Stream: ideas

Topic: split whitespace


view this post on Zulip Brendan Hansknecht (Apr 08 2024 at 03:27):

I'm not sure if this will hit any unicode issues (I don't think so), but I think that we should add a Str.splitWhiteSpace function to the standard library that will split a string on any whitespace. It will remove full chunks of whitespace at a time.

So "hello world\t\t\t!\n\nmore text" would become ["hello", "world", "!", "more", "text"].

view this post on Zulip Luke Boswell (Apr 08 2024 at 03:36):

I don't think that would be a problem, whitespace should be reasonably consitent accross unicode standards. For reference here is the data table for mapping code points to properties including whitespace for version 14 https://www.unicode.org/Public/14.0.0/ucd/PropList.txt

view this post on Zulip Richard Feldman (Apr 08 2024 at 04:00):

yeah Unicode issues seem unlikely, but how often does this come up? I can't think of a time I've ever reached for a function which did that :big_smile:

view this post on Zulip Brendan Hansknecht (Apr 08 2024 at 04:45):

I guess the main use cases I have seen it are two fold:

  1. Data analytics on text
  2. Writing a simple tokenizer

view this post on Zulip Brendan Hansknecht (Apr 08 2024 at 04:46):

That said, it is in a ton of langauges, so I would guess it is more popular than just those two use cases.

view this post on Zulip Anton (Apr 08 2024 at 09:05):

but how often does this come up?

I do actually use this pretty regularly, like for simple parsing stuff when scripting.

view this post on Zulip Anton (Apr 08 2024 at 09:08):

some_str.split() does this in python, it has 2.7 million hits on github.

view this post on Zulip Kevin Gillette (Apr 08 2024 at 14:43):

Richard Feldman said:

yeah Unicode issues seem unlikely, but how often does this come up? I can't think of a time I've ever reached for a function which did that :big_smile:

I use this kind of thing all the time. In Go it's https://pkg.go.dev/strings#Fields and it may well get used more than generalized string split, particularly because it has the nice quality that it doesn't produce empty-string elements when there is leading or trailing spacing.

If we were to have a stdlib parse module that is easy to use and universally understood, than this may not be useful. Otherwise, for certain kinds of text processing, the presence of this kind of functionality makes the developer experience much nicer.

view this post on Zulip Jasper Woudenberg (Apr 09 2024 at 06:44):

Do folks use this to parse the output from shell commands (shell tables)?


Last updated: Jun 16 2026 at 16:19 UTC