`Str` module scope · ideas · Zulip Chat Archive

Stream: ideas

Topic: `Str` module scope

Richard Feldman (May 29 2023 at 03:40):

what if the scope of the Str module no longer included Unicode concepts like grapheme clusters, scalars, etc?

the diff would look like this:

https://github.com/roc-lang/roc/compare/main...simplify-str

and then those functions could be moved into a dedicated (currently extremely WIP) unicode package

Richard Feldman (May 29 2023 at 03:41):

also the module docs for Str could be dramatically simplified, because currently most of them consist of explaining Unicode concepts like extended grapheme clusters, which aren't commonly known but which are currently a prerequisite for understanding these Str APIs :sweat_smile:

Anton (May 29 2023 at 08:45):

Not being able to get the "chars" or the length of a Str without a package seems like a big disadvantage. I find it hard to imagine that users would prefer it that way.

Richard Feldman (May 29 2023 at 11:31):

hm, so what are some use cases where people want those things? :thinking:

Anton (May 29 2023 at 12:31):

validation, for example in web forms
statistics/data science
parsers
programming puzzles

String.length is used 4.5K times in elm code on github, I'd have to deeper relative analysis to see if that's a lot but it's a data point.
String.toList is used 2k times in elm code on github

Due to my background in data science I definitely use these functions more than most but validation of string inputs is a generally common use case.

Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.

Richard Feldman (May 29 2023 at 13:14):

interesting! So for some more context: pat of the reason I want to explore this is that the status quo (in programming in general) is that it's very common for people to reach for the wrong string operations, and then have code that works in a lot of cases but then totally breaks on edge cases like non-English languages or (especially) emoji

Richard Feldman (May 29 2023 at 13:15):

and while I think this is probably true:

Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.

I'm wondering whether this is one of those cases where that's kind of the only way to get to a world where people reach for less error-prone tools

Richard Feldman (May 29 2023 at 13:16):

for example:

validation, for example in web forms

so let's say we have Str.countGraphemes as the easiest thing to reach for, and people have a mental model of "ok, if I want to count the number of 'characters', that's what I reach for, so I'll make sure the 'first name' field is no more than 16 'characters' by using Str.countGraphemes"

Richard Feldman (May 29 2023 at 13:17):

but actually the reason they're enforcing that is that the database column has a capacity of 16 bytes, and so when they get a first name that's 16 grapheme clusters but actually more than 16 bytes, the validation passes but then the database insert fails and the end user gets a cryptic server 500 error instead of a nice message

Richard Feldman (May 29 2023 at 13:18):

also, I think in the case of parsers - I could be wrong! - but I think there's a good chance that the nicest parser API is one that doesn't include a concept of "character"

Richard Feldman (May 29 2023 at 13:20):

and when it comes to programming puzzles, I guess it depends on the puzzle, but as Ayaz noted, for something like Advent of Code, having a parser package would be a big win, and if there's a nice parser package, I wonder what use cases remain for puzzles and things like "split this string into extended grapheme clusters"

Richard Feldman (May 29 2023 at 13:21):

I don't have experience with stats/data science use cases though - maybe some specific examples there would help!

Richard Feldman (May 29 2023 at 13:23):

maybe a better way to convey my thoughts here is that while I agree with this:

Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.

...I wonder if this is a situation similar to RegEx: it's a tool that is useful, and should exist, but it's also a tool that is often used out of familiarity when another tool would be better overall - so maybe (like RegEx in Elm and in Rust) it's better if we don't have it in the stdlib, even though that's what people expect

Richard Feldman (May 29 2023 at 13:24):

worth noting btw that Rust doesn't have grapheme clusters in the stdlib - for that, you have to use the unicode_segmentation crate

Anton (May 29 2023 at 13:25):

I'm going to think about this some more tomorrow. It's a national holiday today so I wanted to finish up early :sweat_smile: :wave:

Richard Feldman (May 29 2023 at 13:25):

that said, Rust does have stdlib support for splitting strings into Unicode scalar values, which I definitely think is a footgun that's asking for emoji bugs :sweat_smile:

Richard Feldman (May 29 2023 at 13:25):

sounds good, thanks for bringing up the use cases, and enjoy the holiday! :smiley:

Hannes (May 29 2023 at 13:26):

My initial reaction was that I would probably import the Unicode library in almost every program I wrote, but after reading what @Richard Feldman wrote about having a good parser package I've realised that would probably cover 90% of my use cases.

Richard Feldman (May 29 2023 at 13:27):

nice! Can you think of what the other 10% might be?

Richard Feldman (May 29 2023 at 13:27):

thinking through specific use cases is very helpful here, I think!

Hannes (May 29 2023 at 13:33):

That 10% number might be inflated by the fact that I started a new Unicode heavy project. It's a pretty printer for CSV files which aligns the columns, so I need to know if a character is a double width CJK character when printing it

Richard Feldman (May 29 2023 at 13:33):

oh cool! :smiley:

Richard Feldman (May 29 2023 at 13:33):

how would you feel if you had to import a unicode package for that project?

Hannes (May 29 2023 at 13:38):

I knew going in that there would be edge cases with double width characters because that's a problem I was trying to solve with existing tools, so I'd be fine with it

Hannes (May 29 2023 at 13:39):

It's a complex enough problem that I wouldn't expect it to be in the stdlib and I was surprised there was a specific function in the Python stdlib to get the visual width of a character

Hannes (May 29 2023 at 13:41):

If unicode functions aren't in the stdlib I'd expect lots of beginner questions asking questions like "how do I get the length of a string?", which is probably a better outcome because then someone can ask for more specific details about what they're tryng to achieve

Richard Feldman (May 29 2023 at 14:00):

yeah that's a great point!

Brendan Hansknecht (May 29 2023 at 15:27):

The big question for me, will this just lead to many many users just converting strings to a list of bytes and then doing things manually. If so they probably will only think about ASCII and not Unicode.

Sky Rose (May 29 2023 at 15:36):

You could have the Unicode package be builtin to make more people use it. People would still be forced to think about how the operation they're doing is complicated by unicode if they use it as Unicode.countGraphemes instead of Str.countGraphemes.

Sky Rose (May 29 2023 at 15:43):

Is Str supposed to represent arbitrary binary data, any human text, or only unicode? If it's only unicode, then that would be a reason to keep unicode functions within Str. (But then what would be the right way to handle arbitrary binary data in Roc?)

Brendan Hansknecht (May 29 2023 at 15:48):

As it currently stands Str is a List U8 with Utf8 encoding.

Brendan Hansknecht (May 29 2023 at 15:49):

So only unicode, which is often only ascii, which is why so many people mess up/write bugged code.

Brendan Hansknecht (May 29 2023 at 15:50):

That is a really good point. Should we be trying to remove unicode support from std, or should we be trying to write a good unicode interface in std with names that help users just make the right decision?

Richard Feldman (May 29 2023 at 16:24):

I think that's a plausible path, but my concern is that teaching Unicode is a big deal, and I'm not sure how to organize that if there are several unicode modules

Richard Feldman (May 29 2023 at 16:25):

like I don't think "names that help users just make the right decision" is achievable without someone having a certain baseline amount of Unicode knowledge, and it seems that this baseline level of knowledge is far from universal

Richard Feldman (May 29 2023 at 16:27):

Brendan Hansknecht said:

The big question for me, will this just lead to many many users just converting strings to a list of bytes and then doing things manually. If so they probably will only think about ASCII and not Unicode.

I also have this concern, although my hope is that doing stuff like "convert it to a List U8 and then work with the U8s directly" will feel janky enough that people will (correctly) feel they're doing something sketchy, and will at least have the intuition that they should look for another way to do it

Brendan Hansknecht (May 29 2023 at 16:45):

Yeah, probably true for most normal string operations. That said, for a class of problems, working with the bytes and assuming they are ascii is pretty common.

Richard Feldman (May 29 2023 at 17:01):

hm, which problems?

Brendan Hansknecht (May 29 2023 at 17:04):

anything with parsing

Brendan Hansknecht (May 29 2023 at 17:04):

I think for a huge class of parsing problems, people don't think about unicode and just operate over ascii bytes.

Richard Feldman (May 29 2023 at 17:49):

Brendan Hansknecht said:

anything with parsing

but if we have a parser package, wouldn't people use that anyway?

Brendan Hansknecht (May 29 2023 at 19:04):

:shrug: I probably am the wrong person to ask. I still don't know if I like parser combinators. So I general would avoid a parser package.

Brendan Hansknecht (May 29 2023 at 19:04):

Often times it is simple enough to just use something recursive over the bytes.

Richard Feldman (May 29 2023 at 20:14):

Brendan Hansknecht said:

I think for a huge class of parsing problems, people don't think about unicode and just operate over ascii bytes.

would this change if we offered a Str.walkGraphemes or something?

Richard Feldman (May 29 2023 at 20:14):

like if Str.walkGraphemes is available, and parser is available, and Str.walkUtf8 is available, how many people pick the first option over the other two? :thinking:

Bryce Miller (May 30 2023 at 00:32):

Not having these functions in the std lib will likely be surprising to users and I would expect this to evoke negative emotions.

I agree, but also:

Having string bugs enabled by the standard library will likely be surprising to users and I would expect this to evoke negative emotions.

Richard Feldman (May 30 2023 at 00:34):

well I don't know how surprising it would be, since most languages do it :sweat_smile:

Bryce Miller (May 30 2023 at 00:35):

Well that's true. The bugs might be surprising though!

Richard Feldman (May 30 2023 at 00:35):

the norm is that end users suffer the consequences, unfortunately, and then programmers write blog posts about things they learned the hard way

Bryce Miller (May 30 2023 at 00:43):

I know nothing about the intricacies of Unicode. But knowing that they exist, I think I'd vote for an approach that favors reliability and predictability. I don't want users of my software to suffer because I made a mistake.

This is reminding me of Json decoders in Elm. Surprising and unfamiliar to newcomers, but they force you to deal with the reality that servers cannot be trusted :melting_face:. After experiencing one production bug caused by failure to validate a server response in a JS codebase, You'd be hard pressed to convince me to skip that step for production code, regardless of language.

And there's nothing stopping someone who wants something more familiar from either:
a) Using a helper package that provides all the error-prone but familiar functions like Str.len
b) using a language that is better suited to applications where reliability and correctness aren't as important

Bryce Miller (May 30 2023 at 00:46):

Idk now you have me paranoid about Unicode-related bugs :sweat_smile:

Richard Feldman (May 30 2023 at 01:01):

yeah I remember a fun outage referred to as "the emoji fire" because users submitting emoji to a certain input box caused a serious production fire :sweat_smile:

Bryce Miller (May 30 2023 at 01:59):

Wheeeee

Brendan Hansknecht (May 30 2023 at 04:35):

After reading this, i think my main thought is that having it out of the standard mostly feels like a way to let it get iterated one more and have a different perception of quality. Also, helps keep the standard simple and focused. Unicode is simply a complex problem that is easy to mess up. If we already had a high quality Unicode package we felt was safe to set in stone, maybe we should put it in the standard, but not with the flux around how it should be handled

Anton (May 30 2023 at 08:39):

I can get behind taking the unicode stuff out of std now, put it in its own package, improve it over the years, and if we want to we can put it back in std once we're satisfied with it.

Richard Feldman (May 30 2023 at 11:18):

interesting, I'm open to that!

Richard Feldman (Jun 18 2023 at 22:40):

random idea: Str.len : [Graphemes, Utf8Bytes] -> Nat

Richard Feldman (Jun 18 2023 at 22:40):

like just make it really up-front that you have to choose one or the other every time

Richard Feldman (Jun 18 2023 at 22:40):

and it would optimize away after inlining

Luke Boswell (Jun 18 2023 at 23:36):

I feel like I would prefer |> Str.toUtf8 |> List.len as it is more explicit, and then keep unicode in a separate package that can change over time with the standards. To do this with graphmemes we would need to support most of unicode in the stdlib right? Would the std lib need to understand different character code charts to understand graphmeme boundaries? Would keeping unicode in a separate package reduce pressure for std to change over time?

Brendan Hansknecht (Jun 18 2023 at 23:43):

My biggest concern is that putting it in one function makes it feel like the perf cost is the same. Which it very much is not.

Richard Feldman (Jun 19 2023 at 01:10):

Luke Boswell said:

I feel like I would prefer |> Str.toUtf8 |> List.len as it is more explicit

I like this idea in theory, but unfortunately in practice it means a small string would have to incur a heap allocation just to get the byte length :sweat_smile:

Brendan Hansknecht (Jun 19 2023 at 01:15):

I guess if we add bytes as a type, that would be fixed? Str.toUtf8 would return a Bytes type that can be small.

Richard Feldman (Jun 19 2023 at 01:16):

if Bytes can be small, then yes

Brendan Hansknecht (Jun 19 2023 at 01:19):

I don't see why we would make it otherwise.

Richard Feldman (Jun 19 2023 at 01:20):

I think the main argument against it is that it would be a cost paid in a lot of places for a pretty narrow use case (potentially just small strings you want to do byte operations on)

Richard Feldman (Jun 19 2023 at 01:28):

a concern I have after thinking about this more is that if all the grapheme stuff is in a separate library, but byte-wise operations are in the stdlib (which they have to be; it has to be possible to convert a Str to bytes so they can be serialized) then people might just reach for things that way, which is the worst thing for correctness :sweat_smile:

Brendan Hansknecht (Jun 19 2023 at 01:29):

I'm not sure I follow in terms of perf. If it is a perf gain for strings, why wouldn't it be a perf gain for bytes?

Richard Feldman (Jun 19 2023 at 01:30):

oh, I'm assuming an opaque Bytes type would be used (approximately) everywhere List U8 is today

Richard Feldman (Jun 19 2023 at 01:30):

e.g. http requests, file I/O...

Brendan Hansknecht (Jun 19 2023 at 01:31):

Yep

Richard Feldman (Jun 19 2023 at 01:31):

and although a lot of UTF-8 strings are under 24B, and therefore can avoid a heap allocation, that's probably not going to be true of basically any of the I/O use cases of Bytes - which I think would be almost all of them in practice

Brendan Hansknecht (Jun 19 2023 at 01:32):

I guess that is fair, but if you are accessing memory on the heap, you are screwed anyway, so I don't it would make much of a difference for perf.

Brendan Hansknecht (Jun 19 2023 at 01:33):

Also, people tend to chop things up, but I guess with seemless slice, we can keep those on the heap, so maybe we have a better case than most lanaguage here.

Brendan Hansknecht (Jun 19 2023 at 01:34):

Cause in c++ with llvm::SmallVector for example, when you chop things up, small list optimization is a huge gain. In roc, if you aren't modifying the chunks, I guess it isn't a gain cause seamless slices will rescue you from the extra allocations.

Brendan Hansknecht (Jun 19 2023 at 01:34):

I guess that specifically may make a difference in terms of rocs performance here

Brendan Hansknecht (Jun 19 2023 at 01:34):

So yeah, maybe not as clear cut due to that.

Brendan Hansknecht (Jun 19 2023 at 01:35):

But yeah, before we added seemless slices, splitting a giant string by spaces was definitely slower due to all of the data copying.

Brendan Hansknecht (Jun 19 2023 at 01:37):

I guess the question is: will strings need to be converted to bytes often to be used. If so, that would lead to a lot of pain points do to losing the small str optimization and paying the cost of allocating, but at the same time, you are reducing some costs for the potentially more common case of byte based data from files, http, etc.

Richard Feldman (Jun 19 2023 at 01:38):

yeah the other option is to try to implement things without needing Str to be converted into bytes

Richard Feldman (Jun 19 2023 at 01:38):

"things" being Unicode things, mainly :big_smile:

Brendan Hansknecht (Jun 19 2023 at 01:47):

As long as the unicode library is smart and able to walk (potentially in chunks), I bet a lot can be done without conversions. But, I'm not sure. Could always expose a function to get a tuple of the bytes or something, but that sounds very unpleasant

Luke Boswell (Jun 19 2023 at 02:58):

Maybe a silly question, but is it possible to have a small list optimisation? where we keep the bytes on the stack in the same way as small Str?

Tangentially, also probably a silly question, is getting the byte length of a small Str a common operation? Like would you be sorting a list of small Str by length or similar?

Brendan Hansknecht (Jun 19 2023 at 03:42):

is it possible to have a small list optimisation? where we keep the bytes on the stack in the same way as small Str?

100% llvm::SmallVector does this for example. It just is a less clear tradeoff and really works best with dynamically sized list types (as in different number of bytes on the stack depending on the element size). With Str and Bytes, they primitive is a U8 and the small optimization is pretty clear and simple. With a U64, and other bigger/more complex types, small list can only hold 2 U64if we keep the limit of 24 bytes (also need to store the length).

In the case of llvm::SmallVector, I believe that it targarts 64 bytes worth of elements (unless otherwise changed by the end user). However many fit in that. So 8 U64s. This just adds a huge level of tradeoffs and perf details to consider. If we want to keep lists simple and don't require essentially ridiculous c++ templates in any host language that accesses a roc list, this is even more complex. I honestly don't think it would likely be worth it to do this generically, especially since we have seamless slices.

As such, that means, we would maybe only apply it to very small types. So maybe for single byte types or something of that nature.

Brendan Hansknecht (Jun 19 2023 at 03:45):

is getting the byte length of a small Str a common operation?

Yes, but only because most people have strings they assume are 100% ascii

Brendan Hansknecht (Jun 19 2023 at 03:46):

So most code I have seen assumes that byte length == str length. Thus pretty common to get the length of a str that way whether small or large.

Brendan Hansknecht (Jun 19 2023 at 03:46):

Maybe also used for prefix trees or similar, but that is pretty niche

Last updated: Jul 23 2026 at 13:15 UTC