Maybe the template should use the roc-ascii package instead of Str? AoC is always ascii anyways, and it has some nice conveniences like case conversions
I already find it very weird that Str isn't the right library for what are, in every other language I've used, strings. Perhaps Str should be renamed to Utf8 and Roc should either drop the term string altogether or have an aggregation library that provides both utf8 and ascii utilities, while making the developer explicitly choose.
it's probably worth splitting off a separate topic, but I do think it's interesting to explore questions around strings as builtins
I think I can summarize the problem as "most programming languages have string libraries that are full of footguns that break by default in most common edge cases, and beginners struggle with the expected footguns not being there in Roc"
so like...I don't want to reintroduce footguns, but I also don't want beginners to feel frustrated, and those two seem to be in direct tension :sweat_smile:
as an example of the footguns - even though I have personally spent a ton of time learning about Unicode, grapheme clusters, etc. - I recently was working on a project at work (using Rust) in which we were doing some word wrapping, and I literally made the classic mistake of reaching for char (because it's right there!) instead of thinking about grapheme clustering, and a user immediately reported a bug around Chinese characters
I had all the knowledge, I designed Roc's string APIs to remove the footguns I knew were there in other languages, and I still fell for the footgun in my day job just because it was Right There in Rust's stdlib and it felt like the obvious choice
so I feel very strongly that the right solution here is not "repeat every other standard library's mistakes"
I believe we can dodge the classic complexity footguns without creating new simplicity footguns :)
that would be great!
one idea that comes to mind is that Rust's stdlib has ASCII operations which are named that way
so we could consider some things from https://github.com/Hasnep/roc-ascii for builtins
I think Zig has an interesting approach, in offering an std.ascii module in its standard library that contains the type of functions that would be footguns if you were to define them for utf8 strings.
One thought is to have an AsciiStr type that's a wrapper around List U8, with a fromAcii : List U8 -> Result AsciiStr [NotAnAsciiByte] method, similar to fromUtf8 for Str.
yeah, certainly ASCII doesn't get new editions the way Unicode does, so it's safe to assume there would never need to be breaking changes to those APIs
and if you're using ASCII (e.g. for Advent of Code) you know you're not doing something robust, but it's still convenient
and if you're using ASCII in production, at least there's a very strong hint that you're not handling any edge cases at all
19 messages were moved here from #ideas > AoC template idea -- using module params by Richard Feldman.
an interesting thing about making it a builtin is that, if we want to, we can actually track in the compiler whether a string is known to be valid ASCII
and then make those conversions be free at runtime
Oh, you mean for string literals, similar to how we can infer a number literal to be one of a couple of types?
yeah something like that
LLVM optimizer might take care of it anyway though
I moved my AoC-specific messages back there
I don't like the idea of making ascii a builtin. I think, this sends the wrong message, that unicode is hard and you should just go for ascii.
It will probably leat to the situations, that many people will use ascii for prototyping. Leading to programs or packages, that do not think about Unicode from the beginning. I like, that Roc tries to work with stirngs in a correct way. But this should not leat to a world, where people reach for ascii instead.
I still think that the Str-type and the Str-package should be remove from the builtins. It is a source of confusion for newcomers and most of the problems should be solved with either List U8 or the unicode-package instead.
The List-package and the Str-package are dangerously similar. There are functions that do (nearly) the same and other functions, that have the same name but do different things. For example Str.split and List.split.
The only advantage of Str over List U8 is, that it guaranties to be valid Utf8. This guaranty is weak, because many strings come from a platform and Roc the language does not guaranty the content of a string generated by a platform. So it is more like a error prone convention, that Str is valid utf8.
I think if there is Str in the builtins, it should be an alias for List U8. A type, that has any guaranties should be an opaque type in the unicode package.
Oskar Hahn said:
I think if there is
Strin the builtins, it should be an alias forList U8. A type, that has any guaranties should be an opaque type in the unicode package.
I don't think there's a world where Roc doesn't have string literals, and if we have string literals whose type is List U8, then we have definitely reintroduced the footgun of "string length Just Works on ASCII and Just Breaks on things like emoji if you have the typical mental model"
I think one of the selling points in favor of having a separate Str type is that it doesn't expose a function named "length"
that's one of the common footguns
also, when it comes to uppercasing and lowercasing, going straight to List U8 unfortunately doesn't help beginners...that's still an operation that's easy in most languages but not in Roc
we could have an Ascii module that works in terms of either Str or bytes
e.g. Ascii.upperCaseStr : Str -> Str and Ascii.upperCaseByte : U8 -> U8
but that doesn't feel great haha
:thinking: maybe we could have a Locale ability builtin or something
so we could have Str.toUpperCase : Str, locale -> Str where locale implements Locale
that would be totally discoverable, but you couldn't call it without a Locale, and then the docs for the function could explain how and why you need one of those
and walk you through how to get one
and although that wouldn't help in the case where you've converted the string to a List U8, it would at least make it easier to discover some docs which help you learn how to do that too
similarly, a Str.getUtf8Byte : Str -> Result U8 [OutOfBounds] could be a discoverable way to help learn about indexing
What if Str : [Ascii (List U8), Utf8 (iforgetwhatthisis)] ?
Then different Str.foo functions can accept one or the other or both tags
That feels the most true/honest, from a domain modeling perspective
Also a can't-miss demonstration of the power of tag unions to new learners
It seems like (what the smart people call) a zero-cost abstraction, where depending on your Str constructor function call, your code might not need to be any longer or ever need to acknowledge an alternative that you don't use
Maybe we want a third option in there - , Either (...)] primarily for literals. Then it can settle into A/U whenever you do an A/U-specific operation on it, or in the meantime just hang out as ambiguous cause that might not hurt anything!
Maybe we can do some similar type classification magic as for number literals, to cast a string literal as Utf8 if it contains a non-ASCII character.
This idea feels like the deepest I've ever understood algebraic types, so nobody is allowed to point out a single flaw in it. I'll make a face like this: :weary: jk
That definitely isn't a zero cost abstraction. It is an extra branch anytime you interact with a string. To make it zero cost, the encoding would have to be compile time only. Not a tag which has runtime uses.
I both agree that the current state of roc is inconvenient and simply using List U8 is not the correct choice if we want to help people avoid footguns.
We could have a str type with multiple encodings. Maybe even make the encoding an ability somehow, but that feels brittle and a lot of extra complexity.
My best thought currently is that Unicode should be an official builtin library, but versioned separately from the rest of roc.
I wonder how often someone looking for a Str.toUppercase turns out to be looking for locale aware unicode uppercasing vs ascii uppercasing. I guess it's risky for the documentation to assume folks are looking for one, because it creates a poor experience for folks really needing the other.
In that sense the current setup is nice. People end up in a thread like this, realize they need to make a decision for ascii or locale-based uppercasing, then continue their way to either roc-ascii or roc-unicode. It'd be nice if folks didn't needn't to go through Zullip, but I wonder if we can do so in a way that keeps the 'what toUpperCase do you need?' question in the path of people looking for toUppercase.
The documentation of Str already has a section of capitzation that explains the situation, but folks looking for a toUppercase might miss it. One solution I think would offer a great experienc would be to have a documentation entry for Str.toUppercase, but clearly flag it to say something like:
This function does not exist! If you're to looking to uppercase ASCII-only text for programing identifiers, advent of code, or similar, take a look at
toUppercasein theroc-asciiipackage. If you need to uppercase strings presented to your users then you're looking for the roc-unicode package
Even cooler would be to count how often both links are clicked, to get data on whether one or the other would make a good default.
What if we made Str a nonzero-cost abstraction with dev-mode-only optimization hint logs to use one of those dedicated libraries? Silly?
This conversation has me thinking of general-purpose strings as not actually a real thing, rather just a developer convenience that postpones optimal implementation.
Clippy says "I see you're looking for a Str built-in. I can help with that, but have you considered that strings are a figment of your imagination?"
In the age of emojis, a first-class Str built-in could feel a bit like a first-class Nibble built-in.
(I'm aware that I'm proposing gambling a big chunk of the weirdness budget on a principled future-looking stance with style points.)
I imagine that as a noob seeing Str redirecting me to dedicated A/U libraries (in any of the above ways), I'd have a 70% chance of going "whoa this language is smart".
Jasper Woudenberg said:
One solution I think would offer a great experienc would be to have a documentation entry for
Str.toUppercase
Do you mean something like this? :smiley:
toUppercase : [] -> [GoodLuckCallingMe]
toUppercase : [] -> [YouHaveMuchToLearn]
:big_smile:
Agus Zubiaga said:
Jasper Woudenberg said:
One solution I think would offer a great experienc would be to have a documentation entry for
Str.toUppercaseDo you mean something like this? :smiley:
toUppercase : [] -> [GoodLuckCallingMe]
Hahaha, not really. I was thinking the function signature would be what'd you'd expect:
toUppercase : Str -> Str
But marked in a very clear way to say that the module doesn't actually expose this function. It's just there to document the non-existence of the function for folks who expect it to exist.
Roc documentation would likely need special support for this. In a way it might be similar to deprecation support. Preventative deprecation!
That actually makes me think of another use case for documenting non-existing functions. If a package had a new major release that removes a particular function, it could be neat to keep a documentation entry of the function for folks who learned about it in a now outdated SO/zulip/github/LLM answer.
yeah maybe with a strikethrough
I've seen that in other languages, at least in ide autocomplete
seems reasonable for docs too!
I feel like I normally see that for deprecated. So you can technically call it, but you shouldnt
I guess that to get the auto-complete to pick up on it, the function should not exist only in documentation but in code in some stub form too. Maybe like this:
## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = nope
nope being a special keyword (needs a better name).
The point being that the function would show up in editor-autocomplete too (along with its documentation). The compiler could show an error if it encounters nope in a compiled program.
Could we just use crash?
## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = crash "deprecated -- DO NOT USE"
I'd like to avoid special keywords and runtime crashes if we can.
How about this:
toUppercase : Str -> [ReadDocsOfStrToUppercase]
toUppercase = \_str -> ReadDocsOfStrToUppercase
A type mismatch will be noticed immediately in the editor or at compile time.
I think downsides of not using the real type is that it might hurt integrations in some places. For instance, a type-search for Str -> Str would not find the function in your example, Anton.
Also, the compiler being able to tell you the reason the function doesn't exist would be a better experience then the compiler telling you to look up documentation. It saves you a hop!
The downside of using crash I think would be that you might use the function and only find out it's a problem at runtime.
Also, the compiler being able to tell you the reason the function doesn't exist would be a better experience then the compiler telling you to look up documentation. It saves you a hop!
Sure, but this issue is so complicated that it would be hard to create a nice reading experience (with several links) in an error message, especially when it is shown in some editor popup.
For instance, a type-search for
Str -> Strwould not find the function in your example
That is true but I don't think it will be common to search for toUppercase with a type-search. Most likely you're asking an LLM "roc function to uppercase" or some more high level description of your task.
Sure, I don't think we should put the entire explanation in there. But I think there'd be space for the highlights, like:
Use <
roc-unicode> if you're working with text you present to the user, use <roc-ascii> if you're working with ascii-only text. Learn more about unicode <here>.
I'm using android-studio at work at the moment. Much prefer Vim for the most part, but one thing I do appreciate is being able to hover over a struct-through (deprecated) function and getting a little tooltip showing what I should be doing instead. I think it'd be nice if Roc would support something like that.
That is true but I don't think it will be common to search for toUppercase with a type-search. Most likely you're asking an LLM "roc function to uppercase" or some more high level description of your task.
I would totally use type-search over an LLM for this sort of thing!
here's a concrete proposal for 3 ASCII functions we could add to Str. (I wrote the docs in static dispatch style, but we'd convert them to today's syntax if we decide to land them.)
my goal with them is to make it super clear up front what they do and don't do, from the name to the initial example, and then to explain what they're useful for and why you should use the unicode package for everything else
I think this would be both useful and also a more discoverable way to learn why we don't have Unicode capitalization in builtins, since by searching for uppercase/lowercase you'd come across these, and then hopefully find the explanation in the docs
plus, as someone noted elsewhere, this would actually be the most performant way to do things like case-insensitive comparisons for env vars and command-line args where you know they're hardcoded ASCII strings, so it makes sense to offer a primitive for that use case
my hope is that with these names and these docs, these won't be footguns in practice :big_smile:
I think eventually these should point users to roc-lang/unicode when that's more mature
Also, the with_* naming tends to be used for functions that take a callback that acts on some object, e.g. with_file! opens a file and passes it to a function that can do stuff with it, and auto-closes when finished
So maybe we should drop the with?
Or maybe only_ascii_*
I added the with later (I was thinking of it in the sense of like with_default) because I figured that comparing my_str.ascii_lowercased() vs my_str.with_ascii_lowercased(), the second one is clearer that it's only lowercasing the ASCII
I think you can still probably guess that if it doesn't have the with, but since I'm concerned about footguns here, I figured making it longer but clearer was worth it
Oh, I can see it now.
Okay, looks good to me
Ship it!
cool, if anyone wants to pick up the implementation, feel free!
I'll make an issue
https://github.com/roc-lang/roc/issues/7473
I would like to take this on as I haven't created a builtin yet. it seems like it would be a good start.
Richard Feldman said:
here's a concrete proposal for 3 ASCII functions we could add to
Str. (I wrote the docs in static dispatch style, but we'd convert them to today's syntax if we decide to land them.)
To be pedantic, the é in the examples is a single codepoint outside the ascii range, U+00E9 and is probably not the ascii character e U+0065 with combining acute U+0301, which looks the same. Writing "café" using the combining character (and assuming the implementation of capitalize_ascii just looks byte-by-byte) would result in Str.capitalize_ascii "café" == "CAFÉ"
hm, why would a code point outside the ASCII range be capitalized when looking byte-by-byte for ASCII to capitalize? :thinking:
Oh, that's evil...dang Unicode
It is "cafe" plus the combining accent mark
So it is printed as café
But it may or may not be equal to CAFÉ depending if the other use also uses a combining character or if it uses the proper single code point É
Basically two ways to represent the same thing. One of which uses ASCII letters and thus works with this API, another that uses Unicode and does not work
b'caf\xc3\xa9'.decode() vs b'cafe\xcc\x81'.decode() in python.
ahhh gotcha
yeah that's a good point. Probably worth noting in the docs.
Last updated: Jun 16 2026 at 16:19 UTC