Stream: ideas

Topic: ASCII in builtins


view this post on Zulip Kilian Vounckx (Nov 07 2024 at 07:04):

Maybe the template should use the roc-ascii package instead of Str? AoC is always ascii anyways, and it has some nice conveniences like case conversions

view this post on Zulip jan kili (Nov 07 2024 at 16:25):

I already find it very weird that Str isn't the right library for what are, in every other language I've used, strings. Perhaps Str should be renamed to Utf8 and Roc should either drop the term string altogether or have an aggregation library that provides both utf8 and ascii utilities, while making the developer explicitly choose.

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:37):

it's probably worth splitting off a separate topic, but I do think it's interesting to explore questions around strings as builtins

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:38):

I think I can summarize the problem as "most programming languages have string libraries that are full of footguns that break by default in most common edge cases, and beginners struggle with the expected footguns not being there in Roc"

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:39):

so like...I don't want to reintroduce footguns, but I also don't want beginners to feel frustrated, and those two seem to be in direct tension :sweat_smile:

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:41):

as an example of the footguns - even though I have personally spent a ton of time learning about Unicode, grapheme clusters, etc. - I recently was working on a project at work (using Rust) in which we were doing some word wrapping, and I literally made the classic mistake of reaching for char (because it's right there!) instead of thinking about grapheme clustering, and a user immediately reported a bug around Chinese characters

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:42):

I had all the knowledge, I designed Roc's string APIs to remove the footguns I knew were there in other languages, and I still fell for the footgun in my day job just because it was Right There in Rust's stdlib and it felt like the obvious choice

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:42):

so I feel very strongly that the right solution here is not "repeat every other standard library's mistakes"

view this post on Zulip jan kili (Nov 07 2024 at 16:49):

I believe we can dodge the classic complexity footguns without creating new simplicity footguns :)

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:54):

that would be great!

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:55):

one idea that comes to mind is that Rust's stdlib has ASCII operations which are named that way

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:55):

so we could consider some things from https://github.com/Hasnep/roc-ascii for builtins

view this post on Zulip Jasper Woudenberg (Nov 07 2024 at 16:55):

I think Zig has an interesting approach, in offering an std.ascii module in its standard library that contains the type of functions that would be footguns if you were to define them for utf8 strings.

One thought is to have an AsciiStr type that's a wrapper around List U8, with a fromAcii : List U8 -> Result AsciiStr [NotAnAsciiByte] method, similar to fromUtf8 for Str.

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:56):

yeah, certainly ASCII doesn't get new editions the way Unicode does, so it's safe to assume there would never need to be breaking changes to those APIs

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:56):

and if you're using ASCII (e.g. for Advent of Code) you know you're not doing something robust, but it's still convenient

view this post on Zulip Richard Feldman (Nov 07 2024 at 16:56):

and if you're using ASCII in production, at least there's a very strong hint that you're not handling any edge cases at all

view this post on Zulip Notification Bot (Nov 07 2024 at 17:03):

19 messages were moved here from #ideas > AoC template idea -- using module params by Richard Feldman.

view this post on Zulip Richard Feldman (Nov 07 2024 at 17:04):

an interesting thing about making it a builtin is that, if we want to, we can actually track in the compiler whether a string is known to be valid ASCII

view this post on Zulip Richard Feldman (Nov 07 2024 at 17:04):

and then make those conversions be free at runtime

view this post on Zulip Jasper Woudenberg (Nov 07 2024 at 17:13):

Oh, you mean for string literals, similar to how we can infer a number literal to be one of a couple of types?

view this post on Zulip Richard Feldman (Nov 07 2024 at 17:14):

yeah something like that

view this post on Zulip Richard Feldman (Nov 07 2024 at 17:14):

LLVM optimizer might take care of it anyway though

view this post on Zulip jan kili (Nov 07 2024 at 17:26):

I moved my AoC-specific messages back there

view this post on Zulip Oskar Hahn (Nov 08 2024 at 08:15):

I don't like the idea of making ascii a builtin. I think, this sends the wrong message, that unicode is hard and you should just go for ascii.

It will probably leat to the situations, that many people will use ascii for prototyping. Leading to programs or packages, that do not think about Unicode from the beginning. I like, that Roc tries to work with stirngs in a correct way. But this should not leat to a world, where people reach for ascii instead.

view this post on Zulip Oskar Hahn (Nov 08 2024 at 08:28):

I still think that the Str-type and the Str-package should be remove from the builtins. It is a source of confusion for newcomers and most of the problems should be solved with either List U8 or the unicode-package instead.

The List-package and the Str-package are dangerously similar. There are functions that do (nearly) the same and other functions, that have the same name but do different things. For example Str.split and List.split.

The only advantage of Str over List U8 is, that it guaranties to be valid Utf8. This guaranty is weak, because many strings come from a platform and Roc the language does not guaranty the content of a string generated by a platform. So it is more like a error prone convention, that Str is valid utf8.

I think if there is Str in the builtins, it should be an alias for List U8. A type, that has any guaranties should be an opaque type in the unicode package.

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:34):

Oskar Hahn said:

I think if there is Str in the builtins, it should be an alias for List U8. A type, that has any guaranties should be an opaque type in the unicode package.

I don't think there's a world where Roc doesn't have string literals, and if we have string literals whose type is List U8, then we have definitely reintroduced the footgun of "string length Just Works on ASCII and Just Breaks on things like emoji if you have the typical mental model"

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:45):

I think one of the selling points in favor of having a separate Str type is that it doesn't expose a function named "length"

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:45):

that's one of the common footguns

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:47):

also, when it comes to uppercasing and lowercasing, going straight to List U8 unfortunately doesn't help beginners...that's still an operation that's easy in most languages but not in Roc

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:49):

we could have an Ascii module that works in terms of either Str or bytes

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:50):

e.g. Ascii.upperCaseStr : Str -> Str and Ascii.upperCaseByte : U8 -> U8

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:50):

but that doesn't feel great haha

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:58):

:thinking: maybe we could have a Locale ability builtin or something

view this post on Zulip Richard Feldman (Nov 08 2024 at 12:59):

so we could have Str.toUpperCase : Str, locale -> Str where locale implements Locale

view this post on Zulip Richard Feldman (Nov 08 2024 at 13:00):

that would be totally discoverable, but you couldn't call it without a Locale, and then the docs for the function could explain how and why you need one of those

view this post on Zulip Richard Feldman (Nov 08 2024 at 13:00):

and walk you through how to get one

view this post on Zulip Richard Feldman (Nov 08 2024 at 13:03):

and although that wouldn't help in the case where you've converted the string to a List U8, it would at least make it easier to discover some docs which help you learn how to do that too

view this post on Zulip Richard Feldman (Nov 08 2024 at 13:11):

similarly, a Str.getUtf8Byte : Str -> Result U8 [OutOfBounds] could be a discoverable way to help learn about indexing

view this post on Zulip jan kili (Nov 08 2024 at 14:03):

What if Str : [Ascii (List U8), Utf8 (iforgetwhatthisis)] ?

view this post on Zulip jan kili (Nov 08 2024 at 14:04):

Then different Str.foo functions can accept one or the other or both tags

view this post on Zulip jan kili (Nov 08 2024 at 14:07):

That feels the most true/honest, from a domain modeling perspective

view this post on Zulip jan kili (Nov 08 2024 at 14:08):

Also a can't-miss demonstration of the power of tag unions to new learners

view this post on Zulip jan kili (Nov 08 2024 at 14:19):

It seems like (what the smart people call) a zero-cost abstraction, where depending on your Str constructor function call, your code might not need to be any longer or ever need to acknowledge an alternative that you don't use

view this post on Zulip jan kili (Nov 08 2024 at 14:29):

Maybe we want a third option in there - , Either (...)] primarily for literals. Then it can settle into A/U whenever you do an A/U-specific operation on it, or in the meantime just hang out as ambiguous cause that might not hurt anything!

view this post on Zulip jan kili (Nov 08 2024 at 14:30):

Maybe we can do some similar type classification magic as for number literals, to cast a string literal as Utf8 if it contains a non-ASCII character.

view this post on Zulip jan kili (Nov 08 2024 at 14:38):

This idea feels like the deepest I've ever understood algebraic types, so nobody is allowed to point out a single flaw in it. I'll make a face like this: :weary: jk

view this post on Zulip Brendan Hansknecht (Nov 08 2024 at 15:43):

That definitely isn't a zero cost abstraction. It is an extra branch anytime you interact with a string. To make it zero cost, the encoding would have to be compile time only. Not a tag which has runtime uses.

view this post on Zulip Brendan Hansknecht (Nov 08 2024 at 15:45):

I both agree that the current state of roc is inconvenient and simply using List U8 is not the correct choice if we want to help people avoid footguns.

We could have a str type with multiple encodings. Maybe even make the encoding an ability somehow, but that feels brittle and a lot of extra complexity.

view this post on Zulip Brendan Hansknecht (Nov 08 2024 at 15:48):

My best thought currently is that Unicode should be an official builtin library, but versioned separately from the rest of roc.

view this post on Zulip Jasper Woudenberg (Nov 08 2024 at 17:14):

I wonder how often someone looking for a Str.toUppercase turns out to be looking for locale aware unicode uppercasing vs ascii uppercasing. I guess it's risky for the documentation to assume folks are looking for one, because it creates a poor experience for folks really needing the other.

In that sense the current setup is nice. People end up in a thread like this, realize they need to make a decision for ascii or locale-based uppercasing, then continue their way to either roc-ascii or roc-unicode. It'd be nice if folks didn't needn't to go through Zullip, but I wonder if we can do so in a way that keeps the 'what toUpperCase do you need?' question in the path of people looking for toUppercase.

The documentation of Str already has a section of capitzation that explains the situation, but folks looking for a toUppercase might miss it. One solution I think would offer a great experienc would be to have a documentation entry for Str.toUppercase, but clearly flag it to say something like:

This function does not exist! If you're to looking to uppercase ASCII-only text for programing identifiers, advent of code, or similar, take a look at toUppercase in the roc-asciii package. If you need to uppercase strings presented to your users then you're looking for the roc-unicode package

Even cooler would be to count how often both links are clicked, to get data on whether one or the other would make a good default.

view this post on Zulip jan kili (Nov 08 2024 at 17:37):

What if we made Str a nonzero-cost abstraction with dev-mode-only optimization hint logs to use one of those dedicated libraries? Silly?

view this post on Zulip jan kili (Nov 08 2024 at 17:39):

This conversation has me thinking of general-purpose strings as not actually a real thing, rather just a developer convenience that postpones optimal implementation.

view this post on Zulip jan kili (Nov 08 2024 at 17:40):

Clippy says "I see you're looking for a Str built-in. I can help with that, but have you considered that strings are a figment of your imagination?"

view this post on Zulip jan kili (Nov 08 2024 at 17:42):

In the age of emojis, a first-class Str built-in could feel a bit like a first-class Nibble built-in.

view this post on Zulip jan kili (Nov 08 2024 at 17:44):

(I'm aware that I'm proposing gambling a big chunk of the weirdness budget on a principled future-looking stance with style points.)

view this post on Zulip jan kili (Nov 08 2024 at 17:47):

I imagine that as a noob seeing Str redirecting me to dedicated A/U libraries (in any of the above ways), I'd have a 70% chance of going "whoa this language is smart".

view this post on Zulip Agus Zubiaga (Nov 08 2024 at 17:52):

Jasper Woudenberg said:

One solution I think would offer a great experienc would be to have a documentation entry for Str.toUppercase

Do you mean something like this? :smiley:

toUppercase : [] -> [GoodLuckCallingMe]

view this post on Zulip Anton (Nov 08 2024 at 18:05):

toUppercase : [] -> [YouHaveMuchToLearn]

:big_smile:

view this post on Zulip Jasper Woudenberg (Nov 08 2024 at 18:11):

Agus Zubiaga said:

Jasper Woudenberg said:

One solution I think would offer a great experienc would be to have a documentation entry for Str.toUppercase

Do you mean something like this? :smiley:

toUppercase : [] -> [GoodLuckCallingMe]

Hahaha, not really. I was thinking the function signature would be what'd you'd expect:

toUppercase : Str -> Str

But marked in a very clear way to say that the module doesn't actually expose this function. It's just there to document the non-existence of the function for folks who expect it to exist.

Roc documentation would likely need special support for this. In a way it might be similar to deprecation support. Preventative deprecation!

That actually makes me think of another use case for documenting non-existing functions. If a package had a new major release that removes a particular function, it could be neat to keep a documentation entry of the function for folks who learned about it in a now outdated SO/zulip/github/LLM answer.

view this post on Zulip Richard Feldman (Nov 08 2024 at 18:30):

yeah maybe with a strikethrough

view this post on Zulip Richard Feldman (Nov 08 2024 at 18:31):

I've seen that in other languages, at least in ide autocomplete

view this post on Zulip Richard Feldman (Nov 08 2024 at 18:31):

seems reasonable for docs too!

view this post on Zulip Brendan Hansknecht (Nov 09 2024 at 06:34):

I feel like I normally see that for deprecated. So you can technically call it, but you shouldnt

view this post on Zulip Jasper Woudenberg (Nov 09 2024 at 08:40):

I guess that to get the auto-complete to pick up on it, the function should not exist only in documentation but in code in some stub form too. Maybe like this:

## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = nope

nope being a special keyword (needs a better name).

The point being that the function would show up in editor-autocomplete too (along with its documentation). The compiler could show an error if it encounters nope in a compiled program.

view this post on Zulip Luke Boswell (Nov 09 2024 at 08:54):

Could we just use crash?

view this post on Zulip Luke Boswell (Nov 09 2024 at 08:55):

## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = crash "deprecated -- DO NOT USE"

view this post on Zulip Anton (Nov 09 2024 at 09:59):

I'd like to avoid special keywords and runtime crashes if we can.
How about this:

toUppercase : Str -> [ReadDocsOfStrToUppercase]
toUppercase = \_str -> ReadDocsOfStrToUppercase

A type mismatch will be noticed immediately in the editor or at compile time.

view this post on Zulip Jasper Woudenberg (Nov 09 2024 at 10:04):

I think downsides of not using the real type is that it might hurt integrations in some places. For instance, a type-search for Str -> Str would not find the function in your example, Anton.

Also, the compiler being able to tell you the reason the function doesn't exist would be a better experience then the compiler telling you to look up documentation. It saves you a hop!

The downside of using crash I think would be that you might use the function and only find out it's a problem at runtime.

view this post on Zulip Anton (Nov 09 2024 at 10:09):

Also, the compiler being able to tell you the reason the function doesn't exist would be a better experience then the compiler telling you to look up documentation. It saves you a hop!

Sure, but this issue is so complicated that it would be hard to create a nice reading experience (with several links) in an error message, especially when it is shown in some editor popup.

view this post on Zulip Anton (Nov 09 2024 at 10:13):

For instance, a type-search for Str -> Str would not find the function in your example

That is true but I don't think it will be common to search for toUppercase with a type-search. Most likely you're asking an LLM "roc function to uppercase" or some more high level description of your task.

view this post on Zulip Jasper Woudenberg (Nov 09 2024 at 10:15):

Sure, I don't think we should put the entire explanation in there. But I think there'd be space for the highlights, like:

Use <roc-unicode> if you're working with text you present to the user, use <roc-ascii> if you're working with ascii-only text. Learn more about unicode <here>.

I'm using android-studio at work at the moment. Much prefer Vim for the most part, but one thing I do appreciate is being able to hover over a struct-through (deprecated) function and getting a little tooltip showing what I should be doing instead. I think it'd be nice if Roc would support something like that.

That is true but I don't think it will be common to search for toUppercase with a type-search. Most likely you're asking an LLM "roc function to uppercase" or some more high level description of your task.

I would totally use type-search over an LLM for this sort of thing!

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:19):

here's a concrete proposal for 3 ASCII functions we could add to Str. (I wrote the docs in static dispatch style, but we'd convert them to today's syntax if we decide to land them.)

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:20):

my goal with them is to make it super clear up front what they do and don't do, from the name to the initial example, and then to explain what they're useful for and why you should use the unicode package for everything else

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:21):

I think this would be both useful and also a more discoverable way to learn why we don't have Unicode capitalization in builtins, since by searching for uppercase/lowercase you'd come across these, and then hopefully find the explanation in the docs

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:22):

plus, as someone noted elsewhere, this would actually be the most performant way to do things like case-insensitive comparisons for env vars and command-line args where you know they're hardcoded ASCII strings, so it makes sense to offer a primitive for that use case

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:22):

my hope is that with these names and these docs, these won't be footguns in practice :big_smile:

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:39):

I think eventually these should point users to roc-lang/unicode when that's more mature

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:41):

Also, the with_* naming tends to be used for functions that take a callback that acts on some object, e.g. with_file! opens a file and passes it to a function that can do stuff with it, and auto-closes when finished

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:41):

So maybe we should drop the with?

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:41):

Or maybe only_ascii_*

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:47):

I added the with later (I was thinking of it in the sense of like with_default) because I figured that comparing my_str.ascii_lowercased() vs my_str.with_ascii_lowercased(), the second one is clearer that it's only lowercasing the ASCII

view this post on Zulip Richard Feldman (Jan 06 2025 at 17:47):

I think you can still probably guess that if it doesn't have the with, but since I'm concerned about footguns here, I figured making it longer but clearer was worth it

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:49):

Oh, I can see it now.

view this post on Zulip Sam Mohr (Jan 06 2025 at 17:49):

Okay, looks good to me

view this post on Zulip Anthony Bullard (Jan 06 2025 at 19:19):

Ship it!

view this post on Zulip Richard Feldman (Jan 06 2025 at 19:29):

cool, if anyone wants to pick up the implementation, feel free!

view this post on Zulip Sam Mohr (Jan 06 2025 at 19:30):

I'll make an issue

view this post on Zulip Sam Mohr (Jan 06 2025 at 19:40):

https://github.com/roc-lang/roc/issues/7473

view this post on Zulip Norbert Hajagos (Jan 06 2025 at 19:55):

I would like to take this on as I haven't created a builtin yet. it seems like it would be a good start.

view this post on Zulip shua (Jan 08 2025 at 23:54):

Richard Feldman said:

here's a concrete proposal for 3 ASCII functions we could add to Str. (I wrote the docs in static dispatch style, but we'd convert them to today's syntax if we decide to land them.)

To be pedantic, the é in the examples is a single codepoint outside the ascii range, U+00E9 and is probably not the ascii character e U+0065 with combining acute U+0301, which looks the same. Writing "café" using the combining character (and assuming the implementation of capitalize_ascii just looks byte-by-byte) would result in Str.capitalize_ascii "café" == "CAFÉ"

view this post on Zulip Richard Feldman (Jan 09 2025 at 00:15):

hm, why would a code point outside the ASCII range be capitalized when looking byte-by-byte for ASCII to capitalize? :thinking:

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:36):

Oh, that's evil...dang Unicode

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:36):

It is "cafe" plus the combining accent mark

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:36):

So it is printed as café

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:39):

But it may or may not be equal to CAFÉ depending if the other use also uses a combining character or if it uses the proper single code point É

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:40):

Basically two ways to represent the same thing. One of which uses ASCII letters and thus works with this API, another that uses Unicode and does not work

view this post on Zulip Brendan Hansknecht (Jan 09 2025 at 00:42):

b'caf\xc3\xa9'.decode() vs b'cafe\xcc\x81'.decode() in python.

view this post on Zulip Richard Feldman (Jan 09 2025 at 00:51):

ahhh gotcha

view this post on Zulip Richard Feldman (Jan 09 2025 at 00:52):

yeah that's a good point. Probably worth noting in the docs.


Last updated: Jun 16 2026 at 16:19 UTC