ASCII in builtins · ideas · Zulip Chat Archive

Maybe the template should use the roc-ascii package instead of Str? AoC is always ascii anyways, and it has some nice conveniences like case conversions

jan kili (Nov 07 2024 at 16:25):

I already find it very weird that Str isn't the right library for what are, in every other language I've used, strings. Perhaps Str should be renamed to Utf8 and Roc should either drop the term string altogether or have an aggregation library that provides both utf8 and ascii utilities, while making the developer explicitly choose.

Richard Feldman (Nov 07 2024 at 16:37):

it's probably worth splitting off a separate topic, but I do think it's interesting to explore questions around strings as builtins

Richard Feldman (Nov 07 2024 at 16:38):

I think I can summarize the problem as "most programming languages have string libraries that are full of footguns that break by default in most common edge cases, and beginners struggle with the expected footguns not being there in Roc"

Richard Feldman (Nov 07 2024 at 16:39):

so like...I don't want to reintroduce footguns, but I also don't want beginners to feel frustrated, and those two seem to be in direct tension :sweat_smile:

Richard Feldman (Nov 07 2024 at 16:41):

as an example of the footguns - even though I have personally spent a ton of time learning about Unicode, grapheme clusters, etc. - I recently was working on a project at work (using Rust) in which we were doing some word wrapping, and I literally made the classic mistake of reaching for char (because it's right there!) instead of thinking about grapheme clustering, and a user immediately reported a bug around Chinese characters

Richard Feldman (Nov 07 2024 at 16:42):

I had all the knowledge, I designed Roc's string APIs to remove the footguns I knew were there in other languages, and I still fell for the footgun in my day job just because it was Right There in Rust's stdlib and it felt like the obvious choice

Richard Feldman (Nov 07 2024 at 16:42):

so I feel very strongly that the right solution here is not "repeat every other standard library's mistakes"

jan kili (Nov 07 2024 at 16:49):

I believe we can dodge the classic complexity footguns without creating new simplicity footguns :)

Richard Feldman (Nov 07 2024 at 16:54):

Richard Feldman (Nov 07 2024 at 16:55):

one idea that comes to mind is that Rust's stdlib has ASCII operations which are named that way

Richard Feldman (Nov 07 2024 at 16:55):

Jasper Woudenberg (Nov 07 2024 at 16:55):

I think Zig has an interesting approach, in offering an std.ascii module in its standard library that contains the type of functions that would be footguns if you were to define them for utf8 strings.

One thought is to have an AsciiStr type that's a wrapper around List U8, with a fromAcii : List U8 -> Result AsciiStr [NotAnAsciiByte] method, similar to fromUtf8 for Str.

Richard Feldman (Nov 07 2024 at 16:56):

yeah, certainly ASCII doesn't get new editions the way Unicode does, so it's safe to assume there would never need to be breaking changes to those APIs

Richard Feldman (Nov 07 2024 at 16:56):

and if you're using ASCII (e.g. for Advent of Code) you know you're not doing something robust, but it's still convenient

Richard Feldman (Nov 07 2024 at 16:56):

and if you're using ASCII in production, at least there's a very strong hint that you're not handling any edge cases at all

Notification Bot (Nov 07 2024 at 17:03):

Richard Feldman (Nov 07 2024 at 17:04):

an interesting thing about making it a builtin is that, if we want to, we can actually track in the compiler whether a string is known to be valid ASCII

Richard Feldman (Nov 07 2024 at 17:04):

Jasper Woudenberg (Nov 07 2024 at 17:13):

Oh, you mean for string literals, similar to how we can infer a number literal to be one of a couple of types?

Richard Feldman (Nov 07 2024 at 17:14):

jan kili (Nov 07 2024 at 17:26):

Oskar Hahn (Nov 08 2024 at 08:15):

I don't like the idea of making ascii a builtin. I think, this sends the wrong message, that unicode is hard and you should just go for ascii.

It will probably leat to the situations, that many people will use ascii for prototyping. Leading to programs or packages, that do not think about Unicode from the beginning. I like, that Roc tries to work with stirngs in a correct way. But this should not leat to a world, where people reach for ascii instead.

Oskar Hahn (Nov 08 2024 at 08:28):

I still think that the Str-type and the Str-package should be remove from the builtins. It is a source of confusion for newcomers and most of the problems should be solved with either List U8 or the unicode-package instead.

The List-package and the Str-package are dangerously similar. There are functions that do (nearly) the same and other functions, that have the same name but do different things. For example Str.split and List.split.

The only advantage of Str over List U8 is, that it guaranties to be valid Utf8. This guaranty is weak, because many strings come from a platform and Roc the language does not guaranty the content of a string generated by a platform. So it is more like a error prone convention, that Str is valid utf8.

I think if there is Str in the builtins, it should be an alias for List U8. A type, that has any guaranties should be an opaque type in the unicode package.

Richard Feldman (Nov 08 2024 at 12:34):

I don't think there's a world where Roc doesn't have string literals, and if we have string literals whose type is List U8, then we have definitely reintroduced the footgun of "string length Just Works on ASCII and Just Breaks on things like emoji if you have the typical mental model"

Richard Feldman (Nov 08 2024 at 12:45):

I think one of the selling points in favor of having a separate Str type is that it doesn't expose a function named "length"

Richard Feldman (Nov 08 2024 at 12:45):

Richard Feldman (Nov 08 2024 at 12:47):

also, when it comes to uppercasing and lowercasing, going straight to List U8 unfortunately doesn't help beginners...that's still an operation that's easy in most languages but not in Roc

Richard Feldman (Nov 08 2024 at 12:49):

Richard Feldman (Nov 08 2024 at 12:50):

Richard Feldman (Nov 08 2024 at 12:58):

Richard Feldman (Nov 08 2024 at 12:59):

so we could have Str.toUpperCase : Str, locale -> Str where locale implements Locale

Richard Feldman (Nov 08 2024 at 13:00):

that would be totally discoverable, but you couldn't call it without a Locale, and then the docs for the function could explain how and why you need one of those

Richard Feldman (Nov 08 2024 at 13:00):

Richard Feldman (Nov 08 2024 at 13:03):

and although that wouldn't help in the case where you've converted the string to a List U8, it would at least make it easier to discover some docs which help you learn how to do that too

Richard Feldman (Nov 08 2024 at 13:11):

similarly, a Str.getUtf8Byte : Str -> Result U8 [OutOfBounds] could be a discoverable way to help learn about indexing

jan kili (Nov 08 2024 at 14:03):

jan kili (Nov 08 2024 at 14:04):

jan kili (Nov 08 2024 at 14:07):

jan kili (Nov 08 2024 at 14:08):

jan kili (Nov 08 2024 at 14:19):

It seems like (what the smart people call) a zero-cost abstraction, where depending on your Str constructor function call, your code might not need to be any longer or ever need to acknowledge an alternative that you don't use

jan kili (Nov 08 2024 at 14:29):

Maybe we want a third option in there - , Either (...)] primarily for literals. Then it can settle into A/U whenever you do an A/U-specific operation on it, or in the meantime just hang out as ambiguous cause that might not hurt anything!

jan kili (Nov 08 2024 at 14:30):

Maybe we can do some similar type classification magic as for number literals, to cast a string literal as Utf8 if it contains a non-ASCII character.

jan kili (Nov 08 2024 at 14:38):

This idea feels like the deepest I've ever understood algebraic types, so nobody is allowed to point out a single flaw in it. I'll make a face like this: :weary: jk

Brendan Hansknecht (Nov 08 2024 at 15:43):

That definitely isn't a zero cost abstraction. It is an extra branch anytime you interact with a string. To make it zero cost, the encoding would have to be compile time only. Not a tag which has runtime uses.

Brendan Hansknecht (Nov 08 2024 at 15:45):

I both agree that the current state of roc is inconvenient and simply using List U8 is not the correct choice if we want to help people avoid footguns.

We could have a str type with multiple encodings. Maybe even make the encoding an ability somehow, but that feels brittle and a lot of extra complexity.

Brendan Hansknecht (Nov 08 2024 at 15:48):

My best thought currently is that Unicode should be an official builtin library, but versioned separately from the rest of roc.

Jasper Woudenberg (Nov 08 2024 at 17:14):

I wonder how often someone looking for a Str.toUppercase turns out to be looking for locale aware unicode uppercasing vs ascii uppercasing. I guess it's risky for the documentation to assume folks are looking for one, because it creates a poor experience for folks really needing the other.

In that sense the current setup is nice. People end up in a thread like this, realize they need to make a decision for ascii or locale-based uppercasing, then continue their way to either roc-ascii or roc-unicode. It'd be nice if folks didn't needn't to go through Zullip, but I wonder if we can do so in a way that keeps the 'what toUpperCase do you need?' question in the path of people looking for toUppercase.

The documentation of Str already has a section of capitzation that explains the situation, but folks looking for a toUppercase might miss it. One solution I think would offer a great experienc would be to have a documentation entry for Str.toUppercase, but clearly flag it to say something like:

Even cooler would be to count how often both links are clicked, to get data on whether one or the other would make a good default.

jan kili (Nov 08 2024 at 17:37):

What if we made Str a nonzero-cost abstraction with dev-mode-only optimization hint logs to use one of those dedicated libraries? Silly?

jan kili (Nov 08 2024 at 17:39):

This conversation has me thinking of general-purpose strings as not actually a real thing, rather just a developer convenience that postpones optimal implementation.

jan kili (Nov 08 2024 at 17:40):

Clippy says "I see you're looking for a Str built-in. I can help with that, but have you considered that strings are a figment of your imagination?"

jan kili (Nov 08 2024 at 17:42):

In the age of emojis, a first-class Str built-in could feel a bit like a first-class Nibble built-in.

jan kili (Nov 08 2024 at 17:44):

(I'm aware that I'm proposing gambling a big chunk of the weirdness budget on a principled future-looking stance with style points.)

jan kili (Nov 08 2024 at 17:47):

I imagine that as a noob seeing Str redirecting me to dedicated A/U libraries (in any of the above ways), I'd have a 70% chance of going "whoa this language is smart".

Agus Zubiaga (Nov 08 2024 at 17:52):

toUppercase : [] -> [GoodLuckCallingMe]

Anton (Nov 08 2024 at 18:05):

toUppercase : [] -> [YouHaveMuchToLearn]

Jasper Woudenberg (Nov 08 2024 at 18:11):

Hahaha, not really. I was thinking the function signature would be what'd you'd expect:

toUppercase : Str -> Str

But marked in a very clear way to say that the module doesn't actually expose this function. It's just there to document the non-existence of the function for folks who expect it to exist.

Roc documentation would likely need special support for this. In a way it might be similar to deprecation support. Preventative deprecation!

That actually makes me think of another use case for documenting non-existing functions. If a package had a new major release that removes a particular function, it could be neat to keep a documentation entry of the function for folks who learned about it in a now outdated SO/zulip/github/LLM answer.

Richard Feldman (Nov 08 2024 at 18:30):

Richard Feldman (Nov 08 2024 at 18:31):

Brendan Hansknecht (Nov 09 2024 at 06:34):

I feel like I normally see that for deprecated. So you can technically call it, but you shouldnt

Jasper Woudenberg (Nov 09 2024 at 08:40):

I guess that to get the auto-complete to pick up on it, the function should not exist only in documentation but in code in some stub form too. Maybe like this:

## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = nope

The point being that the function would show up in editor-autocomplete too (along with its documentation). The compiler could show an error if it encounters nope in a compiled program.

Luke Boswell (Nov 09 2024 at 08:54):

Luke Boswell (Nov 09 2024 at 08:55):

## uppercase is tricky yadda yadda
toUppercase : Str -> Str
toUppercase = crash "deprecated -- DO NOT USE"

Anton (Nov 09 2024 at 09:59):

I'd like to avoid special keywords and runtime crashes if we can.
How about this:

toUppercase : Str -> [ReadDocsOfStrToUppercase]
toUppercase = \_str -> ReadDocsOfStrToUppercase

Jasper Woudenberg (Nov 09 2024 at 10:04):

I think downsides of not using the real type is that it might hurt integrations in some places. For instance, a type-search for Str -> Str would not find the function in your example, Anton.

Also, the compiler being able to tell you the reason the function doesn't exist would be a better experience then the compiler telling you to look up documentation. It saves you a hop!

The downside of using crash I think would be that you might use the function and only find out it's a problem at runtime.

Anton (Nov 09 2024 at 10:09):

Sure, but this issue is so complicated that it would be hard to create a nice reading experience (with several links) in an error message, especially when it is shown in some editor popup.

Anton (Nov 09 2024 at 10:13):

That is true but I don't think it will be common to search for toUppercase with a type-search. Most likely you're asking an LLM "roc function to uppercase" or some more high level description of your task.

Jasper Woudenberg (Nov 09 2024 at 10:15):

Sure, I don't think we should put the entire explanation in there. But I think there'd be space for the highlights, like:

I'm using android-studio at work at the moment. Much prefer Vim for the most part, but one thing I do appreciate is being able to hover over a struct-through (deprecated) function and getting a little tooltip showing what I should be doing instead. I think it'd be nice if Roc would support something like that.

Richard Feldman (Jan 06 2025 at 17:19):

here's a concrete proposal for 3 ASCII functions we could add to Str. (I wrote the docs in static dispatch style, but we'd convert them to today's syntax if we decide to land them.)

Richard Feldman (Jan 06 2025 at 17:20):

my goal with them is to make it super clear up front what they do and don't do, from the name to the initial example, and then to explain what they're useful for and why you should use the unicode package for everything else

Richard Feldman (Jan 06 2025 at 17:21):

I think this would be both useful and also a more discoverable way to learn why we don't have Unicode capitalization in builtins, since by searching for uppercase/lowercase you'd come across these, and then hopefully find the explanation in the docs

Richard Feldman (Jan 06 2025 at 17:22):

plus, as someone noted elsewhere, this would actually be the most performant way to do things like case-insensitive comparisons for env vars and command-line args where you know they're hardcoded ASCII strings, so it makes sense to offer a primitive for that use case

Richard Feldman (Jan 06 2025 at 17:22):

my hope is that with these names and these docs, these won't be footguns in practice :big_smile:

Sam Mohr (Jan 06 2025 at 17:39):

I think eventually these should point users to roc-lang/unicode when that's more mature

Sam Mohr (Jan 06 2025 at 17:41):

Also, the with_* naming tends to be used for functions that take a callback that acts on some object, e.g. with_file! opens a file and passes it to a function that can do stuff with it, and auto-closes when finished

Sam Mohr (Jan 06 2025 at 17:41):

Richard Feldman (Jan 06 2025 at 17:47):

I added the with later (I was thinking of it in the sense of like with_default) because I figured that comparing my_str.ascii_lowercased() vs my_str.with_ascii_lowercased(), the second one is clearer that it's only lowercasing the ASCII

Richard Feldman (Jan 06 2025 at 17:47):

I think you can still probably guess that if it doesn't have the with, but since I'm concerned about footguns here, I figured making it longer but clearer was worth it

Sam Mohr (Jan 06 2025 at 17:49):

Anthony Bullard (Jan 06 2025 at 19:19):

Richard Feldman (Jan 06 2025 at 19:29):

Sam Mohr (Jan 06 2025 at 19:30):

Sam Mohr (Jan 06 2025 at 19:40):

Norbert Hajagos (Jan 06 2025 at 19:55):

I would like to take this on as I haven't created a builtin yet. it seems like it would be a good start.

shua (Jan 08 2025 at 23:54):

To be pedantic, the é in the examples is a single codepoint outside the ascii range, U+00E9 and is probably not the ascii character e U+0065 with combining acute U+0301, which looks the same. Writing "café" using the combining character (and assuming the implementation of capitalize_ascii just looks byte-by-byte) would result in Str.capitalize_ascii "café" == "CAFÉ"

Richard Feldman (Jan 09 2025 at 00:15):

hm, why would a code point outside the ASCII range be capitalized when looking byte-by-byte for ASCII to capitalize? :thinking:

Brendan Hansknecht (Jan 09 2025 at 00:36):

Brendan Hansknecht (Jan 09 2025 at 00:39):

But it may or may not be equal to CAFÉ depending if the other use also uses a combining character or if it uses the proper single code point É

Brendan Hansknecht (Jan 09 2025 at 00:40):

Basically two ways to represent the same thing. One of which uses ASCII letters and thus works with this API, another that uses Unicode and does not work

Stream: ideas

Topic: ASCII in builtins

Kilian Vounckx (Nov 07 2024 at 07:04):

jan kili (Nov 07 2024 at 16:25):

Richard Feldman (Nov 07 2024 at 16:37):

Richard Feldman (Nov 07 2024 at 16:38):

Richard Feldman (Nov 07 2024 at 16:39):

Richard Feldman (Nov 07 2024 at 16:41):

Richard Feldman (Nov 07 2024 at 16:42):

Richard Feldman (Nov 07 2024 at 16:42):

jan kili (Nov 07 2024 at 16:49):

Richard Feldman (Nov 07 2024 at 16:54):

Richard Feldman (Nov 07 2024 at 16:55):

Richard Feldman (Nov 07 2024 at 16:55):

Jasper Woudenberg (Nov 07 2024 at 16:55):

Richard Feldman (Nov 07 2024 at 16:56):

Richard Feldman (Nov 07 2024 at 16:56):

Richard Feldman (Nov 07 2024 at 16:56):

Notification Bot (Nov 07 2024 at 17:03):

Richard Feldman (Nov 07 2024 at 17:04):

Richard Feldman (Nov 07 2024 at 17:04):

Jasper Woudenberg (Nov 07 2024 at 17:13):

Richard Feldman (Nov 07 2024 at 17:14):

Richard Feldman (Nov 07 2024 at 17:14):

jan kili (Nov 07 2024 at 17:26):

Oskar Hahn (Nov 08 2024 at 08:15):

Oskar Hahn (Nov 08 2024 at 08:28):

Richard Feldman (Nov 08 2024 at 12:34):

Richard Feldman (Nov 08 2024 at 12:45):

Richard Feldman (Nov 08 2024 at 12:45):

Richard Feldman (Nov 08 2024 at 12:47):

Richard Feldman (Nov 08 2024 at 12:49):

Richard Feldman (Nov 08 2024 at 12:50):

Richard Feldman (Nov 08 2024 at 12:50):

Richard Feldman (Nov 08 2024 at 12:58):

Richard Feldman (Nov 08 2024 at 12:59):

Richard Feldman (Nov 08 2024 at 13:00):

Richard Feldman (Nov 08 2024 at 13:00):

Richard Feldman (Nov 08 2024 at 13:03):

Richard Feldman (Nov 08 2024 at 13:11):

jan kili (Nov 08 2024 at 14:03):

jan kili (Nov 08 2024 at 14:04):

jan kili (Nov 08 2024 at 14:07):

jan kili (Nov 08 2024 at 14:08):

jan kili (Nov 08 2024 at 14:19):

jan kili (Nov 08 2024 at 14:29):

jan kili (Nov 08 2024 at 14:30):

jan kili (Nov 08 2024 at 14:38):

Brendan Hansknecht (Nov 08 2024 at 15:43):

Brendan Hansknecht (Nov 08 2024 at 15:45):

Brendan Hansknecht (Nov 08 2024 at 15:48):

Jasper Woudenberg (Nov 08 2024 at 17:14):

jan kili (Nov 08 2024 at 17:37):

jan kili (Nov 08 2024 at 17:39):

jan kili (Nov 08 2024 at 17:40):

jan kili (Nov 08 2024 at 17:42):

jan kili (Nov 08 2024 at 17:44):

jan kili (Nov 08 2024 at 17:47):

Agus Zubiaga (Nov 08 2024 at 17:52):

Anton (Nov 08 2024 at 18:05):

Jasper Woudenberg (Nov 08 2024 at 18:11):

Richard Feldman (Nov 08 2024 at 18:30):

Richard Feldman (Nov 08 2024 at 18:31):

Richard Feldman (Nov 08 2024 at 18:31):

Brendan Hansknecht (Nov 09 2024 at 06:34):

Jasper Woudenberg (Nov 09 2024 at 08:40):

Luke Boswell (Nov 09 2024 at 08:54):

Luke Boswell (Nov 09 2024 at 08:55):

Anton (Nov 09 2024 at 09:59):

Jasper Woudenberg (Nov 09 2024 at 10:04):

Anton (Nov 09 2024 at 10:09):

Anton (Nov 09 2024 at 10:13):

Jasper Woudenberg (Nov 09 2024 at 10:15):

Richard Feldman (Jan 06 2025 at 17:19):

Richard Feldman (Jan 06 2025 at 17:20):

Richard Feldman (Jan 06 2025 at 17:21):

Richard Feldman (Jan 06 2025 at 17:22):

Richard Feldman (Jan 06 2025 at 17:22):

Sam Mohr (Jan 06 2025 at 17:39):

Sam Mohr (Jan 06 2025 at 17:41):