Stream: ideas

Topic: Str: upper, lower, and scalars


view this post on Zulip Joshua Warner (Dec 24 2022 at 17:58):

I just implemented functions to convert between lower_case_names and UpperCaseNames, and ran into a few obstacles that I thought might be worth talking about.

My first attempt for the upperName conversion (the forward direction, in the order I've written them), I tried to split on '_' and then convert each element to title case, then join again. Turns out there's no converting to title case. Nor is there a way to split off the first scalar, convert it to uppercase, and concat it back.

For the lowerName direction, I was originally going to find anything matching /[A-Z]/, insert underscores at each location, then lowercase the whole thing. However, there doesn't seem to be any functions to locate a pattern in a string (regex or not).

Eventually I found Str.walkScalars, which worked, and landed on this:

underscoreScalar = 95 # Yes, I now know I could have just used '_' literals
aLowerScalar = 97
zLowerScalar = 122
aUpperScalar = 65
zUpperScalar = 90

# map from a lower_case_name to a UpperCaseName
upperName : Str -> Str
upperName = \name ->
    result = Str.walkScalars name {text: "", needUpper: Bool.true} \{text, needUpper}, c ->
        if c == underscoreScalar then
            {text, needUpper: Bool.true}
        else
            newText =
                if needUpper then
                    Str.appendScalar text (c - aLowerScalar + aUpperScalar) |> orCrash
                else
                    Str.appendScalar text c |> orCrash
            {text: newText, needUpper: Bool.false}
    result.text

expect (upperName "hello_world") == "HelloWorld"

orCrash : Result a e -> a
orCrash = \result ->
    when result is
        Ok a -> a
        Err e -> crash "orCrash"

lowerName : Str -> Str
lowerName = \name ->
    result = Str.walkScalars name {text: "", needUnder: Bool.false} \{text, needUnder}, c ->
        newText =
            if c >= aUpperScalar && c <= zUpperScalar then
                if needUnder then
                    text
                        |> Str.appendScalar underscoreScalar
                        |> orCrash
                        |> Str.appendScalar (c - aUpperScalar + aLowerScalar)
                        |> orCrash
                else
                    text
                        |> Str.appendScalar (c - aUpperScalar + aLowerScalar)
                        |> orCrash
            else
                Str.appendScalar text c |> orCrash

        {text: newText, needUnder: Bool.true}

    result.text

expect
    theResult = (lowerName "HelloWorld")
    theResult == "hello_world"

There are of course a few things that are non-ideal about that:

Here's what I would have liked to have written:

upperName : Str -> Str
upperName = \name ->
    result = Str.walkScalars name {text: "", needUpper: Bool.true} \{text, needUpper}, c ->
        if c == '_' then
            {text, needUpper: Bool.true}
        else
            newText =
                if needUpper then
                    Str.appendScalar text (Scalar.toAsciiUpper c)
                else
                    Str.appendScalar text c
            {text: newText, needUpper: Bool.false}
    result.text

lowerName : Str -> Str
lowerName = \name ->
    result = Str.walkScalars name {text: "", needUnder: Bool.false} \{text, needUnder}, c ->
        newText =
            if needUnder then
                text
                    |> Str.appendScalar underscoreScalar
                    |> Str.appendScalar (Scalar.toAsciiLower c)
            else
                text |> Str.appendScalar (Scalar.toAsciiLower c)

        {text: newText, needUnder: Bool.true}

    result.text

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:02):

can we split this into 2 ideas? :big_smile:

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:02):

uppercase/lowercase and scalars both have nontrivial design considerations, so I think it'll be helpful to have different threads about them!

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:08):

I guess this current proposal is really _all_ about scalars

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:08):

I don't really need Str.toUpper - I need Scalar.toUpper

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:09):

I suppose we could have a Scalar module without a Scalar type, or just throw these in the Str module as Str.scalarToAsciiUpper or something

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:21):

so, here are some considerations I've thought about with regard to uppercase:

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:23):

of course, one more consideration there is the use case you mentioned here: if you're just trying to convert some programmatic text from snake_case to camelCase or PascalCase, it's proabably all ASCII anyway and so getting locales involved wouldn't matter

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:25):

scalars are another rabbit hole; basically the more I've learned about Unicode, the more strongly I've become convinced that they should not get special treatment in stdlibs (which is why they don't in Roc), because I think the vast majority of the time they look like the right thing to reach for, they're actually the wrong thing to reach for

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:25):

(grapheme clusters are)

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:26):

some examples of why: https://github.com/roc-lang/roc/issues/4780

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:27):

there is basically no hope that anyone doing string processing in terms of Unicode scalars will implement the correct Unicode semantics for the cases mentioned in that issue

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:27):

so their best hope is that those cases don't come up (e.g. certain modifiers, emojis, etc.)

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:28):

whereas if you're doing string processing in terms of graphemes (with each grapheme represented as a Str) we can actually offer corect semantics by default in the stlib! (such as in that issue)

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:32):

Unicode is hard :sweat_smile:

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:33):

but I want to create a Pit of Success around it if possible, instead of the footguns most stdlibs have!

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:36):

FWIW this is exactly why I was calling it toAsciiUpper, not toUpper - with the intention that it very clearly only handles ascii.

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:37):

At work a few years ago we had a meeting room named "turkish i", because of exactly the problem you describe.

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:39):

yeah, my concern with offering Str.toAsciiUpper as the only one in the stdlib is that a ton of people will reach for it when they shouldn't, because it's convenient, and then Turkish users (among others) will be sad

view this post on Zulip Kevin Gillette (Dec 24 2022 at 18:41):

Under what circumstances would an env var observably change during the running of a process? Windows perhaps? That cannot happen on unixy systems.

Certainly if there are other mechanisms for locale specification, then those could change at runtime, but perhaps the need to react to such changes is niche, such as for long running processes or interactive (gui/tui) apps, but it's likely quite an anti-feature for non-interactive cli apps and batch processing, in which determinism and speed are much more important

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:41):

like for example suppose if we had a Locale.toUpper and if you know your text doesn't need to handle anything but ASCII, you just always pass it the same hardcoded locale

view this post on Zulip Kevin Gillette (Dec 24 2022 at 18:42):

I would favor at least Unicode case changing as the basic option, since it does the right thing for ascii (from an ascii interpretation)

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:44):

That makes code that only wants to deal with ascii needlessly more complicated, since (for example), upper/lower case can no longer operate over scalars - e.g. because of the german uppercase ß ("Double S").

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:44):

@Kevin Gillette here's an easier way to see why that path won't work: if we're compiling to wasm which might run on any of:

...which wasm bytecode instructions should we emait to determine the locale in the builtins?

view this post on Zulip Kevin Gillette (Dec 24 2022 at 18:44):

locale handling is exceptionally hard, has been done poorly many, many times, and so when we tackle that, it should be considered a major initiative involving many perspectives, probably closer to a Roc 1.0 release, not something we just have one person roll up their sleeves to "solve"

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:45):

I totally buy that locale is a big project :+1:

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:45):

but I'm also pretty sure it's a big project that shouldn't be in the stdlib

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:46):

@Joshua Warner there might actually be a simple solution here: what if we just intentionally didn't put a toAsciiUppercase in the stdlib, and instead left it for a third party package to provide?

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:47):

that way the path of least resistance is not to reach for it inappropriately because it's in the stdlib

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:47):

Sure, that works fine, as long as it's easy to pull in a third party package

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:47):

(currently it's not!!!!)

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:47):

it should be now! I just landed that last week

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:47):

should be able to publish them and import them via URLs just like platforms now

view this post on Zulip Joshua Warner (Dec 24 2022 at 18:48):

Oh :sweat_smile: I guess I'm behind the times!

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:49):

I haven't announced it yet bc it's not totally complete yet - there's no documentation for it, and also transitive dependencies don't work yet (e.g. packages that depend on other packages)

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:50):

but we have a test in the test suite that makes use of the basic functionality now!

view this post on Zulip Richard Feldman (Dec 24 2022 at 18:51):

I think for this use case it should already work, since it should only need to depend on builtins

view this post on Zulip Anton (Dec 26 2022 at 10:50):

yeah, my concern with offering Str.toAsciiUpper as the only one in the stdlib is that a ton of people will reach for it when they shouldn't, because it's convenient, and then Turkish users (among others) will be sad

On the other hand, if we include it in the stdlib we can include warnings in the docs and perhaps on autocompletion in the editor. I think if it wasn't in the stdlib, users would often write their own toAsciiUpper without being educated about the potential shortcomings.

view this post on Zulip Anton (Dec 26 2022 at 11:01):

We could also call it something like toUpperDangerous

view this post on Zulip Kevin Gillette (Dec 26 2022 at 16:13):

We could solve that in documentation by explaining that toUpper does the right thing for ascii as well

view this post on Zulip Kevin Gillette (Dec 26 2022 at 16:17):

We could also have a search function in documentation that supports synonyms or related functionality: you can search for plausible functions which don't actually exist, and it'll show you the closest function to what you want (or an explanation about why it doesn't exist in the stdlib)... A bit like the style used for compiler messages, but you get the assistance even when the compiler isn't involved.

view this post on Zulip Brendan Hansknecht (Dec 26 2022 at 16:44):

I think if it wasn't in the stdlib, users would often write their own toAsciiUpper without being educated about the potential shortcomings.

Personally, I think this is totally fine. If the use case does not matter to the user, it doesn't matter. They can write/import whatever library that solves the problem. It may be naive, but most users really only need the naive solution.

Sure a user may not currently know the shortcomings, but they would learn when they run into it or if they work on internationalization for their company. I think a locale library and a lot of userland experimentation make way more sense then adding something like this to the standard library.

view this post on Zulip Richard Feldman (Dec 26 2022 at 17:16):

yeah also I think a vanishingly small percentage of people would actually read the documentation for a function like Str.toUpperCase - they'd probably just see the name and reach for it right away.

As evidence for this, there's an easter egg in the docs for Elm's function like this, and I've met very few Elm programmers who know about it:

https://package.elm-lang.org/packages/elm/core/latest/String#toUpper

view this post on Zulip Anton (Dec 26 2022 at 17:42):

They can write/import whatever library that solves the problem.

Asking the user to write something manually or search for a library for such a common case does not feel delightful.

view this post on Zulip Anton (Dec 26 2022 at 17:42):

yeah also I think a vanishingly small percentage of people would actually read the documentation for a function like Str.toUpperCase - they'd probably just see the name and reach for it right away

I'd bet the percentage would be a lot better for toUpperDangerous :)

view this post on Zulip Brendan Hansknecht (Dec 26 2022 at 18:05):

Asking the user to write something manually or search for a library for such a common case does not feel delightful.

Fair, i guess. Though it is a super simple function to write in the common case. Even in the single language Turkish case, which is more complex, it should just be a small when expression. So i really don't think it hurts until the complex i18n cases. I don't think we should add the foot guns of ignoring i18n to the language and instead let libraries decide that individually. If we add anything i think it should be much later when we have a full design around locales.


Last updated: Jun 16 2026 at 16:19 UTC