Str: upper, lower, and scalars · ideas

I just implemented functions to convert between lower_case_names and UpperCaseNames, and ran into a few obstacles that I thought might be worth talking about.

My first attempt for the upperName conversion (the forward direction, in the order I've written them), I tried to split on '_' and then convert each element to title case, then join again. Turns out there's no converting to title case. Nor is there a way to split off the first scalar, convert it to uppercase, and concat it back.

For the lowerName direction, I was originally going to find anything matching /[A-Z]/, insert underscores at each location, then lowercase the whole thing. However, there doesn't seem to be any functions to locate a pattern in a string (regex or not).

underscoreScalar = 95 # Yes, I now know I could have just used '_' literals
aLowerScalar = 97
zLowerScalar = 122
aUpperScalar = 65
zUpperScalar = 90

# map from a lower_case_name to a UpperCaseName
upperName : Str -> Str
upperName = \name ->
    result = Str.walkScalars name {text: "", needUpper: Bool.true} \{text, needUpper}, c ->
        if c == underscoreScalar then
            {text, needUpper: Bool.true}
        else
            newText =
                if needUpper then
                    Str.appendScalar text (c - aLowerScalar + aUpperScalar) |> orCrash
                else
                    Str.appendScalar text c |> orCrash
            {text: newText, needUpper: Bool.false}
    result.text

expect (upperName "hello_world") == "HelloWorld"

orCrash : Result a e -> a
orCrash = \result ->
    when result is
        Ok a -> a
        Err e -> crash "orCrash"

lowerName : Str -> Str
lowerName = \name ->
    result = Str.walkScalars name {text: "", needUnder: Bool.false} \{text, needUnder}, c ->
        newText =
            if c >= aUpperScalar && c <= zUpperScalar then
                if needUnder then
                    text
                        |> Str.appendScalar underscoreScalar
                        |> orCrash
                        |> Str.appendScalar (c - aUpperScalar + aLowerScalar)
                        |> orCrash
                else
                    text
                        |> Str.appendScalar (c - aUpperScalar + aLowerScalar)
                        |> orCrash
            else
                Str.appendScalar text c |> orCrash

        {text: newText, needUnder: Bool.true}

    result.text

expect
    theResult = (lowerName "HelloWorld")
    theResult == "hello_world"

upperName : Str -> Str
upperName = \name ->
    result = Str.walkScalars name {text: "", needUpper: Bool.true} \{text, needUpper}, c ->
        if c == '_' then
            {text, needUpper: Bool.true}
        else
            newText =
                if needUpper then
                    Str.appendScalar text (Scalar.toAsciiUpper c)
                else
                    Str.appendScalar text c
            {text: newText, needUpper: Bool.false}
    result.text

lowerName : Str -> Str
lowerName = \name ->
    result = Str.walkScalars name {text: "", needUnder: Bool.false} \{text, needUnder}, c ->
        newText =
            if needUnder then
                text
                    |> Str.appendScalar underscoreScalar
                    |> Str.appendScalar (Scalar.toAsciiLower c)
            else
                text |> Str.appendScalar (Scalar.toAsciiLower c)

        {text: newText, needUnder: Bool.true}

    result.text

Richard Feldman (Dec 24 2022 at 18:02):

uppercase/lowercase and scalars both have nontrivial design considerations, so I think it'll be helpful to have different threads about them!

Joshua Warner (Dec 24 2022 at 18:08):

Joshua Warner (Dec 24 2022 at 18:09):

I suppose we could have a Scalar module without a Scalar type, or just throw these in the Str module as Str.scalarToAsciiUpper or something

Richard Feldman (Dec 24 2022 at 18:21):

Richard Feldman (Dec 24 2022 at 18:23):

of course, one more consideration there is the use case you mentioned here: if you're just trying to convert some programmatic text from snake_case to camelCase or PascalCase, it's proabably all ASCII anyway and so getting locales involved wouldn't matter

Richard Feldman (Dec 24 2022 at 18:25):

scalars are another rabbit hole; basically the more I've learned about Unicode, the more strongly I've become convinced that they should not get special treatment in stdlibs (which is why they don't in Roc), because I think the vast majority of the time they look like the right thing to reach for, they're actually the wrong thing to reach for

Richard Feldman (Dec 24 2022 at 18:25):

Richard Feldman (Dec 24 2022 at 18:26):

Richard Feldman (Dec 24 2022 at 18:27):

there is basically no hope that anyone doing string processing in terms of Unicode scalars will implement the correct Unicode semantics for the cases mentioned in that issue

Richard Feldman (Dec 24 2022 at 18:27):

so their best hope is that those cases don't come up (e.g. certain modifiers, emojis, etc.)

Richard Feldman (Dec 24 2022 at 18:28):

whereas if you're doing string processing in terms of graphemes (with each grapheme represented as a Str) we can actually offer corect semantics by default in the stlib! (such as in that issue)

Richard Feldman (Dec 24 2022 at 18:32):

Richard Feldman (Dec 24 2022 at 18:33):

but I want to create a Pit of Success around it if possible, instead of the footguns most stdlibs have!

Joshua Warner (Dec 24 2022 at 18:36):

FWIW this is exactly why I was calling it toAsciiUpper, not toUpper - with the intention that it very clearly only handles ascii.

Joshua Warner (Dec 24 2022 at 18:37):

At work a few years ago we had a meeting room named "turkish i", because of exactly the problem you describe.

Richard Feldman (Dec 24 2022 at 18:39):

yeah, my concern with offering Str.toAsciiUpper as the only one in the stdlib is that a ton of people will reach for it when they shouldn't, because it's convenient, and then Turkish users (among others) will be sad

Kevin Gillette (Dec 24 2022 at 18:41):

Under what circumstances would an env var observably change during the running of a process? Windows perhaps? That cannot happen on unixy systems.

Certainly if there are other mechanisms for locale specification, then those could change at runtime, but perhaps the need to react to such changes is niche, such as for long running processes or interactive (gui/tui) apps, but it's likely quite an anti-feature for non-interactive cli apps and batch processing, in which determinism and speed are much more important

Richard Feldman (Dec 24 2022 at 18:41):

like for example suppose if we had a Locale.toUpper and if you know your text doesn't need to handle anything but ASCII, you just always pass it the same hardcoded locale

Kevin Gillette (Dec 24 2022 at 18:42):

I would favor at least Unicode case changing as the basic option, since it does the right thing for ascii (from an ascii interpretation)

Joshua Warner (Dec 24 2022 at 18:44):

That makes code that only wants to deal with ascii needlessly more complicated, since (for example), upper/lower case can no longer operate over scalars - e.g. because of the german uppercase ß ("Double S").

Richard Feldman (Dec 24 2022 at 18:44):

@Kevin Gillette here's an easier way to see why that path won't work: if we're compiling to wasm which might run on any of:

...which wasm bytecode instructions should we emait to determine the locale in the builtins?

Kevin Gillette (Dec 24 2022 at 18:44):

locale handling is exceptionally hard, has been done poorly many, many times, and so when we tackle that, it should be considered a major initiative involving many perspectives, probably closer to a Roc 1.0 release, not something we just have one person roll up their sleeves to "solve"

Richard Feldman (Dec 24 2022 at 18:45):

Richard Feldman (Dec 24 2022 at 18:46):

@Joshua Warner there might actually be a simple solution here: what if we just intentionally didn't put a toAsciiUppercase in the stdlib, and instead left it for a third party package to provide?

Richard Feldman (Dec 24 2022 at 18:47):

that way the path of least resistance is not to reach for it inappropriately because it's in the stdlib

Joshua Warner (Dec 24 2022 at 18:47):

Richard Feldman (Dec 24 2022 at 18:47):

Joshua Warner (Dec 24 2022 at 18:48):

Richard Feldman (Dec 24 2022 at 18:49):

I haven't announced it yet bc it's not totally complete yet - there's no documentation for it, and also transitive dependencies don't work yet (e.g. packages that depend on other packages)

Richard Feldman (Dec 24 2022 at 18:50):

but we have a test in the test suite that makes use of the basic functionality now!

Richard Feldman (Dec 24 2022 at 18:51):

I think for this use case it should already work, since it should only need to depend on builtins

Anton (Dec 26 2022 at 10:50):

On the other hand, if we include it in the stdlib we can include warnings in the docs and perhaps on autocompletion in the editor. I think if it wasn't in the stdlib, users would often write their own toAsciiUpper without being educated about the potential shortcomings.

Anton (Dec 26 2022 at 11:01):

Kevin Gillette (Dec 26 2022 at 16:13):

We could solve that in documentation by explaining that toUpper does the right thing for ascii as well

Kevin Gillette (Dec 26 2022 at 16:17):

We could also have a search function in documentation that supports synonyms or related functionality: you can search for plausible functions which don't actually exist, and it'll show you the closest function to what you want (or an explanation about why it doesn't exist in the stdlib)... A bit like the style used for compiler messages, but you get the assistance even when the compiler isn't involved.

Brendan Hansknecht (Dec 26 2022 at 16:44):

Personally, I think this is totally fine. If the use case does not matter to the user, it doesn't matter. They can write/import whatever library that solves the problem. It may be naive, but most users really only need the naive solution.

Sure a user may not currently know the shortcomings, but they would learn when they run into it or if they work on internationalization for their company. I think a locale library and a lot of userland experimentation make way more sense then adding something like this to the standard library.

Richard Feldman (Dec 26 2022 at 17:16):

yeah also I think a vanishingly small percentage of people would actually read the documentation for a function like Str.toUpperCase - they'd probably just see the name and reach for it right away.

As evidence for this, there's an easter egg in the docs for Elm's function like this, and I've met very few Elm programmers who know about it:

Anton (Dec 26 2022 at 17:42):

Asking the user to write something manually or search for a library for such a common case does not feel delightful.

Anton (Dec 26 2022 at 17:42):

Brendan Hansknecht (Dec 26 2022 at 18:05):

Fair, i guess. Though it is a super simple function to write in the common case. Even in the single language Turkish case, which is more complex, it should just be a small when expression. So i really don't think it hurts until the complex i18n cases. I don't think we should add the foot guns of ignoring i18n to the language and instead let libraries decide that individually. If we add anything i think it should be much later when we have a full design around locales.

Stream: ideas

Topic: Str: upper, lower, and scalars

Joshua Warner (Dec 24 2022 at 17:58):

Richard Feldman (Dec 24 2022 at 18:02):

Richard Feldman (Dec 24 2022 at 18:02):

Joshua Warner (Dec 24 2022 at 18:08):

Joshua Warner (Dec 24 2022 at 18:08):

Joshua Warner (Dec 24 2022 at 18:09):

Richard Feldman (Dec 24 2022 at 18:21):

Richard Feldman (Dec 24 2022 at 18:23):

Richard Feldman (Dec 24 2022 at 18:25):

Richard Feldman (Dec 24 2022 at 18:25):

Richard Feldman (Dec 24 2022 at 18:26):

Richard Feldman (Dec 24 2022 at 18:27):

Richard Feldman (Dec 24 2022 at 18:27):

Richard Feldman (Dec 24 2022 at 18:28):

Richard Feldman (Dec 24 2022 at 18:32):

Richard Feldman (Dec 24 2022 at 18:33):

Joshua Warner (Dec 24 2022 at 18:36):

Joshua Warner (Dec 24 2022 at 18:37):

Richard Feldman (Dec 24 2022 at 18:39):

Kevin Gillette (Dec 24 2022 at 18:41):

Richard Feldman (Dec 24 2022 at 18:41):

Kevin Gillette (Dec 24 2022 at 18:42):

Joshua Warner (Dec 24 2022 at 18:44):

Richard Feldman (Dec 24 2022 at 18:44):

Kevin Gillette (Dec 24 2022 at 18:44):

Richard Feldman (Dec 24 2022 at 18:45):

Richard Feldman (Dec 24 2022 at 18:45):

Richard Feldman (Dec 24 2022 at 18:46):

Richard Feldman (Dec 24 2022 at 18:47):

Joshua Warner (Dec 24 2022 at 18:47):

Joshua Warner (Dec 24 2022 at 18:47):

Richard Feldman (Dec 24 2022 at 18:47):

Richard Feldman (Dec 24 2022 at 18:47):

Joshua Warner (Dec 24 2022 at 18:48):

Richard Feldman (Dec 24 2022 at 18:49):

Richard Feldman (Dec 24 2022 at 18:50):

Richard Feldman (Dec 24 2022 at 18:51):

Anton (Dec 26 2022 at 10:50):

Anton (Dec 26 2022 at 11:01):

Kevin Gillette (Dec 26 2022 at 16:13):

Kevin Gillette (Dec 26 2022 at 16:17):

Brendan Hansknecht (Dec 26 2022 at 16:44):

Richard Feldman (Dec 26 2022 at 17:16):

Anton (Dec 26 2022 at 17:42):

Anton (Dec 26 2022 at 17:42):

Brendan Hansknecht (Dec 26 2022 at 18:05):