string manipulation questions · beginners

Stream: beginners

Topic: string manipulation questions

Elias Mulhall (Dec 01 2023 at 16:14):

In response to https://roc.zulipchat.com/#narrow/stream/358903-Advent-of-Code/topic/2023.20Day.201/near/405350559

There is a function Str.startsWithScalarbut no Str.endsWithScalar. Is this on purpose?

Probably not. I think there are some deliberate holes where @Richard Feldman has particular API ideas (for example unicode) but frequently if a builtin is missing it's because no one has implemented it yet. If you'd like to add it I'd ask in #contributing first, but the code change is probably pretty straightforward.

Is there an easy way to slice a string? For example myStr[1:]? (Drop the first element in a memory efficient way)
My solution was [convert to graphemes list, drop first, join to string]

I think this is a place where the Str builtins will need some time to mature due to trickiness around unicode. Just from your question myStr[1:] would be dropping a U8, while in your code example you're dropping a grapheme. Eventually there should be memory efficient ways ways to handle this. I saw in @Luke Boswell's solution that he split the string into U8s up front and iterated over that, which is a different way to avoid the extra joinWiths.

Is the Str.joinWith smart enough to see, that no allocation is needed?

That's a really good question. I don't know enough about Roc's string representation to know if there's an opportunity for optimization in that case. If you want to dig into I'd maybe start here and here

Brendan Hansknecht (Dec 01 2023 at 16:42):

Str.joinWith will not allocate if it is a small string generated. Otherwise, it will just do a single large allocation and copy over data. I guess we could technically add a special case for Str.joinwith str "".

Oskar Hahn (Dec 01 2023 at 16:44):

Is there a logical difference between a grapheme and a Scalar? Or is it just the representation as Str vs U32?

I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme. But there is no function Str.fromScalars.

I would love to make a contribution. I will see, if I am able to do it.

Elias Mulhall (Dec 01 2023 at 16:49):

I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme

In general I totally get it, but for AoC the input is basically always ascii, right?

Brendan Hansknecht (Dec 01 2023 at 17:00):

If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)

Richard Feldman (Dec 01 2023 at 17:15):

yeah unicode handling is complicated enough that for several reasons I actually want to remove it from the builtins and into https://github.com/roc-lang/unicode, but that's got a ways to go before it's production-ready :big_smile:

Luke Boswell (Dec 01 2023 at 17:15):

For AoC in my experience the input has always been ASCII to toUtf8 works great.

Luke Boswell (Dec 01 2023 at 17:18):

I have a PR in unicode that needs some love, I haven't forgotten about it, just deprioritised for AoC stuff last month or so. We should have a reasonable text segmentation implementation for breaking into extended grapheme clusters, but there's still a fair bit to do.

Oskar Hahn (Dec 01 2023 at 17:41):

Brendan Hansknecht said:

If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)

If I understand the docs correctly, then a group of graphemes is called a glyph.

https://www.roc-lang.org/builtins/Str#graphemes

So a Grapheme and a Scalar seems to be the same

Elias Mulhall (Dec 01 2023 at 17:45):

Yeah, a grapheme mighty need multiple U8s to be utf8 encoded, but all grapheme can be encoded as a single U32. That doesn't mean all U32s are valid graphemes

Richard Feldman (Dec 01 2023 at 17:48):

without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"

Richard Feldman (Dec 01 2023 at 17:49):

I think glyph would be a nice word for this except glyph already has a different (but related) meaning, and it seems like it would be confusing to reuse that term for this :sweat_smile:

Elias Mulhall (Dec 01 2023 at 17:50):

Ah beans, I got codeunit and grapheme mixed up, didn't I?

Richard Feldman (Dec 01 2023 at 17:54):

potentially! Code units are also integers (every scalar is a code unit, but not every code unit is a scalar)

Oskar Hahn (Dec 01 2023 at 18:21):

Richard Feldman said:

without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"

Ok. I expected, that s |> Str.toScalars |> List.len == s |> Str.graphemes |> List.len But this is not true:

expect
    glyph = "👩‍👩‍👦‍👦"
    scalarCount =  glyph |> Str.toScalars |> List.len
    graphemesCount = glyph |> Str.graphemes |> List.len

    scalarCount == graphemesCount

roc test test.roc
glyph : Str
glyph = "👩‍👩‍👦‍👦"

scalarCount : Nat
scalarCount = 7

graphemesCount : Nat
graphemesCount = 4

1 failed and 0 passed in 845 ms.

My conclusion is, that strings are complicated :face_with_peeking_eye:

Richard Feldman (Dec 01 2023 at 18:30):

yeah this is part of the reason I want to put it in a separate package haha

Richard Feldman (Dec 01 2023 at 18:30):

with lots of documentation!

Richard Feldman (Dec 01 2023 at 18:33):

in general, my feeling is that:

tons of string use cases are easy to understand and not error prone, and it makes sense to have a Str module for those
as soon as you say "I want to access some subset of a string" (other than specifically splitting a string by another string), we have taken an express train to Unicode Edge Case City and the learning curve required to avoid mistakes skyrockets (but it's not necessarily obvious that this is what has happened!)
I hope that by putting these use cases in a different package with lots of documentation, it makes it easier to realize that this is what's happened, and even though it's innately complicated, you at least have the resources in one place to learn how to do what you need to do

Oskar Hahn (Dec 01 2023 at 18:42):

So for AoC, I just use Str.toUtf8 and ignore the real world.

For the real world, there is no clear answer to the question "give me the first 'unit' of a string". Even Str.toScalars |> List.first or Str.graphemes |> List.first are not correct for a string, that starts with :family_woman_woman_boy_boy:.

I can work with that for the next 24 days :octopus:

Last updated: Aug 17 2025 at 12:14 UTC