Stream: beginners

Topic: string manipulation questions


view this post on Zulip Elias Mulhall (Dec 01 2023 at 16:14):

In response to https://roc.zulipchat.com/#narrow/stream/358903-Advent-of-Code/topic/2023.20Day.201/near/405350559

There is a function Str.startsWithScalarbut no Str.endsWithScalar. Is this on purpose?

Probably not. I think there are some deliberate holes where @Richard Feldman has particular API ideas (for example unicode) but frequently if a builtin is missing it's because no one has implemented it yet. If you'd like to add it I'd ask in #contributing first, but the code change is probably pretty straightforward.

Is there an easy way to slice a string? For example myStr[1:]? (Drop the first element in a memory efficient way)
My solution was [convert to graphemes list, drop first, join to string]

I think this is a place where the Str builtins will need some time to mature due to trickiness around unicode. Just from your question myStr[1:] would be dropping a U8, while in your code example you're dropping a grapheme. Eventually there should be memory efficient ways ways to handle this. I saw in @Luke Boswell's solution that he split the string into U8s up front and iterated over that, which is a different way to avoid the extra joinWiths.

Is the Str.joinWith smart enough to see, that no allocation is needed?

That's a really good question. I don't know enough about Roc's string representation to know if there's an opportunity for optimization in that case. If you want to dig into I'd maybe start here and here

view this post on Zulip Brendan Hansknecht (Dec 01 2023 at 16:42):

Str.joinWith will not allocate if it is a small string generated. Otherwise, it will just do a single large allocation and copy over data. I guess we could technically add a special case for Str.joinwith str "".

view this post on Zulip Oskar Hahn (Dec 01 2023 at 16:44):

Is there a logical difference between a grapheme and a Scalar? Or is it just the representation as Str vs U32?

I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme. But there is no function Str.fromScalars.

I would love to make a contribution. I will see, if I am able to do it.

view this post on Zulip Elias Mulhall (Dec 01 2023 at 16:49):

I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme

In general I totally get it, but for AoC the input is basically always ascii, right?

view this post on Zulip Brendan Hansknecht (Dec 01 2023 at 17:00):

If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)

view this post on Zulip Richard Feldman (Dec 01 2023 at 17:15):

yeah unicode handling is complicated enough that for several reasons I actually want to remove it from the builtins and into https://github.com/roc-lang/unicode, but that's got a ways to go before it's production-ready :big_smile:

view this post on Zulip Luke Boswell (Dec 01 2023 at 17:15):

For AoC in my experience the input has always been ASCII to toUtf8 works great.

view this post on Zulip Luke Boswell (Dec 01 2023 at 17:18):

I have a PR in unicode that needs some love, I haven't forgotten about it, just deprioritised for AoC stuff last month or so. We should have a reasonable text segmentation implementation for breaking into extended grapheme clusters, but there's still a fair bit to do.

view this post on Zulip Oskar Hahn (Dec 01 2023 at 17:41):

Brendan Hansknecht said:

If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)

If I understand the docs correctly, then a group of graphemes is called a glyph.

https://www.roc-lang.org/builtins/Str#graphemes

So a Grapheme and a Scalar seems to be the same

view this post on Zulip Elias Mulhall (Dec 01 2023 at 17:45):

Yeah, a grapheme mighty need multiple U8s to be utf8 encoded, but all grapheme can be encoded as a single U32. That doesn't mean all U32s are valid graphemes

view this post on Zulip Richard Feldman (Dec 01 2023 at 17:48):

without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"

view this post on Zulip Richard Feldman (Dec 01 2023 at 17:49):

I think glyph would be a nice word for this except glyph already has a different (but related) meaning, and it seems like it would be confusing to reuse that term for this :sweat_smile:

view this post on Zulip Elias Mulhall (Dec 01 2023 at 17:50):

Ah beans, I got codeunit and grapheme mixed up, didn't I?

view this post on Zulip Richard Feldman (Dec 01 2023 at 17:54):

potentially! Code units are also integers (every scalar is a code unit, but not every code unit is a scalar)

view this post on Zulip Oskar Hahn (Dec 01 2023 at 18:21):

Richard Feldman said:

without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"

Ok. I expected, that s |> Str.toScalars |> List.len == s |> Str.graphemes |> List.len But this is not true:

expect
    glyph = "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"
    scalarCount =  glyph |> Str.toScalars |> List.len
    graphemesCount = glyph |> Str.graphemes |> List.len

    scalarCount == graphemesCount
roc test test.roc
glyph : Str
glyph = "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"

scalarCount : Nat
scalarCount = 7

graphemesCount : Nat
graphemesCount = 4

1 failed and 0 passed in 845 ms.

My conclusion is, that strings are complicated :face_with_peeking_eye:

view this post on Zulip Richard Feldman (Dec 01 2023 at 18:30):

yeah this is part of the reason I want to put it in a separate package haha

view this post on Zulip Richard Feldman (Dec 01 2023 at 18:30):

with lots of documentation!

view this post on Zulip Richard Feldman (Dec 01 2023 at 18:33):

in general, my feeling is that:

view this post on Zulip Oskar Hahn (Dec 01 2023 at 18:42):

So for AoC, I just use Str.toUtf8 and ignore the real world.

For the real world, there is no clear answer to the question "give me the first 'unit' of a string". Even Str.toScalars |> List.first or Str.graphemes |> List.first are not correct for a string, that starts with :family_woman_woman_boy_boy:.

I can work with that for the next 24 days :octopus:


Last updated: Jul 05 2025 at 12:14 UTC