In response to https://roc.zulipchat.com/#narrow/stream/358903-Advent-of-Code/topic/2023.20Day.201/near/405350559
There is a function
Str.startsWithScalar
but noStr.endsWithScalar
. Is this on purpose?
Probably not. I think there are some deliberate holes where @Richard Feldman has particular API ideas (for example unicode) but frequently if a builtin is missing it's because no one has implemented it yet. If you'd like to add it I'd ask in #contributing first, but the code change is probably pretty straightforward.
Is there an easy way to slice a string? For example myStr[1:]? (Drop the first element in a memory efficient way)
My solution was [convert to graphemes list, drop first, join to string]
I think this is a place where the Str builtins will need some time to mature due to trickiness around unicode. Just from your question myStr[1:]
would be dropping a U8, while in your code example you're dropping a grapheme. Eventually there should be memory efficient ways ways to handle this. I saw in @Luke Boswell's solution that he split the string into U8s up front and iterated over that, which is a different way to avoid the extra joinWith
s.
Is the
Str.joinWith
smart enough to see, that no allocation is needed?
That's a really good question. I don't know enough about Roc's string representation to know if there's an opportunity for optimization in that case. If you want to dig into I'd maybe start here and here
Str.joinWith
will not allocate if it is a small string generated. Otherwise, it will just do a single large allocation and copy over data. I guess we could technically add a special case for Str.joinwith str ""
.
Is there a logical difference between a grapheme and a Scalar? Or is it just the representation as Str vs U32?
I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme. But there is no function Str.fromScalars.
I would love to make a contribution. I will see, if I am able to do it.
I did not use toUtf8 on purpose, since it would drop the first byte, and not the first Scalar/Grapheme
In general I totally get it, but for AoC the input is basically always ascii, right?
If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)
yeah unicode handling is complicated enough that for several reasons I actually want to remove it from the builtins and into https://github.com/roc-lang/unicode, but that's got a ways to go before it's production-ready :big_smile:
For AoC in my experience the input has always been ASCII to toUtf8 works great.
I have a PR in unicode that needs some love, I haven't forgotten about it, just deprioritised for AoC stuff last month or so. We should have a reasonable text segmentation implementation for breaking into extended grapheme clusters, but there's still a fair bit to do.
Brendan Hansknecht said:
If I understand correct (which I totally could be off). A grapheme is a group of scalars. Often it is just one, but it can be more (as is common with emoji)
If I understand the docs correctly, then a group of graphemes is called a glyph.
https://www.roc-lang.org/builtins/Str#graphemes
So a Grapheme and a Scalar seems to be the same
Yeah, a grapheme mighty need multiple U8s to be utf8 encoded, but all grapheme can be encoded as a single U32. That doesn't mean all U32s are valid graphemes
without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"
I think glyph would be a nice word for this except glyph already has a different (but related) meaning, and it seems like it would be confusing to reuse that term for this :sweat_smile:
Ah beans, I got codeunit
and grapheme
mixed up, didn't I?
potentially! Code units are also integers (every scalar is a code unit, but not every code unit is a scalar)
Richard Feldman said:
without going down the Unicode rabbit hole too much, a scalar is a single integer, whereas a grapheme is one or more integers (and there's no upper limit on how many there can be in a single grapheme), even thought both scalars and graphemes render as what we perceive to be an individual "character"
Ok. I expected, that s |> Str.toScalars |> List.len == s |> Str.graphemes |> List.len
But this is not true:
expect
glyph = "๐ฉโ๐ฉโ๐ฆโ๐ฆ"
scalarCount = glyph |> Str.toScalars |> List.len
graphemesCount = glyph |> Str.graphemes |> List.len
scalarCount == graphemesCount
roc test test.roc
glyph : Str
glyph = "๐ฉโ๐ฉโ๐ฆโ๐ฆ"
scalarCount : Nat
scalarCount = 7
graphemesCount : Nat
graphemesCount = 4
1 failed and 0 passed in 845 ms.
My conclusion is, that strings are complicated :face_with_peeking_eye:
yeah this is part of the reason I want to put it in a separate package haha
with lots of documentation!
in general, my feeling is that:
Str
module for thoseSo for AoC, I just use Str.toUtf8 and ignore the real world.
For the real world, there is no clear answer to the question "give me the first 'unit' of a string". Even Str.toScalars |> List.first
or Str.graphemes |> List.first
are not correct for a string, that starts with :family_woman_woman_boy_boy:.
I can work with that for the next 24 days :octopus:
Last updated: Jul 05 2025 at 12:14 UTC