I'm considering developing functions to iterate through codepoints/graphemes for the unicode package. I decided to ask here because I am not experienced in contributing to OS projects and don't know if it would be impolite to simply send unsolicited merge requests. Would this functionality be valuable, and is this the appropriate way to propose it?
I am thinking of something like
CodePoint.walkUtf8 : List U8, state, (state, CodePoint -> state) -> Result state Utf8ParseErr
This function is almost already there in the code. I would simply generalize and expose it.
I am also interested in implementing something similar to the grapheme module.
You can find an overview of the process for roc Ideas, proposals, and feature requests here. Basically;
In the idea stage, people are encouraged to describe their idea and explore the problem, potential solutions, and tradeoffs. It's a good idea to share the idea in
#ideason Zulip.
So you're definitely in the right place :smiley:
Thank you for your work on the unicode package adding the visual width property, that will be very useful I think.
Regarding your proposal
It sounds very similar to the builtin Str.walkUtf8. I think it would probably be quite useful. What use cases are you thinking of?
Would the name CodePoint.walk be more suitable, considering we aren't walking utf8 encoded bytes but instead the unicode code points.
Another idea, what about walking the graphemes? Grapheme.walk that seems to be something people look for when they think of "characters".
I am also interested in implementing something similar to the grapheme module.
What are you thinking?
The problem with Str.walkUtf8 is that it is walking the bytes of the utf8 representation while I believe most people would be interested in iterating through graphemes. Currently this would be achieved by first splitting the bytes into graphemes, but that (as far as I understand) allocates an extra array of strings which may be unnecessary if the user is doing some kind of single pass processing. The same could be said about code points.
To clarify, I think there is a valid use case for having all three variants.
I'm not proposing we need to change the Str builtin. Just using that as an example for a similar API.
I haven't looked into how to implement it for the graphemes yet as I currently don't fully understand how code points are converted into graphemes, but for the CodePoint module I would say it would even make the code more readable as the current parseUtf8 function could be implemented in terms of walk instead of an internal parseHelper.
Luke Boswell said:
I'm not proposing we need to change the Str builtin. Just using that as an example for a similar API.
I think I was not clear. I agree the Str API is fine and is important, but these functions would solve different problems.
I am not advocating for any change in the Str API :sweat_smile:
Last updated: Jun 16 2026 at 16:19 UTC