Add iteration functions to the unicode package · ideas

Stream: ideas

Topic: Add iteration functions to the unicode package

Dilson Higa (Jun 28 2024 at 03:00):

I'm considering developing functions to iterate through codepoints/graphemes for the unicode package. I decided to ask here because I am not experienced in contributing to OS projects and don't know if it would be impolite to simply send unsolicited merge requests. Would this functionality be valuable, and is this the appropriate way to propose it?

I am thinking of something like
CodePoint.walkUtf8 : List U8, state, (state, CodePoint -> state) -> Result state Utf8ParseErr
This function is almost already there in the code. I would simply generalize and expose it.

I am also interested in implementing something similar to the grapheme module.

Luke Boswell (Jun 28 2024 at 03:14):

You can find an overview of the process for roc Ideas, proposals, and feature requests here. Basically;

In the idea stage, people are encouraged to describe their idea and explore the problem, potential solutions, and tradeoffs. It's a good idea to share the idea in #ideas on Zulip.

So you're definitely in the right place :smiley:

Luke Boswell (Jun 28 2024 at 03:15):

Thank you for your work on the unicode package adding the visual width property, that will be very useful I think.

Luke Boswell (Jun 28 2024 at 03:19):

Regarding your proposal

It sounds very similar to the builtin Str.walkUtf8. I think it would probably be quite useful. What use cases are you thinking of?

Would the name CodePoint.walk be more suitable, considering we aren't walking utf8 encoded bytes but instead the unicode code points.

Another idea, what about walking the graphemes? Grapheme.walk that seems to be something people look for when they think of "characters".

Luke Boswell (Jun 28 2024 at 03:20):

I am also interested in implementing something similar to the grapheme module.

What are you thinking?

Dilson Higa (Jun 28 2024 at 12:22):

The problem with Str.walkUtf8 is that it is walking the bytes of the utf8 representation while I believe most people would be interested in iterating through graphemes. Currently this would be achieved by first splitting the bytes into graphemes, but that (as far as I understand) allocates an extra array of strings which may be unnecessary if the user is doing some kind of single pass processing. The same could be said about code points.

Luke Boswell (Jun 28 2024 at 12:25):

To clarify, I think there is a valid use case for having all three variants.

Luke Boswell (Jun 28 2024 at 12:25):

I'm not proposing we need to change the Str builtin. Just using that as an example for a similar API.

Dilson Higa (Jun 28 2024 at 12:27):

I haven't looked into how to implement it for the graphemes yet as I currently don't fully understand how code points are converted into graphemes, but for the CodePoint module I would say it would even make the code more readable as the current parseUtf8 function could be implemented in terms of walk instead of an internal parseHelper.

Dilson Higa (Jun 28 2024 at 12:28):

Luke Boswell said:

I'm not proposing we need to change the Str builtin. Just using that as an example for a similar API.

I think I was not clear. I agree the Str API is fine and is important, but these functions would solve different problems.

Dilson Higa (Jun 28 2024 at 12:29):

I am not advocating for any change in the Str API :sweat_smile:

Last updated: Jul 23 2026 at 13:15 UTC