In both Elm and JavaScript, you can pass an empty string to the split
function to get a list of characters from a string. However, according to Roc's documentation, it won't behave the same, and we should use graphemes
instead, yet this function does not exist in the Str
module.
Was the graphemes
function removed from the module? Was it intended to be built in the future? There's a discrepancy between the documentation and the code we should fix.
Is the intended path for this use case to use graphemes
? At least for what I'm working on, I can use toScalars
or toUtf8
. Are these alternatives the intended path?
Hmm.... @Richard Feldman is toScalars
different than what graphemes
would be? Did the function just get renamed?
As far as I understand from the (confusing) Unicode docs, graphemes are the smallest text element in a script. Scalars are a single unicode character, which may consist of one of more graphemes. Multiple scalars can then be grouped into grapheme clusters.
Edit: fixed the definitions
yeah they're different
it was never implemented, I just wrote about it as if it already existed, although it doesn't :sweat_smile:
@Prajwal S N that's correct!
basically if you want to split it into whole emojis instead of chopped-up emojis, you want graphemes and not scalars
because all emojis are one grapheme each, but multiple scalars each
It looks like a we can use grapheme.zig
with the logic used in Str.toScalars
to implement this
I would like to give this a shot if no one else is already working on it
that would be wonderful! :heart_eyes:
happy to help you with that in any way!
Thanks @Prajwal S N ! :raised_hands:
I can think of two ways to implement this:
List (List U32)
, with each sub-list consisting of the code points for a single graphemeList Str
, with each element being one graphemeWhich one do you guys think is better?
definitely #2!
you can quickly and easily go from Str
to List U32
, but going the other way around requires a utf-8 validation check and a Result
:big_smile:
I managed to get some sort of implementation done, but I've been dealing with all forms of allocation issues, segfaults, and core dumps since yesterday :skull: I'll open a draft PR, would be great if someone can help with fixing it (I have close to zero knowledge of Zig, so I'm not sure how to go about making it work either)
https://github.com/roc-lang/roc/pull/4286
cc @Travis in case you're interested - you seem to be very knowledgeable about Zig! :smiley:
yes i would like to help if i can. although i'm not very comfy in rust (and no rust install yet). @Prajwal S N feel free to ping me if you have any zig questions.
have you looked at ziglyph? https://github.com/jecolon/ziglyph
not sure if its applicable to your pr just the first thing that comes to mind
Yes I did take a look at it, but I'm not keen on adding another dependency to the backend
Or would it be alright to replace the existing grapheme.zig
dependency with ziglyph?
@Travis need some help with copying slices in Zig
var a : ?[]const u8 = null;
var b : []const u8 = "hello";
How do I copy the contents of b to a? Doing @memcpy(a, b, b.len)
throws expected [*]u8, got ?[]const u8
what do you want this to do? you have a null pointer, can't store anything into that?
do you want this to happen at runtime? if so, a would need to be a ?[] u8
cause you can't write to a const pointer.
or do you mean overwrite a to point to b?
cause that would just be a = b
. so yeah i'm not sure what youre trying to do.
in order for this to work, a needs to have memory somewhere.
if you have an allocator you could do a = try allocator.dupe(u8, b)
in order for this to work, a needs to have memory somewhere.
Yes I would like to allocate and copy the slice into a
what do you want this to do? you have a null pointer, can't store anything into that?
@Folkert de Vries It needs to be null since I'm using it in an if within a while loop. In every iteration of the loop, I need b to be copied and stored in a as the previous value. See this PR (https://github.com/roc-lang/roc/pull/4286) for the complete function. I'm trying to refactor it into using a u8 slice instead of u21 values for the codepoints
are you referring to opt_last_codepoint
here? https://github.com/snprajwal/roc/commit/240df7e679e23b7366732a94374ca578a432188e#diff-a30a29d5d2db3689cadd795ef9925fa8bd162463993bff0ff13c91e1178f7d7dR1328
and you're trying to refactor that to a ?[]const u8
?
would it be any easier to make it a ?[4]u8
?
and you're trying to refactor that to a ?[]const u8?
Bingo, that seems to make everything work correctly, but I need to copy the slice in order to not omit the first character
would it be any easier to make it a ?[4]u8?
I'm guessing even if I do that, I will still have to copy the slice over...
Prajwal S N said:
I'm guessing even if I do that, I will still have to copy the slice over...
Yes you'll have to copy over the slice over into the [4]u8 and keep track of the length.
This all seems kinda kludgy imo. Maybe better to use a u32 which you can set to zero and @bitcast to/from a [4]u8?
seems like doing it this way you wouldn't need an optional. just check if the u32 equals zero rather than null.
idk. these are just ideas. haven't thought carefully about the code in question.
I think I'll step back from this. I do not have the Zig expertise to get it working, and I've been going in circles for the past week. If anyone else would like to continue with what I've already implemented, please feel free to do so!
@Prajwal S N i am interested in attempting to implement the zig portion of this pr. however, i am not very familiar with the compiler or rust. so i would need a lot of guidance.
would you be interested in working together? if you or someone else could oversee and assign me tasks plus do the rust coding, i might be able to help complete this.
@Travis I would be happy to assist you in any way possible. The Rust portion of the implementation is already complete, along with the LLVM bindings for the Zig function. The only issue currently is with the logic used for splitting the string into graphemes. You won't have to touch anything outside of compiler/builtins/bitcode/src/str.zig
The isGraphemeBreak()
function used by strGraphemes()
is provided by compiler/builtins/bitcode/src/helpers/grapheme.zig
, you might want to take a look at that too
There were a couple of very useful reviews on the PR by Brian Carroll , would also recommend going through that once
ok thanks @Prajwal S N . sounds like a reasonable task. i'll take a closer look at the pr and comments.
question: are there existing ci tests for this? will the ci just fail if i have errors?
what would be even better is if there are zig tests i could run locally (using $ zig test ...
) without having to push and wait for ci to run. (i'm assuming the ci takes a while to run)
and if no zig tests exist, could you point me to where the tests are defined so that i can create zig tests.
or if they don't exist help me create them by providing some inputs and expected outputs to strGrapheme?
i don't see any strGrapheme
specific tests here: https://github.com/snprajwal/roc/blob/str-graphemes/crates/compiler/builtins/bitcode/src/str.zig#L1310
You're right, I haven't added any tests yet. You can define them in str.zig
itself, something like
test "strGraphemes: emojis, ut8, and ascii characters" {
const bytes_arr = "6🤔å";
const bytes_len = bytes_arr.len;
const str = RocStr.init(bytes_arr, bytes_len);
const expected = [3]RocStr{
RocStr.init("6", 1), RocStr.init("🤔", 1), RocStr.init("å", 1),
};
defer {
for (expected) |rocStr| {
rocStr.deinit();
}
str.deinit();
}
const array = strGraphemes(str);
try expectEqual(array.len, expected.len);
try expect(array[0].eq(expected[0]));
try expect(array[1].eq(expected[1]));
try expect(array[2].eq(expected[2]));
}
thank you! you wrote a whole test. very nice.
while this is much appreciated, i was only asking for inputs and outputs. i think i understand what strGraphemes() is supposed to do - just split up a string into graphemes. what i'm hoping for is some input / outputs like this:
1. "ab" -> ["a", "b"]
2. "6🤔å" -> ["6", "🤔", "å"]
i don't know what important edge cases will look like and would like to add them.
I put up a pr: https://github.com/roc-lang/roc/pull/4364
i consider it a first draft. please review and let me know what it needs.
@Brian Carroll please take a look if you can too ^
you wrote a whole test. very nice.
Haha no I just ripped off an existing test for Str.countGraphemeClusters
and Str.split
:stuck_out_tongue:
glad for a painless rebase. :relief: didn't want to have to deal with any conflicts.
i just did a squash. hoping that + signing worked ok
seems good now if anyone wants to kick off the ci
can't tell why the ci failed https://github.com/roc-lang/roc/pull/4364
doesn't seem so be related to strGraphemes()
?
Agreed. I reran it, let's see what happens
thanks. looks like it failed again with same the errors.
unix_wait_status(139)
is for segmentation faults. I've tested it locally and was able to reproduce it, all failed tests use a c platform.
huh :thinking: thanks for testing locally. thats an interesting observation that they're all using a c platform. could that perhaps suggest some type of linking error is happening?
the failing tests are
failures:
cli_run::parse_movies_csv
cli_run::platform_switching_main
cli_run::ruby_interop
and the first failure looks like this
test cli_run::false_interpreter ... ok
test cli_run::breakout ... ok
failures:
---- cli_run::parse_movies_csv stdout ---- thread 'cli_run::parse_movies_csv'
panicked at 'bad status Out { stdout: "2 movies were found:\n\nThe
movie 'Airplane!' was released in 1980 and stars Robert Hays and Julie
Hagerty\nThe movie 'Caddyshack' was released in 1980 and stars Chevy
Chase, Rodney Dangerfield, Ted Knight, Michael O'Keefe and Bill
Murray\n\nParse success!\n", stderr: "🔨 Rebuilding platform...\n", status:
ExitStatus(unix_wait_status(139)) }', crates/cli/tests/cli_run.rs:118:9
so it appears that the test's stdout looks good. and everything else seems ok except the segfault exit status.
is it possible that the tests i added are causing this segfault? i think the new tests are the only real addition. the strGraphemes()
function is very similar to @Prajwal S N's version which passed ci fine.
I don't think so, they're coming from an entirely different crate
This is somehow specifically caused by the surgical linker in optimized builds. The exit function is crashing. From valgrind:
==34239== Process terminating with default action of signal 11 (SIGSEGV)
==34239== Bad permissions for mapped region at address 0x121004
==34239== at 0x121004: ??? (in /home/bren077s/Projects/roc-misc/roc/examples/parser/parse-movies-csv)
==34239== by 0x498E0C4: __run_exit_handlers (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)
==34239== by 0x498E24D: exit (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)
==34239== by 0x4977254: (below main) (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)
Does not look to be nix/glibc version specific:
==34807== Process terminating with default action of signal 11 (SIGSEGV)
==34807== Bad permissions for mapped region at address 0x121010
==34807== at 0x121010: ??? (in /home/bren077s/Projects/roc-misc/roc/examples/parser/parse-movies-csv)
==34807== by 0x4988FB4: __run_exit_handlers (exit.c:113)
==34807== by 0x4989129: exit (exit.c:143)
==34807== by 0x4972210: (below main) (libc_start_call_main.h:74)
Based on some googling, this seems to happen when something in an app is trying to write to or execute from a section without those permissions. For example, if you have a string constant and try to write to it. The constant would be in a read only section and this would happen.
Ok so. when unloading, the app loads the FINI
address: FINI 0x0000000000019010
. It adds some value to that and then calls it as a function.
It looks to be successfully jumping to the fini
function, which based on the address looks like it is in an executable section, yet when trying to run the first instruction to increment the stack, it crashes.
Summary:
I can see what is broken. It doesn't look to me like it should be broken. I am quite confused how this change led to the breakage.
Cause:
New function, surgical linker, and optimized build leading to segfault when trying to exit the program.
could it possibly have something to do with the large arrays in helpers/grapheme.zig? doesn't seem likely as this file existed and was used prior to this pr.
like maybe we're trampling/overflowing the stack?
i remember zig having some issues with very large arrays at some point. not likely, just a thought.
Ok this is a stupid idea but ... the last change was the one to rename that export. And we're having some kind of linker issue. Wanna try changing it back as an experiment? If we don't learn anything we can change back again.
ok sounds good. will try it out..
ok just pushed. ci needs approval.
no luck. same errors on linux. i wondered about this too but forgot to bring it up. glad we ruled it out anyway.
Is this the first time importing the grapheme.zig file? If so, it is probably generating something that breaks the surgical linker (still not sure how given everything runs and it only crashes on exiting). Also, it breaks test that theoretically shouldn't include any code generated from it.
no helpers/grapheme.zig was there previous to this pr. its used in str.zig#countGraphemeClusters()
https://github.com/roc-lang/roc/blob/main/crates/compiler/builtins/bitcode/src/str.zig#L1216
Ah, it just wasn't imported at the top of the file. Ok
Ok, i figured something out
When we compile a c/c++ host, we always build it with the built-ins object file. This is not wanted for the llvm backend. The llvm backend already loads the built-ins through the llvm bitcode.
Somehow (haven't figured out why yet), if we compile the c/c++ host with the new builtins object generated from the new PR, it is broken.
Simply not compiling it with the built-ins object is a fix.
This may mean we will run into more related problems when the dev backend is more complete, but I don't think we need to worry about that too much for now.
So for now, a fix should be to comment out link.rs
line 437.
should i try commenting out that line in this pr? maybe just temporarily to see if it passes the ci.
i'll give it a try. if anyone sees this and can approve the ci run in a few minutes that would be appreciated.
awaiting ci approval: https://github.com/roc-lang/roc/pull/4364
Brendan Hansknecht said:
So for now, a fix should be to comment out
link.rs
line 437.
This worked! Nice job.
@Brendan Hansknecht @Brian Carroll do you guys want me to leave this link.rs:437 as is or should we put that in a separate pr? As is, that line is included (commented out) and i've removed the str_
prefixed symbol.
Should be fine to change in this pr. Can you add a comment. Something like:
With the addition of Str.graphemes, always linking the built-ins led to a surgical linker bug for optimized builds. Disabling until it is needed for dev builds.
does anyone have any guidance for this idea: https://github.com/roc-lang/roc/pull/4364#issuecomment-1286453759 ?
i just wanted to add a little bit more test coverage for strGraphemes() and it makes sense to me to combine with the existing countGraphemeClusters() tests.
i combined the tests. can easily change it back if necessary - let me know.
have you thought about writing the tests in Roc (using expect
inside Str.roc
, similar to how we are in some other places in that file) instead of zig?
no i hadn't noticed those tests in Str.roc
. if that is the preferred way, i can move them there.
note that it is nice to have the tests in zig for the all the devs (likely just me :smirk: ) who don't have a rust toolchain installed.
you should be able to run the Str.roc
tests with the normal nightly builds of the compiler I think!
roc test path/to/Str.roc
ah. :light_bulb: i forgot that str.zig
is compiled on the fly with the platform.
oh wait oops, I don't think the nightly will work if you need to incorporate new zig builtin code
think i got commit signing straightened out (...again :sweat_smile: ) if anyone can approve the ci: https://github.com/roc-lang/roc/pull/4364
Prajwal S N has marked this topic as resolved.
Last updated: Jul 06 2025 at 12:14 UTC