Stream: contributing

Topic: ✔ Missing Str.graphemes


view this post on Zulip Chris Duncan (Oct 09 2022 at 13:00):

In both Elm and JavaScript, you can pass an empty string to the split function to get a list of characters from a string. However, according to Roc's documentation, it won't behave the same, and we should use graphemes instead, yet this function does not exist in the Str module.

Was the graphemes function removed from the module? Was it intended to be built in the future? There's a discrepancy between the documentation and the code we should fix.

Is the intended path for this use case to use graphemes? At least for what I'm working on, I can use toScalars or toUtf8. Are these alternatives the intended path?

view this post on Zulip Brendan Hansknecht (Oct 09 2022 at 15:00):

Hmm.... @Richard Feldman is toScalars different than what graphemes would be? Did the function just get renamed?

view this post on Zulip Prajwal S N (Oct 09 2022 at 15:20):

As far as I understand from the (confusing) Unicode docs, graphemes are the smallest text element in a script. Scalars are a single unicode character, which may consist of one of more graphemes. Multiple scalars can then be grouped into grapheme clusters.

Edit: fixed the definitions

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:27):

yeah they're different

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:27):

it was never implemented, I just wrote about it as if it already existed, although it doesn't :sweat_smile:

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:27):

@Prajwal S N that's correct!

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:28):

basically if you want to split it into whole emojis instead of chopped-up emojis, you want graphemes and not scalars

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:29):

because all emojis are one grapheme each, but multiple scalars each

view this post on Zulip Prajwal S N (Oct 09 2022 at 15:29):

It looks like a we can use grapheme.zig with the logic used in Str.toScalars to implement this

view this post on Zulip Prajwal S N (Oct 09 2022 at 15:30):

I would like to give this a shot if no one else is already working on it

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:30):

that would be wonderful! :heart_eyes:

view this post on Zulip Richard Feldman (Oct 09 2022 at 15:30):

happy to help you with that in any way!

view this post on Zulip Chris Duncan (Oct 09 2022 at 15:31):

Thanks @Prajwal S N ! :raised_hands:

view this post on Zulip Prajwal S N (Oct 09 2022 at 15:50):

I can think of two ways to implement this:

  1. We return a List (List U32), with each sub-list consisting of the code points for a single grapheme
  2. We return a List Str, with each element being one grapheme

Which one do you guys think is better?

view this post on Zulip Richard Feldman (Oct 09 2022 at 16:18):

definitely #2!

view this post on Zulip Richard Feldman (Oct 09 2022 at 16:19):

you can quickly and easily go from Str to List U32, but going the other way around requires a utf-8 validation check and a Result :big_smile:

view this post on Zulip Prajwal S N (Oct 10 2022 at 15:21):

I managed to get some sort of implementation done, but I've been dealing with all forms of allocation issues, segfaults, and core dumps since yesterday :skull: I'll open a draft PR, would be great if someone can help with fixing it (I have close to zero knowledge of Zig, so I'm not sure how to go about making it work either)

view this post on Zulip Prajwal S N (Oct 10 2022 at 15:23):

https://github.com/roc-lang/roc/pull/4286

view this post on Zulip Richard Feldman (Oct 10 2022 at 16:10):

cc @Travis in case you're interested - you seem to be very knowledgeable about Zig! :smiley:

view this post on Zulip Travis (Oct 10 2022 at 16:18):

yes i would like to help if i can. although i'm not very comfy in rust (and no rust install yet). @Prajwal S N feel free to ping me if you have any zig questions.

view this post on Zulip Travis (Oct 10 2022 at 16:20):

have you looked at ziglyph? https://github.com/jecolon/ziglyph

view this post on Zulip Travis (Oct 10 2022 at 16:23):

not sure if its applicable to your pr just the first thing that comes to mind

view this post on Zulip Prajwal S N (Oct 10 2022 at 16:41):

Yes I did take a look at it, but I'm not keen on adding another dependency to the backend

view this post on Zulip Prajwal S N (Oct 10 2022 at 16:43):

Or would it be alright to replace the existing grapheme.zig dependency with ziglyph?

view this post on Zulip Prajwal S N (Oct 13 2022 at 11:45):

@Travis need some help with copying slices in Zig

var a : ?[]const u8 = null;
var b : []const u8 = "hello";

How do I copy the contents of b to a? Doing @memcpy(a, b, b.len) throws expected [*]u8, got ?[]const u8

view this post on Zulip Folkert de Vries (Oct 13 2022 at 12:09):

what do you want this to do? you have a null pointer, can't store anything into that?

view this post on Zulip Travis (Oct 13 2022 at 15:52):

do you want this to happen at runtime? if so, a would need to be a ?[] u8 cause you can't write to a const pointer.

view this post on Zulip Travis (Oct 13 2022 at 15:53):

or do you mean overwrite a to point to b?

view this post on Zulip Travis (Oct 13 2022 at 15:53):

cause that would just be a = b. so yeah i'm not sure what youre trying to do.

view this post on Zulip Travis (Oct 13 2022 at 15:56):

in order for this to work, a needs to have memory somewhere.

view this post on Zulip Travis (Oct 13 2022 at 15:56):

if you have an allocator you could do a = try allocator.dupe(u8, b)

view this post on Zulip Prajwal S N (Oct 13 2022 at 18:19):

in order for this to work, a needs to have memory somewhere.

Yes I would like to allocate and copy the slice into a

view this post on Zulip Prajwal S N (Oct 13 2022 at 18:22):

what do you want this to do? you have a null pointer, can't store anything into that?

@Folkert de Vries It needs to be null since I'm using it in an if within a while loop. In every iteration of the loop, I need b to be copied and stored in a as the previous value. See this PR (https://github.com/roc-lang/roc/pull/4286) for the complete function. I'm trying to refactor it into using a u8 slice instead of u21 values for the codepoints

view this post on Zulip Travis (Oct 13 2022 at 18:43):

are you referring to opt_last_codepoint here? https://github.com/snprajwal/roc/commit/240df7e679e23b7366732a94374ca578a432188e#diff-a30a29d5d2db3689cadd795ef9925fa8bd162463993bff0ff13c91e1178f7d7dR1328

view this post on Zulip Travis (Oct 13 2022 at 18:44):

and you're trying to refactor that to a ?[]const u8?

view this post on Zulip Travis (Oct 13 2022 at 18:45):

would it be any easier to make it a ?[4]u8?

view this post on Zulip Prajwal S N (Oct 17 2022 at 13:27):

and you're trying to refactor that to a ?[]const u8?

Bingo, that seems to make everything work correctly, but I need to copy the slice in order to not omit the first character

view this post on Zulip Prajwal S N (Oct 17 2022 at 13:28):

would it be any easier to make it a ?[4]u8?

I'm guessing even if I do that, I will still have to copy the slice over...

view this post on Zulip Travis (Oct 17 2022 at 20:15):

Prajwal S N said:

I'm guessing even if I do that, I will still have to copy the slice over...

Yes you'll have to copy over the slice over into the [4]u8 and keep track of the length.

view this post on Zulip Travis (Oct 17 2022 at 20:17):

This all seems kinda kludgy imo. Maybe better to use a u32 which you can set to zero and @bitcast to/from a [4]u8?

view this post on Zulip Travis (Oct 17 2022 at 20:23):

seems like doing it this way you wouldn't need an optional. just check if the u32 equals zero rather than null.

view this post on Zulip Travis (Oct 17 2022 at 20:24):

idk. these are just ideas. haven't thought carefully about the code in question.

view this post on Zulip Prajwal S N (Oct 18 2022 at 13:51):

I think I'll step back from this. I do not have the Zig expertise to get it working, and I've been going in circles for the past week. If anyone else would like to continue with what I've already implemented, please feel free to do so!

view this post on Zulip Travis (Oct 18 2022 at 16:29):

@Prajwal S N i am interested in attempting to implement the zig portion of this pr. however, i am not very familiar with the compiler or rust. so i would need a lot of guidance.

view this post on Zulip Travis (Oct 18 2022 at 16:29):

would you be interested in working together? if you or someone else could oversee and assign me tasks plus do the rust coding, i might be able to help complete this.

view this post on Zulip Prajwal S N (Oct 18 2022 at 17:34):

@Travis I would be happy to assist you in any way possible. The Rust portion of the implementation is already complete, along with the LLVM bindings for the Zig function. The only issue currently is with the logic used for splitting the string into graphemes. You won't have to touch anything outside of compiler/builtins/bitcode/src/str.zig

view this post on Zulip Prajwal S N (Oct 18 2022 at 17:36):

The isGraphemeBreak() function used by strGraphemes() is provided by compiler/builtins/bitcode/src/helpers/grapheme.zig, you might want to take a look at that too

view this post on Zulip Prajwal S N (Oct 18 2022 at 17:37):

There were a couple of very useful reviews on the PR by Brian Carroll , would also recommend going through that once

view this post on Zulip Travis (Oct 18 2022 at 17:41):

ok thanks @Prajwal S N . sounds like a reasonable task. i'll take a closer look at the pr and comments.

view this post on Zulip Travis (Oct 18 2022 at 17:44):

question: are there existing ci tests for this? will the ci just fail if i have errors?

view this post on Zulip Travis (Oct 18 2022 at 17:45):

what would be even better is if there are zig tests i could run locally (using $ zig test ...) without having to push and wait for ci to run. (i'm assuming the ci takes a while to run)

view this post on Zulip Travis (Oct 18 2022 at 17:51):

and if no zig tests exist, could you point me to where the tests are defined so that i can create zig tests.

view this post on Zulip Travis (Oct 18 2022 at 17:52):

or if they don't exist help me create them by providing some inputs and expected outputs to strGrapheme?

view this post on Zulip Travis (Oct 18 2022 at 18:01):

i don't see any strGrapheme specific tests here: https://github.com/snprajwal/roc/blob/str-graphemes/crates/compiler/builtins/bitcode/src/str.zig#L1310

view this post on Zulip Prajwal S N (Oct 19 2022 at 12:33):

You're right, I haven't added any tests yet. You can define them in str.zig itself, something like

test "strGraphemes: emojis, ut8, and ascii characters" {
    const bytes_arr = "6🤔å";
    const bytes_len = bytes_arr.len;
    const str = RocStr.init(bytes_arr, bytes_len);
    const expected = [3]RocStr{
        RocStr.init("6", 1), RocStr.init("🤔", 1), RocStr.init("å", 1),
    };
    defer {
        for (expected) |rocStr| {
            rocStr.deinit();
        }
    str.deinit();
    }

    const array = strGraphemes(str);
    try expectEqual(array.len, expected.len);
    try expect(array[0].eq(expected[0]));
    try expect(array[1].eq(expected[1]));
    try expect(array[2].eq(expected[2]));
}

view this post on Zulip Travis (Oct 19 2022 at 14:04):

thank you! you wrote a whole test. very nice.

while this is much appreciated, i was only asking for inputs and outputs. i think i understand what strGraphemes() is supposed to do - just split up a string into graphemes. what i'm hoping for is some input / outputs like this:

1. "ab" -> ["a", "b"]
2. "6🤔å" -> ["6", "🤔", "å"]

view this post on Zulip Travis (Oct 19 2022 at 14:05):

i don't know what important edge cases will look like and would like to add them.

view this post on Zulip Travis (Oct 19 2022 at 16:33):

I put up a pr: https://github.com/roc-lang/roc/pull/4364

view this post on Zulip Travis (Oct 19 2022 at 16:34):

i consider it a first draft. please review and let me know what it needs.

view this post on Zulip Travis (Oct 19 2022 at 16:41):

@Brian Carroll please take a look if you can too ^

view this post on Zulip Prajwal S N (Oct 19 2022 at 17:32):

you wrote a whole test. very nice.

Haha no I just ripped off an existing test for Str.countGraphemeClusters and Str.split :stuck_out_tongue:

view this post on Zulip Travis (Oct 19 2022 at 17:44):

glad for a painless rebase. :relief: didn't want to have to deal with any conflicts.

view this post on Zulip Travis (Oct 19 2022 at 18:58):

i just did a squash. hoping that + signing worked ok

view this post on Zulip Travis (Oct 19 2022 at 19:31):

seems good now if anyone wants to kick off the ci

view this post on Zulip Travis (Oct 19 2022 at 21:08):

can't tell why the ci failed https://github.com/roc-lang/roc/pull/4364

view this post on Zulip Travis (Oct 19 2022 at 21:09):

doesn't seem so be related to strGraphemes()?

view this post on Zulip Brian Carroll (Oct 19 2022 at 21:19):

Agreed. I reran it, let's see what happens

view this post on Zulip Travis (Oct 19 2022 at 21:39):

thanks. looks like it failed again with same the errors.

view this post on Zulip Anton (Oct 20 2022 at 07:04):

unix_wait_status(139) is for segmentation faults. I've tested it locally and was able to reproduce it, all failed tests use a c platform.

view this post on Zulip Travis (Oct 20 2022 at 16:35):

huh :thinking: thanks for testing locally. thats an interesting observation that they're all using a c platform. could that perhaps suggest some type of linking error is happening?

view this post on Zulip Travis (Oct 20 2022 at 16:45):

the failing tests are

 failures:
    cli_run::parse_movies_csv
    cli_run::platform_switching_main
    cli_run::ruby_interop

view this post on Zulip Travis (Oct 20 2022 at 16:46):

and the first failure looks like this

test cli_run::false_interpreter ... ok
test cli_run::breakout ... ok

failures:

---- cli_run::parse_movies_csv stdout ---- thread 'cli_run::parse_movies_csv'
     panicked at 'bad status Out { stdout: "2 movies were found:\n\nThe
     movie 'Airplane!' was released in 1980 and stars Robert Hays and Julie
     Hagerty\nThe movie 'Caddyshack' was released in 1980 and stars Chevy
     Chase, Rodney Dangerfield, Ted Knight, Michael O'Keefe and Bill
     Murray\n\nParse success!\n", stderr: "🔨 Rebuilding platform...\n", status:
     ExitStatus(unix_wait_status(139)) }', crates/cli/tests/cli_run.rs:118:9

view this post on Zulip Travis (Oct 20 2022 at 16:49):

so it appears that the test's stdout looks good. and everything else seems ok except the segfault exit status.

view this post on Zulip Travis (Oct 20 2022 at 17:46):

is it possible that the tests i added are causing this segfault? i think the new tests are the only real addition. the strGraphemes() function is very similar to @Prajwal S N's version which passed ci fine.

view this post on Zulip Prajwal S N (Oct 20 2022 at 18:04):

I don't think so, they're coming from an entirely different crate

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 18:35):

This is somehow specifically caused by the surgical linker in optimized builds. The exit function is crashing. From valgrind:

==34239== Process terminating with default action of signal 11 (SIGSEGV)
==34239==  Bad permissions for mapped region at address 0x121004
==34239==    at 0x121004: ??? (in /home/bren077s/Projects/roc-misc/roc/examples/parser/parse-movies-csv)
==34239==    by 0x498E0C4: __run_exit_handlers (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)
==34239==    by 0x498E24D: exit (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)
==34239==    by 0x4977254: (below main) (in /nix/store/6f66prpgx1qx4n6k450sxs3d157ia1ps-glibc-2.35-163/lib/libc.so.6)

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 18:37):

Does not look to be nix/glibc version specific:

==34807== Process terminating with default action of signal 11 (SIGSEGV)
==34807==  Bad permissions for mapped region at address 0x121010
==34807==    at 0x121010: ??? (in /home/bren077s/Projects/roc-misc/roc/examples/parser/parse-movies-csv)
==34807==    by 0x4988FB4: __run_exit_handlers (exit.c:113)
==34807==    by 0x4989129: exit (exit.c:143)
==34807==    by 0x4972210: (below main) (libc_start_call_main.h:74)

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 18:44):

Based on some googling, this seems to happen when something in an app is trying to write to or execute from a section without those permissions. For example, if you have a string constant and try to write to it. The constant would be in a read only section and this would happen.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 18:53):

Ok so. when unloading, the app loads the FINI address: FINI 0x0000000000019010. It adds some value to that and then calls it as a function.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 18:54):

It looks to be successfully jumping to the fini function, which based on the address looks like it is in an executable section, yet when trying to run the first instruction to increment the stack, it crashes.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 19:35):

Summary:
I can see what is broken. It doesn't look to me like it should be broken. I am quite confused how this change led to the breakage.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 19:36):

Cause:
New function, surgical linker, and optimized build leading to segfault when trying to exit the program.

view this post on Zulip Travis (Oct 20 2022 at 21:11):

could it possibly have something to do with the large arrays in helpers/grapheme.zig? doesn't seem likely as this file existed and was used prior to this pr.

view this post on Zulip Travis (Oct 20 2022 at 21:12):

like maybe we're trampling/overflowing the stack?

view this post on Zulip Travis (Oct 20 2022 at 21:12):

i remember zig having some issues with very large arrays at some point. not likely, just a thought.

view this post on Zulip Brian Carroll (Oct 20 2022 at 21:21):

Ok this is a stupid idea but ... the last change was the one to rename that export. And we're having some kind of linker issue. Wanna try changing it back as an experiment? If we don't learn anything we can change back again.

view this post on Zulip Travis (Oct 20 2022 at 21:24):

ok sounds good. will try it out..

view this post on Zulip Travis (Oct 20 2022 at 21:29):

ok just pushed. ci needs approval.

view this post on Zulip Travis (Oct 20 2022 at 22:20):

no luck. same errors on linux. i wondered about this too but forgot to bring it up. glad we ruled it out anyway.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 22:25):

Is this the first time importing the grapheme.zig file? If so, it is probably generating something that breaks the surgical linker (still not sure how given everything runs and it only crashes on exiting). Also, it breaks test that theoretically shouldn't include any code generated from it.

view this post on Zulip Travis (Oct 20 2022 at 22:27):

no helpers/grapheme.zig was there previous to this pr. its used in str.zig#countGraphemeClusters()

view this post on Zulip Travis (Oct 20 2022 at 22:28):

https://github.com/roc-lang/roc/blob/main/crates/compiler/builtins/bitcode/src/str.zig#L1216

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 22:30):

Ah, it just wasn't imported at the top of the file. Ok

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:08):

Ok, i figured something out

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:09):

When we compile a c/c++ host, we always build it with the built-ins object file. This is not wanted for the llvm backend. The llvm backend already loads the built-ins through the llvm bitcode.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:10):

Somehow (haven't figured out why yet), if we compile the c/c++ host with the new builtins object generated from the new PR, it is broken.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:10):

Simply not compiling it with the built-ins object is a fix.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:11):

This may mean we will run into more related problems when the dev backend is more complete, but I don't think we need to worry about that too much for now.

view this post on Zulip Brendan Hansknecht (Oct 20 2022 at 23:13):

So for now, a fix should be to comment out link.rs line 437.

view this post on Zulip Travis (Oct 21 2022 at 01:08):

should i try commenting out that line in this pr? maybe just temporarily to see if it passes the ci.

view this post on Zulip Travis (Oct 21 2022 at 01:10):

i'll give it a try. if anyone sees this and can approve the ci run in a few minutes that would be appreciated.

view this post on Zulip Travis (Oct 21 2022 at 01:11):

awaiting ci approval: https://github.com/roc-lang/roc/pull/4364

view this post on Zulip Travis (Oct 21 2022 at 02:19):

Brendan Hansknecht said:

So for now, a fix should be to comment out link.rs line 437.

This worked! Nice job.

view this post on Zulip Travis (Oct 21 2022 at 02:21):

@Brendan Hansknecht @Brian Carroll do you guys want me to leave this link.rs:437 as is or should we put that in a separate pr? As is, that line is included (commented out) and i've removed the str_ prefixed symbol.

view this post on Zulip Brendan Hansknecht (Oct 21 2022 at 03:21):

Should be fine to change in this pr. Can you add a comment. Something like:
With the addition of Str.graphemes, always linking the built-ins led to a surgical linker bug for optimized builds. Disabling until it is needed for dev builds.

view this post on Zulip Travis (Oct 21 2022 at 16:53):

does anyone have any guidance for this idea: https://github.com/roc-lang/roc/pull/4364#issuecomment-1286453759 ?

view this post on Zulip Travis (Oct 21 2022 at 16:54):

i just wanted to add a little bit more test coverage for strGraphemes() and it makes sense to me to combine with the existing countGraphemeClusters() tests.

view this post on Zulip Travis (Oct 21 2022 at 17:12):

i combined the tests. can easily change it back if necessary - let me know.

view this post on Zulip Richard Feldman (Oct 21 2022 at 17:18):

have you thought about writing the tests in Roc (using expect inside Str.roc, similar to how we are in some other places in that file) instead of zig?

view this post on Zulip Travis (Oct 21 2022 at 17:22):

no i hadn't noticed those tests in Str.roc. if that is the preferred way, i can move them there.

view this post on Zulip Travis (Oct 21 2022 at 17:25):

note that it is nice to have the tests in zig for the all the devs (likely just me :smirk: ) who don't have a rust toolchain installed.

view this post on Zulip Richard Feldman (Oct 21 2022 at 17:36):

you should be able to run the Str.roc tests with the normal nightly builds of the compiler I think!

view this post on Zulip Richard Feldman (Oct 21 2022 at 17:36):

roc test path/to/Str.roc

view this post on Zulip Travis (Oct 21 2022 at 17:48):

ah. :light_bulb: i forgot that str.zig is compiled on the fly with the platform.

view this post on Zulip Richard Feldman (Oct 21 2022 at 17:52):

oh wait oops, I don't think the nightly will work if you need to incorporate new zig builtin code

view this post on Zulip Travis (Oct 23 2022 at 02:39):

think i got commit signing straightened out (...again :sweat_smile: ) if anyone can approve the ci: https://github.com/roc-lang/roc/pull/4364

view this post on Zulip Notification Bot (Oct 23 2022 at 08:58):

Prajwal S N has marked this topic as resolved.


Last updated: Jul 06 2025 at 12:14 UTC