Fuzz testing and other contributions to roc-unicode · contributing

Stream: contributing

Topic: Fuzz testing and other contributions to roc-unicode

Hristo (Apr 24 2024 at 06:38):

I've decided to spin off a discussion from this message, authored by @Luke Boswell.

It's in the context of the current state of roc-unicode and potential current-stage contributions that would be welcome, such as giving a hand with fuzz testing.

I've been extremely snowed under with work at my workplace, but I'd be happy to at least try to think how I could be of help with respect to the direction of efforts pertaining to ironing out a more stable roc-unicode release, as it's one of the points which feels to me are relatively strongly tied to overall user experience with Roc.

It'd also be nice to have some sort of action points, which will be available to the community to address, subject to capacity/time availability as well.

Luke Boswell (Apr 24 2024 at 09:06):

@Brendan Hansknecht has done a lot of work on fuzzing, he's written notes about it in this zulip discussion and upgraded his roc-fuzz platform so that it is more suitable for testing pure roc code.

For the next step for roc-lang/unicode, I was wanting to use that platform to provide additional assurance we're handling the various edge cases in the Grapheme.split implementation properly. It's a pretty big state machine, and I have deliberately left a crash to help find any unhandled edge cases.

We code-gen the test suite from the unicode data file, and I went through all these manually until all 1137 tests passed. Basically, take the first one that fails, use dbg print to follow the recursion through each byte and checking it's behaving correctly, as I find things that are not correct IAW the text segmentation rules fix, rinse and repeat.

But in the process of doing that I found the coverage of the unicode data file test points is pretty average, like it might only have a test that covers an emoji at the start of a string, but not the middle or end or before a CLRF or after a Hangul sequence... etc.

So I'm reasonably confident there are a couple of edge cases we haven't caught, and could end up crashing someone's code. It would be nice to get that to a point where we are reasonably confident that is not going to happen.

Last updated: Aug 17 2025 at 12:14 UTC