Stream: compiler development

Topic: zig compiler - fuzzing


view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:50):

I wanted to spin up a discussion on fuzzing the new compiler specifically.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:51):

It sounds like for a while, we will be using afl++ as our fuzzer and eventually we will get to swap to the zig integrated fuzzer (but probably not until at least 0.15.0).

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:52):

I'm trying to figure out what tooling will make fuzzing as seamless as possible. I know that it can be quite painful to manage corpus's and enable anyone to fuzz.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:53):

I think it is really useful to keep around at least a minimized corpus to help exploration, but we probably want to keep that out of the repo to avoid eating up tons of space with random garbage files. So I'm thinking that we may just need to cache the corpus for CI.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:56):

Generally speaking, for each fuzzer, we need an input corpus (just basic starter examples). We can optionally add a dictionary (like a bunch of roc keywords and symbols or ast node names depending on the layer being fuzzed). It is probably good to take found crashes/regressions and add them to the input corpus such that they can be used as unit tests. I'm thinking that by default, CI would just run through the input corpus once to ensure there are no regressions.

Also, there is a chance that we can overlap the fuzzing input corpus with snapshot tests. Like we might be able to preprocess all the snapshot tests to turn them into the input corpus for the various fuzzers.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:57):

This is all just open preliminary thinking, but overall, I fell that we will likely want some scripts to manage AFL such that fuzzing is easy for anyone to run.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:58):

Currently fuzzing has a few different dependencies (like llvm). I think with the release of 0.14.0 there may be less required dependencies, but not completely sure. So we probably will want to use nix to manage the dependencies.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 06:58):

Almost certainly the first really useful fuzzer will be the parser to formatter loop.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 07:00):

Anyway, I'm totally open to any and all ideas. I'm sure @Joshua Warner has some thoughts.

Oh also, when fuzzing, definitely should use the gpa with leak checks enabled.

view this post on Zulip Sam Mohr (Feb 07 2025 at 07:02):

I think fuzzing would want to at least guarantee we have none of these:

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 07:06):

Yeah, for many cases, we can only check those limited things. For some cases, like parse -> format -> parse -> format, we can check for equivalent formatted outputs both times.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 07:07):

Also, if we add any verifiers to IR (like the old compIler has check mono ir), we can use that as well. I'm hoping we add a number of these verifiers.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 07:36):

Oh, an if fuzzing is good enough at exploring (which it may not be), we could theoretically give it access to essentially a repl and have it fuzz the interpreter vs the llvm backend

view this post on Zulip Joshua Warner (Feb 07 2025 at 19:59):

I would also like to do things like run the compiled code, grab the output and also run the interpreter and assert the output is the same

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 20:48):

Yeah, that is what I was suggesting with my last comment. Probably would run the interpreter first. If the code is valid in the interpreter, run the backend (should always pass). Then run the code and assert equivalent output.

view this post on Zulip Loris Cro (Feb 07 2025 at 23:27):

When fuzzing my languages (a json replacement, a one-liner expression language, a html templating language), by making sure to turn off std.mem.eqlBytes_allowed (Zig 0.14.0-dev, named std.mem.backend_can_use_eql_bytes in 0.13.0), I got amazing performance from the fuzzer even without any corpus or dictionary.

Once my tokenizer(s) had been tested enough, I then worked on a simple valid-syntax generator in order to make sure the fuzzer would generate only valid HTML ASTs when flipping bytes, in order to reliably target the parser.

Here's what that looks like: https://github.com/kristoff-it/superhtml/blob/main/src/fuzz/astgen.zig

So the fuzzer writes ccuc and the executable turns that into:

<div>
   <div></div>
</div>
<div></div>

For a full blown programming language getting a valid source code generator up and running is a bit more involved, but Matthew Lugg has been working on one for Zig, you might take inspiration from his work once you want to start fuzzing deeper layers of the compiler.

view this post on Zulip Brendan Hansknecht (Feb 07 2025 at 23:38):

Awesome to know!

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 08:01):

Why does tokenization have StringBegin but not StringEnd? Seems strange to generate: [StringBegin, OpenCurly, Expr, CloseCurly, String]. Also, Is there a reason we don't encode the $ in tokeniziation? Why is it OpenCurly and not DollarCurly or similar?

Context: trying to write a fuzzer for tokenization that tokenizes, prints in a bare bones form, then tokenizers again and asserts they are the same.

Intermediate looks something like:

zzz [zzzz!] { zzz: zzzzzzzz "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" }
zzzzzz zzz.Zzzzzz
zzzz! = |_zzzz|
    z = "~~~~~"
    Zzzzzz.zzzz!("~~~~~~~${z}"")

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 08:01):

cc @Joshua Warner

view this post on Zulip Joshua Warner (Feb 14 2025 at 16:14):

@Brendan Hansknecht yes, that does need to be changed a bit before it's unambiguous

The _intent_ is to generate a sequence of StringBegin, <interpolation>, StringPart, <interpolation>, StringPart (for example)

The ambiguity right now is the OpenCurly/CloseCurly delimiters need to be specific to interpolations - otherwise that's ambiguous with having a string followed by a curly brace (for whatever reason).

In terms of not having StringEnd, that's not needed for disambiguation (at least, as long as I fix up the interpolation thing above). You can have StringBegin, <interpolation>, StringPart, StringBegin - and that's unambiguously a string with an interpolation, followed by a second string without.

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 17:26):

Joshua Warner said:

The _intent_ is to generate a sequence of StringBegin, <interpolation>, StringPart, <interpolation>, StringPart (for example)

I guess I found the first bug (though not via fuzzing, was found well setting up fuzzing)

It curently generates StringBegin, <interpolation>, StringPart, <interpolation>, String

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 17:45):

have a fix on my branch

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 19:38):

Ok, first real fuzz bug of the new compiler:
id:000000,sig:06,src:000000,time:423,execs:751,op:quick,pos:224

thread 36762112 panic: index out of bounds: index 226, len 225
/Users/bren077s/Projects/roc/src/check/parse/tokenize.zig:406:58: 0x1023efff3 in decodeUnicode (repro-tokenize)
        const utf8_char = std.unicode.utf8Decode(self.buf[self.pos .. self.pos + len]) catch {
                                                         ^
/Users/bren077s/Projects/roc/src/check/parse/tokenize.zig:1050:59: 0x1023ef4ef in tokenize (repro-tokenize)
                    const info = self.cursor.decodeUnicode(b);
                                                          ^

Looks to be a buffer overflow due to assuming we have enough characters for a full unicode character

view this post on Zulip Joshua Warner (Feb 14 2025 at 19:48):

Yep, I have some pending tokenizer updates that I'm waiting on Anthony's PR to land prior to posting

view this post on Zulip Joshua Warner (Feb 14 2025 at 19:48):

Good find on the unicode thing (that one's new to me)!

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 20:08):

Ah, and a second unicode bug the call to decodeUnicode on line 754 will always fail. It's position is 1 past the current position of the byte buffer. Cause it is . followed by unicode instead of just unicode.

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 20:17):

Also, PR for tokenizer fuzzer: https://github.com/roc-lang/roc/pull/7607

Fuzzer essentially instantly finds bugs, but that is fine for merging. The merge will just make sure the fuzzers keep compiling and don't bitrot. And we can use it to slowly burn down bugs.

view this post on Zulip Andrew Kelley (Feb 14 2025 at 20:57):

Brendan Hansknecht said:

self.buf[self.pos .. self.pos + len

self.buf[self.pos..][0..len] slicing by length

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 22:04):

Ok, folded a few fixes into the PR

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 22:04):

Still crashes really quick (likely the mock module I am generating from the tokens does not quite match what the tokenizer expects)

view this post on Zulip Luke Boswell (Feb 14 2025 at 22:20):

Looks like your having a lot of fun Brendan :grinning_face_with_smiling_eyes:

view this post on Zulip Brendan Hansknecht (Feb 14 2025 at 22:25):

I mean it is kinda fun seeing weird tokenizer edge cases that fail fuzzing. That said, my print and then retokenize definitely has some bugs (that or the tokenizer has some assumptions and lost state info, maybe both)

view this post on Zulip Joshua Warner (Feb 15 2025 at 04:46):

What command should I be running to kick off fuzzing?

view this post on Zulip Joshua Warner (Feb 15 2025 at 04:47):

joshw@Joshuas-MacBook-Air-3 ~/s/g/r/roc (parser-zig-rewrite)> zig build -Dllvm -Dfuzz
joshw@Joshuas-MacBook-Air-3 ~/s/g/r/roc (parser-zig-rewrite)> ./zig-out/bin/fuzz-cli
[-] FATAL: forkserver is already up, but an instrumented dlopen() library loaded afterwards. You must AFL_PRELOAD such libraries to be able to fuzz them or LD_PRELOAD to run outside of afl-fuzz.
To ignore this set AFL_IGNORE_PROBLEMS=1 but this will lead to ambiguous coverage data.
In addition, you can set AFL_IGNORE_PROBLEMS_COVERAGE=1 to ignore the additional coverage instead (use with caution!).
fish: Job 1, './zig-out/bin/fuzz-cli' terminated by signal SIGABRT (Abort)

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 04:49):

./zig-out/AFLplusplus/bin/afl-fuzz -i src/fuzz/tokenize-corpus/ -o /tmp/tokenize-out/ zig-out/bin/fuzz-tokenize

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 04:50):

also, you don't need -Dllvm, but you do need a system install version of llvm for afl++ to compile. Sadly, I was unable to get afl++ to compile with static llvm (might have to loop back to that at another point)

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 04:59):

Oh also, even without afl and all that hassle, you can zig build test -Dfuzz and it will build a repro-tokenize file. That file can take data from stdin, or a file arg and use it to reproduce directly. It also prints out a lot more info

view this post on Zulip Joshua Warner (Feb 15 2025 at 05:05):

Hmm, having a little trouble adapting this to Anthony's changes.

zig build test works, but not zig build fuzz-tokenize.

Does this make sense to you?

fuzz-tokenize
└─ install generated to repro-tokenize
   └─ zig build-exe repro-tokenize Debug native 1 errors
tokenize.zig:3:27: error: import of file outside module path: '../../collections/utils.zig'
const exitOnOom = @import("../../collections/utils.zig").exitOnOom;
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
referenced by:
    tokenize: tokenize.zig:981:53
    zig_fuzz_test_inner: /Users/joshw/src/github.com/roc-lang/roc/src/fuzz/tokenize.zig:500:14
    remaining reference traces hidden; use '-freference-trace' to see all reference traces

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:06):

Ah, yeah. It's this todo: https://github.com/roc-lang/roc/blob/33a2c663e00c9309624978913a4f9ade3e66113f/build.zig#L111-L113

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:06):

Feel free to comment out the tokenizer test for now.

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:07):

I'll fix it up in a bit

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:09):

technically, moving all fuzz executables to src/fuzz-tokenize.zig and using relative imports would fix this. Otherwise, we need something likesrc/lib.zig and have all the fuzz executables go through that.

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:10):

The issues is that if you directly import src/check/parse/tokenize.zig as your root module file, it is only allowed to import things form src/check/parse/...

view this post on Zulip Joshua Warner (Feb 15 2025 at 05:11):

Interesting. Confusingly, that isn't the first such import in that file. But I guess it must be processing things out of order or something

view this post on Zulip Joshua Warner (Feb 15 2025 at 05:12):

And here I was looking forward to fuzzing the tokenizer :P

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:12):

Anyway, feel free to just disable, not exactly sure when I will next have time to look at it, but I'll fix it up if no one else does first. Just gonna be flying tomorrow, so not sure timing.

view this post on Zulip Joshua Warner (Feb 15 2025 at 05:12):

Which test should I be commenting out?

view this post on Zulip Brendan Hansknecht (Feb 15 2025 at 05:12):

https://github.com/roc-lang/roc/blob/33a2c663e00c9309624978913a4f9ade3e66113f/build.zig#L125-L138

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 03:14):

Fix the fuzz tests: https://github.com/roc-lang/roc/pull/7612

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 03:16):

Also adds a readme!

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:53):

Looks like we had our first merge conflict. Something simple, tokenize fuzzer got fixed and merged at the same time the tokenize function got updated to have malformed nodes. So now the fuzzer doesn't handle reprinting any of the malformed nodes.

I don't currently have time to work on this. cc @Joshua Warner in case he has time. He probably can add printing for the malformed nodes quicker than anyone else cause he knows what they are all.

Otherwise, anyone can fix my just handing a few extra cases in a switch statement. That or for now, giving them empty handling just to unblock merging a PR.

view this post on Zulip Joshua Warner (Feb 16 2025 at 17:44):

https://github.com/roc-lang/roc/pull/7617

view this post on Zulip Joshua Warner (Feb 16 2025 at 17:48):

Right now this just bails out. The slightly better approach would be to copy the corresponding range in the input, and even better would be making sure we have enough information to trigger the same issue.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 17:57):

Yep. Sounds totally good

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 21:10):

Also, @Joshua Warner, not sure it is the best use of time, but I think the fuzzer is now at the state where it finds reasonable tokenizer bugs. Some may be in the reprint, but I think a lot are due to minor mistakes around exact token type such that when reprinted in basic form it leads to bugs.

view this post on Zulip Joshua Warner (Feb 16 2025 at 21:48):

Cool, will take a look in a bit

view this post on Zulip Isaac Van Doren (Feb 28 2025 at 18:17):

Great talk about Tigerbeetle's fuzzing setup. Maybe there are some aspects we could adopt for Roc. https://www.hytradboi.com/2025/c222d11a-6f4d-4211-a243-f5b7fafc8d79-rocket-science-of-simulation-testing

view this post on Zulip Brendan Hansknecht (Feb 28 2025 at 19:16):

Yep

view this post on Zulip Brendan Hansknecht (Feb 28 2025 at 19:17):

Sounds like they are mostly solving the continuous fuzzing case.

view this post on Zulip Brendan Hansknecht (Feb 28 2025 at 19:17):

Which may be less important for roc, but still useful to gleam from

view this post on Zulip Andrew Kelley (Feb 28 2025 at 23:15):

that was a great talk, it finally clicked for me the "level triggered" vs "edge triggered" thing. based on your comment Brendan I think that point might not have sunk in for you yet

view this post on Zulip Brendan Hansknecht (Feb 28 2025 at 23:48):

I only quickly skimmed it cause today has been busy, need to give it a proper watch still.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 00:51):

I think most initial statement still mostly stands. It isn't that async level triggered setup wouldn't be great. It is more that roc is still super early on with limited resources and fuzzing only run on local machines. Long term, I would love to have a similar setup, but currently fuzzing is local only for roc and not tied to CI at all. That said, maybe grabbing a single machine and setting up what tigerbeetled did wouldn't be as much work as I expect. Just feels like a large investment in infra than roc is ready for. Especially given somehow you would prefer to notify folks asyncronously on fuzzing failures and don't want too much of a backlog to build up.

But I definitely might be making a mountain out of a mole hill, they made it look relatively simply to orchestrate all of this.

view this post on Zulip Luke Boswell (Mar 01 2025 at 01:37):

Just feels like a large investment in infra than roc is ready for

I have a linux "server" sitting at home that I'm happy to leave running a fuzzer full-time

view this post on Zulip Luke Boswell (Mar 01 2025 at 01:38):

Not sure if that is the kind of infra you're referring to, but happy to offer that if it helps

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 01:40):

Yeah, that is part of it

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 01:40):

Then it is just extra ci flows, website or notifications for failures, and orchestration code.

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:15):

and don't want too much of a backlog to build up.

One of the suggestions of the talk was to not worry about keeping old failing seeds around indefinitely but only keep the N most recent seeds and rely on the fact that unresolved issues will be found again.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:29):

Yeah, sounds like most of the work is a minor database and a web frontend for that (along with a CI machine to run things)

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:30):

So not too bad

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:31):

Also, we use a corpus based method which has implications for multi machine setups, but I think roc can just do a single machine setup which would avoid much of that hassle.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:37):

Given we do coverage guide fuzzing, I don't think we can have a single integer seed. I think our inputs will remain a blob of text. So not as portable. I guess we could minimize and base64 encode them to at least make them trivial to copy.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:57):

Yeah, maybe this is less work than I initially thought, maybe I'll try to hack something crazy simple together.

view this post on Zulip Brendan Hansknecht (Mar 02 2025 at 02:38):

Has anyone managed to run our fuzzer on linux? I am trying to but keep hitting linking issues. Wondering if it may be distro/config specific.

view this post on Zulip Luke Boswell (Mar 02 2025 at 05:57):

I haven't tried tbh

view this post on Zulip Luke Boswell (Mar 02 2025 at 05:57):

Actually I did early on, but could get it working either

view this post on Zulip Brendan Hansknecht (Mar 03 2025 at 03:38):

Small PR that enabled me to get fuzzing working on linux (though sadly with system install AFL instead of zig compiled AFL): https://github.com/roc-lang/roc/pull/7651

view this post on Zulip Brendan Hansknecht (Mar 05 2025 at 07:48):

Please ignore the absolute ugliness of this site: https://roc-lang.github.io/roc-compiler-fuzz/

view this post on Zulip Luke Boswell (Mar 05 2025 at 08:14):

This will be really nice, looking forward to building fuzzers and getting some high scores :grinning_face_with_smiling_eyes:

view this post on Zulip Isaac Van Doren (Mar 05 2025 at 13:51):

Wow awesome!

view this post on Zulip Richard Feldman (Mar 05 2025 at 14:12):

yooooo this is sweet!!!

view this post on Zulip Brendan Hansknecht (Mar 05 2025 at 22:01):

Also, that site is happily taking contributions if anyone wants to make it pretty.

view this post on Zulip Brendan Hansknecht (Mar 05 2025 at 22:05):

Aside, I really love minimized repros:

zig build repro-tokenize -- -b YW5k -v

This is just passing and to our tokenizer. Which breaks cause we assume that and is &&.

CC: @Joshua Warner real tokenizer bugs likely will pop up at that site (though some are with the fuzz harness as well).

Now we just need more fuzzers to start exploring more of the code.

view this post on Zulip Luke Boswell (Mar 06 2025 at 03:55):

@Brendan Hansknecht -- how does the src/fuzz-corpus/parse/grab_bag.roc work?

view this post on Zulip Luke Boswell (Mar 06 2025 at 03:56):

Is that something that the fuzzer will use to start with... and then randomly modify?

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 03:56):

Yeah, we are required to give the fuzzer at least one seed

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 03:56):

That test case seemed reasonable to me

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 03:57):

Also, currently parsing helloworld hangs and I didn't want to figure out fixing it. So I just used something different

view this post on Zulip Luke Boswell (Mar 06 2025 at 03:57):

Nice. Does it help being large with everything in there together? I wonder if it would be better to have multiple files with simpler syntax?

view this post on Zulip Luke Boswell (Mar 06 2025 at 03:58):

I was also wanting something similar for the snapshot tests... so I'm wondering if these things could/should be combined somehow.

But they are also quite different use cases so probably should be kept different.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 03:59):

The fuzzer does a really good job at exploring especially cause I enabled cmplog in ci. It leads to llvm telling the fuzzer all values that are used in comparisons (strings, ints, etc). So starting corpus isn't too imporant.

In CI, the fuzzer with cache the corpus and keep growing it.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:01):

Yeah, fuzzers theoretically could share with snapshot tests (probably a good idea to ensure valid inputs). Would just need a tool to generate a corpus from the snapshots. For example, the tokenizer and parser fuzzers both use .roc files as input. So a script to extract all the original source from the snapshot test cases would create an amazing starting corpus for those fuzzers.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:02):

If later fuzzers start with sexpr IR, sharing would also be nice, but likely more complex.

view this post on Zulip Luke Boswell (Mar 06 2025 at 04:02):

Nice. I looks like we've already got a fairly solid testing framework emerging...

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:02):

(depending if the fuzzers parse the sexpr or programmatically generate it)

view this post on Zulip Luke Boswell (Mar 06 2025 at 04:03):

I could definitely have the snapshot tool extract the roc source from each snapshot and dump it into a fuzzer corpus folder somewhere

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:05):

for simulation testing

I'm not sure our fuzzing is simulation testing cause nothing is simulated, but none the less is great automatic bug finding.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:06):

I could definitely have the snapshot tool extract the roc source from each snapshot and dump it into a fuzzer corpus folder somewhere

That would be great. If you just add a flag to the snapshot tool to dump all the .roc files into a folder, I'll integrate that into the fuzz CI.

view this post on Zulip Luke Boswell (Mar 06 2025 at 04:16):

I'll do that in my next PR

view this post on Zulip Luke Boswell (Mar 06 2025 at 04:17):

nothing is simulated

We're simulating the source file that is being parsed...

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:21):

haha, I guess. I've always associated simulation testing with fake networks and databases and disks and what not.

view this post on Zulip Luke Boswell (Mar 06 2025 at 04:27):

Hopefully we can move up the stack someday, and simulate more interesting things... like here's a (generated) expression that should evaluate to some expected value.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:34):

Will be interesting to see how far we can take it and what still explores well. At a minimum, we should be able to do repl style expressions.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 04:35):

but not sure how well the fuzzer will explore that (and also have to be careful about infinite loops)

view this post on Zulip Luke Boswell (Mar 06 2025 at 05:21):

Do you think it's ok for the snapshots to be copied in all at the same level... or would you want them to maintain whatever folder structure they had from the snapshots/ folder?

view this post on Zulip Luke Boswell (Mar 06 2025 at 05:22):

When copying them into the corpus folder given as an argument to the snapshot tool

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 05:24):

Same level is preferred

view this post on Zulip Luke Boswell (Mar 06 2025 at 05:25):

Should I give them a psuedo random name... or basically keep whatever they came with

view this post on Zulip Joshua Warner (Mar 06 2025 at 05:26):

We should probably keep the name; otherwise as we modify the snapshots we won't know which one they match up to (and thus which one to edit).

view this post on Zulip Joshua Warner (Mar 06 2025 at 05:27):

If we used something like the hash of the content, then it'd be hard to know which fuzz corpus we should _remove_ because it's the old version of that input. Otherwise old inputs would keep piling up as fuzz corpus entries, and that's probably not super valuable.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 05:29):

I don't want to check these into the repo

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 05:29):

Let's keep just the snapshots as the source of truth

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 05:29):

For fuzzer, hash or random name is fine

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 05:30):

That's at least my default opinion for this

view this post on Zulip Joshua Warner (Mar 06 2025 at 05:31):

Ahhh got it

view this post on Zulip Joshua Warner (Mar 06 2025 at 05:32):

In that case I'd do hash, but whatever is fine.

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 18:18):

Hardcore test case: zig build repro-parse -- -b ''

Break the parser with an empty string (we just have an incorrect assert that assumes parsing will generate something).

view this post on Zulip Brendan Hansknecht (Mar 06 2025 at 19:04):

As a general note, we have the ability to trigger the fuzzer targetting any branch/commit. So if you ever want a PR to get fuzzed, we can do that.

view this post on Zulip Luke Boswell (Mar 06 2025 at 22:45):

Looking at the snapshot tool, with the flag to copy source into our fuzz corpus.
One issue is giving the files a .roc extension is that now git wants to commit them to our repo...

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    src/fuzz-corpus/cupgstzv.roc
    src/fuzz-corpus/hdddlnro.roc
    src/fuzz-corpus/kbdtxaol.roc
    src/fuzz-corpus/tmtpdcnw.roc
    src/fuzz-corpus/ufzxxvfu.roc

Do you think we should;

  1. leave these files without an extension
  2. give them some made up ext like .snap or something
  3. gitignore all files in /src/fuzz-corpus/, and if we want to add something there we need to git add it manually

view this post on Zulip Luke Boswell (Mar 06 2025 at 22:46):

I'm leaning towards (3) -- so I'll run with that for now in my draft PR

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 00:03):

Yeah, 3 sounds good

view this post on Zulip Luke Boswell (Mar 07 2025 at 03:19):

So it runs all the different fuzzers every 4 hours, but only if theres a new commit?

view this post on Zulip Luke Boswell (Mar 07 2025 at 03:21):

I've just noticed the scheduled runs have skipped the actual fuzz run.

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 03:23):

Yeah, I messed something up. Should be fixed already

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 03:24):

And yeah, runs every 4 hours for 30 minutes. Seemed like a reasonable cadence for now. Will keep fuzzing the same commit and expanding the corpus if no new commits exist.

view this post on Zulip Luke Boswell (Mar 07 2025 at 04:23):

Brendan Hansknecht said:

Hardcore test case: zig build repro-parse -- -b ''

Break the parser with an empty string (we just have an incorrect assert that assumes parsing will generate something).

Does it count as a real bugfix if I resolve this one? :sweat_smile:

view this post on Zulip Sam Mohr (Mar 07 2025 at 04:31):

As we say in the smash bros community, "we take those"

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:41):

@Brendan Hansknecht -- how did you install LLVM on your mac?

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:41):

Did you use brew?

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:45):

I'm going to try the release from LLVM's github instead and see if I can get fuzzing working on my macos

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 05:49):

I just used brew

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 05:49):

Also, you don't need llvm and afl to run the repro

view this post on Zulip Brendan Hansknecht (Mar 07 2025 at 05:49):

That repro should work for everyone with only zig as a dep

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:56):

Yeah repro is fine... I was wanting to run the fuzzer

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:57):

I couldn't get the brew version working.

view this post on Zulip Luke Boswell (Mar 07 2025 at 05:57):

I've tried with a downloaded LLVM... and providing the path but having trouble with that too

view this post on Zulip Luke Boswell (Mar 07 2025 at 06:19):

Hey got it working using brew... :tada:

view this post on Zulip Luke Boswell (Mar 07 2025 at 06:19):

Needed to add export PATH=$PATH:/opt/homebrew/opt/llvm@18/bin to my zshrc so it could find brew's llvm-config

view this post on Zulip Luke Boswell (Mar 07 2025 at 06:23):

Got it running with

$ ./zig-out/AFLplusplus/bin/afl-fuzz -i src/fuzz-corpus/parse -o /tmp/fuzz-parse-out zig-out/bin/fuzz-parse

view this post on Zulip Luke Boswell (Mar 07 2025 at 06:24):

Screenshot 2025-03-07 at 17.23.53.png

view this post on Zulip Luke Boswell (Mar 08 2025 at 22:58):

My new commands to use the fuzzer with zig 0.14.0, thank you @Brendan Hansknecht for helping me workaround my issues to get back online fuzzing again

// for some reason homebrew llvm isn't working, but system -afl is...
brew install afl++

zig build -Dfuzz -Dsystem-afl
zig build snapshot -- --fuzz-corpus ./src/fuzz-corpus/
afl-fuzz -i ./src/fuzz-corpus/ -o /tmp/corpus zig-out/bin/fuzz-parse

view this post on Zulip Brendan Hansknecht (Mar 08 2025 at 23:09):

My best guess is that there is a bug with this build script: https://github.com/allyourcodebase/AFLplusplus

As such, system afl is required instead of using zig built afl. Might just be some form of version overlap issue and not an issue with the actual build script. Need to do more testing at some point.

view this post on Zulip Brendan Hansknecht (Mar 08 2025 at 23:13):

how, might just be that we need to wait for aflplusplus to update to zig-0.14_afl_4.31c

view this post on Zulip Brendan Hansknecht (Mar 08 2025 at 23:13):

maybe afl 21c doesn't work with newer zig

view this post on Zulip Luke Boswell (Mar 09 2025 at 23:17):

I've got it down to a couple of seconds before we get a crash now... :tada: (on the parser)

view this post on Zulip Luke Boswell (Mar 11 2025 at 10:07):

New record for fuzz-parser - 2min, 23 sec before my first crash

view this post on Zulip Anthony Bullard (Mar 11 2025 at 10:25):

Let me get these headers done and hopefully that goes up

view this post on Zulip Anthony Bullard (Mar 11 2025 at 10:26):

Have you been posting the all of the crashing input?

view this post on Zulip Luke Boswell (Mar 11 2025 at 10:48):

I've been making a new snapshot for each failure, which I think will help prevent against regressions and seed future fuzzing efforts.

view this post on Zulip Luke Boswell (Mar 11 2025 at 10:48):

You can see all the ones I have so far in my PR https://github.com/roc-lang/roc/pull/7672

view this post on Zulip Brendan Hansknecht (May 11 2025 at 17:39):

One fuzzing hang that comes up semi often is having a ton of { brackets. When Roc reformats that, it adds essentially infinite spaces due to indentation. This fails fuzzing due to the fuzzer thinking it is a hang.

For example, one fuzz failure I saw recently had ~7k { brackets. When formatted, that led to 92,910,578 spaces being printed. It is unsurprising that is too slow.

This leads me to a few questions:

  1. Should we have a limit to how nested of an expression can parse? Is allowing 7k+ levels of nesting ok?
  2. Is the correct solution to tons of { brakects to just nest infinitely deep and have ridiculously long lines?
  3. How can I avoid this hang in the fuzzer (hangs in general mean slower execution and worse fuzzing)?

I get that metric tons of { is contrived, but I think it is still best practice to consider and handle so we can maintain robust fuzzing.

view this post on Zulip Anthony Bullard (May 11 2025 at 18:02):

i think more than 100 levels of indentation is overkill

view this post on Zulip Anthony Bullard (May 11 2025 at 18:02):

let alone 7k+

view this post on Zulip Anthony Bullard (May 11 2025 at 18:04):

i would just just move indenting to a function that panics if it exceeds some limit

view this post on Zulip Anthony Bullard (May 11 2025 at 18:04):

or to be more graceful, returns a parse error and a malformed node

view this post on Zulip Brendan Hansknecht (May 11 2025 at 18:24):

Yeah, that is my thought, maybe after a certain level of nesting, we should just bail and return a parse error

view this post on Zulip Anthony Bullard (May 11 2025 at 19:13):

i could add that to my current PR unless it was already merged

view this post on Zulip Richard Feldman (May 12 2025 at 14:09):

I would actually rather we handled this with a fuzzer change

view this post on Zulip Richard Feldman (May 12 2025 at 14:09):

pathological parsing cases like this can come up irl in generated code

view this post on Zulip Richard Feldman (May 12 2025 at 14:09):

where almost nobody notices, but then one person is totally blocked by it and has to try to make some complex workaround

view this post on Zulip Anthony Bullard (May 12 2025 at 14:11):

what kind of change would you like to see? some sort of filter on the fuzz inputs?

view this post on Zulip Richard Feldman (May 12 2025 at 14:11):

so if the payoff is something like "everyone gets faster builds" (e.g. u16 line counts instead of u32) I'm ok with that, but if the only problem is the fuzzer itself, I'd rather address this by changing the fuzzer than by changing the parser

view this post on Zulip Richard Feldman (May 12 2025 at 14:11):

yeah something like that - I'm not sure what options would be best there!

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:13):

Fair enough. Though if we format exceptionally nested code to be deeply indented, it won't be readable either.

view this post on Zulip Anthony Bullard (May 12 2025 at 14:14):

a good thing to note is that if it's taking this level of complexity to crash the fuzzer we must be doing something pretty good

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:14):

We have some basic crashes too, but I think we are overall doing well.

view this post on Zulip Anthony Bullard (May 12 2025 at 14:15):

feel free to send basic crashes to me if they seem legit

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:22):

Brendan Hansknecht said:

Fair enough. Though if we format exceptionally nested code to be deeply indented, it won't be readable either.

To be more concrete here. Formating this code to ident just slows down parsing. So if this would be coming from generated code that no one is expected to read, we would just be making the experience worse by making the file way latger and way slower to parse.

That said, I do agree that given this is a contrived example, limiting the fuzzer is reasonable too. In the fuzzer, I could pre scan for nesting depth and limit.

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:24):

Aside, we decided on tabs as the canonical form, right? So the formatter should be changed to use tabs for indentation instead of spaces?

view this post on Zulip Anthony Bullard (May 12 2025 at 14:31):

i didn't know that we made that decision but that should be easier

view this post on Zulip Anthony Bullard (May 12 2025 at 14:31):

that cuts the number of characters per indented line by on average 4

view this post on Zulip Anthony Bullard (May 12 2025 at 14:32):

so in that worst case example that's 7k tabs instead of 28k soaxes

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:32):

That might actually be longer than max fuzzer input length.

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:32):

So changing to tabs might actually remove the hangs

view this post on Zulip Anthony Bullard (May 12 2025 at 14:32):

sweet

view this post on Zulip Brendan Hansknecht (May 12 2025 at 14:32):

I think make is 8 or 16k

view this post on Zulip Anthony Bullard (May 12 2025 at 14:33):

i can make that change tomorrow

view this post on Zulip Anthony Bullard (May 12 2025 at 14:33):

should be just a few loc change

view this post on Zulip Anthony Bullard (May 12 2025 at 18:10):

Looks like this is a wee bit harder than I expected. Zig multiline string literals don't allow literal tabs - they recommend a (IMHO) pretty insane system of postprocessing strings at comptime to replace some other sigil in the string with a tab

view this post on Zulip Anthony Bullard (May 12 2025 at 18:11):

Also, there might be a tokenization error with tabs currently, but I'm not sure There is not. Just an issue with Zig

view this post on Zulip Anthony Bullard (May 12 2025 at 18:31):

I'm pretty much going to need to move all parser tests that contain indentation out to snapshot files and make running snapshots more ergonomic during my development workflow

view this post on Zulip Brendan Hansknecht (May 12 2025 at 19:38):

:thinkies: yikes

view this post on Zulip Fábio Beirão (May 13 2025 at 12:45):

Anthony Bullard said:

i think more than 100 levels of indentation is overkill

To me this makes me think of the way that Elm's compiler refuses to have tuples with more than 3 elements, and gracefully explains the rationale to the user.
It could probably be a selling point if Roc didn't allow for .. 10? 20? levels of nesting. I think nesting can always be managed with some refactoring. If the language itself would encourage this behavior from the early days, I think the global quality of Roc code would just increase.
I can also see some "management selling points" when you can say that a language has built-in mandatory opinions about code complexity :sweat_smile:

view this post on Zulip Anthony Bullard (May 15 2025 at 14:29):

@Brendan Hansknecht Here's my PR on this https://github.com/roc-lang/roc/pull/7786

view this post on Zulip Anthony Bullard (May 16 2025 at 00:44):

Addressed all review feedback

view this post on Zulip Brendan Hansknecht (May 17 2025 at 18:21):

@Anthony Bullard just to show you the most commonly found fuzz failure. It is a variant of:
zig build repro-parse -- -b MApmb3I= -v

Which in this case is:

0
for

Most of the failures hit this panic:

thread 26416288 panic: Should have gotten a valid pattern, pos=3 peek=EndOfFile

/Users/bren077s/Projects/roc/src/check/parse/Parser.zig:1266:24: 0x1050855c7 in parsePattern (repro-parse)
        std.debug.panic("Should have gotten a valid pattern, pos={d} peek={s}\n", .{ self.pos, @tagName(self.peek()) });

Probably an easy fix around EOF handling

view this post on Zulip Anthony Bullard (May 17 2025 at 18:26):

cool i can find a fix for this rep quick

view this post on Zulip Anthony Bullard (May 17 2025 at 18:26):

are we parsing this as a statement? expr? module?

view this post on Zulip Anthony Bullard (May 17 2025 at 18:27):

NVM i can just read the code :wink:

view this post on Zulip Anthony Bullard (May 17 2025 at 18:47):

https://github.com/roc-lang/roc/pull/7792 @Brendan Hansknecht

view this post on Zulip Luke Boswell (Jun 28 2025 at 06:21):

I realised I could wire up our coordinate into the fuzzer really easily.

So I've been fuzzing the whole compiler pipeline (at least everything we have so far up to type checking)... and it's been really great so far.

view this post on Zulip Luke Boswell (Jun 28 2025 at 07:47):

In case this helps anyone... here is how I'm running the fuzzer for the roc check zig compiler pipeline.

brew install afl++

rm -rf /tmp/corpus/default/crashes

zig build -Dfuzz -Dsystem-afl

afl-fuzz -i ./src/fuzz-corpus/ -o /tmp/corpus zig-out/bin/fuzz-canonicalize

view this post on Zulip Luke Boswell (Jun 28 2025 at 07:48):

And this is what it looks like in my terminal...
Screenshot 2025-06-28 at 17.48.10.png

view this post on Zulip Richard Feldman (Jun 28 2025 at 11:57):

where do the crashes end up?

view this post on Zulip Brendan Hansknecht (Jun 28 2025 at 14:47):

roc-lang.github.io/roc-compiler-fuzz

view this post on Zulip Richard Feldman (Jun 28 2025 at 14:54):

what I mean is like if I run it locally and it reports a number of crashes, how do I reproduce an individual crash so I can try to fix it?

view this post on Zulip Brendan Hansknecht (Jun 28 2025 at 15:09):

I think you just have to run it passing in a file as the first arg. zig run fuzz-canonicalize -- /tm/corpus/default/crashes/... I think

view this post on Zulip Brendan Hansknecht (Jun 29 2025 at 17:40):

Fuzzing can make such curious crashes at times:

0
pr000000e:{e:0}pr000000e={p:0r}

This leads to (either an infinite or near infinite loop) in check.check_types.unify.Unifier.gatherRecordFields. I'm quite surprised this even makes it past parsing and to canonicalization.

view this post on Zulip Brendan Hansknecht (Jun 29 2025 at 17:41):

zig build repro-canonicalize -- -b MApwcjAwMDAwMGU6e2U6MH1wcjAwMDAwMGU9e3A6MHJ9 -v

view this post on Zulip Anthony Bullard (Jun 29 2025 at 18:10):

would like to see the snapshot for that failure

view this post on Zulip Anthony Bullard (Jun 29 2025 at 18:11):

this reminds me i would like to have a META option to limit the stages run on a snapshot

view this post on Zulip Richard Feldman (Jun 29 2025 at 18:13):

that one looks like:

rec : { e : 0 }
rec = { p: 0r }

so I suspect it's getting typed as an error, and something about trying to gather up all the record fields in an erroneous record is the problem

view this post on Zulip Richard Feldman (Jun 29 2025 at 18:24):

in this case both the type annotation and the record expression are invalid, but not sure if that's required to repro

view this post on Zulip Richard Feldman (Jun 29 2025 at 18:25):

this is probably the sort of situation that breaks the old compiler when you try to do the "run anyway despite errors" thing, so it's pretty great to see the fuzzer turning it up! :grinning_face_with_smiling_eyes:

view this post on Zulip Anthony Bullard (Jun 29 2025 at 18:28):

I'd love to see a count of how many runs of the fuzzer it takes to generate a file / statement (not expr) that is actually completely valid with no reports

view this post on Zulip Brendan Hansknecht (Jun 29 2025 at 18:38):

Anthony Bullard said:

I'd love to see a count of how many runs of the fuzzer it takes to generate a file / statement (not expr) that is actually completely valid with no reports

Just need to make an inverted fuzzer that only fails if everything goes successful though the complete compiler stack

view this post on Zulip Kiryl Dziamura (Jun 29 2025 at 18:55):

Regression tests generator lol

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:10):

First tokenizer fuzz failure in a long long time: zig build repro-tokenize -- -b Jyc= -v

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:11):

Seems to be related to new single quote changes. I think it is a bug on the formatting side technically rather than truly a tokenizer bug.

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:11):

We now allow for empty single quote literals, which was not allowed before.

view this post on Zulip Kiryl Dziamura (Jul 04 2025 at 19:13):

I'll take a look

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:14):

Also, we are getting some fun canonicalize failures now like: zig build repro-canonicalize -- -b IiJ1PSc= -v

It leads to a zig slice that is invalid:

thread 219880510 panic: start index 1 is larger than end index 0
/Users/bren077s/Projects/roc/src/check/canonicalize.zig:1269:42: 0x104a227af in canonicalize_expr (repro-canonicalize)
            const inner_text = token_text[1 .. token_text.len - 1];

view this post on Zulip Kiryl Dziamura (Jul 04 2025 at 19:18):

Interesting, because there's a snapshot with an empty single quote. In such a case, '' is of length 2 so the slice is [1..1]. Looks like a problem in the tokenizer. Likely it creates the token too son

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:21):

Of note, for the tokenizer fuzzer, we try to generate a "canonical" version of each token. Then retokenize a second time

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:21):

So that "canonical" version is probably wrong for single quotes now. It probably needs to be allowed to be empty

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:22):

Oh, I think the for loop here just needs to be from 0..length:
https://github.com/roc-lang/roc/blob/9a32c422f290713a312e18a96cb6f43c850aa4d0/src/check/parse/tokenize.zig#L1662-L1667

view this post on Zulip Brendan Hansknecht (Jul 04 2025 at 19:22):

Or maybe 1..length-1?

view this post on Zulip Kiryl Dziamura (Jul 04 2025 at 19:32):

It should be length-1, right

view this post on Zulip Kiryl Dziamura (Jul 04 2025 at 19:38):

Looks like this code generated only open single quote, truncating the closing one. So '' becomes ' thus the slice [1..(1 - 1)]

view this post on Zulip Kiryl Dziamura (Jul 04 2025 at 19:52):

https://github.com/roc-lang/roc/pull/7941

view this post on Zulip Brendan Hansknecht (Jul 05 2025 at 23:10):

General question, it is fair to say that all files under 16KB should definitely complete roc check in under a second, right?

16KB is just an arbitrary number I set for fuzzing and I bet the true number should be higher, but compiler perf wise, I assume we want be able to roc check much much faster than that.

For reference, Dict.roc is 60KB and is only 1776 lines.

Of course in the worst case fuzzing experience, it will find code that takes maximal time and generates a metric ton of errors. So it isn't truly representative.

Just thinking about fuzzer hangs and settings.

view this post on Zulip Brendan Hansknecht (Jul 05 2025 at 23:16):

Also, how the heck does an input like this pass parsing and get to canonicalization? This feels pretty deeply wrong to me:

0]r={s=||{r={s=||{s={r=||{l={s=||{s={s={v={r={s={v=||{c00st=0t=c00st(0)c00st(0)t=c00st(0)

I get we want the compiler to be able to run as much as possible, but this has to fail parsing, right?

view this post on Zulip Brendan Hansknecht (Jul 05 2025 at 23:17):

Hmm... I guess it does fail parsing, but we just keep going anyway:

[0]: check.parse.AST.Diagnostic{ .tag = check.parse.AST.Diagnostic.Tag.missing_header, .region = check.parse.AST.TokenizedRegion{ .start = 0, .end = 1 } }
[1]: check.parse.AST.Diagnostic{ .tag = check.parse.AST.Diagnostic.Tag.expr_unexpected_token, .region = check.parse.AST.TokenizedRegion{ .start = 55, .end = 56 } }
[2]: check.parse.AST.Diagnostic{ .tag = check.parse.AST.Diagnostic.Tag.expr_unexpected_token, .region = check.parse.AST.TokenizedRegion{ .start = 56, .end = 57 } }

view this post on Zulip Luke Boswell (Jul 05 2025 at 23:38):

Brendan Hansknecht said:

Also, how the heck does an input like this pass parsing and get to canonicalization? This feels pretty deeply wrong to me:

0]r={s=||{r={s=||{s={r=||{l={s=||{s={s={v={r={s={v=||{c00st=0t=c00st(0)c00st(0)t=c00st(0)

I get we want the compiler to be able to run as much as possible, but this has to fail parsing, right?

Maybe we want to support droid mode. The robots can plug and and skip all the human whitespace nonsense.

view this post on Zulip Richard Feldman (Jul 06 2025 at 01:05):

Brendan Hansknecht said:

General question, it is fair to say that all files under 16KB should definitely complete roc check in under a second, right?

Hindley-Milner type inference has pathological asymptotic time complexity if you just keep nesting lets (or defs in our case), and relies on the fact that in practice people don't actually do that

view this post on Zulip Richard Feldman (Jul 06 2025 at 01:05):

but if a fuzzer did that, it would presumably get bad :smile:

view this post on Zulip Richard Feldman (Jul 06 2025 at 01:06):

Brendan Hansknecht said:

Also, how the heck does an input like this pass parsing and get to canonicalization? This feels pretty deeply wrong to me:

0]r={s=||{r={s=||{s={r=||{l={s=||{s={s={v={r={s={v=||{c00st=0t=c00st(0)c00st(0)t=c00st(0)

I get we want the compiler to be able to run as much as possible, but this has to fail parsing, right?

I think the right answer here is that parsing should generate a ton of error nodes, but then when we proceed to canonicalization, it finds essentially no work to do because it's all error nodes, so canonicalization and type-checking end up being no-ops

view this post on Zulip Richard Feldman (Jul 06 2025 at 01:07):

so you get the same outcome as if we "stopped at parsing" except:

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:22):

but if a fuzzer did that, it would presumably get bad

Makes senses, we'll see. The fuzzer just optimizes for new exploration so may be unlikely, but not really sure.

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:24):

I think the right answer here is that parsing should generate a ton of error nodes, but then when we proceed to canonicalization, it finds essentially no work to do because it's all error nodes, so canonicalization and type-checking end up being no-ops

If I understand what is happening, the parser generates a mostly valid tree by automatically adding a bunch of }s at the end. Can then runs with tons of recursive lambda and expression checks. Can is very slow. The end result of Can is mostly a bunch of unused variable and duplicate definition complaints.

view this post on Zulip Joshua Warner (Jul 06 2025 at 01:24):

One thing that a lot of fuzzers count as new coverage is loop counts (maybe recursion counts?) going over some threshold

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:25):

Yeah, bucketed loop counts is new coverage

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:25):

so it isn't down to the individual iteration, but it does count overall

view this post on Zulip Joshua Warner (Jul 06 2025 at 01:25):

I predict as soon as any low hanging fruit is cleared out, it'll start finding things like that (unless we dissuade it somehow!)

view this post on Zulip Joshua Warner (Jul 06 2025 at 01:26):

Would it be reasonable to put some limit on that let recursion and start erroring after that?

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:29):

Yeah, I'm sure when it comes up we can work around it. That said, right now there are lots of hangs with can in general (though hang is a pretty loose definition. Like the example above is considered a hang on the CI machine (old/weak cpu), but only takes 250ms on my M1 mac. That said, something that short taking 250ms is almost certainly a perf bug.

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:29):

So probably worthwhile currently to consider a failure.

view this post on Zulip Richard Feldman (Jul 06 2025 at 01:29):

agreed!

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:29):

At least that is my thought

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:29):

Just want to make sure that what we get currently that is considered a hang is useful. I think it is, but thought it would be worth double checking.

view this post on Zulip Brendan Hansknecht (Jul 06 2025 at 01:30):

And yeah, we still have tons of low hanging crashes in both parse and can


Last updated: Jul 06 2025 at 12:14 UTC