I'm just tinkering with tokenizing fuzzing bugs. Decided to pull into a separate thread to not clutter other discussions.
@Joshua Warner how would you expect a numeric base without any follow digits to tokenize?
0x.
for example. Should we generate some sort of malformed node or error if the base is not followed by a number?
Currently this is one of the tokenizer fuzzer failures and I was just trying to clean it up.
example failure: zig build repro-tokenize -- -b MHguMA== -v
Yeah that looks malformed to me
Found an issue from a fuzz crash. Made a small test that is failing, and investigating now. It tokenizes the ]
as a CloseCurly
instead of an CloseSquare
.
try testTokenization(
gpa,
\\f{o,
\\ ]
\\
\\foo =
\\
\\ "onmo %
,
&[_]Token.Tag{
.LowerIdent,
.OpenCurly,
.LowerIdent,
.Comma,
.Newline,
.CloseSquare,
.Newline,
.LowerIdent,
.OpAssign,
.Newline,
.StringStart,
.StringPart,
},
);
Optimistically, I think that ought to detect the mismatched brackets, infer the intent was to have the correct closing bracket there, emit a message, and continue.
That said, it feels a bit more likely that that’s just an accident at this point.
Yeah, you're correct. I hadn't noticed the Diagnostic was being correctly reported for Mismatched braces.
I had a break for lunch, but almost finished re-wiring the way I'm dealing with Problems in the snapshots so these issues are surfaced easier.
From your other post @Luke Boswell :
~~~PROBLEMS
check.parse.tokenize.Diagnostic.Tag.AsciiControl
check.parse.tokenize.Diagnostic.Tag.MismatchedBrace
check.parse.tokenize.Diagnostic.Tag.UnclosedString
Looks like it's there just fine. We just don't seem to short circuit the input when we have tokenize errors
I hope we do for parse errors
THe invariant should be: For any fuzz input that has no tokenization or parse errors, it should have stable formatting
And a stable AST
We just call the parser directly passing in the source. It calls the tokenizer. Not sure where the tokenizer errors end up. But currently only the errors returned by parse are checked. This is where one global list of all errors that builds up with each new stage would be nice likely
I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty
Yeah, may even switch on type of error. Cause we want to attempt to compile even with many classes of errors.
I've done this in my latest PR for review. Probably needs adjustment, but its a start.
Anthony Bullard said:
I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty
Pretty much every kind of error should be continued from, the only places I can think of this not being true are:
roc check/build/etc
not being foundroc build
and having type errorsThere are definitely classes of malformed nodes for example where it would not make sense to continue...
Not to mention illegal tokens
Can you give an example of the former? Illegal tokens could make sense, but I think we'd still want to allow running the code with malformed nodes, it'd just crash quickly
It feels like these are cases where we'd now be deciding "surely the user doesn't want to run code that is malformed" for them
But they should be free to decide for themselves
If I have a malformed expression in a list literal, how do I run that?
I crash
Just like I do for a type error
Canonicalization will generate a runtime error for that whole node
Sure but that seems a little silly to me
I’m using a strongly typed, compiled language to shift errors to the left
What if that function is barely called?
I think you might be missing the second paradigm of using Roc
I’d still want to know closer to when I wrote it
The first is the one we both like and mainly care about
Then when it just happens to get called - maybe a week later
Oh, I understand the disconnect
The runtime error would be paired with a compiler error that would give a non-zero exit code on roc build
You can still use the binary, but you'd know immediately that you have a positive number of errors and probably should fix those
During dev, you want to be able to run anyway
And 99% of people will fix the error
But doing we really want to block the 1% of people that want to run anyway?
If we decide against supporting that, then I truly have misaligned on "always inform, never block"
yeah that's idea is "always inform, never block" - report the compiler error so you know there's a problem, but always let you have the choice to run it anyway
you shift left while also simultaneously not shifting left if you don't want to. it's like Schrödinger's Shift Left
Got it. I'm just not the target audience for that paradigm. I get enough runtime errors writing JavaScript
really simple example of where this is nice at even the lexing level: I'm trying to merge in a branch and see if it fixes what I'm working on. There are merge conflicts in a part of the code base I'm not working on, creating syntax errors with all their ======
s etc.
I would still like to be able to run tests on my unrelated part of the code base, so I can tell where it's worth my time to fix those merge conflicts or whether that branch doesn't fix my problem. If the merge conflicts block me from trying it out, I have no choice but to fix them first before I know whether that'll be a waste of time!
After some digging, I found the Newline tokens don't have region information set. I think this might be the cause of challenges getting the correct offsets for AST nodes.
From what I've discovered so far, I think it's because we are using the offset field to store the indentation. I'm wondering if we could use the extra
field for this instead.
pub fn pushNewline(self: *TokenizedBuffer, indent: u32) void {
self.tokens.append(self.env.gpa, .{
.tag = .Newline,
.offset = indent, // store the indent in the offset field
.extra = .{ .length = 0 },
}) catch |err| exitOnOom(err);
}
(deleted) -- i think that example is just going to cause more confusion
Yeah, instead of having an offset, newlines store the indentation of the following line
So I think the parser needs to make sure not to use locations from newlines
It needs to take the token before or after the newline depending on its goal
But yeah, probably could move indents to the extra field instead
That sounds cleaner. Especially since we have the space anyway for it.
Yeah, I'm just starting with module []
and trying to figure out why the region for that isn't what I expect.
I think the issue is that self.expect(.CloseSquare) catch { ...}
calls advance
which moves self.pos
forward n steps (we don't know how many) until the next non-newline token. So after parsing the module header, we set the region to be the start of the next non-newline token, instead of the end of the ]
.
I have an idea which might fix it
Actually I found the fix, change _ = self.parseCollectionSpan
to const end = self.parseCollectionSpan
, and use that for the region. :octopus:
@Joshua Warner how do you think we should handle something like 0u8.0
in the tokenizer?
Currently, we convert it into 111.1
which leads to it retokenizing as a float which fails. I'm debating using the extra field of the token to store the suffix. Would be an enum of none
, u8
, i8
, etc. We could actually still store the length as well assuming we limit int literals to less than 2^28 characters.
exact repro for context: zig build repro-tokenize -- -b MHU4LjA= -v
Also, is .7
a valid float? I assume we should just make it a float and then reformat it. Though I could also see it mapping to a recoverable malformed node for missing the leading zero.
I think .7
should not be valid
I think .7
should not be valid
I guess the question is how is it invalid. As in, if we see .7
, I am 99% sure that we want the formatter to output 0.7
. so simply treating it as a valid float in the compiler without any sort of malformed node may be the best bet? I guess we should still push a message. That way we can raise a warning?
I guess I'm not fully sure when we want a malformed node vs a message vs etc
today .7
is a tuple accessor function, but I believe we discussed how that syntax needs to change to _.7
so if it's not a tuple accessor, it seems fine to have it be a number literal
but yeah I'd have the formatter add the 0
totally forgot about tuple accessors
I would expect 0u8.0
to be: 0u8 (an int literal), then a NoSpaceDotInt token
Yeah, so the issue here is technically reprinting for the fuzzer. Gets turned into 111.1
.
hmm, yeah, I think we need smarter number reprinting. Another example failure that is really the reprinting:
0o0.0
becomes 111.1
. First parses as an int and then a NoSpaceDotInt. The second parses as a float.
How do people feel about adding a FileStart
token that is always the first, in the zero'th position?
For something like a .missing_header
diagnostic, it would make more sense to attach that to the whole file than to a single token. Maybe there are other errors that apply to the whole file?
Or should we special case the error in this case and if something comes up in future that we want to apply for the whole file, we revisit?
Here is a fuzz crash I was looking at. To get formatting stable, we will need to format something for these malformed nodes.
The question is what.
My first thought was just write out the original bytes.
~~~SOURCE
4o|
~~~PROBLEMS
PARSER: missing_header
PARSER: unexpected_token
~~~TOKENS
MalformedNumberBadSuffix(1:1-1:3),OpBar(1:3-1:4),EndOfFile(1:4-1:4),
~~~PARSE
(file (1:1-1:4)
(malformed_header (1:1-1:3) "missing_header")
(malformed_expr (1:3-1:4) "unexpected_token"))
I would start by just not formatting on failures like this
You have problems, so don't format at all. Just stop after parse
This is what I'm looking at... is this just a repro-parse
specific thing then?
$ zig build repro-parse -- /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
Reading bytes for repro from /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
slices differ. first difference occurs at index 0 (0x0)
============ expected this output: ============= len: 2 (0x2)
[0]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.missing_header, .region = check.parse.IR.Region{ .start = 0, .end = 0 } }
[1]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.unexpected_token, .region = check.parse.IR.Region{ .start = 1, .end = 2 } }
============= instead found this: ============== len: 0 (0x0)
================================================
Long term, we do want to support formatting even with some problems, but that is likely to be a larger project (I guess you totally could take it on now if you want).
Interesting. I assumed we were already doing that because it seems the fuzzer is checking things even with errors.
It's not supported to
Root error starts here:
std.testing.expectEqualSlices(IR.Diagnostic, parse_ast.errors, &[_]IR.Diagnostic{}) catch {
return error.ParseFailed;
};
Should bubble up through:
const formatted = try parseAndFmt(gpa, input, debug);
Then should be ignored here:
error.ParseFailed => {
// No issue. Just bad input we couldn't parse.
return;
},
Ah I see. We're not pushing the errors into ModuleEnv yet.
Oh, I don't think this checks module env errors yet, is that the issue?
Or maybe the opposite, we are pushing into the ModuleEnv but in the fuzzer it's still checking in the IR
This test case 4o|
is passing the fuzzer
It just prints out a parse error
Due to first pars failing, it exits early (and considers it a success)
zig build repro-parse -- -b NG98Cg== -v && echo "all passed"
I guess that debug print had me confused. Thanks for explaining that
$ echo $?
0
Maybe in verbose mode we should print out a message to clarify that on first parse failure we just exit.
That or maybe add a flag, that just uses some sort of mem eql instead of the testing version that prints on error
Yeah, someone else is going to get tripped up on this too I'm sure
How should we handle malformed tokens? here is an example.
Screenshot 2025-03-14 at 17.08.52.png
That currently gives us these tokens -- there isn't anything for the NUL byte
~~~TOKENS
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
LowerIdent(2:1-2:4),EndOfFile(2:5-2:5),
~~~PARSE
(file (1:1-2:5)
(module (1:1-1:10))
(ident (2:1-2:4) "" "foo"))
Oh wait.. I think I see other MalformedXXX tokens :man_facepalming:
Might just be .MalformedUnknownToken
Random comment from Loris that I think is a good idea when we consider making more intensely optimized parsers via simd or only using start offset or similar:
Keep around the naive path. It makes a great fuzz oracle.
That might be a long time in the future, but something to consider for more invasive changes.
Yes!
Also good read on optimizing a tokenizer in zig: https://iridescence.me/2025/03/14/elyra-part-1-optimal-tokenization/
Thus far the tokenizer is surprisingly resistant to my attemps to simd-ify individual functions, e.g. chompTrivia/etc). I kinda suspect llvm already doing a decent job of vectorizing the obvious things.
I would be stunned it chomp trivia uses simd. Way too many conditionals for llvm to work through
if you've got free time, the SIMDjson is my favorite paper I've ever read
it's really mind-expanding and it's not something you can just pick up anytime
really need time to digest it and play around with the concepts imo
also since that paper was written, Daniel Lemire posted how to do one of the critical vectorized classification techniques mentioned in the paper: https://lemire.me/blog/2025/06/01/easy-vectorized-classification-with-z3/
Yeah, no simd in current chompTrivia
: https://zig.godbolt.org/z/PGeveEP1d
Also double checked a simple function to make sure that godbolt zig is setup to allow llvm to automatically add simd
Also, I just realized that we limit the tokenizer to 128 diagnostics, but the rest of the stack is handled differently. Probably something to make consistent
Of note, I just use tracy to plot the number of bytes being consumed by chomp trivia. The max number of bytes consumed at once is 20 and most of the time, I think it is only 1 byte. So you have a very very short loop to take advantage of simd. (this is with one of my basic million line files).
That makes the problem harder if only looking at that thin slice of the problem and not turning the whole parser to simd where you can be more consistent.
If you look at something like List.roc
the distribution is
82% 1 char -> simd probably makes this slower, in best case same (unless simd changes it to branchless then maybe faster, but still unlikely)
8% 4 char
5% 8 char
3% 12 char
1% 16 char
<1% 20 or 24 char
I'm kinda confused why all of these are multiples of 4. (oh...that's tabbing via spaces, isn't it)
Oh wait...sorry...that distribution is wrong, it missed the new line early exit from chompTrivia (so is probably missing every comment which ends in a newline)
Ok, corrected distributions:
75% 1 char
5% 4 char
3% 8 char
2% 3 char
2% 12 char
1% 7 char
1% 10 char
~10% 13 to 90 char
So these are the comments. Looks like ~1 sample for most things 10 to about 90 characters. (all of these definitely should see benefit from simd)
But yeah, 75% 1 char means that simd will be hard to show value on this function alone. Cause 75% of the time we likely are doing more work because of simd.
Switching from the manual loop looking for newlines after comments to indexOfScalarPos instead (which is vectorized internally), seemed to have basically no difference (even a slight regression actually).
What corpus are you using?
Yeah, good point about doing the simd-ification at the level of the outer loop
That distribution is just from List.roc
. I wanted to see roughly what a normal (but highly commented file) would map to.
Cause List.roc
is approximately the best case.
Switching from the manual loop looking for newlines after comments to indexOfScalarPos instead (which is vectorized internally), seemed to have basically no difference (even a slight regression actually).
Might turn into a case where it ends up being hardware/dataset dependent for if it is a win. Might also only be a win for cpus that are in performance mode rather than standard power management settings.
One of the really hard parts of simd is that if only a small part of a program is simd, even if that part is faster, it might cause enough heat that the cpu down clocks. That then leads to all the rest of the program running slower. This is part of the reason why just having a tiny bit of simd can often fail.
Does zig have simd intrinsics?
@Vector
I think is guaranteed to map to simd.
At least it is the intended type to use to get simd.
The basic simd stuff like @Vector/@select/etc
is sufficient for a lot of cases, but I don't see any of the table lookup stuff (e.g. vqtbl4q_u8)
Oh, though also yes under std.simd
Joshua Warner said:
Switching from the manual loop looking for newlines after comments to indexOfScalarPos instead (which is vectorized internally), seemed to have basically no difference (even a slight regression actually).
Just tested this on my M1 mac I see just under an 8% gain in performance for tokenizing the 1 million lines of code duplicated version of List.roc (and that is on battery, though plug in makes no difference)
For tokenizing normal List.roc
, I see a 3% gain, but that may be noise.
runs so fast that it might be other things causing the perf change.
I think maybe my current benchmark is too small
I finally figured out how to an almost-fully-vectorized version of _most_ of the tokenizer (in particular, excluding comments and strings)
https://github.com/joshuawarner32/roc/blob/improve-tokenize/tokenize_experimental.zig
For a baseline, on my machine the current tokenizer does about 250 megabytes a second
If I don't try to _store_ the tokenized results - just leave them in the computed bitsets, this one gets ~700 mbps
... but when I do store, the perf tanks :/
It only gets ~140 mbps
I have a suspicion avx512 instructions may be more conducive to this than neon
If someone can figure out how to do this faster... that's the main bottleneck.
Also note that this is not capturing any info about idents/nums/etc - and for example is completely omitting ident interning for instance.
Ident interning is the slowest part of the current tokenizer, and removing that brings its perf up to ~375 mbps or so
Oh and this one is missing keyword checking too
Fun experiment. :racecar:
Yep, and I learned some things!
I think the actual answer is to do substantially less simd stuff and break out quicker into regular code
It's unclear as of yet whether this idea of processing things in chunks, 64 bytes at a time has any legs.
Joshua Warner said:
If I don't try to _store_ the tokenized results - just leave them in the computed bitsets, this one gets ~700 mbps
... but when I do store, the perf tanks :/
can't we turn them into parse IR nodes and not store the tokens at all?
in other words, we materialize parse IR nodes but we never actually construct a full list of tokens; they only exist emphemerally in these 64B chunks (give or take 1 extra node for lookahead or something if necessary)
this would also mean we could just keep a running total of "current region" and only actually store Region
nodes in memory when it's time to create an actual AST nodes
Ah - so the parser is directly reading from this bitset representation?
yep!
basically have the parser be a function that's called like "chomp token" and it takes the current parser state, the current byte position in the source code, the byte length of the token, and maybe optionally the next token if we need to lookahead
so we only ever have one 64B bitset worth of tokens in memory at a time (again give or take an extra token for lookahead)
Hmm, why invert the control of the parser like that?
I think this could be done with the parser pulling from the tokenizer rather than the tokenizer pushing to the parser
Given the complex recursion going on in the parser, it'll be much better from a code perspective to have that be driving things
I think we will need at least two tokens of lookahead actually materialized
(not in the bitset)
Otherwise lookahead becomes very complicated
There are cases where we need to check what's after the immediate next token, and we may need to do that repeatedly so we don't want to do linear-time search thru the bitset repeatedly.
sure, either way
I mean another way to think of "lookahead" is like "I'm not ready to proceed yet, so just write down the token temporarily and give me the next one"
so when you need to lookahead multiple tokens, what that means in practice is that you start building up a list of tokens in memory - but just the ones you actually need, not materializing the entire token list - and then after you're ready to go back and process the original one, then you continue going through that list until it's exhausted, and then you're done and ready to go back to the non-materialized token stream
so in practice I assume you'd end up materializing way way less than today
this is also relevant to:
Kiryl Dziamura said:
I don't feel good about replacing region info with numeric data in
extra
. the other options are always pushing this data to a separate collection (as with interned), or extending token size which of course is a terrible idea.
because it would mean we could store extra info that AST nodes want more directly and not have to move them around so much
separately, I really want to get regions out of our string interner so we can just directly compare interned indices to know for sure whether the underlying strings are equal; that really comes up a lot in later stages of the compiler!
Last updated: Jul 26 2025 at 12:14 UTC