I'm just tinkering with tokenizing fuzzing bugs. Decided to pull into a separate thread to not clutter other discussions.
@Joshua Warner how would you expect a numeric base without any follow digits to tokenize?
0x.
for example. Should we generate some sort of malformed node or error if the base is not followed by a number?
Currently this is one of the tokenizer fuzzer failures and I was just trying to clean it up.
example failure: zig build repro-tokenize -- -b MHguMA== -v
Yeah that looks malformed to me
Found an issue from a fuzz crash. Made a small test that is failing, and investigating now. It tokenizes the ]
as a CloseCurly
instead of an CloseSquare
.
try testTokenization(
gpa,
\\f{o,
\\ ]
\\
\\foo =
\\
\\ "onmo %
,
&[_]Token.Tag{
.LowerIdent,
.OpenCurly,
.LowerIdent,
.Comma,
.Newline,
.CloseSquare,
.Newline,
.LowerIdent,
.OpAssign,
.Newline,
.StringStart,
.StringPart,
},
);
Optimistically, I think that ought to detect the mismatched brackets, infer the intent was to have the correct closing bracket there, emit a message, and continue.
That said, it feels a bit more likely that that’s just an accident at this point.
Yeah, you're correct. I hadn't noticed the Diagnostic was being correctly reported for Mismatched braces.
I had a break for lunch, but almost finished re-wiring the way I'm dealing with Problems in the snapshots so these issues are surfaced easier.
From your other post @Luke Boswell :
~~~PROBLEMS
check.parse.tokenize.Diagnostic.Tag.AsciiControl
check.parse.tokenize.Diagnostic.Tag.MismatchedBrace
check.parse.tokenize.Diagnostic.Tag.UnclosedString
Looks like it's there just fine. We just don't seem to short circuit the input when we have tokenize errors
I hope we do for parse errors
THe invariant should be: For any fuzz input that has no tokenization or parse errors, it should have stable formatting
And a stable AST
We just call the parser directly passing in the source. It calls the tokenizer. Not sure where the tokenizer errors end up. But currently only the errors returned by parse are checked. This is where one global list of all errors that builds up with each new stage would be nice likely
I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty
Yeah, may even switch on type of error. Cause we want to attempt to compile even with many classes of errors.
I've done this in my latest PR for review. Probably needs adjustment, but its a start.
Anthony Bullard said:
I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty
Pretty much every kind of error should be continued from, the only places I can think of this not being true are:
roc check/build/etc
not being foundroc build
and having type errorsThere are definitely classes of malformed nodes for example where it would not make sense to continue...
Not to mention illegal tokens
Can you give an example of the former? Illegal tokens could make sense, but I think we'd still want to allow running the code with malformed nodes, it'd just crash quickly
It feels like these are cases where we'd now be deciding "surely the user doesn't want to run code that is malformed" for them
But they should be free to decide for themselves
If I have a malformed expression in a list literal, how do I run that?
I crash
Just like I do for a type error
Canonicalization will generate a runtime error for that whole node
Sure but that seems a little silly to me
I’m using a strongly typed, compiled language to shift errors to the left
What if that function is barely called?
I think you might be missing the second paradigm of using Roc
I’d still want to know closer to when I wrote it
The first is the one we both like and mainly care about
Then when it just happens to get called - maybe a week later
Oh, I understand the disconnect
The runtime error would be paired with a compiler error that would give a non-zero exit code on roc build
You can still use the binary, but you'd know immediately that you have a positive number of errors and probably should fix those
During dev, you want to be able to run anyway
And 99% of people will fix the error
But doing we really want to block the 1% of people that want to run anyway?
If we decide against supporting that, then I truly have misaligned on "always inform, never block"
yeah that's idea is "always inform, never block" - report the compiler error so you know there's a problem, but always let you have the choice to run it anyway
you shift left while also simultaneously not shifting left if you don't want to. it's like Schrödinger's Shift Left
Got it. I'm just not the target audience for that paradigm. I get enough runtime errors writing JavaScript
really simple example of where this is nice at even the lexing level: I'm trying to merge in a branch and see if it fixes what I'm working on. There are merge conflicts in a part of the code base I'm not working on, creating syntax errors with all their ======
s etc.
I would still like to be able to run tests on my unrelated part of the code base, so I can tell where it's worth my time to fix those merge conflicts or whether that branch doesn't fix my problem. If the merge conflicts block me from trying it out, I have no choice but to fix them first before I know whether that'll be a waste of time!
After some digging, I found the Newline tokens don't have region information set. I think this might be the cause of challenges getting the correct offsets for AST nodes.
From what I've discovered so far, I think it's because we are using the offset field to store the indentation. I'm wondering if we could use the extra
field for this instead.
pub fn pushNewline(self: *TokenizedBuffer, indent: u32) void {
self.tokens.append(self.env.gpa, .{
.tag = .Newline,
.offset = indent, // store the indent in the offset field
.extra = .{ .length = 0 },
}) catch |err| exitOnOom(err);
}
(deleted) -- i think that example is just going to cause more confusion
Yeah, instead of having an offset, newlines store the indentation of the following line
So I think the parser needs to make sure not to use locations from newlines
It needs to take the token before or after the newline depending on its goal
But yeah, probably could move indents to the extra field instead
That sounds cleaner. Especially since we have the space anyway for it.
Yeah, I'm just starting with module []
and trying to figure out why the region for that isn't what I expect.
I think the issue is that self.expect(.CloseSquare) catch { ...}
calls advance
which moves self.pos
forward n steps (we don't know how many) until the next non-newline token. So after parsing the module header, we set the region to be the start of the next non-newline token, instead of the end of the ]
.
I have an idea which might fix it
Actually I found the fix, change _ = self.parseCollectionSpan
to const end = self.parseCollectionSpan
, and use that for the region. :octopus:
@Joshua Warner how do you think we should handle something like 0u8.0
in the tokenizer?
Currently, we convert it into 111.1
which leads to it retokenizing as a float which fails. I'm debating using the extra field of the token to store the suffix. Would be an enum of none
, u8
, i8
, etc. We could actually still store the length as well assuming we limit int literals to less than 2^28 characters.
exact repro for context: zig build repro-tokenize -- -b MHU4LjA= -v
Also, is .7
a valid float? I assume we should just make it a float and then reformat it. Though I could also see it mapping to a recoverable malformed node for missing the leading zero.
I think .7
should not be valid
I think .7
should not be valid
I guess the question is how is it invalid. As in, if we see .7
, I am 99% sure that we want the formatter to output 0.7
. so simply treating it as a valid float in the compiler without any sort of malformed node may be the best bet? I guess we should still push a message. That way we can raise a warning?
I guess I'm not fully sure when we want a malformed node vs a message vs etc
today .7
is a tuple accessor function, but I believe we discussed how that syntax needs to change to _.7
so if it's not a tuple accessor, it seems fine to have it be a number literal
but yeah I'd have the formatter add the 0
totally forgot about tuple accessors
I would expect 0u8.0
to be: 0u8 (an int literal), then a NoSpaceDotInt token
Yeah, so the issue here is technically reprinting for the fuzzer. Gets turned into 111.1
.
hmm, yeah, I think we need smarter number reprinting. Another example failure that is really the reprinting:
0o0.0
becomes 111.1
. First parses as an int and then a NoSpaceDotInt. The second parses as a float.
How do people feel about adding a FileStart
token that is always the first, in the zero'th position?
For something like a .missing_header
diagnostic, it would make more sense to attach that to the whole file than to a single token. Maybe there are other errors that apply to the whole file?
Or should we special case the error in this case and if something comes up in future that we want to apply for the whole file, we revisit?
Here is a fuzz crash I was looking at. To get formatting stable, we will need to format something for these malformed nodes.
The question is what.
My first thought was just write out the original bytes.
~~~SOURCE
4o|
~~~PROBLEMS
PARSER: missing_header
PARSER: unexpected_token
~~~TOKENS
MalformedNumberBadSuffix(1:1-1:3),OpBar(1:3-1:4),EndOfFile(1:4-1:4),
~~~PARSE
(file (1:1-1:4)
(malformed_header (1:1-1:3) "missing_header")
(malformed_expr (1:3-1:4) "unexpected_token"))
I would start by just not formatting on failures like this
You have problems, so don't format at all. Just stop after parse
This is what I'm looking at... is this just a repro-parse
specific thing then?
$ zig build repro-parse -- /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
Reading bytes for repro from /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
slices differ. first difference occurs at index 0 (0x0)
============ expected this output: ============= len: 2 (0x2)
[0]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.missing_header, .region = check.parse.IR.Region{ .start = 0, .end = 0 } }
[1]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.unexpected_token, .region = check.parse.IR.Region{ .start = 1, .end = 2 } }
============= instead found this: ============== len: 0 (0x0)
================================================
Long term, we do want to support formatting even with some problems, but that is likely to be a larger project (I guess you totally could take it on now if you want).
Interesting. I assumed we were already doing that because it seems the fuzzer is checking things even with errors.
It's not supported to
Root error starts here:
std.testing.expectEqualSlices(IR.Diagnostic, parse_ast.errors, &[_]IR.Diagnostic{}) catch {
return error.ParseFailed;
};
Should bubble up through:
const formatted = try parseAndFmt(gpa, input, debug);
Then should be ignored here:
error.ParseFailed => {
// No issue. Just bad input we couldn't parse.
return;
},
Ah I see. We're not pushing the errors into ModuleEnv yet.
Oh, I don't think this checks module env errors yet, is that the issue?
Or maybe the opposite, we are pushing into the ModuleEnv but in the fuzzer it's still checking in the IR
This test case 4o|
is passing the fuzzer
It just prints out a parse error
Due to first pars failing, it exits early (and considers it a success)
zig build repro-parse -- -b NG98Cg== -v && echo "all passed"
I guess that debug print had me confused. Thanks for explaining that
$ echo $?
0
Maybe in verbose mode we should print out a message to clarify that on first parse failure we just exit.
That or maybe add a flag, that just uses some sort of mem eql instead of the testing version that prints on error
Yeah, someone else is going to get tripped up on this too I'm sure
How should we handle malformed tokens? here is an example.
Screenshot 2025-03-14 at 17.08.52.png
That currently gives us these tokens -- there isn't anything for the NUL byte
~~~TOKENS
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
LowerIdent(2:1-2:4),EndOfFile(2:5-2:5),
~~~PARSE
(file (1:1-2:5)
(module (1:1-1:10))
(ident (2:1-2:4) "" "foo"))
Oh wait.. I think I see other MalformedXXX tokens :man_facepalming:
Might just be .MalformedUnknownToken
Random comment from Loris that I think is a good idea when we consider making more intensely optimized parsers via simd or only using start offset or similar:
Keep around the naive path. It makes a great fuzz oracle.
That might be a long time in the future, but something to consider for more invasive changes.
Yes!
Also good read on optimizing a tokenizer in zig: https://iridescence.me/2025/03/14/elyra-part-1-optimal-tokenization/
Last updated: Jul 06 2025 at 12:14 UTC