fuzz crash - handling malformed tokens and parsing · compiler development

Stream: compiler development

Topic: fuzz crash - handling malformed tokens and parsing

Luke Boswell (Jun 28 2025 at 07:36):

Problem Description: Formatter Creates Different Parse Errors for Malformed Input

Summary

The Roc formatter is failing a fuzz test because it transforms malformed input with tokenization errors into output with different parse errors, violating the expectation that formatting should be stable and preserve error characteristics.

Specific Case

Input: module[]0}0.
Formatted Output: module []00.
Issue: Different error types between input and output

Root Cause Analysis

1. Tokenization Behavior

Original input module[]0}0. tokenizes as:

KwModule(1:1-1:7) → module
OpenSquare(1:7-1:8) → [
CloseSquare(1:8-1:9) → ]
Int(1:9-1:10) → 0
Float(1:11-1:13) → 0. (the } at position 10 is processed as "OverClosedBrace" error and discarded)
EndOfFile(1:13-1:13)

The } character creates a tokenization diagnostic but doesn't appear in the token stream.

2. Parsing Behavior

The parser sees the token sequence [Int, Float] and creates two separate valid expression nodes:

(e-int @1-9-1-10 (raw "0"))
(e-frac @1-11-1-13 (raw "0."))

Key Issue: The tokenization error (}) doesn't prevent successful parsing of two separate expressions.

3. Formatting Behavior

The formatter processes each expression independently:

Integer 0 → outputs 0
Float 0. → outputs 0.
Combined: 00. (no separator because they're parsed as separate statements)

4. Re-parsing the Formatted Output

When module []00. is parsed:

00 tokenizes as Int (with "LeadingZero" warning)
. tokenizes as Dot
Parser creates an integer expression for 00 but fails to handle the unexpected . token
Results in (e-malformed @1-12-1-13 (reason "expr_unexpected_token"))

Error Type Transformation

Original: Tokenization error (OverClosedBrace) + successful parsing (0 parse diagnostics)
Formatted: Tokenization warning (LeadingZero) + parse error (expr_unexpected_token)

The fuzzer expects the formatted output to have the same number of parse diagnostics as the original (0), but gets 1 instead.

Technical Details

Fuzzer Test Logic

The moduleFmtsStable function in src/fmt.zig:

Formats the input once
Attempts to format it again
The second attempt fails because parseAndFmt expects no parse diagnostics:
zig std.testing.expectEqualSlices(AST.Diagnostic, &[_]AST.Diagnostic{}, parse_ast.parse_diagnostics.items) catch { return error.ParseFailed; };

Information Loss

The core issue is that tokenization errors cause information loss that makes it impossible for the formatter to recreate something faithful to the original. The } character is completely lost because:

It's processed as an error during tokenization
No AST node represents it
The formatter has no way to know it existed

Proposed Solution Direction

The fundamental issue is that malformed syntax should be better represented in the AST so the formatter can recreate something closer to the original. Specifically:

Possible Solution: Token Error Preservation

Modify the tokenization/parsing pipeline to preserve error tokens in the AST so they can be reconstructed during formatting.

Reproduction

$ zig build repro-parse -- -b bW9kdWxlW10wfTAu -v

Luke Boswell (Jun 28 2025 at 07:37):

@Joshua Warner @Anthony Bullard -- I'm not sure the best way forward on this one. I'm not keen to make a big mess of the tokenizer. I'm hoping you have some advice for me.

Brendan Hansknecht (Jun 28 2025 at 07:52):

Of note, if something is truly broken, the correct solution may be to bail and just verbatim copy the original source.

Brendan Hansknecht (Jun 28 2025 at 07:53):

Of course nicest if we can do that for the minimal chunk and then format everything else

Anthony Bullard (Jun 28 2025 at 11:26):

I've never seen a language where the formatter does ANYTHING when there is a parse error

Anthony Bullard (Jun 28 2025 at 11:27):

If there is a parse error the formatter should return an error and bail

Anthony Bullard (Jun 28 2025 at 11:28):

I understand "Never Block, Always Inform", but I think at a minimum it has to have valid parse tree for the compiler to work with

Kiryl Dziamura (Jun 28 2025 at 11:31):

I would expect treesitter-like parsing. In some cases it's possible to format what is known to be correct and leave unknowns as is

Anthony Bullard (Jun 28 2025 at 11:43):

treesitter can do that because of it being a concrete syntax tree, but we are expecting to be able to do something with the output - which is unlikely to be meaningful with parse errors.

Anthony Bullard (Jun 28 2025 at 11:43):

We could try to format top level constructs until the first one that contains a malformed, and then just print the original source from there

Anthony Bullard (Jun 28 2025 at 11:44):

But once you have a parse error, it's not always possible to output something reasonable from there on out

Anthony Bullard (Jun 28 2025 at 11:45):

But I personally like a simple bail out and error (for a specific module) when there is a parse error, along with a good diagnostic so that I can fix it quickly

Richard Feldman (Jun 28 2025 at 11:59):

Anthony Bullard said:

I've never seen a language where the formatter does ANYTHING when there is a parse error

I know, and I hate it :laughing:

Richard Feldman (Jun 28 2025 at 12:00):

it leads to the common experience of "I'm trying to work on this code and everything looks terrible until I stop what I'm doing and hunt down the missing delimiter, then suddenly everything I'd been working on snaps jarringly into place"

Kiryl Dziamura (Jun 28 2025 at 12:00):

Anthony Bullard said:

But I personally like a simple bail out and error (for a specific module) when there is a parse error, along with a good diagnostic so that I can fix it quickly

Generally agree, but an often scenario for me is writing a long function not paying attention at formatting. Then format and save. If there was an error during parsing, I'd like to have valid formatting at least up to the first error. This way it's easier for me to read the messy nonsense I wrote.

Richard Feldman (Jun 28 2025 at 12:01):

I think the better approach is what Zig does, which is where there are certain points that the parser can say "okay we've gotten into an invalid state, but we'll stop being in an invalid state once we get to a certain marker that we can recover from"

Richard Feldman (Jun 28 2025 at 12:01):

a very simple example of this is an unclosed single-line string literal

Richard Feldman (Jun 28 2025 at 12:01):

such as:

foo = "missing close quote
bar = "all good"

Richard Feldman (Jun 28 2025 at 12:02):

although we're missing the closing quote, we know that it's a single-line string literal, so as soon as we encounter a newline, the parser can end the literal and report a problem

Richard Feldman (Jun 28 2025 at 12:02):

and then continue as normal

Kiryl Dziamura (Jun 28 2025 at 12:03):

Yeah, exactly! There are cases with no ambiguity!

Richard Feldman (Jun 28 2025 at 12:03):

we can adopt this philosophy in other situations too besides string literals

Anthony Bullard (Jun 28 2025 at 12:05):

I'm down with parse error recovery for unambiguous cases like that

Richard Feldman (Jun 28 2025 at 12:06):

a really useful one is top-level declarations being a boundary

for example, suppose we hit an unexpected assignment before a literal was closed, e.g. [5, 6, x =

if we look at the x = and discover that there's a newline right before the x = in the source code, then we can helpfully assume it's intended to be a new top-level declaration.

at that point we can end the current top-level decl we've been parsing (and record it as an error with an unclosed delimiter) and continue parsing the rest of the module instead of giving up

Anthony Bullard (Jun 28 2025 at 12:06):

But for cases like the motivating source at the top of this thread:

module[]0}0

There is nothing useful for us to do here

Richard Feldman (Jun 28 2025 at 12:06):

agreed

Anthony Bullard (Jun 28 2025 at 12:07):

But if I forget a comma in a list or record literal, I agree we can probably help there

Anthony Bullard (Jun 28 2025 at 12:08):

But that may be a fair amount of complexity in the formatter - and if that's worth it that's ok! But we should go into that with clear eyes

Richard Feldman (Jun 28 2025 at 12:08):

yeah I definitely think it makes the formatter more complicated, and I also definitely think it's worth it :smile:

Richard Feldman (Jun 28 2025 at 12:08):

to me the analogy is high-quality error messages - more compiler complexity for a nicer end user experience when a mistake has been made

Anthony Bullard (Jun 28 2025 at 12:09):

So we are really saying "in the formatter, if there is an unambiguous case of a syntax error we will fix it for you"

Anthony Bullard (Jun 28 2025 at 12:10):

I'm trying to figure out what we could and could not do

Richard Feldman (Jun 28 2025 at 12:11):

I think it makes sense to support these things on a case by case basis

Anthony Bullard (Jun 28 2025 at 12:11):

Things we definitely COULD do:

Terminate unclosed string literals
Terminate unclosed lists/tuples/records that are single line and not deeply nested
Insert commas into lists/tuples/records/calls

Richard Feldman (Jun 28 2025 at 12:11):

for example, start with the unclosed string literal one

Richard Feldman (Jun 28 2025 at 12:12):

I think the top-level decl one is a good one because it can really isolate error scenarios

Richard Feldman (Jun 28 2025 at 12:12):

also removing redundant commas, e.g. [1, 2,, 3]

Anthony Bullard (Jun 28 2025 at 12:12):

Yeah, that's what I was suggesting above

Anthony Bullard (Jun 28 2025 at 12:13):

But the chances of us doing something very wrong, and very irritating goes up FAST if we try to format after an unrecoverable-from parse error

Anthony Bullard (Jun 28 2025 at 12:15):

Even across that "boundary"

Richard Feldman (Jun 28 2025 at 12:19):

I think if we pick the right heuristics we'll be ok

Anthony Bullard (Jun 28 2025 at 12:19):

I think heuristics are a tough thing when you are changing someone's source code

Anthony Bullard (Jun 28 2025 at 12:20):

If that heuristic is wrong 2% of the time, there is going to be some very frustrated users

Anthony Bullard (Jun 28 2025 at 12:21):

But maybe I'm misunderestimating us :laughing:

Anton (Jun 28 2025 at 12:47):

Yeah, my complexity spidey sense is tingling here :p

Richard Feldman (Jun 28 2025 at 13:21):

what I mean is that we need to be thoughtful about which things we do and don't choose to recover from

Richard Feldman (Jun 28 2025 at 13:23):

like the unclosed string literal seems super uncontroversial to me

Richard Feldman (Jun 28 2025 at 13:23):

the top level decl one is more likely to have a false positive in theory, but seems so unlikely to me that I'd rather have it than not

Richard Feldman (Jun 28 2025 at 13:24):

in contrast, trying to auto close delimiters inside a given expression sounds much riskier

Richard Feldman (Jun 28 2025 at 13:24):

and by default I'd want to not do that

Brendan Hansknecht (Jun 28 2025 at 15:22):

I think assuming we can find the right closing, it makes sense to format other things. For example, a broken header probably should not affect formatting the main function.

Brendan Hansknecht (Jun 28 2025 at 15:22):

And one function should not affect another

Richard Feldman (Jun 28 2025 at 15:58):

I think both of those are handled by the top-level decl thing

Joshua Warner (Jun 28 2025 at 15:59):

For this specific example, I'd actually suggest three different fixes (defense in depth). This has uncovered multiple separate problems.

The over-closed brace should be reflected in the token stream and we should let the parser "forward" that error to the parse tree so we can see where in the tree it ends up
We should make sure that idents/keywords/numbers can never accidentally be merged in the formatter by keeping track of the "kind" of token we last spit into the output, and if we just spit out a ident/keyword/number and we're about to spit another one, we should force a space.
If the formatter sees a node with a parse error under it, we should _default_ to working out what the range of tokens that corresponds to, and copying that range from the source verbatim. Only if we feel confident in isolating the error and formatting the rest (or fixing the error all together), should we try to format.

Anthony Bullard (Jun 28 2025 at 17:26):

Joshua Warner said:

For this specific example, I'd actually suggest three different fixes (defense in depth). This has uncovered multiple separate problems.

The over-closed brace should be reflected in the token stream and we should let the parser "forward" that error to the parse tree so we can see where in the tree it ends up

We should make sure that idents/keywords/numbers can never accidentally be merged in the formatter by keeping track of the "kind" of token we last spit into the output, and if we just spit out a ident/keyword/number and we're about to spit another one, we should force a space.

If the formatter sees a node with a parse error under it, we should _default_ to working out what the range of tokens that corresponds to, and copying that range from the source verbatim. Only if we feel confident in isolating the error and formatting the rest (or fixing the error all together), should we try to format.

I don't think I understand 1 quite fully. 2 is a great idea, and I think 3 was the consensus we reached.

Anthony Bullard (Jun 28 2025 at 17:28):

@Brendan Hansknecht I think situations like:

 module [
    foo,

bar = 42

Are hard to work out. If we comsume foo and bar and then bail when we see the =, we can't parse the next decl correctly

Anthony Bullard (Jun 28 2025 at 17:29):

And it's pretty contentious to work out the intent there through heuristics even though in this example the intent is pretty clear

Brendan Hansknecht (Jun 28 2025 at 17:35):

 module [
    foo,

bar = 42

That module header has no clear end. We could guess one, but I expect that to break formatting...at least until we hit something that clearly can't be in the header, but full bail is fine.

Though if an equal sign is not valid in a header, we could start formatting everything after the bar line

Anthony Bullard (Jun 28 2025 at 17:36):

I think we would do that. we would copy the two spans for the malformed header, and malformed statement and print them verbatim, and then continue down the statements

Brendan Hansknecht (Jun 28 2025 at 17:36):

On the other hand, if foo was some garbage instead, but the module header has an ending ], we could skip formatting the module header and format everything after

Richard Feldman (Jun 28 2025 at 17:37):

what's wrong with the heuristic of "we hit a parsing error and noticed that the decl has a \n in front, so we assume it's supposed to be a new top-level decl and proceed accordingly?"

Anthony Bullard (Jun 28 2025 at 17:37):

Yeah, a case like:

 module [
    @#&,
]

bar = 42

Is easy

Brendan Hansknecht (Jun 28 2025 at 17:37):

Anthony Bullard said:

I think we would do that. we would copy the two spans for the malformed header, and malformed statement and print them verbatim, and then continue down the statements

Yeah, sounds great....wasn't sure what level we handle currently. The original failing case like shared seemed to suggest that we were still trying to format what is invalid

Anthony Bullard (Jun 28 2025 at 17:37):

Well the formatter doesn't really know about newlines

Richard Feldman (Jun 28 2025 at 17:38):

I mean in the parser

Richard Feldman (Jun 28 2025 at 17:38):

it can look at the source byte right before the pattern's region begins

Richard Feldman (Jun 28 2025 at 17:38):

so this is about parser recoverability more than formatter

Anthony Bullard (Jun 28 2025 at 17:39):

Hmmm....that sounds more complicated than formatter error recovery, and requires state to be passed forward through the parser, no?

Anthony Bullard (Jun 28 2025 at 17:40):

So a lookbehind?

Richard Feldman (Jun 28 2025 at 17:42):

the benefit is that we can keep checking the rest of the file and give you errors as normal

Richard Feldman (Jun 28 2025 at 17:42):

if I had to pick between either parser or formatter error tolerance, I'd pick parser for that reason :smile:

Richard Feldman (Jun 28 2025 at 17:44):

I don't think special state handling would be needed; if we've found an =, the parser ends up knowing the lhs and the rhs nodes of that =

Richard Feldman (Jun 28 2025 at 17:45):

so from there we look at the lhs node's region, and that tells us all we need to know where to look for the newline

Richard Feldman (Jun 28 2025 at 17:45):

and the parse error would occur right when we encountered the unexpected =

Kiryl Dziamura (Jun 28 2025 at 17:59):

Btw, is there a plan on having an incremental parser?

Richard Feldman (Jun 28 2025 at 18:01):

depends on what you mean by that :smile:

Kiryl Dziamura (Jun 28 2025 at 18:03):

I suppoae a lot of heuristics would come from deltas. Like, if we know the prev state is stable, and changes introduced a problem - we can narrow the invalid part of the tree

Richard Feldman (Jun 28 2025 at 18:06):

I think the parser should only have full source bytes as its input

Richard Feldman (Jun 28 2025 at 18:06):

I don't think we should have a version of the parser that takes "what we parsed to last time" as an input

Last updated: Jul 26 2025 at 12:14 UTC