The Roc formatter is failing a fuzz test because it transforms malformed input with tokenization errors into output with different parse errors, violating the expectation that formatting should be stable and preserve error characteristics.
Input: module[]0}0.
Formatted Output: module []00.
Issue: Different error types between input and output
Original input module[]0}0.
tokenizes as:
KwModule(1:1-1:7)
→ module
OpenSquare(1:7-1:8)
→ [
CloseSquare(1:8-1:9)
→ ]
Int(1:9-1:10)
→ 0
Float(1:11-1:13)
→ 0.
(the }
at position 10 is processed as "OverClosedBrace" error and discarded)EndOfFile(1:13-1:13)
The }
character creates a tokenization diagnostic but doesn't appear in the token stream.
The parser sees the token sequence [Int, Float]
and creates two separate valid expression nodes:
(e-int @1-9-1-10 (raw "0"))
(e-frac @1-11-1-13 (raw "0."))
Key Issue: The tokenization error (}
) doesn't prevent successful parsing of two separate expressions.
The formatter processes each expression independently:
0
→ outputs 0
0.
→ outputs 0.
00.
(no separator because they're parsed as separate statements)When module []00.
is parsed:
00
tokenizes as Int
(with "LeadingZero" warning).
tokenizes as Dot
00
but fails to handle the unexpected .
token(e-malformed @1-12-1-13 (reason "expr_unexpected_token"))
OverClosedBrace
) + successful parsing (0 parse diagnostics)LeadingZero
) + parse error (expr_unexpected_token
)The fuzzer expects the formatted output to have the same number of parse diagnostics as the original (0), but gets 1 instead.
The moduleFmtsStable
function in src/fmt.zig
:
parseAndFmt
expects no parse diagnostics:zig
std.testing.expectEqualSlices(AST.Diagnostic, &[_]AST.Diagnostic{}, parse_ast.parse_diagnostics.items) catch {
return error.ParseFailed;
};
The core issue is that tokenization errors cause information loss that makes it impossible for the formatter to recreate something faithful to the original. The }
character is completely lost because:
The fundamental issue is that malformed syntax should be better represented in the AST so the formatter can recreate something closer to the original. Specifically:
Modify the tokenization/parsing pipeline to preserve error tokens in the AST so they can be reconstructed during formatting.
$ zig build repro-parse -- -b bW9kdWxlW10wfTAu -v
@Joshua Warner @Anthony Bullard -- I'm not sure the best way forward on this one. I'm not keen to make a big mess of the tokenizer. I'm hoping you have some advice for me.
Of note, if something is truly broken, the correct solution may be to bail and just verbatim copy the original source.
Of course nicest if we can do that for the minimal chunk and then format everything else
I've never seen a language where the formatter does ANYTHING when there is a parse error
If there is a parse error the formatter should return an error and bail
I understand "Never Block, Always Inform", but I think at a minimum it has to have valid parse tree for the compiler to work with
I would expect treesitter-like parsing. In some cases it's possible to format what is known to be correct and leave unknowns as is
treesitter can do that because of it being a concrete syntax tree, but we are expecting to be able to do something with the output - which is unlikely to be meaningful with parse errors.
We could try to format top level constructs until the first one that contains a malformed, and then just print the original source from there
But once you have a parse error, it's not always possible to output something reasonable from there on out
But I personally like a simple bail out and error (for a specific module) when there is a parse error, along with a good diagnostic so that I can fix it quickly
Anthony Bullard said:
I've never seen a language where the formatter does ANYTHING when there is a parse error
I know, and I hate it :laughing:
it leads to the common experience of "I'm trying to work on this code and everything looks terrible until I stop what I'm doing and hunt down the missing delimiter, then suddenly everything I'd been working on snaps jarringly into place"
Anthony Bullard said:
But I personally like a simple bail out and error (for a specific module) when there is a parse error, along with a good diagnostic so that I can fix it quickly
Generally agree, but an often scenario for me is writing a long function not paying attention at formatting. Then format and save. If there was an error during parsing, I'd like to have valid formatting at least up to the first error. This way it's easier for me to read the messy nonsense I wrote.
I think the better approach is what Zig does, which is where there are certain points that the parser can say "okay we've gotten into an invalid state, but we'll stop being in an invalid state once we get to a certain marker that we can recover from"
a very simple example of this is an unclosed single-line string literal
such as:
foo = "missing close quote
bar = "all good"
although we're missing the closing quote, we know that it's a single-line string literal, so as soon as we encounter a newline, the parser can end the literal and report a problem
and then continue as normal
Yeah, exactly! There are cases with no ambiguity!
we can adopt this philosophy in other situations too besides string literals
I'm down with parse error recovery for unambiguous cases like that
a really useful one is top-level declarations being a boundary
for example, suppose we hit an unexpected assignment before a literal was closed, e.g. [5, 6, x =
if we look at the x =
and discover that there's a newline right before the x =
in the source code, then we can helpfully assume it's intended to be a new top-level declaration.
at that point we can end the current top-level decl we've been parsing (and record it as an error with an unclosed delimiter) and continue parsing the rest of the module instead of giving up
But for cases like the motivating source at the top of this thread:
module[]0}0
There is nothing useful for us to do here
agreed
But if I forget a comma in a list or record literal, I agree we can probably help there
But that may be a fair amount of complexity in the formatter - and if that's worth it that's ok! But we should go into that with clear eyes
yeah I definitely think it makes the formatter more complicated, and I also definitely think it's worth it :smile:
to me the analogy is high-quality error messages - more compiler complexity for a nicer end user experience when a mistake has been made
So we are really saying "in the formatter, if there is an unambiguous case of a syntax error we will fix it for you"
I'm trying to figure out what we could and could not do
I think it makes sense to support these things on a case by case basis
Things we definitely COULD do:
for example, start with the unclosed string literal one
I think the top-level decl one is a good one because it can really isolate error scenarios
also removing redundant commas, e.g. [1, 2,, 3]
Yeah, that's what I was suggesting above
But the chances of us doing something very wrong, and very irritating goes up FAST if we try to format after an unrecoverable-from parse error
Even across that "boundary"
I think if we pick the right heuristics we'll be ok
I think heuristics are a tough thing when you are changing someone's source code
If that heuristic is wrong 2% of the time, there is going to be some very frustrated users
But maybe I'm misunderestimating us :laughing:
Yeah, my complexity spidey sense is tingling here :p
what I mean is that we need to be thoughtful about which things we do and don't choose to recover from
like the unclosed string literal seems super uncontroversial to me
the top level decl one is more likely to have a false positive in theory, but seems so unlikely to me that I'd rather have it than not
in contrast, trying to auto close delimiters inside a given expression sounds much riskier
and by default I'd want to not do that
I think assuming we can find the right closing, it makes sense to format other things. For example, a broken header probably should not affect formatting the main function.
And one function should not affect another
I think both of those are handled by the top-level decl thing
For this specific example, I'd actually suggest three different fixes (defense in depth). This has uncovered multiple separate problems.
Joshua Warner said:
For this specific example, I'd actually suggest three different fixes (defense in depth). This has uncovered multiple separate problems.
- The over-closed brace should be reflected in the token stream and we should let the parser "forward" that error to the parse tree so we can see where in the tree it ends up
- We should make sure that idents/keywords/numbers can never accidentally be merged in the formatter by keeping track of the "kind" of token we last spit into the output, and if we just spit out a ident/keyword/number and we're about to spit another one, we should force a space.
- If the formatter sees a node with a parse error under it, we should _default_ to working out what the range of tokens that corresponds to, and copying that range from the source verbatim. Only if we feel confident in isolating the error and formatting the rest (or fixing the error all together), should we try to format.
I don't think I understand 1 quite fully. 2 is a great idea, and I think 3 was the consensus we reached.
@Brendan Hansknecht I think situations like:
module [
foo,
bar = 42
Are hard to work out. If we comsume foo and bar and then bail when we see the =
, we can't parse the next decl correctly
And it's pretty contentious to work out the intent there through heuristics even though in this example the intent is pretty clear
module [
foo,
bar = 42
That module header has no clear end. We could guess one, but I expect that to break formatting...at least until we hit something that clearly can't be in the header, but full bail is fine.
Though if an equal sign is not valid in a header, we could start formatting everything after the bar line
I think we would do that. we would copy the two spans for the malformed header, and malformed statement and print them verbatim, and then continue down the statements
On the other hand, if foo was some garbage instead, but the module header has an ending ]
, we could skip formatting the module header and format everything after
what's wrong with the heuristic of "we hit a parsing error and noticed that the decl has a \n in front, so we assume it's supposed to be a new top-level decl and proceed accordingly?"
Yeah, a case like:
module [
@#&,
]
bar = 42
Is easy
Anthony Bullard said:
I think we would do that. we would copy the two spans for the malformed header, and malformed statement and print them verbatim, and then continue down the statements
Yeah, sounds great....wasn't sure what level we handle currently. The original failing case like shared seemed to suggest that we were still trying to format what is invalid
Well the formatter doesn't really know about newlines
I mean in the parser
it can look at the source byte right before the pattern's region begins
so this is about parser recoverability more than formatter
Hmmm....that sounds more complicated than formatter error recovery, and requires state to be passed forward through the parser, no?
So a lookbehind?
the benefit is that we can keep checking the rest of the file and give you errors as normal
if I had to pick between either parser or formatter error tolerance, I'd pick parser for that reason :smile:
I don't think special state handling would be needed; if we've found an =
, the parser ends up knowing the lhs and the rhs nodes of that =
so from there we look at the lhs node's region, and that tells us all we need to know where to look for the newline
and the parse error would occur right when we encountered the unexpected =
Btw, is there a plan on having an incremental parser?
depends on what you mean by that :smile:
I suppoae a lot of heuristics would come from deltas. Like, if we know the prev state is stable, and changes introduced a problem - we can narrow the invalid part of the tree
I think the parser should only have full source bytes as its input
I don't think we should have a version of the parser that takes "what we parsed to last time" as an input
Last updated: Jul 06 2025 at 12:14 UTC