zig compiler - parser diagnostics · compiler development

For the parser (and the tokenizer, really), I've been thinking about a couple things: (1) I'd like to force as-much-as-possible, for us to do tolerant-parsing: mark a node as malformed and move on, and (2) how to record diagnostics.

Imagine, for a moment, that the parser didn't have a separate side channel to write diagnostic messages to. And, when we hit an error, we should probably be putting some kind of 'malformed' node in the tree.

Why don't we make those the same thing? i.e. make the _only_ way to report diagnostics be to put a 'malformed' node into the tree?

This way we can structurally guarantee that we're doing our best-effort at error recovery, and it also simplifies the interface.

We can do the same thing for the tokenizer - tokenizer diagnostics are simply error tokens inserted into that stream. When the parser sees one, it dutifully puts a malformed_token node into the tree and does its error recovery.

At the end of parsing we can quickly grab these back out by checking the list of nodes with SIMD instructions to fetch the list of diagnostics out of the associated data for each malformed/malformed_token node. If we simply store a bool for whether we encountered an error during parsing, we can skip this step in the happy path.

Richard Feldman (Feb 12 2025 at 04:44):

Richard Feldman (Feb 12 2025 at 04:45):

the nodes in later stages might be too small for the amount of error context we'd want to include, but seems like it should work great in parsing!

Richard Feldman (Feb 12 2025 at 04:45):

Joshua Warner (Feb 12 2025 at 04:52):

Joshua Warner (Feb 12 2025 at 05:00):

Looking thru the current tokenizer diagnostics, I think this works for most of them but not all

Joshua Warner (Feb 12 2025 at 05:00):

Joshua Warner (Feb 12 2025 at 05:02):

There are some things where we could in principle let the parser ignore the problem and just make a "side-note" - e.g. MisplacedCarriageReturn, AsciiControl, LeadingZero, and UppercaseBase, all of which are really "warnings" rather than errors.

Joshua Warner (Feb 12 2025 at 05:03):

Joshua Warner (Feb 12 2025 at 05:07):

Actually - if the formatter is just going to fix something up (e.g. fix the misplaced carriage return), and someone has "format on save" hooked up, it feels kinda silly to be reporting those kind of warnings

Richard Feldman (Feb 12 2025 at 12:26):

Richard Feldman (Feb 12 2025 at 12:27):

hm, actually I guess we could have roc format --check take care of reporting those? :thinking:

Richard Feldman (Feb 12 2025 at 12:28):

eh, but those are actual mistakes, not stylistic choices, and roc check should tell you about mistakes

Richard Feldman (Feb 12 2025 at 12:29):

roc format --check should just tell you about when your code is off stylistically

Brendan Hansknecht (Feb 12 2025 at 23:51):

That way, you still are required to have the node, but no fancy check or simd required

Brendan Hansknecht (Feb 12 2025 at 23:51):

Brendan Hansknecht (Feb 12 2025 at 23:52):

Can literally make the only way to add something to the data structure is to generate a malformed node at the same time

Brendan Hansknecht (Feb 12 2025 at 23:53):

For later stages, I think we just have to map all of these nodes to the crash builtin. Then the build stage of the compiler literally doesn't need to know anything about the malformed nodes or concept

Brendan Hansknecht (Feb 12 2025 at 23:53):

So the mapping to crashes would be the final stage of check (or first stage of build)

Brendan Hansknecht (Feb 12 2025 at 23:54):

Might be nice to do that stage before running the interpreter anyway, but that is optional

Sam Mohr (Feb 12 2025 at 23:58):

It'd be nice to show a message saying "you got 32 compiler problems, that's bad!"

Sam Mohr (Feb 12 2025 at 23:59):

And I don't know how to do that if we immediately convert the CompilerProblem nodes into crashes

Brendan Hansknecht (Feb 13 2025 at 00:00):

Oh, this wouldn't be immediately. They would be malformed nodes through check and check can report all these things. The backend just doesn't care about frontend compiler problems that should turn into crashes so that roc can still run in a partially broken state.

Brendan Hansknecht (Feb 13 2025 at 00:00):

Brendan Hansknecht (Feb 13 2025 at 00:01):

Sam Mohr (Feb 13 2025 at 00:05):

Sam Mohr (Feb 13 2025 at 00:06):

Brendan Hansknecht (Feb 13 2025 at 00:49):

Brendan Hansknecht (Feb 13 2025 at 00:50):

We easily could collect compiler problems after each stage and aggregate them up. Given they should be exceptionally rare, it is even fine if they have more costly handling.

Anthony Bullard (Feb 14 2025 at 14:58):

Hey y'all, I've been in bed with the flu for the past 4 days. I apologize for not communicating better but it hit me suddenly and hard.

Anthony Bullard (Feb 14 2025 at 14:59):

I should have my conflicts resolved and my parser PR ready to merge sometime today. But I also have a LOT of work to catch back up on today

Joshua Warner (Feb 14 2025 at 16:08):

No worries! Sounds like you were doing what you should have been doing - resting!

Richard Feldman (Feb 14 2025 at 16:20):

sorry to hear about the flu - hope you're feeling better, and don't worry about PRs. They can wait. :big_smile:

Anthony Bullard (Feb 15 2025 at 15:51):

Has anyone ever seen GIt have behavior like this? My editor has saved buffers for Parser.zig and IR.zig, and git knows about them, but they don't show up in my filesystem. I don't think I've seen this in my 13 years of using Git.

Richard Feldman (Feb 15 2025 at 16:02):

Anthony Bullard (Feb 15 2025 at 16:02):

Joshua Warner (Feb 15 2025 at 16:03):

Anthony Bullard (Feb 15 2025 at 16:03):

Anthony Bullard (Feb 15 2025 at 16:04):

Joshua Warner (Feb 15 2025 at 16:04):

Anthony Bullard (Feb 15 2025 at 16:04):

Joshua Warner (Feb 15 2025 at 16:05):

Anthony Bullard (Feb 15 2025 at 16:05):

Joshua Warner (Feb 15 2025 at 16:06):

Anthony Bullard (Feb 15 2025 at 16:06):

Anthony Bullard (Feb 15 2025 at 16:07):

I'll probably keep working on some of the low hanging fruit since my brain is still only at 80% capacity at the moment

Joshua Warner (Feb 15 2025 at 16:08):

Joshua Warner (Feb 15 2025 at 16:09):

I have some partially complete local changes to start emitting diagnostics from the parser instead of panicking

Joshua Warner (Feb 15 2025 at 16:10):

I think at this point the braces discussion has settled, so you could start converting the parser to that

Joshua Warner (Feb 15 2025 at 16:12):

Anthony Bullard (Feb 15 2025 at 16:15):

Anthony Bullard (Feb 15 2025 at 16:16):

I need to look at what feedback I didn't address. I thought I had addressed the last of Sam's feedback, but maybe that never made it in?

Joshua Warner (Feb 15 2025 at 16:16):

Joshua Warner (Feb 15 2025 at 16:17):

Joshua Warner (Feb 15 2025 at 16:18):

And doesn’t apply directly in the parser currently anyway since we’re catching ooms and panicking at a lower level

Joshua Warner (Feb 15 2025 at 16:18):

Anthony Bullard (Feb 15 2025 at 16:22):

Anthony Bullard (Feb 15 2025 at 16:23):

Joshua Warner (Feb 15 2025 at 16:23):

Anthony Bullard (Feb 15 2025 at 16:23):

I think planning out the move to braces is the most straightforward thing right now - and the most pressing

Anthony Bullard (Feb 15 2025 at 16:24):

The one thing I am pained about this change is - I will not be able to use our existing corpus of snapshots for testing

Anthony Bullard (Feb 15 2025 at 16:24):

Anthony Bullard (Feb 15 2025 at 16:25):

Which makes me wonder if we should try to develop a new set of principles for adding snapshots while we are at it

Joshua Warner (Feb 15 2025 at 16:25):

One thing that crossed my mind was to do a one-time formatter change to the old compiler, in order to spit out this new syntax

Anthony Bullard (Feb 15 2025 at 16:25):

Anthony Bullard (Feb 15 2025 at 16:26):

Joshua Warner (Feb 15 2025 at 16:26):

Anthony Bullard (Feb 15 2025 at 16:26):

Don't worry, I'm bogged down enough physically/mentally right now that I won't make TOO much progress while you do that

Joshua Warner (Feb 15 2025 at 18:17):

Essentially, have a .malformed tag that is allowed to stand in for any other node kind

Joshua Warner (Feb 15 2025 at 18:18):

I didn't implement this part yet, but when calling getStatement (for example), that'd return an error (or null?) if we're looking at a malformed node.

Anthony Bullard (Feb 19 2025 at 23:05):

Anthony Bullard (Feb 19 2025 at 23:06):

My biggest issue with malformed is: "How do we decide that that we've included enough in the malformed node to make the continuation of the token stream resulting in a well-formed node?"

Joshua Warner (Feb 20 2025 at 03:07):

A common heuristic is to consume tokens until the end of the line, and until all braces have been matched (supposing it’s not a braces problem)

Richard Feldman (Feb 20 2025 at 03:20):

like there are some "cutoff points" (or something - I forget what term they used) where you can tell it's definitely the end of something

Richard Feldman (Feb 20 2025 at 03:20):

Richard Feldman (Feb 20 2025 at 03:21):

you can say "ok, we don't know where it was supposed to end, but we can cut it off at the newline and go back to parsing normally so we don't have the rest of the file being treated as a string literal"

Richard Feldman (Feb 20 2025 at 03:22):

we could do something similar with top-level declarations although maybe it's not worth it to detect that :big_smile:

Richard Feldman (Feb 20 2025 at 03:22):

Joshua Warner (Feb 20 2025 at 03:23):

We also might be able to use indentation and/braces as guides. We can know whether braces are properly matched in the file (the tokenizer needs to compute this anyway). We can probably jump to the nearest closing brace.

Norbert Hajagos (Feb 20 2025 at 08:55):

A while back I was reading the source of rust-analyzer and they do the same there. Let me try to dig it out from my memory: When expecting a specific token, they also provide a set of "recovery tokens" which could start a node, even though such node wouldn't be in the right place.

fn struct User

tokens wouldn't be wrapped in an ErrorNode. struct would be included in the error recovery set when expecting an identifier of a function. When the parser sees a struct, an error would be reported that a function identifier was expected. Then it would start to parse a struct as normal.

fn 2

2 would not be in the recovery set, since it isn't a start of a node they would want to parse for resilience. It would just get skipped in it's own (which means, wrap it in an ErrorNode).
I remember there was a rule specifically to never skip a } in such cases in order to keep the braces balanced. They'd end the malformed node before the brace and return from that specific parse function. The caller will probably expect a closing brace to be the next token. At worst, you are getting an unmatched closing brace error, which is very easy to fix.

Stream: compiler development

Topic: zig compiler - parser diagnostics

Joshua Warner (Feb 12 2025 at 04:33):

Richard Feldman (Feb 12 2025 at 04:44):

Richard Feldman (Feb 12 2025 at 04:45):

Richard Feldman (Feb 12 2025 at 04:45):

Joshua Warner (Feb 12 2025 at 04:52):

Joshua Warner (Feb 12 2025 at 04:52):

Joshua Warner (Feb 12 2025 at 05:00):

Joshua Warner (Feb 12 2025 at 05:00):

Joshua Warner (Feb 12 2025 at 05:02):

Joshua Warner (Feb 12 2025 at 05:03):

Joshua Warner (Feb 12 2025 at 05:07):

Richard Feldman (Feb 12 2025 at 12:26):

Richard Feldman (Feb 12 2025 at 12:27):

Richard Feldman (Feb 12 2025 at 12:28):

Richard Feldman (Feb 12 2025 at 12:29):

Brendan Hansknecht (Feb 12 2025 at 23:51):

Brendan Hansknecht (Feb 12 2025 at 23:51):

Brendan Hansknecht (Feb 12 2025 at 23:51):

Brendan Hansknecht (Feb 12 2025 at 23:52):

Brendan Hansknecht (Feb 12 2025 at 23:53):

Brendan Hansknecht (Feb 12 2025 at 23:53):

Brendan Hansknecht (Feb 12 2025 at 23:54):

Sam Mohr (Feb 12 2025 at 23:58):

Sam Mohr (Feb 12 2025 at 23:59):

Brendan Hansknecht (Feb 13 2025 at 00:00):

Brendan Hansknecht (Feb 13 2025 at 00:00):

Brendan Hansknecht (Feb 13 2025 at 00:00):

Brendan Hansknecht (Feb 13 2025 at 00:00):

Brendan Hansknecht (Feb 13 2025 at 00:01):

Sam Mohr (Feb 13 2025 at 00:05):

Sam Mohr (Feb 13 2025 at 00:06):

Brendan Hansknecht (Feb 13 2025 at 00:49):

Brendan Hansknecht (Feb 13 2025 at 00:50):

Anthony Bullard (Feb 14 2025 at 14:58):

Anthony Bullard (Feb 14 2025 at 14:59):

Joshua Warner (Feb 14 2025 at 16:08):

Richard Feldman (Feb 14 2025 at 16:20):

Anthony Bullard (Feb 15 2025 at 15:51):

Richard Feldman (Feb 15 2025 at 16:02):

Richard Feldman (Feb 15 2025 at 16:02):

Anthony Bullard (Feb 15 2025 at 16:02):

Joshua Warner (Feb 15 2025 at 16:03):

Anthony Bullard (Feb 15 2025 at 16:03):

Anthony Bullard (Feb 15 2025 at 16:03):

Anthony Bullard (Feb 15 2025 at 16:04):

Anthony Bullard (Feb 15 2025 at 16:04):

Joshua Warner (Feb 15 2025 at 16:04):

Joshua Warner (Feb 15 2025 at 16:04):

Anthony Bullard (Feb 15 2025 at 16:04):

Joshua Warner (Feb 15 2025 at 16:05):

Anthony Bullard (Feb 15 2025 at 16:05):

Anthony Bullard (Feb 15 2025 at 16:05):

Anthony Bullard (Feb 15 2025 at 16:05):

Joshua Warner (Feb 15 2025 at 16:06):

Anthony Bullard (Feb 15 2025 at 16:06):

Anthony Bullard (Feb 15 2025 at 16:07):

Joshua Warner (Feb 15 2025 at 16:08):

Joshua Warner (Feb 15 2025 at 16:08):

Joshua Warner (Feb 15 2025 at 16:09):

Joshua Warner (Feb 15 2025 at 16:10):

Joshua Warner (Feb 15 2025 at 16:12):

Anthony Bullard (Feb 15 2025 at 16:15):

Anthony Bullard (Feb 15 2025 at 16:16):

Joshua Warner (Feb 15 2025 at 16:16):

Joshua Warner (Feb 15 2025 at 16:17):

Joshua Warner (Feb 15 2025 at 16:18):

Joshua Warner (Feb 15 2025 at 16:18):

Anthony Bullard (Feb 15 2025 at 16:22):

Anthony Bullard (Feb 15 2025 at 16:23):

Joshua Warner (Feb 15 2025 at 16:23):

Anthony Bullard (Feb 15 2025 at 16:23):

Anthony Bullard (Feb 15 2025 at 16:24):

Anthony Bullard (Feb 15 2025 at 16:24):

Anthony Bullard (Feb 15 2025 at 16:25):

Joshua Warner (Feb 15 2025 at 16:25):

Anthony Bullard (Feb 15 2025 at 16:25):

Anthony Bullard (Feb 15 2025 at 16:25):

Anthony Bullard (Feb 15 2025 at 16:25):