Stream: compiler development

Topic: zig compiler - parser diagnostics


view this post on Zulip Joshua Warner (Feb 12 2025 at 04:33):

For the parser (and the tokenizer, really), I've been thinking about a couple things: (1) I'd like to force as-much-as-possible, for us to do tolerant-parsing: mark a node as malformed and move on, and (2) how to record diagnostics.

Those don't seem related on the surface, but I think they actually are.

Imagine, for a moment, that the parser didn't have a separate side channel to write diagnostic messages to. And, when we hit an error, we should probably be putting some kind of 'malformed' node in the tree.

Why don't we make those the same thing? i.e. make the _only_ way to report diagnostics be to put a 'malformed' node into the tree?

This way we can structurally guarantee that we're doing our best-effort at error recovery, and it also simplifies the interface.

We can do the same thing for the tokenizer - tokenizer diagnostics are simply error tokens inserted into that stream. When the parser sees one, it dutifully puts a malformed_token node into the tree and does its error recovery.

At the end of parsing we can quickly grab these back out by checking the list of nodes with SIMD instructions to fetch the list of diagnostics out of the associated data for each malformed/malformed_token node. If we simply store a bool for whether we encountered an error during parsing, we can skip this step in the happy path.

Thoughts? (@Anthony Bullard?)

view this post on Zulip Richard Feldman (Feb 12 2025 at 04:44):

this sounds rad

view this post on Zulip Richard Feldman (Feb 12 2025 at 04:45):

the nodes in later stages might be too small for the amount of error context we'd want to include, but seems like it should work great in parsing!

view this post on Zulip Richard Feldman (Feb 12 2025 at 04:45):

single source of truth :muscle:

view this post on Zulip Joshua Warner (Feb 12 2025 at 04:52):

The same thing can probably work for canonicalization/desugaring

view this post on Zulip Joshua Warner (Feb 12 2025 at 04:52):

Agree it gets messy as you get deeper into the compiler

view this post on Zulip Joshua Warner (Feb 12 2025 at 05:00):

Looking thru the current tokenizer diagnostics, I think this works for most of them but not all

view this post on Zulip Joshua Warner (Feb 12 2025 at 05:00):

For anything that's _actually_ malformed, we should do this

view this post on Zulip Joshua Warner (Feb 12 2025 at 05:02):

There are some things where we could in principle let the parser ignore the problem and just make a "side-note" - e.g. MisplacedCarriageReturn, AsciiControl, LeadingZero, and UppercaseBase, all of which are really "warnings" rather than errors.

view this post on Zulip Joshua Warner (Feb 12 2025 at 05:03):

"hey this looks funky but I'll let it slide"

view this post on Zulip Joshua Warner (Feb 12 2025 at 05:07):

Actually - if the formatter is just going to fix something up (e.g. fix the misplaced carriage return), and someone has "format on save" hooked up, it feels kinda silly to be reporting those kind of warnings

view this post on Zulip Richard Feldman (Feb 12 2025 at 12:26):

I think it's important for CI

view this post on Zulip Richard Feldman (Feb 12 2025 at 12:27):

hm, actually I guess we could have roc format --check take care of reporting those? :thinking:

view this post on Zulip Richard Feldman (Feb 12 2025 at 12:28):

eh, but those are actual mistakes, not stylistic choices, and roc check should tell you about mistakes

view this post on Zulip Richard Feldman (Feb 12 2025 at 12:29):

roc format --check should just tell you about when your code is off stylistically

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:51):

Can we make them malformed nodes that store the issue side band?

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:51):

That way, you still are required to have the node, but no fancy check or simd required

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:51):

Just look at the length of the sideband data structure

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:52):

Can literally make the only way to add something to the data structure is to generate a malformed node at the same time

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:53):

For later stages, I think we just have to map all of these nodes to the crash builtin. Then the build stage of the compiler literally doesn't need to know anything about the malformed nodes or concept

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:53):

So the mapping to crashes would be the final stage of check (or first stage of build)

view this post on Zulip Brendan Hansknecht (Feb 12 2025 at 23:54):

Might be nice to do that stage before running the interpreter anyway, but that is optional

view this post on Zulip Sam Mohr (Feb 12 2025 at 23:58):

It'd be nice to show a message saying "you got 32 compiler problems, that's bad!"

view this post on Zulip Sam Mohr (Feb 12 2025 at 23:59):

And I don't know how to do that if we immediately convert the CompilerProblem nodes into crashes

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:00):

Oh, this wouldn't be immediately. They would be malformed nodes through check and check can report all these things. The backend just doesn't care about frontend compiler problems that should turn into crashes so that roc can still run in a partially broken state.

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:00):

Also, I think the nodes above aren't compiler problems

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:00):

They are parse issues

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:00):

Aren't compiler problems things we plan to ban via fuzzing?

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:01):

They are the guaranteed failure cases

view this post on Zulip Sam Mohr (Feb 13 2025 at 00:05):

Yes, earlier phase malformed nodes should get converted to crashes

view this post on Zulip Sam Mohr (Feb 13 2025 at 00:06):

Brendan Hansknecht said:

Aren't compiler problems things we plan to ban via fuzzing?

Ideally yes, but to plan for the worst, let's have good error messages anyway

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:49):

Yeah, for sure. But I don't see how that relates to this thread

view this post on Zulip Brendan Hansknecht (Feb 13 2025 at 00:50):

We easily could collect compiler problems after each stage and aggregate them up. Given they should be exceptionally rare, it is even fine if they have more costly handling.

view this post on Zulip Anthony Bullard (Feb 14 2025 at 14:58):

Hey y'all, I've been in bed with the flu for the past 4 days. I apologize for not communicating better but it hit me suddenly and hard.

view this post on Zulip Anthony Bullard (Feb 14 2025 at 14:59):

I should have my conflicts resolved and my parser PR ready to merge sometime today. But I also have a LOT of work to catch back up on today

view this post on Zulip Joshua Warner (Feb 14 2025 at 16:08):

No worries! Sounds like you were doing what you should have been doing - resting!

view this post on Zulip Richard Feldman (Feb 14 2025 at 16:20):

sorry to hear about the flu - hope you're feeling better, and don't worry about PRs. They can wait. :big_smile:

view this post on Zulip Anthony Bullard (Feb 15 2025 at 15:51):

Screenshot 2025-02-15 at 9.49.31 AM.png

Has anyone ever seen GIt have behavior like this? My editor has saved buffers for Parser.zig and IR.zig, and git knows about them, but they don't show up in my filesystem. I don't think I've seen this in my 13 years of using Git.

view this post on Zulip Richard Feldman (Feb 15 2025 at 16:02):

case sensitivity?

view this post on Zulip Richard Feldman (Feb 15 2025 at 16:02):

ir.zig and IR.zig are the same file on macOS, but not on Linux

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:02):

That's right

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:03):

I hit exactly that when I rebased your PR

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:03):

Well I'm trying to figure out what to do here...

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:03):

Oh, you already rebased it?

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:04):

I just saw your DM

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:04):

How did you fix this?

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:04):

https://github.com/roc-lang/roc/pull/7609

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:04):

To get past the casing rename thing I squashed the commits prior to rebase

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:04):

So this is already merged?

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:05):

Yep!

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:05):

So I can just do something useful then!

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:05):

Thank you!

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:05):

And thank you for making sure the commits were still mine, it means a lot to me

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:06):

Of course!

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:06):

Trying to figure out what to tackle next

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:07):

I'll probably keep working on some of the low hanging fruit since my brain is still only at 80% capacity at the moment

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:08):

Some of the feedback on your original PR is still outstanding

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:08):

(So that’s an option if you want)

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:09):

I have some partially complete local changes to start emitting diagnostics from the parser instead of panicking

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:10):

I think at this point the braces discussion has settled, so you could start converting the parser to that

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:12):

https://roc.zulipchat.com/#narrow/stream/304641-ideas/topic/braces.20syntax

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:15):

Ok, yeah, let me do that.

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:16):

I need to look at what feedback I didn't address. I thought I had addressed the last of Sam's feedback, but maybe that never made it in?

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:16):

https://github.com/roc-lang/roc/pull/7597#discussion_r1953437310 i think this is the last item

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:17):

That’s more of an FYI than concrete feedback tho

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:18):

And doesn’t apply directly in the parser currently anyway since we’re catching ooms and panicking at a lower level

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:18):

So :shrug:

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:22):

Yeah, just looked at it, hadn't seen the latest feedback after that

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:23):

There isn't a ton pressing there but I will work to address it when I can

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:23):

Makes sense

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:23):

I think planning out the move to braces is the most straightforward thing right now - and the most pressing

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:24):

The one thing I am pained about this change is - I will not be able to use our existing corpus of snapshots for testing

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:24):

They will all have to be worked up to fit the new syntax

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:25):

Which makes me wonder if we should try to develop a new set of principles for adding snapshots while we are at it

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:25):

One thing that crossed my mind was to do a one-time formatter change to the old compiler, in order to spit out this new syntax

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:25):

Oh, that could work - would be a bit of work but not too bad

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:25):

Do you think you could work on that in parallel?

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:25):

With me continuing on this?

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:26):

Since you are the ParseFather of the Rust compiler :smile:

view this post on Zulip Joshua Warner (Feb 15 2025 at 16:26):

Haha yeah

view this post on Zulip Anthony Bullard (Feb 15 2025 at 16:26):

Don't worry, I'm bogged down enough physically/mentally right now that I won't make TOO much progress while you do that

view this post on Zulip Joshua Warner (Feb 15 2025 at 18:17):

Here's what I was thinking for malformed nodes + diagnostics in the parser: https://github.com/roc-lang/roc/pull/7611

view this post on Zulip Joshua Warner (Feb 15 2025 at 18:17):

Essentially, have a .malformed tag that is allowed to stand in for any other node kind

view this post on Zulip Joshua Warner (Feb 15 2025 at 18:18):

I didn't implement this part yet, but when calling getStatement (for example), that'd return an error (or null?) if we're looking at a malformed node.

view this post on Zulip Anthony Bullard (Feb 19 2025 at 23:05):

That sounds good to me.

view this post on Zulip Anthony Bullard (Feb 19 2025 at 23:06):

My biggest issue with malformed is: "How do we decide that that we've included enough in the malformed node to make the continuation of the token stream resulting in a well-formed node?"

view this post on Zulip Joshua Warner (Feb 20 2025 at 03:07):

A common heuristic is to consume tokens until the end of the line, and until all braces have been matched (supposing it’s not a braces problem)

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:20):

someone wrote an article about this in the context of Zig at some point

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:20):

like there are some "cutoff points" (or something - I forget what term they used) where you can tell it's definitely the end of something

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:20):

an example was if you have an open " and you hit a newline

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:21):

you can say "ok, we don't know where it was supposed to end, but we can cut it off at the newline and go back to parsing normally so we don't have the rest of the file being treated as a string literal"

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:22):

we could do something similar with top-level declarations although maybe it's not worth it to detect that :big_smile:

view this post on Zulip Richard Feldman (Feb 20 2025 at 03:22):

("that" being a newline followed immediately by an identifier)

view this post on Zulip Joshua Warner (Feb 20 2025 at 03:23):

We also might be able to use indentation and/braces as guides. We can know whether braces are properly matched in the file (the tokenizer needs to compute this anyway). We can probably jump to the nearest closing brace.

view this post on Zulip Norbert Hajagos (Feb 20 2025 at 08:55):

A while back I was reading the source of rust-analyzer and they do the same there. Let me try to dig it out from my memory: When expecting a specific token, they also provide a set of "recovery tokens" which could start a node, even though such node wouldn't be in the right place.

So for an input:

fn struct User

tokens wouldn't be wrapped in an ErrorNode. struct would be included in the error recovery set when expecting an identifier of a function. When the parser sees a struct, an error would be reported that a function identifier was expected. Then it would start to parse a struct as normal.

On the other hand, in

fn 2

2 would not be in the recovery set, since it isn't a start of a node they would want to parse for resilience. It would just get skipped in it's own (which means, wrap it in an ErrorNode).
I remember there was a rule specifically to never skip a } in such cases in order to keep the braces balanced. They'd end the malformed node before the brace and return from that specific parse function. The caller will probably expect a closing brace to be the next token. At worst, you are getting an unmatched closing brace error, which is very easy to fix.


Last updated: Jul 06 2025 at 12:14 UTC