Stream: compiler development

Topic: zig compiler - formatter and newlines


view this post on Zulip Richard Feldman (Feb 16 2025 at 02:25):

in the formatter we take care to make multiline things be multiline and single-line things stay single-line. I like this behavior and want to keep it; it's something I always miss with tools like rustfmt that make all newline decisions for me

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:26):

but I don't like how the Rust compiler represents newline info using parse IR nodes

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:26):

it didn't turn out to be nice to work with

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:26):

I think a simpler design would be to scan source ranges for newlines

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:28):

My thinking has always been to not have newlines/comments as part of the AST (and maybe not even in the token stream, now).

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:29):

The formatter looks at the AST, sees a token id for the thing it's trying to format, then goes and looks in the source for the newlines/comments (if any) that came between that token and the one before.

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:29):

sounds good!

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:30):

also I think we can do the same in the parser for checking to see if tokens have a whitespace gap between them or not

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:31):

My thinking is to have the parser never need to look at the underlying source

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:31):

and that strategy works with source ranges being either in line/col or start byte/length

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:31):

yeah it shouldn't need to - the ranges should be enough

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:32):

Ahh I see - comparing the end byte of the last token to the start byte of this token

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:32):

like if I see a ? token and the preceding token has a source range that ends right in front of it

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:32):

:thinking:

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:32):

then we know there's no gap

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:32):

My current approach has been to explicitly put that data into the token stream, where it's needed

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:32):

and we don't even need to keep around all token source ranges to do that

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:32):

just the previous one and the current one

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:32):

So e.g. there's OpenRound and NoSpaceOpenRound

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:33):

ah so lookahead 1 byte?

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:33):

For that case we do lookbehind, but same idea

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:33):

gotcha, that works too!

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:34):

do you think the formatter can get away with just looking at tokens?

view this post on Zulip Richard Feldman (Feb 16 2025 at 02:34):

I'd assume it would need parse IR but maybe not

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:46):

We're a lot closer with braces syntax than without

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:47):

Ultimately I think there are going to be cases that are annoying to handle without looking at the parse IR

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:47):

(which, side node, it's an AST; let's just call it that - we have too many IRs for IR to mean something useful)

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:48):

Anyway

view this post on Zulip Joshua Warner (Feb 16 2025 at 02:50):

It would be interesting to _explore_ whether the formatter could look just at the token stream. I'm betting there will be cases that make that difficult, where we'd essentially be re-implementing significant parts of the parser to run during formatting.

view this post on Zulip Anthony Bullard (Feb 19 2025 at 23:00):

Joshua Warner said:

It would be interesting to _explore_ whether the formatter could look just at the token stream. I'm betting there will be cases that make that difficult, where we'd essentially be re-implementing significant parts of the parser to run during formatting.

This is my actual plan...

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:16):

for what it's worth, I think we can simplify how the formatter thinks about newlines to just:

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:17):

I think this because of the "trailing commas mean render multiline, and no trailing comma means don't" design in conjunction with parens-and-commas

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:18):

if we want a multiline function application, we can do a trailing comma to indicate multiline mode (and unlike whitespace application, the case of "have the first arg on the same line but everything else is on a different line" would look weird and I don't think should be supported anymore)

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:18):

this makes me wonder if we should try just having the formatter manage blank lines automatically

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:19):

like it just has a rule for where they do and don't go, and it puts them in accordingly

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:19):

it might be annoying (I'm not sure), but at that point the only user-configurable aspect of whitespace is "does this comma-separated thing render in single-line or multiline mode?" and that is determined entirely by whether it has a trailing comma or not

view this post on Zulip Richard Feldman (Feb 26 2025 at 17:20):

kinda seems worth trying to me, just to see how it feels in practice?

view this post on Zulip Anton (Feb 26 2025 at 17:26):

Richard Feldman said:

this makes me wonder if we should try just having the formatter manage blank lines automatically

I feel very strongly about my blank lines and use them significantly more than the average dev :p

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:53):

I also feel as Anton does, and I believe so does @Luke Boswell

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:54):

Though there are some places that I think we should consider removing them that I mentioned to Luke but never firmed up enough to make a discussion about

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:54):

The main one being newlines between a functions args and the first line of its body

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:55):

Which now could be simplified to the rule "no blank newlines between an opening curly brace and its body's first line"

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:56):

And maybe "always add a newline above the return expression of a function unless the body has only one expression"

view this post on Zulip Sam Mohr (Feb 26 2025 at 17:57):

But that one is tricky, like what about if your function ends with some Stdout.line calls, those aren't normal "I'm returning something useful" lines

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:10):

I feel like if you have more then 2 blank lines they definitely should be colapsed

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:11):

If you really want them....add comments

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:11):

I totally see the general argument for 2 blank lines between important things for visual separation

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:12):

That said, I think the formatter should manage blank lines. (Again, comments are the way to complete freedom)

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:14):

If I were to write the rule it would probably 1 or 2 blank lines allowed between top levels.

1 or 0 blank lines between other things.

No blank lines after open brackets or before closing brackets.

view this post on Zulip Brendan Hansknecht (Feb 26 2025 at 18:15):

Hmm...though for top level single line expression zero blank lines can be nice (like a block of constants)

view this post on Zulip Anthony Bullard (Feb 26 2025 at 18:55):

What Richard described was basically what I was going to do


Last updated: Jul 06 2025 at 12:14 UTC