zig compiler - formatter and newlines · compiler development

Stream: compiler development

Topic: zig compiler - formatter and newlines

Richard Feldman (Feb 16 2025 at 02:25):

in the formatter we take care to make multiline things be multiline and single-line things stay single-line. I like this behavior and want to keep it; it's something I always miss with tools like rustfmt that make all newline decisions for me

Richard Feldman (Feb 16 2025 at 02:26):

but I don't like how the Rust compiler represents newline info using parse IR nodes

Richard Feldman (Feb 16 2025 at 02:26):

it didn't turn out to be nice to work with

Richard Feldman (Feb 16 2025 at 02:26):

I think a simpler design would be to scan source ranges for newlines

Joshua Warner (Feb 16 2025 at 02:28):

My thinking has always been to not have newlines/comments as part of the AST (and maybe not even in the token stream, now).

Joshua Warner (Feb 16 2025 at 02:29):

The formatter looks at the AST, sees a token id for the thing it's trying to format, then goes and looks in the source for the newlines/comments (if any) that came between that token and the one before.

Richard Feldman (Feb 16 2025 at 02:29):

sounds good!

Richard Feldman (Feb 16 2025 at 02:30):

also I think we can do the same in the parser for checking to see if tokens have a whitespace gap between them or not

Joshua Warner (Feb 16 2025 at 02:31):

My thinking is to have the parser never need to look at the underlying source

Richard Feldman (Feb 16 2025 at 02:31):

and that strategy works with source ranges being either in line/col or start byte/length

Richard Feldman (Feb 16 2025 at 02:31):

yeah it shouldn't need to - the ranges should be enough

Joshua Warner (Feb 16 2025 at 02:32):

Ahh I see - comparing the end byte of the last token to the start byte of this token

Richard Feldman (Feb 16 2025 at 02:32):

like if I see a ? token and the preceding token has a source range that ends right in front of it

Joshua Warner (Feb 16 2025 at 02:32):

:thinking:

Richard Feldman (Feb 16 2025 at 02:32):

then we know there's no gap

Joshua Warner (Feb 16 2025 at 02:32):

My current approach has been to explicitly put that data into the token stream, where it's needed

Richard Feldman (Feb 16 2025 at 02:32):

and we don't even need to keep around all token source ranges to do that

Richard Feldman (Feb 16 2025 at 02:32):

just the previous one and the current one

Joshua Warner (Feb 16 2025 at 02:32):

So e.g. there's OpenRound and NoSpaceOpenRound

Richard Feldman (Feb 16 2025 at 02:33):

ah so lookahead 1 byte?

Joshua Warner (Feb 16 2025 at 02:33):

For that case we do lookbehind, but same idea

Richard Feldman (Feb 16 2025 at 02:33):

gotcha, that works too!

Richard Feldman (Feb 16 2025 at 02:34):

do you think the formatter can get away with just looking at tokens?

Richard Feldman (Feb 16 2025 at 02:34):

I'd assume it would need parse IR but maybe not

Joshua Warner (Feb 16 2025 at 02:46):

We're a lot closer with braces syntax than without

Joshua Warner (Feb 16 2025 at 02:47):

Ultimately I think there are going to be cases that are annoying to handle without looking at the parse IR

Joshua Warner (Feb 16 2025 at 02:47):

(which, side node, it's an AST; let's just call it that - we have too many IRs for IR to mean something useful)

Joshua Warner (Feb 16 2025 at 02:48):

Anyway

Joshua Warner (Feb 16 2025 at 02:50):

It would be interesting to _explore_ whether the formatter could look just at the token stream. I'm betting there will be cases that make that difficult, where we'd essentially be re-implementing significant parts of the parser to run during formatting.

Anthony Bullard (Feb 19 2025 at 23:00):

Joshua Warner said:

It would be interesting to _explore_ whether the formatter could look just at the token stream. I'm betting there will be cases that make that difficult, where we'd essentially be re-implementing significant parts of the parser to run during formatting.

This is my actual plan...

Richard Feldman (Feb 26 2025 at 17:16):

for what it's worth, I think we can simplify how the formatter thinks about newlines to just:

it cares about preserving blank lines (that is, lines containing zero non-whitespace characters) and collapses consecutive blank lines into one
it otherwise ignores newlines

Richard Feldman (Feb 26 2025 at 17:17):

I think this because of the "trailing commas mean render multiline, and no trailing comma means don't" design in conjunction with parens-and-commas

Richard Feldman (Feb 26 2025 at 17:18):

if we want a multiline function application, we can do a trailing comma to indicate multiline mode (and unlike whitespace application, the case of "have the first arg on the same line but everything else is on a different line" would look weird and I don't think should be supported anymore)

Richard Feldman (Feb 26 2025 at 17:18):

this makes me wonder if we should try just having the formatter manage blank lines automatically

Richard Feldman (Feb 26 2025 at 17:19):

like it just has a rule for where they do and don't go, and it puts them in accordingly

Richard Feldman (Feb 26 2025 at 17:19):

it might be annoying (I'm not sure), but at that point the only user-configurable aspect of whitespace is "does this comma-separated thing render in single-line or multiline mode?" and that is determined entirely by whether it has a trailing comma or not

Richard Feldman (Feb 26 2025 at 17:20):

kinda seems worth trying to me, just to see how it feels in practice?

Anton (Feb 26 2025 at 17:26):

Richard Feldman said:

this makes me wonder if we should try just having the formatter manage blank lines automatically

I feel very strongly about my blank lines and use them significantly more than the average dev :p

Sam Mohr (Feb 26 2025 at 17:53):

I also feel as Anton does, and I believe so does @Luke Boswell

Sam Mohr (Feb 26 2025 at 17:54):

Though there are some places that I think we should consider removing them that I mentioned to Luke but never firmed up enough to make a discussion about

Sam Mohr (Feb 26 2025 at 17:54):

The main one being newlines between a functions args and the first line of its body

Sam Mohr (Feb 26 2025 at 17:55):

Which now could be simplified to the rule "no blank newlines between an opening curly brace and its body's first line"

Sam Mohr (Feb 26 2025 at 17:56):

And maybe "always add a newline above the return expression of a function unless the body has only one expression"

Sam Mohr (Feb 26 2025 at 17:57):

But that one is tricky, like what about if your function ends with some Stdout.line calls, those aren't normal "I'm returning something useful" lines

Brendan Hansknecht (Feb 26 2025 at 18:10):

I feel like if you have more then 2 blank lines they definitely should be colapsed

Brendan Hansknecht (Feb 26 2025 at 18:11):

If you really want them....add comments

Brendan Hansknecht (Feb 26 2025 at 18:11):

I totally see the general argument for 2 blank lines between important things for visual separation

Brendan Hansknecht (Feb 26 2025 at 18:12):

That said, I think the formatter should manage blank lines. (Again, comments are the way to complete freedom)

Brendan Hansknecht (Feb 26 2025 at 18:14):

If I were to write the rule it would probably 1 or 2 blank lines allowed between top levels.

1 or 0 blank lines between other things.

No blank lines after open brackets or before closing brackets.

Brendan Hansknecht (Feb 26 2025 at 18:15):

Hmm...though for top level single line expression zero blank lines can be nice (like a block of constants)

Anthony Bullard (Feb 26 2025 at 18:55):

What Richard described was basically what I was going to do

Last updated: Aug 17 2025 at 12:14 UTC