Wanted to pop this github thread out to zulip: https://github.com/roc-lang/roc/pull/7569#discussion_r1938529685
@Anthony Bullard / @Brendan Hansknecht
The approach used in the current tokenizer PR is to preserve exact indent info and let the parser use that however it wants to figure out nesting.
The more-commonly-used alternative (e.g. in Python) would be to generate indent and dedent tokens. I believe this approach is viable for the current roc grammar, but it does force us to be a little bit more picky about indentation. When developing the parser that went along with this tokenizer, I found that trying to use only indent/dedent tokens resulted in very picky indentation that I found frustrating to get working.
Thoughts?
I like it being more permissive - fits with the goal of having it be more error-tolerant
so yeah, sounds like indent/dedent tokens aren't the way to go
I have no thoughts besides making sure we document what we chose and why. I was just surprised that we are only counting spaces or tabs, that didn't feel like enough (but I don't work on the parser).
We're counting both spaces and tabs in the current PR (keeping the counts separate). That's just enough information to be agnostic to whatever the editor configuration for tab indent width is, and still parse correctly.
But like is space space tab space
the same as tab space space space
?
Technically, if we wanted to be very particular, we could record the exact indent string that's used. that would let us catch cases where the first line uses spaces then tabs and the second line uses tabs and spaces. However, it doesn't feel important to me to reject that case in particular.
Ha ha yeah exactly
It is not perfectly the same, but would lead to the same indent, given typical editor configurations.
A user might be forgiven for mistaking those as the same
I guess it would depend on how many spaces a tab is for how it appears in a given editor and if it looks like the same indent level or not.
Yeah, if an editor is using tabs to align to e.g. 8 char intervals, then those aren't identical.
Yeah, but I guess I don't know what roc is supposed to do with that. So just counting tabs and spaces is probably fine.
By that you mean tabs+spaces?
(as a single number)
Joshua Warner said:
It is not perfectly the same, but would lead to the same indent, given typical editor configurations.
hm is that true? I think that's only true if the editor is replacing tabs with some fixed number of spaces, as opposed to doing tab stops
Err yeah for that example you're right
By that you mean tabs+spaces?
(as a single number)
No, sorry, I meant doing what you are currently during in the PR with two different counts. That sounds reasonable cause we don't know what the users tab width is. So we can't really do better.
There is a chance that space space space tab
is 2 indents in the users editor (tab rounds to 2 spaces). As such they see space space tab space
as 2.5 indents and maybe accidentally use it as 3 indents, but I'm not sure how roc would derive any of that info.
For any case where you have a sequence of: ((tab|space{n})*space*)
(forgive my relaxed, regular expressions syntax), keeping the count of tabs and count of spaces is sufficient to infer the correct indent level, regardless of if the user has an improper tabs versus spaces setting.
The conservative alternative here is going back to what the current parser does - only allows spaces. That avoids all of these problems.
The expressive alternative is to track the exact order of tabs and spaces and only allow indents that have the previous line's indent string as a prefix
I think any other alternative here necessarily involves tracking the count of spaces and tabs. Given that, for example, we could give an error if the user is not consistently using either tabs or spaces with no mixing.
Just out of curiosity, what do we do if one line is all spaces and the next is all tabs (this accidentally happens to me sometimes when I edit a file before saving it as a .roc
file). I think the editor defaults to tabs, but my roc config and roc in general uses spaces. So sometimes have one line with 4 spaces and then the next with 2 tabs. Really irks me to fix this and last time I checked, the formatter fails to do so.
fwiw I'd like to report a warning for using anything other than tabs for indentation
so all of this is just for trying to recover if the user makes a mistake
(including having the formatter translate all of these things into the appropriate number of tabs for you)
what do we do if one line is all spaces and the next is all tabs
The parser would report an error, since we can't disambiguate that.
:cry:
Are we switching to tabs? There was a lot of enthusiasm for using tabs for indentation :grinning:
FWIW this scheme comes directly from python
I think it is probably a totally reasonable scheme. It just doesn't solve my essentially only issue with tabs/spaces that I currently hit (though still pretty rare).
I don't think that's deterministically solvable, is it?
like how could the compiler infer what indentation level you intended without knowing what tab width you have configured?
oh I forgot to mention earlier - I think tabs being used for anything other than indentation should also be a warning (and the formatter should change them to spaces)
If we really wanted to be fancy, we could retry the parse assuming different values of tab width, and see which parse makes the most sense
@Brendan Hansknecht can you give a concrete example of the issue you typically hit?
Richard Feldman said:
oh I forgot to mention earlier - I think tabs being used for anything other than indentation should also be a warning (and the formatter should change them to spaces)
hrm, I just realized we'd have to make an exception for this inside doc comments in which you have code blocks.
also, those could be really annoying to edit actually, because a number of editors (most notably Cursor, but Zed is about to start doing this too) are binding the Tab button to mean "accept LLM-generated suggestion"
it's fine outside a doc comment bc the editor generally understands where indentation should happen and inserts indents for you, but inside doc comment code blocks they usually don't realize you're in code mode and they don't indent for you
I wouldn’t expect an editor to reliably insert tab characters in that position?
We could always make the formatter parse markdown code blocks inside doc comments and format those.
that would be good! :+1:
also maybe we can get the editor grammars able to understand that aspect of markdown and do the indentation
I wonder how Go deals with that :thinking:
Joshua Warner said:
Brendan Hansknecht can you give a concrete example of the issue you typically hit?
Here is what normally happens. Copy some code from github that is a repro or something. Paste it into my editor. Edit the file before saving (leads to inserting tabs instead of spaces). Save and now have some lines with tabs and some with spaces.
Just did a small edit of basic cli hello world and got this:
app [main!] { pf: platform "../platform/main.roc" }
# To run this example: check the README.md in this folder
import pf.Stdout
main! = \_args ->
x = "Hello, World!"
Stdout.line! x
if Str.countUtf8Bytes x > 10 then
Stdout.line! "large"
else
Stdout.line! "small"
The original indented line from hello world (Stdout.line! x
) is now the only line indented with spaces. With large code snippets, this becomes a mess of tabs and spaces and I have to manually do a bunch of character replacing.
This happens cause I edit the file before saving it. Once I save it with a .roc
extension, the editor figures it out.
But at that point the damage is done.
This sounds like the right approach would be to detect there’s a mix of spaces and tabs, issue a warning, and retry parsing with different tab width configs
The pick the one with the fewest parse errors/warnings
https://forum.cursor.com/t/ambiguity-with-the-tab-key-in-cursor/7663/12?utm_source=chatgpt.com
ah, apparently a common workaround for indentation specifically is to use an "increase indentation level" keybinding
I don't think that would work in doc comments though
it would just indent the comment :sweat_smile:
I would tend to uncomment, edit normally, then recomment
Sorry everyone - had a number of Lunar New Years parties this weekend and have mostly been out.
@Joshua Warner I think that indent ambiguity is a really messy thing for both the parser and the user to deal with. I personally think the consistency we could apply with INDENT and DEDENT tokens far outweighs the permissiveness of the alternative plan
But I think regardless of what we decide to do with the specifics there, I think the current design of the token output is hard to work with.
Specifically we have an SoA over the tokens but in the same struct a fourth array for lines where it's not exactly straightforward with how the parser is supposed to interact with it (its length has no relation to the rest)
It’s actually not bad to work with. Logic for that is kept in a very thin layer around the tokenizer, and most of the parser doesn’t have to care.
That layer tracks a current line index, incrementing that as appropriate in the parsers advance method
Maybe I should stop what I'm working on because it sounds like you have already designed a parser around this tokenizer
And the "thin layer" that you are speaking of doesn't jump off the page to me at all
But I've only had about an hour or two and I've spent most of that exploring some different designs of the AST
This is literally the tokenizer from the parser I linked in the other thread, translated to zig. So yeah there is a parser that goes with. But I wasn’t thinking of reusing that wholesale - just select parts.
Oh, the one with the linear tree structure?
Yeah
I spent quite a lot of time trying out the indent and dedent token idea, and it was quite fragile
Ok, I thought that was more of a thought experiment since we seemed to be aiming for simplicity
Yeah, the linear tree parts of that I want to throw away
That's interesting, usually with RD with a WSS language IDENT/DEDENT tokens are very straightforward
But maybe it's different with Roc's philosophy of maximal flexibility in the parser
The problem with indent/dedent in roc is there’s a long list of tokens that might jntroduce a block. Python only allow : to do that.
Don’t get me wrong; indent/dedent can be made to work. But it gets messy.
I think there would be two main helpers for that, say parse_block (parses newline-delimited statements at same level of indentation)
, and parse_collection
(parses a comma-delimited sequence that can dedent).
Ok, I'd love to see the difference between the two as I can't imagine the other being simpler (but I'm probably just a dolt).
You have to be careful because you don’t want the tokenizer to insert indent tokens in the middle of an expression
You also kinda want indent tokens inside parens (so that can allow stmts), but you also… don’t, because folks are used to eg python where indentation is relaxed inside parens
Why not? In Python:
some_func(
a,
b,
)
Would be a ID, PAREN_O, IDENT, ID, COMMA, NEWLINE, ID, COMMA, DEDENT, PAREN_C no? Don't see why it wouldn't be the same in Roc
(I don't remember the exact token names)
What if b is dedented there? I’d like that to parse properly too since it’s completely unambiguous
Similarly, maybe a is dedented but b is indented.
Hmm....I would just expect us to push a parse error and fast forward to the closing paren and move on
That’s a valid answer, yeah. But I want to be more forgiving than that.
I’d like to have fewer annoyances than an indentation sensitive lang like python, not more
Couldn't you just say in a collection "use INDENT and DEDENT to track current, and ensure that if we encounter a newline that the indentation is the same as the start?"
And an apply as well?
I don’t quite follow
Ok, let me try an example
foo( # start_indent=0, current_indent=0
a, # INDENT = current_indent = current_indent + 1 (1)
b, # DEDENT = current_indent = current+indent - 1 (0)
) # NEWLINE doesn't change indent
This can parse because the apply (an args container), was satisfied because we ended with the current_indent the same as start_indent
Or you just throw away the INDENTs and DEDENTs inside the container completely, and only monitor indent at the statement level
But at some point I feel like you have to make the whitespace significant in a whitespace significant language.
Hmm so I was definitely assuming indent/dedent would properly nest with other braces. I don’t think that would work with your design?
In Python IDENT/DEDENT are braces - because its a militantly WSS language
But they don't have to be, they can just be state-change tokens
Hmm
At that level, I think these two representations are approximately equivalent
Or we can take a page from Lua and require blocks have an explicitly textual end delimiter. (Don't kill me , I know I suggested this before)
Ehhhh
Yeah, it just pushes the logic for "what's an indent or dedent or just a newline?" to the tokenizer
No, that logic has to cross both the parser and tokenizer. The parser has to know to ignore some of these indent/dedents where it doesn’t care about them.
Let me ask you this, if
foo(
a,
b,
)
Should parse, what about:
when foo is
Ok -> something
Err(e) ->
something_else = func(a, b)
something_else
?
That should not parse
And doesn’t, currently
Joshua Warner said:
No, that logic has to cross both the parser and tokenizer. The parser has to know to ignore some of these indent/dedents where it doesn’t care about them.
Yes, but it doesn't have to decide what is a indent/dedent in terms of raw character bytes (1 space, 4 spaces, tab, etc)
Joshua Warner said:
That should not parse
But why not? It's just as unambiguous to me. Or is it because the former has an explicit bounding pair around it?
It’s only unambiguous if the input ends there
And then
when foo is
Ok -> something
Err(e) ->
something_else = func(a, b)
something_else
end
would be unambiguous if that syntax existed?
Yeah but it doesn’t
Cool, on the same page
I'm trying to think of how we would describe the grammar of Roc, in say EBNF
That’s a gnarly road to go down
As opposed to thinking about the logic we'll implement in Zig
But maybe it's just me, but I want things to be consistent
Roc is not well designed to have a simple-ish ebnf
I think that's what people like about C-syntax languages
I don’t disagree with you
Is that a block is _always_ in squirlies
But that’s not what roc is right now
(I would be open to changing that - and TBH that’s a direction I like, but that seems like a separate convo)
What I'm saying is is that "Whitespace matters for a when expression, but not for a multi-line list or application" feels inconsistent to me
I want to have a parser that’s not annoying to work with as a user
It feels forgiving yes, but inconsistent
We can issue a warning if we want
Oh, so that's why it seems like you are saying to get rid of WSS :-)
Or at least suggesting it
Because it is fiddly
No. Only relaxing it where it’s unambiguous
I’m just describing how the current parser/language work. I want to keep that same behavior.
That makes sense, but this new compiler is probably the best time to write down the principles we want for the parser and kind of move away from the past. That's kind of what we are doing in every other phase (within reason).
Like, what would you change if the starting part for the grammar was properly formatted code from the current compiler?
And then say "what should we be more forgiving on from this point?"
You obviously don't want to require the user to write perfectly formatted code - humans aren't structural editors
Roc is not well designed to have a clean and simple grammar
But what version of that grammar parses very fast, is very consistent, and not a pain to work with?
Languages with significant white space are simply not encodable in context free grammars. You always need an adapter layer, and that can get ugly.
If we want a language that has a simple grammar, that’ll require making some significant changes.
Yes, I know. That's why when I created my language I did one trick to make it neither require braces, nor be WSS: All blocks _must_ end in an expression. But that requires a specific design of language that is not like Roc today and definitely not Roc of tomorrow
Roc used to be like that
It could technically be written in a single line and parse correctly (until I added that all blocks end in an expression followed by a newline)
Presumably you need substitute line separators? Or were expressions always self-delimiting?
Yeah, we seem to be much closer to a Ruby-style syntax
Joshua Warner said:
Presumably you need substitute line separators? Or were expressions always self-delimiting?
Every expression was self-delimiting. There were only literals for Strings, Numbers, Lists, Tuples, Records, Tuples, Lambdas and then Tags and a Match expression. Lambas delimited with {} (only the anonymous kind, not top-level), and match required branches to be introduced with "|" so the last branch would delimit the whole expression
Aha yep makes sense
Cool
It wasn't the most efficient, but parsed reasonably fast for being written in F#
Hand rolled the RD parser at first, and then moved to FParsec which was 95% API compatible
I’d like to keep the set of syntax changes we’re making with this rewrite as minimal as possible
(Thank you Scott Wlaschin of F# For Fun And Profit Fame for giving me inspiration)
Yeah I'd like to make zero
So that leads us back to the initial discussion
Cool, sounds like we’re aligned then
I think I'll just wait and see what the skeleton of the parser you put up looks like on top of your tokenizer
You are the parser guru, so I'll do my best to row the boat in the same direction
Just wanted to say my piece and learn a few things along the way
I appreciate the back pressure
You’re asking reasonable questions
As a young padawan, I still think of a parser as "enforcer of a described, mechanical grammar"
Haha
One that reliably generates AST Nodes when the user is being faithful to that grammar
@Anthony Bullard Added a skeleton parser to https://github.com/roc-lang/roc/pull/7569, that shows the intended way of interoperating with the lines
:thinking: Now that I think about this tho... I wonder if I could just have a Newline
token, and subvert either the offset or the length field to sneak in the indent level...
That may be a bit easier to integrate, and have lower cognitive overhead on the parser side
Yep that seems cleaner; pushed that
Sweet that sounds like a good compromise
Anthony Bullard said:
But maybe it's different with Roc's philosophy of maximal flexibility in the parser
Just a comment: I believe the value of parser flexibility has dropped significantly. Beginners were likely to produce strange code, but beginners with LLMs are not. It is my understanding that flexibility comes with significant complexity.
I think the main value is actually when you're temporarily in a strange editor state, e.g. because you've copy/pasted something in from somewhere else and the indentation is different
because you've copy/pasted something in from somewhere else and the indentation is different
Can you explain more? Would you want the parser to able to fully parse everything with mixed indentation?
Richard Feldman said:
I think the main value is actually when you're temporarily in a strange editor state, e.g. because you've copy/pasted something in from somewhere else and the indentation is different
Except roc doesn't fix this with the current plans. That is exactly the case I mentioned above that breaks and is really hard to fix deterministically. It requires having to guess the tab width.
But that’s just WSS
That’s only solved by only accepting tabs for indentation (or some other more draconian measure like “only four spaces”)
WSS?
White space significance
If you have 100% bounded expressions and statements this isn’t an issue
Otherwise you need a deterministic indentation to act like “braces”
Oh sure, but it is really frustrating nevertheless.
Theoretically roc could try and figure this out or at least the formatter could convert tabs to space or vice versa (even with broken code) such that I can notice why the file is broken.
Just a very frustrating user experience that is hard to debug. Given everything lines up in my editor, roc could theoretically figure this out by guessing tab widths.
Not saying it is worth doing, but I don't think it is unsolvable or innate as an edge case
I would be fine with a bad indentation error. In an editor I would then use the "llm autofix error" button (zed already has this I believe). I bet llms are great at fixing indentation.
I'm not a fan of that solution cause I don't use llms generally (and I'm sure many others are in this boat). So it really isn't a fix in my mind. Also, it isn't hard to manually fix, just the kind of thing where being permissive would be amazing.
Note: I totally also recognize that it may be infeasible/unreasonable to implement.
So the best we may get is a good error message
Was there ever a discussion about WSS? especially with regards to the parencomma restructuring. i could not find anything
Nope
You could start a #ideas thread, but I would not expect it to see traction.
I’ve started such a thread in the past and it did not go well :rolling_on_the_floor_laughing:
Brendan Hansknecht said:
Note: I totally also recognize that it may be infeasible/unreasonable to implement.
I think it is actually within reason to implement. Or at least improve the situation.
Last updated: Jul 06 2025 at 12:14 UTC