Hi! I've been thinking about how we could make some small adjustments to how roc interprets newlines and indentation, to make the syntax more consistent, understandable, and easier to parse.
I wrote up a proposal here: https://gist.github.com/joshuawarner32/f1968ea4a5ec55099b99f9d100385815#file-items_and_blocks-md
Pending some questions at the bottom, I believe this should be a no-op change for almost all code in the wild. I expect few if any people will need to change code to get it to continue to parse and mean exactly the same thing it did before.
Items are required to be separated by newlines. The first token of each item within a block must all be indented to the same level. If an item extends beyond one line, all subsequent lines must be indented.
this was the rule in CoffeeScript and it seemed to work out fine from what I can remember
I've historically shied away from it because it makes it more obvious that the language is One Of Those Indentation-Sensitive Languages, but maybe we should just embrace that even though it doesn't (currently) come up as often as it does in, say, Python or CoffeeScript
like I don't think people lump Haskell or Elm in with Python and CoffeeScript even though they have just as much indentation-sensitivity as Roc does today
although I guess defs in Roc not having let and in as delimiters does make the indentation considerations mentioned in the writeup more prominent
I'm not a huge fan of "items" - here's an idea to consider:
The body of a Roc module is broken down into statements and expressions.
- An expression evaluates to a value. For example,
xis an expression, as isfoo "a" "b" "c"- A statement does not evaluate to a value. There are 5 types of statements in Roc: assignments (e.g.
x = 5), type aliases (e.g.Foo : List Str), opaque type declarations (e.g.Bar := Nat),dbg, andexpect. Every statement in Roc must end in a newline, although expressions don't have to.The body of a module is a sequence of statements. Each statement may contain expressions; for example, in the assignment statement
x = 5,5is an expression.Any expression can be preceded by any number of statements, as long as each of those statements begins at the same indentation level as that expression. (Remember that statements must end in newlines though!) So for example you can write:
x = dbg y z + 1Here, the
z + 1expression is preceded by thedbg ystatement, which is at the same indentation level as thez + 1expression and ends in a newline. Each of the statements will run, in order, and then the expression will run, and the entire sequence will evaluate to that expression at the end.Both statements and expressions can span multiple lines; if they do, each of the additional lines must be at a higher level of indentation than the line where the statement or expression began.
'statements' is a good term.
Any expression can be preceded by any number of statements, as long as each of those statements begins at the same indentation level as that expression. (Remember that statements must end in newlines though!) So for example you can write:
I think we're 100% alignment here.
What I'm suggesting is that we enforce that constraint (effectively, that you can't end a block in a statement nor have an expression in the middle of a block) in canonicalization rather than enforcing it in the parser.
yeah that seems reasonable to me!
That means:
sweet! I also like that this is pretty easy to explain
I'm curious to hear your thoughts on the "Open Questions" at the bottom
I vaguely remember thinking there was an interesting potential design for something if we had the "indentation is required to continue an expression or statement" design, but I don't recall exactly what it was at the moment :thinking:
Particularly - should statements be allowed inside parenthesized expressions?
... and if so, should we also allow them inside tuples? In other collections?
... and, even more confusingly, what does this look like in terms of where the ',' is required to be, separating the values?
One option would be to allow statements inside parens - but to specifically require that tuples must be on a single line.
from a teaching perspective I think it's better if we allow them, in the sense that otherwise the rules have to state "...except if the expression has parens around it, in which case you can't, also inside certain collection literals it's not allowed either"
I may be missing something, but would that create an ambiguity somehow?
That throws a bit of a wrench in the works of the idea to do indent checking purely in the lexer
hm, how so?
you mean with the INDENT and DEDENT tokens?
Yeah
The question is - how does the lexer know that it should emit an INDENT token and check the indentation of the following statements/expressions as part of a block.
checking as in like actual validation, which could produce errors?
I assumed the idea is that it just produces a stream of tokens and then it's totally up to the parser where expressions and statements begin and end, based on the tokens it encounters
oh wait I think I see
do you mean that the algorithm for deciding whether to emit an INDENT or DEDENT (as opposed to just like "there are some spaces here") becomes tricky?
Yes
gotcha
Also, we need to be careful about the distinction between 'blocks' (as I've been calling them) and exprs that just happen to be on multiple lines, but that don't contain '=' / etc (maybe a long binary op)
In the latter case, it feels weird to me to require all the lines of the expr to be exactly aligned
i.e. the user may want to indent nested parts of the expr further
oh yeah, something I hadn't considered - does this mean that |> lines have to be indented? :thinking:
Yes
I think the formatter already does that tho
not anymore, they're at the same level of indentation as the preceding expression now
a
|> foo
|> bar
|> etc
Oh interesting
but I guess we could always make an exception for infix operators
like those don't count as outdents
or the end of the expression
:thinking:
in the lexer even
Yeah that could work
Ok, cool - I think under that rule we can make statements inside parens 'just work', along with any other collection.
so honestly my main concern here would be error message quality
like I've definitely seen some helpful "unexpected OUTDENT" messages in my time
and I'm not sure if this would exacerbate those or not matter
Python is not known for high-quality syntax errors ;)
(or maybe somehow make them easier?)
yeah, I'm mainly wondering what's possible
like if we want to do a good job providing helpful errors, does splitting out the lexer help us, hurt us, or make basically no difference?
Thinking about a few cases here...
Outdents inside parens, square brackets, etc - I think is 100% solvable, since we can use the matching delimiters to work out what you really meant there.
nice
For other cases, let's maybe take this as a representative example:
foo = \a, b ->
x = baz a b
y = bar x
y + 1
The parser will see that as two defs
The first def there will not end in an expression, so we'll get a canonicalization error
I think in that case the best way to disambiguate the user intent might be to look at the names involved in the second def.
If that references names not defined in the outer scope, but that _are_ defined as locals inside the broken def, we can probably infer the user intent here and give a helpful error.
(conveniently, canonicalization has exactly that information available, I think)
that also sounds reasonable!
ok another question is performance, although that's easier to measure. We're introducing a new IR, and potentially a lot of memory to traverse; are we signing up for a bunch of cache misses?
I mean maybe it's fine because the whole parsing step is already so fast, maybe slowing it down a bit isn't noticeable
There would be some wins here. Specifically, there are some cases where the parser currently has to backtrack that could (probably?) be disambiguated by peeking at a fixed number of tokens [ahead] in the input
:thinking: do you think that would be enough to compensate for the extra work of creating and then traversing the token stream?
It's not 100% certain - but zig does it, so it can't be too bad ;)
https://mitchellh.com/zig/parser
In particular, the parser has an array of tokens and token offsets:
const Parser = struct {
gpa: Allocator,
source: []const u8,
token_tags: []const Token.Tag,
token_starts: []const Ast.ByteOffset,
tok_i: TokenIndex,
...
}
sure, I just want to make sure we're considering the potential risks as well as the potential benefits :big_smile:
certainly it's possible to lex+parse quickly overall, but that doesn't mean it would necessarily be faster in our case - might be slower, even if still fast in the grand scheme of things
Yep, for sure
so the rules change seems reasonable to me, and given that, the implementation seems worth a try!
I'd just want to keep an eye on error message quality and performance, make sure they're both still good afterwards
Yeah makes sense
thanks for writing it up and talking it through! :smiley:
FWIW, I think there will be some immediate wins in error messages - since right now accidentally putting an expr in the middle of a sequence of defs screws up the parse for the rest of the defs.
And on the perf side, we can actually implement these rules _without_ doing the separate lexer adjustment
And also, these rules make it easier to recover parsing after an error by looking for the next statement/expression in the nearest surrounding block (emitting a malformed statement / expression in the syntax tree). That means compilation can continue despite the broken code, and you can still run tests as long as they don't touch that code.
Richard Feldman said:
I've historically shied away from it because it makes it more obvious that the language is One Of Those Indentation-Sensitive Languages, but maybe we should just embrace that even though it doesn't (currently) come up as often as it does in, say, Python or CoffeeScript
What's bad about being in the indentation sensitive language camp anyway?
Indentation sensitivity has rarely bothered me but there are some issues that can come up:
Anton said:
Indentation sensitivity has rarely bothered me but there are some issues that can come up:
- fixing code that you pasted can be tedious
- accidentally mixing up tabs and spaces
- following indentation rules can be tedious when using a simple editor
isn't it true to say though that structural editors can basically eliminate this whole class of problems?
so that in roc building upon the roc editor we wouldn't have that much of an issue introducing this as a syntax constraint
Yes indeed, I don't expect indentation to create any problems there.
The await bang brought an additional point to get it back on track.
What’s the expected amount of work? Can we split it? I don't mind to start working on it. However, the current parsing implementation has not yet settled in my head.
The reason I'm bringing it up again is outlined here and it's also kind of a blocker for this problem
With indentation-sensitive parsing, the problem will resolve itself
Kiryl Dziamura said:
The await bang brought an additional point to get it back on track.
What’s the expected amount of work? Can we split it? I don't mind to start working on it. However, the current parsing implementation has not yet settled in my head.
I suspect it’s a pretty big project, although I haven’t really thought about how it would work to modify the current parser to do it
as opposed to doing it as part of a larger change to a non-parser-combinator design
I have things _mostly_ working, in this WIP diff: https://github.com/roc-lang/roc/pull/6809
The last thing (I think / I hope) is squashing some bugs / assessing some changes in test_reporting
Here's a representative example:
Snapshot: dbg_without_final_expression
Source: crates/compiler/load/tests/test_reporting.rs:5740
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Expression: golden
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-old snapshot
+new results
────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
0 │-── INDENT ENDS AFTER EXPRESSION in tmp/dbg_without_final_expression/Test.roc ───
0 │+── MISSING FINAL EXPRESSION in tmp/dbg_without_final_expression/Test.roc ───────
1 1 │
2 │-I am partway through parsing a dbg statement, but I got stuck here:
2 │+I am partway through parsing a definition, but I got stuck here:
3 3 │
4 │+1│ app "test" provides [main] to "./platform"
5 │+2│
6 │+3│ main =
4 7 │ 4│ dbg 42
5 8 │ ^
6 9 │
7 │-I was expecting a final expression, like so
10 │+This definition is missing a final expression. A nested definition
11 │+must be followed by either another definition, or an expression
8 12 │
9 │- dbg 42
10 │- "done"
13 │+ x = 4
14 │+ y = 2
15 │+
16 │+ x + y
The TL;DR there is that, because dbg expression parsing now itself only consists of parsing the dbg 42 part itself and pops back up to a higher level in order to parse the rest of the block, the natural error message is just a "definition missing final expr" error, rather than being something specific to a def.
Should be possible to recover that original dbg-specific error; it just needs some attention on each of the failures
If someone would be interested in helping out, assessing each of these test failures is probably fairly parallelizable
I'm interested, I just want to finish a couple of other PR's I'm working on.
Hopefully we can have a new release of basic-ssg that supports Windows. It's been slow progress finding a compatible set of deps rust is happy with, and also finish the removal of rebuilding host from roc.
Would be happy to help too
PR is green & ready for review: https://github.com/roc-lang/roc/pull/6809
:partying_face:
Last updated: Jun 16 2026 at 16:19 UTC