Do you have a goal for grammar complexity?
Please try to avoid bikeshedding the exact syntax changes, this is about reducing dependency on e.g. indentation levels being correct, or there being no missing newlines :)
It seems like roc's grammar could reach a lower chomsky hierarchy with a few minor changes, making it less indentation sensitive, allowing the formatter to read code with absolutely bonkers indentation, which greatly helps beginners and when copy/pasting code, at the expense of an extra character every once in a while. For example, something that delimits each branch in a when ... is expr, like a leading when (kotlin-style) or a trailing , (erlang-style) or ; (c++-style). Going from when ... is to when ... {with a matching end parenthesis would also be a change in this direction, but that's likely a step too far for an ml-syntax language :). A simpler grammar type also allows faster parsing and better error recovery from parse errors.
tree-sitter-elm is quite unstable and I wonder if that has anything to do with this. the maintainers did say the tricky part was in the C code needed due to white space sensitivity.
@Joshua Warner had some thoughts around this, although as I recall it was more about making things indentation-sensitive in more places :big_smile:
also, hi @drathier! :wave:
Yeah, as is the Roc grammar is pretty difficult to parse - tho honestly I don't think indentation sensitivity is a big barrier here. I used to be pretty staunchly anti-significant-whitespace, but that was before I understood how Python's lexer works :)
That said, I really wouldn't object to a syntax that's a bit more C-like, as that's where I have more experience.
IMO, the biggest pain points with parsing Roc are:
All of those are (I believe) still parseable as a context-free grammar (well, given a suitable lexer to deal with whitespace indentation) - but they are _not_ well handled in common formalisms. I believe Roc can't be parsed as LALR, LR(k), standard recursive descent, etc.
Richard Feldman said:
Joshua Warner had some thoughts around this, although as I recall it was more about making things indentation-sensitive in more places :big_smile:
Yeah, I wrote up some proposals around this - but the actual changes never actual landed, due to being dependent on an increasingly-ill-advised series of rewrite attempts (aka my own fault).
ha, no worries!
I actually think it would be really valuable to look at even design sketches
if you're comfortable sharing them?
Yeah, definitely
Be warned, the code isn't pretty ;)
Here's the current state of the most recent attempt: https://github.com/joshuawarner32/roc/blob/tokens/crates/compiler/parse/src/tree.rs
That's a _mostly_ recursive descent parser - with the interesting caveat that some things (like expressions/blocks and function types) have to know how to parse themselves both singly _and_ as part of a comma-separated list (seq). In the latter case, the expectation is that you call the _seq parsing function, and it'll parse some (but not necessarily all) comma-separated items, pushing them onto the array provided in the argument. This is theoretically sound, but very difficult to reason about / debug.
@Joshua Warner do you know if the challenges you listed (the "can't be parsed" statement) affect something like tree-sitter?
I would guess tree-sitter is pretty capable since it works with so many different languages but I was just wondering if you think there is something specific to Roc that would make it more challenging/error-prone
I believe tree-sitter should be able to parse those, but you'll need to be careful with the lexing step. You can actually probably adapt the tokenizer from the attempt I linked above (tokens.rs in the same dir). That part is basically working, but needs to be extended to correctly handle some of the fancier roc features like multiline strings and \(expr) escapes in strings.
Another challenge will be expressing the grammar in a way that tree-sitter sees as unambiguous. Specifically, some of the subtleties around commas require e.g. COMMA NEWLINE (i.e. two tokens) to be treated as a stronger delimiter than just COMMA - to disambiguate whether you're talking about separating list elements (newline allowed) or multibackpassing (no newline).
Last updated: Jun 16 2026 at 16:19 UTC