Grammar complexity · ideas · Zulip Chat Archive

Stream: ideas

Topic: Grammar complexity

drathier (May 23 2023 at 07:58):

Do you have a goal for grammar complexity?

Please try to avoid bikeshedding the exact syntax changes, this is about reducing dependency on e.g. indentation levels being correct, or there being no missing newlines :)

It seems like roc's grammar could reach a lower chomsky hierarchy with a few minor changes, making it less indentation sensitive, allowing the formatter to read code with absolutely bonkers indentation, which greatly helps beginners and when copy/pasting code, at the expense of an extra character every once in a while. For example, something that delimits each branch in a when ... is expr, like a leading when (kotlin-style) or a trailing , (erlang-style) or ; (c++-style). Going from when ... is to when ... {with a matching end parenthesis would also be a change in this direction, but that's likely a step too far for an ml-syntax language :). A simpler grammar type also allows faster parsing and better error recovery from parse errors.

Georges Boris (May 23 2023 at 10:21):

tree-sitter-elm is quite unstable and I wonder if that has anything to do with this. the maintainers did say the tricky part was in the C code needed due to white space sensitivity.

Richard Feldman (May 23 2023 at 10:31):

@Joshua Warner had some thoughts around this, although as I recall it was more about making things indentation-sensitive in more places :big_smile:

Richard Feldman (May 23 2023 at 11:35):

also, hi @drathier! :wave:

Joshua Warner (May 24 2023 at 01:45):

Yeah, as is the Roc grammar is pretty difficult to parse - tho honestly I don't think indentation sensitivity is a big barrier here. I used to be pretty staunchly anti-significant-whitespace, but that was before I understood how Python's lexer works :)

That said, I really wouldn't object to a syntax that's a bit more C-like, as that's where I have more experience.

IMO, the biggest pain points with parsing Roc are:

Distinguishing commas separating args in function types, from commas that are separating elements in a list
Similarly, distinguishing commas for multi-backpassing from commas separating tuple elements (inside parens)
A few places where tokenization depends on whether there's whitespace before/after - e.g. dots for field access, and ext types (both are awkward, but not a huge issue)

Joshua Warner (May 24 2023 at 01:48):

All of those are (I believe) still parseable as a context-free grammar (well, given a suitable lexer to deal with whitespace indentation) - but they are _not_ well handled in common formalisms. I believe Roc can't be parsed as LALR, LR(k), standard recursive descent, etc.

Joshua Warner (May 24 2023 at 01:59):

Richard Feldman said:

Joshua Warner had some thoughts around this, although as I recall it was more about making things indentation-sensitive in more places :big_smile:

Yeah, I wrote up some proposals around this - but the actual changes never actual landed, due to being dependent on an increasingly-ill-advised series of rewrite attempts (aka my own fault).

Richard Feldman (May 24 2023 at 02:00):

ha, no worries!

Richard Feldman (May 24 2023 at 02:00):

I actually think it would be really valuable to look at even design sketches

Richard Feldman (May 24 2023 at 02:00):

if you're comfortable sharing them?

Joshua Warner (May 24 2023 at 02:01):

Yeah, definitely

Joshua Warner (May 24 2023 at 02:01):

Be warned, the code isn't pretty ;)

Joshua Warner (May 24 2023 at 02:20):

Here's the current state of the most recent attempt: https://github.com/joshuawarner32/roc/blob/tokens/crates/compiler/parse/src/tree.rs

That's a _mostly_ recursive descent parser - with the interesting caveat that some things (like expressions/blocks and function types) have to know how to parse themselves both singly _and_ as part of a comma-separated list (seq). In the latter case, the expectation is that you call the _seq parsing function, and it'll parse some (but not necessarily all) comma-separated items, pushing them onto the array provided in the argument. This is theoretically sound, but very difficult to reason about / debug.

Georges Boris (May 24 2023 at 12:33):

@Joshua Warner do you know if the challenges you listed (the "can't be parsed" statement) affect something like tree-sitter?

I would guess tree-sitter is pretty capable since it works with so many different languages but I was just wondering if you think there is something specific to Roc that would make it more challenging/error-prone

Joshua Warner (May 24 2023 at 14:35):

I believe tree-sitter should be able to parse those, but you'll need to be careful with the lexing step. You can actually probably adapt the tokenizer from the attempt I linked above (tokens.rs in the same dir). That part is basically working, but needs to be extended to correctly handle some of the fancier roc features like multiline strings and \(expr) escapes in strings.

Another challenge will be expressing the grammar in a way that tree-sitter sees as unambiguous. Specifically, some of the subtleties around commas require e.g. COMMA NEWLINE (i.e. two tokens) to be treated as a stronger delimiter than just COMMA - to disambiguate whether you're talking about separating list elements (newline allowed) or multibackpassing (no newline).

Last updated: Jul 23 2026 at 13:15 UTC