Stream: ideas

Topic: Design: Indents and Blocks


view this post on Zulip Joshua Warner (Feb 23 2023 at 02:44):

Hi! I've been thinking about how we could make some small adjustments to how roc interprets newlines and indentation, to make the syntax more consistent, understandable, and easier to parse.

I wrote up a proposal here: https://gist.github.com/joshuawarner32/f1968ea4a5ec55099b99f9d100385815#file-items_and_blocks-md

Pending some questions at the bottom, I believe this should be a no-op change for almost all code in the wild. I expect few if any people will need to change code to get it to continue to parse and mean exactly the same thing it did before.

view this post on Zulip Richard Feldman (Feb 23 2023 at 02:48):

Items are required to be separated by newlines. The first token of each item within a block must all be indented to the same level. If an item extends beyond one line, all subsequent lines must be indented.

this was the rule in CoffeeScript and it seemed to work out fine from what I can remember

view this post on Zulip Richard Feldman (Feb 23 2023 at 02:49):

I've historically shied away from it because it makes it more obvious that the language is One Of Those Indentation-Sensitive Languages, but maybe we should just embrace that even though it doesn't (currently) come up as often as it does in, say, Python or CoffeeScript

view this post on Zulip Richard Feldman (Feb 23 2023 at 02:50):

like I don't think people lump Haskell or Elm in with Python and CoffeeScript even though they have just as much indentation-sensitivity as Roc does today

view this post on Zulip Richard Feldman (Feb 23 2023 at 02:51):

although I guess defs in Roc not having let and in as delimiters does make the indentation considerations mentioned in the writeup more prominent

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:00):

I'm not a huge fan of "items" - here's an idea to consider:

The body of a Roc module is broken down into statements and expressions.

The body of a module is a sequence of statements. Each statement may contain expressions; for example, in the assignment statement x = 5, 5 is an expression.

Any expression can be preceded by any number of statements, as long as each of those statements begins at the same indentation level as that expression. (Remember that statements must end in newlines though!) So for example you can write:

x =
    dbg y
    z + 1

Here, the z + 1 expression is preceded by the dbg y statement, which is at the same indentation level as the z + 1 expression and ends in a newline. Each of the statements will run, in order, and then the expression will run, and the entire sequence will evaluate to that expression at the end.

Both statements and expressions can span multiple lines; if they do, each of the additional lines must be at a higher level of indentation than the line where the statement or expression began.

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:02):

'statements' is a good term.

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:03):

Any expression can be preceded by any number of statements, as long as each of those statements begins at the same indentation level as that expression. (Remember that statements must end in newlines though!) So for example you can write:

I think we're 100% alignment here.

What I'm suggesting is that we enforce that constraint (effectively, that you can't end a block in a statement nor have an expression in the middle of a block) in canonicalization rather than enforcing it in the parser.

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:05):

yeah that seems reasonable to me!

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:06):

That means:

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:06):

sweet! I also like that this is pretty easy to explain

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:07):

I'm curious to hear your thoughts on the "Open Questions" at the bottom

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:07):

I vaguely remember thinking there was an interesting potential design for something if we had the "indentation is required to continue an expression or statement" design, but I don't recall exactly what it was at the moment :thinking:

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:08):

Particularly - should statements be allowed inside parenthesized expressions?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:08):

... and if so, should we also allow them inside tuples? In other collections?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:10):

... and, even more confusingly, what does this look like in terms of where the ',' is required to be, separating the values?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:11):

One option would be to allow statements inside parens - but to specifically require that tuples must be on a single line.

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:12):

from a teaching perspective I think it's better if we allow them, in the sense that otherwise the rules have to state "...except if the expression has parens around it, in which case you can't, also inside certain collection literals it's not allowed either"

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:13):

I may be missing something, but would that create an ambiguity somehow?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:14):

That throws a bit of a wrench in the works of the idea to do indent checking purely in the lexer

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:15):

hm, how so?

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:16):

you mean with the INDENT and DEDENT tokens?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:16):

Yeah

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:16):

The question is - how does the lexer know that it should emit an INDENT token and check the indentation of the following statements/expressions as part of a block.

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:17):

checking as in like actual validation, which could produce errors?

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:18):

I assumed the idea is that it just produces a stream of tokens and then it's totally up to the parser where expressions and statements begin and end, based on the tokens it encounters

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:19):

oh wait I think I see

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:19):

do you mean that the algorithm for deciding whether to emit an INDENT or DEDENT (as opposed to just like "there are some spaces here") becomes tricky?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:19):

Yes

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:20):

gotcha

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:20):

Also, we need to be careful about the distinction between 'blocks' (as I've been calling them) and exprs that just happen to be on multiple lines, but that don't contain '=' / etc (maybe a long binary op)

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:21):

In the latter case, it feels weird to me to require all the lines of the expr to be exactly aligned

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:21):

i.e. the user may want to indent nested parts of the expr further

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:24):

oh yeah, something I hadn't considered - does this mean that |> lines have to be indented? :thinking:

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:24):

Yes

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:24):

I think the formatter already does that tho

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:24):

not anymore, they're at the same level of indentation as the preceding expression now

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:24):

a
  |> foo
  |> bar
  |> etc

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:25):

Oh interesting

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:25):

but I guess we could always make an exception for infix operators

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:25):

like those don't count as outdents

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:25):

or the end of the expression

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:25):

:thinking:

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:25):

in the lexer even

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:25):

Yeah that could work

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:27):

Ok, cool - I think under that rule we can make statements inside parens 'just work', along with any other collection.

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:27):

so honestly my main concern here would be error message quality

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:27):

like I've definitely seen some helpful "unexpected OUTDENT" messages in my time

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:28):

and I'm not sure if this would exacerbate those or not matter

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:28):

Python is not known for high-quality syntax errors ;)

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:28):

(or maybe somehow make them easier?)

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:28):

yeah, I'm mainly wondering what's possible

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:28):

like if we want to do a good job providing helpful errors, does splitting out the lexer help us, hurt us, or make basically no difference?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:31):

Thinking about a few cases here...

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:31):

Outdents inside parens, square brackets, etc - I think is 100% solvable, since we can use the matching delimiters to work out what you really meant there.

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:33):

nice

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:33):

For other cases, let's maybe take this as a representative example:

foo = \a, b ->
    x = baz a b
y = bar x
    y + 1

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:34):

The parser will see that as two defs

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:34):

The first def there will not end in an expression, so we'll get a canonicalization error

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:35):

I think in that case the best way to disambiguate the user intent might be to look at the names involved in the second def.

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:36):

If that references names not defined in the outer scope, but that _are_ defined as locals inside the broken def, we can probably infer the user intent here and give a helpful error.

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:36):

(conveniently, canonicalization has exactly that information available, I think)

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:38):

that also sounds reasonable!

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:38):

ok another question is performance, although that's easier to measure. We're introducing a new IR, and potentially a lot of memory to traverse; are we signing up for a bunch of cache misses?

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:38):

I mean maybe it's fine because the whole parsing step is already so fast, maybe slowing it down a bit isn't noticeable

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:42):

There would be some wins here. Specifically, there are some cases where the parser currently has to backtrack that could (probably?) be disambiguated by peeking at a fixed number of tokens [ahead] in the input

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:44):

:thinking: do you think that would be enough to compensate for the extra work of creating and then traversing the token stream?

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:47):

It's not 100% certain - but zig does it, so it can't be too bad ;)

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:47):

https://mitchellh.com/zig/parser

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:48):

In particular, the parser has an array of tokens and token offsets:

const Parser = struct {
    gpa: Allocator,
    source: []const u8,

    token_tags: []const Token.Tag,
    token_starts: []const Ast.ByteOffset,
    tok_i: TokenIndex,

 ...
}

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:48):

sure, I just want to make sure we're considering the potential risks as well as the potential benefits :big_smile:

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:48):

certainly it's possible to lex+parse quickly overall, but that doesn't mean it would necessarily be faster in our case - might be slower, even if still fast in the grand scheme of things

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:48):

Yep, for sure

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:50):

so the rules change seems reasonable to me, and given that, the implementation seems worth a try!

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:50):

I'd just want to keep an eye on error message quality and performance, make sure they're both still good afterwards

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:50):

Yeah makes sense

view this post on Zulip Richard Feldman (Feb 23 2023 at 03:51):

thanks for writing it up and talking it through! :smiley:

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:53):

FWIW, I think there will be some immediate wins in error messages - since right now accidentally putting an expr in the middle of a sequence of defs screws up the parse for the rest of the defs.

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:57):

And on the perf side, we can actually implement these rules _without_ doing the separate lexer adjustment

view this post on Zulip Joshua Warner (Feb 23 2023 at 03:58):

And also, these rules make it easier to recover parsing after an error by looking for the next statement/expression in the nearest surrounding block (emitting a malformed statement / expression in the syntax tree). That means compilation can continue despite the broken code, and you can still run tests as long as they don't touch that code.

view this post on Zulip Kesanov (Feb 26 2023 at 08:54):

Richard Feldman said:

I've historically shied away from it because it makes it more obvious that the language is One Of Those Indentation-Sensitive Languages, but maybe we should just embrace that even though it doesn't (currently) come up as often as it does in, say, Python or CoffeeScript

What's bad about being in the indentation sensitive language camp anyway?

view this post on Zulip Anton (Feb 26 2023 at 09:11):

Indentation sensitivity has rarely bothered me but there are some issues that can come up:

view this post on Zulip dank (Feb 26 2023 at 11:27):

Anton said:

Indentation sensitivity has rarely bothered me but there are some issues that can come up:

isn't it true to say though that structural editors can basically eliminate this whole class of problems?

view this post on Zulip dank (Feb 26 2023 at 11:28):

so that in roc building upon the roc editor we wouldn't have that much of an issue introducing this as a syntax constraint

view this post on Zulip Anton (Feb 26 2023 at 12:23):

Yes indeed, I don't expect indentation to create any problems there.

view this post on Zulip Kiryl Dziamura (Jun 02 2024 at 11:26):

The await bang brought an additional point to get it back on track.

What’s the expected amount of work? Can we split it? I don't mind to start working on it. However, the current parsing implementation has not yet settled in my head.

view this post on Zulip Kiryl Dziamura (Jun 04 2024 at 08:13):

The reason I'm bringing it up again is outlined here and it's also kind of a blocker for this problem

With indentation-sensitive parsing, the problem will resolve itself

view this post on Zulip Richard Feldman (Jun 04 2024 at 10:45):

Kiryl Dziamura said:

The await bang brought an additional point to get it back on track.

What’s the expected amount of work? Can we split it? I don't mind to start working on it. However, the current parsing implementation has not yet settled in my head.

I suspect it’s a pretty big project, although I haven’t really thought about how it would work to modify the current parser to do it

view this post on Zulip Richard Feldman (Jun 04 2024 at 10:46):

as opposed to doing it as part of a larger change to a non-parser-combinator design

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:40):

I have things _mostly_ working, in this WIP diff: https://github.com/roc-lang/roc/pull/6809

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:41):

The last thing (I think / I hope) is squashing some bugs / assessing some changes in test_reporting

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:41):

Here's a representative example:

Snapshot: dbg_without_final_expression
Source: crates/compiler/load/tests/test_reporting.rs:5740
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Expression: golden
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-old snapshot
+new results
────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    0       │-── INDENT ENDS AFTER EXPRESSION in tmp/dbg_without_final_expression/Test.roc ───
          0 │+── MISSING FINAL EXPRESSION in tmp/dbg_without_final_expression/Test.roc ───────
    1     1 │
    2       │-I am partway through parsing a dbg statement, but I got stuck here:
          2 │+I am partway through parsing a definition, but I got stuck here:
    3     3 │
          4 │+1│  app "test" provides [main] to "./platform"
          5 │+2│
          6 │+3│  main =
    4     7 │ 4│      dbg 42
    5     8 │               ^
    6     9 │
    7       │-I was expecting a final expression, like so
         10 │+This definition is missing a final expression. A nested definition
         11 │+must be followed by either another definition, or an expression
    8    12 │
    9       │-    dbg 42
   10       │-    "done"
         13 │+    x = 4
         14 │+    y = 2
         15 │+
         16 │+    x + y

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:44):

The TL;DR there is that, because dbg expression parsing now itself only consists of parsing the dbg 42 part itself and pops back up to a higher level in order to parse the rest of the block, the natural error message is just a "definition missing final expr" error, rather than being something specific to a def.

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:45):

Should be possible to recover that original dbg-specific error; it just needs some attention on each of the failures

view this post on Zulip Joshua Warner (Jul 10 2024 at 03:45):

If someone would be interested in helping out, assessing each of these test failures is probably fairly parallelizable

view this post on Zulip Luke Boswell (Jul 10 2024 at 03:56):

I'm interested, I just want to finish a couple of other PR's I'm working on.

Hopefully we can have a new release of basic-ssg that supports Windows. It's been slow progress finding a compatible set of deps rust is happy with, and also finish the removal of rebuilding host from roc.

view this post on Zulip Kiryl Dziamura (Jul 10 2024 at 05:20):

Would be happy to help too

view this post on Zulip Joshua Warner (Jul 27 2024 at 21:25):

PR is green & ready for review: https://github.com/roc-lang/roc/pull/6809
:partying_face:


Last updated: Jun 16 2026 at 16:19 UTC