Stream: compiler development

Topic: zig compiler - indent handling


view this post on Zulip Joshua Warner (Feb 02 2025 at 19:38):

Wanted to pop this github thread out to zulip: https://github.com/roc-lang/roc/pull/7569#discussion_r1938529685
@Anthony Bullard / @Brendan Hansknecht

The approach used in the current tokenizer PR is to preserve exact indent info and let the parser use that however it wants to figure out nesting.

The more-commonly-used alternative (e.g. in Python) would be to generate indent and dedent tokens. I believe this approach is viable for the current roc grammar, but it does force us to be a little bit more picky about indentation. When developing the parser that went along with this tokenizer, I found that trying to use only indent/dedent tokens resulted in very picky indentation that I found frustrating to get working.

Thoughts?

view this post on Zulip Richard Feldman (Feb 02 2025 at 19:41):

I like it being more permissive - fits with the goal of having it be more error-tolerant

view this post on Zulip Richard Feldman (Feb 02 2025 at 19:42):

so yeah, sounds like indent/dedent tokens aren't the way to go

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 19:42):

I have no thoughts besides making sure we document what we chose and why. I was just surprised that we are only counting spaces or tabs, that didn't feel like enough (but I don't work on the parser).

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:45):

We're counting both spaces and tabs in the current PR (keeping the counts separate). That's just enough information to be agnostic to whatever the editor configuration for tab indent width is, and still parse correctly.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 19:46):

But like is space space tab space the same as tab space space space?

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:46):

Technically, if we wanted to be very particular, we could record the exact indent string that's used. that would let us catch cases where the first line uses spaces then tabs and the second line uses tabs and spaces. However, it doesn't feel important to me to reject that case in particular.

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:47):

Ha ha yeah exactly

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:47):

It is not perfectly the same, but would lead to the same indent, given typical editor configurations.

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:47):

A user might be forgiven for mistaking those as the same

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 19:48):

I guess it would depend on how many spaces a tab is for how it appears in a given editor and if it looks like the same indent level or not.

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:55):

Yeah, if an editor is using tabs to align to e.g. 8 char intervals, then those aren't identical.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 19:56):

Yeah, but I guess I don't know what roc is supposed to do with that. So just counting tabs and spaces is probably fine.

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:56):

By that you mean tabs+spaces?

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:56):

(as a single number)

view this post on Zulip Richard Feldman (Feb 02 2025 at 19:56):

Joshua Warner said:

It is not perfectly the same, but would lead to the same indent, given typical editor configurations.

hm is that true? I think that's only true if the editor is replacing tabs with some fixed number of spaces, as opposed to doing tab stops

view this post on Zulip Joshua Warner (Feb 02 2025 at 19:57):

Err yeah for that example you're right

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 19:59):

By that you mean tabs+spaces?
(as a single number)

No, sorry, I meant doing what you are currently during in the PR with two different counts. That sounds reasonable cause we don't know what the users tab width is. So we can't really do better.

There is a chance that space space space tab is 2 indents in the users editor (tab rounds to 2 spaces). As such they see space space tab space as 2.5 indents and maybe accidentally use it as 3 indents, but I'm not sure how roc would derive any of that info.

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:00):

For any case where you have a sequence of: ((tab|space{n})*space*) (forgive my relaxed, regular expressions syntax), keeping the count of tabs and count of spaces is sufficient to infer the correct indent level, regardless of if the user has an improper tabs versus spaces setting.

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:03):

The conservative alternative here is going back to what the current parser does - only allows spaces. That avoids all of these problems.

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:04):

The expressive alternative is to track the exact order of tabs and spaces and only allow indents that have the previous line's indent string as a prefix

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:05):

I think any other alternative here necessarily involves tracking the count of spaces and tabs. Given that, for example, we could give an error if the user is not consistently using either tabs or spaces with no mixing.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:06):

Just out of curiosity, what do we do if one line is all spaces and the next is all tabs (this accidentally happens to me sometimes when I edit a file before saving it as a .roc file). I think the editor defaults to tabs, but my roc config and roc in general uses spaces. So sometimes have one line with 4 spaces and then the next with 2 tabs. Really irks me to fix this and last time I checked, the formatter fails to do so.

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:07):

fwiw I'd like to report a warning for using anything other than tabs for indentation

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:07):

so all of this is just for trying to recover if the user makes a mistake

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:08):

(including having the formatter translate all of these things into the appropriate number of tabs for you)

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:10):

what do we do if one line is all spaces and the next is all tabs

The parser would report an error, since we can't disambiguate that.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:11):

:cry:

view this post on Zulip Luke Boswell (Feb 02 2025 at 20:12):

Are we switching to tabs? There was a lot of enthusiasm for using tabs for indentation :grinning:

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:13):

FWIW this scheme comes directly from python

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:15):

I think it is probably a totally reasonable scheme. It just doesn't solve my essentially only issue with tabs/spaces that I currently hit (though still pretty rare).

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:27):

I don't think that's deterministically solvable, is it?

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:27):

like how could the compiler infer what indentation level you intended without knowing what tab width you have configured?

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:28):

oh I forgot to mention earlier - I think tabs being used for anything other than indentation should also be a warning (and the formatter should change them to spaces)

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:28):

If we really wanted to be fancy, we could retry the parse assuming different values of tab width, and see which parse makes the most sense

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:30):

@Brendan Hansknecht can you give a concrete example of the issue you typically hit?

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:34):

Richard Feldman said:

oh I forgot to mention earlier - I think tabs being used for anything other than indentation should also be a warning (and the formatter should change them to spaces)

hrm, I just realized we'd have to make an exception for this inside doc comments in which you have code blocks.

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:36):

also, those could be really annoying to edit actually, because a number of editors (most notably Cursor, but Zed is about to start doing this too) are binding the Tab button to mean "accept LLM-generated suggestion"

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:37):

it's fine outside a doc comment bc the editor generally understands where indentation should happen and inserts indents for you, but inside doc comment code blocks they usually don't realize you're in code mode and they don't indent for you

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:37):

I wouldn’t expect an editor to reliably insert tab characters in that position?

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:38):

We could always make the formatter parse markdown code blocks inside doc comments and format those.

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:38):

that would be good! :+1:

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:39):

also maybe we can get the editor grammars able to understand that aspect of markdown and do the indentation

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:39):

I wonder how Go deals with that :thinking:

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:39):

Joshua Warner said:

Brendan Hansknecht can you give a concrete example of the issue you typically hit?

Here is what normally happens. Copy some code from github that is a repro or something. Paste it into my editor. Edit the file before saving (leads to inserting tabs instead of spaces). Save and now have some lines with tabs and some with spaces.

Just did a small edit of basic cli hello world and got this:

app [main!] { pf: platform "../platform/main.roc" }

# To run this example: check the README.md in this folder

import pf.Stdout

main! = \_args ->
    x = "Hello, World!"
    Stdout.line! x

    if Str.countUtf8Bytes x > 10 then
        Stdout.line! "large"
    else
        Stdout.line! "small"

The original indented line from hello world (Stdout.line! x) is now the only line indented with spaces. With large code snippets, this becomes a mess of tabs and spaces and I have to manually do a bunch of character replacing.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:40):

This happens cause I edit the file before saving it. Once I save it with a .roc extension, the editor figures it out.

view this post on Zulip Brendan Hansknecht (Feb 02 2025 at 20:41):

But at that point the damage is done.

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:41):

This sounds like the right approach would be to detect there’s a mix of spaces and tabs, issue a warning, and retry parsing with different tab width configs

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:42):

The pick the one with the fewest parse errors/warnings

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:43):

https://forum.cursor.com/t/ambiguity-with-the-tab-key-in-cursor/7663/12?utm_source=chatgpt.com

ah, apparently a common workaround for indentation specifically is to use an "increase indentation level" keybinding

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:44):

I don't think that would work in doc comments though

view this post on Zulip Richard Feldman (Feb 02 2025 at 20:45):

it would just indent the comment :sweat_smile:

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:45):

I would tend to uncomment, edit normally, then recomment

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:49):

Sorry everyone - had a number of Lunar New Years parties this weekend and have mostly been out.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:51):

@Joshua Warner I think that indent ambiguity is a really messy thing for both the parser and the user to deal with. I personally think the consistency we could apply with INDENT and DEDENT tokens far outweighs the permissiveness of the alternative plan

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:52):

But I think regardless of what we decide to do with the specifics there, I think the current design of the token output is hard to work with.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:53):

Specifically we have an SoA over the tokens but in the same struct a fourth array for lines where it's not exactly straightforward with how the parser is supposed to interact with it (its length has no relation to the rest)

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:54):

It’s actually not bad to work with. Logic for that is kept in a very thin layer around the tokenizer, and most of the parser doesn’t have to care.

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:54):

That layer tracks a current line index, incrementing that as appropriate in the parsers advance method

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:55):

Maybe I should stop what I'm working on because it sounds like you have already designed a parser around this tokenizer

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:56):

And the "thin layer" that you are speaking of doesn't jump off the page to me at all

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:57):

But I've only had about an hour or two and I've spent most of that exploring some different designs of the AST

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:57):

This is literally the tokenizer from the parser I linked in the other thread, translated to zig. So yeah there is a parser that goes with. But I wasn’t thinking of reusing that wholesale - just select parts.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:58):

Oh, the one with the linear tree structure?

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:58):

Yeah

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:58):

I spent quite a lot of time trying out the indent and dedent token idea, and it was quite fragile

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:58):

Ok, I thought that was more of a thought experiment since we seemed to be aiming for simplicity

view this post on Zulip Joshua Warner (Feb 02 2025 at 20:59):

Yeah, the linear tree parts of that I want to throw away

view this post on Zulip Anthony Bullard (Feb 02 2025 at 20:59):

That's interesting, usually with RD with a WSS language IDENT/DEDENT tokens are very straightforward

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:00):

But maybe it's different with Roc's philosophy of maximal flexibility in the parser

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:00):

The problem with indent/dedent in roc is there’s a long list of tokens that might jntroduce a block. Python only allow : to do that.

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:02):

Don’t get me wrong; indent/dedent can be made to work. But it gets messy.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:02):

I think there would be two main helpers for that, say parse_block (parses newline-delimited statements at same level of indentation), and parse_collection(parses a comma-delimited sequence that can dedent).

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:03):

Ok, I'd love to see the difference between the two as I can't imagine the other being simpler (but I'm probably just a dolt).

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:05):

You have to be careful because you don’t want the tokenizer to insert indent tokens in the middle of an expression

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:06):

You also kinda want indent tokens inside parens (so that can allow stmts), but you also… don’t, because folks are used to eg python where indentation is relaxed inside parens

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:06):

Why not? In Python:

some_func(
    a,
    b,
)

Would be a ID, PAREN_O, IDENT, ID, COMMA, NEWLINE, ID, COMMA, DEDENT, PAREN_C no? Don't see why it wouldn't be the same in Roc

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:07):

(I don't remember the exact token names)

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:07):

What if b is dedented there? I’d like that to parse properly too since it’s completely unambiguous

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:07):

Similarly, maybe a is dedented but b is indented.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:08):

Hmm....I would just expect us to push a parse error and fast forward to the closing paren and move on

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:09):

That’s a valid answer, yeah. But I want to be more forgiving than that.

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:10):

I’d like to have fewer annoyances than an indentation sensitive lang like python, not more

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:10):

Couldn't you just say in a collection "use INDENT and DEDENT to track current, and ensure that if we encounter a newline that the indentation is the same as the start?"

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:10):

And an apply as well?

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:10):

I don’t quite follow

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:11):

Ok, let me try an example

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:13):

foo( # start_indent=0, current_indent=0
    a, # INDENT = current_indent = current_indent + 1 (1)
b, # DEDENT = current_indent = current+indent - 1 (0)
) # NEWLINE doesn't change indent

This can parse because the apply (an args container), was satisfied because we ended with the current_indent the same as start_indent

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:15):

Or you just throw away the INDENTs and DEDENTs inside the container completely, and only monitor indent at the statement level

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:17):

But at some point I feel like you have to make the whitespace significant in a whitespace significant language.

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:17):

Hmm so I was definitely assuming indent/dedent would properly nest with other braces. I don’t think that would work with your design?

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:18):

In Python IDENT/DEDENT are braces - because its a militantly WSS language

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:18):

But they don't have to be, they can just be state-change tokens

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:18):

Hmm

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:19):

At that level, I think these two representations are approximately equivalent

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:19):

Or we can take a page from Lua and require blocks have an explicitly textual end delimiter. (Don't kill me , I know I suggested this before)

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:20):

Ehhhh

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:20):

Yeah, it just pushes the logic for "what's an indent or dedent or just a newline?" to the tokenizer

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:21):

No, that logic has to cross both the parser and tokenizer. The parser has to know to ignore some of these indent/dedents where it doesn’t care about them.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:22):

Let me ask you this, if

foo(
    a,
b,
)

Should parse, what about:

when foo is
    Ok -> something
Err(e) ->
something_else = func(a, b)
something_else

?

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:22):

That should not parse

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:23):

And doesn’t, currently

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:23):

Joshua Warner said:

No, that logic has to cross both the parser and tokenizer. The parser has to know to ignore some of these indent/dedents where it doesn’t care about them.

Yes, but it doesn't have to decide what is a indent/dedent in terms of raw character bytes (1 space, 4 spaces, tab, etc)

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:24):

Joshua Warner said:

That should not parse

But why not? It's just as unambiguous to me. Or is it because the former has an explicit bounding pair around it?

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:24):

It’s only unambiguous if the input ends there

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:25):

And then

when foo is
    Ok -> something
Err(e) ->
something_else = func(a, b)
something_else
end

would be unambiguous if that syntax existed?

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:25):

Yeah but it doesn’t

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:25):

Cool, on the same page

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:26):

I'm trying to think of how we would describe the grammar of Roc, in say EBNF

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:26):

That’s a gnarly road to go down

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:26):

As opposed to thinking about the logic we'll implement in Zig

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:26):

But maybe it's just me, but I want things to be consistent

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:27):

Roc is not well designed to have a simple-ish ebnf

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:27):

I think that's what people like about C-syntax languages

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:27):

I don’t disagree with you

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:27):

Is that a block is _always_ in squirlies

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:27):

But that’s not what roc is right now

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:28):

(I would be open to changing that - and TBH that’s a direction I like, but that seems like a separate convo)

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:28):

What I'm saying is is that "Whitespace matters for a when expression, but not for a multi-line list or application" feels inconsistent to me

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:29):

I want to have a parser that’s not annoying to work with as a user

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:29):

It feels forgiving yes, but inconsistent

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:29):

We can issue a warning if we want

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:29):

Oh, so that's why it seems like you are saying to get rid of WSS :-)

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:29):

Or at least suggesting it

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:29):

Because it is fiddly

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:29):

No. Only relaxing it where it’s unambiguous

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:30):

I’m just describing how the current parser/language work. I want to keep that same behavior.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:32):

That makes sense, but this new compiler is probably the best time to write down the principles we want for the parser and kind of move away from the past. That's kind of what we are doing in every other phase (within reason).

Like, what would you change if the starting part for the grammar was properly formatted code from the current compiler?

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:33):

And then say "what should we be more forgiving on from this point?"

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:33):

You obviously don't want to require the user to write perfectly formatted code - humans aren't structural editors

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:34):

Roc is not well designed to have a clean and simple grammar

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:34):

But what version of that grammar parses very fast, is very consistent, and not a pain to work with?

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:35):

Languages with significant white space are simply not encodable in context free grammars. You always need an adapter layer, and that can get ugly.

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:36):

If we want a language that has a simple grammar, that’ll require making some significant changes.

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:37):

Yes, I know. That's why when I created my language I did one trick to make it neither require braces, nor be WSS: All blocks _must_ end in an expression. But that requires a specific design of language that is not like Roc today and definitely not Roc of tomorrow

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:38):

Roc used to be like that

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:38):

It could technically be written in a single line and parse correctly (until I added that all blocks end in an expression followed by a newline)

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:38):

Presumably you need substitute line separators? Or were expressions always self-delimiting?

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:39):

Yeah, we seem to be much closer to a Ruby-style syntax

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:41):

Joshua Warner said:

Presumably you need substitute line separators? Or were expressions always self-delimiting?

Every expression was self-delimiting. There were only literals for Strings, Numbers, Lists, Tuples, Records, Tuples, Lambdas and then Tags and a Match expression. Lambas delimited with {} (only the anonymous kind, not top-level), and match required branches to be introduced with "|" so the last branch would delimit the whole expression

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:41):

Aha yep makes sense

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:41):

Cool

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:42):

It wasn't the most efficient, but parsed reasonably fast for being written in F#

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:42):

Hand rolled the RD parser at first, and then moved to FParsec which was 95% API compatible

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:43):

I’d like to keep the set of syntax changes we’re making with this rewrite as minimal as possible

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:43):

(Thank you Scott Wlaschin of F# For Fun And Profit Fame for giving me inspiration)

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:43):

Yeah I'd like to make zero

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:44):

So that leads us back to the initial discussion

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:44):

Cool, sounds like we’re aligned then

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:44):

I think I'll just wait and see what the skeleton of the parser you put up looks like on top of your tokenizer

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:45):

You are the parser guru, so I'll do my best to row the boat in the same direction

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:45):

Just wanted to say my piece and learn a few things along the way

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:46):

I appreciate the back pressure

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:46):

You’re asking reasonable questions

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:46):

As a young padawan, I still think of a parser as "enforcer of a described, mechanical grammar"

view this post on Zulip Joshua Warner (Feb 02 2025 at 21:46):

Haha

view this post on Zulip Anthony Bullard (Feb 02 2025 at 21:46):

One that reliably generates AST Nodes when the user is being faithful to that grammar

view this post on Zulip Joshua Warner (Feb 03 2025 at 05:16):

@Anthony Bullard Added a skeleton parser to https://github.com/roc-lang/roc/pull/7569, that shows the intended way of interoperating with the lines

view this post on Zulip Joshua Warner (Feb 03 2025 at 05:17):

:thinking: Now that I think about this tho... I wonder if I could just have a Newline token, and subvert either the offset or the length field to sneak in the indent level...

view this post on Zulip Joshua Warner (Feb 03 2025 at 05:17):

That may be a bit easier to integrate, and have lower cognitive overhead on the parser side

view this post on Zulip Joshua Warner (Feb 03 2025 at 05:57):

Yep that seems cleaner; pushed that

view this post on Zulip Anthony Bullard (Feb 03 2025 at 11:01):

Sweet that sounds like a good compromise

view this post on Zulip Anton (Feb 03 2025 at 13:40):

Anthony Bullard said:

But maybe it's different with Roc's philosophy of maximal flexibility in the parser

Just a comment: I believe the value of parser flexibility has dropped significantly. Beginners were likely to produce strange code, but beginners with LLMs are not. It is my understanding that flexibility comes with significant complexity.

view this post on Zulip Richard Feldman (Feb 03 2025 at 15:46):

I think the main value is actually when you're temporarily in a strange editor state, e.g. because you've copy/pasted something in from somewhere else and the indentation is different

view this post on Zulip Anton (Feb 03 2025 at 16:23):

because you've copy/pasted something in from somewhere else and the indentation is different

Can you explain more? Would you want the parser to able to fully parse everything with mixed indentation?

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 16:43):

Richard Feldman said:

I think the main value is actually when you're temporarily in a strange editor state, e.g. because you've copy/pasted something in from somewhere else and the indentation is different

Except roc doesn't fix this with the current plans. That is exactly the case I mentioned above that breaks and is really hard to fix deterministically. It requires having to guess the tab width.

view this post on Zulip Anthony Bullard (Feb 03 2025 at 16:54):

But that’s just WSS

view this post on Zulip Anthony Bullard (Feb 03 2025 at 16:56):

That’s only solved by only accepting tabs for indentation (or some other more draconian measure like “only four spaces”)

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 17:02):

WSS?

view this post on Zulip Anthony Bullard (Feb 03 2025 at 17:04):

White space significance

view this post on Zulip Anthony Bullard (Feb 03 2025 at 17:05):

If you have 100% bounded expressions and statements this isn’t an issue

view this post on Zulip Anthony Bullard (Feb 03 2025 at 17:05):

Otherwise you need a deterministic indentation to act like “braces”

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 17:10):

Oh sure, but it is really frustrating nevertheless.

Theoretically roc could try and figure this out or at least the formatter could convert tabs to space or vice versa (even with broken code) such that I can notice why the file is broken.

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 17:11):

Just a very frustrating user experience that is hard to debug. Given everything lines up in my editor, roc could theoretically figure this out by guessing tab widths.

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 17:11):

Not saying it is worth doing, but I don't think it is unsolvable or innate as an edge case

view this post on Zulip Anton (Feb 03 2025 at 17:29):

I would be fine with a bad indentation error. In an editor I would then use the "llm autofix error" button (zed already has this I believe). I bet llms are great at fixing indentation.

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 18:03):

I'm not a fan of that solution cause I don't use llms generally (and I'm sure many others are in this boat). So it really isn't a fix in my mind. Also, it isn't hard to manually fix, just the kind of thing where being permissive would be amazing.

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 19:25):

Note: I totally also recognize that it may be infeasible/unreasonable to implement.

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 19:25):

So the best we may get is a good error message

view this post on Zulip kris (Feb 03 2025 at 20:10):

Was there ever a discussion about WSS? especially with regards to the parencomma restructuring. i could not find anything

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 20:13):

Nope

view this post on Zulip Brendan Hansknecht (Feb 03 2025 at 20:14):

You could start a #ideas thread, but I would not expect it to see traction.

view this post on Zulip Anthony Bullard (Feb 03 2025 at 20:46):

I’ve started such a thread in the past and it did not go well :rolling_on_the_floor_laughing:

view this post on Zulip Joshua Warner (Feb 04 2025 at 02:08):

Brendan Hansknecht said:

Note: I totally also recognize that it may be infeasible/unreasonable to implement.

I think it is actually within reason to implement. Or at least improve the situation.


Last updated: Jul 06 2025 at 12:14 UTC