Crazy test_syntax snapshots and their value · compiler development

8
("""""""")f:C
U

8
(
"""
"""
    "")
    f : C
U

8

(
"""
"""
    "")
    f : C
U

See that extra newline? The reason for it is complicated and has to do with ParensAround not existing in Patterns and the fact that we first parse annotation headers as an expr and translate to Pattern. But it really only impacts this case.

But what is the value of having a snapshot that tests the formatting behavior of a illegal pattern? Should we have some way to say "This thing we parsed actually doesn't make any sense, so we expect the formatter to fail here and shouldn't test this?" I think such a change would largely impact the fuzzer since it's the thing that introduced these snapshots in the first place.

Anthony Bullard (Jan 13 2025 at 15:44):

My large point is we have a lot of illegal or invalid Roc syntax in test_syntax snapshots. I think there's value in us being able to parse them, but I think we should be more aggressive in making these not part of the Parse->Format->Reformat cycle - perhaps by making them Malformed sooner?

Anthony Bullard (Jan 13 2025 at 15:45):

Lastly, I really think that having test_syntax actually document invariants of what is actual, valid Roc syntax is just so much higher value.

Anthony Bullard (Jan 13 2025 at 15:46):

And if we can't find a way to turn a fuzzer failure into a real, valid piece of Roc syntax - we should be doing something in either the fuzzer or the parser to ensure that it is marked as Malformed and therefore the fuzzer will discard such an input in the future without generating noise.

Anthony Bullard (Jan 13 2025 at 15:49):

(
"""
"""
    "")
    f : C

Be a TypeAnnotation(TypeHeader(Apply(Malformed, Ident("f"))), Tag("C")) and then have fuzzer bail out at that point

Anthony Bullard (Jan 13 2025 at 15:50):

Anthony Bullard (Jan 13 2025 at 15:54):

I guess I'll end my mini-rant on a positive note and give a potential vision of what these snapshots could be:

That will then help us identify clearly what the style is and the principles we use when maintaining and extending it with new syntax.

Anthony Bullard (Jan 13 2025 at 15:57):

8
"""
"""(
    "",
)(
    f,
) : C
U

8
"""
"""("")(f) : C
U

Anthony Bullard (Jan 13 2025 at 16:08):

Just one more thing before i go to work (and I may move this to #ideas ), should this really just be a md document(or documents) with code blocks appropriately annotated? Then it could really be what I envision above - a programmatically checked guide to valid Roc syntax and the canonical formatted style.

Anton (Jan 13 2025 at 16:12):

Anthony Bullard (Jan 13 2025 at 16:15):

We could even tag the document sections and find a way to link Syntax problems with the relevant section(s) and output it with the report

Anthony Bullard (Jan 13 2025 at 16:22):

Someone does something crazy like the above and they get a nice syntax error report like:

Syntax Error @ main.roc 12:2-12:8 -----------------------------------------

("""""""") f: C
 ^----- | The problem is here

It looks like you are trying to perform function application on a string literal, but that
is not valid Roc.  Here's some tips:

Usually, you would apply an Identifier or a Tag, like this:

func(arg)
# Or
Tag(arg)

Both of these would format exactly the same as above.

To get more tips on syntax for function application, use `roc syntax apply`.

--------------------------------------------------------------------------------------

Anthony Bullard (Jan 13 2025 at 16:28):

Where roc syntax could be a new subcommand in the CLI to allow the user to browse or search the syntax guide

Anthony Bullard (Jan 13 2025 at 16:32):

And since this is checked in CI on every commit - this guide would always be correct for that version of the compiler.

Sam Mohr (Jan 13 2025 at 19:47):

Sam Mohr (Jan 13 2025 at 19:48):

I think a generated file would suffice here, something that's reasonably legible for humans and definitely for computers

Anthony Bullard (Jan 13 2025 at 19:50):

What I'm talking about it a systematic inventory of valid syntax that can be tested and verified and can replace test_fmt at least the _vast_ majority of snapshots. (They would act as snapshots)

Sam Mohr (Jan 13 2025 at 19:50):

Anthony Bullard (Jan 13 2025 at 19:50):

Anthony Bullard (Jan 13 2025 at 19:51):

And since they would be user-facing inside of documentation, there would be context around each example, they would be valid Roc code, and not a bunch of randomly-generated non-sense

Sam Mohr (Jan 13 2025 at 19:51):

I agree that the snapshots are moving further from a set of unit test-like valid examples of Roc code and more like Eldritch horrors that get cleaned up and saved to keep us from crashing when we see them

Anthony Bullard (Jan 13 2025 at 19:51):

And this is NOT saying that the fuzzer does not have value. But in it's current way of being used it's painful

Anthony Bullard (Jan 13 2025 at 19:52):

Sam Mohr (Jan 13 2025 at 19:53):

Anthony Bullard (Jan 13 2025 at 19:53):

(I know the fuzzer technically is, but I mean generated from a specification, not from a small corpus and otherwise random text)

Anthony Bullard (Jan 13 2025 at 19:55):

I think that Rust reference is a good place to start. But I'd really like to show both the canonical form of each bit of syntax as well as the "most terrible way to type this and it still parse right"

Anthony Bullard (Jan 13 2025 at 19:55):

Sam Mohr (Jan 13 2025 at 19:56):

That is a noble goal, but I don't know how achievable it is to make something that is a good reference resource AND good for testing

Sam Mohr (Jan 13 2025 at 19:56):

Anthony Bullard (Jan 13 2025 at 19:57):

I have a design that would make it possible to do both at once. It's called the documentation will link in the code samples that are valid into the reference

Anthony Bullard (Jan 13 2025 at 19:58):

And if we do make a CLI subcommand for it you could have a --extended flag or something and see everything that matches the search term

Luke Boswell (Jan 13 2025 at 21:46):

I feel like we could just delete any snapshots that are not helpful when making a change. They're not sacred or anything. Is this the core of your issue? trying to save or fix snapshots that are super random and strange.

We've been on a mission to get fuzz clean.... parsing and canonicalisation of all the things and not crashing.

Returning Malformed for something really strange sounds like a good strategy to me.

I'm concerned about changing the current setup dramatically, Josh has used it to good effect finding and smashing a lot of bugs.

Anthony Bullard (Jan 13 2025 at 21:48):

Anthony Bullard (Jan 13 2025 at 21:49):

So if we just make things malformed earlier (and / or canonicalize them) I think it would be better

Joshua Warner (Jan 13 2025 at 21:57):

Agree a lot of them are on the funky side. Not sure I agree they aren't legitimate bugs.

Joshua Warner (Jan 13 2025 at 21:57):

Joshua Warner (Jan 13 2025 at 21:58):

I want to provide a 100% guarantee that using the formatter is "safe" - i.e. it won't change the meaning of your code or change again once formatted again, etc.

Sam Mohr (Jan 13 2025 at 21:59):

I will say, not that we'll feel the benefit for the next month or so, but the roc_can rewrite aims to never crash for this stuff. The parser might crash, but there will be literally zero unwraps or expects in the new canonicalization code

Sam Mohr (Jan 13 2025 at 22:00):

So if we're putting a lot of effort into fixing current roc_can, that may not be necessary

Joshua Warner (Jan 13 2025 at 22:00):

Joshua Warner (Jan 13 2025 at 22:01):

Sam Mohr (Jan 13 2025 at 22:01):

Joshua Warner (Jan 13 2025 at 22:01):

Anyway to finish my earlier thought, I want to provide that 100% guarantee, but I'd be open to alternative ways of accomplishing that

Joshua Warner (Jan 13 2025 at 22:02):

For example, we could do things like detect some of these more niche cases and just refuse to format in that case (maybe that's what you're getting at)

Joshua Warner (Jan 13 2025 at 22:03):

Ideally, that only introduces a "local" problem, so if you have one tiny problem in a giant file, most of the file can still be formatted properly, and it's only the top-level def with the problem that is copied verbatim from the input

Sam Mohr (Jan 13 2025 at 22:03):

Anthony Bullard (Jan 13 2025 at 22:08):

Yes it will be safe and only introduce a local issue where the illegal syntax does not get formatted

Joshua Warner (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:09):

Joshua Warner (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:10):

Richard Feldman (Jan 13 2025 at 22:10):

Joshua Warner (Jan 13 2025 at 22:10):

Just because the minimal example that currently hits this case is silly, doesn't mean all such examples that hit this case are silly

Sam Mohr (Jan 13 2025 at 22:10):

Anthony Bullard (Jan 13 2025 at 22:10):

Richard Feldman (Jan 13 2025 at 22:11):

I've thought about an April fools joke announcement of like introducing truthiness or unchecked null or something like that

Anthony Bullard (Jan 13 2025 at 22:11):

A fuzzer bug should be able to be coerced into a real working code sample and still reproduce

Joshua Warner (Jan 13 2025 at 22:11):

Anthony Bullard (Jan 13 2025 at 22:12):

Anthony Bullard (Jan 13 2025 at 22:13):

Joshua Warner (Jan 13 2025 at 22:13):

For your example with multiline strings for example, I 100% agree applying a function like that is not valid - but take this as an example then:

"""abc""".foo(1)(2)

Joshua Warner (Jan 13 2025 at 22:14):

Anthony Bullard (Jan 13 2025 at 22:15):

Joshua Warner (Jan 13 2025 at 22:15):

Joshua Warner (Jan 13 2025 at 22:16):

Anyway, my point is that I've found it's better to just give in and fix the problem rather than avoiding it

Joshua Warner (Jan 13 2025 at 22:16):

Avoiding it completely ends up with very complicated conditions, or very "blunt" / annoying conditions

Anthony Bullard (Jan 13 2025 at 22:16):

Joshua Warner (Jan 13 2025 at 22:16):

Anthony Bullard (Jan 13 2025 at 22:17):

Joshua Warner (Jan 13 2025 at 22:17):

Anthony Bullard (Jan 13 2025 at 22:17):

Joshua Warner (Jan 13 2025 at 22:17):

Joshua Warner (Jan 13 2025 at 22:18):

I actually have a PR locally to refactor a bit and make that malformed syntax, which right now the fuzzer won't try to assert formatting conditions on

Anthony Bullard (Jan 13 2025 at 22:18):

Yes and if it bails early in can, I think we can kind of punt on it in formatting

Joshua Warner (Jan 13 2025 at 22:18):

Anthony Bullard (Jan 13 2025 at 22:18):

Joshua Warner (Jan 13 2025 at 22:19):

Anthony Bullard (Jan 13 2025 at 22:19):

Somehow we are disagreeing and agreeing at the same time. It’s probably my poor communication

Joshua Warner (Jan 13 2025 at 22:19):

Anthony Bullard (Jan 13 2025 at 22:19):

Joshua Warner (Jan 13 2025 at 22:20):

I've been pushing hard on the angle of "just make it work", since I've been seeing progress there recently

Anthony Bullard (Jan 13 2025 at 22:20):

I just think that we need tests that give context on what they are testing, why we care, and what we want things to look like

Joshua Warner (Jan 13 2025 at 22:20):

Like 2-ish years ago I ran into a period where I got very frustrated with that approach and basically gave up for a while

Anthony Bullard (Jan 13 2025 at 22:21):

I’ve read more gobbledygook fuzzer Roc than real Roc the past two weeks and I think I have PTSD

Joshua Warner (Jan 13 2025 at 22:21):

100% on board with taking tests and changing them to make them more realistic, so long as they're still covering the same conditions

Joshua Warner (Jan 13 2025 at 22:22):

Anthony Bullard (Jan 13 2025 at 22:22):

And then we can use the best of the best in the syntax reference I’m talking about (which could be very selective and part of the tutorial)

Richard Feldman (Jan 13 2025 at 22:25):

I'm extremely excited to see all the progress on fixing these things the fuzzer is turning up, because compiler bugs are one of the biggest things holding Roc back from reaching its potential

Richard Feldman (Jan 13 2025 at 22:25):

and fizzers that run for a long time without turning up anything give me way more confidence than anything like what we've ever had in the past!

Richard Feldman (Jan 13 2025 at 22:26):

so I really appreciate all your efforts on wading through the gibberish to get us there! :hearts:

Joshua Warner (Jan 13 2025 at 22:27):

FWIW I don't think the fuzzer is covering any of the really "interesting" parts of the compiler yet (say, the solver) - where I'd define "interesting" as "users often hitting compiler crashes in this area"

Joshua Warner (Jan 13 2025 at 22:27):

Anthony Bullard (Jan 13 2025 at 22:29):

I think my open PR merging spaces within spaces will help with some fuzzer crashes

Joshua Warner (Jan 13 2025 at 22:31):

There are some peculiarities of roc syntax that make it particularly hard to parse+format consistently

Joshua Warner (Jan 13 2025 at 22:32):

For example, multiline strings very often cause problems if they're used outside of very specific situations

Joshua Warner (Jan 13 2025 at 22:32):

Like, they're fine if you're just assigning that to a local, but if you try to do anything else with them, that requires a lot of persnickety condition checking in the formatter

Joshua Warner (Jan 13 2025 at 22:33):

Joshua Warner (Jan 13 2025 at 22:39):

With backpassing gone, function types are [almost] the last instance where we have "naked" parens inside a syntax element (i.e. where there's not a starting + finishing delimiter to branch on, so we either have to do excessive backtracking or we have to de-normalize the function type parser in the context of tuple types and tag unions)

Joshua Warner (Jan 13 2025 at 22:39):

Sam Mohr (Jan 13 2025 at 22:40):

If where ... is the last place, is there a way to change how they look to make that not the case?

Joshua Warner (Jan 13 2025 at 22:41):

Function types are still causing problems, so where ... is definitely not the last place, but anyway...

Joshua Warner (Jan 13 2025 at 22:41):

The solution for function types would be to have some sort of "introduction" delimiter

Joshua Warner (Jan 13 2025 at 22:41):

Luke Boswell (Jan 13 2025 at 22:42):

Sam Mohr (Jan 13 2025 at 22:42):

Joshua Warner (Jan 13 2025 at 22:42):

Anthony Bullard (Jan 13 2025 at 22:43):

Joshua Warner (Jan 13 2025 at 22:43):

PNC for types only changes type application, e.g. List(foo) instead of List foo. That's not the issue here.

Anthony Bullard (Jan 13 2025 at 22:43):

Joshua Warner (Jan 13 2025 at 22:43):

Anthony Bullard (Jan 13 2025 at 22:43):

Anthony Bullard (Jan 13 2025 at 22:44):

Joshua Warner (Jan 13 2025 at 22:44):

Sam Mohr (Jan 13 2025 at 22:44):

I suggested that because it would make parsing code for devs and the compiler all very consistent

Sam Mohr (Jan 13 2025 at 22:45):

Joshua Warner (Jan 13 2025 at 22:45):

Anthony Bullard (Jan 13 2025 at 22:48):

Joshua Warner (Jan 13 2025 at 22:48):

For where ..., I think the solution would look something like allowing parens around the ... part, and furthermore _requiring_ cases where there are multiple implements clauses to use that parens syntax, at least if it's in a context where , would separate elements (e.g. in a tuple type)

Sam Mohr (Jan 13 2025 at 22:48):

Joshua Warner (Jan 13 2025 at 22:48):

That would almost never come up in practice, so probably not much of an actual change

Joshua Warner (Jan 13 2025 at 22:49):

Joshua Warner (Jan 13 2025 at 22:50):

That seems even better actually. Not sure why you'd ever want (List a where a implements Foo, List b where b implements Foo) instead of just (List a, List b) where a implements Foo, b implements Foo

Sam Mohr (Jan 13 2025 at 22:52):

The latter is my current thought for what Roc's type syntax would be. That's not a problem, right?

Joshua Warner (Jan 13 2025 at 22:52):

Technically speaking I guess there are very niche cases where that could come up, if there's a list at a higher level

Joshua Warner (Jan 13 2025 at 22:53):

e.g. a tuple of expressions, where one of the expressions is a Defs node with a type annotation

Joshua Warner (Jan 13 2025 at 22:54):

Distinguishing whether that comma means we should parse the next implements clause, or go up and parse the next top-level expr in the tuple is non-trivial

Anthony Bullard (Jan 13 2025 at 22:54):

Joshua Warner (Jan 13 2025 at 22:55):

Anthony Bullard (Jan 13 2025 at 22:55):

Joshua Warner (Jan 13 2025 at 22:56):

(
  a = 1
  b = 2
  foo: List a where a implements Foo,
  bar
)

Joshua Warner (Jan 13 2025 at 22:56):

That's not fully valid syntax, but at the point where we see bar, we don't know that yet

Joshua Warner (Jan 13 2025 at 22:57):

And in particular we don't know whether we should start parsing bar as a type (to be followed by implements or an expr (i.e. the next element of the tuple).

Joshua Warner (Jan 13 2025 at 22:57):

Anthony Bullard (Jan 13 2025 at 22:59):

Joshua Warner (Jan 13 2025 at 22:59):

Anthony Bullard (Jan 13 2025 at 22:59):

Joshua Warner (Jan 13 2025 at 22:59):

Joshua Warner (Jan 13 2025 at 23:00):

Anthony Bullard (Jan 13 2025 at 23:00):

Joshua Warner (Jan 14 2025 at 15:46):

Here's that PR to introduce a proper TypeVar type (used in TypeHeader), and mark anything that's not a lowercase ident as Malformed in the AST. (Such things would already generate can errors) https://github.com/roc-lang/roc/pull/7511

Stream: compiler development

Topic: Crazy test_syntax snapshots and their value

Anthony Bullard (Jan 13 2025 at 15:41):

Anthony Bullard (Jan 13 2025 at 15:44):

Anthony Bullard (Jan 13 2025 at 15:45):

Anthony Bullard (Jan 13 2025 at 15:46):

Anthony Bullard (Jan 13 2025 at 15:49):

Anthony Bullard (Jan 13 2025 at 15:50):

Anthony Bullard (Jan 13 2025 at 15:54):

Anthony Bullard (Jan 13 2025 at 15:57):

Anthony Bullard (Jan 13 2025 at 16:08):

Anton (Jan 13 2025 at 16:12):

Anthony Bullard (Jan 13 2025 at 16:15):

Anthony Bullard (Jan 13 2025 at 16:15):

Anthony Bullard (Jan 13 2025 at 16:22):

Anthony Bullard (Jan 13 2025 at 16:28):

Anthony Bullard (Jan 13 2025 at 16:32):

Sam Mohr (Jan 13 2025 at 19:47):

Sam Mohr (Jan 13 2025 at 19:48):

Anthony Bullard (Jan 13 2025 at 19:50):

Sam Mohr (Jan 13 2025 at 19:50):

Anthony Bullard (Jan 13 2025 at 19:50):

Anthony Bullard (Jan 13 2025 at 19:51):

Sam Mohr (Jan 13 2025 at 19:51):

Anthony Bullard (Jan 13 2025 at 19:51):

Anthony Bullard (Jan 13 2025 at 19:52):

Sam Mohr (Jan 13 2025 at 19:53):

Anthony Bullard (Jan 13 2025 at 19:53):

Anthony Bullard (Jan 13 2025 at 19:55):

Anthony Bullard (Jan 13 2025 at 19:55):

Sam Mohr (Jan 13 2025 at 19:56):

Sam Mohr (Jan 13 2025 at 19:56):

Anthony Bullard (Jan 13 2025 at 19:57):

Anthony Bullard (Jan 13 2025 at 19:58):

Luke Boswell (Jan 13 2025 at 21:46):

Anthony Bullard (Jan 13 2025 at 21:48):

Anthony Bullard (Jan 13 2025 at 21:48):

Anthony Bullard (Jan 13 2025 at 21:49):

Joshua Warner (Jan 13 2025 at 21:57):

Joshua Warner (Jan 13 2025 at 21:57):

Joshua Warner (Jan 13 2025 at 21:58):

Sam Mohr (Jan 13 2025 at 21:59):

Sam Mohr (Jan 13 2025 at 22:00):

Joshua Warner (Jan 13 2025 at 22:00):

Joshua Warner (Jan 13 2025 at 22:00):

Joshua Warner (Jan 13 2025 at 22:01):

Sam Mohr (Jan 13 2025 at 22:01):

Sam Mohr (Jan 13 2025 at 22:01):

Sam Mohr (Jan 13 2025 at 22:01):

Joshua Warner (Jan 13 2025 at 22:01):

Joshua Warner (Jan 13 2025 at 22:02):

Joshua Warner (Jan 13 2025 at 22:03):

Sam Mohr (Jan 13 2025 at 22:03):

Sam Mohr (Jan 13 2025 at 22:03):

Anthony Bullard (Jan 13 2025 at 22:08):

Joshua Warner (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:09):

Joshua Warner (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:09):

Anthony Bullard (Jan 13 2025 at 22:10):

Richard Feldman (Jan 13 2025 at 22:10):

Joshua Warner (Jan 13 2025 at 22:10):

Sam Mohr (Jan 13 2025 at 22:10):

Anthony Bullard (Jan 13 2025 at 22:10):

Richard Feldman (Jan 13 2025 at 22:11):

Anthony Bullard (Jan 13 2025 at 22:11):

Joshua Warner (Jan 13 2025 at 22:11):

Anthony Bullard (Jan 13 2025 at 22:12):

Anthony Bullard (Jan 13 2025 at 22:13):

Joshua Warner (Jan 13 2025 at 22:13):

Joshua Warner (Jan 13 2025 at 22:14):

Anthony Bullard (Jan 13 2025 at 22:15):

Joshua Warner (Jan 13 2025 at 22:15):

Joshua Warner (Jan 13 2025 at 22:16):

Joshua Warner (Jan 13 2025 at 22:16):

Anthony Bullard (Jan 13 2025 at 22:16):

Anthony Bullard (Jan 13 2025 at 22:16):

Joshua Warner (Jan 13 2025 at 22:16):

Anthony Bullard (Jan 13 2025 at 22:17):

Anthony Bullard (Jan 13 2025 at 22:17):