Elm parser port · contributing · Zulip Chat Archive

So I have been looking into porting the Elm parser... I feel like I have a good handle on how the library works in general and I have a few suggestions for how to do the Roc implementation.

Johan Lövgren (Mar 12 2023 at 12:43):

But one thing puzzles me a bit, so if anyone here has familiarity with the Elm parser I would gladly take some input. In the Advanced module there is a type
Bag c x = Empty | AddRight (Bag c x) (DeadEnd c x) | Append (Bag c x) (Bag c x)
which is used in the Bad branches of the parsers (when the parser fails). In the end one maps this to a list of DeadEnd to give lots of info when e.g. constructing the error messages.

Johan Lövgren (Mar 12 2023 at 12:44):

But I don't quite understand why it is implemented this way. Why not just use a list (or a stack I guess) in the first place? Maybe there are some Elm specific performance considerations I am missing here?

Pit Capitain (Mar 12 2023 at 13:10):

Hi Johan, the Elm parser is highly optimized for performance. So you're right: the Bag is modeled like this because of performance considerations. (This is just my understanding...) In function oneOfHelp we concatenate two lists. This is an expensive operation, so Evan decided to first create a binary tree of the lists which have to be concatenated (using the Append variant), and then create the final list from right to left.

Pit Capitain (Mar 12 2023 at 13:14):

I think the AddRight variant could be changed to something like Single (DeadEnd c x), because the first element of AddRight is always Empty. This seems to be a leftover from a previous implementation.

Pit Capitain (Mar 12 2023 at 13:16):

I doin't know whether something like this would be necessary in Roc. (I didn't touch Roc for a very long time.)

Johan Lövgren (Mar 12 2023 at 13:27):

Johan Lövgren (Mar 12 2023 at 13:29):

My intuition says that it would be better to just build the list directly in Roc since they are "just" arrays. But I don't really know, I am still learning about the how to get the best performance in Roc.

Richard Feldman (Mar 12 2023 at 19:07):

in general, in Roc if you have the choice between a List or a recursive tag union, the List approach will almost always be way faster

Richard Feldman (Mar 12 2023 at 19:07):

Johan Lövgren (Mar 12 2023 at 19:31):

Which is great, considering that it much more easy to reach for List than constructing a recursive tag union or a linked list :D

Folkert de Vries (Mar 12 2023 at 19:39):

also this list is almost certainly unique, which means it can be updated in place (and with our current strategies, there will usually be free capacity in the list, making that push really cheap)

Johan Lövgren (Mar 12 2023 at 19:44):

Cool. Is that also true for concatenating lists? Or does that require moving "memory" around more?

Folkert de Vries (Mar 12 2023 at 20:21):

same idea. so long as there is space, it should be pretty cheap. and usually there will be space

Nick Hallstrom (Mar 14 2023 at 01:02):

@Johan Lövgren have you done anything for the parser port yet? I'd love to look at it if you've got something. I'm trying to do something similar for the interpreter I'm trying to build, but really struggling at making a parser work without higher kinded types, like the other languages I've used parser combinators in.

Johan Lövgren (Mar 14 2023 at 07:51):

Johan Lövgren (Mar 14 2023 at 07:55):

To make it work for arbitrary inputs I had to make some simplifications from the Elm parser. For example, the Elm parser keeps track of row and col while parsing the input string, but that does not really make sense for arbitrary input. So right now it just keeps track of the offset (where the cursor is)

Johan Lövgren (Mar 14 2023 at 07:59):

Then if one wants to parse e.g. strings, one has to implement a way to find row and col when one generates an error message (which is the only time they are used anyway)

Johan Lövgren (Mar 14 2023 at 08:08):

I think this improves performance of the parser on success, but worsens it on failure (if you want a nice error message)

Johan Lövgren (Mar 14 2023 at 08:42):

And by arbitrary input I mean List a :sweat_smile:
I did have the thought that with higher kinded types you might be able to write such a Parser for everything that is "indexable" (not sure what that would be in Haskell)

Brian Carroll (Mar 14 2023 at 19:13):

Cool. I think in many situations it's the right tradeoff to have good performance on success but not care too much about it on failure because it's for debug.

Brian Carroll (Mar 14 2023 at 19:15):

If you use oneOf and the first one fails, does it skip the second pass to generate the nice error? Probably don't need it since you're going to try the next parser anyway, which might succeed.

Johan Lövgren (Mar 14 2023 at 19:46):

Yes the idea is that the second pass through is done first when the error message is generated. So it won't do it until the full parser is finished running

Luke Boswell (Mar 21 2023 at 10:29):

I'm interested to know what the plan is for this package. It looks really good so far. I'm wondering if it would make sense to use this in the Json package? Specifically I was looking into parsing floats today and I feel like it would be good to use some combinators to handle the valid patterns e.g. before passing the string into Str.toDec. Most of the other decoders also could benefit too possibly. If we used a Parser package, then I assume Json could no longer live in the builtins? I guess the alternative is to implement the Json parsers more manually which may be better to optimise for performance in future.

Luke Boswell (Mar 21 2023 at 10:35):

This may be a terrible idea, I was skim reading a paper and it included a survey of different algorithms to parse floats fast. I assume the roc std lib would be the best place to do this kind of thing, but not sure the best way to interface that with a parser. The current implementation parses a string that looks like a valid float and then attempts to convert it to a Dec or F32 etc. The alternative would be to implement a fast algorithm in Roc, going directly from List U8 to Dec etc.

Folkert de Vries (Mar 21 2023 at 10:58):

I think the parsing should be a builtin. There are just a lot of subtle edge cases with floats, and performance is important

Folkert de Vries (Mar 21 2023 at 10:58):

Johan Lövgren (Mar 21 2023 at 14:19):

I took a break from developing for a few days because of a compiler error I did not understand. But now Ayaz has explained it to me so I can continue working. Up next I want to implement some examples and tests to root out any inevitable off-by-1 errors.

Johan Lövgren (Mar 21 2023 at 14:20):

Would this be implemented in Roc or lower level? (I guess the alternative is in Zig?)

Johan Lövgren (Jan 05 2025 at 19:43):

So... I started thinking about this again a bit, as I am working through Crafting interpreters but decided to write the interpreter in Roc instead of Java. Since there is a bit of parsing going on, I thought I could pick this project up again.

Johan Lövgren (Jan 05 2025 at 19:45):

I am thinking whether it would be a good idea to pass the source, i.e. the input which to parse, to the parser as a module parameter. Currently the source is passed around to the different parsers via the "state" the parser is operating on. But because the Elm parser never consumes the input string, it just moves a cursor along it, the input is always the same. So it is more like a "global context" for the whole parser, and in my mind does not need to be passed around explicitly.

Johan Lövgren (Jan 05 2025 at 19:46):

I was wondering if anyone had any opinions on this, if it sounds reasonable, if there are any performance implications?

Johan Lövgren (Jan 05 2025 at 19:49):

input = File.readUtf8! path
import Parser.Utf8 { input } as Parser

Parser.run (Parser.digit |> Parser.sepBy whitespace)

Johan Lövgren (Jan 05 2025 at 19:50):

(i.e. when running the parser one does not pass it the input, it just uses the input which you passed in when importing the module)

Brendan Hansknecht (Jan 05 2025 at 20:19):

I definitely wouldn't use module params for this, but I guess you technically could

Brendan Hansknecht (Jan 05 2025 at 20:19):

Brendan Hansknecht (Jan 05 2025 at 20:20):

Also, most efficient likely will be to avoid the cursor all together and just drop off bytes from the beginning of the string.

Brendan Hansknecht (Jan 05 2025 at 20:22):

Assuming you never rewind the cursor, I think pattern matching on the list for bytes and just keeping around .. as rest to be super powerful.

Brendan Hansknecht (Jan 05 2025 at 20:24):

Brendan Hansknecht (Jan 05 2025 at 20:25):

I also split lexong and parsing just for the fun of it. Probably should be merged.

Johan Lövgren (Jan 05 2025 at 20:36):

Johan Lövgren (Jan 05 2025 at 20:37):

Brendan Hansknecht (Jan 05 2025 at 23:33):

Brendan Hansknecht (Jan 05 2025 at 23:34):

So each drop essentially is two integer changes. One to increment the pointer offset. One to shorten the slice length

Stream: contributing

Topic: Elm parser port

Johan Lövgren (Mar 12 2023 at 12:39):

Johan Lövgren (Mar 12 2023 at 12:43):

Johan Lövgren (Mar 12 2023 at 12:44):

Pit Capitain (Mar 12 2023 at 13:10):

Pit Capitain (Mar 12 2023 at 13:14):

Pit Capitain (Mar 12 2023 at 13:16):

Johan Lövgren (Mar 12 2023 at 13:27):

Johan Lövgren (Mar 12 2023 at 13:29):

Richard Feldman (Mar 12 2023 at 19:07):

Richard Feldman (Mar 12 2023 at 19:07):

Johan Lövgren (Mar 12 2023 at 19:31):

Folkert de Vries (Mar 12 2023 at 19:39):

Johan Lövgren (Mar 12 2023 at 19:44):

Folkert de Vries (Mar 12 2023 at 20:21):

Nick Hallstrom (Mar 14 2023 at 01:02):

Johan Lövgren (Mar 14 2023 at 07:51):

Johan Lövgren (Mar 14 2023 at 07:55):

Johan Lövgren (Mar 14 2023 at 07:59):

Johan Lövgren (Mar 14 2023 at 08:08):

Johan Lövgren (Mar 14 2023 at 08:42):

Brian Carroll (Mar 14 2023 at 19:13):

Brian Carroll (Mar 14 2023 at 19:15):

Johan Lövgren (Mar 14 2023 at 19:46):

Luke Boswell (Mar 21 2023 at 10:29):

Luke Boswell (Mar 21 2023 at 10:35):

Folkert de Vries (Mar 21 2023 at 10:58):

Folkert de Vries (Mar 21 2023 at 10:58):

Johan Lövgren (Mar 21 2023 at 14:19):

Johan Lövgren (Mar 21 2023 at 14:20):

Johan Lövgren (Jan 05 2025 at 19:43):

Johan Lövgren (Jan 05 2025 at 19:45):

Johan Lövgren (Jan 05 2025 at 19:46):

Johan Lövgren (Jan 05 2025 at 19:49):

Johan Lövgren (Jan 05 2025 at 19:50):

Brendan Hansknecht (Jan 05 2025 at 20:19):

Brendan Hansknecht (Jan 05 2025 at 20:19):

Brendan Hansknecht (Jan 05 2025 at 20:20):

Brendan Hansknecht (Jan 05 2025 at 20:22):

Brendan Hansknecht (Jan 05 2025 at 20:24):

Brendan Hansknecht (Jan 05 2025 at 20:25):

Brendan Hansknecht (Jan 05 2025 at 20:25):

Johan Lövgren (Jan 05 2025 at 20:36):

Johan Lövgren (Jan 05 2025 at 20:37):

Brendan Hansknecht (Jan 05 2025 at 23:33):

Brendan Hansknecht (Jan 05 2025 at 23:33):

Brendan Hansknecht (Jan 05 2025 at 23:34):