Stream: contributing

Topic: Elm parser port


view this post on Zulip Johan Lövgren (Mar 12 2023 at 12:39):

So I have been looking into porting the Elm parser... I feel like I have a good handle on how the library works in general and I have a few suggestions for how to do the Roc implementation.

view this post on Zulip Johan Lövgren (Mar 12 2023 at 12:43):

But one thing puzzles me a bit, so if anyone here has familiarity with the Elm parser I would gladly take some input. In the Advanced module there is a type
Bag c x = Empty | AddRight (Bag c x) (DeadEnd c x) | Append (Bag c x) (Bag c x)
which is used in the Bad branches of the parsers (when the parser fails). In the end one maps this to a list of DeadEnd to give lots of info when e.g. constructing the error messages.

view this post on Zulip Johan Lövgren (Mar 12 2023 at 12:44):

But I don't quite understand why it is implemented this way. Why not just use a list (or a stack I guess) in the first place? Maybe there are some Elm specific performance considerations I am missing here?

view this post on Zulip Pit Capitain (Mar 12 2023 at 13:10):

Hi Johan, the Elm parser is highly optimized for performance. So you're right: the Bag is modeled like this because of performance considerations. (This is just my understanding...) In function oneOfHelp we concatenate two lists. This is an expensive operation, so Evan decided to first create a binary tree of the lists which have to be concatenated (using the Append variant), and then create the final list from right to left.

view this post on Zulip Pit Capitain (Mar 12 2023 at 13:14):

I think the AddRight variant could be changed to something like Single (DeadEnd c x), because the first element of AddRight is always Empty. This seems to be a leftover from a previous implementation.

view this post on Zulip Pit Capitain (Mar 12 2023 at 13:16):

I doin't know whether something like this would be necessary in Roc. (I didn't touch Roc for a very long time.)

view this post on Zulip Johan Lövgren (Mar 12 2023 at 13:27):

Thanks! Yes I figured something like that was going on.

view this post on Zulip Johan Lövgren (Mar 12 2023 at 13:29):

My intuition says that it would be better to just build the list directly in Roc since they are "just" arrays. But I don't really know, I am still learning about the how to get the best performance in Roc.

view this post on Zulip Richard Feldman (Mar 12 2023 at 19:07):

in general, in Roc if you have the choice between a List or a recursive tag union, the List approach will almost always be way faster

view this post on Zulip Richard Feldman (Mar 12 2023 at 19:07):

it's kind of the classic "linked list versus array" situation

view this post on Zulip Johan Lövgren (Mar 12 2023 at 19:31):

Which is great, considering that it much more easy to reach for List than constructing a recursive tag union or a linked list :D

view this post on Zulip Folkert de Vries (Mar 12 2023 at 19:39):

also this list is almost certainly unique, which means it can be updated in place (and with our current strategies, there will usually be free capacity in the list, making that push really cheap)

view this post on Zulip Johan Lövgren (Mar 12 2023 at 19:44):

Cool. Is that also true for concatenating lists? Or does that require moving "memory" around more?

view this post on Zulip Folkert de Vries (Mar 12 2023 at 20:21):

same idea. so long as there is space, it should be pretty cheap. and usually there will be space

view this post on Zulip Nick Hallstrom (Mar 14 2023 at 01:02):

@Johan Lövgren have you done anything for the parser port yet? I'd love to look at it if you've got something. I'm trying to do something similar for the interpreter I'm trying to build, but really struggling at making a parser work without higher kinded types, like the other languages I've used parser combinators in.

view this post on Zulip Johan Lövgren (Mar 14 2023 at 07:51):

Nick Hallstrom said:

Johan Lövgren have you done anything for the parser port yet? I'd love to look at it if you've got something. I'm trying to do something similar for the interpreter I'm trying to build, but really struggling at making a parser work without higher kinded types, like the other languages I've used parser combinators in.

Yes I have got something cooking... It's still a work in progress but you are free to take a look:
https://github.com/Subtlesplendor/roc-parser

view this post on Zulip Johan Lövgren (Mar 14 2023 at 07:55):

To make it work for arbitrary inputs I had to make some simplifications from the Elm parser. For example, the Elm parser keeps track of row and col while parsing the input string, but that does not really make sense for arbitrary input. So right now it just keeps track of the offset (where the cursor is)

view this post on Zulip Johan Lövgren (Mar 14 2023 at 07:59):

Then if one wants to parse e.g. strings, one has to implement a way to find row and col when one generates an error message (which is the only time they are used anyway)

view this post on Zulip Johan Lövgren (Mar 14 2023 at 08:08):

I think this improves performance of the parser on success, but worsens it on failure (if you want a nice error message)

view this post on Zulip Johan Lövgren (Mar 14 2023 at 08:42):

And by arbitrary input I mean List a :sweat_smile:
I did have the thought that with higher kinded types you might be able to write such a Parser for everything that is "indexable" (not sure what that would be in Haskell)

view this post on Zulip Brian Carroll (Mar 14 2023 at 19:13):

Cool. I think in many situations it's the right tradeoff to have good performance on success but not care too much about it on failure because it's for debug.

view this post on Zulip Brian Carroll (Mar 14 2023 at 19:15):

If you use oneOf and the first one fails, does it skip the second pass to generate the nice error? Probably don't need it since you're going to try the next parser anyway, which might succeed.

view this post on Zulip Johan Lövgren (Mar 14 2023 at 19:46):

Yes the idea is that the second pass through is done first when the error message is generated. So it won't do it until the full parser is finished running

view this post on Zulip Luke Boswell (Mar 21 2023 at 10:29):

I'm interested to know what the plan is for this package. It looks really good so far. I'm wondering if it would make sense to use this in the Json package? Specifically I was looking into parsing floats today and I feel like it would be good to use some combinators to handle the valid patterns e.g. before passing the string into Str.toDec. Most of the other decoders also could benefit too possibly. If we used a Parser package, then I assume Json could no longer live in the builtins? I guess the alternative is to implement the Json parsers more manually which may be better to optimise for performance in future.

view this post on Zulip Luke Boswell (Mar 21 2023 at 10:35):

This may be a terrible idea, I was skim reading a paper and it included a survey of different algorithms to parse floats fast. I assume the roc std lib would be the best place to do this kind of thing, but not sure the best way to interface that with a parser. The current implementation parses a string that looks like a valid float and then attempts to convert it to a Dec or F32 etc. The alternative would be to implement a fast algorithm in Roc, going directly from List U8 to Dec etc.

view this post on Zulip Folkert de Vries (Mar 21 2023 at 10:58):

I think the parsing should be a builtin. There are just a lot of subtle edge cases with floats, and performance is important

view this post on Zulip Folkert de Vries (Mar 21 2023 at 10:58):

so on the parser side, you would need to be able to tokenize a float

view this post on Zulip Johan Lövgren (Mar 21 2023 at 14:19):

I took a break from developing for a few days because of a compiler error I did not understand. But now Ayaz has explained it to me so I can continue working. Up next I want to implement some examples and tests to root out any inevitable off-by-1 errors.

view this post on Zulip Johan Lövgren (Mar 21 2023 at 14:20):

Folkert de Vries said:

so on the parser side, you would need to be able to tokenize a float

Would this be implemented in Roc or lower level? (I guess the alternative is in Zig?)

view this post on Zulip Johan Lövgren (Jan 05 2025 at 19:43):

So... I started thinking about this again a bit, as I am working through Crafting interpreters but decided to write the interpreter in Roc instead of Java. Since there is a bit of parsing going on, I thought I could pick this project up again.

view this post on Zulip Johan Lövgren (Jan 05 2025 at 19:45):

I am thinking whether it would be a good idea to pass the source, i.e. the input which to parse, to the parser as a module parameter. Currently the source is passed around to the different parsers via the "state" the parser is operating on. But because the Elm parser never consumes the input string, it just moves a cursor along it, the input is always the same. So it is more like a "global context" for the whole parser, and in my mind does not need to be passed around explicitly.

view this post on Zulip Johan Lövgren (Jan 05 2025 at 19:46):

I was wondering if anyone had any opinions on this, if it sounds reasonable, if there are any performance implications?

view this post on Zulip Johan Lövgren (Jan 05 2025 at 19:49):

So one might use the parser like this:

input = File.readUtf8! path
import Parser.Utf8 { input } as Parser

Parser.run (Parser.digit |> Parser.sepBy whitespace)

view this post on Zulip Johan Lövgren (Jan 05 2025 at 19:50):

(i.e. when running the parser one does not pass it the input, it just uses the input which you passed in when importing the module)

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:19):

I definitely wouldn't use module params for this, but I guess you technically could

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:19):

I would make the cursor and the reference string into a single structure.

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:20):

Also, most efficient likely will be to avoid the cursor all together and just drop off bytes from the beginning of the string.

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:22):

Assuming you never rewind the cursor, I think pattern matching on the list for bytes and just keeping around .. as rest to be super powerful.

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:24):

This is how I did it for example: https://github.com/bhansconnect/monkey-roc/blob/main/src/Lexer.roc

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:25):

But I didn't use a fancier parser

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 20:25):

I also split lexong and parsing just for the fun of it. Probably should be merged.

view this post on Zulip Johan Lövgren (Jan 05 2025 at 20:36):

For this case yes I also split lexing and parsing

view this post on Zulip Johan Lövgren (Jan 05 2025 at 20:37):

Brendan Hansknecht said:

Also, most efficient likely will be to avoid the cursor all together and just drop off bytes from the beginning of the string.

Hmm, so there is no real cost to dropping bytes from the start?

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 23:33):

If you don't mutate it is essentially free

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 23:33):

Generates a seamless slice

view this post on Zulip Brendan Hansknecht (Jan 05 2025 at 23:34):

So each drop essentially is two integer changes. One to increment the pointer offset. One to shorten the slice length


Last updated: Jul 06 2025 at 12:14 UTC