So I have been looking into porting the Elm parser... I feel like I have a good handle on how the library works in general and I have a few suggestions for how to do the Roc implementation.
But one thing puzzles me a bit, so if anyone here has familiarity with the Elm parser I would gladly take some input. In the Advanced module there is a type
Bag c x = Empty | AddRight (Bag c x) (DeadEnd c x) | Append (Bag c x) (Bag c x)
which is used in the Bad
branches of the parsers (when the parser fails). In the end one maps this to a list of DeadEnd
to give lots of info when e.g. constructing the error messages.
But I don't quite understand why it is implemented this way. Why not just use a list (or a stack I guess) in the first place? Maybe there are some Elm specific performance considerations I am missing here?
Hi Johan, the Elm parser is highly optimized for performance. So you're right: the Bag is modeled like this because of performance considerations. (This is just my understanding...) In function oneOfHelp
we concatenate two lists. This is an expensive operation, so Evan decided to first create a binary tree of the lists which have to be concatenated (using the Append
variant), and then create the final list from right to left.
I think the AddRight
variant could be changed to something like Single (DeadEnd c x)
, because the first element of AddRight
is always Empty
. This seems to be a leftover from a previous implementation.
I doin't know whether something like this would be necessary in Roc. (I didn't touch Roc for a very long time.)
Thanks! Yes I figured something like that was going on.
My intuition says that it would be better to just build the list directly in Roc since they are "just" arrays. But I don't really know, I am still learning about the how to get the best performance in Roc.
in general, in Roc if you have the choice between a List
or a recursive tag union, the List
approach will almost always be way faster
it's kind of the classic "linked list versus array" situation
Which is great, considering that it much more easy to reach for List
than constructing a recursive tag union or a linked list :D
also this list is almost certainly unique, which means it can be updated in place (and with our current strategies, there will usually be free capacity in the list, making that push really cheap)
Cool. Is that also true for concatenating lists? Or does that require moving "memory" around more?
same idea. so long as there is space, it should be pretty cheap. and usually there will be space
@Johan Lövgren have you done anything for the parser port yet? I'd love to look at it if you've got something. I'm trying to do something similar for the interpreter I'm trying to build, but really struggling at making a parser work without higher kinded types, like the other languages I've used parser combinators in.
Nick Hallstrom said:
Johan Lövgren have you done anything for the parser port yet? I'd love to look at it if you've got something. I'm trying to do something similar for the interpreter I'm trying to build, but really struggling at making a parser work without higher kinded types, like the other languages I've used parser combinators in.
Yes I have got something cooking... It's still a work in progress but you are free to take a look:
https://github.com/Subtlesplendor/roc-parser
To make it work for arbitrary inputs I had to make some simplifications from the Elm parser. For example, the Elm parser keeps track of row and col while parsing the input string, but that does not really make sense for arbitrary input. So right now it just keeps track of the offset (where the cursor is)
Then if one wants to parse e.g. strings, one has to implement a way to find row and col when one generates an error message (which is the only time they are used anyway)
I think this improves performance of the parser on success, but worsens it on failure (if you want a nice error message)
And by arbitrary input I mean List a
:sweat_smile:
I did have the thought that with higher kinded types you might be able to write such a Parser for everything that is "indexable" (not sure what that would be in Haskell)
Cool. I think in many situations it's the right tradeoff to have good performance on success but not care too much about it on failure because it's for debug.
If you use oneOf
and the first one fails, does it skip the second pass to generate the nice error? Probably don't need it since you're going to try the next parser anyway, which might succeed.
Yes the idea is that the second pass through is done first when the error message is generated. So it won't do it until the full parser is finished running
I'm interested to know what the plan is for this package. It looks really good so far. I'm wondering if it would make sense to use this in the Json package? Specifically I was looking into parsing floats today and I feel like it would be good to use some combinators to handle the valid patterns e.g. before passing the string into Str.toDec
. Most of the other decoders also could benefit too possibly. If we used a Parser package, then I assume Json could no longer live in the builtins? I guess the alternative is to implement the Json parsers more manually which may be better to optimise for performance in future.
This may be a terrible idea, I was skim reading a paper and it included a survey of different algorithms to parse floats fast. I assume the roc std lib would be the best place to do this kind of thing, but not sure the best way to interface that with a parser. The current implementation parses a string that looks like a valid float and then attempts to convert it to a Dec or F32 etc. The alternative would be to implement a fast algorithm in Roc, going directly from List U8 to Dec etc.
I think the parsing should be a builtin. There are just a lot of subtle edge cases with floats, and performance is important
so on the parser side, you would need to be able to tokenize a float
I took a break from developing for a few days because of a compiler error I did not understand. But now Ayaz has explained it to me so I can continue working. Up next I want to implement some examples and tests to root out any inevitable off-by-1 errors.
Folkert de Vries said:
so on the parser side, you would need to be able to tokenize a float
Would this be implemented in Roc or lower level? (I guess the alternative is in Zig?)
So... I started thinking about this again a bit, as I am working through Crafting interpreters but decided to write the interpreter in Roc instead of Java. Since there is a bit of parsing going on, I thought I could pick this project up again.
I am thinking whether it would be a good idea to pass the source, i.e. the input which to parse, to the parser as a module parameter. Currently the source is passed around to the different parsers via the "state" the parser is operating on. But because the Elm parser never consumes the input string, it just moves a cursor along it, the input is always the same. So it is more like a "global context" for the whole parser, and in my mind does not need to be passed around explicitly.
I was wondering if anyone had any opinions on this, if it sounds reasonable, if there are any performance implications?
So one might use the parser like this:
input = File.readUtf8! path
import Parser.Utf8 { input } as Parser
Parser.run (Parser.digit |> Parser.sepBy whitespace)
(i.e. when running the parser one does not pass it the input, it just uses the input which you passed in when importing the module)
I definitely wouldn't use module params for this, but I guess you technically could
I would make the cursor and the reference string into a single structure.
Also, most efficient likely will be to avoid the cursor all together and just drop off bytes from the beginning of the string.
Assuming you never rewind the cursor, I think pattern matching on the list for bytes and just keeping around .. as rest
to be super powerful.
This is how I did it for example: https://github.com/bhansconnect/monkey-roc/blob/main/src/Lexer.roc
But I didn't use a fancier parser
I also split lexong and parsing just for the fun of it. Probably should be merged.
For this case yes I also split lexing and parsing
Brendan Hansknecht said:
Also, most efficient likely will be to avoid the cursor all together and just drop off bytes from the beginning of the string.
Hmm, so there is no real cost to dropping bytes from the start?
If you don't mutate it is essentially free
Generates a seamless slice
So each drop essentially is two integer changes. One to increment the pointer offset. One to shorten the slice length
Last updated: Jul 06 2025 at 12:14 UTC