Stream: ideas

Topic: builtin for lexing


view this post on Zulip Richard Feldman (Apr 18 2024 at 18:18):

this gave me an idea:

https://roc.zulipchat.com/#narrow/stream/304902-show-and-tell/topic/Decoder.20composition.20performance.20benchmark/near/433999305

what if we had a builtin module for common forms of lexing? Not necessarily "arbitrary lexing" but rather common lexing operations like "string with backslash escapes" and "line comment" and so on

view this post on Zulip Richard Feldman (Apr 18 2024 at 18:19):

we could implement those using the SIMDjson techniques, so they'd be super fast, and then they'd produce tokens (of your choice) which you could then build parsers on top of, for use in decoding etc.

view this post on Zulip Richard Feldman (Apr 18 2024 at 18:20):

here's a sketch of a potential API (no idea if the types work out, but it's just an idea anyway):

Lexer.lineCol : Lexer LineCol * *
Lexer.lineCodePt : Lexer LineCodePt * *
Lexer.codePt : Lexer CodePt * *
Lexer.byteIndex : Lexer ByteIndex * *

Lexer.byte : Lexer region * err, U8, b -> Lexer region b err
Lexer.byteRange : Lexer region * err, { min : U8, max : U8 }, (U8 -> b) -> Lexer region b err
Lexer.delimited : Lexer region * err, { start : U8, end : U8, escape ? [Escape U8, NoEscapes] }, (List U8 -> b) -> Lexer b
Lexer.delimitedUtf8 : Lexer * []err, { start : U8, end : U8, escape ? [ByteAfter U8, NoEscapes], disallowed ? Str }, (List Str -> b) -> Lexer b [BadUtf8 { index : U64 }]err

view this post on Zulip Richard Feldman (Apr 18 2024 at 18:20):

and then that could be used like this, for example:

lexer : Lexer LineCol [SingleLineStr Str, Dot, Comma, Plus, Minus, Slash, Star, Digit U8] [BadUtf8]
lexer =
    Lexer.lineCol
    |> Lexer.byte '.' Dot
    |> Lexer.byte ',' Comma
    |> Lexer.byte '+' Plus
    |> Lexer.byte '-' Minus
    |> Lexer.byte '/' Slash
    |> Lexer.byte '*' Star
    |> Lexer.byteRange { min: '0', max: '9' } Digit
    |> Lexer.delimitedUtf8 { start: '"', end: '"', escape: ByteAfter '\\', disallowed: "\n" } SingleLineStr

view this post on Zulip Richard Feldman (Apr 18 2024 at 18:21):

assuming the deisgn could end up being nice, the hope would be that this could be used to implement high-perfomrance parsers of not just JSON, but also XML, HTML, CSV, etc.

view this post on Zulip Eli Dowling (Apr 18 2024 at 23:56):

I really like the idea! Although I do think it would be good to get a better handle on why current attempts to parse json are so slow, and see how close it can get. Seems like a good measure of real world roc performance.

view this post on Zulip Luke Boswell (Apr 19 2024 at 00:06):

The current JSON parser could be written much faster I think. I've done a bit of research and have some ideas inspired by simdjson, but haven't made any serious attempts at re-writing it.

I figured the library is still maturing with Roc and we might have a better idea once we have more features. The recent decoding of optional stuff is a good example.

Also we should probably resolve #3816 so we can support decoding of Tag unions.

view this post on Zulip Brendan Hansknecht (Apr 19 2024 at 00:30):

I would personally hope we don't need custom magic for this. We probably will, but I would prefer to delay this as long as possible and focus super heavily on enhancing roc perf first. Also, we should probably figure out the generic simd story in roc before this.

view this post on Zulip Eli Dowling (Apr 19 2024 at 01:29):

@Luke Boswell I'd be curious what your ideas are. I've done a bunch of experiments but haven't gotten much improvement

view this post on Zulip Luke Boswell (Apr 19 2024 at 02:26):

What if we built the proposed lexing library using the simdjson ideas such as branchless code and logical operators, but instead of a builtin we just make it a pure Roc package.

We could use this package to test out and find the performance issues in underlying roc, and also to explore the SIMD story?

Potentially we could then transition roc-json and other libraries to work on top of this lexing package?

view this post on Zulip Luke Boswell (Apr 19 2024 at 02:30):

Just sharing here as it took me a while to dig this out, but in this gist I implemented the Figure 3 from Parsing Gigabytes of JSON per Second paper in Roc using the simdjson ideas as an early POC.


Last updated: Jun 16 2026 at 16:19 UTC