this gave me an idea:
what if we had a builtin module for common forms of lexing? Not necessarily "arbitrary lexing" but rather common lexing operations like "string with backslash escapes" and "line comment" and so on
we could implement those using the SIMDjson techniques, so they'd be super fast, and then they'd produce tokens (of your choice) which you could then build parsers on top of, for use in decoding etc.
here's a sketch of a potential API (no idea if the types work out, but it's just an idea anyway):
Lexer.lineCol : Lexer LineCol * *
Lexer.lineCodePt : Lexer LineCodePt * *
Lexer.codePt : Lexer CodePt * *
Lexer.byteIndex : Lexer ByteIndex * *
Lexer.byte : Lexer region * err, U8, b -> Lexer region b err
Lexer.byteRange : Lexer region * err, { min : U8, max : U8 }, (U8 -> b) -> Lexer region b err
Lexer.delimited : Lexer region * err, { start : U8, end : U8, escape ? [Escape U8, NoEscapes] }, (List U8 -> b) -> Lexer b
Lexer.delimitedUtf8 : Lexer * []err, { start : U8, end : U8, escape ? [ByteAfter U8, NoEscapes], disallowed ? Str }, (List Str -> b) -> Lexer b [BadUtf8 { index : U64 }]err
and then that could be used like this, for example:
lexer : Lexer LineCol [SingleLineStr Str, Dot, Comma, Plus, Minus, Slash, Star, Digit U8] [BadUtf8]
lexer =
Lexer.lineCol
|> Lexer.byte '.' Dot
|> Lexer.byte ',' Comma
|> Lexer.byte '+' Plus
|> Lexer.byte '-' Minus
|> Lexer.byte '/' Slash
|> Lexer.byte '*' Star
|> Lexer.byteRange { min: '0', max: '9' } Digit
|> Lexer.delimitedUtf8 { start: '"', end: '"', escape: ByteAfter '\\', disallowed: "\n" } SingleLineStr
assuming the deisgn could end up being nice, the hope would be that this could be used to implement high-perfomrance parsers of not just JSON, but also XML, HTML, CSV, etc.
I really like the idea! Although I do think it would be good to get a better handle on why current attempts to parse json are so slow, and see how close it can get. Seems like a good measure of real world roc performance.
The current JSON parser could be written much faster I think. I've done a bit of research and have some ideas inspired by simdjson, but haven't made any serious attempts at re-writing it.
I figured the library is still maturing with Roc and we might have a better idea once we have more features. The recent decoding of optional stuff is a good example.
Also we should probably resolve #3816 so we can support decoding of Tag unions.
I would personally hope we don't need custom magic for this. We probably will, but I would prefer to delay this as long as possible and focus super heavily on enhancing roc perf first. Also, we should probably figure out the generic simd story in roc before this.
@Luke Boswell I'd be curious what your ideas are. I've done a bunch of experiments but haven't gotten much improvement
What if we built the proposed lexing library using the simdjson ideas such as branchless code and logical operators, but instead of a builtin we just make it a pure Roc package.
We could use this package to test out and find the performance issues in underlying roc, and also to explore the SIMD story?
Potentially we could then transition roc-json and other libraries to work on top of this lexing package?
Just sharing here as it took me a while to dig this out, but in this gist I implemented the Figure 3 from Parsing Gigabytes of JSON per Second paper in Roc using the simdjson ideas as an early POC.
Last updated: Jun 16 2026 at 16:19 UTC