builtin for lexing · ideas · Zulip Chat Archive

what if we had a builtin module for common forms of lexing? Not necessarily "arbitrary lexing" but rather common lexing operations like "string with backslash escapes" and "line comment" and so on

Richard Feldman (Apr 18 2024 at 18:19):

we could implement those using the SIMDjson techniques, so they'd be super fast, and then they'd produce tokens (of your choice) which you could then build parsers on top of, for use in decoding etc.

Richard Feldman (Apr 18 2024 at 18:20):

here's a sketch of a potential API (no idea if the types work out, but it's just an idea anyway):

Lexer.lineCol : Lexer LineCol * *
Lexer.lineCodePt : Lexer LineCodePt * *
Lexer.codePt : Lexer CodePt * *
Lexer.byteIndex : Lexer ByteIndex * *

Lexer.byte : Lexer region * err, U8, b -> Lexer region b err
Lexer.byteRange : Lexer region * err, { min : U8, max : U8 }, (U8 -> b) -> Lexer region b err
Lexer.delimited : Lexer region * err, { start : U8, end : U8, escape ? [Escape U8, NoEscapes] }, (List U8 -> b) -> Lexer b
Lexer.delimitedUtf8 : Lexer * []err, { start : U8, end : U8, escape ? [ByteAfter U8, NoEscapes], disallowed ? Str }, (List Str -> b) -> Lexer b [BadUtf8 { index : U64 }]err

Richard Feldman (Apr 18 2024 at 18:20):

lexer : Lexer LineCol [SingleLineStr Str, Dot, Comma, Plus, Minus, Slash, Star, Digit U8] [BadUtf8]
lexer =
    Lexer.lineCol
    |> Lexer.byte '.' Dot
    |> Lexer.byte ',' Comma
    |> Lexer.byte '+' Plus
    |> Lexer.byte '-' Minus
    |> Lexer.byte '/' Slash
    |> Lexer.byte '*' Star
    |> Lexer.byteRange { min: '0', max: '9' } Digit
    |> Lexer.delimitedUtf8 { start: '"', end: '"', escape: ByteAfter '\\', disallowed: "\n" } SingleLineStr

Richard Feldman (Apr 18 2024 at 18:21):

assuming the deisgn could end up being nice, the hope would be that this could be used to implement high-perfomrance parsers of not just JSON, but also XML, HTML, CSV, etc.

Eli Dowling (Apr 18 2024 at 23:56):

I really like the idea! Although I do think it would be good to get a better handle on why current attempts to parse json are so slow, and see how close it can get. Seems like a good measure of real world roc performance.

Luke Boswell (Apr 19 2024 at 00:06):

The current JSON parser could be written much faster I think. I've done a bit of research and have some ideas inspired by simdjson, but haven't made any serious attempts at re-writing it.

I figured the library is still maturing with Roc and we might have a better idea once we have more features. The recent decoding of optional stuff is a good example.

Brendan Hansknecht (Apr 19 2024 at 00:30):

I would personally hope we don't need custom magic for this. We probably will, but I would prefer to delay this as long as possible and focus super heavily on enhancing roc perf first. Also, we should probably figure out the generic simd story in roc before this.

Eli Dowling (Apr 19 2024 at 01:29):

@Luke Boswell I'd be curious what your ideas are. I've done a bunch of experiments but haven't gotten much improvement

Luke Boswell (Apr 19 2024 at 02:26):

What if we built the proposed lexing library using the simdjson ideas such as branchless code and logical operators, but instead of a builtin we just make it a pure Roc package.

We could use this package to test out and find the performance issues in underlying roc, and also to explore the SIMD story?

Potentially we could then transition roc-json and other libraries to work on top of this lexing package?

Stream: ideas

Topic: builtin for lexing

Richard Feldman (Apr 18 2024 at 18:18):

Richard Feldman (Apr 18 2024 at 18:19):

Richard Feldman (Apr 18 2024 at 18:20):

Richard Feldman (Apr 18 2024 at 18:20):

Richard Feldman (Apr 18 2024 at 18:21):

Eli Dowling (Apr 18 2024 at 23:56):

Luke Boswell (Apr 19 2024 at 00:06):

Brendan Hansknecht (Apr 19 2024 at 00:30):

Eli Dowling (Apr 19 2024 at 01:29):

Luke Boswell (Apr 19 2024 at 02:26):

Luke Boswell (Apr 19 2024 at 02:30):