builtins for parsing · ideas · Zulip Chat Archive

the main benefit to having a builtin like this for string parsing is that it lets us return the remainder of the Str without having to revalidate UTF-8 on everything. What if we offered a generic way to do that?

Str.parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: Str } [BadUtf8 Utf8Problem]err

Richard Feldman (Aug 14 2024 at 11:12):

instead of revalidating the entire UTF-8, it could just validate that the number of bytes you said you consumed resulted in the Str still being valid

Richard Feldman (Aug 14 2024 at 11:12):

so all of this could be done with no allocations (when done on a non-small string)

Luke Boswell (Aug 14 2024 at 11:20):

Luke Boswell (Aug 14 2024 at 11:21):

If I'm looking to parse an I64, would I think to use this or know how to do that if I'm a beginner?

Luke Boswell (Aug 14 2024 at 11:22):

Or maybe we could include builtin "parsers" like, Num.parseI64 : List U8 -> Result { output: I64, consumed: U64 } []_ }

Luke Boswell (Aug 14 2024 at 11:24):

# parse an I64
Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}

Richard Feldman (Aug 14 2024 at 11:31):

yeah it probably makes sense to include specific parsers for all the number types in builtins

Richard Feldman (Aug 14 2024 at 11:31):

Richard Feldman (Aug 14 2024 at 11:39):

although there are three different commonly useful ways to parse each number type:

Richard Feldman (Aug 14 2024 at 11:40):

and there are 13 number types, so that's +39 builtin functions to cover them all :sweat_smile:

Luke Boswell (Aug 14 2024 at 11:41):

Richard Feldman (Aug 14 2024 at 11:43):

Luke Boswell (Aug 14 2024 at 11:44):

Oh right... I was just thinking about how I'm going to implement all of those functions... :smiley:

Brendan Hansknecht (Aug 14 2024 at 15:27):

Brendan Hansknecht (Aug 14 2024 at 15:30):

Obviously we can change opinions, but the opinion back then was just to leave the user to do the bit twiddling.

Part of the issue is that we can't make it symmetric with a nice API. If you have a Num a and want to convert it into bytes, you don't actually want to return a List U8 cause that requires an allocation.

Brendan Hansknecht (Aug 14 2024 at 15:41):

I find that nested function apis for things like this tend to not be nice. This also only works if you always are pulling from the front of the string. (what if you want to trim the end of the string for example).

I would push for Str.dropFirstBytes str n, Str.dropLastBytes str n and something that matches List.subList. I think we should just directly give the user the power and make all of those return results.

Richard Feldman (Aug 14 2024 at 16:59):

that's interesting, although at that point we probably also need like a Str.byteAt : Str, U64 -> Result U8 [OutOfBounds]

Richard Feldman (Aug 14 2024 at 16:59):

Brendan Hansknecht (Aug 15 2024 at 02:15):

I was think the user would do Str.toUtf8 process the list. Then drop from the original string to avoid the utf cost. It would only be for really niche perf use cases.

Luke Boswell (Aug 17 2024 at 06:42):

Str.parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: Str } [BadUtf8 Utf8Problem]err

List.parseUtf8 :
    List U8,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: List U8 } []err

# parse an I64
Num.parseI64FromStr : List U8 -> Result { output: Str, consumed: U64 } []err
Num.parseI64FromBytes : List U8 -> Result { output: List U8, consumed: U64 } []err
# etc for all 13 number types

# examples
expect Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}
expect List.parseUtf8 [45, 52, 50] Num.parseI64 == Ok {output: -42, rest: []}

timotree (Aug 18 2024 at 04:09):

I think with Brendan's Str.dropFirstBytes we don't need Str.parseUtf8. Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8 Utf8Problem] is a simpler and more general primitive which allows you to implement Str.parseUtf8 without any performance penalty:

Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8]

parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } err)
    -> Result { output, rest: Str } [BadUtf8, ParseError err]
parseUtf8 = \s, f ->
    {output, consumed} <- s
        |> Str.toUtf8 # O(1)
        |> f
        |> Result.mapErr ParseError
        |> Result.try
    rest <- s
        |> Str.dropFirstBytes consumed # Will be O(1) with built-in
        |> Result.map
    { output, rest }

timotree (Aug 18 2024 at 04:12):

Also, is there a reason to expose a separate parse function for each number instead of exposing a generic one for Num *? You would still need to implement it once for each number type but it seems like the API surface can be reduced.

Luke Boswell (Aug 18 2024 at 04:19):

I'm a little confused. I have been trying to implement that last few remaining builtins that were missing. I was tracking a number of "parse" builtins for Str, and the above was a suggestion to reduce the implementation down to Str.parseUtf8.

Are you suggesting that users would implement this when they want this functionality? or maybe saying we can implement it in the builtins using pure Roc and don't need a lowlevel?

Luke Boswell (Aug 18 2024 at 04:21):

Thank you for sharing that example implementation. I could make a PR in a few minutes to add that, at least the Str.parseUtf8 part.

I'm not sure about the helpers like Num.parseI68 etc... if we can just have a single Num.parseNum : List U8 -> Result { output: Num *, consumed U64} err that is generic that would be awesome.

timotree (Aug 18 2024 at 04:24):

Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the first character of the new string
Str.dropLastBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the last character of the new string

NumParseError : [OutOfRange, NotANumber]

Num.parseUtf8 : List U8 -> Result {output: Num *, rest: List U8} NumParseError # Should be a lowlevel so we can get different code for each number type
Num.parse : Str -> Result {output : Num *, rest : Str} NumParseError # Pure Roc wrapper around Num.parseUtf8 using Str.dropFirstBytes
Num.fromStr : Str -> Result (Num *) NumParseError # Pure Roc wrapper around Num.parseUtf8 that errors if rest is nonempty

Luke Boswell (Aug 18 2024 at 04:28):

Like, can we just drop bytes from a known valid Utf-8 Str and it's still going to be valid?

Luke Boswell (Aug 18 2024 at 04:29):

Luke Boswell (Aug 18 2024 at 04:31):

I'm guessing the idea here, is that we do Str.fromUtf8 on the remainder... or have some kind of lowlevel which specifically only needs to validate up to 4 bytes deep to check it's still a valid character.

Luke Boswell (Aug 18 2024 at 04:32):

Anyway, I think we have enough here to create an Issue and track this - which is my primary goal. I'm currently busy trying to iron out some issue with my rebuild-host PR and can maybe come back to this another time.

Luke Boswell (Aug 18 2024 at 04:35):

timotree (Aug 18 2024 at 04:38):

in utf-8, 0xxxxxxx bytes and 11xxxxxx bytes start characters, and 10xxxxxx bytes continue characters. therefore for it to be valid to drop the first n bytes from a valid utf-8 string, you need the new string to either

This is a quick to check local property and doesn't require revalidating the whole string

timotree (Aug 18 2024 at 04:47):

Brendan Hansknecht (Aug 18 2024 at 07:17):

One not on the parse API:
It can be generic over Num * that said would it make more sense to split into parseInt and parseFrac? Not sold other way. Just feel like they are kinda different use cases and different error unions.

timotree (Aug 19 2024 at 03:41):

An API which is generic over Num * is more expressive because you can use it in generic contexts where you're working with Num *s. You can't go from specific versions for each number type to a generic Num * version in user code like you can in a builtin, because user code has to be open to new number types existing in the future. I don't actually have a use-case for wanting to parse numbers in generic code though, so I don't know if it makes much of a difference

Brendan Hansknecht (Aug 19 2024 at 05:57):

That's definitely true. Just means you have to deal with the full error union even if your specific type only has a subset of the possible errors. Given floats are more complex to parse, I would assume they will have more error variants

Stream: ideas

Topic: builtins for parsing

Richard Feldman (Aug 14 2024 at 11:11):

Richard Feldman (Aug 14 2024 at 11:12):

Richard Feldman (Aug 14 2024 at 11:12):

Richard Feldman (Aug 14 2024 at 11:12):

Luke Boswell (Aug 14 2024 at 11:20):

Luke Boswell (Aug 14 2024 at 11:21):

Luke Boswell (Aug 14 2024 at 11:22):

Luke Boswell (Aug 14 2024 at 11:24):

Richard Feldman (Aug 14 2024 at 11:31):

Richard Feldman (Aug 14 2024 at 11:31):

Richard Feldman (Aug 14 2024 at 11:39):

Richard Feldman (Aug 14 2024 at 11:40):

Luke Boswell (Aug 14 2024 at 11:41):

Richard Feldman (Aug 14 2024 at 11:43):

Richard Feldman (Aug 14 2024 at 11:43):

Luke Boswell (Aug 14 2024 at 11:44):

Brendan Hansknecht (Aug 14 2024 at 15:27):

Brendan Hansknecht (Aug 14 2024 at 15:30):

Brendan Hansknecht (Aug 14 2024 at 15:41):

Richard Feldman (Aug 14 2024 at 16:59):

Richard Feldman (Aug 14 2024 at 16:59):

Brendan Hansknecht (Aug 15 2024 at 02:15):

Luke Boswell (Aug 17 2024 at 06:42):

timotree (Aug 18 2024 at 04:09):

timotree (Aug 18 2024 at 04:12):

Luke Boswell (Aug 18 2024 at 04:19):

Luke Boswell (Aug 18 2024 at 04:21):

timotree (Aug 18 2024 at 04:24):

Luke Boswell (Aug 18 2024 at 04:28):

Luke Boswell (Aug 18 2024 at 04:28):

Luke Boswell (Aug 18 2024 at 04:29):

Luke Boswell (Aug 18 2024 at 04:31):

Luke Boswell (Aug 18 2024 at 04:32):

Luke Boswell (Aug 18 2024 at 04:35):

timotree (Aug 18 2024 at 04:38):

timotree (Aug 18 2024 at 04:47):

Brendan Hansknecht (Aug 18 2024 at 07:17):

timotree (Aug 19 2024 at 03:41):

Brendan Hansknecht (Aug 19 2024 at 05:57):