Stream: ideas

Topic: builtins for parsing


view this post on Zulip Richard Feldman (Aug 14 2024 at 11:11):

Brendan Hansknecht said:

I think we want something like
Str.parseI16 : Str -> Result { val: I16, rest: Str } [WhatEverErrorsParsingCanProduce (invalid format? Not a number? Out of bound?)]

Maybe also want it on List U8 which assumes parsing from ASCII.

I wonder if we could make this generic for Str :thinking:

the main benefit to having a builtin like this for string parsing is that it lets us return the remainder of the Str without having to revalidate UTF-8 on everything. What if we offered a generic way to do that?

Str.parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: Str } [BadUtf8 Utf8Problem]err

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:12):

instead of revalidating the entire UTF-8, it could just validate that the number of bytes you said you consumed resulted in the Str still being valid

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:12):

and the List U8 being passed in could be a slice

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:12):

so all of this could be done with no allocations (when done on a non-small string)

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:20):

Is the type of output also output here?

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:21):

I like the idea, though I wonder how people will discover it.

If I'm looking to parse an I64, would I think to use this or know how to do that if I'm a beginner?

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:22):

Or maybe we could include builtin "parsers" like, Num.parseI64 : List U8 -> Result { output: I64, consumed: U64 } []_ }

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:24):

And the docs on those would show usage in tandem with the above.

# parse an I64
Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:31):

yeah it probably makes sense to include specific parsers for all the number types in builtins

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:31):

regardless of whether we also have a generic one

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:39):

although there are three different commonly useful ways to parse each number type:

  1. from a Str, e.g. "42" becomes 42
  2. from a List U8, parsing it as UTF-8 just like the previous example, e.g. Str.toUtf8 "42" becomes 42
  3. from a List U8, interpreting the bytes directly as the integer, e.g. [3] becomes 3u8 - here, any number type bigger than 1 byte requires specifying the endianness to use in the parsing

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:40):

and there are 13 number types, so that's +39 builtin functions to cover them all :sweat_smile:

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:41):

Richard Feldman said:

and there are 13 number types, so that's +39 builtin functions to cover them all :sweat_smile:

That's what LLM's are for though right?

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:43):

oh I meant that it's like a lot of surface area to add haha

view this post on Zulip Richard Feldman (Aug 14 2024 at 11:43):

in terms of docs, compilation time, etc.

view this post on Zulip Luke Boswell (Aug 14 2024 at 11:44):

Oh right... I was just thinking about how I'm going to implement all of those functions... :smiley:

view this post on Zulip Brendan Hansknecht (Aug 14 2024 at 15:27):

We already discussed and explicitly decided not to add 3.

view this post on Zulip Brendan Hansknecht (Aug 14 2024 at 15:30):

Obviously we can change opinions, but the opinion back then was just to leave the user to do the bit twiddling.

Part of the issue is that we can't make it symmetric with a nice API. If you have a Num a and want to convert it into bytes, you don't actually want to return a List U8 cause that requires an allocation.

view this post on Zulip Brendan Hansknecht (Aug 14 2024 at 15:41):

I find that nested function apis for things like this tend to not be nice. This also only works if you always are pulling from the front of the string. (what if you want to trim the end of the string for example).

I would push for Str.dropFirstBytes str n, Str.dropLastBytes str n and something that matches List.subList. I think we should just directly give the user the power and make all of those return results.

view this post on Zulip Richard Feldman (Aug 14 2024 at 16:59):

that's interesting, although at that point we probably also need like a Str.byteAt : Str, U64 -> Result U8 [OutOfBounds]

view this post on Zulip Richard Feldman (Aug 14 2024 at 16:59):

so you can read the first n bytes yourself and decide how many you want to drop

view this post on Zulip Brendan Hansknecht (Aug 15 2024 at 02:15):

I was think the user would do Str.toUtf8 process the list. Then drop from the original string to avoid the utf cost. It would only be for really niche perf use cases.

view this post on Zulip Luke Boswell (Aug 17 2024 at 06:42):

I think we should add this as an Issue.

Just to clarify, is this an accurate description of what we are thinking?

Str.parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: Str } [BadUtf8 Utf8Problem]err

List.parseUtf8 :
    List U8,
    (List U8 -> Result { output, consumed: U64 } []err)
    -> Result { output, rest: List U8 } []err

# parse an I64
Num.parseI64FromStr : List U8 -> Result { output: Str, consumed: U64 } []err
Num.parseI64FromBytes : List U8 -> Result { output: List U8, consumed: U64 } []err
# etc for all 13 number types

# examples
expect Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}
expect List.parseUtf8 [45, 52, 50] Num.parseI64 == Ok {output: -42, rest: []}

view this post on Zulip timotree (Aug 18 2024 at 04:09):

I think with Brendan's Str.dropFirstBytes we don't need Str.parseUtf8. Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8 Utf8Problem] is a simpler and more general primitive which allows you to implement Str.parseUtf8 without any performance penalty:

Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8]

parseUtf8 :
    Str,
    (List U8 -> Result { output, consumed: U64 } err)
    -> Result { output, rest: Str } [BadUtf8, ParseError err]
parseUtf8 = \s, f ->
    {output, consumed} <- s
        |> Str.toUtf8 # O(1)
        |> f
        |> Result.mapErr ParseError
        |> Result.try
    rest <- s
        |> Str.dropFirstBytes consumed # Will be O(1) with built-in
        |> Result.map
    { output, rest }

view this post on Zulip timotree (Aug 18 2024 at 04:12):

Also, is there a reason to expose a separate parse function for each number instead of exposing a generic one for Num *? You would still need to implement it once for each number type but it seems like the API surface can be reduced.

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:19):

@timotree

we don't need Str.parseUtf8.

I'm a little confused. I have been trying to implement that last few remaining builtins that were missing. I was tracking a number of "parse" builtins for Str, and the above was a suggestion to reduce the implementation down to Str.parseUtf8.

Are you suggesting that users would implement this when they want this functionality? or maybe saying we can implement it in the builtins using pure Roc and don't need a lowlevel?

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:21):

Thank you for sharing that example implementation. I could make a PR in a few minutes to add that, at least the Str.parseUtf8 part.

I'm not sure about the helpers like Num.parseI68 etc... if we can just have a single Num.parseNum : List U8 -> Result { output: Num *, consumed U64} err that is generic that would be awesome.

view this post on Zulip timotree (Aug 18 2024 at 04:24):

I'm thinking the new API surface should be

Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the first character of the new string
Str.dropLastBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the last character of the new string

NumParseError : [OutOfRange, NotANumber]

Num.parseUtf8 : List U8 -> Result {output: Num *, rest: List U8} NumParseError # Should be a lowlevel so we can get different code for each number type
Num.parse : Str -> Result {output : Num *, rest : Str} NumParseError # Pure Roc wrapper around Num.parseUtf8 using Str.dropFirstBytes
Num.fromStr : Str -> Result (Num *) NumParseError # Pure Roc wrapper around Num.parseUtf8 that errors if rest is nonempty

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:28):

which just has to UTF-8 validate the first character of the new string

This part I'd like to understand more.

It's similar to the question I have on https://github.com/roc-lang/roc/pull/7007

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:28):

Like, can we just drop bytes from a known valid Utf-8 Str and it's still going to be valid?

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:29):

I'm pretty sure the answer is no, because it's a variable width encoding

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:31):

I'm guessing the idea here, is that we do Str.fromUtf8 on the remainder... or have some kind of lowlevel which specifically only needs to validate up to 4 bytes deep to check it's still a valid character.

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:32):

Anyway, I think we have enough here to create an Issue and track this - which is my primary goal. I'm currently busy trying to iron out some issue with my rebuild-host PR and can maybe come back to this another time.

view this post on Zulip Luke Boswell (Aug 18 2024 at 04:35):

https://github.com/roc-lang/roc/issues/7010

view this post on Zulip timotree (Aug 18 2024 at 04:38):

in utf-8, 0xxxxxxx bytes and 11xxxxxx bytes start characters, and 10xxxxxx bytes continue characters. therefore for it to be valid to drop the first n bytes from a valid utf-8 string, you need the new string to either

This is a quick to check local property and doesn't require revalidating the whole string

view this post on Zulip timotree (Aug 18 2024 at 04:47):

I'll post some more details about how to do the quick UTF-8 checks in the issue

view this post on Zulip Brendan Hansknecht (Aug 18 2024 at 07:17):

One not on the parse API:
It can be generic over Num * that said would it make more sense to split into parseInt and parseFrac? Not sold other way. Just feel like they are kinda different use cases and different error unions.

view this post on Zulip timotree (Aug 19 2024 at 03:41):

An API which is generic over Num * is more expressive because you can use it in generic contexts where you're working with Num *s. You can't go from specific versions for each number type to a generic Num * version in user code like you can in a builtin, because user code has to be open to new number types existing in the future. I don't actually have a use-case for wanting to parse numbers in generic code though, so I don't know if it makes much of a difference

view this post on Zulip Brendan Hansknecht (Aug 19 2024 at 05:57):

That's definitely true. Just means you have to deal with the full error union even if your specific type only has a subset of the possible errors. Given floats are more complex to parse, I would assume they will have more error variants


Last updated: Jun 16 2026 at 16:19 UTC