Brendan Hansknecht said:
I think we want something like
Str.parseI16 : Str -> Result { val: I16, rest: Str } [WhatEverErrorsParsingCanProduce (invalid format? Not a number? Out of bound?)]Maybe also want it on List U8 which assumes parsing from ASCII.
I wonder if we could make this generic for Str :thinking:
the main benefit to having a builtin like this for string parsing is that it lets us return the remainder of the Str without having to revalidate UTF-8 on everything. What if we offered a generic way to do that?
Str.parseUtf8 :
Str,
(List U8 -> Result { output, consumed: U64 } []err)
-> Result { output, rest: Str } [BadUtf8 Utf8Problem]err
instead of revalidating the entire UTF-8, it could just validate that the number of bytes you said you consumed resulted in the Str still being valid
and the List U8 being passed in could be a slice
so all of this could be done with no allocations (when done on a non-small string)
Is the type of output also output here?
I like the idea, though I wonder how people will discover it.
If I'm looking to parse an I64, would I think to use this or know how to do that if I'm a beginner?
Or maybe we could include builtin "parsers" like, Num.parseI64 : List U8 -> Result { output: I64, consumed: U64 } []_ }
And the docs on those would show usage in tandem with the above.
# parse an I64
Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}
yeah it probably makes sense to include specific parsers for all the number types in builtins
regardless of whether we also have a generic one
although there are three different commonly useful ways to parse each number type:
Str, e.g. "42" becomes 42List U8, parsing it as UTF-8 just like the previous example, e.g. Str.toUtf8 "42" becomes 42List U8, interpreting the bytes directly as the integer, e.g. [3] becomes 3u8 - here, any number type bigger than 1 byte requires specifying the endianness to use in the parsingand there are 13 number types, so that's +39 builtin functions to cover them all :sweat_smile:
Richard Feldman said:
and there are 13 number types, so that's +39 builtin functions to cover them all :sweat_smile:
That's what LLM's are for though right?
oh I meant that it's like a lot of surface area to add haha
in terms of docs, compilation time, etc.
Oh right... I was just thinking about how I'm going to implement all of those functions... :smiley:
We already discussed and explicitly decided not to add 3.
Obviously we can change opinions, but the opinion back then was just to leave the user to do the bit twiddling.
Part of the issue is that we can't make it symmetric with a nice API. If you have a Num a and want to convert it into bytes, you don't actually want to return a List U8 cause that requires an allocation.
I find that nested function apis for things like this tend to not be nice. This also only works if you always are pulling from the front of the string. (what if you want to trim the end of the string for example).
I would push for Str.dropFirstBytes str n, Str.dropLastBytes str n and something that matches List.subList. I think we should just directly give the user the power and make all of those return results.
that's interesting, although at that point we probably also need like a Str.byteAt : Str, U64 -> Result U8 [OutOfBounds]
so you can read the first n bytes yourself and decide how many you want to drop
I was think the user would do Str.toUtf8 process the list. Then drop from the original string to avoid the utf cost. It would only be for really niche perf use cases.
I think we should add this as an Issue.
Just to clarify, is this an accurate description of what we are thinking?
Str.parseUtf8 :
Str,
(List U8 -> Result { output, consumed: U64 } []err)
-> Result { output, rest: Str } [BadUtf8 Utf8Problem]err
List.parseUtf8 :
List U8,
(List U8 -> Result { output, consumed: U64 } []err)
-> Result { output, rest: List U8 } []err
# parse an I64
Num.parseI64FromStr : List U8 -> Result { output: Str, consumed: U64 } []err
Num.parseI64FromBytes : List U8 -> Result { output: List U8, consumed: U64 } []err
# etc for all 13 number types
# examples
expect Str.parseUtf8 "-42" Num.parseI64 == Ok {output: -42, rest: []}
expect List.parseUtf8 [45, 52, 50] Num.parseI64 == Ok {output: -42, rest: []}
I think with Brendan's Str.dropFirstBytes we don't need Str.parseUtf8. Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8 Utf8Problem] is a simpler and more general primitive which allows you to implement Str.parseUtf8 without any performance penalty:
Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8]
parseUtf8 :
Str,
(List U8 -> Result { output, consumed: U64 } err)
-> Result { output, rest: Str } [BadUtf8, ParseError err]
parseUtf8 = \s, f ->
{output, consumed} <- s
|> Str.toUtf8 # O(1)
|> f
|> Result.mapErr ParseError
|> Result.try
rest <- s
|> Str.dropFirstBytes consumed # Will be O(1) with built-in
|> Result.map
{ output, rest }
Also, is there a reason to expose a separate parse function for each number instead of exposing a generic one for Num *? You would still need to implement it once for each number type but it seems like the API surface can be reduced.
@timotree
we don't need
Str.parseUtf8.
I'm a little confused. I have been trying to implement that last few remaining builtins that were missing. I was tracking a number of "parse" builtins for Str, and the above was a suggestion to reduce the implementation down to Str.parseUtf8.
Are you suggesting that users would implement this when they want this functionality? or maybe saying we can implement it in the builtins using pure Roc and don't need a lowlevel?
Thank you for sharing that example implementation. I could make a PR in a few minutes to add that, at least the Str.parseUtf8 part.
I'm not sure about the helpers like Num.parseI68 etc... if we can just have a single Num.parseNum : List U8 -> Result { output: Num *, consumed U64} err that is generic that would be awesome.
I'm thinking the new API surface should be
Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the first character of the new string
Str.dropLastBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the last character of the new string
NumParseError : [OutOfRange, NotANumber]
Num.parseUtf8 : List U8 -> Result {output: Num *, rest: List U8} NumParseError # Should be a lowlevel so we can get different code for each number type
Num.parse : Str -> Result {output : Num *, rest : Str} NumParseError # Pure Roc wrapper around Num.parseUtf8 using Str.dropFirstBytes
Num.fromStr : Str -> Result (Num *) NumParseError # Pure Roc wrapper around Num.parseUtf8 that errors if rest is nonempty
which just has to UTF-8 validate the first character of the new string
This part I'd like to understand more.
It's similar to the question I have on https://github.com/roc-lang/roc/pull/7007
Like, can we just drop bytes from a known valid Utf-8 Str and it's still going to be valid?
I'm pretty sure the answer is no, because it's a variable width encoding
I'm guessing the idea here, is that we do Str.fromUtf8 on the remainder... or have some kind of lowlevel which specifically only needs to validate up to 4 bytes deep to check it's still a valid character.
Anyway, I think we have enough here to create an Issue and track this - which is my primary goal. I'm currently busy trying to iron out some issue with my rebuild-host PR and can maybe come back to this another time.
https://github.com/roc-lang/roc/issues/7010
in utf-8, 0xxxxxxx bytes and 11xxxxxx bytes start characters, and 10xxxxxx bytes continue characters. therefore for it to be valid to drop the first n bytes from a valid utf-8 string, you need the new string to either
This is a quick to check local property and doesn't require revalidating the whole string
I'll post some more details about how to do the quick UTF-8 checks in the issue
One not on the parse API:
It can be generic over Num * that said would it make more sense to split into parseInt and parseFrac? Not sold other way. Just feel like they are kinda different use cases and different error unions.
An API which is generic over Num * is more expressive because you can use it in generic contexts where you're working with Num *s. You can't go from specific versions for each number type to a generic Num * version in user code like you can in a builtin, because user code has to be open to new number types existing in the future. I don't actually have a use-case for wanting to parse numbers in generic code though, so I don't know if it makes much of a difference
That's definitely true. Just means you have to deal with the full error union even if your specific type only has a subset of the possible errors. Given floats are more complex to parse, I would assume they will have more error variants
Last updated: Jun 16 2026 at 16:19 UTC