Is there a way to use Decode.fromBytesPartial
to decode Utf8 into a Roc type such as a F64, or U64? I am unsure if this has been implemented, and is working. I've been looking through tests and examples but haven't found anything yet. It would be great if this was something that worked in the REPL. I would like to update the builtin docs and also use it in my parsers.
I've been researching different primitives and thought I might attempt to implement a parser for floats and integers; e.g. 1_024
6.022e23
U+00FC0034
0xAC
0o01``0b0001_0002
etc
Yes, it is possible - see https://github.com/roc-lang/roc/blob/8ae88571250a69a2d0119a76df56f731aa55dd8c/crates/compiler/test_gen/src/gen_abilities.rs#L949. That said, you will probably need to write a parser for the exact style of ints/float you want to parse - and it may not be worth implementing a whole decoder in order to do so.
Thanks for that Ayaz! I've got an initial proof of concept working. So far it appears to handle everything I've thrown at it. I had some type errors along the way, but separating things out into different functions seems to have worked well.
Question. is something like this reasonably efficient for parsing floats? How would you test that? My thinking is that if I use walkUntil
to build up the state as I go then the compiler should be able to optimise this so it doesn't need to copy the whole input each time, and can probably do mutations in place.
float : Parser (List U8) F64
float =
buildPrimitiveParser \input ->
digitState =
List.walkUntil input initDigitState \state, elem ->
when elem is
'0' -> state |> pushDigit elem |> Continue
'1' -> state |> pushDigit elem |> Continue
'2' -> state |> pushDigit elem |> Continue
'3' -> state |> pushDigit elem |> Continue
'4' -> state |> pushDigit elem |> Continue
'5' -> state |> pushDigit elem |> Continue
'6' -> state |> pushDigit elem |> Continue
'7' -> state |> pushDigit elem |> Continue
'8' -> state |> pushDigit elem |> Continue
'9' -> state |> pushDigit elem |> Continue
'.' ->
if state.decimalPoint then
Break state
else
state |> pushDecimal |> Continue
'_' -> state |> pushUnderscore |> Continue
_ -> Break state
when Decode.fromBytes digitState.digits Json.fromUtf8 is
Ok n -> Ok { val: n, input: List.drop input digitState.length }
Err _ -> Err (ParsingFailure "Not a float")
DigitState : {
digits : List U8,
length : Nat,
decimalPoint : Bool,
}
initDigitState : DigitState
initDigitState = {
digits : [],
length : 0,
decimalPoint : Bool.false
}
pushDigit : DigitState, U8 -> DigitState
pushDigit = \state, digit ->
{state & length : state.length + 1, digits : List.append state.digits digit}
pushDecimal : DigitState -> DigitState
pushDecimal = \state ->
{state & length : state.length + 1, digits : List.append state.digits '.', decimalPoint : Bool.true}
pushUnderscore : DigitState -> DigitState
pushUnderscore = \state ->
{state & length : state.length + 1}
Not sure if/how much it matters for performance, but you can use or between cases: '0' | '1' -> ...
Also, not sure how well llvm optimizes the switch, but is would question whether it would be faster to use an if. if '0' <= elem && ...
Would definitely need to test though. Hard to judge these kinds of things from just reading the code.
As long as you can verify you aren't allocating and copying by accident, i would be this is ok.
Oh, though due to how roc currently works and some optimization problems, i am pretty sure that pushDecimal
is copying every time.
Lastly, may want to use List.withCapacity
to help minimize allocations
I think the biggest thing that stands out to me is, could you build up the float as you walk the list of bytes instead of pushing them into an intermediate list? That would eliminate all allocation. Or if not, you could instead use the walk to figure what the start/end indices of the input list are, and then do a List.slice
(I forget the exact function name, apologies) at the end to get a sub-list. Once Roc has slices that should require zero allocation for this use case
How would I test this:
Micro benchmark. Just make a corpus and run it through. Loop it many times if necessary to get decently accurate performance results. Then can time either with the platform or a cli tool like hyperfine.
Brendan Hansknecht said:
Not sure if/how much it matters for performance, but you can use or between cases:
'0' | '1' -> ...
aside: the performance should be the same - the compiler desugars or-patterns into multiple independent patterns.
Also FWIW this function is repeating what Decode.fromBytes .. Json.fromUtf8
does (though it also handles _ in the float, which the JSON decoding format does not). It is building a list that is then passed to Decode.fromBytes
, which calls the appropriate float decoder, which itself attempts to parse the list of bytes as a float. It think you'd want to either use this function to immediately decode into a float yourself, or use the JSON decoding of a float from the start - you can call Decode.fromBytesPartial
, and if it succeeds for a float type, you know you've parsed a JSON float with something left over; if it failed, the input was not a JSON float.
I asked gptchat for an implementation of float parsing that didn't allocate, and it said strtof
from libc. I asked how that was implemented and it said it varies by libc but here's an example implementation:
float strtof(const char *str, char **endptr) {
char *p = str;
// Skip leading white space
while (isspace(*p)) p++;
// Check for an optional plus or minus sign
int sign = 1;
if (*p == '+') {
p++;
} else if (*p == '-') {
p++;
sign = -1;
}
// Parse the integer part of the mantissa
float mantissa = 0.0;
while (isdigit(*p)) {
mantissa = mantissa * 10.0 + (*p - '0');
p++;
}
// Check for a decimal point
if (*p == '.') {
p++;
// Parse the fractional part of the mantissa
float frac = 0.0;
float divisor = 1.0;
while (isdigit(*p)) {
frac = frac * 10.0 + (*p - '0');
divisor *= 10.0;
p++;
}
mantissa += frac / divisor;
}
// Check for an optional exponent
if (*p == 'e' || *p == 'E') {
p++;
// Check for an optional exponent sign
int exp_sign = 1;
if (*p == '+') {
p++;
} else if (*p == '-') {
p++;
exp_sign = -1;
}
// Parse the exponent
int exponent = 0;
while (isdigit(*p)) {
exponent = exponent * 10 + (*p - '0');
p++;
}
// Adjust the mantissa and exponent based on the exponent sign
if (exp_sign == 1) {
for (int i = 0; i < exponent; i++) {
mantissa *= 10.0;
}
} else {
for (int i = 0; i < exponent; i++) {
mantissa /= 10.0;
}
}
}
// Return the final float value
if (endptr) *endptr = p;
return sign * mantissa;
}
certainly looks like it should work, but then again gptchamp is outstanding at generating stuff that looks like it should work but is actually wrong :laughing:
so might be worth looking up strtof
in a specific libc implementation, e.g. musl
Hey thanks gptchat! That's neat. I think I might be able make a version that doesn't allocate, or at least builds up the float as it parses. It's relatively easy to follow along with that source. I'll see if I can find another copy in glibc or something.
ha ha ha I do not recommend trying to understand glibc source code! Try musl libc instead!
maybe you're more familiar than me though, in which case, go for it!
Definitely not familiar here, just a curious adventure into the unknown...
Last updated: Jul 05 2025 at 12:14 UTC