Decode.fromBytesPartial · beginners

Is there a way to use Decode.fromBytesPartial to decode Utf8 into a Roc type such as a F64, or U64? I am unsure if this has been implemented, and is working. I've been looking through tests and examples but haven't found anything yet. It would be great if this was something that worked in the REPL. I would like to update the builtin docs and also use it in my parsers.

Luke Boswell (Dec 06 2022 at 04:01):

I've been researching different primitives and thought I might attempt to implement a parser for floats and integers; e.g. 1_024 6.022e23 U+00FC0034 0xAC 0o01``0b0001_0002 etc

Ayaz Hafiz (Dec 06 2022 at 04:38):

Luke Boswell (Dec 06 2022 at 05:48):

Thanks for that Ayaz! I've got an initial proof of concept working. So far it appears to handle everything I've thrown at it. I had some type errors along the way, but separating things out into different functions seems to have worked well.

Question. is something like this reasonably efficient for parsing floats? How would you test that? My thinking is that if I use walkUntil to build up the state as I go then the compiler should be able to optimise this so it doesn't need to copy the whole input each time, and can probably do mutations in place.

float : Parser (List U8) F64
float =
    buildPrimitiveParser \input ->
        digitState =
            List.walkUntil input initDigitState \state, elem ->
                when elem is
                    '0' -> state |> pushDigit elem |> Continue
                    '1' -> state |> pushDigit elem |> Continue
                    '2' -> state |> pushDigit elem |> Continue
                    '3' -> state |> pushDigit elem |> Continue
                    '4' -> state |> pushDigit elem |> Continue
                    '5' -> state |> pushDigit elem |> Continue
                    '6' -> state |> pushDigit elem |> Continue
                    '7' -> state |> pushDigit elem |> Continue
                    '8' -> state |> pushDigit elem |> Continue
                    '9' -> state |> pushDigit elem |> Continue
                    '.' ->
                        if state.decimalPoint then
                            Break state
                        else
                            state |> pushDecimal |> Continue
                    '_' -> state |> pushUnderscore |> Continue
                    _ -> Break state

        when Decode.fromBytes digitState.digits Json.fromUtf8 is
            Ok n -> Ok { val: n, input: List.drop input digitState.length }
            Err _ -> Err (ParsingFailure "Not a float")

DigitState : {
    digits : List U8,
    length : Nat,
    decimalPoint : Bool,
}

initDigitState : DigitState
initDigitState = {
    digits : [],
    length : 0,
    decimalPoint : Bool.false
}

pushDigit : DigitState, U8 -> DigitState
pushDigit = \state, digit ->
    {state & length : state.length + 1, digits : List.append state.digits digit}

pushDecimal : DigitState -> DigitState
pushDecimal = \state ->
    {state & length : state.length + 1, digits : List.append state.digits '.', decimalPoint : Bool.true}

pushUnderscore : DigitState -> DigitState
pushUnderscore = \state ->
    {state & length : state.length + 1}

Brendan Hansknecht (Dec 06 2022 at 06:02):

Not sure if/how much it matters for performance, but you can use or between cases: '0' | '1' -> ...

Brendan Hansknecht (Dec 06 2022 at 06:04):

Also, not sure how well llvm optimizes the switch, but is would question whether it would be faster to use an if. if '0' <= elem && ...

Would definitely need to test though. Hard to judge these kinds of things from just reading the code.

Brendan Hansknecht (Dec 06 2022 at 06:05):

As long as you can verify you aren't allocating and copying by accident, i would be this is ok.

Brendan Hansknecht (Dec 06 2022 at 06:05):

Oh, though due to how roc currently works and some optimization problems, i am pretty sure that pushDecimal is copying every time.

Brendan Hansknecht (Dec 06 2022 at 06:06):

Ayaz Hafiz (Dec 06 2022 at 06:07):

I think the biggest thing that stands out to me is, could you build up the float as you walk the list of bytes instead of pushing them into an intermediate list? That would eliminate all allocation. Or if not, you could instead use the walk to figure what the start/end indices of the input list are, and then do a List.slice (I forget the exact function name, apologies) at the end to get a sub-list. Once Roc has slices that should require zero allocation for this use case

Brendan Hansknecht (Dec 06 2022 at 06:08):

How would I test this:
Micro benchmark. Just make a corpus and run it through. Loop it many times if necessary to get decently accurate performance results. Then can time either with the platform or a cli tool like hyperfine.

Ayaz Hafiz (Dec 06 2022 at 06:10):

aside: the performance should be the same - the compiler desugars or-patterns into multiple independent patterns.

Ayaz Hafiz (Dec 06 2022 at 06:14):

Also FWIW this function is repeating what Decode.fromBytes .. Json.fromUtf8 does (though it also handles _ in the float, which the JSON decoding format does not). It is building a list that is then passed to Decode.fromBytes, which calls the appropriate float decoder, which itself attempts to parse the list of bytes as a float. It think you'd want to either use this function to immediately decode into a float yourself, or use the JSON decoding of a float from the start - you can call Decode.fromBytesPartial, and if it succeeds for a float type, you know you've parsed a JSON float with something left over; if it failed, the input was not a JSON float.

Richard Feldman (Dec 06 2022 at 06:50):

I asked gptchat for an implementation of float parsing that didn't allocate, and it said strtof from libc. I asked how that was implemented and it said it varies by libc but here's an example implementation:

float strtof(const char *str, char **endptr) {
  char *p = str;
  // Skip leading white space
  while (isspace(*p)) p++;

  // Check for an optional plus or minus sign
  int sign = 1;
  if (*p == '+') {
    p++;
  } else if (*p == '-') {
    p++;
    sign = -1;
  }

  // Parse the integer part of the mantissa
  float mantissa = 0.0;
  while (isdigit(*p)) {
    mantissa = mantissa * 10.0 + (*p - '0');
    p++;
  }

  // Check for a decimal point
  if (*p == '.') {
    p++;

    // Parse the fractional part of the mantissa
    float frac = 0.0;
    float divisor = 1.0;
    while (isdigit(*p)) {
      frac = frac * 10.0 + (*p - '0');
      divisor *= 10.0;
      p++;
    }
    mantissa += frac / divisor;
  }

  // Check for an optional exponent
  if (*p == 'e' || *p == 'E') {
    p++;

    // Check for an optional exponent sign
    int exp_sign = 1;
    if (*p == '+') {
      p++;
    } else if (*p == '-') {
      p++;
      exp_sign = -1;
    }

    // Parse the exponent
    int exponent = 0;
    while (isdigit(*p)) {
      exponent = exponent * 10 + (*p - '0');
      p++;
    }

    // Adjust the mantissa and exponent based on the exponent sign
    if (exp_sign == 1) {
      for (int i = 0; i < exponent; i++) {
        mantissa *= 10.0;
      }
    } else {
      for (int i = 0; i < exponent; i++) {
        mantissa /= 10.0;
      }
    }
  }

  // Return the final float value
  if (endptr) *endptr = p;
  return sign * mantissa;
}

Richard Feldman (Dec 06 2022 at 06:50):

certainly looks like it should work, but then again gptchamp is outstanding at generating stuff that looks like it should work but is actually wrong :laughing:

Richard Feldman (Dec 06 2022 at 06:50):

so might be worth looking up strtof in a specific libc implementation, e.g. musl

Luke Boswell (Dec 06 2022 at 07:30):

Hey thanks gptchat! That's neat. I think I might be able make a version that doesn't allocate, or at least builds up the float as it parses. It's relatively easy to follow along with that source. I'll see if I can find another copy in glibc or something.

Brian Carroll (Dec 06 2022 at 21:52):

ha ha ha I do not recommend trying to understand glibc source code! Try musl libc instead!

Stream: beginners

Topic: Decode.fromBytesPartial

Luke Boswell (Dec 06 2022 at 03:51):

Luke Boswell (Dec 06 2022 at 04:01):

Ayaz Hafiz (Dec 06 2022 at 04:38):

Luke Boswell (Dec 06 2022 at 05:48):

Brendan Hansknecht (Dec 06 2022 at 06:02):

Brendan Hansknecht (Dec 06 2022 at 06:04):

Brendan Hansknecht (Dec 06 2022 at 06:05):

Brendan Hansknecht (Dec 06 2022 at 06:05):

Brendan Hansknecht (Dec 06 2022 at 06:06):

Ayaz Hafiz (Dec 06 2022 at 06:07):

Brendan Hansknecht (Dec 06 2022 at 06:08):

Ayaz Hafiz (Dec 06 2022 at 06:10):

Ayaz Hafiz (Dec 06 2022 at 06:14):

Richard Feldman (Dec 06 2022 at 06:50):

Richard Feldman (Dec 06 2022 at 06:50):

Richard Feldman (Dec 06 2022 at 06:50):

Luke Boswell (Dec 06 2022 at 07:30):

Brian Carroll (Dec 06 2022 at 21:52):

Brian Carroll (Dec 06 2022 at 21:52):

Luke Boswell (Dec 06 2022 at 22:04):