Fast json decoding problem · contributing

This is an attempt to describe a problem I think I have with a design for a fast JSON decoder. I am a fair way off being blocked on this, as I haven't implemented the stage-1 pre-processing code and I don't know exactly what that will look like, however I'm trying to scope out how that stage might integrate with the decode ability as there are things from SIMDjson etc that don't translate directly into the Roc context.

Person : {
  name : Str,
  contacts : {
    email : Str,
    phone : Str,
  }
}

So the user passes in a List U8 bytes into Decode.fromBytes etc, and Roc will call the decodeRecord function in the JSON implementation which is currently implemented using Decode.custom something like the following;

decodeRecord = \initialState, stepField, finalizer -> Decode.custom \bytes, @Json { ... decoder state } ->

    # Recursively build up record from object field:value pairs
    decodeFields = \recordState, bytesBeforeField -> ...

When this implementation gets to the contacts field, it will retrieve a decoder and call Decode.decodeWith passing in the sublist of List U8 bytes for the contacts field. In this case this will be decodeRecord because this field is also an object.

The idea I currently have for implementing a fast JSON decoder is to have a preprocessing step to identify the document structure and then use that information to slice into the original input bytes.

One problem with this idea is that Decode.custom is provided a List U8 bytes and this is the only information we have to work with. So if we preprocessed the input in an earlier stage (function call) we don't have that information available.

One idea might be to preprocess the JSON document and store the original input bytes and field indexes in the decoder state @Json {inputBytes : List U8, fieldSlices : ... }, then maybe have some special sequence of bytes that flags to use the preprocessed information to get the bytes we want to process and proceed with decoding. Or maybe this special sequence includes the information required to slice into the original input bytes.

I'm not sure if this is a good problem description... I am likely missing something obvious and feel like we can probably do what we need with the current implementation.

Luke Boswell (Jul 22 2023 at 22:58):

It may also not be that important to solve this, I probably should use benchmarks to test some ideas. If the preprocess stage is fast enough it may not be that bad to run it each time we decode a new object/record and still use the current recursive descent strategy.

Ayaz Hafiz (Jul 23 2023 at 01:51):

What are the limitations of storing the offset information in the decoder state? That is where my head was at. I think I do not totally follow what the downsides of that approach are.

Luke Boswell (Aug 09 2023 at 10:38):

I've had a bit of a breakthrough and made some progress putting things into the decoder state. :octopus:

% roc check package/Core.roc
An internal compiler expectation was broken.
This is definitely a compiler bug.
Please file an issue here: https://github.com/roc-lang/roc/issues/new/choose
thread '<unnamed>' panicked at 'ambient lambda set function import is not a function, found: Error', crates/compiler/solve/src/module.rs:182:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I've uploaded the relevant code to this gist, and you can see on line 2070 where i have isolated the issue to.

I've tried a bunch of different things, re-structuring the code with functions and type annotations etc, but I can't seem to get it to type check.

Richard Feldman (Aug 09 2023 at 10:59):

Ayaz Hafiz (Aug 09 2023 at 14:16):

Nikita Tchayka (Aug 09 2023 at 14:53):

Anton (Aug 09 2023 at 14:56):

A tool that Ayaz made to debug the type solver (checker + inference + specialization engine), you can find it here.

Anton (Aug 09 2023 at 14:57):

Luke Boswell (Aug 09 2023 at 20:09):

I can try minimising, it's the Decode.decodeWith part that cuases it, so I think I have to keep all the other unrelated decode ability functions around.

Luke Boswell (Aug 10 2023 at 04:59):

I tried implementing the function using List.walkUntil instead of recursion but still get the same issue.

decodeRecordPreProcessed = \stepField, finalizer, initialState, @Json ds ->
    when ds.structure is
        JsonObject fields -> decodeRecordPreProcessedHelp stepField finalizer (@Json ds) initialState fields
        _ -> crash "unreachable, pre-processed string index"

# Check each field/value pair of the object and decode if it is required
decodeRecordPreProcessedHelp = \stepField, finalizer, @Json ds, initialState, recordFieldValues ->

    help = \recordState, recordFieldValue ->
        result =
            # Decode the field name
            fieldNameStr <- decodeObjectFieldName recordFieldValue.field (@Json ds) |> Result.map

            # Retrieve value decoder for the current field
            when stepField recordState fieldNameStr is

                # Skip the field and value, leave record state unchanged
                Skip ->

                    recordState

                # Decode the value using the decoder from the recordState
                Keep valueDecoder ->

                    # UNCOMMENT TO 'STOP COMPILER BUG'
                    # { result: Err TooShort, rest: [] }

                    # COMMENING OUT BELOW TO 'STOP COMPILER BUG'
                    Decode.decodeWith [] valueDecoder (objectFieldValueDecoder (@Json {ds & structure: recordFieldValue.value}))

        when result is
            Err _ ->

                # Return early, failed to decode the field
                Break recordState

            Ok updatedRecordState ->

                # Decode the next field, passing updated recordState
                Continue updatedRecordState

    finalRecordState = List.walkUntil recordFieldValues initialState help

    # Build final record
    when finalizer finalRecordState is
        Ok record -> { result: Ok record, rest: [] }
        Err _ -> { result: Err TooShort, rest: [] }

Ayaz Hafiz (Aug 10 2023 at 05:31):

Thanks for the update Luke. I’ll take a look tomorrow morning (central US time), but i suspect a minimal reproducer will still be necessary

Stream: contributing

Topic: Fast json decoding problem

Luke Boswell (Jul 22 2023 at 21:41):

Luke Boswell (Jul 22 2023 at 22:58):

Ayaz Hafiz (Jul 23 2023 at 01:51):

Luke Boswell (Aug 09 2023 at 10:38):

Richard Feldman (Aug 09 2023 at 10:59):

Ayaz Hafiz (Aug 09 2023 at 14:16):

Ayaz Hafiz (Aug 09 2023 at 14:16):

Nikita Tchayka (Aug 09 2023 at 14:53):

Anton (Aug 09 2023 at 14:56):

Anton (Aug 09 2023 at 14:57):

Luke Boswell (Aug 09 2023 at 20:09):

Luke Boswell (Aug 10 2023 at 04:59):

Ayaz Hafiz (Aug 10 2023 at 05:31):