builtin serialization format · ideas

Stream: ideas

Topic: builtin serialization format

Richard Feldman (Dec 21 2023 at 13:55):

I also think it would be sweet to have a builtin binary serialization format that you didn't need to use a package to get

Richard Feldman (Dec 21 2023 at 13:56):

that lets you serialize any Roc type that has Encode and Decode efficiently and without any loss of information

Richard Feldman (Dec 21 2023 at 13:57):

there are some interesting design questions there, which is probably worth some discussion in #ideas

Richard Feldman (Dec 21 2023 at 13:58):

for example, if it's a builtin, one encoding/decoding strategy we could use is to copy all the bytes from memory directly into a List U8 and not store any information in the bytes themselves about what they contain

Richard Feldman (Dec 21 2023 at 13:58):

so it's like you need to know exactly what Roc type you're serializing to/from

Richard Feldman (Dec 21 2023 at 13:59):

however, that could be error-prone in that if you serialize something, store it, then later change the type slightly, and then deserialize it, decoding wouldn't fail but rather would cause undefined behavior - so that design doesn't sound reasonable :sweat_smile:

Richard Feldman (Dec 21 2023 at 14:00):

one way to solve that problem would be to have the compiler include a hash of the type, computed at compile time, which could be inserted at the beginning of the binary; that would give you a quick yes/no answer as to whether you're decoding the same shape of thing that was encoded, which would prevent the undefined behavior (unless there was a hash collision, but using blake3 or something like that would make a collision unlikely enough to not be worth worrying about)

Richard Feldman (Dec 21 2023 at 14:01):

another way to solve it would be to tag each type as it's encoded, e.g. "the following is a string of length N" - so, more like what CBOR does

Richard Feldman (Dec 21 2023 at 14:03):

also there are tradeoffs around how to represent integers; on the one hand, you can optimize for saving bytes by doing varints like protobuf does, or you can optimize for fewer instructions needed to decode, by storing the exact (little-endian, presumably) integer bytes, even if a lot of them end up being zeroes

Richard Feldman (Dec 21 2023 at 14:03):

anyway, I think it's a good idea to talk through those tradeoffs!

Notification Bot (Dec 21 2023 at 14:03):

10 messages were moved here from #contributing > Binary encoding/decoding by Richard Feldman.

Hannes Nevalainen (Dec 21 2023 at 14:21):

This is one of the killer features of the erlang vm, all terms/types/values even anonymous functions can be deserialized and unserialized to a binary format.
It is super handy to just dump your state to disk and then read it back exactly like it was without having to define some custom serialization format.

Maybe it is not as straightforward to do in roc with all the types and such but it would be amazing if it possible :)

Keywords for the curious reader:
Erlang Term Format, ETF, erlang:term_to_binary, erlang:binary_to_term

Brendan Hansknecht (Dec 21 2023 at 14:35):

Personally for the standard library, I would push for a rather direct memcpy style binary format that just updates all the pointers to be byte offsets.

Brendan Hansknecht (Dec 21 2023 at 14:36):

That said, I would advise tagging the data for debugability. Also to enable platforms to dynamically load the format if wanted.

Brendan Hansknecht (Dec 21 2023 at 14:37):

We would only need to tag the primitives that encode supports (oh, also, dicts probably should just be encoded as key value lists with no impl details)

Brendan Hansknecht (Dec 21 2023 at 14:37):

I would probably densely pack the nested tag at the beginning of the structure and put all the data afterwards

Brendan Hansknecht (Dec 21 2023 at 14:38):

As long as encode doesn't add new promitives, the tag should stay stable.

Brendan Hansknecht (Dec 21 2023 at 14:39):

That said, not exactly sure how we deal with opaque types and more complex things of that nature. Though maybe that isn't a big issue cause they still have to encode to primitives.

Sky Rose (Dec 21 2023 at 14:40):

Perhaps the version of Roc / the encoder should also be included alongside the type.

Brendan Hansknecht (Dec 21 2023 at 14:41):

Yeah

Brendan Hansknecht (Dec 21 2023 at 14:41):

Just give the binary format a version based on encode and it's primitives.

Brendan Hansknecht (Dec 21 2023 at 14:42):

Could even make it gracefully backwards compatible with that.

Sky Rose (Dec 21 2023 at 14:43):

Is it desirable for this to be a well-specced format that other languages could read/write to besides the roc encoder/decoder?
I think this might be the same problem as reading data dynamically for a debugger or reading multiple encoder versions for migrations.

Richard Feldman (Dec 21 2023 at 14:51):

Sky Rose said:

Is it desirable for this to be a well-specced format that other languages could read/write to besides the roc encoder/decoder?

I think so, although I'm not thinking of it as a hard requirement.

I think the key design element that would separate it from other formats would be that it's specifically designed to be a great default choice for Roc (like how JSON is for JavaScript), but it would be even more useful if other languages could read/write it too.

Asier Elorz (he/him) (Dec 21 2023 at 19:03):

I am not the biggest fan. Two things mainly,

First, there are a lot of serialization formats. There are open serialization formats defined for most use cases and tradeoffs. There is no real need to invent a new one. https://xkcd.com/927/

I am not against including support for existing serialization formats in the standard library. That can be a good idea. I just wouldn't invent a new format.

Personally for the standard library, I would push for a rather direct memcpy style binary format that just updates all the pointers to be byte offsets.

Second, the default serialization format that comes with the language will probably be the thing users will go to by default. It should favor more being reliable and forward compatible than being fast. JSON is probably a much better candidate for the serialization needs of a user that is not actively making the choice of what serialization format to use than a binary dump. It is just too easy to change anything and invalidate the saved data in these formats. If a user needs the loading performance of a binary dump of the runtime data structures, they can decide that by themselves and make it an active choice, but it is not a good default. AAA games do a lot of extra work in order to have the final data that ships to players in formats like these. It is not by any means something that comes for free in terms of development cost.

Brendan Hansknecht (Dec 21 2023 at 20:19):

All because it is in the standard library doesn't mean it needs to be used. A user should just as easily be able to use json if they want to.

Brendan Hansknecht (Dec 21 2023 at 20:20):

I think no matter what format we pick, the end user would actively be making the choice to use it.

Brendan Hansknecht (Dec 21 2023 at 20:22):

I do agree that we could pick an existing format instead of using a roc specific one. I think a roc specific one would be easier to implement, but either should be fine.

Kevin Gillette (Dec 22 2023 at 16:24):

There have been many language-specific formats that end up not being used. For the languages I'm familiar with, Python's pickle has had the most popularity, but most projects seem to use it only for bootstrapping and then move beyond it, since interop beyond the origin language never meaningfully materializes.

Brendan Hansknecht (Dec 22 2023 at 16:28):

In roc's case, I think it really only matters if someone writes a version for the main platform languages.

Brendan Hansknecht (Dec 22 2023 at 16:28):

I don't think it really needs any sort of use past that.

Brendan Hansknecht (Dec 22 2023 at 16:30):

Using JSON to send data to the platform is unnecessary overhead. Just sending the roc data in memory would be best, but that doesn't just work. It would at least need extra runtime annotations sent over. Otherwise the platform wouldn't know how to use it.

Brendan Hansknecht (Dec 22 2023 at 16:31):

This is kinda a step passed just sending typing info. It would be automatically derived and a stable offset based format that can be serialized to disk.

Richard Feldman (Dec 22 2023 at 16:35):

this is a good point - could be very useful for the "platform exposes an API for calling functions from dynamic libraries" use case

Kevin Gillette (Dec 22 2023 at 18:10):

I'm not as convinced about a language-specific format either. Designing the format is one thing, but building enough tooling to provide enough trust to use that format for production is quite an undertaking.

If a language offers a native format, I would expect it to have:

Complete data model support (of course).
Broad compatibility across minor versions in the language as a transmission format.
Compatibility across cpu architectures (i.e. arm32 can write data that can be seamlessly read by amd64 without any ahead-of-time preparation/configuration/negotiation).
Zero-configuration reasonable encoding, including transparent decoding of older format revisions.
Probably some grand announcement about transparent heterogenous distributed computing, and how strictly a new format was needed for this. The Pony language team is one of several that has historically thought about this kind of thing).

Point 5 is of course tongue-in-cheek (though it does happen), but out of the others, it sounds like we've only talked about point 1, and the memcpy stuff we've discussed would rule out point 3, at least without some kind of endianness indicator in a format header.

JRI98 (Dec 22 2023 at 18:21):

This reminds me of Golang's gob format
https://go.dev/blog/gob
https://pkg.go.dev/encoding/gob

Richard Feldman (Dec 22 2023 at 18:27):

Richard Feldman said:

this is a good point - could be very useful for the "platform exposes an API for calling functions from dynamic libraries" use case

in this case, the "include a hash of the expected and received layouts" idea could be potentially useful, because it could mean that the host could verify the hash once (against a known constant that could be harddcoded), and if the message contains that constant, then it could treat the rest of the List U8 sent from the application as having the correct struct layout such that it could just point to it directly

:thinking: although I guess if the platform (as opposed to the application) performed the serialization, then it wouldn't even need that check.

so I guess in that world, any design that could - as quickly as possible - turn a Roc type with the Decoding ability into a List U8 would be a good fit for that specific use case, even if nobody ever had any use for it as a generic production serialization format

Kevin Gillette (Dec 22 2023 at 18:28):

I haven't seen much real use of Gob in 12 years of continual professional experience with the language. It just doesn't check enough boxes to see widespread use, even though it's been a stable format.

iirc, gob was initially started to support Go channels over a network (point 5 above), but that originating use case ended up being abandoned.

Kevin Gillette (Dec 22 2023 at 19:13):

Some observations:

For archiving data, interop is critical, so Roc's own format isn't going to be a win there. Either columnar or some compression atop an ubiquitous structured standard, depending on needs.
In practice, compression beats everything else you can do with the format itself in terms of size. Real-world compressed JSON is considerably more compact than uncompressed CBOR, Protobuf, or even Avro. Uncompressed Avro is way smaller than uncompressed JSON, but compressed Avro is not significantly smaller than compressed JSON.
As such, varint might not buy us anything compared to storing every I128 as 16 bytes if we assume compression will be applied afterwards. If most of those I128 values are close to zero, they'll compress to about the same size as a varint anyways; the compression algorithm will do more work, but the core format encoder will do less work.
CBOR and similar do a decent job of semantic encoding: 5i128 and 5u8 will encode to the same thing (based on value, not type), and it's really up to the decoder to unpack it into the expected types. Whether this is valuable to us depends on whether we want to be value-centric or type-centric (do we _need_ to mark that a 5 is an I8 or U16 or whatever, or do we just need to have the decoder do value range checking)?

Kevin Gillette (Dec 22 2023 at 19:37):

For any external use, it seems to me that it'd be a great investment to focus on a canonical serialization data model (tags seem like the only novel thing to figure out), and then a canonical encoding atop a common textual format and a common binary format.

For example: "here's the canonical way to represent tags with payloads in JSON" (i.e. an array where the tag name is the first element). Likewise, providing a pretty good canonical mapping/encoding atop something like Avro could allow Roc to easily interoperate with a lot of modern technology, and for free (such as at compile-time), we'd be able to generate language-independent schema descriptions, check those in, and then have the compiler verify that compatibility with past format revisions is not broken.

Kevin Gillette (Dec 22 2023 at 19:38):

If this is just intended as a format for communicating in-memory with platforms, then much of the above certainly doesn't apply ;)

Brendan Hansknecht (Dec 22 2023 at 20:02):

If this is just intended as a format for communicating in-memory with platforms, then much of the above certainly doesn't apply ;)

Yeah, I think it is very important to clearly enumerate the goals. My personal thoughts:

Usage for in memory communication with the host: big waste, lets not do that. Just box the value and pass it to the host. If the host needs do deal with interacting with it at runtime in a dynamic way, generate a tag of type information that the host can look at to understand the data.
Saving Roc types to disk or serializing them for network communication: This could be useful. It would definitely a lot simpler and more tailored than proto or whatever else. Might be worth doing, but we should seriously consider if it makes more sense to support an existing format.

Last updated: Jul 23 2026 at 13:15 UTC