I also think it would be sweet to have a builtin binary serialization format that you didn't need to use a package to get
that lets you serialize any Roc type that has Encode and Decode efficiently and without any loss of information
there are some interesting design questions there, which is probably worth some discussion in #ideas
for example, if it's a builtin, one encoding/decoding strategy we could use is to copy all the bytes from memory directly into a List U8 and not store any information in the bytes themselves about what they contain
so it's like you need to know exactly what Roc type you're serializing to/from
however, that could be error-prone in that if you serialize something, store it, then later change the type slightly, and then deserialize it, decoding wouldn't fail but rather would cause undefined behavior - so that design doesn't sound reasonable :sweat_smile:
one way to solve that problem would be to have the compiler include a hash of the type, computed at compile time, which could be inserted at the beginning of the binary; that would give you a quick yes/no answer as to whether you're decoding the same shape of thing that was encoded, which would prevent the undefined behavior (unless there was a hash collision, but using blake3 or something like that would make a collision unlikely enough to not be worth worrying about)
another way to solve it would be to tag each type as it's encoded, e.g. "the following is a string of length N" - so, more like what CBOR does
also there are tradeoffs around how to represent integers; on the one hand, you can optimize for saving bytes by doing varints like protobuf does, or you can optimize for fewer instructions needed to decode, by storing the exact (little-endian, presumably) integer bytes, even if a lot of them end up being zeroes
anyway, I think it's a good idea to talk through those tradeoffs!
10 messages were moved here from #contributing > Binary encoding/decoding by Richard Feldman.
This is one of the killer features of the erlang vm, all terms/types/values even anonymous functions can be deserialized and unserialized to a binary format.
It is super handy to just dump your state to disk and then read it back exactly like it was without having to define some custom serialization format.
Maybe it is not as straightforward to do in roc with all the types and such but it would be amazing if it possible :)
Keywords for the curious reader:
Erlang Term Format, ETF, erlang:term_to_binary, erlang:binary_to_term
Personally for the standard library, I would push for a rather direct memcpy style binary format that just updates all the pointers to be byte offsets.
That said, I would advise tagging the data for debugability. Also to enable platforms to dynamically load the format if wanted.
We would only need to tag the primitives that encode supports (oh, also, dicts probably should just be encoded as key value lists with no impl details)
I would probably densely pack the nested tag at the beginning of the structure and put all the data afterwards
As long as encode doesn't add new promitives, the tag should stay stable.
That said, not exactly sure how we deal with opaque types and more complex things of that nature. Though maybe that isn't a big issue cause they still have to encode to primitives.
Perhaps the version of Roc / the encoder should also be included alongside the type.
Yeah
Just give the binary format a version based on encode and it's primitives.
Could even make it gracefully backwards compatible with that.
Is it desirable for this to be a well-specced format that other languages could read/write to besides the roc encoder/decoder?
I think this might be the same problem as reading data dynamically for a debugger or reading multiple encoder versions for migrations.
Sky Rose said:
Is it desirable for this to be a well-specced format that other languages could read/write to besides the roc encoder/decoder?
I think so, although I'm not thinking of it as a hard requirement.
I think the key design element that would separate it from other formats would be that it's specifically designed to be a great default choice for Roc (like how JSON is for JavaScript), but it would be even more useful if other languages could read/write it too.
I am not the biggest fan. Two things mainly,
First, there are a lot of serialization formats. There are open serialization formats defined for most use cases and tradeoffs. There is no real need to invent a new one. https://xkcd.com/927/
I am not against including support for existing serialization formats in the standard library. That can be a good idea. I just wouldn't invent a new format.
Personally for the standard library, I would push for a rather direct memcpy style binary format that just updates all the pointers to be byte offsets.
Second, the default serialization format that comes with the language will probably be the thing users will go to by default. It should favor more being reliable and forward compatible than being fast. JSON is probably a much better candidate for the serialization needs of a user that is not actively making the choice of what serialization format to use than a binary dump. It is just too easy to change anything and invalidate the saved data in these formats. If a user needs the loading performance of a binary dump of the runtime data structures, they can decide that by themselves and make it an active choice, but it is not a good default. AAA games do a lot of extra work in order to have the final data that ships to players in formats like these. It is not by any means something that comes for free in terms of development cost.
All because it is in the standard library doesn't mean it needs to be used. A user should just as easily be able to use json if they want to.
I think no matter what format we pick, the end user would actively be making the choice to use it.
I do agree that we could pick an existing format instead of using a roc specific one. I think a roc specific one would be easier to implement, but either should be fine.
There have been many language-specific formats that end up not being used. For the languages I'm familiar with, Python's pickle has had the most popularity, but most projects seem to use it only for bootstrapping and then move beyond it, since interop beyond the origin language never meaningfully materializes.
In roc's case, I think it really only matters if someone writes a version for the main platform languages.
I don't think it really needs any sort of use past that.
Using JSON to send data to the platform is unnecessary overhead. Just sending the roc data in memory would be best, but that doesn't just work. It would at least need extra runtime annotations sent over. Otherwise the platform wouldn't know how to use it.
This is kinda a step passed just sending typing info. It would be automatically derived and a stable offset based format that can be serialized to disk.
this is a good point - could be very useful for the "platform exposes an API for calling functions from dynamic libraries" use case
I'm not as convinced about a language-specific format either. Designing the format is one thing, but building enough tooling to provide enough trust to use that format for production is quite an undertaking.
If a language offers a native format, I would expect it to have:
Point 5 is of course tongue-in-cheek (though it does happen), but out of the others, it sounds like we've only talked about point 1, and the memcpy stuff we've discussed would rule out point 3, at least without some kind of endianness indicator in a format header.
This reminds me of Golang's gob format
https://go.dev/blog/gob
https://pkg.go.dev/encoding/gob
Richard Feldman said:
this is a good point - could be very useful for the "platform exposes an API for calling functions from dynamic libraries" use case
in this case, the "include a hash of the expected and received layouts" idea could be potentially useful, because it could mean that the host could verify the hash once (against a known constant that could be harddcoded), and if the message contains that constant, then it could treat the rest of the List U8 sent from the application as having the correct struct layout such that it could just point to it directly
:thinking: although I guess if the platform (as opposed to the application) performed the serialization, then it wouldn't even need that check.
so I guess in that world, any design that could - as quickly as possible - turn a Roc type with the Decoding ability into a List U8 would be a good fit for that specific use case, even if nobody ever had any use for it as a generic production serialization format
I haven't seen much real use of Gob in 12 years of continual professional experience with the language. It just doesn't check enough boxes to see widespread use, even though it's been a stable format.
iirc, gob was initially started to support Go channels over a network (point 5 above), but that originating use case ended up being abandoned.
Some observations:
varint might not buy us anything compared to storing every I128 as 16 bytes if we assume compression will be applied afterwards. If most of those I128 values are close to zero, they'll compress to about the same size as a varint anyways; the compression algorithm will do more work, but the core format encoder will do less work.5i128 and 5u8 will encode to the same thing (based on value, not type), and it's really up to the decoder to unpack it into the expected types. Whether this is valuable to us depends on whether we want to be value-centric or type-centric (do we _need_ to mark that a 5 is an I8 or U16 or whatever, or do we just need to have the decoder do value range checking)?For any external use, it seems to me that it'd be a great investment to focus on a canonical serialization data model (tags seem like the only novel thing to figure out), and then a canonical encoding atop a common textual format and a common binary format.
For example: "here's the canonical way to represent tags with payloads in JSON" (i.e. an array where the tag name is the first element). Likewise, providing a pretty good canonical mapping/encoding atop something like Avro could allow Roc to easily interoperate with a lot of modern technology, and for free (such as at compile-time), we'd be able to generate language-independent schema descriptions, check those in, and then have the compiler verify that compatibility with past format revisions is not broken.
If this is just intended as a format for communicating in-memory with platforms, then much of the above certainly doesn't apply ;)
If this is just intended as a format for communicating in-memory with platforms, then much of the above certainly doesn't apply ;)
Yeah, I think it is very important to clearly enumerate the goals. My personal thoughts:
Last updated: Jun 16 2026 at 16:19 UTC