encoding tag union indices · ideas

I was just looking into how serde handles this (our encode ability is based on that), and I do think we may have a gap in our encoder. Which may argue for something like Color::Red being clearly specified.

We currently have tag : Str, List (Encoder fmt) -> Encoder fmt | fmt has EncoderFormatting for encoding tags.

Serde has an extra piece of information. When encoding tags, it also pass in the variant index.
So the function would be tag : Str, Nat, List (Encoder fmt) -> Encoder fmt | fmt has EncoderFormatting

This enables encoding to protobuf and other formats that want to just use dense index forms.

Brendan Hansknecht (May 13 2023 at 17:43):

To get full functionality here, we need the index. To get the index, we need to know the full type, not just [Red]*.

Brendan Hansknecht (May 13 2023 at 17:44):

Even if we require opaque types here. There is still no way to encode a tag densely. We would at a minimum need to add a function that knows the tag is final. That way it can use the index instead of the name.

Matthias Toepp (May 13 2023 at 17:45):

It sounds like your really on to sometheing there! :grinning_face_with_smiling_eyes: (details are over my head)

Brendan Hansknecht (May 13 2023 at 17:47):

Basically, if we want to support encoding into dense formats instead of always encoding enums/tags as strings, we need the index. The index can only be calculated if we know the entire tag.

Notification Bot (May 13 2023 at 17:49):

Richard Feldman (May 13 2023 at 17:49):

Richard Feldman (May 13 2023 at 17:50):

it's a great point that we should probably provide the index so you can serialize to something like protobuf, and I totally missed that

Richard Feldman (May 13 2023 at 17:50):

and I do think monomorphization will take care of this so that it won't be a problem in practice

Richard Feldman (May 13 2023 at 17:50):

however, the rabbit hole is: today, there is no way to go from a tag union to an index at runtime

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:51):

Richard Feldman (May 13 2023 at 17:52):

Richard Feldman (May 13 2023 at 17:53):

like if we can encode a tag into a raw index integer, then we'll need to be able to do the reverse to decode it: go from a raw index integer into a tag union...so what are the implications of being able to go from index to tag?

Brendan Hansknecht (May 13 2023 at 17:55):

Encode.encode Red

We will think the index is 0 cause the tag is [Red]*. This is the same issue we described with bool essentially.

Richard Feldman (May 13 2023 at 17:55):

Brendan Hansknecht (May 13 2023 at 17:56):

Richard Feldman (May 13 2023 at 17:57):

the difference is that if I'm encoding something into JSON (for example, but most formats have a distinction between bool and other enums), there is a totally separate encoding for booleans vs others

Brendan Hansknecht (May 13 2023 at 17:57):

But maybe not an issue in practice cause encode and decode generally need to be used with types.

Brendan Hansknecht (May 13 2023 at 17:58):

I get the separate encoding problem, but I am more talking about resolving to an unexpected type in the users eyes and not having a good way to correct the type.

Richard Feldman (May 13 2023 at 17:58):

Richard Feldman (May 13 2023 at 17:59):

to me, a very important part of that distinction is that there's already a way to solve the former: as you noted earlier, you can name the variable and put a type annotation on it

Richard Feldman (May 13 2023 at 18:00):

so the concern is "this might be inconvenient in practice" as opposed to "this is ambiguous and there's no way for the compiler to help you other than guessing based on a heuristic"

Brendan Hansknecht (May 13 2023 at 18:00):

Make a special encode function that takes a tag value and then correctly dispatches to the current tag encode function.

Brendan Hansknecht (May 13 2023 at 18:00):

So then the tag is would only even be used or seen by people who implement encoders and no where else

Brendan Hansknecht (May 13 2023 at 18:00):

Richard Feldman (May 13 2023 at 18:00):

Richard Feldman (May 13 2023 at 18:01):

Encode.encode {
    firstName: "Sam",
    lastName: "Sample",
    role: Guest,
}

Richard Feldman (May 13 2023 at 18:02):

and I think the problem here is more fundamental than "just allow Role::Guest syntax"

Brendan Hansknecht (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:03):

it's not a User or anything, I could have just put that entire literal directly into the repl

Richard Feldman (May 13 2023 at 18:03):

so even if there's a Role::Guest syntax, I still have to remember to do that here

Richard Feldman (May 13 2023 at 18:03):

the problem is that if I wanted a Role : [Guest, Admin, Moderator] or whatever, I didn't think to specify that

Brendan Hansknecht (May 13 2023 at 18:03):

Richard Feldman (May 13 2023 at 18:04):

I just put Guest because that's what I'd normally do, and it will work out fine everywhere else except specifically when passing a value to Encode.encode

Brendan Hansknecht (May 13 2023 at 18:04):

Richard Feldman (May 13 2023 at 18:04):

Brendan Hansknecht (May 13 2023 at 18:04):

Richard Feldman (May 13 2023 at 18:04):

the obvious solution is to not give the Encode or Decode abilities to tag unions

Brendan Hansknecht (May 13 2023 at 18:05):

That is kinda an argument for encode only working on opaque ttypes in order to ensure proper typing always.

Richard Feldman (May 13 2023 at 18:05):

Brendan Hansknecht (May 13 2023 at 18:06):

That's fair, but i don't think the current other encode functions are powerful enough to encode tags

Richard Feldman (May 13 2023 at 18:06):

Brendan Hansknecht (May 13 2023 at 18:07):

*they are powerful enough but very inconvenient. Look at serde. It has multiple functions around variant encoding.

Brendan Hansknecht (May 13 2023 at 18:08):

As in a user doesn't want to manually generate a different struct for each variant.

Richard Feldman (May 13 2023 at 18:08):

Brendan Hansknecht (May 13 2023 at 18:09):

Oh, i guess we just can remove auto derived, but leave in the function for encoding a tag.

Richard Feldman (May 13 2023 at 18:09):

also it would create a disincentive to use tag unions in any data structure you might want to serialize, which is definitely the wrong incentive to create

Brendan Hansknecht (May 13 2023 at 18:09):

Then the user could pick the index and it would avoid exposing the internal tag index

Richard Feldman (May 13 2023 at 18:10):

well the problem is, let's say today I have User : { ... } - I can just serialize and deserialize that right away without writing any more code

Richard Feldman (May 13 2023 at 18:10):

as soon as I put a tag union in User, if tag unions no longer have Encode and Decode, now I have to customize how to do that

Matthias Toepp (May 13 2023 at 18:10):

What about tracking if there is a closed value that it could be referring to and if there is then its a warning or something.

Richard Feldman (May 13 2023 at 18:11):

certainly we can have a warning if you give values of certain types to Encode.encode

Brendan Hansknecht (May 13 2023 at 18:12):

Richard Feldman (May 13 2023 at 18:13):

to be honest, I'd just start with the warning first and then see if there's demand for the syntax in practice

Richard Feldman (May 13 2023 at 18:13):

Richard Feldman (May 13 2023 at 18:14):

e.g. tests (for serialization formats, for example) are one scenario where I can imagine this coming up, but also I can imagine tests writing helper functions to generate data, and those helper functions would have type annotations which would take care of this automatically

Richard Feldman (May 13 2023 at 18:15):

Richard Feldman (May 13 2023 at 18:16):

I'm not sure how the warning would interact with tag union polarity though :thinking:

Richard Feldman (May 13 2023 at 18:17):

Encode.encode {
    firstName: "Sam",
    lastName: "Sample",
    role: Guest,
}

...and we want to give a compiler warning like "hey you're giving Encode.encode a tag with the type [Guest] - this is probably not what you want, can you give that an explicit type annotation if it really is what you want?"

Richard Feldman (May 13 2023 at 18:17):

hm, actually - maybe the warning heuristic we want is that you're trying to encode a single-tag union?

Richard Feldman (May 13 2023 at 18:17):

Richard Feldman (May 13 2023 at 18:18):

where one branch of the conditional had Guest and another had Moderator, so the type given to Encode.encode was [Guest, Moderator] - which has different indices than if Admin is in the mix

Richard Feldman (May 13 2023 at 18:19):

I gotta run some errands, will be back later...but great catch noticing this scenario! Glad to have found it before people started tripping over it in production :sweat_smile:

Richard Feldman (May 13 2023 at 18:21):

oh one more question I just thought of, before I head out: what size would the index be?

Richard Feldman (May 13 2023 at 18:22):

it'll almost always fit in U8, but we have logic to upgrade to U16 if you have more than 256 tags in the union

Richard Feldman (May 13 2023 at 18:23):

I guess we could always give U64 just to be safe, and then have a separate U64 argument for how many total tags there are, which would let you infer the smallest integer you could use to represent the discriminant (e.g. in protobuf)

Richard Feldman (May 13 2023 at 18:23):

Brendan Hansknecht (May 13 2023 at 18:24):

Brendan Hansknecht (May 13 2023 at 18:26):

I can sadly imagine a tag that eventually is bigger than u16, but I think u32 is super safe size.

Richard Feldman (May 13 2023 at 19:57):

just to put it out there in case we forget: another way to address the error-prone situation is to leave the API as-is, which would have the tradeoffs of:

Richard Feldman (May 13 2023 at 19:58):

Brendan Hansknecht (May 13 2023 at 20:46):

Brendan Hansknecht (May 13 2023 at 20:57):

An extra note, we could keep the current form, but add an extra field to Encoder. Instead of just tag, add indexedTag that uses an integer instead of the name. That would never be used by auto derived, but it could be used by an opaque type. That way, if you want the denser encoding, you could do it manually with an opaque type.

This also means we don't have to expose the tag index. Instead the opaque type encode would define its own when to get the correct integer. So user defined as opposed to the internal compiler index.

Brendan Hansknecht (May 13 2023 at 21:00):

Color := [Red,Green,Blue] has Encode {encode}

encode = \encoder, color ->
    when color is
        Red -> Encode.indexedTag encoder 0 []
        Green -> Encode.indexedTag encoder 1 []
        Blue -> Encode.indexedTag encoder 2 []

Brendan Hansknecht (May 13 2023 at 21:00):

Brendan Hansknecht (May 13 2023 at 21:01):

And if you don't really want the opaque type for regular code, just unwrap it right after encoding/decoding.

Agus Zubiaga (May 13 2023 at 21:04):

Brendan Hansknecht (May 13 2023 at 21:05):

Agus Zubiaga (May 13 2023 at 21:07):

Ah, true. I guess that’d be even worse though because a protocol won’t likely use that indexing.

Brendan Hansknecht (May 13 2023 at 21:08):

Protocol would just see it as an integer? I don't think protocol has any control over that value

Agus Zubiaga (May 13 2023 at 21:09):

I mean protocols/contracts usually have enum definitions and the integer associated to each value won't be based on alphabetical order

Brendan Hansknecht (May 13 2023 at 21:11):

Yeah, if you want a special order, no matter what, you would have to fall back on opaque types and explicit definitions. The question is if we want to be able to auto derived alphabetical order.

Agus Zubiaga (May 13 2023 at 21:13):

I can only see that working if you're encoding data that only your Roc program is going to decode

Agus Zubiaga (May 13 2023 at 21:14):

Georges Boris (May 13 2023 at 21:14):

yeah - imo deriving index based encoder/decoders seems really error-prone when doing anything that doesn't live entirely inside Roc. and even then, can be a mess if you think about a live system with different versions living together sometimes.

Georges Boris (May 13 2023 at 21:15):

Brendan Hansknecht (May 13 2023 at 21:16):

So yeah, maybe all tags should be strings by default and opaque is required for integer indices.

Brendan Hansknecht (May 13 2023 at 21:16):

Brendan Hansknecht (May 13 2023 at 21:20):

I guess the standard in a language like rust would be definition order for these things....but that sounds like a mistake in roc due to how tags reorder by default. Yeah, i like the opaque way thinking about this more.

Richard Feldman (May 13 2023 at 21:26):

this is an option, but I can imagine people being unhappy about the ergonomics in practice

Richard Feldman (May 13 2023 at 21:26):

I think it'll be pretty common to want to store (non-opaque) tag unions in data structures, e.g. enumerations like [Admin, Moderator, Guest]

Richard Feldman (May 13 2023 at 21:27):

and if you've done that a bunch and are suddenly like "oh we've decided to change serialization formats from JSON to protobuf, now I have to go back and rewrite a ton of code to deal with opaque wrappers solely because that's the only viable way to go to protobuf," that's a really unpleasant experience :grimacing:

Richard Feldman (May 13 2023 at 21:29):

separately, it bothers me that this would mean less efficient serialization formats would have better ergonomics than more efficient ones - e.g. "just always use JSON, that way you won't have to deal with opaque type wrappers" (or, alternatively, "always use Bool over tag unions, because those Just Work with protobuf and you don't have to write custom serialization code for them" - which to me would be even worse!)

Richard Feldman (May 13 2023 at 21:29):

ideally we could make the ergonomics of JSON about the same as for binary formats that use (typically) 1-byte discriminants for tags

Richard Feldman (May 13 2023 at 21:30):

none of that makes the footgun go away, of course, but it makes me feel motivated to try to find another way to remove it :big_smile:

Brendan Hansknecht (May 13 2023 at 21:35):

Very true, though ordering will be a strange question. Since declaration order doesn't seem well suited for roc and always using alphabetical may be rough in cases

Richard Feldman (May 13 2023 at 21:46):

Richard Feldman (May 13 2023 at 21:47):

the nice thing about alphabetical is that if you know the tag names in the union, you know the index

Richard Feldman (May 13 2023 at 21:48):

which in turn means the index doesn't change if the code around it changes (which is a downside of designs like having a global index across all tag names that are used in the program: you can add a tag name somewhere else and have it change other tags' indices and break things)

Richard Feldman (May 13 2023 at 21:49):

the warning idea seems promising to me, because this is something that we expect to come up rarely in practice (but be an easy-to-miss error if it does come up); if that ends up being true, then most people won't even see it

Richard Feldman (May 13 2023 at 21:49):

but if they do see it, it would hopefully be because they were actually about to make a mistake (as opposed to a false positive)

Agus Zubiaga (May 13 2023 at 22:06):

You could only use an index based on alphabetical order if you own the protobuf schema you’re talking to

Agus Zubiaga (May 13 2023 at 22:07):

Otherwise, I think you just won’t be able to use auto derived encoders because they won’t match up

Agus Zubiaga (May 13 2023 at 22:13):

enum Role {
  ROLE_GUEST = 0;
  ROLE_MEMBER = 1;
  ROLE_ADMIN = 2;
}

Agus Zubiaga (May 13 2023 at 22:14):

Role : [Guest, Member, Admin]

Agus Zubiaga (May 13 2023 at 22:16):

Richard Feldman (May 13 2023 at 22:31):

Richard Feldman (May 13 2023 at 22:32):

yeah I don't see a way around it in the case where somebody else is in charge of that mapping

Richard Feldman (May 13 2023 at 22:32):

Richard Feldman (May 13 2023 at 22:33):

to generate explicit Roc encoders/decoders from the schema you're using as a source of truth

Agus Zubiaga (May 13 2023 at 22:34):

Agus Zubiaga (May 13 2023 at 22:37):

I guess I just don't see many cases where this could work apart from apps that want to dump their own data to disk or something

Richard Feldman (May 13 2023 at 22:39):

Agus Zubiaga (May 13 2023 at 22:39):

This is a really good point, and the only solution that comes to mind is being able to implement abilities on non-opaque types, but I don't know if that's even possible

Richard Feldman (May 13 2023 at 22:46):

yeah the original motivation for adding opaque types to the language was that abilities wouldn't work otherwise :big_smile:

Agus Zubiaga (May 13 2023 at 22:47):

Richard Feldman (May 13 2023 at 23:53):

Brendan Hansknecht (May 14 2023 at 02:20):

Also, even if we keep status quo, we still need a new method for generating encoding tags via index in the Encoder trait.

Brendan Hansknecht (May 14 2023 at 02:21):

Ayaz Hafiz (May 14 2023 at 14:19):

I agree re. Brendan's point that you may want this for something like front-end/back-end that you control and you're okay with an auto-derived impl of a some usage like protobuf

Ayaz Hafiz (May 14 2023 at 14:20):

I think we should add the tag index to both the encoding and decoding APIs. The auto-derived implementation would work as-is based on definition order (or any other order) and encoding formats could implement this optimization as desired. Concretely, the Encoding api would change to

# `tag {name, index} payloads` encodes a tag of `name` at `index` in its definition and a list of its payloads.
tag : {name: Str, index: U64}, List (Encoder fmt) -> Encoder fmt | fmt has EncoderFormatting

And the Decoding api would become (note that we haven't implemented auto derived decoding for tags yet!!)

## `discriminant {tagNames, maxIndex}` decodes the index of a tag given the names of the tags and the number of tags in the definition.
discriminant : {tagNames: List (List U8), maxIndex: U64} -> Decoder U64 fmt | fmt has DecoderFormatting

## `sequence state stepElem finalizer` decodes a possibly-heterogenous sequence representation into `state`.
sequence : state, (state -> [Keep (Decoder state fmt), Skip]), (state -> Result val DecodeError) -> Decoder val fmt | fmt has DecoderFormatting

Richard Feldman (May 14 2023 at 14:22):

@Ayaz Hafiz what do you think about the footgun mentioned earlier? e.g. that this:

Encode.encode {
    firstName: "Sam",
    lastName: "Sample",
    role: Guest,
}

Richard Feldman (May 14 2023 at 14:22):

because the type encode sees is [Guest], even if the intention is something else like [Guest, Moderator, Admin]

Ayaz Hafiz (May 14 2023 at 14:25):

I think my bias is that I don't think that would happen in practice. Like, it seems you would catch that very quickly - if you are writing test code, you would probably try to deserialize it and see that it's a problem. Otherwise, you are likely passing bound, typed variables (not literals) and are less likely to run into this.

Richard Feldman (May 14 2023 at 14:25):

I predict it would happen rarely in practice, but I think if it does happen it could be very nasty

Richard Feldman (May 14 2023 at 14:25):

like that code looks totally normal and correct, and I think would not be likely to be caught in code review

Richard Feldman (May 14 2023 at 14:26):

Brendan Hansknecht (May 14 2023 at 14:26):

With the warning about open tags and Role::Guest I think that would fix most of the nastiness, right?

Richard Feldman (May 14 2023 at 14:26):

I think the warning by itself could be sufficient (if we're right that it happens very rarely in practice)

Richard Feldman (May 14 2023 at 14:26):

Brendan Hansknecht (May 14 2023 at 14:27):

Richard Feldman (May 14 2023 at 14:27):

the warning idea is basically "if Encode.encode tries to encode a tag union that's still open, give a compile-time warning"

Brendan Hansknecht (May 14 2023 at 14:27):

Richard Feldman (May 14 2023 at 14:28):

which would catch both the situation above, as well as the situation where I had role: set to a conditional where each branch returned a different tag, but the actual union I wanted had more than those two in it

Richard Feldman (May 14 2023 at 14:29):

but it wouldn't fire for any function with the return type User, where User is a type alias for that record which includes { ..., role : [Admin, Guest, Moderator] } - which would be closed, and therefore prevent the warning

Richard Feldman (May 14 2023 at 14:30):

so as long as I'm giving Encode.encode a User value that I made from a function like that, or if I'm manually annotating a variable (e.g. to get around the warning), I won't get a warning

Ayaz Hafiz (May 14 2023 at 14:30):

Brendan Hansknecht (May 14 2023 at 14:31):

Ayaz Hafiz (May 14 2023 at 14:32):

You cannot define tag unions as closed unless they are under an opaque type or you take a tag union as an input to a function and return the same union (same as in same type)

Richard Feldman (May 14 2023 at 14:32):

so another heuristic I had an idea for is: give a warning for a single-tag union

Richard Feldman (May 14 2023 at 14:33):

Ayaz Hafiz (May 14 2023 at 14:34):

that seems more reasonable IMO. i don’t really see how you run into this unless you’re writing literals in test code. in every other case it seems like you’d flag this in code review given how structurally typed Roc is.

Ayaz Hafiz (May 14 2023 at 14:34):

Richard Feldman (May 14 2023 at 14:34):

Richard Feldman (May 14 2023 at 14:35):

Brendan Hansknecht (May 14 2023 at 14:38):

The entire message doesn't need to be constant. Just one field.
So if I am in the admin branch of my code and go: role = Admin. Then eventually put role in a struct, that would lead to this issue, right?

Richard Feldman (May 14 2023 at 14:39):

I genuinely have a hard time imagining this ever coming up. Like you'd have to write either something like role: when ... or else have role assigned to an un-annotated variable that was a conditional, and even then it would only be a problem if you wrote out the whole literal like this, and it never got unified with anything that made the tag union have all the tags

Richard Feldman (May 14 2023 at 14:40):

Ayaz Hafiz (May 14 2023 at 14:41):

there is another side that’s a problem here, which is the decoding side - where the problem is far more likely. You are probably going to run into this there more than you are on the encoding side unless you have annotations (again I think mostly in test code, but I think the scenarios are easier to come up with- for example you match on the expectation of only one tag you want to see appear, and the rest falls into the wildcard and is not seen by the type system)

Richard Feldman (May 14 2023 at 14:42):

so it won't be an issue if something else is causing that role field to have the bigger type, for example:

Ayaz Hafiz (May 14 2023 at 14:42):

what if we had auto derivers for opaque types pass the index, but derivations for the structural types do not

Richard Feldman (May 14 2023 at 14:42):

Richard Feldman (May 14 2023 at 14:43):

Ayaz Hafiz (May 14 2023 at 14:46):

If we want to fully remove the footgun, I believe the only option is opaque types - there is no other way to force a closed union.

Brendan Hansknecht (May 14 2023 at 14:47):

I feel like this overall seems to suggest that we really want encode and decode to require typing info. Like never autoderived and always explicit (which as you just said above, is currently done in roc via opaque types).

Ayaz Hafiz (May 14 2023 at 14:47):

Ayaz Hafiz (May 14 2023 at 14:49):

I disagree. I think having it autoderived for structural types is a huge productivity boost for things like JSON over web services and prototyping

Ayaz Hafiz (May 14 2023 at 14:50):

The challenge is balancing those kinds of use cases with the optimal cases like protobuf as you describe where you want the schema to be strict.

Brendan Hansknecht (May 14 2023 at 14:52):

That's fair. That was why I suggested a separate index based tag encoder and string base tag encoder.

String base would autoderive (exactly like current roc). If you need indices, you need and opaque type and to specify them explicitly (in user land code, not alphabetical). Just make it an opaque type where you expose wrap and unwrap. Then you just need to wrap when throwing it in the final struct. Or you use the same as Bool.true for your type.

Brendan Hansknecht (May 14 2023 at 14:54):

That does make it less convient to used the optimized version, but if you are using the optimized version, you are probably clearly defining all your types. So you just need to add a function that converts from the json friendly version of the type to the proto friendly version. That doesn't seem too hard if you want the perf gain from proto.

Brendan Hansknecht (May 14 2023 at 15:00):

# Json version
Role: [User, Admin, Guest]
User: {id: U64, firstName: Str, lastName: Str, role: Role}

main =
   ....
   # Constants always correct. Tags encode as strings.
   Encode.encode {id: 3, firstName: "John", lastName: "Doe", role: Admin}

# Switch to proto (Still keep json types in code, but add proto type for boundaries)
# This is probably in its own module.
ProtoRole := [User, Admin, Guest] has Encode {encode: encodeRole}
encodeRole = \encoder, role ->
    when role is
        Admin -> Encode.indexedTag encoder 0 []
        User -> Encode.indexedTag encoder 1 []
        Guest -> Encode.indexedTag encoder 2 []

ProtoUser: {id: U64, firstName: Str, lastName: Str, role: ProtoRole}

fromUser = \{id, firstName, lastName, role} -> {id, firstName, lastName, role: @ProtoRole role}


main =
   ....
   Encode.encode (Proto.fromUser {id: 3, firstName: "John", lastName: "Doe", role: Admin})

Ayaz Hafiz (May 14 2023 at 15:01):

Agreed. Probably the main consideration in this case is what Richard mentioned, what is the cost of moving from a Json-based encoding to Protobuf based encoding since you’d need to perform this transformation globally?

Brendan Hansknecht (May 14 2023 at 15:01):

This really is not a big deal if you want perf gain from proto or whatever other format, but maybe for some formats it could be painful. Like if a format wants flexibility, but is limited, thus must use the numbered indices (not sure if such formats exist).

Brendan Hansknecht (May 14 2023 at 15:02):

Theoretically, fromUser doesn't actually need to do anything, but roc, probably doesn't know that.

Brendan Hansknecht (May 14 2023 at 15:03):

Also, my point was that you can just do this at the encode edges and avoid changing your main code.

Brendan Hansknecht (May 14 2023 at 15:03):

Ayaz Hafiz (May 14 2023 at 15:03):

Brendan Hansknecht (May 14 2023 at 15:04):

It is just adding one method call after searching for each Encode.encode and Decode.decode

Ayaz Hafiz (May 14 2023 at 15:04):

we could also have a tool that refactors all named structural types to be opaque types for you, I can imagine how that analysis is done

Ayaz Hafiz (May 14 2023 at 15:07):

somewhat related, have we talked at all about whether we want the Roc tool chain to enforce semver for encode/decode, and if so how we do that? this conversation would be a part of that, but we also should discuss that (in a separate context) for what happens if a library changes how it encodes an opaque types.

Ayaz Hafiz (May 14 2023 at 15:09):

not to sidetrack this conversation, just a note for later (can’t figure out how to make a new thread on my phone ): )

Richard Feldman (May 14 2023 at 15:10):

Richard Feldman (May 14 2023 at 15:11):

so we could say "hey you're using a tag union with encode/decode, you should really annotate that to make sure it's doing the thing you expect"

Richard Feldman (May 14 2023 at 15:11):

so you don't need to stop using structural types, just make sure to use an annotation sometime before you give them to encode or decode to make sure it's clear what you actually want to encode/decode

Richard Feldman (May 14 2023 at 15:12):

and it's just a warning bc we can of course do it without the annotation, so you're not blocked if you're e.g. doing some quick and dirty JSON prototyping

Sky Rose (May 15 2023 at 01:49):

A use case where someone might want a custom order: Adding a new tag to something in a backwards-compatible way. Any automatic order (like alphabetical) could cause pre-existing data to get decoded incorrectly if the new tag doesn't happen to go to the end of the list.

Sky Rose (May 15 2023 at 01:58):

All of the automatic orders seem magic and would make me afraid that it'd bite me sometime. Even the strictest way of automatically generating indices (definition order, only in places with annotations) seems like laying a trap. I want to be able to refactor and reorder my type definitions without having to worry about whether there's serialized data out there that would become incompatible. It seems helpful for quick and dirty JSON prototyping, but for any production code, I think we should push people towards writing explicit mappings from Tag to integer.

Sky Rose (May 15 2023 at 02:01):

(A compiler warning for times when you prefer the convenience of automatically generated indices and accept the tradeoffs seems like a good way to allow that use case without endorsing it.)

Brendan Hansknecht (May 15 2023 at 03:58):

Compiler warnings are made to block CI, so i don't think that is great for this case.

If we auto derived order, we should just document what it is. This is super common in many languages and generally not a problem. Serde does it and i have never heard anyone complain about it. I do think it is slightly different in roc because you can not specify enum ordering where it is defined, but most enums in languages like rust just have implicit declaration order.

Brendan Hansknecht (May 15 2023 at 03:59):

Brendan Hansknecht (May 15 2023 at 04:00):

That said, i also think it would be totally reasonable to just not auto-derived indices and always make it explicit. So if you use auto derived you just get strings. I think that is clean and would work for most things. Also, wouldn't block something like proto, just would make a worse API without adding opaque types (due to using strings instead of enums).

Richard Feldman (May 15 2023 at 10:06):

Richard Feldman (May 15 2023 at 10:08):

Richard Feldman (May 15 2023 at 10:09):

and then of course protobuf itself (and similar encodings) would use their own schemas as the source of truth anyway, and not auto-derived encoders/decoders

Richard Feldman (May 15 2023 at 12:53):

that is to say: the dense encoding also uses field indices instead of string names for fields

Richard Feldman (May 15 2023 at 12:54):

and it has the same backwards-compatibility concern (although not the same "what if you do a single-tag union" thing)

Richard Feldman (May 15 2023 at 12:55):

which is to say: if we're storing field indices instead of string labels, and I add a new field whose name doesn't happen to be alphabetically later than all the others, then its index will be somewhere in the middle, and now if I receive an older version of this type (using the same serialization format), it might successfully decode erroneously rather than giving an error

Richard Feldman (May 15 2023 at 12:56):

so it seems like in the case of both tags and records, if we want to offer auto-generated encoders and decoders, using string field and tag names is significantly less error-prone even in a binary representation (although it is of course significantly less compact than indices in both cases)

Richard Feldman (May 15 2023 at 12:58):

Richard Feldman (May 15 2023 at 12:59):

I think it would be safe to auto-generate encoders and decoders for MessagePack without concerns about backwards-compatibility potentially causing erroneous decodings or needing new compiler warnings

Brendan Hansknecht (May 15 2023 at 13:21):

Oh, so for record fields we also need two versions. 1 auto derived with strings and another that can be explicitly implemented with indices?

Brendan Hansknecht (May 15 2023 at 13:22):

That sucks. This feels like it is getting more edge case filled, less efficient, and less convenient overall.

Richard Feldman (May 15 2023 at 13:24):

Richard Feldman (May 15 2023 at 13:25):

and say if you want indices, you need to go to an explicit schema (e.g. via opaque types) and manage backwards-compatibility yourself

Richard Feldman (May 15 2023 at 13:25):

Brendan Hansknecht (May 15 2023 at 13:25):

Yeah, so less efficient, less convenient, and more chances people will need opaque types.

Richard Feldman (May 15 2023 at 13:26):

Brendan Hansknecht (May 15 2023 at 13:26):

Records aren't like tags. You won't get a nice error if you miss a field like you will if you miss a tag variant.

Richard Feldman (May 15 2023 at 13:29):

sorry, I still don't follow - do you mean that offering record indices is a good idea? bad idea? something else?

Bryce Miller (May 15 2023 at 13:37):

I'm pretty ignorant on typical use cases for e.g. ProtoBuf, so I might be asking silly questions here. What are the situations where you would even want to auto-derive an encoder for something like ProtoBuf? Most often you would be generating Roc types and encoders/decoders from a protobuf definition file right?

But perhaps you're building a server and want to auto-derive an encoder/definition file that you can send to the team building the client application, skipping the step of writing a definition file and generating types. Is this the sort of use case we are thinking of here?

Is this discussion more focused on a use case where you need Some compact binary format to be used by the application you are writing, either on the same machine or a different machine? Save files, multiplayer interactions or collaboration, etc. In this case the format is a bit arbitrary, as your application is the only application that cares about the format. (A prime candidate for auto-derivation)

Trying to understand the problem space so I can follow the conversation a bit better :sweat_smile:

Brendan Hansknecht (May 15 2023 at 13:48):

Let's say that record and tags always auto-generate strings. When writing a tag encoder manually to use indices, you write:

ProtoRole := [User, Admin, Guest] has Encode {encode: encodeRole}
encodeRole = \encoder, role ->
    when role is
        Admin -> Encode.indexedTag encoder 0 []
        User -> Encode.indexedTag encoder 1 []
        Guest -> Encode.indexedTag encoder 2 []

ProtoUser := {id: U64, firstName: Str, lastName: Str, role: ProtoRole}
encodeUser = \encoder, user ->
    encoder
    |> Encode.indexedRecord 0 user.id
    |> Encode.indexedRecord 1 user.firstName
    |> Encode.indexedRecord 2 user.lastName
    |> Encode.indexedRecord 3 user.role

Now imagine that you need to add a field to the tag or record. In the tag case, you get an compiler error in the encode function. In the record case, you just miss data.

Brendan Hansknecht (May 15 2023 at 13:52):

Proto is reasonable to talk about because it is a format edge case, but you are correct that long term, proto should be auto generated from a definition file. That said, for proto to be generated, we need to define how we could support it; the first uses in roc will be hand written; and there are other formats without generators that have some or all of protos complexities. Maybe BSON would technically be a better format to talk about. If you are using BSON, you probably want more density, but you may be using it as just a faster json and not really care about exact encoding much. So autoderived, would be much more useful to you. Just make the frontend alphbetical as well and the message only has a single consumer, so versioning can be updated at once.

Brendan Hansknecht (May 15 2023 at 13:56):

Was realizing that everything is less efficient than I initially thought because also every struct name will be encoded as a string as well. Just was kinda lamenting that that is sad. Also, it would mean that we would be pushing more users towards opaque types. Opaque types are less convenient.

Brendan Hansknecht (May 15 2023 at 14:03):

Slightly tangential thought: In rust, for example, there is no implicit auto-derive. There is only an explicit auto-derive. Are we concerned at all about the security implications of an implicit auto-derive? As in, imagine I have a user record. I just encode it and send it to the frontend. One day, a new engineer adds a feature and as part of it, adds a field to the User type that really should not be public. Because they didn't realize that we encode directly on the User type, we are now encoding and sending that private information to the frontend.

With something like rust, this could happen, but is less likely because the fact a record could be sent to the frontend (and is implicitly going to add new fields to the encoding) lives right above the record #[derive(Serialize)].

Brendan Hansknecht (May 15 2023 at 14:04):

I am honestly starting to lean towards only allowing encode on opaque types at all. I still think we can autoderive, but that should be done something like this:

ProtoRole := [User, Admin, Guest] has Encode {encode: _}

Brendan Hansknecht (May 15 2023 at 14:11):

The big downside being you have either use opaque types in many places in your codebase, or you need to make a method to convert from your regular type to your encoding friendly type. That said, it should just be one method. And if it works out type wise, it may just be exposing the @MyType method.

Richard Feldman (May 15 2023 at 17:09):

Richard Feldman (May 15 2023 at 17:15):

yeah I thought about this, but the issue is that if you need to make everything opaque in order to serialize it, then people will just start making everything opaque as a matter of course, without thinking about it, and the same mistake will be about as likely to happen

Richard Feldman (May 15 2023 at 17:16):

I think the better solution is to always make sensitive data opaque and don't opt into Encode there (because it is already opt-in on opaque types)

Richard Feldman (May 15 2023 at 17:17):

a good practice for that is to have like Sensitive a := a wrapper type, which doesn't have Encode and which overrides Display and Inspect to just return "***" so they don't accidentally end up in logs either

Brendan Hansknecht (May 15 2023 at 22:56):

Brendan Hansknecht (May 15 2023 at 22:57):

I'm not convinced. Opaque types are less convenient to use especially if you ever cross module boundaries.

Brendan Hansknecht (May 15 2023 at 22:58):

I don't think this would lead to the proliferation of opaque types in most code bases. I think many more people would just add opaque types at the boundaries with a conversion function.

Brendan Hansknecht (May 15 2023 at 22:58):

Brendan Hansknecht (May 15 2023 at 22:59):

I do agree about using opaque types for sensitive data, but i bet many newer users won't even think about that as an option, so i think it will be pretty uncommon.

Brendan Hansknecht (May 15 2023 at 23:00):

For more advanced users that would use opaque types for sensitive data and care more about code quality, i think it is pretty easy to suggest to them that encode and it's related opaque types should be nicely wrapped in their own module and only be converted to/from at the boundaries of the system.

Richard Feldman (May 15 2023 at 23:13):

oh interesting, I think I misunderstood the idea you were proposing - so you're saying we still support deriving Encode and Decode for structural types, we just don't do it automatically?

Richard Feldman (May 15 2023 at 23:13):

in other words, if I make a new opaque type and declare that it has Encode, even if it's a big complicated nested structural type in there, the whole thing will get derived

Richard Feldman (May 15 2023 at 23:14):

but I can no longer give a structural type to Encode.encode directly; rather, I have to wrap it in an opaque type first

Richard Feldman (May 15 2023 at 23:14):

like I have SerializedUser := User and then I call Encode.encode on SerializedUser

Richard Feldman (May 15 2023 at 23:14):

Brendan Hansknecht (May 15 2023 at 23:54):

Of course, if your opaque type includes other opaque types, those must also have Encode.

Richard Feldman (May 16 2023 at 00:11):

hm, but if everybody is just uncritically doing SerializedUser := User as a matter of course, before passing that to Encode.encode, does that give people more pause when it comes to checking whether they're encoding secrets?

Richard Feldman (May 16 2023 at 00:11):

Richard Feldman (May 16 2023 at 00:12):

actually, come to think of it - that would probably be an effective solution to the index problem

Richard Feldman (May 16 2023 at 00:12):

Richard Feldman (May 16 2023 at 00:13):

because SerializedUser := User followed by Encode.encode (@SerializedUser { role: Guest, ... }) will make the Guest have the type [Guest, Admin, Moderator] because of the @SerializedUser

Richard Feldman (May 16 2023 at 00:14):

so I guess if we want to support index-based encoding/decoding, that's a possible way we could do it

Brendan Hansknecht (May 16 2023 at 00:32):

Brendan Hansknecht (May 16 2023 at 00:33):

Brendan Hansknecht (May 16 2023 at 00:34):

Brendan Hansknecht (May 16 2023 at 00:37):

But yeah, I guess most of the time when using this, you will just do SerializedType := Type has Encode {encode: _}
So that wouldn't actually help with security at all.

Brendan Hansknecht (May 16 2023 at 00:38):

I was initially thinking that the SerializedType would be defined explicitly. But I guess that probably would not be common.

Stream: ideas

Topic: encoding tag union indices

Brendan Hansknecht (May 13 2023 at 17:42):

Brendan Hansknecht (May 13 2023 at 17:43):

Brendan Hansknecht (May 13 2023 at 17:44):

Matthias Toepp (May 13 2023 at 17:45):

Brendan Hansknecht (May 13 2023 at 17:47):

Notification Bot (May 13 2023 at 17:49):

Richard Feldman (May 13 2023 at 17:49):

Richard Feldman (May 13 2023 at 17:49):

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:50):

Richard Feldman (May 13 2023 at 17:51):

Richard Feldman (May 13 2023 at 17:52):

Richard Feldman (May 13 2023 at 17:53):

Brendan Hansknecht (May 13 2023 at 17:55):

Richard Feldman (May 13 2023 at 17:55):

Richard Feldman (May 13 2023 at 17:55):

Brendan Hansknecht (May 13 2023 at 17:56):

Brendan Hansknecht (May 13 2023 at 17:56):

Richard Feldman (May 13 2023 at 17:57):

Brendan Hansknecht (May 13 2023 at 17:57):

Brendan Hansknecht (May 13 2023 at 17:58):

Richard Feldman (May 13 2023 at 17:58):

Richard Feldman (May 13 2023 at 17:59):

Richard Feldman (May 13 2023 at 18:00):

Brendan Hansknecht (May 13 2023 at 18:00):

Brendan Hansknecht (May 13 2023 at 18:00):

Brendan Hansknecht (May 13 2023 at 18:00):

Richard Feldman (May 13 2023 at 18:00):

Richard Feldman (May 13 2023 at 18:01):

Richard Feldman (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:02):

Brendan Hansknecht (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:02):

Richard Feldman (May 13 2023 at 18:03):

Richard Feldman (May 13 2023 at 18:03):

Richard Feldman (May 13 2023 at 18:03):

Brendan Hansknecht (May 13 2023 at 18:03):

Brendan Hansknecht (May 13 2023 at 18:03):

Richard Feldman (May 13 2023 at 18:04):

Brendan Hansknecht (May 13 2023 at 18:04):

Richard Feldman (May 13 2023 at 18:04):

Brendan Hansknecht (May 13 2023 at 18:04):

Richard Feldman (May 13 2023 at 18:04):

Brendan Hansknecht (May 13 2023 at 18:05):

Richard Feldman (May 13 2023 at 18:05):

Brendan Hansknecht (May 13 2023 at 18:06):

Richard Feldman (May 13 2023 at 18:06):

Brendan Hansknecht (May 13 2023 at 18:07):

Brendan Hansknecht (May 13 2023 at 18:08):

Richard Feldman (May 13 2023 at 18:08):

Brendan Hansknecht (May 13 2023 at 18:09):

Richard Feldman (May 13 2023 at 18:09):

Brendan Hansknecht (May 13 2023 at 18:09):

Richard Feldman (May 13 2023 at 18:10):

Richard Feldman (May 13 2023 at 18:10):

Matthias Toepp (May 13 2023 at 18:10):

Richard Feldman (May 13 2023 at 18:11):

Richard Feldman (May 13 2023 at 18:11):

Brendan Hansknecht (May 13 2023 at 18:12):

Brendan Hansknecht (May 13 2023 at 18:12):

Richard Feldman (May 13 2023 at 18:13):

Richard Feldman (May 13 2023 at 18:13):

Richard Feldman (May 13 2023 at 18:14):

Richard Feldman (May 13 2023 at 18:15):

Richard Feldman (May 13 2023 at 18:15):

Richard Feldman (May 13 2023 at 18:16):

Richard Feldman (May 13 2023 at 18:17):

Richard Feldman (May 13 2023 at 18:17):

Richard Feldman (May 13 2023 at 18:17):

Richard Feldman (May 13 2023 at 18:17):

Richard Feldman (May 13 2023 at 18:18):

Richard Feldman (May 13 2023 at 18:19):

Richard Feldman (May 13 2023 at 18:21):

Richard Feldman (May 13 2023 at 18:22):