Stream: compiler development

Topic: Boxing Anything


view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 18:26):

Just to ask this concretely, should these effects be allowed:

storeAnything : Box a -> Effect U64
loadAnything : U64 -> Effect (Box a)

They are fundamentally what enable dynamic FFI and would enable caching any data in basic webserver. They just have no type safety. Unless loadAnything could inspect the type expected to be returned, roc user code could request a Box Str when you actually have a Box U64 stored.

view this post on Zulip Jasper Woudenberg (Jul 05 2024 at 19:27):

These remind me a bit of the Var type in Haskell. Maybe you're already familiar, their types (translated to Roc syntax) would are like:

new : Box a -> Effect (Var a)
read: Var a -> Effect (Box a)
write: Var a, Box a -> Effect {}

Where Var a represents a mutable reference to some value of type A. I think that would allow caching in a type-safe manner.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:29):

I don't think roc currently would allow you to define Var a cause it would have an unused type variable.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:35):

Also, just to give the more complex ffi example, it is essentially:

ffi: Lib, Str, Box a -> Effect (Box b)

So I don't think there is any sort of Var a strategy that could work. But I think that is expected. Obviously ffi, can totally transform types.

view this post on Zulip Jasper Woudenberg (Jul 05 2024 at 19:36):

I think Roc allows it. Maybe I misunderstand?

$ roc repl
» Var a := U64
» mkvar : a -> Var a
… mkvar = \_ -> @Var 42

<function> : a -> Var a
» mkvar "Hi!"

@Var 42 : Var Str

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:36):

Oh interesting. I thought we generated a warning for that. Guess not.

view this post on Zulip Agus Zubiaga (Jul 05 2024 at 19:41):

I think we don't because phantom types are useful :smile:

view this post on Zulip Richard Feldman (Jul 05 2024 at 19:44):

yeah it's an error for type aliases but not even a warning for opaque type (because yeah, phantom types are useful!)

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:45):

Ah, alias vs opaque. Makes sense.

view this post on Zulip Richard Feldman (Jul 05 2024 at 19:48):

Brendan Hansknecht said:

Just to ask this concretely, should these effects be allowed:

storeAnything : Box a -> Effect U64
loadAnything : U64 -> Effect (Box a)

They are fundamentally what enable dynamic FFI and would enable caching any data in basic webserver. They just have no type safety. Unless loadAnything could inspect the type expected to be returned, roc user code could request a Box Str when you actually have a Box U64 stored.

this is a very good question

view this post on Zulip Richard Feldman (Jul 05 2024 at 19:48):

there's some other way to do it which is less efficient, right?

view this post on Zulip Richard Feldman (Jul 05 2024 at 19:48):

like with List U8 instead of a

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:51):

Yeah, you could do a List U8 instead of Box a and force some form of encoding and decoding.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:52):

You also can make it type safe with Var a as @Jasper Woudenberg mentioned.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 19:53):

To the platform, box is just a pointer, so they can handle and store it safely for the most part. (refcounting may have some complications, but I'm not 100% sure)

view this post on Zulip Richard Feldman (Jul 05 2024 at 21:21):

yeah I think for now we should be conservative and disallow this

view this post on Zulip Richard Feldman (Jul 05 2024 at 21:21):

and consider supporting it in the future if there’s demand for it in practice

view this post on Zulip Richard Feldman (Jul 05 2024 at 21:22):

because it’s both super unsafe and also prevents an automatic replay feature

view this post on Zulip Richard Feldman (Jul 05 2024 at 21:22):

because we can’t know the layout of what’s behind that pointer, so we don’t know how to write it down to replay it later

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 21:24):

Hmm... You have to allow some form of it due to Box a being used for models? Or is that different

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 21:26):

Like in an elm architecture style app

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:04):

yeah that's different

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:04):

I specifically mean *

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:06):

hm, actually

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:06):

I guess for replay we can know what the type is after monomorphization :thinking:

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:07):

because we'll have inferred it

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:07):

so we could use that to know the layout, which in turn means we'd know how to traverse it for replay

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 22:07):

Yeah, roc knows all the types. The platform just doesnt

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:07):

it's unsafe, but to be fair, literally anything we get from the host could be a bad pointer

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:07):

so it's all equally unsafe in that sense

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:08):

the difference is whether application authors can cause UB

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 22:08):

The part that makes it less safe is that if exposed directly to userland, the app author can hit it

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:08):

which they can if we allow this, and can't if we don't allow this

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:08):

right

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:09):

so we'd go from "only platform authors can cause UB" to "application authors (or any of the libraries they use which return Task) can cause UB if the platform supports it"

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 22:20):

Yep. Though I don't think UB is the right term. More like can generate totally broken bindings that crash the app or return garbage data.

That said, it is one of those weird cases where depending on how it is exposed, it could be totally safe. Like stored with Var a. So kinda a case where it enables new features, but those features may have too much power

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:42):

unfortunately I think UB is accurate here :big_smile:

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:43):

if you get the layout wrong, all bets are off and anything could happen

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:43):

including overwriting data in a completely unrelated data structure which was unfortunate enough to be adjacent in memory to the thing with the wrong layout, which in turn leads to other incorrect pointers, which…etc etc

view this post on Zulip Richard Feldman (Jul 05 2024 at 22:45):

yeah this feels like “we can always introduce support later if it feels worth it, but it would be a big breaking change to take away if we support it now and regret it later”

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 22:53):

UB is just a very specific compiler term that means something different. Like when c++ or llvm says UB, it means that they won't specify a specific behaviour generally for performance reasons. Then the optimizer will pick the faster choice for the specific hardware.

Having a wrong layout for a call isn't UB. It is totally incorrect and broken code.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 22:54):

Richard Feldman said:

yeah this feels like “we can always introduce support later if it feels worth it, but it would be a big breaking change to take away if we support it now and regret it later”

Note, we do support this today

view this post on Zulip Ayaz Hafiz (Jul 05 2024 at 23:45):

i’m pretty sure exposing any generic types to the platform, in either input or output is a bad idea

view this post on Zulip Ayaz Hafiz (Jul 05 2024 at 23:45):

it feels like a mistake that it can be done today

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 23:45):

How are you supposed to do something elm architecture like then?

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 23:46):

Where the model depends on the app?

view this post on Zulip Ayaz Hafiz (Jul 05 2024 at 23:48):

that seems like a special case, because the type is specialized to the app right? in that case the runtime should box the model type, but it’s not generic in the sense that the application cannot call it with any other type

view this post on Zulip Ayaz Hafiz (Jul 05 2024 at 23:52):

said another way, from the perspective of the app there are no generic inputs or outputs

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 23:55):

In all of these cases, I think we are talking about actually sending a Box a over to the platform. The app is required to know what a is. I guess in the ffi case specifically, it is so flexible that generally it requires a type annotation or roc compilation will break.

view this post on Zulip Brendan Hansknecht (Jul 05 2024 at 23:55):

I think any sort of type variable passed to the platform has to be wrapped in an indirection. So like List a or Box a. I 100% agree that a raw a passed to a platform is definitely wrong.

view this post on Zulip Ayaz Hafiz (Jul 06 2024 at 00:02):

to me the problem is that this allows the programmer to write a function a -> b where a and b are generic. i think this is prone to writing mistakes, and makes it hard to manage your applications (whenever you change the input or output type, you need to make sure the types propagate but the compiler won’t help you). i think the right solution is runtime type tagging (not saying the language runtime should provide that, but some serialization should be used). The model case is a bit different I think because the programmer must explicitly type the model type (as something non-generic), and so you don’t end up being able to write a function a->b

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:15):

I think nested runtime type tagging would be too slow. On the otherhand, I think sideband type tagging could work great.

Basically taggedEffect (Box.box a) (Type.type a)

Where if a is a List (Str, I32), Type.type would return a tag explaining that info ListType (TupleType [StrType, I32Type]).

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:17):

Then the roc type only has to be boxed and no other serialization has to happen. Type.type would actually always be a compile time constant. So no cost to building it out. Mono would just know the answer.

view this post on Zulip Ayaz Hafiz (Jul 06 2024 at 00:19):

yeah, that’s a much larger change though. that has some things to figure out too, for example if you do not want to add type tags to all runtime values you need to figure out where to drop them-and it may not be trivial to do so, because of the vitality of type inference

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:21):

Why is it such a large change? after type inference runs and mono, won't the variable a have a concrete type. So it would just be running a mapping over the concrete type of a? I mean today, inspect could be used to build the Type.type function.

view this post on Zulip Ayaz Hafiz (Jul 06 2024 at 00:23):

yes, but there a lot of details to discuss - how and when to check the tags, whether to infer whether tags should be added, the tag representation, so on. you’re right that it’s simple, but my guess is it’s not trivial to implement. yes, the actual tag to add is easy to compute

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:24):

Ok, yeah. Totally agree with that sentiment.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:25):

I just want to make sure we don't lock to only decode/encode. I think that would be really sad for the perf story of some common patterns. Even having to box is a bit sad, but it gets around variable sized types so makes sense.

view this post on Zulip Ayaz Hafiz (Jul 06 2024 at 00:26):

I also empathize with the concern about runtime type data being too slow, but i would suggest using serialization at the ffi boundary for now until performance becomes a concern in practice at which point some ecosystem or language level solution can be devised. i don’t think any real uses of Roc are going to run into a performance problem here while Roc is still being used for small/medium enterprise applications - and if you do run into something that has a perf problem, you can create a special-case effect. reducing the surface area of the language also makes development easier; there are still a ton of holes in compilation, and might be worth avoiding another potential source of those right now.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:28):

reducing the surface area of the language also makes development easier; there are still a ton of holes in compilation, and might be worth avoiding another potential source of those right now.

Very true

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:31):

I guess maybe we just need someone to implement a simple binary serialization format that avoids needing to tag every individual piece of data. Make sure it can represent a List (Str, Int) without generating the equivalent of everything being nested RocObjects. Where you have a RocObject (List) -> RocObject (Tuple) -> RocObject(Str/Int). Cause that is a horrid amount of wrapping that would definitely ruin perf.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 00:31):

Also, by invent, probably just mean implement. I'm sure something must exist.

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:24):

Brendan Hansknecht said:

I guess maybe we just need someone to implement a simple binary serialization format that avoids needing to tag every individual piece of data. Make sure it can represent a List (Str, Int) without generating the equivalent of everything being nested RocObjects. Where you have a RocObject (List) -> RocObject (Tuple) -> RocObject(Str/Int). Cause that is a horrid amount of wrapping that would definitely ruin perf.

this would be a nice binary serialization format to have in general, not just for FFI!

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 01:50):

Yeah, maybe that will be something I dig into after sqlite. Not really sure what binary format to target though.

Things like protobuf and cap'n proto have no type info in the serialized format and require codegen for decode/encode.

msgpack or maybe bson look reasonable. They have repeated types inline, which is kinda annoying. So a list of 100 strings specifies that each individual element in the list is a string. That said, with both of these, specifying the element type is a single byte. And everything gets built into a single flat buffer. So should be reasonable for perf.

apache avro is one of the few things that seems to have out of band types that are sent as metadata. They send the types as json for some reason. I guess it is meant for big data, so parsing a single json to learn you are decoding a List (Str, Int) is no big deal. Then the actual data is densely packed with no type info.

Anyone have general input on binary formats with type info? I would guess that bson is the most popular simply because it is used with mongo db. That said, msgpack looks a lot cleaner and simpler. But I don't really have much knowledge of the various options in this space.

view this post on Zulip Luke Boswell (Jul 06 2024 at 01:56):

Not quite sure if its the same. But folkert mentioned this format to me when I was asking about something similar https://postcard.jamesmunns.com/wire-format

view this post on Zulip Luke Boswell (Jul 06 2024 at 01:57):

Benefit would be interop with rust

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 02:01):

I think postcard is in the same category as protobuf and cap'n proto. Totally type unsafe. I think it is kinda a simpler variant of those two libraries.

So if you pick the wrong type to decode into List (I32, Str) when it should have been List (U64, Str), you will just decode wrong. It may fail. It may succeed and just give you garbage data.

I think for our first spec here, we probably want something with some form of type info, but maybe that is a wrong assumption.

view this post on Zulip Brian Carroll (Jul 06 2024 at 09:13):

One possibility would be to have a header section with a serialized version of the Layout, then a body section with the actual data, in the same format we use at runtime. Pointers would get translated to byte offsets within the payload.

That pointer translation is something we already do in the Web REPL, where the user's compiled app is in a separate address space from the REPL app.

view this post on Zulip Richard Feldman (Jul 06 2024 at 10:49):

we can make a roc-specific one like rvn if there’s nothing off the shelf that does what we want!

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:36):

For sure. I was trying to find something off the shelf simply to help with value. Like bson would both help here and would be useful if someone interacts with mongo DB. Plus it means we don't have to implement it in all host languages.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:38):

Brian Carroll said:

One possibility would be to have a header section with a serialized version of the Layout, then a body section with the actual data, in the same format we use at runtime. Pointers would get translated to byte offsets within the payload.

This is something I am trying to make as a Encoder/Decoder written in roc. I think it would actually be pretty hard to match Roc's runtime format using an encoder.

view this post on Zulip Richard Feldman (Jul 06 2024 at 16:27):

Brendan Hansknecht said:

For sure. I was trying to find something off the shelf simply to help with value. Like bson would both help here and would be useful if someone interacts with mongo DB

true, although if someone wants bson and we have this, it should give them a very strong starting point!


Last updated: Jul 06 2025 at 12:14 UTC