Stream: API design

Topic: reading integers from bytes


view this post on Zulip Richard Feldman (Jan 23 2024 at 20:26):

The current Num.bytesToU16 (and similar) functions don't take endianness into account, which means they'll give different answers on different CPUs. We've been eliminating those scenarios, and this seems like one worth eliminating too.

More broadly, I recently realized that adding things to Num which rely on endianness (like bytesToU16 and similar) was probably premature. I think they should be removed for now and revisited later in the context of specific concrete use cases.

For example, binary serialization formats necessarily specify things like endianness as part of the format, so it's not clear to me how helpful dedicated Num builtins for translating between bytes and certain integer sizes would be in practice.

I don't think this would block anything from being built in Roc. Consider Zig's std.mem.readVarPackedInt. It supports decoding bit offsets, and does everything in userspace using (as far as I can see) operations that are available in Num in Roc. If a function like that is implementable in userspace, I think all the relevant use cases should be unblocked here.

Obviously in the future we can revisit this if specific use cases come up which justify builtins, but I think in this case it's worth the forcing function of starting with userspace and seeing what that experience is like in practice.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 20:43):

The plan here was to change the api and make it take endianness

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 20:46):

Also to make to return a tuple when going from a U16 to bytes.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 20:48):

The apis discussed where the following:

Num.u16ToBytes : U16, [BE, LE] -> (U8, U8)
Num.u16FromBytes : (U8, U8), [BE, LE] -> U16

Num.appendBytesToList : List U8, Num a, [BE, LE] -> List U8

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:36):

yeah I remember the discussion, I just think a better plan is to try taking them out altogether :big_smile:

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:37):

and seeing if it really feels like there's justification for making them builtins after all, based on how they end up being used in practice

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:37):

I don't understand though. They will clearly be wanted for binary protocols

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:37):

They also are much more performant as builtins then as bit shifting.

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:38):

hm, why would they be more performant as builtins? :thinking:

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:38):

Cause they just use type casting instead of bitshifting.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:39):

Feels like something llvm should be able to optimize to the same thing, but IIRC when working on hashing, that wasn't the case.

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:39):

you mean in the case where the target endianness matches the requested endianness?

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:40):

yeah I assumed LLVM would optimize that, surprising that it doesn't

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:40):

Both are faster, but that is the most affected

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:40):

well it's not safe to cast unless the endianness matches, right?

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:40):

I can double check. Maybe I had something else off

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:40):

otherwise you can end up with the wrong answer

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:40):

There is a single instruction to flip the endianness though

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:40):

whoa, I didn't know that! :astonished:

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:41):

ok if LLVM doesn't optimize that, that's a very important consideration haha

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:41):

Yeah cause all the network protocols are big endian but cpus are little endian.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:41):

Will go double check in godbolt now. Maybe before something else was messing up the generation.

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:45):

appreciate it!

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:52):

Ok, nvm, llvm gets it: https://zig.godbolt.org/z/xG4YPYcoG

I wonder what I hit before that messed this up... :shrug:

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:52):

So verbose, but doable

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:54):

Oh sorry, I remember the issue I was hitting. It is in bytes to integers.

We would need a way to load a tuple from a list. In current roc, you have to get each individual element. That is a huge cost.

So I guess the primitive that I would need is List.get8: List a, Nat -> Result (a, a, a, a, a, a, a, a) [OutOfBounds]. Same for other numeric sizes.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:56):

Otherwise, you as stuck with n branches to check size (which hopefully optimize into one), and then loading each individual element a single byte at a time.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 21:57):

Which I guess the proposed Num.*FromBytes apis don't actually fix. So good thing this was thought about now either way.

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:58):

ha, makes sense!

view this post on Zulip Richard Feldman (Jan 23 2024 at 21:59):

yeah those list primitives seem totally reasonable :+1:

view this post on Zulip Richard Feldman (Jan 23 2024 at 22:00):

so how about we try adding those primitives and see how it goes in practice?

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 22:03):

Oh wait, even better, we shouldn't need the primitives. Pattern matching and seamless slices for the win.

when List.dropFirst list index is
    [a, b, c, d, e, f, g, h, ...] ->
    _ ->

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 22:04):

Yeah, I say lets leave it to userland. Should be trivial for someone to make a package for it if they want to.

view this post on Zulip Richard Feldman (Jan 23 2024 at 22:13):

wow, great point!!!

view this post on Zulip Luke Boswell (Jan 23 2024 at 22:21):

The specific use case I want this for is a binary encoder/decoder so I can efficiently cache data and reuse between calls in basic-webserver, we can add something like set : U64, List U8 -> Task {} [OutOfSpace] and get : U64 -> Task (List U8) [NotFound] and encode/decode in the app or even platform if we want.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 22:25):

Sure, so when you define the binary encoder/decoder, just have to manually implement this stuff in Encoder.i32 and friends.

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 22:26):

It all can be done in userland

view this post on Zulip Luke Boswell (Jan 23 2024 at 22:27):

Maybe if you could give me a worked example for I32 or something I can make the rest happen. I'm not sure I follow how it all comes together

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 22:30):

Lets pair, I think we have a few different things to discuss

view this post on Zulip Brendan Hansknecht (Jan 23 2024 at 23:23):

Ok. so there are a few functions we do need for binary encoding in the std lib:

We need dec to/from i128. We also need f32/f64 to/from sign, exponent, and mantissa.

view this post on Zulip Luke Boswell (Jan 23 2024 at 23:25):

We should use these for the impl

inline fn floatExponentBits(comptime T: type) comptime_int
inline fn floatMantissaBits(comptime T: type) comptime_int

view this post on Zulip Brendan Hansknecht (Jan 24 2024 at 01:23):

@Richard Feldman For dec, it obviously can't use Num.toI128 to get the underlying I128. I'm not exactly sure a good name for it. I guess it technically is atto in terms of metric prefixes. Num.decToAtto doesn't sound even vaguely discoverable. Could do Num.decToBytes and return a tuple of U8. Or maybe Num.decToRaw or something like that....yeah...not sure a good name here.

view this post on Zulip Luke Boswell (Jan 24 2024 at 01:50):

I like Num.decToRaw or Num.decToBytes

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 02:29):

Cause this will be useful for some of the fuzzing stuff I am currently looking at, wanted to pin down api a bit. Also, will be needed for binary encoding/decoding in roc.

For Dec, I think it can probably simply be this with a doc mentioning that these are integers scaled by 10^-18

Num.decToRaw : Dec -> I128
Num.decFromRaw : I128 -> Dec

For float, I think there are a few possibilities for the api.
Probably most direct would be:

Num.f32ToRaw : F32 -> { sign: Bool, exponent: U8, fraction: U32 }
Num.f64ToRaw : F64 -> { sign: Bool, exponent: U16, fraction: U64 }

-- Also the reverse

That said, we could also just allow bitcasting a float to/from a U32/U64.

Num.f32ToRaw : F32 -> U32
Num.f64ToRaw : F64 -> U64

-- Also the reverse

Or a direct byte function of some sort that gives a tuple, but I think that would be less useful.

For all the above float APIs, they could also use signed types instead of unsigned.

Anyone have any thoughts and what would be the nicest api here?

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 02:30):

For extract parts of a float, we could make each part its own function, but we can't really do that for building a float. I mean we could, but it would be kinda strange to like apply the fractional part and then add on an exponent

view this post on Zulip Luke Boswell (Jan 28 2024 at 07:39):

Looking at the postcard wire format for no particular reason other than it looks like a useful reference, https://postcard.jamesmunns.com/wire-format#13---f64

They have that an f64 will be bitwise converted into a u64, and encoded as a little-endian array of eight bytes.

For example, the float value -32.005859375f64 would be bitwise represented as 0xc040_00c0_0000_0000u64, and encoded as [0x00, 0x00, 0x00, 0x00, 0xc0, 0x00, 0x40, 0xc0].

Would the above Num.f64ToRaw : F64 -> U64 be the same as this? I assume we might want the more explicit { sign: Bool, exponent: U16, fraction: U64 } if we want to support really specific encoding/decodings?

view this post on Zulip Luke Boswell (Jan 28 2024 at 07:41):

I guess we can always just bitshift things around if we need a different ordering. Though maybe the API should be more like Num.f64ToRaw : F64, [BE, LE] -> U64?

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 07:46):

Converting to a U64 without an endian specifier should be fine. It will just stay in the same endian ambiguous form. Then you can write it in little endian form into the final buffer.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 07:48):

Cause both the float and the int will be in native endian. Then you write the int into the buffer lowest byte to highest byte to get a little endian buffer

view this post on Zulip Richard Feldman (Jan 28 2024 at 12:18):

what about these for names?

Num.withoutDecimalPt : Dec -> I128
Num.withDecimalPt : I128 -> Dec

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 16:29):

Interesting. I definitely get the idea...

I definitely don't think I would ever think to reach for a function named that. Also, I am a bit concerned the name is too close to withoutDecimalPart which sounds like it would return the whole number portion of the Dec.

view this post on Zulip Richard Feldman (Jan 28 2024 at 16:58):

yeah I just always try to avoid names that basically say "to internal implementation" because it pretty much guarantees you can never change the internal implementation

view this post on Zulip Richard Feldman (Jan 28 2024 at 16:59):

as opposed to a name that describes the transformation, which at least potentially leaves the door open to changing the internal representation in the future and then backporting the function to still return what it says it does

view this post on Zulip Richard Feldman (Jan 28 2024 at 17:00):

which in this particular case might never happen, but people look to builtins for naming conventions, so I want to avoid establishing "to internal representation" as a naming convention in builtins if possible! :big_smile:

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 17:22):

True, but these function are actually for serde and binary protocols. They truly are meant to get the raw bytes.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 17:23):

These types just don't allow raw access like integers (via bit shifts and masks)

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 17:24):

Personally, I think I would prefer just a single generic Num.bitCast

view this post on Zulip Luke Boswell (Jan 28 2024 at 21:47):

Num.bitCast sounds nice. Does it always return a List U8?

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 23:19):

No, it converts any numeric type into any other numeric type by just directly moving the bits.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 23:20):

No checks. If the old type is smaller, zero pad (maybe sign extend?). If the old type is bigger truncate.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 23:20):

I think zig has @bitCast that would be the same

view this post on Zulip Richard Feldman (Jan 28 2024 at 23:34):

I'm a little worried about that...I don't know if a lot of people appreciate the distinction between bit cast and numeric cast, and I could see people calling Num.bitCast thinking it will work more like Num.intCast

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 23:38):

What are intCasts semantics currently? I think it is a bitcast, just only for integers.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 23:41):

Though maybe it panics if a value doesn't fit or is supposed to panic if a value doesn't fit?

view this post on Zulip Richard Feldman (Jan 29 2024 at 00:23):

I'm not sure...also I'm not totally sure we should have that one either :laughing:

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 00:40):

Fair enough. In that case, sounds like a couple of bespoke methods. One to get the parts of a floats and one to remove the decimal point from a dec is probably the way to go.

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 00:40):

Then of course the reverse method for building the types.

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 00:45):

Num.withoutDecimalPoint : Dec -> I128
Num.withDecimalPoint : I128 -> Dec

Num.f32ToParts : F32 -> { sign : Bool, exponent : U8, fraction : U32 }
-- plus reverse

-- plus for f64

view this post on Zulip Richard Feldman (Jan 29 2024 at 00:55):

sounds good to me!

view this post on Zulip Fabian Schmalzried (Mar 15 2024 at 07:50):

I will try to implement those

view this post on Zulip Fabian Schmalzried (Mar 20 2024 at 07:07):

What should happen if the fraction is bigger than allowed in f32FromParts? Ignore the extra bits, or should it return a result?

view this post on Zulip Luke Boswell (Apr 14 2025 at 01:47):

Just saw this Issue raised https://github.com/roc-lang/roc/issues/7739 -- is this thread effectively the direction we plan on going for this, so Num.f32ToParts?

view this post on Zulip Luke Boswell (Apr 14 2025 at 01:49):

I thought we had an issue for this but maybe we never made one as I cant find it

view this post on Zulip Brendan Hansknecht (Apr 14 2025 at 02:12):

They are already implemented?
https://github.com/roc-lang/roc/blob/966d0459e7ccb1bd28cb77c05b8419953ef167af/crates/compiler/builtins/roc/Num.roc#L154-L157

view this post on Zulip Brendan Hansknecht (Apr 14 2025 at 02:13):

That said, in practice, I think this ended up being the wrong decide (cause roc doesn't have arbitrary width intgers

view this post on Zulip Brendan Hansknecht (Apr 14 2025 at 02:13):

Frankly, at this point, I would suggest removing this and going with the raw conversion like the issue you linked suggested.

view this post on Zulip Brendan Hansknecht (Apr 14 2025 at 02:13):

let user do the bit twiddling as they need.

view this post on Zulip Luke Boswell (Apr 14 2025 at 02:35):

Ahk, I forgot about that. I don't have a strong opinion here, but moving to the simpler API sounds good to me.

I think we should drop a comment/update on that Issue so Lars or someone else has a decision to reference and is unblocked to progress the change.

WDYT @Brendan Hansknecht ?

view this post on Zulip Brendan Hansknecht (Apr 14 2025 at 03:44):

I commented on the issue


Last updated: Jul 06 2025 at 12:14 UTC