so reading through @dank's awesome JVM interop example, I noticed that the JVM's Native Interface requires that strings are encoded in Modified UTF-8. Apparently Android's Dalvik uses the same representation, so this would be relevant for any Roc application running on Android.
separately, many C functions don't allow null characters because they use it to tell where the string terminates
however, UTF-8 allows null characters in it
today, Roc strings are encoded as UTF-8, which means they allow null characters
one reason I went with this design (which Rust does too) is that it seems less error-prone; we can just say "Roc strings are valid UTF-8" instead of something more complicated than that, which host authors might get wrong in ways that would be easy to overlook
and which wouldn't show up except in really unusual edge cases
another problem is performance; we want to be able to convert a Rust string (which can contain null characters) from the host directly into a RocStr without having to reallocate it
I just thought of an interesting design idea: what if we did this?
\0, no using \u to encode a null character, and if you try to do Str.fromUtf8 on a List U8 which contains any 0 bytes, we return Err instead of Ok). This means that we can now safely pass Roc strings to JVM/Android without needing to do a check or conversion, because we already know they don't contain any null characters. Also this means we can pass any strings that were created in Roc directly to C functions that require null-termination without checking for interior null characters first; we can just ensure that it ends in a null character (which may not even require a reallocation, if there's at least 1 byte of excess capacity in the string) and then pass it directly to the C function in question.Str implementations that the host may give us some null bytes (e.g. because they took a Rust string, which can contain null bytes, and converted it to a RocStr). This means we can still do a free conversion from Rust strings etc. to Roc strings even though the Rust strings may contain null bytes.putting those two together, we essentially have the rule "pure Roc code won't introduce a null character, but if we get one from the host, we'll preserve it"
Are there documented cases where that actually matters? (thinking particularly of the JVM/dalvik restriction)
well if you give them a string with a null byte in it, apparently bad things happen (maybe even Undefined Behavior?)
I guess what I meant was more along the lines of, why would you ever want to use the overlong-null encoding?
That just seems like asking for weird shit to happen.
In terms of what the "right" way to structure interacting with platform strings, I think that's a really tough call. Trying to make one string type work for all platforms works for like 99.9% of cases just fine, but then breaks down when you find that one person who put an unpaired surrogate pair on a filename in NTFS (because that's valid!).
What validation does rust currently do on strings that come from the host?
well currently none
Err sorry I meant what Roc... what does roc do.
oh, no validation either
like we don't validate going to or from Rust strings
but we currently have to validate going from Roc strings to JVM strings
Right
I wonder if it would be worth-while to add aggressive validation, at least in debug builds.
I guess a better way to formulate the goals here would be:
Hmm
Joshua Warner said:
I wonder if it would be worth-while to add aggressive validation, at least in debug builds.
I've thought about this, and I think it still might be too error-prone tbh...like the real problem is that interior nulls are so rare you wouldn't likely even encounter them in local development or testing
Fair
you'd either encounter them in production or else you'd know about them (in which case you'd know to handle them properly, but the assumption is that most host authors won't because it's such a random edge case)
but the performance angle is pretty compelling to me
This feels like a very slippery slope toward "anything's allowed"
hm, is it though?
I mean UTF-8 is definitely the overall encoding we want, interior nulls are not something we actually use, and skipping checks that require iterating over the entire string can be a really big deal if you're sending a lot of (and/or large) strings to/from the host
a big deal performance-wise I mean
Someone somewhere is eventually going to misuse that and end up using strings for data that's clearly not UTF-8
Like, binary data. Or data that's sneakily _almost_ UTF-8. Or UTF-16. (without converting)
possibly, but I think the performance cost of trying to defend against that would be too high given the benefit
Someone somewhere is eventually going to misuse that and end up using strings for data that's clearly not UTF-8
definitely going to happen, but that will be a bug in the platform, not a bug in the roc app. The platform should correctly generate and check strings if it is passing them into roc.
So like, when rust reads from a file to a string, it validates utf8. No reason for roc to defensively revalidate this
yeah I mean we definitely could do it, but it puts a permanent ceiling on how fast certain use cases can be
in that we'd unavoidably be double validating in some use cases
Richard Feldman said:
if you try to do
Str.fromUtf8on aList U8which contains any0bytes, we returnErrinstead ofOk
Maybe it makes sense to allow 0 only as the last byte? So if you get somwhere some c-like bytes you can still use Str.fromUtf8.
I think if this will be allowed, then the trailing 0 should be removed? That way we still always have "valid UTF-8 without 0" from Str.fromUtf8.
hm, would this actually come up in practice though?
in order to have a List U8 to give to Str.fromUtf8, the list must already have a known length, which would mean the C code could just pass a list with 1 fewer length to drop the trailing 0 it knows is there :big_smile:
That's true.
Maybe there is some cases, where you get bytes from a low level bus system or though a web API that you need to parse to UTF-8 and may or may not contain a trailing 0? But I cannot think of a real world example.
I could see wanting to keep the zero at the end for when you pass back to c. That way c can treat it as a normal c string instead of as a roc str.
Or something roughly along those lines
yeah another option would be since you have a List U8, converting doing List.dropLast to drop the zero if you know it's there before passing it to Stsr.fromUtf8
C strings are not defined as UTF-8 encoded. iiuc, it's not even entirely clear whether the elements of the string are signed or not, and given that C casts types pretty readily, it often doesn't matter.
Why not just define C interop, and thus this aspect of Java interop, as utilizing List U8 ?
I believe it's probably fine (though perhaps a bit unusual) to have Roc validate that its own string literals not contain null-bytes, but the same argument could apply to any ASCII control or non-printing character (including 0x7F). These are all valid Unicode (and UTF-8), yet it's rather likely that, if used, they're not being used to convey their original meanings, and that the string likely represents binary.
That said, it sounds unreasonable to me that Roc validate all strings it encounters just to confirm they don't hold null-bytes, just to avoid hazy interop issues with a single niche integration, and when nearly all such validations can be expected to pass.
The counter-argument is that it will be surprising, particularly for Str.fromUtf8 or similar, to reject inputs which are valid UTF-8.
Kevin Gillette said:
C strings are not defined as UTF-8 encoded. iiuc, it's not even entirely clear whether the elements of the string are signed or not, and given that C casts types pretty readily, it often doesn't matter.
Why not just define C interop, and thus this aspect of Java interop, as utilizing
List U8?
well both C and Java strings will potentially do very bad things if you give them interior 0 bytes, so if Roc strings can contain those, then when platform authors are converting them to either C or Java strings, the only responsible choice is to iterate through the entire Roc string they were given to check for zeroes, which is expensive
the goal would be to create a guarantee which would mean they didn't have to do that, meaning C and Java interop (among others) would be strictly faster—to the tune of not having to traverse and potentially copy every single string that's ever sent from Roc to them—than if Roc strings didn't have that guarantee
Redis uses a length-prefixed string format that is C compatible. Essentially the pointer to the string data is actually pointing 4 or 8 bytes past the start of an allocation, while also null-terminating the string. When redis code uses any libc string functions that aren't length-aware, it "just works," and when redis needs the length of a string, rather than scanning until a null, it just reads the length a word length before the data pointer.
We could do something similar in Roc's contract with platforms that are expecting C-style strings, but in addition to length data, we can also set a bit in string header data (or somewhere) indicating whether the string _may_ contain interior nulls.
For Roc string literals we can prove this trivially, since the compiler will know. For strings obtained anew at runtime, if unscanned, we can mark it as "maybe containing" nulls, and for any string manipulation operations (like concat), we just OR the bits from the source strings.
The receiving platform could then decide what to do based on whether the use-case assumes no interior nulls, and also whether the string is marked as possibly null-containing.
There are presumably various optimizations that can be performed to arrive at an answer about null-bytes while performing other operations. For example, instead of a memcpy when performing a string concat, we could use a (hopefully-hardware-optimized) strncpy, passing the length Roc knows the string possesses. If the copy completed n bytes, then we know it's free of interior nulls and unset the maybe-null-indicating bit. If it copied fewer, we set the bit and just follow it with a regular memcpy for the remainder of the string.
Also, if we're only passing a string data pointer to the platform (not passing ownership of the allocation and not passing a length), then I'd imagine no bad things could happen because they'd have no way of readily knowing that their view of the string isn't the "complete" string. It may be application-undefined behavior, but that might be okay depending on the case.
If transferring an allocation, and if the platform provides it, we could just resize the string allocation to stop at the first null. The memory might be somewhat fragmented if the string is held onto for a long time, but semantically it'd be in an okay state.
hmm interesting!
one question to consider: specifically where in memory would the "no interior nulls" bit be stored in both small strings and large strings?
For large strings, we already use the 8 bytes before the start of the allocation to store the refcount, but we could potentially add a 9th byte for this. Alternatively, we could try to sneak it into an unused bit pattern in the refcount itself, which would save a byte of memory but make all refcount operations require additional instructions.
For small strings, I think this would probably be free; we already have some leftover bits because we have an entire byte for the length, but the maximum length of a small string is only 23, so we have several leftover bits in that length byte that could be used for this. And since we already need to do a mask to get the length, I don't even think that code would need to change.
so in that design, it would be +1 byte on the heap per large string, and essentially no cost for small strings. That doesn't sound like a deal-breaker amount of increased memory consumption to me
:thinking: how would this interact with seamless slices?
an important part of their design is that they can point to arbitrary heap allocations which may not have refcounts preceding them (e.g. they may come from the platform)
so I guess they'd just always have to be assumed to potentially have interior nulls
in which case a reasonable follow-up question is: how often in practice would a platform be specifically receiving a slice? How much would that limit the usefulness of the interior null flag?
another potentially interesting variation on this design: what if instead of storing the flag at the beginning of the allocation, we stored it at the end of the allocation? That way, if 0 means "no interior nulls" and 1 means "may have interior nulls" then the "no interior nulls" flag also serves as a null terminator for C
there are some considerations there though - for example, it means that resizing a string requires writing a new 0 or 1 to the end of the allocation
(also means we have to be careful of off-by-one errors when comparing length and capacity, but that's just an implementation detail)
another question is how to balance wanting string operations to be as cheap as possible in the general case, versus how conservative to be with "may contain interior nulls"
for example, when doing Str.fromUtf8, you could have a separate check on every single byte to see if it's zero; if so, then you set the flag
but that's a separate conditional on every single byte; is that worth it? Or would it be better to say Str.fromUtf8 always sets the flag just in case?
Edit: worth noting that "separate conditional on every single byte" would also be required for the "disallow interior null bytes in Roc strings" design. And actually this one could be cheaper; instead of using a conditional at all, this could be mutating a local variable like hasInteriorNull = hasInteriorNull || currentByte == 0
other options:
Str.fromNonNullUtf8 which returns an Err if there are interior nulls, or maybe a more flexible Str.fromUtf8Filtered which takes a predicate function and runs it on each byteStr.fromUtf8NullChecked which does the check (seems unlikely library authors would reach for this over Str.fromUtf8 though)an interesting thought about the "null-terminate" idea: there are various different ways to do that.
one simple way is to guarantee that the end of the allocation is 0. This means if the string is using up all of its available capacity, then it is null-terminated. However, some strings (e.g. strings that are the result of concatenation) may have excess capacity, which may not be zeroed. So when passing one of these Roc strings to a C function that requires null termination, you'd need to check whether it is actually null-terminated (by looking at the byte 1 past the end of the length) and if it isn't, write a 0 there.
This is what we do today when sending Roc strings to C, but we have the additional step of checking to see if there is even enough space in the allocation for a null terminator, because sometimes there isn't. In that case, we actually have to reallocate the whole string, which could be super expensive. This design would rule out that case; there would always be space for a null terminator even if that byte happens to be nonzero - and C can always write a zero there, because when Roc gives C a string, Roc is no longer using it.
(if the allocation is not writable because it's located in readonly memory, then it couldn't possibly have been the result of concatenation, so there would already be a zero there for sure)
a stronger guarantee would be "all Roc strings are null-terminated" - meaning they have a 0 right after the end of the length, regardless of what the capacity is
this seems a lot more expensive performance-wise because it means any Str operation that needs to add bytes to the end of the string additionally needs to write a 0 or 1 after whatever they just wrote
a lot of those operations happen in loops, so that seems a lot more likely to cause a performance problem (especially on non-C platforms) than doing the "C platforms have to double-check for null termination, and if it's not null-terminated, write a zero there"
that said, all of this does kind of gloss over another important consideration: suppose the bit is 1, and the string contains interior nulls...now what is C supposed to do?
in the case of Java, it's pretty clear: you have to convert the null into the representation they use for nulls, which will be potentially expensive but at least correct
what about C though? Is the JVM "modified UTF-8" representation of nulls a safe thing to pass to C functions expecting UTF-8 without interior null bytes? I honestly don't know!
so an upside of the "just don't allow Roc strings to introduce interior nulls" design is that it means C doesn't have to worry about them altogether
but maybe the same representation the JVM uses is fine for C too?
I feel like we are over-complicating something simpler here:
This leaves one core question in my opinion, do we want to block null bytes when converting a List U8 to a Str? If we write our own code for validating unicode/modify what is in the zig standard library, I think this should be do-able fast. If we don't it requires looping over the entire string twice.
well if we want there to be no way to add them to a string from Roc, then we have no choice but to do that :big_smile:
We shouldn't pay the cost of C in Roc.
I think this is a reasonable stance to take, but I also think it's reasonable to say "Roc aspires to run fast on lots of different platforms, and both JVM and C are very widely used, so penalizing string conversions on them so heavily is not something Roc should do if it can avoid it cheaply enough"
I'm honestly not sure which stance we should go with!
When I say "no way", I mean "no direct way". That is why I asked the question. I think it would be fine if there was some weird work around to add in a null (convert to list append zero convert back). Just don't want to make it easy to directly due. Though it also also valid to say we should block all paths.
As for performance, I think adding the null check would be essentially free. I think we can do it with bit twiddling (in fact switch out UTF-8 validation to bit twiddling would probably make it faster than what we get from zig currently).
As for performance when sending strings to C. That will depend mostly on if the string has extra capacity (which i think will be exceptionally common because we grow in blocks). With the exception of seamless slice. That said, if we didn't have the seamless slice, you would have just had the copy somewhere else in the program so no perf diff.
Brendan Hansknecht said:
When I say "no way", I mean "no direct way". That is why I asked the question. I think it would be fine if there was some weird work around to add in a null (convert to list append zero convert back). Just don't want to make it easy to directly due. Though it also also valid to say we should block all paths.
the problem with that is: let's say I'm the author of a JVM host. Do I check for interior nulls or not? Like if it's at all possible, even through a weird backdoor, I have to worry about it. What am I gonna say "you got UB from my platform but don't blame me, blame whoever used Str.fromUtf8 on a list that had zeroes in it"
Yeah, for roc, I can see that tradeoff mattering. We want the experience to be delightful and that can definitely be a break in the experience. I think we should just check for the zero case then.
I could also see this being a security concern. If you're using 3rd party Roc code (packages or some plugin system) you might not realize they can sneak in strings with nulls in them.
oh another design we didn't talk about: what if we converted to JVM representation? So like we support \0 but it doesn't actually put a null character in there, and if we do Str.fromUtf8 we convert any zero bytes we encounter to that representation too
That sounds much slower.
but doable
Also more confusing to an end user if they expect to pass an actually null byte to something.
some consideration with that design:
Str.fromUtf8!With the conversion you would probably need to change how the check works in a way that also slows down fromUtf8 as well. I guess you would first check for valid utf8 without nulls. If that is the case, convert to Str freely. If not, loop through on a slow path that finds a null, copies up to it, add \0, then copies everything else. If the utf8 validation from before early exists on seeing a null byte you also need to validate all utf 8 after the null byte here.
yeah I don't like that design overall
I'm not sure if C will gracefully handle this representation
It should .
so then there's the big-picture question: to what extent should Roc accommodate C and JVM string representations?
another way to look at it is: to what extent should Roc value having fast string conversions for C and JVM?
and a relevant question there is "what potential platforms get significantly worse if string conversions are slow for C or JVM?"
one that comes immediately to mind is a JVM web server that's doing a lot of database stuff
if you move a bunch of database logic into Roc (which I assume would decode a List U8 of bytes coming from the database driver into whatever local types make sense for those columns, which in many cases would be strings) then you would potentially end up sending a large number of potentially large strings from Roc to the JVM
of course then there's the question of "what is actually considered unacceptably slow?" and for a webserver, even traversing every single string like that to verify it is probably fine; I'd guess it wouldn't even add 1ms of latency to the response
another example that comes to mind is really low-level C stuff, like embedded systems
there the perf cost of defensively checking would be higher, and having to reallocate to add a null terminator would also be really bad
on the other hand, they also have limited memory, so +1 byte for approximately all Roc strings on the heap might be noticeable?
also I guess a lot of embedded systems are going to have a pretty low quantity of strings, so maybe it would come up infrequently enough that it wouldn't matter
also worth noting: a lot of libc functions have versions that accept a length and don't require the string to be null-terminated
e.g. https://man7.org/linux/man-pages/man2/write.2.html
so there may be workarounds there
on the host side
another thing to consider: JavaScript strings are UTF-16, which requires a much more expensive conversion every time regardless of how this design question shakes out
so that might give an idea in practice of the upper bound on how big of a deal this is in practice on a web server
There should be no conversion needed when sending to java if we don't allow null bytes.
For C, if we don't allow null bytes, conversion would only be more costly on the case that there is no extra capacity (I think rare in many string manipulation task that wouldn't need to copy anyway) and C can't use an api that takes length.
If you want null bytes, or null terminated strings, I think you should make a wrapper of List U8 that deals with the terminator.
Also, use List U8 if you need to do byte level manipulation that would lead to setting null bytes. I mean we have already decided not to let users directly manipulate the bytes of a Str, so I think this matches the current Roc apis/ideas.
JavaScript strings are UTF-16
This is a really good point. I am sure someone has look at this for using javascript on Node with rust/c/etc. How much work does rust/c/etc need to do for the conversion to be worth it. What is the real cost of paying those conversions. Would definitely be more than the cost of converting null bytes to "\0". That might get us better insights on the tradeoffs in perf
Brendan Hansknecht said:
If you want null bytes, or null terminated strings, I think you should make a wrapper of
List U8that deals with the terminator.
well the trouble is the package ecosystem - people using third-party libraries that work in terms of Str, which have no guarantees
but yeah we'll get some insight into the cost on converting to JS strings at Vendr!
people using third-party libraries that work in terms of Str, which have no guarantees
If we ban null, all libraries that work on Str are should work with all Str. If you need the terminator, you won't be able to use the libraries because you can't convert to Str. So you will have to write your own or find a custom library that takes List U8
So the type safety of Str libraries should still be good.
oh sure, yeah
my feeling based on this discussion is that the following 3 designs are plausible:
Str.fromUtf8based on everything we've discussed, I'm inclined to stick with the status quo and see if the performance problem is significant in practice
if it is, either of the other two options become worth reconsidering - and notably, each alternative design has a breaking change for either host authors, or Roc authors, but not both
that is, banning them from the language only affects Roc authors but doesn't cause a breaking change to RocStr's semantics to the host
and changing the in-memory representation wouldn't be a breaking change for any Roc code but would be a breaking change for hosts
Side note, becasue of discussing this, I realized that we can validate unicode ways faster than the zig standard lib does on average.
For huge ascii only strings, we could be ~30x faster than zig's standard library based on some of my quick testing. The rough cost of doing so is about ~2% slower for large unicode only strings.
nice, let's do that! :smiley:
can that be implemented using pure Roc code?
I mean currently I still fall back to the zig underlying functions when dealing with non-ascii, but theoretically could be.
Also, long term, we would potentially want to make it even faster by using cpu specific optimization like simd which would likely be better to implement in zig.
fair
But yeah, processing ascii 8 or 16 bytes at a time is much much faster.
sure seems that way! haha
Also, optimizing code is hard. I am trying to make the the only unicode case run faster. So I basically added a loop to say "after processing a unicode character, if you are still pointing to a unicode character, keep processing unicode characters" instead of having the code loop back to load 8 bytes and check if they all are ascii (which of course will fail in the all unicode case). Yet, that tight loop around the unicode section is slower than running the large loop and checking ascii each time. My only guess is loading the 8 bytes leads to some sort of pre-fetching that makes the less tight loop faster than the tight loop. I tried to add similar prefetching to the tight loop, but it hasn't helped (probably gets optimized away).
Yep, was pre-fetching related. Anyway managed to get it so that we are only slower in the small unicode case. In the larger unicode cases, we are basically break even now. Ridiculous gains for ascii only: #5139
Richard Feldman said:
Edit: worth noting that "separate conditional on every single byte" would also be required for the "disallow interior null bytes in Roc strings" design. And actually this one could be cheaper; instead of using a conditional at all, this could be mutating a local variable like
hasInteriorNull = hasInteriorNull || currentByte == 0
iirc, there are SIMD operations which can determine whether a byte is present in a wide byte range?
Yeah, there are simd and swar techniques for this. The biggest issue is that it will have to happen during the super fast and very hot loop that checks if a batch of bytes are ASCII. So it definitely will be a perf hit. Even if we collect until the end, we probably make the core loop twice as many instructions.
Last updated: Jun 16 2026 at 16:19 UTC