null characters in strings · ideas

Stream: ideas

Topic: null characters in strings

Richard Feldman (Mar 11 2023 at 20:40):

so reading through @dank's awesome JVM interop example, I noticed that the JVM's Native Interface requires that strings are encoded in Modified UTF-8. Apparently Android's Dalvik uses the same representation, so this would be relevant for any Roc application running on Android.

Richard Feldman (Mar 11 2023 at 20:41):

separately, many C functions don't allow null characters because they use it to tell where the string terminates

Richard Feldman (Mar 11 2023 at 20:41):

however, UTF-8 allows null characters in it

Richard Feldman (Mar 11 2023 at 20:41):

today, Roc strings are encoded as UTF-8, which means they allow null characters

Richard Feldman (Mar 11 2023 at 20:42):

one reason I went with this design (which Rust does too) is that it seems less error-prone; we can just say "Roc strings are valid UTF-8" instead of something more complicated than that, which host authors might get wrong in ways that would be easy to overlook

Richard Feldman (Mar 11 2023 at 20:42):

and which wouldn't show up except in really unusual edge cases

Richard Feldman (Mar 11 2023 at 20:46):

another problem is performance; we want to be able to convert a Rust string (which can contain null characters) from the host directly into a RocStr without having to reallocate it

Richard Feldman (Mar 11 2023 at 20:47):

I just thought of an interesting design idea: what if we did this?

We explicitly disallow encoding null characters in Roc strings (e.g. no \0, no using \u to encode a null character, and if you try to do Str.fromUtf8 on a List U8 which contains any 0 bytes, we return Err instead of Ok). This means that we can now safely pass Roc strings to JVM/Android without needing to do a check or conversion, because we already know they don't contain any null characters. Also this means we can pass any strings that were created in Roc directly to C functions that require null-termination without checking for interior null characters first; we can just ensure that it ends in a null character (which may not even require a reallocation, if there's at least 1 byte of excess capacity in the string) and then pass it directly to the C function in question.
Despite doing this, on the Roc side we still assume that strings might contain null bytes; that is, we assume in our builtin Str implementations that the host may give us some null bytes (e.g. because they took a Rust string, which can contain null bytes, and converted it to a RocStr). This means we can still do a free conversion from Rust strings etc. to Roc strings even though the Rust strings may contain null bytes.

Richard Feldman (Mar 11 2023 at 20:48):

putting those two together, we essentially have the rule "pure Roc code won't introduce a null character, but if we get one from the host, we'll preserve it"

Joshua Warner (Mar 11 2023 at 20:51):

Are there documented cases where that actually matters? (thinking particularly of the JVM/dalvik restriction)

Richard Feldman (Mar 11 2023 at 20:51):

well if you give them a string with a null byte in it, apparently bad things happen (maybe even Undefined Behavior?)

Joshua Warner (Mar 11 2023 at 20:53):

I guess what I meant was more along the lines of, why would you ever want to use the overlong-null encoding?

Joshua Warner (Mar 11 2023 at 20:54):

That just seems like asking for weird shit to happen.

Joshua Warner (Mar 11 2023 at 21:02):

In terms of what the "right" way to structure interacting with platform strings, I think that's a really tough call. Trying to make one string type work for all platforms works for like 99.9% of cases just fine, but then breaks down when you find that one person who put an unpaired surrogate pair on a filename in NTFS (because that's valid!).

Joshua Warner (Mar 11 2023 at 21:06):

What validation does rust currently do on strings that come from the host?

Richard Feldman (Mar 11 2023 at 21:14):

well currently none

Joshua Warner (Mar 11 2023 at 21:15):

Err sorry I meant what Roc... what does roc do.

Richard Feldman (Mar 11 2023 at 21:15):

oh, no validation either

Richard Feldman (Mar 11 2023 at 21:15):

like we don't validate going to or from Rust strings

Richard Feldman (Mar 11 2023 at 21:15):

but we currently have to validate going from Roc strings to JVM strings

Joshua Warner (Mar 11 2023 at 21:16):

Right

Joshua Warner (Mar 11 2023 at 21:16):

I wonder if it would be worth-while to add aggressive validation, at least in debug builds.

Richard Feldman (Mar 11 2023 at 21:16):

I guess a better way to formulate the goals here would be:

allow Rust, C, and JVM hosts all to take their preferred UTF-8 strings (which vary in their rules around nulls) and convert them directly to Roc strings without having to verify or reallocate memory
regardless of host, allow strings created in pure Roc code to be converted to Rust, C, or JVM strings without having to scan through the Roc string looking for nulls that might be in an unsupported representation

Joshua Warner (Mar 11 2023 at 21:17):

Hmm

Richard Feldman (Mar 11 2023 at 21:17):

Joshua Warner said:

I wonder if it would be worth-while to add aggressive validation, at least in debug builds.

I've thought about this, and I think it still might be too error-prone tbh...like the real problem is that interior nulls are so rare you wouldn't likely even encounter them in local development or testing

Joshua Warner (Mar 11 2023 at 21:17):

Fair

Richard Feldman (Mar 11 2023 at 21:17):

you'd either encounter them in production or else you'd know about them (in which case you'd know to handle them properly, but the assumption is that most host authors won't because it's such a random edge case)

Richard Feldman (Mar 11 2023 at 21:18):

but the performance angle is pretty compelling to me

Joshua Warner (Mar 11 2023 at 21:18):

This feels like a very slippery slope toward "anything's allowed"

Richard Feldman (Mar 11 2023 at 21:18):

hm, is it though?

Richard Feldman (Mar 11 2023 at 21:20):

I mean UTF-8 is definitely the overall encoding we want, interior nulls are not something we actually use, and skipping checks that require iterating over the entire string can be a really big deal if you're sending a lot of (and/or large) strings to/from the host

Richard Feldman (Mar 11 2023 at 21:20):

a big deal performance-wise I mean

Joshua Warner (Mar 11 2023 at 21:21):

Someone somewhere is eventually going to misuse that and end up using strings for data that's clearly not UTF-8

Joshua Warner (Mar 11 2023 at 21:22):

Like, binary data. Or data that's sneakily _almost_ UTF-8. Or UTF-16. (without converting)

Richard Feldman (Mar 11 2023 at 21:33):

possibly, but I think the performance cost of trying to defend against that would be too high given the benefit

Brendan Hansknecht (Mar 11 2023 at 21:37):

Someone somewhere is eventually going to misuse that and end up using strings for data that's clearly not UTF-8

definitely going to happen, but that will be a bug in the platform, not a bug in the roc app. The platform should correctly generate and check strings if it is passing them into roc.

Brendan Hansknecht (Mar 11 2023 at 21:38):

So like, when rust reads from a file to a string, it validates utf8. No reason for roc to defensively revalidate this

Richard Feldman (Mar 11 2023 at 21:39):

yeah I mean we definitely could do it, but it puts a permanent ceiling on how fast certain use cases can be

Richard Feldman (Mar 11 2023 at 21:39):

in that we'd unavoidably be double validating in some use cases

Fabian Schmalzried (Mar 11 2023 at 21:49):

Richard Feldman said:

if you try to do Str.fromUtf8 on a List U8 which contains any 0 bytes, we return Err instead of Ok

Maybe it makes sense to allow 0 only as the last byte? So if you get somwhere some c-like bytes you can still use Str.fromUtf8.
I think if this will be allowed, then the trailing 0 should be removed? That way we still always have "valid UTF-8 without 0" from Str.fromUtf8.

Richard Feldman (Mar 11 2023 at 21:51):

hm, would this actually come up in practice though?

in order to have a List U8 to give to Str.fromUtf8, the list must already have a known length, which would mean the C code could just pass a list with 1 fewer length to drop the trailing 0 it knows is there :big_smile:

Fabian Schmalzried (Mar 11 2023 at 21:57):

That's true.
Maybe there is some cases, where you get bytes from a low level bus system or though a web API that you need to parse to UTF-8 and may or may not contain a trailing 0? But I cannot think of a real world example.

Brendan Hansknecht (Mar 11 2023 at 22:04):

I could see wanting to keep the zero at the end for when you pass back to c. That way c can treat it as a normal c string instead of as a roc str.

Brendan Hansknecht (Mar 11 2023 at 22:04):

Or something roughly along those lines

Richard Feldman (Mar 11 2023 at 22:04):

yeah another option would be since you have a List U8, converting doing List.dropLast to drop the zero if you know it's there before passing it to Stsr.fromUtf8

Kevin Gillette (Mar 13 2023 at 05:43):

C strings are not defined as UTF-8 encoded. iiuc, it's not even entirely clear whether the elements of the string are signed or not, and given that C casts types pretty readily, it often doesn't matter.

Why not just define C interop, and thus this aspect of Java interop, as utilizing List U8 ?

Kevin Gillette (Mar 13 2023 at 05:47):

I believe it's probably fine (though perhaps a bit unusual) to have Roc validate that its own string literals not contain null-bytes, but the same argument could apply to any ASCII control or non-printing character (including 0x7F). These are all valid Unicode (and UTF-8), yet it's rather likely that, if used, they're not being used to convey their original meanings, and that the string likely represents binary.

Kevin Gillette (Mar 13 2023 at 05:50):

That said, it sounds unreasonable to me that Roc validate all strings it encounters just to confirm they don't hold null-bytes, just to avoid hazy interop issues with a single niche integration, and when nearly all such validations can be expected to pass.

Kevin Gillette (Mar 13 2023 at 05:52):

The counter-argument is that it will be surprising, particularly for Str.fromUtf8 or similar, to reject inputs which are valid UTF-8.

Richard Feldman (Mar 13 2023 at 10:35):

Kevin Gillette said:

C strings are not defined as UTF-8 encoded. iiuc, it's not even entirely clear whether the elements of the string are signed or not, and given that C casts types pretty readily, it often doesn't matter.

Why not just define C interop, and thus this aspect of Java interop, as utilizing List U8 ?

well both C and Java strings will potentially do very bad things if you give them interior 0 bytes, so if Roc strings can contain those, then when platform authors are converting them to either C or Java strings, the only responsible choice is to iterate through the entire Roc string they were given to check for zeroes, which is expensive

Richard Feldman (Mar 13 2023 at 10:36):

the goal would be to create a guarantee which would mean they didn't have to do that, meaning C and Java interop (among others) would be strictly faster—to the tune of not having to traverse and potentially copy every single string that's ever sent from Roc to them—than if Roc strings didn't have that guarantee

Kevin Gillette (Mar 14 2023 at 13:21):

Redis uses a length-prefixed string format that is C compatible. Essentially the pointer to the string data is actually pointing 4 or 8 bytes past the start of an allocation, while also null-terminating the string. When redis code uses any libc string functions that aren't length-aware, it "just works," and when redis needs the length of a string, rather than scanning until a null, it just reads the length a word length before the data pointer.

Kevin Gillette (Mar 14 2023 at 13:40):

We could do something similar in Roc's contract with platforms that are expecting C-style strings, but in addition to length data, we can also set a bit in string header data (or somewhere) indicating whether the string _may_ contain interior nulls.

For Roc string literals we can prove this trivially, since the compiler will know. For strings obtained anew at runtime, if unscanned, we can mark it as "maybe containing" nulls, and for any string manipulation operations (like concat), we just OR the bits from the source strings.

The receiving platform could then decide what to do based on whether the use-case assumes no interior nulls, and also whether the string is marked as possibly null-containing.

There are presumably various optimizations that can be performed to arrive at an answer about null-bytes while performing other operations. For example, instead of a memcpy when performing a string concat, we could use a (hopefully-hardware-optimized) strncpy, passing the length Roc knows the string possesses. If the copy completed n bytes, then we know it's free of interior nulls and unset the maybe-null-indicating bit. If it copied fewer, we set the bit and just follow it with a regular memcpy for the remainder of the string.

Kevin Gillette (Mar 14 2023 at 13:48):

Also, if we're only passing a string data pointer to the platform (not passing ownership of the allocation and not passing a length), then I'd imagine no bad things could happen because they'd have no way of readily knowing that their view of the string isn't the "complete" string. It may be application-undefined behavior, but that might be okay depending on the case.

If transferring an allocation, and if the platform provides it, we could just resize the string allocation to stop at the first null. The memory might be somewhat fragmented if the string is held onto for a long time, but semantically it'd be in an okay state.

Richard Feldman (Mar 14 2023 at 14:06):

hmm interesting!

Richard Feldman (Mar 14 2023 at 14:10):

one question to consider: specifically where in memory would the "no interior nulls" bit be stored in both small strings and large strings?

For large strings, we already use the 8 bytes before the start of the allocation to store the refcount, but we could potentially add a 9th byte for this. Alternatively, we could try to sneak it into an unused bit pattern in the refcount itself, which would save a byte of memory but make all refcount operations require additional instructions.

For small strings, I think this would probably be free; we already have some leftover bits because we have an entire byte for the length, but the maximum length of a small string is only 23, so we have several leftover bits in that length byte that could be used for this. And since we already need to do a mask to get the length, I don't even think that code would need to change.

Richard Feldman (Mar 14 2023 at 14:11):

so in that design, it would be +1 byte on the heap per large string, and essentially no cost for small strings. That doesn't sound like a deal-breaker amount of increased memory consumption to me

Richard Feldman (Mar 14 2023 at 14:13):

:thinking: how would this interact with seamless slices?

Richard Feldman (Mar 14 2023 at 14:14):

an important part of their design is that they can point to arbitrary heap allocations which may not have refcounts preceding them (e.g. they may come from the platform)

Richard Feldman (Mar 14 2023 at 14:14):

so I guess they'd just always have to be assumed to potentially have interior nulls

Richard Feldman (Mar 14 2023 at 14:16):

in which case a reasonable follow-up question is: how often in practice would a platform be specifically receiving a slice? How much would that limit the usefulness of the interior null flag?

Richard Feldman (Mar 14 2023 at 14:20):

another potentially interesting variation on this design: what if instead of storing the flag at the beginning of the allocation, we stored it at the end of the allocation? That way, if 0 means "no interior nulls" and 1 means "may have interior nulls" then the "no interior nulls" flag also serves as a null terminator for C

Richard Feldman (Mar 14 2023 at 14:20):

there are some considerations there though - for example, it means that resizing a string requires writing a new 0 or 1 to the end of the allocation

Richard Feldman (Mar 14 2023 at 14:21):

(also means we have to be careful of off-by-one errors when comparing length and capacity, but that's just an implementation detail)

Richard Feldman (Mar 14 2023 at 14:22):

another question is how to balance wanting string operations to be as cheap as possible in the general case, versus how conservative to be with "may contain interior nulls"

Richard Feldman (Mar 14 2023 at 14:23):

for example, when doing Str.fromUtf8, you could have a separate check on every single byte to see if it's zero; if so, then you set the flag

Richard Feldman (Mar 14 2023 at 14:23):

but that's a separate conditional on every single byte; is that worth it? Or would it be better to say Str.fromUtf8 always sets the flag just in case?

Edit: worth noting that "separate conditional on every single byte" would also be required for the "disallow interior null bytes in Roc strings" design. And actually this one could be cheaper; instead of using a conditional at all, this could be mutating a local variable like hasInteriorNull = hasInteriorNull || currentByte == 0

Richard Feldman (Mar 14 2023 at 14:25):

other options:

also have Str.fromNonNullUtf8 which returns an Err if there are interior nulls, or maybe a more flexible Str.fromUtf8Filtered which takes a predicate function and runs it on each byte
also have Str.fromUtf8NullChecked which does the check (seems unlikely library authors would reach for this over Str.fromUtf8 though)

Richard Feldman (Mar 14 2023 at 14:32):

an interesting thought about the "null-terminate" idea: there are various different ways to do that.

one simple way is to guarantee that the end of the allocation is 0. This means if the string is using up all of its available capacity, then it is null-terminated. However, some strings (e.g. strings that are the result of concatenation) may have excess capacity, which may not be zeroed. So when passing one of these Roc strings to a C function that requires null termination, you'd need to check whether it is actually null-terminated (by looking at the byte 1 past the end of the length) and if it isn't, write a 0 there.

This is what we do today when sending Roc strings to C, but we have the additional step of checking to see if there is even enough space in the allocation for a null terminator, because sometimes there isn't. In that case, we actually have to reallocate the whole string, which could be super expensive. This design would rule out that case; there would always be space for a null terminator even if that byte happens to be nonzero - and C can always write a zero there, because when Roc gives C a string, Roc is no longer using it.

Richard Feldman (Mar 14 2023 at 14:33):

(if the allocation is not writable because it's located in readonly memory, then it couldn't possibly have been the result of concatenation, so there would already be a zero there for sure)

Richard Feldman (Mar 14 2023 at 14:34):

a stronger guarantee would be "all Roc strings are null-terminated" - meaning they have a 0 right after the end of the length, regardless of what the capacity is

Richard Feldman (Mar 14 2023 at 14:34):

this seems a lot more expensive performance-wise because it means any Str operation that needs to add bytes to the end of the string additionally needs to write a 0 or 1 after whatever they just wrote

Richard Feldman (Mar 14 2023 at 14:35):

a lot of those operations happen in loops, so that seems a lot more likely to cause a performance problem (especially on non-C platforms) than doing the "C platforms have to double-check for null termination, and if it's not null-terminated, write a zero there"

Richard Feldman (Mar 14 2023 at 14:36):

that said, all of this does kind of gloss over another important consideration: suppose the bit is 1, and the string contains interior nulls...now what is C supposed to do?

Richard Feldman (Mar 14 2023 at 14:37):

in the case of Java, it's pretty clear: you have to convert the null into the representation they use for nulls, which will be potentially expensive but at least correct

Richard Feldman (Mar 14 2023 at 14:37):

what about C though? Is the JVM "modified UTF-8" representation of nulls a safe thing to pass to C functions expecting UTF-8 without interior null bytes? I honestly don't know!

Richard Feldman (Mar 14 2023 at 14:38):

so an upside of the "just don't allow Roc strings to introduce interior nulls" design is that it means C doesn't have to worry about them altogether

Richard Feldman (Mar 14 2023 at 14:46):

but maybe the same representation the JVM uses is fine for C too?

Brendan Hansknecht (Mar 14 2023 at 14:55):

I feel like we are over-complicating something simpler here:

Roc should simply never care about interior nulls and should have no way to add them to a string. If roc uses a C or Java or etc platform that would break with interior nulls, the only interior nulls that could exist on the Roc side would be due to the platform passing them into Roc. So if the platform doesn't want interior nulls, they need to makes sure not to pass any string with null in it to Roc.
We shouldn't pay the cost of C in Roc. Roc has a well formed full featured string. Due to slices, we have no way to guarantee it is null terminated. If a platform needs a null terminated string, they can check the capacity of a string returned by roc and set a byte to zero. In the rare case their is no capacity left, they can reallocate the string or fall back on code that uses the length. (This would be better than java where you have to always copy the string to add the null terminator)

This leaves one core question in my opinion, do we want to block null bytes when converting a List U8 to a Str? If we write our own code for validating unicode/modify what is in the zig standard library, I think this should be do-able fast. If we don't it requires looping over the entire string twice.

Richard Feldman (Mar 14 2023 at 15:10):

well if we want there to be no way to add them to a string from Roc, then we have no choice but to do that :big_smile:

Richard Feldman (Mar 14 2023 at 15:12):

We shouldn't pay the cost of C in Roc.

I think this is a reasonable stance to take, but I also think it's reasonable to say "Roc aspires to run fast on lots of different platforms, and both JVM and C are very widely used, so penalizing string conversions on them so heavily is not something Roc should do if it can avoid it cheaply enough"

Richard Feldman (Mar 14 2023 at 15:13):

I'm honestly not sure which stance we should go with!

Brendan Hansknecht (Mar 14 2023 at 15:17):

When I say "no way", I mean "no direct way". That is why I asked the question. I think it would be fine if there was some weird work around to add in a null (convert to list append zero convert back). Just don't want to make it easy to directly due. Though it also also valid to say we should block all paths.

Brendan Hansknecht (Mar 14 2023 at 15:20):

As for performance, I think adding the null check would be essentially free. I think we can do it with bit twiddling (in fact switch out UTF-8 validation to bit twiddling would probably make it faster than what we get from zig currently).

As for performance when sending strings to C. That will depend mostly on if the string has extra capacity (which i think will be exceptionally common because we grow in blocks). With the exception of seamless slice. That said, if we didn't have the seamless slice, you would have just had the copy somewhere else in the program so no perf diff.

Richard Feldman (Mar 14 2023 at 15:23):

Brendan Hansknecht said:

When I say "no way", I mean "no direct way". That is why I asked the question. I think it would be fine if there was some weird work around to add in a null (convert to list append zero convert back). Just don't want to make it easy to directly due. Though it also also valid to say we should block all paths.

the problem with that is: let's say I'm the author of a JVM host. Do I check for interior nulls or not? Like if it's at all possible, even through a weird backdoor, I have to worry about it. What am I gonna say "you got UB from my platform but don't blame me, blame whoever used Str.fromUtf8 on a list that had zeroes in it"

Brendan Hansknecht (Mar 14 2023 at 15:25):

Yeah, for roc, I can see that tradeoff mattering. We want the experience to be delightful and that can definitely be a break in the experience. I think we should just check for the zero case then.

Martin Stewart (Mar 14 2023 at 15:25):

I could also see this being a security concern. If you're using 3rd party Roc code (packages or some plugin system) you might not realize they can sneak in strings with nulls in them.

Richard Feldman (Mar 14 2023 at 15:30):

oh another design we didn't talk about: what if we converted to JVM representation? So like we support \0 but it doesn't actually put a null character in there, and if we do Str.fromUtf8 we convert any zero bytes we encounter to that representation too

Brendan Hansknecht (Mar 14 2023 at 15:31):

That sounds much slower.

Brendan Hansknecht (Mar 14 2023 at 15:31):

but doable

Brendan Hansknecht (Mar 14 2023 at 15:33):

Also more confusing to an end user if they expect to pass an actually null byte to something.

Richard Feldman (Mar 14 2023 at 15:34):

some consideration with that design:

you still can get interior nulls on certain hosts, but some Roc libraries might not think to handle that edge case because they have the mental model of "Roc always converts those to the other representation" - so there could be bugs there...but it seems like it would be a super rare edge case to come up
I'm not sure if C will gracefully handle this representation
the conversion would have to expand 1 byte to multiple bytes, which would require reallocating the string - very bad for performance of Str.fromUtf8!

Brendan Hansknecht (Mar 14 2023 at 15:38):

With the conversion you would probably need to change how the check works in a way that also slows down fromUtf8 as well. I guess you would first check for valid utf8 without nulls. If that is the case, convert to Str freely. If not, loop through on a slow path that finds a null, copies up to it, add \0, then copies everything else. If the utf8 validation from before early exists on seeing a null byte you also need to validate all utf 8 after the null byte here.

Richard Feldman (Mar 14 2023 at 15:38):

yeah I don't like that design overall

Brendan Hansknecht (Mar 14 2023 at 15:38):

I'm not sure if C will gracefully handle this representation

It should .

Richard Feldman (Mar 14 2023 at 16:17):

so then there's the big-picture question: to what extent should Roc accommodate C and JVM string representations?

Richard Feldman (Mar 14 2023 at 16:17):

another way to look at it is: to what extent should Roc value having fast string conversions for C and JVM?

Richard Feldman (Mar 14 2023 at 16:18):

and a relevant question there is "what potential platforms get significantly worse if string conversions are slow for C or JVM?"

Richard Feldman (Mar 14 2023 at 16:20):

one that comes immediately to mind is a JVM web server that's doing a lot of database stuff

Richard Feldman (Mar 14 2023 at 16:21):

if you move a bunch of database logic into Roc (which I assume would decode a List U8 of bytes coming from the database driver into whatever local types make sense for those columns, which in many cases would be strings) then you would potentially end up sending a large number of potentially large strings from Roc to the JVM

Richard Feldman (Mar 14 2023 at 16:24):

of course then there's the question of "what is actually considered unacceptably slow?" and for a webserver, even traversing every single string like that to verify it is probably fine; I'd guess it wouldn't even add 1ms of latency to the response

Richard Feldman (Mar 14 2023 at 16:25):

another example that comes to mind is really low-level C stuff, like embedded systems

Richard Feldman (Mar 14 2023 at 16:26):

there the perf cost of defensively checking would be higher, and having to reallocate to add a null terminator would also be really bad

Richard Feldman (Mar 14 2023 at 16:26):

on the other hand, they also have limited memory, so +1 byte for approximately all Roc strings on the heap might be noticeable?

Richard Feldman (Mar 14 2023 at 16:28):

also I guess a lot of embedded systems are going to have a pretty low quantity of strings, so maybe it would come up infrequently enough that it wouldn't matter

Richard Feldman (Mar 14 2023 at 16:28):

also worth noting: a lot of libc functions have versions that accept a length and don't require the string to be null-terminated

Richard Feldman (Mar 14 2023 at 16:29):

e.g. https://man7.org/linux/man-pages/man2/write.2.html

Richard Feldman (Mar 14 2023 at 16:29):

so there may be workarounds there

Richard Feldman (Mar 14 2023 at 16:29):

on the host side

Richard Feldman (Mar 14 2023 at 16:30):

another thing to consider: JavaScript strings are UTF-16, which requires a much more expensive conversion every time regardless of how this design question shakes out

Richard Feldman (Mar 14 2023 at 16:31):

so that might give an idea in practice of the upper bound on how big of a deal this is in practice on a web server

Brendan Hansknecht (Mar 14 2023 at 16:48):

There should be no conversion needed when sending to java if we don't allow null bytes.

Brendan Hansknecht (Mar 14 2023 at 16:49):

For C, if we don't allow null bytes, conversion would only be more costly on the case that there is no extra capacity (I think rare in many string manipulation task that wouldn't need to copy anyway) and C can't use an api that takes length.

Brendan Hansknecht (Mar 14 2023 at 16:49):

If you want null bytes, or null terminated strings, I think you should make a wrapper of List U8 that deals with the terminator.

Brendan Hansknecht (Mar 14 2023 at 16:52):

Also, use List U8 if you need to do byte level manipulation that would lead to setting null bytes. I mean we have already decided not to let users directly manipulate the bytes of a Str, so I think this matches the current Roc apis/ideas.

Brendan Hansknecht (Mar 14 2023 at 16:58):

JavaScript strings are UTF-16

This is a really good point. I am sure someone has look at this for using javascript on Node with rust/c/etc. How much work does rust/c/etc need to do for the conversion to be worth it. What is the real cost of paying those conversions. Would definitely be more than the cost of converting null bytes to "\0". That might get us better insights on the tradeoffs in perf

Richard Feldman (Mar 14 2023 at 17:48):

Brendan Hansknecht said:

If you want null bytes, or null terminated strings, I think you should make a wrapper of List U8 that deals with the terminator.

well the trouble is the package ecosystem - people using third-party libraries that work in terms of Str, which have no guarantees

Richard Feldman (Mar 14 2023 at 17:49):

but yeah we'll get some insight into the cost on converting to JS strings at Vendr!

Brendan Hansknecht (Mar 14 2023 at 17:57):

people using third-party libraries that work in terms of Str, which have no guarantees

If we ban null, all libraries that work on Str are should work with all Str. If you need the terminator, you won't be able to use the libraries because you can't convert to Str. So you will have to write your own or find a custom library that takes List U8

Brendan Hansknecht (Mar 14 2023 at 17:57):

So the type safety of Str libraries should still be good.

Richard Feldman (Mar 14 2023 at 17:58):

oh sure, yeah

Richard Feldman (Mar 14 2023 at 18:01):

my feeling based on this discussion is that the following 3 designs are plausible:

prevent Roc code from creating interior nulls, by disallowing them in string literals and also in functions like Str.fromUtf8
allow interior nulls, but record whether they are present by adding an extra byte to the end of every string, which can also serve as a null terminator for C
status quo, where interior nulls are allowed, and C and JVM need to check for them and handle appropriately

based on everything we've discussed, I'm inclined to stick with the status quo and see if the performance problem is significant in practice

Richard Feldman (Mar 14 2023 at 18:01):

if it is, either of the other two options become worth reconsidering - and notably, each alternative design has a breaking change for either host authors, or Roc authors, but not both

Richard Feldman (Mar 14 2023 at 18:02):

that is, banning them from the language only affects Roc authors but doesn't cause a breaking change to RocStr's semantics to the host

Richard Feldman (Mar 14 2023 at 18:03):

and changing the in-memory representation wouldn't be a breaking change for any Roc code but would be a breaking change for hosts

Brendan Hansknecht (Mar 14 2023 at 23:20):

Side note, becasue of discussing this, I realized that we can validate unicode ways faster than the zig standard lib does on average.

For huge ascii only strings, we could be ~30x faster than zig's standard library based on some of my quick testing. The rough cost of doing so is about ~2% slower for large unicode only strings.

Richard Feldman (Mar 14 2023 at 23:24):

nice, let's do that! :smiley:

Richard Feldman (Mar 14 2023 at 23:25):

can that be implemented using pure Roc code?

Brendan Hansknecht (Mar 14 2023 at 23:29):

I mean currently I still fall back to the zig underlying functions when dealing with non-ascii, but theoretically could be.

Brendan Hansknecht (Mar 14 2023 at 23:29):

Also, long term, we would potentially want to make it even faster by using cpu specific optimization like simd which would likely be better to implement in zig.

Richard Feldman (Mar 14 2023 at 23:30):

fair

Brendan Hansknecht (Mar 14 2023 at 23:33):

But yeah, processing ascii 8 or 16 bytes at a time is much much faster.

Richard Feldman (Mar 14 2023 at 23:35):

sure seems that way! haha

Brendan Hansknecht (Mar 15 2023 at 00:11):

Also, optimizing code is hard. I am trying to make the the only unicode case run faster. So I basically added a loop to say "after processing a unicode character, if you are still pointing to a unicode character, keep processing unicode characters" instead of having the code loop back to load 8 bytes and check if they all are ascii (which of course will fail in the all unicode case). Yet, that tight loop around the unicode section is slower than running the large loop and checking ascii each time. My only guess is loading the 8 bytes leads to some sort of pre-fetching that makes the less tight loop faster than the tight loop. I tried to add similar prefetching to the tight loop, but it hasn't helped (probably gets optimized away).

Brendan Hansknecht (Mar 15 2023 at 00:47):

Yep, was pre-fetching related. Anyway managed to get it so that we are only slower in the small unicode case. In the larger unicode cases, we are basically break even now. Ridiculous gains for ascii only: #5139

Kevin Gillette (Mar 23 2023 at 02:20):

Richard Feldman said:

Edit: worth noting that "separate conditional on every single byte" would also be required for the "disallow interior null bytes in Roc strings" design. And actually this one could be cheaper; instead of using a conditional at all, this could be mutating a local variable like hasInteriorNull = hasInteriorNull || currentByte == 0

iirc, there are SIMD operations which can determine whether a byte is present in a wide byte range?

Brendan Hansknecht (Mar 23 2023 at 02:45):

Yeah, there are simd and swar techniques for this. The biggest issue is that it will have to happen during the super fast and very hot loop that checks if a batch of bytes are ASCII. So it definitely will be a perf hit. Even if we collect until the end, we probably make the core loop twice as many instructions.

Last updated: Jul 23 2026 at 13:15 UTC