Suppose one has a program that recursively builds up a string by traversing an AST and concatenating strings together in a recursive manner. This is very clean from an isolation / readability perspective, but it does generate a lot of separate allocations and copies.
The better way to write that in rust/zig would be to have the node-printing function take a pointer to a buffer that everything is appended to. That way, we avoid allocating all the intermediate strings separately and then copying them.
In Roc, a similar thing could be accomplished by having the node-printing function take as an argument a string to append to (which it then appends to and returns, of course). But that's ugly!
Can we automatically make the simple code do the fast thing?
Just like when returning a large struct from a function, the compiler will typically have the caller allocate space for the return value and pass a pointer to that memory as a hidden arg to the function in question (which the function just writes its result into), we can treat "returning a string" similarly and have it be represented at the calling-convention level as passing a string to append to.
If a function returns a Str, we transform it to instead take an extra Str argument, perhaps even one that's statically guaranteed to have a refcount of 1 (because the caller guarantees this already).
If the caller doesn't need/want the return value to be appended to something, it can just pass an empty string with no backing allocation.
Now - there's one case that this leads to extra copying: if the callee is just slicing a string from something else and returning that, and the caller is just storing that in a record for later processing, we would have unnecessarily made a copy. This can be handled via either: (1) statically inspecting both the callee and caller to see if the "slice" optimization should apply instead, or (2) having the callee check at runtime if an empty/unallocated string has been passed in as the ret value slot, and if so ignoring that and just returning the slice directly. (2) is obviously more general, but does lead to an extra branch.
Thoughts?
If the caller doesn't need/want the return value to be appended to something, it can just pass an empty string with not backing allocation.
Who is doing this? The roc code author? Some heuristic in the compiler? Other?
This would be code generated by the compiler, with a simple heuristic: if the value is immediately concatenated to another string, we pass that string as the "append str", and if not we pass an empty/unallocated str.
So, if we see the pattern:
x : Str
x = someFn(...)
buf = buf.append(s)
We transform to pass buf into the function?
Yep! And _every function_ that returns a Str is modified in the same way - to take an extra/implicit "append str" arg.
Then we transform the entire inside of the function to change all string operations to append to this buffer instead of build up an intermediate?
Cause most likely the code inside is not just cleanly appending to a buffer. So likely you would need very invasive transformations.
The callee function is of course free to ignore the passed-in parameter until the end of the function - so that's the fall back if things don't neatly chain together.
I don't understand this, this isn't code authorred by a human. This is compiler transformations, there is no such things as free to ignore. Everything must be specified.
"Free to ignore" means the transformation in the compiler can bail and compute an intermediate and then only just before returning do the append.
Sure, I guess I'm just not seeing where this would do anything without invasive transformations.
If code isn't written with buffers in mind, it will just build up intermediates separately and then merge them
So this transformation would just end up last minute merging the intermediates.
Probably need concrete examples of complex string code. Maybe from roc glue written the naive way?
Take this as a simple example:
printMe = |me| {
match me {
MyInt(i) => Str.from_int(i),
MyPair(a, b) => {
"[".concat(printMe(a).concat(", ")).concat(printMe(b)) + "]"
}
}
}
Actually, let me write that with explicit function call syntax to make this more clear...
printMe = |me| {
match me {
Int(i) => Str.from_int(i),
MyPair(a, b) => {
Str.concat("[", Str.concat(Str.concat(Str.concat(printMe(a), ", "), printMe(b)), "]"))
}
}
}
I've been thinking about this over the year and I super want to do it
That gets transformed into:
printMe = |buffer, me| {
match me {
Int(i) => Str.from_int(buffer, i),
MyPair(a, b) => {
buffer = Str.concat(buffer, "[");
buffer = printMe(buffer, a);
buffer = Str.concat(buffer, ", ");
buffer = printMe(buffer, b);
buffer = Str.concat(buffer, "]");
buffer
}
}
}
I think the generalization is "autobuffering"
That obviously means we need to:
(1) know that Str.concat is associative
(2) be ok with reordering calls to it
not just strings but lists in general!
Yes!
Ah, I see. So work in reverse updating bufferizable functions that are pure to automatically take the buffer
if we know it's pure then we know it's ok to reorder
One fun note, in this specific case, you probably just hurt perf
Huh, say more?
Cause the entire thing would have likely been a small string originally. So you would have built up a small string then appended that in one go to the buffer.
Now instead you have many different append calls to the buffer each with tiny strings.
Interesting. Why would the small string optimization not compose nicely with this one?
But I would guess on average this helps even if in many of the leaf functions it hurts.
The idea is kind of a transformation of recursion to tail call recursion where possible?
It's roughly two things:
(1) having the compiler recognize the associativity of Str.concat and adjusting the tree to be "left-leaning" - as I've written above
(2) Then, instead of building up intermediate values for calling nested functions, we do this calling convention transform
(1) is not strictly necessary, but makes (2) helpful in more cases
Joshua Warner said:
Interesting. Why would the small string optimization not compose nicely with this one?
Just that the buffer passed in is likely not a small string. So you are likely appending to a large buffer many super tiny strings each with the chance to cause a reallocation or cache miss.
Anyway, I don't think it's really relevant in practice. This cost is minor. The cost of copying tons of super long strings and building up recursive intermediate is way worse.
Yeah, so this makes sense to me
No tail calls necessary - this can still be true recursion
Just that the buffer passed in is likely not a small string. So you are likely appending to a large buffer many super tiny strings each with the chance to cause a reallocation or cache miss.
Ah, but we're still talking about a single branch that in the limit should be well-predicted since we're not allocating very often.
And we'd need a separate branch for checking if the string fits the small string optimization anyway, and that might be mispredicted much more often in the normal case (depending on the example, of course)
The small string optimization still does a buffer-size check
Yeah, very example dependent, but small strings do some crazy heavy lifting when used right
In theory, the optimization could go even further and attempt to pre-compute the necessary buffer size and avoid all reallocations
Also, this is where estimated reserve capacities can help a lot.
Yeah, that... Didn't type fast enough
a related optimization (which I suspect would rely a LOT on inlining) would be inserting automatic .reserve() calls before a batch of appends based on the length of things known in advance to be immutable and going to be appended later in the chain
hahaha
Yeah I think that's exactly what I was just saying :upside_down:
I, too, didn't type fast enough :joy:
Doesn't necessarily have to rely on inlining
doesn't have to, but inlining would minimize the number of reallocations
The compiler could generate a printMe__computeBufferSize function that it calls first
Could walk the call graph similar to what we do for borrow inference. Just with capacity estimates. (Always reserve the minimum guaranteed capacity ahead of time)
If the buffer size is statically known, printMe__computeBufferSize would itself just return a constant (or it could just be a constant). But if there's some dynamic computation it could be an upper bound (or perhaps lower bound).
Yeah, definitely could be more accurate with runtime info.
e.g. it generates:
printMe__computeBufferSize = |me| {
match me {
Int(i) => 1, # We know a int will take up at least 1 byte (or we could do an upper-bound, or the average of the two, ...)
MyPair(a, b) => {
size = 1
size = size + printMe__computeBufferSize(a);
size = size + 1;
size = size + printMe__computeBufferSize(b);
size = size + 1;
size
}
}
}
It could even return a pair (upper_bound, lower_bound) - and the caller could do something sensible with that
Like, if the upper bound fits within a small string, we can omit all bounds checks and optimize heavily
:thinking: I wonder if this would be useful in the context of Set/Dict as well?
e.g. In cases like computing a Set of free variables
Or - crazy thought - I wonder if there's a world in which doing some optimization around "trees of tag unions with boxes" to convert them to SOA automatically, putting them all into a contiguous buffer instead of scattered in the heap. Use index offsets instead of pointer indirection. Functions that took a reference to the value now take a reference to the containing "tree buffer" and an index of the value. You then do the same "pass in the buffer" optimization to have a buffer to plop the return value into.
Automatic SOA arena allocation?
That does get gnarly if you later want to re-use just parts of the tree for some reason (say, you're building a parser that does minimal re-parses with ast structure sharing after edits)
Stuff like this starts to worry me cause it adds a lot more complex tradeoffs (both in terms of compiler complexity and adding things to runtime that have some form of cost). So I prefer slim optimizations where possible. But I also am someone who likes to manipulate code to get the perf I want...so too complex of a compiler is often a disadvantage to me.
Expanding on my comment above. If perf matters for some code, I will manually buffer it cause auto-buffering is not guaranteed to work and be consisted. On top of that, my context likely means I can employ other tricks to get more perf. If libraries are designed assuming auto-bufferization and the optimization can't apply for whatever reason, then users are just out of luck. The perf is gonna be bad. If instead, libraries manually buffer cause they know that is the only way to get perf, things are guaranteed to be performant.
That said, I do think auto-bufferization likely falls into the category of roc should just do it. Though this depends on if we can make it happen essentially 100% of the time with very limited compiler complexity. If this only happens ~95% of the time, I would argue it is likely worse than not having it and pushing users to manually bufferize. Honestly when consider large code bases and package ecosystems, it probably needs to work well more than 99.9% of the time.
I see anysort of runtime bound guesstimating as likely a mistake. More compiletime and runtime complexity likely for little gain. On top of that, the functions if actually walking a recursive datastructure to calculate a size, could be crazy expensive. Again, something dumb that just gets a lower bound without recursion and is super fast likely is reasonable (and hopefully can be done at compile time), but anythong that steps into runtime or has any real complexity will certainly have many cases where it hurts more than it helps. So I would suggest avoiding it.
Same with any sort of automatic SOA or crazy tag transformations. Not only does this make host interactions a lot more complicated, but it is not clearly a win and may just be wasting tons of compile time for gains that don't matter in practice. So these kinds of transformations I baseline strongly against. This feels a lot like automatic transformations to run on GPU or similar. While they can help, but of the time they lead to worse results and are a loss for a lot of complexity. That has been tried many many times with very poor results.
All that said, if some of these have simple implementations that lead to across the board wins with little compile time costs, then of course they are welcome. I just have a strong default against most of this compiler magic that maybe occasionally works and costs a lot in terms of complexity, compile time, etc.
Last updated: Jun 16 2026 at 16:19 UTC