Stream: ideas

Topic: String optimization: "append str" calling convention


view this post on Zulip Joshua Warner (Sep 01 2025 at 17:47):

Suppose one has a program that recursively builds up a string by traversing an AST and concatenating strings together in a recursive manner. This is very clean from an isolation / readability perspective, but it does generate a lot of separate allocations and copies.

The better way to write that in rust/zig would be to have the node-printing function take a pointer to a buffer that everything is appended to. That way, we avoid allocating all the intermediate strings separately and then copying them.

In Roc, a similar thing could be accomplished by having the node-printing function take as an argument a string to append to (which it then appends to and returns, of course). But that's ugly!

Can we automatically make the simple code do the fast thing?

Just like when returning a large struct from a function, the compiler will typically have the caller allocate space for the return value and pass a pointer to that memory as a hidden arg to the function in question (which the function just writes its result into), we can treat "returning a string" similarly and have it be represented at the calling-convention level as passing a string to append to.

If a function returns a Str, we transform it to instead take an extra Str argument, perhaps even one that's statically guaranteed to have a refcount of 1 (because the caller guarantees this already).

If the caller doesn't need/want the return value to be appended to something, it can just pass an empty string with no backing allocation.

Now - there's one case that this leads to extra copying: if the callee is just slicing a string from something else and returning that, and the caller is just storing that in a record for later processing, we would have unnecessarily made a copy. This can be handled via either: (1) statically inspecting both the callee and caller to see if the "slice" optimization should apply instead, or (2) having the callee check at runtime if an empty/unallocated string has been passed in as the ret value slot, and if so ignoring that and just returning the slice directly. (2) is obviously more general, but does lead to an extra branch.

Thoughts?

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 17:54):

If the caller doesn't need/want the return value to be appended to something, it can just pass an empty string with not backing allocation.

Who is doing this? The roc code author? Some heuristic in the compiler? Other?

view this post on Zulip Joshua Warner (Sep 01 2025 at 17:55):

This would be code generated by the compiler, with a simple heuristic: if the value is immediately concatenated to another string, we pass that string as the "append str", and if not we pass an empty/unallocated str.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 17:57):

So, if we see the pattern:

x : Str
x = someFn(...)
buf = buf.append(s)

We transform to pass buf into the function?

view this post on Zulip Joshua Warner (Sep 01 2025 at 17:58):

Yep! And _every function_ that returns a Str is modified in the same way - to take an extra/implicit "append str" arg.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 17:59):

Then we transform the entire inside of the function to change all string operations to append to this buffer instead of build up an intermediate?

Cause most likely the code inside is not just cleanly appending to a buffer. So likely you would need very invasive transformations.

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:01):

The callee function is of course free to ignore the passed-in parameter until the end of the function - so that's the fall back if things don't neatly chain together.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:02):

I don't understand this, this isn't code authorred by a human. This is compiler transformations, there is no such things as free to ignore. Everything must be specified.

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:03):

"Free to ignore" means the transformation in the compiler can bail and compute an intermediate and then only just before returning do the append.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:04):

Sure, I guess I'm just not seeing where this would do anything without invasive transformations.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:04):

If code isn't written with buffers in mind, it will just build up intermediates separately and then merge them

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:05):

So this transformation would just end up last minute merging the intermediates.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:05):

Probably need concrete examples of complex string code. Maybe from roc glue written the naive way?

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:05):

Take this as a simple example:

printMe = |me| {
  match me {
    MyInt(i) => Str.from_int(i),
    MyPair(a, b) => {
        "[".concat(printMe(a).concat(", ")).concat(printMe(b)) + "]"
    }
  }
}

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:07):

Actually, let me write that with explicit function call syntax to make this more clear...

printMe = |me| {
  match me {
    Int(i) => Str.from_int(i),
    MyPair(a, b) => {
        Str.concat("[", Str.concat(Str.concat(Str.concat(printMe(a), ", "), printMe(b)), "]"))
    }
  }
}

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:10):

I've been thinking about this over the year and I super want to do it

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:10):

That gets transformed into:

printMe = |buffer, me| {
  match me {
    Int(i) => Str.from_int(buffer, i),
    MyPair(a, b) => {
        buffer = Str.concat(buffer, "[");
        buffer = printMe(buffer, a);
        buffer = Str.concat(buffer, ", ");
        buffer = printMe(buffer, b);
        buffer = Str.concat(buffer, "]");
        buffer
    }
  }
}

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:10):

I think the generalization is "autobuffering"

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:10):

That obviously means we need to:
(1) know that Str.concat is associative
(2) be ok with reordering calls to it

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:10):

not just strings but lists in general!

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:10):

Yes!

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:11):

Ah, I see. So work in reverse updating bufferizable functions that are pure to automatically take the buffer

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:11):

if we know it's pure then we know it's ok to reorder

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:11):

One fun note, in this specific case, you probably just hurt perf

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:11):

Huh, say more?

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:12):

Cause the entire thing would have likely been a small string originally. So you would have built up a small string then appended that in one go to the buffer.

Now instead you have many different append calls to the buffer each with tiny strings.

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:13):

Interesting. Why would the small string optimization not compose nicely with this one?

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:13):

But I would guess on average this helps even if in many of the leaf functions it hurts.

view this post on Zulip Kiryl Dziamura (Sep 01 2025 at 18:14):

The idea is kind of a transformation of recursion to tail call recursion where possible?

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:16):

It's roughly two things:
(1) having the compiler recognize the associativity of Str.concat and adjusting the tree to be "left-leaning" - as I've written above
(2) Then, instead of building up intermediate values for calling nested functions, we do this calling convention transform

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:16):

(1) is not strictly necessary, but makes (2) helpful in more cases

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:17):

Joshua Warner said:

Interesting. Why would the small string optimization not compose nicely with this one?

Just that the buffer passed in is likely not a small string. So you are likely appending to a large buffer many super tiny strings each with the chance to cause a reallocation or cache miss.


Anyway, I don't think it's really relevant in practice. This cost is minor. The cost of copying tons of super long strings and building up recursive intermediate is way worse.


Yeah, so this makes sense to me

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:17):

No tail calls necessary - this can still be true recursion

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:18):

Just that the buffer passed in is likely not a small string. So you are likely appending to a large buffer many super tiny strings each with the chance to cause a reallocation or cache miss.

Ah, but we're still talking about a single branch that in the limit should be well-predicted since we're not allocating very often.

And we'd need a separate branch for checking if the string fits the small string optimization anyway, and that might be mispredicted much more often in the normal case (depending on the example, of course)

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:19):

The small string optimization still does a buffer-size check

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:19):

Yeah, very example dependent, but small strings do some crazy heavy lifting when used right

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:20):

In theory, the optimization could go even further and attempt to pre-compute the necessary buffer size and avoid all reallocations

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:21):

Also, this is where estimated reserve capacities can help a lot.

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:21):

Yeah, that... Didn't type fast enough

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:21):

a related optimization (which I suspect would rely a LOT on inlining) would be inserting automatic .reserve() calls before a batch of appends based on the length of things known in advance to be immutable and going to be appended later in the chain

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:21):

hahaha

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:21):

Yeah I think that's exactly what I was just saying :upside_down:

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:22):

I, too, didn't type fast enough :joy:

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:22):

Doesn't necessarily have to rely on inlining

view this post on Zulip Richard Feldman (Sep 01 2025 at 18:22):

doesn't have to, but inlining would minimize the number of reallocations

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:23):

The compiler could generate a printMe__computeBufferSize function that it calls first

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:23):

Could walk the call graph similar to what we do for borrow inference. Just with capacity estimates. (Always reserve the minimum guaranteed capacity ahead of time)

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:23):

If the buffer size is statically known, printMe__computeBufferSize would itself just return a constant (or it could just be a constant). But if there's some dynamic computation it could be an upper bound (or perhaps lower bound).

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:24):

Yeah, definitely could be more accurate with runtime info.

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:25):

e.g. it generates:

printMe__computeBufferSize = |me| {
  match me {
    Int(i) => 1, # We know a int will take up at least 1 byte (or we could do an upper-bound, or the average of the two, ...)
    MyPair(a, b) => {
        size = 1
        size = size + printMe__computeBufferSize(a);
        size = size + 1;
        size = size + printMe__computeBufferSize(b);
        size = size + 1;
        size
    }
  }
}

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:26):

It could even return a pair (upper_bound, lower_bound) - and the caller could do something sensible with that

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:26):

Like, if the upper bound fits within a small string, we can omit all bounds checks and optimize heavily

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:29):

:thinking: I wonder if this would be useful in the context of Set/Dict as well?

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:29):

e.g. In cases like computing a Set of free variables

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:34):

Or - crazy thought - I wonder if there's a world in which doing some optimization around "trees of tag unions with boxes" to convert them to SOA automatically, putting them all into a contiguous buffer instead of scattered in the heap. Use index offsets instead of pointer indirection. Functions that took a reference to the value now take a reference to the containing "tree buffer" and an index of the value. You then do the same "pass in the buffer" optimization to have a buffer to plop the return value into.

Automatic SOA arena allocation?

view this post on Zulip Joshua Warner (Sep 01 2025 at 18:35):

That does get gnarly if you later want to re-use just parts of the tree for some reason (say, you're building a parser that does minimal re-parses with ast structure sharing after edits)

view this post on Zulip Brendan Hansknecht (Sep 01 2025 at 18:48):

Stuff like this starts to worry me cause it adds a lot more complex tradeoffs (both in terms of compiler complexity and adding things to runtime that have some form of cost). So I prefer slim optimizations where possible. But I also am someone who likes to manipulate code to get the perf I want...so too complex of a compiler is often a disadvantage to me.

view this post on Zulip Brendan Hansknecht (Sep 03 2025 at 01:05):

Expanding on my comment above. If perf matters for some code, I will manually buffer it cause auto-buffering is not guaranteed to work and be consisted. On top of that, my context likely means I can employ other tricks to get more perf. If libraries are designed assuming auto-bufferization and the optimization can't apply for whatever reason, then users are just out of luck. The perf is gonna be bad. If instead, libraries manually buffer cause they know that is the only way to get perf, things are guaranteed to be performant.

That said, I do think auto-bufferization likely falls into the category of roc should just do it. Though this depends on if we can make it happen essentially 100% of the time with very limited compiler complexity. If this only happens ~95% of the time, I would argue it is likely worse than not having it and pushing users to manually bufferize. Honestly when consider large code bases and package ecosystems, it probably needs to work well more than 99.9% of the time.


I see anysort of runtime bound guesstimating as likely a mistake. More compiletime and runtime complexity likely for little gain. On top of that, the functions if actually walking a recursive datastructure to calculate a size, could be crazy expensive. Again, something dumb that just gets a lower bound without recursion and is super fast likely is reasonable (and hopefully can be done at compile time), but anythong that steps into runtime or has any real complexity will certainly have many cases where it hurts more than it helps. So I would suggest avoiding it.


Same with any sort of automatic SOA or crazy tag transformations. Not only does this make host interactions a lot more complicated, but it is not clearly a win and may just be wasting tons of compile time for gains that don't matter in practice. So these kinds of transformations I baseline strongly against. This feels a lot like automatic transformations to run on GPU or similar. While they can help, but of the time they lead to worse results and are a loss for a lot of complexity. That has been tried many many times with very poor results.


All that said, if some of these have simple implementations that lead to across the board wins with little compile time costs, then of course they are welcome. I just have a strong default against most of this compiler magic that maybe occasionally works and costs a lot in terms of complexity, compile time, etc.


Last updated: Jun 16 2026 at 16:19 UTC