zig compiler - profiling and optimization · compiler development

So, I have been messing a lot with profiling. Its fun to tinker with the new compiler. Interesting to see some of the tradeoffs.

Thought I should make a standalone thread cause I assume there will be many findings over time and discussion.

A few random findings.

1 million line challenge

Parsing and formatting 1 million lines of syntax grab bag.
zig compiler is ~5x faster and 4x less memory.
In real terms, the zig compiler took ~300ms to parse and format the million lines.
It used ~300MB to do so.

c allocator vs the new zig smp allocator.

When dealing with the 100 files of 1000 lines:
c allocator uses way less memory than the zig smp allocator (~4x less memory)
That said, it also takes significantly longer runtime to do so (~1.4x slower)

Definitely something to consider switching to. Though need to test on more cases and such.

Brendan Hansknecht (Mar 16 2025 at 01:36):

Richard Feldman (Mar 16 2025 at 01:38):

Brendan Hansknecht (Mar 16 2025 at 01:45):

Also, I am still very new to the tracy profiler (demo), but it is an awesome tool for diving into performance. I think it will be extra useful once we start doing multi-threaded work. It has too many features for me to describe here, but I definitely should give a demo of using it with roc at some point.

I graciously borrowed how the zig compiler integrates tracy and have it on a branch. At some point soon, I want to make a PR for it. It is relatively non-invasive. Just a sprinkling of trace points, some build config, and an optional allocation tracker.

Brendan Hansknecht (Mar 16 2025 at 18:51):

One thing seen clearly from profiling is that container default capacities can save a lot of time by avoiding many reallocations on copies.

I was thinking of adding a bunch of initCapacity functions to our various datastructures, but realized that in many cases, the capacity wanted is not really known by the caller. I'm thinking of flipping the script and giving the data structures control of their default size. So calling init will simply allocate the default capacity we think is reasonable for a datastructure.

As an example, instead of adding initCapacity to the small string interner, we would just update the small string interner init function to always allocate enough space for (x strings of a specific size), maybe 1000 strings of 4 characters.

I'm not totally sold on this idea, but it feels like it might be easier to tune on a per data structure level than at a per instance level.

Richard Feldman (Mar 16 2025 at 19:00):

I think what they did in the Zig compiler was to do some benchmarks on heuristics and go by that

Richard Feldman (Mar 16 2025 at 19:00):

like for example "here's how much to allocate for tokens as a multiple of size of source bytes"

Richard Feldman (Mar 16 2025 at 19:01):

not an exact science obviously, but can do heuristics based on measurements in the wild

Brendan Hansknecht (Mar 16 2025 at 19:01):

Yeah, that's a good point. A lot of this likely can have simple heuristics that go beyond datastructure specific and into input specific

Joshua Warner (Mar 17 2025 at 01:25):

Brendan Hansknecht (Mar 17 2025 at 01:26):

Brendan Hansknecht (Mar 17 2025 at 01:27):

Repletion with definitely benefit the interner and lead to less allocating and regrowth though. So it is biased for sure.

Brendan Hansknecht (Mar 17 2025 at 01:27):

I really should update the builtins and/or basic CLI to the new syntax to get a more realistic feel.

Joshua Warner (Mar 17 2025 at 01:29):

I have a large corpus of all public roc code, but written in the old syntax of course

Joshua Warner (Mar 17 2025 at 01:29):

Joshua Warner (Mar 17 2025 at 01:30):

Been thinking about running that thru the migration formatter in the old compiler (that'll need a bit of work!) and then using that as a somewhat more realistic corpus

Brendan Hansknecht (Mar 17 2025 at 01:30):

Oh, that would be awesome to work to update and do some benchmarks on. I think we are still a bit away from supporting everything to use that corpus, but would be great

Brendan Hansknecht (Mar 17 2025 at 01:31):

Can we make that corpus a GitHub repo? And make two branches, one for old and one for new syntax?

Brendan Hansknecht (Mar 17 2025 at 01:31):

Joshua Warner (Mar 17 2025 at 01:32):

Brendan Hansknecht (Mar 17 2025 at 01:35):

Joshua Warner (Mar 17 2025 at 01:35):

Joshua Warner (Mar 17 2025 at 01:36):

There are a few missing translations there that I know of, and likely some bugs. Basically completely untested.

Joshua Warner (Mar 17 2025 at 01:45):

Of course this is somewhat complicated by the old compiler still depending on zig 13 which breaks the build there, since I've upgraded to 14 for the new compiler :grimacing:

Brendan Hansknecht (Mar 17 2025 at 01:49):

Joshua Warner (Mar 17 2025 at 02:37):

Brendan Hansknecht (Mar 17 2025 at 02:38):

Depends on how hard it ends up being to upgrade inkwell and llvm. Occasionally that is trivial. A lot of the time that is a huge hassle.

Joshua Warner (Mar 17 2025 at 02:38):

Brendan Hansknecht (Mar 17 2025 at 02:39):

Yeah....one of the other huge gains of the new compiler is that we will generate llvm bitcode directly, which gives us much more flexibility to decouple that

Joshua Warner (Mar 17 2025 at 02:40):

Brendan Hansknecht (Mar 17 2025 at 02:46):

Brendan Hansknecht (Mar 17 2025 at 04:23):

Some of these optimizations are bespoke tuning that probably won't be kept or need proper heuristics, but otherwise are just simple cleanups to have less allocations overall.

optimization results (-38% execution time)

Each executable builds on top of the last and includes all previous optimizations.

Also, some reason, changes to file loading lead to the smp allocator using way less memory. It doesn't make sense to me. Definitely questioning if I am missing something or if the other case of 4x memory was some weird bug/fluke.

Benchmark 1 (31 runs): ./zig-out/bin/roc-base format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           159ms ± 2.86ms     153ms …  168ms          2 ( 6%)        0%
  peak_rss            745KB ±    0       745KB …  745KB          0 ( 0%)        0%
  cpu_cycles          170M  ± 5.79M      156M  …  184M           3 (10%)        0%
  instructions        321M  ±  961       321M  …  321M           0 ( 0%)        0%
  cache_references   3.17M  ± 86.4K     3.05M  … 3.36M           0 ( 0%)        0%
  cache_misses       1.82K  ± 1.38K     1.11K  … 8.01K           2 ( 6%)        0%
  branch_misses      1.10M  ± 94.2K      858K  … 1.24M           3 (10%)        0%
Benchmark 2 (35 runs): ./zig-out/bin/roc-buffered-fmt format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           144ms ± 2.75ms     138ms …  149ms          0 ( 0%)        ⚡-  9.8% ±  0.9%
  peak_rss            864KB ±    0       864KB …  864KB          0 ( 0%)        💩+ 15.9% ±  0.0%
  cpu_cycles          158M  ± 5.91M      146M  …  167M           0 ( 0%)        ⚡-  7.2% ±  1.7%
  instructions        308M  ±  454       308M  …  308M           0 ( 0%)        ⚡-  4.2% ±  0.0%
  cache_references   2.89M  ± 75.4K     2.77M  … 3.08M           0 ( 0%)        ⚡-  9.0% ±  1.3%
  cache_misses       1.34K  ±  646       979   … 4.97K           1 ( 3%)          - 26.4% ± 28.6%
  branch_misses       997K  ±  106K      785K  … 1.17M           0 ( 0%)        ⚡-  9.4% ±  4.5%
Benchmark 3 (39 runs): ./zig-out/bin/roc-arena format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           127ms ± 2.93ms     120ms …  131ms          0 ( 0%)        ⚡- 20.5% ±  0.9%
  peak_rss            864KB ±    0       864KB …  864KB          0 ( 0%)        💩+ 15.9% ±  0.0%
  cpu_cycles          154M  ± 6.73M      140M  …  165M           0 ( 0%)        ⚡-  9.6% ±  1.8%
  instructions        308M  ±  454       308M  …  308M           0 ( 0%)        ⚡-  3.9% ±  0.0%
  cache_references   2.61M  ± 80.5K     2.48M  … 2.78M           0 ( 0%)        ⚡- 17.6% ±  1.3%
  cache_misses       1.69K  ±  641      1.22K  … 5.39K           2 ( 5%)          -  7.0% ± 27.3%
  branch_misses       976K  ±  122K      734K  … 1.17M           0 ( 0%)        ⚡- 11.3% ±  4.8%
Benchmark 4 (41 runs): ./zig-out/bin/roc-more-cap format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           123ms ± 2.54ms     117ms …  130ms          2 ( 5%)        ⚡- 22.8% ±  0.8%
  peak_rss            864KB ±    0       864KB …  864KB          0 ( 0%)        💩+ 15.9% ±  0.0%
  cpu_cycles          153M  ± 5.80M      140M  …  170M           4 (10%)        ⚡-  9.8% ±  1.6%
  instructions        306M  ±  444       306M  …  306M           0 ( 0%)        ⚡-  4.7% ±  0.0%
  cache_references   2.46M  ± 48.8K     2.38M  … 2.60M           0 ( 0%)        ⚡- 22.5% ±  1.0%
  cache_misses       1.69K  ±  642      1.18K  … 5.46K           2 ( 5%)          -  7.0% ± 26.7%
  branch_misses       986K  ±  105K      746K  … 1.28M           4 (10%)        ⚡- 10.4% ±  4.3%
Benchmark 5 (43 runs): ./zig-out/bin/roc-file-size format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           117ms ± 2.96ms     111ms …  122ms          0 ( 0%)        ⚡- 26.8% ±  0.9%
  peak_rss            864KB ±    0       864KB …  864KB          0 ( 0%)        💩+ 15.9% ±  0.0%
  cpu_cycles          150M  ± 6.47M      137M  …  162M           0 ( 0%)        ⚡- 12.0% ±  1.7%
  instructions        305M  ±  954       305M  …  305M           0 ( 0%)        ⚡-  4.9% ±  0.0%
  cache_references   2.37M  ± 37.3K     2.31M  … 2.47M           3 ( 7%)        ⚡- 25.2% ±  0.9%
  cache_misses       1.61K  ±  327       832   … 3.00K           3 ( 7%)          - 11.7% ± 23.8%
  branch_misses       924K  ±  116K      696K  … 1.14M           0 ( 0%)        ⚡- 16.1% ±  4.6%
Benchmark 6 (50 runs): ./zig-out/bin/roc-smp format /tmp/new
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          98.7ms ± 2.92ms    92.2ms …  105ms          0 ( 0%)        ⚡- 38.0% ±  0.8%
  peak_rss            864KB ±    0       864KB …  864KB          0 ( 0%)        💩+ 15.9% ±  0.0%
  cpu_cycles          147M  ± 6.21M      133M  …  161M           0 ( 0%)        ⚡- 13.5% ±  1.6%
  instructions        303M  ±  960       303M  …  303M           0 ( 0%)        ⚡-  5.7% ±  0.0%
  cache_references   2.11M  ± 40.6K     2.03M  … 2.19M           0 ( 0%)        ⚡- 33.5% ±  0.9%
  cache_misses       1.44K  ±  533      1.07K  … 4.75K           4 ( 8%)          - 20.9% ± 23.7%
  branch_misses       919K  ±  107K      675K  … 1.13M           0 ( 0%)        ⚡- 16.6% ±  4.2%

Luke Boswell (Mar 19 2025 at 05:17):

@Brendan Hansknecht -- we're adding a lot of knobs and dials for tuning the compiler. I appreciate these are all things that we can tune later.

Luke Boswell (Mar 19 2025 at 05:19):

Maybe one day we have some automated thing that can help us tune these based on real code (i.e. using something like Osprey)... but even manually it would be easier to surface all of these decisions if they are in one place.

Brendan Hansknecht (Mar 19 2025 at 05:52):

Yeah, definitely lots of nobs. I just have been learning tracy and thus tuning a bunch of random ones

Brendan Hansknecht (Mar 19 2025 at 05:53):

Appart for initial capacities, I don't think we'll have too many bespoke constants

Brendan Hansknecht (Mar 19 2025 at 05:53):

Brendan Hansknecht (Mar 19 2025 at 05:54):

That said, setting a constant somewhere for the default capacity if people don't know what to pick sounds like protentially a good idea.

Luke Boswell (Mar 19 2025 at 05:56):

What I like about putting the constants in a single file is that its easier to track the history of any changes. If we change constants in future based on some profiling... we will include the analysis/evaluation in the PR and so we always have a good point of reference that is easy to find.

Luke Boswell (Mar 19 2025 at 05:58):

I could also imagine a future where different users might want different parameters. Like maybe if I'm using roc in some special way I might want to change things to suit me.

Brendan Hansknecht (Mar 19 2025 at 06:03):

Yeah, makes some sense. I'm not fully sure there are good names for these various constants cause many of them will just be the starting size of arbitrary containers or maybe a ratio from the input source to the size. That is where local reasoning makes a lot of sense. But I totally understand the want to have all nobs in one place.

Stream: compiler development

Topic: zig compiler - profiling and optimization

Brendan Hansknecht (Mar 16 2025 at 01:35):

A few random findings.

1 million line challenge

c allocator vs the new zig smp allocator.

Brendan Hansknecht (Mar 16 2025 at 01:36):

Richard Feldman (Mar 16 2025 at 01:38):

Brendan Hansknecht (Mar 16 2025 at 01:45):

Brendan Hansknecht (Mar 16 2025 at 18:51):

Richard Feldman (Mar 16 2025 at 19:00):

Richard Feldman (Mar 16 2025 at 19:00):

Richard Feldman (Mar 16 2025 at 19:01):

Brendan Hansknecht (Mar 16 2025 at 19:01):

Joshua Warner (Mar 17 2025 at 01:25):

Brendan Hansknecht (Mar 17 2025 at 01:26):

Brendan Hansknecht (Mar 17 2025 at 01:26):

Brendan Hansknecht (Mar 17 2025 at 01:27):

Brendan Hansknecht (Mar 17 2025 at 01:27):

Joshua Warner (Mar 17 2025 at 01:29):

Joshua Warner (Mar 17 2025 at 01:29):

Joshua Warner (Mar 17 2025 at 01:30):

Brendan Hansknecht (Mar 17 2025 at 01:30):

Brendan Hansknecht (Mar 17 2025 at 01:31):

Brendan Hansknecht (Mar 17 2025 at 01:31):

Joshua Warner (Mar 17 2025 at 01:32):

Joshua Warner (Mar 17 2025 at 01:32):

Brendan Hansknecht (Mar 17 2025 at 01:35):

Joshua Warner (Mar 17 2025 at 01:35):

Joshua Warner (Mar 17 2025 at 01:36):

Joshua Warner (Mar 17 2025 at 01:36):

Joshua Warner (Mar 17 2025 at 01:45):

Brendan Hansknecht (Mar 17 2025 at 01:49):

Joshua Warner (Mar 17 2025 at 02:37):

Brendan Hansknecht (Mar 17 2025 at 02:38):

Joshua Warner (Mar 17 2025 at 02:38):

Brendan Hansknecht (Mar 17 2025 at 02:39):

Joshua Warner (Mar 17 2025 at 02:40):

Brendan Hansknecht (Mar 17 2025 at 02:46):

Brendan Hansknecht (Mar 17 2025 at 04:23):

Luke Boswell (Mar 19 2025 at 05:17):

Luke Boswell (Mar 19 2025 at 05:19):

Brendan Hansknecht (Mar 19 2025 at 05:52):

Brendan Hansknecht (Mar 19 2025 at 05:53):

Brendan Hansknecht (Mar 19 2025 at 05:53):

Brendan Hansknecht (Mar 19 2025 at 05:54):

Luke Boswell (Mar 19 2025 at 05:56):

Luke Boswell (Mar 19 2025 at 05:58):

Brendan Hansknecht (Mar 19 2025 at 06:03):