Stream: compiler development

Topic: ci


view this post on Zulip Richard Feldman (Dec 05 2025 at 14:16):

https://github.com/roc-lang/roc/pull/8565 failed on Nix checks only, but there's no error message and I'm not sure how to debug :sweat_smile:

I'm going to merge because it has some important fixes and is prone to merge conflicts, plus I don't see why anything should have changed in terms of Nix :raised:

can anyone who has Nix set up try to reproduce that locally?

view this post on Zulip Anton (Dec 05 2025 at 14:45):

I'll take a quick look

view this post on Zulip Anton (Dec 05 2025 at 15:04):

It passes now https://github.com/roc-lang/roc/pull/8575 , I did not change anything :p
There was also nothing in your failing check output that showed what was going wrong, I guess we'll see if it pops up again.

view this post on Zulip Richard Feldman (Dec 06 2025 at 18:05):

I think I found a fix - https://github.com/roc-lang/roc/pull/8579 - I believe the problem was that we were rebuilding the roc binary before each fx test, to prevent staleness, and this resulted in too much disk usage for the Nix environment, resulting in an error with no clear error message.

view this post on Zulip Richard Feldman (Dec 06 2025 at 18:06):

however, that PR is now failing on a valgrind error that suggests maybe we install a non-stripped ld.so equivalent in that environment?

debug: reported NIL problems

debug: processing snapshot file: /home/runner/work/roc/roc/test/snapshots/expr_no_space_dot_int.md

debug: Generating snapshot for: /home/runner/work/roc/roc/test/snapshots/expr_no_space_dot_int.md

debug: processing snapshot file: /home/runner/work/roc/roc/test/snapshots/can_import_exposing_types.md

debug: Generating snapshot for: /home/runner/work/roc/roc/test/snapshots/can_import_exposing_types.md

==6829==

==6829== HEAP SUMMARY:

==6829== in use at exit: 0 bytes in 0 blocks

==6829== total heap usage: 0 allocs, 0 frees, 0 bytes allocated

==6829==

==6829== All heap blocks were freed -- no leaks are possible

==6829==

==6829== For lists of detected and suppressed errors, rerun with: -s

==6829== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

==6906== Memcheck, a memory error detector

==6906== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.

==6906== Using Valgrind-3.26.0 and LibVEX; rerun with -h for copyright info

==6906== Command: ./zig-out/bin/roc --no-cache test/str/app.roc

==6906==

valgrind: Fatal error at startup: a function redirection

valgrind: which is mandatory for this platform-tool combination

valgrind: cannot be set up. Details of the redirection are:

valgrind:

valgrind: A must-be-redirected function

valgrind: whose name matches the pattern: memcmp

valgrind: in an object with soname matching: ld-linux-x86-64.so.2

valgrind: was not found whilst processing

valgrind: symbols from the object with soname: ld-linux-x86-64.so.2

valgrind:

valgrind: Possible fixes: (1, short term): install glibc's debuginfo

valgrind: package on this machine. (2, longer term): ask the packagers

valgrind: for your Linux distribution to please in future ship a non-

valgrind: stripped ld.so (or whatever the dynamic linker .so is called)

valgrind: that exports the above-named function using the standard

valgrind: calling conventions for this platform. The package you need

valgrind: to install for fix (1) is called

valgrind:

valgrind: On Debian, Ubuntu: libc6-dbg

valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo

valgrind:

valgrind: Note that if you are debugging a 32 bit process on a

valgrind: 64 bit system, you will need a corresponding 32 bit debuginfo

valgrind: package (e.g. libc6-dbg:i386).

valgrind:

valgrind: Cannot continue -- exiting now. Sorry

view this post on Zulip Anton (Dec 08 2025 at 10:42):

Looks like it's fixed :)

view this post on Zulip Richard Feldman (Dec 08 2025 at 14:32):

oh yeah Claude had a suggested fix in a recent PR, and it ended up resolving it

view this post on Zulip Anton (Apr 16 2026 at 12:53):

We switched internet providers today and internet is not working :sweat_smile: the benchmarks workflow and the old workflows will not pass until it is fixed.

view this post on Zulip Anton (Apr 17 2026 at 08:19):

CI machines are back online

view this post on Zulip Anton (May 18 2026 at 13:47):

Lots of CI workflows remain queued, waiting on a github hosted runner. The github status page does not show any issues. Perhaps we exceeded some quota?

view this post on Zulip Anton (May 18 2026 at 15:06):

Looks like its moving along now, there is probably a limited amount of github runners that we can use simultaneously.

view this post on Zulip Luke Boswell (May 19 2026 at 00:13):

Yeah things seem to be running really slowly compared to what I remember.

view this post on Zulip Anton (May 19 2026 at 11:45):

I also spotted some recent significant slowdowns with roc check and build when running basic-cli all_tests.sh (branch migrate-zig-compiler-edits). I have not looked into it deeply but it's on my TODO list.

view this post on Zulip Richard Feldman (May 19 2026 at 14:26):

recent as in after the MIR rewrite landed?

view this post on Zulip Anton (May 19 2026 at 14:28):

I checked right before MIR rewrite, that was fast, let me check right after...

view this post on Zulip Richard Feldman (May 19 2026 at 14:28):

yeah I added a ton of debug-only checks and lints and stuff in there, so I wouldn't be surprised if that slowed down ci

view this post on Zulip Richard Feldman (May 19 2026 at 14:29):

benchmark ci step passed, so if release builds got significantly slower, it would be for a scenario we don't have benchmarks for

view this post on Zulip Anton (May 19 2026 at 14:30):

Oh ok, let me check with a release build

view this post on Zulip Anton (May 19 2026 at 14:47):

Yeah, still a significant slowdown with release, I will dig into it sometime:
Commit 0f56082:

=== Checking examples ===
Checking: command-line-args.roc
No errors found in 104ms for examples/command-line-args.roc
Checking: hello-world.roc
No errors found in 100ms for examples/hello-world.roc
Checking: stdin-basic.roc
No errors found in 95ms for examples/stdin-basic.roc
Checking: path.roc
No errors found in 102ms for examples/path.roc
Checking: command.roc
No errors found in 94ms for examples/command.roc
Checking: time.roc
No errors found in 100ms for examples/time.roc
Checking: random.roc
No errors found in 98ms for examples/random.roc
Checking: locale.roc
No errors found in 89ms for examples/locale.roc
Checking: tty.roc
No errors found in 104ms for examples/tty.roc
Checking: dir.roc
No errors found in 104ms for examples/dir.roc
Checking: env-var.roc
No errors found in 97ms for examples/env-var.roc

Latest main:

Checking: command-line-args.roc
No errors found in 270ms for examples/command-line-args.roc
Checking: hello-world.roc
No errors found in 263ms for examples/hello-world.roc
Checking: stdin-basic.roc
No errors found in 287ms for examples/stdin-basic.roc
Checking: path.roc
No errors found in 265ms for examples/path.roc
Checking: command.roc
No errors found in 308ms for examples/command.roc
Checking: time.roc
No errors found in 280ms for examples/time.roc
Checking: random.roc
No errors found in 276ms for examples/random.roc
Checking: locale.roc
No errors found in 278ms for examples/locale.roc
Checking: tty.roc
No errors found in 276ms for examples/tty.roc
Checking: dir.roc
No errors found in 311ms for examples/dir.roc
Checking: env-var.roc
No errors found in 283ms for examples/env-var.roc

view this post on Zulip Anton (May 19 2026 at 14:54):

CI related message: I have a fix for the flaky macos x86 issue "Coordinator stuck" that I am testing now:
https://github.com/roc-lang/roc/pull/9439/changes/1b95f9f4b6c87d211b2550c49c1112b5a51dc4d6

view this post on Zulip Anton (May 23 2026 at 14:50):

roc_build_times_comparison.webp
:rocket: :rocket: :rocket:

view this post on Zulip Anton (May 23 2026 at 14:52):

Benchmarks CI did not catch the slowdown after the MIR rewrite becuase it uses roc run everywhere which is currently set to single threaded mode. The slowdown was only visible when running multi-threaded as we do for roc check and build.

view this post on Zulip Richard Feldman (May 23 2026 at 14:52):

whoooa!

view this post on Zulip Richard Feldman (May 23 2026 at 14:52):

what leads to the massive speedup?

view this post on Zulip Anton (May 23 2026 at 14:53):

The whole breakdown is here: https://github.com/roc-lang/roc/pull/9453

view this post on Zulip Anton (May 23 2026 at 14:54):

Do we want to keep roc run single threaded?
https://github.com/roc-lang/roc/blob/06d3d901c011a2b55fae84b182deb5a5f722343c/src/cli/main.zig#L1963

view this post on Zulip Richard Feldman (May 23 2026 at 14:58):

nah

view this post on Zulip Luke Boswell (Jun 02 2026 at 05:56):

I'm finding a few pathologically slow tests in our CI ... I'm thinking of skipping these and pointing them at GH Issue tasks to investigate and fix. From looking into a few so far I get the feeling they will be simple changes t re-use work or caches properly and will cut out orders of magnitude off the runtime.

view this post on Zulip Luke Boswell (Jun 03 2026 at 00:04):

Here's a sneak peek at the generated "dashboard" from the new minici :smiley:

Screenshot 2026-06-03 at 10.03.05.png

view this post on Zulip Luke Boswell (Jun 03 2026 at 00:18):

yeah its actually awesome... I've already shaved 20+ minutes of the CI runtime just picking off low hanging fruit

view this post on Zulip Luke Boswell (Jun 03 2026 at 02:16):

OK, I think this is ready to land... https://github.com/roc-lang/roc/pull/9501

There is a lot more we can do, but we can follow up with those changes.

view this post on Zulip Luke Boswell (Jun 03 2026 at 02:21):

I really lent into the naming convention we have been using so build-* steps doing building and prep work, and run-* or check-* steps actually execute things. zig build minici is the only exception now, that does build-ci then orchestrates running all the leaf steps.

view this post on Zulip Luke Boswell (Jun 03 2026 at 02:51):

How would we feel about pulling out the valgrind and llvm tests from every PR CI run, and having those run only as a follow up to things landing in main? Now that we are doing less backend development now I feel like the risk of us breaking these is much lower. Almost all of our work recently has been in earlier stages of the pipeline from llvm, and I think valgrind was most helpful while we were working on the serialisation side of things.

view this post on Zulip Richard Feldman (Jun 03 2026 at 03:47):

I think it would be even better if we could skip valgrind based on certain files/directories not being touched, e.g. if the only changes are to parsing/canonicalization/type-checking/reporting/repl/etc, and there's no change to anything after type-checking up through code gen, then the odds of valgrind failing are so small that yeah, just checking periodically on main seems fine

view this post on Zulip Richard Feldman (Jun 03 2026 at 03:47):

and similarly if we have no changes to llvm code gen, then yeah there's probably no need to run those tests except periodically on main, since any bugs should in theory be caught by one of the other backends

view this post on Zulip Luke Boswell (Jun 03 2026 at 03:58):

I think I've almost fixed all the current broken things in our existing CI (post zig 16 merge). I hope to push that work up in the next hour or so. I don't want to touch the above valgrind or llvm right now, I think we can do that in a follow up. I'm sure @Anton will have ideas for how to do that properly.

view this post on Zulip Anton (Jun 03 2026 at 12:08):

Probably very similar to ci_zig_nix.yml, I have added it to my TODO list.

view this post on Zulip Anton (Jun 03 2026 at 13:40):

I am going to look at the windows 11 arm64 CI failure now.

view this post on Zulip Anton (Jun 07 2026 at 18:01):

Luke's big improvement to the testing workflow just landed :tada:

view this post on Zulip Anton (Jun 07 2026 at 18:01):

If you had a PR up, I recommend that you check carefully if a test that you added should be moved to a different file.

view this post on Zulip Richard Feldman (Jun 07 2026 at 20:54):

I also did some test rearranging stuff which landed recently, but I think it made it in first :fingers_crossed:

view this post on Zulip Anton (Jun 08 2026 at 17:04):

The valgrind tests are re-enabled and working again :)

view this post on Zulip Richard Feldman (Jun 08 2026 at 17:14):

awesome, thank you so much for getting them fixed!!!

view this post on Zulip Anton (Jun 08 2026 at 18:28):

Ok hello_world.zig does not work on windows 11 arm64 with zig 16, this is issue https://codeberg.org/ziglang/zig/issues/31865. Claude did a whole bunch of involved debugging before I decided to see if hello world even runs :p

view this post on Zulip Anton (Jun 08 2026 at 18:29):

Maybe we should use zig master which has the fix, I will see tomorrow...

view this post on Zulip Luke Boswell (Jun 10 2026 at 07:00):

@Anton can you please make a tracking issue for that zig windows arm issue... I'm not sure how it breaks Roc specifically

view this post on Zulip Anton (Jun 10 2026 at 11:27):

I made #9598 for the issue, it is not breaking Roc specifically, hello world (written in zig) does not work on win 11 arm64. Given how involved our zig 16 upgrade was, I'm thinking we just keep win 11 arm64 CI disabled until the next release instead of upgrading to master now.

view this post on Zulip Luke Boswell (Jun 14 2026 at 01:18):

CI has a lot of work backed up ... it would be good to land a few of these PRs in the pipeline

view this post on Zulip Luke Boswell (Jun 14 2026 at 01:53):

I wonder if we should have a "staging" branch which we create PR's against and CI runs just the lighter "check" runs against. This would mean we can have many PR's landing together and avoiding all the merge conflicts etc. Presumably everyone is running zig build minici n their machines locally before pushing anyway. When I'm working on something cross-OS I typically run that on each machine also before merging the PR.

Then we could make PR's from staging into main that run the full CI suite and catch any regressions. But as it's much quicker to make PR's against staging it's not a massive job to fix those and then restart the single staging->main workflows.

At the moment we have like 10+ PR's all trying to run the full CI suite and it's very slow. So we have many PR's that will need to be restarted multiple times just from merging changes that land in main.

view this post on Zulip Luke Boswell (Jun 14 2026 at 01:56):

This would mean we only have one big CI run going at any given time instead of many

view this post on Zulip Luke Boswell (Jun 14 2026 at 01:58):

Or I guess I'm suggesting a merge queue... but I don't have much experience with using one of these

view this post on Zulip Luke Boswell (Jun 14 2026 at 02:02):

We have talked about this previously... #contributing > Latest PR (#8487 str-to-utf8) fails the tests @ 💬

view this post on Zulip Richard Feldman (Jun 14 2026 at 02:30):

Luke Boswell said:

CI has a lot of work backed up ... it would be good to land a few of these PRs in the pipeline

I'm landing as fast as I can, but they have to actually pass ci first :sweat_smile:

view this post on Zulip Luke Boswell (Jun 14 2026 at 02:34):

Sorry I didn't mean that as criticism ... more an expression of "is there a better way to do this?"

view this post on Zulip Richard Feldman (Jun 14 2026 at 02:34):

oh sure haha

view this post on Zulip Richard Feldman (Jun 14 2026 at 02:35):

I honestly kinda wonder if maybe just investing in more CI machines is the answer

view this post on Zulip Anton (Jun 14 2026 at 11:13):

I have not kept up with cloud pricing but I suspect that could be quite expensive, especially because we will likely be producing more and more PRs in the future.

view this post on Zulip Anton (Jun 14 2026 at 11:14):

I am willing to try out a merge queue.

view this post on Zulip Anton (Jun 14 2026 at 11:15):

Maybe Claude can even bundle PRs that are unlikely to interfere with each other and run CI on them once.

view this post on Zulip Anton (Jun 14 2026 at 11:16):

By the way, we are only using a self-hosted server for the zig-benchmarks job, all the rest is done on github servers.

view this post on Zulip Anton (Jun 14 2026 at 11:40):

Anton said:

I have not kept up with cloud pricing but I suspect that could be quite expensive, especially because we will likely be producing more and more PRs in the future.

Like in the last 90 days, we used 10428 hours of CI time.


Last updated: Jun 16 2026 at 16:19 UTC