https://github.com/roc-lang/roc/pull/8565 failed on Nix checks only, but there's no error message and I'm not sure how to debug :sweat_smile:
I'm going to merge because it has some important fixes and is prone to merge conflicts, plus I don't see why anything should have changed in terms of Nix :raised:
can anyone who has Nix set up try to reproduce that locally?
I'll take a quick look
It passes now https://github.com/roc-lang/roc/pull/8575 , I did not change anything :p
There was also nothing in your failing check output that showed what was going wrong, I guess we'll see if it pops up again.
I think I found a fix - https://github.com/roc-lang/roc/pull/8579 - I believe the problem was that we were rebuilding the roc binary before each fx test, to prevent staleness, and this resulted in too much disk usage for the Nix environment, resulting in an error with no clear error message.
however, that PR is now failing on a valgrind error that suggests maybe we install a non-stripped ld.so equivalent in that environment?
debug: reported NIL problems
debug: processing snapshot file: /home/runner/work/roc/roc/test/snapshots/expr_no_space_dot_int.md
debug: Generating snapshot for: /home/runner/work/roc/roc/test/snapshots/expr_no_space_dot_int.md
debug: processing snapshot file: /home/runner/work/roc/roc/test/snapshots/can_import_exposing_types.md
debug: Generating snapshot for: /home/runner/work/roc/roc/test/snapshots/can_import_exposing_types.md
==6829==
==6829== HEAP SUMMARY:
==6829== in use at exit: 0 bytes in 0 blocks
==6829== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==6829==
==6829== All heap blocks were freed -- no leaks are possible
==6829==
==6829== For lists of detected and suppressed errors, rerun with: -s
==6829== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==6906== Memcheck, a memory error detector
==6906== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==6906== Using Valgrind-3.26.0 and LibVEX; rerun with -h for copyright info
==6906== Command: ./zig-out/bin/roc --no-cache test/str/app.roc
==6906==
valgrind: Fatal error at startup: a function redirection
valgrind: which is mandatory for this platform-tool combination
valgrind: cannot be set up. Details of the redirection are:
valgrind:
valgrind: A must-be-redirected function
valgrind: whose name matches the pattern: memcmp
valgrind: in an object with soname matching: ld-linux-x86-64.so.2
valgrind: was not found whilst processing
valgrind: symbols from the object with soname: ld-linux-x86-64.so.2
valgrind:
valgrind: Possible fixes: (1, short term): install glibc's debuginfo
valgrind: package on this machine. (2, longer term): ask the packagers
valgrind: for your Linux distribution to please in future ship a non-
valgrind: stripped ld.so (or whatever the dynamic linker .so is called)
valgrind: that exports the above-named function using the standard
valgrind: calling conventions for this platform. The package you need
valgrind: to install for fix (1) is called
valgrind:
valgrind: On Debian, Ubuntu: libc6-dbg
valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo
valgrind:
valgrind: Note that if you are debugging a 32 bit process on a
valgrind: 64 bit system, you will need a corresponding 32 bit debuginfo
valgrind: package (e.g. libc6-dbg:i386).
valgrind:
valgrind: Cannot continue -- exiting now. Sorry
Looks like it's fixed :)
oh yeah Claude had a suggested fix in a recent PR, and it ended up resolving it
We switched internet providers today and internet is not working :sweat_smile: the benchmarks workflow and the old workflows will not pass until it is fixed.
CI machines are back online
Lots of CI workflows remain queued, waiting on a github hosted runner. The github status page does not show any issues. Perhaps we exceeded some quota?
Looks like its moving along now, there is probably a limited amount of github runners that we can use simultaneously.
Yeah things seem to be running really slowly compared to what I remember.
I also spotted some recent significant slowdowns with roc check and build when running basic-cli all_tests.sh (branch migrate-zig-compiler-edits). I have not looked into it deeply but it's on my TODO list.
recent as in after the MIR rewrite landed?
I checked right before MIR rewrite, that was fast, let me check right after...
yeah I added a ton of debug-only checks and lints and stuff in there, so I wouldn't be surprised if that slowed down ci
benchmark ci step passed, so if release builds got significantly slower, it would be for a scenario we don't have benchmarks for
Oh ok, let me check with a release build
Yeah, still a significant slowdown with release, I will dig into it sometime:
Commit 0f56082:
=== Checking examples ===
Checking: command-line-args.roc
No errors found in 104ms for examples/command-line-args.roc
Checking: hello-world.roc
No errors found in 100ms for examples/hello-world.roc
Checking: stdin-basic.roc
No errors found in 95ms for examples/stdin-basic.roc
Checking: path.roc
No errors found in 102ms for examples/path.roc
Checking: command.roc
No errors found in 94ms for examples/command.roc
Checking: time.roc
No errors found in 100ms for examples/time.roc
Checking: random.roc
No errors found in 98ms for examples/random.roc
Checking: locale.roc
No errors found in 89ms for examples/locale.roc
Checking: tty.roc
No errors found in 104ms for examples/tty.roc
Checking: dir.roc
No errors found in 104ms for examples/dir.roc
Checking: env-var.roc
No errors found in 97ms for examples/env-var.roc
Latest main:
Checking: command-line-args.roc
No errors found in 270ms for examples/command-line-args.roc
Checking: hello-world.roc
No errors found in 263ms for examples/hello-world.roc
Checking: stdin-basic.roc
No errors found in 287ms for examples/stdin-basic.roc
Checking: path.roc
No errors found in 265ms for examples/path.roc
Checking: command.roc
No errors found in 308ms for examples/command.roc
Checking: time.roc
No errors found in 280ms for examples/time.roc
Checking: random.roc
No errors found in 276ms for examples/random.roc
Checking: locale.roc
No errors found in 278ms for examples/locale.roc
Checking: tty.roc
No errors found in 276ms for examples/tty.roc
Checking: dir.roc
No errors found in 311ms for examples/dir.roc
Checking: env-var.roc
No errors found in 283ms for examples/env-var.roc
CI related message: I have a fix for the flaky macos x86 issue "Coordinator stuck" that I am testing now:
https://github.com/roc-lang/roc/pull/9439/changes/1b95f9f4b6c87d211b2550c49c1112b5a51dc4d6
![]()
:rocket: :rocket: :rocket:
Benchmarks CI did not catch the slowdown after the MIR rewrite becuase it uses roc run everywhere which is currently set to single threaded mode. The slowdown was only visible when running multi-threaded as we do for roc check and build.
whoooa!
what leads to the massive speedup?
The whole breakdown is here: https://github.com/roc-lang/roc/pull/9453
Do we want to keep roc run single threaded?
https://github.com/roc-lang/roc/blob/06d3d901c011a2b55fae84b182deb5a5f722343c/src/cli/main.zig#L1963
nah
I'm finding a few pathologically slow tests in our CI ... I'm thinking of skipping these and pointing them at GH Issue tasks to investigate and fix. From looking into a few so far I get the feeling they will be simple changes t re-use work or caches properly and will cut out orders of magnitude off the runtime.
Here's a sneak peek at the generated "dashboard" from the new minici :smiley:
![]()
yeah its actually awesome... I've already shaved 20+ minutes of the CI runtime just picking off low hanging fruit
OK, I think this is ready to land... https://github.com/roc-lang/roc/pull/9501
There is a lot more we can do, but we can follow up with those changes.
I really lent into the naming convention we have been using so build-* steps doing building and prep work, and run-* or check-* steps actually execute things. zig build minici is the only exception now, that does build-ci then orchestrates running all the leaf steps.
How would we feel about pulling out the valgrind and llvm tests from every PR CI run, and having those run only as a follow up to things landing in main? Now that we are doing less backend development now I feel like the risk of us breaking these is much lower. Almost all of our work recently has been in earlier stages of the pipeline from llvm, and I think valgrind was most helpful while we were working on the serialisation side of things.
I think it would be even better if we could skip valgrind based on certain files/directories not being touched, e.g. if the only changes are to parsing/canonicalization/type-checking/reporting/repl/etc, and there's no change to anything after type-checking up through code gen, then the odds of valgrind failing are so small that yeah, just checking periodically on main seems fine
and similarly if we have no changes to llvm code gen, then yeah there's probably no need to run those tests except periodically on main, since any bugs should in theory be caught by one of the other backends
I think I've almost fixed all the current broken things in our existing CI (post zig 16 merge). I hope to push that work up in the next hour or so. I don't want to touch the above valgrind or llvm right now, I think we can do that in a follow up. I'm sure @Anton will have ideas for how to do that properly.
Probably very similar to ci_zig_nix.yml, I have added it to my TODO list.
I am going to look at the windows 11 arm64 CI failure now.
Luke's big improvement to the testing workflow just landed :tada:
If you had a PR up, I recommend that you check carefully if a test that you added should be moved to a different file.
I also did some test rearranging stuff which landed recently, but I think it made it in first :fingers_crossed:
The valgrind tests are re-enabled and working again :)
awesome, thank you so much for getting them fixed!!!
Ok hello_world.zig does not work on windows 11 arm64 with zig 16, this is issue https://codeberg.org/ziglang/zig/issues/31865. Claude did a whole bunch of involved debugging before I decided to see if hello world even runs :p
Maybe we should use zig master which has the fix, I will see tomorrow...
@Anton can you please make a tracking issue for that zig windows arm issue... I'm not sure how it breaks Roc specifically
I made #9598 for the issue, it is not breaking Roc specifically, hello world (written in zig) does not work on win 11 arm64. Given how involved our zig 16 upgrade was, I'm thinking we just keep win 11 arm64 CI disabled until the next release instead of upgrading to master now.
CI has a lot of work backed up ... it would be good to land a few of these PRs in the pipeline
I wonder if we should have a "staging" branch which we create PR's against and CI runs just the lighter "check" runs against. This would mean we can have many PR's landing together and avoiding all the merge conflicts etc. Presumably everyone is running zig build minici n their machines locally before pushing anyway. When I'm working on something cross-OS I typically run that on each machine also before merging the PR.
Then we could make PR's from staging into main that run the full CI suite and catch any regressions. But as it's much quicker to make PR's against staging it's not a massive job to fix those and then restart the single staging->main workflows.
At the moment we have like 10+ PR's all trying to run the full CI suite and it's very slow. So we have many PR's that will need to be restarted multiple times just from merging changes that land in main.
This would mean we only have one big CI run going at any given time instead of many
Or I guess I'm suggesting a merge queue... but I don't have much experience with using one of these
We have talked about this previously...
Luke Boswell said:
CI has a lot of work backed up ... it would be good to land a few of these PRs in the pipeline
I'm landing as fast as I can, but they have to actually pass ci first :sweat_smile:
Sorry I didn't mean that as criticism ... more an expression of "is there a better way to do this?"
oh sure haha
I honestly kinda wonder if maybe just investing in more CI machines is the answer
I have not kept up with cloud pricing but I suspect that could be quite expensive, especially because we will likely be producing more and more PRs in the future.
I am willing to try out a merge queue.
Maybe Claude can even bundle PRs that are unlikely to interfere with each other and run CI on them once.
By the way, we are only using a self-hosted server for the zig-benchmarks job, all the rest is done on github servers.
Anton said:
I have not kept up with cloud pricing but I suspect that could be quite expensive, especially because we will likely be producing more and more PRs in the future.
Like in the last 90 days, we used 10428 hours of CI time.
Last updated: Jun 16 2026 at 16:19 UTC