Continuing from @Brendan Hansknecht
long_15 is for roc build
on this code:
app [main!] { cli: platform "https://github.com/roc-lang/basic-cli/releases/download/0.19.0/Hj-J_zxz7V9YurCSTFcFdu6cQJie4guzsPMUi5kBYUk.tar.br" }
import cli.Stdout
import cli.Arg exposing [Arg]
main! : List Arg => Result {} _
main! = |_args|
Stdout.line!("1")?
Stdout.line!("2")?
Stdout.line!("3")?
Stdout.line!("4")?
Stdout.line!("5")?
Stdout.line!("6")?
Stdout.line!("7")?
Stdout.line!("8")?
Stdout.line!("9")?
Stdout.line!("10")?
Stdout.line!("11")?
Stdout.line!("12")?
Stdout.line!("13")?
Stdout.line!("14")?
Stdout.line!("15")
20 is the same thing with 5 more Stdout.line!
calls. The flamegraph for 15 looks normal but for 20 it looks off. It spends a lot of time in handle_mm_fault though so that may be what is going wrong.
flamegraph_long_15.svg
flamegraph_long_20.svg
The potential call explosion does seem clear from the 15 flamegraph.
Based on the heaptrack data; deep recursion looks likely
heaptrack_15vs20.png
I'll see if ftrace can give me some nice stats
uftrace tells a very different and more plausible story than the flamegraph:
# uftrace diff
# [0] base: uftrace.data (from uftrace record --filter roc_* ./target/debug/roc build temp_long_15.roc)
# [1] diff: uftrace20.data/ (from uftrace record --filter roc_* ./target/debug/roc build temp_long_20.roc)
#
Total time Self time Calls Function
=========== =========== =========== ====================
+13.007 m +0.084 us +0 roc_build::program::build_file
+13.007 m +0.497 us +0 roc_cli::build
+13.007 m +14.717 us +0 roc::main
+13.007 m +0.385 us +0 roc_load::load
+13.007 m +0.218 us +0 roc_load::load_and_monomorphize
+13.007 m +23.707 us +0 roc_load_internal::file::load_single_threaded
+13.007 m -1.656 us +0 roc_load_internal::file::load
+13.007 m +39.427 us +0 roc_load_internal::file::state_thread_step
+13.007 m +101.288 us +0 roc_load_internal::file::update
+13.007 m +2.436 us +0 roc_mono::inc_dec::insert_inc_dec_operations
+13.007 m +4.556 us +0 roc_mono::inc_dec::insert_inc_dec_operations_proc
+13.007 m +10.585 s +4571136 roc_mono::inc_dec::insert_refcount_operations_stmt
+7.007 m +2.716 s +11681792 roc_mono::inc_dec::insert_refcount_operations_stmt::_{{closure}}
+5.042 m +16.057 s +65704457 core::hash::BuildHasher::hash_one
+5.015 m +4.405 s +67737523 core::hash::impls::_<impl core::hash::Hash for &T>::hash
+4.053 m +190.553 ms +2031656 _<core::iter::adapters::rev::Rev<I>>::fold
+4.053 m +1.089 s +2031656 core::iter::traits::double_ended::DoubleEndedIterator::rfold
+4.048 m +10.363 s +65700970 _<roc_module::symbol::Symbol>::hash
+4.027 m +59.520 s +135469517 _<wyhash::traits::WyHash>::write
+4.026 m +8.766 s +131405247 core::hash::Hasher::write_u32
+3.027 m +9.369 s +5079040 roc_mono::inc_dec::insert_refcount_operations_binding
+3.013 m +4.815 s +27426891 hashbrown::map::HashMap<K,V,S,A>::contains_key
+2.048 m +1.880 s +11681792 roc_mono::inc_dec::SymbolRcTypes::get
+2.046 m +3.340 s +14222412 hashbrown::map::HashMap<K,V,S,A>::insert
+2.046 m +1.857 s +23363644 std::collections::hash::set::HashSet<T,S>::contains
+2.022 m +5.588 s +65704262 core::hash::impls::_<impl core::hash::Hash for u32>::hash
+2.015 m +5.469 s +65700985 _<core::num::nonzero::NonZero<u32>>::hash
+2.010 m +234.683 ms +1524052 bumpalo::collections::vec::Vec<T>::from_iter_in
+2.009 m +831.661 ms +1524247 _<bumpalo::collections::vec::Vec<T>>::extend
+2.004 m +1.001 m +135469517 wyhash::functions::wyhash_core
+2.002 m +714.215 ms +9651156 std::collections::hash::map::HashMap<K,V,S>::insert
+1.059 m +337.363 ms +2038914 _<core::iter::adapters::map::Map<I,F>>::next
+1.059 m +8.820 s +51200480 _<hashbrown::map::HashMap<K,V,S,A>>::clone
+1.059 m +117.193 ms +1020568 core::ops::function::impls::_<impl core::ops::function::FnOnce<A> for &mut F>::call_once
+1.057 m +393.806 ms +5586944 roc_mono::inc_dec::RefcountEnvironment::add_symbol
+1.056 m +607.791 ms +5586944 roc_mono::inc_dec::RefcountEnvironment::add_symbol_with
+1.049 m +8.621 s +51200480 _<hashbrown::raw::RawTable<T,A>>::clone
+1.047 m +102.891 ms +1015828 bumpalo::collections::collect_in::CollectIn::collect_in
+1.047 m +83.509 ms +1015828 _<bumpalo::collections::vec::Vec<T>>::from_iter_in
+1.045 m +547.355 ms +5588649 core::iter::traits::iterator::Iterator::collect
+1.044 m +1.318 s +5586949 _<std::collections::hash::set::HashSet<T,S>>::from_iter
+1.043 m +1.095 s +7110656 roc_mono::inc_dec::RefcountEnvironment::get_symbol_rc_type
+1.043 m +1.366 s +5586949 _<hashbrown::set::HashSet<T,S,A>>::extend
+1.041 m +1.801 s +5586964 _<hashbrown::map::HashMap<K,V,S,A>>::extend
+1.039 m +590.532 ms +6096312 _<core::iter::adapters::map::Map<I,F>>::fold
+1.036 m +532.162 ms +6603467 core::iter::traits::iterator::Iterator::for_each
+1.030 m +2.694 s +14222412 hashbrown::raw::RawTable<T,A>::find_or_find_insert_slot
+1.024 m +4.486 s +54755838 _<hashbrown::raw::RawTable<T,A>>::drop
+1.022 m +412.975 ms +2539520 _<roc_mono::inc_dec::RefcountEnvironment>::clone
+1.022 m +21.982 s +44698317 hashbrown::raw::RawTable<T,A>::find
+1.022 m +401.394 ms +5079495 _<std::collections::hash::map::HashMap<K,V,S>>::clone
+1.020 m +705.180 ms +9144322 std::collections::hash::map::HashMap<K,V,S>::get
+1.019 m +6.564 s +54755838 hashbrown::raw::RawTableInner::drop_inner_table
+1.019 m +1.780 s +9144437 hashbrown::map::HashMap<K,V,S,A>::get
+1.015 m +1.023 s +28901461 _<hashbrown::raw::RawTable<T,A>>::clone_from_spec
+1.014 m +8.608 s +2539575 hashbrown::raw::RawTable<T,A>::clone_from_impl
+1.013 m +25.222 s +270939034 _<core::slice::iter::Chunks<T>>::next
+1.010 m +438.165 ms +5079040 roc_mono::inc_dec::RefcountEnvironment::consume_rc_symbol
+1.010 m +2.412 s +36570443 hashbrown::raw::RawTable<T,A>::get
+1.008 m +443.216 ms +4063252 _<core::iter::adapters::filter::Filter<I,P>>::fold
+1.005 m +943.156 ms +3047424 roc_mono::inc_dec::RefcountEnvironment::borrowed_usages
+1.004 m +475.935 ms +2539520 core::ptr::drop_in_place<roc_mono::inc_dec::RefcountEnvironment>
+1.003 m +374.460 ms +2031616 roc_mono::inc_dec::RefcountEnvironment::consume_symbol
+1.002 m +1.004 s +21333133 hashbrown::raw::RawTable<T,A>::reserve
+1.001 m +467.025 ms +3555368 core::iter::adapters::filter::filter_fold::_{{closure}}
+1.001 m +5.078 s +8175857 hashbrown::raw::RawTable<T,A>::reserve_rehash
+1.000 m +268.364 ms +2539520 core::ptr::drop_in_place<std::collections::hash::map::HashMap<roc_mono::ir::JoinPointId,std::collections::hash::set::HashSet<roc_module::symbol::Symbol,core::hash::BuildHasherDefault<wyhash::traits::WyHash>>,core::hash::BuildHasherDefault<wyhash::traits::WyHash>>>
+1.000 m +234.728 ms +2539520 core::ptr::drop_in_place<hashbrown::map::HashMap<roc_mono::ir::JoinPointId,std::collections::hash::set::HashSet<roc_module::symbol::Symbol,core::hash::BuildHasherDefault<wyhash::traits::WyHash>>,core::hash::BuildHasherDefault<wyhash::traits::WyHash>>>
I'm going to see if I can reproduce this with a simple platform, basic-cli may cause a bunch of distractions when debugging.
Reproduction with simple platform :tada:
The list argument is key, even though it is unused! It does not happen with myargs = 5
or {}
, but it does also happen with ""
.
app [main!] { pf: platform "../test-platform-effects-zig/main.roc" }
import pf.Effect
main! : {} => {}
main! = |{}|
myargs = []
when wrapper!(myargs) is
Ok(he) -> Effect.put_line!(Inspect.to_str(he))
Err(e) -> Effect.put_line!(Inspect.to_str(e))
wrapper! = |_args|
Effect.foo_line!("1")?
Effect.foo_line!("2")?
Effect.foo_line!("3")?
Effect.foo_line!("4")?
Effect.foo_line!("5")?
Effect.foo_line!("6")?
Effect.foo_line!("7")?
Effect.foo_line!("8")?
Effect.foo_line!("9")?
Effect.foo_line!("10")?
Effect.foo_line!("11")?
Effect.foo_line!("12")?
Effect.foo_line!("13")?
Effect.foo_line!("14")?
Effect.foo_line!("15")?
Effect.foo_line!("16")?
Effect.foo_line!("17")
Using myargs = []
instead of myargs = 0
leads to 73628 more calls of roc_mono::inc_dec::insert_refcount_operations_stmt
Wow....crazy :rolling_on_the_floor_laughing:
I wonder what we are scanning over that explodes like that
Or recursiving over
is this still doing the closure thing?
where Effect is a continuation monad
if so, that's probably why
Ah yeah, I thought it didn't have closures, but probably has repeat calls to Result.try
Otherwise, it is a gigantic nested when is
Ayaz Hafiz said:
is this still doing the closure thing?
Yeah it's using closures, but do you know why that would result in a massive difference depending on the type of myargs?
Lists are refcount
Ints and records are not
And all values are being nestedly captured for each closure
Still really should not explode like that....feels exponential...probably a bad algorithm.
Thanks Brendan :)
On a further simplified example I did find it was repeating the same lengthy "recursive walk" multiple times so memoization is possible
Ah, makes sense
I know how I want to solve this, I'm currently entangled in a web of algorithmic and lifetime complexity :web: :smile:
The optimizations I tried where either not significant or resulted in wrong reference counts when running all tests. Let's make sure to write unit tests for complicated functions in the new compiler :p
I'm going to give it one more day to see if I can skip work for unused variables.
Did you try changing the desugaring to use when is instead of result.try (assuming it is currently using result.try)?
No, not yet, that's a good idea. I'll explore that for a bit.
Turns out they are actually using when. It goes: TrySuffix > LowLevelTry > Try > When. Earlier I was looking at the IR after specialization and it wasn't clear to me that the problematic IR came from when expressions.
I like the overview we get of the output all phases with the snapshots in the new compiler, it requires substantial effort to get that output for some phases from the old compiler.
Interesting.... So this isn't from nested closures
it's due to nested joins:
procedure : `#UserApp.wrapper!` [C {}, C {}]
procedure = `#UserApp.wrapper!` (`#UserApp.7`: List []):
joinpoint `#UserApp.70` `#UserApp.66`:
joinpoint `#UserApp.58` `#UserApp.54`:
joinpoint `#UserApp.46` `#UserApp.42`:
joinpoint `#UserApp.34` `#UserApp.30`:
joinpoint `#UserApp.22` `#UserApp.18`:
let `#UserApp.17` : Str = "6";
let `#UserApp.16` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.17`;
ret `#UserApp.16`;
in
let `#UserApp.29` : Str = "5";
let `#UserApp.21` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.29`;
let `#UserApp.26` : U8 = 1i64;
let `#UserApp.27` : U8 = GetTagId `#UserApp.21`;
let `#UserApp.28` : Int1 = lowlevel Eq `#UserApp.26` `#UserApp.27`;
if `#UserApp.28` then
let `#UserApp.19` : {} = UnionAtIndex (Id 1) (Index 0) `#UserApp.21`;
jump `#UserApp.22` `#UserApp.19`;
else
let `#UserApp.20` : {} = UnionAtIndex (Id 0) (Index 0) `#UserApp.21`;
let `#UserApp.25` : [C {}, C {}] = TagId(0) `#UserApp.20`;
ret `#UserApp.25`;
in
let `#UserApp.41` : Str = "4";
let `#UserApp.33` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.41`;
let `#UserApp.38` : U8 = 1i64;
let `#UserApp.39` : U8 = GetTagId `#UserApp.33`;
let `#UserApp.40` : Int1 = lowlevel Eq `#UserApp.38` `#UserApp.39`;
if `#UserApp.40` then
let `#UserApp.31` : {} = UnionAtIndex (Id 1) (Index 0) `#UserApp.33`;
jump `#UserApp.34` `#UserApp.31`;
else
let `#UserApp.32` : {} = UnionAtIndex (Id 0) (Index 0) `#UserApp.33`;
let `#UserApp.37` : [C {}, C {}] = TagId(0) `#UserApp.32`;
ret `#UserApp.37`;
in
let `#UserApp.53` : Str = "3";
let `#UserApp.45` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.53`;
let `#UserApp.50` : U8 = 1i64;
let `#UserApp.51` : U8 = GetTagId `#UserApp.45`;
let `#UserApp.52` : Int1 = lowlevel Eq `#UserApp.50` `#UserApp.51`;
if `#UserApp.52` then
let `#UserApp.43` : {} = UnionAtIndex (Id 1) (Index 0) `#UserApp.45`;
jump `#UserApp.46` `#UserApp.43`;
else
let `#UserApp.44` : {} = UnionAtIndex (Id 0) (Index 0) `#UserApp.45`;
let `#UserApp.49` : [C {}, C {}] = TagId(0) `#UserApp.44`;
ret `#UserApp.49`;
in
let `#UserApp.65` : Str = "2";
let `#UserApp.57` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.65`;
let `#UserApp.62` : U8 = 1i64;
let `#UserApp.63` : U8 = GetTagId `#UserApp.57`;
let `#UserApp.64` : Int1 = lowlevel Eq `#UserApp.62` `#UserApp.63`;
if `#UserApp.64` then
let `#UserApp.55` : {} = UnionAtIndex (Id 1) (Index 0) `#UserApp.57`;
jump `#UserApp.58` `#UserApp.55`;
else
let `#UserApp.56` : {} = UnionAtIndex (Id 0) (Index 0) `#UserApp.57`;
let `#UserApp.61` : [C {}, C {}] = TagId(0) `#UserApp.56`;
ret `#UserApp.61`;
in
let `#UserApp.77` : Str = "1";
let `#UserApp.69` : [C {}, C {}] = CallByName `pf.Effect.foo_line!` `#UserApp.77`;
let `#UserApp.74` : U8 = 1i64;
let `#UserApp.75` : U8 = GetTagId `#UserApp.69`;
let `#UserApp.76` : Int1 = lowlevel Eq `#UserApp.74` `#UserApp.75`;
if `#UserApp.76` then
let `#UserApp.67` : {} = UnionAtIndex (Id 1) (Index 0) `#UserApp.69`;
jump `#UserApp.70` `#UserApp.67`;
else
let `#UserApp.68` : {} = UnionAtIndex (Id 0) (Index 0) `#UserApp.69`;
let `#UserApp.73` : [C {}, C {}] = TagId(0) `#UserApp.68`;
ret `#UserApp.73`;
It's this loop specifically.
Alright, I was able to hack together a fix for my specific problematic file as proof of concept :)
I should be able to turn this proof of concept into a proper fix :fingers_crossed:
Congrats @Anton :sparkler:
what was the issue?
This loop does expensive fixed point iteration even when all variables that could potentially be reference counted are unused.
I made a tool so that I can understand how the recursion iterates over the code :)
debug_tool.mp4
This is awesome. I want to know how you built it!
Here is the source code index.html
I used the regular Claude web interface to start, for js you can have a live preview on the side so that is very useful. This was my initial back and forth with claude: https://claude.ai/share/7ff2aacf-b7a5-40ad-839c-46b0e70b1882
Claude servers were overloaded so then I switched to the latest gemini pro 2.5 (free on https://aistudio.google.com). I don't think it has the same live js preview, so I had to a bunch of manual copy, paste and run. In between I did some good old fashion human brain debugging and that's it :)
From several past experiments, I've found vanilla js to improve things a lot. Models can have a difficult time fixing React errors, but they often default to React if you don't specify what to use.
I've had good experiences using Elm. Though I haven't built anything substantial with it, mostly just smaller demo things.
Good to know, I should give roc htmx a try as well :)
status update: I'm trying to achieve a more ambitious improvement then earlier, doing less work for unused variables is pretty easy. I'm now also trying to prevent work for vars after their last use. There is a way to this with a lot of reasonably simple code changes but I'm trying to see if I can make it work with a lot less code changes.
This nested joinpoint case is very common in current Roc and blows up 2^n so it's worth spending some time on in my opinion.
I believe we still plan on doing perceus refcounting in the new compiler @Richard Feldman?
for sure!
It is done :tada: :fireworks: :partying_face: PR#7884
fix_comparison_plot.png
wooooooooow, amazing @Anton!!!
great job @Anton '
Wait....but I thought the goal of all graphs was to go up and to the right.....
:grinning_face_with_smiling_eyes: Congrats on fixing this @Anton
Brendan Hansknecht said:
Wait....but I thought the goal of all graphs was to go up and to the right.....
I should have plotted the speedup % :p
Last updated: Jul 06 2025 at 12:14 UTC