Stream: show and tell

Topic: Go Platform


view this post on Zulip Oskar Hahn (Jan 07 2024 at 12:29):

The stream "show and tell" does not really fit. This topic is more "show and ask for help" :wink:

I would really like to call roc from go. But I have no experience with c, the c ABI, cgo or anything related. Finally, I succeeded to call the "platform-switching" example with a very simple go-platform.

https://github.com/roc-lang/roc/compare/main...ostcar:go-platform

There are some unpleasantness: When I call roc build, I get the error:

🔨 Rebuilding platform...
An internal compiler expectation was broken.
This is definitely a compiler bug.
Please file an issue here: https://github.com/roc-lang/roc/issues/new/choose
thread '<unnamed>' panicked at 'failed to open file "go-platform/dynhost": No such file or directory (os error 2)', crates/linker/src/lib.rs:590:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Failed to (re)build platform.: Any { .. }', crates/compiler/build/src/program.rs:976:46

This can be solved by calling roc build --no-link. This creates a main.o file. You have to move it inside the go-platform folder. I was not able to build go with the main.go file from another location.

But after this, It is possible to call go run main.go to call the go-platform or go build to create a executable.

Would it be possible, to build the go-platform with roc build?

I did not write any "roc_alloc and friends"-code. I think, no allocations are needed for this simple example, since it only contains a constant string. But I still feels like a big step (with many even bigger steps to come).

Should I create a PR for the platform-switching-example? It is not much, but maybe a starting point for something more.

view this post on Zulip Brendan Hansknecht (Jan 07 2024 at 15:39):

Oskar Hahn said:

Would it be possible, to build the go-platform with roc build?

The plan is to rip all that special code out anyway. We want platforms to control their own build and to tell roc what they need it to generate.

view this post on Zulip Brendan Hansknecht (Jan 07 2024 at 15:40):

Roc may deal with the final linking (eventually only surgically), but we don't want it to know about and call every toolchain under the sun.

view this post on Zulip Brendan Hansknecht (Jan 07 2024 at 15:41):

So you could add code for go into roc, but it would be short lived.

view this post on Zulip Brendan Hansknecht (Jan 07 2024 at 15:42):

In a more general note, roc and go may not really be a good match. Go really dislikes interacting with C. It is horribly slow.

view this post on Zulip Anton (Jan 08 2024 at 09:55):

Should I create a PR for the platform-switching-example?

The examples repo is a better fit, we've been wanting to move most "other language" examples to there.

view this post on Zulip Oskar Hahn (Jan 13 2024 at 18:44):

I created a PR for the example repo: https://github.com/roc-lang/examples/pull/152

I have no experience with cgo. So I can not tell, how slow "slow" is. But this article comes to the conclusion, that the overhead of cgo is similar to two mutex operations. I think, this is ok for many use cases.

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:04):

My similar benchmarks are 17x faster than what Cockroach labs saw in 2015

haha. Glad it got better

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:05):

Imagine if instead it was just as slow as 35 mutex operations :face_palm:

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:06):

Probably would matter for platform design. Need to make sure roc is doing enough work that it is worth calling back and forth.

For a task heavy workflow, that could be super expensive.

view this post on Zulip Oskar Hahn (Jan 13 2024 at 19:08):

It would also be expensive, when there are a lot of calls to roc_alloc, roc_realloc and roc_dealloc. But it would still be faster then most IO-calls.

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:10):

true....yeah. not great with roc's base design

view this post on Zulip Oskar Hahn (Jan 13 2024 at 19:11):

It would be nice, if roc would allocate bigger chunks at once. I guess, that would also be good for other platforms.

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:12):

that is a decision we leave to the platform. They can group and chunk allocations as they want. For example using an arena. Roc is just a consumer of what the platform picks.

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 19:14):

If roc did it's own thing, it would likely defeat some of the optimizations that the platform is doing.

view this post on Zulip Brendan Hansknecht (Jan 13 2024 at 20:15):

I took the false interpreter example. It is large enough of a run time with multiple allocations and tasks, so I thought it might be reasonable to measure the cost of roughly 40ns delay on each effect and allocations function.

I can't use an actual sleep function cause it is too slow for 40ns delay.
This seems to take roughly 40ns on my machine and not optimize away (generally it errs on the faster side in my testing):

static mut I: i64 = 0;

#[inline(never)]
fn cgo_cost() {
    unsafe {
        I = 0;

        while I < 40 {
            I = std::hint::black_box(I) + 1;
        }
    }
}

Used the nqueens example cause it takes about a second to run.
With the added delay, it takes 12% longer to execute.

So a hefty but definitely manageable perf cost. Also, other applications with better allocation patterns likely will have less of a perf loss.

view this post on Zulip Oskar Hahn (Jan 14 2024 at 12:57):

This is a interesting comparison. But to get better numbers, I converted the false interpreter to use a go platform. This was a fun exercise: https://github.com/ostcar/roc-examples/tree/go-false/examples/false-interpreter-go

To run it, I called:

roc build --no-link False.roc
go build platform/main.go
time (echo "9\n" | ./main examples/queens.false)

For the go platform, it returns:

real    0m26,863s
user    0m27,679s
sys     0m1,165s

For the original rust platform, it returns

real    0m26.401s
user    0m26.252s
sys     0m0.004s

So the real time is about the same (I run it multiple times. Some times go was faster, some time rust was faster). There is a relevant difference in the sys-time, but this seems to be insignificant on multi core CPUs, when there is an idle core.

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 15:47):

Oh wow. Awesome :+1:

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 15:57):

Out of curiosity, can you run something like this to get more accurate time comparisions:

hyperfine -w 5 -r 20 -L v rust,go "/tmp/false-{v} examples/cli/false-interpreter/examples/queens.false <<< 9"

The two executables would be saved as /tmp/false-rust and /tmp/false-go. And it woud be run from the root of the roc repo.

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 15:58):

roc build --no-link False.roc

This misses --optimize

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 16:17):

On M1 machine, the go version was crashing (unsurprising, I think false hits some roc bugs currently and the stricter memory protection can notice that)

For my x86 linux machine, these are the timings that I see with hyperfine and --optimize:

Summary
  '/tmp/false-rust examples/cli/false-interpreter/examples/queens.false <<< 9' ran
    1.10 ± 0.25 times faster than '/tmp/false-go examples/cli/false-interpreter/examples/queens.false <<< 9'

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 16:19):

10% perf loss +- 25%. Go version has a crasy high standard deviation.
rust stdev ± 0.038 s
go stdev ± 0.729 s

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 18:00):

So. Wanted a bit cleaner testing. So I removed reading from stdin it is for a single character, noisy, and requires a shell. Instead just hardcode getChar to return 9 in both rust and go.

I then noticed that go was missing buffered file reading, so I hacked that in:

diff

Also closed everything else on that PC and set the cpu to performance mode to make sure underclocking wasn't happening.

With much less noise, here are the full hyperfine results:
hyperfine results

For fun I also ran it with a zig performance tool by andrew kelley that adds more info:
poop results

This second tool was run for longer, so I will use its results. Go is 26% +/- 4% slower than rust. This does not show how much of the time is used by cgo though. Luckily, we have perf for that.

go-flamegraph.svg
rust-flamegraph.svg

Looking at the go flamegraph, it looks like 10.5% of the time is spent in runtime.cgocallback.abi0. Plus another 2.5% for the cgo malloc calls for 13%. Free didn't measure any overhead.

Of that time, 1.5% is spent in the actual malloc impl.

Thise would mean a total overhead of 11.5% for using cgo with this program. The other 14.5% of perf loss look to be coming from go runtime stuff and general setup.

view this post on Zulip Richard Feldman (Jan 14 2024 at 18:09):

this is such a sweet analysis, love it! :heart_eyes:

view this post on Zulip Oskar Hahn (Jan 14 2024 at 21:28):

Wow. Very interesting. I think, for a simple webserver, it is fast enough

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 22:17):

Oh yeah, for sure. As a note, false is probably a worst case (or at least used to be not sure how bad it is now). It allocates like crazy.

So that will be tons of calls and overhead. Most webserver and what not will be io bound. Also they will hopefully allocate much much less and have limited numbers of tasks run.

So this isn't a don't use cgo. It was mostly me being curious cause I used to work with go in chrome os and had always heard that was super slow. So wanted to test

view this post on Zulip Brendan Hansknecht (Jan 14 2024 at 22:20):

Thanks for proving out go platforms!


Last updated: Jul 06 2025 at 12:14 UTC