Hi! I'm new here, and I'm learning Roc by rewriting some coding puzzles I've done previously in Haskell in Roc. I'm currently doing the One Billion Row Challenge (which is basically just parsing a list of cities with temperatures and returning the min/mean/max temperature for each city).
I've gotten thusfar, working from the CLI args example:
app [main] {
pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br",
}
import pf.Stdout
import pf.File
import pf.Path
import pf.Task
import pf.Arg
main =
finalTask =
pathArg = readFirstArgT!
readFileToStr (Path.fromStr pathArg)
finalResult <- Task.attempt finalTask
when finalResult is
Err ZeroArgsGiven ->
Task.err (Exit 1 "Error ZeroArgsGiven")
Err (ReadFileErr errMsg) ->
Task.err (Exit 1 "Error ReadFileErr:\n$(errMsg)")
Ok fileContentStr ->
fileContentStr |> parse |> Stdout.line
readFirstArgT =
args = Arg.list!
dbg args
List.get args 1 |> Result.mapErr (\_ -> ZeroArgsGiven) |> Task.fromResult
readFileToStr = \path ->
path
|> File.readUtf8
|> Task.mapErr \_ -> ReadFileErr "Failed to read file: $(Path.display path)"
parse = \contents ->
citiesList = Str.split contents "\n"
results = List.walk citiesList (Dict.empty {}) \state, elem ->
cityWithTemp = Str.split elem ";"
city =
when List.get cityWithTemp 0 is
Ok v -> v
Err _ -> crash "err"
temp =
when List.get cityWithTemp 1 is
Ok v -> when Str.toF64 v is
Ok w -> w
Err _ -> crash "err"
Err _ -> crash "err"
Dict.update state city \value ->
when value is
Missing ->
Present { min: temp, mean: temp, max: temp }
Present { min: oldMin, mean: oldMean, max: oldMax } ->
Present
{
min: Num.min temp oldMin,
mean: (temp + oldMean) / 2,
max: Num.max temp oldMax,
}
results
|> Dict.toList
|> List.map \(city, {min, mean, max}) ->
"$(city): $(Num.toStr min), $(Num.toStr mean), $(Num.toStr max)"
|> Str.joinWith "\n"
roc dev
tells me "The surgical linker currently has issue #3609 and would fail linking your app. Please use --linker=legacy
to avoid the issue for now." I'm not sure which change made it so the surgical linker wouldn't work anymore, but I think it was inbetween when I realised I can't Stdout.line
a Dict and have to manually create the string to print. When I run it with the legacy linker (roc dev --linker=legacy -- measurements.txt
), it always results in ZeroArgsGiven
, and dbg args
gives [ ]
. If I remember correctly, this wasn't so before I introduced whatever caused the problem with the linker. Could this be, or is it a coincidence? If so, what mistake am I making in my code?
I also wanted to know if there is a way to print, e.g., dicts. I'm used to elements having show
in Haskell, so I can do print
on basically any structure. Having to do that last results |> Dict.toList |> List.map ... |> Str.joinWith "\n"
bit is a bit tedious tbh.
Then, I'm still working on my understanding of error handling, but I'm pretty sure this is sub-optimal:
when List.get cityWithTemp 1 is
Ok v -> when Str.toF64 v is
Ok w -> w
Err _ -> crash "err"
Err _ -> crash "err"
Is there something simple I'm missing? Or something more fundamental?
Thanks!
Inspect.toStr
should help you
For printing a dict
Oh yes! I forgot about Inspect
. Thanks.
I'll do a new basic-cli release soon that should improve the args parsing.
If you want you can use this to simplify the file reading a lot.
I'm already "watching" the basic-cli repo :) looking forward to it <3
About the import file
syntax, does that compile the file into the binary? Because the one billion row file is 14GB haha. Also, I think I'd prefer to keep the file dynamic, so I can benchmark performance on different sizes without recompiling.
I've rerun the CLI args example as is, and it doesn't see the args when using the legacy linker, so it seems the problem is with using the legacy linker. Legacy linker means using shared system libraries and such, right? In which case maybe the problem is because I'm running NixOS. Does anyone know what in my code above prevents the surgical linker from working? I didn't see anything on the issue that seemed relevant.
About the
import file
syntax, does that compile the file into the binary?
Yes indeed.
Also, I think I'd prefer to keep the file dynamic, so I can benchmark performance on different sizes without recompiling.
Makes sense :)
In which case maybe the problem is because I'm running NixOS.
I'm on NixOS as well, I'll see if it is any different on my ubuntu laptop.
Now that I think of it; for the benchmark later, make sure to use:
$ roc build yourApp.roc --optimize
And then time the execution of ./yourApp cities.csv
Oh yeah, I've been intending to do that but haven't gotten it to build yet in the first place haha :smiling: but thanks tho
I've been following the meta conversation about Roc development, all Richard's talks and podcasts and such, so I know a lot about the way things work, I just haven't actually sat down and written code until now hahaha
Yeah the behavior is the same on Ubuntu, the tracking issue for this is here.
Oh huh thanks for confirming
The Str.toF64
is what requires the legacy linker, Str.toU64
does work (if I change cities.csv to natural numbers).
Hm I tried Str.toDec
and it segfaulted :skull:
Yeah, I tried that too :smiling_face_with_tear:
I'll see if we have an issue for that
No issue, it doesn't happen every time when using Str.toDec (in other code), I'll minimize and make an issue.
I was doing the same challenge (1brc) few weeks ago and ran into what I understand is a refcount performance issue. And I think we also have an open issue to load a large file in chunks which is probably required for the full 1 billion row file
https://roc.zulipchat.com/#narrow/stream/231634-beginners/topic/Profiling.20an.20AoC.20app/near/440118892
These were some of my numbers for a super naive implementation that stopped after parsing and creating the dict
1M row file
roc reference: 103271ms
gleam reference: 2251ms
gleam (arrays): 2901ms
go reference: 140ms
btw this is the issue for reading in chunks: github.com/roc-lang/basic-cli/issues/205
I'll minimize and make an issue.
I've investigated the empty args issue further and also found a temporary workaround :)
Oh I implented that
https://github.com/roc-lang/basic-cli/pull/206
Just forgot to add a test script and make it ready for review
I forgot I had that sitting there
Ok for completion's sake, I managed to get things working!
Code and results:
app [main] {
pf: platform "./basic-cli/platform/main.roc",
}
import pf.Stdout
import pf.File
import pf.Task
import pf.Arg
main =
finalTask =
pathArg = readFirstArgT!
readFileToStr pathArg
finalResult <- finalTask |> Task.attempt
when finalResult is
Err ZeroArgsGiven ->
Task.err (Exit 1 "Error ZeroArgsGiven")
Err (ReadFileErr errMsg) ->
Task.err (Exit 1 "Error ReadFileErr:\n$(errMsg)")
Ok fileContentStr ->
fileContentStr |> parse |> Inspect.toStr |> Stdout.line!
_ ->
Task.err (Exit 1 "Unknown error")
readFirstArgT =
args = Arg.list!
List.get args 1 |> Result.mapErr (\_ -> ZeroArgsGiven) |> Task.fromResult
readFileToStr = \path ->
path
|> File.readUtf8
|> Task.mapErr \_ -> ReadFileErr "Failed to read file: $(path)"
parse = \contents ->
citiesList = Str.split contents "\n"
List.walk citiesList (Dict.empty {}) \state, elem ->
cityWithTemp = Str.split elem ";"
[city, t] = cityWithTemp
temp = Result.withDefault (Str.toF32 t) 0
Dict.update state city \value ->
when value is
Missing -> Present { min: temp, mean: temp, max: temp }
Present { min: oldMin, mean: oldMean, max: oldMax } ->
Present {
min: Num.min temp oldMin,
mean: (temp + oldMean) / 2,
max: Num.max temp oldMax,
}
hyperfine --warmup 1 -L num 1_000,10_000,100_000,1_000_000 "./main measurements_{num}.txt"
Benchmark 1: ./main ../1brc-hs/measurements_1_000.txt
Time (mean ± σ): 6.9 ms ± 0.3 ms [User: 5.8 ms, System: 1.1 ms]
Range (min … max): 6.2 ms … 8.4 ms 261 runs
Benchmark 2: ./main ../1brc-hs/measurements_10_000.txt
Time (mean ± σ): 67.6 ms ± 1.2 ms [User: 65.2 ms, System: 2.0 ms]
Range (min … max): 65.4 ms … 71.3 ms 42 runs
Benchmark 3: ./main ../1brc-hs/measurements_100_000.txt
Time (mean ± σ): 661.0 ms ± 5.8 ms [User: 654.7 ms, System: 3.9 ms]
Range (min … max): 653.8 ms … 671.7 ms 10 runs
Benchmark 4: ./main ../1brc-hs/measurements_1_000_000.txt
Time (mean ± σ): 6.557 s ± 0.041 s [User: 6.510 s, System: 0.025 s]
Range (min … max): 6.501 s … 6.621 s 10 runs
Which is faster than I got my Haskell code, in a few cases at least! Which was:
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -Wno-incomplete-uni-patterns #-}
module Main where
import System.Environment (getArgs)
import Data.Foldable (foldl')
import qualified Data.Map.Strict as M
import qualified Data.Text as T
type Results = M.Map T.Text (Float, Float, Float)
main :: IO ()
main = do
[path] <- getArgs
txt <- readFile path
print $ M.toList . foldl' addCity M.empty . T.lines . T.pack $ txt
addCity :: Results -> T.Text -> Results
addCity results line =
let (city, temp) = parseLine line
in M.insertWith f city (temp, temp, temp) results
where f (temp, _, _) (oldMin, oldMean, oldMax) =
(min temp oldMin, (temp + oldMean) / 2, max temp oldMax)
parseLine :: T.Text -> (T.Text, Float)
parseLine line =
(x, read (T.unpack y) :: Float)
where [x, y] = T.splitOn ";" line
Which resulted in:
| rows | runs | result (ms) | ± |
|-----------:|-----:|------------:|--------:|
| 1 000 | 164 | 13.4 | 0.9 |
| 10 000 | 50 | 52.9 | 3.8 |
| 100 000 | 10 | 386.2 | 10.7 |
| 1 000 000 | 10 | 4 107.0 | 28.0 |
| 10 000 000 | 10 | 47 138.0 | 1 736.0 |
For some reason, the Haskell is faster from 100 000 rows. Since Roc starts out faster, maybe this is a file reading thing? Not sure. (I did compile with --optimize
).
I do like the Haskell for being more concise, but then again it doesn't do any error handling.
Just a note, we have a bug that effects dict containing strings and would be killing your perf
Also, I'm really surprised that this works. It is probably a bug: [city, t] = cityWithTemp
. Cause there is no guarantee the list is length 2.
So, I wrote a terrible function to instead of using strings, use tuples of 32 U8s. The longest string was 26 bytes, so everything fits in these tuples, but they are way less efficient for hashing than a string. This should actually be a slower solution, but it work around a current refcounting bug.
Your code on my machine (so using Dict Str
):
| rows | runs | result (ms) | ± |
|-----------:|-----:|------------:|--------:|
| 1 000 | 294 | 6.8 | 0.4 |
| 10 000 | 55 | 54.9 | 6.5 |
| 100 000 | 10 | 531.8 | 72.8 |
| 1 000 000 | 10 | 5 434.0 | 737.0 |
| 10 000 000 | 10 | 51 639.0 | 6 809.0 |
My slightly modified code to use tuples (so using Dict (U8, ..., U8)
):
| rows | runs | result (ms) | ± |
|-----------:|-----:|------------:|--------:|
| 1 000 | 248 | 4.1 | 0.4 |
| 10 000 | 300 | 6.3 | 0.2 |
| 100 000 | 96 | 27.6 | 2.7 |
| 1 000 000 | 12 | 233.2 | 1.1 |
| 10 000 000 | 10 | 2 291.0 | 15.0 |
The Dict Str
version should actually be faster than this second version with the tuples, but we have a refcounting bug to finish fixing. So the roc version should be about 20x faster for large inputs.
Brendan Hansknecht said:
I'm really surprised that this works. It is probably a bug:
[city, t] = cityWithTemp
. Cause there is no guarantee the list is length 2.
Oh huh, I think I wrote that with the intention of afterwards chasing down the type errors to do error handling, and then when the checker didn't complain I just forgot about it!
I wrote a terrible function
thank you, this is beautiful
It's so exciting to write functional/high-level code — and a pretty naive implementation also — that is just really fast!!
I tried to investigate the destructuring thing a bit, and tried this input:
Seoul;8.7
Minneapolis;3.5
El Paso;22.0
Helsinki;9.1
Foo
Reggane;15.4
Gaborone;25.3
Benghazi;31.6
Bridgetown;38.0
Algiers;26.8
Chihuahua;12.9
which resulted in
{
"Seoul": {max: 8.699999809265137, mean: 8.699999809265137, min: 8.699999809265137 },
"Minneapolis": {max: 3.5, mean: 3.5, min: 3.5 },
"El Paso": {max: 22, mean: 22, min: 22 },
"Helsinki": {max: 9.100000381469727, mean: 9.100000381469727, min: 9.100000381469727 },
"Foo": {max: 9.100000381469727, mean: 9.100000381469727, min: 9.100000381469727 },
"Reggane": {max: 15.399999618530273, mean: 15.399999618530273, min: 15.399999618530273 },
"Gaborone": {max: 25.299999237060547, mean: 25.299999237060547, min: 25.299999237060547 },
"Benghazi": {max: 31.600000381469727, mean: 31.600000381469727, min: 31.600000381469727 },
"Bridgetown": {max: 38, mean: 38, min: 38 },
"Algiers": {max: 26.799999237060547, mean: 26.799999237060547, min: 26.799999237060547 },
"Chihuahua": {max: 12.899999618530273, mean: 12.899999618530273, min: 12.899999618530273 },
"": {max: 12.899999618530273, mean: 12.899999618530273, min: 12.899999618530273 }
}
So, it carried over the previous value for t
during the walk.
I also noticed that it added a blank string at the bottom, which is because splitting the input on "\n"
results in a blank string at the end (or rather, the file string ends on a \n
):
» Str.split "Seoul;8.7\nMinneapolis;3.5\n" "\n"
["Seoul;8.7", "Minneapolis;3.5", ""] : List Str
Anyway, when I insert dbg cityWithTemp
above [city, t] = cityWithTemp
, compilation fails with
This looks like an operator, but it's not one I recognize!
45│ dbg cityWithTemp
46│ [city, t] = cityWithTemp
^
I have no specific suggestion for this operator, see TODO for the full
list of operators in Roc.
I remember trying [city, t] = Str.split elem ";"
, which, trying again, results in the same "looks like an operator" error. I originally thought this was maybe because split
can fail, but I realise now it can't, so it must be something else.
Adding dbg
after the line works. This:
...
List.walk citiesList (Dict.empty {}) \state, elem ->
cityWithTemp = Str.split elem ";"
[city, t] = cityWithTemp
dbg cityWithTemp
dbg t
temp = Result.withDefault (Str.toF32 t) 0
dbg temp
Dict.update state city \value -> ...
gives
...
[main.roc:46] cityWithTemp = ["Helsinki", "9.1"]
[main.roc:47] t = "9.1"
[main.roc:49] temp = 9.100000381469727
[main.roc:46] cityWithTemp = ["Foo"]
[main.roc:47] t = "9.1"
[main.roc:49] temp = 9.100000381469727
[main.roc:46] cityWithTemp = ["Reggane", "15.4"]
[main.roc:47] t = "15.4"
[main.roc:49] temp = 15.399999618530273
...
Can you try Str.splitFirst there instead of split?
I think it should be faster and give you a record that can be matched inline like that. As I mentioned earlier, I don't think[city, t] = ...
is actually expected to work in general. So it may confuse the compiler.
Also, the value carry over is definitely a bug. If you could file an issue on GitHub, that would be great.
Brendan Hansknecht said:
Can you try Str.splitFirst there instead of split?
I tried this (not sure if the nested when
s is best way to express this :thinking: )
parse = \contents ->
citiesList = contents |> Str.split "\n"
List.walk citiesList (Dict.empty {}) \state, elem ->
when Str.splitFirst elem ";" is
Ok {before: city, after} ->
temp = after |> Str.toF32 |> Result.withDefault 0
Dict.update state city \value ->
when value is
Missing -> Present { min: temp, mean: temp, max: temp }
Present { min: oldMin, mean: oldMean, max: oldMax } ->
Present {
min: Num.min temp oldMin,
mean: (temp + oldMean) / 2,
max: Num.max temp oldMax,
}
Err NotFound -> state
and it's actually a tiny bit slower. But by like 1ms in the 1000 line case and 200ms in the 1m line case, so could be insignificant.
I don't think
[city, t] = ...
is actually expected to work in general
oohhh I misunderstood earlier, sorry.
If you could file an issue on GitHub, that would be great.
Will do! :salute:
and it's actually a tiny bit slower
This intuitively makes some sense to me. I'm pretty sure that [city, t] = ...
has no error handling. So it runs faster due to having a bug that is hit if the list isn't exactly (at least?) 2 elements.
Also, does it help if you change the error case to:
Err NotFound -> crash "impossible"
Sometimes crashing leads to better optimizations by the compiler
Brendan Hansknecht said:
Also, does it help if you change the error case to:
Err NotFound -> crash "impossible"
Problem is, the error case is reached because of the newline at the end of the file and Str.split
adds blank strings, e.g., Str.split "foo" "o"
=> ["f", "", ""]
.
I think this is expected behaviour since Haskell's splitOn
also does this. I was actually looking for an equivalent of lines
.
I see there's Str.trim
, will try that in a moment.
An interesting observation:
You guys were getting 5-6s for the 1M files and even though my implementation (Dict Str) was virtually identical my M1 mac would finish consistently around 100-110s. After looking into it a lot, it came down to my input generation script. The one I was using, would create the correct number of rows, but it had waaay more unique cities. When I swapped the script to one linked on the official 1brc site I started getting ~5s. This input script only has 413 unique cities whereas my old one could go up to 44k which exacerbates the Dict issue I think
Trimming the newline at the end before splitting, and using crash on the nofound error:
Benchmark 1: ./main ../1brc-hs/measurements_1_000.txt
Time (mean ± σ): 7.1 ms ± 0.3 ms [User: 5.7 ms, System: 1.3 ms]
Range (min … max): 6.4 ms … 8.4 ms 267 runs
Benchmark 2: ./main ../1brc-hs/measurements_10_000.txt
Time (mean ± σ): 69.2 ms ± 0.7 ms [User: 67.0 ms, System: 1.7 ms]
Range (min … max): 68.0 ms … 71.8 ms 41 runs
Benchmark 3: ./main ../1brc-hs/measurements_100_000.txt
Time (mean ± σ): 677.7 ms ± 4.1 ms [User: 670.1 ms, System: 5.0 ms]
Range (min … max): 672.5 ms … 686.2 ms 10 runs
Benchmark 4: ./main ../1brc-hs/measurements_1_000_000.txt
Time (mean ± σ): 6.839 s ± 0.060 s [User: 6.777 s, System: 0.029 s]
Range (min … max): 6.755 s … 6.920 s 10 runs
I think this is within the margin of error of the prevous version.
@Musab Nazir oh interesting, and yeah, my test files were generated by the "official" python script.
This input script only has 413 unique cities whereas my old one could go up to 44k which exacerbates the Dict issue I think
Yeah, the issue is linear with the size of the dict
Basically, it is recursively updating all of the refcounts of the stored string keys
This is a basic impl of 1brc using the latest roc (including #6911) and basic-cli: https://github.com/bhansconnect/roc-1brc/blob/main/1brc.roc
It's only single threaded, so obviously the performance can't be that good. I'm sure it also has tons of room for improvement (probably a lot within roc itself).
Perf result on the full 1 billion rows:
M1 mac 8GB ram: 145s
Linux machine 32GB ram: 272s
Assuming perfect parallelism over the 8 cores the test uses, we would expect something in the range of:
M1: 18s
Linux: 34s
For context, good results for the expanded key version of the test start at about 20s and go down as low as 3s. That said, good results also do a lot more tricks than we do (like swar for finding the semicolons).
Note: I used the python generator, which actually leads to slower results cause it uses 8,921 weather station names instead of just the 413 of the java generator.
Comparing to https://benhoyt.com/writings/go-1brc/
I am doing something roughly similar to r4
. Which took 51s in golang.
To get a bit more of a direct comparison. This is the timing of r4
on my machines with my same input data:
M1: 66s
Linux: 73s
So roughly 2.2x slower currently on M1 and 3.7x slower on linux.
Perf of the roc code is split as such:
25% list.splitFirst
25% parsing the rest of row (probably the pattern match)
50% Dict.update
I wonder how much of that is algorithmic differences
not sure how splitFirst could be made significantly faster! :big_smile:
What do you mean by algorithmic?
Aside from using an mmap instead of buffered io, the two implementations are almost identical algorithmically. With the caveat that I use pattern matching while go uses manual but roughly equivalent code.
Also, at a minimum for split, it is doing a ton of redundant checks by calling List.sublist
twice. But I would guess that iterating in roc with refcount, tags, and list checks probably costs a solid amount then a quick loop in zig.
Also, for bytes specifically, you can use simd or swar to find much faster.
Brendan Hansknecht said:
Also, at a minimum for split, it is doing a ton of redundant checks by calling
List.sublist
twice. But I would guess that iterating in roc with refcount, tags, and list checks probably costs a solid amount then a quick loop in zig.
oh interesting! do you think it would be viable to move more of that to zig to avoid redundant checks?
Yeah, I think we could get a decent gain from moving it into zig.
Also, in this specific case, a bytes specific version could use simd or swar. Which would be even faster, but I don't think the go version does that.
I also like the idea of doing that!
working with bytes is super common
and having builtins just automatically run faster for common cases like that seems great
it's a good use of builtins :big_smile:
Yeah, one note related to that, would also be great if we allowed for special ability specialization for lists of bytes. For example, hashing has addBytes
, but you have to wrap a List U8
in an opaque type to use it. So List U8
hashing is slow by default.
I should actually double check and verify this
Yeah, doubled checked, List U8
hashing is on the slow path by default. It hashes each element individually.
oh yeah we should totally do that one too
same with equality checks for sure
yeah, and with equality checks, we can use treat any list of types without pointers (or opaques) as a big list of bytes and just blit over them instead of caring about padding.
Actually, i guess we could do that with hashing as well...maybe....depends if we want hashing a
b
and c
to be equivalent to hashing [a,b,c]
. Where a/b/c
are variables of some random type alias.
Oh wait, list hashing is already different than just hashing the elements...so yeah, any type that can be seen as a bag of bits (memcpy safe in the llvm backend) could use way faster list hashing.
That said, I do think there is a problem here with ordering of passes. Abilities are generated when we don't know concrete types. These optimizations require knowing the concrete final type.
concrete as in monomorphized?
I'd be surprised if there wasn't some way to get that to work without restructuring things, although @Ayaz Hafiz would know better
Oh, I bet we could just make hashList a low level. So the ability dispatches to hashList. Then the low level should know the concrete type and be able to dispatch farther.
Or at least something roughly like that. So kinda two steps.
Also, as much as I would love to keep dict in roc (and think it should long term be possible to make efficient in roc), we probably could get some large perf gains by moving dict to zig. (oh, and making sure we enable borrow inference for dict and set the same as list and str).
That said, it would be a pretty large project to move dict and may not be worth it for now.
Do we have a location in the compiler pipeline after/during specialization that we can pick a lowlevel we generate? Thinking about trying to specialize before we get to the backends instead of needing to deal with the specialization in each of the backends. So backends would still just dispatch a function call to zig without any special wiring. I'm assuming this would just be a pass on mono ir. I assume there is already a location in the compiler to add something like this.
I believe so, although I'm not sure offhand where it is
actually I think the lowlevels all do that
so if you search for pattern matches on lowlevels you might find it (I'm on mobile right now)
Also, ran this through the poop
and definitely some heavy memory issues (I need to test with a large bufreader instead of an mmap). That may be a large part of the perf, but memory issues might be from something else (like mmap loading pages may be label as a cache miss but not actually be that expensive, not sure):
$ poop -d 1 "./r4 data/measurements_1_000_000_000.txt" "./1brc data/measurements_1_000_000_000.txt"
Benchmark 1 (3 runs): ./r4 data/measurements_1_000_000_000.txt
measurement mean ± σ min … max outliers delta
wall_time 87.3s ± 3.30s 83.5s … 89.5s 0 ( 0%) 0%
peak_rss 4.15MB ± 227KB 3.89MB … 4.28MB 0 ( 0%) 0%
cpu_cycles 276G ± 958M 275G … 277G 0 ( 0%) 0%
instructions 471G ± 175M 471G … 471G 0 ( 0%) 0%
cache_references 16.3G ± 265M 16.0G … 16.6G 0 ( 0%) 0%
cache_misses 1.06M ± 224K 870K … 1.31M 0 ( 0%) 0%
branch_misses 2.75G ± 21.6M 2.74G … 2.77G 0 ( 0%) 0%
Benchmark 2 (3 runs): ./1brc data/measurements_1_000_000_000.txt
measurement mean ± σ min … max outliers delta
wall_time 169s ± 10.1s 159s … 180s 0 ( 0%) 💩+ 94.1% ± 19.6%
peak_rss 16.0GB ± 74.5KB 16.0GB … 16.0GB 0 ( 0%) 💩+385274.2% ± 9.2%
cpu_cycles 640G ± 2.13G 638G … 642G 0 ( 0%) 💩+131.9% ± 1.4%
instructions 1.40T ± 1.59G 1.40T … 1.40T 0 ( 0%) 💩+197.5% ± 0.5%
cache_references 10.9G ± 133M 10.8G … 11.0G 0 ( 0%) ⚡- 33.0% ± 2.9%
cache_misses 462M ± 1.43M 460M … 463M 0 ( 0%) 💩+43454.8% ± 218.7%
branch_misses 2.95G ± 30.2M 2.92G … 2.97G 0 ( 0%) 💩+ 7.4% ± 2.2%
Highlights ~45x as many cache misses and ~3x as many instructions.
Also, totally possible all the cache misses are from our dictionary being in two different allocation along with the memory pressure from the giant mmap.
Also, note, these times are way faster than the times above due to not clearing the page cache between runs. So the file was already in ram more or less.
So with a better version of readline, perf is even worse than with the mmap. This version of readline actually accumulates to a RocList directly instead of a Vec (which helped a lot, but not enough):
Benchmark 1 (3 runs): ./r4 data/measurements_1_000_000_000.txt
wall_time 77.6s ± 4.87s 74.5s … 83.2s 0 ( 0%) 0%
Benchmark 2 (3 runs): ./1brc-mmap data/measurements_1_000_000_000.txt
wall_time 177s ± 7.44s 168s … 182s 0 ( 0%) 💩+127.7% ± 18.4%
Benchmark 3 (3 runs): ./1brc-readlines data/measurements_1_000_000_000.txt
wall_time 255s ± 2.91s 252s … 258s 0 ( 0%) 💩+228.0% ± 11.7%
Last updated: Jul 06 2025 at 12:14 UTC