Is there currently a reasonable cross-platform way to read a local file in chunks, e.g. with basic-cli Tcp.readLine?
On Ubuntu I will be trying to use netcat
program to serve the file and read it in a Tcp stream, but I can't find an analogous built-in tool for e.g. Windows
so like Tcp.readLine but for file I/O instead?
Yes, essentially - to be able to handle large files
I realized in my case I probably won't get bottlenecked by the current strict File.readUtf8 for now, but this is a use-case to consider
I made https://github.com/roc-lang/basic-cli/issues/205 for this
@Musab Nazir if you are on Linux, you could try to combine serving your file with netcat
and stream process it using the Tcp builtin module
Maybe
https://gist.github.com/talwai/d94c71ca09729ac655b4 will be of help, which you can execute with the Cmd builtin module
What if we just add a File.readLine
similar to Rust's std::io::BufRead read_line method?
Then you can just write a task to process lines and call it using Task.loop
Maybe you call File.openBuffered
to get a handle first?
That would be ideal
With File.readLine, File.readUpTo and File.readExactly to match Tcp builtin
I can't contribute as I am not familiar with Rust
Another idea is maybe these functions can be unified between File and Tcp modules with a single Stream module
I have a design somewhere for a general Stream
concept
maybe now's the time to implement it! :smiley:
Thought I might have a go and see how to implement it. It works, but not sure this is the right direction.
https://github.com/roc-lang/basic-cli/pull/206
$ roc examples/file-read-buffered.roc
🔨 Rebuilding platform...
Read 119 bytes
Read 1 bytes
Read 52 bytes
Read 1 bytes
Read 496 bytes
Read 1 bytes
Read 22 bytes
Read 1 bytes
Read 194 bytes
Read 1 bytes
Read 330 bytes
Read 1 bytes
Read 52 bytes
Read 1 bytes
Read 181 bytes
Read 1 bytes
Read 461 bytes
Done reading, got {bytesRead: 1915, linesRead: 17}
I'm using this to keep track of the buffered file reader
thread_local! {
static READERS: RefCell<Vec<Rc<RefCell<BufReader<File>>>>> = RefCell::new(Vec::new());
}
I have another question tangentially related to this. When we use the import sytax to load a file like:
import "sample.txt" as sample : Str
Is it the same as using File.read
or is there extra stuff roc is doing in the import version to the file?
I believe this causes the file to be read at compile time and included in the executable but I'd have to check to make sure.
Apologies if this isn't the right place for these questions but I'm trying to find the fastest way I can read the raw contents of a file and my current method is showing non-linear runtime as I feed bigger and bigger files.
main =
startTime = Utc.now!
input <- File.readBytes (Path.fromStr "measurements.txt") |> Task.attempt
when input is
Ok _ ->
endTime = Utc.now!
runTime = Utc.deltaAsMillis startTime endTime |> Num.toStr
Stdout.line! "File read in $(runTime)ms"
Err _ -> Stdout.line! "Failed to read"
With a 100mil row file (1.5gb) my fastest run was 800ms on the dot
With a 200mil row file (3.0gb) I get 5500ms, I might be running into memory limitations cuz I'm on a 8gig m1 machine.
Is there any low hanging fruit in terms of cutting this time down more?
There are 100%
Pretty sure we are duplicating the read in file
Also, if we add some for a streamed reading. Please make sure it has either a configurable or very large buffer (4k minimum, 16k probably better)
Though personally, I would push for a mmap'ed load directly into a roc list instead. Fast and nice. Don't have to think about buffers at all. Just one magical block (though may not play nice with roc, will explode on the first edit) needs to be specially treated as read only or managed by the platform with more effects.
Can we pass a slice into Roc?
I'm trying to find the fastest way I can read the raw contents of a file and my current method is showing non-linear runtime as I feed bigger and bigger files.
One thing to point out is that I think you will see limitations with basic-cli
that is probably just the platform which is pretty primitive, and not necessarily anything specific to roc. So making a fork, or writing another platform implementation that is more suited may be a good idea to consider.
Depending on what is it you're trying to do.
One thing to point out is that I think you will see limitations with
basic-cli
that is probably just the platform which is pretty primitive, and not necessarily anything specific to roc. So making a fork, or writing another platform implementation that is more suited may be a good idea to consider.
ah right I forget I'm going through a platform. Might look into what basic-cli is doing and see if a local tweak gets me what I'm after.
We can pass slices into roc.
I think we should really avoid mmap in Roc platform I/O...it would introduce the characteristic "any other process can now cause your Roc application code to perform undefined behavior" whereas you don't have that concern if there's no mmap primitive
I don't follow. Just mmap private.
whoa, TIL that MMAP_PRIVATE
protects against that! :mind_blown:
ok in that case I guess there's no potential for UB?
Of course, on modification, mmapped slices have an issue in roc. They will be cloned and the original map won't be freed.... Though maybe it could be intercepted if done right....
what if we only offered readonly access?
like the platform just didn't offer write access to any file that's mmap'd
Yeah, would probably do that by offering a seamless slice to the mmap'ed file
Then, if the user tries to write, it would copy the entire thing.
Last updated: Jul 06 2025 at 12:14 UTC