zig compiler - integrate with tooling · compiler development

Stream: compiler development

Topic: zig compiler - integrate with tooling

Eli Dowling (Feb 04 2025 at 08:17):

I think this rewrite is a great time to also think a little about how the roc compiler can integrate well with external tooling and the language server:
Things like:

Could the compiler be designed to be able to serialize enough of its internal info: symbols, ast, etc that the language server, and maybe other exciting tooling could be written in roc?
There needs to be two modes of error reporting. One for the CLI that includes all info and a snippet and such, and one for the language server that only includes the error as a string and line info. (changing that looked like kind of a nightmare in the current roc compiler)
Obviously recoverable parsing.
Probably a few things I've forgotten that I can add when I take a look at the current language server impl.

I just hadn't seen any discussion of it and wanted to try to make sure it's not forgotten :)

Sam Mohr (Feb 04 2025 at 08:31):

I think Richard isn't a fan of tying an external protocol to our compiler:

Richard Feldman said:

zooming out for a sec, I'm trying to avoid coupling the public roc API to other protocols, so that we don't get in a situation where people are saying "hey please update the language and do a release ASAP because we're blocked and there's no way for us to unblock ourselves short of a language release."

examples of this include:

not wanting to offer generating llvm IR because that couples roc to llvm updates

not wanting to have Unicode segmentation in builtins because that couples roc to Unicode updates

not wanting to couple roc to lsp for the same reason.

with that in mind, I wonder if there's a way we could keep the relevant logic in the roc binary but expose it in a way that lsp (and others) can access functionality with it in a way that can be upgraded independently from roc itself - like for example what we do with roc glue and giving it a .roc script that describes what to do

Sam Mohr (Feb 04 2025 at 08:32):

But since we're aiming for a single binary that does everything, it'd be really nice to have the same experience as with gleam lsp

Eli Dowling (Feb 04 2025 at 08:37):

Well that's why I was hoping that the roc binary itself could expose very low level primitives, symbols, types, ast etc.
That way improvements to the language server can occur separately to the language.
It seems like that decouples roc from lsp much more, right?

Sam Mohr (Feb 04 2025 at 08:38):

Yeah, I'd agree

Sam Mohr (Feb 04 2025 at 08:38):

Sorry, I think I read your message too quickly

Luke Boswell (Feb 04 2025 at 08:39):

I think the plan is for each IR to be nicely serialisable for caching and snapshotting, and also for external tools to integrate with.

Luke Boswell (Feb 04 2025 at 08:39):

I'm not exactly sure how that will look or work. But definitely something I'm keen to explore and understand more to help ensure we dont miss it.

Eli Dowling (Feb 04 2025 at 08:41):

Also I think a language server written in roc would be a really great, "here is roc doing real things that are non trivial" example.
language servers do:

serialization and deserialization of a very complex spec
caching and invalidation in response to input
parallelism
quite complex data types that require lots of manipulation and "rooting about in"

Luke Boswell (Feb 04 2025 at 08:42):

oh, interesting... writing the language server in roc. How would that work? as a plugin?

Luke Boswell (Feb 04 2025 at 08:44):

I'm really interested in this idea, it sounds great.

Luke Boswell (Feb 04 2025 at 08:45):

Would that need to be a custom platform?

Eli Dowling (Feb 04 2025 at 08:48):

We can look at how other language servers work.
i have few ideas, but one I quite like is having the compiler also be a roc platform.
That way you guarantee the language server will always work even if you have a different version of the compiler, but it can still release independently of the compiler.

The compiler could also expose apis over jsonrpc or the like and the language server could communicate that way.

Eli Dowling (Feb 04 2025 at 08:50):

I think choosing either one should allow us to switch to the other with very little hassle.
Eventually we should definitely expose both, having your compiler be able to be run as a service lets people build interesting tools with it, but also I think the platform approach would be better for the language server and making it super simple for people to build other tools in roc.

Luke Boswell (Feb 04 2025 at 08:51):

When you say the roc compiler, are you thinking the roc cli binary/executable?

Eli Dowling (Feb 04 2025 at 08:53):

Yup. You could start it with a special flag and it goes into "rpc mode" where you can send messages back and forth and get stuff out.

Eli Dowling (Feb 04 2025 at 08:54):

In the short term, I think a platform for roc would be the best choice though. Again, it's a good way to show off our cool features, with a real word use case

Richard Feldman (Feb 04 2025 at 14:39):

yeah it could work like glue

Richard Feldman (Feb 04 2025 at 14:40):

where instead of gleam lsp you run something like roc tooling lsp.roc

Richard Feldman (Feb 04 2025 at 14:42):

and then lsp.roc works like rust-glue.roc in that its platform is provided by the compiler, and lsp.roc plays the role of translating direct function calls to/from the compiler into the language server protocol

Eli Dowling (Feb 04 2025 at 14:47):

I think that's a good solution for now.

In the longer term it may be nice to be able to release changes to the language server without updating roc as a whole (not sure how often you plan to do releases).
If so I was thinking we could have a platform that produces it's own standalone binary and uses the compiler as a library. That way the language server can be bumped separate to the compiler because it bundles its own copy of the compiler.

Eli Dowling (Feb 04 2025 at 14:50):

I had a chat to @Luke Boswell earlier today about this.
I think it would be great if we could make the "roc tooling" platform be more than just a language server.
If we try to keep everything pretty general purpose on the zig side, getting ast, different IR, types, symbols etc, then we could easily create a base for other analysis tools, like linters, or codegen etc. All written in roc

Richard Feldman (Feb 04 2025 at 15:10):

yeah I do think that type of thing seems reasonable

Richard Feldman (Feb 04 2025 at 15:13):

I like the "linting" philosophy where it's not so much about enforcing arbitrary stylistic preferences as it is about project-specific invariants

Richard Feldman (Feb 04 2025 at 15:15):

e.g. "we have decided to move away from doing things in this way that we used to, and the tool's job is to fail the build if that way is used in any new code, with exceptions carved out for old code"

Eli Dowling (Feb 04 2025 at 15:17):

I think one day it would be really cool if we could do things like create linting rules using the refcount IR.
That way we could create a lint that says "hey this platform expects to be able to reuse this buffer, and you're storing an extra reference to it here"

Richard Feldman (Feb 04 2025 at 15:21):

I think it's reasonable to do that on the application side but not the platform side

Richard Feldman (Feb 04 2025 at 15:21):

like "I always want to be reusing this for perf, and if I ever stop doing that here I want to know about it"

Richard Feldman (Feb 04 2025 at 15:22):

as opposed to "you must only give me one that can be reused or else your build will fail" - at which point we've added a janky version of Rust's ownership types to Roc :big_smile:

Eli Dowling (Feb 04 2025 at 15:30):

oh no, I more just meant:
If we build a good compiler platform that allows plugins for things like linting, and we expose the right stuff, we could enable a platform author to add little suggestions like that. Not as an error necessarily but as a hint.
You know: "if anyone calls this function, and the variable it's assigned to has multiple references, show a little suggestion that says "sure you want to do that mate?"

I would definitely write one for myself that does: "if i have a variable called buff and it has multiple references, warn me"

Brendan Hansknecht (Feb 04 2025 at 16:43):

Richard Feldman said:

and then lsp.roc works like rust-glue.roc in that its platform is provided by the compiler, and lsp.roc plays the role of translating direct function calls to/from the compiler into the language server protocol

This really don't make sense to me. Unlike glue, the language server is a bespoke single application.

Brendan Hansknecht (Feb 04 2025 at 16:46):

Luke Boswell said:

I think the plan is for each IR to be nicely serialisable for caching and snapshotting, and also for external tools to integrate with.

I think we should be really careful of this. The more coupling we expose the less flexible and changeable roc becomes. This sounds like a trivial way to hit hyrum's law really hard.

Not saying we shouldn't do it, but if we do it, we probably should pick one very explicit cutting point that we think is unlikely to change.

Brendan Hansknecht (Feb 04 2025 at 16:46):

That said, I think both zig and Odin expose parts of their compiler in their standard library, so maybe it isn't too bad.

I guess a lot of this depends on good versioning guarantees.

Brendan Hansknecht (Feb 04 2025 at 16:48):

Also, this all may fall nice into a libroc workflow where you just use the compiler as a library instead of as an executable

Eli Dowling (Feb 04 2025 at 16:48):

I am strongly with Brendan on this. I would be leery of exposing the full IR. I would like to transform it slightly to include only the info we see external services wanting, recounts symbol locations type info etc. nothing weird and internal if we can avoid it.

Hopefully that will keep us from having the external API change much.

Luke Boswell (Feb 04 2025 at 19:20):

As a starting point I was thinking of making libroc essentially expose the roc check functionality, and it returns the ResolvedIR (prev Can) or maybe as far as the last IR before code gen ... and making a roc platform that provides that to a roc app aling with Problems.

I figure this would be all that's required for making an LSP, or something like checkmate, or our playground.

jan kili (Feb 10 2025 at 21:33):

Is this topic a superset of (something we talked about several months ago) wanting to be able to convert raw Roc source code files to/from a serialization format like JSON or YAML etc, powered by something like a first-party JSON Schema? Should I start a new topic?

Luke Boswell (Feb 10 2025 at 22:05):

I think the general idea is to serialise the IR's to an S-expression format, which should then be easy to parse and work with -- I'm not sure about a schema, though I guess once the IR has firmed up that might help standardise it.

Anthony Bullard (Feb 10 2025 at 22:07):

yes, each IR would have it's own sexpr representation

Anthony Bullard (Feb 10 2025 at 22:07):

Which would be very simple

Brendan Hansknecht (Feb 10 2025 at 22:28):

I think the general idea is to serialise the IR's to an S-expression format

Is that for this tooling as well? This tooling wouldn't be using a serialized text at all. It should be directly using some sort of roc tag union representation of the IR.

Eli Dowling (Feb 10 2025 at 22:31):

Well making it a format able to be sent outside of zig is pretty essential if we want to build tools for roc in roc.
So I'd call it tangentially relevant.

Brendan Hansknecht (Feb 10 2025 at 23:28):

For sure, I was think of text representation vs tag union representation which may be two very different shapes

jan kili (Feb 11 2025 at 00:10):

Anthony Bullard said:

Which would be very simple

Sweet! So could I write a Roc library that helps you "read/write Roc code" by mapping+translating typed values (likely mostly Strs, but maybe lots of tags) from/to the raw contents of main.roc.ir_step_5.sexpr.idk files that the compiler writes beforehand / reads later? (Maybe in real time if a platform called the compiler in a certain way?)

Luke Boswell (Feb 11 2025 at 00:13):

I'd like to explore the idea of roc glue -- potentially even becoming something more like roc gen and it could potentially access any or all of the IR's and then we could write plugins (roc scripts) that do things with roc source code really easily. The primary usecase is for things like tooling (e.g. checkmate).

Luke Boswell (Feb 11 2025 at 00:14):

If we use a Str and parse the S-expressions on the roc side... it will be much easier than trying to maintain a binding to the roc types for all the IR's.

Luke Boswell (Feb 11 2025 at 00:17):

So I imagine a roc package that parses the IR Str and gives us an AST using roc tag unions etc.

Sam Mohr (Feb 11 2025 at 00:18):

How would we make sure that it always stays in sync with the Zig equivalent? Seems easier to keep this in a Zig library

Luke Boswell (Feb 11 2025 at 00:18):

Then I imagine having a few simple effects available like Stdout.line!, File.write! or Http.send! to do stuff with this.

Luke Boswell (Feb 11 2025 at 00:19):

Sam Mohr said:

How would we make sure that it always stays in sync with the Zig equivalent? Seems easier to keep this in a Zig library

I imagine we could fuzz it somehow... we will be using this for glue generation anyway [in my hypothesis here] -- which we would want to be reliable.

jan kili (Feb 11 2025 at 00:20):

Are the alternatives (b) not having this functionality and (c) having third parties write plugins in Zig?

Sam Mohr (Feb 11 2025 at 00:22):

Sounds about right, with more leaning towards (c)

Sam Mohr (Feb 11 2025 at 00:22):

Or writing bindings to Zig, which seems janky

Luke Boswell (Feb 11 2025 at 00:23):

Note that if we expose things via zig, e.g. (c), we wouldn't want to use the builtins... as they're really our internals and not a very nice abstraction for working with. So if we do expect people to work with zig, it's the same issue where we have a separate thing (zig library) that needs to be maintained and kept in sync.

Luke Boswell (Feb 11 2025 at 00:24):

If we can standardise on a simple protocol (S-expressions) instead of a library (one blessed language) then it will be much easier for tooling in any language.

Luke Boswell (Feb 11 2025 at 00:27):

But maybe I'm wrong here... we will need zig code to serialise and deserialise the IR's anyway, and a parser in Roc would be duplicating this effort.

Brendan Hansknecht (Feb 11 2025 at 00:27):

and it could potentially access any or all of the IR's and then we could write plugins

Please no

Luke Boswell (Feb 11 2025 at 00:27):

Why not?

Brendan Hansknecht (Feb 11 2025 at 00:27):

The more exposed our internals are, the more lockdown they are

Brendan Hansknecht (Feb 11 2025 at 00:28):

I think we should expose a single cutting point with its own transformed IR and nothing else

Brendan Hansknecht (Feb 11 2025 at 00:28):

Huge hit by hyrum's law

Luke Boswell (Feb 11 2025 at 00:28):

Ahk... well I guess maybe we just expose the IR after type checking, and then the IR after the full build (which includes refcounting etc)??

Brendan Hansknecht (Feb 11 2025 at 00:29):

well I guess maybe we just expose the IR after type checking

Yeah, something around here is the one point I think we should expose.

....

refcounting is technically an internal detail, why do we want to expose it?

Luke Boswell (Feb 11 2025 at 00:29):

We need to expose something for generating glue anyway... my hypothesis is that we could also expose something that tooling like checkmate can use too

Luke Boswell (Feb 11 2025 at 00:30):

Oh, it was the LSP that wanted to know about tail-call or other optimisations.

I was wondering if we could even write our LSP as a roc plugin?

Mostly spit-balling here... these aren't really thought through ideas. I just feel like we could use roc scripts to simplify a lot of things for our own tooling.

Brendan Hansknecht (Feb 11 2025 at 00:32):

It possible. If we do so, I think we just need to make sure to pick a limited set of cutting points with really clear apis.

Brendan Hansknecht (Feb 11 2025 at 00:33):

Potentially even separating them completely from the IR so the IR can change separately from the api (probably required anyway essentially to translate from zig to roc)

Brendan Hansknecht (Feb 11 2025 at 00:34):

JanCVanB said:

Are the alternatives (b) not having this functionality and (c) having third parties write plugins in Zig?

d. standard c api shared libaries

Richard Feldman (Feb 11 2025 at 00:37):

also, roc glue does not support effects by design, because it means you know it is always 100% harmless to run anyone's glue script

Richard Feldman (Feb 11 2025 at 00:37):

all it's going to do is to spit out files into the directory you requested, because that's all it knows how to do :big_smile:

Luke Boswell (Feb 11 2025 at 00:38):

My gut feeling here is that we should kick the tooling can down the road a little further -- and option (d) aka libroc is probably the best option, but there's a lot of design work around the interpreter etc that would be good to understand first.

Brendan Hansknecht (Feb 11 2025 at 00:49):

that is actually option (e) techincally. For (d) I meant that the compiler could run liblsp.so instead of making a roc interface for the lsp.

Brendan Hansknecht (Feb 11 2025 at 00:49):

but yeah, I think this part of tooling, we should wait on.

Brendan Hansknecht (Feb 11 2025 at 00:49):

First we should figure out the best abstraction for glue (in the new compiler) and learn from wiring that up happily

Brendan Hansknecht (Feb 11 2025 at 00:50):

Then we should revisit exposing things specifically thinking of the LSP use case

Brendan Hansknecht (Feb 11 2025 at 00:50):

Finally we should think if that can be expanded to more general use cases.

Brendan Hansknecht (Feb 11 2025 at 00:50):

That is at least roughly how I would push on it

Lucas Rosa (Feb 12 2025 at 04:16):

honestly, I've been more tempted to think that compilers should be built lsp-first and out. but maybe that's crazy. not saying we should, I don't think there's enough precedence to justify it. but I get this feeling that at this point a language server ends up inevitable and having a good one makes all the difference in terms of adoption. considering how most people actually spend more time interacting with the LSP than the cli, it kinda makes sense to focus on it first and with highest priority. the cli pass would largely end up being for CI most of the time.

Richard Feldman (Feb 12 2025 at 04:27):

to be honest, I'm not sure how different we'll want it to be

Richard Feldman (Feb 12 2025 at 04:32):

here's a sketch of how it could look:

for the roc check phases of the compiler (which is all that language servers care about) we're already compiling each module separately, with its data structures stored in its own arena so we can serialize them to/from disk easily
we also know the full module dependency graph statically, and it's a DAG (or could be converted to one if we were to allow cyclic imports)
putting these together, we can essentially start with a CLI build, then we have everything in memory, and whenever a module changes, we can invalidate the relevant parts of the graph, reset their arenas, and redo the relevant work to get their info up to date again

Richard Feldman (Feb 12 2025 at 04:33):

at that point we have all the parsing, canonical IR, and type info in memory and up to date...so we can expose an interface to ask questions about that

Richard Feldman (Feb 12 2025 at 04:34):

maybe I'm missing something, but I don't really understand what specific advantages a fundamentally "query-based" compiler architecture would have over that :sweat_smile:

Joshua Warner (Feb 12 2025 at 04:35):

What you're describing is a manually-orchestrated query-based compiler architecture

Richard Feldman (Feb 12 2025 at 04:38):

I guess? The relevant part to me is that it sounds like it has about 98% code reuse with the batch compiler :big_smile:

Joshua Warner (Feb 12 2025 at 04:38):

Also - in the context where you need to potentially do type inference globally, the more traditional query-based architecture doesn't really do much for you. That's most convenient where you have very clear cut-points in the graph (e.g. function boundaries)

Joshua Warner (Feb 12 2025 at 04:39):

(side note: I would like to explore automatically carving out such boundaries based on where type annotations occur in the source code)

Richard Feldman (Feb 12 2025 at 04:39):

like when I watched Anders talking about C#'s Roslyn compiler architecture years ago, or when I looked at salsa in Rust, they seem wildly different architecturally

Joshua Warner (Feb 12 2025 at 04:41):

If you squint hard enough, it's the same

Richard Feldman (Feb 12 2025 at 04:41):

the thing is, my assumption is that rerunning type checking on an individual module will be so fast it won't matter, as long as we don't have to redo any work in other modules to do it - which is already the case unless you're doing something like a rename of an exposed thing

Joshua Warner (Feb 12 2025 at 04:42):

Agree that we should make the batch case super duper fast as a first priority, and only fall back to other strategies when that clearly _can't_ keep up.

Lucas Rosa (Feb 12 2025 at 19:28):

@Richard Feldman yea makes sense, it's probably not much different, especially if things are just fast enough

Lucas Rosa (Feb 12 2025 at 19:29):

just vague musings I've had lately on my part

Richard Feldman (Feb 12 2025 at 19:40):

yeah like if we can single-threaded roc check a file in under 1 ms per 2K lines of code, then as long as we have uninterrupted access to a CPU core, we can do that at 120fps for all but the top 1% of humongous individual files

Brendan Hansknecht (Feb 12 2025 at 23:42):

I guess it depends on the speed of cross file dependency resolution with cached can IR. That likely will get slower in large projects

Last updated: Aug 17 2025 at 12:14 UTC