This is a pretty open thought, but I am questioning if we should change the Dev backend to have an ir that can be stored in a flat array (rather than a tree structure/not existing), that is essentially a portable assembly (one op per function in the current Assembler trait).
So it would kinda be akin to llvm ir, but just what we need to support the dev backend. It would also just be for defining within a single procedure and any jmp type instructions would just use a integer as the target that is which other instruction in the array to jump to.
The main gain of doing this is for future minor optimizations that we want to run on the ir. The big one being register allocation. With the current IR, register allocation, and lifetimes in general, are very much not a clean problem. Mono is a tree. It also is not nearly as fine grain as individual assembly instructions. So it is very hard to have any idea how many registers will be needed for a specific mono ir node.
I think that we should have a very fast pass that runs over mono and blits out a flat array of assembly like instructions. (essentially will be equivalent to the current dev backend but less complex due to having infinite registers for example).
Then we would run register allocation on the flat array (lots of good algorithms with linear time exist for this). Given the assembly ir won't perfectly match to assembly instructions, we will have a backend dependent function that will label exactly how many temporary registers an instruction will need. That is all the information we should need to deal with register allocations.
Once that works, we can consider other minor but fast optimizations:
Especially given we generate at the procedure level, it should be possible to run this part of the dev backend in parallel. Only merging the final generated assembly into an output object file.
I think this will help clean up some of the complexities of the dev backend related to where data is stored and how it is loaded. It should help separate that work cleanly from the rest of the ir selection at a minimum. It also gives us some opportunities for minor but important optimization passes over dev backend ir.
What are people's thoughts on this? I would guess @Folkert de Vries and @Richard Feldman would have the most input.
so my default thinking on anything to do with the dev backend is "fewer passes is faster, so let's minimize passes"
obviously there can possibly exist a way to take a given implementation and actually make a faster implementation which does more passes, but by default it's going to be slower
so my immediate reaction is "to what extent could we achieve some of these goals without introducing any new passes or IRs?"
I think it is important to start with, we definitely want ok register allocation in the long term. It will be important to get enough performance that end users actually want to use the dev backend for reasonably sized projects. On top of that, it will be useful for making the dev backend output assembly easier to read and debug.
I don't think it will be feasible to generate ok register allocation while depending only on Mono IR.
Or more accurately, we probably could find a way to force it to happen, but it would be exceptionally messy. The current super simple register and stack allocation strategy built on mono is already messy.
I'll just say that in principle I'm on board with this idea. The current register allocation is fundamentally limited, I don't think there is a (good) way to traverse the tree structure and perform the register allocation in one pass
so my default thinking on anything to do with the dev backend is "fewer passes is faster, so let's minimize passes"
My general thoughts are:
so could it be done in 1 extra pass?
as in:
Yeah. Though technically today we do one pass on mono to record lifetimes and in the new form we would do one pass on the new ir to do register allocation instead.
that sounds good to me then! :+1:
One extra note, if we do any of the other extra optimizations that I mentioned, they would all run on this IR (and some may actually just be extensions to the linear register allocator pass). So definitely no other IRs.
What's the status of this?
nobody is working on it right now, as far as I know
but it would be awesome if anyone wanted to get it going! :smiley:
Yeah, totally unstarted
Just documented
also how would this pass / perserve registers when jumps happen (what's the calling convention)?
Jumps or calls?
Any, calls sound be c calling convention
Jumps are up to us but theoretically can keep everything alive.
If stashing out happen inside a loop from a join point, we would just want to hoist whatever possible before the join point so it isn't stashing on repeat in the loop
calls
Of course still might have stashing in a loop due to locals for the loop
Yeah, calls are just c abi.
c is just preserve the arguement on a stack right?
Anytime one comes up, we just ban a list of registers and have to stash what's in them
Yeah, we stash to the stack whenever out of register space
got it
In the GitHub issue, I link to some good article on linear time register allocation passes
@Wizard ish are you interested in diving into this? :smiley:
@Richard Feldman Perhaps, although I'll have to freshen up on my assembly first
that's awesome! feel free to post any questions here! :grinning_face_with_smiling_eyes:
For the most part you should not have to generate any assembly directly. Cause we already have the assembler trait. That said, brushing up on assembly will definitely help use the trait, map it to an ir, and debug issues.
yeah and i also want to re-famaliarize myself with just the whole thing yk
been like a year or two since ive done any assembly
Makes sense
only one thing i really disagree with, and that's that the ir should use a integer as the target for jumps, which, given that the whole point would be for optimization, seems like it would be pretty hard to deal with (i would make essientally "label" no-ops that are used when generating the actual machine code, so basically like assembly). For instance, if we have a function that is (this is a very contrived example and also i didn't test that this on any assembler (note that destination first is used)):
store LOC eax ;0
add ebx eax ;1
load eax [LOC] ;2
JumpTo: ;3
sub ebx #1 ;4
cmp ebx #5 ;5
jge JumpTo ;6
if we first transform the labels to the instructions they refrence (technically LOC is also a label but that's not important) we get:
store LOC eax ;0
add ebx eax ;1
load eax [LOC] ;2
nop ;3
sub ebx #1 ;4
cmp ebx #5 ;5
jge [3] ;6
Then when optimized, we get
add ebx eax ;0
sub ebx #1 ;1
cmp ebx #5 ;2
jge [3] ;3
Where the jge
is now a loop, not the desired behavior! However, if we looked up the labels after, it is first optimized to:
add ebx eax ;0
JumpTo: ;1
sub ebx #1 ;2
cmp ebx #5 ;3
jge JumpTo ;4
Then we realize the labels:
add ebx eax ;0
nop ;1
sub ebx #1 ;2
cmp ebx #5 ;3
jge [1] ;4
We fix the problem
I dont think we plan to support those kinds of optimizations
Moving a nodes index is really expensive. Would requiring shifting all later nodes in the array.
Not if it was a linked list
That is a lot of unnecessary cost
We only want something super slim
This is the dev backend
got it
We can pattern match on final assembly generation if we want, but I don't think we want any passes that actually mutate this array in large ways (like inserting or deleting a node)
Register and stack info probably will be sideband to avoid inserting a bunch of load and store instructions into the ir.
At least that is the hope if it works out in practice
Just one quick write to go from mono to a flat array of roughly assembly. N passes that just edit nodes inplace or add sideband info. A final pass to take all that info and actually generate the final assembly.
That's roughly the theory and hope.
Note: I totally could be off about a constraint here. I have not thought about this in a long time or dug through it in a ton of detail. For background context. There was debate around if we should even add another ir. What we do currently is super adhoc, but it generates assembly crazy fast. We really want to keep as much of that speed as possible, just get slightly more reasonable dev assembly.
But the target is still roughly -O0
assembly
oh yeah i thought it was gonna be something that would act as an IR for all the backends, so in other words, Roc -> IR
, and then IR -> LLVM
, IR -> WASM
, and, for the dev backend, just have an evaluator to get rid of the giant overhead of compilation
but that makes a lot less sense now that i think about it
Also why isn’t the dev backend just an interpreter? Wouldn’t that make it much faster
Roc isn't limited to being used through the cli, you can also compile it and embed it in other things. Also this interpreter would need to be platform specific.
No simply for the dev backend, as the idea for it is to be used for... devoloping. So keep LLVM, use that for release builds, and use dev
for debugging.
I think @Ryan Barth had some interesting ideas in this direction. Not sure where we got to with that.
ok, also, just because i feel like there is definitly something I'm doing wrong in #7249, how does one sign old commits that have already been pushed
An admin can bypass and merge, or you can interactively rebase and sign. I have a note on my laptop how to do it if you want to try that.
yeah the rebasing, ive tried that a couple times but for some reason it appears to not fix the issue...
Nvm, I just realised what we were talking about. All done :grinning:
Thanks for working on that, and your patience
ok ty and np (alas this is why i need to read the fine print of stuff...)
Pro tip, you can do git rebase --exec 'git commit --amend --no-edit -S' main
to rebase on main and sign all your commits.
It can be a lot more complicated if you got merges in your commits as well
Merges? I thought we lived in a civilized world?
Sorry I did not see your comment, I did a force push after my rebase :p
Hopefully no one else is working on my feature branch :-)
Anthony Bullard said:
Merges? I thought we lived in a civilized world?
I think that a rebase only world is delightful, but I have really learned that many people can't cope with rebase. To be fair, rebase gets absolutely horrid if you make millions of small commits. (Which I think is way more common in open source).
Also, gits and GitHub are only mid at best. If they were better, rebasing might be more common.
Should 128-bit registers be part of this IR? as far as I understand, 128-bit values have to be moved to stack, then mov'd to 2 64-bit registers in our current design, but if we're going to have some abstract IR register map to chip register anyway, I wonder how tricky it would be to allow 1 IR register to map to one or two chip registers based on the size?
I don't think so. Cause you don't necessarily have to have both parts loaded at the same time. And that can enable better register use.
So better to see it as a tuple of 2 u64
do you have an example in mind? I suspect it won't be useful at the Mono->IR translation level to split across two IR registers, but rather something we do when translating IR->asm In other words, it's not useful to work on one half at a time in gen_dev/generic64/mod.rs
which is where I imagine translating Mono to IR happens, but rather in gen_dev/generic64/x86_64.rs
where translating IR to x86 asm happens.
(I'm assuming also that allocating registers happens in the IR->asm steps)
I actually think that it should just be registers are unspecified in size, and however large they need to be is specified at IR->Asm according to how large it needs to be according to instruction size, with perhaps a ‘extend’ and ‘truncate’ op, as it dosent really seem to be about “how much can we optimize instructions as “how much can we optimize memory usage” but that’s just my opinion
The main goal is to avoid tons of extra instructions that are just spilling and loading from stack.
Not really about memory usage, more about huge time sink and perf hit
Just too heavy to be reasonable in a loop
ASM doesn't work on 128 but values. Only register size or less values. I think keeping that abstraction is useful to make the target as simple as possible
That said, infinite sized registers are common with more complex register allocation algorithms.
So maybe would be valuable, haven't thought this through fully
Given that, due to following c abi, most things will be required on the stack anyway, I had leaned towards keeping it simple and storing the true value of more things on the stack (like is done today). Then only solve register allocation for live subvalues that are actually being acted on.
Definitely possible to do the reverse. Just changes the infrastructure father away from the simpler and more direct design of today (everything large has space on the stack and we only ever load register sized chunks of it)
I could be wrong, but I think with infinite sized registers, you likely need more passes and a more complex algorithm to complete register allocation
fair, and just to be clear, I haven't thought much past my recent attempt to implement add/sub/mul x checked/overflow/wrap for u128/i128 as asm rather than bitcode call. It feels like it would make those cases simpler, but raised the question here because that's a pretty narrow view.
Hmmm yeah that makes sense, although abstracting that way makes it fundamentally harder to translate it to 32 bit
And also it might save a bit of memory by splitting registers (which would only be possible if we weren’t assuming each register is 64 bits)
But that would undoubtedly make it much more complex
One thing that would be hard to implement w/o memory apart from 64 bit registers are structs
As you’d need to use mutiple registers
Although you could just have “large registers” that will probably just become memory locations
re "following c abi, most things will be required on the stack", for the very specific case of those arithmetic functions that only take two u128/i128 arguments, they fit entirely in registers; and the return fits in rax+rdx as well. I haven't found a way to work with u128/i128 args that doesn't first push them to stack, so that's a case where not having 128bit registers (or a way to say "give me the two 64bit registers that this 128bit argument has been already loaded into") ends up with more mov's between stack/register.
This could also be over-optimizing an edge case. I don't know how much roc code will be using Num.mul
on u128/i128, but maybe Dec
?
Yeah I do think that having registers of diffrent sizes would be useful, even if it was a little more complex, either with or without splitting them, and yeah ‘Dec’ is a case, and a very large case, of where a 128 bit registers would be useful (though frankly I think if performance is a concern f64 or perhaps Frac should be used instead)
Wizard ish said:
Although you could just have “large registers” that will probably just become memory locations
Yeah, I guess this is essentially what we have now
Perhaps a way to “unsplit” registers, ie, if one did ‘rax’ and ‘rbx’ become a 128bit ‘rabx’
if an example is any help, here's the change to impl add u128/i128 as asm.
I don't quite follow that change, why is the argument pushed onto the stack? If it is already in registers, couldn't it be just used directly from those registers?
We could have it so the virtual registers can have “hints” for the max size so if they’re moved into memory they can be smaller then the regular guarantee (ie if a register is used in a add 32 and mul 16 instructions and it has to at that point be moved into memory, instead of giving it a full 64 bits, it could just be given a 32 bit slot of memory (this would happen is IR->Backend
Brendan Hansknecht said:
I don't quite follow that change, why is the argument pushed onto the stack? If it is already in registers, couldn't it be just used directly from those registers?
that's just my not understanding how is the best way to get u128 args. I couldn't find any examples using registers, only other things that pushed to stack, then read what was pushed.
ok
Also, I had assumed this ir would be just for generic64, I guess if it is for all of gen dev and a future theoretical generic32, it would need a more featured solution
That said, if it is too generic, it can't do register allocation
For example, you can't register allocate until you know if an i64 fits in a register or belongs on the stack
If it belongs on the stack you have to generate multiple instructions to handle the addition
And that set of instructions needs register alloctions
Given we don't want too much complexity, it is nice to have the abstraction that everything larger than a register is always on the stack. An optimization would be to allow things to be partially on the stack and partially loaded. A different optimization would be to use 2 registers as the cut off cause that matches c abi (this is the issue referenced with adding u128 currently and requiring dumping to stack). I'm not sure any of those optimizations need to be in the dev backend
My biggest concern is with being too generic is that it will lead to more required transformation and a significant cost in time to generate a binary with the dev backend.
Yeah that’s the thing ain’t it, can’t be to complex
That is why I would learn towards just solving a bespoke solution for generic64 and not worring about 32 bit at all
32bit really needs a different asm trait. Which means it needs an entirely different ir.
yeah, I don't know what the right answer is, but "complex structs on the stack, primitives in registers" sounds correct. The question is what to do with u128/i128 and Dec. Are these primitives or complex?
Although I feel that the stack should be abstracted as “struct registers” just to be the abstraction, although that might cause a performance problem
I think you are probably right
About the abstraction
The assembly would have functions to load from and store to "struct registers". Then it would have tons more functions to operate on normal registers. It would later calculate stack offsets for all struct registers (maybe even using lifetime analysis to reuse stack space)
Wait so I was right about it being a good idea or a bad idea?
Also it could just have functions to go from struct to normal
That having struct registers as an abstraction for the stack is a good idea
shua said:
yeah, I don't know what the right answer is, but "complex structs on the stack, primitives in registers" sounds correct. The question is what to do with u128/i128 and Dec. Are these primitives or complex?
I think they have to be seen as complex due to no real assembly operations being available for them. If we want to do a minor optimization, we could make a special optimization to allow storing them in 2 registers.
I think you're right. I only have one counterpoint which is in x86_64, mul rcx
will have as dest, rax
and rdx
which is a 128bit value split over two 64bit registers.
Yeah, it is definitely imperfect. We need a smart way for assembly operations to claim registers temporarily.
Hmmm well for the meantime tomorrow ill write up something to describe it just so we’re all on the same page
:folded_hands:
Oh, and one final thing, how should results be given, should they (all the following all accomplish
ACC
)add A B
mov C ACC
mov C A
add C B
add C A B
also IMO it shoud be called ROAR :)
ROAR = Roc Optimizable Abstract Representation
I would suggest (2)
yeah that is the most common, but i also like option 1, as it allows for calls to functions and their returns to be modeled quite simply, although of course that increases complexity
Current assembly is all 3 arg like arm
I think that is nicest
Encode result register in the ASM directly
i think all have their merits personally
So it all matches 3 today
although given it's purpose 3 is probably the best (it makes it very apparent what registers are being modified)
Given risc v and arm are both 3 and x86 is the only special on (and it is trivial to decompose to x86), I would push for 3.
ok then itll be 3 :)
so i think thats it, which is to say in about fifteen minutes ill probably have another question :)
Also, you can always do (add a a b
for I place if register allocation realizes it is ok)
Wizard ish said:
also IMO it shoud be called ROAR :)
ROAR = Roc Optimizable Abstract Representation
Also what yall think, yae or nae for ROAR
I'm here for the name. It's fun to pick something like that and have it persist for a decade or so and have people take it seriously
Made a first draft of the proposal
Looking great, thanks for taking the time to do this design work. I think it will really help to clarify things.
I've copied some random notes from skimming back through the thread.
I think it would be helpful to provide a little more context/background or a summary to motivate this change. We can talk about how the dev backend is currently very fast but maybe fragile or overly complicated... (my words, no idea if this is accurate).
Goal: reduce complexity of the dev backend related to where data is stored and how it is loaded
Goal: maintain fast compilation speed for the development backend, while improving runtime performance.
we definitely want ok register allocation in the long term. It will be important to get enough performance that end users actually want to use the dev backend for reasonably sized projects.
it will be useful for making the dev backend output assembly easier to read and debug.
Does this improve our ability to test things? should we mention that?
change the Dev backend to have an ir that can be stored in a flat array (rather than a tree structure/not existing), that is essentially a portable assembly (one op per function in the current Assembler trait). Main gain of doing this is for future minor optimizations that we want to run on the ir. The big one being register allocation. With the current IR, register allocation, and lifetimes in general, are very much not a clean problem. Mono is a tree. It also is not nearly as fine grain as individual assembly instructions. So it is very hard to have any idea how many registers will be needed for a specific mono ir node.
Should we include this motivation in the ROAR proposal?
I may have missed it, but is this mentioned in the ROAR proposal? 32-bit architectures are out of scope.
"It would also just be for defining within a single procedure and any jmp type instructions would just use a integer as the target that is which other instruction in the array to jump to."
Is this covered in the proposal?
Possible future optimisations
- grouping instructions for final output
- intentionally reusing source and dest reg to avoid extra data movement instructions
- freeing up the base pointer by calculating all offsets with the stack pointer
- storing some aggregate values in registers if available instead of on the stack
- maybe some minor and quick rewrites.
Would be good to include future work in the context of ROAR
All "extra" optimizations can be evaluated one at a time to see if they meet our tradeoffs.
Future optimizations can be evaluated individually. how would we do this? is this something we can benchmark easily?
it should be possible to run this part of the dev backend in parallel. Only merging the final generated assembly into an output object file.
Is this possible or a goal? would be good to mention it.
Ok, yeah Ill get to adding those (thanks for taking time to review it)
Sorry for being slow about commenting on this (energy for roc has been low lately). Will try to write up more comprehensive comments hopefully this afternoon PST.
I have two ideas, both of which involve function calls, that I'm interested to hear your opionions on:
int N
(derived from the instruction in x86 for interupts) which would be do a special action N, which can't be modeled in ROAR. IMO this would go right along with Roc's notion of "platforms", and they could be correlated (although this isn't substantiated by any logical reason, just a general sort of feeling)also just making sure about this (because its a pretty big thing) the goal would be to go from Roc->ROAR->Bin
as opposed to the current Roc->Mono->Bin
?
Not an expert, but presumably it'd be Roc->Mono->ROAR->Bin
Monomorphization is the process that takes generic functions and spits out statically typed functions
It's probably not worth having to duplicate that effort in ROAR
Let me read the proposal first
wait a moment, is Roc Mono diffrent from .Net Mono?
it's not like an extension?
I don't know dot net's Mono
But Roc mono is a lot like Rust's mono
oh that's why a lot of stuff about it didn't make much sense
It's a phase of the compiler after type checking where we take type checked code and spit out IR that is simple yet concrete
So if I have a function that typechecked to numToStr : Num * -> Str
And I use it with the inputs U64
and I8
I'll get two functions out of mono, numToStr : U64 -> Str
and numToStr : I8 -> Str
Yeah that makes more sense...
I was wondering why Roc used the Microsoft IR, but this makes way more sense
Now, that's based on my incomplete understanding of what:
https://en.wikipedia.org/wiki/Monomorphization
Monomorphization is good because it trades a larger binary size for better performance
If you don't do monomorphization, you end up with a single function that needs to check the input's type at runtime
More or less
I thought Mono Ir refered to this :speechless:
https://github.com/roc-lang/roc/tree/main/crates#roc-compiler-stages
This might be helpful
Specialize is the term in that diagram
yeah that was my bad for really reading things too quickly
As in "take each specific use of generic entities and make a specialized version"
Welcome to the club
Don't worry about it too much. I think the more important part is that you're helping us make a big improvement part to a tricky part of the compiler. That's awesome!
yeah, well fourtanetly it dosen't actually affect my understanding of it that much, as the idea is basically the same, but thank you for the patience :)
Of course
Wizard ish said:
I have two ideas, both of which involve function calls, that I'm interested to hear your opionions on:
- While local jumps have been covered as being absolute addresses, I think that there should be a more adressing modes. As mentioned before, ROAR will diffenitly be split up into functions, with each one being a discrete chunk of code. However, I think we should also have larger segments, with those being groups of function, each one reperesenting a single program unit (a library or file, perhaps)
- Relatedly, ROAR does not have a way to deal with IO and other specific needs. As the goal of ROAR is to be simple, I would propose a simple "black box" instruction,
int N
(derived from the instruction in x86 for interupts) which would be do a special action N, which can't be modeled in ROAR. IMO this would go right along with Roc's notion of "platforms", and they could be correlated (although this isn't substantiated by any logical reason, just a general sort of feeling)
So yeah what do yall think about these?
I don't think we'll have a need for either. The dev backend already has solutions for both.
The dev backend only ever worries about generating a single procedure at a time. As such, each procedure can be 100% self contained without any worry of what other procedures are doing. So we never need long jumps or other special things like segments. I think it would be 100% fine for ROAR to slot in such that it is only used during single function compilation. I think that it will actually fit best there.
This is the same for io. There is no need to worry about it. We never do it directly in roc. All of this will always go through a function call. So the call instruction is all you will need.
Using relocations + call instructions solve all of the above nicely.
So I would imagine the compilation process as:
You say each "function in mono", which is currently all top-level values, since we convert (at least most) top-level constants to thunks with force_thunk AFAICT. I presume this strategy would work if/when we decided to move towards non-thunked top level constants, but just wanna call that out in case there would be an issue
Sure, we should either force thunks for the dev backend or we would also have to do constant generation, but for constants, we wouldn't need ROAR at all.
Maybe I missed something in the discussion, but it would still need to translate complex constants, right? We said above that even I128 would get multiple "registers", so a struct/discriminated union would need to be loaded like any other variable that was stored in a prior register. This should be very minor compared to the main scope of ROAR, though
no, they will all just be equivalent to the constant stored on the stack. So they are just a chunk of bytes that we get the address of and load data from.
So they would in the roar proposal just be represented as a structured register (which is really just a pointer to memory)
Oh and relating to your comment on the PR, I've moved the proposal to it's own repo here
comments on the proposal:
Just one thing, if registers aren’t being explicitly saved, how will the program differentiate registers that are set to for the purpose of returning values and those that simply hold intermediate values? Would the plan be for the return and call op take an argument that is what register should be returned / saved to?
Oh, this is actually a really good point I hadn't full though through. In the current design, due to having access to mono, a call or return instruction is given a list of symbols. Those symbols are the loaded and dealt with according to the c abi for that system. If we are abstracting into ROAR, we have to decide how to deal with that mapping and call convention. Hmm.
So that gives three main options I can think of:
I lean 3, but am open to other opinions and ideas for sure.
Personally I think 1 is better, as it keeps the process pure (ie, machine agnostic and seperate from mono), which would probably help with maintainability
Also if we restricted ROAR functions to returning one argument, we could just have ‘call’ instructions take an output register that would the be set to a corresponding register returned with ‘ret’
As for passing arguments, this could be even more simple, just pass them in registers, and only differentiate it from the original once it’s been set
It definitely has to learn to speak c abi no matter what, sadly. It has to be callable from the platform and be able to call the zig builtins and platform effects.
This is why the dev backend currently speaks c abi. I found it easier to be consistent than makes exceptions for the edges.
but shouldn't this be abstracted as part of the ROAR->Binary
interface, given that one of the primary goals is to be platform agnostic? (which ccould be solved by somethig like this?)
I don't think the goal of roar is really to be platform agnostic. It is trying to abstract away a level of assembly details, but register allocation (which will run on roar) is platform specific. On top of that specific handling of the stack, that will come out of converting from structured registers to stack use is also platform specific.
So roar should be expected to pull in platform details. That said, I think that 3 is platform agnostic. Roar will simply record the information it needs to pass to the final assembly generation for it to deal with generating the correct call conv.
The reason I don't think we need something like black box. is that nothing is black box. They are just function calls. The same as any other function call.
ok got it
I'm not sure how much you have look at the current traits, but we actually have one for assembly and a second for callconv. If you look at a callconv trait impl this is what making a function call looks like today: https://github.com/roc-lang/roc/blob/f8b0be725c1677dd1c39cc68723fe4ad5ce09ade/crates/compiler/gen_dev/src/generic64/x86_64.rs#L292-L337
ROAR could call essentially the same primitive.
Just instead of using Symbol
from mono, it would use abstract and struct registers.
so if i'm understanding correctly you would just have the call function accept any number of args, including one for where to place the return value?
yep
With the minor caveat that we may want to handle storage in a special way to avoid bloating the roar enum in rust. Don't want to take up too many bytes per assembly instruction
I've been making a little progress with the proposal, just wanted to make sure I'm putting down correctly what yall mean, in particular about the function calls https://github.com/wizard7377/ROAR
I'm really liking how this is looking!
I'm not sure the value of making cmp
, jmp
, ret
, etc functions return a null value, but if in practice it helps build out a consistent interface, that sounds fine. I assume the assembly will map to rust enum. The powerful pattern matching in rust should make it ok for some ops to differ.
ROAR is that every function is pure (with respect to registers)
I'm not 100% sure the caveats of the meanings here, but it is important to note that some values passed into function will get mutated. That said, it will never be a register value. It will always be a pointer to data on the heap that gets mutated (like a list with a refcount of one).
I think this likely falls into what you are saying, but wanted to clarify
Also, for structured registers (at least as roc exists today), all loads and stores should be to a compile time known offset. Also, we will need the ability to load at different widths, but that can just be a load8
vs load32
vs load64
.
And roc will make sure that all loads are properly aligned based on the layout of the data. So we should have no concerns there.
Also, for jumps, I know that roc only has structured loops, so we may be able to take advantage of something there. But I'm not sure what exactly. Just noting.
Oh, one other note for jmps. Roc's jumps technically have arguments, but we should be able to map that into editing the same roar registers over and over again.
Yeah, I might’ve forgotten about the references not being mutated dosent correlate to the referents being mutated
Also can you explain a little bit more about what you mean about them taking arguments, do you mean they should be similar to RISC branch instructions (ie, the instruction contains the condition)
I just mean that in mono ir, jumps look a lot like function calls.
There is a block declared with a joinpout. The join point has a list of args. Within the block you can access any constants from before the jointpoint and the args as well. At some point in the main code flow, it calls a jump instruction to a join point with a list of args.
In roar, this could be modeled as raw assembly style jumps with register updates before the jump to set the args. Or it could be modeled closer to a function call. Not sure either is better, just worth noting and thinking about.
yeah personally im much more familiar with x86
and 6502
assembly, and in those, jmp
s basically only change the instruction pointer, so that just seems more natrual to me, but only because that's the only thing ive ever encountered
Yeah, rocs mono isn't trying to be assembly. So makes sense it has a high level concept
should ROAR have seperate floating point registers?
Would be nice for a tiny bit of extra safety, but would be fine to start with unified registers and then only convert to float vs general registers during allocation.
So I think either is fine before register allocation. I lean slightly towards just having them
yeah i also lean on having them, theyre isnt really a reason not to and basically all systems do so it probably will be more efficient
Did some work on the proposal (still not done)... and i may have just realized gh md has a diagramming feature... https://github.com/wizard7377/ROAR#optimizations
Ok I think this is the first draft, is there anything any of you think needs to be added to this (obviously needs to be cleaned up a lot)
I'll read through it when I get the chance, but packing tonight and traveling tomorrow.
nice safe travels :)
Also, on the llvm side, I think I am going to greatly simplify abi (including the c abi we generate). Dev backend may want to do the same (might not really affect the proposal). The core change is that all aggregate types will be passed by const ref. Then only enums and numeric types are passed in registers.
if anything that actually works with the proposal quite nicely
I don't know very much about c abi, but I can say that this was a well-written doc that even a mid-wit (as regards assembly) like myself could understand
Thank you :)
So could I say (as a midwit as noted above) that this is a SSA representation that's meant to be a level higher than say LLVMIR because it can assume a lot of Roc's language semantics?
But still low-level enough that we can use those semantics to drive efficient optimization passes before codegen?
Yes, that is the primary purpose
Ok, I passed my comprehension test. I'm going to bed now :-)
I'll be following the (assumed) implementation of this, as I am very interested in learning a lot in this area
A few thoughts on the ROAR document:
type Stmt { Load (Register, Expr), Store (Register, Expr), Ret, ...}
and type Expr { Add (Register, Register), Sub (Register, Register), ... }
jmp
other than jmp
itself, and model conditional jumps via branches to basic blocks. Reducing the number of nodes in the IR makes it easier to implement a backend, which I think is the best cost function for the IR.I personally think the null register is quite important. It allows statements to be consistent internally as one structure, rather than having expressions which can be statements, which would significantly complicate the process of obtaining modified registers. In addition, sometimes one wants to have expressions which don’t write to anything, for example on x64 ‘cmp’ vs ‘sub’, where the only difference is whether they write to something or just set flags. In addition, the null register wouldn’t be that hard to implement, and in case this wasent clear the null register can’t be read, it just serves as a “trash” register
Also the purpose of this is to be stored in a flat array, which wouldn’t allow for nested expressions
(Note, refer back to the first message in this channel)
Brendan Hansknecht said:
This is a pretty open thought, but I am questioning if we should change the Dev backend to have an ir that can be stored in a flat array (rather than a tree structure/not existing), that is essentially a portable assembly (one op per function in the current Assembler trait).
So it would kinda be akin to llvm ir, but just what we need to support the dev backend. It would also just be for defining within a single procedure and any jmp type instructions would just use a integer as the target that is which other instruction in the array to jump to.
The main gain of doing this is for future minor optimizations that we want to run on the ir. The big one being register allocation. With the current IR, register allocation, and lifetimes in general, are very much not a clean problem. Mono is a tree. It also is not nearly as fine grain as individual assembly instructions. So it is very hard to have any idea how many registers will be needed for a specific mono ir node.
I think that we should have a very fast pass that runs over mono and blits out a flat array of assembly like instructions. (essentially will be equivalent to the current dev backend but less complex due to having infinite registers for example).
Then we would run register allocation on the flat array (lots of good algorithms with linear time exist for this). Given the assembly ir won't perfectly match to assembly instructions, we will have a backend dependent function that will label exactly how many temporary registers an instruction will need. That is all the information we should need to deal with register allocations.
Once that works, we can consider other minor but fast optimizations:
- grouping instructions for final output
- intentionally reusing source and dest reg to avoid extra data movement instructions
- freeing up the base pointer by calculating all offsets with the stack pointer
- storing some aggregate values in registers if available instead of on the stack
- maybe some minor and quick rewrites.
Especially given we generate at the procedure level, it should be possible to run this part of the dev backend in parallel. Only merging the final generated assembly into an output object file.
I think this will help clean up some of the complexities of the dev backend related to where data is stored and how it is loaded. It should help separate that work cleanly from the rest of the ir selection at a minimum. It also gives us some opportunities for minor but important optimization passes over dev backend ir.
What are people's thoughts on this? I would guess Folkert de Vries and Richard Feldman would have the most input.
Also, just making sure of something: ROAR would require reimplementating the Backend trait, but not the Assembler and CallConv trait (as the backend trait takes in mono as input)
Yeah, might even end up replacing the backend trait with just a raw method to go from mono to roar and a separate method that directly calls out to the assembler and call conv traits
also was just thinking, it might be useful to have a concept of "true global variables" in ROAR. While struct register persist, they require refrences being passed around, and what I'm thinking of would just be accessed with it's symbol. This wouldn't be prefered (ie, local vars would still be in abstract registers) but i think might make it a lot more clean
oh also if no one has anymore comments on this, ill probably start working on it tommorow
While struct register persist, they require refrences being passed around, and what I'm thinking of would just be accessed with it's symbol.
I'm not sure I fully follow what you are suggesting here
basically having things like struct registers, but instead of symbols being dynamically generated their symbols persist from mono
I think most of the symbols in mono are locals. There are true globals. For those, we will need some solution. The final output is a chunk of data and a relocation to load that data locally in a function.
Also, my only main comment on the proposal is that I would try to stay very close to the assembler trait to start with (in terms of which operations to support). For example, it looks like we currently only need a jmp
and jne
instruction. Clearly more optimizations are possible and in some cases other jump instructions might map better, but I think it will make the transition smoothest if we start as close as possible. Then incrementally make things more powerful.
Oh and instead of using things like u32 directly, wrap them in nominal types to avoid using an index as a register or a float register as an int register etc
I think that was already the plan, just explicitly clarifying
also just so yall can comment if you like, i made a draft pr at #7397
first piece of ROAR (handcoded into rust), only the printing is done
[%32] {
%78 <- addu %73 %32
_ <- subs %73 %32
}
also i do plan to make the code not break every single style guideline possible
Left a handful of comments
ok, yeah i wasn't thinking that much about the 128 bit types, those might be a small problem
actually having the abstract registers be sized might not be that bad of an idea
also maybe seperating structure refrences from regular registers
I think it should be fine to see 128 bit values as a structured registers containing two 64bit registers
That is how operations function on them anyways
okay
Mostly called them out to make sure they are being thought of
Relating to your comment here, what do you think should be the max size per instruction?
I would hope it can fit into 4 u64s, but that may not be realistic.
1 for op code and any sort of metadata
1 for output register
2 for input registers
If an op might need more than 2 input registers. The input u64s would be instead a pointer and length to a sideband array of extra inputs.
i dont it will be naccasary to have them actually modeled as inputs, as they should be consant, and therefore are not a part of the internal logic
also just realized it was never discussed if function pointers should be included or not
I think a function pointer can be treated the same as an int constant with a special op to create it.
ok!
why does roc use bump allocation? It does seem to speed it up, but it makes the code much less clean (same thing with layout interning)
Cause we want a really fast compiler. That is also why it is written in rust
Obviously there is still large room for improvement, but we care a lot about compile time.
fair enough
Also, I won’t be able to work on this for a couple days
On a related note, anyone know of any cheap, small, low-middle end pc’s?
(I somehow managed to break my laptops keyboard by cleaning it… so now the whole thing is crippled)
update on this, the classic strategy of doing nothing about it has worked, and now my keyboard is working again :partying_face:
Hey everyone, sorry about the slow progress, it’s just that school takes up a lot of my time, and in particular recently I’ve just been doing way too much stuff, hopefully however I’ll be able to do more work on ROAR this week (and after that) :)
It's alg, we're all volunteers here juggling life things. :smiley:
error[E0515]: cannot return value referencing local variable `env`
--> crates/compiler/build/src/program.rs:589:5
|
577 | roc_gen_dev::build_module(&env, &mut interns, &mut layout_interner, target, procedures);
| ---- `env` is borrowed here
...
589 | / (
590 | | CodeObject::Vector(module_out),
591 | | CodeGenTiming {
592 | | generate_final_ir,
... |
600 | | },
601 | | )
| |_____^ returns a value referencing data owned by the current function
error[E0515]: cannot return value referencing local variable `layout_interner`
--> crates/compiler/build/src/program.rs:589:5
|
577 | roc_gen_dev::build_module(&env, &mut interns, &mut layout_interner, target, procedures);
| -------------------- `layout_interner` is borrowed here
...
589 | / (
590 | | CodeObject::Vector(module_out),
591 | | CodeGenTiming {
592 | | generate_final_ir,
... |
600 | | },
601 | | )
| |_____^ returns a value referencing data owned by the current function
error[E0515]: cannot return value referencing local variable `interns`
--> crates/compiler/build/src/program.rs:589:5
|
577 | roc_gen_dev::build_module(&env, &mut interns, &mut layout_interner, target, procedures);
| ------------ `interns` is borrowed here
...
589 | / (
590 | | CodeObject::Vector(module_out),
591 | | CodeGenTiming {
592 | | generate_final_ir,
... |
600 | | },
601 | | )
| |_____^ returns a value referencing data owned by the current function
error[E0505]: cannot move out of `interns` because it is borrowed
--> crates/compiler/build/src/program.rs:597:13
|
549 | fn gen_from_mono_module_dev_assembly<'a>(
| -- lifetime `'a` defined here
...
562 | mut interns,
| ----------- binding `interns` declared here
...
577 | roc_gen_dev::build_module(&env, &mut interns, &mut layout_interner, target, procedures);
| ------------ borrow of `interns` occurs here
...
589 | / (
590 | | CodeObject::Vector(module_out),
591 | | CodeGenTiming {
592 | | generate_final_ir,
... |
597 | | interns,
| | ^^^^^^^ move out of `interns` occurs here
... |
600 | | },
601 | | )
| |_____- returning this value requires that `interns` is borrowed for `'a`
error[E0505]: cannot move out of `layout_interner` because it is borrowed
--> crates/compiler/build/src/program.rs:598:13
|
549 | fn gen_from_mono_module_dev_assembly<'a>(
| -- lifetime `'a` defined here
...
564 | mut layout_interner,
| ------------------- binding `layout_interner` declared here
...
577 | roc_gen_dev::build_module(&env, &mut interns, &mut layout_interner, target, procedures);
| -------------------- borrow of `layout_interner` occurs here
...
589 | / (
590 | | CodeObject::Vector(module_out),
591 | | CodeGenTiming {
592 | | generate_final_ir,
... |
598 | | layout_interner,
| | ^^^^^^^^^^^^^^^ move out of `layout_interner` occurs here
599 | | expectations: loaded.expectations,
600 | | },
601 | | )
| |_____- returning this value requires that `layout_interner` is borrowed for `'a`
Some errors have detailed explanations: E0505, E0515.
For more information about an error, try `rustc --explain E0505`.
anyone know how to fix this? (full state is here https://github.com/wizard7377/roc/tree/roar)
(note the change in this signature https://github.com/wizard7377/roc/blob/d4de3b1545fcfd3bbb4c6ccc0ee7b918c663208d/crates/compiler/gen_dev/src/object_builder.rs#L49)
anyone can help (rust should definitly have something to bypass the borrow check in development...)
I can look when I'm done with #7530
thank you!
(this has been driving me insane)
@Wizard ish using the latest roar branch commit I currently only get a mismatched types error at crates/compiler/build/src/program.rs:456:17. That's with cargo build --release
. How can I produce your borrow checker errors?
run with -F use_roar
(the flag is temporary and will be replaced with a sensible build system later)
also if it's helpful this is the mess that i use to compile it clear && RUSTFLAGS="-Awarnings" cargo run -F use_roar -- --dev ../tests/fizzbuzz.roc
sorry it's really a mess
No problem, we all have to learn by doing :)
fn build_module_help<'a, 'r : 'a>(
env: &'r Env<'a>,
interns: &'r mut Interns,
layout_interner: &'r mut STLayoutInterner<'a>,
target: Target,
procedures: MutMap<(symbol::Symbol, ProcLayout<'a>), Proc<'a>>,
) -> Object<'a> {
Why are 'a
and 'r
linked? I don't think that is correct. 'r : 'a
. Not that I 100% know what that syntax means.
Anyway, I think you should be able to fully remove 'r
?
yeah it was that i changed it bc in the Converter it (the borrow checker) required 'b
to outlive 'a
(the Converter's 'a
), which then required 'r
to outlive 'a
(of the above funciton)
Something is definitely wrong is 'r
needs to outlive 'a
. 'a
is essentially a global lifetime. 'r
is temporary for the current function. So the code creating that requirement must be wrong
Oh wait :skull: i somehow forgot the most crucial point, without that requirement, i get this error
error: lifetime may not live long enough
--> crates/compiler/gen_dev/src/roar/convert.rs:240:22
|
162 | ...c, 'a: 'c, 'b: 'c> Converter<'a, 'b>
| -- -- lifetime `'b` defined here
| |
| lifetime `'a` defined here
...
240 | ... &mut sym_map.insert(*sym, (Output::Register(new_reg.clone()), repr, form...
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'b` must outlive `'a`
|
= help: consider adding the following bound: `'b: 'a`
error: could not compile `roc_gen_dev` (lib) due to 1 previous error
(all the random lifetimes are me just trying to fix this)
I think you have a loop that is angering rust. env -> arena -> converter -> back to env.
That in general won't work
likely converter shouldn't be allocated in the env at all
Should just be passed around on the stack
Then just converter -> env, interner, etc
no loop
ok
Wait so onlyy the converter object or also everything created by it
I assume most things can be created in the env. The problem with converter is specifically the loop.
so just converter
And I think you will be a able to remove a bunch of type variable like &'c mut self
will just be &mut self
And I think proc_queue: RefCell<Vec<'b, (symbol::Symbol, &'b ir::ProcLayout<'b>, &'b ir::Proc<'b>)>>,
Likely should be all 'a
? Cause those are all allocated in the env. So they have a lifetime same as the bump in the env.
yeah this is a little bit of a mess...
Part of learning rust in a codebase that uses arenas (thus tons of extra lifetimes to manage)
the good news is it's fixed :)
the bad news is that it was fixed by using incredibly bad practice to get around the borrow checker...
Hey, given the plans to completly rewrite the compiler, does it make sense to continue this project in Rust (as opposed to switching it to Zig)?
I'd say switch to zig!
okay :)
As a note, some of the rewrite plans may make this project obsolete. As part of the switch, we want to try using an interpreter instead of the dev backend.
We need an interpreter anyway for constant folding and some other work and we think that an interpreter should be capable of running fast enough that it makes the dev backend a lot less useful as a proposition
So we want to start by trying to only have the interpreter and skipping the dev backend.
hmmm okay, although still might be useful as a sort of in between for the interpreter and the new mono ir
Interpreter is planned to run higher up the stack (it may still want a specific IR, but haven't thought through the details yet).
Totally will be lots of things to help with and IR design work, just might not map straight to roar.
I'm really sorry about this and any time that was wasted. The big zig rewrite seems like the best time to try something like only using an interpreter? Assuming only using an interpreter dosen't work out or if we end up wanting a dev backend again, this work would revive, but at least for now, the hope is to avoid any dev backends if possible.
Nah it’s all cool, plus I’ve been putting off learning zig so this’ll be fun
I appreciate your having an awesome attitude about it! :smiley:
Last updated: Jul 06 2025 at 12:14 UTC