autograd in Roc · show and tell · Zulip Chat Archive

Had some good fun implementing autograd in Roc:
https://gist.github.com/ayazhafiz/88aec4c50b6403a9f343b13998a41886
The TLDR behind autograd for those not already in the know is, you build up some computational expression (e.g. any function y depending on any number of variables), and then you can run two passes - a "forward" pass that computes the value of the function y with respect to the values of the variables, and a "backward" pass that computes the derivative of y with respect to each variable. If y is a minimizing function (a loss function), you can then update the variables to incrementally try to get the value of y closer and closer to zero. This is the basis for a lot of neural networks (there's a simple multi-layer NN in the example above that tries to solve y = x^2 + z^2 with a neural net)

Ayaz Hafiz (Dec 31 2024 at 07:33):

This is quite a simple implementation, it's not optimized, probably has bugs and doesn't support tensors (matrices) like a real library would

Ayaz Hafiz (Dec 31 2024 at 07:41):

A few thoughts:

expect is awesome. I didn't run into any issues except with the last expect, where the program segfaults during printing of values if the expect fails. However, with a 3-layer perceptor (e.g. [2, 3, 1]), there is no segfault. so maybe it's something about the depth of the graph
the performance is nice. it takes longer to compile + run than starting the same program in python and running it, but the actual evaluation in the hot path is faster (and there is a lot to optimize, see below)
I thought not having operator overloading would be inconvenient, but it wasn't too bad. it doesn't read as well as having operators, but IMO it's not bad by any means. However, my function names could use some work... lol
dbg/inline expect is great.
quite excited for dispatching on values directly instead of having to reference the module name
quite excited for shadowing. You'll see in a lot of cases there is binding like ctx2, which feels a bit silly - and when it causes an error, the error is viral and the actual issue is somewhat opaque

Sam Mohr (Dec 31 2024 at 07:43):

Seemingly all positive!

Ayaz Hafiz (Dec 31 2024 at 07:50):

A few things I struggled with:

The language server didn't ever show types on hover, errors from roc check, and would frequently need a restart. I'm not sure why this is, maybe because it's a standalone module? but idk, that's likely a red herring. Anyway, it was awesome when it worked and formatted the code automatically, but otherwise I couldn't find it very useful
There are a few cases where the error messages could use some love. One case is here: https://gist.github.com/ayazhafiz/ff1ecf664de30869a32c817d4022c267. TLDR in a nested record there was a field mismatch, but the compiler didn't point it out (it said this anonymous record != this alias). After I extracted the nested record and explicitly annotated it, the error message included the field typo. I don't know how hard it is to look into aliases, but if it's easy, this might be a nice QoL win. Another issue I ran into was an error along the lines of [..] is not equal to [..], but I know that's already a known issue.
shadowing, excited for shadowing to come soon.
recursion of types - i'll get into this more later but at first I wanted a shape like Graph = { value: F64, op: ... } where op is the current definition of Graph. This doesn't really work because Graph and Op are mutually recursive, and then it runs into that bug about records not being able to be recursive without passing through a tag union (even though it would in this case)

Sam Mohr (Dec 31 2024 at 07:58):

On the bad type comparisons, I think there was a discussion of doing a git diff-like delta, e.g.

It seems like you have a type mismatch:

  List {
>     item: Str,
<     item: U64,
  }

Sam Mohr (Dec 31 2024 at 07:59):

Which might not be directly related to the [..] is not equal to [..] issue, but a rework of that code in a diffing feature push should fix that issue

Sam Mohr (Dec 31 2024 at 08:00):

The LSP definitely feels ad hoc at the moment. The main thing I use it for is displaying warnings/errors, but I also need to keep a "sixth sense" as to when it has crashed and for what reason, and remember to restart in those cases

Ayaz Hafiz (Dec 31 2024 at 08:01):

I'd love folks' feedback on how to make this faster. In general autograd benefits heavily from in-place mutation - you do many rounds of training, and both within a training round and between them you'd like to re-use the same memory and mutate in-place. One particular problem with my implementation is stuff like this:

            Mul m n ->
                # x = m*n, dy/dm = dy/dx * dx/dm, dx/dm = n
                mval = forward m xs # ouch
                nval = forward n xs # ouch
                grads1 = go m (dydx * nval) grads
                grads2 = go n (dydx * mval) grads1
                grads2

Basically, when computing the gradient for x, y for a function z = x*y, we know that dz/dx = y and dz/dy = x. That means we need to compute the value of x and y, which is why forward is called. But this is wasteful, because forward is always called before backward, so there's no new information gained by doing this. Especially if the computation graph/equation is very large before the particular node (pretend for example we are doing x * y where x and y are themselves equations of a large # of variables), this can get expensive really quickly.

The alternative I was thinking of is to have a representation like

Graph : [
    Const F64 F64, # forward pass value, constant value
    Var F64 Str, # forward pass value, constant name
    ... # etc
]

and change the signature of forward to Graph, Vars -> Graph, i.e. directly compute the value of the forward pass and save it in the computation graph, that way the value is pre-cached when you go to back-propogate the gradients. The issue with this is twofold. The first issue is trivial, which is that I think you would want to lift the forward-pass value out and have a record of Node = {forward_pass: F64, op: Op}, but that runs into the record recursion mentioned before - but that seems solvable.

The second is that reconstruction of the entire AST needs to re-use memory, or else it will be quite expensive to do this over and over (imagine on the lower side a 100K parameter graph with 100K+ training rounds). And I'm not sure if there's a way to guarantee that. This gets more problematic once you introduce higher-dimensional tensors (i.e. actual matrices, and not just the scalars used here).

Anyway, curious if anyone has thoughts or ideas

Ayaz Hafiz (Dec 31 2024 at 08:01):

Sam Mohr said:

Seemingly all positive!

Yep! it was great how well this worked

Eli Dowling (Dec 31 2024 at 18:35):

Yeah any time code with recursive or nested types starts coming into play the language server is pretty well useless.

In my experience, this is because the compiler has a lot of bugs that causes hangs in that domain, and so the language server constantly ends up stuck with a hanging compiler.

We should be able to terminate the hanging process and It is setup to do that, but I think I don't know enough about Async rust and I'm not yielding correctly or something so it can't ever kill the hanging process

Ayaz Hafiz (Dec 31 2024 at 19:04):

ah okay, one issue i was running into with the LS was #compiler development > bug: Outstanding references to the derived module @ 💬 - removing the main.roc fixed it

Brendan Hansknecht (Dec 31 2024 at 19:24):

Basically, when computing the gradient for x, y for a function z = x*y, we know that dz/dx = y and dz/dy = x. That means we need to compute the value of x and y, which is why forward is called.

Couldn't you make forward take an optional grad dictionary (or a different variant of forward that takes grad)? Then just set the values during the forward pass?

Brendan Hansknecht (Dec 31 2024 at 19:25):

Also, when you have a Dict.get followed by a Dict.insert, you can get some extra efficiency by using Dict.update. Avoids looking up the key twice.

Brendan Hansknecht (Dec 31 2024 at 19:45):

Or, I guess it would be a value cache instead of a gradient cache

Ayaz Hafiz (Dec 31 2024 at 19:52):

Couldn't you make forward take an optional grad dictionary (or a different variant of forward that takes grad)? Then just set the values during the forward pass?

That would work, but I believe unique IDs would have to be created for each node in the graph then somehow. because for example if i'm at x = a*b where b = e^c (but being just at x, I don't know that the second operand is b=e^c), i need the result of b. but b is a resultant value, not a variable index

Brendan Hansknecht (Dec 31 2024 at 20:12):

https://gist.github.com/bhansconnect/effa61cb21e879e28b6cc816fbb2850e

Brendan Hansknecht (Dec 31 2024 at 20:12):

This just makes a new graph for going backwards

Brendan Hansknecht (Dec 31 2024 at 20:12):

Solid perf gains. Definitely could be made cleaner.

Brendan Hansknecht (Dec 31 2024 at 20:16):

That said, making new nodes instead of mutating in place definitely is not as nice as it could be.

Brendan Hansknecht (Dec 31 2024 at 20:19):

Using your last expect and increasing to 1000 rounds of training, I see:

Summary
  ./grad-new ran
    1.53 ± 0.04 times faster than ./grad

Ayaz Hafiz (Dec 31 2024 at 20:43):

yeah that def works. my only concern is that it creates a new tree to each forward pass. but yeah def better

Brendan Hansknecht (Dec 31 2024 at 21:34):

Yeah, not sure the best way to map this into roc.

Ayaz Hafiz (Jan 01 2025 at 22:11):

another interesting thing is that the program spends much of its time in Dict operations. for small graphs (small # of vars) it looks like an association list is faster

Brendan Hansknecht (Jan 01 2025 at 22:23):

Makes sense. No hashing.

Brendan Hansknecht (Jan 01 2025 at 22:23):

Also denser memory.

Brendan Hansknecht (Jan 01 2025 at 22:23):

Not to mention, our dict is two loads due to being a index map (hash -> index -> list of kv).

Brendan Hansknecht (Jan 01 2025 at 22:24):

Also, equality can fail fast, hashing can not.

Ayaz Hafiz (Jan 01 2025 at 22:24):

yeah

Ayaz Hafiz (Jan 01 2025 at 22:43):

oh another thing that did come up was i was thinking of how to support arbitrary differentiable operations, like if someone wants to write a custom Sigmoid op or something. I think the best API would be an interface (e.g. abilities). But you can't hold on to opaque values of abilities (i.e. hidden behind a pointer, the concrete type is never materialized), so this doesn't work

Richard Feldman (Jan 01 2025 at 22:52):

@Ayaz Hafiz do you think we should support that?

Ayaz Hafiz (Jan 01 2025 at 22:53):

idk. i think it might make sense at some point, but probably not right now

Richard Feldman (Jan 01 2025 at 22:53):

seems like it would require doing something conceptually similar to lambda sets, where we make a tag union behind the scenes of all the different instantiations that come up in practice

Ayaz Hafiz (Jan 01 2025 at 22:53):

yeah, or you could compile it to the boxed representation

Ayaz Hafiz (Jan 01 2025 at 22:55):

i also feel like there must be some generalization of reset-reuse to support in-place mutation of trees, e.g. when doing something like

Node : [
    Const U64,
    Add Node Node,
    Mul Node Node,
]

somewalk : Node -> Node
somewalk = \node ->
    when node is
        Const x -> Const (x + 1)
        Add m n -> Add (somewalk m) (somewalk n)
        Mul m n -> Mul (somewalk m) (somewalk n)

which would make it a non-issue to create and teardown trees between passes, since they're almost always unique (and this is a common operation in anything graph shaped, be it ML or compilers or whatever)

Ayaz Hafiz (Jan 01 2025 at 22:56):

however, i'm struggling to find an intuition that holds up.

Ayaz Hafiz (Jan 01 2025 at 22:59):

oh wait, im silly

Ayaz Hafiz (Jan 01 2025 at 23:00):

this works fine under reset-reuse, so it is updated in place

Norbert Hajagos (Jan 04 2025 at 09:13):

Ayaz Hafiz said:

recursion of types - i'll get into this more later but at first I wanted a shape like Graph = { value: F64, op: ... } where op is the current definition of Graph. This doesn't really work because Graph and Op are mutually recursive, and then it runs into that bug about records not being able to be recursive without passing through a tag union (even though it would in this case)

I have a branch that solves this in the dev backend that got silently left behind while implementing the llvm backend. I'll revive it today as I should have done quiet some time ago.

Ayaz Hafiz (Jan 04 2025 at 17:01):

:thinking: the bug occurs during typechecking though, i believe

Norbert Hajagos (Jan 05 2025 at 13:48):

Oh, yes... I thought you meant you couldn't write tail recursive functions using tag unions where the recursive data (op) was inside a struct, not directly in the payload of a tag union. Realized we're talking about a completely different problem.

shua (Jan 17 2025 at 18:28):

I haven't really used enzyme, but I wonder what the tradeoffs are between doing autograd on the roc source vs doing it on the llvm ir (or similar low-level IR)?

https://enzyme.mit.edu/

Last updated: Aug 17 2025 at 12:14 UTC

Stream: show and tell

Topic: autograd in Roc

Ayaz Hafiz (Dec 31 2024 at 07:32):

Ayaz Hafiz (Dec 31 2024 at 07:33):

Ayaz Hafiz (Dec 31 2024 at 07:41):

Sam Mohr (Dec 31 2024 at 07:43):

Ayaz Hafiz (Dec 31 2024 at 07:50):

Sam Mohr (Dec 31 2024 at 07:58):

Sam Mohr (Dec 31 2024 at 07:59):

Sam Mohr (Dec 31 2024 at 08:00):

Ayaz Hafiz (Dec 31 2024 at 08:01):

Ayaz Hafiz (Dec 31 2024 at 08:01):

Eli Dowling (Dec 31 2024 at 18:35):

Ayaz Hafiz (Dec 31 2024 at 19:04):

Brendan Hansknecht (Dec 31 2024 at 19:24):

Brendan Hansknecht (Dec 31 2024 at 19:25):

Brendan Hansknecht (Dec 31 2024 at 19:45):

Ayaz Hafiz (Dec 31 2024 at 19:52):

Brendan Hansknecht (Dec 31 2024 at 20:12):

Brendan Hansknecht (Dec 31 2024 at 20:12):

Brendan Hansknecht (Dec 31 2024 at 20:12):

Brendan Hansknecht (Dec 31 2024 at 20:16):

Brendan Hansknecht (Dec 31 2024 at 20:19):

Ayaz Hafiz (Dec 31 2024 at 20:43):

Brendan Hansknecht (Dec 31 2024 at 21:34):

Ayaz Hafiz (Jan 01 2025 at 22:11):

Brendan Hansknecht (Jan 01 2025 at 22:23):

Brendan Hansknecht (Jan 01 2025 at 22:23):

Brendan Hansknecht (Jan 01 2025 at 22:23):

Brendan Hansknecht (Jan 01 2025 at 22:24):

Ayaz Hafiz (Jan 01 2025 at 22:24):

Ayaz Hafiz (Jan 01 2025 at 22:43):

Richard Feldman (Jan 01 2025 at 22:52):

Ayaz Hafiz (Jan 01 2025 at 22:53):

Richard Feldman (Jan 01 2025 at 22:53):

Ayaz Hafiz (Jan 01 2025 at 22:53):

Ayaz Hafiz (Jan 01 2025 at 22:55):

Ayaz Hafiz (Jan 01 2025 at 22:56):

Ayaz Hafiz (Jan 01 2025 at 22:59):

Ayaz Hafiz (Jan 01 2025 at 23:00):

Norbert Hajagos (Jan 04 2025 at 09:13):

Ayaz Hafiz (Jan 04 2025 at 17:01):

Norbert Hajagos (Jan 05 2025 at 13:48):

shua (Jan 17 2025 at 18:28):