Stream: compiler development

Topic: current type inference incomplete...


view this post on Zulip GordonBGood (Jul 05 2024 at 22:54):

Since reliable and "Type Complete" Type Inference is such a key feature of functional languages such as the Roc language, it seems to me to be of the utmost priority to be sure that Type Inference is completely decidable in all cases, yet there seems to be many cases where it is not. This may just be due to the development state of Roc with many compiler bugs as to this, but it is my concern that the type system is ambiguous or undecidable in far too many cases, which can be tested in the Roc REPL as follows:

  1. Given a Binary Tree defined as Tree v : [Mt, Br v (Tree v) (Tree v)], try to define tree2 = Br 0 (Br 1 Mt Mt) Mt, which panics, I believe due to trying to resolve the type with nested types as it works if it is done in two steps as in treei = Br 1 Mt Mtfollowed by tree2 = Br 1 treei Mt.

  2. Nested Co-Inductive Stream (CIS) defined as CIS a : [CIS a ({} -> CIS a)], and then define a function to create a sequence of numbers as mkcis = \ i -> CIS i (\ {} -> mkcis (i + 1)), then a function to create a sequence of sequences of numbers as alls = \ (CIS v vtl) -> CIS (mkcis v) (\ {} -> alls vtl). Now call the second function creating the nested sequence of numbers with a call to the first function as its argument as in alls (mkcis 2) to get the following message: "This mkcis call produces: […] as * But alls needs its 1st argument to be: […] as *". This actually infers the correct type but, likely due to undecidable ambiguities in Type Inference, can't determine that it is in fact the correct type.

  3. Church Numbers - although quite an esoteric subject with little practical use, writing Church Numeral implementations is an excellent test of the soundness of a Type System. Again, Roc fails at this in not being able to implement the full set of Church functions such as can be done in Elm, with the Church Type defined with two tags in the Union as in one that contains Church Functions (Church inputs and outputs) and the other Church Values (just a Church wrapper for a value) - since neither Elm nor Roc have a forall for Existential Type's. Again, even the slightly advanced Church functions require nesting of Church Type's that causes a panic crash in the compiler.

Of course some of these problems could be avoided by, for example, implementing the Tree as an array/"List" in memory to avoid having to nest types, by implementing sequences as memory mapped array's/"List's" for many use cases, etc., but that just avoids the underlying Type Inference problems...

Although the problems may just be due to the stage of development of the compiler and bugs in the Type Inference, I would be concerned that the problem runs deeper than that in that the ability to resolve "open"/"*" means a kind of Existential Types/"RankNType's" Type System, which in turn means that the Type system isn't "sound" and "complete" and I would think that would be of utmost concern to the core development team to determine if the problems are just software bugs or due to Type Inference algorithm limitations...

It seems to me that much of the language design (the more complex bits anyway) are based on the use of "open" Types that are resolved later in the compilation process, and if these cannot be resolved for any but the simplest types, the Roc language can never be more than a "toy" one that has some interesting ideas, but isn't anywhere near able to be a general purpose language. I sincerely hope it is not the latter case...

view this post on Zulip Ajai Nelson (Jul 06 2024 at 01:26):

I'm having trouble reproducing the 1st case you provided. Regardless of whether I provide a type signature or not, it doesn't seem to be panicking:

» Tree v : [Mt, Br v (Tree v) (Tree v)]

» tree2 = Br 0 (Br 1 Mt Mt) Mt

Br 0 (Br 1 Mt Mt) Mt : [Br (Num *) [Br (Num *) [Mt] [Mt]] [Mt]]

» tree3 : Tree _
… tree3 = Br 0 (Br 1 Mt Mt) Mt

Br 0 (Br 1 Mt Mt) Mt : Tree (Num *)

» tree4 : Tree I64
… tree4 = Br 0 (Br 1 Mt Mt) Mt

Br 0 (Br 1 Mt Mt) Mt : Tree I64

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:30):

GordonBGood said:

This may just be due to the development state of Roc with many compiler bugs as to this, but it is my concern that the type system is ambiguous or undecidable in far too many cases

by design, Roc's type system should support 100% principal decidable type inference. If it doesn't appear to, then either:

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:31):

GordonBGood said:

which panics, I believe due to trying to resolve the type with nested types as it works if it is done in two steps as in treei = Br 1 Mt Mtfollowed by tree2 = Br 1 treei Mt.

yep, panic definitely means compiler bug! :big_smile:

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:33):

GordonBGood said:

get the following message: "This mkcis call produces: […] as * But alls needs its 1st argument to be: […] as *".

this is at a minimum a compiler bug; any error message that says "expected Foo but got Foo" is a bug

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:34):

GordonBGood said:

Roc fails at this in not being able to implement the full set of Church functions [...] Again, even the slightly advanced Church functions require nesting of Church Type's that causes a panic crash in the compiler.

of note, something being unimplementable in Roc's type system is unrelated to whether Roc's type inference is decidable in all cases...there are plenty of things Roc's type system is not designed to support! :big_smile:

(a panic is definitely a compiler bug though!)

view this post on Zulip Richard Feldman (Jul 06 2024 at 01:35):

thanks for sharing these small reproductions btw!

view this post on Zulip GordonBGood (Jul 06 2024 at 02:07):

Ajai Nelson said:

I'm having trouble reproducing the 1st case you provided. Regardless of whether I provide a type signature or not, it doesn't seem to be panicking...

I definitely had the problem when I posted it, as I copied directly out of the REPL I was using and confirmed it with the online REPL from the front Roc web page; however, I've now upgraded to the latest Roc nightly and it no longer panics as you showed, nor does it any longer crash on the on-line version of the REPL...

Thanks for trying this out and letting me know; maybe I'll try another implementation of a persistent data structure Priority Queue which uses a version of a Tree...

However, there is still the problem with Type Inference for a particular custom Type nested inside itself as required for some algorithms and for Church Numerals...

view this post on Zulip GordonBGood (Jul 06 2024 at 03:44):

Richard Feldman said:

GordonBGood said:

get the following message: "This mkcis call produces: […] as * But alls needs its 1st argument to be: […] as *".

this is at a minimum a compiler bug; any error message that says "expected Foo but got Foo" is a bug

thanks for sharing these small reproductions btw!

Here's another small reproduction showing the type resolution problem, as follows:

CIS a : [CIS a ({} -> CIS a)]

cisTest : {} -> CIS I32
cisTest = \ {} ->
    ycombo = \ f -> (f{}) (ycombo f)
    nextcis = \ (CIS i itl) -> CIS (i + 2) (\ {} -> nextcis (itl{}))
    nxtf = \ cis -> CIS 3 (\ {} -> nextcis cis)
    CIS 2 (\ {} -> ycombo (\ {} -> nxtf))

The above code should compile but fails as

I'm inferring a weird self-referential type for cisTest:

75│>  cisTest = \ {} ->
76│>      ycombo = \ f -> (f{}) (ycombo f)
77│>      nextcis = \ (CIS i itl) -> CIS (i + 2) (\ {} -> nextcis (itl{}))
78│>      nxtf = \ cis -> CIS 3 (\ {} -> nextcis cis)
79│>      CIS 2 (\ {} -> ycombo (\ {} -> nxtf))

Here is my best effort at writing down the type. You will see ∞ for
parts of the type that repeat something already printed out
infinitely.

    {} -> CIS I32

It's determining the Type of the function as desired but stating that it is an infinite Type when it isn't. This time, the CIS is not nested but I had to use a Y combinator in order to bypass Roc's cyclic reference checking (with injected laziness to prevent the obvious data race), which is likely what leads the Type Inference astray. This code (or something like it except for differences in syntax) works on Elm but not on Roc...

EDIT_ADD:
Although the equivalent to the above code compiles with Elm, it overflows the stack when run due to not formulating the laziness injection correctly; in order to run correctly, it would need to be written as per the following equivalent Roc code:

CIS a : [CIS a ({} -> CIS a)]

cisTest : {} -> CIS I32
cisTest = \ {} ->
    ycombo = \ f -> f (\ {} -> ycombo f)
    nextcis = \ (CIS i itl) -> CIS (i + 2) (\ {} -> nextcis (itl{}))
    nextr = \ cisf -> CIS 3 (\ {} -> nextcis (cisf{}))
    CIS 2 (\ {} -> ycombo nextr)

Compiling the above code also can't properly resolve the function Type even though it is actually correct, just as before:

I'm inferring a weird self-referential type for cisTest:

75│>  cisTest = \ {} ->
76│>      ycombo = \ f -> f (\ {} -> ycombo f)
77│>      nextcis = \ (CIS i itl) -> CIS (i + 2) (\ {} -> nextcis (itl{}))
78│>      nextr = \ cisf -> CIS 3 (\ {} -> nextcis (cisf{}))
79│>      CIS 2 (\ {} -> ycombo nextr)

Here is my best effort at writing down the type. You will see ∞ for
parts of the type that repeat something already printed out
infinitely.

    {} -> CIS I32

view this post on Zulip Richard Feldman (Jul 06 2024 at 10:51):

:thinking: what’s the type you would expect it to infer?

view this post on Zulip GordonBGood (Jul 06 2024 at 11:05):

Richard Feldman said:

:thinking: what’s the type you would expect it to infer?

I would expect it to infer the Type that it did infer {} -> CIS I32 but didn't find it acceptable just as in the type signature for the cisTest function - another case of "looking for type Foo but found type Foo" erroneous error...

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:40):

I think in this case, roc just doesn't know that lambdas are safe for recursion in types.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:40):

Anything indirect should be safe cause we can calculate a compile time known size

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:41):

Box, List, lambda, recursive tag...they all break the chain and should allow for something like this to work.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 15:46):

That said, due to a lambdas ability to capture data and lambda sets not being boxed, maybe I am wrong and this could lead to an infinitely size type. Not fully sure at the moment would have to work out the types with captures to calculste

view this post on Zulip Richard Feldman (Jul 06 2024 at 16:29):

GordonBGood said:

Richard Feldman said:

:thinking: what’s the type you would expect it to infer?

I would expect it to infer the Type that it did infer {} -> CIS I32 but didn't find it acceptable just as in the type signature for the cisTest function - another case of "looking for type Foo but found type Foo" erroneous error...

ah I see - yeah I just noticed that in the error message it says to look for an infinity symbol, but then it doesn’t print one :sweat_smile:

so it’s at a minimum a compiler bug of some sort too!

view this post on Zulip GordonBGood (Jul 06 2024 at 21:51):

Brendan Hansknecht said:

That said, due to a lambdas ability to capture data and lambda sets not being boxed, maybe I am wrong and this could lead to an infinitely size type. Not fully sure at the moment would have to work out the types with captures to calculste

Yes, one must consider lambda captures, but they are not "infinite" as captures have a known finite size at compile time, but I'm not sure what you mean by "lambda sets not being boxed" other than that lambda function pointers with their captured data may not be boxed to the heap if they don't escape their scope in Roc's implementation.

In Elm, Evan did an early optimization in anticipation of using reference counted memory management for WASM of not allowing cyclic references other than to functions (he missed using lambda's as Roc does but it would be a simple fix to allow them), but missed thinking that there still could be cyclic data references if a function/lambda is an escaping closure/has captures that would also be reference counted; of course, it didn't matter for current Elm as this optimization was wasted in that Elm never has gone past JavaScript and thus has JavaScript's Garbage Collects.

I think (I hope) that when you "work out the types with captures" that you see lambdas (actually lambda thunks when used this way) are neither infinite types (as in unresolvable because they contain themselves as a component) nor build infinite memory, as using streams in this way is about the only functional way of returning an infinite collection. If Roc currently has no means to lift these to the heap when they escape their scope then that would be a serious problem...

Using reference counting on closures IS INDEED a tricky problem as they MUST be allowed to have cyclic references as here in order to be able to do things like in this example; I think you need to look into bypassing letting cyclic references (which are known at compile time due to all data being seen as persistent data structures) participate in cycle counting so that only external references control the lifetime of the captures...

Once one has "broken the chain" of cyclic references by making them "weaK" (not participating in reference counts), then there is no longer a reason that cyclic references need to be limited to only functions/lambda's and should work for any data type, even reference counted ones (at least I think so)...

view this post on Zulip GordonBGood (Jul 06 2024 at 22:15):

In the little reproduction that fails here, none of the lambda's are actually closures that capture reference counted data as they only refer to external lambda's, none of which capture external values that would be reference counted so this example should pass anyway (captures the following: f and ycombo in ycombo, itl and nextcis in nextcis, nextcis and cisf in nextr, and ycombo and nextr in the last thunk, none of which lambda's/thunk's need reference counting as none capture heap values - given that lambda's aren't "lifted" when they con't capture). However, it can't be guaranteed that there would never be captures from external heap-based data and thus that the lambda's may be closures in the general case...

view this post on Zulip Folkert de Vries (Jul 06 2024 at 22:17):

Lambda sets are boxed sometimes. they have to be. Here is an example of where the caputure does not have a size known at compile time (at least if we assume functions were somehow dependent on runtime data):

compose : (a -> b), (b -> c) -> (a -> c)
compose = \f, g -> \x -> g (f x)

functions : List (I64 -> I64)
functions = [\_ -> 0, \x -> x + 1, \y -> y - 1]

composed = List.walk functions (\x -> x) compose

Therefore the captures are boxed in this case.

view this post on Zulip Folkert de Vries (Jul 06 2024 at 22:18):

and because it is heap-allocated it must be reference counted, too, though generally that is trivial because there is no sharing

view this post on Zulip Folkert de Vries (Jul 06 2024 at 22:19):

in this particular case there are no functions that capture in functions, but there could be. And then they are composed some unknown number of times as far as type inference is concened

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 22:19):

My knowledge is limited here, but let me try to clarify anyway.

In most languages, closure captures are boxed. They compiler just allocates a chunk of memory dynamically and shoves in all of the captured data.

This is not the case in roc. Roc use lambasets. In the lambaset world, all closures secretly take an extra argument. This argument is the captures. Instead of dynamically allocating space for captured data, roc calculates the union of all possible captures of a specfic lambda signiture. In fact, the lamba is not even a function pointer at all. Picking the correct function to run just uses an enum and a switch statement. So lambas in roc pass around this concrete capture union instead of boxing. This avoids a heap allocation and a pointer indirection.

That all said, it is also one of the most bug prone parts of the compiler currently. It is complex new territory that should benefit roc with a lot of performance gains over traditional boxed lambdas.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 22:20):

When I say an infinite capture. I simply mean that we may be missing a recursion point that requires boxing. Any recursive data requires a pointer indirection or becomes an infinitely sized type.

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 22:21):

So the outer capture would not be boxed, but it's recursive data still would need to be.

view this post on Zulip GordonBGood (Jul 06 2024 at 22:26):

Brendan Hansknecht said:

When I say an infinite capture. I simply mean that we may be missing a recursion point that requires boxing. Any recursive data requires a pointer indirection or becomes an infinitely sized type.

Yes, I'm aware of that, which is why records can use recursive fields as they are considered to be "value" types but unions can use recursive "fields" as they are considered to be "reference" types and the compiler enforces this...

view this post on Zulip GordonBGood (Jul 06 2024 at 22:37):

Brendan Hansknecht said:

In most languages, closure captures are boxed. They compiler just allocates a chunk of memory dynamically and shoves in all of the captured data.

This is not the case in roc. Roc use lambasets. In the lambaset world, all closures secretly take an extra argument. This argument is the captures. Instead of dynamically allocating space for captured data, roc calculates the union of all possible captures of a specfic lambda signiture. In fact, the lamba is not even a function pointer at all. Picking the correct function to run just uses an enum and a switch statement. So lambas in roc pass around this concrete capture union instead of boxing. This avoids a heap allocation and a pointer indirection.

Roc isn't really unique in this: lots/most/all languages that have closures implement them this way; Roc is unique in that it tries not to automatically allocate the captures on the heap (but must if they escape their scope as @Folkert de Vries has shown. So avoiding heap allocation is thus perhaps a good optmization for when closures don't escape, but can't be used when they must; if currently Roc is trying to forbid closures from escaping then that is a restriction that must be lifted...

That all said, it is also one of the most bug prone parts of the compiler currently. It is complex new territory that should benefit roc with a lot of performance gains over traditional boxed lambdas.

Ah yes, but it doesn't apply generally as above. And being complex as to determining when it can and can't be applied and how to distinguish when there might be nesting of such where the nested captures might need to be lifted to the heap and so on of course leads to bugs. This almost seems to be a case of "early optimization" that could be left for later given that Roc currently has many more serious performance issues as we have discussed before...

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 22:44):

I'm actually surprised that the list of functions is boxed. I thought we would just store a list of lambda set unions and only box and recursion.

view this post on Zulip GordonBGood (Jul 06 2024 at 22:49):

Brendan Hansknecht said:

I'm actually surprised that the list of functions is boxed. I thought we would just store a list of lambda set unions and only box and recursion.

The List doesn't have to be boxed but the enclosed lambda's-that-might-be-closures need to be because they could escape their scope - they could be passed anywhere; tracking that the contents of a collection never escape gets even more complex...

view this post on Zulip Brendan Hansknecht (Jul 06 2024 at 22:59):

I'm not talking of the list. I mean the captures. Why would they need to be boxed. A essentially tuple of captures should be fine there. Just changes what the refcounting function would be.

view this post on Zulip GordonBGood (Jul 06 2024 at 23:01):

How can you guarantee that no use of the enclosing collection, be it tuple or List, will never escape the scope?

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 00:19):

I'm not sure I follow the question. Boxing and escape don't have to be strictly related.

Lets modify the code above a bit:

compose : (a -> b), (b -> c) -> (a -> c)
compose = \f, g -> \x -> g (f x)

genFunctions : (I64 -> I64), Str -> List (I64 -> I64)
genFunctions = \firstFn, aCapture ->
    secondCapture = 12
    [ firstFn, \x -> x + (Str.len aCapture), \y -> y - secondCapture]

composed = List.walk (genFunctions (\_ -> 0) "This is going to be captured and needs refcounting") (\x -> x) compose

In this situation, the lambaset for I64 -> I64 can be seen as a tagged union roughly like:

LambaSet : [
    FirstFn,
    XFn Str,
    YFn I64,
]

So without any boxing, genFunctions can actually return a List LambaSet

Then we have the underlying lambda calling function that would be equivalent to:

callLambda : I64, LambdaSet -> I64
callLambda = \input, set ->
    # Note: I inlined all of these functions which llvm probably would do, but they technically generate standalone functions.
    when set is
        FirstFn ->
            _ = input
            0
        XFn aCapture ->
            x = input
            x + (Str.len aCapture)
        YFn secondCapture ->
            y = input
            y - secondCapture

Values escaping aren't a problem. I just have to run a refcounting function on the Lambdaset type.

view this post on Zulip GordonBGood (Jul 07 2024 at 00:49):

Brendan Hansknecht said:

In this situation, the lambaset for I64 -> I64 can be seen as a tagged union roughly like:

LambaSet : [
    FirstFn,
    XFn Str,
    YFn I64,
]

Ah, now I see your definition of LambdaSet, although it seems pretty strange that a List of a million lambda's would need a million tags in a tagged union, with one of these reference counted "LambdaSet"'s instantiated for each of those lambda's. Yes, it can escape because the captures are in a reference counted heap structure, but why the complexity of the "LambdaSet" and subsequence when ... is switch (yes, I know that this would problably be a switch and not a sequence of if ... then ... else's).

It seems to only trade one way of capturing closures to the heap for another more complex (in requiring multiple tags for the different lambdas) way...

view this post on Zulip Ayaz Hafiz (Jul 07 2024 at 00:56):

The tradeoff is that you get static dispatch everywhere, which can produce more optimal code. It also means specializing something like List.map to a specific passed closure becomes trivial and comes for free. But there is a mode in development to heap-allocate closures as well.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 01:36):

it seems pretty strange that a List of a million lambda's would need a million tags in a tagged union, with one of these reference counted "LambdaSet"'s instantiated for each of those lambda's

Just a note here on cost.

Boxed Closures
A list of 1 million lambdas with boxed closures would be 1 million and 1 allocations for all of the captures (assuming all closures have a capture, of course closures without captures would just have a null capture).
The list itself being the extra allocation. The List would be storing tuples of 2 pointers. One to the lambda code and one to the boxed data. Each box is sized exactly for the captures with no extra space. Refcounting one is just refcounting the box. To call a closure is a pointer indirection that will never be inlined and is essentially a random jump in memory.

Lambda Sets
A list of 1 millions lambdas with lambasets is 1 allocation. The list contains all of the data. 1 million fits in 32 bits, so the tag is a 32 bit integer. Each element is sized to the largest capture due to being a tagged union. There is a chance that functions will get inlined. If you also get element 3 and the list is constants, llvm can fully inline the function due to the lambda set using static dispatch and a switch. This is a potentially large perf gain (unlikely to actually happen with the 1 million element list, but super common in normal code like List.map). Calling a lambda is static dispatch to a known code region. Even the call to individual closure implementations is known at compile time and can receive extra optimization. It also is more friendly to the branch predictor than pointer indirections. Refcounting one is equivalent to refcounting the closed over data, so depends on what is closed over.

The one potentially large drawback is one giant capture ruins the space efficiency of the lambdaset. Luckily, we have control over the tag union layer. We can box any lambdaset capture that is too large. Basically, limit to size n and all elements after n get put in a single box. So only lambdasets with large amounts of captures see any boxing. Also, the boxing here is still better than what happens with fully boxed closures. We are only boxing some of the data. We never box the tag. So will still have static dispatch. Some closure capture data just has to pay a pointer indirection on load.


Also, both obviously have some number of allocation for any closed over data that is heap allocated.

view this post on Zulip GordonBGood (Jul 07 2024 at 03:55):

Ayaz Hafiz said:

The tradeoff is that you get static dispatch everywhere, which can produce more optimal code. It also means specializing something like List.map to a specific passed closure becomes trivial and comes for free. But there is a mode in development to heap-allocate closures as well.

Thanks for the quick explanation of the advantages; however, if there is an alternate development with conventional heap-allocation of closure captures, there must be some trade-offs...

Brendan Hansknecht said:

it seems pretty strange that a List of a million lambda's would need a million tags in a tagged union, with one of these reference counted "LambdaSet"'s instantiated for each of those lambda's

Just a note here on cost.

Boxed Closures
A list of 1 million lambdas with boxed closures would be 1 million and 1 allocations ... To call a closure is a pointer indirection that will never be inlined and is essentially a random jump in memory.

Lambda Sets
A list of 1 millions lambdas with lambasets is 1 allocation ... Calling a lambda is static dispatch to a known code region. Even the call to individual closure implementations is known at compile time and can receive extra optimization. It also is more friendly to the branch predictor than pointer indirections. Refcounting one is equivalent to refcounting the closed over data, so depends on what is closed over.

The one potentially large drawback is one giant capture ruins the space efficiency of the lambdaset. Luckily, we have control over the tag union layer. We can box any lambdaset capture that is too large. Basically, limit to size n and all elements after n get put in a single box. So only lambdasets with large amounts of captures see any boxing. Also, the boxing here is still better than what happens with fully boxed closures. We are only boxing some of the data. We never box the tag. So will still have static dispatch. Some closure capture data just has to pay a pointer indirection on load.


Also, both obviously have some number of allocation for any closed over data that is heap allocated.

But back to the subject of the thread, no matter what its potential performance advantages due to static dispatch, if use of "LambdaSet" prevents recursive closures (yet to be determined) and it is the exclusive method used, then the language is useless to being able to use streams as an output method as one of the common techniques in functional programming...

Which is why I wonder if working on this complex optimization is "too early"?

Yes, heap allocations are expensive as is dynamic dispatch but one of the most common uses for this kind of thing as streams needs hardly any allocations at all and could easily adapt to static dispatch as in most uses will consume the sets/closures at the same rate as they are being generated and could therefore not do any allocations per "loop" by just re-using the space of the just consumed one in a manner somewhat analogous to the in-place mutation of array/List elements. When used this way, there doesn't need to be any new allocations, the closure data (whether on the heap or not) is actually at a fixed location as is the actual function/lambda code dispatch, and everything runs an near imperative speeds. The first use of a the data generated by a closure is almost exclusively a pattern match with moves its data onto the stack and that makes the closure being consumed available for destruction or reuse and it can then be recomposed in-place along with it's closure and closure data (also re-composed in place) for the "tail" result. Attention needs to be paid to the cyclic reference but whatever is done to solve this doesn't need to change since the net reference count doesn't change...

It seems to me that such techniques would make more difference for more practical uses than "LambdaSet"'s and its complexities, but perhaps I'm wrong; it seems to work out for Haskell (although Garbage Collected), which can run the above type of algorithm with what would be about 18 million heap allocations of value/closure pairs in about 20 CPU clock cycles per "allocation"/operation/loop likely using something like the above techniques where it takes about two and a half times that long in Elm with it's lack of such optimizations (depending on what the JavaScript engine is doing)...

What I am trying to say is that there may be more important areas to work on for performance than "LambdaSet"'s with all its potential bugs and it is more important to be able to have a "full" functional programming language before being so concerned with such optimizations, even including the one above...

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:10):

if there is an alternate development with conventional heap-allocation of closure captures, there must be some trade-offs...

I think mostly complexity/bugproneness of unifying lambdasets. Along with heap allocated closures being easier to expose to a platform. Though maybe there is another reason.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:11):

if use of "LambdaSet" prevents recursive closures

It doesn't. Just a bug in our implementation.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:13):

Which is why I wonder if working on this complex optimization is "too early"?

We started with lambda sets a long while ago. Due to their complexity, we are looking at enabling heap allocated closures. Long term, we definitely want lambasets as the default, but it might take a while to get to stable compilation with them. They have sharp edges.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:15):

the language is useless

No need for hyperbole. The language already has value to people. Even if we iron out the rest of the language with a more limited type system, it is definitely useful. Also, please recognize that you are talking to a group of volunteers who use some of their free time to work on Roc.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:24):

What I am trying to say is that there may be more important areas to work on

I don't mean to be rude, but all of your suggestions about how we should spend our time as we develop roc and what is important come off as very derogatory. Roc has made lots of progress over the years. Yes, it has many sharp edges, but it is much much more functional then even a year ago. Compiler crashes are a lot less common and there are many more potential uses. Still there are tons of missing pieces and sharp edges (lambdasets, glue generation, borrow inference, alloca handling in llvm, general error message quality, effect interpreters transition, recursive refcounting, etc). We are aware. Often times, progress on these issue is slow because the work can be full of segfaults and opaque crashes. So when many contributers have limited bandwidth, it will move slow or even not at all for large chunks of time. That is the reality of having around 5 to 15 people who put larger amounts of their free time into roc.

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 04:25):

There is no paid core team with dedicated time to really iron out the hardest and slowest moving parts of the compiler.

view this post on Zulip GordonBGood (Jul 07 2024 at 04:42):

Brendan Hansknecht said:

What I am trying to say is that there may be more important areas to work on

I don't mean to be rude, but all of your suggestions about how we should spend our time as we develop roc and what is important come off as very derogatory.

Sorry, I came out too strong there with frustration over not being able to do some things in which I am particularly interested

it is much much more functional then even a year ago ...

I'm glad to be able to acknowledge your progress...

Compiler crashes are a lot less common and there are many more potential uses ...

Yes, I've experienced that in things one didn't used to be able to do at all are now possible...

So when many contributors have limited bandwidth, it will move slow or even not at all for large chunks of time. That is the reality of having around 5 to 15 people who put larger amounts of their free time into roc.

Acknowledging that you have a limited team that are contributing their spare time, I guess that is where my despair factor cuts in: it seems to me better to have a stable language (as in stopping breaking changes as quickly as possible) that compiles and runs reliably with meaningful error and warning messages that can be tuned as to performance later than one that is striving for the utmost in performance from the outset with lots of weird compiler crashes and errors such as this one that says a Type that should be determinable can't be determined...

But that's just me commenting from outside your team; back to the point of this thread, which was to prompt your team to make this issue in Type Inference of a higher priority because it affects Roc's ability to be a complete FP language...

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:37):

so back when the compiler did not have any implementation of code generation for closures, we chose to use Morphic's strategy for generating them. Morphic's strategy (lambda sets) has the advantages discussed earlier in the thread (and the first word in Roc's tag line of "fast, friendly, functional" is fast; outperforming all mainstream garbage-collected imperative languages has been a concrete goal of the language since 2019)

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:38):

we implemented it during type-checking rather than as a separate pass (Morphic uses a separate pass) and although this compiles faster, we think it has caused a subset of the lambda set bugs we've seen (but there are others that are unrelated)

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:39):

sometime after we implemented this, we also added Abilities (ad-hoc polymorphism), which Morphic does not have, and which also complicate lambda sets

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:40):

I mention this context to emphasize that there's never been a world where Roc had traditional heap-allocated closures. It's not like we are "prioritizing performance improvements" - it’s not an improvement over a working version that didn’t run as fast; the lambda set system is the only implementation we’ve ever had!

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:40):

there is a separate project underway to build, for the first time ever in Roc's development, a heap-allocated closure system in order to buy time to go back and revisit lambda sets as a separate pass later

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:43):

progress on this is going to be very slow because contributions in this area require both availability and very narrow expertise

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:44):

there are some projects in Roc where essentially anyone with any programming background can jump in and figure out what needs to be done and make progress

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:46):

this category of bugs (as well as the heap-allocated closures rewrite) is at the extreme other end of the spectrum; making any usable progress on this requires a very large amount of domain-specific knowledge

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:47):

as it happens, the volunteers who have the relevant knowledge on that have much less bandwidth today than they did when they built the system we have now. So progress on this is going to be very, very slow, and that is not really a fixable problem in the short term

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:48):

in the long term, if there are people who would like to learn more about Roc's type inference implementation, how we want to change it, etc. - then I would absolutely love to start making investments in ramping people up now. That would be fantastic!

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:49):

however, that means after some substantial period of mentoring, those new contributors would hopefully able to start making some amount of progress on these things; it still would not be a short-term solution. There is no short-term solution here.

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:49):

if we had a paid team of full-time contributors, maybe that would be different, but that's not the world we live in, so we're going to make the best choices we can given the world we do live in! :big_smile:

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:50):

speaking of which, the main thing I think needs to be unblocked by bugs in this category is Task-level parallelism

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:51):

to me, that is much more "obviously this needs to work as soon as possible" than, say, Church numerals

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:52):

this means that among the (slow and steady) progress that's being made on this category of issue, the focus is (correctly, I think) on unblocking Task-level parallelism

view this post on Zulip Richard Feldman (Jul 07 2024 at 11:52):

and I think that ought to be the top priority in this area until we reach that goal

view this post on Zulip Richard Feldman (Jul 07 2024 at 12:15):

I can totally see how if you look at the amazing stream of PRs that are making progress on things that aren’t this, it can give the impression that this “isn’t a priority” or is de-prioritized in favor of those other things

view this post on Zulip Richard Feldman (Jul 07 2024 at 12:17):

but really the actual barrier to progress in this particular area is limited more by the large amount of narrow domain knowledge necessary to make any amount of usable progress at all

view this post on Zulip Richard Feldman (Jul 07 2024 at 12:19):

meanwhile, other projects which don’t have that high barrier to entry are able to make awesome progress at a much faster rate!

view this post on Zulip Richard Feldman (Jul 07 2024 at 12:26):

so I can totally see how that can result in a feeling of frustration, but I don’t see an alternative solution short of taking on VC funding or something :big_smile:

view this post on Zulip GordonBGood (Jul 07 2024 at 18:16):

Richard Feldman said:

I mention this context to emphasize that there's never been a world where Roc had traditional heap-allocated closures. It's not like we are "prioritizing performance improvements" - it’s not an improvement over a working version that didn’t run as fast; the lambda set system is the only implementation we’ve ever had!

Richard, thank you for the explanation about "LambdaSet"'s history in the development of the Roc language, what they are, and the how and why they are used in Roc. Not having expertise in the areas required to improve the problem "edge cases", I can now drop referring to them and proceed with just helping finding the "edge cases"...

view this post on Zulip GordonBGood (Jul 07 2024 at 18:46):

Richard Feldman said:

speaking of which, the main thing I think needs to be unblocked by bugs in this category is Task-level parallelism

Yes, many of these exotic schemes that seem good experience difficulties when parallelism is thrown into the mix...

Richard Feldman said:

to me, that is much more "obviously this needs to work as soon as possible" than, say, Church numerals

Sorry for seeming to over-emphasize being able to implement "Church numerals"; I only brought them up in showing that there are currently Type Inference problems in Roc and that when a language can cleanly implement a reasonably complete set of the Church numeral functions, that langue is generally Type sound and complete (other than it doesn't test parallelism, of course)...

Richard Feldman said:

but really the actual barrier to progress in this particular area is limited more by the large amount of narrow domain knowledge necessary to make any amount of usable progress at all

Yes, I can see that. As I have quite a bit of spare time, I considered becoming a more active contributor to Roc; however, as well as the ability to dig into algorithm things such as the Morphic paper's ideas and implementation (written in Rust), one needs to be somewhat of an expert in writing Rust code (which I can do but don't really enjoy as being too imperative feeling), Zig code (which again I don't care for for the same reasons), and LLVM-IR (ditto), so I lost interest in proceeding with this. That limits me to testing and pushing "edge cases" as I have done with filing a couple of issues and here as in this thread...

In doing this, I have run into the following problems:

  1. Problems with the performance of in-place mutation, which has been attributed to how LLVM-IR is used and emitted, is fixable but only by a LLVM-IR expert whose time is in short supply...
  2. A problem in the performance of the sorting implementation, which has been attributed to the implementation in Zig but which can't be converted with good performance to using "native" Roc code due to the above problem...
  3. The problem of Type Interference identified here and in a filed issue # 6542, where I'm coming to understand that the problem may be in the implementation of "LambdaSet"'s which require algorithm knowledge, Rust language implementation expertise, and likely general knowledge of the entire Roc compiler implementation...

Richard Feldman said:

but really the actual barrier to progress in this particular area is limited more by the large amount of narrow domain knowledge necessary to make any amount of usable progress at all

Point taken, as per my above comment...

Richard Feldman said:

so I can totally see how that can result in a feeling of frustration, but I don’t see an alternative solution short of taking on VC funding or something :big_smile:

The age-old problem with this kind of project: The bite you've chosen is almost too big to chew and once bit, it isn't very practical to get other people to help:big_smile:...

Anyway, for my part I'll continue to monitor progress on my issues of concern/your "edge cases" as my contribution...

Thank you for you for your participation in my "enlightenment"...

view this post on Zulip Folkert de Vries (Jul 07 2024 at 19:27):

I mean, if you have free time and are interested in type systems, I think many of these issues have big picture solutions. So what you could do is build your own simple type checker that uses the lambda set ideas, and explore how these cases should be handled. Implementing fixes in the roc compiler itself then will be a lot easier, either for you or for us

view this post on Zulip Folkert de Vries (Jul 07 2024 at 19:27):

also it would even be really good for us if this lambda set idea became more widespread

view this post on Zulip Ayaz Hafiz (Jul 07 2024 at 23:42):

if you are interested in the type system, you can take a look at my implementations of much smaller languages with the core of various type system features here: https://github.com/ayazhafiz/cor. For example, here is a compiler for a language with lambda sets like Roc has: https://github.com/ayazhafiz/cor/tree/base/experiments/compose_fx. There are some examples in the test/ directory, and the design+implementation should be much easier to follow than in the Roc compiler (for example, the inference algorithms are 600 LoC). Here is another one for Roc's tag unions: https://github.com/ayazhafiz/cor/tree/base/experiments/easy_tags

view this post on Zulip Ayaz Hafiz (Jul 07 2024 at 23:45):

Problems with the performance of in-place mutation, which has been attributed to how LLVM-IR is used and emitted, is fixable but only by a LLVM-IR expert whose time is in short supply

Honestly, I don't think this is true. A lot of the big issues we continue to find are oversights that just require someone to sit down and dig into it. For example, we recently found an issue where the code generation of LLVM IR for list literals was extremely inefficient, and the solution was very simple. It's just that someone has to sit down and actually do the work to diagnose the problems, and the bandwidth of people who are excited and able to do so is very limited right now (for example, I personally can contribute almost no development time to Roc)

view this post on Zulip Brendan Hansknecht (Jul 07 2024 at 23:49):

Yeah, probably same with the alloca issues. Not the same as the list issue, but fundamentally it is probably not a hard problem. Just a bit tedious and will take a while for someone to sort through, but probably can be dealt with a single alloca at a time.

view this post on Zulip Ayaz Hafiz (Jul 08 2024 at 00:01):

I'd also like to encourage anyone who is interested in various parts of the language become more mature and stable to be curious. Finding problems that block adoption is one half of the path towards success. The other half, which is to find solutions for them and implement them, is equally as important, and the part where the project could use the most help. It is true that some of the foundations are wrong and will need to be adjusted, but much of it is sound and correct. I hope I can speak on behalf of everyone in the project when I say everyone would be willing to share their knowledge and perspective on the design/implementation of various things, whether it's in the compiler as discussed here, or ecosystem, etc. There are many materials already available, but we'd all be happy to sit down and work with anyone who wants to help grow the language. Today there are some parts of the projects that are only really understood by 1 person, and though everyone is smart, we are all human and can overlook things or convince ourselves of a solution when there is a simpler/better one. For that we need teams of people who can work together to solve implementation issues like this. So if you're interested in that, I encourage you to be curious.

view this post on Zulip GordonBGood (Jul 08 2024 at 01:19):

Folkert de Vries said:

I mean, if you have free time and are interested in type systems, I think many of these issues have big picture solutions. So what you could do is build your own simple type checker that uses the lambda set ideas, and explore how these cases should be handled. Implementing fixes in the roc compiler itself then will be a lot easier, either for you or for us

So you are saying that you think that the Type Inference problems with "LambdaSet"'s could be explored using an alternate Type Inference engine written in another language? It turns out that I am about to embark on such a little project in a couple of weeks and I could look into it. I would, of course, share any results that could be helpful here...

view this post on Zulip GordonBGood (Jul 08 2024 at 01:31):

Ayaz Hafiz said:

if you are interested in the type system, you can take a look at my implementations of much smaller languages with the core of various type system features here: https://github.com/ayazhafiz/cor. For example, here is a compiler for a language with lambda sets like Roc has: https://github.com/ayazhafiz/cor/tree/base/experiments/compose_fx. There are some examples in the test/ directory, and the design+implementation should be much easier to follow than in the Roc compiler (for example, the inference algorithms are 600 LoC). Here is another one for Roc's tag unions: https://github.com/ayazhafiz/cor/tree/base/experiments/easy_tags

Thank you so much for these links; they are something I can easily get started on, being written in a functional language as is my preference...

Ayaz Hafiz said:

Problems with the performance of in-place mutation, which has been attributed to how LLVM-IR is used and emitted, is fixable but only by a LLVM-IR expert whose time is in short supply

Honestly, I don't think this is true...

I'm only referring to the what I have been told by @Brendan Hansknecht as to what he feels is the ultimate resolution of my reported issue...
Brendan Hansknecht said:

Yeah, probably same with the alloca issues. Not the same as the list issue, but fundamentally it is probably not a hard problem. Just a bit tedious and will take a while for someone to sort through, but probably can be dealt with a single alloca at a time.

Which is my point: it may not be a hard problem to a LLVM-IR expert but it is beyond my level of expertise in this area and probably many others, whose expertise is in other areas...

view this post on Zulip GordonBGood (Jul 08 2024 at 01:34):

Ayaz Hafiz said:

... I encourage you to be curious.

If I weren't curious, why else would I be here?

view this post on Zulip Ayaz Hafiz (Jul 08 2024 at 01:36):

No, you are, all your comments here are very valuable! I apologize if that comment felt directed at you in any unfair way. I meant it for everyone generally, as a reflection.

view this post on Zulip GordonBGood (Jul 08 2024 at 01:57):

Ayaz Hafiz said:

the design+implementation should be much easier to follow than in the Roc compiler (for example, the inference algorithms are 600 LoC)

Thanks again for this code that is written in a functional language I can easily read and reasonably quickly understand; I have had a quick scan through this file and see the match ... with cases where the "LambdaSet" considerations kick in and what you have done with them...

I haven't fully groc'ed your code as it will take a few passes through it; in order to save me some time, is this HM Type Inference and if not, what modifications to HM are made to make it (perhaps not quite) HM?

view this post on Zulip Ayaz Hafiz (Jul 08 2024 at 01:58):

yes, its HM

view this post on Zulip Ayaz Hafiz (Jul 08 2024 at 01:59):

HM with the standard let-polymorphism and the value restriction

view this post on Zulip GordonBGood (Jul 08 2024 at 02:01):

Great, that will make my investigations even easier as I am currently working on another project that uses HM, although I am just starting on fully understanding it; I believe it also has "standard let-polymorphism and the value restriction"...

view this post on Zulip GordonBGood (Jul 09 2024 at 20:39):

Richard Feldman said:

so back when the compiler did not have any implementation of code generation for closures, we chose to use Morphic's strategy for generating them.

Okay, so I've read through the materials available from the link, especially the published paper, and while I see that the authors are very adept at Type Theory in proving that a type system using "LambdaSet"'s should be sound and complete; I don't see that they have very well proved "Our evaluation confirmed that LLS yields considerable performance benefits in practice. ". Remember than they, like every other research paper authors, could hardly conclude anything else if the results can be slanted that way at all after investing a year or more of their time in doing this research and testing it, even to the extent of applying it to their own language implementation. I don't want to teach a fox how to suck eggs, but it seems to me that Roc needs more evaluation than what they did to confirm its benefits are worth the pain points you are currently experiencing...

My problems with their evaluations are as follows:

  1. In tests such as the primes tests that can be very subject to the size of the test such as the primes array tests being affected by cache associativity, they don't state the size of the tests and the number of iterations used...
  2. The OCaml and MLton code was automatically generated from the Morphic code, so we can't review how well suited they are to their purposes in these specific languages.
  3. The huge gains in using LSS for Morphic mostly only indicates how poorly optimized its generated code without LSS is and there are no comparisons to compare its actual testing times for the same data sets to the other languages.
  4. The executable size comparisons are therefore pretty much meaningless: yes, it is a given that any gains found in a highly optimized language such as OCaml with likely make such gains by increased inlining (which aids the static dispatch benefit) and of course the code sizes will be a little larger; the other languages aren't optimized as well (or even nearly as well in the case of Morphic), so the LSS optimizations can somewhat reduce their code sizes.
  5. The only comparison that makes any sense here is comparing Morphic and OCaml but it is of very much limited relevance without actual code and benchmark parameters as well as actual timing results...

So we can fix that. First, just as they say, the technique is most effective when there is a lot of combinator type code as used in parsers, which is confirmed that the biggest gains for a sophisticated optimized language such as OCaml is for the two calc and parse JSON tests that are based on a combinational parser implementation. But as I say, without being able to see the actual translated source code for the OCaml benchmarks for these, we can't really comment on whether other ways of optimization might produce as near the same results as LSS. However, we can pretty much reconstruct their OCaml code for the Sieve of Eratosthenes from their Morphic implementation, although it isn't clear whether they also translated their combinator Iter or just translated to using OCaml's standard library Seq type (which in not so combinational) but I tried it both ways and they aren't significantly different...

The OCaml SoE implementation I am using (which is a translation of their Morphic code that is terrible as to wasting a huge amount of loops/lambda's doing redundant work) is the following code:

type primality = Prime | Composite

let primesM limit =
  if limit < 2 then Seq.empty else
  let sz = limit + 1 in
  let cmpsts = Array.make sz Prime in
  cmpsts.(0) <- Composite; cmpsts.(1) <- Composite;
  Seq.unfold (fun i -> if i < sz then Some (i, i + 1) else None) 0
    |> Seq.fold_left ( fun arr bp ->
      if arr.(bp) == Composite then arr
      else Seq.unfold (fun m -> Some (m, m + 1)) 2
           |> Seq.map (fun m -> m * bp)
           |> Seq.unfold ( fun seq ->
                match seq() with Seq.Nil -> None |
                                 Seq.Cons (hd, tl) ->
                                   if hd >= sz then None
                                   else Some (hd, tl) )
           |> Seq.fold_left ( fun narr c ->
                narr.(c) <- Composite; narr ) arr ) cmpsts
    |> Array.to_seqi
    |> Seq.filter_map ( fun (p, pt) ->
         if p < 2 || pt == Composite then None else Some p )

which returns a seq of the prime results. From the relative timings, I assume this is something like the code they are using for the sieve benchmark, although if one translates their combinator Iter to use instead of OCaml's standard library Seq as here, it runs another about 25 percent slower...

Now the best optimization one could do would be to eliminate the closures entirely(which is what GHC Haskell does in "List Fusion" for an implementation of this, only paying the price that the boxed array is filled with lazy values that need to be evaluated later upon evaluation of the generated primes list at a slight cost of about 25%), so I did this by using direct recursive loops instead of the Seq's/Iter's in the following code for the same (lousy) algorithm:

let primes limit =
  if limit < 2 then Seq.empty else
  let sz = limit + 1 in
  let cmpsts = Array.make sz Prime in
  let rec loopbp bp =
    if bp >= sz then () else
    if cmpsts.(bp) == Composite then loopbp (bp + 1) else
    let rec loopc c =
      if c >= sz then ()
      else ( cmpsts.(c) <- Composite; loopc (c + bp) ) in
    ( loopc (bp + bp); loopbp (bp + 1) ) in
  let rec loopn n =
    if n >= sz then None else
    if cmpsts.(n) == Composite then loopn (n + 1) else Some(n, n + 1) in
  cmpsts.(0) <- Composite; cmpsts.(1) <- Composite;
  loopbp 0; Seq.unfold (fun n -> loopn n) 2

The above code runs at about 3.5 times faster than the extremely "closurized" previous version for sieving a range to ten million at about 210 milliseconds for this, 730 milliseconds for the previous, both running on an AMD 7840HS with boosted CPU clock rate of 5.1 GHz.

If the testing conditions were the same as used in the paper, this indicates that for this algorithm, their LSS technique wasn't fully getting rid of the closures as they reported a gain using them of only about 1.79 times (but who knows, their ranges could have been different, etc.).

I was unable to be able to translate this work into Roc because of the recursive Type Inference bugs.

If this were their testing methodology, then it kind of explains why their first two benchmarks using their combinatorial parsers were more effective for OCaml at a ratio of about 3.3 times as those will be even more costly and the and the remaining computation after they are reduced is so simple (for any reasonable range of sieving, a huge memory array such as used here has a terrible memory bandwidth with cache thrashing).

Now, it seems to me that the question that should be asked is "Are there other ways to eliminate the closures or make them more efficient?". This sieve benchmark is completely artificial in that if one were concerned with performance, one wouldn't use Seq's/Iter's and there would be nothing to optimize in that regard. Even if one were to use them, they are consumed as they are generated so any kind of "Perceus" and "Mod Cons" optimization would inline most of the code inside the closures to have the same effect as Haskell's "List Fusion". Combinational parsers will also respond to this although perhaps not quite as effectively. I have had some experience with using parsers in that Safari with its Proper Tail Call optimizations and inlining optimizations often runs JavaScript parser code two to three times faster than Google Chrome or Firefox without it, and I believe that Safari's optimization also gets away from the excessive dynamic dispatch by some sort of optimization that is almost certainly not LSS and all its complexities.

Next, I'll look at an algorithm that is something like a combinatorial parser in which the nested functions may not be able to be completely eliminated...

view this post on Zulip Brendan Hansknecht (Jul 09 2024 at 21:02):

I think something very concrete we could do is finish implementing erased closures and directly compare the perf in roc with and without lambasets.

view this post on Zulip Richard Feldman (Jul 09 2024 at 21:02):

totally!

view this post on Zulip Brendan Hansknecht (Jul 09 2024 at 21:02):

Obviously there isn't that much roc code, but some of the parsers should exercise the benefits.

view this post on Zulip Richard Feldman (Jul 09 2024 at 21:13):

for what it's worth, we already know that lambda sets enable llvm optimizations of closures that are otherwise blocked by llvm's inability to optimize function pointers - I believe @Folkert de Vries observed that this alone made a big difference in practice for Roc, and that's even before applying morphic's optimizations (which also require lambda sets)

view this post on Zulip Richard Feldman (Jul 09 2024 at 21:15):

so there are at least 3 performance benefits that defunctionalization (via lambda sets) unlocks:

we learned about lambda set defunctionalization from the Morphic authors but it's a separate thing from Morphic's optimizations themselves

view this post on Zulip Folkert de Vries (Jul 09 2024 at 21:21):

this is exactly the problem that the morphic paper describes: boxed closures obscure control flow from LLVM. I'd say we today really only get a marginal benefit from alias analysis (in large part because we don't model all builtins accurately)

view this post on Zulip Richard Feldman (Jul 09 2024 at 21:27):

alias analysis being morphic's optimizations

view this post on Zulip Ayaz Hafiz (Jul 09 2024 at 22:24):

to be fair, LLVM is typically not a good compilation target for functional languages (at least for higher-order programs, before they are lowered to first-order). other compilation targets besides SSA like A normal form or CPS are more amenable to the kinds of optimizations that you want to do over functional languages. It just so happens that compiling via lambda sets is another way of achieving that, because compilation with lambda sets means you get to a first-order program as soon as you specialize all lambda set types. But the tradeoff is really more like (lambda sets | no lambda sets) + (optimization IRs: llvm | ANF | CPS | Morphic | etc), rather than necessarily being tied to LLVM as a target.

view this post on Zulip GordonBGood (Jul 10 2024 at 00:30):

Ayaz Hafiz said:

to be fair, LLVM is typically not a good compilation target for functional languages (at least for higher-order programs, before they are lowered to first-order)...

I guess that's why GHC Haskell is able to compile with LLVM-IR as a back-end: by the time it gets to the back-end, the code has been lowered by successive passes to a Cee Minus Minus form, etc., and generally does pretty well as compared to the Native Code Generator, which is a trade off as LLVM has a much better register allocator but sometimes has a bit of a miss (even after the lowering) for mostly functional forms of code.

For the Sieve benchmark, I kind of sluffed off my conclusion that there shouldn't have been a problem as there didn't need to be any closures, but that if there were closures used, they should have been optimized away by other means than LSS. Here is the Haskell Sieve benchmark in the same (lousy) form of sieve as from the paper:

data Primality = Prime | Composite
  deriving (Eq)

primesM :: Int -> [Int]
primesM limit =
  let sievearr = runSTArray $ do
        marr <- newArray (0, limit) Prime
        unsafeWrite marr 0 Composite; unsafeWrite marr 1 Composite
        forM_ [2 .. limit] $ \ bp -> do
           v <- unsafeRead marr bp
           when (v == Prime) $
             forM_ (takeWhile (<= limit) $ (map (* bp) [2..])) $
                unsafeWrite marr c Composite
        return marr
  in [ bp | (bp, Prime) <- assocs sievearr ]

Notice that GHC Haskell runs this about as fast as if there were no lazy list (with their inherent closure's) in the "inner loop" of the innermost forM_ because Haskell has fused the complex building of what is actually a simple list of [ bp + bp, bp + bp + bp .. limit], by List Fusion. In this case, List Fusion (plus closure lifting) does a better job than LSS would have done and without (quite) so much complexity, or at least not the major problems with types that it sees as recursive...

OCaml doesn't have Seq fusion, but if I manually do it for it for the "inner loop" as per the following code:

let primesM limit =
  if limit < 2 then Seq.empty else
  let sz = limit + 1 in
  let cmpsts = Array.make sz Prime in
  cmpsts.(0) <- Composite; cmpsts.(1) <- Composite;
  Seq.unfold (fun i -> if i < sz then Some (i, i + 1) else None) 0
    |> Seq.fold_left ( fun arr bp ->
      if arr.(bp) == Composite then arr
      else Seq.unfold (fun c -> if c < sz then Some (c, c + bp) else None) (bp + bp)
           (* Seq.unfold (fun m -> Some (m, m + 1)) 2
           |> Seq.map (fun m -> m * bp)
           |> Seq.unfold ( fun seq ->
                match seq() with Seq.Nil -> None |
                                 Seq.Cons (hd, tl) ->
                                   if hd >= sz then None
                                   else Some (hd, tl) ) *)
           |> Seq.fold_left ( fun narr c ->
                narr.(c) <- Composite; narr ) arr ) cmpsts
    |> Array.to_seqi
    |> Seq.filter_map ( fun (p, pt) ->
         if p < 2 || pt == Composite then None else Some p )

then we gain just about exactly the speed-up as found in the LSS article...

Thus, my conclusion that LSS doesn't really do much for us for this situation other than cause some grief...

view this post on Zulip GordonBGood (Jul 10 2024 at 14:32):

GordonBGood said:

Next, I'll look at an algorithm that is something like a combinatorial parser in which the nested functions may not be able to be completely eliminated...

So, up to now it has been determined that the best candidates to benefit from LSS optimizations are likely combinatorial parser types of applications. As I am not (at least yet) ready to translate the Morphic people's parser code into other languages, let's consider some other "parser-like" alternatives. I believe that code such as the little few line example to clearly demonstrates Roc currently being able to Type Infer even what should be legal recursive types near the start of this code. In order to keep it simple, that code is pretty trivial, but the Co-Inductive Stream type that it uses could be considered to be a parser of sorts as, although it is formed of "thunk" closures instead of passing the state forward as an argument, the state that is carried forward is the the "thunk" captures, which makes this type of code even less performant, also generally requiring new "boxing" allocations for each advance of the stream, with at least one for the stream "pair" and another for the closure "thunk" that forms the tail, with perhaps others if the results of the stream advancing require that it advance the head of other streams to produce its result. That is kind of like a worst-case parser...

The little example is actually extracted from a purely functional incremental (as in it depends on the state of the the stream as of the last value to determine the next) Sieve of Eratosthenes, which is kind of interesting since sieves are often used for bench-marking in that the numbers dealt with are random as far as general branch prediction is concerned and thus a sieve implemented this way needs a comparison for each stream merge call that won't be predictable and a more-or-less constant overhead of several tens of CPU clock cycles. The code is quite short as it can be expressed in Haskell in just a few LoC (including the extra conditons to make pattern matches total) as follows:

primes :: () -> [Int]
primes() = 2 : _Y ((3:) . gaps 5 . _U . map(\p-> [p*p, p*p+2*p..])) where
  _Y g = g (_Y g)  -- = g (g (g ( ... )))   non-sharing multistage fixpoint combinator
  gaps k [] = [k]
  gaps k s@(c:cs) | k < c     = k : gaps (k+2) s  -- ~= ([k,k+2..] \\ s)
                  | otherwise =     gaps (k+2) cs --   when null(s\\[k,k+2..])
  _U [] = []
  _U [[]] = []
  _U ([] : _ : _) = []
  _U ((x:xs):t) = x : (merge xs . _U . pairs) t   -- tree-shaped folding big union
  pairs [] = []
  pairs [xs] = [xs]
  pairs (xs:ys:t) = merge xs ys : pairs t
  merge xs [] = xs
  merge [] ys = ys
  merge xs@(x:xt) ys@(y:yt) | x < y     = x : merge xt ys
                            | y < x     = y : merge xs yt
                            | otherwise = x : merge xt yt

where the gaps function is checking that the advancing test value is in the _U generated merged list of odd composite numbers or not to determine primality and where the depth of the merge tree has been reduced by "infinite tree folding" so to only be about the log base two of the number of base primes, which is only the base primes because it is a recursive determination where the result of the primes is fed back into the calculate as a secondary stream by virtual of the fixed point Y Combinator.

Now, the above code enjoys certain advantages as compared to as implemented in other languages because GHC Haskell applies certain List specializations, although it also loses slightly due to have to deal with deferred non-strict results. To compare fairly, I converted the above List-based code into using Co-Inductive Streams (CIS's) just as the implementations in other languages. CIS's without memoization are adequate for this algorithm (not requiring full memoized lazy lists) because each subsequent new value for each of the streams is determined from the directly preceding value and doesn't need to use values from further back. So the above Haskell code translated to use CIS's is as follows:

data CIS a = CIS a (() -> CIS a)

cisToList :: CIS a -> [a]
cisToList = unfoldr (\ (CIS hd tl) -> Just (hd, tl()))

primesTF :: () -> [Int]
primesTF() =
    cisToList $ CIS 2 $ \ () -> _Y (\ bps -> CIS 3 $ \ () -> testAt 5 $ _U $ allmults bps) where
  _Y g = g (_Y g)  -- = g (g (g ( ... )))   non-sharing multistage fixpoint combinator
  testAt k cs@(CIS c ctl) | k < c     = CIS k $ \ () -> testAt (k+2) cs  -- ~= ([k,k+2..] \\ s)
                          | otherwise =                 testAt (k+2) (ctl()) --   when null(s\\[k,k+2..])
  _U (CIS (CIS x xtl) cstl) = CIS x $ \ () -> (merge (xtl()) . _U . pairs) (cstl())   -- tree-shaped folding big union
  pairs (CIS xs xstl) = let (CIS ys ystl) = xstl() in
                        CIS (merge xs ys) $ \ () -> pairs (ystl())
  bpmults bp = let adv = bp + bp in
               let mlts c = CIS c $ \ () -> mlts (c + adv) in mlts (bp * bp)
  allmults (CIS bp bptl) = CIS (bpmults bp) $ \ () -> allmults (bptl())
  merge xs@(CIS x xtl) ys@(CIS y ytl) | x < y     = CIS x $ \ () -> merge (xtl()) ys
                                      | y < x     = CIS y $ \ () -> merge xs (ytl())
                                      | otherwise = CIS x $ \ () -> merge (xtl()) (ytl())

which indeed runs about 25 percent slower than the previous List version. Other than the missed branch predictions per merge call, most of the overhead is related to boxing CIS's and their tail closures with their associated memory allocation times. Interestingly, this program translated to Elm and and run with the JavaScript engine of Google Chrome runs only a few tens of percentage slower than this last Haskell version!

The execution profile of this algorithm when sieving to ten million is as follows:

  1. The merge function is run 43,744,361 times where most of the computation is done, with two pattern matches, an unpredictable comparison, an allocation of the result CIS and another for its tail closure whcih will triggering at least one advance of one of the input streams although most will be just another call to merge.
  2. There are 8,931,203 calls to the inner function of the pmults function (only a few hundred to the outer function with one call per base prime to the square root of the computed range) with a much easier computation of allocating a new result CIS with the head the sum of the captured only value and the captured adv value and also allocating a new tail closure consisting containing the updated current value.
  3. The other significant function is testAt which is called 5,004,276 times but for all but the 664,579 prime value determinations, these will just be recursive loops and OCaml will optimeze the pattern match and the consumption of the composite number stream to a simple imperative loop, so consuming little time.
  4. The number of times the rest of the functions are called is insignificant as in the next most at 1,086 calls to pairs as it does the "infinite tree folding` for the base primes structure...

So to optimize this algorithm, we really only need to consider the base prime multiplier function and the merge function. Since both of these are "parser-like" advances of state per call, they basically consume a stream as they feed a stream so it is very easy to see how a compiler could recognize that the old value is never used again and re-use its storage for the new value in the case of the multiplier function; for the merge function it may not be as clear to the compiler, but with a little eta expansion/reduction one can see that, after the first external call to merge that needs to allocate a new result CIS, all internal recursive calls from the tail closures can then re-use that result value in consuming the old value and producing a new one. Also, the one or two advances of the captured streams can also be in-place as they can also be identified as last used for the old value. For the merge closures, once one is re-using the space for the captured streams, it could be identified that there are three possible values for each closure, one each for LT, GT, and EQ with different captured streams advanced for each and different consumed results moved over to the head of the new result CIS (in-place); alternatively we could define a LSS structure with a tag for each of the three cases, although this, while gaining static dispatch, would require another missed prediction branch on the tag, so might be pretty much a wash for this algorithm...

Implementing this in OCaml using ref's actually caused a slow down due to the extra ref indirection, so time to use a different more imperative language for these experiments. This is not contrary to development of a functional language as all of these in-place mutation optimizations with be compiler generated and hidden to the programmer...

view this post on Zulip GordonBGood (Jul 11 2024 at 01:22):

GordonBGood said:

so time to use a different more imperative language for these experiments. This is not contrary to development of a functional language as all of these in-place mutation optimizations with be compiler generated and hidden to the programmer...

I decided to use Nim as a language with a nice syntax whose backend is C, which then allows testing with GCC or Clang to prehaps get some advantage of compiling through LLVM. Nim is very much an imperative language by choice of the BDFL who relates function programming to the use of linked lists and admired the language developments based on Algol 68, so Nim includes reference parameters, overloading functions, variadic arguments, etc., but for all that it is a pretty good (and powerful) language if one is willing to get functional purity by programming discipline. It has a good Type system and some Type Inference limited by that mutability is always an option and the range of other features that conflict with full HM type inference.

Nim already has "smart" elided reference counting memory management and, more importantly, implements closures by calling a hidden function with an extra hidden argument of a heap allocated capture structure, so fits in very nicely with what has been proposed...

So the functional Tree Folding SoE can be implemented and tested using Nim code as follows:

# for smart reference counting, maximum optimization,
# for output with GCC, compile with:  nim c -d:danger --mm:arc -t:-march=native <filename>
# for output with clang, add the --cc:clang option...

import sugar
import std/monotimes
from std/times import inMilliseconds

type  PrimeType = uint64

let cLIMIT: PrimeType = 10_000_000

type
  CIS[T] = ref object
    head: T
    tail: () -> CIS[T]

iterator primesTreeFolding(): PrimeType {.closure.} =

  proc merge(xs, ys: CIS[PrimeType]): CIS[PrimeType] =
    let x = xs.head;
    let y = ys.head
    if x < y:
      CIS[PrimeType](head: x, tail: () => merge(xs.tail(), ys))
    elif y < x:
      CIS[PrimeType](
        head: y,
        tail: () => merge(xs, ys.tail()))
    else:
      CIS[PrimeType](
        head: x,
        tail: () => merge(xs.tail(), ys.tail()))

  proc bpmults(bp: PrimeType): CIS[PrimeType] =
    let inc = bp + bp
    proc mlts(c: PrimeType): CIS[PrimeType] =
      CIS[PrimeType](head: c, tail: () => mlts(c + inc))
    mlts(bp * bp)

  proc allmults(ps: CIS[PrimeType]): CIS[CIS[PrimeType]] =
    CIS[CIS[PrimeType]](
      head: bpmults(ps.head),
      tail: () => allmults(ps.tail()))

  proc pairs(css: CIS[CIS[PrimeType]]): CIS[CIS[PrimeType]] =
    let cs0 = css.head;
    let rest0 = css.tail()
    CIS[CIS[PrimeType]](
      head: merge(cs0, rest0.head),
      tail: () => pairs(rest0.tail()))

  proc cmpsts(css: CIS[CIS[PrimeType]]): CIS[PrimeType] =
    let cs0 = css.head
    CIS[PrimeType](
      head: cs0.head,
      tail: () => merge(cs0.tail(), css.tail().pairs.cmpsts))

  proc testAt(n: PrimeType, cs: CIS[PrimeType]): CIS[PrimeType] =
    var nn = n;
    var ncs = cs
    while nn >= ncs.head: # emulate tail call with loop
      nn += 2;
      ncs = ncs.tail()
    CIS[PrimeType](head: nn, tail: () => testAt(nn + 2, ncs))

  proc oddprms(): CIS[PrimeType] =
    CIS[PrimeType](
      head: 3.PrimeType,
      tail: () => testAt(5.PrimeType, oddprms().allmults.cmpsts))

  var prms = CIS[PrimeType](head: 2.PrimeType, tail: oddprms)
  while true:
    yield prms.head;
    prms = prms.tail()

for p in primesTreeFolding():
  if p > 100: echo ""; break
  stdout.write p, " "

let strt = getMonoTime()
var count = 0
for p in primesTreeFolding():
  if p > cLIMIT: break
  count += 1
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "Found ", count, " primes up to ", cLIMIT, " in ", elpsd, " milliseconds."

The sieve is written as an infinite closure iterator for ease of testing, but this doesn't affect the performance. This runs on my AMD 7840HS at 5.1 GHz with single threaded boost in about 1.75 seconds compiled either with GCC or Clang back-ends, which time is likely mostly memory allocations for which the Nim standard memory allocator is known to be slow; however, that isn't going to matter for this testing as most of the allocation will be edided away for the final comparisons...

As discussed previously, the sub functions that will most increase the performance of this algorithm is the bpmult function and the merge function. Since bpmult doesn't really need "LambdaSet" (or a set of only one condition), and was already "eta expanded" to use a sub function, that function can be changed to mutate-in-place by just substituting the following version:

  proc bpmults(bp: PrimeType): CIS[PrimeType] =
    let adv = bp + bp
    proc mltclsr(): CIS[PrimeType] {.inline, closure.} # forward declaration
    var rslt = CIS[PrimeType](head: bp * bp, tail: mltclsr)
    proc mltclsr(): CIS[PrimeType] {.inline, closure.} =
      rslt.head += adv; rslt
    rslt

Changing the merge function for in-place mutation but not yet using "LambdaSet" is a little more complex as it requires that it first be "eta expanded" to use an internal sub function (which I manually inlined) and internal closure options (with forward declarations to allow the mutual recursion) to result in the following code for that function:

  proc merge(xs, ys: CIS[PrimeType]): CIS[PrimeType] =
    var vxs = xs; var vys = ys
    var vrslt = CIS[PrimeType](head: xs.head, tail: xs.tail)
    proc clsrLT(): CIS[PrimeType] {.closure.} # forward declarations
    proc clsrGT(): CIS[PrimeType] {.closure.} # for recursion...
    proc clsrEQ(): CIS[PrimeType] {.closure.}
    proc clsrLT(): CIS[PrimeType] {.closure.} =
      vxs = vxs.tail()
      let x = vxs.head; let y = vys.head
      if x < y: vrslt.head = x; vrslt.tail = clsrLT
      elif y < x: vrslt.head = y; vrslt.tail = clsrGT
      else: vrslt.head = x; vrslt.tail = clsrEQ
      vrslt
    proc clsrGT(): CIS[PrimeType] {.closure.} =
      vys = vys.tail()
      let x = vxs.head; let y = vys.head
      if x < y: vrslt.head = x; vrslt.tail = clsrLT
      elif y < x: vrslt.head = y; vrslt.tail = clsrGT
      else: vrslt.head = x; vrslt.tail = clsrEQ
      vrslt
    proc clsrEQ(): CIS[PrimeType] {.closure.} =
      vxs = vxs.tail(); vys = vys.tail()
      let x = vxs.head; let y = vys.head
      if x < y: vrslt.head = x; vrslt.tail = clsrLT
      elif y < x: vrslt.head = y; vrslt.tail = clsrGT
      else: vrslt.head = x; vrslt.tail = clsrEQ
      vrslt
    let x = vxs.head; let y = vys.head
    if x < y: vrslt.head = x; vrslt.tail = clsrLT
    elif y < x: vrslt.head = y; vrslt.tail = clsrGT
    else: vrslt.head = x; vrslt.tail = clsrEQ
    vrslt

As discussed previously, this uses three different closures chosen by the merge branch logic as to which one should define the tail of the stream and thus avoiding the cost of the missed prediction branches at the cost of the dynamic invocation of the closure that prevents it from being inlined (a close trade off). With it, the code runs on my above machine sieving to ten million in about 425 milliseconds and is about 25 milliseconds (about five to six percent) faster compiling with Clang than with GCC.

It was actually easier to write the "LambdaSet" version than the above version so as to use only one closure and just change the tag value in the (mutable) closure to suit the branch condition, which although it makes the closure and called code inlinable, costs the extra missed branch prediction on the internal branch on the tag, as per the following merge code:

  proc merge(xs, ys: CIS[PrimeType]): CIS[PrimeType] =
    type Cond = enum LT, GT, EQ
    type LambdaSet = object
          cond: Cond
          xs, ys, rslt: CIS[PrimeType]
    var vlss = LambdaSet(cond: LT, xs: xs, ys: ys,
                        rslt: CIS[PrimeType](head: xs.head, tail: xs.tail))
    proc mrg() {.inline, closure.} =
      let x = vlss.xs.head; let y = vlss.ys.head
      if x < y: vlss.rslt.head = x; vlss.cond = LT
      elif y < x: vlss.rslt.head = y; vlss.cond = GT
      else: vlss.rslt.head = x; vlss.cond = EQ
    proc clsr(): CIS[PrimeType] {.inline, closure.} =
      case vlss.cond:
        of LT:
          vlss.xs = vlss.xs.tail()
          mrg(); vlss.rslt
        of GT:
          vlss.ys = vlss.ys.tail()
          mrg(); vlss.rslt
        of EQ:
          vlss.xs = vlss.xs.tail(); vlss.ys = vlss.ys.tail()
          mrg(); vlss.rslt
    mrg(); vlss.rslt.tail = clsr; vlss.rslt

With this substitution, the code for either GCC compiled or Clang compiles runs about 25 milliseconds faster with the Clang generated code still about 25 milliseconds faster than the GCC generated code. This shows that "LambdaSet" does indeed work for this processor architecture and algorithm where the miss prediction seems to cost less than the extra closure invocation - with other architectures and/or algorithms where the processing time has a different ration to the cost of the branch and/or closure invocation could significantly change the results. In particular for parsing which stand to benefit the most from this optimization, the different between parsers that do a lot of work using lats of "mini-parsers" and mini branch decisions will benefit even more from this but more normal parsers that parse based on longer text sequences only looking for character "triggers" probably won't benefit as much or may end up losing to the previous version.

However, I am going to go with "LambdaSet" implementation, not for the fairly insignificant performance benefits over the previous version without it but because it seems easier to implement once one gets past the need for "eta expansion" and the perceived recursive types Type Inference...

view this post on Zulip GordonBGood (Jul 11 2024 at 02:29):

Note that most of the performance gains above come from "Mod Cons" in place mutation and not from "LambdaSet" optimizations, so if one had a working implementation of "Mod Cons" not using "LambdaSet"'s that was compatible with complete Type Interference, the there would be minimal performance loss...

This may look like an over four times performance gain but that is just because of the lousy memory allocation making the gain seem larger - the actual gain is as compared with a functional language with a suitable memory allocations such as GHC Haskell or OCaml (using the same CIS type as here) where the gain is about 2.5 times...

Nim actually has a built-in way of tracking "last used" used for copy/move semantics that I didn't tap into here, and using this could have automatically compiled the mutate in-place version or not using Nim's compile-time term re-writing macro facility. In this case, the determination is aided in that the internal composites stream as generated from the possibly external primes stream is never exposed outside the algorithm as the consumed stream starts with the bpmults function and ends with the cmpsts function but cmpsts is never used outside the testAt function which builds a new possibly non-consumed stream only by comparison with it, and thus it is fairly easy to show that it is always used in a way that consumes as it produces, using either OCaml's type system or even Nim's...

As to the completeness of the "LambdaSet" LSS Type system, I'll have to rely on the gurus of the paper and those who have reviewed it as I am not experienced in this field, it is difficult for someone at my stage of learning Type Theory to read, and I don't see much benefit at this stage in spending the time to learn it...

view this post on Zulip GordonBGood (Jul 14 2024 at 03:01):

GordonBGood said:

Okay, so I've read through the materials available from the link, especially the published paper, and while I see that the authors are very adept at Type Theory in proving that a type system using "LambdaSet"'s should be sound and complete; I don't see that they have very well proved "Our evaluation confirmed that LLS yields considerable performance benefits in practice. ".

I got to thinking that my initial analysis may not have been completely fair to the Morphic paper's evaluation so investigated this more thoroughly by downloading their evaluation docker file and repeating their evaluations on my machine. First, I immediately found that they chose numbers of iterations and ranges for the benchmarks that were reasonable for the type of algorithms being evaluated, so scratch that criticism off the list. However, in looking at the actual relative timings between the Morphic generated code and the other languages as compiled from code translated from Morphic code by applying a pre-processor stage it became obvious that some translations just didn't work very well. For instance, the translation of the array sieve code into OCaml produced very relatively inefficient code, likely due to translating into increased use of OCaml ref's with which I had similar problems when trying to apply these to -in-place mutations. So, although they do give an indication that LSS does have a benefit when applied to lambda's/closure's, the relative results don't really mean anything. Also, the Morphic generated code shows huge gains due to LSS mostly because the code generated without these optimizations is so terrible, but the general indication truly is that LSS optimizations are mostly effective when applied to combinator style code.

However, these benchmarks are somewhat artificial in that if one were really tasked with these types of things and performance was of concern, one wouldn't write them this way as follows:

  1. The quicksort benchmark uses iterators internally where if one really wanted it to be fast, one would use recursive loops instead that should run about ten times faster than the fastest of this benchmark.
  2. The iterative primes benchmark is meaningless as a practical algorithm and only is a placeholder confirming that high use of lambda's/closures can be improved through LSS. Even as a benchmark, it it was re-written to use recursive loops for the isprime function, it would again be many times faster than using lambda's/closures even after LSS optimization.
  3. The same goes for the array sieving primes benchmark where my implementation not using lambda's/closure's doesn't benefit hardly at all from LSS optimization, yet runs faster than the evaluation implementation by a large factor.
  4. The unify, "calc" numeric parsing, and the JSON parsing benchmarks all test the effectiveness of LSS optimizations on the heavy use of combinators, but even there the range of results shows that the optimization depends on the work done within the combinators and how the algorithms are written, with not that large a benefit for the "unify" benchmark, and increasing benefits when applied to the numeric parsing and JSON parsing, likely due to that the numeric parsing requires many more smaller parsers whereas the JSON parsing is parsing across larger strings of text only scanning for mostly start/stop character markers...
  5. The good thing about the LSS optimization as used by Morphic is that it generally doesn't seem to do any or much harm (other than in the complexity of the compiler), although that may be due to some of the limitations that Morphic has as a language and in its type system. On the other hand, Morphic relies on LSS to produce acceptable speed and when it comes into a situation where this isn't effective, the code produced from Morphic isn't competitive (although this may just be due to the experimental nature of the Morphic compiler)...

So I learned enough Morphic language syntax to be able to add two tests: an array based Sieve of Eratosthenes (still following their inefficient algorithm) using an array just as they do but eliminating all (almost) uses of lambda's and closure's in the inner culling loops, and the incremental Tree Folding Sieve of Eratosthenes that did depend on lambdas/closures. From running these, I learned the following:

Comparing Morphic to Roc, it is easy to see that the Roc compiler likely started out as a fork from the Morphic one, as follows:

  1. Putting collections as the first argument of functions and using the forward pipe operator to pass them from function to function is the same between the two.
  2. No currying and partial application for functions.
  3. Only allowing lambda's/closure's to be defined in interior scopes of functions is the same, although Roc takes this a step further and defines all functions as lambda's.
  4. Ugly Rust panic error messages are common across the two, with Morphic only showing a location as an offset in the source file without line and column numbers, which has been improved in Roc (when the compiler doesn't panic).

Roc has also made quite a few other improvements over Morphic, as follows:

  1. Roc doesn't require type annotations on global functions (Morphic requires these, but doesn't allow them on local lambda definitions).
  2. Roc makes curly braces and line/block terminators entirely optional as well as eliminates the let .. in keywords for definition blocks.
  3. One of the biggest differences is in use of recursive functions: Morphic only allows recursive functions at the global level and internal lambda's are not allowed to recurse nor to capture functions from outside their scope. This may be why they don't have problems with Type inference, but is also quite limiting for a functional language. Roc allows these, especially with a simple fixed point Y Combinator, but then currently runs into problems with Type inference on what it thinks is a recursively defined Type...

Running my two new tests pretty much proved that LSS is only really applicable to reducing/"lifting" lambdas and closures in that the sieved array benchmark sees little benefit from the LSS optimization, yet runs much faster than the previous sieve benchmark because of not having lambda's/closure's in the inner loop at all, and the incremental sieve algorithm which does have lambda's/closure's sees only a quite small benefit as they are already quite optimized and even this benefit would mostly be eliminated if there were in-place mutation for consuming/producing ("Mod Cons") optimizations; Interestingly, the Roc benchmark runs at almost optimum speed for the array sieving without lambdas at under three CPU clock cycles per inner culling operation with the translated ML and especially OCaml version running much (much) slower; for the incremental sieve, the reverse happens with the translated ML and OCaml incremental sieve versions running at pretty much optimal speed of about 100 CPU clock cycles per merge operation whereas the direct Morphic version of the same algorithm is almost three times slower - I think this is because Morphic doesn't optimize lambda's/closure's very well without LSS which dosn't apply much here (the Mod Cons optimizations doesn't currently appear to be applied to any of these although perhaps memory pools are used for OCaml and ML)...

view this post on Zulip GordonBGood (Jul 14 2024 at 06:57):

The problem about the error message printing that the type is an "infinite" type but not showing it seems to be entirely due to how it handles type annotations on error; when there is a type annotation (new in Roc as compared to Morphic), the compiler seems to assume that the given type annotation will be the one derived and prints that rather than the derived one on error; probably quite a simple fix to show the actually derived one...

However, there is still a Type Inference problem for recursive functions that refer to their own function name interpreted as a value as in the following function definition:

CIS a : [CIS a ({} -> CIS a)]

# cisTest : {} -> CIS I32 # no type annotation to show the proper error...
cisTest = \ {} ->
  nxtcis = \ (CIS i itl) -> CIS (i + 2) (\ {} -> nxtcis (itl{}))
  oddscis = \ {} -> CIS 3 (\ {} -> nxtcis (oddscis{}))
  CIS 2 (\ {} -> oddscis{})

which compiles with the following error message:

── CIRCULAR TYPE in Seq.roc ────────────────────────────────────────────────────

I'm inferring a weird self-referential type for oddscis:

18│ oddscis = \ {} -> CIS 3 (\ {} -> nxtcis (oddscis{}))
^^^^^^^

Here is my best effort at writing down the type. You will see ∞ for
parts of the type that repeat something already printed out
infinitely.

{} -> [CIS (Num *) ({} -> ∞)] as *

From what I can see, the type inference is first treating oddscis as a value (which it actually is) for purposes of use as a capture in the interior closure, but due to Roc requiring that all functions be defined as lambda's, that is a closure value captured by a closure nested inside a closure and that causes it to puke...

In Morphic, one is not allowed to define interior functions either and only lambda's as here but Morphic doesn't allow closure's to be recursive at all, which means that the oddscis would need to be defined as a function at the global level in order to compile, and then the function name is allowed to be recursive and captured inside the interior closure as, being global, it is not a closure itself. This problem came about likely due to the changes made in converting the Morphic compiler into the Roc compiler. This workaround is not possible in Roc due to that all functions must be defined as lambdas...

The fix is likely to do a little bit more complex type analysis when it has been determined that an outer closure is cyclic in "eta expanding" the outer closure so it is only a lambda and not a closure (ie. passing the captured values as some type of arguments, which is actually how closures are generally implemented), in which case the name assigned to the outer lambda no longer represents a (reference counted) closure structure/"LambdaSet" and can (perhaps) be safely passed to the inner closure as a value...

With the above problem understood (I think), I beg to reverse myself on the benefit of LSS from the Morphic paper: The evaluation is indeed flawed as the (sometimes huge) performance gains attributed to it are mostly due to the terrible optimization for the baseline case, and generally the resulting optimized performance may be barely any better than what ML or OCaml might be if there were no translation process; thus the comparisons to translated versions of OCaml and ML implementations have no relevance as I previously noted, and the Morphic implementation does not show that it is particularly fast although it should be a slight amount faster for some cases due to the static function invocation rather than dynamic. In other words, LSS is a convoluted way to implement polymorphic lambda lifting which may already be possible in other ways in languages such as GHC Haskell (whose performance was not compared since there was no easy way to translate Morphic code into it). Running equivalent to the incremental sieve benchmark in GHC Haskell runs at about the same speed or faster than LLS optmized Morphic runs, with whatever GHC Haskell uses for "closure lifting"...

view this post on Zulip GordonBGood (Jul 15 2024 at 02:51):

My main problem with the Morphic LSS paper's evaluation is that they don't compare LSS optimizations to monomorphic optimizations using the Morphic compiler, which would have shown the advantage (or not) of LSS as they set out to do; trying to compare these using other languages like OCaml and MLton adds too many other variables, and it seems that translating/simulating monomorphic optimizations to those languages doesn't work...

view this post on Zulip Anton (Jul 15 2024 at 14:54):

Thanks for the in depth analysis @GordonBGood!

view this post on Zulip Anton (Jul 15 2024 at 14:56):

the Roc compiler likely started out as a fork from the Morphic one

The Roc compiler actually started out based on the Elm compiler. I don't know how much changes we made to use it with morphic afterwards though.

view this post on Zulip Richard Feldman (Jul 15 2024 at 14:57):

the initial version of Roc's type checker was basically a direct port of Elm's (porting from Haskell to Rust)

view this post on Zulip Richard Feldman (Jul 15 2024 at 14:58):

the parser, canonicalizer, monomorphizer, and code generation were all written from scratch

view this post on Zulip Richard Feldman (Jul 15 2024 at 14:58):

the only part of Morphic that we use is the Morphic library during alias analysis

view this post on Zulip GordonBGood (Jul 15 2024 at 18:59):

Richard Feldman said:

the parser, canonicalizer, monomorphizer, and code generation were all written from scratch

Ah, okay, I was led astray there by the similarities between the Morphic and Roc syntax's as far as forward piping, no partial function applicaton or currying, etc.

Rather than ranting on about the deficiencies in the Morphic evaluation of their results, I'm going to change to try to look at what problems there are in applying the Morphic alias analysis at an early stage in the Roc compiler...

There still is one problem that the Morphic compiler seems to have that would be revealed more clearly if the compiler were given a more exhaustive set of benchmark tests: It seems to occasionally decide to do LSS closure lifting when it shouldn't. GHC Haskell talks about this problem and implements selective closure lifting recognizing that lifting indiscriminately can lead to inefficient code in some instances. In the Morphic evaluations plus my own, current Morphic seems to fail in the "unify" and my "functional tree folding sieve" evaluations. This would need to be looked at if Morphic were to be turned into a production ready language...

I observe that the Morphic compiler has a total of about 21 compiler passes before LLVM-IR code generation with Hindley-Milner Type Inference used twice (once at the usual early stage, and once at a much later pass to resolve array type - perhaps nested - type variables), but other than when it attempts LSS closure lifting when it probably shouldn't, it produces very very good executable's as to performance; that is extremely rare for an experimental language such as this...

If I were to step back and say I wanted to write a Roc compiler in Rust that embraces the Morphic LSS optimization only having a small development team, I would probably have done what you haven't: just fork the Morphic repo and work on a new Roc front end, which would mostly include some parts of what you have done in translating the Elm parser, canonicalization, and Type Inference passes to Rust from Haskell, with the appropriate changes to adapt to the Roc syntax and type differences and any changes required to pass the Morphic compiler chain what it needs from what it lists as the third stage on; as I see in my analysis, if LSS is a good optimization, then the only deficiency seems to be over-enthusiastic application of it and perhaps some minor work needs to be done to determine when it applies and when it doesn't...

In that way, you wouldn't have the huge problem of efficient LLVM-IR production you are experiencing as the Morphic compiler already lowers the generated code to something more acceptable to LLVM, all the automatically elided reference counting would be done for you (for which you could later perhaps tweak), and the automatic in-place mutation when appropriate would just be there (which already works very well in Morphic)...

Proceeding this way would be helped by the many similarities between Roc and Morphic syntax, with after stripping away the cosmetic differences of compulsory curly brackets and bracketed and comma'd function arguments (perhaps offside rule detection versus syntax delineation), the major work here would be adding your "open" custom types and records, which also seem to be used to describe your expanded numeric "stack"...

Then your small contributor team could concentrate on the Roc language definition just to the Type Inference stage with the idea that if the Roc code Type checks then it will almost certainly compile with the Morphic back-end, with all optimization and code generation done for you by the Morphic stages, and given that the Morphic compiler seems quite stable (other than pretty primitive error reporting, much of which would be done at the front end anyway), it would seem you could get much closer to your goals much sooner. Morphic even has a Target Abstraction Layer that seems to be similar in concept to your "platforms" idea...

With the above in mind, combining the Type inference stage with the LSS optimization is likely a bad idea, especially at this stage of development...

Or at least, that's how it currently seems to me :smile:

view this post on Zulip Richard Feldman (Jul 15 2024 at 22:02):

GordonBGood said:

fork the Morphic repo and work on a new Roc front end

unfortunately, those 21 compiler passes are at odds with another of Roc's goals, which is to be a compiler that runs fast :big_smile:

view this post on Zulip Richard Feldman (Jul 15 2024 at 22:03):

we ended up concluding that doing type inference and LSS at the same time is the wrong design

view this post on Zulip Richard Feldman (Jul 15 2024 at 22:03):

and LSS should run in a separate pass, perhaps only in --optimize builds

view this post on Zulip Richard Feldman (Jul 15 2024 at 22:03):

(assuming we have a backend that can emit heap-allocated closures, which we currently don't but want to have access to regardless)

view this post on Zulip GordonBGood (Jul 16 2024 at 02:01):

Richard Feldman said:

unfortunately, those 21 compiler passes are at odds with another of Roc's goals, which is to be a compiler that runs fast :big_smile:

I can see that as being the goal in the long term, but perhaps you might have considered that even Rust, with all of its contributors and support from Mozilla, decided it was better to have a slow compiler that generates fast code without too many crashes and then work on improving compiler speed as one can...

view this post on Zulip GordonBGood (Jul 16 2024 at 02:22):

Richard Feldman said:

we ended up concluding that doing type inference and LSS at the same time is the wrong design

and LSS should run in a separate pass, perhaps only in --optimize builds

(assuming we have a backend that can emit heap-allocated closures, which we currently don't but want to have access to regardless)

I think that is the right decision, and likely so is moving the LSS optimization (which seems to require multiple passes anyway - at least three from what I recall from the code translation effort) as an option later in the compilation as in the optimization pass would mean fast debug builds with any slowness caused by this only in optimized builds, and also with the LSS optimizations latter in the compilation process would mean that the closure lifting could be more selective and solve some of my perceived problems...

However, that still leaves you with the problems unrelated to LSS of generating fast LLVM-IR code (which seems to require "lowering" of the functional form of code) and smart reference counting in spite of recursion, which seem to have been (mostly) solved by Morphic...

I note that Evan's rejection of recursive references for Elm other than to functions in preparation of compiling to Web Assembly automatically means that Elm does not allow defining recursive functions as values which are lambda's which is Roc's only means for defining them, but that would be easy to fix. However, what he seems to have failed to consider is that such allowed cyclic references, whether they be in function form or in lambda form, could be closures capturing external values and thus have associated heap allocated/reference counted allocation (when such a closure escapes, the function/lambda can be eta lifted to just have hidden parameters filled by the "captured" arguments so as not to be a closure when not escaping) to consider as it would still generate memory leaks in this closure escaping case due to the "captured heap-allocated data" extra internal reference...

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:29):

smart reference counting in spite of recursion, which seem to have been (mostly) solved by Morphic...

I think we some more robust ownership tracking (some that started with the borrow inference PR by @Folkert de Vries), we will be able to resolve refcounting issues. That plus things like drop specialization and morphic. So I think that definitely is fixable. Just a work in progress today.

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:35):

generating fast LLVM-IR code (which seems to require "lowering" of the functional form of code)

I'm definitely really curious to see what is needed here. I think it heavily depends on:

  1. how long you're willing to wait on llvm
  2. what functional patterns we expect to be common and want to optimize

Today, Roc code gen with llvm can be fairly optimal (or at least has a few known issues that we can fix). That said, llvm is not a fan of piles of nested closures. It will very slowly inline and unwind them. This is part of the reason why heavily nested (though not particularly complicated) parser combinator code can be such a pain for llvm to optimize. It generally will do a decent job, sometimes will give up, but either way will take much longer to compile than more imperative style code.

If you write relatively flat Roc with some limited higher order functions, llvm can definitely handle it. In many cases, generate something very close to the raw C. I'm not sure how badly llvm really falls apart with a truly functional pile of code, but am definitely curious. Still thinking about final runtime, I know it can be really bad for compile time.

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:37):

Aside:
I would expect LSS to shine the most in a lot of the simple cases that get made opaque by function pointers, not in complex and fancy more recursive code. As in, I would expect in real world code, the effect on things like List.map closures would be much larger than the effect on recursive primes generation. (totally just my rough understanding of the affects)

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:39):

Though, I don't know enough about other closure lifting and inlining optimizations to know how well they would work in these cases.

view this post on Zulip GordonBGood (Jul 16 2024 at 02:46):

Brendan Hansknecht said:

Though, I don't know enough about other closure lifting and inlining optimizations to know how well they would work in these cases.

I don't know nearly enough about them either and only know what I do from dabbling in GHC Haskell and reading what they write about them. However, my observation from the Morphic evaluation is that LSS as implemented there is pretty effective at reducing chains of lambda/closures although it stops at totally eliminating them (the reduction may be entirely due to LLVM inlining and not what LSS does itself), where as GHC Haskell can completely eliminate lambda's/closure's in inner loops for some specializations such as List Fusion...

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:49):

Does that require lazy? Do you know if their optimizations would apply to roc?

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:51):

And yeah, I would guess that morphic enables llvm to inline everything instead of doing some other optimization themselves To remove the need for inlining at all. That said, I'm not sure how other languages would inline if they have multiple possible choices for a function. Cause those would be two distinction function pointers. Can Haskel fuse if a function mid way through is a map with an option of 1 of 3 possible functions?

view this post on Zulip GordonBGood (Jul 16 2024 at 02:55):

Brendan Hansknecht said:

Does that require lazy? Do you know if their optimizations would apply to roc?

I don't believe that the lambda lifting optimization require being lazy (other than that there are no other kind of lists than lazy ones, of course); however, GHC Haskell seems perfectly capable of lifting lambda's/closures that are not related to (lazy) lists, such as in tight prime sieve culling loops when formulated in a way that forces such lifting.

I don't know enough about the Roc compiler (and that seems to be in a state of flux anyway as per Richard) to be able to answer whether Roc can use the Haskell techniques, but from what I read about the Haskell lambda lifting, it is the result of multiple compiler passes that have "lowered" the code to a more imperative style...

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:57):

Ok

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 02:57):

That makes sense

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 03:03):

Useful reference: https://haskell.foundation/hs-opt-handbook.github.io/src/Optimizations/GHC_opt/lambda_lifting.html

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 03:05):

I'm still unsure how/if it can deal with cases that a function can be 1 of n options. It seems more focused on if you know your closure is specifically one case.

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 03:07):

Or if the lambda comes from far enough away that it may not be known at the call site that there is only one variant and it can be lifted and inclined.

view this post on Zulip GordonBGood (Jul 16 2024 at 03:08):

Brendan Hansknecht said:

That said, I'm not sure how other languages would inline if they have multiple possible choices for a function. Cause those would be two distinction function pointers. Can Haskell fuse if a function mid way through is a map with an option of 1 of 3 possible functions?

I think that GHC Haskell can't generally inline in the situation of the possibility of multiple output lambda's/closure's, but then these are often not inlinable anyway due to being recursive - for instance, a lazy merge function with three possible closure outputs but each of whose closures refer to the outer merge recursively. As I showed in the Nim consideration of first using general in-place mutation of the loop captures and result in Nim, even LSS is only of slight benefit here is being able to trade off whatever the function is doing against the cost of dynamic versus static function invocaton...

List Fusion is a specialization and an interior map that results in a variant union type which is a set of closures would be fused but the closures, being escaping might only be inlined if a later HOF in the chain unpeeled the variant union to call the appropriate one.

view this post on Zulip Brendan Hansknecht (Jul 16 2024 at 03:09):

Thanks for the details!

view this post on Zulip GordonBGood (Jul 16 2024 at 03:10):

Brendan Hansknecht said:

Or if the lambda comes from far enough away that it may not be known at the call site that there is only one variant and it can be lifted and inclined.

Aye, there's that, but there are exceptions to just about any optimization. The Morphic evaluation just chose to benchmark the ones that most applied to LSS, missing only on the "unify" benchmark...

view this post on Zulip Richard Feldman (Jul 16 2024 at 03:36):

related past discussion: #ideas > basic deforestation

view this post on Zulip GordonBGood (Jul 26 2024 at 20:08):

@Richard Feldman:

I have updated the issue I filed on this to include the short example from near the beginning of this thread...

view this post on Zulip Richard Feldman (Jul 26 2024 at 22:35):

nice, thanks!


Last updated: Jul 06 2025 at 12:14 UTC