Stream: compiler development

Topic: zig compiler - dealing with memory dynamics of temporary ...


view this post on Zulip Anthony Bullard (Feb 27 2025 at 12:14):

I noticed in my last PR some subtle memory issues with how slice fields in typed nodes returned from the NodeStore worked. I believe there are places where if the scratch array that one of those structs came from resize while you still have a reference to that slice you can get some weird behavior.

I really would like to avoid creating owned slices over data that is in extra_data. My idea is to replace the slices with a new data structure called a DataSpan defined like this:

    pub const DataSpan = struct {
        start: u32,
        len: u32,
    };

This would be created by separately sending the slice (on typed node creation) to a method that would return a typed instance of this DataSpan type - which would just have information about where the nodes were stored in extra_data.

When you want to get the items from the Store, you would call a method on the Store that take the typed DataSpan and return a typed Iterator that you can use:

    pub const ExprSpan = struct { span: DataSpan };

    pub fn IdIter(comptime T: anytype) type {
        return struct {
            iter: Iterator,
            pub fn next(self: *@This()) ?T {
                if (self.iter.next()) |n| {
                    return .{ .id = n };
                }
                return null;
            }
        };
    }

    pub const ExprIter = IdIter(ExprIdx);

    const Iterator = struct {
        start: usize,
        end: usize,
        pos: usize,
        data: *std.ArrayList(u32),

        pub fn new(span: DataSpan, data: *std.ArrayList(u32)) @This() {
            const start = @as(usize, @intCast(span.start));
            const end = start + @as(usize, @intCast(span.len));
            return .{
                .start = start,
                .end = end,
                .pos = start,
                .data = data,
            };
        }

        pub fn next(self: *@This()) ?u32 {
            if (self.pos == self.end or self.data.items.len <= self.pos) {
                return null;
            }
            const curr = self.data.items[self.pos];
            self.pos += 1;
            return curr;
        }
    };

The advantage is we are not allocating a bunch of short-lived slices that we have to manage their lifetimes or worry about pointers being invalidated (or memory regions being overwritten). And working with these Spans and Iterators are relatively straightforward.

view this post on Zulip Anthony Bullard (Feb 27 2025 at 12:26):

So today, instead of this when creating a Expr.list:

            // ... collect the list items in scratch_exprs
            const items = self.store.scratch_exprs.items[scratch_top..];
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

you would have one additional step (really function call):

            const items = self.store.addExprSpan(self.store.scratch_exprs.items[scratch_top..]);
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

The pattern above is common enough it may instead be

            const items = self.store.addExprSpan(scratch_top); // addExprSpan will make the span from scratch_top until the end of the ArrayList's items
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

view this post on Zulip Anthony Bullard (Feb 27 2025 at 12:33):

And getting it for things like formatting (or reading to move to next IR) is very simple as well, this:

        .list => |l| {
            fmt.push('[');
            var i: usize = 0;
            for (l.items) |item| {
                fmt.formatExpr(item);
                if (i < (l.items.len - 1)) {
                    fmt.pushAll(", ");
                }
                i += 1;
            }
            fmt.push(']');
        },

Becomes:

        .list => |l| {
            fmt.push('[');
            var i: usize = 0;
            const iter = fmt.ast.store.getExprIter(l.items); // Get the IdIterator
            while (iter.next()) |item| { // Consume items while next() returns a value
                fmt.formatExpr(item);
                if (i < (l.items.len - 1)) {
                    fmt.pushAll(", ");
                }
                i += 1;
            }
            fmt.push(']');
        },

DataSpan still has a len field, so the if condition there can remain the way it is!

view this post on Zulip Joshua Warner (Feb 27 2025 at 13:25):

The other thing that you could do is guarantee that the child nodes for a given node are all contiguous, by returning and buffering the full node struct rather than using a list of node ids. That means that each node only needs a begin and end index into the primary node store - no need for an unbounded-size “extra” data.

view this post on Zulip Anthony Bullard (Feb 27 2025 at 13:44):

Not really possible due to recursion

view this post on Zulip Anthony Bullard (Feb 27 2025 at 13:47):

I'm very interested in feedback here from @Sam Mohr and others working downstream from Parse IR

view this post on Zulip Joshua Warner (Feb 27 2025 at 14:29):

Hmm I don’t follow why recursion would cause problems?

view this post on Zulip Joshua Warner (Feb 27 2025 at 14:30):

This is exactly what I’ve done for some IRs in the past

view this post on Zulip Anthony Bullard (Feb 27 2025 at 14:39):

What do you use to track the nested nodes?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 14:40):

Let's say you have this list expression:

[
  [1, 2, 3],
  [4, 5, 6],
  [7, 8, 9],
]

if you store the top level list's items - what would that look like?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 14:45):

I guess the more relevant question is, have you implemented those IRs using SOA before?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 14:45):

If so, maybe there is a technique you know that I'm not thinking of!

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:13):

You can use a stack of scratch nodes, just like you're using now.
In your example, the parser would:

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:45):

So slices are just stored as u32 length? And then we have to load each of them to know how far to backtrack to build up the slice of items?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:48):

So #13 (the parent list) is requested, it has a length for items of 3. I have to load #12 recursively to see where it ends (the index of the first descendant recursively), then do the same for the other two?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:50):

Seems like I have to load all of the nodes from the one requested to the first one added while it was added to just get the list of node idxs for its children

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:50):

And will have to do that each time

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:51):

Slices are a begin and end NodeIdx

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:51):

But I still have to load each child to get it

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:52):

Not sure I follow? When is this is load happening?

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:52):

To get the slice info

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:52):

Which is the data field on the node

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:52):

Today

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:53):

The begin/end are the lhs and rhs

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:53):

We should chat about this later. Hard to illustrate on my phone while in a meeting

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:54):

And I want to make sure I understand your approach

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:54):

Haha yeah I’m trying to type on the train and not fall over

view this post on Zulip Anthony Bullard (Feb 27 2025 at 15:54):

:grinning:. Been there, used to be everyday on the Muni

view this post on Zulip Joshua Warner (Feb 27 2025 at 15:54):

Later is probs better

view this post on Zulip Brendan Hansknecht (Feb 27 2025 at 19:06):

Anthony Bullard said:

I noticed in my last PR some subtle memory issues with how slice fields in typed nodes returned from the NodeStore worked. I believe there are places where if the scratch array that one of those structs came from resize while you still have a reference to that slice you can get some weird behavior.

I really would like to avoid creating owned slices over data that is in extra_data. My idea is to replace the slices with a new data structure called a DataSpan defined like this:

    pub const DataSpan = struct {
        start: u32,
        len: u32,
    };

Just to quickly note, this is a totally reasonable alternative to slices when you are working with data with unstable pointers. Fundamentally, you can't keep a slice to a growing container. When the containers grow, all pointers will become invalid.

As long as we link the data span to the original container, that is fundamentally reasonable.

Also, I don't think arraylists have this feature yet but zig has a feature to lock pointers and ensure stability (at least on hashmaps). It doesn't solve this issue, but it does make it easier to catch these kinds of issues.


That said, having the slices in the first place that outlive growth might be a sign of an issue in how we are designing this.

Slices never own data. They are at the whims of the owning container and must be used carefully.

view this post on Zulip Sam Mohr (Feb 27 2025 at 22:56):

I don't think the canonicalization would have any problem ingesting data with this interface, and I expect that a small proportion of the canonicalization work is actually "talking" to the parse AST, and most of the work is elsewhere

view this post on Zulip Sam Mohr (Feb 27 2025 at 22:56):

So this seems fine to me

view this post on Zulip Brendan Hansknecht (Feb 27 2025 at 23:09):

Oh previously the list expression was storing a slice of sub expression

view this post on Zulip Brendan Hansknecht (Feb 27 2025 at 23:10):

If we plan to serialize list expression, they would need the integer based spans anyway

view this post on Zulip Brendan Hansknecht (Feb 27 2025 at 23:10):

That said, this is parse ast and caching just requires serializing the can ast. So not sure that is important.

view this post on Zulip Brendan Hansknecht (Feb 27 2025 at 23:11):

What actually owns the scratch expressions? If they are stored in the other expression nodes, it seems that they need proper ownership and are not just scratch.

view this post on Zulip Anthony Bullard (Feb 28 2025 at 00:15):

The scratch is just that, temporary bits of memory to go from raw u32s to typed ids and back

view this post on Zulip Anthony Bullard (Feb 28 2025 at 00:16):

This change will ensure that they are always temporary - data added will never outlive the function scope

view this post on Zulip Anthony Bullard (Feb 28 2025 at 00:17):

The span will just describe where the extra data for the node lives if any

view this post on Zulip Anthony Bullard (Feb 28 2025 at 00:23):

And Brendan scratch is a stack so it helps us keep related extra data contiguous in the array list

view this post on Zulip Anthony Bullard (Feb 28 2025 at 00:25):

And a serialzation scheme could easily consume this and I believe should be able to recreate the exact same layout deterministically

view this post on Zulip Anthony Bullard (Mar 01 2025 at 18:01):

Here's my first PR towards implementing this. It was pretty painless: https://github.com/roc-lang/roc/pull/7649

view this post on Zulip Joshua Warner (Mar 01 2025 at 22:41):

I hadn’t considered just making the scratch stack a stack of ids rather than full nodes. It looks pretty clean tho!

view this post on Zulip Joshua Warner (Mar 01 2025 at 22:42):

Doing nodes instead would trade off a bit of extra data movement during parsing for better data locality later.

view this post on Zulip Joshua Warner (Mar 01 2025 at 22:43):

We might be able to selectively remove that data shuffling during parsing :thinking:

view this post on Zulip Joshua Warner (Mar 01 2025 at 22:43):

Anyway, this looks like a clear improvement as is.


Last updated: Jul 06 2025 at 12:14 UTC