zig compiler - dealing with memory dynamics of temporary ... · compiler development

I noticed in my last PR some subtle memory issues with how slice fields in typed nodes returned from the NodeStore worked. I believe there are places where if the scratch array that one of those structs came from resize while you still have a reference to that slice you can get some weird behavior.

I really would like to avoid creating owned slices over data that is in extra_data. My idea is to replace the slices with a new data structure called a DataSpan defined like this:

    pub const DataSpan = struct {
        start: u32,
        len: u32,
    };

This would be created by separately sending the slice (on typed node creation) to a method that would return a typed instance of this DataSpan type - which would just have information about where the nodes were stored in extra_data.

When you want to get the items from the Store, you would call a method on the Store that take the typed DataSpan and return a typed Iterator that you can use:

    pub const ExprSpan = struct { span: DataSpan };

    pub fn IdIter(comptime T: anytype) type {
        return struct {
            iter: Iterator,
            pub fn next(self: *@This()) ?T {
                if (self.iter.next()) |n| {
                    return .{ .id = n };
                }
                return null;
            }
        };
    }

    pub const ExprIter = IdIter(ExprIdx);

    const Iterator = struct {
        start: usize,
        end: usize,
        pos: usize,
        data: *std.ArrayList(u32),

        pub fn new(span: DataSpan, data: *std.ArrayList(u32)) @This() {
            const start = @as(usize, @intCast(span.start));
            const end = start + @as(usize, @intCast(span.len));
            return .{
                .start = start,
                .end = end,
                .pos = start,
                .data = data,
            };
        }

        pub fn next(self: *@This()) ?u32 {
            if (self.pos == self.end or self.data.items.len <= self.pos) {
                return null;
            }
            const curr = self.data.items[self.pos];
            self.pos += 1;
            return curr;
        }
    };

The advantage is we are not allocating a bunch of short-lived slices that we have to manage their lifetimes or worry about pointers being invalidated (or memory regions being overwritten). And working with these Spans and Iterators are relatively straightforward.

Anthony Bullard (Feb 27 2025 at 12:26):

            // ... collect the list items in scratch_exprs
            const items = self.store.scratch_exprs.items[scratch_top..];
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

            const items = self.store.addExprSpan(self.store.scratch_exprs.items[scratch_top..]);
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

            const items = self.store.addExprSpan(scratch_top); // addExprSpan will make the span from scratch_top until the end of the ArrayList's items
            expr = self.store.addExpr(.{ .list = .{
                .items = items,
                .region = .{ .start = start, .end = self.pos },
            } });

Anthony Bullard (Feb 27 2025 at 12:33):

And getting it for things like formatting (or reading to move to next IR) is very simple as well, this:

        .list => |l| {
            fmt.push('[');
            var i: usize = 0;
            for (l.items) |item| {
                fmt.formatExpr(item);
                if (i < (l.items.len - 1)) {
                    fmt.pushAll(", ");
                }
                i += 1;
            }
            fmt.push(']');
        },

        .list => |l| {
            fmt.push('[');
            var i: usize = 0;
            const iter = fmt.ast.store.getExprIter(l.items); // Get the IdIterator
            while (iter.next()) |item| { // Consume items while next() returns a value
                fmt.formatExpr(item);
                if (i < (l.items.len - 1)) {
                    fmt.pushAll(", ");
                }
                i += 1;
            }
            fmt.push(']');
        },

DataSpan still has a len field, so the if condition there can remain the way it is!

Joshua Warner (Feb 27 2025 at 13:25):

The other thing that you could do is guarantee that the child nodes for a given node are all contiguous, by returning and buffering the full node struct rather than using a list of node ids. That means that each node only needs a begin and end index into the primary node store - no need for an unbounded-size “extra” data.

Anthony Bullard (Feb 27 2025 at 13:44):

Anthony Bullard (Feb 27 2025 at 13:47):

I'm very interested in feedback here from @Sam Mohr and others working downstream from Parse IR

Joshua Warner (Feb 27 2025 at 14:29):

Joshua Warner (Feb 27 2025 at 14:30):

Anthony Bullard (Feb 27 2025 at 14:39):

Anthony Bullard (Feb 27 2025 at 14:40):

[
  [1, 2, 3],
  [4, 5, 6],
  [7, 8, 9],
]

Anthony Bullard (Feb 27 2025 at 14:45):

I guess the more relevant question is, have you implemented those IRs using SOA before?

Anthony Bullard (Feb 27 2025 at 14:45):

Joshua Warner (Feb 27 2025 at 15:13):

You can use a stack of scratch nodes, just like you're using now.
In your example, the parser would:

Anthony Bullard (Feb 27 2025 at 15:45):

So slices are just stored as u32 length? And then we have to load each of them to know how far to backtrack to build up the slice of items?

Anthony Bullard (Feb 27 2025 at 15:48):

So #13 (the parent list) is requested, it has a length for items of 3. I have to load #12 recursively to see where it ends (the index of the first descendant recursively), then do the same for the other two?

Anthony Bullard (Feb 27 2025 at 15:50):

Seems like I have to load all of the nodes from the one requested to the first one added while it was added to just get the list of node idxs for its children

Anthony Bullard (Feb 27 2025 at 15:50):

Joshua Warner (Feb 27 2025 at 15:51):

Anthony Bullard (Feb 27 2025 at 15:51):

Joshua Warner (Feb 27 2025 at 15:52):

Anthony Bullard (Feb 27 2025 at 15:52):

Joshua Warner (Feb 27 2025 at 15:53):

Anthony Bullard (Feb 27 2025 at 15:53):

We should chat about this later. Hard to illustrate on my phone while in a meeting

Anthony Bullard (Feb 27 2025 at 15:54):

Joshua Warner (Feb 27 2025 at 15:54):

Anthony Bullard (Feb 27 2025 at 15:54):

Joshua Warner (Feb 27 2025 at 15:54):

Brendan Hansknecht (Feb 27 2025 at 19:06):

Just to quickly note, this is a totally reasonable alternative to slices when you are working with data with unstable pointers. Fundamentally, you can't keep a slice to a growing container. When the containers grow, all pointers will become invalid.

As long as we link the data span to the original container, that is fundamentally reasonable.

Also, I don't think arraylists have this feature yet but zig has a feature to lock pointers and ensure stability (at least on hashmaps). It doesn't solve this issue, but it does make it easier to catch these kinds of issues.

That said, having the slices in the first place that outlive growth might be a sign of an issue in how we are designing this.

Slices never own data. They are at the whims of the owning container and must be used carefully.

Sam Mohr (Feb 27 2025 at 22:56):

I don't think the canonicalization would have any problem ingesting data with this interface, and I expect that a small proportion of the canonicalization work is actually "talking" to the parse AST, and most of the work is elsewhere

Sam Mohr (Feb 27 2025 at 22:56):

Brendan Hansknecht (Feb 27 2025 at 23:09):

Brendan Hansknecht (Feb 27 2025 at 23:10):

If we plan to serialize list expression, they would need the integer based spans anyway

Brendan Hansknecht (Feb 27 2025 at 23:10):

That said, this is parse ast and caching just requires serializing the can ast. So not sure that is important.

Brendan Hansknecht (Feb 27 2025 at 23:11):

What actually owns the scratch expressions? If they are stored in the other expression nodes, it seems that they need proper ownership and are not just scratch.

Anthony Bullard (Feb 28 2025 at 00:15):

The scratch is just that, temporary bits of memory to go from raw u32s to typed ids and back

Anthony Bullard (Feb 28 2025 at 00:16):

This change will ensure that they are always temporary - data added will never outlive the function scope

Anthony Bullard (Feb 28 2025 at 00:17):

Anthony Bullard (Feb 28 2025 at 00:23):

And Brendan scratch is a stack so it helps us keep related extra data contiguous in the array list

Anthony Bullard (Feb 28 2025 at 00:25):

And a serialzation scheme could easily consume this and I believe should be able to recreate the exact same layout deterministically

Anthony Bullard (Mar 01 2025 at 18:01):

Joshua Warner (Mar 01 2025 at 22:41):

I hadn’t considered just making the scratch stack a stack of ids rather than full nodes. It looks pretty clean tho!

Joshua Warner (Mar 01 2025 at 22:42):

Doing nodes instead would trade off a bit of extra data movement during parsing for better data locality later.

Joshua Warner (Mar 01 2025 at 22:43):

We might be able to selectively remove that data shuffling during parsing :thinking:

Stream: compiler development

Topic: zig compiler - dealing with memory dynamics of temporary ...

Anthony Bullard (Feb 27 2025 at 12:14):

Anthony Bullard (Feb 27 2025 at 12:26):

Anthony Bullard (Feb 27 2025 at 12:33):

Joshua Warner (Feb 27 2025 at 13:25):

Anthony Bullard (Feb 27 2025 at 13:44):

Anthony Bullard (Feb 27 2025 at 13:47):

Joshua Warner (Feb 27 2025 at 14:29):

Joshua Warner (Feb 27 2025 at 14:30):

Anthony Bullard (Feb 27 2025 at 14:39):

Anthony Bullard (Feb 27 2025 at 14:40):

Anthony Bullard (Feb 27 2025 at 14:45):

Anthony Bullard (Feb 27 2025 at 14:45):

Joshua Warner (Feb 27 2025 at 15:13):

Anthony Bullard (Feb 27 2025 at 15:45):

Anthony Bullard (Feb 27 2025 at 15:48):

Anthony Bullard (Feb 27 2025 at 15:50):

Anthony Bullard (Feb 27 2025 at 15:50):

Joshua Warner (Feb 27 2025 at 15:51):

Anthony Bullard (Feb 27 2025 at 15:51):

Joshua Warner (Feb 27 2025 at 15:52):

Anthony Bullard (Feb 27 2025 at 15:52):

Anthony Bullard (Feb 27 2025 at 15:52):

Anthony Bullard (Feb 27 2025 at 15:52):

Joshua Warner (Feb 27 2025 at 15:53):

Anthony Bullard (Feb 27 2025 at 15:53):

Anthony Bullard (Feb 27 2025 at 15:54):

Joshua Warner (Feb 27 2025 at 15:54):

Anthony Bullard (Feb 27 2025 at 15:54):

Joshua Warner (Feb 27 2025 at 15:54):

Brendan Hansknecht (Feb 27 2025 at 19:06):

Sam Mohr (Feb 27 2025 at 22:56):

Sam Mohr (Feb 27 2025 at 22:56):

Brendan Hansknecht (Feb 27 2025 at 23:09):

Brendan Hansknecht (Feb 27 2025 at 23:10):

Brendan Hansknecht (Feb 27 2025 at 23:10):

Brendan Hansknecht (Feb 27 2025 at 23:11):

Anthony Bullard (Feb 28 2025 at 00:15):

Anthony Bullard (Feb 28 2025 at 00:16):

Anthony Bullard (Feb 28 2025 at 00:17):

Anthony Bullard (Feb 28 2025 at 00:23):

Anthony Bullard (Feb 28 2025 at 00:25):

Anthony Bullard (Mar 01 2025 at 18:01):

Joshua Warner (Mar 01 2025 at 22:41):

Joshua Warner (Mar 01 2025 at 22:42):

Joshua Warner (Mar 01 2025 at 22:43):

Joshua Warner (Mar 01 2025 at 22:43):