ideating chunk-based data structures

2026-06-08 21:45:30 -04:00
parent 0be9b4f0fb
commit d7b66a5dda
1 changed files with 241 additions and 0 deletions
@@ -0,0 +1,241 @@
+# Ed's Chunk-Based Data Structure Ideation — 2026-05-23
+
+**Source:** User-provided notes from an ideation session (Discord messages + 5 image transcriptions via `MiniMax understand_image`).
+**Date:** 2026-05-23 (per timestamp in the notes)
+**Archived:** 2026-06-08
+**Status:** Raw ideation. Not yet an article. The user noted: *"Ok I'm done thats the basic drafting for w/e that turns into when I feel like writing a proper article and I have some code on a repo to give some weight to it"*.
+
+> **Context for this archive.** The user mentioned this ideation while asking for transcripts of two YouTube videos (Casey Muratori "Big OOPs" + Andrew Reece "Assuming as Much as Possible") and the two Fleury Digital Grove articles ("The Codepath Combinatoric Explosion" + "A Taxonomy of Computation Shapes"). The chunk ideation is highly aligned with Reece's talk (the Xar data structure is exactly this chunking pattern) and with Muratori's talk (ECS archetype chunks are systems-over-hierarchies). The user wants all four sources to ground the upcoming `code_path_audit_20260607` track.
+
+---
+
+## Image 1 (2026-05-23 12:39 PM) — Original ideation, raw notes
+
+> Once of the articles I want to write thats not on there but I want more exp before I do it is related to making truly, generic, scalable data structures where they fundamentally operate with constraints that allow them to lego properly.
+>
+> And the fundamental thing you have to preserve or utilize with any data structure thats multi-element is fixed sized slices. You don't have to bake the fixed size for the slice at comp time but you must always decide a fixed size heuristic to use.
+>
+> As soon as you do that you can lego a bunch of things and they will nearly always last longer and perform better than if you assumed an indefinite linear tape or array for storage, or some arbitrary fragmentation storage pool. And then the concept of indefinite linearity becomes a frontend ergonomic for the user of a module or interface, not an actual behavior.
+>
+> So like a TArray in UE, I would force there to be a fixed size you must pass for the slice chunk. Same with tmap.
+>
+> If you were to ever process data from those data structures you must be aware of that chunk. You don't get to ignore it, even if you do linear access. From there you can pretty much preerve low performance interface and opt into chunk awareness for parallel processing, cache aware, etc. And it becomes way easier to opt-in without a rewrite.
+>
+> The thing people appeal as bad is the double indirection of having two indices, one of chunks and one for the element in the chunk is a bad fear to have. Because the computer cannot process in cache that much data anyway so at some point if the data is large enough you are going to stall and the indirection arithmetic is irrelevant.
+>
+> So even for things like files, this has to be the case, you will only window so much of a file, you will only process so many lines at "once in cache", so many tokens, etc.
+
+## Image 2 (2026-05-23 12:47 PM) — Continuation on parallel processing and arena allocation
+
+> Trying to bake that away because its a browser and your targeting hundreds of devices is not a good enough excuse, you can at worst case for critical chunks have a size per-performance class (mobile, console, desktop, laptop, 10 years old, 15 years old, 5 years old, etc).
+>
+> That's something I see implicitly from good devs but I never see confronted when people learn data structures.
+>
+> As it is at [-work-] everything is just for loop spam with tarray and you can have hundreds like that and your relying on the cpu or gpu to just tank it. When if that was a design consideration from the start the dev is confronted with, all of a sudden when a failure does occur you don't spend days to weeks rewriting a system. You just change a for loop to be chunk-aware and start profiling different chunk sizes. Or setup threads to attack chunks in parallel.
+>
+> On top of this it leads to you have less realloc or never needing a realloc, the chunk represents a compute batch naturally, you allocate on an arena or block allocate by the chunk, there is your tarray's realloc.
+>
+> So you don't have to worry about linear locality, because it can never be perserved when pipelining code anyway for cpu reciving it, the mmu doens't magically go "oh hey these address are all in the same region yeah we can just know to batch it where this one will be next."
+>
+> Once your past a few hundred entities that goes out the window.
+>
+> All of a sudden since a chunk is your proper data structure container element, you can utilize the same allocation scheme for arrays, pools, maps, etc. You can swap between heap allocators and have minimal fragmentation to non-existent or GC or arnea alloctors and they will always perform better. Because you can recycle by the chunk instead of a more downstream more complex or non-trivial collection of objects or entities.
+
+## Image 3 (2026-05-23 12:56 PM) — Distillation with code pattern
+
+> So basically the distillation is:
+>
+> ```cpp
+> for (auto& element : DataStructure)
+> {
+>   // do stuff with chunks elements, but the chunk indirection is handled for
+>   // you.
+> }
+> for (auto& Chunk : DataStructure) for (auto& element : Chunk)
+> {
+>   // do stuff with chunks elements, you handle chunk awareness
+> }
+> SomeThreadBatch per_thread_work;
+> if first_arriving_thread() do planner_figure_out_the_split(DataStructure,
+> per_thread_work);
+> sync_wait_for_planner_thread();
+> // Split to each thread
+> for (auto& Chunk : per_thread_work[thread_id].DataStructure) for (auto&
+> element : Chunk)
+> {
+>   // Do stuff with chunks element which have been distributed to threads.
+> }
+> ```
+>
+> This is universal. It will never change no matter what machine you use until you die. This scales on CPUs, GPUs, FPGAs, ASICs, period.
+>
+> The top most loop is the simplest and you can always make a for range operator that abstracts away the chunk if chunk processing doesn't need to be taken into account, but as soon as you do need to be chunk aware you are fucked in most language libraries because they don't acknowledge it as a fundamental aspect of modern computing. Including odin and jai.
+
+----
+
+> Ok I'm done thats the basic drafting for w/e that turns into when I feel like writing a proper article and I have some code on a repo to give some weight to it.
+>
+> Ideally imo the chunk is so important it should be a cpu aware construct for instruction sets that correlate with SIMD, MIMD, etc. And the OS should also enforce it for memory ops and other things they do on their side. It kinda is already but because the CS curriculums don't really treat it proper constraint its kinda just a thing hidden in plain sight as soon as you do any performance programming.
+
+## Image 4 — Work-stealing thread model (rebuttal fragment)
+
+> **The Work-Stealing Thread Model**
+> Chunks form the perfect atomic unit of work for a multithreaded job system.
+>
+> - You do not need to lock the entire data structure.
+> - You maintain an atomic counter representing the "next available chunk."
+> - Thread 0 reads the counter, grabs Chunk 0, and increments the counter. Thread 1 grabs Chunk 1.
+> - Because the chunks are distinct memory regions (and ideally a multiple of the 64-byte cache line size to prevent false sharing), threads can mutate data within their respective chunks with zero locking overhead and perfect cache coherency.
+
+## Image 5 — Common objections and rebuttals
+
+### 1. The "Wasted Memory" Fallacy (Internal Fragmentation)
+
+**The Objection:** "If my chunk size is 1,000 elements, but I only have 5 elements to store, aren't I wasting a massive amount of memory?"
+
+**The Pragmatic Dismissal:**
+
+In the real world, you are already "wasting" memory; you just can't see it. Modern operating systems manage memory in pages (typically 4KB). If you ask the OS for 5 bytes, it maps 4KB anyway.
+
+Furthermore, standard dynamic arrays (like `std::vector`) typically grow by doubling their capacity. If an array has 100,000 elements and you add one more, it might allocate space for 200,000 elements, wasting space for 99,999 elements.
+
+- **The Reality:** With chunking, you only ever have "wasted" space in the *very last* chunk of a sequence.
+- **The Solution:** If a specific system truly only ever holds a tiny handful of elements, you define a smaller chunk size for that specific arena. It is a compile-time tweak, not an architectural crisis.
+
+### 2. The "Double Indirection is Slow" Myth
+
+**The Objection:** "To get an element, I have to look up the chunk, and then look up the element inside the chunk. That's two lookups! Doesn't that double indirection kill performance?"
+
+**The Pragmatic Dismissal:**
+
+This argument assumes all CPU operations take the same amount of time. They don't. The CPU is incredibly fast at math and incredibly slow at waiting for RAM.
+
+A cache miss (waiting for main memory) costs hundreds of CPU cycles. Bitwise arithmetic takes one cycle. If your chunk sizes are powers of two (e.g., 256), finding the chunk and the element requires a simple bitwise shift and a bitwise AND mask.
+
+| Operation | Approximate CPU Cost | Consequence |
+|---|---|---|
+| **Bitwise Math (Finding the chunk)** | ~1 cycle | CPU doesn't even break a sweat. |
+| **L1 Cache Hit (Reading the chunk)** | ~3-4 cycles | Instantaneous data processing. |
+| **RAM Fetch (Standard OOP Pointer)** | ~100-300 cycles | CPU completely stalls waiting for data. |
+
+Because chunks keep data tightly packed in the CPU cache, paying 1 cycle for "double indirection" to avoid a 300-cycle RAM stall is the best trade you will ever make in systems programming.
+
+### 3. The "Polymorphic Soup" Problem
+
+**The Objection:** "What if my list needs to hold different types of objects? A `Vehicle` chunk can't hold a `Car` (size 64 bytes) and a `Truck` (size 128 bytes) because chunks rely on fixed sizes!"
+
+**The Pragmatic Dismissal:**
+
+You simply shouldn't be processing heterogeneous data in the same continuous loop if you care about performance.
+
+- **The Reality:** If you have an array of mixed objects, every iteration of your loop requires the CPU to figure out what type of object it's looking at, look up its specific functions (vtable lookups), and fetch completely different memory footprints. This defeats hardware branch prediction.
+- **The Solution:** You split them up. You have a chunk for `Cars` and a chunk for `Trucks`. If a system only cares about their common `Position` data, you extract the `Position` into its own chunk-based array. This is the entire philosophy behind Entity Component Systems (ECS).
+
+### 4. The "Dangling Pointer / Object Reference" Panic
+
+**The Objection:** "If elements are packed into chunks, and one gets deleted, and we move things around to fill the gap, what happens to all the other objects that were pointing to it? Their pointers are now broken!"
+
+**The Pragmatic Dismissal:**
+
+Raw pointers are a massive liability for game state or complex application logic anyway. The industry standard solution for this is **Generational Indices (Handles)**.
+
+Instead of Object A holding a memory address pointing to Object B, it holds a 32-bit or 64-bit integer ID.
+
+- **How it works:** A Handle contains the `Chunk Index`, the `Element Index`, and a `Generation Counter`.
+- **The Safety Net:** When an element is deleted, its slot in the chunk is freed, and that slot's "Generation" is incremented. If Object A tries to use its old handle, the system sees the generation numbers no longer match and safely rejects the request, rather than crashing the program with a segmentation fault.
+
+### The Bottom Line
+
+Most arguments against chunking come from developers treating memory like an abstract, infinite, and perfectly flat void. Once you accept that hardware is deeply physical and relies on fixed-size batches to run efficiently, the edge cases of chunking look remarkably easy to manage.
+
+---
+
+## Postscript from Ed (a question to himself)
+
+> **PS:** "But Ed what if you want todo handles to entities and you want to enqueue processing of those entities."
+>
+> Simple: you have a getter to resolve their owning chunk. That means you'll have at worst case an intrusive flag in the entity to be processed so it ignores most entities in the chunk when pipelined, or has a segregated chunk whitelist of entities within that chunk.
+
+---
+
+## Image 3 (transcribed): "The Hardware Reality: Why 'Indefinite Linearity' Fails"
+
+The core argument against standard dynamic arrays (like `std::vector` in C++ or `TArray` in Unreal) is that they abstract away the physical realities of modern hardware. The CPU does not read memory byte-by-byte; it reads in cache lines (typically 64 bytes) and manages memory in pages (typically 4KB to 2MB).
+
+- **The Reallocation Cost (O(N)):** A continuous dynamic array must eventually grow. When it exceeds its capacity, the allocator attempts to expand the memory block. If the adjacent virtual memory is occupied, it triggers a full reallocation: allocating a new, larger block and copying every single element over. This pollutes the cache, stalls the CPU, and fragments the heap.
+- **TLB Misses:** Massive contiguous allocations spread across disparate physical memory pages increase Translation Lookaside Buffer (TLB) misses, slowing down memory fetches.
+- **False Sharing in Concurrency:** If multiple threads process adjacent elements in a tightly packed linear array, they will likely write to the same cache line, causing cache invalidation across CPU cores (false sharing).
+
+By enforcing a fixed-size chunk heuristic at compile time, you align your software with the hardware's fixed-size execution models.
+
+## Advanced Chunk-Aware Data Structures
+
+When you drop the requirement for absolute continuous memory, you can implement high-performance, chunk-based equivalents for standard data structures.
+
+| Standard Structure | Chunk-Aware Equivalent | Primary Hardware Benefit |
+|---|---|---|
+| Dynamic Array (`std::vector`) | Segmented Array / Unrolled Linked List | O(1) expansion. No reallocation copies. Safe concurrent reads during expansion. |
+| Binary Search Tree (`std::map`) | B-Tree / B+ Tree | Nodes are sized exactly to CPU cache lines or OS pages, minimizing memory fetches. |
+| Hash Table (`std::unordered_map`) | Swiss Table / Flat Hash Map | Metadata is chunked into 16-byte blocks for parallel SIMD querying before fetching payloads. |
+| Array of Structs (AoS Entities) | ECS Archetype Tables | Entities with identical component layouts are grouped into dense, fixed-size chunks for optimal linear iteration. |
+
+## Technical Implementations of Chunking
+
+### 1. The Segmented Array (Unrolled Linked List)
+
+Instead of a single continuous block, you allocate an array of pointers to fixed-size blocks (chunks). `std::deque` in C++ operates similarly, but a strict, custom Segmented Array gives you explicit control over the chunk size to match your specific cache or threading needs.
+
+- **Memory Growth:** When capacity is reached, you allocate a single new chunk and add its pointer to your directory. Existing elements never move, meaning raw pointers to elements are never invalidated by an append operation.
+- **Indexing:** Finding an element at index `i` requires simple division and modulo arithmetic. `Chunk index = i / CHUNK_SIZE`. `Element offset = i % CHUNK_SIZE`. If `CHUNK_SIZE` is a power of two, this compiles down to ultra-fast bitwise shifts and masks.
+
+### 2. Archetype Chunks in Entity Component Systems (ECS)
+
+In high-performance game engines (like Unity's DOTS or custom internal engines), data is heavily oriented around chunks.
+
+- Entities are not objects; they are just IDs.
+- Components (like Position or Velocity) are stored in chunks.
+- An "Archetype" dictates the layout of a chunk (e.g., a chunk containing exactly 128 Positions and 128 Velocities).
+- When a system runs, it does not query individual entities. It queries a central registry for all chunks matching a specific Archetype, operating directly on the dense component arrays within that chunk.
+
+### 3. Spatial Partitioning (Grid/Octree Chunks)
+
+For collision detection or rendering culling, continuous arrays fail entirely.
+
+- By chunking spatial data into fixed voxel grids or Octree leaves, you map physical 3D space to hardware chunks.
+- Entities moving between spatial regions are simply removed from one chunk's contiguous array and swapped into another's, keeping memory strictly localized to spatial proximity.
+
+## Execution Mapping: SIMD and Multithreading
+
+The most significant advantage of fixed-size chunks is how elegantly they map to parallel execution architectures.
+
+### SIMD (Single Instruction, Multiple Data)
+
+Modern CPUs feature wide vector registers (AVX2, AVX-512) that can process 8, 16, or 32 floats simultaneously. If your data structure relies on a linear array that might have gaps or require complex branching, auto-vectorization fails. By ensuring data is tightly packed into a fixed-size chunk, you guarantee the compiler can safely unroll the loop and issue SIMD instructions for maximum throughput.
+
+### The Work-Stealing Thread Model
+
+Chunks form the perfect atomic unit of work for a multithreaded job system.
+
+- You do not need to lock the entire data structure.
+- You maintain an atomic counter representing the "next available chunk."
+- Thread 0 reads the counter, grabs Chunk 0, and increments the counter. Thread 1 grabs Chunk 1.
+- Because the chunks are distinct memory regions (and ideally a multiple of the 64-byte cache line size to prevent false sharing), threads can mutate data within their respective chunks with zero locking overhead and perfect cache coherency.
+
+---
+
+## Cross-references to other sources (added by Tier 2, 2026-06-08)
+
+These notes are deeply aligned with the 4 other sources loaded for the same audit:
+
+| Source | Alignment with these notes |
+|---|---|
+| Andrew Reece, "Assuming as Much as Possible" (BSC 2025) | The Xar is *exactly* this chunking pattern (fixed-size chunks, exponential growth, bitwise divmod, no realloc copy). Reece's "byte-first thinking" maps to Ed's "indefinite linearity becomes a frontend ergonomic, not an actual behavior." |
+| Casey Muratori, "The Big OOPs" (BSC 2025) | The "ECS Archetype Tables" section in the image-3 follow-up is literally Muratori's argument: data-oriented ECS over hierarchical OOP. "Entities are not objects; they are just IDs" is the entire thesis. |
+| Ryan Fleury, "The Codepath Combinatoric Explosion" | Ed's "double indirection is a bad fear to have" is a corollary of Fleury's "effective codepaths" — by exposing the chunk layer (with its known performance shape), the *user* codepath becomes simpler (a single straight-line loop over `Chunk` instead of cache-miss-vulnerable iteration over pointer-chained nodes). |
+| Ryan Fleury, "A Taxonomy of Computation Shapes" | The "chunks form the atomic unit of work" framing IS a wide-codepath visualization: each chunk is dispatched to a different thread (a sub-codepath), with no shared mutable state between them. The "no locking overhead" is a consequence of the *separation*. |
+
+The user's intuition that "the chunk is so important it should be a cpu aware construct for instruction sets that correlate with SIMD, MIMD" is essentially the SIMD section of image 3 made into a hardware design recommendation. The CS curriculum gap the user laments is exactly what this archive (and the 4 other sources) collectively try to address.
+
+---
+
+*End of ideation archive. Reference for the upcoming code_path_audit_20260607 track and the user's eventual article on chunk-based data structures.*