Private
Public Access
0
0
Files
manual_slop/docs/reports/c11_python_interop_assessment_20260608.md
T
conductor-tier2 999fdea467 docs(c11-interop): cross-reference SSDL digest in See Also
The SSDL digest (docs/reports/computational_shapes_ssdl_digest_20260608.md,
504 lines, 30KB) is the theoretical foundation for the chunkification
pattern. Per the digest's Technique 5 "Assume-away (Xar)" in §2.2
and the "Xar-style chunked arrays" recommendation in §5.2, the
chunkification track is a *direct application* of the SSDL's
"assume as much as possible" lens (§4).

This commit adds the SSDL digest to the See Also of the v1+v2
C11-Python interop assessment (front-matter Cross-references line).
The same cross-reference is also being added to:
- conductor/tracks/chunkification_optimization_20260608_PLACEHOLDER/spec.md
  (in a new §6.1 "SSDL alignment" subsection)
- conductor/tracks/manual_ux_validation_20260608_PLACEHOLDER/spec.md
  (in §5 Architectural Reference + §6 See Also + a new §2.6
  "SSDL cross-reference" section that distinguishes GUI ASCII
  vocabulary from SSDL vocabulary)

No code modified. Cross-reference only.

Also: small update to conductor/tracks.md to add the 2 new
tracks (manual_ux_validation_20260608_PLACEHOLDER as Active;
chunkification_optimization_20260608_PLACEHOLDER as Backlog/Contingency).
2026-06-08 23:42:21 -04:00

57 KiB
Raw Blame History

C11 ↔ Python Interop Assessment — 2026-06-08

Question source: end-of-session user clarification on the proposed chunkification_optimization_20260608_PLACEHOLDER track. Author: Tier 1 Orchestrator (synthesis + technical assessment) Date: 2026-06-08 Status: Honest tractable-vs-not verdict, no code proposed Cross-references: docs/reports/session_synthesis_20260608.md §8.2, docs/ideation/ed_chunk_data_structures_20260523.md, docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt §56:42, docs/reports/computational_shapes_ssdl_digest_20260608.md (the SSDL digest; the theoretical foundation for the chunkification pattern — Technique 5 "Assume-away (Xar)" in §2.2 is the explicit pre-support for the chunk-arrays recommendation in §5.2)


0. The user-correction that reshaped the question

First framing (mine, in proposed_new_tracks_20260608.md): "Manual Slop's comms.log could be replaced by a C11 chunk-based data structure, with Python user-space interop via numpy/ctypes/etc."

User's clarification: "it's not really an interop pattern, I just wanted to show how I like todo C11."

What changed: the C11 codebases I was pointed to (forth_bootslop/attempt_1/duffle.amd64.win32.h + main.c, and Pikuma/ps1/code/duffle/* + gte_hello/) are style references — they show what C11 looks like when Ed writes it. They do not contain a Python interop layer, and weren't meant to be read as one. The "interop design space" question is a separate open question, and the user explicitly said "lots of ambiguities."

This document is split into two parts that should not be conflated:

  • Part 1 — the C11 style reference (what the duffle.h + pikuma ps1 headers show)
  • Part 2 — the interop design space (the actual question the user is asking, with honest tractable-vs-not assessment)

PART 1 — C11 Style Reference (what your duffle.h + pikuma ps1 show)

1.1 The duffle.h "DSL" (forth_bootslop/attempt_1/duffle.amd64.win32.h, 727 lines)

A single-header file that defines a C DSL in pure macros + inline functions. Compiled with clang in c23 mode. Target: amd64 + Windows 11. Zero external dependencies (the only #pragma comment(lib, ...) lines are to Kernel32/User32/Gdi32/Advapi32).

The core conventions:

1.1.1 Byte-width typedef convention (mandatory, used everywhere)

typedef __UINT8_TYPE__  U1; typedef __UINT16_TYPE__ U2; typedef __UINT32_TYPE__ U4; typedef __UINT64_TYPE__ U8;
typedef __INT8_TYPE__   S1; typedef __INT16_TYPE__  S2; typedef __INT32_TYPE__  S4; typedef __INT64_TYPE__  S8;
typedef unsigned char   B1; typedef __UINT16_TYPE__ B2; typedef __UINT32_TYPE__ B4; typedef __UINT64_TYPE__ B8;
typedef float           F4; typedef double          F8;
  • U = unsigned, S = signed, B = byte (char)
  • The number is the bit-width, not the byte count
  • All custom code uses these; int/long/size_t only appear in system headers

Casts are wrapped: u4_(value) / u8_(value) / f4_(value) etc. enforce precedence in arithmetic and signal at the call site "this is an explicit narrowing."

1.1.2 Macro meta-DSL (the "duffle" layer)

#define m_expand(...)      __VA_ARGS__
#define glue_impl(A, B)    A ## B
#define glue(A, B)         glue_impl(A, B)
#define tmpl(prefix, type) prefix ## _ ## type

The rest of the file is built on these. Patterns:

  • Struct_(Foo) expands to struct Foo Foo; struct Foo — a forward decl + a typedef in one go, so you can use Foo as a type or a struct namespace immediately
  • Enum_(U4, MyEnum) similarly gives you MyEnum as the type and enum MyEnum as the tag
  • Union_(Foo), Array_(type, len), Slice_(type) — same pattern, all single-line

This is the meta-primitive that the entire codebase builds on. There is no class, no templates, no codegen — just #define and _Generic.

1.1.3 Inline / always-inline / no-inline discipline

#define I_  internal inline
#define IA_ I_ __attribute__((always_inline))
#define N_  internal __attribute__((noinline))

Plus the macro name encodes intent: I_* is a normal inline, IA_* is forced inline (small, hot), N_* is forced out-of-line (debugging, code-size). Functions written as IA_ void foo(...) carry the intent in the function signature itself.

1.1.4 The r/v discipline (restrict / volatile, and nothing else)

#define r restrict  // pointers are either restricted or volatile and nothing else
#define v volatile

Plus typed pointer aliases: r_(ptr) = C_(T_(ptr[0])*r, ptr) is a typed restrict pointer, v_(ptr) is a typed volatile pointer. The user comment says this directly: "pointers are either restricted or volatile and nothing else."

There are no const pointers, no volatile restrict, no fancy CV qualifiers. Just two states. This is a real constraint on the design.

1.1.5 Slice as the core compound type

typedef Struct_(Slice)  { U8 ptr, len; };  // Untyped slice
#define Slice_(type) Struct_(tmpl(Slice,type)) { type* ptr; U8 len; }
  • Untyped Slice is { void*, size_t } (well, {U8 ptr, U8 len}U8 is the byte-width convention)
  • Typed Slice_T wraps a typed T* with the same len field
  • slice_iter(container, iter) is the iteration macro
  • slice_end(slice) returns slice.ptr + slice.len (pointer past the end, not a pointer to last element)
  • slice_to_ut(s) converts a typed slice to an untyped slice (used for memcpy / hash / format)
  • S_slice(s) is s.len * sizeof(s.ptr[0]) — the byte size

This is the data-structure primitive of the duffle system. Arenas, stacks, KTL tables — everything is built on Slice + Slice_T + FArena.

1.1.6 The FArena (the chunk-adjacent data structure)

typedef Struct_(FArena) { U8 start, capacity, used; };
  • Linear-bump allocator with a start / capacity / used triple
  • farena_push(arena, amount, options) returns a Slice
  • farena_save(arena) -> used (snapshot), farena_rewind(arena, save_point) (rollback to snapshot)
  • farena_reset(arena) zeroes used (does NOT free; that requires slice_free or arena destruction)
  • farena_push_type(arena, type, ...) and farena_push_array(arena, type, amount, ...) are typed convenience macros

Key observation: this is not a chunk-based arena. It is a single contiguous buffer with a bump pointer. The user could extend it to chunked (with Slice<FArena> as the backing, or by allocating new pages and chaining them), but the current FArena is monolithic.

1.1.7 Memory-barrier and atomic primitives (asm volatile)

IA_ void barrier_compiler(void){asm volatile("::""memory");}
IA_ void barrier_memory  (void){__builtin_ia32_mfence();}
IA_ void barrier_read    (void){__builtin_ia32_lfence();}
IA_ void barrier_write   (void){__builtin_ia32_sfence();}

IA_ U4 atm_add_u4 (U4*r addr, U4 value){asm volatile("lock xaddl %0,%1":"=r"(value),"=m"(addr[0]):"0"(value),"m"(addr[0]):"memory","cc");}

These are written as raw inline asm, not stdatomic.h. The user prefers __builtin_* intrinsics and raw asm volatile(...) over library abstractions. This matters for interop: there's no portable way to call these from Python.

1.1.8 Control-flow and defer discipline

#define defer(expr) for(U4 once= 1; once!=1; ++once, (expr))
#define scope(begin,end) for(U4 once=(1,(begin)); once!=1; ++once, (end))
#define defer_rewind(cursor) for(T_(cursor) sp=cursor, once=0; once!=1; ++once, cursor=sp)

defer is a single-statement cleanup that fires when the enclosing block exits. defer_rewind is the arena-aware variant: it captures the current cursor at block entry and restores it on exit. This is the pattern for "transactional" arena allocation.

1.1.9 The KTL (Key Table Linear) — a small key-value table

#define KTL_Slot_(type) Struct_(tmpl(KTL_Slot,type)) { U8 key; type value; }
#define KTL_(type) Slice_(tmpl(Slot,type));
typedef Slice KTL_Byte;

A linear array of {key, value} slots, with FNV-1a 64-bit hashing on Str8 keys. The comment in the code says: "We do a linear iteration instead of a hash table lookup because the user should never subst with more than 100 unique tokens." — this is the "assume as much as possible" principle applied directly. No hash table; linear scan wins for small N.

1.2 The duffle.h ↔ main.c interface (forth_bootslop/attempt_1/main.c, 1426 lines)

main.c is a stack-machine JIT compiler. It uses duffle.h to:

  • Define an STag enum (X-macro pattern: 7 entries in a single Tag_Entries() table, then #define X + #undef X to repurpose the macro inside the table generator)
  • Define tape_arena (an FArena for the bytecode tape) and anno_arena (parallel arena for annotation strings)
  • Use u4_r(...) / u8_r(...) for typed restrict pointers
  • Use mem_copy / mem_zero (which are wrappers around __builtin_memcpy / __builtin_memset)
  • Hand-emit x64 machine code using emit8 / emit32 / emit64 macros
  • Build a JIT (Just-In-Time compiler for a custom stack-based VM) that emits REX prefixes, ModRM bytes, SIB bytes via a per-field macro DSL

What this tells us about how Ed uses duffle.h:

  • The DSL is meant to support low-level systems work (JIT, OS syscalls, raw asm) without sacrificing readability
  • The byte-width typedef convention is rigid — every new line of code in main.c uses U1/U4/U8; int/long only appear in system header forward-decls
  • Memory discipline is arena-first: tape_arena + anno_arena + code_arena are global FArena instances, no malloc/free in user code
  • The defer / defer_rewind pattern is the user's answer to RAII — it's the only structured cleanup mechanism

1.3 The Pikuma ps1 duffle/ (Pikuma/ps1/code/duffle/*, the more recent style)

The Pikuma ps1 duffle/ is a refined, smaller version of the forth_bootslop DSL. Same conventions, but with platform-specific concerns (PS1 MIPS + GTE + GPU command encoders). Notable differences:

  • dsl.h adds TSet_(type) (type + restricted-pointer + volatile-pointer in one typedef), Proc_(symbol) (typedef for void(*)())
  • memory.h adds sll_stack_push_n / sll_queue_push_nz — singly-linked list / queue macros (the DAG region)
  • gp.h is the GPU command encoder; every GPU command is a (gcmd_X << 24 | ...) bit-packing macro, same pattern as the x64 emission DSL in forth_bootslop main.c
  • gte.h is the GTE coprocessor instruction encoder; per-field macros, asm volatile(asm_inline(gte_cmd_rtpt, ...)) to emit constant-folded instruction words
  • math.h defines V2_S2, V3_S2, V4_S2 (S2/S4 are 16/32-bit signed), Rect_S2, M3_S2 — 3x3 matrix with translation vector

What Pikuma ps1 duffle/ shows that's different from forth_bootslop:

  • The DSL is split across multiple small headers (dsl.h, memory.h, math.h, gp.h, gte.h, mips.h, gcc_asm.h, strings.h) — one concept per file, easier to reason about
  • The INTELLISENSE_DIRECTIVES guard at the top of every header lets IDEs (#pragma once + includes) see the full type graph without requiring the user to include dsl.h in every file. Production builds skip the include
  • The TSet_ / PtrSet_ / Array_expand macros are a more complete type-builder system: one macro gives you type, type*restrict, type*volatile in one shot
  • The GTE/GPU encoding layers are fully composableenc_gte_cmdw(sf, mx, v, cv, lm, cmd) is a flat OR of 6 per-field encoders, each of which is its own named function

hello_gte.c shows usage:

  • SMemory is the global state struct; static_mem is a single global instance
  • prim__alloc(type_width, type_name) is the arena-style allocation primitive for the GTE primitive buffer
  • ent_cube128_init / ent_floor_init are __forceinline initializers that copy baked vertex/face data into the entity's arena slot
  • Ent_Cube and Ent_Floor are entity structs that embed their data (A8_V3_S2 verts; A6_V4_S2 faces;) — entities are POD, not heap-allocated

1.4 The 11 style observations that matter for chunkification

Distilled from the duffle.h + main.c + pikuma ps1 headers + hello_gte.c reading:

  1. No malloc/free in user code. Everything is arena-allocated. For chunk-based data structures, this means the chunks themselves would be allocated from an FArena (or a chunk-aware variant), and the structure holds a Slice<Chunk> of pointers into the arena.
  2. No classes, no templates, no inheritance. POD structs only. Methods are free functions that take a pointer: void farena_push(FArena* arena, U8 amount, Opt_farena o).
  3. The Slice + Slice_T pair is the data-structure primitive. A chunk-array is probably modeled as Slice<Chunk> where Chunk is a fixed-size T[N].
  4. Pointer discipline is restrict or volatile, never both, never const. This is a hard constraint.
  5. The byte-width convention is rigid. U1/U2/U4/U8 for unsigned, S1/S2/S4/S8 for signed, B1/B2/B4/B8 for byte, F4/F8 for float. int and long are forbidden in user code.
  6. asm volatile + __builtin_* are preferred over library wrappers. No stdatomic.h, no stddef.h for size_t.
  7. The DSL compiles in c23 mode (clang). This means _Generic is available, __builtin_* are stable, and typeof works.
  8. __attribute__((always_inline)) is the default for small hot functions. Hot path code has zero call overhead.
  9. Macros encode intent, not just abbreviation. I_ vs IA_ vs N_ is meaningful; I_proc was specifically removed in the duffle.h because the user found it harder to read than just writing inline functions.
  10. Entities are POD structs with embedded data. No handles, no IDs, no virtual dispatch.
  11. X-macros are the pattern for data-driven code. Tag_Entries() defines the table; #define X(n, s, c, p) + #undef X lets the same table feed the enum, the colors array, the prefix array, the name array.

1.5 What the style implies for the chunkified data structure

If the user wrote a chunk-based C11 data structure in their style, it would probably look like:

// Likely shape (NOT actually written, this is what their style suggests)
typedef Struct_(ChunkArray_T) {                  // ChunkArray<T>
    Slice         chunks;                          // { Chunk* ptr; U8 len; }
    U4            chunk_size;                      // power-of-2
    U4            element_size;                    // sizeof(T)
    U8            total_used;                      // sum of all chunk use
    FArena*       backing;                         // where chunks live
};

// Push: O(1) amortized
I_ U8 chunkarray_push(ChunkArray_T* ca, U8 element) {
    U4 chunk_idx = ca->total_used >> log2_of(ca->chunk_size);
    if (chunk_idx >= ca->chunks.len) {
        // grow: add a new chunk
        Chunk* new_chunk = farena_push_type(ca->backing, Chunk, ...);
        ca->chunks.ptr[ca->chunks.len] = new_chunk;
        ca->chunks.len += 1;
    }
    U4 offset = ca->total_used & (ca->chunk_size - 1);
    U8* dst = (U8*)&ca->chunks.ptr[chunk_idx][offset * ca->element_size];
    dst[0] = element;  // copy
    ca->total_used += 1;
    return ca->total_used - 1;
}

// Index: O(1) bitwise
IA_ U8 chunkarray_at(ChunkArray_T* ca, U8 i) {
    U4 chunk_idx = i >> log2_of(ca->chunk_size);
    U4 offset    = i & (ca->chunk_size - 1);
    return ((U8*)ca->chunks.ptr[chunk_idx])[offset * ca->element_size];
}

This is exactly Reece's Xar pattern (8-byte header, power-of-2 chunks, bitwise divmod), written in Ed's duffle.h style.

The point: the style is consistent with the chunkification optimization. If you wrote this in C11, it would look like duffle.h. There's no impedance mismatch between "the user's preferred C11 style" and "the chunk-idea C11 implementation."

The impedance is between any C11 chunk-array and the Python runtime, regardless of style. That's Part 2.


PART 2 — Interop Design Space (the actual question)

2.1 What "interop" actually means in this context

The question isn't "can Python call C11?" — that's a solved problem with multiple working answers (ctypes, cffi, pybind11, Cython, custom CPython module, etc.). The question is more specific:

Can a Python user-space program actually exploit a chunk-based C11 data structure as if it were a "lego set" of composable pieces — where the user picks which chunk operations to run, in which order, with custom callbacks for filter/map/reduce — without paying the FFI overhead per element?

The user's skepticism is well-founded. The standard FFI answers have specific impedance-mismatch properties:

2.2 The 5 candidate interop layers, honestly assessed

2.2.1 ctypes (Python stdlib)

What it is: load a .dll / .so and call C functions via FFI. No compile step. Structs, arrays, pointers, callbacks all work.

Pros for chunkification:

  • Zero build-time cost — ctypes.CDLL("./libchunks.so") and you're in
  • Structure + Array classes map naturally to a ChunkArray header + Chunk* array
  • POINTER(c_uint64) can wrap the chunk pointer, indexed like a Python list
  • Thread-safe (GIL released on foreign calls)

Cons for chunkification:

  • Per-call overhead is ~1-5 microseconds. A chunkarray_at(arr, i) round trip is 1 µs of FFI overhead. A 10,000-element loop is 10ms. Python's native list iteration is ~50ns/element, so ctypes is ~20-100x slower for tight loops.
  • No inlining. The "lego set" pattern requires the user to compose operations (filter + map + reduce over chunks). With ctypes, each operation is a separate FFI call, so composition costs O(N) FFI round trips.
  • Type coercion is one-shot. You can't ask ctypes to call chunkarray_at and have the result auto-converted to a Python int without going through the ctypes object.
  • No SIMD/AVX exposure. The user could write the C11 to use AVX, but ctypes sees only the C function signature.

Verdict for chunkification: Tractable but defeats the purpose. If the use case is "process a 100K-element chunk-array in a hot loop," ctypes is wrong. If the use case is "occasionally bulk-load or bulk-dump a chunk-array and do the rest in Python," ctypes is fine.

Style fit with duffle.h: low. ctypes would require the user to write Python-side struct definitions that mirror the C struct layout. The duffle.h Struct_(ChunkArray_T) { Slice chunks; U4 chunk_size; U4 element_size; U8 total_used; } would become:

class ChunkArray_T(ctypes.Structure):
    _fields_ = [
        ("chunks", Slice),       # needs its own Structure
        ("chunk_size", c_uint32),
        ("element_size", c_uint32),
        ("total_used", c_uint64),
    ]

That's 2x the code on the Python side, and you have to keep the two in sync. The user's unorthodox Slice + Struct_ macros would have to be unwound into a C-friendly layout.

2.2.2 cffi (PyPy / CPython, third-party)

What it is: write C declarations in a Python string, cffi compiles them and gives you ABI-stable handles.

Pros over ctypes:

  • C-level type declarations are the source of truth (not Python-side mirroring)
  • ABI mode vs API mode: ABI is like ctypes (no compile); API mode compiles a Python extension module
  • More Pythonic: from ffi import ffi; lib = ffi.dlopen("./libchunks.so")

Cons for chunkification: same as ctypes for the per-call overhead. Plus the C declaration layer adds a build step (cffi "compiles" the C declarations at import time, which is a real cost on cold start).

Verdict for chunkification: same as ctypes — tractable but defeats the purpose for hot loops.

Style fit with duffle.h: low-medium. cffi is more idiomatic for the C-decl-as-source-of-truth, but you still pay the FFI cost.

2.2.3 pybind11 (C++ heavy)

What it is: C++ header-only library that generates Python bindings from C++ type signatures. Requires the C++ compiler.

Pros for chunkification:

  • Type-safe bindings
  • STL containers (vector, array) have automatic conversions to Python list / numpy array
  • py::buffer_info lets you expose raw memory as a NumPy array (zero-copy)

Cons for chunkification:

  • C++ is not the user's style. The user writes pure C11 with macros. pybind11 is C++-only.
  • pybind11's STL conversions don't fit the duffle.h Slice / FArena model. You'd be writing the C++ adapter layer, not the C11 chunk-array.
  • The "pybind11 generates bindings" claim is misleading for non-trivial types — you write glue code, and for an FArena-backed chunk array, the glue is more code than the C11 implementation.

Verdict for chunkification: not a fit. Style mismatch is fatal here.

2.2.4 Custom CPython C extension (CPython C API)

What it is: write a real CPython extension module using <Python.h>. You get a Python-importable module that wraps the C11 code directly.

Pros for chunkification:

  • Zero FFI overhead for tightly-coupled code. Once the module is loaded, import chunks; chunks.push(arr, val) is a normal C function call with refcount discipline, ~50ns/element.
  • The C API is C-compatible (C11 or later), so the duffle.h macros can be used directly inside the extension module
  • The user controls the module surface — can expose ChunkArray.push, .at, .chunk_count, .chunk_size, .arena_capacity etc.
  • Generator/coroutine support (__iter__ over chunks) is straightforward in C
  • Can release the GIL for long-running pure-C operations

Cons for chunkification:

  • Refcount discipline is manual. The user must Py_INCREF / Py_DECREF correctly. The duffle.h style doesn't have a notion of refcounting (everything is arena-owned). A new discipline is needed at the Python boundary.
  • Must compile. Build the .pyd/.so, ensure it's on sys.path, deal with Python version compatibility (3.11 ABI tag, etc.). The user's Manual Slop project uses uv; this would be a pyproject.toml [tool.uv]-style build hook.
  • CPython-specific. PyPy / GraalPy / RustPython don't all support the C API the same way. For a tool that's CPython-only (Manual Slop is), this is fine, but it's a lock-in.
  • GIL. Free-threaded Python (PEP 703) is shipping; chunk-array code that releases the GIL has to be careful about which Python objects it touches.

Verdict for chunkification: Most tractable option. The custom C extension model lets the user write the chunk-array in their preferred C11 style (duffle.h compatible), wrap it with a small Python-facing layer (refcount-aware), and ship it as a real importable module. Build cost is one-time.

Style fit with duffle.h: high. The C11 code is C11. The Python-facing layer is a thin PyTypeObject / PyMethodDef table at the bottom of the file. The duffle.h macros can be used inside the extension module without modification.

Sketch (not actually written — for the design conversation):

// chunks_module.c
#include <Python.h>
#include "duffle.amd64.win32.h"   // user's existing style

typedef Struct_(ChunkArray) {
    Slice  chunks;        // { Chunk* ptr; U8 len; }
    U4     chunk_size;    // power-of-2
    U4     element_size;
    U8     total_used;
    FArena backing_arena;
};

static PyObject* chunka_push(PyObject* self, PyObject* args) {
    PyObject* py_arr;
    U8        value;
    if (!PyArg_ParseTuple(args, "OK", &py_arr, &value)) return nullptr;
    ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
    U8 idx = chunkarray_push(arr, value);
    return PyLong_FromUnsignedLongLong(idx);
}

static PyObject* chunka_at(PyObject* self, PyObject* args) {
    PyObject* py_arr; U8 i;
    if (!PyArg_ParseTuple(args, "OK", &py_arr, &i)) return nullptr;
    ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
    U8 val = chunkarray_at(arr, i);
    return PyLong_FromUnsignedLongLong(val);
}

static PyMethodDef ChunkArrayMethods[] = {
    {"push", chunka_push, METH_VARARGS, "Append an element, return its index"},
    {"at",   chunka_at,   METH_VARARGS, "Random access by index"},
    {nullptr, nullptr, 0, nullptr}
};

static struct PyModuleDef chunkmodule = {
    PyModuleDef_HEAD_INIT, "chunks", nullptr, -1, ChunkArrayMethods
};

PyMODINIT_FUNC PyInit_chunks(void) {
    return PyModule_Create(&chunkmodule);
}

This is ~80 lines of glue for a fully-functional module. The actual chunkarray_push and chunkarray_at are duffle.h-style C11.

2.2.5 NumPy + custom C API (PyArray_Interface)

What it is: NumPy has a C API (<numpy/arrayobject.h>) that lets C extensions allocate and manipulate ndarray objects. The C extension holds the actual memory, and NumPy wraps it as an array with zero copy.

Pros for chunkification:

  • If the chunk-array is logically a 1D contiguous sequence, NumPy can wrap it as a ndarray with zero copy
  • The user can then do np.sum(chunks), chunks[1000:2000], chunks[chunks > threshold] in NumPy land — all the vectorized ops for free
  • For batch operations (load 10K elements, do something to all of them, write back), NumPy is the right level of abstraction
  • Most Manual Slop hot-path code (text processing, JSON-L serialization, list-mutation) can be re-expressed as NumPy operations

Cons for chunkification:

  • NumPy semantics are flat 1D/2D/ND arrays, not chunk-aware. The "lego set" pattern (iterate over chunks, custom callback per chunk) is not a first-class NumPy concept.
  • The C API requires linking against NumPy's headers and ABI version compatibility
  • NumPy's array protocol is strongly typed (dtype); chunk-array-of-mixed-type is not a fit
  • For a chunk-array that needs to be both chunk-aware (user iterates chunks) and element-wise (NumPy ops on the flat view), you'd need a custom NumPy dtype with chunk-aware accessors — possible but not trivial

Verdict for chunkification: orthogonal. NumPy is a great consumer of a chunk-array (zero-copy wrap), but not a great driver (you can't easily express chunk-aware iteration in NumPy). The combination is: write the chunk-array in C11, expose a NumPy-compatible 1D view, let NumPy do batch ops when appropriate, do chunk-aware iteration in C.

Style fit with duffle.h: medium. NumPy's C API doesn't conflict with duffle.h, but the PyArrayObject types are intrusive. You'd write an adapter layer that converts between Slice<U8> (raw bytes) and PyArrayObject (typed ndarray).

2.3 The honest assessment matrix

For the actual question — "can a Python user-space program fully exploit a C11 chunk-based data structure lego-set?" — here's what the design space looks like:

Approach Build cost Per-op overhead Style fit Lego-set pattern support Verdict
ctypes 0 ~1-5 µs/call low low (each op = FFI call) Tractable but defeats the purpose
cffi ABI mode 0 ~1-5 µs/call low-medium low Same as ctypes
cffi API mode 1x (compile) ~50ns/call medium medium Good middle ground
pybind11 1x (compile) ~50ns/call very low (C++) medium Style mismatch — not a fit
CPython C ext 1x (compile) ~50ns/call high (C11) high (full C API) Most tractable
NumPy wrap 1x (compile) ~50ns/call medium low (flat view) Orthogonal — good for batch, not lego-set
HPy / PyO3 / nanobind 1x (compile) ~50ns/call low (Rust/C++/new API) medium Better than pybind11 but still style-mismatched

The recommendation:

For the lego-set (chunk-aware user-driven iteration): custom CPython C extension is the most tractable. The duffle.h style is C11; the C extension wrapping is ~80 lines of glue per chunk-array class; per-element overhead is the same as native Python (~50ns).

For batch operations on a chunk-array: NumPy wrap is the most tractable. Expose the chunk-array's memory as a 1D ndarray, let NumPy do the work. Zero-copy, vectorized, free.

For occasional FFI from Python: ctypes is fine. Load the lib, call the function, get the result. Don't try to do hot loops this way.

2.4 What "a chunked C11 package that interops with Python" actually requires

If the user wants to build this, the minimum viable product is:

  1. The chunk-array C11 code (duffle.h style, ~200-400 lines)

    • ChunkArray_T struct
    • chunkarray_push, chunkarray_at, chunkarray_grow, chunkarray_iter_chunks
    • Backing is an FArena for chunk memory + a Slice<Chunk*> for the chunk pointer table
  2. A CPython C extension wrapper (~80-150 lines)

    • PyTypeObject for ChunkArrayObject (wraps the C struct)
    • __init__ (creates the C struct from Python args: chunk_size, element_size, initial_capacity)
    • __len__ (returns total_used)
    • __getitem__ / __setitem__ (calls chunkarray_at / in-place write)
    • __iter__ (yields elements one at a time; can be optimized to yield per-chunk for the lego-set pattern)
    • push(value) method
    • chunks() method (yields per-chunk ndarray views for the NumPy interop path)
    • arena_capacity, chunk_count, chunk_size read-only properties
  3. A build step in pyproject.toml (one-time cost, ~5 lines)

    • [tool.uv.build-backend] config
    • Build the .pyd/.so for the current Python version
    • Wheels for distribution (optional, build for arm64 + x86_64 + win32 + linux)
  4. Tests in tests/test_chunka_c11.py (~100-300 lines)

    • TDD-style: write tests in Python first, then write the C, then verify
    • Grow pattern tests, random access tests, edge cases (empty, full, resize)
    • NumPy interop test: ensure np.array(chunks) is zero-copy
    • Comparison test: chunk-array must beat list.append for the relevant N
  5. A chunks/__init__.py Python wrapper (~30-50 lines, optional but recommended)

    • High-level API: ChunkArray(chunk_size=1024, element_size=8), .push(x), .at(i), .numpy()
    • Type hints for IDE support
    • This is the only Python code; everything else is C

Total: ~500-1000 lines of C + ~50-150 lines of Python glue + build/test config.

2.5 The honest tractable-vs-not answer

Tractable:

  • Writing a chunk-array in C11 duffle.h style: trivially tractable (Reece's Xar is the reference impl, ~200 lines)
  • Wrapping it as a CPython C extension: tractable (~150 lines of glue)
  • Per-element overhead matching native Python: yes (50ns vs 50ns, no FFI tax)
  • NumPy interop via zero-copy ndarray wrap: tractable (NumPy's C API is well-documented)
  • Build + distribution via uv + pyproject.toml: tractable (one-time setup, well-trodden path)

Not tractable (or not worth the cost):

  • Letting the user arbitrarily compose C11 chunk operations from Python at the lego-set level: not tractable without compiling Python → C11 on the fly. ctypes/cffi/pybind11 are all per-call; you'd need a C-subset JIT (like the user's forth_bootslop does for stack machine bytecode) to compose C11 ops in Python. That's a different track.
  • Having Python extend the chunk-array with user-defined per-element callbacks (like list(map(fn, arr))) that run at C speed: not tractable. Cython can compile Python-ish syntax to C, but the duffle.h style doesn't fit Cython's type system. The workaround is to ship pre-baked operations (push, at, iter_chunks, filter_chunk(fn_ptr)) and let users choose from those, not define new ones in Python.
  • Making the chunk-array cross-implementation (CPython + PyPy + RustPython): not tractable with the C extension approach. Use HPy (new Python C API targeting multiple impls) if this matters. HPy has a separate style, would need an adapter.

The "numpy DSL" the user mentioned: the closest analog is Cython's typed memoryviews or NumPy's ndarray protocol — both give you "Python can see a chunk of C memory and operate on it efficiently." Neither is a literal DSL; both are ABI/protocol layers. If the user wants a Python-side DSL for composing chunk operations, that's a separate design problem (Cython-like compile-to-C, or a small Python AST → C11 emitter).

Don't start with C11. Start with pure Python chunkification of the target (the comms.log ring buffer in app_controller.py:716). Verify:

  • The chunk pattern delivers a measurable speedup
  • The API is ergonomic from Python
  • The thread-safety story is correct
  • The serial/deserial path still works

Then, if the user wants the C11 lego-set:

  • Build the duffle.h-style C11 chunk-array (one type, ~200 lines)
  • Build the CPython C extension wrapper (~150 lines of glue)
  • Build the NumPy-compatible 1D view (lets existing Python code consume the chunk-array)
  • Optional: add a few pre-baked chunk-aware operations (filter_chunks, map_chunks, reduce_chunks) in C, exposed as Python methods
  • Optional: build a "lego-set" Python API that lets users compose pre-baked operations without writing C

Defer the "Python-defined chunk-aware callback" goal — it's the most ambitious, requires either Cython or a custom AST emitter, and is not clearly worth the complexity for a single project.

2.7 The 5 questions to ask the user (before this becomes a track)

These map directly to the design decisions in §2.3-§2.6:

  1. Build cost acceptable? Custom C extension is one-time ~half-day of build setup (pyproject.toml, compiler config, wheel build). One-time.
  2. Per-element overhead target? Native (~50ns) requires the C extension. ctypes is ~1-5µs (20-100x slower). What's the SLA?
  3. NumPy interop required? If yes, the C extension must expose the underlying memory as a 1D ndarray view (one-time setup).
  4. Cross-implementation? CPython only? Or HPy for CPython+PyPy? Big style difference.
  5. Lego-set composition in Python? Pre-baked ops (push, at, iter_chunks, filter_chunks) is tractable. User-defined Python→C11 callbacks is not (without Cython or a custom AST emitter).

2.8 The crucial insight

The user said: "the way I would define the C11 package or interop stuff would be unorthodox and would follow a similar pattern to what you would fine in either my forth_bootslop repo or my pikuma ps1 repo."

Reading both repos carefully (and the user's correction that they're "not really an interop pattern, I just wanted to show how I like todo C11"), the implication is:

  • The user is comfortable with a single C11 .h file as the entire interop boundary
  • The user is not going to write a complex pybind11 C++ layer or a Cython .pyx file
  • The user is comfortable with a thin CPython C extension if the C11 code stays in their style

The most likely path the user would actually take, given their style and your "lots of ambiguities" caveat:

  • Write the chunk-array in duffle.h style as a single header
  • Wrap it with a small PyTypeObject block at the bottom of the same file (or a separate chunks_module.c that includes the header)
  • Build it with uv + pyproject.toml
  • Import it from Manual Slop and verify the speedup on comms.log

That's tractable. The "lego set of composable Python-driven chunk operations" is a stretch goal that requires more design work, and probably isn't needed for the comms.log target.


3. The non-recommendations

Don't do any of these:

  • pybind11. Style mismatch. C++ is not the user's idiom.
  • Cython. The user writes pure C11 with macros. Cython is Python-with-C-type-annotations. Style mismatch.
  • Rust + PyO3. The user writes C, not Rust. PyO3 is great for Rust shops, not relevant here.
  • HPy. Cross-implementation matters less than style fit. Revisit if PyPy becomes a target.
  • Pure Python implementation of the lego-set pattern. Defeats the point. If you're not crossing the FFI boundary, you don't need C11.

4. Summary verdict (SUPERSEDED — see Part 3)

The table in this section is the v1 verdict, written before the user's second correction (Part 3). Kept for the record, but Part 3 is the action-oriented section.

The user's question The honest answer
Can chunk-based C11 interop with Python? Yes, via custom CPython C extension. ~150 lines of glue per chunk-array type.
Is it worth the cost? Depends on the use case. For comms.log, the C extension is tractable. For "compose arbitrary C11 ops from Python," it's not (needs a Python→C emitter).
What does the lego-set pattern look like? Pre-baked C operations exposed as Python methods (push, at, iter_chunks, filter_chunks). User-defined per-element Python callbacks running at C speed is not tractable.
What about numpy? NumPy can zero-copy wrap the chunk-array as a 1D ndarray. Best for batch ops, not chunk-aware iteration.
What's the build cost? One-time ~half-day (uv + pyproject.toml + C extension). Wheels for distribution optional.
What about HPy / cross-impl? Not needed unless PyPy becomes a target. Stick with CPython C API.
What's the style fit with duffle.h? High. The chunk-array is written in duffle.h style; the C extension wrapper is a thin PyTypeObject block at the bottom of the file.

Original recommended action (v1):

  1. Verify the chunk pattern delivers value first. Pure-Python chunkification of comms.log (or another target), measure, confirm.
  2. If C11 is desired, build the C extension in duffle.h style. ~500 lines total (200 C array + 150 glue + 100 tests + 50 Python wrapper).
  3. If NumPy is the consumer, expose the 1D view. One-time, ~20 lines of NumPy C API glue.
  4. Defer the "user-defined Python→C11 callback" goal unless a specific use case demands it.

PART 3 — Revised Verdict (after the user's second correction)

3.1 The second user-correction (verbatim)

"This seems like it would only be worth it if I reach a hard constraint that I cannot solve with an existing python package. Then I could make a custom pipelien to deal with the hot data set witha custom cpython extension. Such as, parsing markdown files or sources int aggregate markdown, context snapshot processing and possibly other things in the future. The python would have to define the payload in a simple text or binary format as the request and then the extenion pipeline in C11 would do the ops and provide the output in another binary or text blob/s."

3.2 What the second correction changed

Two distinct moves, both significant:

Move 1 — threshold-shift on when to bother:

"only worth it if I reach a hard constraint that I cannot solve with an existing python package"

This inverts the default. v1 framed the chunkification_optimization track as "if you want the C11 path, here's how to build it." v2 frames it as "don't build it until a hard constraint forces the issue, and here's the specific shape of the build when that day comes."

Move 2 — shape-change on what to build:

"the python would have to define the payload in a simple text or binary format as the request and then the extension pipeline in C11 would do the ops and provide the output in another binary or text blob/s"

This is not a stateful C extension with a Python-facing API. It is a request/response blob pipeline:

   Python user-space                    C11 pipeline
  ┌──────────────────┐                 ┌──────────────────┐
  │ 1. Assemble      │                 │                  │
  │    request:      │   request.bin   │  parse request   │
  │    {files: [...],│ ───────────────▶│  load payload    │
  │     ops: [...],  │                 │  run ops         │
  │     params: {}}  │                 │  format output   │
  │ 2. Serialize to  │                 │                  │
  │    blob (text or │                 │                  │
  │    binary)       │                 │                  │
  │ 3. Hand to C11   │   response.bin  │                  │
  │ 4. Parse         │ ◀───────────────│                  │
  │    response      │                 │                  │
  └──────────────────┘                 └──────────────────┘

This is strictly better than the v1 framing in 4 ways:

  1. Composition in Python is trivial. The "lego set" the user worried about isn't a problem: the Python side composes the request, and the C side just executes the pre-defined op pipeline. No Python→C11 emitter needed.
  2. The wire format IS the contract. Both sides agree on a schema (text or binary), not on a Python type. The C side has zero knowledge of PyObject / PyTypeObject / refcounting. The Python side has zero knowledge of FArena / Slice / U8. Cleanest possible boundary.
  3. Per-op FFI cost is zero. There's exactly one FFI call per pipeline run, not per element. The "ctypes per-call overhead defeats the purpose" concern from v1 §2.2.1 disappears.
  4. State-free C side. The C pipeline reads the request, runs ops, writes the response, exits. No need to maintain Python refcount discipline over a long-lived C object. The C side is a pure function process(request_bytes) -> response_bytes.

3.3 The two target use cases, grounded in actual code

3.3.1 Target 1: parsing markdown files / sources into aggregate markdown

Current state (read from src/aggregate.py:380-454 build_markdown_from_items + src/summarize.py:7-219):

  • The aggregate pipeline builds markdown by pure Python string concatenation (f"### \{original}`\n\n```{suffix}\n{skeleton}\n``""and"\n\n---\n\n".join(sections)`)
  • _summarise_markdown in summarize.py only extracts headings — does NOT parse the body
  • pyproject.toml has zero third-party markdown dependencies (mistune, markdown-it-py, commonmark-py, markdown are all not in the deps)
  • build_file_items at aggregate.py:142 does the path resolution + content reading; build_markdown_from_items does the string-concat assembly; summarize.summarise_file is called per-file for non-focus tiers

Where the actual bottleneck is (right now):

  • The string concatenation in build_markdown_from_items — Python's f-strings are fast but "\n\n---\n\n".join(sections) over a list of ~50-500 sections scales linearly
  • The parser.get_skeleton(content) call in aggregate.py:444 for every .py file in the composition
  • The mcp_client.py_get_definition / mcp_client.ts_cpp_get_* calls for masked symbols
  • The summarize.summarise_file calls per file

Where the bottleneck would be IF real markdown parsing were added:

  • Adding a markdown parser (e.g., markdown-it-py) to extract structural elements (headings, code blocks, links) for navigation/context-aware aggregation
  • For projects with many .md files (e.g., docs/ with 14 guides, 30+ IDE markdown files), the parse cost would dominate

Is this a hard constraint that Python packages can't solve?

  • No, today. markdown-it-py is ~10x faster than python-markdown and ~50x faster than pure-Python regex parsing. It's well-maintained, C-accelerated (via cmark/commonmark), and has a clean AST API. Adopting it is a one-line pyproject.toml change, not a C11 build.
  • Possible yes, in the future. If the user adds cross-file markdown analysis (TOC generation, link graph, code-block extraction across many files) at runtime, the cumulative parse time for hundreds of files could push past markdown-it-py's comfort zone. That would be the hard constraint.

When to act: the moment the markdown-parse hot path becomes a real bottleneck in profiling (i.e., the user can demonstrate via performance_monitor.py that build_markdown_from_items is the slow part of a real workflow). Until then, the existing Python path is fine, and markdown-it-py is the first thing to try.

3.3.2 Target 2: context snapshot processing

Current state (read from src/history.py:1-141):

  • UISnapshot is a @dataclass with 13 fields. The "large" fields are disc_entries: list[dict], files: list[dict], context_files: list[dict], screenshots: list[str]
  • HistoryManager is a small Python class. push / undo / redo / jump_to_undo are the only mutating ops
  • Snapshot capacity is 100 (default in HistoryManager.__init__)
  • The actual work is UISnapshot.to_dict and from_dict — deep-copy of nested dicts

Where the actual bottleneck is:

  • The to_dict / from_dict deep-copies. 100 snapshots × ~5KB each = 500KB of nested dict copying per push/undo. At 60 FPS push rate, that's 30MB/s of dict copy — Python's not great at that but pushes are debounced in docs/guide_state_lifecycle.md (render frame at gui_2.py:1140-1170), so the actual rate is much lower
  • The list copy of disc_entries is the heaviest single op (a 23-op matrix can have ~50-200 entries per snapshot)

Is this a hard constraint that Python packages can't solve?

  • No, today. Python's copy.deepcopy is the canonical answer; pickle round-trips are 5-10x faster than to_dict/from_dict for nested data. If snapshot capture is slow, the fix is to switch to pickle (or to msgspec / orjson for json-like schemas), not C11.
  • Possible yes, in the future. If snapshots grow to MB-scale (e.g., per-frame UI state for video-game-like content) and push rate goes up (e.g., per-frame state push during a long session), the cumulative cost would matter. That would be the hard constraint.

When to act: the moment the user sees history.py push() in a profile. Until then, switching to pickle is the cheap fix.

3.4 The request/response wire format (the contract)

The user said "simple text or binary format as the request and then the extension pipeline in C11 would do the ops and provide the output in another binary or text blob/s."

Two options on the table. The choice has real implications:

3.4.1 Option A: text (line-based, JSON-ish, debuggable)

# request.txt
op parse_md
op summarise_python
op mask_symbols @sym1 def @sym2 sig
op build_section tier=3
input file src/foo.py
input file src/bar.py
format markdown_v3
end
  • Pros: human-readable, greppable, version-controllable, easy to debug (you can cat the request and the response)
  • Cons: parsing cost on the C side (strncmp per op), bigger payload, slower to roundtrip

3.4.2 Option B: binary (msgpack / protobuf / custom)

[1 byte: format version]
[1 byte: op_count]
[for each op:
   [1 byte: op_id]
   [varint: param_count]
   [for each param:
      [1 byte: type_id]
      [varint: byte_len]
      [bytes: value]]]
[for each input:
   [varint: byte_len]
   [bytes: file_path]]
[for each input file blob:
   [varint: byte_len]
   [bytes: file_content]]
  • Pros: fast to parse (~1-10µs per op on C side), small payload, deterministic
  • Cons: not human-readable, harder to debug, format versioning required, binary compatibility across Python/C versions

The recommendation: start with text for v1 (debuggability > speed when you're not sure what the ops look like), switch to binary for v2 if profiling shows the parse cost matters. The wire format is the only contract, so it's also the only thing you have to maintain compat with.

A reasonable middle path: text for the envelope (which ops to run, which params), binary for the payloads (file contents, result blobs). This way you can cat the envelope to debug, and the heavy bytes move binary-only.

3.5 The pipeline API (what the C11 side exposes)

If we adopt the request/response model, the C11 side has exactly one entry point:

// chunks_module.c (hypothetical)
// Returns: response blob (caller frees)
// Args: request blob (opaque, owned by caller)
typedef Struct_(PipelineResponse) {
    U8* bytes;
    U8  len;
    U4  exit_code;  // 0 = success, non-zero = error
    Str8 error_msg; // optional, only populated on error
};

IA_ PipelineResponse pipeline_run(Slice request);

The C side:

  1. Parses the request envelope (op list + params + input file list)
  2. Loads the requested input files (or accepts inline blobs)
  3. Runs each op in order
  4. Collects the output into a single response blob
  5. Returns the blob + exit code

The Python side:

  1. Builds the request envelope (text or binary)
  2. Subprocess-launches the C pipeline binary (or calls via ctypes) with the request on stdin
  3. Reads the response from stdout
  4. Parses the response (text or binary)
  5. Returns the parsed result to the calling code

The subprocess model is strongly recommended over the in-process FFI model for v1:

  • Zero FFI surface (no ctypes, no PyTypeObject, no refcount discipline)
  • Trivially testable (the C binary can be run from the shell, results compared)
  • Total process isolation (C crash doesn't take down the Python process)
  • ~10-20ms startup tax per call (acceptable for batch ops, not for hot loops)
  • Easy to swap implementations (rewrite the C binary, keep the wire format)

If profiling later shows the subprocess startup is the bottleneck, switch to in-process via ctypes. The wire format doesn't change.

3.6 The "chunkification" question, revisited

The original chunkification_optimization_20260608_PLACEHOLDER track was about replacing growable buffers (comms.log, summary_cache, etc.) with chunk-based data structures (Reece's Xar pattern, duffle.h style).

Under the new framing:

  • If the target (comms.log etc.) is on a hot path that an existing Python package can't solve, build a C11 pipeline that takes a request like {op: append_chunk, arena: comms, data: {...}} and returns {status: ok, count: 42}. The C side owns the chunk-array as a private data structure; the Python side never sees it.
  • The chunk-array is now an implementation detail of the C pipeline, not a Python data type. The user's "lego set" worry is moot because Python doesn't have direct access to the lego set — it only has the request/response protocol.

This is much cleaner than the v1 framing (stateful C extension with Python-facing API). The chunk-array is internal to the C pipeline. Python user-space has zero access to the underlying memory layout. The wire format is the entire surface area.

3.7 When to act (the decision tree)

Is the target code path actually a bottleneck in profiling?
├── No  → Don't act. Use existing Python packages (`markdown-it-py`,
│         `pickle`, `msgspec`, `orjson`, `numpy`, `pandas` as appropriate).
│         Re-evaluate next quarter.
│
└── Yes → Is the bottleneck solvable with existing Python packages?
    ├── Yes (e.g., switch `to_dict`/`from_dict` to `pickle`) → Apply that fix.
    │         Cost: hours. Don't reach for C11.
    │
    └── No (existing packages aren't fast enough or can't do the op) → Build the C11 pipeline:
              1. Define the wire format (text v1, binary v2)
              2. Write the C11 pipeline binary in duffle.h style
              3. Write the Python wrapper that builds requests and parses responses
              4. Ship as a subprocess (not in-process FFI) for v1
              5. Add an in-process FFI path only if subprocess startup is the new bottleneck
              6. Profile: confirm the C11 path is actually faster than the Python baseline
              7. If not faster, throw away the C11 code and try a different Python package

Default action for the current session: don't build the C11 pipeline. No profiling has been done; no existing Python package has been ruled out. The hard constraint doesn't exist yet.

3.8 The 4 questions to revisit when a hard constraint actually surfaces

These are the design decisions that have to be made when (not before) the user hits a real bottleneck:

  1. Which target? Is it markdown parsing, snapshot processing, log aggregation, RAG indexing, or something else? Each has different op shapes, different request schemas, different response schemas.
  2. Subprocess or in-process FFI? Start with subprocess (zero FFI surface, ~10-20ms startup tax). Move to in-process only if startup cost is the new bottleneck.
  3. Text or binary wire format? Text v1 (debuggable, slower). Binary v2 (fast, not debuggable). Envelope-text + payload-binary middle ground.
  4. One pipeline binary or many? One binary with an op registry is simpler to build/test/deploy. Many binaries (one per op) is more modular but harder to coordinate. Recommend one binary with a registry.

3.9 The crucial insight (revised)

v1's insight: "The user's 'unorthodox' interop is most likely a single duffle.h-style C11 .h file with a thin PyTypeObject block at the bottom. Tractable."

v2's insight (the better one): "The C11 side doesn't need to be a Python-aware module at all. It can be a standalone binary that takes a request on stdin, runs ops, returns a response on stdout. Python user-space just shells out. Zero FFI surface. Zero refcount discipline. The wire format is the contract, period."

The v2 model is strictly more tractable than v1:

  • No pyproject.toml build hook required
  • No PyTypeObject, no PyMethodDef, no PyArg_ParseTuple
  • No Python GIL concerns
  • No CPython version compat (works with any Python that can subprocess.run())
  • Testable from the shell (echo 'op foo' | ./pipeline_bin returns the response)
  • Deployable as a single binary, or a wheel that bundles the binary
  • The C11 code is 100% duffle.h style, no Python adaptation needed

The cost trade-off: subprocess startup is ~10-20ms per call. For batch ops (parse 100 markdown files, generate 100 snapshots, build one big context) this is fine. For per-frame hot loops (e.g., 60 FPS text rendering) it's not. If a target is per-frame, the v1 in-process FFI model is required; otherwise, the v2 subprocess model is strictly better.

3.10 What this means for the track

chunkification_optimization_20260608_PLACEHOLDER is no longer a track. It is a contingency that activates when a hard constraint surfaces. The contingency plan is:

  1. Default: don't build. Use existing Python packages. Re-evaluate quarterly.
  2. If a hard constraint surfaces: build the v2 subprocess pipeline model. Wire format is the contract. C11 code is duffle.h-style standalone binary. Python wrapper is a thin subprocess.run() caller.
  3. Track artifact, deferred: the chunkification_optimization_20260608_PLACEHOLDER directory should hold a 1-page "contingency plan" doc (essentially a copy of this §3) rather than a full spec/plan. Promote to a full track when the first hard constraint surfaces.

manual_ux_validation_20260608_PLACEHOLDER (the other v1 proposal) is unaffected by this correction. It remains a small, well-scoped track to promote the ASCII-sketch UX workflow.

3.11 The honest re-verdict matrix (v2)

The user's question The honest answer (v2)
When is the C11 path worth the cost? Only when a hard constraint surfaces that no existing Python package can solve. Default: don't build.
What does the C11 path look like? A standalone subprocess binary. Request in (text or binary), response out. Zero Python-awareness. Wire format is the contract.
How does Python compose chunk operations? It composes the request envelope (which ops to run, with which params), not the C ops themselves. The C side just executes the pre-defined op list. No Python→C11 emitter needed.
What's the per-op overhead? Zero FFI overhead (subprocess model). ~10-20ms per call (subprocess startup). Acceptable for batch ops, not for per-frame hot loops.
What about numpy? NumPy is a Python package; the question doesn't apply to the v2 model. The C pipeline is its own world, with its own data structures. NumPy doesn't help here.
What's the build cost? One-time ~half-day (just a C binary, no Python integration). Build via existing uv + a new [tool.uv.scripts] entry that runs clang on the .c file.
What about HPy / cross-impl? Not relevant; the v2 model is a standalone subprocess, no Python implementation specifics.
What's the style fit with duffle.h? Perfect. The C pipeline is 100% duffle.h style. No Python adaptation.
What's the wire format? The user chooses. Recommend text-v1 (debuggable) → binary-v2 (fast) as the workload justifies.
What's the deploy shape? Single C binary. Python subprocess.run() to call. Optional wheel that bundles the binary.
What about in-process FFI? Skip for v1. Add later if subprocess startup is the new bottleneck. The wire format doesn't change.

3.12 Summary (v2, the action-oriented section)

Don't build anything yet. Profile first; adopt existing Python packages; only reach for C11 when an existing package can't solve the bottleneck. The user said this directly: "only worth it if I reach a hard constraint that I cannot solve with an existing python package."

When you do build, the shape is: subprocess C11 binary + wire format contract + thin Python subprocess.run() wrapper. No FFI, no PyTypeObject, no refcount discipline, no Python adaptation of the C code. The chunk-array (or whatever data structure) lives entirely inside the C binary; Python only sees request/response blobs.

chunkification_optimization_20260608_PLACEHOLDER should become a 1-page contingency plan, not a full track. Promote to a track when (if) the first hard constraint surfaces.

manual_ux_validation_20260608_PLACEHOLDER (Track #1 from the v1 proposal) is unaffected and remains a small, well-scoped track. Confirmed worth doing in the user's first message ("I love the idea and definitely see poitental").


End of v2 assessment. The 2 user-corrections in this session (style reference, then request/response model) reshaped the answer from "build a stateful C extension" to "don't build anything, here's the contingency plan for when you do." Track #1 (manual_ux_validation) is confirmed. Track #2 (chunkification) is downgraded to a contingency document.

Cross-references for re-anchoring: docs/reports/session_synthesis_20260608.md §8.2 (the original v1 proposal), docs/ideation/ed_chunk_data_structures_20260523.md (the user's chunk-ideation), docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt §56:42 (Reece's Xar reference impl), src/aggregate.py:380-454 (the actual current markdown hot path), src/history.py:1-141 (the actual current snapshot hot path), pyproject.toml:6-27 (the current zero-markdown-deps state).