Private

Public Access

Files

T

conductor-tier2 68354841cb docs(interop-assessment): C11 <-> Python interop design space for chunkification_optimization

The user asked a sharp, skeptical question: can a chunk-based C11
data structure actually interop with Python's runtime in a way
that's useful for Manual Slop? They explicitly corrected my
first-draft framing (the duffle.h + pikuma ps1 files are a C11
*style reference*, not an interop pattern). The assessment
investigates honestly and reports tractable-vs-not.

docs/reports/c11_python_interop_assessment_20260608.md (564 lines, 38KB):

Part 1: C11 style reference summary
- 11 style observations from reading duffle.h + main.c + pikuma
  ps1 duffle/ + hello_gte.c end-to-end
- Byte-width typedef convention (U1/U2/U4/U8, S1/S2/S4/S8, B1-B8, F4/F8)
- The macro meta-DSL (Struct_/Enum_/Array_/Slice_/Opt_/Ret_)
- The I_/IA_/N_ inline discipline
- The r/v pointer rule (restrict OR volatile, never both, never const)
- Slice + Slice_T as the data-structure primitive
- FArena as the allocation primitive (single-buffer, NOT chunked)
- defer/defer_rewind/scope as the cleanup primitive
- KTL (linear key-value table) as the "assume small N" pattern
- What a chunk-array in duffle.h style would look like

Part 2: Interop design space (the actual question)
- 5 candidate interop layers: ctypes, cffi, pybind11, custom
  CPython C extension, NumPy wrap
- Honest assessment matrix: build cost, per-op overhead, style
  fit, lego-set pattern support
- Verdict: custom CPython C extension is most tractable; pybind11
  is style-mismatched; ctypes/cffi work for non-hot-path
- What "MVP chunked C11 package" requires (~500-1000 LOC total)
- 5 questions to ask the user before this becomes a track
- Crucial insight: the user's "unorthodox" interop is most likely
  duffle.h-style C11 + thin PyTypeObject glue at the bottom of
  the same .h file. Tractable, style-fit high.

Cross-references the 5 sources:
- docs/transcripts/i-h95QIGchY (Reece's Xar reference impl)
- docs/ideation/ed_chunk_data_structures_20260523.md
- docs/reports/session_synthesis_20260608.md (the original proposal)
- src/app_controller.py:716 (the comms.log target)
- The user's local forth_bootslop + pikuma ps1 repos (read in full)

This is a follow-on to the synthesis's 2 proposed tracks
(manual_ux_validation_20260608_PLACEHOLDER + chunkification_optimization_20260608_PLACEHOLDER).
The user's question resolved the "skeptical of #2" concern by
scoping the tractable path: CPython C extension in duffle.h style.
The "lego-set of user-defined Python->C11 chunk ops" is NOT
tractable without a Python->C11 AST emitter, which is a
different (much larger) track.

2026-06-08 22:50:03 -04:00

37 KiB

Raw Blame History

C11 ↔ Python Interop Assessment — 2026-06-08

Question source: end-of-session user clarification on the proposed chunkification_optimization_20260608_PLACEHOLDER track. Author: Tier 1 Orchestrator (synthesis + technical assessment) Date: 2026-06-08 Status: Honest tractable-vs-not verdict, no code proposed Cross-references: docs/reports/session_synthesis_20260608.md §8.2, docs/ideation/ed_chunk_data_structures_20260523.md, docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt §56:42

0. The user-correction that reshaped the question

First framing (mine, in proposed_new_tracks_20260608.md): "Manual Slop's comms.log could be replaced by a C11 chunk-based data structure, with Python user-space interop via numpy/ctypes/etc."

User's clarification: "it's not really an interop pattern, I just wanted to show how I like todo C11."

What changed: the C11 codebases I was pointed to (forth_bootslop/attempt_1/duffle.amd64.win32.h + main.c, and Pikuma/ps1/code/duffle/* + gte_hello/) are style references — they show what C11 looks like when Ed writes it. They do not contain a Python interop layer, and weren't meant to be read as one. The "interop design space" question is a separate open question, and the user explicitly said "lots of ambiguities."

This document is split into two parts that should not be conflated:

Part 1 — the C11 style reference (what the duffle.h + pikuma ps1 headers show)
Part 2 — the interop design space (the actual question the user is asking, with honest tractable-vs-not assessment)

PART 1 — C11 Style Reference (what your duffle.h + pikuma ps1 show)

1.1 The duffle.h "DSL" (forth_bootslop/attempt_1/duffle.amd64.win32.h, 727 lines)

A single-header file that defines a C DSL in pure macros + inline functions. Compiled with clang in c23 mode. Target: amd64 + Windows 11. Zero external dependencies (the only #pragma comment(lib, ...) lines are to Kernel32/User32/Gdi32/Advapi32).

The core conventions:

1.1.1 Byte-width typedef convention (mandatory, used everywhere)

typedef __UINT8_TYPE__  U1; typedef __UINT16_TYPE__ U2; typedef __UINT32_TYPE__ U4; typedef __UINT64_TYPE__ U8;
typedef __INT8_TYPE__   S1; typedef __INT16_TYPE__  S2; typedef __INT32_TYPE__  S4; typedef __INT64_TYPE__  S8;
typedef unsigned char   B1; typedef __UINT16_TYPE__ B2; typedef __UINT32_TYPE__ B4; typedef __UINT64_TYPE__ B8;
typedef float           F4; typedef double          F8;

U = unsigned, S = signed, B = byte (char)
The number is the bit-width, not the byte count
All custom code uses these; int/long/size_t only appear in system headers

Casts are wrapped: u4_(value) / u8_(value) / f4_(value) etc. enforce precedence in arithmetic and signal at the call site "this is an explicit narrowing."

1.1.2 Macro meta-DSL (the "duffle" layer)

#define m_expand(...)      __VA_ARGS__
#define glue_impl(A, B)    A ## B
#define glue(A, B)         glue_impl(A, B)
#define tmpl(prefix, type) prefix ## _ ## type

The rest of the file is built on these. Patterns:

Struct_(Foo) expands to struct Foo Foo; struct Foo — a forward decl + a typedef in one go, so you can use Foo as a type or a struct namespace immediately
Enum_(U4, MyEnum) similarly gives you MyEnum as the type and enum MyEnum as the tag
Union_(Foo), Array_(type, len), Slice_(type) — same pattern, all single-line

This is the meta-primitive that the entire codebase builds on. There is no class, no templates, no codegen — just #define and _Generic.

1.1.3 Inline / always-inline / no-inline discipline

#define I_  internal inline
#define IA_ I_ __attribute__((always_inline))
#define N_  internal __attribute__((noinline))

Plus the macro name encodes intent: I_* is a normal inline, IA_* is forced inline (small, hot), N_* is forced out-of-line (debugging, code-size). Functions written as IA_ void foo(...) carry the intent in the function signature itself.

1.1.4 The `r`/`v` discipline (restrict / volatile, and nothing else)

#define r restrict  // pointers are either restricted or volatile and nothing else
#define v volatile

Plus typed pointer aliases: r_(ptr) = C_(T_(ptr[0])*r, ptr) is a typed restrict pointer, v_(ptr) is a typed volatile pointer. The user comment says this directly: "pointers are either restricted or volatile and nothing else."

There are no const pointers, no volatile restrict, no fancy CV qualifiers. Just two states. This is a real constraint on the design.

1.1.5 Slice as the core compound type

typedef Struct_(Slice)  { U8 ptr, len; };  // Untyped slice
#define Slice_(type) Struct_(tmpl(Slice,type)) { type* ptr; U8 len; }

Untyped Slice is { void*, size_t } (well, {U8 ptr, U8 len} — U8 is the byte-width convention)
Typed Slice_T wraps a typed T* with the same len field
slice_iter(container, iter) is the iteration macro
slice_end(slice) returns slice.ptr + slice.len (pointer past the end, not a pointer to last element)
slice_to_ut(s) converts a typed slice to an untyped slice (used for memcpy / hash / format)
S_slice(s) is s.len * sizeof(s.ptr[0]) — the byte size

This is the data-structure primitive of the duffle system. Arenas, stacks, KTL tables — everything is built on Slice + Slice_T + FArena.

1.1.6 The `FArena` (the chunk-adjacent data structure)

typedef Struct_(FArena) { U8 start, capacity, used; };

Linear-bump allocator with a start / capacity / used triple
farena_push(arena, amount, options) returns a Slice
farena_save(arena) -> used (snapshot), farena_rewind(arena, save_point) (rollback to snapshot)
farena_reset(arena) zeroes used (does NOT free; that requires slice_free or arena destruction)
farena_push_type(arena, type, ...) and farena_push_array(arena, type, amount, ...) are typed convenience macros

Key observation: this is not a chunk-based arena. It is a single contiguous buffer with a bump pointer. The user could extend it to chunked (with Slice<FArena> as the backing, or by allocating new pages and chaining them), but the current FArena is monolithic.

1.1.7 Memory-barrier and atomic primitives (asm volatile)

IA_ void barrier_compiler(void){asm volatile("::""memory");}
IA_ void barrier_memory  (void){__builtin_ia32_mfence();}
IA_ void barrier_read    (void){__builtin_ia32_lfence();}
IA_ void barrier_write   (void){__builtin_ia32_sfence();}

IA_ U4 atm_add_u4 (U4*r addr, U4 value){asm volatile("lock xaddl %0,%1":"=r"(value),"=m"(addr[0]):"0"(value),"m"(addr[0]):"memory","cc");}

These are written as raw inline asm, not stdatomic.h. The user prefers __builtin_* intrinsics and raw asm volatile(...) over library abstractions. This matters for interop: there's no portable way to call these from Python.

1.1.8 Control-flow and defer discipline

#define defer(expr) for(U4 once= 1; once!=1; ++once, (expr))
#define scope(begin,end) for(U4 once=(1,(begin)); once!=1; ++once, (end))
#define defer_rewind(cursor) for(T_(cursor) sp=cursor, once=0; once!=1; ++once, cursor=sp)

defer is a single-statement cleanup that fires when the enclosing block exits. defer_rewind is the arena-aware variant: it captures the current cursor at block entry and restores it on exit. This is the pattern for "transactional" arena allocation.

1.1.9 The `KTL` (Key Table Linear) — a small key-value table

#define KTL_Slot_(type) Struct_(tmpl(KTL_Slot,type)) { U8 key; type value; }
#define KTL_(type) Slice_(tmpl(Slot,type));
typedef Slice KTL_Byte;

A linear array of {key, value} slots, with FNV-1a 64-bit hashing on Str8 keys. The comment in the code says: "We do a linear iteration instead of a hash table lookup because the user should never subst with more than 100 unique tokens." — this is the "assume as much as possible" principle applied directly. No hash table; linear scan wins for small N.

1.2 The duffle.h ↔ main.c interface (forth_bootslop/attempt_1/main.c, 1426 lines)

main.c is a stack-machine JIT compiler. It uses duffle.h to:

Define an STag enum (X-macro pattern: 7 entries in a single Tag_Entries() table, then #define X + #undef X to repurpose the macro inside the table generator)
Define tape_arena (an FArena for the bytecode tape) and anno_arena (parallel arena for annotation strings)
Use u4_r(...) / u8_r(...) for typed restrict pointers
Use mem_copy / mem_zero (which are wrappers around __builtin_memcpy / __builtin_memset)
Hand-emit x64 machine code using emit8 / emit32 / emit64 macros
Build a JIT (Just-In-Time compiler for a custom stack-based VM) that emits REX prefixes, ModRM bytes, SIB bytes via a per-field macro DSL

What this tells us about how Ed uses duffle.h:

The DSL is meant to support low-level systems work (JIT, OS syscalls, raw asm) without sacrificing readability
The byte-width typedef convention is rigid — every new line of code in main.c uses U1/U4/U8; int/long only appear in system header forward-decls
Memory discipline is arena-first: tape_arena + anno_arena + code_arena are global FArena instances, no malloc/free in user code
The defer / defer_rewind pattern is the user's answer to RAII — it's the only structured cleanup mechanism

1.3 The Pikuma ps1 duffle/ (Pikuma/ps1/code/duffle/*, the more recent style)

The Pikuma ps1 duffle/ is a refined, smaller version of the forth_bootslop DSL. Same conventions, but with platform-specific concerns (PS1 MIPS + GTE + GPU command encoders). Notable differences:

dsl.h adds TSet_(type) (type + restricted-pointer + volatile-pointer in one typedef), Proc_(symbol) (typedef for void(*)())
memory.h adds sll_stack_push_n / sll_queue_push_nz — singly-linked list / queue macros (the DAG region)
gp.h is the GPU command encoder; every GPU command is a (gcmd_X << 24 | ...) bit-packing macro, same pattern as the x64 emission DSL in forth_bootslop main.c
gte.h is the GTE coprocessor instruction encoder; per-field macros, asm volatile(asm_inline(gte_cmd_rtpt, ...)) to emit constant-folded instruction words
math.h defines V2_S2, V3_S2, V4_S2 (S2/S4 are 16/32-bit signed), Rect_S2, M3_S2 — 3x3 matrix with translation vector

What Pikuma ps1 duffle/ shows that's different from forth_bootslop:

The DSL is split across multiple small headers (dsl.h, memory.h, math.h, gp.h, gte.h, mips.h, gcc_asm.h, strings.h) — one concept per file, easier to reason about
The INTELLISENSE_DIRECTIVES guard at the top of every header lets IDEs (#pragma once + includes) see the full type graph without requiring the user to include dsl.h in every file. Production builds skip the include
The TSet_ / PtrSet_ / Array_expand macros are a more complete type-builder system: one macro gives you type, type*restrict, type*volatile in one shot
The GTE/GPU encoding layers are fully composable — enc_gte_cmdw(sf, mx, v, cv, lm, cmd) is a flat OR of 6 per-field encoders, each of which is its own named function

hello_gte.c shows usage:

SMemory is the global state struct; static_mem is a single global instance
prim__alloc(type_width, type_name) is the arena-style allocation primitive for the GTE primitive buffer
ent_cube128_init / ent_floor_init are __forceinline initializers that copy baked vertex/face data into the entity's arena slot
Ent_Cube and Ent_Floor are entity structs that embed their data (A8_V3_S2 verts; A6_V4_S2 faces;) — entities are POD, not heap-allocated

1.4 The 11 style observations that matter for chunkification

Distilled from the duffle.h + main.c + pikuma ps1 headers + hello_gte.c reading:

No malloc/free in user code. Everything is arena-allocated. For chunk-based data structures, this means the chunks themselves would be allocated from an FArena (or a chunk-aware variant), and the structure holds a Slice<Chunk> of pointers into the arena.
No classes, no templates, no inheritance. POD structs only. Methods are free functions that take a pointer: void farena_push(FArena* arena, U8 amount, Opt_farena o).
The Slice + Slice_T pair is the data-structure primitive. A chunk-array is probably modeled as Slice<Chunk> where Chunk is a fixed-size T[N].
Pointer discipline is restrict or volatile, never both, never const. This is a hard constraint.
The byte-width convention is rigid. U1/U2/U4/U8 for unsigned, S1/S2/S4/S8 for signed, B1/B2/B4/B8 for byte, F4/F8 for float. int and long are forbidden in user code.
asm volatile + __builtin_* are preferred over library wrappers. No stdatomic.h, no stddef.h for size_t.
The DSL compiles in c23 mode (clang). This means _Generic is available, __builtin_* are stable, and typeof works.
__attribute__((always_inline)) is the default for small hot functions. Hot path code has zero call overhead.
Macros encode intent, not just abbreviation. I_ vs IA_ vs N_ is meaningful; I_proc was specifically removed in the duffle.h because the user found it harder to read than just writing inline functions.
Entities are POD structs with embedded data. No handles, no IDs, no virtual dispatch.
X-macros are the pattern for data-driven code. Tag_Entries() defines the table; #define X(n, s, c, p) + #undef X lets the same table feed the enum, the colors array, the prefix array, the name array.

1.5 What the style implies for the chunkified data structure

If the user wrote a chunk-based C11 data structure in their style, it would probably look like:

// Likely shape (NOT actually written, this is what their style suggests)
typedef Struct_(ChunkArray_T) {                  // ChunkArray<T>
    Slice         chunks;                          // { Chunk* ptr; U8 len; }
    U4            chunk_size;                      // power-of-2
    U4            element_size;                    // sizeof(T)
    U8            total_used;                      // sum of all chunk use
    FArena*       backing;                         // where chunks live
};

// Push: O(1) amortized
I_ U8 chunkarray_push(ChunkArray_T* ca, U8 element) {
    U4 chunk_idx = ca->total_used >> log2_of(ca->chunk_size);
    if (chunk_idx >= ca->chunks.len) {
        // grow: add a new chunk
        Chunk* new_chunk = farena_push_type(ca->backing, Chunk, ...);
        ca->chunks.ptr[ca->chunks.len] = new_chunk;
        ca->chunks.len += 1;
    }
    U4 offset = ca->total_used & (ca->chunk_size - 1);
    U8* dst = (U8*)&ca->chunks.ptr[chunk_idx][offset * ca->element_size];
    dst[0] = element;  // copy
    ca->total_used += 1;
    return ca->total_used - 1;
}

// Index: O(1) bitwise
IA_ U8 chunkarray_at(ChunkArray_T* ca, U8 i) {
    U4 chunk_idx = i >> log2_of(ca->chunk_size);
    U4 offset    = i & (ca->chunk_size - 1);
    return ((U8*)ca->chunks.ptr[chunk_idx])[offset * ca->element_size];
}

This is exactly Reece's Xar pattern (8-byte header, power-of-2 chunks, bitwise divmod), written in Ed's duffle.h style.

The point: the style is consistent with the chunkification optimization. If you wrote this in C11, it would look like duffle.h. There's no impedance mismatch between "the user's preferred C11 style" and "the chunk-idea C11 implementation."

The impedance is between any C11 chunk-array and the Python runtime, regardless of style. That's Part 2.

PART 2 — Interop Design Space (the actual question)

2.1 What "interop" actually means in this context

The question isn't "can Python call C11?" — that's a solved problem with multiple working answers (ctypes, cffi, pybind11, Cython, custom CPython module, etc.). The question is more specific:

Can a Python user-space program actually exploit a chunk-based C11 data structure as if it were a "lego set" of composable pieces — where the user picks which chunk operations to run, in which order, with custom callbacks for filter/map/reduce — without paying the FFI overhead per element?

The user's skepticism is well-founded. The standard FFI answers have specific impedance-mismatch properties:

2.2 The 5 candidate interop layers, honestly assessed

2.2.1 ctypes (Python stdlib)

What it is: load a .dll / .so and call C functions via FFI. No compile step. Structs, arrays, pointers, callbacks all work.

Pros for chunkification:

Zero build-time cost — ctypes.CDLL("./libchunks.so") and you're in
Structure + Array classes map naturally to a ChunkArray header + Chunk* array
POINTER(c_uint64) can wrap the chunk pointer, indexed like a Python list
Thread-safe (GIL released on foreign calls)

Cons for chunkification:

Per-call overhead is ~1-5 microseconds. A chunkarray_at(arr, i) round trip is 1 µs of FFI overhead. A 10,000-element loop is 10ms. Python's native list iteration is ~50ns/element, so ctypes is ~20-100x slower for tight loops.
No inlining. The "lego set" pattern requires the user to compose operations (filter + map + reduce over chunks). With ctypes, each operation is a separate FFI call, so composition costs O(N) FFI round trips.
Type coercion is one-shot. You can't ask ctypes to call chunkarray_at and have the result auto-converted to a Python int without going through the ctypes object.
No SIMD/AVX exposure. The user could write the C11 to use AVX, but ctypes sees only the C function signature.

Verdict for chunkification: Tractable but defeats the purpose. If the use case is "process a 100K-element chunk-array in a hot loop," ctypes is wrong. If the use case is "occasionally bulk-load or bulk-dump a chunk-array and do the rest in Python," ctypes is fine.

Style fit with duffle.h: low. ctypes would require the user to write Python-side struct definitions that mirror the C struct layout. The duffle.h Struct_(ChunkArray_T) { Slice chunks; U4 chunk_size; U4 element_size; U8 total_used; } would become:

class ChunkArray_T(ctypes.Structure):
    _fields_ = [
        ("chunks", Slice),       # needs its own Structure
        ("chunk_size", c_uint32),
        ("element_size", c_uint32),
        ("total_used", c_uint64),
    ]

That's 2x the code on the Python side, and you have to keep the two in sync. The user's unorthodox Slice + Struct_ macros would have to be unwound into a C-friendly layout.

2.2.2 cffi (PyPy / CPython, third-party)

What it is: write C declarations in a Python string, cffi compiles them and gives you ABI-stable handles.

Pros over ctypes:

C-level type declarations are the source of truth (not Python-side mirroring)
ABI mode vs API mode: ABI is like ctypes (no compile); API mode compiles a Python extension module
More Pythonic: from ffi import ffi; lib = ffi.dlopen("./libchunks.so")

Cons for chunkification: same as ctypes for the per-call overhead. Plus the C declaration layer adds a build step (cffi "compiles" the C declarations at import time, which is a real cost on cold start).

Verdict for chunkification: same as ctypes — tractable but defeats the purpose for hot loops.

Style fit with duffle.h: low-medium. cffi is more idiomatic for the C-decl-as-source-of-truth, but you still pay the FFI cost.

2.2.3 pybind11 (C++ heavy)

What it is: C++ header-only library that generates Python bindings from C++ type signatures. Requires the C++ compiler.

Pros for chunkification:

Type-safe bindings
STL containers (vector, array) have automatic conversions to Python list / numpy array
py::buffer_info lets you expose raw memory as a NumPy array (zero-copy)

Cons for chunkification:

C++ is not the user's style. The user writes pure C11 with macros. pybind11 is C++-only.
pybind11's STL conversions don't fit the duffle.h Slice / FArena model. You'd be writing the C++ adapter layer, not the C11 chunk-array.
The "pybind11 generates bindings" claim is misleading for non-trivial types — you write glue code, and for an FArena-backed chunk array, the glue is more code than the C11 implementation.

Verdict for chunkification: not a fit. Style mismatch is fatal here.

2.2.4 Custom CPython C extension (CPython C API)

What it is: write a real CPython extension module using <Python.h>. You get a Python-importable module that wraps the C11 code directly.

Pros for chunkification:

Zero FFI overhead for tightly-coupled code. Once the module is loaded, import chunks; chunks.push(arr, val) is a normal C function call with refcount discipline, ~50ns/element.
The C API is C-compatible (C11 or later), so the duffle.h macros can be used directly inside the extension module
The user controls the module surface — can expose ChunkArray.push, .at, .chunk_count, .chunk_size, .arena_capacity etc.
Generator/coroutine support (__iter__ over chunks) is straightforward in C
Can release the GIL for long-running pure-C operations

Cons for chunkification:

Refcount discipline is manual. The user must Py_INCREF / Py_DECREF correctly. The duffle.h style doesn't have a notion of refcounting (everything is arena-owned). A new discipline is needed at the Python boundary.
Must compile. Build the .pyd/.so, ensure it's on sys.path, deal with Python version compatibility (3.11 ABI tag, etc.). The user's Manual Slop project uses uv; this would be a pyproject.toml [tool.uv]-style build hook.
CPython-specific. PyPy / GraalPy / RustPython don't all support the C API the same way. For a tool that's CPython-only (Manual Slop is), this is fine, but it's a lock-in.
GIL. Free-threaded Python (PEP 703) is shipping; chunk-array code that releases the GIL has to be careful about which Python objects it touches.

Verdict for chunkification: Most tractable option. The custom C extension model lets the user write the chunk-array in their preferred C11 style (duffle.h compatible), wrap it with a small Python-facing layer (refcount-aware), and ship it as a real importable module. Build cost is one-time.

Style fit with duffle.h: high. The C11 code is C11. The Python-facing layer is a thin PyTypeObject / PyMethodDef table at the bottom of the file. The duffle.h macros can be used inside the extension module without modification.

Sketch (not actually written — for the design conversation):

// chunks_module.c
#include <Python.h>
#include "duffle.amd64.win32.h"   // user's existing style

typedef Struct_(ChunkArray) {
    Slice  chunks;        // { Chunk* ptr; U8 len; }
    U4     chunk_size;    // power-of-2
    U4     element_size;
    U8     total_used;
    FArena backing_arena;
};

static PyObject* chunka_push(PyObject* self, PyObject* args) {
    PyObject* py_arr;
    U8        value;
    if (!PyArg_ParseTuple(args, "OK", &py_arr, &value)) return nullptr;
    ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
    U8 idx = chunkarray_push(arr, value);
    return PyLong_FromUnsignedLongLong(idx);
}

static PyObject* chunka_at(PyObject* self, PyObject* args) {
    PyObject* py_arr; U8 i;
    if (!PyArg_ParseTuple(args, "OK", &py_arr, &i)) return nullptr;
    ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
    U8 val = chunkarray_at(arr, i);
    return PyLong_FromUnsignedLongLong(val);
}

static PyMethodDef ChunkArrayMethods[] = {
    {"push", chunka_push, METH_VARARGS, "Append an element, return its index"},
    {"at",   chunka_at,   METH_VARARGS, "Random access by index"},
    {nullptr, nullptr, 0, nullptr}
};

static struct PyModuleDef chunkmodule = {
    PyModuleDef_HEAD_INIT, "chunks", nullptr, -1, ChunkArrayMethods
};

PyMODINIT_FUNC PyInit_chunks(void) {
    return PyModule_Create(&chunkmodule);
}

This is ~80 lines of glue for a fully-functional module. The actual chunkarray_push and chunkarray_at are duffle.h-style C11.

2.2.5 NumPy + custom C API (`PyArray_Interface`)

What it is: NumPy has a C API (<numpy/arrayobject.h>) that lets C extensions allocate and manipulate ndarray objects. The C extension holds the actual memory, and NumPy wraps it as an array with zero copy.

Pros for chunkification:

If the chunk-array is logically a 1D contiguous sequence, NumPy can wrap it as a ndarray with zero copy
The user can then do np.sum(chunks), chunks[1000:2000], chunks[chunks > threshold] in NumPy land — all the vectorized ops for free
For batch operations (load 10K elements, do something to all of them, write back), NumPy is the right level of abstraction
Most Manual Slop hot-path code (text processing, JSON-L serialization, list-mutation) can be re-expressed as NumPy operations

Cons for chunkification:

NumPy semantics are flat 1D/2D/ND arrays, not chunk-aware. The "lego set" pattern (iterate over chunks, custom callback per chunk) is not a first-class NumPy concept.
The C API requires linking against NumPy's headers and ABI version compatibility
NumPy's array protocol is strongly typed (dtype); chunk-array-of-mixed-type is not a fit
For a chunk-array that needs to be both chunk-aware (user iterates chunks) and element-wise (NumPy ops on the flat view), you'd need a custom NumPy dtype with chunk-aware accessors — possible but not trivial

Verdict for chunkification: orthogonal. NumPy is a great consumer of a chunk-array (zero-copy wrap), but not a great driver (you can't easily express chunk-aware iteration in NumPy). The combination is: write the chunk-array in C11, expose a NumPy-compatible 1D view, let NumPy do batch ops when appropriate, do chunk-aware iteration in C.

Style fit with duffle.h: medium. NumPy's C API doesn't conflict with duffle.h, but the PyArrayObject types are intrusive. You'd write an adapter layer that converts between Slice<U8> (raw bytes) and PyArrayObject (typed ndarray).

2.3 The honest assessment matrix

For the actual question — "can a Python user-space program fully exploit a C11 chunk-based data structure lego-set?" — here's what the design space looks like:

Approach	Build cost	Per-op overhead	Style fit	Lego-set pattern support	Verdict
ctypes	0	~1-5 µs/call	low	low (each op = FFI call)	Tractable but defeats the purpose
cffi ABI mode	0	~1-5 µs/call	low-medium	low	Same as ctypes
cffi API mode	1x (compile)	~50ns/call	medium	medium	Good middle ground
pybind11	1x (compile)	~50ns/call	very low (C++)	medium	Style mismatch — not a fit
CPython C ext	1x (compile)	~50ns/call	high (C11)	high (full C API)	Most tractable
NumPy wrap	1x (compile)	~50ns/call	medium	low (flat view)	Orthogonal — good for batch, not lego-set
HPy / PyO3 / nanobind	1x (compile)	~50ns/call	low (Rust/C++/new API)	medium	Better than pybind11 but still style-mismatched

The recommendation:

For the lego-set (chunk-aware user-driven iteration): custom CPython C extension is the most tractable. The duffle.h style is C11; the C extension wrapping is ~80 lines of glue per chunk-array class; per-element overhead is the same as native Python (~50ns).

For batch operations on a chunk-array: NumPy wrap is the most tractable. Expose the chunk-array's memory as a 1D ndarray, let NumPy do the work. Zero-copy, vectorized, free.

For occasional FFI from Python: ctypes is fine. Load the lib, call the function, get the result. Don't try to do hot loops this way.

2.4 What "a chunked C11 package that interops with Python" actually requires

If the user wants to build this, the minimum viable product is:

The chunk-array C11 code (duffle.h style, ~200-400 lines)
- ChunkArray_T struct
- chunkarray_push, chunkarray_at, chunkarray_grow, chunkarray_iter_chunks
- Backing is an FArena for chunk memory + a Slice<Chunk*> for the chunk pointer table
A CPython C extension wrapper (~80-150 lines)
- PyTypeObject for ChunkArrayObject (wraps the C struct)
- __init__ (creates the C struct from Python args: chunk_size, element_size, initial_capacity)
- __len__ (returns total_used)
- __getitem__ / __setitem__ (calls chunkarray_at / in-place write)
- __iter__ (yields elements one at a time; can be optimized to yield per-chunk for the lego-set pattern)
- push(value) method
- chunks() method (yields per-chunk ndarray views for the NumPy interop path)
- arena_capacity, chunk_count, chunk_size read-only properties
A build step in pyproject.toml (one-time cost, ~5 lines)
- [tool.uv.build-backend] config
- Build the .pyd/.so for the current Python version
- Wheels for distribution (optional, build for arm64 + x86_64 + win32 + linux)
Tests in tests/test_chunka_c11.py (~100-300 lines)
- TDD-style: write tests in Python first, then write the C, then verify
- Grow pattern tests, random access tests, edge cases (empty, full, resize)
- NumPy interop test: ensure np.array(chunks) is zero-copy
- Comparison test: chunk-array must beat list.append for the relevant N
A chunks/__init__.py Python wrapper (~30-50 lines, optional but recommended)
- High-level API: ChunkArray(chunk_size=1024, element_size=8), .push(x), .at(i), .numpy()
- Type hints for IDE support
- This is the only Python code; everything else is C

Total: ~500-1000 lines of C + ~50-150 lines of Python glue + build/test config.

2.5 The honest tractable-vs-not answer

Tractable:

Writing a chunk-array in C11 duffle.h style: trivially tractable (Reece's Xar is the reference impl, ~200 lines)
Wrapping it as a CPython C extension: tractable (~150 lines of glue)
Per-element overhead matching native Python: yes (50ns vs 50ns, no FFI tax)
NumPy interop via zero-copy ndarray wrap: tractable (NumPy's C API is well-documented)
Build + distribution via uv + pyproject.toml: tractable (one-time setup, well-trodden path)

Not tractable (or not worth the cost):

Letting the user arbitrarily compose C11 chunk operations from Python at the lego-set level: not tractable without compiling Python → C11 on the fly. ctypes/cffi/pybind11 are all per-call; you'd need a C-subset JIT (like the user's forth_bootslop does for stack machine bytecode) to compose C11 ops in Python. That's a different track.
Having Python extend the chunk-array with user-defined per-element callbacks (like list(map(fn, arr))) that run at C speed: not tractable. Cython can compile Python-ish syntax to C, but the duffle.h style doesn't fit Cython's type system. The workaround is to ship pre-baked operations (push, at, iter_chunks, filter_chunk(fn_ptr)) and let users choose from those, not define new ones in Python.
Making the chunk-array cross-implementation (CPython + PyPy + RustPython): not tractable with the C extension approach. Use HPy (new Python C API targeting multiple impls) if this matters. HPy has a separate style, would need an adapter.

The "numpy DSL" the user mentioned: the closest analog is Cython's typed memoryviews or NumPy's ndarray protocol — both give you "Python can see a chunk of C memory and operate on it efficiently." Neither is a literal DSL; both are ABI/protocol layers. If the user wants a Python-side DSL for composing chunk operations, that's a separate design problem (Cython-like compile-to-C, or a small Python AST → C11 emitter).

2.6 The recommended path forward for chunkification_optimization

Don't start with C11. Start with pure Python chunkification of the target (the comms.log ring buffer in app_controller.py:716). Verify:

The chunk pattern delivers a measurable speedup
The API is ergonomic from Python
The thread-safety story is correct
The serial/deserial path still works

Then, if the user wants the C11 lego-set:

Build the duffle.h-style C11 chunk-array (one type, ~200 lines)
Build the CPython C extension wrapper (~150 lines of glue)
Build the NumPy-compatible 1D view (lets existing Python code consume the chunk-array)
Optional: add a few pre-baked chunk-aware operations (filter_chunks, map_chunks, reduce_chunks) in C, exposed as Python methods
Optional: build a "lego-set" Python API that lets users compose pre-baked operations without writing C

Defer the "Python-defined chunk-aware callback" goal — it's the most ambitious, requires either Cython or a custom AST emitter, and is not clearly worth the complexity for a single project.

2.7 The 5 questions to ask the user (before this becomes a track)

These map directly to the design decisions in §2.3-§2.6:

Build cost acceptable? Custom C extension is one-time ~half-day of build setup (pyproject.toml, compiler config, wheel build). One-time.
Per-element overhead target? Native (~50ns) requires the C extension. ctypes is ~1-5µs (20-100x slower). What's the SLA?
NumPy interop required? If yes, the C extension must expose the underlying memory as a 1D ndarray view (one-time setup).
Cross-implementation? CPython only? Or HPy for CPython+PyPy? Big style difference.
Lego-set composition in Python? Pre-baked ops (push, at, iter_chunks, filter_chunks) is tractable. User-defined Python→C11 callbacks is not (without Cython or a custom AST emitter).

2.8 The crucial insight

The user said: "the way I would define the C11 package or interop stuff would be unorthodox and would follow a similar pattern to what you would fine in either my forth_bootslop repo or my pikuma ps1 repo."

Reading both repos carefully (and the user's correction that they're "not really an interop pattern, I just wanted to show how I like todo C11"), the implication is:

The user is comfortable with a single C11 .h file as the entire interop boundary
The user is not going to write a complex pybind11 C++ layer or a Cython .pyx file
The user is comfortable with a thin CPython C extension if the C11 code stays in their style

The most likely path the user would actually take, given their style and your "lots of ambiguities" caveat:

Write the chunk-array in duffle.h style as a single header
Wrap it with a small PyTypeObject block at the bottom of the same file (or a separate chunks_module.c that includes the header)
Build it with uv + pyproject.toml
Import it from Manual Slop and verify the speedup on comms.log

That's tractable. The "lego set of composable Python-driven chunk operations" is a stretch goal that requires more design work, and probably isn't needed for the comms.log target.

3. The non-recommendations

Don't do any of these:

pybind11. Style mismatch. C++ is not the user's idiom.
Cython. The user writes pure C11 with macros. Cython is Python-with-C-type-annotations. Style mismatch.
Rust + PyO3. The user writes C, not Rust. PyO3 is great for Rust shops, not relevant here.
HPy. Cross-implementation matters less than style fit. Revisit if PyPy becomes a target.
Pure Python implementation of the lego-set pattern. Defeats the point. If you're not crossing the FFI boundary, you don't need C11.

4. Summary verdict

The user's question	The honest answer
Can chunk-based C11 interop with Python?	Yes, via custom CPython C extension. ~150 lines of glue per chunk-array type.
Is it worth the cost?	Depends on the use case. For `comms.log`, the C extension is tractable. For "compose arbitrary C11 ops from Python," it's not (needs a Python→C emitter).
What does the lego-set pattern look like?	Pre-baked C operations exposed as Python methods (push, at, iter_chunks, filter_chunks). User-defined per-element Python callbacks running at C speed is not tractable.
What about numpy?	NumPy can zero-copy wrap the chunk-array as a 1D ndarray. Best for batch ops, not chunk-aware iteration.
What's the build cost?	One-time ~half-day (uv + pyproject.toml + C extension). Wheels for distribution optional.
What about HPy / cross-impl?	Not needed unless PyPy becomes a target. Stick with CPython C API.
What's the style fit with duffle.h?	High. The chunk-array is written in duffle.h style; the C extension wrapper is a thin `PyTypeObject` block at the bottom of the file.

Recommended action:

Verify the chunk pattern delivers value first. Pure-Python chunkification of comms.log (or another target), measure, confirm.
If C11 is desired, build the C extension in duffle.h style. ~500 lines total (200 C array + 150 glue + 100 tests + 50 Python wrapper).
If NumPy is the consumer, expose the 1D view. One-time, ~20 lines of NumPy C API glue.
Defer the "user-defined Python→C11 callback" goal unless a specific use case demands it.

End of assessment. The track chunkification_optimization_20260608_PLACEHOLDER is now scoped: tractable as a CPython C extension, scope = chunk-array in duffle.h style + thin C extension wrapper + NumPy 1D view. Not tractable as a lego-set of user-defined chunk operations. Next step is to confirm the use case (which target, which SLA) and pick the build path.

Cross-references for re-anchoring: docs/reports/session_synthesis_20260608.md §8.2 (the original proposal), docs/ideation/ed_chunk_data_structures_20260523.md (the user's chunk-ideation), docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt §56:42 (Reece's Xar reference impl).

37 KiB Raw Blame History

C11 ↔ Python Interop Assessment — 2026-06-08

0. The user-correction that reshaped the question

PART 1 — C11 Style Reference (what your duffle.h + pikuma ps1 show)

1.1 The duffle.h "DSL" (forth_bootslop/attempt_1/duffle.amd64.win32.h, 727 lines)

1.1.1 Byte-width typedef convention (mandatory, used everywhere)

1.1.2 Macro meta-DSL (the "duffle" layer)

1.1.3 Inline / always-inline / no-inline discipline

1.1.4 The r/v discipline (restrict / volatile, and nothing else)

1.1.5 Slice as the core compound type

1.1.6 The FArena (the chunk-adjacent data structure)

1.1.7 Memory-barrier and atomic primitives (asm volatile)

1.1.8 Control-flow and defer discipline

1.1.9 The KTL (Key Table Linear) — a small key-value table

1.2 The duffle.h ↔ main.c interface (forth_bootslop/attempt_1/main.c, 1426 lines)

1.3 The Pikuma ps1 duffle/ (Pikuma/ps1/code/duffle/*, the more recent style)

1.4 The 11 style observations that matter for chunkification

1.5 What the style implies for the chunkified data structure

PART 2 — Interop Design Space (the actual question)

2.1 What "interop" actually means in this context

2.2 The 5 candidate interop layers, honestly assessed

2.2.1 ctypes (Python stdlib)

2.2.2 cffi (PyPy / CPython, third-party)

2.2.3 pybind11 (C++ heavy)

2.2.4 Custom CPython C extension (CPython C API)

2.2.5 NumPy + custom C API (PyArray_Interface)

2.3 The honest assessment matrix

2.4 What "a chunked C11 package that interops with Python" actually requires

2.5 The honest tractable-vs-not answer

2.6 The recommended path forward for chunkification_optimization

2.7 The 5 questions to ask the user (before this becomes a track)

2.8 The crucial insight

3. The non-recommendations

4. Summary verdict

37 KiB

Raw Blame History

1.1.4 The `r`/`v` discipline (restrict / volatile, and nothing else)

1.1.6 The `FArena` (the chunk-adjacent data structure)

1.1.9 The `KTL` (Key Table Linear) — a small key-value table

2.2.5 NumPy + custom C API (`PyArray_Interface`)