docs(interop-assessment): C11 <-> Python interop design space for chunkification_optimization
The user asked a sharp, skeptical question: can a chunk-based C11 data structure actually interop with Python's runtime in a way that's useful for Manual Slop? They explicitly corrected my first-draft framing (the duffle.h + pikuma ps1 files are a C11 *style reference*, not an interop pattern). The assessment investigates honestly and reports tractable-vs-not. docs/reports/c11_python_interop_assessment_20260608.md (564 lines, 38KB): Part 1: C11 style reference summary - 11 style observations from reading duffle.h + main.c + pikuma ps1 duffle/ + hello_gte.c end-to-end - Byte-width typedef convention (U1/U2/U4/U8, S1/S2/S4/S8, B1-B8, F4/F8) - The macro meta-DSL (Struct_/Enum_/Array_/Slice_/Opt_/Ret_) - The I_/IA_/N_ inline discipline - The r/v pointer rule (restrict OR volatile, never both, never const) - Slice + Slice_T as the data-structure primitive - FArena as the allocation primitive (single-buffer, NOT chunked) - defer/defer_rewind/scope as the cleanup primitive - KTL (linear key-value table) as the "assume small N" pattern - What a chunk-array in duffle.h style would look like Part 2: Interop design space (the actual question) - 5 candidate interop layers: ctypes, cffi, pybind11, custom CPython C extension, NumPy wrap - Honest assessment matrix: build cost, per-op overhead, style fit, lego-set pattern support - Verdict: custom CPython C extension is most tractable; pybind11 is style-mismatched; ctypes/cffi work for non-hot-path - What "MVP chunked C11 package" requires (~500-1000 LOC total) - 5 questions to ask the user before this becomes a track - Crucial insight: the user's "unorthodox" interop is most likely duffle.h-style C11 + thin PyTypeObject glue at the bottom of the same .h file. Tractable, style-fit high. Cross-references the 5 sources: - docs/transcripts/i-h95QIGchY (Reece's Xar reference impl) - docs/ideation/ed_chunk_data_structures_20260523.md - docs/reports/session_synthesis_20260608.md (the original proposal) - src/app_controller.py:716 (the comms.log target) - The user's local forth_bootslop + pikuma ps1 repos (read in full) This is a follow-on to the synthesis's 2 proposed tracks (manual_ux_validation_20260608_PLACEHOLDER + chunkification_optimization_20260608_PLACEHOLDER). The user's question resolved the "skeptical of #2" concern by scoping the tractable path: CPython C extension in duffle.h style. The "lego-set of user-defined Python->C11 chunk ops" is NOT tractable without a Python->C11 AST emitter, which is a different (much larger) track.
This commit is contained in:
@@ -0,0 +1,564 @@
|
||||
# C11 ↔ Python Interop Assessment — 2026-06-08
|
||||
|
||||
**Question source:** end-of-session user clarification on the proposed `chunkification_optimization_20260608_PLACEHOLDER` track.
|
||||
**Author:** Tier 1 Orchestrator (synthesis + technical assessment)
|
||||
**Date:** 2026-06-08
|
||||
**Status:** Honest tractable-vs-not verdict, no code proposed
|
||||
**Cross-references:** `docs/reports/session_synthesis_20260608.md` §8.2, `docs/ideation/ed_chunk_data_structures_20260523.md`, `docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt` §56:42
|
||||
|
||||
---
|
||||
|
||||
## 0. The user-correction that reshaped the question
|
||||
|
||||
**First framing (mine, in `proposed_new_tracks_20260608.md`):** "Manual Slop's `comms.log` could be replaced by a C11 chunk-based data structure, with Python user-space interop via numpy/ctypes/etc."
|
||||
|
||||
**User's clarification:** "it's not really an interop pattern, I just wanted to show how I like todo C11."
|
||||
|
||||
**What changed:** the C11 codebases I was pointed to (`forth_bootslop/attempt_1/duffle.amd64.win32.h` + `main.c`, and `Pikuma/ps1/code/duffle/*` + `gte_hello/`) are **style references** — they show what C11 looks like when *Ed* writes it. They do not contain a Python interop layer, and weren't meant to be read as one. The "interop design space" question is a *separate* open question, and the user explicitly said "lots of ambiguities."
|
||||
|
||||
This document is split into two parts that should not be conflated:
|
||||
- **Part 1** — the C11 style reference (what the duffle.h + pikuma ps1 headers show)
|
||||
- **Part 2** — the interop design space (the actual question the user is asking, with honest tractable-vs-not assessment)
|
||||
|
||||
---
|
||||
|
||||
# PART 1 — C11 Style Reference (what your duffle.h + pikuma ps1 show)
|
||||
|
||||
## 1.1 The duffle.h "DSL" (forth_bootslop/attempt_1/duffle.amd64.win32.h, 727 lines)
|
||||
|
||||
A single-header file that defines a **C DSL** in pure macros + inline functions. Compiled with `clang` in c23 mode. Target: amd64 + Windows 11. Zero external dependencies (the only `#pragma comment(lib, ...)` lines are to `Kernel32`/`User32`/`Gdi32`/`Advapi32`).
|
||||
|
||||
The core conventions:
|
||||
|
||||
### 1.1.1 Byte-width typedef convention (mandatory, used everywhere)
|
||||
|
||||
```c
|
||||
typedef __UINT8_TYPE__ U1; typedef __UINT16_TYPE__ U2; typedef __UINT32_TYPE__ U4; typedef __UINT64_TYPE__ U8;
|
||||
typedef __INT8_TYPE__ S1; typedef __INT16_TYPE__ S2; typedef __INT32_TYPE__ S4; typedef __INT64_TYPE__ S8;
|
||||
typedef unsigned char B1; typedef __UINT16_TYPE__ B2; typedef __UINT32_TYPE__ B4; typedef __UINT64_TYPE__ B8;
|
||||
typedef float F4; typedef double F8;
|
||||
```
|
||||
|
||||
- `U` = unsigned, `S` = signed, `B` = byte (char)
|
||||
- The *number* is the bit-width, not the byte count
|
||||
- All custom code uses these; `int`/`long`/`size_t` only appear in system headers
|
||||
|
||||
**Casts are wrapped:** `u4_(value)` / `u8_(value)` / `f4_(value)` etc. enforce precedence in arithmetic and signal at the call site "this is an explicit narrowing."
|
||||
|
||||
### 1.1.2 Macro meta-DSL (the "duffle" layer)
|
||||
|
||||
```c
|
||||
#define m_expand(...) __VA_ARGS__
|
||||
#define glue_impl(A, B) A ## B
|
||||
#define glue(A, B) glue_impl(A, B)
|
||||
#define tmpl(prefix, type) prefix ## _ ## type
|
||||
```
|
||||
|
||||
The rest of the file is built on these. Patterns:
|
||||
- `Struct_(Foo)` expands to `struct Foo Foo; struct Foo` — a forward decl + a typedef in one go, so you can use `Foo` as a type *or* a struct namespace immediately
|
||||
- `Enum_(U4, MyEnum)` similarly gives you `MyEnum` as the type and `enum MyEnum` as the tag
|
||||
- `Union_(Foo)`, `Array_(type, len)`, `Slice_(type)` — same pattern, all single-line
|
||||
|
||||
This is **the meta-primitive** that the entire codebase builds on. There is no `class`, no templates, no codegen — just `#define` and `_Generic`.
|
||||
|
||||
### 1.1.3 Inline / always-inline / no-inline discipline
|
||||
|
||||
```c
|
||||
#define I_ internal inline
|
||||
#define IA_ I_ __attribute__((always_inline))
|
||||
#define N_ internal __attribute__((noinline))
|
||||
```
|
||||
|
||||
Plus the macro name encodes intent: `I_*` is a normal inline, `IA_*` is forced inline (small, hot), `N_*` is forced out-of-line (debugging, code-size). Functions written as `IA_ void foo(...)` carry the intent in the function signature itself.
|
||||
|
||||
### 1.1.4 The `r`/`v` discipline (restrict / volatile, and nothing else)
|
||||
|
||||
```c
|
||||
#define r restrict // pointers are either restricted or volatile and nothing else
|
||||
#define v volatile
|
||||
```
|
||||
|
||||
Plus typed pointer aliases: `r_(ptr) = C_(T_(ptr[0])*r, ptr)` is a typed restrict pointer, `v_(ptr)` is a typed volatile pointer. The user comment says this directly: *"pointers are either restricted or volatile and nothing else."*
|
||||
|
||||
There are no `const` pointers, no `volatile restrict`, no fancy CV qualifiers. Just two states. This is a real constraint on the design.
|
||||
|
||||
### 1.1.5 Slice as the core compound type
|
||||
|
||||
```c
|
||||
typedef Struct_(Slice) { U8 ptr, len; }; // Untyped slice
|
||||
#define Slice_(type) Struct_(tmpl(Slice,type)) { type* ptr; U8 len; }
|
||||
```
|
||||
|
||||
- Untyped `Slice` is `{ void*, size_t }` (well, `{U8 ptr, U8 len}` — `U8` is the byte-width convention)
|
||||
- Typed `Slice_T` wraps a typed `T*` with the same `len` field
|
||||
- `slice_iter(container, iter)` is the iteration macro
|
||||
- `slice_end(slice)` returns `slice.ptr + slice.len` (pointer past the end, *not* a pointer to last element)
|
||||
- `slice_to_ut(s)` converts a typed slice to an untyped slice (used for memcpy / hash / format)
|
||||
- `S_slice(s)` is `s.len * sizeof(s.ptr[0])` — the byte size
|
||||
|
||||
This is the *data-structure primitive* of the duffle system. Arenas, stacks, KTL tables — everything is built on `Slice` + `Slice_T` + `FArena`.
|
||||
|
||||
### 1.1.6 The `FArena` (the chunk-adjacent data structure)
|
||||
|
||||
```c
|
||||
typedef Struct_(FArena) { U8 start, capacity, used; };
|
||||
```
|
||||
|
||||
- Linear-bump allocator with a `start` / `capacity` / `used` triple
|
||||
- `farena_push(arena, amount, options)` returns a `Slice`
|
||||
- `farena_save(arena) -> used` (snapshot), `farena_rewind(arena, save_point)` (rollback to snapshot)
|
||||
- `farena_reset(arena)` zeroes `used` (does NOT free; that requires `slice_free` or arena destruction)
|
||||
- `farena_push_type(arena, type, ...)` and `farena_push_array(arena, type, amount, ...)` are typed convenience macros
|
||||
|
||||
**Key observation:** this is *not* a chunk-based arena. It is a single contiguous buffer with a bump pointer. The user could extend it to chunked (with `Slice<FArena>` as the backing, or by allocating new pages and chaining them), but the current `FArena` is monolithic.
|
||||
|
||||
### 1.1.7 Memory-barrier and atomic primitives (asm volatile)
|
||||
|
||||
```c
|
||||
IA_ void barrier_compiler(void){asm volatile("::""memory");}
|
||||
IA_ void barrier_memory (void){__builtin_ia32_mfence();}
|
||||
IA_ void barrier_read (void){__builtin_ia32_lfence();}
|
||||
IA_ void barrier_write (void){__builtin_ia32_sfence();}
|
||||
|
||||
IA_ U4 atm_add_u4 (U4*r addr, U4 value){asm volatile("lock xaddl %0,%1":"=r"(value),"=m"(addr[0]):"0"(value),"m"(addr[0]):"memory","cc");}
|
||||
```
|
||||
|
||||
These are written as raw inline asm, not `stdatomic.h`. The user prefers `__builtin_*` intrinsics and raw `asm volatile(...)` over library abstractions. This matters for interop: there's no portable way to call these from Python.
|
||||
|
||||
### 1.1.8 Control-flow and defer discipline
|
||||
|
||||
```c
|
||||
#define defer(expr) for(U4 once= 1; once!=1; ++once, (expr))
|
||||
#define scope(begin,end) for(U4 once=(1,(begin)); once!=1; ++once, (end))
|
||||
#define defer_rewind(cursor) for(T_(cursor) sp=cursor, once=0; once!=1; ++once, cursor=sp)
|
||||
```
|
||||
|
||||
`defer` is a single-statement cleanup that fires when the enclosing block exits. `defer_rewind` is the arena-aware variant: it captures the current cursor at block entry and restores it on exit. This is *the* pattern for "transactional" arena allocation.
|
||||
|
||||
### 1.1.9 The `KTL` (Key Table Linear) — a small key-value table
|
||||
|
||||
```c
|
||||
#define KTL_Slot_(type) Struct_(tmpl(KTL_Slot,type)) { U8 key; type value; }
|
||||
#define KTL_(type) Slice_(tmpl(Slot,type));
|
||||
typedef Slice KTL_Byte;
|
||||
```
|
||||
|
||||
A linear array of `{key, value}` slots, with FNV-1a 64-bit hashing on `Str8` keys. The comment in the code says: *"We do a linear iteration instead of a hash table lookup because the user should never subst with more than 100 unique tokens."* — this is the "assume as much as possible" principle applied directly. No hash table; linear scan wins for small N.
|
||||
|
||||
## 1.2 The duffle.h ↔ main.c interface (forth_bootslop/attempt_1/main.c, 1426 lines)
|
||||
|
||||
main.c is a stack-machine JIT compiler. It uses duffle.h to:
|
||||
- Define an `STag` enum (X-macro pattern: 7 entries in a single `Tag_Entries()` table, then `#define X` + `#undef X` to repurpose the macro inside the table generator)
|
||||
- Define `tape_arena` (an `FArena` for the bytecode tape) and `anno_arena` (parallel arena for annotation strings)
|
||||
- Use `u4_r(...)` / `u8_r(...)` for typed restrict pointers
|
||||
- Use `mem_copy` / `mem_zero` (which are wrappers around `__builtin_memcpy` / `__builtin_memset`)
|
||||
- Hand-emit x64 machine code using `emit8` / `emit32` / `emit64` macros
|
||||
- Build a `JIT` (Just-In-Time compiler for a custom stack-based VM) that emits `REX` prefixes, `ModRM` bytes, `SIB` bytes via a per-field macro DSL
|
||||
|
||||
**What this tells us about how Ed uses duffle.h:**
|
||||
- The DSL is meant to support **low-level systems work** (JIT, OS syscalls, raw asm) without sacrificing readability
|
||||
- The byte-width typedef convention is **rigid** — every new line of code in main.c uses U1/U4/U8; `int`/`long` only appear in system header forward-decls
|
||||
- Memory discipline is **arena-first**: `tape_arena` + `anno_arena` + `code_arena` are global `FArena` instances, no `malloc`/`free` in user code
|
||||
- The `defer` / `defer_rewind` pattern is the user's answer to RAII — it's the only structured cleanup mechanism
|
||||
|
||||
## 1.3 The Pikuma ps1 duffle/ (Pikuma/ps1/code/duffle/*, the more recent style)
|
||||
|
||||
The Pikuma ps1 duffle/ is a **refined, smaller** version of the forth_bootslop DSL. Same conventions, but with platform-specific concerns (PS1 MIPS + GTE + GPU command encoders). Notable differences:
|
||||
|
||||
- `dsl.h` adds `TSet_(type)` (type + restricted-pointer + volatile-pointer in one typedef), `Proc_(symbol)` (typedef for `void(*)()`)
|
||||
- `memory.h` adds `sll_stack_push_n` / `sll_queue_push_nz` — singly-linked list / queue macros (the DAG region)
|
||||
- `gp.h` is the GPU command encoder; every GPU command is a `(gcmd_X << 24 | ...)` bit-packing macro, same pattern as the x64 emission DSL in forth_bootslop main.c
|
||||
- `gte.h` is the GTE coprocessor instruction encoder; per-field macros, `asm volatile(asm_inline(gte_cmd_rtpt, ...))` to emit constant-folded instruction words
|
||||
- `math.h` defines `V2_S2`, `V3_S2`, `V4_S2` (S2/S4 are 16/32-bit signed), `Rect_S2`, `M3_S2` — 3x3 matrix with translation vector
|
||||
|
||||
**What Pikuma ps1 duffle/ shows that's different from forth_bootslop:**
|
||||
- The DSL is **split across multiple small headers** (dsl.h, memory.h, math.h, gp.h, gte.h, mips.h, gcc_asm.h, strings.h) — one concept per file, easier to reason about
|
||||
- The `INTELLISENSE_DIRECTIVES` guard at the top of every header lets IDEs (`#pragma once` + includes) see the full type graph *without* requiring the user to include `dsl.h` in every file. Production builds skip the include
|
||||
- The `TSet_` / `PtrSet_` / `Array_expand` macros are a more complete type-builder system: one macro gives you `type`, `type*restrict`, `type*volatile` in one shot
|
||||
- The GTE/GPU encoding layers are **fully composable** — `enc_gte_cmdw(sf, mx, v, cv, lm, cmd)` is a flat OR of 6 per-field encoders, each of which is its own named function
|
||||
|
||||
**`hello_gte.c` shows usage:**
|
||||
- `SMemory` is the global state struct; `static_mem` is a single global instance
|
||||
- `prim__alloc(type_width, type_name)` is the arena-style allocation primitive for the GTE primitive buffer
|
||||
- `ent_cube128_init` / `ent_floor_init` are `__forceinline` initializers that copy baked vertex/face data into the entity's arena slot
|
||||
- `Ent_Cube` and `Ent_Floor` are entity structs that *embed* their data (`A8_V3_S2 verts; A6_V4_S2 faces;`) — entities are POD, not heap-allocated
|
||||
|
||||
## 1.4 The 11 style observations that matter for chunkification
|
||||
|
||||
Distilled from the duffle.h + main.c + pikuma ps1 headers + hello_gte.c reading:
|
||||
|
||||
1. **No `malloc`/`free` in user code.** Everything is arena-allocated. For chunk-based data structures, this means the chunks themselves would be allocated from an `FArena` (or a chunk-aware variant), and the structure holds a `Slice<Chunk>` of pointers into the arena.
|
||||
2. **No classes, no templates, no inheritance.** POD structs only. Methods are free functions that take a pointer: `void farena_push(FArena* arena, U8 amount, Opt_farena o)`.
|
||||
3. **The `Slice` + `Slice_T` pair is *the* data-structure primitive.** A chunk-array is probably modeled as `Slice<Chunk>` where `Chunk` is a fixed-size `T[N]`.
|
||||
4. **Pointer discipline is `restrict` or `volatile`, never both, never `const`.** This is a hard constraint.
|
||||
5. **The byte-width convention is rigid.** `U1`/`U2`/`U4`/`U8` for unsigned, `S1`/`S2`/`S4`/`S8` for signed, `B1`/`B2`/`B4`/`B8` for byte, `F4`/`F8` for float. `int` and `long` are forbidden in user code.
|
||||
6. **`asm volatile` + `__builtin_*` are preferred over library wrappers.** No `stdatomic.h`, no `stddef.h` for size_t.
|
||||
7. **The DSL compiles in c23 mode (clang).** This means `_Generic` is available, `__builtin_*` are stable, and `typeof` works.
|
||||
8. **`__attribute__((always_inline))` is the default for small hot functions.** Hot path code has zero call overhead.
|
||||
9. **Macros encode intent, not just abbreviation.** `I_` vs `IA_` vs `N_` is meaningful; `I_proc` was specifically *removed* in the duffle.h because the user found it harder to read than just writing inline functions.
|
||||
10. **Entities are POD structs with embedded data.** No handles, no IDs, no virtual dispatch.
|
||||
11. **X-macros are the pattern for data-driven code.** `Tag_Entries()` defines the table; `#define X(n, s, c, p)` + `#undef X` lets the same table feed the enum, the colors array, the prefix array, the name array.
|
||||
|
||||
## 1.5 What the style implies for the chunkified data structure
|
||||
|
||||
If the user wrote a chunk-based C11 data structure in their style, it would probably look like:
|
||||
|
||||
```c
|
||||
// Likely shape (NOT actually written, this is what their style suggests)
|
||||
typedef Struct_(ChunkArray_T) { // ChunkArray<T>
|
||||
Slice chunks; // { Chunk* ptr; U8 len; }
|
||||
U4 chunk_size; // power-of-2
|
||||
U4 element_size; // sizeof(T)
|
||||
U8 total_used; // sum of all chunk use
|
||||
FArena* backing; // where chunks live
|
||||
};
|
||||
|
||||
// Push: O(1) amortized
|
||||
I_ U8 chunkarray_push(ChunkArray_T* ca, U8 element) {
|
||||
U4 chunk_idx = ca->total_used >> log2_of(ca->chunk_size);
|
||||
if (chunk_idx >= ca->chunks.len) {
|
||||
// grow: add a new chunk
|
||||
Chunk* new_chunk = farena_push_type(ca->backing, Chunk, ...);
|
||||
ca->chunks.ptr[ca->chunks.len] = new_chunk;
|
||||
ca->chunks.len += 1;
|
||||
}
|
||||
U4 offset = ca->total_used & (ca->chunk_size - 1);
|
||||
U8* dst = (U8*)&ca->chunks.ptr[chunk_idx][offset * ca->element_size];
|
||||
dst[0] = element; // copy
|
||||
ca->total_used += 1;
|
||||
return ca->total_used - 1;
|
||||
}
|
||||
|
||||
// Index: O(1) bitwise
|
||||
IA_ U8 chunkarray_at(ChunkArray_T* ca, U8 i) {
|
||||
U4 chunk_idx = i >> log2_of(ca->chunk_size);
|
||||
U4 offset = i & (ca->chunk_size - 1);
|
||||
return ((U8*)ca->chunks.ptr[chunk_idx])[offset * ca->element_size];
|
||||
}
|
||||
```
|
||||
|
||||
This is *exactly* Reece's Xar pattern (8-byte header, power-of-2 chunks, bitwise divmod), written in Ed's duffle.h style.
|
||||
|
||||
**The point:** the style is *consistent with* the chunkification optimization. If you wrote this in C11, it would look like duffle.h. There's no impedance mismatch between "the user's preferred C11 style" and "the chunk-idea C11 implementation."
|
||||
|
||||
The impedance is between *any* C11 chunk-array and the Python runtime, regardless of style. That's Part 2.
|
||||
|
||||
---
|
||||
|
||||
# PART 2 — Interop Design Space (the actual question)
|
||||
|
||||
## 2.1 What "interop" actually means in this context
|
||||
|
||||
The question isn't "can Python call C11?" — that's a solved problem with multiple working answers (ctypes, cffi, pybind11, Cython, custom CPython module, etc.). The question is more specific:
|
||||
|
||||
> Can a Python *user-space* program actually *exploit* a chunk-based C11 data structure as if it were a "lego set" of composable pieces — where the user picks which chunk operations to run, in which order, with custom callbacks for filter/map/reduce — without paying the FFI overhead per element?
|
||||
|
||||
The user's skepticism is well-founded. The standard FFI answers have specific impedance-mismatch properties:
|
||||
|
||||
## 2.2 The 5 candidate interop layers, honestly assessed
|
||||
|
||||
### 2.2.1 ctypes (Python stdlib)
|
||||
|
||||
**What it is:** load a `.dll` / `.so` and call C functions via FFI. No compile step. Structs, arrays, pointers, callbacks all work.
|
||||
|
||||
**Pros for chunkification:**
|
||||
- Zero build-time cost — `ctypes.CDLL("./libchunks.so")` and you're in
|
||||
- `Structure` + `Array` classes map naturally to a `ChunkArray` header + `Chunk*` array
|
||||
- `POINTER(c_uint64)` can wrap the chunk pointer, indexed like a Python list
|
||||
- Thread-safe (GIL released on foreign calls)
|
||||
|
||||
**Cons for chunkification:**
|
||||
- **Per-call overhead is ~1-5 microseconds.** A `chunkarray_at(arr, i)` round trip is 1 µs of FFI overhead. A 10,000-element loop is 10ms. Python's native list iteration is ~50ns/element, so ctypes is ~20-100x slower for tight loops.
|
||||
- **No inlining.** The "lego set" pattern requires the user to *compose* operations (filter + map + reduce over chunks). With ctypes, each operation is a separate FFI call, so composition costs O(N) FFI round trips.
|
||||
- **Type coercion is one-shot.** You can't ask ctypes to call `chunkarray_at` and have the result auto-converted to a Python int without going through the ctypes object.
|
||||
- **No SIMD/AVX exposure.** The user could write the C11 to use AVX, but ctypes sees only the C function signature.
|
||||
|
||||
**Verdict for chunkification:** **Tractable but defeats the purpose.** If the use case is "process a 100K-element chunk-array in a hot loop," ctypes is wrong. If the use case is "occasionally bulk-load or bulk-dump a chunk-array and do the rest in Python," ctypes is fine.
|
||||
|
||||
**Style fit with duffle.h:** *low.* ctypes would require the user to write *Python-side* struct definitions that mirror the C struct layout. The duffle.h `Struct_(ChunkArray_T) { Slice chunks; U4 chunk_size; U4 element_size; U8 total_used; }` would become:
|
||||
```python
|
||||
class ChunkArray_T(ctypes.Structure):
|
||||
_fields_ = [
|
||||
("chunks", Slice), # needs its own Structure
|
||||
("chunk_size", c_uint32),
|
||||
("element_size", c_uint32),
|
||||
("total_used", c_uint64),
|
||||
]
|
||||
```
|
||||
That's 2x the code on the Python side, and you have to keep the two in sync. The user's unorthodox `Slice` + `Struct_` macros would have to be unwound into a C-friendly layout.
|
||||
|
||||
### 2.2.2 cffi (PyPy / CPython, third-party)
|
||||
|
||||
**What it is:** write C declarations in a Python string, cffi compiles them and gives you ABI-stable handles.
|
||||
|
||||
**Pros over ctypes:**
|
||||
- C-level type declarations are the source of truth (not Python-side mirroring)
|
||||
- ABI mode vs API mode: ABI is like ctypes (no compile); API mode compiles a Python extension module
|
||||
- More Pythonic: `from ffi import ffi; lib = ffi.dlopen("./libchunks.so")`
|
||||
|
||||
**Cons for chunkification:** same as ctypes for the per-call overhead. Plus the C declaration layer adds a build step (cffi "compiles" the C declarations at import time, which is a real cost on cold start).
|
||||
|
||||
**Verdict for chunkification:** same as ctypes — *tractable but defeats the purpose* for hot loops.
|
||||
|
||||
**Style fit with duffle.h:** *low-medium.* cffi is more idiomatic for the C-decl-as-source-of-truth, but you still pay the FFI cost.
|
||||
|
||||
### 2.2.3 pybind11 (C++ heavy)
|
||||
|
||||
**What it is:** C++ header-only library that generates Python bindings from C++ type signatures. Requires the C++ compiler.
|
||||
|
||||
**Pros for chunkification:**
|
||||
- Type-safe bindings
|
||||
- STL containers (vector, array) have automatic conversions to Python list / numpy array
|
||||
- `py::buffer_info` lets you expose raw memory as a NumPy array (zero-copy)
|
||||
|
||||
**Cons for chunkification:**
|
||||
- **C++ is not the user's style.** The user writes pure C11 with macros. pybind11 is C++-only.
|
||||
- pybind11's STL conversions don't fit the duffle.h `Slice` / `FArena` model. You'd be writing the C++ adapter layer, not the C11 chunk-array.
|
||||
- The "pybind11 generates bindings" claim is misleading for non-trivial types — you write glue code, and for an `FArena`-backed chunk array, the glue is more code than the C11 implementation.
|
||||
|
||||
**Verdict for chunkification:** *not a fit.* Style mismatch is fatal here.
|
||||
|
||||
### 2.2.4 Custom CPython C extension (CPython C API)
|
||||
|
||||
**What it is:** write a real CPython extension module using `<Python.h>`. You get a Python-importable module that wraps the C11 code directly.
|
||||
|
||||
**Pros for chunkification:**
|
||||
- **Zero FFI overhead for tightly-coupled code.** Once the module is loaded, `import chunks; chunks.push(arr, val)` is a normal C function call with refcount discipline, ~50ns/element.
|
||||
- The C API is C-compatible (C11 or later), so the duffle.h macros can be used directly inside the extension module
|
||||
- The user controls the module surface — can expose `ChunkArray.push`, `.at`, `.chunk_count`, `.chunk_size`, `.arena_capacity` etc.
|
||||
- Generator/coroutine support (`__iter__` over chunks) is straightforward in C
|
||||
- Can release the GIL for long-running pure-C operations
|
||||
|
||||
**Cons for chunkification:**
|
||||
- **Refcount discipline is manual.** The user must `Py_INCREF` / `Py_DECREF` correctly. The duffle.h style doesn't have a notion of refcounting (everything is arena-owned). A new discipline is needed at the Python boundary.
|
||||
- **Must compile.** Build the `.pyd`/`.so`, ensure it's on `sys.path`, deal with Python version compatibility (3.11 ABI tag, etc.). The user's Manual Slop project uses `uv`; this would be a `pyproject.toml` `[tool.uv]`-style build hook.
|
||||
- **CPython-specific.** PyPy / GraalPy / RustPython don't all support the C API the same way. For a tool that's CPython-only (Manual Slop is), this is fine, but it's a lock-in.
|
||||
- **GIL.** Free-threaded Python (PEP 703) is shipping; chunk-array code that releases the GIL has to be careful about which Python objects it touches.
|
||||
|
||||
**Verdict for chunkification:** **Most tractable option.** The custom C extension model lets the user write the chunk-array in their preferred C11 style (duffle.h compatible), wrap it with a small Python-facing layer (refcount-aware), and ship it as a real importable module. Build cost is one-time.
|
||||
|
||||
**Style fit with duffle.h:** *high.* The C11 code is C11. The Python-facing layer is a thin `PyTypeObject` / `PyMethodDef` table at the bottom of the file. The duffle.h macros can be used *inside* the extension module without modification.
|
||||
|
||||
**Sketch (not actually written — for the design conversation):**
|
||||
```c
|
||||
// chunks_module.c
|
||||
#include <Python.h>
|
||||
#include "duffle.amd64.win32.h" // user's existing style
|
||||
|
||||
typedef Struct_(ChunkArray) {
|
||||
Slice chunks; // { Chunk* ptr; U8 len; }
|
||||
U4 chunk_size; // power-of-2
|
||||
U4 element_size;
|
||||
U8 total_used;
|
||||
FArena backing_arena;
|
||||
};
|
||||
|
||||
static PyObject* chunka_push(PyObject* self, PyObject* args) {
|
||||
PyObject* py_arr;
|
||||
U8 value;
|
||||
if (!PyArg_ParseTuple(args, "OK", &py_arr, &value)) return nullptr;
|
||||
ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
|
||||
U8 idx = chunkarray_push(arr, value);
|
||||
return PyLong_FromUnsignedLongLong(idx);
|
||||
}
|
||||
|
||||
static PyObject* chunka_at(PyObject* self, PyObject* args) {
|
||||
PyObject* py_arr; U8 i;
|
||||
if (!PyArg_ParseTuple(args, "OK", &py_arr, &i)) return nullptr;
|
||||
ChunkArray* arr = ((ChunkArrayObject*)py_arr)->c_arr;
|
||||
U8 val = chunkarray_at(arr, i);
|
||||
return PyLong_FromUnsignedLongLong(val);
|
||||
}
|
||||
|
||||
static PyMethodDef ChunkArrayMethods[] = {
|
||||
{"push", chunka_push, METH_VARARGS, "Append an element, return its index"},
|
||||
{"at", chunka_at, METH_VARARGS, "Random access by index"},
|
||||
{nullptr, nullptr, 0, nullptr}
|
||||
};
|
||||
|
||||
static struct PyModuleDef chunkmodule = {
|
||||
PyModuleDef_HEAD_INIT, "chunks", nullptr, -1, ChunkArrayMethods
|
||||
};
|
||||
|
||||
PyMODINIT_FUNC PyInit_chunks(void) {
|
||||
return PyModule_Create(&chunkmodule);
|
||||
}
|
||||
```
|
||||
|
||||
This is ~80 lines of glue for a fully-functional module. The actual `chunkarray_push` and `chunkarray_at` are duffle.h-style C11.
|
||||
|
||||
### 2.2.5 NumPy + custom C API (`PyArray_Interface`)
|
||||
|
||||
**What it is:** NumPy has a C API (`<numpy/arrayobject.h>`) that lets C extensions allocate and manipulate `ndarray` objects. The C extension holds the *actual* memory, and NumPy wraps it as an array with zero copy.
|
||||
|
||||
**Pros for chunkification:**
|
||||
- If the chunk-array is logically a 1D contiguous sequence, NumPy can wrap it as a `ndarray` with zero copy
|
||||
- The user can then do `np.sum(chunks)`, `chunks[1000:2000]`, `chunks[chunks > threshold]` in NumPy land — all the vectorized ops for free
|
||||
- For *batch* operations (load 10K elements, do something to all of them, write back), NumPy is the right level of abstraction
|
||||
- Most Manual Slop hot-path code (text processing, JSON-L serialization, list-mutation) can be re-expressed as NumPy operations
|
||||
|
||||
**Cons for chunkification:**
|
||||
- NumPy semantics are *flat* 1D/2D/ND arrays, not chunk-aware. The "lego set" pattern (iterate over chunks, custom callback per chunk) is not a first-class NumPy concept.
|
||||
- The C API requires linking against NumPy's headers and ABI version compatibility
|
||||
- NumPy's array protocol is *strongly* typed (dtype); chunk-array-of-mixed-type is not a fit
|
||||
- For a chunk-array that needs to be both chunk-aware (user iterates chunks) and element-wise (NumPy ops on the flat view), you'd need a custom NumPy `dtype` with chunk-aware accessors — possible but not trivial
|
||||
|
||||
**Verdict for chunkification:** *orthogonal.* NumPy is a great *consumer* of a chunk-array (zero-copy wrap), but not a great *driver* (you can't easily express chunk-aware iteration in NumPy). The combination is: write the chunk-array in C11, expose a NumPy-compatible 1D view, let NumPy do batch ops when appropriate, do chunk-aware iteration in C.
|
||||
|
||||
**Style fit with duffle.h:** *medium.* NumPy's C API doesn't conflict with duffle.h, but the `PyArrayObject` types are intrusive. You'd write an adapter layer that converts between `Slice<U8>` (raw bytes) and `PyArrayObject` (typed ndarray).
|
||||
|
||||
## 2.3 The honest assessment matrix
|
||||
|
||||
For the actual question — *"can a Python user-space program fully exploit a C11 chunk-based data structure lego-set?"* — here's what the design space looks like:
|
||||
|
||||
| Approach | Build cost | Per-op overhead | Style fit | Lego-set pattern support | Verdict |
|
||||
|---|---|---|---|---|---|
|
||||
| **ctypes** | 0 | ~1-5 µs/call | low | low (each op = FFI call) | Tractable but defeats the purpose |
|
||||
| **cffi ABI mode** | 0 | ~1-5 µs/call | low-medium | low | Same as ctypes |
|
||||
| **cffi API mode** | 1x (compile) | ~50ns/call | medium | medium | Good middle ground |
|
||||
| **pybind11** | 1x (compile) | ~50ns/call | very low (C++) | medium | Style mismatch — not a fit |
|
||||
| **CPython C ext** | 1x (compile) | ~50ns/call | high (C11) | high (full C API) | **Most tractable** |
|
||||
| **NumPy wrap** | 1x (compile) | ~50ns/call | medium | low (flat view) | Orthogonal — good for batch, not lego-set |
|
||||
| **HPy / PyO3 / nanobind** | 1x (compile) | ~50ns/call | low (Rust/C++/new API) | medium | Better than pybind11 but still style-mismatched |
|
||||
|
||||
**The recommendation:**
|
||||
|
||||
**For the *lego-set* (chunk-aware user-driven iteration):** custom CPython C extension is the most tractable. The duffle.h style is C11; the C extension wrapping is ~80 lines of glue per chunk-array class; per-element overhead is the same as native Python (~50ns).
|
||||
|
||||
**For *batch* operations on a chunk-array:** NumPy wrap is the most tractable. Expose the chunk-array's memory as a 1D ndarray, let NumPy do the work. Zero-copy, vectorized, free.
|
||||
|
||||
**For *occasional* FFI from Python:** ctypes is fine. Load the lib, call the function, get the result. Don't try to do hot loops this way.
|
||||
|
||||
## 2.4 What "a chunked C11 package that interops with Python" actually requires
|
||||
|
||||
If the user wants to build this, the minimum viable product is:
|
||||
|
||||
1. **The chunk-array C11 code** (duffle.h style, ~200-400 lines)
|
||||
- `ChunkArray_T` struct
|
||||
- `chunkarray_push`, `chunkarray_at`, `chunkarray_grow`, `chunkarray_iter_chunks`
|
||||
- Backing is an `FArena` for chunk memory + a `Slice<Chunk*>` for the chunk pointer table
|
||||
|
||||
2. **A CPython C extension wrapper** (~80-150 lines)
|
||||
- `PyTypeObject` for `ChunkArrayObject` (wraps the C struct)
|
||||
- `__init__` (creates the C struct from Python args: `chunk_size`, `element_size`, `initial_capacity`)
|
||||
- `__len__` (returns `total_used`)
|
||||
- `__getitem__` / `__setitem__` (calls `chunkarray_at` / in-place write)
|
||||
- `__iter__` (yields elements one at a time; can be optimized to yield per-chunk for the lego-set pattern)
|
||||
- `push(value)` method
|
||||
- `chunks()` method (yields per-chunk `ndarray` views for the NumPy interop path)
|
||||
- `arena_capacity`, `chunk_count`, `chunk_size` read-only properties
|
||||
|
||||
3. **A build step** in `pyproject.toml` (one-time cost, ~5 lines)
|
||||
- `[tool.uv.build-backend]` config
|
||||
- Build the `.pyd`/`.so` for the current Python version
|
||||
- Wheels for distribution (optional, build for arm64 + x86_64 + win32 + linux)
|
||||
|
||||
4. **Tests** in `tests/test_chunka_c11.py` (~100-300 lines)
|
||||
- TDD-style: write tests in Python first, then write the C, then verify
|
||||
- Grow pattern tests, random access tests, edge cases (empty, full, resize)
|
||||
- NumPy interop test: ensure `np.array(chunks)` is zero-copy
|
||||
- Comparison test: chunk-array must beat `list.append` for the relevant N
|
||||
|
||||
5. **A `chunks/__init__.py` Python wrapper** (~30-50 lines, optional but recommended)
|
||||
- High-level API: `ChunkArray(chunk_size=1024, element_size=8)`, `.push(x)`, `.at(i)`, `.numpy()`
|
||||
- Type hints for IDE support
|
||||
- This is the *only* Python code; everything else is C
|
||||
|
||||
**Total:** ~500-1000 lines of C + ~50-150 lines of Python glue + build/test config.
|
||||
|
||||
## 2.5 The honest tractable-vs-not answer
|
||||
|
||||
**Tractable:**
|
||||
- Writing a chunk-array in C11 duffle.h style: trivially tractable (Reece's Xar is the reference impl, ~200 lines)
|
||||
- Wrapping it as a CPython C extension: tractable (~150 lines of glue)
|
||||
- Per-element overhead matching native Python: yes (50ns vs 50ns, no FFI tax)
|
||||
- NumPy interop via zero-copy ndarray wrap: tractable (NumPy's C API is well-documented)
|
||||
- Build + distribution via uv + pyproject.toml: tractable (one-time setup, well-trodden path)
|
||||
|
||||
**Not tractable (or not worth the cost):**
|
||||
- Letting the user *arbitrarily compose* C11 chunk operations from Python at the lego-set level: **not tractable without compiling Python → C11 on the fly**. ctypes/cffi/pybind11 are all per-call; you'd need a C-subset JIT (like the user's `forth_bootslop` does for stack machine bytecode) to compose C11 ops in Python. That's a different track.
|
||||
- Having Python *extend* the chunk-array with user-defined per-element callbacks (like `list(map(fn, arr))`) that run at C speed: **not tractable**. Cython can compile Python-ish syntax to C, but the duffle.h style doesn't fit Cython's type system. The workaround is to ship pre-baked operations (`push`, `at`, `iter_chunks`, `filter_chunk(fn_ptr)`) and let users choose from those, not define new ones in Python.
|
||||
- Making the chunk-array *cross-implementation* (CPython + PyPy + RustPython): **not tractable** with the C extension approach. Use HPy (new Python C API targeting multiple impls) if this matters. HPy has a separate style, would need an adapter.
|
||||
|
||||
**The "numpy DSL" the user mentioned:** the closest analog is **Cython's typed memoryviews** or **NumPy's `ndarray` protocol** — both give you "Python can see a chunk of C memory and operate on it efficiently." Neither is a literal DSL; both are ABI/protocol layers. If the user wants a Python-side DSL for *composing* chunk operations, that's a separate design problem (Cython-like compile-to-C, or a small Python AST → C11 emitter).
|
||||
|
||||
## 2.6 The recommended path forward for chunkification_optimization
|
||||
|
||||
**Don't start with C11.** Start with **pure Python chunkification** of the target (the `comms.log` ring buffer in `app_controller.py:716`). Verify:
|
||||
- The chunk pattern delivers a measurable speedup
|
||||
- The API is ergonomic from Python
|
||||
- The thread-safety story is correct
|
||||
- The serial/deserial path still works
|
||||
|
||||
**Then, if the user wants the C11 lego-set:**
|
||||
- Build the duffle.h-style C11 chunk-array (one type, ~200 lines)
|
||||
- Build the CPython C extension wrapper (~150 lines of glue)
|
||||
- Build the NumPy-compatible 1D view (lets existing Python code consume the chunk-array)
|
||||
- Optional: add a few pre-baked chunk-aware operations (`filter_chunks`, `map_chunks`, `reduce_chunks`) in C, exposed as Python methods
|
||||
- Optional: build a "lego-set" Python API that lets users compose pre-baked operations without writing C
|
||||
|
||||
**Defer the "Python-defined chunk-aware callback" goal** — it's the most ambitious, requires either Cython or a custom AST emitter, and is not clearly worth the complexity for a single project.
|
||||
|
||||
## 2.7 The 5 questions to ask the user (before this becomes a track)
|
||||
|
||||
These map directly to the design decisions in §2.3-§2.6:
|
||||
|
||||
1. **Build cost acceptable?** Custom C extension is one-time ~half-day of build setup (pyproject.toml, compiler config, wheel build). One-time.
|
||||
2. **Per-element overhead target?** Native (~50ns) requires the C extension. ctypes is ~1-5µs (20-100x slower). What's the SLA?
|
||||
3. **NumPy interop required?** If yes, the C extension must expose the underlying memory as a 1D ndarray view (one-time setup).
|
||||
4. **Cross-implementation?** CPython only? Or HPy for CPython+PyPy? Big style difference.
|
||||
5. **Lego-set composition in Python?** Pre-baked ops (push, at, iter_chunks, filter_chunks) is tractable. User-defined Python→C11 callbacks is not (without Cython or a custom AST emitter).
|
||||
|
||||
## 2.8 The crucial insight
|
||||
|
||||
The user said: *"the way I would define the C11 package or interop stuff would be unorthodox and would follow a similar pattern to what you would fine in either my forth_bootslop repo or my pikuma ps1 repo."*
|
||||
|
||||
Reading both repos carefully (and the user's correction that they're "not really an interop pattern, I just wanted to show how I like todo C11"), the implication is:
|
||||
|
||||
- The user is comfortable with a **single C11 .h file** as the entire interop boundary
|
||||
- The user is **not** going to write a complex pybind11 C++ layer or a Cython .pyx file
|
||||
- The user is **comfortable with a thin CPython C extension** if the C11 code stays in their style
|
||||
|
||||
The most likely path the user would actually take, given their style and your "lots of ambiguities" caveat:
|
||||
- Write the chunk-array in duffle.h style as a single header
|
||||
- Wrap it with a small `PyTypeObject` block at the bottom of the same file (or a separate `chunks_module.c` that includes the header)
|
||||
- Build it with `uv` + `pyproject.toml`
|
||||
- Import it from Manual Slop and verify the speedup on `comms.log`
|
||||
|
||||
That's tractable. The "lego set of composable Python-driven chunk operations" is a stretch goal that requires more design work, and probably isn't needed for the comms.log target.
|
||||
|
||||
---
|
||||
|
||||
## 3. The non-recommendations
|
||||
|
||||
**Don't do any of these:**
|
||||
|
||||
- **pybind11.** Style mismatch. C++ is not the user's idiom.
|
||||
- **Cython.** The user writes pure C11 with macros. Cython is Python-with-C-type-annotations. Style mismatch.
|
||||
- **Rust + PyO3.** The user writes C, not Rust. PyO3 is great for Rust shops, not relevant here.
|
||||
- **HPy.** Cross-implementation matters less than style fit. Revisit if PyPy becomes a target.
|
||||
- **Pure Python implementation of the lego-set pattern.** Defeats the point. If you're not crossing the FFI boundary, you don't need C11.
|
||||
|
||||
## 4. Summary verdict
|
||||
|
||||
| The user's question | The honest answer |
|
||||
|---|---|
|
||||
| Can chunk-based C11 interop with Python? | Yes, via custom CPython C extension. ~150 lines of glue per chunk-array type. |
|
||||
| Is it worth the cost? | Depends on the use case. For `comms.log`, the C extension is tractable. For "compose arbitrary C11 ops from Python," it's not (needs a Python→C emitter). |
|
||||
| What does the lego-set pattern look like? | Pre-baked C operations exposed as Python methods (push, at, iter_chunks, filter_chunks). User-defined per-element Python callbacks running at C speed is not tractable. |
|
||||
| What about numpy? | NumPy can zero-copy wrap the chunk-array as a 1D ndarray. Best for batch ops, not chunk-aware iteration. |
|
||||
| What's the build cost? | One-time ~half-day (uv + pyproject.toml + C extension). Wheels for distribution optional. |
|
||||
| What about HPy / cross-impl? | Not needed unless PyPy becomes a target. Stick with CPython C API. |
|
||||
| What's the style fit with duffle.h? | High. The chunk-array is written in duffle.h style; the C extension wrapper is a thin `PyTypeObject` block at the bottom of the file. |
|
||||
|
||||
**Recommended action:**
|
||||
1. **Verify the chunk pattern delivers value first.** Pure-Python chunkification of `comms.log` (or another target), measure, confirm.
|
||||
2. **If C11 is desired, build the C extension in duffle.h style.** ~500 lines total (200 C array + 150 glue + 100 tests + 50 Python wrapper).
|
||||
3. **If NumPy is the consumer, expose the 1D view.** One-time, ~20 lines of NumPy C API glue.
|
||||
4. **Defer the "user-defined Python→C11 callback" goal** unless a specific use case demands it.
|
||||
|
||||
---
|
||||
|
||||
*End of assessment. The track `chunkification_optimization_20260608_PLACEHOLDER` is now scoped: tractable as a CPython C extension, scope = chunk-array in duffle.h style + thin C extension wrapper + NumPy 1D view. Not tractable as a lego-set of user-defined chunk operations. Next step is to confirm the use case (which target, which SLA) and pick the build path.*
|
||||
|
||||
*Cross-references for re-anchoring: `docs/reports/session_synthesis_20260608.md` §8.2 (the original proposal), `docs/ideation/ed_chunk_data_structures_20260523.md` (the user's chunk-ideation), `docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt` §56:42 (Reece's Xar reference impl).*
|
||||
Reference in New Issue
Block a user