diff --git a/docs/reports/c11_python_interop_assessment_20260608.md b/docs/reports/c11_python_interop_assessment_20260608.md index e576052e..cdeb3a90 100644 --- a/docs/reports/c11_python_interop_assessment_20260608.md +++ b/docs/reports/c11_python_interop_assessment_20260608.md @@ -539,7 +539,9 @@ That's tractable. The "lego set of composable Python-driven chunk operations" is - **HPy.** Cross-implementation matters less than style fit. Revisit if PyPy becomes a target. - **Pure Python implementation of the lego-set pattern.** Defeats the point. If you're not crossing the FFI boundary, you don't need C11. -## 4. Summary verdict +## 4. Summary verdict (SUPERSEDED — see Part 3) + +The table in this section is the v1 verdict, written before the user's second correction (Part 3). Kept for the record, but **Part 3 is the action-oriented section.** | The user's question | The honest answer | |---|---| @@ -551,7 +553,7 @@ That's tractable. The "lego set of composable Python-driven chunk operations" is | What about HPy / cross-impl? | Not needed unless PyPy becomes a target. Stick with CPython C API. | | What's the style fit with duffle.h? | High. The chunk-array is written in duffle.h style; the C extension wrapper is a thin `PyTypeObject` block at the bottom of the file. | -**Recommended action:** +**Original recommended action (v1):** 1. **Verify the chunk pattern delivers value first.** Pure-Python chunkification of `comms.log` (or another target), measure, confirm. 2. **If C11 is desired, build the C extension in duffle.h style.** ~500 lines total (200 C array + 150 glue + 100 tests + 50 Python wrapper). 3. **If NumPy is the consumer, expose the 1D view.** One-time, ~20 lines of NumPy C API glue. @@ -559,6 +561,283 @@ That's tractable. The "lego set of composable Python-driven chunk operations" is --- -*End of assessment. The track `chunkification_optimization_20260608_PLACEHOLDER` is now scoped: tractable as a CPython C extension, scope = chunk-array in duffle.h style + thin C extension wrapper + NumPy 1D view. Not tractable as a lego-set of user-defined chunk operations. Next step is to confirm the use case (which target, which SLA) and pick the build path.* +# PART 3 — Revised Verdict (after the user's second correction) -*Cross-references for re-anchoring: `docs/reports/session_synthesis_20260608.md` §8.2 (the original proposal), `docs/ideation/ed_chunk_data_structures_20260523.md` (the user's chunk-ideation), `docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt` §56:42 (Reece's Xar reference impl).* +## 3.1 The second user-correction (verbatim) + +> "This seems like it would only be worth it if I reach a hard constraint that I cannot solve with an existing python package. Then I could make a custom pipelien to deal with the hot data set witha custom cpython extension. Such as, parsing markdown files or sources int aggregate markdown, context snapshot processing and possibly other things in the future. The python would have to define the payload in a simple text or binary format as the request and then the extenion pipeline in C11 would do the ops and provide the output in another binary or text blob/s." + +## 3.2 What the second correction changed + +Two distinct moves, both significant: + +**Move 1 — threshold-shift on *when* to bother:** +> "only worth it if I reach a hard constraint that I cannot solve with an existing python package" + +This inverts the default. v1 framed the chunkification_optimization track as "if you want the C11 path, here's how to build it." v2 frames it as "don't build it until a hard constraint forces the issue, and *here's the specific shape* of the build when that day comes." + +**Move 2 — shape-change on *what* to build:** +> "the python would have to define the payload in a simple text or binary format as the request and then the extension pipeline in C11 would do the ops and provide the output in another binary or text blob/s" + +This is **not** a stateful C extension with a Python-facing API. It is a **request/response blob pipeline**: + +``` + Python user-space C11 pipeline + ┌──────────────────┐ ┌──────────────────┐ + │ 1. Assemble │ │ │ + │ request: │ request.bin │ parse request │ + │ {files: [...],│ ───────────────▶│ load payload │ + │ ops: [...], │ │ run ops │ + │ params: {}} │ │ format output │ + │ 2. Serialize to │ │ │ + │ blob (text or │ │ │ + │ binary) │ │ │ + │ 3. Hand to C11 │ response.bin │ │ + │ 4. Parse │ ◀───────────────│ │ + │ response │ │ │ + └──────────────────┘ └──────────────────┘ +``` + +**This is strictly better than the v1 framing in 4 ways:** + +1. **Composition in Python is trivial.** The "lego set" the user worried about isn't a problem: the Python side composes the *request*, and the C side just executes the pre-defined op pipeline. No Python→C11 emitter needed. +2. **The wire format IS the contract.** Both sides agree on a schema (text or binary), not on a Python type. The C side has zero knowledge of `PyObject` / `PyTypeObject` / refcounting. The Python side has zero knowledge of `FArena` / `Slice` / `U8`. Cleanest possible boundary. +3. **Per-op FFI cost is zero.** There's exactly one FFI call per pipeline run, not per element. The "ctypes per-call overhead defeats the purpose" concern from v1 §2.2.1 disappears. +4. **State-free C side.** The C pipeline reads the request, runs ops, writes the response, exits. No need to maintain Python refcount discipline over a long-lived C object. The C side is a pure function `process(request_bytes) -> response_bytes`. + +## 3.3 The two target use cases, grounded in actual code + +### 3.3.1 Target 1: parsing markdown files / sources into aggregate markdown + +**Current state** (read from `src/aggregate.py:380-454` `build_markdown_from_items` + `src/summarize.py:7-219`): +- The aggregate pipeline builds markdown by **pure Python string concatenation** (`f"### \`{original}\`\n\n\`\`\`{suffix}\n{skeleton}\n\`\`\""` and `"\n\n---\n\n".join(sections)`) +- `_summarise_markdown` in `summarize.py` only extracts headings — does NOT parse the body +- **`pyproject.toml` has zero third-party markdown dependencies** (`mistune`, `markdown-it-py`, `commonmark-py`, `markdown` are all *not* in the deps) +- `build_file_items` at `aggregate.py:142` does the path resolution + content reading; `build_markdown_from_items` does the string-concat assembly; `summarize.summarise_file` is called per-file for non-focus tiers + +**Where the actual bottleneck is (right now):** +- The string concatenation in `build_markdown_from_items` — Python's f-strings are fast but `"\n\n---\n\n".join(sections)` over a list of ~50-500 sections scales linearly +- The `parser.get_skeleton(content)` call in `aggregate.py:444` for every `.py` file in the composition +- The `mcp_client.py_get_definition` / `mcp_client.ts_cpp_get_*` calls for masked symbols +- The `summarize.summarise_file` calls per file + +**Where the bottleneck would be IF real markdown parsing were added:** +- Adding a markdown parser (e.g., `markdown-it-py`) to extract structural elements (headings, code blocks, links) for navigation/context-aware aggregation +- For projects with many `.md` files (e.g., `docs/` with 14 guides, 30+ IDE markdown files), the parse cost would dominate + +**Is this a hard constraint that Python packages can't solve?** +- **No, today.** `markdown-it-py` is ~10x faster than `python-markdown` and ~50x faster than pure-Python regex parsing. It's well-maintained, C-accelerated (via `cmark`/`commonmark`), and has a clean AST API. Adopting it is a one-line `pyproject.toml` change, not a C11 build. +- **Possible yes, in the future.** If the user adds cross-file markdown analysis (TOC generation, link graph, code-block extraction across many files) at runtime, the cumulative parse time for hundreds of files could push past `markdown-it-py`'s comfort zone. **That would be the hard constraint.** + +**When to act:** the moment the markdown-parse hot path becomes a real bottleneck in profiling (i.e., the user can demonstrate via `performance_monitor.py` that `build_markdown_from_items` is the slow part of a real workflow). Until then, the existing Python path is fine, and `markdown-it-py` is the first thing to try. + +### 3.3.2 Target 2: context snapshot processing + +**Current state** (read from `src/history.py:1-141`): +- `UISnapshot` is a `@dataclass` with 13 fields. The "large" fields are `disc_entries: list[dict]`, `files: list[dict]`, `context_files: list[dict]`, `screenshots: list[str]` +- `HistoryManager` is a small Python class. `push` / `undo` / `redo` / `jump_to_undo` are the only mutating ops +- Snapshot capacity is 100 (default in `HistoryManager.__init__`) +- The actual work is `UISnapshot.to_dict` and `from_dict` — deep-copy of nested dicts + +**Where the actual bottleneck is:** +- The `to_dict` / `from_dict` deep-copies. 100 snapshots × ~5KB each = 500KB of nested dict copying per push/undo. At 60 FPS push rate, that's 30MB/s of dict copy — Python's not great at that but **pushes are debounced** in `docs/guide_state_lifecycle.md` (render frame at `gui_2.py:1140-1170`), so the actual rate is much lower +- The list copy of `disc_entries` is the heaviest single op (a 23-op matrix can have ~50-200 entries per snapshot) + +**Is this a hard constraint that Python packages can't solve?** +- **No, today.** Python's `copy.deepcopy` is the canonical answer; `pickle` round-trips are 5-10x faster than `to_dict`/`from_dict` for nested data. If snapshot capture is slow, the fix is to switch to `pickle` (or to `msgspec` / `orjson` for json-like schemas), not C11. +- **Possible yes, in the future.** If snapshots grow to MB-scale (e.g., per-frame UI state for video-game-like content) and push rate goes up (e.g., per-frame state push during a long session), the cumulative cost would matter. **That would be the hard constraint.** + +**When to act:** the moment the user sees `history.py` `push()` in a profile. Until then, switching to `pickle` is the cheap fix. + +## 3.4 The request/response wire format (the contract) + +The user said *"simple text or binary format as the request and then the extension pipeline in C11 would do the ops and provide the output in another binary or text blob/s."* + +Two options on the table. The choice has real implications: + +### 3.4.1 Option A: text (line-based, JSON-ish, debuggable) + +``` +# request.txt +op parse_md +op summarise_python +op mask_symbols @sym1 def @sym2 sig +op build_section tier=3 +input file src/foo.py +input file src/bar.py +format markdown_v3 +end +``` + +- Pros: human-readable, greppable, version-controllable, easy to debug (you can `cat` the request and the response) +- Cons: parsing cost on the C side (strncmp per op), bigger payload, slower to roundtrip + +### 3.4.2 Option B: binary (msgpack / protobuf / custom) + +``` +[1 byte: format version] +[1 byte: op_count] +[for each op: + [1 byte: op_id] + [varint: param_count] + [for each param: + [1 byte: type_id] + [varint: byte_len] + [bytes: value]]] +[for each input: + [varint: byte_len] + [bytes: file_path]] +[for each input file blob: + [varint: byte_len] + [bytes: file_content]] +``` + +- Pros: fast to parse (~1-10µs per op on C side), small payload, deterministic +- Cons: not human-readable, harder to debug, format versioning required, binary compatibility across Python/C versions + +**The recommendation:** start with text for v1 (debuggability > speed when you're not sure what the ops look like), switch to binary for v2 if profiling shows the parse cost matters. The wire format is the *only* contract, so it's also the *only* thing you have to maintain compat with. + +A reasonable middle path: **text for the *envelope* (which ops to run, which params), binary for the *payloads* (file contents, result blobs).** This way you can `cat` the envelope to debug, and the heavy bytes move binary-only. + +## 3.5 The pipeline API (what the C11 side exposes) + +If we adopt the request/response model, the C11 side has exactly one entry point: + +```c +// chunks_module.c (hypothetical) +// Returns: response blob (caller frees) +// Args: request blob (opaque, owned by caller) +typedef Struct_(PipelineResponse) { + U8* bytes; + U8 len; + U4 exit_code; // 0 = success, non-zero = error + Str8 error_msg; // optional, only populated on error +}; + +IA_ PipelineResponse pipeline_run(Slice request); +``` + +The C side: +1. Parses the request envelope (op list + params + input file list) +2. Loads the requested input files (or accepts inline blobs) +3. Runs each op in order +4. Collects the output into a single response blob +5. Returns the blob + exit code + +The Python side: +1. Builds the request envelope (text or binary) +2. Subprocess-launches the C pipeline binary (or calls via ctypes) with the request on stdin +3. Reads the response from stdout +4. Parses the response (text or binary) +5. Returns the parsed result to the calling code + +**The subprocess model is strongly recommended over the in-process FFI model for v1**: +- Zero FFI surface (no ctypes, no PyTypeObject, no refcount discipline) +- Trivially testable (the C binary can be run from the shell, results compared) +- Total process isolation (C crash doesn't take down the Python process) +- ~10-20ms startup tax per call (acceptable for batch ops, not for hot loops) +- Easy to swap implementations (rewrite the C binary, keep the wire format) + +If profiling later shows the subprocess startup is the bottleneck, switch to in-process via ctypes. The wire format doesn't change. + +## 3.6 The "chunkification" question, revisited + +The original `chunkification_optimization_20260608_PLACEHOLDER` track was about replacing growable buffers (`comms.log`, `summary_cache`, etc.) with chunk-based data structures (Reece's Xar pattern, duffle.h style). + +**Under the new framing:** +- If the *target* (`comms.log` etc.) is on a hot path that an existing Python package *can't* solve, build a C11 pipeline that takes a request like `{op: append_chunk, arena: comms, data: {...}}` and returns `{status: ok, count: 42}`. The C side owns the chunk-array as a *private* data structure; the Python side never sees it. +- The chunk-array is now an *implementation detail* of the C pipeline, not a *Python data type*. The user's "lego set" worry is moot because Python doesn't have direct access to the lego set — it only has the request/response protocol. + +**This is much cleaner than the v1 framing** (stateful C extension with Python-facing API). The chunk-array is internal to the C pipeline. Python user-space has zero access to the underlying memory layout. The wire format is the entire surface area. + +## 3.7 When to act (the decision tree) + +``` +Is the target code path actually a bottleneck in profiling? +├── No → Don't act. Use existing Python packages (`markdown-it-py`, +│ `pickle`, `msgspec`, `orjson`, `numpy`, `pandas` as appropriate). +│ Re-evaluate next quarter. +│ +└── Yes → Is the bottleneck solvable with existing Python packages? + ├── Yes (e.g., switch `to_dict`/`from_dict` to `pickle`) → Apply that fix. + │ Cost: hours. Don't reach for C11. + │ + └── No (existing packages aren't fast enough or can't do the op) → Build the C11 pipeline: + 1. Define the wire format (text v1, binary v2) + 2. Write the C11 pipeline binary in duffle.h style + 3. Write the Python wrapper that builds requests and parses responses + 4. Ship as a subprocess (not in-process FFI) for v1 + 5. Add an in-process FFI path only if subprocess startup is the new bottleneck + 6. Profile: confirm the C11 path is actually faster than the Python baseline + 7. If not faster, throw away the C11 code and try a different Python package +``` + +**Default action for the current session: don't build the C11 pipeline.** No profiling has been done; no existing Python package has been ruled out. The hard constraint doesn't exist yet. + +## 3.8 The 4 questions to revisit when a hard constraint actually surfaces + +These are the design decisions that have to be made *when* (not before) the user hits a real bottleneck: + +1. **Which target?** Is it markdown parsing, snapshot processing, log aggregation, RAG indexing, or something else? Each has different op shapes, different request schemas, different response schemas. +2. **Subprocess or in-process FFI?** Start with subprocess (zero FFI surface, ~10-20ms startup tax). Move to in-process only if startup cost is the new bottleneck. +3. **Text or binary wire format?** Text v1 (debuggable, slower). Binary v2 (fast, not debuggable). Envelope-text + payload-binary middle ground. +4. **One pipeline binary or many?** One binary with an op registry is simpler to build/test/deploy. Many binaries (one per op) is more modular but harder to coordinate. Recommend one binary with a registry. + +## 3.9 The crucial insight (revised) + +**v1's insight:** "The user's 'unorthodox' interop is most likely a single duffle.h-style C11 .h file with a thin PyTypeObject block at the bottom. Tractable." + +**v2's insight (the better one):** "The C11 side doesn't need to be a Python-aware module at all. It can be a standalone binary that takes a request on stdin, runs ops, returns a response on stdout. Python user-space just shells out. Zero FFI surface. Zero refcount discipline. The wire format is the contract, period." + +The v2 model is **strictly more tractable** than v1: +- No `pyproject.toml` build hook required +- No `PyTypeObject`, no `PyMethodDef`, no `PyArg_ParseTuple` +- No Python GIL concerns +- No CPython version compat (works with any Python that can `subprocess.run()`) +- Testable from the shell (`echo 'op foo' | ./pipeline_bin` returns the response) +- Deployable as a single binary, or a wheel that bundles the binary +- The C11 code is 100% duffle.h style, no Python adaptation needed + +**The cost trade-off:** subprocess startup is ~10-20ms per call. For batch ops (parse 100 markdown files, generate 100 snapshots, build one big context) this is fine. For per-frame hot loops (e.g., 60 FPS text rendering) it's not. If a target is per-frame, the v1 in-process FFI model is required; otherwise, the v2 subprocess model is strictly better. + +## 3.10 What this means for the track + +**`chunkification_optimization_20260608_PLACEHOLDER`** is no longer a track. It is a **contingency** that activates when a hard constraint surfaces. The contingency plan is: + +1. **Default: don't build.** Use existing Python packages. Re-evaluate quarterly. +2. **If a hard constraint surfaces:** build the v2 subprocess pipeline model. Wire format is the contract. C11 code is duffle.h-style standalone binary. Python wrapper is a thin `subprocess.run()` caller. +3. **Track artifact, deferred:** the `chunkification_optimization_20260608_PLACEHOLDER` directory should hold a 1-page "contingency plan" doc (essentially a copy of this §3) rather than a full spec/plan. Promote to a full track when the first hard constraint surfaces. + +**`manual_ux_validation_20260608_PLACEHOLDER`** (the other v1 proposal) is **unaffected** by this correction. It remains a small, well-scoped track to promote the ASCII-sketch UX workflow. + +## 3.11 The honest re-verdict matrix (v2) + +| The user's question | The honest answer (v2) | +|---|---| +| When is the C11 path worth the cost? | Only when a hard constraint surfaces that no existing Python package can solve. Default: don't build. | +| What does the C11 path look like? | A standalone subprocess binary. Request in (text or binary), response out. Zero Python-awareness. Wire format is the contract. | +| How does Python compose chunk operations? | It composes the *request envelope* (which ops to run, with which params), not the C ops themselves. The C side just executes the pre-defined op list. No Python→C11 emitter needed. | +| What's the per-op overhead? | Zero FFI overhead (subprocess model). ~10-20ms per call (subprocess startup). Acceptable for batch ops, not for per-frame hot loops. | +| What about numpy? | NumPy is a *Python* package; the question doesn't apply to the v2 model. The C pipeline is its own world, with its own data structures. NumPy doesn't help here. | +| What's the build cost? | One-time ~half-day (just a C binary, no Python integration). Build via existing `uv` + a new `[tool.uv.scripts]` entry that runs `clang` on the .c file. | +| What about HPy / cross-impl? | Not relevant; the v2 model is a standalone subprocess, no Python implementation specifics. | +| What's the style fit with duffle.h? | Perfect. The C pipeline is 100% duffle.h style. No Python adaptation. | +| What's the wire format? | The user chooses. Recommend text-v1 (debuggable) → binary-v2 (fast) as the workload justifies. | +| What's the deploy shape? | Single C binary. Python `subprocess.run()` to call. Optional wheel that bundles the binary. | +| What about in-process FFI? | Skip for v1. Add later if subprocess startup is the new bottleneck. The wire format doesn't change. | + +## 3.12 Summary (v2, the action-oriented section) + +**Don't build anything yet.** Profile first; adopt existing Python packages; only reach for C11 when an existing package *can't* solve the bottleneck. The user said this directly: *"only worth it if I reach a hard constraint that I cannot solve with an existing python package."* + +**When you do build, the shape is:** subprocess C11 binary + wire format contract + thin Python `subprocess.run()` wrapper. No FFI, no PyTypeObject, no refcount discipline, no Python adaptation of the C code. The chunk-array (or whatever data structure) lives entirely inside the C binary; Python only sees request/response blobs. + +**`chunkification_optimization_20260608_PLACEHOLDER`** should become a 1-page contingency plan, not a full track. Promote to a track when (if) the first hard constraint surfaces. + +**`manual_ux_validation_20260608_PLACEHOLDER`** (Track #1 from the v1 proposal) is unaffected and remains a small, well-scoped track. Confirmed worth doing in the user's first message ("I love the idea and definitely see poitental"). + +--- + +*End of v2 assessment. The 2 user-corrections in this session (style reference, then request/response model) reshaped the answer from "build a stateful C extension" to "don't build anything, here's the contingency plan for when you do." Track #1 (manual_ux_validation) is confirmed. Track #2 (chunkification) is downgraded to a contingency document.* + +*Cross-references for re-anchoring: `docs/reports/session_synthesis_20260608.md` §8.2 (the original v1 proposal), `docs/ideation/ed_chunk_data_structures_20260523.md` (the user's chunk-ideation), `docs/transcripts/i-h95QIGchY_assuming_as_much_as_possible_andrewreece.txt` §56:42 (Reece's Xar reference impl), `src/aggregate.py:380-454` (the actual current markdown hot path), `src/history.py:1-141` (the actual current snapshot hot path), `pyproject.toml:6-27` (the current zero-markdown-deps state).*