docs(tier2): append update to refined investigation (T-shirt done, layout didn't fix)

Per user feedback this round: 1. T-shirt size removed from conductor/workflow.md (policy), conductor/tracks.md (registry), and the prior NEGATIVE_FLOWS_INVESTIGATION_20260617.md report. 2. Layout regenerated from _default_windows (17KB -> 3KB, 10 stale windows -> 3). Layout fix did NOT fix the crash. Three new diagnostic experiments (results appended to the report): - diag_no_click.py: process survives 60s without clicks (render loop is stable in isolation; crash is click-triggered). - diag_thread.py: standalone ThreadPoolExecutor + adapter call works fine in all 3 MOCK_MODE modes (subprocess spawn is not the issue). - diag_realbig2_run.py: bumping threading.stack_size(8MB) does NOT prevent the crash (io_pool worker is not where the stack is exhausted). Refined hypothesis: the crash is in the MAIN THREAD's imgui-bundle render loop (1.94 MB stack), running concurrently with the io_pool worker's adapter call. The subprocess spawn + CreateProcessW causes the kernel to allocate resources at the moment the main thread is deep in imgui-bundle C++ frames, exhausting the main thread's small guard page. What's needed for definitive diagnosis: a Windows crash dump (procdump -ma or cdb.exe) to see the actual C-side stack frame, OR a SetUnhandledExceptionFilter in sitecustomize.py that logs the crashing thread's TEB and call stack to stderr before the process dies.
2026-06-17 12:25:29 -04:00
parent 54eb4740b3
commit 788ebbc608
1 changed files with 85 additions and 0 deletions
@@ -130,6 +130,91 @@ The most likely culprit is one of:
 - Move the GUI's render loop off the main thread (or use imgui-bundle's offscreen rendering mode) so the main thread is a thin renderer
 - Move all `subprocess.Popen` calls to dedicated subprocess worker pool

+
+## Update 2026-06-17 (post-user-feedback round)
+
+User feedback after the previous report:
+1. Remove the T-shirt size metric from all places encountered.
+2. Fix the layout (it was stale - 10 windows referencing deleted/renamed windows).
+3. The user correctly suspected "Something more fundamental is wrong" - the layout fix was a guess.
+
+### T-shirt size removal (done)
+
+Removed T-shirt size from:
+- `conductor/workflow.md` (the policy file) - removed the S/M/L/XL table, the replacement pattern row, and the "reasonable effort" guard's reference. Scope (N files, M sites, N tasks) is now the only effort dimension.
+- `conductor/tracks.md` (the registry) - removed the T-shirt column header and the Fable track entry's T-shirt mentions.
+- `docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md` - removed the T-shirt mention in the follow-up suggestion.
+
+Track artifacts (`conductor/tracks/fable_review_20260617/metadata.json`, `conductor/tracks/result_migration_20260616/metadata.json`, their spec.md files) still have T-shirt references. These are historical track snapshots - left as records of past decisions.
+
+### Layout fix (done, didn't help)
+
+Regenerated `manualslop_layout.ini`: 17,360 bytes -> 3,361 bytes (102 windows -> 23 windows). Now matches the windows registered in `src/app_controller.py` `_default_windows` (lines 1862-1886). Docking section preserved. Stale window warning dropped from 10 windows to 3.
+
+**The layout fix did NOT fix the crash.** Process still dies with `rc=3221225725` (`0xC00000FD`) within 1s of click.
+
+### Three new diagnostic experiments (everything points at the main thread)
+
+**Experiment 1: No-click baseline (`diag_no_click.py`).** Spawned sloppy.py with hook server, did NO clicks, waited 60s polling status every 2s. **Process survived 60s.** So the render loop is stable in isolation; the crash is specifically triggered by the click chain.
+
+**Experiment 2: Standalone ThreadPoolExecutor (`diag_thread.py`).** Created a fresh ThreadPoolExecutor, called the adapter from a worker thread, tested all 3 MOCK_MODE values. **No crash, no stack overflow.** So the io_pool thread + adapter + subprocess stack usage is fine in isolation.
+
+**Experiment 3: Bumped io_pool to 8MB stack (`diag_realbig2_run.py`).** Used `threading.stack_size(8 * 1024 * 1024)` via sitecustomize.py, then spawned sloppy.py. Verified via the log: `[DIAGSTK] Set thread stack size to 8388608 bytes`. **Process STILL dies with 0xC00000FD.** So the io_pool worker's stack is not the bottleneck.
+
+### Refined understanding
+
+Combining all the data:
+
+| What we know | What it means |
+|---|---|
+| Call depth at crash is 13 frames | Not Python recursion; not call depth |
+| `threading.stack_size(8MB)` doesn't help | The io_pool worker (and `_loop_thread`) are not where the stack is exhausted |
+| Main thread stack is 1.94 MB (verified via `kernel32.GetCurrentThreadStackLimits`) | The only thread left with a small stack is the main thread |
+| Crash happens after `_send_gemini_cli` returns ok=False but before the "response" event is emitted | The crash is in the `ai_client.send -> _handle_request_event -> _on_api_event` chain OR in something concurrent with it (render loop on main thread) |
+| Standalone ThreadPoolExecutor + adapter works fine | The subprocess spawn is fine; the issue is specific to sloppy.py's environment |
+| Render loop is stable in isolation (no clicks) | The crash is triggered by the click -> worker -> adapter call chain |
+
+### Most likely cause (re-formulated hypothesis)
+
+The crash is almost certainly in the **main thread**, not the io_pool worker. The main thread's imgui-bundle render loop is running concurrently with the io_pool worker's adapter call. When the click is processed:
+1. The io_pool worker calls `subprocess.Popen` (CreateProcessW on Windows)
+2. The Windows kernel allocates resources for the new process
+3. The main thread's render loop is in a frame draw call
+4. Some imgui-bundle native code in the render loop uses the C stack
+5. The main thread's 1.94 MB stack is exhausted
+
+The cmd_list debug print (in the io_pool worker) succeeds because the io_pool worker has 8MB. But the main thread is rendering concurrently and runs out.
+
+The "after `_send_gemini_cli` returns" timing is incidental - it just happens to be when the main thread's render loop hits the stack limit. The actual crash is in imgui-bundle's render code, not in the AI call chain.
+
+### What's needed for definitive diagnosis
+
+To find the actual C-side stack frame that's overflowing, we need:
+
+1. **A Windows crash dump.** Run sloppy.py under a debugger:
+   ```bash
+   cdb.exe -g -G -o sloppy.py --enable-test-hooks
+   ```
+   Or use `procdump`:
+   ```bash
+   procdump -ma -e 1 -f "" sloppy.py --enable-test-hooks
+   ```
+   The .dmp file gives full call stacks for ALL threads at the moment of crash.
+
+2. **Or: `SetUnhandledExceptionFilter` in sitecustomize.py** that dumps the crashing thread's TEB and call stack to stderr before the process dies. This avoids needing a debugger.
+
+### Files added in this round
+
+- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_no_click.py` (no-click baseline - confirms crash is click-triggered)
+- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread.py` (standalone ThreadPoolExecutor - confirms subprocess works in isolation)
+- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_realbig2_run.py` (8MB thread stack - confirms io_pool worker is not the bottleneck)
+- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread_stk_run.py` (instrumented thread.start logging)
+- `scripts/tier2/artifacts/send_result_to_send_20260616/regen_layout.py` (regenerates layout from `_default_windows`)
+- `scripts/tier2/artifacts/send_result_to_send_20260616/remove_tshirt3.py` (removes T-shirt from conductor files)
+- `logs/sloppy_no_click_*.log` (process alive after 60s, no clicks)
+- `logs/sloppy_diag2_*_after_layout.log` (process dies after layout fix)
+
+
 ## Files in this report

 - `docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md` (the prior theme fix report, restored in `8c6d9aa0`)