diff --git a/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md b/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md index 07f0dedb..4f8cb3fd 100644 --- a/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md +++ b/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md @@ -130,6 +130,91 @@ The most likely culprit is one of: - Move the GUI's render loop off the main thread (or use imgui-bundle's offscreen rendering mode) so the main thread is a thin renderer - Move all `subprocess.Popen` calls to dedicated subprocess worker pool + +## Update 2026-06-17 (post-user-feedback round) + +User feedback after the previous report: +1. Remove the T-shirt size metric from all places encountered. +2. Fix the layout (it was stale - 10 windows referencing deleted/renamed windows). +3. The user correctly suspected "Something more fundamental is wrong" - the layout fix was a guess. + +### T-shirt size removal (done) + +Removed T-shirt size from: +- `conductor/workflow.md` (the policy file) - removed the S/M/L/XL table, the replacement pattern row, and the "reasonable effort" guard's reference. Scope (N files, M sites, N tasks) is now the only effort dimension. +- `conductor/tracks.md` (the registry) - removed the T-shirt column header and the Fable track entry's T-shirt mentions. +- `docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md` - removed the T-shirt mention in the follow-up suggestion. + +Track artifacts (`conductor/tracks/fable_review_20260617/metadata.json`, `conductor/tracks/result_migration_20260616/metadata.json`, their spec.md files) still have T-shirt references. These are historical track snapshots - left as records of past decisions. + +### Layout fix (done, didn't help) + +Regenerated `manualslop_layout.ini`: 17,360 bytes -> 3,361 bytes (102 windows -> 23 windows). Now matches the windows registered in `src/app_controller.py` `_default_windows` (lines 1862-1886). Docking section preserved. Stale window warning dropped from 10 windows to 3. + +**The layout fix did NOT fix the crash.** Process still dies with `rc=3221225725` (`0xC00000FD`) within 1s of click. + +### Three new diagnostic experiments (everything points at the main thread) + +**Experiment 1: No-click baseline (`diag_no_click.py`).** Spawned sloppy.py with hook server, did NO clicks, waited 60s polling status every 2s. **Process survived 60s.** So the render loop is stable in isolation; the crash is specifically triggered by the click chain. + +**Experiment 2: Standalone ThreadPoolExecutor (`diag_thread.py`).** Created a fresh ThreadPoolExecutor, called the adapter from a worker thread, tested all 3 MOCK_MODE values. **No crash, no stack overflow.** So the io_pool thread + adapter + subprocess stack usage is fine in isolation. + +**Experiment 3: Bumped io_pool to 8MB stack (`diag_realbig2_run.py`).** Used `threading.stack_size(8 * 1024 * 1024)` via sitecustomize.py, then spawned sloppy.py. Verified via the log: `[DIAGSTK] Set thread stack size to 8388608 bytes`. **Process STILL dies with 0xC00000FD.** So the io_pool worker's stack is not the bottleneck. + +### Refined understanding + +Combining all the data: + +| What we know | What it means | +|---|---| +| Call depth at crash is 13 frames | Not Python recursion; not call depth | +| `threading.stack_size(8MB)` doesn't help | The io_pool worker (and `_loop_thread`) are not where the stack is exhausted | +| Main thread stack is 1.94 MB (verified via `kernel32.GetCurrentThreadStackLimits`) | The only thread left with a small stack is the main thread | +| Crash happens after `_send_gemini_cli` returns ok=False but before the "response" event is emitted | The crash is in the `ai_client.send -> _handle_request_event -> _on_api_event` chain OR in something concurrent with it (render loop on main thread) | +| Standalone ThreadPoolExecutor + adapter works fine | The subprocess spawn is fine; the issue is specific to sloppy.py's environment | +| Render loop is stable in isolation (no clicks) | The crash is triggered by the click -> worker -> adapter call chain | + +### Most likely cause (re-formulated hypothesis) + +The crash is almost certainly in the **main thread**, not the io_pool worker. The main thread's imgui-bundle render loop is running concurrently with the io_pool worker's adapter call. When the click is processed: +1. The io_pool worker calls `subprocess.Popen` (CreateProcessW on Windows) +2. The Windows kernel allocates resources for the new process +3. The main thread's render loop is in a frame draw call +4. Some imgui-bundle native code in the render loop uses the C stack +5. The main thread's 1.94 MB stack is exhausted + +The cmd_list debug print (in the io_pool worker) succeeds because the io_pool worker has 8MB. But the main thread is rendering concurrently and runs out. + +The "after `_send_gemini_cli` returns" timing is incidental - it just happens to be when the main thread's render loop hits the stack limit. The actual crash is in imgui-bundle's render code, not in the AI call chain. + +### What's needed for definitive diagnosis + +To find the actual C-side stack frame that's overflowing, we need: + +1. **A Windows crash dump.** Run sloppy.py under a debugger: + ```bash + cdb.exe -g -G -o sloppy.py --enable-test-hooks + ``` + Or use `procdump`: + ```bash + procdump -ma -e 1 -f "" sloppy.py --enable-test-hooks + ``` + The .dmp file gives full call stacks for ALL threads at the moment of crash. + +2. **Or: `SetUnhandledExceptionFilter` in sitecustomize.py** that dumps the crashing thread's TEB and call stack to stderr before the process dies. This avoids needing a debugger. + +### Files added in this round + +- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_no_click.py` (no-click baseline - confirms crash is click-triggered) +- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread.py` (standalone ThreadPoolExecutor - confirms subprocess works in isolation) +- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_realbig2_run.py` (8MB thread stack - confirms io_pool worker is not the bottleneck) +- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread_stk_run.py` (instrumented thread.start logging) +- `scripts/tier2/artifacts/send_result_to_send_20260616/regen_layout.py` (regenerates layout from `_default_windows`) +- `scripts/tier2/artifacts/send_result_to_send_20260616/remove_tshirt3.py` (removes T-shirt from conductor files) +- `logs/sloppy_no_click_*.log` (process alive after 60s, no clicks) +- `logs/sloppy_diag2_*_after_layout.log` (process dies after layout fix) + + ## Files in this report - `docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md` (the prior theme fix report, restored in `8c6d9aa0`)