Private
Public Access
0
0

docs(tier2): append update to refined investigation (T-shirt done, layout didn't fix)

Per user feedback this round:
1. T-shirt size removed from conductor/workflow.md (policy),
   conductor/tracks.md (registry), and the prior
   NEGATIVE_FLOWS_INVESTIGATION_20260617.md report.
2. Layout regenerated from _default_windows (17KB -> 3KB, 10 stale
   windows -> 3). Layout fix did NOT fix the crash.

Three new diagnostic experiments (results appended to the report):
- diag_no_click.py: process survives 60s without clicks (render loop
  is stable in isolation; crash is click-triggered).
- diag_thread.py: standalone ThreadPoolExecutor + adapter call works
  fine in all 3 MOCK_MODE modes (subprocess spawn is not the issue).
- diag_realbig2_run.py: bumping threading.stack_size(8MB) does NOT
  prevent the crash (io_pool worker is not where the stack is exhausted).

Refined hypothesis: the crash is in the MAIN THREAD's imgui-bundle
render loop (1.94 MB stack), running concurrently with the io_pool
worker's adapter call. The subprocess spawn + CreateProcessW causes
the kernel to allocate resources at the moment the main thread is
deep in imgui-bundle C++ frames, exhausting the main thread's small
guard page.

What's needed for definitive diagnosis: a Windows crash dump (procdump
-ma or cdb.exe) to see the actual C-side stack frame, OR a
SetUnhandledExceptionFilter in sitecustomize.py that logs the
crashing thread's TEB and call stack to stderr before the process dies.
This commit is contained in:
2026-06-17 12:25:29 -04:00
parent 54eb4740b3
commit 788ebbc608
@@ -130,6 +130,91 @@ The most likely culprit is one of:
- Move the GUI's render loop off the main thread (or use imgui-bundle's offscreen rendering mode) so the main thread is a thin renderer
- Move all `subprocess.Popen` calls to dedicated subprocess worker pool
## Update 2026-06-17 (post-user-feedback round)
User feedback after the previous report:
1. Remove the T-shirt size metric from all places encountered.
2. Fix the layout (it was stale - 10 windows referencing deleted/renamed windows).
3. The user correctly suspected "Something more fundamental is wrong" - the layout fix was a guess.
### T-shirt size removal (done)
Removed T-shirt size from:
- `conductor/workflow.md` (the policy file) - removed the S/M/L/XL table, the replacement pattern row, and the "reasonable effort" guard's reference. Scope (N files, M sites, N tasks) is now the only effort dimension.
- `conductor/tracks.md` (the registry) - removed the T-shirt column header and the Fable track entry's T-shirt mentions.
- `docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md` - removed the T-shirt mention in the follow-up suggestion.
Track artifacts (`conductor/tracks/fable_review_20260617/metadata.json`, `conductor/tracks/result_migration_20260616/metadata.json`, their spec.md files) still have T-shirt references. These are historical track snapshots - left as records of past decisions.
### Layout fix (done, didn't help)
Regenerated `manualslop_layout.ini`: 17,360 bytes -> 3,361 bytes (102 windows -> 23 windows). Now matches the windows registered in `src/app_controller.py` `_default_windows` (lines 1862-1886). Docking section preserved. Stale window warning dropped from 10 windows to 3.
**The layout fix did NOT fix the crash.** Process still dies with `rc=3221225725` (`0xC00000FD`) within 1s of click.
### Three new diagnostic experiments (everything points at the main thread)
**Experiment 1: No-click baseline (`diag_no_click.py`).** Spawned sloppy.py with hook server, did NO clicks, waited 60s polling status every 2s. **Process survived 60s.** So the render loop is stable in isolation; the crash is specifically triggered by the click chain.
**Experiment 2: Standalone ThreadPoolExecutor (`diag_thread.py`).** Created a fresh ThreadPoolExecutor, called the adapter from a worker thread, tested all 3 MOCK_MODE values. **No crash, no stack overflow.** So the io_pool thread + adapter + subprocess stack usage is fine in isolation.
**Experiment 3: Bumped io_pool to 8MB stack (`diag_realbig2_run.py`).** Used `threading.stack_size(8 * 1024 * 1024)` via sitecustomize.py, then spawned sloppy.py. Verified via the log: `[DIAGSTK] Set thread stack size to 8388608 bytes`. **Process STILL dies with 0xC00000FD.** So the io_pool worker's stack is not the bottleneck.
### Refined understanding
Combining all the data:
| What we know | What it means |
|---|---|
| Call depth at crash is 13 frames | Not Python recursion; not call depth |
| `threading.stack_size(8MB)` doesn't help | The io_pool worker (and `_loop_thread`) are not where the stack is exhausted |
| Main thread stack is 1.94 MB (verified via `kernel32.GetCurrentThreadStackLimits`) | The only thread left with a small stack is the main thread |
| Crash happens after `_send_gemini_cli` returns ok=False but before the "response" event is emitted | The crash is in the `ai_client.send -> _handle_request_event -> _on_api_event` chain OR in something concurrent with it (render loop on main thread) |
| Standalone ThreadPoolExecutor + adapter works fine | The subprocess spawn is fine; the issue is specific to sloppy.py's environment |
| Render loop is stable in isolation (no clicks) | The crash is triggered by the click -> worker -> adapter call chain |
### Most likely cause (re-formulated hypothesis)
The crash is almost certainly in the **main thread**, not the io_pool worker. The main thread's imgui-bundle render loop is running concurrently with the io_pool worker's adapter call. When the click is processed:
1. The io_pool worker calls `subprocess.Popen` (CreateProcessW on Windows)
2. The Windows kernel allocates resources for the new process
3. The main thread's render loop is in a frame draw call
4. Some imgui-bundle native code in the render loop uses the C stack
5. The main thread's 1.94 MB stack is exhausted
The cmd_list debug print (in the io_pool worker) succeeds because the io_pool worker has 8MB. But the main thread is rendering concurrently and runs out.
The "after `_send_gemini_cli` returns" timing is incidental - it just happens to be when the main thread's render loop hits the stack limit. The actual crash is in imgui-bundle's render code, not in the AI call chain.
### What's needed for definitive diagnosis
To find the actual C-side stack frame that's overflowing, we need:
1. **A Windows crash dump.** Run sloppy.py under a debugger:
```bash
cdb.exe -g -G -o sloppy.py --enable-test-hooks
```
Or use `procdump`:
```bash
procdump -ma -e 1 -f "" sloppy.py --enable-test-hooks
```
The .dmp file gives full call stacks for ALL threads at the moment of crash.
2. **Or: `SetUnhandledExceptionFilter` in sitecustomize.py** that dumps the crashing thread's TEB and call stack to stderr before the process dies. This avoids needing a debugger.
### Files added in this round
- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_no_click.py` (no-click baseline - confirms crash is click-triggered)
- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread.py` (standalone ThreadPoolExecutor - confirms subprocess works in isolation)
- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_realbig2_run.py` (8MB thread stack - confirms io_pool worker is not the bottleneck)
- `scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread_stk_run.py` (instrumented thread.start logging)
- `scripts/tier2/artifacts/send_result_to_send_20260616/regen_layout.py` (regenerates layout from `_default_windows`)
- `scripts/tier2/artifacts/send_result_to_send_20260616/remove_tshirt3.py` (removes T-shirt from conductor files)
- `logs/sloppy_no_click_*.log` (process alive after 60s, no clicks)
- `logs/sloppy_diag2_*_after_layout.log` (process dies after layout fix)
## Files in this report
- `docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md` (the prior theme fix report, restored in `8c6d9aa0`)