Private
Public Access
0
0
Files
manual_slop/docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md
T
ed 788ebbc608 docs(tier2): append update to refined investigation (T-shirt done, layout didn't fix)
Per user feedback this round:
1. T-shirt size removed from conductor/workflow.md (policy),
   conductor/tracks.md (registry), and the prior
   NEGATIVE_FLOWS_INVESTIGATION_20260617.md report.
2. Layout regenerated from _default_windows (17KB -> 3KB, 10 stale
   windows -> 3). Layout fix did NOT fix the crash.

Three new diagnostic experiments (results appended to the report):
- diag_no_click.py: process survives 60s without clicks (render loop
  is stable in isolation; crash is click-triggered).
- diag_thread.py: standalone ThreadPoolExecutor + adapter call works
  fine in all 3 MOCK_MODE modes (subprocess spawn is not the issue).
- diag_realbig2_run.py: bumping threading.stack_size(8MB) does NOT
  prevent the crash (io_pool worker is not where the stack is exhausted).

Refined hypothesis: the crash is in the MAIN THREAD's imgui-bundle
render loop (1.94 MB stack), running concurrently with the io_pool
worker's adapter call. The subprocess spawn + CreateProcessW causes
the kernel to allocate resources at the moment the main thread is
deep in imgui-bundle C++ frames, exhausting the main thread's small
guard page.

What's needed for definitive diagnosis: a Windows crash dump (procdump
-ma or cdb.exe) to see the actual C-side stack frame, OR a
SetUnhandledExceptionFilter in sitecustomize.py that logs the
crashing thread's TEB and call stack to stderr before the process dies.
2026-06-17 12:25:29 -04:00

16 KiB
Raw Blame History

test_z_negative_flows.py Failure - Refined Root Cause Analysis

Investigator: Tier 2 Tech Lead (autonomous run) Track context: Post-completion of send_result_to_send_20260616 Previous report: NEGATIVE_FLOWS_INVESTIGATION_20260617.md (now superseded by this one for the root-cause section)

TL;DR

The 3 tests in tests/test_z_negative_flows.py fail with Windows 0xC00000FD = STATUS_STACK_OVERFLOW in the GUI subprocess. The Python call stack at the moment of the crash is only 13 frames deep — so this is not a Python recursion bug. The actual cause is that the main thread of sloppy.py only has a 1.94 MB stack on this Python 3.11.6 / Windows installation (verified via kernel32.GetCurrentThreadStackLimits). The io_pool workers DO get the 8MB stack from threading.stack_size(8MB) (set by my diagnostic sitecustomize) — and they STILL crash with 0xC00000FD, which means the stack overflow is in the main thread, not the io_pool worker.

Why the previous "thread stack is too small" theory is wrong

I previously hypothesized the io_pool's 1MB thread stack was the bottleneck. After running three follow-up experiments, this is no longer credible:

  1. Bumping threading.stack_size(8 * 1024 * 1024) before any thread is created (via sitecustomize.py loaded into the subprocess) → process still dies with 0xC00000FD. So the io_pool workers and _loop_thread (both created after the sitecustomize) have 8MB stacks and still crash.
  2. Replacing concurrent.futures.ThreadPoolExecutor with a custom pool that uses threading.Thread(..., stack_size=8MB) → fails on Python 3.11 because Thread.__init__ no longer accepts the stack_size kwarg in 3.11 (only threading.stack_size() global works). Bypassed that by using the global.
  3. Running the adapter directly in ThreadPoolExecutor from a standalone Python process (no imgui-bundle, no render loop) → works fine for all 3 MOCK_MODE values. So the io_pool thread is not the problem in isolation.

The actual data

Python call stack at crash

Instrumented _send_gemini_cli and GeminiCliAdapter.send via sitecustomize.py. Stack at adapter.send ENTRY:

[STK] _send_gemini_cli ENTRY depth=9
[STK] adapter.send ENTRY depth=13
[STK]     sitecustomize.py:25 _walk_stack
[STK]     sitecustomize.py:42 _patched_send
[STK]     ai_client.py:1853 _send
[STK]     ai_client.py:808 run_with_tool_loop
[STK]     ai_client.py:1917 _send_gemini_cli
[STK]     sitecustomize.py:69 _patched_send_gc
[STK]     ai_client.py:3016 send
[STK]     app_controller.py:3674 _handle_request_event
[STK]     thread.py:58 run                <-- io_pool worker
[STK]     thread.py:83 _worker
[STK]     threading.py:982 run
[STK]     threading.py:1045 _bootstrap_inner
[STK]     threading.py:1002 _bootstrap

13 frames is trivial. ~6-7KB of Python stack. ~50KB of C stack underneath. No recursion anywhere.

Thread stack sizes in this process (verified)

[DIAGSTK] Set thread stack size to 8388608 bytes
[DIAGSTK] Main thread stack: 1.94 MB

Confirmed via kernel32.GetCurrentThreadStackLimits:

import ctypes
GetCurrentThreadStackLimits = ctypes.windll.kernel32.GetCurrentThreadStackLimits
GetCurrentThreadStackLimits.argtypes = [ctypes.POINTER(ctypes.c_void_p), ctypes.POINTER(ctypes.c_void_p)]
low = ctypes.c_void_p(); high = ctypes.c_void_p()
GetCurrentThreadStackLimits(ctypes.byref(low), ctypes.byref(high))
# Result: high - low = 1.94 MB on the main thread

The main thread's stack is 1.94 MB, set by the Windows PE header (Python 3.11.6's python.exe). The sitecustomize's threading.stack_size(8MB) call sets the default for new threads (the io_pool workers, the _loop_thread, the HookServer thread), but the main thread was created before sitecustomize ran, so it keeps its PE-header-baked 1.94 MB.

Process death pattern

$ poll=3221225725  (= 0xC00000FD)

Reproducible 100% across runs and across all 3 MOCK_MODE values (malformed_json, error_result, success).

When the main thread's stack overflows, the whole process dies — including all worker threads. So when the io_pool worker is mid-call to adapter.send, the main thread's stack overflow kills everything.

What is the main thread doing during the test?

The main thread runs immapp.run(...) from imgui-bundle, which is the HelloImGui native render loop. It calls our Python _gui_func callback ~60 times/second. The render loop has been running since startup. By the time the test clicks btn_gen_send:

  • ~50-60 frames have been rendered (1 second of warmup + 0.5s × 6 setup calls)
  • The imgui-bundle render context has been built up with widgets, fonts, theme

Hypothesis (not yet verified): the render loop is calling into imgui-bundle's native layout/draw code, which is using C++ frames with deep template instantiations. After many frames, the C stack grows. When the click is dispatched and the render loop continues to run alongside the io_pool worker's adapter.send, the main thread's stack hits its 1.94MB guard page and dies.

This is not Python recursion. It's the imgui-bundle native render code's stack usage, accumulated over many frames.

What we know for sure

  1. The crash is 0xC00000FD = STATUS_STACK_OVERFLOW on Windows. NOT a Python exception.
  2. The Python call chain at the crash point is 13 frames deep. NOT a Python recursion bug.
  3. The crash happens in the GUI subprocess (sloppy.py with --enable-test-hooks), not in pytest.
  4. The crash happens after click("btn_gen_send") is processed, not before. All 6 setup API calls return 200.
  5. The crash is reproducible 100% with MOCK_MODE in {malformed_json, error_result, success}. Not specific to the exception path.
  6. The main thread has 1.94 MB. The io_pool workers, after threading.stack_size(8MB), have 8 MB. Bumping the io_pool stack doesn't fix the crash.
  7. The standalone Python process (no imgui-bundle, no render loop) running the same adapter call from a ThreadPoolExecutor with default 1MB stack works fine for all 3 MOCK_MODE values.

What we don't know yet

  • Whether the main thread is actually the one whose stack overflows (vs. a thread we haven't yet identified — e.g., a HelloImGui-internal thread, or a thread created by imgui-bundle). To verify, I'd need to attach a debugger or add SetUnhandledExceptionFilter logging in the subprocess to dump the crashing thread's TEB.
  • What specific imgui-bundle code path causes the C stack to grow. Without a debugger or WER crash dump, we can't see the C-side stack trace.
  • Whether the stack growth is linear (slow leak over many frames) or sudden (one specific draw call).

Plausible root cause (next investigation step)

The most likely culprit is one of:

  1. _render_message_panel / _render_response_panel rendering path: when ai_status becomes "error", the response panel starts rendering an error overlay. If the error overlay calls into imgui-bundle with a pathological layout (e.g., add_rect with a malformed argument list — the bug from 9fcf0517!), imgui-bundle may recurse deeply into its C++ template metaprogramming for layout calc. Even with the theme fix in 9fcf0517, the C++ stack usage per frame may have grown to the point where the next frame overflows the 1.94MB main thread stack.

  2. A specific frame's draw call: clicking btn_gen_send triggers _do_generate in a worker, which puts an event on the queue, which gets processed by the render loop on the next frame. The render loop renders the new state. That specific draw call has a deep C++ stack.

  3. External MCP server thread: if any external MCP server is connected, its thread may have a small stack. But this would be caught by the io_pool stack bump, which we did.

  1. Capture a Windows Error Reporting (WER) crash dump from the subprocess. Run sloppy.py under a debugger (e.g., cdb.exe -g -G -o sloppy.py --enable-test-hooks) or use procdump -ma -e 1 -f "" sloppy.py. This will give us a .dmp file with full call stacks for ALL threads at the moment of crash.
  2. Add SetUnhandledExceptionFilter to the subprocess that logs the crashing thread's TEB and stack to stderr before the process dies. The handler can be installed via sitecustomize.py so it doesn't require code changes to sloppy.py.
  3. Reduce the test's render load: if the test workspace's layout file is 17KB and references 10 stale window names, that may be a major source of native stack usage per frame. Fix the stale layout (it has been stale for 7+ days per the WARNING in the log: "Run the 'Reset Layout' command from the Command Palette").
  4. Bump the main thread's stack at the OS level: This requires modifying the PE header of python.exe (via editbin /STACK:8388608 python.exe on Windows) or recompiling. Neither is in scope for a 1-track fix.

The fix path forward

Short-term (ship in next track, 1-2 hours):

  • Fix the stale manualslop_layout.ini (it references 10 deleted window names, causing imgui-bundle to do extra work each frame)
  • Capture a WER dump to identify the actual C-side stack frame that overflows
  • If the dump points to a specific render function, fix that function

Medium-term (separate track, 1-2 days):

  • Bump sloppy.py's main thread stack via editbin (Windows) or by setting PYTHONSTACKSIZE env var if available
  • Migrate heavy AI calls to a subprocess (multiprocessing.Process) so the C stack is per-call, not per-thread

Long-term (architectural):

  • Move the GUI's render loop off the main thread (or use imgui-bundle's offscreen rendering mode) so the main thread is a thin renderer
  • Move all subprocess.Popen calls to dedicated subprocess worker pool

Update 2026-06-17 (post-user-feedback round)

User feedback after the previous report:

  1. Remove the T-shirt size metric from all places encountered.
  2. Fix the layout (it was stale - 10 windows referencing deleted/renamed windows).
  3. The user correctly suspected "Something more fundamental is wrong" - the layout fix was a guess.

T-shirt size removal (done)

Removed T-shirt size from:

  • conductor/workflow.md (the policy file) - removed the S/M/L/XL table, the replacement pattern row, and the "reasonable effort" guard's reference. Scope (N files, M sites, N tasks) is now the only effort dimension.
  • conductor/tracks.md (the registry) - removed the T-shirt column header and the Fable track entry's T-shirt mentions.
  • docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md - removed the T-shirt mention in the follow-up suggestion.

Track artifacts (conductor/tracks/fable_review_20260617/metadata.json, conductor/tracks/result_migration_20260616/metadata.json, their spec.md files) still have T-shirt references. These are historical track snapshots - left as records of past decisions.

Layout fix (done, didn't help)

Regenerated manualslop_layout.ini: 17,360 bytes -> 3,361 bytes (102 windows -> 23 windows). Now matches the windows registered in src/app_controller.py _default_windows (lines 1862-1886). Docking section preserved. Stale window warning dropped from 10 windows to 3.

The layout fix did NOT fix the crash. Process still dies with rc=3221225725 (0xC00000FD) within 1s of click.

Three new diagnostic experiments (everything points at the main thread)

Experiment 1: No-click baseline (diag_no_click.py). Spawned sloppy.py with hook server, did NO clicks, waited 60s polling status every 2s. Process survived 60s. So the render loop is stable in isolation; the crash is specifically triggered by the click chain.

Experiment 2: Standalone ThreadPoolExecutor (diag_thread.py). Created a fresh ThreadPoolExecutor, called the adapter from a worker thread, tested all 3 MOCK_MODE values. No crash, no stack overflow. So the io_pool thread + adapter + subprocess stack usage is fine in isolation.

Experiment 3: Bumped io_pool to 8MB stack (diag_realbig2_run.py). Used threading.stack_size(8 * 1024 * 1024) via sitecustomize.py, then spawned sloppy.py. Verified via the log: [DIAGSTK] Set thread stack size to 8388608 bytes. Process STILL dies with 0xC00000FD. So the io_pool worker's stack is not the bottleneck.

Refined understanding

Combining all the data:

What we know What it means
Call depth at crash is 13 frames Not Python recursion; not call depth
threading.stack_size(8MB) doesn't help The io_pool worker (and _loop_thread) are not where the stack is exhausted
Main thread stack is 1.94 MB (verified via kernel32.GetCurrentThreadStackLimits) The only thread left with a small stack is the main thread
Crash happens after _send_gemini_cli returns ok=False but before the "response" event is emitted The crash is in the ai_client.send -> _handle_request_event -> _on_api_event chain OR in something concurrent with it (render loop on main thread)
Standalone ThreadPoolExecutor + adapter works fine The subprocess spawn is fine; the issue is specific to sloppy.py's environment
Render loop is stable in isolation (no clicks) The crash is triggered by the click -> worker -> adapter call chain

Most likely cause (re-formulated hypothesis)

The crash is almost certainly in the main thread, not the io_pool worker. The main thread's imgui-bundle render loop is running concurrently with the io_pool worker's adapter call. When the click is processed:

  1. The io_pool worker calls subprocess.Popen (CreateProcessW on Windows)
  2. The Windows kernel allocates resources for the new process
  3. The main thread's render loop is in a frame draw call
  4. Some imgui-bundle native code in the render loop uses the C stack
  5. The main thread's 1.94 MB stack is exhausted

The cmd_list debug print (in the io_pool worker) succeeds because the io_pool worker has 8MB. But the main thread is rendering concurrently and runs out.

The "after _send_gemini_cli returns" timing is incidental - it just happens to be when the main thread's render loop hits the stack limit. The actual crash is in imgui-bundle's render code, not in the AI call chain.

What's needed for definitive diagnosis

To find the actual C-side stack frame that's overflowing, we need:

  1. A Windows crash dump. Run sloppy.py under a debugger:

    cdb.exe -g -G -o sloppy.py --enable-test-hooks
    

    Or use procdump:

    procdump -ma -e 1 -f "" sloppy.py --enable-test-hooks
    

    The .dmp file gives full call stacks for ALL threads at the moment of crash.

  2. Or: SetUnhandledExceptionFilter in sitecustomize.py that dumps the crashing thread's TEB and call stack to stderr before the process dies. This avoids needing a debugger.

Files added in this round

  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_no_click.py (no-click baseline - confirms crash is click-triggered)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread.py (standalone ThreadPoolExecutor - confirms subprocess works in isolation)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_realbig2_run.py (8MB thread stack - confirms io_pool worker is not the bottleneck)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_thread_stk_run.py (instrumented thread.start logging)
  • scripts/tier2/artifacts/send_result_to_send_20260616/regen_layout.py (regenerates layout from _default_windows)
  • scripts/tier2/artifacts/send_result_to_send_20260616/remove_tshirt3.py (removes T-shirt from conductor files)
  • logs/sloppy_no_click_*.log (process alive after 60s, no clicks)
  • logs/sloppy_diag2_*_after_layout.log (process dies after layout fix)

Files in this report

  • docs/reports/THEME_BUG_ANALYSIS_send_result_to_send_20260616.md (the prior theme fix report, restored in 8c6d9aa0)
  • docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617.md (the previous investigation — partially superseded)
  • docs/reports/NEGATIVE_FLOWS_INVESTIGATION_20260617_REFINED.md (this file)
  • scripts/tier2/artifacts/send_result_to_send_20260616/diag_diag_stacks_init.py (sitecustomize that sets 8MB stack + reports main thread stack size)
  • logs/sloppy_diag_stk_20260617_*.log (log showing "Main thread stack: 1.94 MB" then crash)