Private
Public Access
0
0
Commit Graph

3 Commits

Author SHA1 Message Date
ed 694cfd2b70 diag(tier2): isolate the jank - _trigger_blink in render_response_panel
User asked: 'what does negative flows cause in the imgui procedural
dag graph that would cause a recursive processing of the stack?'

Tested 4 hypotheses:
1. PYTHONSTACKSIZE env var to bump main thread stack: IGNORED. Main
   thread stays at 1.94MB regardless of env var or PE header (PE
   header SizeOfStackReserve is 4TB but Windows OS uses its own
   default for the main thread commit size).
2. -X faulthandler: doesn't capture native STATUS_STACK_OVERFLOW
   (faulthandler only catches Python-level signals).
3. Editbin /STACK: editbin not installed on this system.
4. PE header patching with ctypes: SizeOfStackReserve is 4TB but the
   OS commits only 1.94MB for the main thread and Python doesn't
   honor any env var to change it.

The breakthrough: monkey-patched _handle_ai_response via sitecustomize
to disable _trigger_blink and _autofocus_response_tab. Result:

  WITHOUT _trigger_blink: process survives 60s, response event
  arrives with status='error' and correct error text. The test
  WOULD PASS.

  WITH _trigger_blink (default): process dies with 0xC00000FD
  (STATUS_STACK_OVERFLOW) within 1s of click.

The jank: in src/gui_2.py:render_response_panel (line 5537), the
_trigger_blink flag triggers imgui.set_window_focus('Response') on
the SAME frame as the response render. This native imgui call
apparently triggers imgui-bundle to do extra C++ draw work that
exhausts the main thread's 1.94MB stack.

Why negative_flows specifically: it's the ONLY tier-3 test where the
error response triggers the _trigger_blink path. Success responses
also trigger _trigger_blink but don't crash (perhaps because imgui-
bundle's layout calculations for an error overlay are heavier than
for a normal text response).

User predicted: 'i wont solve it but just pad out until failure'.
Confirmed - bumping stack didn't fix it (couldn't bump anyway, but
the prediction about recursion-related behavior is on track).

The fix (per user's framing 'needs to be guarded'): wrap the
set_window_focus call in render_response_panel in a try/except or
add a stack-depth guard before calling it. Or move the
_trigger_blink logic to a deferred frame to avoid the same-frame
race with the response render.
2026-06-17 13:22:38 -04:00
ed cc234b1b83 docs(tier2): architecture check - click chain isolation is correct
Per user question about whether execution is properly isolated between
AppController and gui_2.py main thread.

Verified by reading the architecture contract (docs/guide_architecture.md
lines 12, 884-890) and the two click handlers in question:

- _handle_generate_send (btn_gen_send): self.submit_io(worker)
- _cb_plan_epic (btn_mma_plan_epic): self.submit_io(_bg_task)

BOTH click handlers return immediately after submitting work. The
heavy AI call (ai_client.send -> subprocess.Popen -> process.communicate)
runs on the io_pool worker thread. The execution isolation between
AppController and gui_2.py's main render thread IS being followed.

The crash (STATUS_STACK_OVERFLOW, 0xC00000FD) is NOT in the click
handler chain. It IS in the main thread's imgui-bundle render loop.

The render loop runs concurrently with the io_pool worker's subprocess
operations. imgui-bundle's per-frame C++ draw code can exceed the main
thread's 1.94 MB stack (verified via kernel32.GetCurrentThreadStackLimits).

What aspect of negative_flows triggers this: the error-response render
path. MOCK_MODE=malformed_json causes the adapter to raise, which
triggers _handle_request_event to emit a 'response' event with
status='error'. The render loop draws this error response on the next
frame, exhausting the main thread's stack.

test_visual_orchestration.py uses the same provider setup but does NOT
set MOCK_MODE, so the mock defaults to 'success' mode, the adapter
returns normally, no error event, no crash. Empirically PASSED in
11.01s.

The architecture's render-loop contract assumes imgui-bundle's C stack
usage is bounded. It's not. The architecture has no enforcement
mechanism (no stack guard, no per-frame stack measurement, no graceful
degradation).

Next step (post-compact): capture Windows crash dump via procdump to
identify the specific imgui-bundle draw call.
2026-06-17 13:09:57 -04:00
ed cc2105dc65 docs(tier2): what's special about test_z_negative_flows
User asked why this test is uniquely affected. Answer: it's the ONLY
tier-3 test where the AI call runs ASYNCHRONOUSLY in the io_pool worker
while the imgui-bundle render loop continues on the main thread.

Verified: test_visual_orchestration.py::test_mma_epic_lifecycle uses
the same provider setup (gemini_cli + mock_gemini_cli.py + click) but
calls orchestrator_pm.generate_tracks() synchronously in the main
thread, blocking the render loop. It PASSES in 11s.

test_mma_step_mode_sim.py::test_mma_step_mode_approval_flow also uses
the async path but is @pytest.mark.skipif(not RUN_MMA_INTEGRATION) -
skipped by default. Would likely also crash if unsuppressed.

All other MockProvider tests short-circuit at ai_client.send and never
spawn a subprocess.

The crash is on the MAIN thread (1.94 MB stack, verified via
kernel32.GetCurrentThreadStackLimits), not the io_pool worker (which
has 8MB after threading.stack_size(8MB) patch). The main thread's
imgui-bundle render loop runs concurrently with the io_pool worker's
subprocess.Popen / process.communicate. The accumulated imgui-bundle
C++ frames exhaust the main thread's 1.94 MB stack.

This explains:
- Why bumping io_pool stack to 8MB doesn't help (the patch can't reach
  the main thread, which was created before any sitecustomize runs).
- Why the standalone subprocess call works (no render loop concurrent).
- Why the no-click baseline survives 60s (no AI call to trigger the race).

Next step: capture a Windows crash dump via procdump or cdb.exe to
confirm the crashing thread is the main thread and identify the
specific imgui-bundle C++ stack frame.
2026-06-17 12:58:15 -04:00