Private
Public Access
0
0
Commit Graph

1 Commits

Author SHA1 Message Date
ed e1c8730f20 fix(tests): bound run_tests_batched.py hang at 30s via daemon watchdog
run_tests_batched.py hangs at the end of a batch when the pytest
subprocess fails to exit cleanly. Two hang chains have been observed:

  1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) joining a
     blocked worker during interpreter finalization
     (concurrent.futures._python_exit, pool __del__, etc.).
  2. The session-scoped \live_gui\ fixture teardown hanging in
     client.reset_session() (HTTP call to hook server) or
     kill_process_tree(process.pid) / process.wait(timeout=2)
     (waiting for the sloppy.py subprocess to die on Windows).

A previous atexit-based fix (commit 8957c9a5) attempted to preempt
chain #1, but verified empirically that atexit handlers do NOT fire
at all when a pool worker is blocked in user code (see
src/io_pool.py module docstring for the full analysis). The
atexit-based fix is therefore ineffective, and was removed from
the conftest in this commit.

Solution: a daemon-thread watchdog that unconditionally calls
os._exit(0) after 30s. If pytest exits cleanly first, the thread
is killed when the process tears down (daemon=True). If pytest
hangs, the watchdog kicks in and the batched runner can move to
the next batch. Same pattern as
src/app_controller.py:_install_sigint_exit_handler (the production
Ctrl+C fix); the difference is the trigger (time-based vs. SIGINT).

Files:
- tests/conftest.py: replaced the ineffective atexit-based fix
  with the daemon-thread watchdog. Header comment documents both
  hang chains and explains why atexit was abandoned.
- tests/test_conftest_watchdog.py: 3 static regression tests that
  verify the watchdog is registered as a daemon thread with a
  timeout in the 25-35s range. Static checks (not subprocess) so
  the test itself isn't recursively bound by the watchdog.
2026-06-07 10:02:07 -04:00