Private
Public Access
0
0

fix(tests): bound run_tests_batched.py hang at 30s via daemon watchdog

run_tests_batched.py hangs at the end of a batch when the pytest
subprocess fails to exit cleanly. Two hang chains have been observed:

  1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) joining a
     blocked worker during interpreter finalization
     (concurrent.futures._python_exit, pool __del__, etc.).
  2. The session-scoped \live_gui\ fixture teardown hanging in
     client.reset_session() (HTTP call to hook server) or
     kill_process_tree(process.pid) / process.wait(timeout=2)
     (waiting for the sloppy.py subprocess to die on Windows).

A previous atexit-based fix (commit 8957c9a5) attempted to preempt
chain #1, but verified empirically that atexit handlers do NOT fire
at all when a pool worker is blocked in user code (see
src/io_pool.py module docstring for the full analysis). The
atexit-based fix is therefore ineffective, and was removed from
the conftest in this commit.

Solution: a daemon-thread watchdog that unconditionally calls
os._exit(0) after 30s. If pytest exits cleanly first, the thread
is killed when the process tears down (daemon=True). If pytest
hangs, the watchdog kicks in and the batched runner can move to
the next batch. Same pattern as
src/app_controller.py:_install_sigint_exit_handler (the production
Ctrl+C fix); the difference is the trigger (time-based vs. SIGINT).

Files:
- tests/conftest.py: replaced the ineffective atexit-based fix
  with the daemon-thread watchdog. Header comment documents both
  hang chains and explains why atexit was abandoned.
- tests/test_conftest_watchdog.py: 3 static regression tests that
  verify the watchdog is registered as a daemon thread with a
  timeout in the 25-35s range. Static checks (not subprocess) so
  the test itself isn't recursively bound by the watchdog.
This commit is contained in:
2026-06-07 10:02:07 -04:00
parent 01ddf9f163
commit e1c8730f20
2 changed files with 126 additions and 17 deletions
+33 -17
View File
@@ -34,18 +34,33 @@ install()
# the live_gui fixture also creates one), this call is a no-op or
# fast (warmup already done).
#
# FIX (startup_speedup_20260606 sub-track 4 follow-up): The original
# code held `_warmup_app_controller` at module scope for the entire
# pytest session. When pytest exits, GC of the AppController triggers
# ThreadPoolExecutor.__del__ -> shutdown(wait=True). If warmup hasn't
# fully completed, shutdown blocks indefinitely, causing the batched
# test runner to hang after pytest exits.
# HANG PROTECTION: The run_tests_batched.py runner hangs at the end
# of a batch when the pytest subprocess fails to exit cleanly. Two
# hang chains have been observed:
# 1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) joining a
# blocked worker (concurrent.futures._python_exit, pool __del__,
# etc.). An earlier atexit fix at commit 8957c9a5 attempted to
# preempt this; verified empirically that atexit handlers do NOT
# fire at all when a pool worker is blocked in user code, so the
# fix is ineffective (see src/io_pool.py module docstring).
# 2. The session-scoped `live_gui` fixture teardown (conftest.py:~451)
# hangs in client.reset_session() (HTTP call to the hook server)
# or kill_process_tree(process.pid) / process.wait(timeout=2)
# (waiting for the sloppy.py subprocess to die on Windows).
# Both chains keep the pytest subprocess alive indefinitely, which
# makes run_tests_batched.py hang at subprocess.run() waiting for the
# child to exit.
#
# Fix: register an atexit handler that captures the pool reference
# directly (not the AppController) and shuts it down with wait=False.
# shutdown() is idempotent, so the subsequent shutdown(wait=True) in
# __del__ is a no-op. The pool reference is captured by closure so it
# survives even after the AppController is GC'd.
# Solution: a daemon-thread watchdog that unconditionally calls
# os._exit(0) after a generous timeout. If pytest exits cleanly
# first, the thread is killed when the process tears down
# (daemon=True). If pytest hangs, the watchdog kicks in and the
# batched runner can move to the next batch. 30s timeout: batches
# 1-3 in the user's run completed in 1-5s of test execution; 30s
# leaves headroom for slow batches while bounding the worst-case
# hang at half a minute. See src/app_controller.py:_install_sigint_exit_handler
# for the same pattern (SIGINT + os._exit(0)) applied to the
# production Ctrl+C path.
import atexit
from src.app_controller import AppController
_warmup_app_controller = AppController()
@@ -58,12 +73,13 @@ if not _warmup_app_controller.wait_for_warmup(timeout=60.0):
RuntimeWarning,
stacklevel=2,
)
_warmup_io_pool = getattr(_warmup_app_controller, "_io_pool", None)
def _shutdown_warmup_pool(pool: object = _warmup_io_pool) -> None:
if pool is not None:
try: pool.shutdown(wait=False)
except Exception: pass
atexit.register(_shutdown_warmup_pool)
def _watchdog_exit() -> None:
import time
time.sleep(30.0)
os._exit(0)
import threading
threading.Thread(target=_watchdog_exit, daemon=True, name="conftest-hang-watchdog").start()
from src.gui_2 import App