Private
Public Access
0
0
Files
manual_slop/tests/test_conftest_watchdog.py
T
ed e1c8730f20 fix(tests): bound run_tests_batched.py hang at 30s via daemon watchdog
run_tests_batched.py hangs at the end of a batch when the pytest
subprocess fails to exit cleanly. Two hang chains have been observed:

  1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) joining a
     blocked worker during interpreter finalization
     (concurrent.futures._python_exit, pool __del__, etc.).
  2. The session-scoped \live_gui\ fixture teardown hanging in
     client.reset_session() (HTTP call to hook server) or
     kill_process_tree(process.pid) / process.wait(timeout=2)
     (waiting for the sloppy.py subprocess to die on Windows).

A previous atexit-based fix (commit 8957c9a5) attempted to preempt
chain #1, but verified empirically that atexit handlers do NOT fire
at all when a pool worker is blocked in user code (see
src/io_pool.py module docstring for the full analysis). The
atexit-based fix is therefore ineffective, and was removed from
the conftest in this commit.

Solution: a daemon-thread watchdog that unconditionally calls
os._exit(0) after 30s. If pytest exits cleanly first, the thread
is killed when the process tears down (daemon=True). If pytest
hangs, the watchdog kicks in and the batched runner can move to
the next batch. Same pattern as
src/app_controller.py:_install_sigint_exit_handler (the production
Ctrl+C fix); the difference is the trigger (time-based vs. SIGINT).

Files:
- tests/conftest.py: replaced the ineffective atexit-based fix
  with the daemon-thread watchdog. Header comment documents both
  hang chains and explains why atexit was abandoned.
- tests/test_conftest_watchdog.py: 3 static regression tests that
  verify the watchdog is registered as a daemon thread with a
  timeout in the 25-35s range. Static checks (not subprocess) so
  the test itself isn't recursively bound by the watchdog.
2026-06-07 10:02:07 -04:00

94 lines
3.3 KiB
Python

"""Regression: pytest conftest must install a hang-bounding watchdog.
The run_tests_batched.py runner hangs at the end of a batch when the
pytest subprocess fails to exit cleanly. Two hang chains have been
observed:
1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) on a blocked
worker during interpreter finalization.
2. The session-scoped `live_gui` fixture teardown (conftest.py:~451)
hanging on HTTP call to the hook server or on process.wait() for
the sloppy.py subprocess.
The conftest installs a daemon-thread watchdog (os._exit(0) after a
timeout) to bound the hang. This test verifies the watchdog is
actually registered after the conftest loads. It does NOT spawn a
subprocess (which would itself be bound by the watchdog and create a
recursive timeout), it just inspects threading.enumerate() at the
time the test runs.
If the watchdog is removed or the timeout grows, this test fails
and the run_tests_batched.py hang returns.
"""
import sys
import threading
from pathlib import Path
import pytest
ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(ROOT))
# The conftest has already been loaded by pytest before this test
# collection. We just need to verify the watchdog thread is alive.
WATCHDOG_NAME = "conftest-hang-watchdog"
WATCHDOG_SLEEP_SECONDS = 30.0
WATCHDOG_TOLERANCE_SECONDS = 5.0
def test_watchdog_thread_registered() -> None:
"""Verify the conftest's hang-bounding watchdog thread is alive.
The watchdog is a daemon thread named "conftest-hang-watchdog" that
sleeps for ~30s then calls os._exit(0). It must be alive (not yet
fired) at the time this test runs, because the pytest session has
not been running for 30s yet.
"""
threads = threading.enumerate()
names = [t.name for t in threads]
assert WATCHDOG_NAME in names, (
f"conftest watchdog thread {WATCHDOG_NAME!r} not found in "
f"threading.enumerate(); run_tests_batched.py will hang at end "
f"of batch. Active threads: {names}"
)
def test_watchdog_thread_is_daemon() -> None:
"""Watchdog must be daemon so it doesn't block pytest's own exit."""
for t in threading.enumerate():
if t.name == WATCHDOG_NAME:
assert t.daemon, (
f"watchdog thread is not daemon (daemon={t.daemon}); "
f"this would prevent pytest from exiting cleanly"
)
return
pytest.fail(f"watchdog thread {WATCHDOG_NAME!r} not found")
def test_watchdog_timeout_within_tolerance() -> None:
"""Watchdog timeout must be near the documented 30s value.
If the timeout drifts too low (<25s), normal slow batches could
be killed prematurely. If it drifts too high (>120s), the hang
bounding is too loose. This test enforces the contract.
"""
import re
conftest_path = Path(__file__).resolve().parent / "conftest.py"
text = conftest_path.read_text(encoding="utf-8")
# Look for the watchdog sleep call and extract the timeout
match = re.search(r"time\.sleep\(([\d.]+)\)", text)
assert match is not None, (
f"could not find time.sleep() call in {conftest_path}; "
f"watchdog may have been removed or restructured"
)
sleep_value = float(match.group(1))
assert (
WATCHDOG_SLEEP_SECONDS - WATCHDOG_TOLERANCE_SECONDS
<= sleep_value
<= WATCHDOG_SLEEP_SECONDS + WATCHDOG_TOLERANCE_SECONDS
), (
f"watchdog timeout is {sleep_value}s; expected "
f"~{WATCHDOG_SLEEP_SECONDS}s +/- {WATCHDOG_TOLERANCE_SECONDS}s. "
f"If the timeout was intentionally changed, update this test."
)