The os._exit(2) change in 719c5e27 introduced a regression: the watchdog's daemon thread continues running through pytest's interpreter shutdown. On EVERY batch (even ones that complete successfully in 17s), the watchdog's time.sleep(30.0) elapses during finalization and the thread calls os._exit(2) just as pytest is wrapping up. Result: every batch was reported as 'Batch N failed' by run_tests_batched.py, even ones with '126 passed in 17.14s'.
Revert watchdog to os._exit(0) — its original purpose (force-exit any stuck pytest at 30s) doesn't need a non-zero code; it's a sledgehammer, not a signal. The runner does its own failure detection.
Update scripts/run_tests_batched.py to:
- Use subprocess.run(timeout=180) per batch
- Catch TimeoutExpired as a batch failure (with elapsed time + reason printed)
- Catch CalledProcessError as a batch failure (preserved from before)
- Print elapsed time for every batch (pass or fail) so hang behavior is visible
- Print a final summary that lists all FAILED FILES (not batches) for easy re-running
- Add --batch-size and --timeout CLI flags
- Add 1-space indentation + type hints per project style
Verified: ast.parse OK; --help works; test_conftest_watchdog 3/3 pass.
The conftest watchdog (e1c8730f) used os._exit(0) after the 30s sleep. run_tests_batched.py calls subprocess.run(check=True) and only prints 'Batch N failed.' when the subprocess exits non-zero. Exit 0 hid the failure: pytest got killed mid-test, the FAILURES section never printed, and the runner silently moved to the next batch. The 'Total batches with failures: 1' summary at the end was therefore undercounting.
Fix: os._exit(0) -> os._exit(2). Code 2 is the standard 'interrupted by signal/timeout' code; pytest also uses it for Ctrl-C. The batched runner now correctly reports a non-zero exit as a failure.
Test updated (docstring) to document the new contract. 3/3 test_conftest_watchdog.py still pass.
run_tests_batched.py hangs at the end of a batch when the pytest
subprocess fails to exit cleanly. Two hang chains have been observed:
1. ThreadPoolExecutor.__del__ -> shutdown(wait=True) joining a
blocked worker during interpreter finalization
(concurrent.futures._python_exit, pool __del__, etc.).
2. The session-scoped \live_gui\ fixture teardown hanging in
client.reset_session() (HTTP call to hook server) or
kill_process_tree(process.pid) / process.wait(timeout=2)
(waiting for the sloppy.py subprocess to die on Windows).
A previous atexit-based fix (commit 8957c9a5) attempted to preempt
chain #1, but verified empirically that atexit handlers do NOT fire
at all when a pool worker is blocked in user code (see
src/io_pool.py module docstring for the full analysis). The
atexit-based fix is therefore ineffective, and was removed from
the conftest in this commit.
Solution: a daemon-thread watchdog that unconditionally calls
os._exit(0) after 30s. If pytest exits cleanly first, the thread
is killed when the process tears down (daemon=True). If pytest
hangs, the watchdog kicks in and the batched runner can move to
the next batch. Same pattern as
src/app_controller.py:_install_sigint_exit_handler (the production
Ctrl+C fix); the difference is the trigger (time-based vs. SIGINT).
Files:
- tests/conftest.py: replaced the ineffective atexit-based fix
with the daemon-thread watchdog. Header comment documents both
hang chains and explains why atexit was abandoned.
- tests/test_conftest_watchdog.py: 3 static regression tests that
verify the watchdog is registered as a daemon thread with a
timeout in the 25-35s range. Static checks (not subprocess) so
the test itself isn't recursively bound by the watchdog.