docs(tier2): ban output filtering + prefer targeted tier runs
Two new rules for Tier 2 (added per user directive 2026-06-27 after Tier 2 ran the full batch and piped through Select-Object -Last 20, losing the full record): 1. NEVER filter test output (Select-Object, head, tail, | Select -First N). ALWAYS redirect to a log file, then read it with read_file/grep. 2. Prefer targeted tier runs (--tier tier3, --filter test_<file>) over the full 11-tier batch. The full batch is for the USER post-merge, not for Tier 2 per-task verification. Applied to 3 files: tier2-autonomous.md, tier-2-auto-execute.md, workflow.md Tier 2 Autonomous Sandbox conventions.
This commit is contained in:
@@ -91,6 +91,8 @@ This gate catches the failure mode in the 2026-06-24 MCP regression where Tier 2
|
||||
## Conventions (MUST follow - added 2026-06-17; updated 2026-06-27)
|
||||
|
||||
- **Test runner:** ALWAYS use `uv run python scripts/run_tests_batched.py` for test runs. NEVER call `uv run pytest` directly. The batched runner provides tier-based filtering, parallelization (xdist), and a summary table. Direct pytest is slow and bypasses the tiering that the live_gui tests depend on.
|
||||
- **NEVER filter test output** (added 2026-06-27 per user directive). Do NOT pipe test output through `Select-Object`, `| Select -First N`, `| Select -Last N`, `head`, `tail`, or any truncation filter. If you need to see more output later, you'll have to re-run the entire test — which wastes time and context. Instead, ALWAYS redirect to a log file: `uv run python scripts/run_tests_batched.py > tests/artifacts/tier2_state/<track>/test_run_<phase>_<task>.log 2>&1`. Then read the log file with `manual-slop_read_file` or `grep` to find the relevant sections. The log file is your full record; you can search it without re-running.
|
||||
- **Prefer targeted tier runs** (added 2026-06-27 per user directive). Do NOT run the full 11-tier batch for every verification. Run only the tiers relevant to the current task (e.g., `uv run python scripts/run_tests_batched.py --tier tier3` or `--filter test_<specific_file>`). The full batch is for the USER to run after merge review, not for Tier 2's per-task verification. Running the full batch every time wastes 20+ minutes and the output is too large to be useful in context.
|
||||
- **Default branch:** this repo uses `master` (not `main`). Always use `origin/master` in `git fetch` and as the base for new branches. Do not assume `main` exists.
|
||||
- **Line endings:** preserve existing line endings on edit. This repo has a mix of CRLF and LF (a repo-wide LF standardization is a future track). If the file is CRLF, keep it CRLF. If the file is LF, keep it LF. Do not add CRLF to LF files or strip CRLF from CRLF files.
|
||||
- **Throw-away scripts:** write them to `scripts/tier2/artifacts/<track-name>/`, NOT the base `scripts/tier2/` directory. The base directory is reserved for production code that ships with the sandbox (failcount.py, run_track.py, write_report.py, the .ps1 launchers). Throw-away scripts are kept for archival but live in a track-specific subdir so they don't pollute the base.
|
||||
|
||||
@@ -51,6 +51,8 @@ Optional flags: `--resume` (continue from last completed task), `--toast` (Windo
|
||||
## Conventions (MUST follow - added 2026-06-17)
|
||||
|
||||
- **Test runner:** use `uv run python scripts/run_tests_batched.py` (NOT `uv run pytest`)
|
||||
- **NEVER filter test output** (added 2026-06-27 per user directive). Do NOT pipe test output through `Select-Object`, `| Select -First N`, `| Select -Last N`, `head`, `tail`, or any truncation filter. Instead, ALWAYS redirect to a log file: `uv run python scripts/run_tests_batched.py > tests/artifacts/tier2_state/<track>/test_run_<phase>_<task>.log 2>&1`. Then read the log file to find relevant sections. The log file is your full record; you can search it without re-running.
|
||||
- **Prefer targeted tier runs** (added 2026-06-27 per user directive). Do NOT run the full 11-tier batch for every verification. Run only the tiers relevant to the current task (e.g., `--tier tier3` or `--filter test_<specific_file>`). The full batch is for the USER to run after merge review, not for Tier 2's per-task verification.
|
||||
- **Default branch:** `master` (this repo never had `main`)
|
||||
- **Line endings:** preserve existing (CRLF stays CRLF, LF stays LF)
|
||||
- **Throw-away scripts:** write to `scripts/tier2/artifacts/<track-name>/`, NOT the base directory
|
||||
|
||||
@@ -383,11 +383,13 @@ The Tier 2 autonomous mode is the unattended execution mode for tracks. See `doc
|
||||
### Conventions (MUST follow)
|
||||
|
||||
1. **Test runner:** Tier 2 always uses `uv run python scripts/run_tests_batched.py`. NEVER `uv run pytest` directly. The batched runner provides tier-based filtering, parallelization (xdist), and a summary table that direct pytest does not.
|
||||
2. **Default branch:** this repo uses `master` (not `main`). When fetching or branching, use `origin/master`. Do not assume `main` exists.
|
||||
3. **Line endings:** preserve existing line endings on edit. This repo has a mix of CRLF and LF; repo-wide LF standardization is a future track. For now, do not normalize.
|
||||
4. **Throw-away scripts:** Tier 2 writes its working scripts to `scripts/tier2/artifacts/<track-name>/`, NOT the base `scripts/tier2/` directory. The base is reserved for production code (failcount.py, run_track.py, write_report.py, the .ps1 launchers). Throw-away scripts are kept for archival but isolated.
|
||||
5. **End-of-track report:** at the end of every track, Tier 2 writes `docs/reports/TRACK_COMPLETION_<track-name>.md` (follow the precedent set by `TRACK_COMPLETION_tier2_autonomous_sandbox_20260616.md`) and updates `conductor/tracks/<track-name>/state.toml` to `status = "completed"`. The user reads this report to decide merge.
|
||||
6. **Run-time expectation:** tracks are 1-4 hours. If the model reports it is running out of context, Tier 2 notes progress to disk (the failcount state file) and continues. The user expects autonomous runs to complete without manual "press continue" intervention. The `--resume` flag picks up from the last completed task.
|
||||
2. **NEVER filter test output** (added 2026-06-27 per user directive). Do NOT pipe test output through `Select-Object`, `| Select -First N`, `| Select -Last N`, `head`, `tail`, or any truncation filter. If you need to see more output later, you'll have to re-run the entire test — which wastes time and context. Instead, ALWAYS redirect to a log file: `uv run python scripts/run_tests_batched.py > tests/artifacts/tier2_state/<track>/test_run_<phase>_<task>.log 2>&1`. Then read the log file with `manual-slop_read_file` or `grep` to find the relevant sections. The log file is your full record; you can search it without re-running.
|
||||
3. **Prefer targeted tier runs** (added 2026-06-27 per user directive). Do NOT run the full 11-tier batch for every verification. Run only the tiers relevant to the current task (e.g., `--tier tier3` or `--filter test_<specific_file>`). The full batch is for the USER to run after merge review, not for Tier 2's per-task verification. Running the full batch every time wastes 20+ minutes and the output is too large to be useful in context.
|
||||
4. **Default branch:** this repo uses `master` (not `main`). When fetching or branching, use `origin/master`. Do not assume `main` exists.
|
||||
5. **Line endings:** preserve existing line endings on edit. This repo has a mix of CRLF and LF; repo-wide LF standardization is a future track. For now, do not normalize.
|
||||
6. **Throw-away scripts:** Tier 2 writes its working scripts to `scripts/tier2/artifacts/<track-name>/`, NOT the base `scripts/tier2/` directory. The base is reserved for production code (failcount.py, run_track.py, write_report.py, the .ps1 launchers). Throw-away scripts are kept for archival but isolated.
|
||||
7. **End-of-track report:** at the end of every track, Tier 2 writes `docs/reports/TRACK_COMPLETION_<track-name>.md` (follow the precedent set by `TRACK_COMPLETION_tier2_autonomous_sandbox_20260616.md`) and updates `conductor/tracks/<track-name>/state.toml` to `status = "completed"`. The user reads this report to decide merge.
|
||||
8. **Run-time expectation:** tracks are 1-4 hours. If the model reports it is running out of context, Tier 2 notes progress to disk (the failcount state file) and continues. The user expects autonomous runs to complete without manual "press continue" intervention. The `--resume` flag picks up from the last completed task.
|
||||
|
||||
### Hard bans (3-layer enforcement)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user