07a0e66a19
User feedback from the first sandbox run (send_result_to_send_20260616, 2026-06-17) identified 6 conventions Tier 2 must follow. Update the agent prompt template, slash command template, user guide, and workflow doc: 1. Test runner: ALWAYS use 'uv run python scripts/run_tests_batched.py' (NOT 'uv run pytest'). The batched runner provides tier filtering, parallelization (xdist), and a summary table that direct pytest lacks. 2. Default branch: this repo uses 'master', not 'main'. The Tier 2 slash command now does 'git fetch origin master' (was 'origin main'). 3. Line endings: preserve existing. This repo has a mix of CRLF and LF; a repo-wide LF standardization is a future track. 4. Throw-away scripts: write to 'scripts/tier2/artifacts/<track>/', NOT the base 'scripts/tier2/' directory. The base is reserved for production code; throw-away scripts are kept for archival but isolated per-track. 5. End-of-track report: write 'docs/reports/TRACK_COMPLETION_<track>.md' and update 'state.toml' to 'status=completed'. The user reads this to decide merge. Previously this was implicit; now it's explicit. 6. Run-time expectation: tracks are 1-4 hours. If context runs out, Tier 2 notes progress to disk and continues. The --resume flag picks up from the last completed task. Also updated the user guide with a 'Conventions' section and a troubleshooting entry for the resume flow. The verify-the-sandbox checklist now uses 'origin master' instead of 'origin main'.
131 lines
6.7 KiB
Markdown
131 lines
6.7 KiB
Markdown
# Tier 2 Autonomous Sandbox
|
|
|
|
## Why this exists
|
|
|
|
When you run Tier 2 in the main repo, every `edit` and every `bash`
|
|
call prompts you for approval (`permission: ask`). For well-regularized
|
|
tracks (TDD red/green with atomic per-task commits), this is noise.
|
|
This track adds an **autonomous mode** in a sibling clone where Tier 2
|
|
runs unattended, with a 3-layer enforcement stack to keep it contained.
|
|
|
|
## One-time bootstrap
|
|
|
|
```powershell
|
|
cd C:\projects\manual_slop
|
|
pwsh -File scripts\tier2\setup_tier2_clone.ps1 -WhatIf # dry run first
|
|
pwsh -File scripts\tier2\setup_tier2_clone.ps1 # actual bootstrap
|
|
```
|
|
|
|
The bootstrap:
|
|
1. Clones the main repo to `C:\projects\manual_slop_tier2\`
|
|
2. Sets `origin = C:\projects\manual_slop` (local path; no remote)
|
|
3. Copies the agent, slash command, and opencode.json templates to the clone
|
|
4. Installs the git hooks (`pre-push` refuses all pushes; `post-checkout` logs checkouts)
|
|
5. Creates `C:\Users\Ed\AppData\Local\manual_slop\tier2\` with restricted ACLs
|
|
6. Creates a "Tier 2 (Sandboxed)" desktop shortcut
|
|
|
|
## Per-track invocation
|
|
|
|
1. Double-click the "Tier 2 (Sandboxed)" desktop shortcut
|
|
(or run `pwsh -File C:\projects\manual_slop\scripts\tier2\run_tier2_sandboxed.ps1` manually)
|
|
2. In the OpenCode session, type:
|
|
```
|
|
/tier-2-auto-execute <track-name>
|
|
```
|
|
Examples:
|
|
- `/tier-2-auto-execute result_migration_review_pass`
|
|
- `/tier-2-auto-execute data_structure_strengthening_20260606 --resume`
|
|
- `/tier-2-auto-execute rag_test_failures_20260615 --toast`
|
|
3. Tier 2 runs the track autonomously, commits per task, monitors failcount
|
|
4. On success: prints a summary
|
|
5. On give-up: writes a failure report and prints the path
|
|
|
|
## Review and merge
|
|
|
|
After Tier 2 finishes (success or give-up):
|
|
1. `cd C:\projects\manual_slop` (back to main)
|
|
2. `git fetch C:/projects/manual_slop_tier2 tier2/<track-name>`
|
|
3. Review the diff with Tier 1 (interactive)
|
|
4. On approval: `git merge --no-ff tier2/<track-name>` to main
|
|
|
|
## The 4 hard bans (enforced at 3 layers)
|
|
|
|
| Ban | Layer 1 (OpenCode) | Layer 2 (OS) | Layer 3 (git hook) |
|
|
|---|---|---|---|
|
|
| `git push*` (any push) | `permission.bash` deny rule | n/a | `pre-push` hook refuses all pushes |
|
|
| `git checkout*` (any form) | `permission.bash` deny rule | n/a | `post-checkout` hook logs the checkout |
|
|
| `git restore*` (any form) | `permission.bash` deny rule | n/a | n/a |
|
|
| `git reset*` (any form) | `permission.bash` deny rule | n/a | n/a |
|
|
| File access outside Tier 2 clone + app-data dir | `permission.read`/`write` path allowlist | Windows ACL | n/a |
|
|
|
|
## The failcount threshold
|
|
|
|
Tier 2 gives up if ANY of these hit:
|
|
- 3 consecutive red-phase failures (the test doesn't fail when it should)
|
|
- 3 consecutive green-phase failures (the implementation doesn't make the test pass)
|
|
- 30 minutes with no progress (no commit, no green test)
|
|
|
|
Override via `scripts/tier2/failcount.toml`.
|
|
|
|
## The failure report
|
|
|
|
Written to `C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<timestamp>.md` with 7 sections:
|
|
1. Header (track, branch, started, stopped, duration, give-up signal)
|
|
2. Tasks completed
|
|
3. Current task (where it stopped)
|
|
4. Last 3 failures
|
|
5. Failcount state
|
|
6. Git state (`git log tier2/<track> ^origin/master`)
|
|
7. Recommendation (heuristic-based)
|
|
|
|
A `.STOPPED` flag file is created alongside the report. The main repo
|
|
can check for it on next Tier 1 session start (an opt-in banner).
|
|
|
|
## Conventions (added 2026-06-17)
|
|
|
|
These are enforced by the Tier 2 agent prompt. The agent MUST follow them — they're not optional.
|
|
|
|
- **Test runner:** Tier 2 always uses `uv run python scripts/run_tests_batched.py`. Never `uv run pytest` directly. The batched runner provides tier-based filtering, parallelization (xdist), and a summary table that direct pytest doesn't.
|
|
- **Default branch:** this repo uses `master` (not `main`). When fetching or branching, use `origin/master`. Tier 2 may otherwise get confused by the missing `main` reference.
|
|
- **Line endings:** Tier 2 preserves existing line endings on edit. This repo has a mix of CRLF and LF; standardizing to repo-wide LF is a future track. For now, do not normalize.
|
|
- **Throw-away scripts:** Tier 2 writes its working scripts to `scripts/tier2/artifacts/<track-name>/`, NOT the base `scripts/tier2/` directory. The base directory is reserved for production code. Throw-away scripts are kept for archival but isolated in a track-specific subdir.
|
|
- **End-of-track report:** at the end of every track, Tier 2 writes `docs/reports/TRACK_COMPLETION_<track-name>.md` (follow the precedent set by `TRACK_COMPLETION_tier2_autonomous_sandbox_20260616.md`) and updates `conductor/tracks/<track-name>/state.toml` to `status = "completed"`. The user reads this report to decide merge.
|
|
- **Run-time expectation:** tracks are expected to take 1-4 hours. If the model reports it is running out of context, Tier 2 notes progress to disk and continues. The user expects autonomous runs to complete without manual "press continue" intervention.
|
|
|
|
## Verify the sandbox (manual checklist)
|
|
|
|
After bootstrap, run these inside the Tier 2 sandboxed OpenCode session
|
|
to verify the bans are enforced:
|
|
|
|
- [ ] Try `git restore tests/test_failcount.py` — should print "denied"
|
|
- [ ] Try `git push origin master` — should print "denied" (or the pre-push hook fires)
|
|
- [ ] Try `git checkout -- src/foo.py` — should print "denied"
|
|
- [ ] Try `git reset --hard HEAD~1` — should print "denied"
|
|
- [ ] Try to read `C:\Users\Ed\Documents\test.txt` (from a Python subprocess) — should print "ACCESS_DENIED"
|
|
|
|
And verify allowed operations work:
|
|
- [ ] `git status` — works
|
|
- [ ] `git switch -c test-branch` — works
|
|
- [ ] Edit a file in the Tier 2 clone — works
|
|
- [ ] `git add <file> && git commit -m "test"` — works
|
|
|
|
## Troubleshooting
|
|
|
|
- **"Tier 2 (Sandboxed) shortcut doesn't work"**: check that
|
|
`pwsh.exe` is on the PATH (`where.exe pwsh`).
|
|
- **"Permission denied" on file access inside the sandbox**: the
|
|
Windows ACL may be too restrictive. Re-run the bootstrap
|
|
(`setup_tier2_clone.ps1` is idempotent).
|
|
- **"Failcount state not found"**: the `<app-data>/tier2/<track>/`
|
|
dir may be missing. The bootstrap creates it; check `$env:LOCALAPPDATA`.
|
|
- **"Pre-push hook not firing"**: check that `.git/hooks/pre-push`
|
|
is executable. On Windows, Git Bash runs the hook; check
|
|
`git config core.hooksPath` if you have a custom hooks dir.
|
|
- **"Tier 2 keeps giving up at 30 min"**: increase
|
|
`no_progress_minutes` in `scripts/tier2/failcount.toml`.
|
|
- **"Tier 2 ran out of context"**: the model stopped mid-track. The
|
|
user (interactive Tier 1) should `cd` to the Tier 2 clone, inspect
|
|
`<app-data>/tier2/<track>/state.json` for the last completed task,
|
|
and re-invoke with `/tier-2-auto-execute <track-name> --resume`
|
|
to continue. The state file persists across runs.
|