User feedback from the first sandbox run (send_result_to_send_20260616, 2026-06-17) identified 6 conventions Tier 2 must follow. Update the agent prompt template, slash command template, user guide, and workflow doc: 1. Test runner: ALWAYS use 'uv run python scripts/run_tests_batched.py' (NOT 'uv run pytest'). The batched runner provides tier filtering, parallelization (xdist), and a summary table that direct pytest lacks. 2. Default branch: this repo uses 'master', not 'main'. The Tier 2 slash command now does 'git fetch origin master' (was 'origin main'). 3. Line endings: preserve existing. This repo has a mix of CRLF and LF; a repo-wide LF standardization is a future track. 4. Throw-away scripts: write to 'scripts/tier2/artifacts/<track>/', NOT the base 'scripts/tier2/' directory. The base is reserved for production code; throw-away scripts are kept for archival but isolated per-track. 5. End-of-track report: write 'docs/reports/TRACK_COMPLETION_<track>.md' and update 'state.toml' to 'status=completed'. The user reads this to decide merge. Previously this was implicit; now it's explicit. 6. Run-time expectation: tracks are 1-4 hours. If context runs out, Tier 2 notes progress to disk and continues. The --resume flag picks up from the last completed task. Also updated the user guide with a 'Conventions' section and a troubleshooting entry for the resume flow. The verify-the-sandbox checklist now uses 'origin master' instead of 'origin main'.
6.7 KiB
Tier 2 Autonomous Sandbox
Why this exists
When you run Tier 2 in the main repo, every edit and every bash
call prompts you for approval (permission: ask). For well-regularized
tracks (TDD red/green with atomic per-task commits), this is noise.
This track adds an autonomous mode in a sibling clone where Tier 2
runs unattended, with a 3-layer enforcement stack to keep it contained.
One-time bootstrap
cd C:\projects\manual_slop
pwsh -File scripts\tier2\setup_tier2_clone.ps1 -WhatIf # dry run first
pwsh -File scripts\tier2\setup_tier2_clone.ps1 # actual bootstrap
The bootstrap:
- Clones the main repo to
C:\projects\manual_slop_tier2\ - Sets
origin = C:\projects\manual_slop(local path; no remote) - Copies the agent, slash command, and opencode.json templates to the clone
- Installs the git hooks (
pre-pushrefuses all pushes;post-checkoutlogs checkouts) - Creates
C:\Users\Ed\AppData\Local\manual_slop\tier2\with restricted ACLs - Creates a "Tier 2 (Sandboxed)" desktop shortcut
Per-track invocation
- Double-click the "Tier 2 (Sandboxed)" desktop shortcut
(or run
pwsh -File C:\projects\manual_slop\scripts\tier2\run_tier2_sandboxed.ps1manually) - In the OpenCode session, type:
Examples:
/tier-2-auto-execute <track-name>/tier-2-auto-execute result_migration_review_pass/tier-2-auto-execute data_structure_strengthening_20260606 --resume/tier-2-auto-execute rag_test_failures_20260615 --toast
- Tier 2 runs the track autonomously, commits per task, monitors failcount
- On success: prints a summary
- On give-up: writes a failure report and prints the path
Review and merge
After Tier 2 finishes (success or give-up):
cd C:\projects\manual_slop(back to main)git fetch C:/projects/manual_slop_tier2 tier2/<track-name>- Review the diff with Tier 1 (interactive)
- On approval:
git merge --no-ff tier2/<track-name>to main
The 4 hard bans (enforced at 3 layers)
| Ban | Layer 1 (OpenCode) | Layer 2 (OS) | Layer 3 (git hook) |
|---|---|---|---|
git push* (any push) |
permission.bash deny rule |
n/a | pre-push hook refuses all pushes |
git checkout* (any form) |
permission.bash deny rule |
n/a | post-checkout hook logs the checkout |
git restore* (any form) |
permission.bash deny rule |
n/a | n/a |
git reset* (any form) |
permission.bash deny rule |
n/a | n/a |
| File access outside Tier 2 clone + app-data dir | permission.read/write path allowlist |
Windows ACL | n/a |
The failcount threshold
Tier 2 gives up if ANY of these hit:
- 3 consecutive red-phase failures (the test doesn't fail when it should)
- 3 consecutive green-phase failures (the implementation doesn't make the test pass)
- 30 minutes with no progress (no commit, no green test)
Override via scripts/tier2/failcount.toml.
The failure report
Written to C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<timestamp>.md with 7 sections:
- Header (track, branch, started, stopped, duration, give-up signal)
- Tasks completed
- Current task (where it stopped)
- Last 3 failures
- Failcount state
- Git state (
git log tier2/<track> ^origin/master) - Recommendation (heuristic-based)
A .STOPPED flag file is created alongside the report. The main repo
can check for it on next Tier 1 session start (an opt-in banner).
Conventions (added 2026-06-17)
These are enforced by the Tier 2 agent prompt. The agent MUST follow them — they're not optional.
- Test runner: Tier 2 always uses
uv run python scripts/run_tests_batched.py. Neveruv run pytestdirectly. The batched runner provides tier-based filtering, parallelization (xdist), and a summary table that direct pytest doesn't. - Default branch: this repo uses
master(notmain). When fetching or branching, useorigin/master. Tier 2 may otherwise get confused by the missingmainreference. - Line endings: Tier 2 preserves existing line endings on edit. This repo has a mix of CRLF and LF; standardizing to repo-wide LF is a future track. For now, do not normalize.
- Throw-away scripts: Tier 2 writes its working scripts to
scripts/tier2/artifacts/<track-name>/, NOT the basescripts/tier2/directory. The base directory is reserved for production code. Throw-away scripts are kept for archival but isolated in a track-specific subdir. - End-of-track report: at the end of every track, Tier 2 writes
docs/reports/TRACK_COMPLETION_<track-name>.md(follow the precedent set byTRACK_COMPLETION_tier2_autonomous_sandbox_20260616.md) and updatesconductor/tracks/<track-name>/state.tomltostatus = "completed". The user reads this report to decide merge. - Run-time expectation: tracks are expected to take 1-4 hours. If the model reports it is running out of context, Tier 2 notes progress to disk and continues. The user expects autonomous runs to complete without manual "press continue" intervention.
Verify the sandbox (manual checklist)
After bootstrap, run these inside the Tier 2 sandboxed OpenCode session to verify the bans are enforced:
- Try
git restore tests/test_failcount.py— should print "denied" - Try
git push origin master— should print "denied" (or the pre-push hook fires) - Try
git checkout -- src/foo.py— should print "denied" - Try
git reset --hard HEAD~1— should print "denied" - Try to read
C:\Users\Ed\Documents\test.txt(from a Python subprocess) — should print "ACCESS_DENIED"
And verify allowed operations work:
git status— worksgit switch -c test-branch— works- Edit a file in the Tier 2 clone — works
git add <file> && git commit -m "test"— works
Troubleshooting
- "Tier 2 (Sandboxed) shortcut doesn't work": check that
pwsh.exeis on the PATH (where.exe pwsh). - "Permission denied" on file access inside the sandbox: the
Windows ACL may be too restrictive. Re-run the bootstrap
(
setup_tier2_clone.ps1is idempotent). - "Failcount state not found": the
<app-data>/tier2/<track>/dir may be missing. The bootstrap creates it; check$env:LOCALAPPDATA. - "Pre-push hook not firing": check that
.git/hooks/pre-pushis executable. On Windows, Git Bash runs the hook; checkgit config core.hooksPathif you have a custom hooks dir. - "Tier 2 keeps giving up at 30 min": increase
no_progress_minutesinscripts/tier2/failcount.toml. - "Tier 2 ran out of context": the model stopped mid-track. The
user (interactive Tier 1) should
cdto the Tier 2 clone, inspect<app-data>/tier2/<track>/state.jsonfor the last completed task, and re-invoke with/tier-2-auto-execute <track-name> --resumeto continue. The state file persists across runs.