Private

Public Access

Files

T

ed 07a0e66a19 docs(tier2): apply user feedback - 6 workflow conventions

User feedback from the first sandbox run (send_result_to_send_20260616,
2026-06-17) identified 6 conventions Tier 2 must follow. Update the agent
prompt template, slash command template, user guide, and workflow doc:

1. Test runner: ALWAYS use 'uv run python scripts/run_tests_batched.py'
   (NOT 'uv run pytest'). The batched runner provides tier filtering,
   parallelization (xdist), and a summary table that direct pytest lacks.

2. Default branch: this repo uses 'master', not 'main'. The Tier 2 slash
   command now does 'git fetch origin master' (was 'origin main').

3. Line endings: preserve existing. This repo has a mix of CRLF and LF;
   a repo-wide LF standardization is a future track.

4. Throw-away scripts: write to 'scripts/tier2/artifacts/<track>/', NOT
   the base 'scripts/tier2/' directory. The base is reserved for
   production code; throw-away scripts are kept for archival but
   isolated per-track.

5. End-of-track report: write 'docs/reports/TRACK_COMPLETION_<track>.md'
   and update 'state.toml' to 'status=completed'. The user reads this
   to decide merge. Previously this was implicit; now it's explicit.

6. Run-time expectation: tracks are 1-4 hours. If context runs out, Tier
   2 notes progress to disk and continues. The --resume flag picks up
   from the last completed task.

Also updated the user guide with a 'Conventions' section and a
troubleshooting entry for the resume flow. The verify-the-sandbox
checklist now uses 'origin master' instead of 'origin main'.

2026-06-17 02:13:29 -04:00

6.7 KiB

Raw Blame History

Tier 2 Autonomous Sandbox

Why this exists

When you run Tier 2 in the main repo, every edit and every bash call prompts you for approval (permission: ask). For well-regularized tracks (TDD red/green with atomic per-task commits), this is noise. This track adds an autonomous mode in a sibling clone where Tier 2 runs unattended, with a 3-layer enforcement stack to keep it contained.

One-time bootstrap

cd C:\projects\manual_slop
pwsh -File scripts\tier2\setup_tier2_clone.ps1 -WhatIf   # dry run first
pwsh -File scripts\tier2\setup_tier2_clone.ps1            # actual bootstrap

The bootstrap:

Clones the main repo to C:\projects\manual_slop_tier2\
Sets origin = C:\projects\manual_slop (local path; no remote)
Copies the agent, slash command, and opencode.json templates to the clone
Installs the git hooks (pre-push refuses all pushes; post-checkout logs checkouts)
Creates C:\Users\Ed\AppData\Local\manual_slop\tier2\ with restricted ACLs
Creates a "Tier 2 (Sandboxed)" desktop shortcut

Per-track invocation

Double-click the "Tier 2 (Sandboxed)" desktop shortcut (or run pwsh -File C:\projects\manual_slop\scripts\tier2\run_tier2_sandboxed.ps1 manually)
In the OpenCode session, type:
```
/tier-2-auto-execute <track-name>
```
Examples:
- /tier-2-auto-execute result_migration_review_pass
- /tier-2-auto-execute data_structure_strengthening_20260606 --resume
- /tier-2-auto-execute rag_test_failures_20260615 --toast
Tier 2 runs the track autonomously, commits per task, monitors failcount
On success: prints a summary
On give-up: writes a failure report and prints the path

Review and merge

After Tier 2 finishes (success or give-up):

cd C:\projects\manual_slop (back to main)
git fetch C:/projects/manual_slop_tier2 tier2/<track-name>
Review the diff with Tier 1 (interactive)
On approval: git merge --no-ff tier2/<track-name> to main

The 4 hard bans (enforced at 3 layers)

Ban	Layer 1 (OpenCode)	Layer 2 (OS)	Layer 3 (git hook)
`git push*` (any push)	`permission.bash` deny rule	n/a	`pre-push` hook refuses all pushes
`git checkout*` (any form)	`permission.bash` deny rule	n/a	`post-checkout` hook logs the checkout
`git restore*` (any form)	`permission.bash` deny rule	n/a	n/a
`git reset*` (any form)	`permission.bash` deny rule	n/a	n/a
File access outside Tier 2 clone + app-data dir	`permission.read`/`write` path allowlist	Windows ACL	n/a

The failcount threshold

Tier 2 gives up if ANY of these hit:

3 consecutive red-phase failures (the test doesn't fail when it should)
3 consecutive green-phase failures (the implementation doesn't make the test pass)
30 minutes with no progress (no commit, no green test)

Override via scripts/tier2/failcount.toml.

The failure report

Written to C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<timestamp>.md with 7 sections:

Header (track, branch, started, stopped, duration, give-up signal)
Tasks completed
Current task (where it stopped)
Last 3 failures
Failcount state
Git state (git log tier2/<track> ^origin/master)
Recommendation (heuristic-based)

A .STOPPED flag file is created alongside the report. The main repo can check for it on next Tier 1 session start (an opt-in banner).

Conventions (added 2026-06-17)

These are enforced by the Tier 2 agent prompt. The agent MUST follow them — they're not optional.

Test runner: Tier 2 always uses uv run python scripts/run_tests_batched.py. Never uv run pytest directly. The batched runner provides tier-based filtering, parallelization (xdist), and a summary table that direct pytest doesn't.
Default branch: this repo uses master (not main). When fetching or branching, use origin/master. Tier 2 may otherwise get confused by the missing main reference.
Line endings: Tier 2 preserves existing line endings on edit. This repo has a mix of CRLF and LF; standardizing to repo-wide LF is a future track. For now, do not normalize.
Throw-away scripts: Tier 2 writes its working scripts to scripts/tier2/artifacts/<track-name>/, NOT the base scripts/tier2/ directory. The base directory is reserved for production code. Throw-away scripts are kept for archival but isolated in a track-specific subdir.
End-of-track report: at the end of every track, Tier 2 writes docs/reports/TRACK_COMPLETION_<track-name>.md (follow the precedent set by TRACK_COMPLETION_tier2_autonomous_sandbox_20260616.md) and updates conductor/tracks/<track-name>/state.toml to status = "completed". The user reads this report to decide merge.
Run-time expectation: tracks are expected to take 1-4 hours. If the model reports it is running out of context, Tier 2 notes progress to disk and continues. The user expects autonomous runs to complete without manual "press continue" intervention.

Verify the sandbox (manual checklist)

After bootstrap, run these inside the Tier 2 sandboxed OpenCode session to verify the bans are enforced:

Try git restore tests/test_failcount.py — should print "denied"
Try git push origin master — should print "denied" (or the pre-push hook fires)
Try git checkout -- src/foo.py — should print "denied"
Try git reset --hard HEAD~1 — should print "denied"
Try to read C:\Users\Ed\Documents\test.txt (from a Python subprocess) — should print "ACCESS_DENIED"

And verify allowed operations work:

git status — works
git switch -c test-branch — works
Edit a file in the Tier 2 clone — works
git add <file> && git commit -m "test" — works

Troubleshooting

"Tier 2 (Sandboxed) shortcut doesn't work": check that pwsh.exe is on the PATH (where.exe pwsh).
"Permission denied" on file access inside the sandbox: the Windows ACL may be too restrictive. Re-run the bootstrap (setup_tier2_clone.ps1 is idempotent).
"Failcount state not found": the <app-data>/tier2/<track>/ dir may be missing. The bootstrap creates it; check $env:LOCALAPPDATA.
"Pre-push hook not firing": check that .git/hooks/pre-push is executable. On Windows, Git Bash runs the hook; check git config core.hooksPath if you have a custom hooks dir.
"Tier 2 keeps giving up at 30 min": increase no_progress_minutes in scripts/tier2/failcount.toml.
"Tier 2 ran out of context": the model stopped mid-track. The user (interactive Tier 1) should cd to the Tier 2 clone, inspect <app-data>/tier2/<track>/state.json for the last completed task, and re-invoke with /tier-2-auto-execute <track-name> --resume to continue. The state file persists across runs.

6.7 KiB Raw Blame History