conductor(spec): Tier 2 autonomous sandbox track spec
This commit is contained in:
@@ -0,0 +1,614 @@
|
||||
# Track Specification: Tier 2 Autonomous Sandbox (unattended track execution with bounded blast radius)
|
||||
|
||||
**Track ID:** `tier2_autonomous_sandbox_20260616`
|
||||
**Status:** Planned (spec pending user review)
|
||||
**Priority:** A (user-blocking; eliminates the manual `permission: ask` bottleneck for well-regularized tracks)
|
||||
**Owner:** Tier 2 Tech Lead (per `conductor/workflow.md`)
|
||||
**Type:** feature (meta-tooling — adds a new execution mode to the existing MMA workflow, not to the Manual Slop app itself)
|
||||
**Scope:** ~7 new files in main repo + 1 sibling clone at `C:\projects\manual_slop_tier2\` (one-time bootstrap)
|
||||
**Parent tracks:** `opencode_config_overhaul_20260310` (shipped; established the agent profile scaffolding this track extends)
|
||||
**Sibling tracks:** none (independent)
|
||||
|
||||
> **Note on effort estimates:** per the Tier 1 rules (see
|
||||
> `conductor/workflow.md` §"Tier 1 Track Initialization Rules"), this
|
||||
> spec does NOT include day estimates. Effort is measured by **scope** (N
|
||||
> files, M sites) and **T-shirt size** (S/M/L/XL). The user / Tier 2
|
||||
> agent decides the actual pacing.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
This track adds an **unattended execution mode** for Tier 2: you open
|
||||
OpenCode in a sibling clone (`C:\projects\manual_slop_tier2\`), type
|
||||
`/tier-2-auto-execute <track-name>`, and Tier 2 runs the track
|
||||
autonomously — **no `permission: ask` prompts** — while a **3-layer
|
||||
defense-in-depth** enforcement stack prevents it from touching the
|
||||
filesystem outside its clone + an app-data temp dir, and from running
|
||||
destructive git operations (`git restore`, `git push*`, `git checkout`,
|
||||
`git reset`). If Tier 2 can't make progress (3 red-phase failures, 3
|
||||
green-phase failures, or 30 minutes with no commit/green), it stops
|
||||
early, writes a failure report, and notifies you. You review the
|
||||
feature branch with Tier 1 in the main repo, then merge.
|
||||
|
||||
**T-shirt size: L** — 7 new files in main repo (mostly config +
|
||||
scripts + 1 small Python module), 4 new test files, 1 PowerShell
|
||||
wrapper, 1 bootstrap script, 1 user guide. ~600 lines of new code.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview
|
||||
|
||||
### 1.1 The State Before This Track (as of `88e44d1c`)
|
||||
|
||||
The current OpenCode configuration has these properties:
|
||||
|
||||
- **One repo, two modes via agent profile.** `opencode.json:11` sets
|
||||
`default_agent: "tier2-tech-lead"`. Tier 1 and Tier 2 are
|
||||
distinguished by which agent profile the user selects in the OpenCode
|
||||
session, not by which directory they're in.
|
||||
- **Permission bottleneck on Tier 2.** `.opencode/agents/tier2-tech-lead.md:6-9`
|
||||
sets `permission: { edit: "ask", bash: "ask", 'manual-slop_*': allow }`.
|
||||
Every `edit` and every `bash` call from Tier 2 prompts the user for
|
||||
approval. For well-regularized tracks (TDD red/green/refactor with
|
||||
atomic per-task commits, e.g., the upcoming `result_migration_*`
|
||||
tracks), this is **noise** — the user has already pre-approved the
|
||||
track plan, and the per-task approval doesn't add safety, it just
|
||||
adds 50+ clicks per track.
|
||||
- **No filesystem boundary enforcement.** Tier 2 has the same
|
||||
filesystem access as the user. There is nothing preventing Tier 2 (or
|
||||
a delegated Tier 3 worker) from reading `C:\Users\Ed\.aws\credentials`
|
||||
or writing to a different project entirely.
|
||||
- **No git ban enforcement.** Nothing prevents Tier 2 from running
|
||||
`git restore`, `git push origin`, `git checkout -- <file>`, or
|
||||
`git reset --hard`. These are the four operations the user has
|
||||
called out as "destructive to its progress or affects the origin
|
||||
server" in the original ask.
|
||||
- **No failure threshold / give-up mechanism.** A stuck Tier 2 runs
|
||||
until the user notices or the agent self-terminates. There is no
|
||||
"3 red-phase attempts without progress → stop and write a report"
|
||||
guardrail.
|
||||
- **One OpenCode session at a time.** The main repo's OpenCode session
|
||||
is the only execution environment. Tier 2 cannot run in parallel with
|
||||
Tier 1 review.
|
||||
|
||||
### 1.2 The Goal
|
||||
|
||||
Add a **second execution mode** for Tier 2 that is:
|
||||
|
||||
- **Autonomous** — no `permission: ask` prompts for `edit` or `bash`
|
||||
- **Sandboxed** — file access is restricted to the Tier 2 clone + an
|
||||
app-data temp dir, enforced at 3 independent layers (OpenCode
|
||||
permission system, Windows restricted token + ACLs, git hooks)
|
||||
- **Bounded** — a one-shot run with a failure threshold; stuck runs
|
||||
stop early and write a report
|
||||
- **Reviewable** — the run produces a feature branch in the clone;
|
||||
the user fetches it back to main and reviews with Tier 1
|
||||
- **Opt-in to the app's test suite** — the sandbox / bootstrap / smoke
|
||||
tests are env-var-gated so the default `uv run pytest` run stays
|
||||
app-focused and fast
|
||||
|
||||
The main repo (the Tier 1 control plane) is **not modified** —
|
||||
`opencode.json` stays the same (Tier 1 still has `permission: ask`),
|
||||
and the existing MMA agents stay the same.
|
||||
|
||||
### 1.3 What the User Experiences
|
||||
|
||||
**One-time bootstrap (the user runs once):**
|
||||
```powershell
|
||||
cd C:\projects\manual_slop
|
||||
pwsh scripts/tier2/setup_tier2_clone.ps1
|
||||
```
|
||||
|
||||
**Per-track invocation (the user's normal flow from now on):**
|
||||
1. `cd C:\projects\manual_slop_tier2`
|
||||
2. Open OpenCode in that directory (the "Tier 2 Sandboxed" desktop
|
||||
shortcut the bootstrap created)
|
||||
3. In the OpenCode session, type:
|
||||
```
|
||||
/tier-2-auto-execute result_migration_review_pass
|
||||
```
|
||||
4. Tier 2 fetches the spec, creates `tier2/result_migration_review_pass`
|
||||
branch, runs the plan, commits per task
|
||||
5. On success: prints a summary. On give-up: writes a failure report
|
||||
and prints its path.
|
||||
6. `cd C:\projects\manual_slop` (back to main)
|
||||
7. `git fetch C:/projects/manual_slop_tier2 tier2/result_migration_review_pass`
|
||||
8. Review the diff with Tier 1 (interactive)
|
||||
9. `git merge --no-ff tier2/result_migration_review_pass` to main
|
||||
|
||||
**No `permission: ask` prompts in step 4.** If a Tier 2 tool call
|
||||
attempts a banned operation, the OpenCode permission system denies it;
|
||||
if a delegated Tier 3 worker tries to escape via a Python subprocess,
|
||||
the Windows ACLs deny it; if a `git push` somehow slips through, the
|
||||
pre-push hook blocks it. **Three independent layers, all enforcing the
|
||||
same ban list.**
|
||||
|
||||
---
|
||||
|
||||
## 2. Current State Audit (as of `88e44d1c`)
|
||||
|
||||
### 2.1 Already Implemented (DO NOT re-implement)
|
||||
|
||||
- **OpenCode agent profile scaffolding** —
|
||||
`.opencode/agents/tier{1,2,3,4}-*.md:1-200` and the
|
||||
`opencode.json:1-50` config file. The `tier2-autonomous` agent
|
||||
profile this track adds follows the same pattern.
|
||||
- **Slash command pattern** — `.opencode/commands/conductor-implement.md:1-100`
|
||||
is the existing pattern for slash commands. The
|
||||
`tier-2-auto-execute.md` command follows the same structure (front
|
||||
matter `agent:` and `description:`, markdown body with protocol).
|
||||
- **Conductor track convention** — `conductor/tracks/<id>/{spec,plan}.md`
|
||||
and `metadata.json` per `conductor/workflow.md` "State.toml
|
||||
Template" + "Track Dependencies and Execution Order" sections. This
|
||||
track's artifacts follow that pattern.
|
||||
- **Project-level test opt-in convention** — the `live_gui` fixture
|
||||
in `tests/conftest.py` and the existing env-var-gated tests (e.g.,
|
||||
the `RUN_LIVE_GUI=1` pattern in `tests/test_live_*.py`). The
|
||||
`TIER2_SANDBOX_TESTS=1` opt-in gate for this track's sandbox tests
|
||||
follows the same shape.
|
||||
- **PowerShell-based tooling** — `scripts/` already contains
|
||||
PowerShell-adjacent Python scripts. The new wrapper is a pure
|
||||
PowerShell script, consistent with `pywin32`-based operations on
|
||||
Windows.
|
||||
- **`scripts/audit_*.py` pattern** — the 4 existing audit scripts
|
||||
(`audit_exception_handling.py`, `audit_weak_types.py`,
|
||||
`audit_main_thread_imports.py`, `audit_no_models_config_io.py`) are
|
||||
the project's enforcement mechanism. This track does not introduce
|
||||
a new audit (the failcount thresholds are TOML-config, not
|
||||
statically checkable), but follows the `scripts/audit_<name>.py`
|
||||
naming for any future addition.
|
||||
|
||||
### 2.2 Gaps to Fill (This Track's Scope)
|
||||
|
||||
**Gap 1: A second clone as the Tier 2 execution environment.**
|
||||
|
||||
The main repo (`C:\projects\manual_slop\`) currently doubles as both
|
||||
the Tier 1 control plane and the Tier 2 execution environment. The
|
||||
fix is a sibling clone at `C:\projects\manual_slop_tier2\` with
|
||||
`origin` set to the main repo's local path (no remote). The clone is
|
||||
where the feature branch lives; the user fetches the branch back into
|
||||
main for review.
|
||||
|
||||
**Gap 2: A `tier2-autonomous` agent profile with deny rules.**
|
||||
|
||||
The existing `tier2-tech-lead` agent has `permission: ask` for `edit`
|
||||
and `bash`. The fix is a new `tier2-autonomous` agent profile (in the
|
||||
Tier 2 clone's `opencode.json`) with:
|
||||
- `permission.edit: allow`
|
||||
- `permission.bash: { "*": "allow", "git push*": "deny",
|
||||
"git checkout*": "deny", "git restore*": "deny", "git reset*": "deny" }`
|
||||
- `permission.read` / `permission.write` restricted to the Tier 2
|
||||
clone + `C:\Users\Ed\AppData\Local\manual_slop\tier2\`
|
||||
|
||||
**Gap 3: A sandboxed launcher (Windows restricted token + ACLs).**
|
||||
|
||||
OpenCode's permission system is process-level. A determined Tier 3
|
||||
worker calling `os.system("...")` from a delegated Python script
|
||||
could in principle bypass OpenCode. The fix is a PowerShell wrapper
|
||||
that:
|
||||
- Acquires a Windows restricted token (drops `SeBackupPrivilege`,
|
||||
`SeRestorePrivilege`, `SeTakeOwnershipPrivilege`, `SeDebugPrivilege`,
|
||||
`SeLoadDriverPrivilege`)
|
||||
- Sets explicit ACLs on the Tier 2 clone + app-data temp dir (allow
|
||||
the restricted token, deny everything else)
|
||||
- Wraps the process tree in a Job Object (no breakaway)
|
||||
- Launches OpenCode + the MCP server under the restricted token via
|
||||
`CreateProcessWithTokenW`
|
||||
|
||||
**Gap 4: A `tier-2-auto-execute` slash command.**
|
||||
|
||||
The existing slash commands are conductor-style ("start
|
||||
implementation", "create track"). The new slash command takes a
|
||||
`<track-name>` argument, fetches the spec from `origin/main`, creates
|
||||
a `tier2/<track-name>` branch via `git switch -c` (NOT `git checkout`),
|
||||
runs the plan via Tier 2, monitors the failcount, and reports back.
|
||||
|
||||
**Gap 5: A failure threshold + give-up mechanism (`failcount.py`).**
|
||||
|
||||
The current Tier 2 has no built-in "I can't make progress" detection.
|
||||
A stuck agent burns tokens until the user notices. The fix is a pure
|
||||
Python module that tracks three orthogonal signals:
|
||||
- `red_phase_failures` (3 = give up)
|
||||
- `green_phase_failures` (3 = give up)
|
||||
- `no_progress_minutes` (30 = give up)
|
||||
|
||||
Whichever signal hits its threshold first triggers give-up. The
|
||||
module is pure logic, fully unit-testable, with a TOML config for
|
||||
threshold overrides.
|
||||
|
||||
**Gap 6: A failure report writer + flag file + notification.**
|
||||
|
||||
When give-up fires, the system needs to:
|
||||
- Write a markdown report to
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<utc-timestamp>.md`
|
||||
with: header, tasks completed, current task state, last 3 failures,
|
||||
failcount state, git log, recommendation
|
||||
- Create a `.STOPPED` flag file alongside the report
|
||||
- Print a clear "TRACK ABORTED" banner in the OpenCode session with
|
||||
the report path
|
||||
- Optionally: Windows toast notification (opt-in via `--toast` flag)
|
||||
|
||||
**Gap 7: Git hooks as defense-in-depth (Layer 3).**
|
||||
|
||||
The OpenCode permission system is the primary enforcement for git bans.
|
||||
A pre-push hook (`pre-push` in the clone's `.git/hooks/`) is the
|
||||
backup that catches `git push origin*` even if the OpenCode deny rule
|
||||
is somehow misconfigured. A `post-checkout` hook logs any checkout of
|
||||
tracked files to a detection log.
|
||||
|
||||
**Gap 8: A user guide for bootstrap + invocation + manual verification.**
|
||||
|
||||
The user needs to know:
|
||||
- How to run the bootstrap once
|
||||
- How to invoke the slash command
|
||||
- What the failure report looks like
|
||||
- How to review and merge the feature branch
|
||||
- How to manually verify the sandbox blocks the banned operations
|
||||
|
||||
---
|
||||
|
||||
## 3. Goals
|
||||
|
||||
- **Eliminate the `permission: ask` bottleneck** for well-regularized
|
||||
tracks. The user clicks zero times during a normal Tier 2 run
|
||||
(excluding the "did Tier 2 give up?" check at the end).
|
||||
- **Enforce the 4 hard git bans** (`git restore`, `git push*`,
|
||||
`git checkout`, `git reset`) at 3 independent layers (OpenCode,
|
||||
Windows OS, git hooks). A bypass of one layer is caught by another.
|
||||
- **Enforce the filesystem boundary** (Tier 2 clone + app-data temp
|
||||
only) at 2 independent layers (OpenCode path allowlist, Windows
|
||||
ACLs). Even a delegated Python subprocess can't read outside the
|
||||
allowlist.
|
||||
- **Bound the blast radius** with a failure threshold. A stuck Tier 2
|
||||
stops within ~30 minutes and writes a report, instead of running
|
||||
indefinitely.
|
||||
- **Keep the default test run app-focused.** All sandbox/bootstrap/
|
||||
smoke tests are env-var-gated; `uv run pytest` with no env vars
|
||||
stays fast and never touches the Windows ACL subsystem.
|
||||
- **Keep Tier 1 unchanged.** The main repo's `opencode.json` is not
|
||||
modified. Tier 1 retains its `permission: ask` workflow.
|
||||
|
||||
## 4. Functional Requirements
|
||||
|
||||
### 4.1 Bootstrap (one-time, user-driven)
|
||||
|
||||
**FR1.1:** `scripts/tier2/setup_tier2_clone.ps1` (new) clones the
|
||||
main repo to `C:\projects\manual_slop_tier2\`, sets
|
||||
`origin = C:\projects\manual_slop`, copies the agent/command/
|
||||
opencode.json templates to the clone, installs the git hooks into
|
||||
the clone's `.git/hooks/`, creates the app-data temp dir
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2\` with restricted ACLs,
|
||||
and creates a "Tier 2 (Sandboxed)" desktop shortcut.
|
||||
|
||||
**FR1.2:** The bootstrap is idempotent — re-running it does not
|
||||
destroy an existing clone's feature branches (it `git fetch origin`
|
||||
and pulls the latest templates, but does not `git reset` the clone).
|
||||
|
||||
**FR1.3:** The bootstrap dry-run mode (`-WhatIf`) shows what would
|
||||
happen without making changes. Required for safety.
|
||||
|
||||
### 4.2 The tier2-autonomous agent profile
|
||||
|
||||
**FR2.1:** `.opencode/agents/tier2-autonomous.md` (template) in main
|
||||
repo; copied to Tier 2 clone during bootstrap. Defines the
|
||||
autonomous-mode agent with the deny rules in §2.2 Gap 2.
|
||||
|
||||
**FR2.2:** The agent's `temperature: 0.4` (matches Tier 2 Tech Lead).
|
||||
The agent uses `git switch -c <branch>` for new branches and
|
||||
`git switch <branch>` for switching — `git checkout` is banned
|
||||
project-wide.
|
||||
|
||||
**FR2.3:** The agent prompt includes the failcount monitoring
|
||||
contract: "After each task commit, check
|
||||
`<app-data>/tier2/<track>/state.json` via the failcount module. If
|
||||
`should_give_up` returns true, write the failure report and stop."
|
||||
|
||||
### 4.3 The sandboxed launcher
|
||||
|
||||
**FR3.1:** `scripts/tier2/run_tier2_sandboxed.ps1` (new) is the
|
||||
entry point that opens OpenCode in the Tier 2 clone under a
|
||||
restricted token.
|
||||
|
||||
**FR3.2:** The wrapper acquires a restricted token via .NET
|
||||
(`CreateRestrictedToken`), sets ACLs on the Tier 2 clone + app-data
|
||||
dir to grant the restricted token read/write, wraps the process
|
||||
tree in a Job Object, and launches OpenCode + the MCP server under
|
||||
the restricted token via `CreateProcessWithTokenW`.
|
||||
|
||||
**FR3.3:** The wrapper is the target of the "Tier 2 (Sandboxed)"
|
||||
desktop shortcut created during bootstrap. Right-click → Properties
|
||||
shows the command: `pwsh -File C:\projects\manual_slop\scripts\tier2\run_tier2_sandboxed.ps1`.
|
||||
|
||||
### 4.4 The slash command
|
||||
|
||||
**FR4.1:** `.opencode/commands/tier-2-auto-execute.md` (template) in
|
||||
main repo; copied to Tier 2 clone during bootstrap. Takes a
|
||||
required `<track-name>` argument.
|
||||
|
||||
**FR4.2:** The slash command:
|
||||
1. Reads `conductor/tracks/<track-name>/spec.md` + `plan.md` from
|
||||
the current branch (after a `git fetch origin main`)
|
||||
2. Creates a `tier2/<track-name>` branch via
|
||||
`git switch -c tier2/<track-name> origin/main`
|
||||
3. Initializes the failcount state file at
|
||||
`<app-data>/tier2/<track-name>/state.json`
|
||||
4. Delegates the plan to the tier2-autonomous agent
|
||||
5. After each task commit, checks failcount; on give-up, writes the
|
||||
report and stops
|
||||
6. On success, prints a summary (branch name, N commits, M tasks)
|
||||
|
||||
**FR4.3:** The slash command's protocol is duplicated in a CLI
|
||||
entry point (`scripts/tier2/run_track.py`) so the smoke e2e test
|
||||
can invoke the same logic without spinning up an OpenCode session.
|
||||
|
||||
**FR4.4:** The slash command supports `--resume` to continue a
|
||||
previously-give-up track from the last completed task (state is in
|
||||
the state.json file). Default behavior: refuse to resume, ask for
|
||||
explicit confirmation.
|
||||
|
||||
### 4.5 The failcount module
|
||||
|
||||
**FR5.1:** `scripts/tier2/failcount.py` (new) is a pure-Python module
|
||||
with no external deps. Exposes:
|
||||
- `class FailcountState` — the signal state dataclass
|
||||
- `class FailcountConfig` — threshold loader (from TOML or defaults)
|
||||
- `def should_give_up(state: FailcountState, config: FailcountConfig,
|
||||
now: datetime) -> Result[bool, ErrorInfo]`
|
||||
- `def record_red_failure(state: FailcountState) -> FailcountState`
|
||||
- `def record_green_failure(state: FailcountState) -> FailcountState`
|
||||
- `def record_green_success(state: FailcountState,
|
||||
now: datetime) -> FailcountState` (resets no_progress)
|
||||
- `def record_commit(state: FailcountState,
|
||||
now: datetime) -> FailcountState` (resets no_progress)
|
||||
- `def to_dict(state) -> dict`, `def from_dict(d) -> FailcountState`
|
||||
- `def load_state(track_name: str) -> Result[FailcountState, ErrorInfo]`
|
||||
- `def save_state(track_name: str, state: FailcountState) -> Result[None, ErrorInfo]`
|
||||
|
||||
**FR5.2:** Default thresholds (override via `failcount.toml`):
|
||||
- `red_phase_threshold: 3`
|
||||
- `green_phase_threshold: 3`
|
||||
- `no_progress_minutes: 30`
|
||||
|
||||
**FR5.3:** `should_give_up` returns `True` if ANY signal hits its
|
||||
threshold. The `now` parameter is injectable for testing.
|
||||
|
||||
**FR5.4:** `record_green_success` and `record_commit` reset the
|
||||
`no_progress_minutes` timer. They do NOT reset the red/green
|
||||
failure counters (those only reset on the next progress signal of
|
||||
the same type — e.g., a red failure is reset by a green test that
|
||||
eventually passes).
|
||||
|
||||
### 4.6 The failure report writer
|
||||
|
||||
**FR6.1:** `scripts/tier2/write_report.py` (new) takes a track name,
|
||||
branch name, state, and a list of `TaskResult` records, and writes
|
||||
the markdown report to
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<utc-timestamp>.md`.
|
||||
|
||||
**FR6.2:** The report contains the 7 sections in order:
|
||||
1. Header (track, branch, started-at, stopped-at, duration, give-up signal)
|
||||
2. Tasks completed (list with task IDs, commit SHAs, summaries)
|
||||
3. Current task state (where it stopped: task ID, phase, worker output, test failure)
|
||||
4. Last 3 failures (truncated to 50 lines, full output in `..._full.log`)
|
||||
5. Failcount state at give-up
|
||||
6. Git state (`git log --oneline tier2/<track> ^origin/main`)
|
||||
7. Recommendation (heuristic-based: "track too complex", "spec needs clearer plan", "external dependency missing", "review carefully")
|
||||
|
||||
**FR6.3:** A `.STOPPED` flag file is created at
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>.STOPPED`.
|
||||
|
||||
**FR6.4:** The report writer returns the report path on success
|
||||
(via `Result[str, ErrorInfo]`).
|
||||
|
||||
### 4.7 The git hooks (Layer 3)
|
||||
|
||||
**FR7.1:** `conductor/tier2/githooks/pre-push` (template) is a
|
||||
shell/PowerShell script that refuses `git push` invocations to any
|
||||
remote. The script returns exit code 1 with the message
|
||||
"Tier 2 autonomous mode: `git push` is disabled. Push the branch
|
||||
manually from the main repo after review."
|
||||
|
||||
**FR7.2:** `conductor/tier2/githooks/post-checkout` (template) is a
|
||||
detection-only hook that logs any checkout of tracked files to
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2\tier2_checkout_log.txt`
|
||||
with a timestamp, the commit hash, and the affected paths.
|
||||
|
||||
**FR7.3:** The bootstrap script copies both hooks to the Tier 2
|
||||
clone's `.git/hooks/` and `chmod +x` (on Linux/WSL) or sets the
|
||||
executable bit via `icacls` (on Windows).
|
||||
|
||||
### 4.8 The user guide
|
||||
|
||||
**FR8.1:** `docs/guide_tier2_autonomous.md` (new) covers:
|
||||
- Why this exists (the `permission: ask` bottleneck)
|
||||
- One-time bootstrap procedure (with `-WhatIf` instructions)
|
||||
- Per-track invocation procedure
|
||||
- The slash command arguments (`<track-name>`, `--resume`, `--toast`)
|
||||
- The failure report layout (with screenshot/example)
|
||||
- How to review and merge the feature branch
|
||||
- The "Verify the sandbox" checklist (manual verification)
|
||||
- Troubleshooting (common errors: origin not set, hooks not
|
||||
executable, failcount.toml missing)
|
||||
|
||||
**FR8.2:** The guide includes a "Verify the sandbox" section that
|
||||
walks the user through attempting each banned operation manually
|
||||
and confirming the denial. This is the user-driven checklist from
|
||||
the design.
|
||||
|
||||
### 4.9 The test suite (opt-in)
|
||||
|
||||
**FR9.1:** `tests/test_failcount.py` (new) — **default-on**. Unit
|
||||
tests for the failure threshold module. The full test inventory:
|
||||
- `test_initial_state_zero`
|
||||
- `test_red_phase_failure_increments`
|
||||
- `test_green_success_resets_red_counter`
|
||||
- `test_green_phase_failure_increments`
|
||||
- `test_no_progress_advances`
|
||||
- `test_no_progress_resets_on_commit`
|
||||
- `test_no_progress_resets_on_green`
|
||||
- `test_threshold_fires_at_three`
|
||||
- `test_threshold_does_not_fire_at_two`
|
||||
- `test_multi_signal_independence`
|
||||
- `test_any_signal_triggers`
|
||||
- `test_state_persistence_round_trip`
|
||||
- `test_configurable_thresholds`
|
||||
|
||||
Target: 100% line + branch coverage on `failcount.py`.
|
||||
|
||||
**FR9.2:** `tests/test_tier2_slash_command_spec.py` (new) — **default-on**.
|
||||
Loads the slash command markdown, verifies its protocol contract
|
||||
(argument parsing, git commands, failcount check, report writing).
|
||||
|
||||
**FR9.3:** `tests/test_tier2_setup_bootstrap.py` (new) — **opt-in**
|
||||
(`TIER2_SANDBOX_TESTS=1`). Runs `setup_tier2_clone.ps1` against a
|
||||
fixture workspace, verifies the side effects (clone exists, origin
|
||||
set, templates copied, hooks installed, app-data dir created with
|
||||
ACLs).
|
||||
|
||||
**FR9.4:** `tests/test_tier2_sandbox_enforcement.py` (new) —
|
||||
**opt-in** (`TIER2_SANDBOX_TESTS=1`). The critical test: spawns the
|
||||
wrapper in a subprocess, inside the sandboxed context attempts
|
||||
each banned operation, verifies each is denied.
|
||||
|
||||
**FR9.5:** `tests/test_tier2_report_writer.py` (new) — **opt-in**
|
||||
(`TIER2_SANDBOX_TESTS=1`). Invokes failcount until give-up,
|
||||
verifies the report file is created at the right path with the
|
||||
right 7 sections.
|
||||
|
||||
**FR9.6:** `tests/test_tier2_smoke_e2e.py` (new) — **opt-in**
|
||||
(`TIER2_SANDBOX_TESTS=1 TIER2_SMOKE=1`). Runs the full pipeline
|
||||
against a fixture workspace: bootstrap → invoke the CLI entry
|
||||
point → verify the feature branch exists with 1 commit → verify
|
||||
the report file is NOT created (success path).
|
||||
|
||||
## 5. Non-Functional Requirements
|
||||
|
||||
**NFR1. Performance:** the failcount module adds <1ms per check.
|
||||
The slash command's protocol adds <500ms to a typical Tier 2 task
|
||||
(spec fetch + branch creation + state init).
|
||||
|
||||
**NFR2. Reliability:** the failcount state is persisted after every
|
||||
commit. A killed run can be resumed (or refused to resume) on the
|
||||
next invocation. The state file uses atomic write (write to
|
||||
`state.json.tmp` + `os.replace`) to survive crashes mid-write.
|
||||
|
||||
**NFR3. Security:**
|
||||
- The 4 git bans are enforced at 3 independent layers (OpenCode
|
||||
permission system, Windows OS-level via restricted token, git
|
||||
hooks). A bypass of one layer is caught by another.
|
||||
- The filesystem boundary is enforced at 2 independent layers
|
||||
(OpenCode path allowlist, Windows ACLs).
|
||||
- The Tier 2 process tree is wrapped in a Job Object that
|
||||
prevents child process escape.
|
||||
|
||||
**NFR4. Testability:**
|
||||
- The failcount module is pure logic, 100% unit-testable without
|
||||
any infrastructure.
|
||||
- The slash command's protocol is duplicated in
|
||||
`scripts/tier2/run_track.py` (CLI entry point) so the smoke e2e
|
||||
test runs without an OpenCode session.
|
||||
- All sandbox / bootstrap / smoke tests are env-var-gated
|
||||
(`TIER2_SANDBOX_TESTS=1`, `TIER2_SMOKE=1`).
|
||||
|
||||
**NFR5. Auditability:** every Tier 2 run writes to
|
||||
`C:\Users\Ed\AppData\Local\manual_slop\tier2\<track>\state.json`
|
||||
and (on give-up) `C:\Users\Ed\AppData\Local\manual_slop\tier2_failures\<track>_<timestamp>.md`.
|
||||
The user can inspect the state at any time.
|
||||
|
||||
**NFR6. UX:** the user clicks zero times during a normal Tier 2
|
||||
run. The "did Tier 2 give up?" check is passive (an OpenCode
|
||||
banner, an optional Windows toast, and a flag file the user can
|
||||
check on next Tier 1 session start).
|
||||
|
||||
**NFR7. Backward compatibility:** the main repo's `opencode.json`
|
||||
is not modified. Tier 1 retains its `permission: ask` workflow.
|
||||
The new agent profile (`tier2-autonomous`) is in the Tier 2 clone
|
||||
only. The new slash command is in the Tier 2 clone only.
|
||||
|
||||
## 6. Architecture Reference
|
||||
|
||||
**This track's design follows these existing patterns:**
|
||||
|
||||
- **`docs/guide_architecture.md`** §"Threading model" — the
|
||||
Tier 2 process tree runs in its own Job Object, isolated from
|
||||
the user's main session.
|
||||
- **`docs/guide_mma.md`** §"Tier 2/3/4 lifecycles" — the Tier 2
|
||||
Tech Lead's existing delegation patterns (Task tool to
|
||||
`@tier3-worker`, `@tier4-qa`) are preserved in the autonomous
|
||||
mode.
|
||||
- **`docs/guide_meta_boundary.md`** — this track is squarely in
|
||||
the "Meta-Tooling" environment (it builds execution infrastructure
|
||||
for the agents), not the "Application" environment. No changes
|
||||
to `src/*.py`.
|
||||
- **`docs/guide_testing.md`** §"Authoring robust live_gui tests"
|
||||
+ the `live_gui` session-scoped pattern — the smoke e2e test
|
||||
follows the same opt-in env-var-gated pattern.
|
||||
- **`conductor/code_styleguides/python.md`** — 1-space indentation,
|
||||
CRLF line endings, no comments, strict type hints. All new Python
|
||||
code in this track follows this styleguide.
|
||||
- **`conductor/code_styleguides/error_handling.md`** — the
|
||||
failcount module uses `Result[T, ErrorInfo]` per the convention
|
||||
(the 3 refactored baseline files use it; the convention is being
|
||||
rolled out across the codebase per
|
||||
`data_oriented_error_handling_20260606` + the upcoming
|
||||
`result_migration_20260616` sub-tracks).
|
||||
|
||||
**This track's NEW patterns (the contribution to the codebase):**
|
||||
|
||||
- **Sibling clone as execution mode switch** — opening OpenCode in
|
||||
a different directory IS the mode switch (no `mode:` flag in
|
||||
`opencode.json`, no env var, just a directory).
|
||||
- **3-layer enforcement stack** — OpenCode permission system +
|
||||
Windows restricted token + git hooks. Documented in
|
||||
`docs/guide_tier2_autonomous.md` (this track's new guide).
|
||||
- **Bounded autonomous run with fail-loud** — the failcount module
|
||||
is a general-purpose "I'm stuck" detector, applicable to any
|
||||
future autonomous run (not just Tier 2). The pattern is
|
||||
reusable for any sub-agent that has a contract to follow.
|
||||
|
||||
## 7. Out of Scope
|
||||
|
||||
- **No changes to the Manual Slop app (`src/*.py`).** This is
|
||||
meta-tooling, not the app. The 4 audit scripts
|
||||
(`audit_exception_handling.py`, `audit_weak_types.py`,
|
||||
`audit_main_thread_imports.py`, `audit_no_models_config_io.py`)
|
||||
are not modified.
|
||||
- **No changes to the main repo's `opencode.json` or MMA agent
|
||||
profiles.** The new `tier2-autonomous` profile lives in the
|
||||
Tier 2 clone only.
|
||||
- **No new top-level `src/<thing>.py` files.** Per the file-naming
|
||||
convention (`AGENTS.md` §"File Size and Naming Convention"), the
|
||||
new code is in `scripts/tier2/`, `conductor/tier2/`, and `tests/`
|
||||
(all namespace-isolated by directory).
|
||||
- **No changes to existing tracks or in-flight work.** The
|
||||
`result_migration_20260616` umbrella track, the
|
||||
`data_oriented_error_handling_20260606` track, and the
|
||||
`exception_handling_audit_20260616` track are not affected.
|
||||
- **No new audit script.** The failcount thresholds are TOML config,
|
||||
not statically checkable. If a future track adds a checkable
|
||||
convention (e.g., "all CLI entry points must use Result[T]"),
|
||||
the new audit script should follow the
|
||||
`scripts/audit_<name>.py` pattern from the existing 4.
|
||||
- **No WSL2 / Docker / Windows Sandbox variants.** The user
|
||||
approved Approach 1 (OpenCode + Windows restricted token + git
|
||||
hooks, all native Windows). WSL2 was considered and deferred;
|
||||
the failure to run Dear PyGui/ImGui tests in WSL2 was the
|
||||
deciding factor.
|
||||
- **No parallel Tier 2 runs.** The Tier 2 clone is a single
|
||||
workspace. Two parallel Tier 2 runs would conflict on the
|
||||
feature branch. If parallel runs become a need, that's a
|
||||
follow-up track.
|
||||
- **No `git push` to non-origin remotes.** Even though the deny
|
||||
rule is `git push*` (any push), the practical use case is
|
||||
"Tier 2 doesn't push at all; the user pushes after review."
|
||||
Adding a "push to a tier2-remote bare dir" workflow is a
|
||||
follow-up if needed.
|
||||
- **No automated review of the feature branch.** Tier 1 reviewing
|
||||
Tier 2's branch is a future track (out of scope here).
|
||||
|
||||
---
|
||||
|
||||
**Spec ends.** The implementation plan (`plan.md` + `metadata.json`)
|
||||
will be written by the `writing-plans` skill in the next phase, after
|
||||
the user reviews this spec.
|
||||
Reference in New Issue
Block a user