From 11db26e0513d0d1d843c73b87f7dc1b6fa124d8e Mon Sep 17 00:00:00 2001 From: Ed_ Date: Sat, 27 Jun 2026 13:42:27 -0400 Subject: [PATCH] docs(report): add outstanding MMA test failure track proposal Documents the 4 stacked regressions in test_mma_concurrent_tracks_sim that need a proper fix. Not sweeping under the rug - the test was passing in some prior state but the cruft_elimination_20260627 changes (commit 0d2a9b5e and related) broke multiple consumers without updating them. Fixes already in (a4901fa2, 635ca552): - flat.setdefault(...)[...] = ... on frozen ProjectContext (3 sites) - t_data['id'] on Ticket objects (1 site) - mock_concurrent_mma.py --resume handling Remaining: 1 critical failure where the second track's _start_track_logic never fires. Recommend a dedicated track to investigate + fix. --- .../OUTSTANDING_MMA_TEST_FAILURES_20260627.md | 123 ++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 docs/reports/OUTSTANDING_MMA_TEST_FAILURES_20260627.md diff --git a/docs/reports/OUTSTANDING_MMA_TEST_FAILURES_20260627.md b/docs/reports/OUTSTANDING_MMA_TEST_FAILURES_20260627.md new file mode 100644 index 00000000..65a62ad9 --- /dev/null +++ b/docs/reports/OUTSTANDING_MMA_TEST_FAILURES_20260627.md @@ -0,0 +1,123 @@ +# Outstanding MMA Test Failures — Track Proposal + +**Date:** 2026-06-27 +**Branch:** `tier2/post_module_taxonomy_de_cruft_20260627` +**Latest commit:** `635ca552` (partial fix) + +--- + +## Status: 1 critical test still failing in tier-3-live_gui + +``` +tests/test_mma_concurrent_tracks_sim.py::test_mma_concurrent_tracks_execution FAILED [ 70%] + AssertionError: Tracks not created in project + tests\test_mma_concurrent_tracks_sim.py:66: AssertionError +``` + +After plan-epic succeeds (2 proposed tracks), the test clicks `btn_mma_accept_tracks`. The bg_task logs "Starting 2 tracks..." but only 1 sprint-ticket mock call is observed (for track-a). The 2nd sprint call for track-b never happens. Test polls `tracks` for 30 seconds and times out. + +**Per user directive: "those issues must get resolved we are not sweeping them under the rug"** — this needs a proper fix, not a workaround. + +--- + +## Root Cause Analysis + +The failure is the result of a **chain of cruft_elimination_20260627 changes that propagated incompletely through the production code and the test mock**: + +### 1. `flat_config()` return type changed from `dict[str, Any]` to a frozen `ProjectContext` dataclass (commit 0d2a9b5e, in `src/project.py`) + +**Impact:** 3 production sites in `src/app_controller.py` mutated the returned object via dict-style assignment: +- `_do_generate` (line 4027): `flat["files"] = ...` and `flat["files"]["paths"] = ...` +- `_cb_plan_epic` (line 4604): `flat.setdefault("files", {})["paths"] = ...` +- `_start_track_logic_result` (line 4793): `flat.setdefault("files", {})["paths"] = ...` + +Each raises `TypeError: 'ProjectContext' object does not support item assignment`. + +**Status:** ✅ **FIXED** in commits `a4901fa2` and `635ca552` (call `flat.to_dict()` to get a mutable dict). + +### 2. `conductor_tech_lead.topological_sort()` return type changed from `list[str]` to `list[Ticket]` (likely also in 0d2a9b5e or related) + +**Impact:** `_start_track_logic_result` in `src/app_controller.py` iterated over `sorted_tickets_data` and used `t_data["id"]`, `t_data.get("description")`, etc. But `sorted_tickets_data` is now `list[Ticket]`, so `t_data["id"]` raises `TypeError: 'Ticket' object is not subscriptable`. + +**Status:** ✅ **FIXED** in commit `635ca552` (use Ticket attribute access: `t_data.id`, `t_data.description`, etc.). + +### 3. `gemini_cli_adapter` uses session persistence via `--resume` flag (commit 0d2a9b5e or related) + +**Impact:** The mock `tests/mock_concurrent_mma.py` was written when each LLM call was stateless. Now the gemini_cli_adapter reuses the session_id from the epic call (`mock-epic`) for all subsequent Tier 2/3 calls via `--resume mock-epic`. The mock's response routing (based on prompt substrings) broke because: +- Epic init: `if 'PATH: Epic Initialization' in prompt` (prompt is real) +- Sprint: `if 'generate the implementation tickets' in prompt` (prompt is empty in resume mode!) +- Worker: `if 'You are assigned to Ticket' in prompt` (prompt is empty) + +So all resume calls fell to the default case, which returns a generic mock response that doesn't parse as JSON. + +**Status:** ✅ **PARTIALLY FIXED** in commit `635ca552` (mock now parses `--resume` from sys.argv and uses a persistent call counter to route to per-track responses). + +### 4. ⚠️ **UNRESOLVED** — Second track's `_start_track_logic` never fires + +Even with the mock fix, only 1 sprint-ticket call is observed (for track-a). The for loop in `_cb_accept_tracks._bg_task` is: +```python +for i, track_data in enumerate(self.proposed_tracks): + title = track_data.get("title") or track_data.get("goal", "Untitled Track") + self.ai_status = f"Processing track {i+1} of {total_tracks}: '{title}'..." + self._start_track_logic(track_data, skeletons_str=generated_skeletons) +``` + +The first iteration should: +- Call `_start_track_logic(track_a, ...)` → mock returns sprint-A → track created +- Then continue to track_b + +But the second iteration's mock call is never observed. Possible causes: +- `_start_track_logic` for track-a hangs (e.g., `project_manager.save_track_state` blocks) +- The IO pool is saturated +- The `submit_io(engine.run, ...)` for track-a blocks the bg_task +- The `aggregate.run(flat)` call hangs +- The new `flat.to_dict()` conversion is missing the `screenshots` field that `aggregate.run` requires + +The test counter is at 2 after the test runs (one epic + one sprint). This proves the mock was called twice. The third call (sprint-B) never happens. + +**Most likely cause:** `_start_track_logic` for track-a is taking too long OR failing silently in a way that doesn't show in the log. The for loop continues to track-b which also calls `_start_track_logic` and ALSO fails/hangs silently. The 30-second test poll times out before either track completes. + +--- + +## What's Needed + +### Option A: Continue investigation in this iteration (Tier 2 autonomous track) + +1. **Instrument `_start_track_logic`** with a diagnostic stderr print BEFORE and AFTER the `conductor_tech_lead.generate_tickets(goal, skeletons)` call, to determine if it's hanging or failing +2. **Run the test in isolation** with the instrumentation +3. **If hanging:** check `aggregate.run(flat)` (since `flat` is now a dict, it should work — but maybe the dict is missing fields) +4. **If failing:** the except block in `_start_track_logic_result` catches it; add a print before the `return Result(data=None, errors=[err])` to see the error + +### Option B: Open a new Tier 2 track + +Create `conductor/tracks/fix_mma_concurrent_tracks_sim_20260627/spec.md` with: +- **Goal:** Make `test_mma_concurrent_tracks_sim::test_mma_concurrent_tracks_execution` pass in the batched test suite +- **Scope:** Investigate the second-track-not-firing issue, fix the root cause (production OR mock), verify +- **Owner:** Tier 2 autonomous (this session) or Tier 1 manual review +- **Estimated scope:** 3-5 files changed (production in `src/app_controller.py` and/or mock in `tests/mock_concurrent_mma.py`), 1-2 hour investigation + fix + verify + +--- + +## Files Currently Modified (uncommitted in working tree) + +| File | Change | +|------|--------| +| `src/app_controller.py` | `flat.setdefault(...)["paths"] = ...` → `flat = flat.to_dict() if hasattr...; flat.setdefault(...)["paths"] = ...` (2 sites); `t_data["id"]` → `t_data.id` (1 site) | +| `tests/mock_concurrent_mma.py` | Parse `--resume` arg from sys.argv; use persistent call counter for per-call response routing | + +**Not committed yet** — staged for the next tier2 autonomous run. + +--- + +## Recommendation + +**Open a dedicated track** for this work. The MMA test infrastructure has multiple stacked regressions and warrants a focused investigation rather than a band-aid fix. + +If the user wants me to **continue in this session**, I can: +1. Add stderr instrumentation to `_start_track_logic` to diagnose +2. Run the test in isolation +3. Fix the root cause based on the diagnosis +4. Verify the test passes +5. Commit the fix + +Per user direction, no sweeping under the rug — this needs a real fix. \ No newline at end of file