All tier-3-live_gui tests now pass. Track complete with 5 fixes: 1.e9919059: TrackMetadata import (production NameError) 2.913aa48c: Mock sprint routing (session_id-based was fragile) 3.fad1755b: Mock epic catch-all (literal-substring was fragile) 4.d28e373e: Mock worker fallback (stale session_id leaked) 5.55dae159: Remove 'refresh_from_project' task (was overwriting self.tracks with a disk read returning 0 tracks in batched env) Verified: - test_mma_concurrent_tracks_execution: PASS - test_mma_concurrent_tracks_stress: PASS - 15 wider tests: PASS (237.63s) - 3 consecutive runs of the failing combination: PASS (100s each) OUTSTANDING_MMA_TEST_FAILURES_20260627.md updated with section 7 documenting the refresh_from_project bug and fix. State.toml updated to reflect all 5 fixes and the 3 verification runs. Track status: active (final SHIPPED commit pending TRACK_COMPLETION update). The parent branch tier2/post_module_taxonomy_de_cruft_20260627 is now ready for merge after this fix track is reviewed.
11 KiB
Outstanding MMA Test Failures — Track Proposal
Date: 2026-06-27
Branch: tier2/post_module_taxonomy_de_cruft_20260627
Latest commit: 635ca552 (partial fix)
Status: 1 critical test still failing in tier-3-live_gui
tests/test_mma_concurrent_tracks_sim.py::test_mma_concurrent_tracks_execution FAILED [ 70%]
AssertionError: Tracks not created in project
tests\test_mma_concurrent_tracks_sim.py:66: AssertionError
After plan-epic succeeds (2 proposed tracks), the test clicks btn_mma_accept_tracks. The bg_task logs "Starting 2 tracks..." but only 1 sprint-ticket mock call is observed (for track-a). The 2nd sprint call for track-b never happens. Test polls tracks for 30 seconds and times out.
Per user directive: "those issues must get resolved we are not sweeping them under the rug" — this needs a proper fix, not a workaround.
Root Cause Analysis
The failure is the result of a chain of cruft_elimination_20260627 changes that propagated incompletely through the production code and the test mock:
1. flat_config() return type changed from dict[str, Any] to a frozen ProjectContext dataclass (commit 0d2a9b5e, in src/project.py)
Impact: 3 production sites in src/app_controller.py mutated the returned object via dict-style assignment:
_do_generate(line 4027):flat["files"] = ...andflat["files"]["paths"] = ..._cb_plan_epic(line 4604):flat.setdefault("files", {})["paths"] = ..._start_track_logic_result(line 4793):flat.setdefault("files", {})["paths"] = ...
Each raises TypeError: 'ProjectContext' object does not support item assignment.
Status: ✅ FIXED in commits a4901fa2 and 635ca552 (call flat.to_dict() to get a mutable dict).
2. conductor_tech_lead.topological_sort() return type changed from list[str] to list[Ticket] (likely also in 0d2a9b5e or related)
Impact: _start_track_logic_result in src/app_controller.py iterated over sorted_tickets_data and used t_data["id"], t_data.get("description"), etc. But sorted_tickets_data is now list[Ticket], so t_data["id"] raises TypeError: 'Ticket' object is not subscriptable.
Status: ✅ FIXED in commit 635ca552 (use Ticket attribute access: t_data.id, t_data.description, etc.).
3. gemini_cli_adapter uses session persistence via --resume flag (commit 0d2a9b5e or related)
Impact: The mock tests/mock_concurrent_mma.py was written when each LLM call was stateless. Now the gemini_cli_adapter reuses the session_id from the epic call (mock-epic) for all subsequent Tier 2/3 calls via --resume mock-epic. The mock's response routing (based on prompt substrings) broke because:
- Epic init:
if 'PATH: Epic Initialization' in prompt(prompt is real) - Sprint:
if 'generate the implementation tickets' in prompt(prompt is empty in resume mode!) - Worker:
if 'You are assigned to Ticket' in prompt(prompt is empty)
So all resume calls fell to the default case, which returns a generic mock response that doesn't parse as JSON.
Status: ✅ PARTIALLY FIXED in commit 635ca552 (mock now parses --resume from sys.argv and uses a persistent call counter to route to per-track responses).
4. ✅ RESOLVED — Production bug: NameError on models.Metadata call site
After all 3 prior fixes in commit 635ca552, only 1 sprint-ticket call was observed (for track-a). The for loop in _cb_accept_tracks._bg_task was reached but track-a's _start_track_logic raised a NameError that was NOT caught by the EXCEPT block (which only catches 7 specific exception types). The io_pool worker died, the for loop never reached track-b.
Root cause: The de-cruft migration in commit ee763eea removed from src import models from src/app_controller.py but did not update the call site models.Metadata(...) at line 4830. The line is:
meta = models.Metadata(id=track_id, name=title, status="todo", created_at=datetime.now(), updated_at=datetime.now())
models is no longer in scope, so this raises NameError: name 'models' is not defined.
Status: ✅ FIXED in commit e9919059 (added TrackMetadata to the from src.mma import line; changed models.Metadata(...) to TrackMetadata(...)).
Verification: 5 consecutive PASS runs of test_mma_concurrent_tracks_execution (7.49s, 7.54s, 7.97s, 8.02s, 8.45s). The full diag log shows both tracks are created:
[DIAG] _start_track_logic_result self.tracks.append OK title='Track A' track_id=track_ef3ff66ba50c
[DIAG] _start_track_logic_result ENTER title='Track B' goal='Track B Goal' skeletons_len=0
[DIAG] _start_track_logic_result AFTER generate_tickets title='Track B' raw_tickets_count=1
...
[DIAG] _start_track_logic_result self.tracks.append OK title='Track B' track_id=track_52e6741b0748
5. ✅ RESOLVED — Mock bug: session_id-based routing for sprints is fragile
The session_id-based routing added in commit 635ca552 had two sub-bugs:
call_nliteral matching (== 2,== 3) is fragile to test ordering: the file-based counter persists across tests in the same session, socall_n != 2for the 1st sprint if a prior test ran.session_id="mock-sprint-A"means "this is a follow-up call after the 1st sprint returned mock-sprint-A", so the response should be sprint-B (2nd track tickets), not sprint-A. The prior code routed this to sprint-A, causing track-b's worker to have stream idticket-A-1(notticket-B-1).
Status: ✅ FIXED in commit 913aa48c (replaced session_id-based sprint routing with prompt-content-based routing; the original pre-635ca552 design).
Verification: 3 consecutive PASS runs after the fix.
The test counter is at 2 after the test runs (one epic + one sprint). This proves the mock was called twice. The third call (sprint-B) never happens.
Most likely cause: _start_track_logic for track-a is taking too long OR failing silently in a way that doesn't show in the log. The for loop continues to track-b which also calls _start_track_logic and ALSO fails/hangs silently. The 30-second test poll times out before either track completes.
What's Needed
Option A: Continue investigation in this iteration (Tier 2 autonomous track)
- Instrument
_start_track_logicwith a diagnostic stderr print BEFORE and AFTER theconductor_tech_lead.generate_tickets(goal, skeletons)call, to determine if it's hanging or failing - Run the test in isolation with the instrumentation
- If hanging: check
aggregate.run(flat)(sinceflatis now a dict, it should work — but maybe the dict is missing fields) - If failing: the except block in
_start_track_logic_resultcatches it; add a print before thereturn Result(data=None, errors=[err])to see the error
Option B: Open a new Tier 2 track
Create conductor/tracks/fix_mma_concurrent_tracks_sim_20260627/spec.md with:
- Goal: Make
test_mma_concurrent_tracks_sim::test_mma_concurrent_tracks_executionpass in the batched test suite - Scope: Investigate the second-track-not-firing issue, fix the root cause (production OR mock), verify
- Owner: Tier 2 autonomous (this session) or Tier 1 manual review
- Estimated scope: 3-5 files changed (production in
src/app_controller.pyand/or mock intests/mock_concurrent_mma.py), 1-2 hour investigation + fix + verify
Files Currently Modified (uncommitted in working tree)
| File | Change |
|---|---|
src/app_controller.py |
flat.setdefault(...)["paths"] = ... → flat = flat.to_dict() if hasattr...; flat.setdefault(...)["paths"] = ... (2 sites); t_data["id"] → t_data.id (1 site) |
tests/mock_concurrent_mma.py |
Parse --resume arg from sys.argv; use persistent call counter for per-call response routing |
Not committed yet — staged for the next tier2 autonomous run.
Recommendation
Open a dedicated track for this work. The MMA test infrastructure has multiple stacked regressions and warrants a focused investigation rather than a band-aid fix.
If the user wants me to continue in this session, I can:
- Add stderr instrumentation to
_start_track_logicto diagnose - Run the test in isolation
- Fix the root cause based on the diagnosis
- Verify the test passes
- Commit the fix
Per user direction, no sweeping under the rug — this needs a real fix.
6. ✅ RESOLVED — Mock bug: epic branch only matches one literal prompt
Date: 2026-06-27 (discovered after the fix_mma_concurrent_tracks_sim_20260627 track SHIPPED)
The stress test (tests/test_mma_concurrent_tracks_stress_sim.py::test_mma_concurrent_tracks_stress) uses mma_epic_input='STRESS TEST: TRACK A AND TRACK B', which the mock's epic branch did NOT match (it only matched 'PATH: Epic Initialization'). The stress prompt fell to the Default branch which returns text (not JSON), and the production's orchestrator_pm.generate_tracks failed to parse it, returning 0 tracks.
Root cause: The mock's epic branch was a literal-substring check for a single test-specific prompt. It was not robust to other test prompts.
Status: ✅ FIXED in commit fad1755b (restructured routing so sprint and worker are checked first, and any non-empty prompt that doesn't match those patterns is treated as an epic request returning 2 tracks).
Verification: 3 consecutive PASS runs of both test_mma_concurrent_tracks_execution AND test_mma_concurrent_tracks_stress (13.94s, 14.81s, 14.13s).
7. ✅ RESOLVED — Production bug: 'refresh_from_project' task overwrites self.tracks
Date: 2026-06-27 (discovered after the second batched test run)
After the epic catch-all fix, the batched test still failed. Diagnostic logging revealed that self.tracks was being replaced between track appends (different id(self.tracks) values in the log). Root cause:
_start_track_logic_result (and _cb_accept_tracks._bg_task) appended a 'refresh_from_project' task to _pending_gui_tasks at the end. The main thread processed this task by calling _refresh_from_project, which does:
self.tracks = project_manager.get_all_tracks(self.active_project_root)
This REPLACED self.tracks with a fresh disk read. In batched test environments, the disk read returned 0 tracks (due to timing or path issues), losing the in-memory tracks that were just appended by self.tracks.append(...).
Fix: Remove the 'refresh_from_project' task appends from both _start_track_logic_result and _cb_accept_tracks._bg_task. The bg_task already updates self.tracks directly via self.tracks.append(...). The refresh is unnecessary for the accept flow because the other state (files, disc_entries, etc.) doesn't change during the accept.
Status: ✅ FIXED in commit 55dae159.
Verification: 3 consecutive PASS runs of the failing test combination (test_context_sim_live + test_mma_concurrent_tracks_execution + test_mma_concurrent_tracks_stress) at 100.57s, 100.29s, 100.18s. Also passes 15 wider tests (237.63s) with no regressions.