Private
Public Access
0
0

docs(ideation): video UX-eval pipeline + triage overlay on ASCII DSL

Adds a manual-first pipeline for finding UX regressions in long screen recordings: ffmpeg re-encode to proxy, LAB-palette frame-change detection (kasa-style), pixel-diff backup, manual triage into a triage overlay on the existing ASCII UI Layout Map DSL (docs/guide_ascii_layout_map.md). The overlay adds only a thin meta-layer (entry headers, @delta, @ux_finding) on top of the existing visual grammar; the existing DSL remains the source of truth for the visual layer. Includes 8 edge-case worked examples ranked by LLM difficulty and a findings-report template for the user-in-the-loop iteration. Future track candidates: build the keyframe-extraction tool (scripts/dogfood_extract.py) after ≥3 manual dogfoods validate the DSL shape.
This commit is contained in:
2026-06-17 09:09:15 -04:00
parent 07a0e66a19
commit ee75660834
@@ -0,0 +1,774 @@
# Ed's Video UX-Eval Pipeline Ideation — 2026-06-17
**Source:** Tier 1 orchestration session, 2026-06-17. User did a multi-hour dogfood of the Application on a previous night; captured a ~3-hour screen recording at 120 fps / high bitrate (≈80 GB) on a home server. Wanted a way to surface UX regressions without manually scrubbing 1.3M frames, then shifted to a more rigorous-but-manual-first approach.
**Status:** Raw ideation. Not a track, not a spec, not an implementation commitment. The user explicitly chose manual triage for the current dogfood ("for now I'll do the manual way") but wants the pipeline + DSL designed rigorously enough that the manual step produces structured, automatable signal — so a future LLM/diffusion pass can be dropped in without re-doing the work.
**Date:** 2026-06-17 (today's session).
**Archived:** 2026-06-17.
> **Revision note (added during the same session).** An existing canonical DSL was found after the first draft: [`docs/guide_ascii_layout_map.md`](../guide_ascii_layout_map.md) (visual grammar: window frames, buttons, combos, sliders, panel zooms, grid overlays) and [`docs/reports/ascii_sketch_ux_workflow_20260608.md`](../reports/ascii_sketch_ux_workflow_20260608.md) (the workflow + vocabulary refinements). The first draft of §3 invented a parallel `@entry`/`@window`/`@panel` prefix-tag system that ignored both. The revised §3 below reuses the existing visual grammar and adds only the **time-series + change-log + severity meta-layer** that those guides don't cover (the existing DSL is for forward *design*; this is for retrospective *triage*).
---
## 0. Context (why this exists)
The Application is a high-density multi-viewport ImGui orchestrator for LLM-driven coding sessions. Its UX surface is dense, stateful, and has a lot of failure modes that don't show up in unit tests (panel ordering, focus loss, modal stacking, status bar stale state, undo/redo corruption, MMA dashboard drift, persona editor state desync, etc.). A dogfood session is the most reliable way to find these — but a session is a stream, not a regression list.
The capture: 3 hours, 120 fps, ≈80 GB. The user can re-encode but cannot realistically scrub every frame. The user wants two things:
1. **Now:** A rigorous way to convey UX failures from a manual watch-through so the failures become actionable tickets (not just a memory dump).
2. **Later:** A pipeline that can do (1) automatically, optionally using LLMs and/or vision/diffusion models, so future dogfoods don't require manual scrubbing.
The unifying concept: a **triage overlay on top of the existing ASCII UI Layout Map DSL** (`docs/guide_ascii_layout_map.md`). The existing DSL provides the visual grammar — boxes, brackets, combos, sliders, panel zooms, state annotations, SSDL primitives. What it doesn't cover is the *time-series* and *change-log* dimension needed for retrospective triage: timestamps, frame references, before/after deltas, severity-tagged findings. That meta-layer is what this report designs.
---
## 1. The Problem (concrete numbers)
| Property | Value | Implication |
|---|---|---|
| Source video length | ~3 hours | 10,800 seconds |
| Capture frame rate | 120 fps | ~1.3M raw frames |
| File size | ~80 GB | Won't fit in working memory; needs proxy |
| Frames a human can review | ~1/second realistic | ~10K frames max in a single sit-down |
| Frames where a UX bug is *visible* | Maybe 200-500 across 3 hours | <0.05% of all frames |
| Frames where a UX bug *occurs* but isn't visually obvious | Could be many more (state desync without visible artifact) | Need state introspection, not just pixel diff |
**Constraints:**
- LLMs cannot watch video. They can ingest text and (some) images. 1.3M images is not viable.
- Diffusion / vision models work on still images. Cost scales per-image; 1.3M is not viable. 200-500 is.
- Pure pixel diff catches glitches but not semantic regressions (e.g., wrong button label is invisible to pixel diff at low res).
- Manual scrubbing through 3 hours is feasible but produces unstructured notes ("around the 1h mark something looked off in the panel").
**The gap.** Manual scrubbing produces a story; the team needs a ticket. Today the conversion from "I saw a thing" → "this is a bug with these reproduction steps" is lossy. The DSL is the explicit target output of the manual step — it's the lossy compression that doesn't lose structure.
---
## 2. The Pipeline (proposed; not built yet)
Five stages. Stages 0-2 are the "make it small" path. Stage 3 is the manual triage. Stage 4 is where the DSL lives. Stage 5 is where future automation slots in.
### Stage 0 — Re-encode (mandatory first step)
ffmpeg downsample + transcode. The 80 GB raw is the wrong starting point.
```bash
ffmpeg -i raw.mp4 \
-vf "scale=1280:-2,fps=4" \
-c:v libx264 -crf 24 -preset slow -an \
dogfood_proxy.mp4
```
Result: ~1.5 GB, 4 fps, 720p. 4 fps is the deliberate budget — UI events faster than 250 ms aren't regressions you can triage anyway. The audio is dropped because (a) audio doesn't help UX eval and (b) it preserves privacy for any ambient sound.
### Stage 1 — Coarse scene change (LAB palette delta)
Per-frame signature: downsample to 100×100, convert to LAB, K-means with k=5, return cluster centers sorted by size. Compare consecutive signatures via size-weighted L2. When distance > threshold (0.10-0.15 in normalized LAB space), flag the frame.
This is the **kasa pattern** (`C:\projects\kasa\kasa_cinematic_bulbs.py:50-72`). The kasa code does live screen capture for a lightbulb ambient-lighting use case, but the palette extraction is exactly right for frame-change detection: it's robust to cursor blinks, subpixel font rendering, and JPEG noise, while catching modal opens, panel switches, and theme shifts.
Output: ~200-500 candidate keyframes from 3 hours.
### Stage 2 — Pixel-diff backup (catches what palette misses)
For frames where palette delta < threshold, run `cv2.absdiff` against the last *kept* frame, masked to UI regions (top status bar, panel areas, modal layer). If any region's per-pixel mean luminance delta > 0.05, save it.
This catches text additions, tooltip pops, and small widget glitches that don't move the dominant palette. Trade-off: ~30% more saved frames, ~2× the Stage 1 cost.
### Stage 3 — Manual triage (the current path)
User opens the proxy video in a player, scrubs at 4× speed, and for each visual event writes a structured note in the DSL (Section 3 below). Output: a single `triage.dsl` file with N entries.
The DSL is the contract. It is **append-only** during triage (entries can be marked `superseded` but not deleted). Each entry has a timestamp, a frame reference, a state snapshot, and a finding. The format is plain text, diff-friendly, and reviewable in any text editor.
### Stage 4 — DSL aggregation → tickets
A small parser reads `triage.dsl` and groups related entries. Grouping rules: same `@window` + same `@panel` + temporal proximity (<60s) = one ticket. Output: N markdown files under `conductor/tracks/dogfood_<date>/tickets/`, one per group, each with reproduction steps + the supporting DSL diffs.
### Stage 5 — Future automation (where LLMs/diffusion plug in)
Three pluggable stages, each independent:
- **5a. DSL-from-image (diffusion/vision):** a vision model takes the candidate keyframe + the previous keyframe + the App's UI hierarchy dump → emits a DSL `@state_change` block. Trainable, fallible, but reduces manual effort from "watch 3 hours" to "verify 200-500 model outputs."
- **5b. Narrative-from-DSL (LLM text):** an LLM reads the full `triage.dsl` and emits one sentence per `@ux_finding` in standardized ticket format. Pure text → text.
- **5c. Cross-video regression dedup (RAG over past DSL):** index all past `triage.dsl` files via RAG. When a new finding looks semantically similar to a past finding, surface "you've seen this before — ticket T-1234." Uses the conservative-RAG pattern (opt-in, complement not replace, provenance, no mutation).
The design intent: **stages 0-4 work today with zero AI.** Stage 5 is a multiplier, not a dependency. If stage 5a produces garbage, you fall back to stage 3 manually. The pipeline degrades gracefully.
---
## 3. The Triage Overlay (built on the existing ASCII Layout Map DSL)
### 3.1 The split: visual layer (existing) vs meta layer (new)
The existing ASCII UI Layout Map DSL ([`docs/guide_ascii_layout_map.md`](../guide_ascii_layout_map.md)) defines the **visual grammar** — how to draw an ImGui panel as ASCII. It covers 14 widget types (buttons, checkboxes, combos, sliders, tables, tree nodes, etc.), high-resolution techniques (feature zooming, grid overlays, state multiplicity annotations), and SSDL control-flow primitives (`[Q:]` `[B:]` `[S:]` `[N:]` `[I:]`).
What it does NOT cover is **the temporal dimension**. A static sketch is one frame; a triage session is many frames over time, and the *changes* between frames are what carry the regression signal. The overlay defined here adds only what the existing DSL lacks:
| Layer | Source | Purpose | Examples |
|---|---|---|---|
| **Visual** | `docs/guide_ascii_layout_map.md` (existing) | Draw the panel | `+=== Title ===+`, `[Save]`, `[X]`, `[v]`, `|text|`, `[Zoom: …]`, `---` |
| **State annotation** | `docs/guide_ascii_layout_map.md` §4.3 (existing) | Single-frame state | `[State: app.show_X == True]` |
| **Triage meta** | **this report (new)** | **Multi-frame change log + findings** | **`--- E## @t=… @frame=N ---` header, `@delta vs E##`, `@ux_finding severity=… category=…`** |
The visual layer is reused unchanged. The triage meta layer is the only thing this report defines. Keeping the visual grammar untouched means any future change to the canonical guide automatically propagates to triage output — no parallel grammar to maintain.
### 3.2 Worked example (a real finding, rendered in the existing grammar)
Same `stale_state` finding from the prior draft, but rendered using the **existing** visual grammar + the new meta layer. Compare against the existing guide's worked examples in §6 of `docs/guide_ascii_layout_map.md`.
```
--- E01 @t=00:14:32.500 @frame=420 @palette_delta=0.18 @pixel_delta=0.04 ---
[State: observed during active MMA session, t=00:14:32]
+==================================================+
| Manual Slop — Main [X] |
+--------------------------------------------------+
| Active Track: mma_tier_usage_reset_fix |
| Progress: [============-----------] 60% | <- was 65% at E00
| Tickets: 5 done / 2 in progress / 0 blocked |
| |
| Comm History |
| +----------------------------------------------+ |
| | [ERROR] tier3-worker: Cannot connect to API | |
| | [INFO] tier2-tech-lead: Retrying... | |
| +----------------------------------------------+ |
| |
| Status: FPS:60 CPU:12% Tokens:14.2k |
| Last update: 00:08:14 |
| ^^^^^^^^^ |
| stale (6m18s old) |
+==================================================+
@delta vs E00
- Panel "Comm History" gained 2 entries (1 ERROR tier3-worker, 1 INFO tier2-tech-lead)
- Progress bar p1 dropped 0.65 -> 0.60 (-5pp, no visible cause)
- Status bar "Last update" field unchanged at 00:08:14 (now 00:14:32, +6m18s)
while session is observably active (comm history growing, worker spawning)
@ux_finding severity=high category=stale_state
Status bar "Last update" timestamp does not refresh during active MMA
sessions. Misleading to operators who may believe the session is idle
when worker activity is ongoing.
@repro
1. Open any MMA dashboard
2. Trigger a worker spawn
3. Wait 5+ minutes
4. Observe "Last update" field — does not refresh
@screenshots
- out/frames/E01_00-14-32_full.png
- out/frames/E01_00-14-32_zoom_status.png
@cross_refs
- src/gui_2.py:_render_status_bar (TODO: locate)
- Past dogfood 2026-06-10 (verbal, not in DSL): "status bar lies sometimes"
```
The visual block (`+===+`, `[ERROR]`, `[INFO]`, `[============-----------]`) is **existing grammar** (see [`docs/guide_ascii_layout_map.md` §2](../guide_ascii_layout_map.md)). The `[State: ...]` annotation is also existing grammar (§4.3 of the guide), repurposed for *observed* state rather than the *design* state it was originally scoped for. The only new constructs are:
- the entry header line (`--- E## @t=… @frame=N ---`)
- `@delta vs E##` (bulleted change list)
- `@ux_finding severity=… category=…` (regression note + `@repro`, `@screenshots`, `@cross_refs` sub-blocks)
### 3.3 The meta-layer grammar (the only new part)
Five constructs. All are line-oriented. All are optional except the entry header (every observation is one entry, every entry has one header).
| Construct | Required | Optional | Purpose |
|---|---|---|---|
| `--- E## @t=H:MM:SS.mmm @frame=N ---` | `E##`, `t`, `frame` | `@palette_delta`, `@pixel_delta`, `@notes` | Entry header; canonical separator between observations |
| `[State: …]` | — | — | Observed state at this entry; reuses existing guide §4.3 grammar |
| ASCII Layout block | — | — | Visual snapshot; reuses existing guide grammar verbatim |
| `@delta vs E##` | `vs E##` | — | Bulleted change list vs the referenced prior entry |
| `@ux_finding severity=<lvl> category=<name>` | `severity`, `category` | `@repro`, `@screenshots`, `@cross_refs`, `@notes` | A regression note; body is free prose |
`severity` uses the existing conductor ticket convention: `low | medium | high | critical`. `category` is free-form for v1; see §7 for the convergence plan. Entry IDs are monotonic `E00`, `E01`, … per `triage.dsl` file (matches the existing conductor ticket convention).
### 3.4 Why this shape (instead of a separate DSL)
- **No grammar duplication.** The visual layer is the existing guide. Only the meta layer is new. Future edits to the canonical guide propagate automatically.
- **Existing tools apply.** Anything that already reads ASCII Layout Maps (the design-contract workflow in [`docs/reports/ascii_sketch_ux_workflow_20260608.md`](../reports/ascii_sketch_ux_workflow_20260608.md), the `MiniMax understand_image` cross-checks, the docstring convention in `gui_2.py`) works on triage output unchanged.
- **The existing visual grammar is opinionated for ImGui specifically.** It already encodes that `[X]` means "on", `[v]` is a dropdown arrow, `+===+` is a window frame. Inventing a parallel grammar would have re-litigated all of that.
- **Stage 5 prompt compatibility.** A future LLM stage that reads an existing ASCII Layout Map can already do so (per the workflow doc §1 Step 3). The prompt just needs to ask for *the meta layer* on top: "given this before/after pair of ASCII Layout Maps, emit the `@delta` and any `@ux_finding`."
- **Manual triage is faster.** The user already knows the visual grammar from existing design work; only the meta layer (5 constructs) is new to learn.
### 3.5 The meta layer is the contract for the LLM/diffusion stages
If Stage 5a writes the meta layer (and the visual layer that reuses the existing grammar), the rest of the pipeline doesn't care whether the meta came from a human or a model. The aggregation stage (4) and the future RAG dedup (5c) operate on the meta layer (`@ux_finding` + `@delta`), not on raw visual snapshots. This is the **separation of perception from reasoning**: perception (frame → ASCII + meta) is the hard part; reasoning (meta → ticket) is the easy part.
The visual layer has the additional benefit that **it's already verified against the rendered GUI.** The design-contract workflow ([`docs/guide_ascii_layout_map.md` §7](../guide_ascii_layout_map.md)) already includes a Puppeteer visual audit step. Triage output that reuses the same grammar can be cross-checked the same way — a future Stage 5b "verify the triage entry matches the actual frame" can plug into existing verification infrastructure.
---
### 3.6 Edge cases that exercise the LLM/DSL boundary (the 80/20)
The 8 examples below cover the failure modes most likely to ship in this codebase, ranked by LLM difficulty. Each example shows (a) the DSL block a human or Stage 5a would emit, (b) the specific challenge for an LLM processing image → ASCII, and (c) the `@ux_finding` annotation that should be generated. **Difficulty ratings** are how hard the case is for a vision model to convert to ASCII *correctly* — not how hard the case is to spot after the ASCII exists.
---
#### Case 1 — Modal stacking + focus loss (difficulty: medium)
The negative finding is the load-bearing part: focus *should* be on the Track Browser row but is not. Pixel diff alone cannot detect absence; the LLM must cross-reference prior entries.
```
--- E07 @t=00:32:14.000 @frame=1928 @palette_delta=0.22 ---
[State: app.active_modal = "Confirm Delete"]
+==================================================+
| Manual Slop — Main [X] |
+--------------------------------------------------+
| Track Browser |
| > COMPLETED TRACKS |
| > ARCHIVED TRACKS |
| (no focused row — was "ai_loop_regressions") | <- focus stolen
| |
| +------------------------------------+ |
| | Confirm Delete [X] | | <- modal on top
| +------------------------------------+ |
| | Delete track "ai_loop_regressions"?| |
| | | |
| | [Cancel] [Delete] | |
| +------------------------------------+ |
+==================================================+
@delta vs E06
- Modal "Confirm Delete" opened above Track Browser
- Track Browser focus indicator: visible -> absent (negative change)
- Underlying "Comm History" panel still auto-scrolling (visible through modal? verify alpha)
@ux_finding severity=medium category=modal_focus_steal
Opening a confirmation modal does not return focus to the prior Track
Browser row when closed. After Esc/Cancel, no row is highlighted.
@repro
1. Select any track in Track Browser
2. Press Delete (modal opens)
3. Press Escape (modal closes)
4. Observe: focus indicator gone, no row highlighted
@cross_refs src/gui_2.py:render_confirm_modal (TODO: locate)
@llm_observation
Difficulty: MEDIUM. Negative findings (something absent that should be
present) require cross-referencing E06 where the focus WAS visible.
An LLM processing only E07 in isolation cannot detect this bug.
```
---
#### Case 2 — Mid-drag state (difficulty: high)
A snapshot of a drag-in-progress captures a state that is not in the design contract — there's no "during drag" mockup. The LLM must infer the meaning of the ghost preview from context.
```
--- E23 @t=01:14:08.500 @frame=12724 @palette_delta=0.08 @pixel_delta=0.03 ---
[State: drag_in_progress, source=ticket_t2_4, target=phase_2]
+==================================================+
| Ticket Queue |
| |
| [✓] t2_1: Extract File IO |
| [✓] t2_2: Extract Python |
| ~> t2_4: Implement Parser [DRAG] | <- source, dimmed
| |
| (ghost outline at phase_2 slot) | <- LLM-inferred
| |
| [ ] t3_1: Write tests |
+==================================================+
@delta vs E22
- Ticket t2_4 entered drag state (highlighted, dimmed)
- Ghost outline visible at phase_2 slot (indicating drop target)
- No entry-level @delta — drag is a transient state
@ux_finding severity=low category=during_interaction
No regression; documenting the drag visual state for completeness.
The ghost outline uses a different border weight than the standard
drag indicator described in the design contract — may be intentional.
@llm_observation
Difficulty: HIGH. "Ghost outline" and "[DRAG]" annotations are
LLM inferences, not literal pixel features. The model must recognize
the drag pattern from context (dimmed source + offset outline) and
add the bracketed annotation by convention.
```
---
#### Case 3 — Stale data with fresh UI labels (difficulty: high)
The label says "updated just now" but the data shown is from 3 hours ago. **Pixel diff passes** (the UI *did* update — the label changed). **Semantic diff** fails (the data didn't actually update). The LLM must read the label text, parse a timestamp, and check it against frame time.
```
--- E41 @t=02:07:33.000 @frame=23892 @palette_delta=0.04 @pixel_delta=0.02 ---
[State: data_panel.showing = "session_metrics", session.last_update = 23:14:51]
+==================================================+
| Session Metrics |
| |
| Last refresh: 23:14:51 (3m42s ago) | <- label
| Tokens: 14,231 |
| Active workers: 2 |
| |
| [Refresh Now] |
+==================================================+
@delta vs E40
- Label "Last refresh" changed: 23:10:51 -> 23:14:51 (4 minutes newer)
- Token count: 14,231 -> 14,231 (unchanged)
- Worker count: 2 -> 2 (unchanged)
- No new events in the session log between 23:14:51 and 02:07:33
@ux_finding severity=high category=stale_data
The "Last refresh" label updates from a different source than the data
it labels. The label advanced 4 minutes but token count + worker count
did not change — suggesting the label refresh is triggered by heartbeat,
but the underlying data fetch is failing silently.
@repro
1. Open Session Metrics panel
2. Note token count
3. Wait 5 minutes
4. Observe: label advances, token count unchanged
@cross_refs src/gui_2.py:render_session_metrics (TODO: locate)
@llm_observation
Difficulty: HIGH. Requires (a) reading the timestamp in the label,
(b) comparing to frame time, (c) cross-referencing with session log
to verify whether a refresh event occurred. Pure pixel diff misses
this completely — the label DID change, just not in sync with data.
```
---
#### Case 4 — Cross-panel coupling from one root cause (difficulty: medium)
A single user action (saving a preset) updates 3 panels simultaneously. The LLM must group these as one finding, not three.
```
--- E52 @t=02:48:12.000 @frame=31692 @palette_delta=0.31 ---
[State: preset_saved, propagated to 3 panels]
[Panel: Context Hub]
+----------------------------------------------------+
| Context Hub |
| Active preset: [fast_coding_v3 v] (was: v2) | <- changed
+----------------------------------------------------+
[Panel: AI Settings]
+----------------------------------------------------+
| AI Settings |
| System Prompt Preset: [fast_coding_v3 v] | <- changed
+----------------------------------------------------+
[Panel: Status Bar]
+----------------------------------------------------+
| Status: Preset "fast_coding_v3" loaded | <- changed
+----------------------------------------------------+
@delta vs E51
- Context Hub: Active preset v2 -> v3
- AI Settings: System Prompt Preset v2 -> v3
- Status Bar: shows new preset name (transient, fades in 3s)
@ux_finding severity=low category=propagation_correct
Single user action "Save preset fast_coding_v3" propagated correctly
to all 3 dependent panels. Documenting as a passing case for the
propagation pattern. (Not a bug.)
@llm_observation
Difficulty: MEDIUM. The LLM must group 3 panel changes as one finding
(correct propagation) rather than 3 independent findings (false alarm).
Requires temporal clustering: all 3 changes within the same frame.
```
---
#### Case 5 — Spinner stuck after task complete (difficulty: medium)
The visual cue is "spinner still present" but the semantic cue is "underlying task is done". Pure pixel diff would flag this as a *change* (spinner is animated), but the LLM must recognize that animation ≠ regression here.
```
--- E68 @t=03:21:05.000 @frame=38185 @palette_delta=0.03 @pixel_delta=0.01 ---
[State: spinner_active_but_task_complete=true]
+----------------------------------------------------+
| RAG Engine |
| |
| Status: Ready | <- says Ready
| Index size: 14,231 vectors |
| |
| [spinner] Rebuilding... (animated) | <- contradiction
| |
| [Rebuild Index] |
+----------------------------------------------------+
@delta vs E67
- Spinner is animating (delta is animated pixels, not state)
- "Status: Ready" label unchanged
- "Rebuilding..." text unchanged
- Task completion event NOT in session log (expected if rebuild never ran)
@ux_finding severity=high category=state_contradiction
"Status: Ready" + animated "Rebuilding..." spinner are simultaneously
true. The spinner is stuck from a prior incomplete rebuild. User
cannot tell whether a rebuild is in progress or stuck.
@repro
1. Trigger RAG rebuild
2. Cancel mid-rebuild
3. Observe: spinner persists, Status: Ready
@cross_refs src/gui_2.py:render_rag_status (TODO: locate)
@llm_observation
Difficulty: MEDIUM. The LLM must recognize that a low palette delta
+ low pixel delta does NOT mean "no change" — animation creates
pixel deltas. The LLM must read the text labels and detect the
contradiction, not trust the pixel statistics.
```
---
#### Case 6 — Wrong label / semantic text error (difficulty: very high)
The button says `[Save]` but the action is destructive (deletes files). **Pixel diff is useless** — the button renders correctly. **OCR + semantic classification** is required. This is the hardest case for an LLM.
```
--- E73 @t=03:42:18.500 @frame=42981 @palette_delta=0.02 ---
[State: button_label_wrong, action_actual=delete_files]
+----------------------------------------------------+
| Clear Workspace [X] |
+----------------------------------------------------+
| This will delete all session artifacts. |
| |
| Name: |confirm-clear_________________________| |
| |
| [Save] | <- WRONG LABEL
+----------------------------------------------------+
@delta vs E72
- (no visual delta; this is a semantic-only finding)
@ux_finding severity=critical category=wrong_label
The "Clear Workspace" confirmation modal has a button labeled [Save]
but the action deletes session artifacts. This is a destructive
operation with an incorrect non-destructive label.
@repro
1. Trigger "Clear Workspace"
2. Type "confirm-clear" in the name field
3. Observe the primary action button: it says [Save]
4. Click it -> session artifacts are deleted
@cross_refs
- src/gui_2.py:render_clear_workspace_modal (TODO: locate)
- Possibly related: the button label is reused from a "Save Profile" modal
@llm_observation
Difficulty: VERY HIGH. Pixel diff returns no delta. The LLM must
(a) read the button text via OCR/ASCII, (b) read the surrounding
context ("This will delete all session artifacts"), (c) recognize
the contradiction. Vision models that only describe pixels will
miss this. Models that perform text+context reasoning may catch
it; accuracy depends on training data distribution for "destructive
action with non-destructive label".
```
---
#### Case 7 — Multi-viewport / popped-out panel drift (difficulty: high)
A popped-out panel shows a different state than the main window. The LLM must read multiple frames (or the main + popped-out viewports) and detect the state desync.
```
--- E88 @t=04:18:42.000 @frame=49957 @palette_delta=0.15 ---
[State: viewport.main = "MMA Dashboard v2", viewport.popout_discussion = "Discussion #3 v1"]
[Main viewport:]
+==================================================+
| MMA Dashboard [Pop-out] | <- v2 indicator
| Active: mma_tier_usage_reset_fix |
+==================================================+
[Pop-out viewport: "Discussion #3"]
+==================================================+
| Discussion #3 [Dock back] | <- v1 indicator
| Last entry: 5 minutes ago (stale in popout) |
+==================================================+
@delta vs E87
- Main viewport: MMA Dashboard refreshed (v2 indicator visible)
- Pop-out viewport: Discussion #3 stale (v1 indicator, no refresh)
@ux_finding severity=medium category=viewport_state_drift
When a panel is popped out into a separate viewport, it stops
receiving state updates from the main app. The popped-out panel
shows stale data even when the equivalent in-main panel is fresh.
@repro
1. Pop out the Discussion panel
2. Add a new entry in the main Discussion panel
3. Observe popped-out panel: no update
@cross_refs src/gui_2.py:popout_discussion_viewport (TODO: locate)
@llm_observation
Difficulty: HIGH. Requires reasoning about TWO simultaneous viewports
in a single frame. The LLM must compare state across viewports and
recognize the drift. May require Stage 5a to emit multiple ASCII
blocks per entry (one per viewport).
```
---
#### Case 8 — Long static period with hidden event (difficulty: medium)
5 minutes of identical UI, but the session log shows 3 worker crashes. **Pixel diff returns zero** for the entire period. The LLM must consult a *secondary signal* (the session log) to detect what the pixels don't show.
```
--- E94 @t=04:55:00.000 @frame=53172 --
--- E95 @t=05:00:00.000 @frame=54000 -- (delta vs E94: 0.00)
--- E96 @t=05:05:00.000 @frame=54900 -- (delta vs E95: 0.00)
--- E97 @t=05:10:00.000 @frame=55800 -- (delta vs E96: 0.00)
--- E98 @t=05:15:00.000 @frame=56700 -- (delta vs E97: 0.00)
[State: app.ui_idle = true, but session_events = [worker_crash, worker_crash, worker_crash]]
+==================================================+
| MMA Dashboard |
| (same content as E94) |
+==================================================+
@ux_finding severity=high category=hidden_event
UI is static for 5 minutes (00:55 - 01:00 dogfood time) while the
session log shows 3 worker crashes in the same window. The UI gives
no indication that anything is wrong; an operator watching the screen
would believe the system is idle.
@evidence
- Session log shows 3 ERROR events between 04:55 and 05:15
- "Comm History" panel SHOULD show these events but does not
(possibly a render-thread bug blocking the update)
@cross_refs
- logs/sessions/2026-06-17_dogfood.jsonl (3 ERROR events)
- src/gui_2.py:render_comm_history (TODO: locate)
@llm_observation
Difficulty: MEDIUM (but undetectable from pixels alone). The LLM
must triangulate 3 signals: (a) no pixel change for 5 min,
(b) session log shows events, (c) Comm History panel not updating.
This is the case where vision-only LLMs fail entirely; the pipeline
needs a "secondary signals" channel (logs, hook events) accessible
to the same reasoning pass.
```
---
### 3.7 Findings report format (what Stage 5b emits)
Stage 5a produces DSL. Stage 5b consumes DSL across many entries and emits a **findings report**. The user reads the report and decides which entries to dig deeper on.
#### Template
```markdown
# Triage Findings Report — {dogfood_date}
**Source:** docs/dogfood_{date}/triage.dsl ({N} entries, {M} @ux_finding)
**Generated:** {timestamp}
**Coverage:** {X}% of @ux_finding have direct screenshot evidence
## Summary
- Total entries processed: {N}
- Total @ux_finding emitted: {M}
- Severity: high={h}, medium={m}, low={l}
- Time range: {T_start} to {T_end}
- Categories seen: {list with counts}
## Top findings (severity=high, sorted by occurrence count)
### 1. {category}: {one-sentence description}
- **Evidence:** E##, E##, E## ({N_occurrences} occurrences)
- **Pattern:** {observed pattern, e.g. "occurs after every worker spawn"}
- **Likely root cause:** {hypothesis, e.g. "render thread not subscribed to worker event channel"}
- **Confidence:** {high|medium|low}
- **Suggested ticket:** {file path under conductor/tracks/.../tickets/}
### 2. ...
## Cross-cutting patterns
### Pattern A: {name} ({N} entries span this)
- Affected categories: {list}
- Affected panels: {list}
- Time cluster: {T_start} - {T_end}
- Hypothesis: {shared root cause?}
## Time clusters (events grouped by proximity)
| Cluster | Time range | N entries | Top category | Hypothesis |
|---|---|---|---|---|
| 1 | 00:14:00 - 00:18:00 | 16 | stale_state | worker connection retries |
| 2 | 01:42:00 - 01:45:00 | 9 | undo_redo | history corruption sequence |
| ... |
## Single-occurrence findings (need human confirmation)
- **E23:** mid-drag state — possible visual regression, need to verify design contract
- **E47:** focus loss — single observation, may be one-off; suggest re-test
- ...
## Items I am NOT calling findings (uncertainty disclosure)
These look suspicious but I am not confident enough to flag:
- **E88:** viewport drift — could be intentional behavior; check spec
- **E103:** spinner animation — probably not stuck, just animated; verify duration
- **E117:** empty panel — could be intentional empty state, not a missing data bug
- ...
## Suggested follow-ups (timestamps the user should re-watch)
1. **Re-watch E47-E62 at 0.25× speed** — rapid state churn during worker spawn; need finer granularity
2. **Re-watch E88 from start to end** — viewport drift appeared mid-session; verify when it started
3. **Cross-check E94-E98 against session log** — the hidden-event case; verify the log evidence
4. **Compare E73's modal screenshot against the "Clear Workspace" design contract** — if a design contract exists, verify the [Save] label is intentional
## What I would investigate next with more compute
- Build a dependency graph between @delta entries to find root causes across clusters
- Diff this report against past dogfood reports (via RAG over past triage.dsl files) to flag recurring patterns
- Run a second pass at 0.5× speed on the time ranges where pixel change was high but @ux_finding was low (possible missed findings)
```
#### User iteration loop
The user reads the report and replies with **one of four intents**:
| User reply | Stage 5b action |
|---|---|
| "Confirmed, ship the top-3 findings as tickets" | Generate ticket markdown files; commit |
| "Check E47-E62 at higher granularity" | Re-process entries E47-E62; emit deeper per-entry findings |
| "E88 isn't a bug, it's intentional — remove it" | Mark E88 as `superseded` in triage.dsl; regenerate report without it |
| "I disagree with the {category} cluster hypothesis; here's what I think is happening" | Record the human hypothesis as `@human_note` in triage.dsl; re-run with the constraint |
The DSL supports all four: confirmed findings become tickets, deeper digests are just more `@ux_finding` blocks per entry, supersession is a flag, and human notes are a meta-layer annotation. **The loop is the value**: the LLM does the broad sweep, the user does the precision surgery.
#### Worked example (rolled-up output from §3.6)
If §3.6's 8 examples were the only @ux_finding in a 3-hour dogfood, the report's top section would be:
```markdown
## Top findings (severity=high, sorted by occurrence count)
### 1. stale_data (E41): Session Metrics label advances but data does not
- **Evidence:** E41 (1 occurrence so far)
- **Pattern:** label-data desync after idle periods
- **Likely root cause:** heartbeat triggers label refresh; data fetch is failing silently
- **Confidence:** medium (single occurrence, but the contradiction is unambiguous)
- **Suggested ticket:** conductor/tracks/dogfood_2026-06-17/tickets/stale-data-label.md
### 2. state_contradiction (E68): RAG spinner stuck after task complete
- **Evidence:** E68 (1 occurrence)
- **Pattern:** appears after cancelled rebuild
- **Likely root cause:** spinner state not reset on cancel path
- **Confidence:** high (the contradiction is visible in a single frame)
### 3. wrong_label (E73): Clear Workspace modal labels destructive action as [Save]
- **Evidence:** E73 (1 occurrence)
- **Pattern:** button label reused from a different modal
- **Likely root cause:** label hardcoded instead of parameterized by modal context
- **Confidence:** very high (text is unambiguous)
### 4. hidden_event (E94-E98): UI idle while 3 worker crashes in session log
- **Evidence:** E94-E98 + session log correlation
- **Pattern:** UI render thread not subscribed to worker event channel
- **Likely root cause:** missing event subscription in render_comm_history
- **Confidence:** high (3 corroborating signals: no pixel change + log shows events + Comm History panel stale)
```
A user reading this in 60 seconds would say: "ship 3 and 4, dig into 1 more, and skip 2 — I'll re-test the RAG spinner manually." That's the loop working.
---
---
## 4. Manual Triage Workflow (what to do now)
For the current 3-hour dogfood:
1. **Stage 0:** Run the re-encode command. Confirm `dogfood_proxy.mp4` exists, is ~1-2 GB, plays in any player.
2. **Stages 1-2:** Run the keyframe extraction (once the tool exists — this is the deferred work). Output ~200-500 keyframes into `out/frames/`.
3. **Stage 3:** Open the proxy at 4× speed in VLC or mpv. Use `,` / `.` to step frame-by-frame when something looks off. For each event:
- Hit a bookmark shortcut (e.g., `b` in mpv with a config line) to record the timestamp.
- When you stop, write a DSL entry for each bookmark using the format in §3.2 above — the visual block uses the existing grammar ([`docs/guide_ascii_layout_map.md`](../guide_ascii_layout_map.md)); only the header line, `@delta`, and `@ux_finding` blocks are new.
- Entries with `@ux_finding severity>=medium` are mandatory. Entries below are nice-to-have.
4. **Stage 4:** Run the aggregator. Get the ticket list.
5. **Commit:** `triage.dsl` goes into `docs/dogfood_<date>/triage.dsl`. Tickets go into the conductor track.
The **time budget** for Stage 3: a 3-hour video at 4× speed is 45 minutes of playback. Writing ~30 DSL entries (one per material finding) at 1 minute each is another 30 minutes. Total: ~75 minutes of triage for a 3-hour session. That's a 2.4× ratio — significantly better than the current "I watched it and have feelings" outcome. The 1-minute-per-entry estimate assumes the user is already familiar with the existing visual grammar from prior design work; first-time users should budget +30 minutes for a 5-minute skim of `docs/guide_ascii_layout_map.md §2`.
---
## 5. When to Build the Pipeline Tool (future track)
The manual workflow above is the **MVP**. It produces the DSL format, which is itself the deliverable that justifies the rest of the pipeline. Build the tool when **two** of the following are true:
1. You've done ≥3 manual dogfoods using the DSL and the manual step feels redundant.
2. You have ≥2 hours of dogfood per week where manual triage is the bottleneck.
3. The DSL grammar has stabilized (you've stopped adding fields).
When the tool gets built:
- **Scope:** `scripts/dogfood_extract.py` + `tests/test_dogfood_extract.py`. ~150 LOC + tests.
- **Interface:** `python -m scripts.dogfood_extract --video dogfood_proxy.mp4 --out out/ [--threshold 0.12] [--include-pixel-diff]`.
- **Output:** keyframe PNGs + `palette_timeline.json` + `keyframe_index.csv`.
- **DSL generation:** out of scope for v1. The tool produces frames; humans still write DSL.
Stage 5 (LLM/diffusion pass) is a **separate** future track, gated on the DSL being proven via manual use.
---
## 6. Cross-References
### Existing DSL and workflow (the visual layer + workflow this report reuses)
| Source | Relevance |
|---|---|
| [`docs/guide_ascii_layout_map.md`](../guide_ascii_layout_map.md) | The canonical ASCII UI Layout Map DSL. Defines the visual grammar (window frames, buttons, combos, sliders, panels, zooms, grid overlays, state annotations, SSDL primitives) that this report's triage overlay reuses unchanged. |
| [`docs/guide_ssdl.md`](../guide_ssdl.md) | Spec/Sketch Description Language — the operational companion to the ASCII Layout Map DSL. The 6 computational shapes + the `[Q:] [B:] [S:] [I:] [N:]` primitives appear in ASCII sketches as inline annotations. |
| [`docs/reports/ascii_sketch_ux_workflow_20260608.md`](../reports/ascii_sketch_ux_workflow_20260608.md) | The 5-step collaborative design workflow + 10-element vocabulary that the user has already adopted for *forward* design. The triage workflow in §4 below mirrors this workflow's structure (boundary → sketch → iterate → lock) but for *retrospective* observation. |
### Pipeline technical references
| Source | Relevance |
|---|---|
| `C:\projects\kasa\kasa_cinematic_bulbs.py:50-72` | The exact LAB-palette extraction algorithm this pipeline's Stage 1 is based on. The kasa code is live-screen-capture; this pipeline is video-frame, but the downsample-and-K-means-on-LAB core is identical. |
| `C:\projects\kasa\kasa_test.py:83-98` | Earlier variant of the palette extractor using RGB instead of LAB. LAB is strictly better for perceptual distance; this is a known upgrade. |
| `docs/guide_gui_2.md` | The Application's UI surface. The DSL's `[Zoom: …]` names should match the actual panel registry in `gui_2.py` so cross-references resolve. |
### Project conventions
| Source | Relevance |
|---|---|
| `docs/guide_architecture.md` | The Application's thread model. Useful for Stage 3 triage: knowing which thread owns which UI region explains some "stale state" findings (status bar is updated by the render thread, not the worker thread — if the render thread is busy, the status bar can lag). |
| `conductor/code_styleguides/agent_memory_dimensions.md` | The 4-dim model. This ideation lives in the **knowledge** dimension (per-project durable, provenance-aware, user-editable). The DSL files are the artifacts; the digest of past findings is the projection. |
| `conductor/code_styleguides/feature_flags.md` | Stage 5a/b/c are feature-flag candidates. Each is "off by default in new projects; turned on per-dogfood." File-presence or config-flag pattern, not CLI. |
| `docs/reports/test_infrastructure_hardening_batch_green_20260610.md` | Reminder of the "isolated-pass fallacy." When the pipeline tool exists, run it on multiple dogfoods in batch before declaring it correct. |
---
## 7. Open Questions
1. **Where does `triage.dsl` live?** Per-dogfood (`docs/dogfood_<date>/triage.dsl`) is simplest. Per-project (aggregated) is more powerful but adds a write-path. Lean toward per-dogfood for v1; aggregate lazily.
2. **What's the schema for `@severity`?** `low | medium | high | critical` mirrors the conductor ticket convention. Confirm.
3. **What's the schema for `@category`?** Free-form string for v1, but should converge on a controlled vocabulary (`stale_state`, `missing_element`, `wrong_label`, `layout_overflow`, `focus_loss`, `modal_stack`, `color_state`, ...). Defer.
4. **What about non-UI regressions** (e.g., AI provider timeout, MMA worker crash)? These show up in `Comm History` / `Diagnostics` panels — they ARE in the DSL's UI surface. But raw application logs (`logs/sessions/`) may have richer signals. Hybrid: DSL for UI-visible state; raw logs as a separate annotation stream.
5. **The 80 GB video — keep or discard?** After proxy generation, the raw file is redundant for UX eval. Keep one dogfood's raw for archival; re-encode going forward.
6. **Should the meta layer be merged into `guide_ascii_layout_map.md`?** Currently this report defines the meta layer separately. Once stabilized (after ≥3 manual dogfoods), the natural home is a new section §8 "Triage Overlay" appended to the canonical guide. Alternative: keep it as a separate `docs/guide_ascii_layout_map_triage.md` to preserve the canonical guide's "design-only" scope. Lean: merge, after stabilization.
7. **Does the `[State: ...]` annotation need a new prefix for "observed" vs "design" state?** Currently reusing the existing prefix, repurposed. Risk: a future reader of `guide_ascii_layout_map.md §4.3` may assume all `[State: ...]` lines are design-time, not observed. Mitigation: in §6's revision, add a sentence "this annotation is also used in retrospective triage; see `docs/ideation/ed_video_ux_eval_pipeline_20260617.md` §3.2."
---
## 8. The One-Sentence Version
If I had to summarize this for someone in 30 seconds: *"Watch the video, write a structured text log of what changed when (the DSL), turn that into tickets; eventually teach an LLM to write the DSL for you, but the DSL is the canonical artifact either way."*
---
*End of ideation archive. Next step: user approves the DSL shape (or revises §3.2-§3.4), then either (a) does a manual dogfood triage as the first instance, or (b) defers to a future track.*