Files
manual_slop/MMA_Support/Technical_Deep_Dive.md
2026-02-24 19:03:22 -05:00

3.2 KiB

Technical Deep Dive: Paths & Nuances

This document explores the low-level technical execution paths and implementation nuances of the 4-Tier Hierarchical Multi-Model Architecture.

1. Execution Paths

The architecture distinguishes between different "paths" to optimize for latency, cost, and accuracy.

A. The Fast Path (Reactive)

  • Trigger: Low-complexity intents (e.g., "Hello", "What is the current time?", "Status check").
  • Flow: User -> Tier 1 -> User.
  • Nuance: Tier 1 identifies that no specialized knowledge (Tier 3) or tool execution (Tier 2) is required. It responds directly using its internal weights or a local cache.
  • Goal: Sub-100ms response time.

B. The Slow Path (Reflective / Agentic)

  • Trigger: Complex tasks (e.g., "Fix the bug in the UI layout", "Refactor the ai_client.py").
  • Flow: User -> Tier 1 (Intent) -> Tier 2 (Specialist) -> Tier 3 (Context/RAG) -> Tier 2 (Execution) -> Tier 1 (Synthesis) -> User.
  • Nuance: This involves high-latency operations, including tool calls and codebase searches. Tier 1 acts as a supervisor, potentially looping back to Tier 2 if the initial output is insufficient.

C. The Governance Path (Tier 4 Integration)

  • Trigger: Any operation that modifies the system or presents a high-risk answer.
  • Flow: (Parallel or Post-hoc) Tier 1/2 Output -> Tier 4 (Validation) -> User/Log.
  • Nuance: Tier 4 runs an "LLM-as-a-judge" or a static analysis tool (like ruff or mypy) on the output. If validation fails, the system may automatically trigger a "re-plan" in Tier 1.

2. Context & Token Management

A critical nuance is how the limited context window (token budget) is managed across tiers.

A. Token Budgeting

  • Tier 1 (Global Context): Holds the conversation history and high-level project metadata. Budget: ~20% of window.
  • Tier 2 (Local Context): Receives a "surgical" injection of relevant files/data from Tier 3. Budget: ~60% of window.
  • Output Space: Reserved for generating large code blocks or summaries. Budget: ~20% of window.

B. Context Folding (The "Accordion" Effect)

To prevent context overflow, the system "folds" (summarizes) older parts of the conversation.

  • Recent History: Full fidelity.
  • Mid-term History: Summarized by Tier 1.
  • Long-term History: Archived in Tier 3 (searchable but not in-context).

3. Communication Protocols

  • Inter-Tier Format: Strictly structured JSON (e.g., OpenAI Tool Call format or Google GenAI Function Call).
  • Streaming: Tier 1 typically streams its "thinking" process (Slow Path) to provide the user with immediate feedback while Tier 2 is still working.
  • Handshake: Tier 2 must acknowledge receipt of context from Tier 3 with a "Digest" hash to ensure data integrity.

4. Nuances vs. Standard RAG

Feature Standard RAG MMA (4-Tier)
Logic Flat (Query -> Doc -> Result) Hierarchical (Intent -> Route -> Expert -> Doc)
Expertise Homogeneous Heterogeneous (Different models for different tiers)
Feedback Manual Automated (Tier 4 Closed-loop)
State Stateless or simple session Multi-layered state (Orchestrator vs Specialist state)