# Technical Deep Dive: Paths & Nuances This document explores the low-level technical execution paths and implementation nuances of the 4-Tier Hierarchical Multi-Model Architecture. ## 1. Execution Paths The architecture distinguishes between different "paths" to optimize for latency, cost, and accuracy. ### A. The Fast Path (Reactive) * **Trigger:** Low-complexity intents (e.g., "Hello", "What is the current time?", "Status check"). * **Flow:** User -> Tier 1 -> User. * **Nuance:** Tier 1 identifies that no specialized knowledge (Tier 3) or tool execution (Tier 2) is required. It responds directly using its internal weights or a local cache. * **Goal:** Sub-100ms response time. ### B. The Slow Path (Reflective / Agentic) * **Trigger:** Complex tasks (e.g., "Fix the bug in the UI layout", "Refactor the ai_client.py"). * **Flow:** User -> Tier 1 (Intent) -> Tier 2 (Specialist) -> Tier 3 (Context/RAG) -> Tier 2 (Execution) -> Tier 1 (Synthesis) -> User. * **Nuance:** This involves high-latency operations, including tool calls and codebase searches. Tier 1 acts as a supervisor, potentially looping back to Tier 2 if the initial output is insufficient. ### C. The Governance Path (Tier 4 Integration) * **Trigger:** Any operation that modifies the system or presents a high-risk answer. * **Flow:** (Parallel or Post-hoc) Tier 1/2 Output -> Tier 4 (Validation) -> User/Log. * **Nuance:** Tier 4 runs an "LLM-as-a-judge" or a static analysis tool (like `ruff` or `mypy`) on the output. If validation fails, the system may automatically trigger a "re-plan" in Tier 1. --- ## 2. Context & Token Management A critical nuance is how the limited context window (token budget) is managed across tiers. ### A. Token Budgeting * **Tier 1 (Global Context):** Holds the conversation history and high-level project metadata. Budget: ~20% of window. * **Tier 2 (Local Context):** Receives a "surgical" injection of relevant files/data from Tier 3. Budget: ~60% of window. * **Output Space:** Reserved for generating large code blocks or summaries. Budget: ~20% of window. ### B. Context Folding (The "Accordion" Effect) To prevent context overflow, the system "folds" (summarizes) older parts of the conversation. * **Recent History:** Full fidelity. * **Mid-term History:** Summarized by Tier 1. * **Long-term History:** Archived in Tier 3 (searchable but not in-context). --- ## 3. Communication Protocols * **Inter-Tier Format:** Strictly structured JSON (e.g., OpenAI Tool Call format or Google GenAI Function Call). * **Streaming:** Tier 1 typically streams its "thinking" process (Slow Path) to provide the user with immediate feedback while Tier 2 is still working. * **Handshake:** Tier 2 must acknowledge receipt of context from Tier 3 with a "Digest" hash to ensure data integrity. --- ## 4. Nuances vs. Standard RAG | Feature | Standard RAG | MMA (4-Tier) | | :--- | :--- | :--- | | **Logic** | Flat (Query -> Doc -> Result) | Hierarchical (Intent -> Route -> Expert -> Doc) | | **Expertise** | Homogeneous | Heterogeneous (Different models for different tiers) | | **Feedback** | Manual | Automated (Tier 4 Closed-loop) | | **State** | Stateless or simple session | Multi-layered state (Orchestrator vs Specialist state) |