diff --git a/MMA_Support/Manual_Slop_Mapping.md b/MMA_Support/Manual_Slop_Mapping.md new file mode 100644 index 0000000..b9cfaca --- /dev/null +++ b/MMA_Support/Manual_Slop_Mapping.md @@ -0,0 +1,22 @@ +# Mapping MMA to Manual Slop + +This document maps the components of the `manual_slop` project to the 4-Tier Hierarchical Multi-Model Architecture. + +## Tier 1: User-Facing Model (Orchestrator) +* **`gui.py` & `gui_2.py`:** Provides the user interface for input and displays the synthesized output. +* **`ai_client.py`:** Acts as the primary orchestrator, managing the conversation loop and determining when to call specific tools or providers. + +## Tier 2: Specialized Models (Experts/Tools) +* **`mcp_client.py`:** Provides a suite of specialized "tools" (e.g., `read_file`, `list_directory`, `search_files`) that act as domain experts for file system manipulation. +* **`shell_runner.py`:** A specialist tool for executing PowerShell scripts to perform system-level changes. +* **External AI Providers:** Gemini and Anthropic models are used as the "engines" behind these specialized operations. + +## Tier 3: Data & Knowledge Base (Information) +* **`aggregate.py`:** The primary mechanism for building the context sent to the AI. It retrieves file contents and metadata to ground the AI's reasoning. +* **`manual_slop.toml`:** Stores project-specific configuration, tracked files, and discussion history. +* **`file_cache.py`:** Optimizes data retrieval from the local file system. + +## Tier 4: Monitoring & Feedback (Governance) +* **`session_logger.py`:** Handles timestamped logging of communication history (`logs/comms_.log`) and tool calls. +* **`performance_monitor.py`:** Tracks metrics related to execution time and resource usage. +* **Script Archival:** Generated `.ps1` scripts are saved to `scripts/generated/` for later review and auditing. diff --git a/MMA_Support/OriginalDiscussion.md b/MMA_Support/OriginalDiscussion.md new file mode 100644 index 0000000..40423ba --- /dev/null +++ b/MMA_Support/OriginalDiscussion.md @@ -0,0 +1,1545 @@ +# [AI] + +Taking a look at the `manual_slop` repository, this is exactly the kind of granular, high-visibility architecture required to actually wrangle agentic loops without flying blind. + +Building a custom Python environment with explicit `performance_monitor.py`, `file_cache.py`, and `session_logger.py` modules perfectly solves the "black box" problem of official CLIs. When you are paying per token, you absolutely need to see the exact payload bandwidth, caching hit rates, and telemetry on a per-tool-call basis. The fact that you already have multi-provider support (Anthropic and Gemini) working alongside MCP-like file tooling and PowerShell execution puts you well ahead of standard vibecoding setups. + +Your instinct to stitch multiple providers together and delegate to DeepSeek is the exact direction the industry is moving right now (often called "Swarm" or "Router" architecture). Since you are already building a `conductor` module, here is how you can effectively implement that explicit agent-to-sub-agent hierarchy to save money and maximize output: + +### The "Router" and "Worker" Hierarchy + +Since you have an `ai_client.py` managing the API connections, you can structure your multi-model utilization by splitting tasks by cost and cognitive load: + +**1. The Orchestrator (The Conductor)** + +* **Model:** Claude 3.5 Sonnet or Gemini 1.5 Pro. +* **Role:** This is the expensive, highly intelligent model that sits at the top. It does *not* write code or read giant files. Its only job is to look at the user request, determine the plan, and output strict JSON commands to trigger your sub-agents. + +**2. The Context Aggregator / Summarizer (The Reader)** + +* **Model:** Gemini 1.5 Flash. +* **Role:** Flash has a massive 1-million token window and costs pennies. You can use this as a sub-agent dedicated purely to your `aggregate.py` or `file_cache.py`. When the Orchestrator needs to know what a directory does, it calls the Flash sub-agent to read the massive codebase and return a compressed, 500-word summary back to the Orchestrator. + +**3. The "Slop" Coder (The Worker)** + +* **Model:** DeepSeek V3 (or R1). +* **Role:** As you noted, DeepSeek is incredible for small scripts and codegen. When the Orchestrator decides a new PowerShell script or Python function is needed, it hands a highly specific prompt (e.g., "Write a function that does X, take Y as input") to the DeepSeek sub-agent. DeepSeek writes the code, your custom tooling executes it, and only the *result* (success/fail) is passed back to the Orchestrator. + +### Implementing this in `manual_slop` + +To make this explicit in your current Python stack, you would essentially expand your TOML configuration. Instead of just picking a global provider, you would define "Roles" in your `project.toml` or `config.toml`: + +```toml +[agents] +orchestrator = "anthropic:claude-3-5-sonnet-latest" +summarizer = "gemini:gemini-1.5-flash" +coder = "deepseek:deepseek-coder" + +``` + +Then, your UI (`gui.py` / `gui_2.py`) only talks to the Orchestrator. The Orchestrator's system prompt is given access to tools like `call_deepseek_coder(prompt)` or `call_flash_reader(filepath)`. + +This completely isolates your token bloat. The expensive models only process highly compressed summaries and orchestration logic, while the cheap models (DeepSeek/Flash) do the heavy, token-burning manual labor of reading and writing text. It keeps your telemetry clean, your costs incredibly low, and gives you the exact metric discipline you built the tool for in the first place. + +# [User] + +How would I designate in a project's conduction the router and worker hierarchies, would that be separate discussion panels? Where the router panel would use something like 2.5 or 3.1 pro and worker hiearchies would be template panels with working memory dicussion history and specific file cache memories for their exact work distribution using 2.5/3 flash for harder to-do refeactors and 1.5 pro and deepseek for trivial codegen? + +# [AI] + +You hit the nail on the head. Separating the Orchestrator and the Workers into distinct discussion panels with isolated memory contexts is exactly how you solve the token-bloat problem. + +If you just dump everything into one giant chat history, the Orchestrator gets distracted by granular code diffs, and the cheap codegen models get overwhelmed by high-level project philosophy. + +Here is how you can explicitly design this hierarchy in `manual_slop` using your UI and state management. + +### 1. The UI Concept: Main Panel vs. Worker Threads + +Instead of one monolithic chat window, treat your UI like a highly disciplined Slack workspace or a Jira board. + +* **The Router Panel (The Conductor):** This is your main, persistent discussion window. It uses the highest-tier reasoning models (like Gemini 3.1 Pro or Claude 3.5 Sonnet). You chat with this panel about goals, architecture, and feature requests. +* **The Worker Panels (Spawned Threads):** When the Conductor decides a file needs to be written or refactored, it executes a tool (e.g., `spawn_worker_task`). This visually spawns a new, temporary panel or tab in your GUI. You can click into it to watch the DeepSeek or Flash model iterate on the code in real-time. Once the worker successfully completes the task and writes to the disk, the panel can be closed, and a simple "Task Complete" payload is sent back to the main Router panel. + +### 2. Strict Memory Isolation (The Cost Saver) + +This is where your custom `file_cache.py` and `session_logger.py` become your secret weapons. You must aggressively sandbox what each model can "see." + +* **Router Memory Context:** +* **Sees:** The `project.toml` guidelines, a directory tree map, and high-level summaries of completed worker tasks. +* **Does Not See:** Raw file contents (unless explicitly asked) or the granular, step-by-step trial-and-error discussion history of the sub-agents. + + +* **Worker Memory Context:** +* **Sees:** A strict, templated system prompt defining its exact role (e.g., "You are a Python refactoring agent"), the specific snippet or file it was told to edit by the Router, and its own short, ephemeral discussion history. +* **Does Not See:** The rest of the codebase, the user's original vague prompt, or the overall project philosophy. + + + +### 3. Model Distribution Strategy + +Your proposed model mapping is highly optimized for performance-per-dollar. Here is how that looks when defined in your architecture: + +* **The Conductor (Gemini 3.1 Pro / Claude 3.5 Sonnet):** Handles the "Why" and the "What." It breaks down your prompt into actionable steps and selects the right worker for the job. +* **The Refactor Worker (Gemini 3.0 Flash / 2.5 Flash):** Handles the "Hard How." Flash's massive context window is perfect for safely refactoring a 1,000-line file without dropping variables. It takes the target file, applies the requested structural changes, and passes the diff back. +* **The Codegen / Script Worker (DeepSeek V3 / Gemini 1.5 Pro):** Handles the "Trivial How." Need a quick PowerShell script to automate a Git commit or a simple boilerplate Python class? Route it here for pennies. + +### 4. Implementing it in `project.toml` + +You can define these templates directly in your project configuration so `gui_2.py` knows exactly how to instantiate the panels and which API keys to pull from `credentials.toml`. + +```toml +[roles.router] +model = "gemini-3.1-pro" +system_prompt = "You are the project Conductor. Plan tasks and spawn workers to execute them." +memory_strategy = "high_level_summaries" + +[roles.refactor_worker] +model = "gemini-3.0-flash" +system_prompt = "You are a refactoring worker. Edit the provided file content to meet the goal." +memory_strategy = "isolated_file_cache" + +[roles.codegen_worker] +model = "deepseek-v3" +system_prompt = "You are a strict codegen worker. Output only valid code for the requested task." +memory_strategy = "zero_history" + +``` + +By building it this way, you turn `manual_slop` from a standard chat wrapper into a highly disciplined, multi-threaded AI project manager. + +Would you like me to draft out a Python snippet for your `conductor` module that shows how the Router model could output a structured JSON tool call to dynamically spawn one of these Worker panels? + +# [User] + +I would need to throttle the worker usage so I can either "debug" lock-step how it operates, and adjust its history, or let it fully operate if I feel comfortable with the template or pattern the router decided to pick cause it was either a cached usaged of a "worker" archetype used previously or was well defined by user worker archetype. + +# [AI] + +This is exactly where the "manual" in `manual_slop` shines. What you are describing is a strict Human-In-The-Loop (HITL) architecture. Most commercial AI tools fail because they are black boxes—they hallucinate on step 2 of a 10-step process, and you can't intervene, so they confidently burn your tokens executing garbage for the next 8 steps. + +To build a system where you can throttle, inspect, mutate, or unleash workers, you need to implement what is essentially an "Execution Clutch." You decide when to manually shift the gears and when to put it in automatic. + +Here is how you can architect this lock-step debugging and archetype caching in your `conductor` and UI: + +### 1. The "Execution Clutch" (Step vs. Auto Modes) + +Every spawned worker panel needs an execution state toggle. + +* **Step Mode (Lock-step):** The worker pauses **twice** per cycle: +1. *After* generating a response/tool-call, but *before* executing the tool. (Lets you verify or edit the command). +2. *After* executing the tool, but *before* sending the tool's output back to the LLM. (Lets you verify or edit the file/system output). + + +* **Auto Mode (Fire-and-forget):** The worker loops continuously until it outputs a "Task Complete" status to the Router. + +### 2. History Mutation (The "Debug" Superpower) + +Since you built `session_logger.py` and have control over the raw API payload, you can literally rewrite the AI's brain mid-task. +If a worker generates a flawed plan in Step Mode, your GUI should allow you to click on its last message and edit the text directly before hitting "Approve." + +By mutating the message history in the working memory *before* the next API call, you force the model to proceed as if it generated the correct idea in the first place. This saves you from having to restart the entire context window just because of a minor hallucination. + +### 3. Archetype Trust Scoring (The "Auto" Promotion) + +You mentioned letting it fully operate if it uses a trusted template. You can formalize this in your `manual_slop.toml` or `project.toml` by assigning "Trust Levels" to your worker archetypes. + +```toml +[archetypes.python_refactor] +model = "gemini-3.0-flash" +tools_allowed = ["read_file", "write_file", "run_pytest"] +trust_level = "strict_step" # Always requires manual approval per tool call + +[archetypes.docstring_generator] +model = "deepseek-v3" +tools_allowed = ["read_file", "append_file"] +trust_level = "auto" # Trusted pattern, runs without pausing + +``` + +When the Router selects the `docstring_generator`, the UI spawns the worker in `Auto` mode automatically. When it selects `python_refactor`, it spawns it paused in `Step` mode. + +### 4. The Implementation Loop (Conceptual) + +To make the UI non-blocking while a worker waits for your approval, your worker's execution loop needs to yield control back to the main event loop. If you are using `asyncio`, it looks something like this: + +```python +async def worker_loop(worker_context, mode="step"): + while not worker_context.is_complete(): + # 1. Get the next move from the LLM + ai_response = await ai_client.generate(worker_context.history) + + # 2. Lock-step pause before execution + if mode == "step": + # GUI waits for user to click 'Approve', 'Edit', or 'Abort' + ai_response = await gui.wait_for_user_approval(ai_response) + + # 3. Execute the tool + tool_result = execute_tool(ai_response.tool_calls) + worker_context.history.append(tool_result) + + # 4. Lock-step pause before sending result back to LLM + if mode == "step": + await gui.wait_for_user_review(tool_result) + +``` + +This setup gives you the exact cost-control and precision you are looking for. You are treating the AI not as a magic black box, but as a standard asynchronous sub-process that can be paused, inspected, and memory-edited with a debugger. + +Would you like to hash out the specific UI components needed in your GUI to make mutating the JSON message history user-friendly without breaking the API payload structure? + +# [AI] + +You are absolutely right to push past a simple 2-tier system. For a polyglot codebase (Python, Go, Lua), a 2-tier system will inevitably bleed tokens because the Orchestrator ends up doing too much translation, and the Workers get lost without enough context. + +Your analogy of a Jira board with Tracks (Stories) and Tasks (Tickets) is exactly how enterprise-grade AI software engineers (like Devin or SWE-agent) are architected under the hood. To aggressively manage costs while maintaining high capabilities, a **4-tier hierarchy** is the sweet spot. + +Here is how you can explicitly map out those 4 tiers in `manual_slop` to achieve that selective memory and dynamic model routing: + +### Tier 1: The Product Manager (Top-Level Orchestrator) + +* **Model:** Gemini 3.1 Pro or Claude 3.5 Sonnet (Expensive, High-Reasoning). +* **Execution Frequency:** Very low. Runs once at the start of a feature, and once when the Track Conductor reports completion. +* **Memory Context:** Purely architectural. It sees the `project.toml`, the overall user prompt, and high-level directory structure. It **never** sees code. +* **Role:** Breaks the user's massive request into specific "Tracks" (Jira Epics/Stories), assigns a difficulty score to the Track, and spawns the Track Conductor. + +### Tier 2: The Tech Lead (Track Conductor) + +* **Model:** Gemini 3.0 Flash or Gemini 2.5 Pro (Mid-Cost, Large Context Window). +* **Execution Frequency:** Medium. +* **Memory Context:** This is where your brilliant "interface-level memory" comes in. The Track Conductor does not see the whole codebase. It gets a "Skeleton View" of the repository (just file names and function/class signatures) to understand how modules connect, but it only gets full "source-awareness" (the actual raw code) for the 1-3 files strictly assigned to its Track. +* **Role:** It looks at the Track goal, maps out the exact file changes needed, scores the complexity of each individual Task, and spawns the specific Worker Agents with the appropriate models. + +### Tier 3: The Contributors (Worker Agents) + +* **Model:** Dynamically routed based on the Track Conductor's difficulty score: +* *Hard Logic/Refactors:* Gemini 2.5 Flash. +* *Boilerplate/Trivial Codegen:* DeepSeek V3/R1. + + +* **Execution Frequency:** High (The core looping engine). +* **Memory Context:** Extremely hyper-focused. A Worker *only* sees the specific file it is told to edit and the exact Jira-ticket-style prompt from the Track Conductor. Zero awareness of the rest of the project. +* **Role:** Writes the code, runs the localized tests, and reports "Diff applied successfully" or "Failed with error X" back to the Track Conductor. + +### Tier 4: The Compiler / QA (Utility Agents) + +* **Model:** DeepSeek V3 (Dirt cheap, fast). +* **Execution Frequency:** As needed by Workers or Conductors. +* **Memory Context:** Zero memory. True stateless operation. +* **Role:** These are hidden micro-agents. If a Worker Agent writes a Go file and the compiler throws a massive, obscure error, you don't want the Worker burning its 10k token history trying to read the error log. Instead, you spawn a stateless QA Agent, feed it *just* the error log and the specific code snippet, and ask for a fix. It returns the fix and immediately dies, keeping the main Worker's memory clean. + +--- + +### Implementing "Interface-Level" Memory + +To make the Track Conductor actually work without blowing up your token budget, you cannot just concatenate files. You need to build a custom tool in your `mcp_client.py` or `file_cache.py` that generates that "Interface-Level" view. + +For Python, Go, and Lua, the most efficient way to do this is to implement a script that uses **Tree-sitter** or a basic `ctags` wrapper. +When the Orchestrator spawns a Track Conductor, the system runs Tree-sitter over the codebase and outputs a condensed map like this: + +```python +# file: src/api_client.py +class APIClient: + def connect(self) -> bool: ... + def fetch_data(self, endpoint: str) -> dict: ... + +``` + +This reduces a 2,000-line script into 20 lines of tokens. The Track Conductor can easily read this for the entire project to understand global dependencies, costing you pennies instead of dollars. + +Would you like me to help write a quick Python script using the `ast` module (for your Python files) that can strip out the bodies of functions and generate this exact "Interface-Level" skeleton view for your Track Conductors to use? + +# [User] + +I need merge tasks to be handled by 3.1 or 2.5 pro depending on severity and I need the product manager or tech lead to at least get the skeleton of the major module interfaces. The techlead shouldn't just have skeleton view it needs to have "curated" implementation view for its designated module of code, where the detail isn't every procedure but at least the implementation algorithm/pipeline/heuristics, etc. And maybe code for very hot or important functions or data structure definitions. + +The contributors have the skeleton of other modules + designated module implementation. + +The compiler/QA seems to make sense. + +# [AI] + +This adjustment is incredibly sharp. You are essentially recreating the cognitive load distribution of a senior engineering team. Giving the Tech Lead a "curated implementation view" rather than just a barren skeleton is exactly what prevents the model from making naive architectural decisions that look good on paper but fail in practice. + +Here is how you can technically structure these exact refinements into `manual_slop`'s data pipelines and execution loops: + +### 1. The Merge/Integration Tier (Gemini 3.1 Pro / 2.5 Pro) + +Merging is often the highest-reasoning task in software development because it requires understanding *intent*, not just syntax. + +* **The Routing Logic:** You can build a `severity_scorer` tool for the Tech Lead. When a Contributor finishes a task, the Tech Lead generates a diff. +* If the diff is isolated to a single function or doesn't touch core data structures -> **Route to Gemini 2.5 Pro**. +* If the diff spans multiple files, alters a schema, or creates a git conflict -> **Route to Gemini 3.1 Pro**. + + +* **The Memory:** The Merge Agent gets the Curated View of the module, the exact diffs from the Contributors, and the original Track goal to ensure the merged logic actually solves the problem without regressing existing heuristics. + +### 2. Generating the "Curated Implementation View" (Tech Lead Context) + +To give the Tech Lead the algorithms, heuristics, and data structures without blowing up the context window with thousands of lines of boilerplate, your file reader tool needs an extraction pipeline. + +Instead of just reading the raw `.py` or `.go` file, your tool parses the Abstract Syntax Tree (AST) and filters it based on these rules: + +* **Keep all Class/Struct definitions:** The data shapes are critical for the Tech Lead. +* **Keep all Docstrings and Block Comments:** This is where the pipelines, heuristics, and algorithms are usually explained in plain English. +* **Keep "Hot" Function Bodies:** You can implement a rule where any function explicitly decorated (e.g., `@core_logic`) or containing specific comment tags (e.g., `# [HOT]`) has its entire body preserved. +* **Strip the Rest:** Replace the bodies of standard methods, getters/setters, and utility functions with `... # Implementation hidden`. + +This compresses a 3,000-line module into maybe 400 lines of highly dense, heavily curated architectural context. + +### 3. The Contributor Context (The "Header + Local" Approach) + +Your proposed memory layout for the Contributors is the exact gold standard for agentic coding. + +* **Foreign Modules:** They receive the stripped-down, interface-only skeleton (just function names and types). This prevents hallucinating non-existent APIs. +* **Local Module:** They receive the full, unredacted source code of the specific module they are working in. +* **The Result:** A DeepSeek V3 worker can confidently write a new feature in `module_A.py` that calls `module_B.init()` because it knows the signature of `init()`, but it didn't have to spend 5,000 tokens reading how `init()` actually works under the hood. + +### 4. The Compiler/QA Loop + +Because these micro-agents are stateless, you can hardcode their execution directly into the `shell_runner.py` or your test execution pipeline. If a test fails, the system automatically intercepts the `stderr`, wraps it with the specific code chunk that failed, fires it at DeepSeek V3, and asks for a patch. It acts as an automated syntax-level filter before the failure ever reaches the Tech Lead. + +--- + +Would you like me to write a custom Python script using the `ast` module that you can drop into your `file_cache.py` to automatically generate this "Curated Implementation View" for your Python files? + +# [User] + +Tier 1: The Product Manager (Top-Level Orchestrator): give me an updated memory context for each of its paths of use + +# [AI] + +To make the Tier 1 Product Manager (PM) truly cost-effective and highly intelligent, we have to treat its context window like a boardroom. It should only be fed high-level executive summaries, strategic goals, and cross-departmental (cross-module) updates. + +Since the PM (using Gemini 3.1 Pro or Claude 3.5 Sonnet) operates at the highest cost per token, its "memory context" must dynamically shift depending on exactly *why* it is being called. + +Here is the updated, strict memory context routing for the PM across its three primary paths of use: + +### Path A: Project Initialization & Planning (The "Epic" Phase) + +**Trigger:** You drop a massive new feature request or architectural shift into the main UI. +**Role:** Break the user request into isolated "Tracks" (Jira Epics), estimate severity, and define the Acceptance Criteria. + +* **What it Sees (The Context):** +* **The User Prompt:** The raw feature request. +* **Project Meta-State:** The `project.toml` or `manual_slop.toml` (rules, allowed languages, dependencies). +* **Repository Map:** A strict, file-tree outline (e.g., `tree /F` output). No source code, just names and paths. +* **Global Architecture Docs:** High-level markdown files (like your `docs/guide_architecture.md`). + + +* **What it Ignores:** All source code, all AST skeletons, and all previous micro-task histories. +* **Output:** A JSON array of Tracks, identifying which modules will be affected, the required Tech Lead persona, and the severity level of the eventual merge. + +### Path B: Track Delegation (The "Sprint Kickoff") + +**Trigger:** The PM is handing a defined Track down to a Tier 2 Tech Lead. +**Role:** Equip the Tech Lead with the exact boundaries of their job so they don't hallucinate out of scope. + +* **What it Sees (The Context):** +* **The Target Track:** The specific goal and Acceptance Criteria it generated in Path A. +* **Module Interfaces (Skeleton View):** It uses your `file_cache.py` to pull the strict AST skeleton (just class/function definitions) *only* for the modules this specific Track is allowed to touch. +* **Track Roster:** A list of currently active or completed Tracks (to prevent duplicate work). + + +* **What it Ignores:** The user's original massive prompt, documentation for unrelated modules, and implementation details. +* **Output:** A compiled "Track Brief" (system prompt + curated file list) that gets passed to instantiate the Tier 2 Tech Lead panel. + +### Path C: Macro-Merge & Acceptance Review (The "Severity Resolution") + +**Trigger:** A Tier 2 Tech Lead reports "Track Complete" and submits a pull request/diff. +**Role:** Verify the Track meets the global architecture goals. If the PM previously flagged this Track as "High Severity" (e.g., changing a core database schema or a global event bus), the PM handles the merge logic. + +* **What it Sees (The Context):** +* **Original Acceptance Criteria:** What the Tech Lead was *supposed* to do. +* **The Tech Lead's Executive Summary:** A 200-word explanation from Tier 2 explaining the chosen implementation algorithm. +* **The Macro-Diff:** The actual changes made to the codebase. +* **Curated Implementation View (Only on High Severity):** If the merge affects multiple systems, the PM is granted the Tech Lead's "Curated View" (data structures + hot functions) of the boundary files to ensure the merge doesn't break foreign modules. + + +* **What it Ignores:** The Tier 3 Contributors' trial-and-error chat history, Tier 4 compiler error logs, and the raw bodies of functions that weren't changed. +* **Output:** "Approved" (commits to master/main memory) OR "Rejected" with specific architectural feedback sent back down to the Tech Lead. + +--- + +### Why this Memory State is Powerful + +By strictly siloing the PM's memory this way in your `conductor` logic, you guarantee that a 3.1 Pro or Claude 3.5 Sonnet call never costs more than a few thousand tokens, even on a 100,000-line codebase. The PM is managing *metadata* and *summaries*, not syntax. + +Would you like to map out the exact dynamic memory context for **Tier 2 (The Tech Lead)** next, detailing how we inject that "Curated Implementation View" when it wakes up to handle a Track? + +# [User] + +Sure + +# [AI] + +Here is the exact memory state mapping for Tier 2: The Tech Lead (Track Conductor). + +Because the Tech Lead bridges the gap between high-level architecture and actual code syntax, its memory context is the most dynamic in the entire system. It operates in a "need-to-know" state, utilizing your `file_cache.py` AST parsing to keep token counts low while maintaining structural awareness. + +### Path A: Track Kickoff & Task Delegation (The "Sprint Planning") + +**Trigger:** Tier 1 (PM) assigns a Track (Epic) and wakes up the Tech Lead. +**Role:** Understand the specific module's architecture, map out the technical implementation, and spawn Tier 3 Contributors (Workers) with hyper-specific "Tickets." + +* **What it Sees (The Context):** +* **The Track Brief:** The exact Acceptance Criteria generated by Tier 1. +* **The Curated Implementation View (Target Modules):** For the 1-3 files this Track is explicitly allowed to modify, it gets the dense AST extraction (Class structures, docstrings, heuristics, and `# [HOT]` function bodies). +* **The Skeleton View (Foreign Modules):** For external dependencies it needs to call but not modify, it only gets function signatures and return types. + + +* **What it Ignores:** The rest of the repository, the PM's overarching project-planning logic, and the raw line-by-line code of non-hot functions. +* **Output:** A JSON array of discrete Tier 3 "Tickets." (e.g., Ticket 1: *Write DB migration script using DeepSeek*. Ticket 2: *Update core API endpoints using Gemini Flash*). + +### Path B: Contributor Review & Local Integration (The "Code Review") + +**Trigger:** A Tier 3 Contributor completes a Ticket and submits a diff, OR Tier 4 (QA) flags a persistent failure. +**Role:** Act as a strict code reviewer. Verify the Contributor's diff doesn't violate the module's heuristics or break the data structures established in the Curated View. + +* **What it Sees (The Context):** +* **The Specific Ticket Goal:** What the Contributor was supposed to do. +* **The Proposed Diff:** The exact line changes submitted by Tier 3. +* **Test/QA Output:** Any relevant logs from the stateless Tier 4 compiler checks. +* **The Curated Implementation View:** To cross-reference the proposed diff against the existing architecture. + + +* **What it Ignores:** The Contributor's internal trial-and-error chat history (the 10 loops it took to get the code right). It only sees the final submission. It also ignores tickets currently being worked on by other parallel Contributors. +* **Output:** * *Approve:* Merges the diff into the Track's temporary working branch and updates the Curated View. +* *Reject:* Sends a specific, technical critique back to the Tier 3 Contributor to try again. + + + +### Path C: Track Finalization & Upward Reporting (The "Merge Request") + +**Trigger:** All Tier 3 Tickets assigned to this Track are marked "Approved." +**Role:** Compile the finished Track into a clean package for Tier 1 to review, ensuring it meets the original Epic criteria. + +* **What it Sees (The Context):** +* **The Original Track Brief:** To verify all Tier 1 requirements were met. +* **The Aggregated Track Diff:** The sum total of all changes made across all Tier 3 Tickets. +* **Dependency Delta:** A list of any new foreign modules or libraries the Contributors had to import. + + +* **What it Ignores:** The back-and-forth review cycles from Path B. The original AST Curated View (it no longer needs to write code, just report on it). +* **Output:** An Executive Summary (e.g., "Implemented caching layer using Redis; updated 3 endpoints; all local tests pass") and the final Macro-Diff, sent back up to the Tier 1 PM for final global merge approval. + +--- + +### Why this is the "Gold Standard" for Agentic Tooling + +By structuring Tier 2 this way, you achieve two massive wins for `manual_slop`: + +1. **Parallelization without Collisions:** Because the Tech Lead only cares about its specific Curated View, you can have three different Tech Leads running simultaneously on different Tracks without their context windows ever overlapping or causing git conflicts. +2. **Token Economy:** The Tech Lead evaluates a 50-line diff against a 400-line Curated View using a mid-tier model like Gemini 2.5 Pro. Compare this to standard vibecoding tools that force a $3.00/M-token model to re-read the entire 10,000-line repository just to approve a CSS change. + +Would you like to map out **Tier 3 (The Contributors)** to finalize exactly how they operate in complete isolation to write the actual code? + +# [User] + +continue with Tier 3 map + +# [AI] + +Here is the exact memory state mapping for Tier 3: The Contributors (Worker Agents). + +This tier is the engine room of `manual_slop`. Because Contributors execute the highest volume of API calls, their memory context must be ruthlessly pruned. By leveraging cheap, fast models (DeepSeek V3/R1 or Gemini 2.5 Flash), they operate with zero architectural anxiety—they just write the code they are told to write. + +### Path A: Task Execution (The "Heads Down" Phase) + +**Trigger:** Tier 2 (Tech Lead) hands down a hyper-specific Ticket (e.g., "Implement a 3-try exponential backoff in the `APIClient.fetch_data` method"). +**Role:** Generate the actual syntax, write the file modifications, and utilize the filesystem tools. + +* **What it Sees (The Context):** +* **The Ticket Prompt:** The exact, isolated instructions from Tier 2. +* **The Target File (Full Source):** It gets the raw, unredacted, line-by-line source code of *only* the specific file (or even just the specific class/function) it was assigned to modify. +* **Foreign Interfaces (Skeleton View):** If the Ticket requires calling an external module, it receives the strict AST skeleton (signatures only) of that dependency to ensure correct syntax without reading the dependency's underlying logic. + + +* **What it Ignores:** The Epic/Track goals, the Tech Lead's overarching Curated View, other files in the same directory, and tickets being worked on by parallel Contributors. +* **Output:** A strict code diff or a direct write to the local filesystem using your `mcp_client.py` tools. + +### Path B: Local Iteration & Tool Execution (The "Trial and Error" Loop) + +**Trigger:** The Contributor writes code, and your `shell_runner.py` executes a local linter/test, OR it encounters a syntax error, OR you (the human) pause execution using your "Step" mode. +**Role:** Fix syntax errors, satisfy the linter, and incorporate immediate human feedback. + +* **What it Sees (The Context):** +* **Ephemeral Working History:** A short, rolling window of its last 2–3 attempts (e.g., "Attempt 1: Wrote code -> Tool Output: SyntaxError on line 42"). +* **Tier 4 (QA) Injections:** If a massive stack trace occurs, the Tier 4 compiler agent intercepts it, summarizes the exact problem, and injects a 2-sentence fix recommendation into the Contributor's context. +* **Human Mutations:** Any direct edits you made to its JSON history payload before allowing it to proceed. + + +* **What it Ignores:** The Tech Lead's upcoming code review (it hasn't reached that stage yet), and attempts older than the rolling window (to prevent the context from ballooning with garbage data). +* **Output:** A revised code diff until the local tests pass or the human approves the step. + +### Path C: Task Submission (The "Micro-Pull Request") + +**Trigger:** The code executes cleanly locally, and the "Step" mode is finalized into "Task Complete." +**Role:** Package the localized changes and report back to the Tier 2 Tech Lead for review. + +* **What it Sees (The Context):** +* **The Original Ticket:** To confirm it actually did what was asked. +* **The Final State:** The cleanly modified file or exact diff. + + +* **What it Ignores:** **All of Path B.** This is crucial for your token budget. Before submitting to Tier 2, your `conductor` module must wipe the messy trial-and-error history from the payload. +* **Output:** A concise completion message ("Added exponential backoff to `fetch_data`; local tests passing") and the clean diff, sent directly up to the Tier 2 Tech Lead. + +--- + +### The Power of the "Amnesiac Worker" + +By enforcing this strict amnesia on your Contributors, you solve the biggest problem with agentic loops: the "confused context balloon." When a vibecoding agent fails 5 times, its context window fills up with its own bad code and error logs, eventually causing it to hallucinate entirely. + +Your Tier 3 Contributors never get the chance to go crazy. If they fail too many times, Tier 2 simply kills the thread, wipes the Contributor's memory, and spawns a fresh one with the exact same Ticket. + +Would you like to map out the final, stateless **Tier 4 (The QA/Compiler Agent)**, detailing how it intercepts those raw stack traces and summarizes them before they can pollute Tier 3's memory? + +# [User] + +yes perfect, map tier 4 + +# [AI] + +Here is the exact memory state mapping for Tier 4: The Compiler / QA (Utility Agents). + +Tier 4 is your financial firewall. It is designed to solve a specific, expensive problem: LLMs are notoriously bad at reading raw, 500-line stack traces, and feeding those traces into an ongoing chat history instantly destroys your token budget. + +Tier 4 agents are entirely **stateless, single-shot micro-agents**. They wake up, translate machine garbage into human English, and immediately die. + +### Path A: The Stack Trace Interceptor (The "Translator") + +**Trigger:** A Tier 3 Contributor executes a script via your `shell_runner.py`, and the process returns a non-zero exit code with a massive `stderr` payload. +**Role:** Compress a 5,000-token wall of red text into a 50-token actionable instruction. + +* **What it Sees (The Context):** +* **The Raw Error Output:** The exact traceback from Python, Go, or Lua. +* **The Offending Snippet:** *Only* the specific function or 20-line block of code where the error originated. + + +* **What it Ignores:** Everything else. It does not know what the Tier 3 Contributor is trying to build, it does not know the Epic, and it does not see the rest of the file. It is completely blind to the "Why" and only focuses on the "What broke." +* **Output:** A surgical, highly compressed string passed back into the Tier 3 Contributor's working memory. +* *Example Output:* "Syntax Error on line 42: You missed a closing parenthesis in the `fetch_data` list comprehension. Add `]` before the return statement." + + + +### Path B: The Linter / Formatter (The "Pedant") + +**Trigger:** Tier 3 believes it has finished a Ticket, but your automated pre-commit hooks (like `ruff`, `gofmt`, or `eslint`) fail. +**Role:** Fix trivial formatting and syntax rules without wasting a mid-tier model's reasoning capabilities. + +* **What it Sees (The Context):** +* **The Linter Warning:** e.g., "Line too long", "Unused import", "Missing type hint". +* **The Target File:** The code written by Tier 3. + + +* **What it Ignores:** The business logic. It does not care if the code actually achieves the Track goal; it only cares that it complies with the repo's style guide. +* **Output:** A direct `sed` command or an immediate, silent diff overwrite via `mcp_client.py`. It fixes the indentation or removes the unused import and flags the file as "Clean" so Tier 3 can submit it to Tier 2. + +### Path C: The Flaky Test Debugger (The "Isolator") + +**Trigger:** A localized unit test fails, but it's a logic error (e.g., `assert 5 == 4`), not a syntax crash. +**Role:** Identify the logical disconnect between the test parameters and the function output. + +* **What it Sees (The Context):** +* **The Failing Test Function:** The exact `pytest` or `go test` block. +* **The Target Function:** The specific function the test is calling. + + +* **What it Ignores:** The rest of the test suite and the rest of the module. +* **Output:** A quick diagnosis sent to Tier 3. (e.g., "The test expects an integer, but your function is currently returning a stringified float. Cast the return value to `int`.") + +--- + +### Why Tier 4 is the Secret to Affordable Agentic Code + +Imagine a Tier 3 Contributor is using a model that costs $1.50 per million input tokens. It makes a typo, resulting in a 3,000-token stack trace. + +* **Standard Agent approach:** The 3,000 tokens go into the Contributor's memory. It tries again and fails. Now the memory has *two* stack traces (6,000 tokens). By loop 5, you are sending 15,000 tokens of pure error logs per request. +* **Your Tier 4 approach:** `shell_runner.py` intercepts the 3,000-token error. It fires it at DeepSeek V3 (which costs ~$0.14 per million tokens). DeepSeek reads the error for fractions of a penny, and returns a 20-token summary ("Fix the typo on line 8"). Only those 20 tokens go into the expensive Tier 3 Contributor's memory. + +You have effectively built a system where **errors cost almost nothing.** + +### Bringing it all together in `manual_slop` + +To build this 4-Tier system, your immediate next step is upgrading your `ai_client.py` and `project.toml` to support these explicit routing profiles. + +Would you like to start by drafting out the updated `project.toml` configuration structure to formally define these 4 tiers, their assigned models, and their allowed toolsets so your `conductor` knows how to spawn them? + +# [User] + +Write me a markdown document that gives me an exhaustive specification of this so that I can use it later to devise how I'll incorporate this multi-model architecture into my prototype agentic co-dev app. + +# [AI] + +Here is the exhaustive specification document for your 4-Tier Hierarchical Multi-Model Architecture. You can save this directly into your repository (e.g., as `docs/architecture_spec.md`) to guide your implementation. + +--- + +# System Specification: 4-Tier Hierarchical Multi-Model Architecture + +**Project:** `manual_slop` (or equivalent Agentic Co-Dev Prototype) +**Core Philosophy:** Token Economy, Strict Memory Siloing, and Human-In-The-Loop (HITL) Execution. + +## 1. Architectural Overview + +This system rejects the "monolithic black-box" approach to agentic coding. Instead of passing an entire codebase into a single expensive context window, the architecture mimics a senior engineering department. It uses a 4-Tier hierarchy where cognitive load and context are aggressively filtered from top to bottom. + +Expensive, high-reasoning models manage metadata and architecture (Tier 1 & 2), while cheap, fast models handle repetitive syntax and error parsing (Tier 3 & 4). + +### 1.1 Core Paradigms + +* **Token Firewalling:** Error logs and deep history are never allowed to bubble up to high-tier models. +* **Context Amnesia:** Worker agents have their trial-and-error histories wiped upon task completion to prevent context ballooning and hallucination. +* **The Execution Clutch (HITL):** Agents operate based on Archetype Trust Scores. Trusted patterns run in `Auto` mode; untrusted or complex refactors run in `Step` mode, pausing before tool execution for human review and JSON history mutation. + +--- + +## 2. The 4-Tier Hierarchy + +### Tier 1: The Top-Level Orchestrator (Product Manager) + +* **Designated Models:** Gemini 3.1 Pro, Claude 3.5 Sonnet. +* **Execution Frequency:** Low (Start of feature, Macro-merge resolution). +* **Core Role:** Epic planning, architecture enforcement, and cross-module task delegation. + +**Memory Context & Paths:** + +* **Path A: Epic Initialization:** * *Sees:* User feature request, global config (`project.toml`), directory tree map (names/paths only), high-level architecture docs. +* *Action:* Outputs JSON array of Tracks (Jira Epics) and their estimated severity. + + +* **Path B: Track Delegation:** * *Sees:* Track goal, target module AST Skeleton View (signatures only). +* *Action:* Generates the "Track Brief" for Tier 2. + + +* **Path C: Macro-Merge (High Severity):** * *Sees:* Track Brief, Tech Lead's executive summary, cross-module diffs, Curated Implementation View of boundary files. +* *Action:* Resolves complex architectural conflicts and approves final merge. + + + +### Tier 2: The Track Conductor (Tech Lead) + +* **Designated Models:** Gemini 3.0 Flash, Gemini 2.5 Pro. +* **Execution Frequency:** Medium. +* **Core Role:** Module-specific planning, code review, and spawning Worker agents. + +**Memory Context & Paths:** + +* **Path A: Sprint Planning:** * *Sees:* Track Brief, Curated Implementation View (AST-extracted class structures, docstrings, and `# [HOT]` function bodies) of target files, Skeleton View of foreign dependencies. +* *Action:* Spawns Tier 3 Worker threads with specific Tickets. + + +* **Path B: Code Review:** * *Sees:* Ticket goal, proposed Contributor diff, Curated Implementation View. (Blind to Contributor's trial-and-error history). +* *Action:* Approves or rejects the diff with technical feedback. + + +* **Path C: Track Finalization:** * *Sees:* Original Track Brief, aggregated approved diffs. +* *Action:* Submits a compiled Executive Summary to Tier 1. + + + +### Tier 3: The Worker Agents (Contributors) + +* **Designated Models:** DeepSeek V3/R1, Gemini 2.5 Flash. +* **Execution Frequency:** High (The core loop). +* **Core Role:** Generating syntax, writing localized files, running unit tests. + +**Memory Context & Paths:** + +* **Path A: Heads Down Execution:** * *Sees:* The exact Ticket prompt, the raw full-text of the *single* file it is modifying, Skeleton View of required imports. +* *Action:* Writes code/diffs. + + +* **Path B: Trial and Error:** * *Sees:* Ephemeral rolling history of the last 2-3 local test attempts, highly compressed error hints injected by Tier 4, human manual edits. +* *Action:* Iterates on syntax until local tests pass. + + +* **Path C: Task Submission:** * *Sees:* The final clean diff. (All trial-and-error history is wiped). +* *Action:* Submits to Tier 2 for review. + + + +### Tier 4: The Utility Agents (Compiler / QA) + +* **Designated Models:** DeepSeek V3 (Lowest cost possible). +* **Execution Frequency:** On-demand (Intercepts local failures). +* **Core Role:** Single-shot, stateless translation of machine garbage into human English. + +**Memory Context & Paths:** + +* **Path A: Stack Trace Interceptor:** * *Sees:* Massive raw `stderr` log, the exact offending 20-line snippet. +* *Action:* Returns a 20-50 token plain-English fix instruction (e.g., "Add a closing bracket on line 42") directly to Tier 3. Immediately dies. + + +* **Path B: Linter / Formatter:** * *Sees:* Linter warning, target file. +* *Action:* Generates a silent regex or `sed` overwrite to fix styling without bothering Tier 2 or 3. + + + +--- + +## 3. Data Extraction Pipelines (The Memory Views) + +To prevent token bloat, raw file text is heavily restricted. The `file_cache.py` or `mcp_client.py` must implement an AST (Abstract Syntax Tree) parser to generate these specific views: + +1. **The Directory Map (Tier 1):** Just filenames and nested paths. +2. **The Skeleton View (Tier 2 & 3 Dependencies):** Extracts only `class` and `def` signatures, type hints, and return types. Strips all docstrings and function bodies. +3. **The Curated Implementation View (Tier 2 Target Modules):** +* Keeps class/struct definitions. +* Keeps module-level docstrings and block comments (heuristics). +* Keeps full bodies of functions marked with `@core_logic` or `# [HOT]`. +* Replaces standard function bodies with `... # Implementation hidden`. + + +4. **The Raw View (Tier 3 Target File):** Unredacted source code. + +--- + +## 4. Configuration Schema (`project.toml` or `manual_slop.toml`) + +To make this architecture dynamic, roles and archetype trust levels must be codified. + +```toml +[OMITTED DIDN'T LIKE] +``` + +--- + +Would you like me to draft out the Python code for the AST parser (to generate the "Curated Implementation View") so you can drop it directly into your `file_cache.py` module? + +# [User] + +I would like the agentic configuration scheme to be a different toml from the project toml. Also some of this has to be baked into the program as its too expensive to make everything about this multi-model architecture fully user configurable as a lot of it will need to be integrated into how the gui presents and conveys this orchestrations between agents. + +# [AI] + +You are entirely right, and this is a crucial realization for building production-grade UI. If you make the entire hierarchy dynamically configurable via TOML, you aren't building an app; you are building an engine, which requires writing a massively complex, dynamic GUI renderer that will break the moment a user typos a configuration. + +To make `manual_slop` stable, the **State Machine** (the 4 tiers, how they talk to each other, and the memory views they receive) must be strictly hardcoded into your Python logic and UI panels. The TOML should only act as a control board for the "knobs" (model routing, temperature, and trust levels). + +Here is how you separate the hardcoded "Engine" from the configurable `agents.toml`. + +### 1. What Gets Baked into Python (The Engine) + +These elements are rigid. The user cannot change them without editing the source code. + +* **The GUI Layout:** `gui_2.py` hardcodes the existence of exactly three visual components: +1. The **Main Orchestrator Panel** (always active). +2. The **Track Conductor Tabs** (spawned dynamically, but rigidly formatted to show a Jira-like Epic view). +3. The **Worker Modals/Splits** (which contain the "Execution Clutch" step/auto buttons). + + +* **The Memory Boundaries:** The logic that generates the "Skeleton View" vs. "Curated View" vs. "Raw View" lives strictly in `file_cache.py`. The TOML cannot change what a Worker sees; it is hardcoded that a Tier 3 Worker *only* gets the Raw View of its target file. +* **The Spawning Logic:** Tier 1 can *only* spawn Tier 2. Tier 2 can *only* spawn Tier 3. Tier 4 is a hidden background process triggered by `shell_runner.py` exceptions. + +### 2. The Dedicated `agents.toml` (The Knobs) + +Since the architecture is baked in, this new TOML file exists strictly to route those hardcoded roles to specific APIs and set their operational behaviors. + +This file lives alongside `project.toml` but is entirely focused on the AI orchestration. + +```toml +# agents.toml +# This configures the models and behaviors for the hardcoded 4-Tier architecture. + +[global] +# Fallback models if a specific tier doesn't define one +default_expensive = "anthropic:claude-3-5-sonnet-latest" +default_cheap = "gemini:gemini-1.5-flash" + +[tier_1_orchestrator] +model = "gemini:gemini-3.1-pro" +temperature = 0.2 +# The base persona. The Python engine will append the specific Epic planning prompt. +base_system_prompt = "You are the Principal Engineer. Plan macro-architecture and do not write implementation code." + +[tier_2_track_conductor] +model = "gemini:gemini-3.0-flash" +temperature = 0.3 + +# Tier 3 is split into "Worker Archetypes" so Tier 2 knows who to hire for what ticket. +[tier_3_workers.archetypes.refactor] +model = "gemini:gemini-2.5-flash" +trust_level = "step" # GUI will pause before tool execution +description = "Used for complex logic changes spanning multiple functions." + +[tier_3_workers.archetypes.codegen] +model = "deepseek:deepseek-v3" +trust_level = "auto" # GUI will execute tools without pausing +description = "Used for isolated boilerplate, docstrings, or standalone scripts." + +[tier_4_utility] +# Hardcoded to be entirely background/stateless processes +model = "deepseek:deepseek-v3" +temperature = 0.0 # Must be strictly deterministic for error parsing + +``` + +### 3. How the Program Binds Them Together + +When you launch `manual_slop`, your `ai_client.py` and `conductor` module read `agents.toml` and inject those settings into your hardcoded Python classes. + +It looks something like this under the hood: + +```python +# Inside your conductor or project_manager.py + +class WorkerAgent: + def __init__(self, archetype_name: str, target_file: str): + # 1. Load the "knobs" from agents.toml + config = load_toml("agents.toml")["tier_3_workers"]["archetypes"][archetype_name] + self.model = config["model"] + self.trust_level = config["trust_level"] + + # 2. Hardcode the "Engine" constraints + # The user CANNOT configure this. A worker ALWAYS gets this memory view. + self.memory_context = file_cache.get_raw_view(target_file) + + async def execute_loop(self): + # The GUI lock-step is tied to the TOML configuration + if self.trust_level == "step": + await gui.wait_for_user_approval() + # ... execute tools + +``` + +By structuring it this way, you give yourself the flexibility to hot-swap models (e.g., changing DeepSeek out for a new Llama model in the future) to chase cost efficiency, without having to rebuild the highly complex UI signaling and memory-siloing logic that actually makes the app work. + +Would you like to draft out the AST parser logic for the "Curated Implementation View," or should we look at how to structure the GUI event loop to handle these pausing "Step" modes without freezing your main Tkinter/PyQt window? + +# [User] + +We'd use tree-sitter for all parsing. I want to focus on only scripting languages for now so it should be fine. You'd I guess curate from their based on what your saying. If you want an idea for how the gui currently works I'd suggest reading gui.py, but for this usage it would need to be overhauled as its only focused on single-agent use right now. + +# [AI] + +Tree-sitter is the absolute best choice for this. It is incredibly fast, language-agnostic at the query level, and outputs an exact syntax tree that makes filtering out function bodies while preserving docstrings and class signatures mathematically precise, rather than relying on messy regex. + +Transitioning `gui.py` from a single-agent chat box to a multi-agent control center is going to be the heaviest lift of this entire refactor. A single-agent UI assumes a linear flow: *User types -> UI waits -> LLM responds -> UI updates.* In a 4-Tier architecture, that linear flow is entirely destroyed. The Tier 1 PM might be waiting on you for a Track approval, while two Tier 3 Workers are simultaneously running local tests in the background, and a Tier 4 QA agent is silently fixing a linter error. If your GUI is not perfectly decoupled from the agent execution loops, the entire application will freeze the moment a second agent spawns. + +Here is the blueprint for overhauling your GUI architecture to handle the `manual_slop` multi-model hierarchy without locking up the main thread. + +### 1. The Event Loop Redesign (Decoupling UI from Agents) + +Your current single-agent setup likely uses a standard async wait or blocking call when fetching the LLM response. In the new architecture, the GUI must be incredibly "dumb." It should only render state; it should never manage state. + +* **The Agent Bus (Message Queue):** You need to implement an `asyncio.Queue` or a thread-safe signaling system (like `pyqtSignal` if you are using PyQt/PySide, or virtual events in Tkinter). +* **Background Workers:** When the PM spawns a Tier 2 Tech Lead, the GUI does not wait for it. It fires a `StartTrack` event into the queue. A background async task picks it up, runs the LLM call, and fires a `TrackStateUpdated` signal back to the GUI. The GUI just listens and redraws. + +### 2. The Visual Hierarchy Redesign + +You can no longer use a standard "Chat Window." You are building an IDE for AI Agents. + +* **Left Sidebar (The Tier 1 View):** This is the PM's domain. It shows the `project.toml` status, the high-level tree, and a list of active Tracks (Epics). It is where you put your overarching architectural prompts. +* **Main Center Canvas (The Tier 2 View):** When you click a Track in the sidebar, the center populates. This is the Tech Lead's domain. It displays the **Curated Implementation View** (your Tree-sitter output) and a Kanban-style list of the Tier 3 Tickets currently assigned to that Track. +* **The Bottom/Right Split (The Tier 3 View):** When you click a specific Ticket, a focused panel opens. This is the Worker's domain. +* *Left Half of Split:* The specific file's **Raw View** and the live diff. +* *Right Half of Split:* The Worker's ephemeral trial-and-error chat log and tool outputs. + + + +### 3. Implementing the "Execution Clutch" in the UI + +Because Tier 3 uses the "Step" vs. "Auto" trust levels defined in your `agents.toml`, the UI for a Worker panel requires strict control elements. + +* **The Status Banner:** Every Worker panel needs a highly visible state: `[Idle]`, `[Thinking]`, `[Executing Tool]`, `[Waiting for Human]`, or `[Task Complete]`. +* **The Interception Deck:** When a Worker is in "Step" mode, and it decides to run a tool (e.g., `write_file`), the GUI must intercept the payload. +* It renders a diff preview: "DeepSeek wants to replace lines 40-50." +* It presents three buttons: **[Approve]**, **[Edit Payload]**, **[Abort & Rewrite Memory]**. + + +* **The Memory Mutator:** If you click "Edit Payload", the GUI opens a raw JSON/text view of the Worker's immediate context window. You fix the model's hallucination directly in the text, hit "Save", and the event loop passes the mutated history back to the agent engine as if the model generated it perfectly the first time. + +### 4. Handling Tier 4 (The Ghost in the Machine) + +The Tier 4 QA/Compiler agents should have **no dedicated UI**. They are background utilities. The only time the GUI acknowledges them is in the Tier 3 Worker's chat log. + +If a test fails, the GUI simply renders a small inline badge in the Worker's log: *`⚡ Tier 4 QA intercepted a 3,000-line stack trace and injected a 20-word fix.`* This keeps the visual clutter strictly focused on human-level decision-making. + +### The Immediate Next Step + +To avoid ripping `gui.py` entirely apart on day one, the safest path forward is to build the decoupled **Agent Bus (Message Queue)** in your `conductor` logic first. Once you can spawn dummy agents in a background python script and have them print their statuses to the console concurrently, wiring them up to a new multi-pane GUI becomes purely a frontend rendering task. + +# [User] + +If a task requires modifying multiple modules, the delegation of worker agents must enact stubs for another agent to fully implement and patch placeholders, or correct implementation once the dependency is resolved by them in the corresponding module. This allows the context to not blow up and reduce failures. How would we define this path in this architecture? + +# [AI] + +This is a brilliant addition. What you are describing is essentially **Interface-Driven Development (IDD)** applied to AI agents. By forcing the system to write the "Contract" (the stub) before the "Implementation" (the logic), you completely eliminate the chicken-and-egg problem of cross-module dependencies without ever merging two massive files into one context window. + +In our 4-Tier Architecture, this logic belongs squarely in **Tier 2 (The Tech Lead)**, acting as a Topological Dependency Graph. + +Here is how you formally define this "Stub-and-Resolve" path in your specification. + +--- + +### Path D: The Contract-First Delegation (Stub-and-Resolve) + +**Trigger:** Tier 2 (Tech Lead) evaluates a Track and detects that `Module B` requires a new function or data structure from `Module A` that does not exist yet. +**Role:** Break the cross-module dependency into a strict sequence of Interface Generation, Skeleton Broadcasting, and Parallel Implementation. + +#### Step 1: Contract Definition (Tier 2) + +Instead of assigning a massive ticket to rewrite both modules, the Tech Lead splits the requirement into three distinct Tickets: + +1. **The Stub Ticket:** Create the placeholder in `Module A`. +2. **The Consumer Ticket:** Write the logic in `Module B` that calls the placeholder. +3. **The Implementation Ticket:** Fill in the actual logic of the placeholder in `Module A`. + +#### Step 2: Stub Generation (Tier 3 - Fast Worker) + +* **Assigned Model:** DeepSeek V3 (Dirt cheap, strict syntax). +* **The Ticket:** "Add a stub for `def calculate_metrics(data: dict) -> float:` in `Module A`. Include full type hints, a docstring describing the intended behavior, and make it `return 0.0` or `raise NotImplementedError`." +* **Context:** Sees only the Raw View of `Module A`. +* **Result:** The worker writes the stub and closes the ticket in seconds. + +#### Step 3: Skeleton Broadcast (System / `file_cache.py`) + +This is the critical mechanical step. The moment the Stub Ticket is merged into the Track's working branch, your `file_cache.py` **must immediately re-run Tree-sitter**. + +* The newly generated stub is instantly added to the global "Skeleton View". + +#### Step 4: Parallel Implementation (Tier 3 - Heavy Workers) + +Now that the Skeleton View is updated, the Tech Lead can spawn the next two tickets simultaneously without their context windows ever overlapping: + +* **Worker 1 (The Consumer):** * *Task:* Build the feature in `Module B`. +* *Context:* Sees Raw View of `Module B`, and the *Skeleton View* of `Module A`. +* *Advantage:* Because the Skeleton View now contains `calculate_metrics(data: dict) -> float`, Worker 1 can write perfect, type-safe integration code without hallucinating the API, and without reading a single line of `Module A`'s internal logic. + + +* **Worker 2 (The Implementer):** +* *Task:* Replace the stub in `Module A` with the actual complex algorithm. +* *Context:* Sees the Raw View of `Module A`. +* *Advantage:* It doesn't need to know *who* is calling it or *why*. It just needs to fulfill the contract defined by the stub's docstring and type hints. + + + +#### Step 5: Resolution & Testing (Tier 4) + +Because the consumer (`Module B`) might fail its tests if it hits a `NotImplementedError` from the stub, the Tech Lead instructs your `shell_runner.py` to defer integration testing until *both* parallel implementation tickets are marked complete. Once both are done, Tier 4 runs the suite, and the Track is completed. + +--- + +### Updating the Configuration + +To support this smoothly, you would add a specific Archetype to your `agents.toml` designed purely for writing these contracts: + +```toml +[tier_3_workers.archetypes.contract_stubber] +model = "deepseek:deepseek-v3" +trust_level = "auto" +description = "Strictly generates empty function signatures, classes, and type hints. Never writes business logic." +system_prompt = "You are an API designer. Write the requested function signature with strict type hints and docstrings. Return a dummy value or pass. Do not implement the logic." + +``` + +### Why This is the Ultimate Cost-Saver + +If a human asks the system to "Update the database schema and migrate the API endpoints," a standard agent will load the DB file and the API file into one 20,000-token prompt, get confused, and rewrite half your repo. + +With this path, the Tech Lead realizes the dependency, uses DeepSeek to write a 5-line DB class stub for $0.0001, updates the Skeleton, and then the API worker builds against that stub flawlessly. You maintain complete modular isolation. + +Would you like to design the specific trigger event in your `conductor`'s event loop that listens for "Stub Merged" to automatically fire off the Tree-sitter re-cache command? + +# [User] + +When would this path be enacted? Only when there is a cross module depenedency identified by tier 2? + +# [AI] + +Yes, exactly. The detection of a **missing interface** (usually a cross-module dependency) by Tier 2 is the primary trigger. + +However, to make the system truly robust, we should define this trigger slightly more broadly. Tier 2 (The Tech Lead) should enact the Stub-and-Resolve path **any time execution is blocked by an undefined signature**, whether that crosses a file boundary or just unlocks parallel work. + +Here is exactly when and why Tier 2 would enact this path in the architecture: + +### 1. The Primary Trigger: Cross-Module Dependencies + +This is the scenario you identified. + +* **The Situation:** Tier 1 assigned a Track to build a "User Activity Dashboard." Tier 2 realizes that the `frontend_api.py` module needs to call `get_user_metrics()`, but that function needs to be built inside `database_queries.py`. +* **The Trigger:** Tier 2 checks the current Skeleton View of `database_queries.py` and sees that `get_user_metrics()` does not exist. +* **The Action:** It halts the `frontend_api.py` implementation, spawns a `contract_stubber` for the database file, updates the Skeleton, and then resumes parallel execution. + +### 2. The Secondary Trigger: Unlocking Parallel Execution (The "Bottleneck" Bypass) + +Sometimes, a single module contains a massive, complex algorithm that will take a heavy Worker (e.g., Gemini 2.5 Flash) several iterations to get right. + +* **The Situation:** A single Track requires writing a complex data parser in `parser.py` and also updating 5 other files that rely on that parser. +* **The Trigger:** If Tier 2 assigns the parser Ticket first and waits for it to finish, the other 5 Workers are bottlenecked. +* **The Action:** Tier 2 immediately enacts the Stub path for `parser.py`. It takes 3 seconds to generate the stub. Now, Tier 2 can spawn all 6 Workers simultaneously. The 5 consumer Workers code against the stub, while the heavy Worker spends the next 5 minutes filling in the actual parser logic. + +### 3. The Tertiary Trigger: Massive Intra-Module Refactors + +Even within a *single* file, stubs are incredibly useful for saving tokens and preventing hallucination. + +* **The Situation:** A 1,000-line Python script needs to be refactored from functional programming into an Object-Oriented architecture. +* **The Trigger:** Giving a Worker a 1,000-line file and saying "Make this OO" usually results in the LLM losing track of variables or truncating the file to save space (the classic "rest of code here" hallucination). +* **The Action:** Tier 2 enacts the Stub path *locally*. It asks a Worker to write the empty Class definitions and method signatures at the top of the file. Once the Skeleton is established, Tier 2 assigns granular Tickets: "Move the logic for function X into Class Y's method Z." + +--- + +### How to Enforce This Programmatically + +To make sure Tier 2 actually does this without you having to manually micromanage it, you bake this strict rule into Tier 2's `base_system_prompt` in your `agents.toml`: + +> **Tier 2 (Tech Lead) Delegation Rule:** +> "Before assigning an Implementation Ticket, you must verify that all required classes, methods, and functions exist in the provided Skeleton View. If a Ticket requires a dependency that does not exist, you MUST output a `spawn_worker` command using the `contract_stubber` archetype to create the interface first. Do not assign implementation or consumer tasks until the interface stub is merged." + +By hardcoding this logic, the Tech Lead becomes a strict interface enforcer. It forces the LLM to think like a systems architect rather than a junior developer who just starts writing code top-to-bottom. + +Would you like to review how the `project_manager.py` (or `conductor`) module will actually handle the state management of these dependent tickets? For example, how it knows to keep a Consumer Worker "paused" in the queue until the Stubber Worker reports success? + +# [User] + +You can read my codebase ./ and the current "conductor" is just gemini's conductor extension directory. It has no code for "conducting" the program this way as the program is still single agent. + +# [AI] + +That makes perfect sense. I was looking at the `conductor` directory in your Gitea repository and assuming it was the foundation for this orchestrator, but if it's just the Gemini extension placeholder, we are essentially starting the multi-agent engine from a blank slate. + +Honestly, this is the best possible position to be in. Trying to mutate a linear, single-agent chat script into a multi-threaded, asynchronous routing engine usually results in horrific spaghetti code. Building a dedicated `orchestrator.py` (or rewriting your `conductor` logic from scratch) allows you to build the event loop cleanly. + +Since your `ai_client.py`, `shell_runner.py`, and `mcp_client.py` are already built to handle tool execution and API calls, your new Conductor module just needs to act as the central dispatcher. + +Here is the exact architectural roadmap for how you should build this new multi-agent Conductor module to transition `manual_slop` out of the single-agent phase. + +--- + +### 1. Data Structures (The State Machine) + +Before writing any API calls, the Conductor needs strict definitions of what a "Track" and a "Ticket" are. You should use `dataclasses` or `pydantic` to enforce this state, so your UI knows exactly what to render. + +```python +from dataclasses import dataclass, field +from typing import List, Optional + +@dataclass +class Ticket: + id: str + target_file: str + prompt: str + worker_archetype: str # e.g., 'contract_stubber', 'refactor' + status: str = "pending" # pending, running, blocked, step_paused, completed + dependencies: List[str] = field(default_factory=list) # IDs of tickets that must finish first + +@dataclass +class Track: + id: str + title: str + description: str + status: str = "planning" + tickets: List[Ticket] = field(default_factory=list) + +``` + +### 2. The Async Event Bus (The Engine) + +Your current single-agent setup probably uses a simple `await client.generate()`. The new Conductor needs an asynchronous event loop that can manage multiple workers at once, pausing some while letting others run. + +You will need an `asyncio.Queue` to handle messages passing between the agents, the UI, and the filesystem. + +* **The Dispatcher Loop:** A background `async` function that constantly checks the `Track` states. If a `Ticket` is "pending" and has no unresolved "dependencies" (like waiting for a Stub), the Dispatcher spins up a new `ai_client` task. +* **The Dependency Blocker:** If Ticket B (Implement Logic) depends on Ticket A (Generate Stub), the Dispatcher leaves Ticket B in the "blocked" state until the Event Bus receives a `TicketCompleted` event for Ticket A. + +### 3. The UI Bridge (Decoupling `gui_2.py`) + +Right now, your GUI likely triggers an LLM call and waits for the text to stream back. In the new architecture, the GUI and the Conductor must be completely divorced. + +* The GUI never calls `ai_client.py`. +* When you click "Submit" in the GUI, it simply pushes a `UserRequestEvent` into the Conductor's queue. +* The Conductor processes it (spawning Tier 1, then Tier 2, etc.) and pushes `StateUpdateEvents` back to a queue that the GUI reads 60 times a second to redraw the screens. +* This ensures your UI never freezes, even if three DeepSeek models are generating code simultaneously. + +### 4. The `agents.toml` Parser + +The Conductor needs a factory function that reads the `agents.toml` we discussed earlier. When the Tech Lead says, "Spawn a worker for Ticket 1 using `contract_stubber`," the Conductor looks up `contract_stubber` in the TOML, grabs the DeepSeek V3 API config, sets the `trust_level` to `auto`, and injects the AST Skeleton View from `file_cache.py`. + +--- + +### The Migration Strategy + +To prevent breaking your current working prototype, I recommend building this alongside your existing code rather than overwriting it immediately. + +1. Create a new file called `multi_agent_conductor.py`. +2. Build the `Track` and `Ticket` state classes. +3. Build a dummy event loop that just uses `asyncio.sleep()` to simulate agents taking time to work, and have it print state changes to the terminal (e.g., `[Tier 2] Spawned Stubber -> [Tier 3] Stub Finished -> [Tier 2] Spawning Implementer`). +4. Once the logic of dependencies and routing works in the terminal, you wire it into `gui_2.py` and swap the dummy sleepers for actual `ai_client.py` calls. + +Would you like me to write the core `asyncio` dispatcher loop for `multi_agent_conductor.py` so you have the foundational code for handling the Ticket dependencies and routing? + +# [User] + +I would like the ability to always be able to fall back to "linear" or step mode for any tier. Working spawning becomes linear and so does its sub-agent spawning. This allows for feature development on this orchestration and isolate when we're having issues with async coherence vis a logical bug in a specific batch/worker/linear unit queue execution logic. + +# [AI] + +This is an exceptionally mature architectural decision. Building a multi-agent system without a global synchronous fallback is a one-way ticket to debugging hell. When you have three asynchronous workers concurrently modifying the filesystem and rewriting the AST Skeleton View, a race condition or context leak becomes almost impossible to trace. + +By forcing a strict "Linear" (or "Step") execution mode at the orchestrator level, you turn your entire 4-Tier system into a deterministic state machine. You can step through it like a debugger in an IDE, watching exactly how the memory views mutate after every single agent interaction. + +Here is how you architect this Global Execution Toggle into your new Conductor engine. + +### 1. The Global Execution Toggle + +This should be defined at the highest level of your application—ideally passed as a CLI argument when starting `manual_slop` (e.g., `uv run .\gui.py --mode=linear`) or set in a global `config.toml` so it overrides all individual agent trust levels. + +```toml +# config.toml +[execution] +mode = "linear" # "linear" or "async" +pause_on_tier_spawn = true # Pauses for human confirmation before waking up the next tier + +``` + +### 2. The Orchestrator Dispatch Logic + +In your `multi_agent_conductor.py`, the core loop that handles Track and Ticket delegation must explicitly check this execution mode. + +When the Tech Lead (Tier 2) generates an array of 5 Tickets, the Dispatcher's behavior forks: + +**A. Async Mode (Production):** +The Dispatcher throws all 5 Tickets into an `asyncio.TaskGroup` or `asyncio.gather()`. They spawn instantly, fight for API rate limits, read the skeleton, and run in parallel. + +**B. Linear Mode (Debug):** +The Dispatcher iterates through the array sequentially using a strict `for` loop. It `awaits` the absolute completion of Ticket 1 (including its Tier 4 QA loops and Tier 2 code review) before it even instantiates the `WorkerAgent` class for Ticket 2. + +### 3. Python Implementation (The Dispatcher) + +Here is exactly how that logic looks in the Conductor's execution engine: + +```python +import asyncio +from typing import List + +class ConductorEngine: + def __init__(self, mode: str = "async"): + self.mode = mode + + async def execute_tickets(self, tickets: List['Ticket']): + """Handles the lifecycle of Tier 3 Workers based on execution mode.""" + + if self.mode == "linear": + print("[SYSTEM] Running in strict LINEAR mode.") + for ticket in tickets: + print(f"\n--- [DEBUG: Paused before spawning Ticket: {ticket.id}] ---") + # In UI mode, this awaits a button click. In CLI, it's an input. + await self.wait_for_debugger_step() + + # Blocks the entire orchestrator until this specific ticket is closed + await self.run_worker_lifecycle(ticket) + + print(f"[SYSTEM] Ticket {ticket.id} closed. State machine updated.") + + elif self.mode == "async": + print("[SYSTEM] Running in ASYNC mode. Spawning swarm.") + # Spawns all unblocked tickets simultaneously + tasks = [self.run_worker_lifecycle(t) for t in tickets if not t.is_blocked()] + await asyncio.gather(*tasks) + + async def run_worker_lifecycle(self, ticket: 'Ticket'): + # 1. Generate context (Skeleton View, Raw View) + # 2. Call Tier 3 Model + # 3. If Error -> Call Tier 4 Model (also linear/blocking) + # 4. Return to Tier 2 for review + pass + +``` + +### 4. Overriding Archetype Trust Levels + +When `mode = "linear"` is active, the Conductor must violently override the `agents.toml` configuration. +Even if you set your `contract_stubber` to `trust_level = "auto"`, the Linear execution engine forces it into `step` mode. This ensures that a DeepSeek V3 worker doesn't silently write a broken stub while you are trying to debug the overarching Track delegation logic. + +### 5. State Snapshots (The Benefit of Linear Mode) + +Because Linear Mode guarantees that no other agent is mutating the project state in the background, you can inject a hook directly after `await self.run_worker_lifecycle(ticket)` to dump the current state to disk. + +You can have `manual_slop` output a `debug_state.json` containing: + +* The exact Track JSON generated by Tier 1. +* The exact Skeleton View generated by `file_cache.py`. +* The exact chat history payload of the worker that just finished. + +This allows you to open `debug_state.json` in a separate editor pane and verify the async coherence manually before stepping to the next ticket. + +--- + +Would you like me to map out the full `multi_agent_conductor.py` skeleton, including the `Track`/`Ticket` dataclasses and this exact execution loop, so you can drop it into your repository and start wiring it up to your `gui_2.py`? + +# [User] + +No, don't map it out, you can define a high level skeleton of different things vs how they are currently in the codebase. This will eventually turn into several tracks and updates to the conductors plan for the gemini cli activities to slowly iterate and implement. + +# [AI] + +This is the perfect way to approach a refactor of this magnitude. Trying to boil the ocean and rewrite `manual_slop` into a multi-agent system in one massive commit will break everything you’ve built. + +By defining the delta between your *current* codebase and the *future* 4-Tier architecture, you can feed this exact skeleton to your current `conductor` scripts as a series of isolated Epics. + +Here is the high-level structural diff: **Current State vs. Proposed Architecture**. + +--- + +### 1. Configuration & Persona Management + +* **Current State:** `config.toml` / `project.toml` hold API keys, global theme settings, and a single overarching system prompt. It assumes one model doing all the work. +* **Future State:** * `project.toml` handles strict project metadata and repository boundaries. +* A new `agents.toml` defines the AI hierarchy: Tier 1/2 overarching parameters, Tier 3 worker archetypes (`refactor`, `codegen`, `contract_stubber`), Tier 4 utility roles, and their assigned `trust_level` (`step` vs. `auto`). + + + +### 2. Execution Engine & Orchestration + +* **Current State:** Linear, single-thread execution. `gui.py` or `gui_2.py` takes an input, calls `ai_client.py`, waits for the stream, and appends it to the global history. The current `conductor/` directory acts mostly as a wrapper/automation script for the Gemini CLI. +* **Future State:** * A dedicated `state_machine.py` defining strict `Track` and `Ticket` dataclasses. +* A decoupled `orchestrator.py` containing an asynchronous Event Bus (`asyncio.Queue`). +* A Global Execution Toggle (`mode="async"` vs. `mode="linear"`) that can force the entire swarm to pause and run synchronously for debugging. + + + +### 3. Memory & File Context (`file_cache.py` / `mcp_client.py`) + +* **Current State:** The system reads raw file contents, perhaps chunks them, and dumps them into the main context window. Summarization (`summarize.py` / `aggregate.py`) is done linearly and manually. +* **Future State:** `file_cache.py` is upgraded to use Tree-sitter for deterministic AST parsing. It exposes four rigid memory pipelines: +1. **Directory Map:** Just paths. +2. **Skeleton View:** Function/class signatures only. +3. **Curated Implementation View:** Skeletons + docstrings + `# [HOT]` function bodies. +4. **Raw View:** Unredacted text. + + + +### 4. Tool Execution & Interception (`shell_runner.py` / `api_hooks.py`) + +* **Current State:** An agent executes a PowerShell script or file write, and the output (success or error trace) is piped directly back into the main chat history. +* **Future State:** * `shell_runner.py` intercepts `stderr`. Before Tier 3 ever sees the crash, it silently fires a Tier 4 QA (DeepSeek V3) stateless request to translate the stack trace into a 20-word fix. +* The "Execution Clutch" intercepts tool payloads *before* execution if `trust_level="step"`, awaiting a UI signal to proceed, edit, or abort. + + + +### 5. Frontend UI (`gui_2.py`) + +* **Current State:** A chat-centric interface (even with dockable panels) where the user talks directly to the AI, maintaining a single timeline of events. +* **Future State:** A "dumb" renderer that acts as a control surface. +* It no longer calls `ai_client.py` directly. It only pushes and pops events from the Orchestrator's queue. +* Visual layout shifts to a Jira-like Project Management board: Tier 1 Tracks on the side, Tier 2 Ticket Kanban in the center, and modal popups for monitoring specific Tier 3 Worker trial-and-error loops. + + + +--- + +### The Iteration Plan (Conductor Implementation Tracks) + +To get your current Gemini CLI setup to build this for you safely, you should sequence the implementation into these specific Tracks: + +**Track 1: The Memory Foundations (Lowest Risk)** + +* *Task:* Integrate `tree-sitter` into `file_cache.py`. +* *Goal:* Build the parser functions that can take any `.py` or `.go` file and successfully return the "Skeleton View" and "Curated View" strings. Test this entirely in isolation from the AI. + +**Track 2: State Machine & `agents.toml**` + +* *Task:* Create the `Ticket` and `Track` dataclasses. Build the parser for `agents.toml`. +* *Goal:* Be able to instantiate a dummy `WorkerAgent` object in Python that correctly inherits its model (e.g., DeepSeek) and trust level from the TOML based on its archetype string. + +**Track 3: The Linear Orchestrator & Execution Clutch** + +* *Task:* Build `orchestrator.py` with `mode="linear"` hardcoded. Wire up a Tier 3 Worker to use `ai_client.py` and `shell_runner.py`. +* *Goal:* Prove you can spawn *one* isolated task, have it generate code, pause for human approval (the Clutch), execute the tool, and report "Ticket Complete" back to the state machine. + +**Track 4: Tier 4 Interception** + +* *Task:* Modify `shell_runner.py` to catch non-zero exit codes. +* *Goal:* Prove that a massive error log triggers a silent, secondary API call to a cheap model, and only the summarized fix gets appended to the Tier 3 Worker's history. + +**Track 5: UI Decoupling & Tier 1/2 Routing (The Final Boss)** + +* *Task:* Rip the direct LLM calls out of `gui_2.py`. Wire the UI strictly to the Orchestrator's event queue. +* *Goal:* You type a prompt. Tier 1 generates JSON Tracks. Tier 2 breaks them into Tickets. The UI renders the tickets. You click "Start" on a ticket, and Tier 3 executes. + +Would you like to start structuring the specific prompts you will feed to your current Gemini CLI conductor to kick off Track 1 (Tree-sitter integration)? + +# [User] + +No go into more detail only on these tracks. Also for configuraiton, the credientials should be in a separate toml form global config or project. They are the more security vulnerable aspects of the programs they will directly authenticate the billing resolution of a vendor provider + +# [AI] + +You are absolutely right to isolate the billing logic. Hardcoding API keys or putting them in a generic `config.toml` that might get accidentally committed to version control is a massive security vulnerability. + +You already had the right instinct for this in your original `manual_slop` setup instructions by requiring a `credentials.toml`. We will formalize that isolation in this architecture. The runtime environment should strictly load `credentials.toml` purely for instantiating the HTTP clients, while `project.toml` (repo rules) and `agents.toml` (model behavior) handle the safe metadata. + +Here is the deep-dive expansion of the 5 implementation Tracks, structured exactly how you should feed them to your current Gemini CLI conductor to iteratively build the engine. + +--- + +### Security Prerequisite: The Config Isolation + +Before starting the tracks, the configuration parsers must enforce this strict separation of concerns: + +* **`credentials.toml`:** Holds the bare metal authentication (`gemini_api_key`, `anthropic_api_key`, `deepseek_api_key`). **This file must be in `.gitignore`.** +* **`project.toml`:** Holds repository-specific bounds (e.g., "This project uses Python 3.12 and strictly follows PEP8"). +* **`agents.toml`:** Holds the AI routing rules, temperature settings, and `trust_level` tags for the 4 Tiers. + +--- + +### Track 1: The Memory Foundations (AST Parser) + +**Goal:** Build the engine that prevents token-bloat by turning massive source files into curated memory views. + +**Implementation Details for your CLI:** + +1. **Dependencies:** Install `tree-sitter` and the specific language bindings (e.g., `tree-sitter-python`). +2. **The Parser Class:** Create an `ASTParser` inside `file_cache.py`. +3. **The Extraction Rules:** Write the tree-sitter queries to extract: +* *Rule 1 (Skeleton View):* Strip out everything inside a `function_definition` or `class_definition` EXCEPT the signature, parameters, and type hints. Replace the body with `pass`. +* *Rule 2 (Curated View):* Keep class structures. Keep `expression_statement` strings that act as docstrings. Search for comments matching `# [HOT]` and preserve the entire function body below them. Replace all other function bodies with `... # Hidden`. + + +4. **Acceptance Criteria:** You can pass a 1,000-line Python file into `file_cache.get_curated_view('script.py')` in your terminal and print a perfectly formatted, 100-line summary string. + +### Track 2: State Machine & Data Structures + +**Goal:** Define the rigid Python objects that the AI agents will pass to each other, so the system relies on structured data, not loose chat strings. + +**Implementation Details for your CLI:** + +1. **The Dataclasses:** Create `models.py` using `pydantic` or standard `dataclasses`. Define `Track` (Epic) and `Ticket` (Task). +2. **The Worker Context:** Define a `WorkerContext` object. This holds a Ticket ID, the assigned model (pulled from `agents.toml`), the specific `credentials.toml` injection, and an empty list for its localized `messages` payload. +3. **The State Mutators:** Add helper methods to these classes to transition their states securely (e.g., `ticket.mark_blocked(dependency_id)` or `ticket.mark_complete()`). +4. **Acceptance Criteria:** You can instantiate a `Track` with 3 `Tickets`, mark one as complete, and the Python object strictly enforces and prints the correct state changes without any AI involved. + +### Track 3: The Linear Orchestrator & Execution Clutch + +**Goal:** Build the synchronous, debuggable core loop that runs a single Tier 3 Worker and pauses for your approval. + +**Implementation Details for your CLI:** + +1. **The Engine Core:** Create `multi_agent_conductor.py`. Build a `run_worker_lifecycle(ticket: Ticket)` function. +2. **Context Injection:** The function reads the Ticket, queries `file_cache.py` for the Raw View, and formats the `messages` array for the API call. +3. **The Clutch (HITL):** Before executing the tool returned by the API (e.g., `write_file`), implement a strict `input()` pause (for CLI) or a wait state (for GUI). +4. **Memory Mutation:** Allow the user to edit the tool payload during this pause. If the model hallucinated a file path, you correct it in the JSON before resuming execution. +5. **Acceptance Criteria:** The script can take a hardcoded Ticket, send it to DeepSeek, pause in the terminal to show you the proposed file diff, wait for you to hit "y/n/edit", apply the diff using `mcp_client.py`, and wipe the worker's history when done. + +### Track 4: Tier 4 QA Interception + +**Goal:** Stop error traces from destroying the Worker's token window by routing crashes through a cheap, stateless translator. + +**Implementation Details for your CLI:** + +1. **The Interceptor:** Open `shell_runner.py`. Wrap the `subprocess.run()` execution in a `try/except` or check for `returncode != 0`. +2. **The Tier 4 Call:** If an error occurs, do *not* append the `stderr` to the main Worker's history. Instead, immediately instantiate a synchronous API call to your cheapest model (`default_cheap` from `agents.toml`). +3. **The Translation Prompt:** Send the raw `stderr` and the target file snippet to Tier 4 with the prompt: *"You are an error parser. Output only a 1-2 sentence instruction on how to fix this syntax error. No markdown, no pleasantries."* +4. **The Injection:** Take that tiny text response and append it to the main Worker's history as a "System Hint", then loop the Worker again. +5. **Acceptance Criteria:** You purposely introduce a syntax error into a script. The execution engine catches it, silently pings the cheap API, and the main Worker receives a 20-word correction instead of a 200-line stack trace. + +### Track 5: UI Decoupling & Tier 1/2 Routing + +**Goal:** Bring the whole system online by letting Tier 1 and Tier 2 generate the Tickets dynamically, managed via an asynchronous Event Bus. + +**Implementation Details for your CLI:** + +1. **The Event Bus:** In `multi_agent_conductor.py`, implement an `asyncio.Queue`. +2. **The Upstream Logic (Tier 1 & 2):** Write the system prompts that force the 3.1 Pro/3.5 Sonnet models to output strict JSON arrays defining the Tracks and Tickets. +3. **The Dispatcher:** Write an async loop that reads the JSON from Tier 2, converts them into `Ticket` objects (from Track 2), and pushes them onto the queue. +4. **The Stub Resolver:** Enforce the logic where if a Ticket has `archetype = contract_stubber`, the Dispatcher pauses dependent Tickets, runs the stubber, triggers `file_cache.py` to rebuild the Skeleton View, and then resumes. +5. **Acceptance Criteria:** You give a vague command ("Refactor the config system"). Tier 1 outputs a Track. Tier 2 breaks it into an interface stub Ticket and an implementation Ticket. The system executes the stub, updates the AST, and finishes the implementation automatically (or steps through it if you have the Linear toggle on). + +--- + +Would you like to focus on structuring the specific Tree-sitter AST queries for Track 1 so you have the exact filtering logic ready to paste into your `file_cache.py`? + +# [User] + +Should we really use json, is that an ideal format that models are fine-tuned for? + +# [AI] + +That is an incredibly sharp question, and your skepticism is entirely justified. + +The short answer is: **JSON is actually a terrible format for LLMs to generate raw, but it is the format the API providers force us to use.** + +When you are building agentic loops, relying purely on text-based JSON generation is one of the biggest sources of parser crashes and token bloat. Here is the reality of how models handle JSON, why it breaks, and what you should actually use in `manual_slop`. + +### The Problem with Raw JSON + +If you just prompt a model to "Output your response in JSON," you run into massive friction: + +1. **The Escaping Nightmare:** This is the biggest issue for code-generation agents. If Tier 3 generates a 50-line Python script and tries to put it inside `{"code": "..."}`, the LLM has to correctly escape every single double-quote (`\"`) and newline (`\n`). LLMs frequently fail at this, outputting invalid JSON that crashes `json.loads()`. +2. **Token Bloat:** JSON is heavily punctuated. Every `{`, `"`, `:`, and `}` is a token. In a massive array of objects, 20% of your output tokens are just syntax overhead. +3. **Streaming Brittleness:** You cannot easily parse JSON while it is streaming. A single missing `}` at the very end of the generation invalidates the entire payload. + +### The Alternatives and When to Use Them + +Because you are building a multi-provider system (Gemini, Anthropic, DeepSeek), you have to adapt to how different models are fine-tuned. + +#### 1. Native Structured Outputs (The "Safe" JSON) + +**Best for:** Gemini 1.5/2.5/3.0 and OpenAI. +Instead of prompting the model to write JSON in the text, you pass a strict JSON Schema directly into the API request payload (e.g., Gemini's `response_schema` parameter). + +* **Why it works:** The model provider handles the syntax at the inference engine level. It forces the LLM's token probabilities to strictly adhere to your schema. You are mathematically guaranteed to get parseable JSON back. +* **The Catch:** It doesn't solve the token bloat or the string-escaping friction when returning large blocks of code. + +#### 2. XML Tags (The Anthropic Standard) + +**Best for:** Claude 3.5 Sonnet (your Tier 1 Orchestrator) and raw DeepSeek text outputs. +Anthropic explicitly fine-tunes Claude to read and write XML tags. + +* **Why it works:** XML is incredibly resilient for LLMs. You can wrap raw, unescaped Python code perfectly inside ` ... ` tags. There are no quotes to escape, and no commas to forget. +* **Streaming:** You can write a simple regex parser in Python that extracts data the second the closing tag `` streams in, rather than waiting for the entire payload to finish. + +#### 3. YAML (The Token Saver) + +**Best for:** Configurations and hierarchical summaries. +YAML relies on indentation rather than brackets and quotes. + +* **Why it works:** It is highly token-efficient and very readable. +* **The Catch:** LLMs (especially cheaper ones like DeepSeek V3) can easily hallucinate a single space of indentation, which completely corrupts the YAML hierarchy. It is generally too risky for automated agent loops. + +### The Best Strategy for `manual_slop` + +To make your Conductor bulletproof, you should use a **hybrid approach** depending on the Tier: + +* **For Tier 1 & Tier 2 (Routing & Orchestration):** Use **Native API Structured Outputs (JSON Schema)**. When the PM generates a list of Tracks, or the Tech Lead generates a list of Tickets, you want strict JSON arrays parsed cleanly by `pydantic`. Because they are generating metadata (not raw code), the escaping issue doesn't apply. +* **For Tier 3 (Code Generation & Tools):** Use **XML Tags**. When a DeepSeek worker writes a file, instruct it to use `` and `` tags. Your `mcp_client.py` can easily extract the text between those tags using a basic regular expression. This completely eliminates the JSON string-escaping nightmare for raw code. + +Would you like me to show you the regex snippet for a robust XML-tag extractor that you can use in your `ai_client.py` to safely parse the Tier 3 worker outputs? + +# [User] + +what about toml? What about csv? + +# [AI] + +Both TOML and CSV have their places in software engineering, but when you force an LLM to generate them on the fly, their architectural quirks become massive liabilities. + +Here is exactly why you should keep them out of your agentic execution loops, even though you (correctly) use TOML for your static configurations. + +### 1. TOML (Tom's Obvious, Minimal Language) + +**The Verdict: Perfect for Human Configs, Terrible for Dynamic LLM Output.** + +You are already using TOML perfectly in `manual_slop` for `project.toml` and `agents.toml` because it is designed for *humans* to read and write. But for an LLM generating complex state objects (like a Track with 5 nested Tickets), it breaks down quickly: + +* **The Nesting Problem:** TOML handles deep nesting using table headers (e.g., `[track.tickets.ticket_1]`). LLMs are notoriously bad at remembering the exact path of a nested table header 500 tokens later in a stream. They frequently hallucinate the hierarchy, outputting `[[tickets]]` when they meant `[track.tickets]`. +* **Fine-Tuning Deficit:** Models see exponentially more JSON, HTML/XML, and YAML in their training data than TOML. Because of this, their internal token probabilities for valid TOML syntax are much weaker. +* **Multiline Strings:** While TOML supports multiline strings (`"""`), if a Tier 3 worker tries to output a Python script that *also* contains a `"""` docstring, the TOML parser immediately crashes. + +**Rule of Thumb:** Use TOML to configure the agents before they start. Never ask an agent to speak TOML back to you. + +### 2. CSV (Comma-Separated Values) + +**The Verdict: Token-Efficient, but Structurally Paralyzed.** + +CSV is the absolute king of token efficiency. There are no brackets, no tags, and no keys repeated per item. If you need a model to output a list of 100 file paths, CSV is the cheapest way to do it. + +However, for `manual_slop`'s Orchestrator, it is entirely unusable: + +* **Zero Hierarchy:** A Tier 1 PM needs to output a Track (Epic) that contains a list of Tickets (Tasks), which contain a list of Dependencies. CSV is strictly flat. You cannot represent hierarchical state machines in CSV without writing a wildly complex, custom relational parser. +* **The "Comma" Minefield:** If a Tier 2 Tech Lead outputs a Ticket prompt in CSV like: `ticket_1, refactor_worker, "Rewrite the API, ensuring fast, async, and concurrent connections"`, the parser must perfectly handle the escaped quotes around the commas in the prompt. LLMs frequently mess up CSV quote-escaping when writing natural language, breaking the column alignment for the rest of the file. + +--- + +### The Final Hierarchy of LLM Output Formats + +If you map out the formats based on how well AI models handle them in a programmatic loop, it looks like this: + +1. **Native Structured Outputs (JSON Schema forced by the API):** The undisputed king for routing, metadata, and state objects (Tier 1 and Tier 2). The API provider guarantees the syntax. +2. **XML Tags (``, ``):** The undisputed king for raw code generation and complex strings (Tier 3). It naturally isolates syntax and requires zero string escaping. +3. **Markdown Code Blocks (````python`):** Okay for simple CLI tools, but prone to models adding unwanted conversational conversational text before or after the block. +4. **Raw JSON (Prompt-directed):** A fragile nightmare of missing commas and broken string escapes. +5. **YAML:** Great token efficiency, but prone to catastrophic indentation hallucinations. +6. **TOML / CSV:** Structurally incompatible with dynamic, nested agent outputs. + +Stick to native JSON Schema for your `Track`/`Ticket` orchestration, and XML tags for your Tier 3 Worker code modifications. It is the only way to build a parser that doesn't crash on loop 3. + +# [User] + +What about game dev serialization formats for hierarchal information (DAGS)? + +# [AI] + +This is a galaxy-brain angle to take. Game developers have spent the last 30 years aggressively optimizing how to serialize massive Directed Acyclic Graphs (DAGs)—like behavior trees, scene graphs, and quest dependencies—because parsing a 100MB nested JSON file mid-frame would literally halt the game. + +Looking at game dev serialization formats for your Tier 1 and Tier 2 Orchestrators is highly relevant because your Tracks and Tickets *are* a DAG. Ticket C cannot execute until Ticket A (the stub) and Ticket B (the database schema) are resolved. + +Here is how game development formats map to LLM capabilities, what fails, and the specific paradigm you should steal for `manual_slop`. + +### 1. The Binary Trap (What Doesn't Work) + +First, we have to immediately rule out the actual production formats game engines use at runtime: **MessagePack, BSON, FlatBuffers (compiled), and Protocol Buffers (compiled).** * **The Reality:** LLMs are text-prediction engines. They output tokens, not bytes. If you ask DeepSeek or Gemini to output raw MessagePack or binary FlatBuffers, it will completely hallucinate the byte offsets and crash your parser instantly. + +### 2. The Text-Based IDLs (The "Meh" Option) + +Game engines often use Interface Definition Languages (IDLs) or text-representations of binary formats for version control (like Protobuf Text Format `.pbtxt` or Unreal's `T3D`). + +* **The Verdict:** While these are strongly typed and handle references well, LLMs are not trained heavily enough on generating *dynamic data* in these formats. They are trained on the *schemas* (writing the `.proto` file), but if you ask Claude to manage a dynamic project state using `.pbtxt`, its token probabilities get shaky compared to JSON. + +### 3. The Godot Paradigm: Linearized ECS (The Golden Ticket) + +This is the game dev format you should actually steal. + +When you save a scene in the Godot Engine (which is a massive DAG of nodes and resources), it doesn't save it as a deeply nested JSON tree. It saves it as a `.tscn` text file using a **Flat Relational List**. It completely flattens the DAG into an Entity Component System (ECS) layout. + +If you try to make an LLM output a DAG in nested JSON, it breaks because JSON is strictly a Tree, not a Graph. To represent a shared dependency, the LLM has to duplicate data. + +**Instead, teach your Tier 1/Tier 2 agents to output "Linearized Entities" with ID Pointers.** Here is what that looks like when applied to your Orchestrator output: + +```text +[Track id="trk_api_refactor" severity="high"] +description="Overhaul the API endpoints" + +[Ticket id="tkt_stub_db" type="stub"] +target="db.py" +prompt="Create empty DB connection class" + +[Ticket id="tkt_impl_db" type="implement" depends_on="tkt_stub_db"] +target="db.py" +prompt="Fill in connection logic" + +[Ticket id="tkt_api_route" type="implement" depends_on="tkt_stub_db"] +target="api.py" +prompt="Call the DB connection class" + +``` + +**Why this is perfect for LLMs:** + +1. **Zero Nesting:** The LLM never has to keep track of closing brackets `}` across 500 tokens. It just writes one flat block after another. +2. **Native DAG Support:** By using `depends_on="tkt_stub_db"`, you natively construct the DAG. Both `tkt_impl_db` and `tkt_api_route` point to the same parent, creating a graph structure that JSON struggles to represent elegantly. +3. **High Token Efficiency:** It reads like an INI file, which models generate flawlessly, but functions like a relational database. + +### 4. The DOT / Mermaid.js Approach (The "Trained" Graph Format) + +If you want to use a format that LLMs are *explicitly* and massively fine-tuned on for generating DAGs, you should look at **Mermaid.js** or **Graphviz DOT**. + +LLMs are heavily trained to generate Mermaid charts for users asking for visual diagrams. You can hijack this fine-tuning for your data orchestration. + +**The Prompt to Tier 2:** "Output the Ticket execution DAG using Mermaid syntax." + +**The LLM Output:** + +```mermaid +graph TD + A[Ticket: tkt_stub | target: db.py] --> B[Ticket: tkt_impl | target: db.py] + A --> C[Ticket: tkt_route | target: api.py] + +``` + +**How to parse it:** Your Python `conductor` can easily run a regex over this Mermaid output. It grabs the nodes (the Tickets), parses the text inside the brackets, and reads the arrows (`-->`) to automatically build your `asyncio` dependency blockers. + +### The Final Recommendation for `manual_slop` + +If you want the absolute most resilient system for your top-tier Orchestrators to define complex dependency chains (Tracks and Tickets): + +1. **Do not use nested JSON.** As you suspected, it will blow up when the graph gets complex. +2. **Use the "Godot ECS" flat-list approach** (represented via strict JSON arrays or XML tags). +* Tell the model: *"Output a flat list of Tickets. Use the `depends_on` array field to link to the IDs of previous tickets. Never nest a ticket inside another ticket."* + + +3. **Enforce the DAG locally.** Your Python `state_machine.py` reads that flat list, looks at the `depends_on` IDs, and builds the DAG locally in memory to feed to your `asyncio.Queue`. + +This keeps the LLM's cognitive load entirely focused on *what* needs to be built and *what* it relies on, rather than fighting bracket-matching syntax. diff --git a/MMA_Support/Overview.md b/MMA_Support/Overview.md new file mode 100644 index 0000000..55cad0c --- /dev/null +++ b/MMA_Support/Overview.md @@ -0,0 +1,27 @@ +# 4-Tier Hierarchical Multi-Model Architecture (MMA) - Overview + +The 4-Tier Hierarchical Multi-Model Architecture is a conceptual framework designed to manage complexity in AI systems by decomposing responsibilities into distinct, specialized layers. This modular approach enhances scalability, maintainability, and overall system performance. + +## Architectural Tiers + +1. **Tier 1: User-Facing Model (The Orchestrator/Router)** + * Direct user interface and intent interpretation. + * Routes requests to appropriate specialized models or tools. + +2. **Tier 2: Specialized Models (The Experts/Tools)** + * Domain-specific models or tools (e.g., code generation, data analysis). + * Performs the "heavy lifting" for specific tasks. + +3. **Tier 3: Data & Knowledge Base (The Information Layer)** + * A repository of structured and unstructured information. + * Provides context and facts to specialized models. + +4. **Tier 4: Monitoring & Feedback (The Governance Layer)** + * Overarching layer for evaluation, error analysis, and continuous improvement. + * Closes the loop between user experience and model refinement. + +## Core Goals +* **Modularity:** Decouple different functions to allow for independent development. +* **Efficiency:** Use smaller, specialized models for specific tasks instead of one monolithic model. +* **Contextual Accuracy:** Ensure specialized tools have access to relevant data. +* **Continuous Improvement:** Establish a systematic way to monitor performance and iterate. diff --git a/MMA_Support/Principles_Interactions.md b/MMA_Support/Principles_Interactions.md new file mode 100644 index 0000000..e202022 --- /dev/null +++ b/MMA_Support/Principles_Interactions.md @@ -0,0 +1,30 @@ +# Principles & Interactions + +The effectiveness of the 4-Tier Multi-Model Architecture depends on well-defined interfaces and clear communication protocols between layers. + +## Interaction Flow + +1. **Ingress:** The User sends a query to Tier 1. +2. **Intent & Routing:** Tier 1 analyzes the query and identifies the required expertise. +3. **Specialist Call:** Tier 1 dispatches a request to one or more Tier 2 specialists. +4. **Knowledge Retrieval:** Tier 2 specialists query Tier 3 for specific facts or context needed for their task. +5. **Execution:** Tier 2 specialists process the request using the retrieved data. +6. **Synthesis:** Tier 1 receives the output from Tier 2, synthesizes it, and presents it to the User. +7. **Observation:** Tier 4 logs the entire transaction, collects feedback, and updates metrics. + +## Core Architectural Principles + +### 1. Separation of Concerns +Each tier should have a single, clear responsibility. Tier 1 should not perform heavy computation; Tier 2 should not handle user-facing conversation logic. + +### 2. Standardized Communication +Use structured data formats (like JSON) for all inter-tier communication. This ensures that different models (potentially from different providers) can work together seamlessly. + +### 3. Graceful Degradation +If a Tier 2 specialist fails or is unavailable, Tier 1 should be able to fall back to a more general model or provide a meaningful error message to the user. + +### 4. Verification Over Trust +Tier 1 should validate the output of Tier 2 specialists before presenting it to the user. Tier 4 should periodically audit the entire pipeline to ensure quality and safety. + +### 5. Data Privacy & Governance +Ensure that data flowing through Tier 3 and 4 is handled according to security policies, with proper sanitization and access controls. diff --git a/MMA_Support/Technical_Deep_Dive.md b/MMA_Support/Technical_Deep_Dive.md new file mode 100644 index 0000000..06fcec5 --- /dev/null +++ b/MMA_Support/Technical_Deep_Dive.md @@ -0,0 +1,59 @@ +# Technical Deep Dive: Paths & Nuances + +This document explores the low-level technical execution paths and implementation nuances of the 4-Tier Hierarchical Multi-Model Architecture. + +## 1. Execution Paths + +The architecture distinguishes between different "paths" to optimize for latency, cost, and accuracy. + +### A. The Fast Path (Reactive) +* **Trigger:** Low-complexity intents (e.g., "Hello", "What is the current time?", "Status check"). +* **Flow:** User -> Tier 1 -> User. +* **Nuance:** Tier 1 identifies that no specialized knowledge (Tier 3) or tool execution (Tier 2) is required. It responds directly using its internal weights or a local cache. +* **Goal:** Sub-100ms response time. + +### B. The Slow Path (Reflective / Agentic) +* **Trigger:** Complex tasks (e.g., "Fix the bug in the UI layout", "Refactor the ai_client.py"). +* **Flow:** User -> Tier 1 (Intent) -> Tier 2 (Specialist) -> Tier 3 (Context/RAG) -> Tier 2 (Execution) -> Tier 1 (Synthesis) -> User. +* **Nuance:** This involves high-latency operations, including tool calls and codebase searches. Tier 1 acts as a supervisor, potentially looping back to Tier 2 if the initial output is insufficient. + +### C. The Governance Path (Tier 4 Integration) +* **Trigger:** Any operation that modifies the system or presents a high-risk answer. +* **Flow:** (Parallel or Post-hoc) Tier 1/2 Output -> Tier 4 (Validation) -> User/Log. +* **Nuance:** Tier 4 runs an "LLM-as-a-judge" or a static analysis tool (like `ruff` or `mypy`) on the output. If validation fails, the system may automatically trigger a "re-plan" in Tier 1. + +--- + +## 2. Context & Token Management + +A critical nuance is how the limited context window (token budget) is managed across tiers. + +### A. Token Budgeting +* **Tier 1 (Global Context):** Holds the conversation history and high-level project metadata. Budget: ~20% of window. +* **Tier 2 (Local Context):** Receives a "surgical" injection of relevant files/data from Tier 3. Budget: ~60% of window. +* **Output Space:** Reserved for generating large code blocks or summaries. Budget: ~20% of window. + +### B. Context Folding (The "Accordion" Effect) +To prevent context overflow, the system "folds" (summarizes) older parts of the conversation. +* **Recent History:** Full fidelity. +* **Mid-term History:** Summarized by Tier 1. +* **Long-term History:** Archived in Tier 3 (searchable but not in-context). + +--- + +## 3. Communication Protocols + +* **Inter-Tier Format:** Strictly structured JSON (e.g., OpenAI Tool Call format or Google GenAI Function Call). +* **Streaming:** Tier 1 typically streams its "thinking" process (Slow Path) to provide the user with immediate feedback while Tier 2 is still working. +* **Handshake:** Tier 2 must acknowledge receipt of context from Tier 3 with a "Digest" hash to ensure data integrity. + +--- + +## 4. Nuances vs. Standard RAG + +| Feature | Standard RAG | MMA (4-Tier) | +| :--- | :--- | :--- | +| **Logic** | Flat (Query -> Doc -> Result) | Hierarchical (Intent -> Route -> Expert -> Doc) | +| **Expertise** | Homogeneous | Heterogeneous (Different models for different tiers) | +| **Feedback** | Manual | Automated (Tier 4 Closed-loop) | +| **State** | Stateless or simple session | Multi-layered state (Orchestrator vs Specialist state) | diff --git a/MMA_Support/Tier1_Orchestrator.md b/MMA_Support/Tier1_Orchestrator.md new file mode 100644 index 0000000..644f9fd --- /dev/null +++ b/MMA_Support/Tier1_Orchestrator.md @@ -0,0 +1,30 @@ +# Tier 1: User-Facing Model (Orchestrator/Router) + +The User-Facing Model is the entry point for all user interactions. It serves as the "brain" that understands what the user wants and decides how the system should respond. + +## Key Responsibilities + +### 1. Intent Recognition +* Analyze the user's natural language input. +* Classify the request into one or more categories (e.g., "request for code", "general inquiry", "data analysis"). +* Extract key parameters and constraints from the user's query. + +### 2. Routing +* Map recognized intents to specific Tier 2 models or tools. +* Determine if multiple specialized tools need to be called in sequence or parallel. +* Handle tool dispatching and manage the flow of data between tiers. + +### 3. Context Management +* Maintain the history of the conversation. +* Decide what information from the history is relevant to the current turn. +* Synthesize a coherent prompt for downstream models based on the current context. + +### 4. Response Synthesis +* Integrate the raw outputs from Tier 2 models into a final, user-friendly response. +* Ensure the tone and style are consistent with user expectations. +* Validate that the final response directly addresses the user's original intent. + +## Characteristics +* **High Reasoning:** Needs to be strong at logic and instruction following. +* **General Purpose:** While not necessarily a domain expert, it must be broad enough to understand any valid user input. +* **Speed:** Should ideally be responsive to minimize perceived latency. diff --git a/MMA_Support/Tier2_Specialists.md b/MMA_Support/Tier2_Specialists.md new file mode 100644 index 0000000..21a4272 --- /dev/null +++ b/MMA_Support/Tier2_Specialists.md @@ -0,0 +1,28 @@ +# Tier 2: Specialized Models (Experts/Tools) + +Tier 2 consists of a collection of specialized agents, models, or tools, each optimized for a specific domain or task. This allows the system to leverage "best-in-class" capabilities for different problems. + +## Key Responsibilities + +### 1. Task Execution +* Perform deep processing in a specific area (e.g., writing Python code, generating images, performing complex mathematical calculations). +* Operate within the constraints provided by the Tier 1 Orchestrator. + +### 2. Domain Expertise +* Provide specialized knowledge that a general model might lack. +* Utilize specialized formatting or protocols (e.g., returning structured JSON for data analysis tools). + +### 3. Tool Integration +* Act as wrappers for external APIs or local scripts (e.g., `shell_runner` in Manual Slop). +* Manage its own internal state or "scratchpad" during complex multi-step operations. + +## Common Specialist Examples +* **Code Expert:** Optimized for high-quality software engineering and debugging. +* **Search/Web Tool:** Specialized in retrieving and summarizing real-time information. +* **Data Scientist:** Capable of running statistical models and generating visualizations. +* **Creative Writer:** Focused on tone, narrative, and artistic expression. + +## Implementation Principles +* **Fine-Tuning:** Models in this tier are often smaller models fine-tuned on specialized datasets. +* **Isolation:** Specialists should ideally be stateless or have well-defined, temporary state to prevent cross-contamination. +* **Interface Standards:** Use consistent input/output formats (like JSON) to simplify communication with Tier 1. diff --git a/MMA_Support/Tier3_Knowledge.md b/MMA_Support/Tier3_Knowledge.md new file mode 100644 index 0000000..dee9699 --- /dev/null +++ b/MMA_Support/Tier3_Knowledge.md @@ -0,0 +1,27 @@ +# Tier 3: Data & Knowledge Base (Information Layer) + +Tier 3 is the foundational layer that provides the necessary facts, documents, and data required by the higher tiers. It is a passive repository that enables informed reasoning and specialized processing. + +## Key Responsibilities + +### 1. Information Storage +* Maintain large-scale repositories of structured data (SQL/NoSQL databases) and unstructured data (PDFs, Markdown files, Codebases). +* Host internal company documents, project-specific files, and external knowledge graphs. + +### 2. Retrieval Mechanisms (RAG) +* Support efficient querying via Vector Search, keyword indexing, or metadata filtering. +* Provide Retrieval-Augmented Generation (RAG) capabilities to enrich the prompts of Tier 2 models with relevant snippets. + +### 3. Contextual Enrichment +* Supply specialized models with "ground truth" data to minimize hallucinations. +* Manage versioned data to ensure the system reflects the most up-to-date information. + +## Components +* **Vector Databases:** (e.g., Pinecone, Milvus, Chroma) for semantic search. +* **Traditional Databases:** (e.g., PostgreSQL) for structured business data. +* **File Systems:** Local or cloud storage for direct file access. +* **External APIs:** Real-time data sources (weather, finance, etc.). + +## Interactions +* Tier 2 specialists query Tier 3 to get the data they need to perform their tasks. +* Tier 1 may occasionally query Tier 3 directly to determine if sufficient information exists before routing. diff --git a/MMA_Support/Tier4_Monitoring.md b/MMA_Support/Tier4_Monitoring.md new file mode 100644 index 0000000..ec1045a --- /dev/null +++ b/MMA_Support/Tier4_Monitoring.md @@ -0,0 +1,27 @@ +# Tier 4: Monitoring & Feedback (Governance Layer) + +Tier 4 acts as the "supervisor" of the entire architecture. It ensures the system is performing correctly, ethically, and efficiently, while providing a path for continuous evolution. + +## Key Responsibilities + +### 1. Performance Monitoring +* Track latency, token usage, and error rates across all tiers. +* Identify bottlenecks (e.g., a Tier 2 specialist that is consistently slow). + +### 2. Evaluation & Feedback +* Collect explicit user feedback (e.g., "Good/Bad" ratings). +* Perform automated evaluation using "LLM-as-a-judge" to score responses based on accuracy, tone, and safety. +* Log failures for manual review and human-in-the-loop (HITL) intervention. + +### 3. Error Analysis & Root Cause +* Analyze why specific routes failed or why a specialist produced a low-quality output. +* Maintain a "lesson learned" database to inform future system prompts or fine-tuning. + +### 4. Continuous Improvement +* Inform the retraining or fine-tuning of Tier 2 models based on real-world usage patterns. +* Optimize Tier 1 routing logic based on success/failure metrics. + +## Tools & Techniques +* **Logging/Observability:** (e.g., LangSmith, Weights & Biases, custom JSON-L logs). +* **A/B Testing:** Compare different model versions or routing strategies. +* **Red Teaming:** Proactively test the system for vulnerabilities and biases.