Files
manual_slop/MMA_Support/Data_Pipelines_and_Config.md
2026-02-24 19:11:15 -05:00

3.2 KiB

Data Pipelines, Memory Views & Configuration

The 4-Tier Architecture relies on strictly managed data pipelines and configuration files to prevent token bloat and maintain a deterministically safe execution environment.

1. AST Extraction Pipelines (Memory Views)

To prevent LLMs from hallucinating or consuming massive context windows, raw file text is heavily restricted. The file_cache.py uses Tree-sitter for deterministic Abstract Syntax Tree (AST) parsing to generate specific views:

  1. The Directory Map (Tier 1): Just filenames and nested paths (e.g., output of tree /F). No source code.
  2. The Skeleton View (Tier 2 & 3 Dependencies): Extracts only class and def signatures, parameters, and type hints. Strips all docstrings and function bodies, replacing them with pass. Used for foreign modules a worker must call but not modify.
  3. The Curated Implementation View (Tier 2 Target Modules):
    • Keeps class/struct definitions.
    • Keeps module-level docstrings and block comments (heuristics).
    • Keeps full bodies of functions marked with @core_logic or # [HOT].
    • Replaces standard function bodies with ... # Hidden.
  4. The Raw View (Tier 3 Target File): Unredacted, line-by-line source code of the single file a Tier 3 worker is assigned to modify.

2. Configuration Schema

The architecture separates sensitive billing logic from AI behavior routing.

  • credentials.toml (Security Prerequisite): Holds the bare metal authentication (gemini_api_key, anthropic_api_key, deepseek_api_key). This file must be in .gitignore. Loaded strictly for instantiating HTTP clients.
  • project.toml (Repo Rules): Holds repository-specific bounds (e.g., "This project uses Python 3.12 and strictly follows PEP8").
  • agents.toml (AI Routing): Defines the hardcoded hierarchy's operational behaviors. Includes fallback models (default_expensive, default_cheap), Tier 1/2 overarching parameters (temperature, base system prompts), and Tier 3 worker archetypes (refactor, codegen, contract_stubber) mapped to specific models (DeepSeek V3, Gemini Flash) and trust_level tags (step vs. auto).

3. LLM Output Formats

To ensure robust parser execution and avoid JSON string-escaping nightmares, the architecture uses a hybrid approach for LLM outputs depending on the Tier:

  • Native Structured Outputs (JSON Schema forced by API): Used for Tier 1 and Tier 2 routing and orchestration. The model provider mathematically guarantees the syntax, allowing clean parsing of Track and Ticket metadata by pydantic.
  • XML Tags (<file_path>, <file_content>): Used for Tier 3 Code Generation & Tools. It natively isolates syntax and requires zero string escaping. The UI/Orchestrator parses these via regex to safely extract raw Python code without bracket-matching failures.
  • Godot ECS Flat List (Linearized Entities with ID Pointers): Instead of deeply nested JSON (which models hallucinate across 500 tokens), Tier 1/2 Orchestrators define complex dependency DAGs as a flat list of items (e.g., [Ticket id="tkt_impl" depends_on="tkt_stub"]). The Python state machine reconstructs the DAG locally.