Private
Public Access
0
0
Files
manual_slop/docs/guide_rag.md
T
ed 886df61051 docs(rag): correct the 'Removed fields' note (claim ChunkingConfig was wrong)
The previous note in guide_rag.md §RAGConfig Schema said:
  'ast_chunking_enabled lives in ChunkingConfig (not in RAGConfig)'

This was a documentation lie. Verified by grep:
- 'class ChunkingConfig' returns 0 matches in src/
- 'ast_chunking_enabled' returns 0 matches anywhere in src/
- The 5 fields (ast_chunking_enabled, auto_index_on_load,
  auto_sync_interval_seconds, vector_store_backend, vector_store_path)
  were never in the real RAGConfig. They were fictional.

Rewrite the note to be honest: 'the old doc was fictional; the
real RAGConfig has 5 fields; the other 5 fields never existed'.
Clarify that top_k is a real runtime parameter (on
RAGEngine.search()) not a config field.
2026-06-10 20:32:11 -04:00

22 KiB
Raw Blame History

RAG (Retrieval-Augmented Generation)

Top | Architecture | MMA | Tools & IPC | Simulations


Overview

Manual Slop integrates Retrieval-Augmented Generation (RAG) to extend the AI's working context beyond the explicit file list. When a project is RAG-enabled, the system maintains a vector index of file content; AI calls can retrieve semantically similar fragments at query time and prepend them to the prompt.

The RAG implementation is pluggable: the vector store, the embedding provider, and the chunking strategy are all configurable per project. The default backend is ChromaDB (local persistent), the default embedding is Gemini Embedding 001 (cloud), and the default chunking is character-based with overlap (with AST-aware chunking for Python files when enabled).

This guide covers:

  1. Architecture — Where RAG fits in the dispatch pipeline
  2. ComponentsRAGEngine, embedding providers, vector store
  3. Data Flow — Indexing, query, retrieval, injection
  4. ConfigurationRAGConfig schema and TOML settings
  5. Verification — Test infrastructure and known edge cases

Architecture

RAG sits between the project's tracked files and the AI provider's input prompt. It is not an internal AI call — it is a pre-processing step that augments md_content before the provider sees it.

                ┌─────────────────────────────────┐
                │ AppController / ConductorEngine │
                │ (caller of ai_client.send)      │
                └────────────┬────────────────────┘
                             │ constructs RAGEngine once per project
                             ▼
        ┌────────────────────────────────────────────┐
        │ RAGEngine                                   │
        │  ├─ EmbeddingProvider (Local or Gemini)    │
        │  ├─ VectorStore (ChromaDB persistent)        │
        │  └─ Chunkers (_chunk_text, _chunk_code)    │
        └────────────┬───────────────────────────────┘
                     │ on every ai_client.send() call:
                     │   rag_engine.search(user_message) -> fragments
                     ▼
        ┌────────────────────────────────────────────┐
        │ ai_client.send(rag_engine=...)             │
        │   injects [RETRIEVED CONTEXT] block        │
        │   into md_content before provider call     │
        └────────────────────────────────────────────┘

Lifecycle:

  • The AppController constructs a single RAGEngine per project load (lazily, when the project is first opened or when a RAG-related setting changes).
  • The RAGEngine is passed through to ai_client.send() for every AI call from the main discussion flow.
  • For Tier 3 workers spawned by the MMA, the ConductorEngine or caller is responsible for constructing the engine (typically with the same configuration as the main discussion).
  • If a project disables RAG, rag_engine=None is passed to send() and the integration is a no-op.

Why caller-owned? The RAG engine is decoupled from ai_client so that the same module can be reused by the GUI's RAG panel for direct queries, by MMA workers for ticket-specific retrieval, and by future automation scripts. ai_client only knows how to use an engine if one is provided.


Components

RAGEngine (src/rag_engine.py)

The central class. Owns the embedding provider and the vector store, exposes high-level methods for indexing and search.

class RAGEngine:
    def __init__(self, config: models.RAGConfig, base_dir: str = "."):
        ...

Construction: Takes a RAGConfig (from src/models.py) and a base_dir. The config specifies the embedding provider type, the vector store path, the chunk size, and the chunk overlap.

Internal state:

  • embedding_provider: BaseEmbeddingProvider — set by _init_embedding_provider
  • client: chromadb.PersistentClient — the chroma client (or the string "mock" in mock mode)
  • collection: chromadb.Collection — the actual collection (or "mock" in mock mode)
  • chunk_size: int — character count per chunk
  • chunk_overlap: int — overlap between adjacent chunks

Embedding Providers

Two providers are implemented; new ones can be added by subclassing BaseEmbeddingProvider.

BaseEmbeddingProvider

class BaseEmbeddingProvider:
    def embed(self, texts: List[str]) -> List[List[float]]:
        """Embed a batch of texts. Returns one vector per input text."""
        ...

A contract: embed() takes a list of strings and returns a list of equal-length float vectors. The vector dimensionality is provider-specific (e.g., 384 for all-MiniLM-L6-v2, 768 for gemini-embedding-001).

LocalEmbeddingProvider

Uses sentence-transformers (all-MiniLM-L6-v2 by default) for embedding.

  • Pros: Fully local, no API quota, deterministic.
  • Cons: Lower-quality embeddings than cloud models for code; CPU/GPU usage during indexing.
  • Default model: all-MiniLM-L6-v2 (384 dimensions, ~80MB download on first use).
class LocalEmbeddingProvider(BaseEmbeddingProvider):
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        ...

GeminiEmbeddingProvider

Uses the Gemini Embedding 001 model via the google-genai SDK.

  • Pros: Higher-quality embeddings, especially for code; no local model download.
  • Cons: Requires Gemini API key, network round-trip per embedding call, subject to API quotas.
class GeminiEmbeddingProvider(BaseEmbeddingProvider):
    def __init__(self, model_name: str = 'gemini-embedding-001'):
        ...

Lazy Loading

The heavy dependencies (sentence_transformers, google.genai, chromadb) are loaded lazily via _get_sentence_transformers(), _get_google_genai(), _get_chromadb(). This means RAG is opt-in: a project that doesn't enable RAG pays no import-time cost.

Vector Store

ChromaDB is the default persistent vector store. The store is created at <project_dir>/.slop_cache/chroma_<collection_name>/ (auto-generated from VectorStoreConfig.collection_name, default "manual_slop"). The .slop_cache location is intentional — it co-locates the chroma index with the existing per-project cache layout.

def _init_vector_store(self):
    vs_config = self.config.vector_store
    if vs_config.provider == 'chroma':
        db_path = os.path.abspath(os.path.join(
            self.base_dir, ".slop_cache", f"chroma_{vs_config.collection_name}"
        ))
        os.makedirs(db_path, exist_ok=True)
        chromadb, Settings = _get_chromadb()
        self.client     = chromadb.PersistentClient(path=db_path)
        self.collection = self.client.get_or_create_collection(name=vs_config.collection_name)
        self._validate_collection_dim()
    elif vs_config.provider == 'mock':
        self.client     = "mock"
        self.collection = "mock"
    else:
        raise ValueError(f"Unknown vector store provider: {vs_config.provider}")

Backends (VectorStoreConfig.provider):

  • chroma (default for real use) — local persistent, single-process
  • mock — no-op collection (for tests / RAG-disabled paths)

The mcp_server + mcp_tool fields in VectorStoreConfig are placeholders for the future External RAG Bridge via MCP (e.g., a remote vector database server); not yet implemented.

Chunking Strategies

Two strategies are implemented. The choice is made per-file based on extension and config.

Character-Based (_chunk_text)

Default for non-Python files and for Python files when AST chunking is disabled.

def _chunk_text(self, content: str) -> List[str]:
    """Character-based chunking with overlap."""
    chunks = []
    start = 0
    while start < len(content):
        end = min(start + self.chunk_size, len(content))
        chunks.append(content[start:end])
        if end >= len(content): break
        start = end - self.chunk_overlap
    return chunks
  • Default chunk size: 1000 characters
  • Default overlap: 200 characters
  • Edge cases: Empty files return []; single-chunk files return [content].

AST-Aware (_chunk_code)

Used for .py files when RAGConfig.ast_chunking_enabled = True.

def _chunk_code(self, content: str, file_path: str) -> List[str]:
    """AST-aware chunking for Python code."""
    # Parses with stdlib ast
    # Splits on top-level def/class boundaries
    # Each chunk is a complete top-level definition with its docstring
    ...
  • Strategy: Each top-level function, class, or constant block becomes one chunk. Docstrings are preserved as the first line of the chunk for context.
  • Pros: Semantic boundaries produce more meaningful retrieval results. A query for "how does X work" is more likely to return the entire definition of X rather than a fragment.
  • Cons: Requires valid Python; syntax errors fall back to character-based chunking.

The chunker uses stdlib ast (not tree-sitter) to avoid pulling tree-sitter for a feature that only handles Python.


Data Flow

Indexing Flow

When a project is loaded with RAG enabled, the RAGEngine is populated by indexing all tracked files.

1. Project load: AppController reads [rag] section from manual_slop.toml
2. AppController constructs RAGEngine(config)
3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
   - Calls _validate_collection_dim() to detect/recover from dim mismatch
4. For each tracked file (parallelized):
     a. Read content
     b. Choose chunker based on extension and config
     c. For each chunk: call embedding_provider.embed([chunk])
     d. Add to vector store with metadata {path, chunk_index, ...}
5. Indexing complete; engine is ready for queries

Parallelization: The indexing pipeline uses ThreadPoolExecutor for parallel embedding calls (the embedding step is the bottleneck). The chunking is fast and sequential per file.

Incremental Updates: When a file's mtime changes (detected by pathlib.Path.stat().st_mtime), delete_documents_by_path() is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).

Path resolution resilience: index_file() falls back to os.getcwd() if the base_dir-relative path doesn't exist. This handles batched test conditions where the subprocess CWD differs from the project root (e.g., a test chdir'ing into tests/artifacts/live_gui_workspace_*/ for fixture isolation). Without the fallback, indexing silently skipped files in those conditions.

Dimension Mismatch Protection

_init_vector_store() calls _validate_collection_dim() after creating the collection. The validation inspects the first existing vector's dim and compares it to the current embedding provider's output. On mismatch (e.g., the user switched from Gemini 3072-dim to local 384-dim, or vice versa, or a prior run populated the collection with a different model), the chroma directory is wiped via shutil.rmtree (with the client closed first to release file handles) and the collection is recreated with the correct dim.

Why this exists: Without validation, dim-mismatched upserts silently corrupt the collection. The next search() raises chromadb.errors.InvalidDimensionError: Collection expecting embedding with dimension of X, got Y, the AI request never reaches 'done' status, and the live_gui test polls timeout at 50×0.5s = 25s. This pattern was the dominant cause of tier-3-live_gui failures in the 2026-06-08 to 2026-06-10 window.

Regression tests in tests/test_rag_engine.py: test_rag_collection_dim_mismatch_recreates_collection, test_rag_collection_dim_match_preserves_collection.

Query Flow

When ai_client.send(rag_engine=engine) is called:

1. send() receives user_message
2. If rag_engine is not None:
     a. rag_engine.search(user_message, top_k=5) -> list of {text, metadata, distance}
     b. If results non-empty: inject [RETRIEVED CONTEXT] block into md_content
     c. The block contains the top_k fragments, formatted as:
        ```
        [RETRIEVED CONTEXT]
        File: path/to/file.py (chunk 0)
        <chunk text>

        File: path/to/another.py (chunk 2)
        <chunk text>
        ...
        ```
3. send() proceeds to the provider call with the augmented md_content

The injection point is before the system prompt construction. This means the retrieved context is treated as part of the project's tracked content, not as ad-hoc advice.

Public Methods

# Index a single file
rag_engine.index_file(path: str) -> None

# Search the index
rag_engine.search(query: str, top_k: int = 5) -> List[Dict[str, Any]]
# Returns: [{"text": str, "metadata": dict, "distance": float}, ...]

# Index management
rag_engine.add_documents(ids: List[str], texts: List[str], metadatas: Optional[List[dict]] = None) -> None
rag_engine.delete_documents(ids: List[str]) -> None
rag_engine.delete_documents_by_path(path: str) -> None
rag_engine.get_all_indexed_paths() -> List[str]
rag_engine.is_empty() -> bool

Configuration

RAG is configured via the project's manual_slop.toml:

[rag]
enabled = true
embedding_provider = "gemini"  # or "local"

[rag.vector_store]
provider = "chroma"              # "chroma" | "mock"
collection_name = "manual_slop"  # the chroma subdir under .slop_cache/
url = ""                         # future: external HTTP vector store
api_key = ""                     # future: external HTTP auth
mcp_server = ""                  # future: MCP-based external RAG bridge
mcp_tool = ""                    # future: tool name on the MCP server

[rag]
chunk_size = 1000
chunk_overlap = 200

RAGConfig + VectorStoreConfig Schema (src/models.py)

@dataclass
class VectorStoreConfig:
    provider:        str                              # "chroma" | "mock"
    url:             Optional[str] = None             # future: external HTTP
    api_key:         Optional[str] = None             # future: external HTTP auth
    collection_name: str = "manual_slop"
    mcp_server:      Optional[str] = None             # future: MCP bridge
    mcp_tool:        Optional[str] = None             # future: MCP tool name

@dataclass
class RAGConfig:
    enabled:            bool = False
    vector_store:       VectorStoreConfig = field(default_factory=lambda: VectorStoreConfig(provider='mock'))
    embedding_provider: str = 'gemini'                 # "gemini" | "local"
    chunk_size:         int = 1000
    chunk_overlap:      int = 200

What about the fields the old doc showed? The 2026-06-10 docs sync verified against src/models.py:1029-1040 that the previous RAGConfig schema was fictional — most of the fields it listed never existed in the real dataclass. Specifically: ast_chunking_enabled does not exist anywhere in src/ (there is no ChunkingConfig class — I claimed one existed in an earlier draft of this note and was wrong; flagging the correction here); vector_store_backend and vector_store_path never existed on RAGConfig (they were a flattened version of the now-nested VectorStoreConfig); auto_index_on_load and auto_sync_interval_seconds do not exist anywhere in src/ (they were aspirational; the actual index-on-load and auto-sync behavior is wired in RAGEngine and the controller's mma_state_update flow, not via persisted config); top_k IS a real thing but it is a runtime parameter to RAGEngine.search(query, top_k=5) and RAGEngine._search_mcp(query, top_k=5) (src/rag_engine.py:339, 322), not a field on RAGConfig — the old doc confused "config field" with "search parameter."

Behavior When Disabled

If enabled = false (the default), RAGEngine is never constructed. ai_client.send() receives rag_engine=None and the integration is a no-op. The lazy-loading of chromadb, sentence_transformers, and google.genai is also skipped, so there is zero overhead for projects that don't use RAG.

Auto-Sync

When auto_sync_interval_seconds > 0, a background thread periodically scans tracked files for mtime changes and re-indexes them. This keeps the vector store consistent with on-disk changes without requiring explicit user action.

The sync uses pathlib.Path.stat().st_mtime for change detection (same mechanism as the file cache in file_cache.py). For very large projects, the sync can be tuned to skip files above a size threshold.


Cross-System Integration

ai_client.send() Integration

See guide_architecture.md#rag-integration for the full dispatch flow. Summary:

def send(md_content, user_message, ..., rag_engine=None) -> str:
    if rag_engine is not None:
        retrieved = rag_engine.search(user_message, top_k=rag_engine.config.top_k)
        if retrieved:
            md_content = _inject_rag_context(md_content, retrieved)
    ...

The injection is a no-op if:

  • rag_engine is None
  • rag_engine.is_empty() (index has no documents)
  • search() returns no results above the distance threshold

MMA Worker Integration

The ConductorEngine does not construct RAGEngine itself. Workers receive context via md_content which is built by the caller. To use RAG in workers:

  1. Construct a RAGEngine in the caller (typically AppController or test harness).
  2. Pass it to multi_agent_conductor.run_worker_lifecycle(..., rag_engine=...) (if supported) or to the test invocation.
  3. The worker passes it to ai_client.send(rag_engine=...).

Note: As of 2026-06-02, the direct rag_engine parameter on run_worker_lifecycle is not yet implemented. Workers currently rely on the md_content already being augmented by the caller, or on Tier 4 / Tier 2 setting up the augmentation before spawning workers.

GUI Integration

The GUI's RAG panel (under AI Settings → RAG) provides:

  • Status indicatorRAGEngine.is_empty() → "Empty" / "Indexed N chunks"
  • Manual search box — for testing retrieval quality without sending a full AI call
  • Re-index button — forces a full rebuild of the index
  • Settings editor — modifies RAGConfig fields and writes back to manual_slop.toml

The RAG panel also surfaces the auto-sync status (last sync time, files indexed, files pending re-index).


Testing

Unit Tests

  • tests/test_rag_engine.pyRAGEngine basic lifecycle with mock ChromaDB and mock embedding provider
  • tests/test_rag_integration.py — End-to-end indexing + search + retrieval

Simulation Tests

  • tests/test_rag_gui_presence.py — Verifies the RAG panel renders correctly
  • tests/test_rag_visual_sim.py — Visual verification of the RAG search results panel

Stress Tests

  • tests/test_rag_phase4_stress.py — Indexes 1000+ files, measures retrieval latency
  • tests/test_rag_phase4_final_verify.py — End-to-end verification of RAG-augmented AI responses

Test Patterns

The standard pattern for testing RAG-augmented calls:

def test_rag_augmented_send(live_gui):
    # 1. Set up project with RAG enabled
    client.set_rag_config(enabled=True, embedding_provider="local")
    client.reindex_project()
    
    # 2. Send a question that requires retrieval
    response = client.send("How does the Execution Clutch work?")
    
    # 3. Verify the response references the retrieved content
    # (The exact assertion depends on what was indexed)
    assert response

For unit tests that don't need real embedding models, the BaseEmbeddingProvider is mocked to return deterministic vectors (e.g., based on the hash of the input text).


Edge Cases & Limitations

  1. Empty Index: If the index has no documents, search() returns [] and no context is injected. The AI call proceeds normally with just the explicit file context.

  2. Network Failures (Gemini Embeddings): If the Gemini API is unreachable, GeminiEmbeddingProvider.embed() raises an exception. The caller (typically _chunk_codeindex_file → RAG indexer) should handle this gracefully and either retry or fall back to the local provider.

  3. Stale Index: Auto-sync runs periodically but not on every read. If a file is changed between sync intervals, the index may be stale. The delete_documents_by_path + index_file cycle is atomic per file, so a partial sync leaves the index in a consistent (if incomplete) state.

  4. Large Files: A single file larger than chunk_size is split into multiple chunks with overlap. There's no upper limit on the number of chunks per file, but very large files (>10MB) may slow down indexing significantly.

  5. Binary Files: RAG only handles text files. Binary files (images, compiled Python, etc.) are skipped during indexing with a warning logged to comms_log.

  6. Cross-Project Queries: The vector store is per-project (<project_dir>/.rag/chroma/). Cross-project retrieval is not supported; each project has its own isolated index.

  7. Concurrent Writes: ChromaDB's PersistentClient is single-writer. If multiple processes try to write to the same index simultaneously, ChromaDB will raise. Manual Slop uses a threading.Lock to serialize writes from the auto-sync thread and the manual re-index button.


Future Work

  • External RAG Bridge — Connect to remote vector databases (e.g., a managed Pinecone or Weaviate) via MCP. The _search_mcp method is a placeholder for this.
  • Hybrid Search — Combine dense (vector) retrieval with sparse (BM25) retrieval for better recall on code keywords.
  • Re-ranking — Apply a cross-encoder reranker to the top-k results before injection to improve precision.
  • Caching — Cache query results in memory to avoid re-embedding for repeated questions.
  • Provider Routing — Allow per-query provider selection (e.g., use Gemini for general queries, local for code).

See guide_tools.md for the MCP tool inventory; see guide_architecture.md for the dispatch pipeline.