Private
Public Access
0
0
Files
manual_slop/docs/guide_rag.md
T

18 KiB

RAG (Retrieval-Augmented Generation)

Top | Architecture | MMA | Tools & IPC | Simulations


Overview

Manual Slop integrates Retrieval-Augmented Generation (RAG) to extend the AI's working context beyond the explicit file list. When a project is RAG-enabled, the system maintains a vector index of file content; AI calls can retrieve semantically similar fragments at query time and prepend them to the prompt.

The RAG implementation is pluggable: the vector store, the embedding provider, and the chunking strategy are all configurable per project. The default backend is ChromaDB (local persistent), the default embedding is Gemini Embedding 001 (cloud), and the default chunking is character-based with overlap (with AST-aware chunking for Python files when enabled).

This guide covers:

  1. Architecture — Where RAG fits in the dispatch pipeline
  2. ComponentsRAGEngine, embedding providers, vector store
  3. Data Flow — Indexing, query, retrieval, injection
  4. ConfigurationRAGConfig schema and TOML settings
  5. Verification — Test infrastructure and known edge cases

Architecture

RAG sits between the project's tracked files and the AI provider's input prompt. It is not an internal AI call — it is a pre-processing step that augments md_content before the provider sees it.

                ┌─────────────────────────────────┐
                │ AppController / ConductorEngine │
                │ (caller of ai_client.send)      │
                └────────────┬────────────────────┘
                             │ constructs RAGEngine once per project
                             ▼
        ┌────────────────────────────────────────────┐
        │ RAGEngine                                   │
        │  ├─ EmbeddingProvider (Local or Gemini)    │
        │  ├─ VectorStore (ChromaDB persistent)        │
        │  └─ Chunkers (_chunk_text, _chunk_code)    │
        └────────────┬───────────────────────────────┘
                     │ on every ai_client.send() call:
                     │   rag_engine.search(user_message) -> fragments
                     ▼
        ┌────────────────────────────────────────────┐
        │ ai_client.send(rag_engine=...)             │
        │   injects [RETRIEVED CONTEXT] block        │
        │   into md_content before provider call     │
        └────────────────────────────────────────────┘

Lifecycle:

  • The AppController constructs a single RAGEngine per project load (lazily, when the project is first opened or when a RAG-related setting changes).
  • The RAGEngine is passed through to ai_client.send() for every AI call from the main discussion flow.
  • For Tier 3 workers spawned by the MMA, the ConductorEngine or caller is responsible for constructing the engine (typically with the same configuration as the main discussion).
  • If a project disables RAG, rag_engine=None is passed to send() and the integration is a no-op.

Why caller-owned? The RAG engine is decoupled from ai_client so that the same module can be reused by the GUI's RAG panel for direct queries, by MMA workers for ticket-specific retrieval, and by future automation scripts. ai_client only knows how to use an engine if one is provided.


Components

RAGEngine (src/rag_engine.py)

The central class. Owns the embedding provider and the vector store, exposes high-level methods for indexing and search.

class RAGEngine:
    def __init__(self, config: models.RAGConfig, base_dir: str = "."):
        ...

Construction: Takes a RAGConfig (from src/models.py) and a base_dir. The config specifies the embedding provider type, the vector store path, the chunk size, and the chunk overlap.

Internal state:

  • embedding_provider: BaseEmbeddingProvider — set by _init_embedding_provider
  • vector_store — a ChromaDB Collection (or a stub for tests)
  • chunk_size: int — character count per chunk
  • chunk_overlap: int — overlap between adjacent chunks

Embedding Providers

Two providers are implemented; new ones can be added by subclassing BaseEmbeddingProvider.

BaseEmbeddingProvider

class BaseEmbeddingProvider:
    def embed(self, texts: List[str]) -> List[List[float]]:
        """Embed a batch of texts. Returns one vector per input text."""
        ...

A contract: embed() takes a list of strings and returns a list of equal-length float vectors. The vector dimensionality is provider-specific (e.g., 384 for all-MiniLM-L6-v2, 768 for gemini-embedding-001).

LocalEmbeddingProvider

Uses sentence-transformers (all-MiniLM-L6-v2 by default) for embedding.

  • Pros: Fully local, no API quota, deterministic.
  • Cons: Lower-quality embeddings than cloud models for code; CPU/GPU usage during indexing.
  • Default model: all-MiniLM-L6-v2 (384 dimensions, ~80MB download on first use).
class LocalEmbeddingProvider(BaseEmbeddingProvider):
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        ...

GeminiEmbeddingProvider

Uses the Gemini Embedding 001 model via the google-genai SDK.

  • Pros: Higher-quality embeddings, especially for code; no local model download.
  • Cons: Requires Gemini API key, network round-trip per embedding call, subject to API quotas.
class GeminiEmbeddingProvider(BaseEmbeddingProvider):
    def __init__(self, model_name: str = 'gemini-embedding-001'):
        ...

Lazy Loading

The heavy dependencies (sentence_transformers, google.genai, chromadb) are loaded lazily via _get_sentence_transformers(), _get_google_genai(), _get_chromadb(). This means RAG is opt-in: a project that doesn't enable RAG pays no import-time cost.

Vector Store

ChromaDB is the default persistent vector store. The store is created at <project_dir>/.rag/chroma/ by default (configurable via RAGConfig.vector_store_path).

def _init_vector_store(self):
    if self.config.vector_store_backend == "chromadb":
        client = chromadb.PersistentClient(path=...)
        self.vector_store = client.get_or_create_collection(name=...)
    else:
        raise NotImplementedError(...)

Backends:

  • chromadb (default) — local persistent, single-process
  • Future: External RAG Bridge via MCP (e.g., a remote vector database server)

The _search_mcp method is a placeholder for the future external bridge integration; current local-only mode uses vector_store.query() directly.

Chunking Strategies

Two strategies are implemented. The choice is made per-file based on extension and config.

Character-Based (_chunk_text)

Default for non-Python files and for Python files when AST chunking is disabled.

def _chunk_text(self, content: str) -> List[str]:
    """Character-based chunking with overlap."""
    chunks = []
    start = 0
    while start < len(content):
        end = min(start + self.chunk_size, len(content))
        chunks.append(content[start:end])
        if end >= len(content): break
        start = end - self.chunk_overlap
    return chunks
  • Default chunk size: 1000 characters
  • Default overlap: 200 characters
  • Edge cases: Empty files return []; single-chunk files return [content].

AST-Aware (_chunk_code)

Used for .py files when RAGConfig.ast_chunking_enabled = True.

def _chunk_code(self, content: str, file_path: str) -> List[str]:
    """AST-aware chunking for Python code."""
    # Parses with stdlib ast
    # Splits on top-level def/class boundaries
    # Each chunk is a complete top-level definition with its docstring
    ...
  • Strategy: Each top-level function, class, or constant block becomes one chunk. Docstrings are preserved as the first line of the chunk for context.
  • Pros: Semantic boundaries produce more meaningful retrieval results. A query for "how does X work" is more likely to return the entire definition of X rather than a fragment.
  • Cons: Requires valid Python; syntax errors fall back to character-based chunking.

The chunker uses stdlib ast (not tree-sitter) to avoid pulling tree-sitter for a feature that only handles Python.


Data Flow

Indexing Flow

When a project is loaded with RAG enabled, the RAGEngine is populated by indexing all tracked files.

1. Project load: AppController reads [rag] section from manual_slop.toml
2. AppController constructs RAGEngine(config)
3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
4. For each tracked file (parallelized):
     a. Read content
     b. Choose chunker based on extension and config
     c. For each chunk: call embedding_provider.embed([chunk])
     d. Add to vector store with metadata {path, chunk_index, ...}
5. Indexing complete; engine is ready for queries

Parallelization: The indexing pipeline uses ThreadPoolExecutor for parallel embedding calls (the embedding step is the bottleneck). The chunking is fast and sequential per file.

Incremental Updates: When a file's mtime changes (detected by pathlib.Path.stat().st_mtime), delete_documents_by_path() is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).

Query Flow

When ai_client.send(rag_engine=engine) is called:

1. send() receives user_message
2. If rag_engine is not None:
     a. rag_engine.search(user_message, top_k=5) -> list of {text, metadata, distance}
     b. If results non-empty: inject [RETRIEVED CONTEXT] block into md_content
     c. The block contains the top_k fragments, formatted as:
        ```
        [RETRIEVED CONTEXT]
        File: path/to/file.py (chunk 0)
        <chunk text>

        File: path/to/another.py (chunk 2)
        <chunk text>
        ...
        ```
3. send() proceeds to the provider call with the augmented md_content

The injection point is before the system prompt construction. This means the retrieved context is treated as part of the project's tracked content, not as ad-hoc advice.

Public Methods

# Index a single file
rag_engine.index_file(path: str) -> None

# Search the index
rag_engine.search(query: str, top_k: int = 5) -> List[Dict[str, Any]]
# Returns: [{"text": str, "metadata": dict, "distance": float}, ...]

# Index management
rag_engine.add_documents(ids: List[str], texts: List[str], metadatas: Optional[List[dict]] = None) -> None
rag_engine.delete_documents(ids: List[str]) -> None
rag_engine.delete_documents_by_path(path: str) -> None
rag_engine.get_all_indexed_paths() -> List[str]
rag_engine.is_empty() -> bool

Configuration

RAG is configured via the project's manual_slop.toml:

[rag]
enabled = true
embedding_provider = "gemini"  # or "local"
chunk_size = 1000
chunk_overlap = 200
ast_chunking_enabled = true
vector_store_backend = "chromadb"
vector_store_path = ".rag/chroma"  # relative to project base_dir
auto_index_on_load = true
auto_sync_interval_seconds = 60  # background re-indexing
top_k = 5

RAGConfig Schema (src/models.py)

@dataclass
class RAGConfig:
    enabled: bool = False
    embedding_provider: str = "gemini"  # "local" | "gemini"
    chunk_size: int = 1000
    chunk_overlap: int = 200
    ast_chunking_enabled: bool = True
    vector_store_backend: str = "chromadb"
    vector_store_path: str = ".rag/chroma"
    auto_index_on_load: bool = True
    auto_sync_interval_seconds: int = 60
    top_k: int = 5

Behavior When Disabled

If enabled = false (the default), RAGEngine is never constructed. ai_client.send() receives rag_engine=None and the integration is a no-op. The lazy-loading of chromadb, sentence_transformers, and google.genai is also skipped, so there is zero overhead for projects that don't use RAG.

Auto-Sync

When auto_sync_interval_seconds > 0, a background thread periodically scans tracked files for mtime changes and re-indexes them. This keeps the vector store consistent with on-disk changes without requiring explicit user action.

The sync uses pathlib.Path.stat().st_mtime for change detection (same mechanism as the file cache in file_cache.py). For very large projects, the sync can be tuned to skip files above a size threshold.


Cross-System Integration

ai_client.send() Integration

See guide_architecture.md#rag-integration for the full dispatch flow. Summary:

def send(md_content, user_message, ..., rag_engine=None) -> str:
    if rag_engine is not None:
        retrieved = rag_engine.search(user_message, top_k=rag_engine.config.top_k)
        if retrieved:
            md_content = _inject_rag_context(md_content, retrieved)
    ...

The injection is a no-op if:

  • rag_engine is None
  • rag_engine.is_empty() (index has no documents)
  • search() returns no results above the distance threshold

MMA Worker Integration

The ConductorEngine does not construct RAGEngine itself. Workers receive context via md_content which is built by the caller. To use RAG in workers:

  1. Construct a RAGEngine in the caller (typically AppController or test harness).
  2. Pass it to multi_agent_conductor.run_worker_lifecycle(..., rag_engine=...) (if supported) or to the test invocation.
  3. The worker passes it to ai_client.send(rag_engine=...).

Note: As of 2026-06-02, the direct rag_engine parameter on run_worker_lifecycle is not yet implemented. Workers currently rely on the md_content already being augmented by the caller, or on Tier 4 / Tier 2 setting up the augmentation before spawning workers.

GUI Integration

The GUI's RAG panel (under AI Settings → RAG) provides:

  • Status indicatorRAGEngine.is_empty() → "Empty" / "Indexed N chunks"
  • Manual search box — for testing retrieval quality without sending a full AI call
  • Re-index button — forces a full rebuild of the index
  • Settings editor — modifies RAGConfig fields and writes back to manual_slop.toml

The RAG panel also surfaces the auto-sync status (last sync time, files indexed, files pending re-index).


Testing

Unit Tests

  • tests/test_rag_engine.pyRAGEngine basic lifecycle with mock ChromaDB and mock embedding provider
  • tests/test_rag_integration.py — End-to-end indexing + search + retrieval

Simulation Tests

  • tests/test_rag_gui_presence.py — Verifies the RAG panel renders correctly
  • tests/test_rag_visual_sim.py — Visual verification of the RAG search results panel

Stress Tests

  • tests/test_rag_phase4_stress.py — Indexes 1000+ files, measures retrieval latency
  • tests/test_rag_phase4_final_verify.py — End-to-end verification of RAG-augmented AI responses

Test Patterns

The standard pattern for testing RAG-augmented calls:

def test_rag_augmented_send(live_gui):
    # 1. Set up project with RAG enabled
    client.set_rag_config(enabled=True, embedding_provider="local")
    client.reindex_project()
    
    # 2. Send a question that requires retrieval
    response = client.send("How does the Execution Clutch work?")
    
    # 3. Verify the response references the retrieved content
    # (The exact assertion depends on what was indexed)
    assert response

For unit tests that don't need real embedding models, the BaseEmbeddingProvider is mocked to return deterministic vectors (e.g., based on the hash of the input text).


Edge Cases & Limitations

  1. Empty Index: If the index has no documents, search() returns [] and no context is injected. The AI call proceeds normally with just the explicit file context.

  2. Network Failures (Gemini Embeddings): If the Gemini API is unreachable, GeminiEmbeddingProvider.embed() raises an exception. The caller (typically _chunk_codeindex_file → RAG indexer) should handle this gracefully and either retry or fall back to the local provider.

  3. Stale Index: Auto-sync runs periodically but not on every read. If a file is changed between sync intervals, the index may be stale. The delete_documents_by_path + index_file cycle is atomic per file, so a partial sync leaves the index in a consistent (if incomplete) state.

  4. Large Files: A single file larger than chunk_size is split into multiple chunks with overlap. There's no upper limit on the number of chunks per file, but very large files (>10MB) may slow down indexing significantly.

  5. Binary Files: RAG only handles text files. Binary files (images, compiled Python, etc.) are skipped during indexing with a warning logged to comms_log.

  6. Cross-Project Queries: The vector store is per-project (<project_dir>/.rag/chroma/). Cross-project retrieval is not supported; each project has its own isolated index.

  7. Concurrent Writes: ChromaDB's PersistentClient is single-writer. If multiple processes try to write to the same index simultaneously, ChromaDB will raise. Manual Slop uses a threading.Lock to serialize writes from the auto-sync thread and the manual re-index button.


Future Work

  • External RAG Bridge — Connect to remote vector databases (e.g., a managed Pinecone or Weaviate) via MCP. The _search_mcp method is a placeholder for this.
  • Hybrid Search — Combine dense (vector) retrieval with sparse (BM25) retrieval for better recall on code keywords.
  • Re-ranking — Apply a cross-encoder reranker to the top-k results before injection to improve precision.
  • Caching — Cache query results in memory to avoid re-embedding for repeated questions.
  • Provider Routing — Allow per-query provider selection (e.g., use Gemini for general queries, local for code).

See guide_tools.md for the MCP tool inventory; see guide_architecture.md for the dispatch pipeline.