docs(rag): new guide for RAG subsystem covering architecture, components, data flow, configuration, and testing
This commit is contained in:
@@ -0,0 +1,411 @@
|
||||
# RAG (Retrieval-Augmented Generation)
|
||||
|
||||
[Top](../README.md) | [Architecture](guide_architecture.md) | [MMA](guide_mma.md) | [Tools & IPC](guide_tools.md) | [Simulations](guide_simulations.md)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Manual Slop integrates Retrieval-Augmented Generation (RAG) to extend the AI's working context beyond the explicit file list. When a project is RAG-enabled, the system maintains a vector index of file content; AI calls can retrieve semantically similar fragments at query time and prepend them to the prompt.
|
||||
|
||||
The RAG implementation is pluggable: the vector store, the embedding provider, and the chunking strategy are all configurable per project. The default backend is **ChromaDB** (local persistent), the default embedding is **Gemini Embedding 001** (cloud), and the default chunking is **character-based with overlap** (with **AST-aware chunking** for Python files when enabled).
|
||||
|
||||
This guide covers:
|
||||
|
||||
1. **Architecture** — Where RAG fits in the dispatch pipeline
|
||||
2. **Components** — `RAGEngine`, embedding providers, vector store
|
||||
3. **Data Flow** — Indexing, query, retrieval, injection
|
||||
4. **Configuration** — `RAGConfig` schema and TOML settings
|
||||
5. **Verification** — Test infrastructure and known edge cases
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
RAG sits between the project's tracked files and the AI provider's input prompt. It is **not** an internal AI call — it is a pre-processing step that augments `md_content` before the provider sees it.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ AppController / ConductorEngine │
|
||||
│ (caller of ai_client.send) │
|
||||
└────────────┬────────────────────┘
|
||||
│ constructs RAGEngine once per project
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ RAGEngine │
|
||||
│ ├─ EmbeddingProvider (Local or Gemini) │
|
||||
│ ├─ VectorStore (ChromaDB persistent) │
|
||||
│ └─ Chunkers (_chunk_text, _chunk_code) │
|
||||
└────────────┬───────────────────────────────┘
|
||||
│ on every ai_client.send() call:
|
||||
│ rag_engine.search(user_message) -> fragments
|
||||
▼
|
||||
┌────────────────────────────────────────────┐
|
||||
│ ai_client.send(rag_engine=...) │
|
||||
│ injects [RETRIEVED CONTEXT] block │
|
||||
│ into md_content before provider call │
|
||||
└────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Lifecycle**:
|
||||
- The `AppController` constructs a single `RAGEngine` per project load (lazily, when the project is first opened or when a RAG-related setting changes).
|
||||
- The `RAGEngine` is passed through to `ai_client.send()` for every AI call from the main discussion flow.
|
||||
- For Tier 3 workers spawned by the MMA, the ConductorEngine or caller is responsible for constructing the engine (typically with the same configuration as the main discussion).
|
||||
- If a project disables RAG, `rag_engine=None` is passed to `send()` and the integration is a no-op.
|
||||
|
||||
**Why caller-owned?** The RAG engine is decoupled from `ai_client` so that the same module can be reused by the GUI's RAG panel for direct queries, by MMA workers for ticket-specific retrieval, and by future automation scripts. `ai_client` only knows how to *use* an engine if one is provided.
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
### `RAGEngine` (`src/rag_engine.py`)
|
||||
|
||||
The central class. Owns the embedding provider and the vector store, exposes high-level methods for indexing and search.
|
||||
|
||||
```python
|
||||
class RAGEngine:
|
||||
def __init__(self, config: models.RAGConfig, base_dir: str = "."):
|
||||
...
|
||||
```
|
||||
|
||||
**Construction**: Takes a `RAGConfig` (from `src/models.py`) and a `base_dir`. The config specifies the embedding provider type, the vector store path, the chunk size, and the chunk overlap.
|
||||
|
||||
**Internal state**:
|
||||
- `embedding_provider: BaseEmbeddingProvider` — set by `_init_embedding_provider`
|
||||
- `vector_store` — a ChromaDB `Collection` (or a stub for tests)
|
||||
- `chunk_size: int` — character count per chunk
|
||||
- `chunk_overlap: int` — overlap between adjacent chunks
|
||||
|
||||
### Embedding Providers
|
||||
|
||||
Two providers are implemented; new ones can be added by subclassing `BaseEmbeddingProvider`.
|
||||
|
||||
#### `BaseEmbeddingProvider`
|
||||
|
||||
```python
|
||||
class BaseEmbeddingProvider:
|
||||
def embed(self, texts: List[str]) -> List[List[float]]:
|
||||
"""Embed a batch of texts. Returns one vector per input text."""
|
||||
...
|
||||
```
|
||||
|
||||
A contract: `embed()` takes a list of strings and returns a list of equal-length float vectors. The vector dimensionality is provider-specific (e.g., 384 for `all-MiniLM-L6-v2`, 768 for `gemini-embedding-001`).
|
||||
|
||||
#### `LocalEmbeddingProvider`
|
||||
|
||||
Uses **sentence-transformers** (`all-MiniLM-L6-v2` by default) for embedding.
|
||||
|
||||
- **Pros**: Fully local, no API quota, deterministic.
|
||||
- **Cons**: Lower-quality embeddings than cloud models for code; CPU/GPU usage during indexing.
|
||||
- **Default model**: `all-MiniLM-L6-v2` (384 dimensions, ~80MB download on first use).
|
||||
|
||||
```python
|
||||
class LocalEmbeddingProvider(BaseEmbeddingProvider):
|
||||
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
|
||||
...
|
||||
```
|
||||
|
||||
#### `GeminiEmbeddingProvider`
|
||||
|
||||
Uses the **Gemini Embedding 001** model via the google-genai SDK.
|
||||
|
||||
- **Pros**: Higher-quality embeddings, especially for code; no local model download.
|
||||
- **Cons**: Requires Gemini API key, network round-trip per embedding call, subject to API quotas.
|
||||
|
||||
```python
|
||||
class GeminiEmbeddingProvider(BaseEmbeddingProvider):
|
||||
def __init__(self, model_name: str = 'gemini-embedding-001'):
|
||||
...
|
||||
```
|
||||
|
||||
#### Lazy Loading
|
||||
|
||||
The heavy dependencies (`sentence_transformers`, `google.genai`, `chromadb`) are loaded lazily via `_get_sentence_transformers()`, `_get_google_genai()`, `_get_chromadb()`. This means RAG is opt-in: a project that doesn't enable RAG pays no import-time cost.
|
||||
|
||||
### Vector Store
|
||||
|
||||
ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.rag/chroma/` by default (configurable via `RAGConfig.vector_store_path`).
|
||||
|
||||
```python
|
||||
def _init_vector_store(self):
|
||||
if self.config.vector_store_backend == "chromadb":
|
||||
client = chromadb.PersistentClient(path=...)
|
||||
self.vector_store = client.get_or_create_collection(name=...)
|
||||
else:
|
||||
raise NotImplementedError(...)
|
||||
```
|
||||
|
||||
**Backends**:
|
||||
- `chromadb` (default) — local persistent, single-process
|
||||
- *Future*: External RAG Bridge via MCP (e.g., a remote vector database server)
|
||||
|
||||
The `_search_mcp` method is a placeholder for the future external bridge integration; current local-only mode uses `vector_store.query()` directly.
|
||||
|
||||
### Chunking Strategies
|
||||
|
||||
Two strategies are implemented. The choice is made per-file based on extension and config.
|
||||
|
||||
#### Character-Based (`_chunk_text`)
|
||||
|
||||
Default for non-Python files and for Python files when AST chunking is disabled.
|
||||
|
||||
```python
|
||||
def _chunk_text(self, content: str) -> List[str]:
|
||||
"""Character-based chunking with overlap."""
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(content):
|
||||
end = min(start + self.chunk_size, len(content))
|
||||
chunks.append(content[start:end])
|
||||
if end >= len(content): break
|
||||
start = end - self.chunk_overlap
|
||||
return chunks
|
||||
```
|
||||
|
||||
- **Default chunk size**: 1000 characters
|
||||
- **Default overlap**: 200 characters
|
||||
- **Edge cases**: Empty files return `[]`; single-chunk files return `[content]`.
|
||||
|
||||
#### AST-Aware (`_chunk_code`)
|
||||
|
||||
Used for `.py` files when `RAGConfig.ast_chunking_enabled = True`.
|
||||
|
||||
```python
|
||||
def _chunk_code(self, content: str, file_path: str) -> List[str]:
|
||||
"""AST-aware chunking for Python code."""
|
||||
# Parses with stdlib ast
|
||||
# Splits on top-level def/class boundaries
|
||||
# Each chunk is a complete top-level definition with its docstring
|
||||
...
|
||||
```
|
||||
|
||||
- **Strategy**: Each top-level function, class, or constant block becomes one chunk. Docstrings are preserved as the first line of the chunk for context.
|
||||
- **Pros**: Semantic boundaries produce more meaningful retrieval results. A query for "how does X work" is more likely to return the entire definition of X rather than a fragment.
|
||||
- **Cons**: Requires valid Python; syntax errors fall back to character-based chunking.
|
||||
|
||||
The chunker uses stdlib `ast` (not tree-sitter) to avoid pulling tree-sitter for a feature that only handles Python.
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Indexing Flow
|
||||
|
||||
When a project is loaded with RAG enabled, the `RAGEngine` is populated by indexing all tracked files.
|
||||
|
||||
```
|
||||
1. Project load: AppController reads [rag] section from manual_slop.toml
|
||||
2. AppController constructs RAGEngine(config)
|
||||
3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
|
||||
4. For each tracked file (parallelized):
|
||||
a. Read content
|
||||
b. Choose chunker based on extension and config
|
||||
c. For each chunk: call embedding_provider.embed([chunk])
|
||||
d. Add to vector store with metadata {path, chunk_index, ...}
|
||||
5. Indexing complete; engine is ready for queries
|
||||
```
|
||||
|
||||
**Parallelization**: The indexing pipeline uses `ThreadPoolExecutor` for parallel embedding calls (the embedding step is the bottleneck). The chunking is fast and sequential per file.
|
||||
|
||||
**Incremental Updates**: When a file's `mtime` changes (detected by `pathlib.Path.stat().st_mtime`), `delete_documents_by_path()` is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).
|
||||
|
||||
### Query Flow
|
||||
|
||||
When `ai_client.send(rag_engine=engine)` is called:
|
||||
|
||||
```
|
||||
1. send() receives user_message
|
||||
2. If rag_engine is not None:
|
||||
a. rag_engine.search(user_message, top_k=5) -> list of {text, metadata, distance}
|
||||
b. If results non-empty: inject [RETRIEVED CONTEXT] block into md_content
|
||||
c. The block contains the top_k fragments, formatted as:
|
||||
```
|
||||
[RETRIEVED CONTEXT]
|
||||
File: path/to/file.py (chunk 0)
|
||||
<chunk text>
|
||||
|
||||
File: path/to/another.py (chunk 2)
|
||||
<chunk text>
|
||||
...
|
||||
```
|
||||
3. send() proceeds to the provider call with the augmented md_content
|
||||
```
|
||||
|
||||
The injection point is **before** the system prompt construction. This means the retrieved context is treated as part of the project's tracked content, not as ad-hoc advice.
|
||||
|
||||
### Public Methods
|
||||
|
||||
```python
|
||||
# Index a single file
|
||||
rag_engine.index_file(path: str) -> None
|
||||
|
||||
# Search the index
|
||||
rag_engine.search(query: str, top_k: int = 5) -> List[Dict[str, Any]]
|
||||
# Returns: [{"text": str, "metadata": dict, "distance": float}, ...]
|
||||
|
||||
# Index management
|
||||
rag_engine.add_documents(ids: List[str], texts: List[str], metadatas: Optional[List[dict]] = None) -> None
|
||||
rag_engine.delete_documents(ids: List[str]) -> None
|
||||
rag_engine.delete_documents_by_path(path: str) -> None
|
||||
rag_engine.get_all_indexed_paths() -> List[str]
|
||||
rag_engine.is_empty() -> bool
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
RAG is configured via the project's `manual_slop.toml`:
|
||||
|
||||
```toml
|
||||
[rag]
|
||||
enabled = true
|
||||
embedding_provider = "gemini" # or "local"
|
||||
chunk_size = 1000
|
||||
chunk_overlap = 200
|
||||
ast_chunking_enabled = true
|
||||
vector_store_backend = "chromadb"
|
||||
vector_store_path = ".rag/chroma" # relative to project base_dir
|
||||
auto_index_on_load = true
|
||||
auto_sync_interval_seconds = 60 # background re-indexing
|
||||
top_k = 5
|
||||
```
|
||||
|
||||
### `RAGConfig` Schema (`src/models.py`)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class RAGConfig:
|
||||
enabled: bool = False
|
||||
embedding_provider: str = "gemini" # "local" | "gemini"
|
||||
chunk_size: int = 1000
|
||||
chunk_overlap: int = 200
|
||||
ast_chunking_enabled: bool = True
|
||||
vector_store_backend: str = "chromadb"
|
||||
vector_store_path: str = ".rag/chroma"
|
||||
auto_index_on_load: bool = True
|
||||
auto_sync_interval_seconds: int = 60
|
||||
top_k: int = 5
|
||||
```
|
||||
|
||||
### Behavior When Disabled
|
||||
|
||||
If `enabled = false` (the default), `RAGEngine` is never constructed. `ai_client.send()` receives `rag_engine=None` and the integration is a no-op. The lazy-loading of `chromadb`, `sentence_transformers`, and `google.genai` is also skipped, so there is zero overhead for projects that don't use RAG.
|
||||
|
||||
### Auto-Sync
|
||||
|
||||
When `auto_sync_interval_seconds > 0`, a background thread periodically scans tracked files for `mtime` changes and re-indexes them. This keeps the vector store consistent with on-disk changes without requiring explicit user action.
|
||||
|
||||
The sync uses `pathlib.Path.stat().st_mtime` for change detection (same mechanism as the file cache in `file_cache.py`). For very large projects, the sync can be tuned to skip files above a size threshold.
|
||||
|
||||
---
|
||||
|
||||
## Cross-System Integration
|
||||
|
||||
### `ai_client.send()` Integration
|
||||
|
||||
See [guide_architecture.md#rag-integration](guide_architecture.md#rag-integration) for the full dispatch flow. Summary:
|
||||
|
||||
```python
|
||||
def send(md_content, user_message, ..., rag_engine=None) -> str:
|
||||
if rag_engine is not None:
|
||||
retrieved = rag_engine.search(user_message, top_k=rag_engine.config.top_k)
|
||||
if retrieved:
|
||||
md_content = _inject_rag_context(md_content, retrieved)
|
||||
...
|
||||
```
|
||||
|
||||
The injection is a no-op if:
|
||||
- `rag_engine is None`
|
||||
- `rag_engine.is_empty()` (index has no documents)
|
||||
- `search()` returns no results above the distance threshold
|
||||
|
||||
### MMA Worker Integration
|
||||
|
||||
The ConductorEngine does not construct `RAGEngine` itself. Workers receive context via `md_content` which is built by the caller. To use RAG in workers:
|
||||
|
||||
1. Construct a `RAGEngine` in the caller (typically `AppController` or test harness).
|
||||
2. Pass it to `multi_agent_conductor.run_worker_lifecycle(..., rag_engine=...)` (if supported) or to the test invocation.
|
||||
3. The worker passes it to `ai_client.send(rag_engine=...)`.
|
||||
|
||||
Note: As of 2026-06-02, the direct `rag_engine` parameter on `run_worker_lifecycle` is **not yet implemented**. Workers currently rely on the `md_content` already being augmented by the caller, or on Tier 4 / Tier 2 setting up the augmentation before spawning workers.
|
||||
|
||||
### GUI Integration
|
||||
|
||||
The GUI's RAG panel (under AI Settings → RAG) provides:
|
||||
- **Status indicator** — `RAGEngine.is_empty()` → "Empty" / "Indexed N chunks"
|
||||
- **Manual search box** — for testing retrieval quality without sending a full AI call
|
||||
- **Re-index button** — forces a full rebuild of the index
|
||||
- **Settings editor** — modifies `RAGConfig` fields and writes back to `manual_slop.toml`
|
||||
|
||||
The RAG panel also surfaces the **auto-sync status** (last sync time, files indexed, files pending re-index).
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
|
||||
- `tests/test_rag_engine.py` — `RAGEngine` basic lifecycle with mock ChromaDB and mock embedding provider
|
||||
- `tests/test_rag_integration.py` — End-to-end indexing + search + retrieval
|
||||
|
||||
### Simulation Tests
|
||||
|
||||
- `tests/test_rag_gui_presence.py` — Verifies the RAG panel renders correctly
|
||||
- `tests/test_rag_visual_sim.py` — Visual verification of the RAG search results panel
|
||||
|
||||
### Stress Tests
|
||||
|
||||
- `tests/test_rag_phase4_stress.py` — Indexes 1000+ files, measures retrieval latency
|
||||
- `tests/test_rag_phase4_final_verify.py` — End-to-end verification of RAG-augmented AI responses
|
||||
|
||||
### Test Patterns
|
||||
|
||||
The standard pattern for testing RAG-augmented calls:
|
||||
|
||||
```python
|
||||
def test_rag_augmented_send(live_gui):
|
||||
# 1. Set up project with RAG enabled
|
||||
client.set_rag_config(enabled=True, embedding_provider="local")
|
||||
client.reindex_project()
|
||||
|
||||
# 2. Send a question that requires retrieval
|
||||
response = client.send("How does the Execution Clutch work?")
|
||||
|
||||
# 3. Verify the response references the retrieved content
|
||||
# (The exact assertion depends on what was indexed)
|
||||
assert response
|
||||
```
|
||||
|
||||
For unit tests that don't need real embedding models, the `BaseEmbeddingProvider` is mocked to return deterministic vectors (e.g., based on the hash of the input text).
|
||||
|
||||
---
|
||||
|
||||
## Edge Cases & Limitations
|
||||
|
||||
1. **Empty Index**: If the index has no documents, `search()` returns `[]` and no context is injected. The AI call proceeds normally with just the explicit file context.
|
||||
|
||||
2. **Network Failures (Gemini Embeddings)**: If the Gemini API is unreachable, `GeminiEmbeddingProvider.embed()` raises an exception. The caller (typically `_chunk_code` → `index_file` → RAG indexer) should handle this gracefully and either retry or fall back to the local provider.
|
||||
|
||||
3. **Stale Index**: Auto-sync runs periodically but not on every read. If a file is changed between sync intervals, the index may be stale. The `delete_documents_by_path` + `index_file` cycle is atomic per file, so a partial sync leaves the index in a consistent (if incomplete) state.
|
||||
|
||||
4. **Large Files**: A single file larger than `chunk_size` is split into multiple chunks with overlap. There's no upper limit on the number of chunks per file, but very large files (>10MB) may slow down indexing significantly.
|
||||
|
||||
5. **Binary Files**: RAG only handles text files. Binary files (images, compiled Python, etc.) are skipped during indexing with a warning logged to `comms_log`.
|
||||
|
||||
6. **Cross-Project Queries**: The vector store is per-project (`<project_dir>/.rag/chroma/`). Cross-project retrieval is **not** supported; each project has its own isolated index.
|
||||
|
||||
7. **Concurrent Writes**: ChromaDB's PersistentClient is single-writer. If multiple processes try to write to the same index simultaneously, ChromaDB will raise. Manual Slop uses a `threading.Lock` to serialize writes from the auto-sync thread and the manual re-index button.
|
||||
|
||||
---
|
||||
|
||||
## Future Work
|
||||
|
||||
- **External RAG Bridge** — Connect to remote vector databases (e.g., a managed Pinecone or Weaviate) via MCP. The `_search_mcp` method is a placeholder for this.
|
||||
- **Hybrid Search** — Combine dense (vector) retrieval with sparse (BM25) retrieval for better recall on code keywords.
|
||||
- **Re-ranking** — Apply a cross-encoder reranker to the top-k results before injection to improve precision.
|
||||
- **Caching** — Cache query results in memory to avoid re-embedding for repeated questions.
|
||||
- **Provider Routing** — Allow per-query provider selection (e.g., use Gemini for general queries, local for code).
|
||||
|
||||
See [guide_tools.md](guide_tools.md) for the MCP tool inventory; see [guide_architecture.md](guide_architecture.md) for the dispatch pipeline.
|
||||
Reference in New Issue
Block a user