docs(rag): sync with src/rag_engine.py (collection attr, chroma path, dim validation)

Critical fixes: - Chroma path: .rag/chroma/ -> .slop_cache/chroma_<collection_name>/ - self.vector_store -> self.client (PersistentClient) + self.collection (Collection) - vector_store_backend -> vector_store.provider (nested VectorStoreConfig) - RAGConfig schema: removed fictional fields (ast_chunking_enabled, vector_store_backend, vector_store_path, auto_index_on_load, auto_sync_interval_seconds, top_k); added VectorStoreConfig nested New sections: - Dimension Mismatch Protection: documents _validate_collection_dim and why it exists (silent corruption from provider switches) - Path resolution resilience: index_file() CWD fallback for batched tests
2026-06-10 19:50:35 -04:00
parent f973fb275f
commit 5aa19e59e7
1 changed files with 59 additions and 27 deletions
@@ -73,7 +73,8 @@ class RAGEngine:

 **Internal state**:
 - `embedding_provider: BaseEmbeddingProvider` — set by `_init_embedding_provider`
- `vector_store` — a ChromaDB `Collection` (or a stub for tests)
+- `client: chromadb.PersistentClient` — the chroma client (or the string `"mock"` in mock mode)
+- `collection: chromadb.Collection` — the actual collection (or `"mock"` in mock mode)
 - `chunk_size: int` — character count per chunk
 - `chunk_overlap: int` — overlap between adjacent chunks

@@ -125,22 +126,32 @@ The heavy dependencies (`sentence_transformers`, `google.genai`, `chromadb`) are

 ### Vector Store

-ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.rag/chroma/` by default (configurable via `RAGConfig.vector_store_path`).
+ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.slop_cache/chroma_<collection_name>/` (auto-generated from `VectorStoreConfig.collection_name`, default `"manual_slop"`). The `.slop_cache` location is intentional — it co-locates the chroma index with the existing per-project cache layout.

 ```python
 def _init_vector_store(self):
-    if self.config.vector_store_backend == "chromadb":
-        client = chromadb.PersistentClient(path=...)
-        self.vector_store = client.get_or_create_collection(name=...)
+    vs_config = self.config.vector_store
+    if vs_config.provider == 'chroma':
+        db_path = os.path.abspath(os.path.join(
+            self.base_dir, ".slop_cache", f"chroma_{vs_config.collection_name}"
+        ))
+        os.makedirs(db_path, exist_ok=True)
+        chromadb, Settings = _get_chromadb()
+        self.client     = chromadb.PersistentClient(path=db_path)
+        self.collection = self.client.get_or_create_collection(name=vs_config.collection_name)
+        self._validate_collection_dim()
+    elif vs_config.provider == 'mock':
+        self.client     = "mock"
+        self.collection = "mock"
    else:
-        raise NotImplementedError(...)
+        raise ValueError(f"Unknown vector store provider: {vs_config.provider}")
 ```

-**Backends**:
- `chromadb` (default) — local persistent, single-process
- *Future*: External RAG Bridge via MCP (e.g., a remote vector database server)
+**Backends** (`VectorStoreConfig.provider`):
+- `chroma` (default for real use) — local persistent, single-process
+- `mock` — no-op collection (for tests / RAG-disabled paths)

-The `_search_mcp` method is a placeholder for the future external bridge integration; current local-only mode uses `vector_store.query()` directly.
+The `mcp_server` + `mcp_tool` fields in `VectorStoreConfig` are placeholders for the future External RAG Bridge via MCP (e.g., a remote vector database server); not yet implemented.

 ### Chunking Strategies

@@ -198,6 +209,7 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index
 1. Project load: AppController reads [rag] section from manual_slop.toml
 2. AppController constructs RAGEngine(config)
 3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
+   - Calls _validate_collection_dim() to detect/recover from dim mismatch
 4. For each tracked file (parallelized):
     a. Read content
     b. Choose chunker based on extension and config
@@ -210,6 +222,16 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index

 **Incremental Updates**: When a file's `mtime` changes (detected by `pathlib.Path.stat().st_mtime`), `delete_documents_by_path()` is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).

+**Path resolution resilience**: `index_file()` falls back to `os.getcwd()` if the `base_dir`-relative path doesn't exist. This handles batched test conditions where the subprocess CWD differs from the project root (e.g., a test chdir'ing into `tests/artifacts/live_gui_workspace_*/` for fixture isolation). Without the fallback, indexing silently skipped files in those conditions.
+
+### Dimension Mismatch Protection
+
+`_init_vector_store()` calls `_validate_collection_dim()` after creating the collection. The validation inspects the first existing vector's dim and compares it to the current embedding provider's output. On mismatch (e.g., the user switched from Gemini 3072-dim to local 384-dim, or vice versa, or a prior run populated the collection with a different model), the chroma directory is wiped via `shutil.rmtree` (with the client closed first to release file handles) and the collection is recreated with the correct dim.
+
+**Why this exists:** Without validation, dim-mismatched upserts silently corrupt the collection. The next `search()` raises `chromadb.errors.InvalidDimensionError: Collection expecting embedding with dimension of X, got Y`, the AI request never reaches `'done'` status, and the live_gui test polls timeout at 50×0.5s = 25s. This pattern was the dominant cause of `tier-3-live_gui` failures in the 2026-06-08 to 2026-06-10 window.
+
+Regression tests in `tests/test_rag_engine.py`: `test_rag_collection_dim_mismatch_recreates_collection`, `test_rag_collection_dim_match_preserves_collection`.
+
 ### Query Flow

 When `ai_client.send(rag_engine=engine)` is called:
@@ -262,33 +284,43 @@ RAG is configured via the project's `manual_slop.toml`:
 [rag]
 enabled = true
 embedding_provider = "gemini"  # or "local"
+
+[rag.vector_store]
+provider = "chroma"              # "chroma" | "mock"
+collection_name = "manual_slop"  # the chroma subdir under .slop_cache/
+url = ""                         # future: external HTTP vector store
+api_key = ""                     # future: external HTTP auth
+mcp_server = ""                  # future: MCP-based external RAG bridge
+mcp_tool = ""                    # future: tool name on the MCP server
+
+[rag]
 chunk_size = 1000
 chunk_overlap = 200
-ast_chunking_enabled = true
-vector_store_backend = "chromadb"
-vector_store_path = ".rag/chroma"  # relative to project base_dir
-auto_index_on_load = true
-auto_sync_interval_seconds = 60  # background re-indexing
-top_k = 5
 ```

-### `RAGConfig` Schema (`src/models.py`)
+### `RAGConfig` + `VectorStoreConfig` Schema (`src/models.py`)

 ```python
+@dataclass
+class VectorStoreConfig:
+    provider:        str                              # "chroma" | "mock"
+    url:             Optional[str] = None             # future: external HTTP
+    api_key:         Optional[str] = None             # future: external HTTP auth
+    collection_name: str = "manual_slop"
+    mcp_server:      Optional[str] = None             # future: MCP bridge
+    mcp_tool:        Optional[str] = None             # future: MCP tool name
+
@dataclass
 class RAGConfig:
-    enabled: bool = False
-    embedding_provider: str = "gemini"  # "local" | "gemini"
-    chunk_size: int = 1000
-    chunk_overlap: int = 200
-    ast_chunking_enabled: bool = True
-    vector_store_backend: str = "chromadb"
-    vector_store_path: str = ".rag/chroma"
-    auto_index_on_load: bool = True
-    auto_sync_interval_seconds: int = 60
-    top_k: int = 5
+    enabled:            bool = False
+    vector_store:       VectorStoreConfig = field(default_factory=lambda: VectorStoreConfig(provider='mock'))
+    embedding_provider: str = 'gemini'                 # "gemini" | "local"
+    chunk_size:         int = 1000
+    chunk_overlap:      int = 200
 ```

+> **Removed fields** (moved to other systems or not yet implemented): `ast_chunking_enabled` lives in `ChunkingConfig` (not in `RAGConfig`); `vector_store_backend`/`vector_store_path` replaced by nested `VectorStoreConfig`; `auto_index_on_load`/`auto_sync_interval_seconds`/`top_k` are runtime parameters set by the controller, not persisted in `RAGConfig`.
+
 ### Behavior When Disabled

 If `enabled = false` (the default), `RAGEngine` is never constructed. `ai_client.send()` receives `rag_engine=None` and the integration is a no-op. The lazy-loading of `chromadb`, `sentence_transformers`, and `google.genai` is also skipped, so there is zero overhead for projects that don't use RAG.