Private
Public Access
0
0

docs(rag): sync with src/rag_engine.py (collection attr, chroma path, dim validation)

Critical fixes:
- Chroma path: .rag/chroma/ -> .slop_cache/chroma_<collection_name>/
- self.vector_store -> self.client (PersistentClient) + self.collection (Collection)
- vector_store_backend -> vector_store.provider (nested VectorStoreConfig)
- RAGConfig schema: removed fictional fields (ast_chunking_enabled,
  vector_store_backend, vector_store_path, auto_index_on_load,
  auto_sync_interval_seconds, top_k); added VectorStoreConfig nested

New sections:
- Dimension Mismatch Protection: documents _validate_collection_dim
  and why it exists (silent corruption from provider switches)
- Path resolution resilience: index_file() CWD fallback for batched tests
This commit is contained in:
2026-06-10 19:50:35 -04:00
parent f973fb275f
commit 5aa19e59e7
+59 -27
View File
@@ -73,7 +73,8 @@ class RAGEngine:
**Internal state**:
- `embedding_provider: BaseEmbeddingProvider` — set by `_init_embedding_provider`
- `vector_store` — a ChromaDB `Collection` (or a stub for tests)
- `client: chromadb.PersistentClient` — the chroma client (or the string `"mock"` in mock mode)
- `collection: chromadb.Collection` — the actual collection (or `"mock"` in mock mode)
- `chunk_size: int` — character count per chunk
- `chunk_overlap: int` — overlap between adjacent chunks
@@ -125,22 +126,32 @@ The heavy dependencies (`sentence_transformers`, `google.genai`, `chromadb`) are
### Vector Store
ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.rag/chroma/` by default (configurable via `RAGConfig.vector_store_path`).
ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.slop_cache/chroma_<collection_name>/` (auto-generated from `VectorStoreConfig.collection_name`, default `"manual_slop"`). The `.slop_cache` location is intentional — it co-locates the chroma index with the existing per-project cache layout.
```python
def _init_vector_store(self):
if self.config.vector_store_backend == "chromadb":
client = chromadb.PersistentClient(path=...)
self.vector_store = client.get_or_create_collection(name=...)
vs_config = self.config.vector_store
if vs_config.provider == 'chroma':
db_path = os.path.abspath(os.path.join(
self.base_dir, ".slop_cache", f"chroma_{vs_config.collection_name}"
))
os.makedirs(db_path, exist_ok=True)
chromadb, Settings = _get_chromadb()
self.client = chromadb.PersistentClient(path=db_path)
self.collection = self.client.get_or_create_collection(name=vs_config.collection_name)
self._validate_collection_dim()
elif vs_config.provider == 'mock':
self.client = "mock"
self.collection = "mock"
else:
raise NotImplementedError(...)
raise ValueError(f"Unknown vector store provider: {vs_config.provider}")
```
**Backends**:
- `chromadb` (default) — local persistent, single-process
- *Future*: External RAG Bridge via MCP (e.g., a remote vector database server)
**Backends** (`VectorStoreConfig.provider`):
- `chroma` (default for real use) — local persistent, single-process
- `mock` — no-op collection (for tests / RAG-disabled paths)
The `_search_mcp` method is a placeholder for the future external bridge integration; current local-only mode uses `vector_store.query()` directly.
The `mcp_server` + `mcp_tool` fields in `VectorStoreConfig` are placeholders for the future External RAG Bridge via MCP (e.g., a remote vector database server); not yet implemented.
### Chunking Strategies
@@ -198,6 +209,7 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index
1. Project load: AppController reads [rag] section from manual_slop.toml
2. AppController constructs RAGEngine(config)
3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
- Calls _validate_collection_dim() to detect/recover from dim mismatch
4. For each tracked file (parallelized):
a. Read content
b. Choose chunker based on extension and config
@@ -210,6 +222,16 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index
**Incremental Updates**: When a file's `mtime` changes (detected by `pathlib.Path.stat().st_mtime`), `delete_documents_by_path()` is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).
**Path resolution resilience**: `index_file()` falls back to `os.getcwd()` if the `base_dir`-relative path doesn't exist. This handles batched test conditions where the subprocess CWD differs from the project root (e.g., a test chdir'ing into `tests/artifacts/live_gui_workspace_*/` for fixture isolation). Without the fallback, indexing silently skipped files in those conditions.
### Dimension Mismatch Protection
`_init_vector_store()` calls `_validate_collection_dim()` after creating the collection. The validation inspects the first existing vector's dim and compares it to the current embedding provider's output. On mismatch (e.g., the user switched from Gemini 3072-dim to local 384-dim, or vice versa, or a prior run populated the collection with a different model), the chroma directory is wiped via `shutil.rmtree` (with the client closed first to release file handles) and the collection is recreated with the correct dim.
**Why this exists:** Without validation, dim-mismatched upserts silently corrupt the collection. The next `search()` raises `chromadb.errors.InvalidDimensionError: Collection expecting embedding with dimension of X, got Y`, the AI request never reaches `'done'` status, and the live_gui test polls timeout at 50×0.5s = 25s. This pattern was the dominant cause of `tier-3-live_gui` failures in the 2026-06-08 to 2026-06-10 window.
Regression tests in `tests/test_rag_engine.py`: `test_rag_collection_dim_mismatch_recreates_collection`, `test_rag_collection_dim_match_preserves_collection`.
### Query Flow
When `ai_client.send(rag_engine=engine)` is called:
@@ -262,33 +284,43 @@ RAG is configured via the project's `manual_slop.toml`:
[rag]
enabled = true
embedding_provider = "gemini" # or "local"
[rag.vector_store]
provider = "chroma" # "chroma" | "mock"
collection_name = "manual_slop" # the chroma subdir under .slop_cache/
url = "" # future: external HTTP vector store
api_key = "" # future: external HTTP auth
mcp_server = "" # future: MCP-based external RAG bridge
mcp_tool = "" # future: tool name on the MCP server
[rag]
chunk_size = 1000
chunk_overlap = 200
ast_chunking_enabled = true
vector_store_backend = "chromadb"
vector_store_path = ".rag/chroma" # relative to project base_dir
auto_index_on_load = true
auto_sync_interval_seconds = 60 # background re-indexing
top_k = 5
```
### `RAGConfig` Schema (`src/models.py`)
### `RAGConfig` + `VectorStoreConfig` Schema (`src/models.py`)
```python
@dataclass
class VectorStoreConfig:
provider: str # "chroma" | "mock"
url: Optional[str] = None # future: external HTTP
api_key: Optional[str] = None # future: external HTTP auth
collection_name: str = "manual_slop"
mcp_server: Optional[str] = None # future: MCP bridge
mcp_tool: Optional[str] = None # future: MCP tool name
@dataclass
class RAGConfig:
enabled: bool = False
embedding_provider: str = "gemini" # "local" | "gemini"
chunk_size: int = 1000
chunk_overlap: int = 200
ast_chunking_enabled: bool = True
vector_store_backend: str = "chromadb"
vector_store_path: str = ".rag/chroma"
auto_index_on_load: bool = True
auto_sync_interval_seconds: int = 60
top_k: int = 5
enabled: bool = False
vector_store: VectorStoreConfig = field(default_factory=lambda: VectorStoreConfig(provider='mock'))
embedding_provider: str = 'gemini' # "gemini" | "local"
chunk_size: int = 1000
chunk_overlap: int = 200
```
> **Removed fields** (moved to other systems or not yet implemented): `ast_chunking_enabled` lives in `ChunkingConfig` (not in `RAGConfig`); `vector_store_backend`/`vector_store_path` replaced by nested `VectorStoreConfig`; `auto_index_on_load`/`auto_sync_interval_seconds`/`top_k` are runtime parameters set by the controller, not persisted in `RAGConfig`.
### Behavior When Disabled
If `enabled = false` (the default), `RAGEngine` is never constructed. `ai_client.send()` receives `rag_engine=None` and the integration is a no-op. The lazy-loading of `chromadb`, `sentence_transformers`, and `google.genai` is also skipped, so there is zero overhead for projects that don't use RAG.