docs(rag): sync with src/rag_engine.py (collection attr, chroma path, dim validation)
Critical fixes: - Chroma path: .rag/chroma/ -> .slop_cache/chroma_<collection_name>/ - self.vector_store -> self.client (PersistentClient) + self.collection (Collection) - vector_store_backend -> vector_store.provider (nested VectorStoreConfig) - RAGConfig schema: removed fictional fields (ast_chunking_enabled, vector_store_backend, vector_store_path, auto_index_on_load, auto_sync_interval_seconds, top_k); added VectorStoreConfig nested New sections: - Dimension Mismatch Protection: documents _validate_collection_dim and why it exists (silent corruption from provider switches) - Path resolution resilience: index_file() CWD fallback for batched tests
This commit is contained in:
+59
-27
@@ -73,7 +73,8 @@ class RAGEngine:
|
||||
|
||||
**Internal state**:
|
||||
- `embedding_provider: BaseEmbeddingProvider` — set by `_init_embedding_provider`
|
||||
- `vector_store` — a ChromaDB `Collection` (or a stub for tests)
|
||||
- `client: chromadb.PersistentClient` — the chroma client (or the string `"mock"` in mock mode)
|
||||
- `collection: chromadb.Collection` — the actual collection (or `"mock"` in mock mode)
|
||||
- `chunk_size: int` — character count per chunk
|
||||
- `chunk_overlap: int` — overlap between adjacent chunks
|
||||
|
||||
@@ -125,22 +126,32 @@ The heavy dependencies (`sentence_transformers`, `google.genai`, `chromadb`) are
|
||||
|
||||
### Vector Store
|
||||
|
||||
ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.rag/chroma/` by default (configurable via `RAGConfig.vector_store_path`).
|
||||
ChromaDB is the default persistent vector store. The store is created at `<project_dir>/.slop_cache/chroma_<collection_name>/` (auto-generated from `VectorStoreConfig.collection_name`, default `"manual_slop"`). The `.slop_cache` location is intentional — it co-locates the chroma index with the existing per-project cache layout.
|
||||
|
||||
```python
|
||||
def _init_vector_store(self):
|
||||
if self.config.vector_store_backend == "chromadb":
|
||||
client = chromadb.PersistentClient(path=...)
|
||||
self.vector_store = client.get_or_create_collection(name=...)
|
||||
vs_config = self.config.vector_store
|
||||
if vs_config.provider == 'chroma':
|
||||
db_path = os.path.abspath(os.path.join(
|
||||
self.base_dir, ".slop_cache", f"chroma_{vs_config.collection_name}"
|
||||
))
|
||||
os.makedirs(db_path, exist_ok=True)
|
||||
chromadb, Settings = _get_chromadb()
|
||||
self.client = chromadb.PersistentClient(path=db_path)
|
||||
self.collection = self.client.get_or_create_collection(name=vs_config.collection_name)
|
||||
self._validate_collection_dim()
|
||||
elif vs_config.provider == 'mock':
|
||||
self.client = "mock"
|
||||
self.collection = "mock"
|
||||
else:
|
||||
raise NotImplementedError(...)
|
||||
raise ValueError(f"Unknown vector store provider: {vs_config.provider}")
|
||||
```
|
||||
|
||||
**Backends**:
|
||||
- `chromadb` (default) — local persistent, single-process
|
||||
- *Future*: External RAG Bridge via MCP (e.g., a remote vector database server)
|
||||
**Backends** (`VectorStoreConfig.provider`):
|
||||
- `chroma` (default for real use) — local persistent, single-process
|
||||
- `mock` — no-op collection (for tests / RAG-disabled paths)
|
||||
|
||||
The `_search_mcp` method is a placeholder for the future external bridge integration; current local-only mode uses `vector_store.query()` directly.
|
||||
The `mcp_server` + `mcp_tool` fields in `VectorStoreConfig` are placeholders for the future External RAG Bridge via MCP (e.g., a remote vector database server); not yet implemented.
|
||||
|
||||
### Chunking Strategies
|
||||
|
||||
@@ -198,6 +209,7 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index
|
||||
1. Project load: AppController reads [rag] section from manual_slop.toml
|
||||
2. AppController constructs RAGEngine(config)
|
||||
3. RAGEngine._init_vector_store() creates/loads ChromaDB collection
|
||||
- Calls _validate_collection_dim() to detect/recover from dim mismatch
|
||||
4. For each tracked file (parallelized):
|
||||
a. Read content
|
||||
b. Choose chunker based on extension and config
|
||||
@@ -210,6 +222,16 @@ When a project is loaded with RAG enabled, the `RAGEngine` is populated by index
|
||||
|
||||
**Incremental Updates**: When a file's `mtime` changes (detected by `pathlib.Path.stat().st_mtime`), `delete_documents_by_path()` is called first, then the file is re-indexed. This is critical for the auto-sync flow (see Configuration below).
|
||||
|
||||
**Path resolution resilience**: `index_file()` falls back to `os.getcwd()` if the `base_dir`-relative path doesn't exist. This handles batched test conditions where the subprocess CWD differs from the project root (e.g., a test chdir'ing into `tests/artifacts/live_gui_workspace_*/` for fixture isolation). Without the fallback, indexing silently skipped files in those conditions.
|
||||
|
||||
### Dimension Mismatch Protection
|
||||
|
||||
`_init_vector_store()` calls `_validate_collection_dim()` after creating the collection. The validation inspects the first existing vector's dim and compares it to the current embedding provider's output. On mismatch (e.g., the user switched from Gemini 3072-dim to local 384-dim, or vice versa, or a prior run populated the collection with a different model), the chroma directory is wiped via `shutil.rmtree` (with the client closed first to release file handles) and the collection is recreated with the correct dim.
|
||||
|
||||
**Why this exists:** Without validation, dim-mismatched upserts silently corrupt the collection. The next `search()` raises `chromadb.errors.InvalidDimensionError: Collection expecting embedding with dimension of X, got Y`, the AI request never reaches `'done'` status, and the live_gui test polls timeout at 50×0.5s = 25s. This pattern was the dominant cause of `tier-3-live_gui` failures in the 2026-06-08 to 2026-06-10 window.
|
||||
|
||||
Regression tests in `tests/test_rag_engine.py`: `test_rag_collection_dim_mismatch_recreates_collection`, `test_rag_collection_dim_match_preserves_collection`.
|
||||
|
||||
### Query Flow
|
||||
|
||||
When `ai_client.send(rag_engine=engine)` is called:
|
||||
@@ -262,33 +284,43 @@ RAG is configured via the project's `manual_slop.toml`:
|
||||
[rag]
|
||||
enabled = true
|
||||
embedding_provider = "gemini" # or "local"
|
||||
|
||||
[rag.vector_store]
|
||||
provider = "chroma" # "chroma" | "mock"
|
||||
collection_name = "manual_slop" # the chroma subdir under .slop_cache/
|
||||
url = "" # future: external HTTP vector store
|
||||
api_key = "" # future: external HTTP auth
|
||||
mcp_server = "" # future: MCP-based external RAG bridge
|
||||
mcp_tool = "" # future: tool name on the MCP server
|
||||
|
||||
[rag]
|
||||
chunk_size = 1000
|
||||
chunk_overlap = 200
|
||||
ast_chunking_enabled = true
|
||||
vector_store_backend = "chromadb"
|
||||
vector_store_path = ".rag/chroma" # relative to project base_dir
|
||||
auto_index_on_load = true
|
||||
auto_sync_interval_seconds = 60 # background re-indexing
|
||||
top_k = 5
|
||||
```
|
||||
|
||||
### `RAGConfig` Schema (`src/models.py`)
|
||||
### `RAGConfig` + `VectorStoreConfig` Schema (`src/models.py`)
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class VectorStoreConfig:
|
||||
provider: str # "chroma" | "mock"
|
||||
url: Optional[str] = None # future: external HTTP
|
||||
api_key: Optional[str] = None # future: external HTTP auth
|
||||
collection_name: str = "manual_slop"
|
||||
mcp_server: Optional[str] = None # future: MCP bridge
|
||||
mcp_tool: Optional[str] = None # future: MCP tool name
|
||||
|
||||
@dataclass
|
||||
class RAGConfig:
|
||||
enabled: bool = False
|
||||
embedding_provider: str = "gemini" # "local" | "gemini"
|
||||
chunk_size: int = 1000
|
||||
chunk_overlap: int = 200
|
||||
ast_chunking_enabled: bool = True
|
||||
vector_store_backend: str = "chromadb"
|
||||
vector_store_path: str = ".rag/chroma"
|
||||
auto_index_on_load: bool = True
|
||||
auto_sync_interval_seconds: int = 60
|
||||
top_k: int = 5
|
||||
enabled: bool = False
|
||||
vector_store: VectorStoreConfig = field(default_factory=lambda: VectorStoreConfig(provider='mock'))
|
||||
embedding_provider: str = 'gemini' # "gemini" | "local"
|
||||
chunk_size: int = 1000
|
||||
chunk_overlap: int = 200
|
||||
```
|
||||
|
||||
> **Removed fields** (moved to other systems or not yet implemented): `ast_chunking_enabled` lives in `ChunkingConfig` (not in `RAGConfig`); `vector_store_backend`/`vector_store_path` replaced by nested `VectorStoreConfig`; `auto_index_on_load`/`auto_sync_interval_seconds`/`top_k` are runtime parameters set by the controller, not persisted in `RAGConfig`.
|
||||
|
||||
### Behavior When Disabled
|
||||
|
||||
If `enabled = false` (the default), `RAGEngine` is never constructed. `ai_client.send()` receives `rag_engine=None` and the integration is a no-op. The lazy-loading of `chromadb`, `sentence_transformers`, and `google.genai` is also skipped, so there is zero overhead for projects that don't use RAG.
|
||||
|
||||
Reference in New Issue
Block a user