diff --git a/conductor/code_styleguides/error_handling.md b/conductor/code_styleguides/error_handling.md new file mode 100644 index 00000000..fc0b5732 --- /dev/null +++ b/conductor/code_styleguides/error_handling.md @@ -0,0 +1,322 @@ +# Data-Oriented Error Handling + +> **Status:** Active convention as of 2026-06-11. Established by the +> `data_oriented_error_handling_20260606` track. Canonical reference for all +> Python error-handling decisions in this codebase. + +This styleguide codifies Ryan Fleury's "errors are just cases" framework as the +project convention. The 5 patterns below replace `Optional[T]` returns and +exception-based control flow with `Result[T]` dataclasses and nil-sentinel +dataclasses. SDK-boundary exceptions are caught and converted to `ErrorInfo`; +the rest of the application works with data, not control flow. + +Reference: [Ryan Fleury, "The Easiest Way To Handle Errors Is To Not Have +Them"](https://www.dgtlgrove.com/p/the-easiest-way-to-handle-errors). +Independent corroboration: Timothy Lottes (`ERROR[__line__]: _code_` exit +pattern; each error code has exactly one meaning — never overload `UNKNOWN`), +Valigo ("Exceptions are horrifying"; modern languages without legacy baggage +move away from exceptions — Rust, Jai, Zig, Odin). + +--- + +## The 5 Patterns + +### 1. Nil-Sentinel Dataclasses (replaces `None`) + +When a function would "return None" in conventional Python, return a +nil-sentinel dataclass instead. The sentinel has all default values +(zero-initialized) and is safe to read from. + +```python +from dataclasses import dataclass, field + +@dataclass(frozen=True) +class NilPath: + exists: bool = False + read_text: str = "" + errors: list[ErrorInfo] = field(default_factory=list) + +NIL_PATH = NilPath() # module-level singleton +``` + +Callers don't need `if x is None:` checks; they can call `x.read_text` and +get `""` on the nil path. + +**Convention:** `NIL_*` (uppercase) is the module-level singleton. `Nil*` +(PascalCase) is the class. Frozen dataclass prevents runtime mutation. + +### 2. Zero-Initialization (via `@dataclass` defaults) + +Fresh memory from the OS is zero-initialized. In Python, `@dataclass` with +field defaults achieves the same: the data is in a valid "empty" state +without any explicit constructor logic. + +```python +@dataclass(frozen=True) +class String8: + text: str = "" + size: int = 0 +``` + +Code that consumes `String8` (e.g., a for-loop bounded by `size`) works +correctly with the zero-initialized instance. + +**Convention:** Mutable defaults use `field(default_factory=list)` (NOT `= []`, +which is shared across instances). + +### 3. Fail Early (push validation to shallow stack frames) + +Don't defer error checks to deep in the call stack. Push them to the entry +point so the user knows ASAP if the operation cannot succeed. + +```python +def do_thing(path: Path) -> Result[str]: + resolved = _resolve_path(path) # validation happens HERE, not deeper + if not resolved.ok: + return Result(data="", errors=resolved.errors) + ... +``` + +**Convention:** `assert` at entry points for invariants. Early `return` for +user-facing errors. `try/finally` (Python's analog to `goto defer`) for +cleanup. + +### 4. AND over OR (Result with side-channel errors; no sum types) + +Instead of `Union[T, E]` or `Result`, return a struct with BOTH data +and errors as parallel fields: + +```python +@dataclass(frozen=True) +class Result(Generic[T]): + data: T # the happy-path result (zero-initialized on failure) + errors: list[ErrorInfo] = field(default_factory=list) # side-channel; empty = success +``` + +Callers: + +```python +r = do_thing(path) +if r.errors: + for err in r.errors: log(err.ui_message()) +# use r.data regardless (it's the zero-initialized value on failure) +``` + +**Convention:** `Result` is generic over `T` (the success data) but NOT over +the error type. Errors are always `list[ErrorInfo]` (a side-channel list, not +a tagged sum). This collapses the bifurcated `if r.ok: ... else: ...` +codepaths into a single flat codepath. + +### 5. Error Info as Side-Channel (not as exception) + +Errors flow as DATA in the `Result` struct, not as exceptions. SDK +boundaries (which must catch vendor exceptions) convert them to `ErrorInfo`: + +```python +@dataclass(frozen=True) +class ErrorInfo: + kind: ErrorKind + message: str + source: str = "" + original: BaseException | None = None + def ui_message(self) -> str: + src = f"[{self.source}] " if self.source else "" + return f"{src}{self.kind.value}: {self.message}" +``` + +**Convention:** `ErrorInfo` is the canonical error type. The legacy +`ai_client.ProviderError` exception class is removed; SDK helpers +(`_classify__error()`) RETURN `ErrorInfo` instead of raising. + +--- + +## The Data Model + +The canonical types live in `src/result_types.py`: + +| Type | Form | Purpose | +|---|---|---| +| `ErrorKind` | `str, Enum` (12+ values) | Canonical error taxonomy: `NETWORK`, `AUTH`, `QUOTA`, `RATE_LIMIT`, `BALANCE`, `PERMISSION`, `NOT_FOUND`, `INVALID_INPUT`, `NOT_READY`, `UNKNOWN`, `CONFIG`, `INTERNAL`, plus optional `PROVIDER_HISTORY_DIVERGED_FROM_UI` for app-vs-provider-state-divergence cases. Each value has exactly one meaning. | +| `ErrorInfo` | `@dataclass(frozen=True)` | A single error: `kind: ErrorKind`, `message: str`, `source: str = ""`, `original: BaseException \| None = None`. Frozen; carries `ui_message()` for display. | +| `Result[T]` | `@dataclass(frozen=True)` `Generic[T]` | The success-or-failure container: `data: T`, `errors: list[ErrorInfo] = field(default_factory=list)`, `ok: bool` property, `with_error()`, `with_errors()`, `with_data()` methods. | +| `NilPath` | `@dataclass(frozen=True)` + `NIL_PATH` | Nil-sentinel for filesystem paths. Has `exists=False`, `read_text=""`, `errors=[]`. | +| `NilRAGState` | `@dataclass(frozen=True)` + `NIL_RAG_STATE` | Nil-sentinel for the RAG engine. Has `enabled=False`, `is_empty_result=True`, `errors=[]`. | +| `OK` | `Result[None]` constant | Trivial success for fail-or-succeed operations that carry no data. | + +`Result` is **generic over `T` only** (not over the error type). Errors are +always `list[ErrorInfo]`. This is the AND-over-OR principle: data and errors +are parallel fields, not a tagged sum. + +--- + +## Decision Tree + +``` +Need to represent "missing or failed"? +| ++-- Is the value a "data" value (not a control-flow signal)? +| +-- Use a Result dataclass (data + errors list) +| +-- Use a nil-sentinel dataclass (zero-initialized) +| ++-- Is the value a control-flow signal (e.g., "abort" or "skip")? +| +-- Use a boolean (or enum) +| +-- Use Optional[bool] / Optional[Enum] ONLY if the absence is meaningful +| ++-- Is the failure "unrecoverable" (programmer error, not runtime condition)? +| +-- Use assert (debug builds) +| +-- Use raise (only for programmer errors like KeyError on a known dict) +| ++-- Does the SDK raise an exception you can't avoid? + +-- Catch at the boundary; convert to ErrorInfo inside a Result +``` + +--- + +## Anti-Patterns + +**DON'T do these things:** + +1. **DON'T** use `Optional[X]` for "this might fail at runtime". Use + `Result[X]` instead. +2. **DON'T** use `None` as a sentinel for "no result". Use a nil-sentinel + dataclass. +3. **DON'T** raise a custom exception class for runtime failures. Catch SDK + exceptions and return `ErrorInfo`. +4. **DON'T** use `Union[T, E]` (sum type). Use a struct with parallel fields + (AND over OR). +5. **DON'T** have `if x is None: handle; else: use_x` patterns in production + code. The nil-sentinel makes them unnecessary. +6. **DON'T** catch `except Exception` and silently swallow. Convert to + `ErrorInfo` and return in the `Result`. + +--- + +## Examples + +The 3 refactored subsystems demonstrate each pattern in context: + +- **`src/mcp_client.py:205-294`** — `read_file`, `list_directory`, + `search_files` return `Result[str]`; `(p, err)` tuples become + `Result[Path]`; the 30+ `assert p is not None` chain (lines 304-794) is + removed. +- **`src/ai_client.py`** — `_send__result()` returns `Result[str]` + (8 vendors: gemini, anthropic, deepseek, minimax, gemini_cli, qwen, llama, + grok); `send_result()` is the new public API; `send()` is `@deprecated`. +- **`src/rag_engine.py:100-180`** — `_init_vector_store_result`, + `_validate_collection_dim_result`, `is_empty_result`, `add_documents_result` + return `Result[None]` or `Result[T]`; broad `except Exception` blocks + become `ErrorInfo` entries. + +--- + +## Hard Rules (enforced in the 3 refactored files) + +These are non-negotiable in `src/mcp_client.py`, `src/ai_client.py`, and +`src/rag_engine.py`: + +- **`Optional[T]` return types are FORBIDDEN** in the 3 refactored files. Use + `Result[T]` (with `NIL_T` singleton if needed) instead. Rationale: + `Optional[T]` is the sum type `Union[T, None]` that Fleury's framework + replaces. Mixing the two patterns reintroduces the bifurcation the + convention is designed to remove. +- **Function return types must be `Result[T]` for any function that can fail + at runtime.** A function that can't fail (e.g., `get_name() -> str`) + doesn't need a `Result`. The classification is "can this return a different + value under different runtime conditions?" If yes, `Result`. If no, plain + return type. +- **Catch SDK exceptions at the boundary only.** Inside the 3 refactored + files, the only place an exception is caught is at the SDK call site + (e.g., `_send__result()` wrapping the SDK call). Internal + `try/except` is reserved for converting `OSError`, `PermissionError`, and + similar I/O exceptions to `ErrorInfo` at the mcp_client tool boundary. + +The verification script `scripts/audit_optional_in_3_files.py` enforces the +`Optional[X]` rule by failing CI if any new `Optional[X]` appears in the 3 +refactored files. + +### `Optional[X]` in argument types + +The `Optional[X]` ban above applies to **return types only**. Argument types +that genuinely may be `None` (e.g., `rag_engine: Optional[Any] = None`, +`pre_tool_callback: Optional[Callable] = None`) remain allowed; they describe +a caller choice, not a runtime failure of this function. + +### Cross-thread safety + +`Result` and `ErrorInfo` are `@dataclass(frozen=True)` and therefore +thread-safe by immutability. The `with_error()` / `with_errors()` / +`with_data()` methods produce new instances (no mutation), matching the +project's "no shared mutable state across threads" invariant. Deprecation +warnings use `warnings.warn(..., stacklevel=2)` which is thread-safe. + +--- + +## When to Use This Convention + +**Use it for:** + +- New public APIs (any function that can fail at runtime and the caller + might care). +- New internal functions where the caller benefits from knowing the failure + (vs. just propagating `None`). + +**Don't use it for:** + +- Constructors (`__init__`) that fail with programmer errors (use `assert` or + `raise` for these). +- Trivial getters that can't fail (`get_name() -> str` doesn't need a + `Result`). +- Performance-critical hot paths where the overhead of the dataclass + allocation is measurable (rare; benchmark first). + +--- + +## Migration Playbook + +When converting existing code: + +1. Identify the `Optional[X]` return type or the `raise` statement. +2. Define a `Result` dataclass (or use the existing one) with `data: X` and + `errors: list[ErrorInfo]`. +3. Replace `None` returns with `Result(data=NIL_X, errors=[...])` or + `Result(data=zero_value, errors=[...])`. +4. Replace `raise X` with + `return Result(data=zero_value, errors=[ErrorInfo(kind=..., message=...)])`. +5. Update the caller to check `result.errors` instead of `is None` / + `try/except`. +6. Add a test that verifies both the success and failure paths return the + right `Result`. + +--- + +## Deprecation: `ai_client.send()` → `ai_client.send_result()` + +The public `ai_client.send()` is marked `@deprecated` (via +`typing_extensions.deprecated`, the Python 3.11+ backport of +`@warnings.deprecated`). It still works for backward compat but emits a +`DeprecationWarning` at runtime. New code MUST use `ai_client.send_result()`. + +- `send_result(...) -> Result[str, ErrorInfo]` — the new public API. +- `send(...) -> str` — **deprecated.** Returns `str` for backward compat; + errors are logged to the comms log but not returned. +- Removal timeline: `public_api_migration_20260606` follow-up track. + +The deprecation warning is cached per call site (Python's `__warningregistry__`) +to avoid log spam. `tests/conftest.py` adds a `filterwarnings` entry to +silence the warning during the transition; new tests for the new API should +assert the warning is NOT emitted by `send_result()`. + +--- + +## See Also + +- `conductor/tracks/data_oriented_error_handling_20260606/spec.md` — the spec + that established this convention. +- `docs/guide_ai_client.md` "Data-Oriented Error Handling (Fleury Pattern)" + — the in-context guide for the provider layer. +- `docs/guide_mcp_client.md` "Data-Oriented Error Handling (Fleury Pattern)" + — the in-context guide for the MCP tool layer. +- `docs/guide_rag.md` "Data-Oriented Error Handling (Fleury Pattern)" — the + in-context guide for the RAG engine. +- Ryan Fleury's [original article](https://www.dgtlgrove.com/p/the-easiest-way-to-handle-errors) + — the philosophical foundation.