docs(styleguide): add canonical reference for Data-Oriented Error Handling
This commit is contained in:
@@ -0,0 +1,322 @@
|
||||
# Data-Oriented Error Handling
|
||||
|
||||
> **Status:** Active convention as of 2026-06-11. Established by the
|
||||
> `data_oriented_error_handling_20260606` track. Canonical reference for all
|
||||
> Python error-handling decisions in this codebase.
|
||||
|
||||
This styleguide codifies Ryan Fleury's "errors are just cases" framework as the
|
||||
project convention. The 5 patterns below replace `Optional[T]` returns and
|
||||
exception-based control flow with `Result[T]` dataclasses and nil-sentinel
|
||||
dataclasses. SDK-boundary exceptions are caught and converted to `ErrorInfo`;
|
||||
the rest of the application works with data, not control flow.
|
||||
|
||||
Reference: [Ryan Fleury, "The Easiest Way To Handle Errors Is To Not Have
|
||||
Them"](https://www.dgtlgrove.com/p/the-easiest-way-to-handle-errors).
|
||||
Independent corroboration: Timothy Lottes (`ERROR[__line__]: _code_` exit
|
||||
pattern; each error code has exactly one meaning — never overload `UNKNOWN`),
|
||||
Valigo ("Exceptions are horrifying"; modern languages without legacy baggage
|
||||
move away from exceptions — Rust, Jai, Zig, Odin).
|
||||
|
||||
---
|
||||
|
||||
## The 5 Patterns
|
||||
|
||||
### 1. Nil-Sentinel Dataclasses (replaces `None`)
|
||||
|
||||
When a function would "return None" in conventional Python, return a
|
||||
nil-sentinel dataclass instead. The sentinel has all default values
|
||||
(zero-initialized) and is safe to read from.
|
||||
|
||||
```python
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class NilPath:
|
||||
exists: bool = False
|
||||
read_text: str = ""
|
||||
errors: list[ErrorInfo] = field(default_factory=list)
|
||||
|
||||
NIL_PATH = NilPath() # module-level singleton
|
||||
```
|
||||
|
||||
Callers don't need `if x is None:` checks; they can call `x.read_text` and
|
||||
get `""` on the nil path.
|
||||
|
||||
**Convention:** `NIL_*` (uppercase) is the module-level singleton. `Nil*`
|
||||
(PascalCase) is the class. Frozen dataclass prevents runtime mutation.
|
||||
|
||||
### 2. Zero-Initialization (via `@dataclass` defaults)
|
||||
|
||||
Fresh memory from the OS is zero-initialized. In Python, `@dataclass` with
|
||||
field defaults achieves the same: the data is in a valid "empty" state
|
||||
without any explicit constructor logic.
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class String8:
|
||||
text: str = ""
|
||||
size: int = 0
|
||||
```
|
||||
|
||||
Code that consumes `String8` (e.g., a for-loop bounded by `size`) works
|
||||
correctly with the zero-initialized instance.
|
||||
|
||||
**Convention:** Mutable defaults use `field(default_factory=list)` (NOT `= []`,
|
||||
which is shared across instances).
|
||||
|
||||
### 3. Fail Early (push validation to shallow stack frames)
|
||||
|
||||
Don't defer error checks to deep in the call stack. Push them to the entry
|
||||
point so the user knows ASAP if the operation cannot succeed.
|
||||
|
||||
```python
|
||||
def do_thing(path: Path) -> Result[str]:
|
||||
resolved = _resolve_path(path) # validation happens HERE, not deeper
|
||||
if not resolved.ok:
|
||||
return Result(data="", errors=resolved.errors)
|
||||
...
|
||||
```
|
||||
|
||||
**Convention:** `assert` at entry points for invariants. Early `return` for
|
||||
user-facing errors. `try/finally` (Python's analog to `goto defer`) for
|
||||
cleanup.
|
||||
|
||||
### 4. AND over OR (Result with side-channel errors; no sum types)
|
||||
|
||||
Instead of `Union[T, E]` or `Result<T, E>`, return a struct with BOTH data
|
||||
and errors as parallel fields:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class Result(Generic[T]):
|
||||
data: T # the happy-path result (zero-initialized on failure)
|
||||
errors: list[ErrorInfo] = field(default_factory=list) # side-channel; empty = success
|
||||
```
|
||||
|
||||
Callers:
|
||||
|
||||
```python
|
||||
r = do_thing(path)
|
||||
if r.errors:
|
||||
for err in r.errors: log(err.ui_message())
|
||||
# use r.data regardless (it's the zero-initialized value on failure)
|
||||
```
|
||||
|
||||
**Convention:** `Result` is generic over `T` (the success data) but NOT over
|
||||
the error type. Errors are always `list[ErrorInfo]` (a side-channel list, not
|
||||
a tagged sum). This collapses the bifurcated `if r.ok: ... else: ...`
|
||||
codepaths into a single flat codepath.
|
||||
|
||||
### 5. Error Info as Side-Channel (not as exception)
|
||||
|
||||
Errors flow as DATA in the `Result` struct, not as exceptions. SDK
|
||||
boundaries (which must catch vendor exceptions) convert them to `ErrorInfo`:
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class ErrorInfo:
|
||||
kind: ErrorKind
|
||||
message: str
|
||||
source: str = ""
|
||||
original: BaseException | None = None
|
||||
def ui_message(self) -> str:
|
||||
src = f"[{self.source}] " if self.source else ""
|
||||
return f"{src}{self.kind.value}: {self.message}"
|
||||
```
|
||||
|
||||
**Convention:** `ErrorInfo` is the canonical error type. The legacy
|
||||
`ai_client.ProviderError` exception class is removed; SDK helpers
|
||||
(`_classify_<vendor>_error()`) RETURN `ErrorInfo` instead of raising.
|
||||
|
||||
---
|
||||
|
||||
## The Data Model
|
||||
|
||||
The canonical types live in `src/result_types.py`:
|
||||
|
||||
| Type | Form | Purpose |
|
||||
|---|---|---|
|
||||
| `ErrorKind` | `str, Enum` (12+ values) | Canonical error taxonomy: `NETWORK`, `AUTH`, `QUOTA`, `RATE_LIMIT`, `BALANCE`, `PERMISSION`, `NOT_FOUND`, `INVALID_INPUT`, `NOT_READY`, `UNKNOWN`, `CONFIG`, `INTERNAL`, plus optional `PROVIDER_HISTORY_DIVERGED_FROM_UI` for app-vs-provider-state-divergence cases. Each value has exactly one meaning. |
|
||||
| `ErrorInfo` | `@dataclass(frozen=True)` | A single error: `kind: ErrorKind`, `message: str`, `source: str = ""`, `original: BaseException \| None = None`. Frozen; carries `ui_message()` for display. |
|
||||
| `Result[T]` | `@dataclass(frozen=True)` `Generic[T]` | The success-or-failure container: `data: T`, `errors: list[ErrorInfo] = field(default_factory=list)`, `ok: bool` property, `with_error()`, `with_errors()`, `with_data()` methods. |
|
||||
| `NilPath` | `@dataclass(frozen=True)` + `NIL_PATH` | Nil-sentinel for filesystem paths. Has `exists=False`, `read_text=""`, `errors=[]`. |
|
||||
| `NilRAGState` | `@dataclass(frozen=True)` + `NIL_RAG_STATE` | Nil-sentinel for the RAG engine. Has `enabled=False`, `is_empty_result=True`, `errors=[]`. |
|
||||
| `OK` | `Result[None]` constant | Trivial success for fail-or-succeed operations that carry no data. |
|
||||
|
||||
`Result` is **generic over `T` only** (not over the error type). Errors are
|
||||
always `list[ErrorInfo]`. This is the AND-over-OR principle: data and errors
|
||||
are parallel fields, not a tagged sum.
|
||||
|
||||
---
|
||||
|
||||
## Decision Tree
|
||||
|
||||
```
|
||||
Need to represent "missing or failed"?
|
||||
|
|
||||
+-- Is the value a "data" value (not a control-flow signal)?
|
||||
| +-- Use a Result dataclass (data + errors list)
|
||||
| +-- Use a nil-sentinel dataclass (zero-initialized)
|
||||
|
|
||||
+-- Is the value a control-flow signal (e.g., "abort" or "skip")?
|
||||
| +-- Use a boolean (or enum)
|
||||
| +-- Use Optional[bool] / Optional[Enum] ONLY if the absence is meaningful
|
||||
|
|
||||
+-- Is the failure "unrecoverable" (programmer error, not runtime condition)?
|
||||
| +-- Use assert (debug builds)
|
||||
| +-- Use raise (only for programmer errors like KeyError on a known dict)
|
||||
|
|
||||
+-- Does the SDK raise an exception you can't avoid?
|
||||
+-- Catch at the boundary; convert to ErrorInfo inside a Result
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
**DON'T do these things:**
|
||||
|
||||
1. **DON'T** use `Optional[X]` for "this might fail at runtime". Use
|
||||
`Result[X]` instead.
|
||||
2. **DON'T** use `None` as a sentinel for "no result". Use a nil-sentinel
|
||||
dataclass.
|
||||
3. **DON'T** raise a custom exception class for runtime failures. Catch SDK
|
||||
exceptions and return `ErrorInfo`.
|
||||
4. **DON'T** use `Union[T, E]` (sum type). Use a struct with parallel fields
|
||||
(AND over OR).
|
||||
5. **DON'T** have `if x is None: handle; else: use_x` patterns in production
|
||||
code. The nil-sentinel makes them unnecessary.
|
||||
6. **DON'T** catch `except Exception` and silently swallow. Convert to
|
||||
`ErrorInfo` and return in the `Result`.
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
The 3 refactored subsystems demonstrate each pattern in context:
|
||||
|
||||
- **`src/mcp_client.py:205-294`** — `read_file`, `list_directory`,
|
||||
`search_files` return `Result[str]`; `(p, err)` tuples become
|
||||
`Result[Path]`; the 30+ `assert p is not None` chain (lines 304-794) is
|
||||
removed.
|
||||
- **`src/ai_client.py`** — `_send_<vendor>_result()` returns `Result[str]`
|
||||
(8 vendors: gemini, anthropic, deepseek, minimax, gemini_cli, qwen, llama,
|
||||
grok); `send_result()` is the new public API; `send()` is `@deprecated`.
|
||||
- **`src/rag_engine.py:100-180`** — `_init_vector_store_result`,
|
||||
`_validate_collection_dim_result`, `is_empty_result`, `add_documents_result`
|
||||
return `Result[None]` or `Result[T]`; broad `except Exception` blocks
|
||||
become `ErrorInfo` entries.
|
||||
|
||||
---
|
||||
|
||||
## Hard Rules (enforced in the 3 refactored files)
|
||||
|
||||
These are non-negotiable in `src/mcp_client.py`, `src/ai_client.py`, and
|
||||
`src/rag_engine.py`:
|
||||
|
||||
- **`Optional[T]` return types are FORBIDDEN** in the 3 refactored files. Use
|
||||
`Result[T]` (with `NIL_T` singleton if needed) instead. Rationale:
|
||||
`Optional[T]` is the sum type `Union[T, None]` that Fleury's framework
|
||||
replaces. Mixing the two patterns reintroduces the bifurcation the
|
||||
convention is designed to remove.
|
||||
- **Function return types must be `Result[T]` for any function that can fail
|
||||
at runtime.** A function that can't fail (e.g., `get_name() -> str`)
|
||||
doesn't need a `Result`. The classification is "can this return a different
|
||||
value under different runtime conditions?" If yes, `Result`. If no, plain
|
||||
return type.
|
||||
- **Catch SDK exceptions at the boundary only.** Inside the 3 refactored
|
||||
files, the only place an exception is caught is at the SDK call site
|
||||
(e.g., `_send_<vendor>_result()` wrapping the SDK call). Internal
|
||||
`try/except` is reserved for converting `OSError`, `PermissionError`, and
|
||||
similar I/O exceptions to `ErrorInfo` at the mcp_client tool boundary.
|
||||
|
||||
The verification script `scripts/audit_optional_in_3_files.py` enforces the
|
||||
`Optional[X]` rule by failing CI if any new `Optional[X]` appears in the 3
|
||||
refactored files.
|
||||
|
||||
### `Optional[X]` in argument types
|
||||
|
||||
The `Optional[X]` ban above applies to **return types only**. Argument types
|
||||
that genuinely may be `None` (e.g., `rag_engine: Optional[Any] = None`,
|
||||
`pre_tool_callback: Optional[Callable] = None`) remain allowed; they describe
|
||||
a caller choice, not a runtime failure of this function.
|
||||
|
||||
### Cross-thread safety
|
||||
|
||||
`Result` and `ErrorInfo` are `@dataclass(frozen=True)` and therefore
|
||||
thread-safe by immutability. The `with_error()` / `with_errors()` /
|
||||
`with_data()` methods produce new instances (no mutation), matching the
|
||||
project's "no shared mutable state across threads" invariant. Deprecation
|
||||
warnings use `warnings.warn(..., stacklevel=2)` which is thread-safe.
|
||||
|
||||
---
|
||||
|
||||
## When to Use This Convention
|
||||
|
||||
**Use it for:**
|
||||
|
||||
- New public APIs (any function that can fail at runtime and the caller
|
||||
might care).
|
||||
- New internal functions where the caller benefits from knowing the failure
|
||||
(vs. just propagating `None`).
|
||||
|
||||
**Don't use it for:**
|
||||
|
||||
- Constructors (`__init__`) that fail with programmer errors (use `assert` or
|
||||
`raise` for these).
|
||||
- Trivial getters that can't fail (`get_name() -> str` doesn't need a
|
||||
`Result`).
|
||||
- Performance-critical hot paths where the overhead of the dataclass
|
||||
allocation is measurable (rare; benchmark first).
|
||||
|
||||
---
|
||||
|
||||
## Migration Playbook
|
||||
|
||||
When converting existing code:
|
||||
|
||||
1. Identify the `Optional[X]` return type or the `raise` statement.
|
||||
2. Define a `Result` dataclass (or use the existing one) with `data: X` and
|
||||
`errors: list[ErrorInfo]`.
|
||||
3. Replace `None` returns with `Result(data=NIL_X, errors=[...])` or
|
||||
`Result(data=zero_value, errors=[...])`.
|
||||
4. Replace `raise X` with
|
||||
`return Result(data=zero_value, errors=[ErrorInfo(kind=..., message=...)])`.
|
||||
5. Update the caller to check `result.errors` instead of `is None` /
|
||||
`try/except`.
|
||||
6. Add a test that verifies both the success and failure paths return the
|
||||
right `Result`.
|
||||
|
||||
---
|
||||
|
||||
## Deprecation: `ai_client.send()` → `ai_client.send_result()`
|
||||
|
||||
The public `ai_client.send()` is marked `@deprecated` (via
|
||||
`typing_extensions.deprecated`, the Python 3.11+ backport of
|
||||
`@warnings.deprecated`). It still works for backward compat but emits a
|
||||
`DeprecationWarning` at runtime. New code MUST use `ai_client.send_result()`.
|
||||
|
||||
- `send_result(...) -> Result[str, ErrorInfo]` — the new public API.
|
||||
- `send(...) -> str` — **deprecated.** Returns `str` for backward compat;
|
||||
errors are logged to the comms log but not returned.
|
||||
- Removal timeline: `public_api_migration_20260606` follow-up track.
|
||||
|
||||
The deprecation warning is cached per call site (Python's `__warningregistry__`)
|
||||
to avoid log spam. `tests/conftest.py` adds a `filterwarnings` entry to
|
||||
silence the warning during the transition; new tests for the new API should
|
||||
assert the warning is NOT emitted by `send_result()`.
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- `conductor/tracks/data_oriented_error_handling_20260606/spec.md` — the spec
|
||||
that established this convention.
|
||||
- `docs/guide_ai_client.md` "Data-Oriented Error Handling (Fleury Pattern)"
|
||||
— the in-context guide for the provider layer.
|
||||
- `docs/guide_mcp_client.md` "Data-Oriented Error Handling (Fleury Pattern)"
|
||||
— the in-context guide for the MCP tool layer.
|
||||
- `docs/guide_rag.md` "Data-Oriented Error Handling (Fleury Pattern)" — the
|
||||
in-context guide for the RAG engine.
|
||||
- Ryan Fleury's [original article](https://www.dgtlgrove.com/p/the-easiest-way-to-handle-errors)
|
||||
— the philosophical foundation.
|
||||
Reference in New Issue
Block a user