diff --git a/conductor/code_styleguides/error_handling.md b/conductor/code_styleguides/error_handling.md index c86981c9..5d08c322 100644 --- a/conductor/code_styleguides/error_handling.md +++ b/conductor/code_styleguides/error_handling.md @@ -263,7 +263,7 @@ warnings use `warnings.warn(..., stacklevel=2)` which is thread-safe. **Don't use it for:** - Constructors (`__init__`) that fail with programmer errors (use `assert` or - `raise` for these). + `raise` for these). See "Constructors Can Raise" below for the full rule. - Trivial getters that can't fail (`get_name() -> str` doesn't need a `Result`). - Performance-critical hot paths where the overhead of the dataclass @@ -271,6 +271,337 @@ warnings use `warnings.warn(..., stacklevel=2)` which is thread-safe. --- +## Boundary Types: What Counts as a "Boundary"? + +The convention says "exceptions are reserved for the SDK boundary," but what +counts as a boundary? There are 3 categories: + +### 1. Third-party SDK calls + +A try/except that wraps a call to a third-party SDK is the canonical +boundary use of the pattern. The catch site converts the SDK's exception +to `ErrorInfo` (or re-raises if the function is the public API and a Result +is the right return type). + +Recognized third-party SDK modules (partial list): +`anthropic`, `google` / `google.genai` / `google.api_core`, `openai`, +`groq`, `cohere`, `chromadb`, `sentence_transformers`, `huggingface_hub`, +`requests`, `urllib3`, `httpx`, `aiohttp`, `websockets`, `psutil`, +`imgui_bundle`, `dearpygui`, `PIL`, `cv2`, `numpy`. + +Recognized third-party exception types (partial list): +`anthropic.APIError` / `RateLimitError` / `AuthenticationError`, +`google.api_core.exceptions.GoogleAPIError` / `ResourceExhausted`, +`openai.OpenAIError` / `APIError` / `RateLimitError`, +`requests.RequestException` / `ConnectionError` / `Timeout`, +`httpx.HTTPError` / `RequestError`, +`chromadb.errors.ChromaError`, +`pydantic.ValidationError`. + +### 2. Stdlib I/O that can raise + +File and network I/O via stdlib (`open()`, `os.path.*`, `json.loads()`, +`subprocess.run()`, `socket.*`, `sqlite3.*`, `csv.*`, `zipfile.*`, +`xml.etree.ElementTree`) commonly raises. Catching the specific exception +(`OSError`, `FileNotFoundError`, `PermissionError`, +`json.JSONDecodeError`, `subprocess.CalledProcessError`, etc.) at the +tool boundary and converting to `ErrorInfo` is compliant. + +This is the "stdlib I/O exception caught in our own code is acceptable" +rule. The catch site should be **specific** (`except FileNotFoundError`, +not `except Exception`) and should convert to `ErrorInfo`, not swallow. + +### 3. Framework boundaries (FastAPI) + +A try/except or `raise` in a FastAPI `_api_*` handler is the framework +boundary. `raise HTTPException(status_code=..., detail=...)` is the +FastAPI-idiomatic way to signal an HTTP error; FastAPI converts it to a +JSON response at the framework level. This is **not** an exception leak +into internal code; it's the framework contract. + +```python +# Compliant: FastAPI boundary in _api_* handler +async def _api_get_key(controller, header_key: str) -> str: + if not _is_valid_key(header_key): + raise HTTPException(status_code=403, detail="Could not validate API Key") + return header_key + +# Compliant: broad catch + HTTPException at the FastAPI boundary +async def _api_generate(controller, payload): + try: + result = ai_client.send_result(...) + return result.data + except Exception as e: + raise HTTPException(status_code=500, detail=f"AI call failed: {e}") +``` + +The catch-all `except Exception` is acceptable here **because the +conversion is to the framework's exception** (HTTPException), not to a +silent swallow. The detail message includes the original error; the +HTTP status code is the framework contract. + +### What is NOT a boundary + +- Internal business logic: `try/except` around a `for` loop in a + controller method is internal, not boundary. +- Cross-method calls within `src/`: calling a method in + `app_controller.py` from a method in `app_controller.py` is internal, + not boundary. +- stdlib I/O that the user controls directly: opening a file the user + passed via `--config` is internal; converting the failure should be + Result-based, not exception-based. + +--- + +## The Broad-Except Distinction + +Anti-pattern #6 says "DON'T catch `except Exception` and silently swallow." +But `except Exception` is **not always a violation**. The distinction is +**what the catch site does with the exception**: + +| What the catch does | Classification | Convention status | +|---|---|---| +| `pass` (or no body) | `INTERNAL_SILENT_SWALLOW` | **Violation** | +| `print(...)` / `log(...)` only | `INTERNAL_SILENT_SWALLOW` | **Violation** (the data is lost) | +| `return None` / `return Optional[T]` | `INTERNAL_OPTIONAL_RETURN` | **Violation** (use `Result[T]`) | +| `return Result(data=..., errors=[ErrorInfo(...)])` | `BOUNDARY_CONVERSION` | **Compliant** (the canonical pattern) | +| `raise` (re-raise) | `INTERNAL_RETHROW` (or `BOUNDARY_SDK` if at third-party call) | **Suspicious** (often refactorable) | +| `raise HTTPException(...)` (in `_api_*` handler) | `BOUNDARY_FASTAPI` | **Compliant** (the framework contract) | + +**The canonical pattern** (in `_result` functions that wrap third-party SDK +calls): + +```python +def _validate_collection_dim_result(self) -> Result[None]: + if self.collection is None or self.collection == "mock": + return Result(data=None) + try: + res = self.collection.get(limit=1, include=["embeddings"]) + # ... validation logic ... + return Result(data=None) + except Exception as e: + return Result(data=None, errors=[ + ErrorInfo(kind=ErrorKind.INTERNAL, + message=f"Failed to validate collection dim: {e}", + source="rag._validate_collection_dim", + original=e) + ]) +``` + +This `except Exception` is **compliant** because the catch + ErrorInfo +conversion IS the data-oriented pattern. The `original=e` field preserves +the original exception for debugging. + +**The anti-pattern** (in internal code that has nothing to do with a +third-party SDK): + +```python +# VIOLATION: broad catch + silent swallow +try: + do_something() +except Exception: + pass + +# VIOLATION: broad catch + log-only (data is lost) +try: + do_something() +except Exception as e: + print(f"Error: {e}") +``` + +--- + +## Constructors Can Raise + +Per the "When to Use This Convention" section, constructors (`__init__`) +that fail with programmer errors use `assert` or `raise`. This section +elaborates. + +**Compliant constructor raises:** + +```python +class MyClass: + def __init__(self, config: Config): + if config is None: + raise ValueError("MyClass requires a non-None Config") + if not config.api_key: + raise ValueError("MyClass requires a non-empty api_key") + self._config = config +``` + +**Compliant assert (for impossible states):** + +```python +def _set_rag_status(self, status: str): + # The status string is one of a known set; if it's not, the caller + # has a bug. + assert status in {"idle", "ready", "syncing", "error"}, f"Unknown status: {status}" + self._rag_status = status +``` + +**The rule:** if the failure is "this object cannot exist without X," raise +in `__init__` is the canonical pattern. The Result pattern is for runtime +failures ("the network is down"); raise is for programmer errors ("you +forgot to pass X"). + +**Recognized programmer-error exception types** (per +`scripts/audit_exception_handling.py` `INTERNAL_PROGRAMMER_RAISE` +category): +`AssertionError`, `ValueError`, `KeyError`, `IndexError`, `TypeError`, +`AttributeError`, `NameError`, `RuntimeError`, `NotImplementedError`. + +--- + +## Re-Raise Patterns + +A `try/except + raise` (without ErrorInfo conversion) is **suspicious** but +not always a violation. There are 3 legitimate re-raise patterns: + +### 1. Catch + convert + raise as a different type + +```python +# Compliant: convert library error to user-friendly error +try: + value = json.loads(raw) +except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON: {e}") from e +``` + +The `from e` preserves the original exception in the traceback. The +new exception type (`ValueError`) is more meaningful to the caller. + +### 2. Catch + log + re-raise + +```python +# Compliant: log before propagating +try: + do_something() +except Exception as e: + logger.exception("do_something failed; will propagate") + raise +``` + +The log line provides a record; the re-raise preserves the original +control flow. This is appropriate when the failure is severe and the +caller should still handle it. + +### 3. Catch + cleanup + re-raise + +```python +# Compliant: ensure cleanup before propagating +try: + resource = acquire() + do_something(resource) +finally: + release(resource) # `finally` is cleaner; `except+raise` is for when + # you also need to log or convert +``` + +Use `try/finally` for the pure cleanup case (no logging/conversion). +Use `try/except + re-raise` when you need to log or convert AND ensure +cleanup. + +### Suspicious re-raise (often a code smell) + +```python +# SUSPICIOUS: catch + re-raise the same exception (no value-add) +try: + do_something() +except Exception: + raise +``` + +This catches an exception, does nothing with it, and re-raises. The +`try/except` is dead code; remove it or use a `Result`-based propagation +instead. + +The audit script flags this as `INTERNAL_RETHROW` (suspicious). If you +see this pattern in code review, ask "is the `try/except` doing anything +useful? If not, remove it." + +--- + +## Audit Script + +The convention is enforced via +`scripts/audit_exception_handling.py`. This is a static analyzer (AST-based) +that classifies every `try/except/finally/raise` site in the codebase per +the categories in the previous sections. + +**Usage:** + +```bash +# Human-readable report +uv run python scripts/audit_exception_handling.py + +# JSON output for tooling +uv run python scripts/audit_exception_handling.py --json + +# Include tests/ and scripts/ +uv run python scripts/audit_exception_handling.py --include-tests + +# Top N files (default: 15) +uv run python scripts/audit_exception_handling.py --top 20 + +# Show every site inline +uv run python scripts/audit_exception_handling.py --verbose + +# Strict mode (exit 1 on any violation; for CI use) +uv run python scripts/audit_exception_handling.py --strict +``` + +**"Delete to turn off"** (per `feature_flags.md`): `rm +scripts/audit_exception_handling.py` disables the audit. Re-enable by +restoring the file (it's tracked in git). + +**Classification categories** (the canonical taxonomy; matches the +script's output): + +| Category | Convention status | When | +|---|---|---| +| `BOUNDARY_SDK` | Compliant | Wraps a third-party SDK call | +| `BOUNDARY_IO` | Compliant | Wraps stdlib I/O that can raise | +| `BOUNDARY_CONVERSION` | Compliant | Catches and converts to `ErrorInfo` in a `Result` | +| `BOUNDARY_FASTAPI` | Compliant | FastAPI `HTTPException` in `_api_*` handler | +| `INTERNAL_SILENT_SWALLOW` | **Violation** | `except ...: pass` or just logs | +| `INTERNAL_BROAD_CATCH` | **Violation** | `except Exception` without ErrorInfo conversion, in non-`*_result` code | +| `INTERNAL_OPTIONAL_RETURN` | **Violation** | `try/except + return None/Optional[T]` | +| `INTERNAL_RETHROW` | Suspicious | `try/except + raise` (without ErrorInfo conversion) | +| `INTERNAL_PROGRAMMER_RAISE` | Compliant | `raise` for impossible state / precondition | +| `INTERNAL_COMPLIANT` | Compliant | `try/finally` (no except) — canonical cleanup | +| `UNCLEAR` | Review needed | Can't determine automatically | + +**Output structure:** + +``` +=== Exception Handling Audit (Data-Oriented Convention) === + +Files scanned: 65 +Files with findings: 42 +Total sites: 348 +Compliant sites: 80 +Suspicious sites: 25 +Violation sites: 211 +Unclear (review): 32 + +--- Baseline (refactored files: mcp_client, ai_client, rag_engine) --- + Sites: 112, violations: 77 +--- Migration target (all other src/ files) --- + Sites: 236, violations: 134 +``` + +The **baseline** is the 3 fully-refactored files (the convention reference). +The **migration target** is the ~10 unrefactored files in `src/`. The +violation count is informational; the user decides which migration-target +files warrant a refactor track. + +**Important:** the audit is **informational**, not a CI gate. The script +exits 0 by default. Use `--strict` to enable CI-gate mode (exit 1 on any +violation). The user is expected to review the report and decide the +next action. + +--- + ## Migration Playbook When converting existing code: