# Knowledge Artifacts (the harvest pattern) **Status:** Styleguide; codifies the knowledge harvest pattern: category files, provenance, sha256 ledger, digest regeneration, "delete to turn off." **Date:** 2026-06-12 **Cross-refs:** `conductor/code_styleguides/agent_memory_dimensions.md` §4; `conductor/code_styleguides/feature_flags.md`; `docs/guide_knowledge_curation.md`; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4. > **What this is.** The 4th memory dimension (per `agent_memory_dimensions.md` §4) is the durable, provenance-aware, user-editable knowledge store. It's a *layer*, not a *snapshot*: category files are the source of truth; the digest is a projection; the ledger is the audit log. This styleguide names the files, the formats, the harvest workflow, and the "delete to turn off" pattern. --- ## 0. The one-glance directory layout ``` ~/.manual_slop/knowledge/ ├── facts.md # - {statement} {provenance} ├── decisions.md # - {statement, reason} {provenance} ├── questions.md # - {question} {provenance} ├── playbooks.md # - **{name}**: {steps} {provenance} ├── tasks.md # ## Open / ## Done ├── files/ │ └── {file_id}.md # per-file notes (keyed by inode) ├── digest.md # bounded 4KB; the projection; "delete to turn off" ├── ledger.json # sha256-of-content audit log └── prompts/ └── harvest-conversation.md # user-editable harvest prompt ``` --- ## 1. The category files (the source of truth) ### 1.1 `facts.md` (durable statements) ```markdown # Facts - The MCP dispatch uses a flat if/elif chain. 4 places, 45 tools. [from: 2026-05-12-investigate-dispatch, 2026-05-12] - ai_client.py has 5 separate per-provider history lists, each with their own lock. Switching providers mid-session loses history. [from: 2026-05-13-state-mutation-matrix, 2026-05-13] - RAG is opt-in. Default-off in new projects. [from: 2026-06-12-rag-discipline, 2026-06-12] ``` **The shape:** `- {statement} {provenance}`. Plain markdown. Append-only. User-editable. ### 1.2 `decisions.md` (decisions with reasons) ```markdown # Decisions - Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12] - Cache TTL defaults to 5 min (Anthropic) + 60 min (Gemini); configurable per-discussion. [from: 2026-06-12-cache-strategy, 2026-06-12] ``` **The shape:** `- {statement} {provenance}`. The "why" lives in the LLM's harvest output; the user's edits override. ### 1.3 `questions.md` (unanswered questions) ```markdown # Questions - Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12] - How should the knowledge digest TTL be exposed in the GUI? [from: 2026-06-12-cache-ttl, 2026-06-12] ``` **The shape:** `- {question} {provenance}`. Open questions are *valuable* — they're the TODO list the next session can act on. ### 1.4 `playbooks.md` (reusable sequences) ```markdown # Playbooks - **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12] - **Stable-to-Volatile Cache Ordering**: identify Instance: boundary -> pass to --cache-prefix-chars. [from: 2026-06-12-candidate-12, 2026-06-12] - **Candidate Verification (TBD)**: read src/ai_client.py:run_discussion_compression -> check failure mode. [from: 2026-06-12-candidate-15, 2026-06-12] ``` **The shape:** `- **{name}**: {steps} {provenance}`. Playbooks are the "I did this once; here it is" record. Future workers use them directly. ### 1.5 `tasks.md` (open and done) ```markdown # Tasks ## Open - Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12] - Verify Candidate 15 by reading src/ai_client.py:run_discussion_compression. [from: 2026-06-12-candidate-15, 2026-06-12] ## Done - Read nagent source in full (18 files). [from: 2026-05-15, 2026-05-15] - Wrote v2.3 review (272KB / 3965 lines). [from: 2026-06-12-v2.3, 2026-06-12] ``` **The shape:** `- {task} {provenance}`. The two sections are manually maintained; the harvest places open items in `## Open` and done items in `## Done`. ### 1.6 `files/{file_id}.md` (per-file notes) ```markdown # /repo/src/ai_client.py - Uses `cache_control: {"type": "ephemeral"}` blocks for Anthropic caching. [from: 2026-06-12-investigate-cache, 2026-06-12] - The 5 per-provider history lists are gated by their own locks. [from: 2026-05-13-state-mutation-matrix, 2026-05-13] - `run_discussion_compression` failure mode: TBD (Candidate 15). [from: 2026-06-12-candidate-15, 2026-06-12] ``` **The shape:** `- {note} {provenance}`. Keyed by `file_id` (the st_dev:st_ino of the file). Survives renames within the same filesystem. **The file_id pattern** (per nagent's `bin/helpers/nagent_file_edit_lib.py:file_id_for_path`): ```python def file_id_for_path(path: Path) -> str: """Stable file identity across renames. Returns 'device:inode'.""" stat = path.stat() return f"{stat.st_dev}:{stat.st_ino}" ``` **The "files" category in the harvest output** has a special branch: if the path resolves to an existing file, the note goes to `knowledge/files/{file_id}.md`; if not, the note falls back to `facts.md` as `{path}: {note} {provenance}`. The note survives, just loses the per-file binding. --- ## 2. The digest (`digest.md`) The digest is a *projection* of the category files, bounded to **4KB**. It's injected as the `{knowledge}` block in the initial context. **The format** (per nagent's `regenerate_digest`): ```markdown # Knowledge digest (regenerated by nagent-gc; edit the category files, not this file) ## Open tasks - Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12] ## Open questions - Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12] ## Decisions - Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12] ## Facts - nagent has 5 providers; Manual Slop has 8. [from: 2026-06-12-v2.3, 2026-06-12] ## Playbooks - **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12] ``` **The ordering is fixed:** Open tasks, Open questions, Decisions, Facts, Playbooks (per nagent's `DIGEST_SECTIONS = (('Open tasks', 'tasks_open'), ('Open questions', 'questions'), ('Decisions', 'decisions'), ('Facts', 'facts'), ('Playbooks', 'playbooks'))`). **Within each section, newest first** (because the category files are append-only; reversing gives newest-first). **Truncation:** if the sections don't fit in 4KB, the rest is truncated with a visible `(truncated; see the category files for the rest)` note. **"Delete to turn off":** if all sections are empty, the digest is *deleted*: ```python # In regenerate_digest if not sections: if target.is_file(): target.unlink() # delete to turn off return None ``` **The injection point** (in `aggregate.py:run`): ```python # In aggregate.py:run (the consumer of the digest) knowledge_digest_path = paths.knowledge_dir() / "digest.md" if knowledge_digest_path.is_file(): knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8") stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n") ``` --- ## 3. The ledger (`ledger.json`) The ledger is the **sha256-of-content audit log**. It gates deletion on a proven harvest. **The format:** ```json { "entries": { "": { "path": "/home/user/.nagent/conversations/-", "status": "harvested", "at": "2026-06-12T14:23:45.123456+00:00", "items": { "facts": 3, "decisions": 2, "tasks_done": 1, "tasks_open": 0, "questions": 1, "playbooks": 0, "files": 1 }, "deleted": true }, "": { "path": "...", "status": "harvest-failed", "at": "2026-06-12T14:24:00.000000+00:00", "deleted": false, "error": "provider 'openai' not available" } } } ``` **The status values:** | Status | Meaning | Action | |---|---|---| | `harvested` | LLM distillation succeeded; items appended to category files | reclaim (unlink) | | `harvest-failed` | LLM distillation failed after retries | keep the conversation; record the error | | `deleted-unharvested` | User passed `--no-harvest`; the conversation is reclaimed without LLM | reclaim (unlink) | | `too-large` | File > 1MB; kept without harvesting | keep | **The sha256-of-content dedup:** two conversations with the same content share a ledger entry. The second is reclaimed without paying the LLM cost again. --- ## 4. The harvest workflow ### 4.1 The 7-category schema (the LLM output) The LLM's harvest output is strict JSON (no prose, no markdown fence): ```json { "facts": [ {"statement": "The system has 4 memory dimensions", "detail": ""} ], "decisions": [ {"statement": "Knowledge harvest is a complement to curation + discussion", "detail": "not a RAG replacement"} ], "tasks_done": [ {"statement": "v2.3 review identified 10 future-track candidates", "detail": ""} ], "tasks_open": [ {"statement": "Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md", "detail": "Candidate 14"} ], "questions": [ {"statement": "Where does intent resolution live — per-verb, per-block, or global?", "detail": ""} ], "playbooks": [ {"name": "Knowledge Harvest", "steps": "scan -> classify -> LLM-distill -> append -> digest -> reclaim"} ], "files": [ {"path": "/repo/src/ai_client.py", "note": "Cache TTL GUI: per-discussion state; cache hit rate per provider"} ] } ``` **The prompt** (in `prompts/harvest-conversation.md`; user-editable, root-first resolution): ```markdown # Harvest durable knowledge from a manual_slop conversation You are given one conversation (or a summary of one). Extract only knowledge that stays useful after this conversation is deleted. Return only JSON in exactly this form (no prose, no markdown fence): [the 7-category schema above] Category rules: - facts: durable statements about systems, repositories, tools, environments, or constraints that were learned, not assumed. - decisions: choices that were made, with the why in `detail`. - tasks_done: concrete work completed in this conversation. - tasks_open: work that was started, planned, or requested but not finished. - questions: questions raised and never answered. - playbooks: command sequences or processes that worked and are reusable; `steps` is the runnable sequence. - files: a note tied to one specific file path (use the absolute path seen in the conversation). General rules: - Empty arrays are valid and expected: most conversations contain nothing durable. Do not invent items to fill categories. - One item per distinct piece of knowledge; keep `statement` to one sentence. - `detail` is optional context; omit it or use "" when the statement stands alone. - Do not include conversation mechanics, tool output noise, retries, or one-off trivia (timestamps, token counts, transient errors). ``` ### 4.2 The retry budget `HARVEST_MAX_ATTEMPTS = 2`. The retry is at the parse level (not the API level): ```python def harvest_conversation(path, provider, model, config_path, *, generate, summarize=None): content = read_or_summarize(path, provider, model) template = harvest_prompt_path().read_text(encoding="utf-8").strip() last_error = None for attempt in range(HARVEST_MAX_ATTEMPTS): prompt = build_harvest_prompt(template, path.name, content, retry=attempt > 0) response = generate(prompt, provider, model) try: return parse_harvest_json(response) except (json.JSONDecodeError, ValueError) as exc: last_error = exc raise RuntimeError(f"harvest output invalid after {HARVEST_MAX_ATTEMPTS} attempts: {last_error}") ``` **The retry-suffix:** on retry, append `\nYour previous reply was not valid JSON. Return only the JSON object.\n` to the prompt. The LLM sees its previous (malformed) output and a one-line correction. **The strict parser** (tolerates code-fence; otherwise strict): ```python def parse_harvest_json(text: str) -> dict: stripped = text.strip() fence = JSON_FENCE.match(stripped) # tolerates ```json ... ``` if fence: stripped = fence.group(1).strip() payload = json.loads(stripped) if not isinstance(payload, dict): raise ValueError("harvest output is not a JSON object") harvested = {} for category in ITEM_CATEGORIES: rows = payload.get(category, []) harvested[category] = rows if isinstance(rows, list) else [] return harvested ``` ### 4.3 The size limits (the budgets) | Constant | Value | Why | |---|---|---| | `SUMMARIZE_THRESHOLD_BYTES` | 64 KB | Files > 64KB get summarized first | | `MAX_HARVEST_SOURCE_BYTES` | 1 MB | Files > 1MB are kept (not harvested) | | `DIGEST_MAX_BYTES` | 4 KB | The bounded digest size | | `HARVEST_MAX_ATTEMPTS` | 2 | Retry budget on parse failure | **The "too-large" branch** (the budget guard): ```python if artifact.size_bytes > MAX_HARVEST_SOURCE_BYTES: entries[sha] = {"status": "too-large", "deleted": False} emit(f"kept (too large): {label}") continue ``` ### 4.4 The dry-run-by-default safety The harvest CLI defaults to **dry-run**. Without `--apply`, the CLI classifies, estimates cost, and prints a report. **No mutation.** ```bash $ python -m src.knowledge_harvest artifacts: live:42, user-kept:3, prune:0, harvest:17, keep:1 harvest candidates: 2.3MB (~600K input tokens), prune candidates: 0B dry run; pass --apply to harvest and reclaim $ python -m src.knowledge_harvest --apply reclaimed: 2.3MB harvested items: facts:42, decisions:18, tasks_done:7, tasks_open:3, questions:5, playbooks:2, files:11 digest: /home/user/.manual_slop/knowledge/digest.md ledger: /home/user/.manual_slop/knowledge/ledger.json ``` --- ## 5. The "delete to turn off" pattern (per `feature_flags.md`) **The principle.** Feature flags should be data, not config. If a feature is gated by the presence of a file, the user can turn it off by deleting the file. No GUI toggle, no env var, no `config.toml` edit. Just `rm`. **The knowledge harvest pattern:** `rm ~/.manual_slop/knowledge/digest.md` → no `{knowledge}` block is injected. Re-enable by running `python -m src.knowledge_harvest --apply` (which regenerates the digest). **The implementation:** ```python # In aggregate.py:run (the consumer) knowledge_digest_path = paths.knowledge_dir() / "digest.md" if knowledge_digest_path.is_file(): knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8") stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n") # else: skip; the file is the switch ``` **The general pattern** recurs in 3 places: 1. `regenerate_digest` deletes the digest when sections are empty 2. The `aggregate.py:run` injection check is the load-bearing one 3. The `Knowledge` panel shows the file state (so the user knows what to do) **The alternative** (config toggle) is also supported: `[ai_settings.knowledge].digest_enabled = false`. See `feature_flags.md` for the rule on when to use file presence vs config. --- ## 6. The graceful failure modes | Failure | Handling | |---|---| | LLM returns invalid JSON | Retry (up to 2 attempts); on 2nd failure, mark `harvest-failed` in the ledger; keep the conversation | | File > 1MB | Mark `too-large` in the ledger; keep the conversation | | File > 64KB | Summarize via `run_subagent_summarization` (or equivalent); use the summary as the LLM input | | Provider not available | Mark `harvest-failed`; keep the conversation | | Network timeout | Same; mark `harvest-failed`; keep the conversation | | Disk full writing to category files | Raise; mark `harvest-failed`; keep the conversation (don't reclaim) | **The pattern:** critical operations complete; non-essential post-steps are best-effort. The marker is visible. The user can re-run. --- ## 7. The cross-references - `conductor/code_styleguides/agent_memory_dimensions.md` §4 — the knowledge dim in context - `conductor/code_styleguides/feature_flags.md` — the "delete to turn off" pattern - `conductor/code_styleguides/cache_friendly_context.md` — where the digest is injected (layer 7, stable) - `conductor/code_styleguides/data_oriented_design.md` §1.2 — "Design around a model of the world" (the anti-pattern) - `data_oriented_error_handling_20260606` — the `Result[T, ErrorInfo]` pattern for the harvest LLM call - `docs/guide_knowledge_curation.md` — the user-facing deep-dive - `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4 — the nagent pattern that informed this styleguide