# Knowledge Curation Guide **Status:** User-facing deep-dive on the 4th memory dimension (the knowledge memory). For agents, see `./docs/AGENTS.md` §6. **Date:** 2026-06-12 **Cross-refs:** `conductor/code_styleguides/knowledge_artifacts.md`; `docs/guide_agent_memory_dimensions.md` §4; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4. > **What this is.** The 4th memory dimension is the *durable, user-editable, provenance-aware* knowledge store. It's a *layer*, not a *snapshot*. Category files are the source of truth; the digest is a projection; the ledger is the audit log. This guide is the user-facing deep-dive on how to use it, how to harvest it, and how to query it. --- ## 0. The 30-second version Manual Slop's knowledge memory lives at `~/.manual_slop/knowledge/`. It has 5 category files (`facts.md`, `decisions.md`, `questions.md`, `playbooks.md`, `tasks.md`) plus per-file notes (`files/{file_id}.md`) plus a 4KB bounded digest plus a sha256 ledger. The LLM harvests past discussions into these files; the user can edit any of them in plain text. The digest is injected into every new discussion's initial context as a `{knowledge}` block. ``` $ ls ~/.manual_slop/knowledge/ facts.md # - {statement} {provenance} decisions.md # - {statement, reason} {provenance} questions.md # - {question} {provenance} playbooks.md # - **{name}**: {steps} {provenance} tasks.md # ## Open / ## Done files/ # per-file notes (keyed by inode) digest.md # bounded 4KB; the projection ledger.json # sha256-of-content audit log prompts/ # user-editable harvest prompt ``` --- ## 1. The 5 category files (the source of truth) **The canonical reference is `conductor/code_styleguides/knowledge_artifacts.md` §1** (the full per-category formats + the `───` data shape markers + the append-only rule + the user-editable contract). This section is the user-facing summary. | File | Shape | What it stores | |---|---|---| | `facts.md` | `- {statement} {provenance}` | Durable statements about systems, repos, tools | | `decisions.md` | `- {statement, reason} {provenance}` | Decisions that were made | | `questions.md` | `- {question} {provenance}` | Unanswered questions | | `playbooks.md` | `- **{name}**: {steps} {provenance}` | Reusable command sequences | | `tasks.md` | `- {task}` (## Open / ## Done) | Open and done tasks | **The provenance string:** `[from: {conversation_name}, {date}]`. The `date` is the ISO-8601 date prefix of the harvest timestamp. **The user can edit any of the 5.** The LLM's output is a *suggestion*; the user is the editor. The harvest will *append*; it will not *overwrite*. **The example listings** (per-file path / file `facts.md`, etc.) are in `conductor/code_styleguides/knowledge_artifacts.md` §1.1-§1.5. This section is a pointer. ## 2. The per-file notes (`files/{file_id}.md`) **The shape:** ```markdown # /repo/src/ai_client.py - Uses `cache_control: {"type": "ephemeral"}` blocks for Anthropic caching. [from: 2026-06-12-investigate-cache, 2026-06-12] - The 5 per-provider history lists are gated by their own locks. [from: 2026-05-13-state-mutation-matrix, 2026-05-13] - `run_discussion_compression` failure mode: TBD (Candidate 15). [from: 2026-06-12-candidate-15, 2026-06-12] ``` **The shape:** `- {note} {provenance}`. Keyed by `file_id` (the st_dev:st_ino of the file). Survives renames within the same filesystem. **The `file_id_for_path` pattern** (per nagent's `bin/helpers/nagent_file_edit_lib.py:file_id_for_path`): ```python def file_id_for_path(path: Path) -> str: """Stable file identity across renames. Returns 'device:inode'.""" stat = path.stat() return f"{stat.st_dev}:{stat.st_ino}" ``` **Why inode and not path?** The path can change (rename, move, link); the inode is stable. A note about `src/foo.py` is preserved if `src/foo.py` is renamed to `src/bar.py` (same inode). If the file is moved across filesystems, the inode changes; the user must re-add the note. **The "files" category in the harvest output has a special branch:** ```python # In merge_harvest (the harvest pipeline) file_notes = 0 for row in harvested.get("files", []): if not isinstance(row, dict): continue path_text = str(row.get("path") or "").strip() note = str(row.get("note") or "").strip() if not note: continue target = Path(path_text) if path_text else None if target is not None and target.is_file(): try: file_id = file_id_for_path(target) except OSError: file_id = None if file_id is not None: _append_bullets( file_knowledge_path(root, file_id), f"# {target.resolve()}", [f"{note} {provenance}"], ) file_notes += 1 continue # Target no longer resolvable: the note survives as a fact. prefix = f"{path_text}: " if path_text else "" _append_bullets(knowledge / "facts.md", "# Facts", [f"{prefix}{note} {provenance}"]) file_notes += 1 counts["files"] = file_notes ``` **The behavior:** - If the path resolves to an existing file → the note goes to `knowledge/files/{file_id}.md` - If the path doesn't resolve (the file is gone) → the note falls back to `facts.md` as `{path}: {note} {provenance}`. The note survives, just loses the per-file binding. --- ## 3. The digest (`digest.md`) The digest is a *projection* of the category files, bounded to **4KB**. It's injected as the `{knowledge}` block in the initial context. **The format:** ```markdown # Knowledge digest (regenerated by knowledge_harvest; edit the category files, not this file) ## Open tasks - Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12] ## Open questions - Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12] ## Decisions - Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12] ## Facts - nagent has 5 providers; Manual Slop has 8. [from: 2026-06-12-v2.3, 2026-06-12] ## Playbooks - **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12] ``` **The ordering is fixed:** Open tasks, Open questions, Decisions, Facts, Playbooks. **Within each section, newest first** (because the category files are append-only; reversing gives newest-first). **Truncation:** if the sections don't fit in 4KB, the rest is truncated with a visible `(truncated; see the category files for the rest)` note. **"Delete to turn off":** `rm ~/.manual_slop/knowledge/digest.md` → no `{knowledge}` block injected. Re-enable by running the harvest (which regenerates the digest). --- ## 4. The ledger (`ledger.json`) The ledger is the **sha256-of-content audit log**. It gates deletion on a proven harvest. **The format:** ```json { "entries": { "": { "path": "/home/user/.manual_slop/conversations/-", "status": "harvested", "at": "2026-06-12T14:23:45.123456+00:00", "items": { "facts": 3, "decisions": 2, "tasks_done": 1, "tasks_open": 0, "questions": 1, "playbooks": 0, "files": 1 }, "deleted": true } } } ``` **The status values:** | Status | Meaning | Action | |---|---|---| | `harvested` | LLM distillation succeeded; items appended to category files | reclaim (unlink) | | `harvest-failed` | LLM distillation failed after retries | keep the conversation; record the error | | `deleted-unharvested` | User passed `--no-harvest`; the conversation is reclaimed without LLM | reclaim (unlink) | | `too-large` | File > 1MB; kept without harvesting | keep | **The sha256-of-content dedup:** two conversations with the same content share a ledger entry. The second is reclaimed without paying the LLM cost again. --- ## 5. The harvest workflow ### 5.1 The 7-category schema (the LLM output) The LLM's harvest output is strict JSON (no prose, no markdown fence): ```json { "facts": [{"statement": "...", "detail": "..."}], "decisions": [{"statement": "...", "detail": "..."}], "tasks_done": [{"statement": "...", "detail": "..."}], "tasks_open": [{"statement": "...", "detail": "..."}], "questions": [{"statement": "...", "detail": "..."}], "playbooks": [{"name": "...", "steps": "..."}], "files": [{"path": "...", "note": "..."}] } ``` **The prompt** (in `~/.manual_slop/knowledge/prompts/harvest-conversation.md`; user-editable, root-first resolution): ```markdown # Harvest durable knowledge from a manual_slop conversation You are given one conversation (or a summary of one). Extract only knowledge that stays useful after this conversation is deleted. Return only JSON in exactly this form (no prose, no markdown fence): [the 7-category schema above] Category rules: - facts: durable statements about systems, repositories, tools, environments, or constraints that were learned, not assumed. - decisions: choices that were made, with the why in `detail`. - tasks_done: concrete work completed in this conversation. - tasks_open: work that was started, planned, or requested but not finished. - questions: questions raised and never answered. - playbooks: command sequences or processes that worked and are reusable; `steps` is the runnable sequence. - files: a note tied to one specific file path (use the absolute path seen in the conversation). General rules: - Empty arrays are valid and expected: most conversations contain nothing durable. Do not invent items to fill categories. - One item per distinct piece of knowledge; keep `statement` to one sentence. - `detail` is optional context; omit it or use "" when the statement stands alone. - Do not include conversation mechanics, tool output noise, retries, or one-off trivia (timestamps, token counts, transient errors). ``` ### 5.2 The retry budget (the contract) `HARVEST_MAX_ATTEMPTS = 2`. The retry is at the parse level (not the API level): ```python def harvest_conversation(path, provider, model, *, generate, summarize=None): content = read_or_summarize(path, provider, model) template = harvest_prompt_path().read_text(encoding="utf-8").strip() last_error = None for attempt in range(HARVEST_MAX_ATTEMPTS): prompt = build_harvest_prompt(template, path.name, content, retry=attempt > 0) response = generate(prompt, provider, model) try: return parse_harvest_json(response) except (json.JSONDecodeError, ValueError) as exc: last_error = exc raise RuntimeError(f"harvest output invalid after {HARVEST_MAX_ATTEMPTS} attempts: {last_error}") ``` **The retry-suffix:** on retry, append `\nYour previous reply was not valid JSON. Return only the JSON object.\n` to the prompt. ### 5.3 The size limits (the budgets) | Constant | Value | Why | |---|---|---| | `SUMMARIZE_THRESHOLD_BYTES` | 64 KB | Files > 64KB get summarized first | | `MAX_HARVEST_SOURCE_BYTES` | 1 MB | Files > 1MB are kept (not harvested) | | `DIGEST_MAX_BYTES` | 4 KB | The bounded digest size | | `HARVEST_MAX_ATTEMPTS` | 2 | Retry budget on parse failure | ### 5.4 The dry-run-by-default safety The harvest CLI defaults to **dry-run**. Without `--apply`, the CLI classifies, estimates cost, and prints a report. **No mutation.** ```bash $ python -m src.knowledge_harvest artifacts: live:42, user-kept:3, prune:0, harvest:17, keep:1 harvest candidates: 2.3MB (~600K input tokens), prune candidates: 0B dry run; pass --apply to harvest and reclaim $ python -m src.knowledge_harvest --apply reclaimed: 2.3MB harvested items: facts:42, decisions:18, tasks_done:7, tasks_open:3, questions:5, playbooks:2, files:11 digest: /home/user/.manual_slop/knowledge/digest.md ledger: /home/user/.manual_slop/knowledge/ledger.json ``` --- ## 6. The "delete to turn off" pattern **The principle.** Feature flags should be data, not config. If a feature is gated by the presence of a file, the user can turn it off by deleting the file. No GUI toggle, no env var, no `config.toml` edit. Just `rm`. **The knowledge digest pattern:** `rm ~/.manual_slop/knowledge/digest.md` → no `{knowledge}` block is injected. Re-enable by running `python -m src.knowledge_harvest --apply` (which regenerates the digest). **The implementation:** ```python # In aggregate.py:run (the consumer of the digest) knowledge_digest_path = paths.knowledge_dir() / "digest.md" if knowledge_digest_path.is_file(): knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8") stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n") # else: skip; the file is the switch ``` **The pattern recurs in 3 places:** 1. `regenerate_digest` deletes the digest when sections are empty 2. The `aggregate.py:run` injection check is the load-bearing one 3. The GUI `Knowledge` panel shows the file state and provides a `[Delete to turn off]` button --- ## 7. The graceful failure modes | Failure | Handling | |---|---| | LLM returns invalid JSON | Retry (up to 2 attempts); on 2nd failure, mark `harvest-failed` in the ledger; keep the conversation | | File > 1MB | Mark `too-large` in the ledger; keep the conversation | | File > 64KB | Summarize via `run_subagent_summarization`; use the summary as the LLM input | | Provider not available | Mark `harvest-failed`; keep the conversation | | Network timeout | Same; mark `harvest-failed`; keep the conversation | | Disk full writing to category files | Raise; mark `harvest-failed`; keep the conversation (don't reclaim) | **The pattern:** critical operations complete; non-essential post-steps are best-effort. The marker is visible. The user can re-run. --- ## 8. The injection (where the digest is used) The digest is injected into the *stable* position of the initial context (layer 7 of the 12-layer model; per `cache_friendly_context.md`): ```python # In aggregate.py:run (the consumer) def build_initial_context(ctrl, user_message): stable_prefix = [] # Layer 1-6: role, schema, tools, system prompt, persona, project context stable_prefix.append(...) # Layer 7: knowledge digest (the 4KB bounded projection) knowledge_digest_path = paths.knowledge_dir() / "digest.md" if knowledge_digest_path.is_file(): knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8") stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n") # Layer 8-12: discussion metadata, active preset, per-file details, prior turns, user message volatile_suffix = [...] return "".join(stable_prefix + volatile_suffix) ``` **The position matters.** The digest is in the *stable* position (before the `Instance:` volatile block). The cache can include the digest in the cached prefix; the volatile suffix is not cached. Per `cache_friendly_context.md` §1. --- ## 9. The cross-references - `conductor/code_styleguides/knowledge_artifacts.md` — the canonical styleguide - `docs/guide_agent_memory_dimensions.md` §4 — the knowledge dim in context - `docs/guide_caching_strategy.md` §5 — where the digest is injected - `conductor/code_styleguides/feature_flags.md` — the "delete to turn off" pattern - `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4 — the nagent pattern that informed this guide