Private
Public Access
0
0
Files
manual_slop/docs/guide_knowledge_curation.md
T
ed 434b6d0d54 docs: reduce redundant content across files; map references to canonical sources
Per user 'a bunch of docs just committed had redundant content across
files. Can we do a reduction of that and instead map references to
other files?'

This commit reduces content duplication across 9 files. The
canonical sources are kept as detailed references; the other
files now point to them.

Reductions (table replaced with 'see canonical' reference):

1. data_oriented_design.md §9: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

2. guide_agent_memory_dimensions.md §0: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

3. guide_caching_strategy.md §1: the 12-layer model
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

5. guide_knowledge_curation.md §1: the 5 category file details
   (canonical: conductor/code_styleguides/knowledge_artifacts.md §1)

6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

7. guide_mma.md '4 memory dimensions' section: the MMA scope table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/
   feature flag tables (canonical: the per-topic styleguides in
   conductor/code_styleguides/)

9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list
   (canonical: docs/AGENTS.md §2)

The principle: each piece of content has ONE source of truth; other
places point to it. The data-oriented way. Files retain their
narrative flow and the 'what this is' intros, but the detailed
tables are now in their canonical home.

Net effect: -2100 bytes across 9 files (without losing any
information - the canonical sources are unchanged). The
'cross-references' sections are kept; the duplicated content
is removed.
2026-06-12 14:10:30 -04:00

359 lines
15 KiB
Markdown

# Knowledge Curation Guide
**Status:** User-facing deep-dive on the 4th memory dimension (the knowledge memory). For agents, see `./docs/AGENTS.md` §6.
**Date:** 2026-06-12
**Cross-refs:** `conductor/code_styleguides/knowledge_artifacts.md`; `docs/guide_agent_memory_dimensions.md` §4; `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4.
> **What this is.** The 4th memory dimension is the *durable, user-editable, provenance-aware* knowledge store. It's a *layer*, not a *snapshot*. Category files are the source of truth; the digest is a projection; the ledger is the audit log. This guide is the user-facing deep-dive on how to use it, how to harvest it, and how to query it.
---
## 0. The 30-second version
Manual Slop's knowledge memory lives at `~/.manual_slop/knowledge/`. It has 5 category files (`facts.md`, `decisions.md`, `questions.md`, `playbooks.md`, `tasks.md`) plus per-file notes (`files/{file_id}.md`) plus a 4KB bounded digest plus a sha256 ledger. The LLM harvests past discussions into these files; the user can edit any of them in plain text. The digest is injected into every new discussion's initial context as a `{knowledge}` block.
```
$ ls ~/.manual_slop/knowledge/
facts.md # - {statement} {provenance}
decisions.md # - {statement, reason} {provenance}
questions.md # - {question} {provenance}
playbooks.md # - **{name}**: {steps} {provenance}
tasks.md # ## Open / ## Done
files/ # per-file notes (keyed by inode)
digest.md # bounded 4KB; the projection
ledger.json # sha256-of-content audit log
prompts/ # user-editable harvest prompt
```
---
## 1. The 5 category files (the source of truth)
**The canonical reference is `conductor/code_styleguides/knowledge_artifacts.md` §1** (the full per-category formats + the `───` data shape markers + the append-only rule + the user-editable contract). This section is the user-facing summary.
| File | Shape | What it stores |
|---|---|---|
| `facts.md` | `- {statement} {provenance}` | Durable statements about systems, repos, tools |
| `decisions.md` | `- {statement, reason} {provenance}` | Decisions that were made |
| `questions.md` | `- {question} {provenance}` | Unanswered questions |
| `playbooks.md` | `- **{name}**: {steps} {provenance}` | Reusable command sequences |
| `tasks.md` | `- {task}` (## Open / ## Done) | Open and done tasks |
**The provenance string:** `[from: {conversation_name}, {date}]`. The `date` is the ISO-8601 date prefix of the harvest timestamp.
**The user can edit any of the 5.** The LLM's output is a *suggestion*; the user is the editor. The harvest will *append*; it will not *overwrite*.
**The example listings** (per-file path / file `facts.md`, etc.) are in `conductor/code_styleguides/knowledge_artifacts.md` §1.1-§1.5. This section is a pointer.
## 2. The per-file notes (`files/{file_id}.md`)
**The shape:**
```markdown
# /repo/src/ai_client.py
- Uses `cache_control: {"type": "ephemeral"}` blocks for Anthropic caching. [from: 2026-06-12-investigate-cache, 2026-06-12]
- The 5 per-provider history lists are gated by their own locks. [from: 2026-05-13-state-mutation-matrix, 2026-05-13]
- `run_discussion_compression` failure mode: TBD (Candidate 15). [from: 2026-06-12-candidate-15, 2026-06-12]
```
**The shape:** `- {note} {provenance}`. Keyed by `file_id` (the st_dev:st_ino of the file). Survives renames within the same filesystem.
**The `file_id_for_path` pattern** (per nagent's `bin/helpers/nagent_file_edit_lib.py:file_id_for_path`):
```python
def file_id_for_path(path: Path) -> str:
"""Stable file identity across renames. Returns 'device:inode'."""
stat = path.stat()
return f"{stat.st_dev}:{stat.st_ino}"
```
**Why inode and not path?** The path can change (rename, move, link); the inode is stable. A note about `src/foo.py` is preserved if `src/foo.py` is renamed to `src/bar.py` (same inode). If the file is moved across filesystems, the inode changes; the user must re-add the note.
**The "files" category in the harvest output has a special branch:**
```python
# In merge_harvest (the harvest pipeline)
file_notes = 0
for row in harvested.get("files", []):
if not isinstance(row, dict):
continue
path_text = str(row.get("path") or "").strip()
note = str(row.get("note") or "").strip()
if not note:
continue
target = Path(path_text) if path_text else None
if target is not None and target.is_file():
try:
file_id = file_id_for_path(target)
except OSError:
file_id = None
if file_id is not None:
_append_bullets(
file_knowledge_path(root, file_id), f"# {target.resolve()}",
[f"{note} {provenance}"],
)
file_notes += 1
continue
# Target no longer resolvable: the note survives as a fact.
prefix = f"{path_text}: " if path_text else ""
_append_bullets(knowledge / "facts.md", "# Facts", [f"{prefix}{note} {provenance}"])
file_notes += 1
counts["files"] = file_notes
```
**The behavior:**
- If the path resolves to an existing file → the note goes to `knowledge/files/{file_id}.md`
- If the path doesn't resolve (the file is gone) → the note falls back to `facts.md` as `{path}: {note} {provenance}`. The note survives, just loses the per-file binding.
---
## 3. The digest (`digest.md`)
The digest is a *projection* of the category files, bounded to **4KB**. It's injected as the `{knowledge}` block in the initial context.
**The format:**
```markdown
# Knowledge digest
(regenerated by knowledge_harvest; edit the category files, not this file)
## Open tasks
- Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12]
## Open questions
- Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12]
## Decisions
- Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12]
## Facts
- nagent has 5 providers; Manual Slop has 8. [from: 2026-06-12-v2.3, 2026-06-12]
## Playbooks
- **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12]
```
**The ordering is fixed:** Open tasks, Open questions, Decisions, Facts, Playbooks. **Within each section, newest first** (because the category files are append-only; reversing gives newest-first).
**Truncation:** if the sections don't fit in 4KB, the rest is truncated with a visible `(truncated; see the category files for the rest)` note.
**"Delete to turn off":** `rm ~/.manual_slop/knowledge/digest.md` → no `{knowledge}` block injected. Re-enable by running the harvest (which regenerates the digest).
---
## 4. The ledger (`ledger.json`)
The ledger is the **sha256-of-content audit log**. It gates deletion on a proven harvest.
**The format:**
```json
{
"entries": {
"<sha256-of-conversation-content>": {
"path": "/home/user/.manual_slop/conversations/<name>-<uuid>",
"status": "harvested",
"at": "2026-06-12T14:23:45.123456+00:00",
"items": {
"facts": 3,
"decisions": 2,
"tasks_done": 1,
"tasks_open": 0,
"questions": 1,
"playbooks": 0,
"files": 1
},
"deleted": true
}
}
}
```
**The status values:**
| Status | Meaning | Action |
|---|---|---|
| `harvested` | LLM distillation succeeded; items appended to category files | reclaim (unlink) |
| `harvest-failed` | LLM distillation failed after retries | keep the conversation; record the error |
| `deleted-unharvested` | User passed `--no-harvest`; the conversation is reclaimed without LLM | reclaim (unlink) |
| `too-large` | File > 1MB; kept without harvesting | keep |
**The sha256-of-content dedup:** two conversations with the same content share a ledger entry. The second is reclaimed without paying the LLM cost again.
---
## 5. The harvest workflow
### 5.1 The 7-category schema (the LLM output)
The LLM's harvest output is strict JSON (no prose, no markdown fence):
```json
{
"facts": [{"statement": "...", "detail": "..."}],
"decisions": [{"statement": "...", "detail": "..."}],
"tasks_done": [{"statement": "...", "detail": "..."}],
"tasks_open": [{"statement": "...", "detail": "..."}],
"questions": [{"statement": "...", "detail": "..."}],
"playbooks": [{"name": "...", "steps": "..."}],
"files": [{"path": "...", "note": "..."}]
}
```
**The prompt** (in `~/.manual_slop/knowledge/prompts/harvest-conversation.md`; user-editable, root-first resolution):
```markdown
# Harvest durable knowledge from a manual_slop conversation
You are given one conversation (or a summary of one). Extract only knowledge that
stays useful after this conversation is deleted. Return only JSON in exactly this
form (no prose, no markdown fence):
[the 7-category schema above]
Category rules:
- facts: durable statements about systems, repositories, tools, environments, or
constraints that were learned, not assumed.
- decisions: choices that were made, with the why in `detail`.
- tasks_done: concrete work completed in this conversation.
- tasks_open: work that was started, planned, or requested but not finished.
- questions: questions raised and never answered.
- playbooks: command sequences or processes that worked and are reusable; `steps`
is the runnable sequence.
- files: a note tied to one specific file path (use the absolute path seen in
the conversation).
General rules:
- Empty arrays are valid and expected: most conversations contain nothing durable.
Do not invent items to fill categories.
- One item per distinct piece of knowledge; keep `statement` to one sentence.
- `detail` is optional context; omit it or use "" when the statement stands alone.
- Do not include conversation mechanics, tool output noise, retries, or one-off
trivia (timestamps, token counts, transient errors).
```
### 5.2 The retry budget (the contract)
`HARVEST_MAX_ATTEMPTS = 2`. The retry is at the parse level (not the API level):
```python
def harvest_conversation(path, provider, model, *, generate, summarize=None):
content = read_or_summarize(path, provider, model)
template = harvest_prompt_path().read_text(encoding="utf-8").strip()
last_error = None
for attempt in range(HARVEST_MAX_ATTEMPTS):
prompt = build_harvest_prompt(template, path.name, content, retry=attempt > 0)
response = generate(prompt, provider, model)
try:
return parse_harvest_json(response)
except (json.JSONDecodeError, ValueError) as exc:
last_error = exc
raise RuntimeError(f"harvest output invalid after {HARVEST_MAX_ATTEMPTS} attempts: {last_error}")
```
**The retry-suffix:** on retry, append `\nYour previous reply was not valid JSON. Return only the JSON object.\n` to the prompt.
### 5.3 The size limits (the budgets)
| Constant | Value | Why |
|---|---|---|
| `SUMMARIZE_THRESHOLD_BYTES` | 64 KB | Files > 64KB get summarized first |
| `MAX_HARVEST_SOURCE_BYTES` | 1 MB | Files > 1MB are kept (not harvested) |
| `DIGEST_MAX_BYTES` | 4 KB | The bounded digest size |
| `HARVEST_MAX_ATTEMPTS` | 2 | Retry budget on parse failure |
### 5.4 The dry-run-by-default safety
The harvest CLI defaults to **dry-run**. Without `--apply`, the CLI classifies, estimates cost, and prints a report. **No mutation.**
```bash
$ python -m src.knowledge_harvest
artifacts: live:42, user-kept:3, prune:0, harvest:17, keep:1
harvest candidates: 2.3MB (~600K input tokens), prune candidates: 0B
dry run; pass --apply to harvest and reclaim
$ python -m src.knowledge_harvest --apply
reclaimed: 2.3MB
harvested items: facts:42, decisions:18, tasks_done:7, tasks_open:3, questions:5, playbooks:2, files:11
digest: /home/user/.manual_slop/knowledge/digest.md
ledger: /home/user/.manual_slop/knowledge/ledger.json
```
---
## 6. The "delete to turn off" pattern
**The principle.** Feature flags should be data, not config. If a feature is gated by the presence of a file, the user can turn it off by deleting the file. No GUI toggle, no env var, no `config.toml` edit. Just `rm`.
**The knowledge digest pattern:** `rm ~/.manual_slop/knowledge/digest.md` → no `{knowledge}` block is injected. Re-enable by running `python -m src.knowledge_harvest --apply` (which regenerates the digest).
**The implementation:**
```python
# In aggregate.py:run (the consumer of the digest)
knowledge_digest_path = paths.knowledge_dir() / "digest.md"
if knowledge_digest_path.is_file():
knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
# else: skip; the file is the switch
```
**The pattern recurs in 3 places:**
1. `regenerate_digest` deletes the digest when sections are empty
2. The `aggregate.py:run` injection check is the load-bearing one
3. The GUI `Knowledge` panel shows the file state and provides a `[Delete to turn off]` button
---
## 7. The graceful failure modes
| Failure | Handling |
|---|---|
| LLM returns invalid JSON | Retry (up to 2 attempts); on 2nd failure, mark `harvest-failed` in the ledger; keep the conversation |
| File > 1MB | Mark `too-large` in the ledger; keep the conversation |
| File > 64KB | Summarize via `run_subagent_summarization`; use the summary as the LLM input |
| Provider not available | Mark `harvest-failed`; keep the conversation |
| Network timeout | Same; mark `harvest-failed`; keep the conversation |
| Disk full writing to category files | Raise; mark `harvest-failed`; keep the conversation (don't reclaim) |
**The pattern:** critical operations complete; non-essential post-steps are best-effort. The marker is visible. The user can re-run.
---
## 8. The injection (where the digest is used)
The digest is injected into the *stable* position of the initial context (layer 7 of the 12-layer model; per `cache_friendly_context.md`):
```python
# In aggregate.py:run (the consumer)
def build_initial_context(ctrl, user_message):
stable_prefix = []
# Layer 1-6: role, schema, tools, system prompt, persona, project context
stable_prefix.append(...)
# Layer 7: knowledge digest (the 4KB bounded projection)
knowledge_digest_path = paths.knowledge_dir() / "digest.md"
if knowledge_digest_path.is_file():
knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
# Layer 8-12: discussion metadata, active preset, per-file details, prior turns, user message
volatile_suffix = [...]
return "".join(stable_prefix + volatile_suffix)
```
**The position matters.** The digest is in the *stable* position (before the `Instance:` volatile block). The cache can include the digest in the cached prefix; the volatile suffix is not cached. Per `cache_friendly_context.md` §1.
---
## 9. The cross-references
- `conductor/code_styleguides/knowledge_artifacts.md` — the canonical styleguide
- `docs/guide_agent_memory_dimensions.md` §4 — the knowledge dim in context
- `docs/guide_caching_strategy.md` §5 — where the digest is injected
- `conductor/code_styleguides/feature_flags.md` — the "delete to turn off" pattern
- `conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md` §3.1, §4 — the nagent pattern that informed this guide