Per user 'a bunch of docs just committed had redundant content across files. Can we do a reduction of that and instead map references to other files?' This commit reduces content duplication across 9 files. The canonical sources are kept as detailed references; the other files now point to them. Reductions (table replaced with 'see canonical' reference): 1. data_oriented_design.md §9: the 4-dim memory table (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0) 2. guide_agent_memory_dimensions.md §0: the 4-dim memory table (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0) 3. guide_caching_strategy.md §1: the 12-layer model (canonical: conductor/code_styleguides/cache_friendly_context.md §1) 4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap (canonical: conductor/code_styleguides/cache_friendly_context.md §1) 5. guide_knowledge_curation.md §1: the 5 category file details (canonical: conductor/code_styleguides/knowledge_artifacts.md §1) 6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0) 7. guide_mma.md '4 memory dimensions' section: the MMA scope table (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0) 8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/ feature flag tables (canonical: the per-topic styleguides in conductor/code_styleguides/) 9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list (canonical: docs/AGENTS.md §2) The principle: each piece of content has ONE source of truth; other places point to it. The data-oriented way. Files retain their narrative flow and the 'what this is' intros, but the detailed tables are now in their canonical home. Net effect: -2100 bytes across 9 files (without losing any information - the canonical sources are unchanged). The 'cross-references' sections are kept; the duplicated content is removed.
15 KiB
Knowledge Curation Guide
Status: User-facing deep-dive on the 4th memory dimension (the knowledge memory). For agents, see ./docs/AGENTS.md §6.
Date: 2026-06-12
Cross-refs: conductor/code_styleguides/knowledge_artifacts.md; docs/guide_agent_memory_dimensions.md §4; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.1, §4.
What this is. The 4th memory dimension is the durable, user-editable, provenance-aware knowledge store. It's a layer, not a snapshot. Category files are the source of truth; the digest is a projection; the ledger is the audit log. This guide is the user-facing deep-dive on how to use it, how to harvest it, and how to query it.
0. The 30-second version
Manual Slop's knowledge memory lives at ~/.manual_slop/knowledge/. It has 5 category files (facts.md, decisions.md, questions.md, playbooks.md, tasks.md) plus per-file notes (files/{file_id}.md) plus a 4KB bounded digest plus a sha256 ledger. The LLM harvests past discussions into these files; the user can edit any of them in plain text. The digest is injected into every new discussion's initial context as a {knowledge} block.
$ ls ~/.manual_slop/knowledge/
facts.md # - {statement} {provenance}
decisions.md # - {statement, reason} {provenance}
questions.md # - {question} {provenance}
playbooks.md # - **{name}**: {steps} {provenance}
tasks.md # ## Open / ## Done
files/ # per-file notes (keyed by inode)
digest.md # bounded 4KB; the projection
ledger.json # sha256-of-content audit log
prompts/ # user-editable harvest prompt
1. The 5 category files (the source of truth)
The canonical reference is conductor/code_styleguides/knowledge_artifacts.md §1 (the full per-category formats + the ─── data shape markers + the append-only rule + the user-editable contract). This section is the user-facing summary.
| File | Shape | What it stores |
|---|---|---|
facts.md |
- {statement} {provenance} |
Durable statements about systems, repos, tools |
decisions.md |
- {statement, reason} {provenance} |
Decisions that were made |
questions.md |
- {question} {provenance} |
Unanswered questions |
playbooks.md |
- **{name}**: {steps} {provenance} |
Reusable command sequences |
tasks.md |
- {task} (## Open / ## Done) |
Open and done tasks |
The provenance string: [from: {conversation_name}, {date}]. The date is the ISO-8601 date prefix of the harvest timestamp.
The user can edit any of the 5. The LLM's output is a suggestion; the user is the editor. The harvest will append; it will not overwrite.
The example listings (per-file path / file facts.md, etc.) are in conductor/code_styleguides/knowledge_artifacts.md §1.1-§1.5. This section is a pointer.
2. The per-file notes (files/{file_id}.md)
The shape:
# /repo/src/ai_client.py
- Uses `cache_control: {"type": "ephemeral"}` blocks for Anthropic caching. [from: 2026-06-12-investigate-cache, 2026-06-12]
- The 5 per-provider history lists are gated by their own locks. [from: 2026-05-13-state-mutation-matrix, 2026-05-13]
- `run_discussion_compression` failure mode: TBD (Candidate 15). [from: 2026-06-12-candidate-15, 2026-06-12]
The shape: - {note} {provenance}. Keyed by file_id (the st_dev:st_ino of the file). Survives renames within the same filesystem.
The file_id_for_path pattern (per nagent's bin/helpers/nagent_file_edit_lib.py:file_id_for_path):
def file_id_for_path(path: Path) -> str:
"""Stable file identity across renames. Returns 'device:inode'."""
stat = path.stat()
return f"{stat.st_dev}:{stat.st_ino}"
Why inode and not path? The path can change (rename, move, link); the inode is stable. A note about src/foo.py is preserved if src/foo.py is renamed to src/bar.py (same inode). If the file is moved across filesystems, the inode changes; the user must re-add the note.
The "files" category in the harvest output has a special branch:
# In merge_harvest (the harvest pipeline)
file_notes = 0
for row in harvested.get("files", []):
if not isinstance(row, dict):
continue
path_text = str(row.get("path") or "").strip()
note = str(row.get("note") or "").strip()
if not note:
continue
target = Path(path_text) if path_text else None
if target is not None and target.is_file():
try:
file_id = file_id_for_path(target)
except OSError:
file_id = None
if file_id is not None:
_append_bullets(
file_knowledge_path(root, file_id), f"# {target.resolve()}",
[f"{note} {provenance}"],
)
file_notes += 1
continue
# Target no longer resolvable: the note survives as a fact.
prefix = f"{path_text}: " if path_text else ""
_append_bullets(knowledge / "facts.md", "# Facts", [f"{prefix}{note} {provenance}"])
file_notes += 1
counts["files"] = file_notes
The behavior:
- If the path resolves to an existing file → the note goes to
knowledge/files/{file_id}.md - If the path doesn't resolve (the file is gone) → the note falls back to
facts.mdas{path}: {note} {provenance}. The note survives, just loses the per-file binding.
3. The digest (digest.md)
The digest is a projection of the category files, bounded to 4KB. It's injected as the {knowledge} block in the initial context.
The format:
# Knowledge digest
(regenerated by knowledge_harvest; edit the category files, not this file)
## Open tasks
- Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12]
## Open questions
- Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12]
## Decisions
- Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12]
## Facts
- nagent has 5 providers; Manual Slop has 8. [from: 2026-06-12-v2.3, 2026-06-12]
## Playbooks
- **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12]
The ordering is fixed: Open tasks, Open questions, Decisions, Facts, Playbooks. Within each section, newest first (because the category files are append-only; reversing gives newest-first).
Truncation: if the sections don't fit in 4KB, the rest is truncated with a visible (truncated; see the category files for the rest) note.
"Delete to turn off": rm ~/.manual_slop/knowledge/digest.md → no {knowledge} block injected. Re-enable by running the harvest (which regenerates the digest).
4. The ledger (ledger.json)
The ledger is the sha256-of-content audit log. It gates deletion on a proven harvest.
The format:
{
"entries": {
"<sha256-of-conversation-content>": {
"path": "/home/user/.manual_slop/conversations/<name>-<uuid>",
"status": "harvested",
"at": "2026-06-12T14:23:45.123456+00:00",
"items": {
"facts": 3,
"decisions": 2,
"tasks_done": 1,
"tasks_open": 0,
"questions": 1,
"playbooks": 0,
"files": 1
},
"deleted": true
}
}
}
The status values:
| Status | Meaning | Action |
|---|---|---|
harvested |
LLM distillation succeeded; items appended to category files | reclaim (unlink) |
harvest-failed |
LLM distillation failed after retries | keep the conversation; record the error |
deleted-unharvested |
User passed --no-harvest; the conversation is reclaimed without LLM |
reclaim (unlink) |
too-large |
File > 1MB; kept without harvesting | keep |
The sha256-of-content dedup: two conversations with the same content share a ledger entry. The second is reclaimed without paying the LLM cost again.
5. The harvest workflow
5.1 The 7-category schema (the LLM output)
The LLM's harvest output is strict JSON (no prose, no markdown fence):
{
"facts": [{"statement": "...", "detail": "..."}],
"decisions": [{"statement": "...", "detail": "..."}],
"tasks_done": [{"statement": "...", "detail": "..."}],
"tasks_open": [{"statement": "...", "detail": "..."}],
"questions": [{"statement": "...", "detail": "..."}],
"playbooks": [{"name": "...", "steps": "..."}],
"files": [{"path": "...", "note": "..."}]
}
The prompt (in ~/.manual_slop/knowledge/prompts/harvest-conversation.md; user-editable, root-first resolution):
# Harvest durable knowledge from a manual_slop conversation
You are given one conversation (or a summary of one). Extract only knowledge that
stays useful after this conversation is deleted. Return only JSON in exactly this
form (no prose, no markdown fence):
[the 7-category schema above]
Category rules:
- facts: durable statements about systems, repositories, tools, environments, or
constraints that were learned, not assumed.
- decisions: choices that were made, with the why in `detail`.
- tasks_done: concrete work completed in this conversation.
- tasks_open: work that was started, planned, or requested but not finished.
- questions: questions raised and never answered.
- playbooks: command sequences or processes that worked and are reusable; `steps`
is the runnable sequence.
- files: a note tied to one specific file path (use the absolute path seen in
the conversation).
General rules:
- Empty arrays are valid and expected: most conversations contain nothing durable.
Do not invent items to fill categories.
- One item per distinct piece of knowledge; keep `statement` to one sentence.
- `detail` is optional context; omit it or use "" when the statement stands alone.
- Do not include conversation mechanics, tool output noise, retries, or one-off
trivia (timestamps, token counts, transient errors).
5.2 The retry budget (the contract)
HARVEST_MAX_ATTEMPTS = 2. The retry is at the parse level (not the API level):
def harvest_conversation(path, provider, model, *, generate, summarize=None):
content = read_or_summarize(path, provider, model)
template = harvest_prompt_path().read_text(encoding="utf-8").strip()
last_error = None
for attempt in range(HARVEST_MAX_ATTEMPTS):
prompt = build_harvest_prompt(template, path.name, content, retry=attempt > 0)
response = generate(prompt, provider, model)
try:
return parse_harvest_json(response)
except (json.JSONDecodeError, ValueError) as exc:
last_error = exc
raise RuntimeError(f"harvest output invalid after {HARVEST_MAX_ATTEMPTS} attempts: {last_error}")
The retry-suffix: on retry, append \nYour previous reply was not valid JSON. Return only the JSON object.\n to the prompt.
5.3 The size limits (the budgets)
| Constant | Value | Why |
|---|---|---|
SUMMARIZE_THRESHOLD_BYTES |
64 KB | Files > 64KB get summarized first |
MAX_HARVEST_SOURCE_BYTES |
1 MB | Files > 1MB are kept (not harvested) |
DIGEST_MAX_BYTES |
4 KB | The bounded digest size |
HARVEST_MAX_ATTEMPTS |
2 | Retry budget on parse failure |
5.4 The dry-run-by-default safety
The harvest CLI defaults to dry-run. Without --apply, the CLI classifies, estimates cost, and prints a report. No mutation.
$ python -m src.knowledge_harvest
artifacts: live:42, user-kept:3, prune:0, harvest:17, keep:1
harvest candidates: 2.3MB (~600K input tokens), prune candidates: 0B
dry run; pass --apply to harvest and reclaim
$ python -m src.knowledge_harvest --apply
reclaimed: 2.3MB
harvested items: facts:42, decisions:18, tasks_done:7, tasks_open:3, questions:5, playbooks:2, files:11
digest: /home/user/.manual_slop/knowledge/digest.md
ledger: /home/user/.manual_slop/knowledge/ledger.json
6. The "delete to turn off" pattern
The principle. Feature flags should be data, not config. If a feature is gated by the presence of a file, the user can turn it off by deleting the file. No GUI toggle, no env var, no config.toml edit. Just rm.
The knowledge digest pattern: rm ~/.manual_slop/knowledge/digest.md → no {knowledge} block is injected. Re-enable by running python -m src.knowledge_harvest --apply (which regenerates the digest).
The implementation:
# In aggregate.py:run (the consumer of the digest)
knowledge_digest_path = paths.knowledge_dir() / "digest.md"
if knowledge_digest_path.is_file():
knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
# else: skip; the file is the switch
The pattern recurs in 3 places:
regenerate_digestdeletes the digest when sections are empty- The
aggregate.py:runinjection check is the load-bearing one - The GUI
Knowledgepanel shows the file state and provides a[Delete to turn off]button
7. The graceful failure modes
| Failure | Handling |
|---|---|
| LLM returns invalid JSON | Retry (up to 2 attempts); on 2nd failure, mark harvest-failed in the ledger; keep the conversation |
| File > 1MB | Mark too-large in the ledger; keep the conversation |
| File > 64KB | Summarize via run_subagent_summarization; use the summary as the LLM input |
| Provider not available | Mark harvest-failed; keep the conversation |
| Network timeout | Same; mark harvest-failed; keep the conversation |
| Disk full writing to category files | Raise; mark harvest-failed; keep the conversation (don't reclaim) |
The pattern: critical operations complete; non-essential post-steps are best-effort. The marker is visible. The user can re-run.
8. The injection (where the digest is used)
The digest is injected into the stable position of the initial context (layer 7 of the 12-layer model; per cache_friendly_context.md):
# In aggregate.py:run (the consumer)
def build_initial_context(ctrl, user_message):
stable_prefix = []
# Layer 1-6: role, schema, tools, system prompt, persona, project context
stable_prefix.append(...)
# Layer 7: knowledge digest (the 4KB bounded projection)
knowledge_digest_path = paths.knowledge_dir() / "digest.md"
if knowledge_digest_path.is_file():
knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
# Layer 8-12: discussion metadata, active preset, per-file details, prior turns, user message
volatile_suffix = [...]
return "".join(stable_prefix + volatile_suffix)
The position matters. The digest is in the stable position (before the Instance: volatile block). The cache can include the digest in the cached prefix; the volatile suffix is not cached. Per cache_friendly_context.md §1.
9. The cross-references
conductor/code_styleguides/knowledge_artifacts.md— the canonical styleguidedocs/guide_agent_memory_dimensions.md§4 — the knowledge dim in contextdocs/guide_caching_strategy.md§5 — where the digest is injectedconductor/code_styleguides/feature_flags.md— the "delete to turn off" patternconductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md§3.1, §4 — the nagent pattern that informed this guide