Private
Public Access
0
0
Files
manual_slop/docs/guide_knowledge_curation.md
T
ed 434b6d0d54 docs: reduce redundant content across files; map references to canonical sources
Per user 'a bunch of docs just committed had redundant content across
files. Can we do a reduction of that and instead map references to
other files?'

This commit reduces content duplication across 9 files. The
canonical sources are kept as detailed references; the other
files now point to them.

Reductions (table replaced with 'see canonical' reference):

1. data_oriented_design.md §9: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

2. guide_agent_memory_dimensions.md §0: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

3. guide_caching_strategy.md §1: the 12-layer model
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

5. guide_knowledge_curation.md §1: the 5 category file details
   (canonical: conductor/code_styleguides/knowledge_artifacts.md §1)

6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

7. guide_mma.md '4 memory dimensions' section: the MMA scope table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/
   feature flag tables (canonical: the per-topic styleguides in
   conductor/code_styleguides/)

9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list
   (canonical: docs/AGENTS.md §2)

The principle: each piece of content has ONE source of truth; other
places point to it. The data-oriented way. Files retain their
narrative flow and the 'what this is' intros, but the detailed
tables are now in their canonical home.

Net effect: -2100 bytes across 9 files (without losing any
information - the canonical sources are unchanged). The
'cross-references' sections are kept; the duplicated content
is removed.
2026-06-12 14:10:30 -04:00

15 KiB

Knowledge Curation Guide

Status: User-facing deep-dive on the 4th memory dimension (the knowledge memory). For agents, see ./docs/AGENTS.md §6. Date: 2026-06-12 Cross-refs: conductor/code_styleguides/knowledge_artifacts.md; docs/guide_agent_memory_dimensions.md §4; conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.1, §4.

What this is. The 4th memory dimension is the durable, user-editable, provenance-aware knowledge store. It's a layer, not a snapshot. Category files are the source of truth; the digest is a projection; the ledger is the audit log. This guide is the user-facing deep-dive on how to use it, how to harvest it, and how to query it.


0. The 30-second version

Manual Slop's knowledge memory lives at ~/.manual_slop/knowledge/. It has 5 category files (facts.md, decisions.md, questions.md, playbooks.md, tasks.md) plus per-file notes (files/{file_id}.md) plus a 4KB bounded digest plus a sha256 ledger. The LLM harvests past discussions into these files; the user can edit any of them in plain text. The digest is injected into every new discussion's initial context as a {knowledge} block.

$ ls ~/.manual_slop/knowledge/
facts.md            # - {statement} {provenance}
decisions.md        # - {statement, reason} {provenance}
questions.md        # - {question} {provenance}
playbooks.md        # - **{name}**: {steps} {provenance}
tasks.md            # ## Open / ## Done
files/              # per-file notes (keyed by inode)
digest.md           # bounded 4KB; the projection
ledger.json         # sha256-of-content audit log
prompts/            # user-editable harvest prompt

1. The 5 category files (the source of truth)

The canonical reference is conductor/code_styleguides/knowledge_artifacts.md §1 (the full per-category formats + the ─── data shape markers + the append-only rule + the user-editable contract). This section is the user-facing summary.

File Shape What it stores
facts.md - {statement} {provenance} Durable statements about systems, repos, tools
decisions.md - {statement, reason} {provenance} Decisions that were made
questions.md - {question} {provenance} Unanswered questions
playbooks.md - **{name}**: {steps} {provenance} Reusable command sequences
tasks.md - {task} (## Open / ## Done) Open and done tasks

The provenance string: [from: {conversation_name}, {date}]. The date is the ISO-8601 date prefix of the harvest timestamp.

The user can edit any of the 5. The LLM's output is a suggestion; the user is the editor. The harvest will append; it will not overwrite.

The example listings (per-file path / file facts.md, etc.) are in conductor/code_styleguides/knowledge_artifacts.md §1.1-§1.5. This section is a pointer.

2. The per-file notes (files/{file_id}.md)

The shape:

# /repo/src/ai_client.py

- Uses `cache_control: {"type": "ephemeral"}` blocks for Anthropic caching. [from: 2026-06-12-investigate-cache, 2026-06-12]
- The 5 per-provider history lists are gated by their own locks. [from: 2026-05-13-state-mutation-matrix, 2026-05-13]
- `run_discussion_compression` failure mode: TBD (Candidate 15). [from: 2026-06-12-candidate-15, 2026-06-12]

The shape: - {note} {provenance}. Keyed by file_id (the st_dev:st_ino of the file). Survives renames within the same filesystem.

The file_id_for_path pattern (per nagent's bin/helpers/nagent_file_edit_lib.py:file_id_for_path):

def file_id_for_path(path: Path) -> str:
    """Stable file identity across renames. Returns 'device:inode'."""
    stat = path.stat()
    return f"{stat.st_dev}:{stat.st_ino}"

Why inode and not path? The path can change (rename, move, link); the inode is stable. A note about src/foo.py is preserved if src/foo.py is renamed to src/bar.py (same inode). If the file is moved across filesystems, the inode changes; the user must re-add the note.

The "files" category in the harvest output has a special branch:

# In merge_harvest (the harvest pipeline)
file_notes = 0
for row in harvested.get("files", []):
    if not isinstance(row, dict):
        continue
    path_text = str(row.get("path") or "").strip()
    note = str(row.get("note") or "").strip()
    if not note:
        continue
    target = Path(path_text) if path_text else None
    if target is not None and target.is_file():
        try:
            file_id = file_id_for_path(target)
        except OSError:
            file_id = None
        if file_id is not None:
            _append_bullets(
                file_knowledge_path(root, file_id), f"# {target.resolve()}",
                [f"{note} {provenance}"],
            )
        file_notes += 1
        continue
    # Target no longer resolvable: the note survives as a fact.
    prefix = f"{path_text}: " if path_text else ""
    _append_bullets(knowledge / "facts.md", "# Facts", [f"{prefix}{note} {provenance}"])
    file_notes += 1
counts["files"] = file_notes

The behavior:

  • If the path resolves to an existing file → the note goes to knowledge/files/{file_id}.md
  • If the path doesn't resolve (the file is gone) → the note falls back to facts.md as {path}: {note} {provenance}. The note survives, just loses the per-file binding.

3. The digest (digest.md)

The digest is a projection of the category files, bounded to 4KB. It's injected as the {knowledge} block in the initial context.

The format:

# Knowledge digest
(regenerated by knowledge_harvest; edit the category files, not this file)

## Open tasks
- Create canonical DOD file at conductor/code_styleguides/data_oriented_design.md. [from: 2026-06-12-candidate-16, 2026-06-12]

## Open questions
- Where does intent resolution live — per-verb, per-block, or global? [from: 2026-06-12-follow-up-b, 2026-06-12]

## Decisions
- Knowledge harvest is a complement to curation + discussion, not a RAG replacement. [from: 2026-06-12-candidate-11, 2026-06-12]

## Facts
- nagent has 5 providers; Manual Slop has 8. [from: 2026-06-12-v2.3, 2026-06-12]

## Playbooks
- **Knowledge Harvest**: scan -> classify -> LLM-distill -> append -> digest -> reclaim. [from: 2026-06-12-candidate-11, 2026-06-12]

The ordering is fixed: Open tasks, Open questions, Decisions, Facts, Playbooks. Within each section, newest first (because the category files are append-only; reversing gives newest-first).

Truncation: if the sections don't fit in 4KB, the rest is truncated with a visible (truncated; see the category files for the rest) note.

"Delete to turn off": rm ~/.manual_slop/knowledge/digest.md → no {knowledge} block injected. Re-enable by running the harvest (which regenerates the digest).


4. The ledger (ledger.json)

The ledger is the sha256-of-content audit log. It gates deletion on a proven harvest.

The format:

{
  "entries": {
    "<sha256-of-conversation-content>": {
      "path": "/home/user/.manual_slop/conversations/<name>-<uuid>",
      "status": "harvested",
      "at": "2026-06-12T14:23:45.123456+00:00",
      "items": {
        "facts": 3,
        "decisions": 2,
        "tasks_done": 1,
        "tasks_open": 0,
        "questions": 1,
        "playbooks": 0,
        "files": 1
      },
      "deleted": true
    }
  }
}

The status values:

Status Meaning Action
harvested LLM distillation succeeded; items appended to category files reclaim (unlink)
harvest-failed LLM distillation failed after retries keep the conversation; record the error
deleted-unharvested User passed --no-harvest; the conversation is reclaimed without LLM reclaim (unlink)
too-large File > 1MB; kept without harvesting keep

The sha256-of-content dedup: two conversations with the same content share a ledger entry. The second is reclaimed without paying the LLM cost again.


5. The harvest workflow

5.1 The 7-category schema (the LLM output)

The LLM's harvest output is strict JSON (no prose, no markdown fence):

{
  "facts": [{"statement": "...", "detail": "..."}],
  "decisions": [{"statement": "...", "detail": "..."}],
  "tasks_done": [{"statement": "...", "detail": "..."}],
  "tasks_open": [{"statement": "...", "detail": "..."}],
  "questions": [{"statement": "...", "detail": "..."}],
  "playbooks": [{"name": "...", "steps": "..."}],
  "files": [{"path": "...", "note": "..."}]
}

The prompt (in ~/.manual_slop/knowledge/prompts/harvest-conversation.md; user-editable, root-first resolution):

# Harvest durable knowledge from a manual_slop conversation

You are given one conversation (or a summary of one). Extract only knowledge that
stays useful after this conversation is deleted. Return only JSON in exactly this
form (no prose, no markdown fence):

[the 7-category schema above]

Category rules:
- facts: durable statements about systems, repositories, tools, environments, or
  constraints that were learned, not assumed.
- decisions: choices that were made, with the why in `detail`.
- tasks_done: concrete work completed in this conversation.
- tasks_open: work that was started, planned, or requested but not finished.
- questions: questions raised and never answered.
- playbooks: command sequences or processes that worked and are reusable; `steps`
  is the runnable sequence.
- files: a note tied to one specific file path (use the absolute path seen in
  the conversation).

General rules:
- Empty arrays are valid and expected: most conversations contain nothing durable.
  Do not invent items to fill categories.
- One item per distinct piece of knowledge; keep `statement` to one sentence.
- `detail` is optional context; omit it or use "" when the statement stands alone.
- Do not include conversation mechanics, tool output noise, retries, or one-off
  trivia (timestamps, token counts, transient errors).

5.2 The retry budget (the contract)

HARVEST_MAX_ATTEMPTS = 2. The retry is at the parse level (not the API level):

def harvest_conversation(path, provider, model, *, generate, summarize=None):
    content = read_or_summarize(path, provider, model)
    template = harvest_prompt_path().read_text(encoding="utf-8").strip()
    last_error = None
    for attempt in range(HARVEST_MAX_ATTEMPTS):
        prompt = build_harvest_prompt(template, path.name, content, retry=attempt > 0)
        response = generate(prompt, provider, model)
        try:
            return parse_harvest_json(response)
        except (json.JSONDecodeError, ValueError) as exc:
            last_error = exc
    raise RuntimeError(f"harvest output invalid after {HARVEST_MAX_ATTEMPTS} attempts: {last_error}")

The retry-suffix: on retry, append \nYour previous reply was not valid JSON. Return only the JSON object.\n to the prompt.

5.3 The size limits (the budgets)

Constant Value Why
SUMMARIZE_THRESHOLD_BYTES 64 KB Files > 64KB get summarized first
MAX_HARVEST_SOURCE_BYTES 1 MB Files > 1MB are kept (not harvested)
DIGEST_MAX_BYTES 4 KB The bounded digest size
HARVEST_MAX_ATTEMPTS 2 Retry budget on parse failure

5.4 The dry-run-by-default safety

The harvest CLI defaults to dry-run. Without --apply, the CLI classifies, estimates cost, and prints a report. No mutation.

$ python -m src.knowledge_harvest
artifacts: live:42, user-kept:3, prune:0, harvest:17, keep:1
harvest candidates: 2.3MB (~600K input tokens), prune candidates: 0B
dry run; pass --apply to harvest and reclaim

$ python -m src.knowledge_harvest --apply
reclaimed: 2.3MB
harvested items: facts:42, decisions:18, tasks_done:7, tasks_open:3, questions:5, playbooks:2, files:11
digest: /home/user/.manual_slop/knowledge/digest.md
ledger: /home/user/.manual_slop/knowledge/ledger.json

6. The "delete to turn off" pattern

The principle. Feature flags should be data, not config. If a feature is gated by the presence of a file, the user can turn it off by deleting the file. No GUI toggle, no env var, no config.toml edit. Just rm.

The knowledge digest pattern: rm ~/.manual_slop/knowledge/digest.md → no {knowledge} block is injected. Re-enable by running python -m src.knowledge_harvest --apply (which regenerates the digest).

The implementation:

# In aggregate.py:run (the consumer of the digest)
knowledge_digest_path = paths.knowledge_dir() / "digest.md"
if knowledge_digest_path.is_file():
    knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
    stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
# else: skip; the file is the switch

The pattern recurs in 3 places:

  1. regenerate_digest deletes the digest when sections are empty
  2. The aggregate.py:run injection check is the load-bearing one
  3. The GUI Knowledge panel shows the file state and provides a [Delete to turn off] button

7. The graceful failure modes

Failure Handling
LLM returns invalid JSON Retry (up to 2 attempts); on 2nd failure, mark harvest-failed in the ledger; keep the conversation
File > 1MB Mark too-large in the ledger; keep the conversation
File > 64KB Summarize via run_subagent_summarization; use the summary as the LLM input
Provider not available Mark harvest-failed; keep the conversation
Network timeout Same; mark harvest-failed; keep the conversation
Disk full writing to category files Raise; mark harvest-failed; keep the conversation (don't reclaim)

The pattern: critical operations complete; non-essential post-steps are best-effort. The marker is visible. The user can re-run.


8. The injection (where the digest is used)

The digest is injected into the stable position of the initial context (layer 7 of the 12-layer model; per cache_friendly_context.md):

# In aggregate.py:run (the consumer)
def build_initial_context(ctrl, user_message):
    stable_prefix = []
    
    # Layer 1-6: role, schema, tools, system prompt, persona, project context
    stable_prefix.append(...)
    
    # Layer 7: knowledge digest (the 4KB bounded projection)
    knowledge_digest_path = paths.knowledge_dir() / "digest.md"
    if knowledge_digest_path.is_file():
        knowledge_digest = knowledge_digest_path.read_text(encoding="utf-8")
        stable_prefix.append(f"{{knowledge}}\n{knowledge_digest}\n{{/knowledge}}\n")
    
    # Layer 8-12: discussion metadata, active preset, per-file details, prior turns, user message
    volatile_suffix = [...]
    
    return "".join(stable_prefix + volatile_suffix)

The position matters. The digest is in the stable position (before the Instance: volatile block). The cache can include the digest in the cached prefix; the volatile suffix is not cached. Per cache_friendly_context.md §1.


9. The cross-references

  • conductor/code_styleguides/knowledge_artifacts.md — the canonical styleguide
  • docs/guide_agent_memory_dimensions.md §4 — the knowledge dim in context
  • docs/guide_caching_strategy.md §5 — where the digest is injected
  • conductor/code_styleguides/feature_flags.md — the "delete to turn off" pattern
  • conductor/tracks/nagent_review_20260608/nagent_review_v2_3_20260612.md §3.1, §4 — the nagent pattern that informed this guide