Private
Public Access
0
0
Files
manual_slop/conductor/code_styleguides/data_oriented_design.md
T
ed 434b6d0d54 docs: reduce redundant content across files; map references to canonical sources
Per user 'a bunch of docs just committed had redundant content across
files. Can we do a reduction of that and instead map references to
other files?'

This commit reduces content duplication across 9 files. The
canonical sources are kept as detailed references; the other
files now point to them.

Reductions (table replaced with 'see canonical' reference):

1. data_oriented_design.md §9: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

2. guide_agent_memory_dimensions.md §0: the 4-dim memory table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

3. guide_caching_strategy.md §1: the 12-layer model
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

4. guide_ai_client.md 'Cache strategy' section: the 12-layer model recap
   (canonical: conductor/code_styleguides/cache_friendly_context.md §1)

5. guide_knowledge_curation.md §1: the 5 category file details
   (canonical: conductor/code_styleguides/knowledge_artifacts.md §1)

6. product-guidelines.md 'Memory Dimensions' section: the 4-dim table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

7. guide_mma.md '4 memory dimensions' section: the MMA scope table
   (canonical: conductor/code_styleguides/agent_memory_dimensions.md §0)

8. docs/AGENTS.md §0 + §5-§8: 4-dim table + caching/knowledge/RAG/
   feature flag tables (canonical: the per-topic styleguides in
   conductor/code_styleguides/)

9. AGENTS.md 'Code Styleguides' section: the 6-styleguide list
   (canonical: docs/AGENTS.md §2)

The principle: each piece of content has ONE source of truth; other
places point to it. The data-oriented way. Files retain their
narrative flow and the 'what this is' intros, but the detailed
tables are now in their canonical home.

Net effect: -2100 bytes across 9 files (without losing any
information - the canonical sources are unchanged). The
'cross-references' sections are kept; the duplicated content
is removed.
2026-06-12 14:10:30 -04:00

19 KiB

Data-Oriented Design (the canonical rules)

Status: This is the canonical DOD reference for Manual Slop. Imported by AGENTS.md and injected into the Application's RAG / context assembly via manual_slop.toml [agent].context_files. One source of truth for both harnesses. Source: Adapted from Mike Acton's context/data-oriented-design.md (13,084 bytes, the nagent canonical reference). Date: 2026-06-12

What this is. Operating rules, not philosophy: every rule here tells you what to do. Approach every problem — code, plan, pipeline, document — by understanding the real data first, then designing the simplest machine that transforms the input you actually have into the output you actually need, at a cost you can state. Decide from facts and measurement, not habit, analogy, or dogma.

Manual Slop context. The project is an ImGui GUI orchestrator for LLM-driven coding sessions. The dominant data is the conversation — a typed message list with role + content + metadata + optional thinking segments. The data has to survive across workers (MMA Tier 3 subprocesses), across tools (the 45 MCP tools), across LLM providers (8 send paths), and across the user's editing session (per-entry edit, branch, undo). The data is the thing; the workers and processes are disposable.


0. Scope, tiers, and precedence

Scale the ceremony to the task. Decide the tier first; when unsure, pick the higher tier and say which you picked.

Tier When What to do
Tier 0 Trivial: typo fixes, mechanical edits, one-line bugfixes, answering questions Apply the defaults silently (naming, explicit error behavior, no speculative generality). No written plan or checklist
Tier 1 Non-trivial change: new function or feature, behavior change, anything that touches a data layout, contract, or interface Required: answer the framing + data questions in a short written plan before implementing, run the simplification pass, run the final self-check
Tier 2 Subsystem-scale: new or substantially reworked subsystem, pipeline, or tool Everything in tier 1 plus the enforceable deliverables (per §10)

Precedence when rules conflict:

  1. An explicit instruction from the user for the current task
  2. This document (conductor/code_styleguides/data_oriented_design.md)
  3. Existing codebase or workflow convention

When this document conflicts with existing convention and complying would mean a large refactor, do not silently rewrite and do not silently conform: state the conflict, estimate the cost of each option, and propose the smallest compliant change.


1. The 3 defaults to reject

These are the three default beliefs that produce bad solutions. Each comes with the replacement behavior — do the replacement, every time:

1.1 "The tools are the platform."

Reality is the platform: the actual hardware, organization, deadline, physics.

Do instead: before designing, name the real platform and the 2-3 of its fixed properties that constrain this solution, and design within them.

For Manual Slop: the platform is the user's machine (Windows; 1-8 cores; 16-128 GB RAM), the LLM provider API (rate limits, context window, cost), and the MCP tool surface (45 tools, 3-layer security). Not the ImGui API; not the Python version. The ImGui API is the view; the platform is the view + the data + the user.

1.2 "Design around a model of the world."

World models (objects, metaphors, idealized categories) hide the actual data and the actual cost.

Do instead: design around the data. Do not introduce an abstraction until you can describe, concretely, the data it organizes and the transform it serves — and what the abstraction costs.

For Manual Slop: the data is the disc_entries list, the FileItem schema, the ContextPreset schema, the RAGEngine index, the comms.log JSON-L. Not the Discussion or the Persona or the Project as objects. The objects are convenient summaries; the data is the ground truth.

1.3 "The solution matters more than the data."

The only purpose of any solution is to transform data from one form to another.

Do instead: start every task from the actual inputs and required outputs, never from the machinery you'd like to build.

For Manual Slop: before proposing a new class, module, or pipeline, write down (in a comment, in the plan, in the test) what the input is and what the output is. If you can't, that's the first task.


2. The 8 core defaults (any problem)

  1. The problem is the data. Before proposing any solution, describe the input and output concretely. If you can't, getting that description is the first task.
  2. State the cost. Every design recommendation you make must state its cost (time, memory, complexity, maintenance) and on what platform that cost is paid. A recommendation without a cost is a guess.
  3. Solve only the problem you have. Different data is a different problem. Do not add parameters, options, abstraction layers, or extension points for hypothetical future needs. If you're tempted, write the one-line note of what you didn't build and why, and move on.
  4. Where there is one, there are many. Anything that happens once almost always happens many times — across space or across the time axis. Default every design to the batch; treat the single case as a batch of size one.
  5. The common case dominates. Identify the most common case explicitly and design the straight-line path for it. Handle rare and error cases, but outside that path — a "maybe" checked everywhere is an "always."
  6. Exploit every constraint you have. List the known constraints (ranges, volumes, rates, invariants) and use them to remove work. Do not discard a constraint to make the solution "more general" — that generality is a cost paid forever.
  7. Simplicity is removing work. Prefer fewer states, fewer steps, fewer special cases, fewer moving parts. Every added state or branch must be carried, tested, and explained — count them as cost.
  8. "Can't be done" is a cost claim. When something seems impossible, what is almost always true is that it costs more than it's worth. Say that, with the estimate, so the tradeoff can actually be decided.

3. Get the real data (required before designing)

You cannot observe data you were not given — so observe what you can, and label everything else:

  • Inspect before assuming. Read representative input files, sample actual values, read the actual call sites, run the code on real input when a way to do so exists. Do not design from the type signatures or the docs alone.
  • Label every assumption. For each fact you need but cannot observe, write an explicit line — ASSUMPTION: — affects — in your plan, and prefer designs that are cheap to revisit if the assumption is wrong. Ask the user only when the answer materially changes the design.
  • Never fabricate. Do not invent plausible-looking values, distributions, or measurements and treat them as real.

Answer these about the data (in the tier 1+ plan):

  1. What does the input actually look like — shape, volume, source?
  2. What are the most common real values, and how are they distributed?
  3. What are the acceptable ranges, and what happens when out-of-range data arrives?
  4. What is the frequency of change — what is stable, what is volatile?
  5. What does the solution read and where does it come from? What does it write and where is it used? What does it touch that it doesn't need?

For Manual Slop specifically: the data is disc_entries (the conversation), FileItem (per-file curation), ContextPreset (per-preset curation), RAGEngine (semantic search), comms.log (audit), Persona (agent profile), manual_slop.toml (project config), app_state (live state). Read the actual files before designing.


4. Method (tier 1+)

Show this work as a short plan, a line or two per step:

  1. Frame it. What is the problem, why is it worth solving, where is the limit beyond which it isn't, and what is plan B?
  2. Get the data (per §3).
  3. State the cost of the dominant transform on the real platform.
  4. Design the transform: a sequence or DAG of explicit transformations — what comes in, what goes out, what each step is responsible for, with explicit contracts (shape, meaning, ownership, lifetime, valid ranges) at each boundary.
  5. Run the simplification pass (per §5); say which questions applied and what work they removed.
  6. Define done. State the success criteria and what evidence would prove the approach wrong, before building.
  7. Verify. Check the result against the real data and the stated criteria, and report what was and wasn't verified.

5. The simplification pass (run recursively on every sub-problem)

The 7 questions, applied in order, to every sub-problem:

# Question Reduces
1 Can we not do this at all? Work that shouldn't exist
2 Can we do this only once (precompute, cache, amortize)? Repeated work
3 Can we do this fewer times? Frequency of work
4 Can we approximate the result so that no one notices the difference? Precision cost
5 Can we use a small lookup table? Branching cost
6 Can we use a large lookup table? Branching cost (alternative)
7 Can we use a small buffer/FIFO to decouple producer from consumer? Coupling cost
8 Can we constrain the problem further so a simpler machine suffices? Generality cost

If any question applies, do the cheaper thing. If a question doesn't apply, say why and move on. The questions are not a checklist to score against; they're a habit.


6. Design rules

  • Minimize states and branches by design, not by adding checks. Where the data genuinely varies, partition it by case and handle each partition straight-line, rather than re-deciding the case per element.
  • Out-of-range and error behavior is always explicit — clamp, reject, drop, or fail loudly; chosen deliberately and written down. Never leave undefined behavior as an implicit policy, in any tier.
  • Complexity requires evidence. Add complexity only against a real, observed need — never a hypothetical one.

7. Performance claims

  • Never assert an unmeasured performance result. Not "this should be faster," not invented numbers.
  • If a way to measure exists (benchmark, profiler, test harness, counters), measure, and include before/after numbers with the change.
  • If no way to measure exists here, label the change unverified, state the expected effect as a hypothesis, and specify the exact measurement that would verify it.
  • If there is no measurable performance requirement, build the simplest correct design and skip speculative optimization entirely.

For Manual Slop: the existing audit scripts (scripts/audit_main_thread_imports.py, scripts/audit_weak_types.py, scripts/check_test_toml_paths.py) are the measurement infrastructure. Use them. Don't claim "faster" without a number from one of these.


8. Software specifics (systems, engine, embedded, game)

The rules above apply to any problem. These are their conclusions for software, where the hardware is unforgiving and the data volumes are real.

8.1 Batch-first transforms (plural by default)

  • Write transforms to operate on batches/arrays by default, named in the plural (update_things, not update_thing).
  • A singular call is a degenerate batch: the same batch path with count = 1. Do not maintain separate singular logic without a proven, measured need.
  • Exception: true singletons (configuration state, a single shared resource). Taking the exception requires a written note: why the data is genuinely singular and batch semantics don't apply.

8.2 Memory, layout, and access

  • Indices over pointers/references/handles by default (index into a contiguous array or table). Any pointer-heavy hot path must include a short written justification for why indices are insufficient.
  • Organize data by access pattern, not conceptual ownership. Split hot and cold fields when the cold fields aren't needed in the dominant loop.
  • For each hot path, write down the expected access pattern (linear / strided / random), expected branch behavior (predictable / unpredictable), and the hardware assumptions.
  • When branch entropy is high, prefer partitioned passes (bucket by state/tag, process each bucket straight-line) over per-element branching.
  • Keep the common-case path branch-minimal; rare and error handling lives outside the hot loop.

8.3 Data protocols between systems

Systems communicate through explicit data protocols, modeled after network protocols and file formats — explicit layout, versioning, documented meaning. The default is a flat struct: fixed layout, no hidden pointers, no OO-style interfaces. Use tagged unions or header-plus-payload when the flat struct genuinely can't express it. Do not model system boundaries as objects, virtual calls, or opaque handles.

For Manual Slop: the boundary between the AI client and the LLM provider is a flat struct (the Message dataclass: role, content, tool_calls, tool_results); the boundary between the MCP client and the tool implementer is a flat struct (the tool_input dict); the boundary between the LLM client and the GUI is the comms.log JSON-L. Not objects with virtual methods. Not opaque handles. Flat structs.

8.4 Hardware is the platform

Design with the actual hardware's properties — cache hierarchy, memory bandwidth, alignment, latency vs throughput — and to its strengths.

  • Latency and throughput are only the same thing in a sequential system. For every performance requirement, identify which one it actually is before designing for it.
  • The compiler and language are tools, not magic: memory layout, access order, and the choice of what work to do at all are your job, not theirs — and they are roughly 90% of the problem. Know what the compiler can reasonably do with what you wrote, and don't delegate what it can't.

9. The 4 memory dimensions (the Manual Slop context)

The conversation data has 4 distinct memory dimensions (curation / discussion / RAG / knowledge). Each lives at a different layer; each serves a different purpose.

The canonical reference is conductor/code_styleguides/agent_memory_dimensions.md §0 (the full 4-dim table + per-dim deep-dives + boundaries + decision tree). This section is a pointer.

The one-line summary:

  • Curation is per-file structural (the FileItem schema)
  • Discussion is per-turn conversational (the disc_entries list)
  • RAG is opt-in semantic (the ChromaDB vector store)
  • Knowledge is per-project durable (the markdown files at ~/.manual_slop/knowledge/)

The shape rule. A feature that wants one should use the matching dimension; mixing them is a maintenance liability.

10. Enforceable deliverables (tier 2)

For each new or substantially reworked subsystem:

  • One explicit batch transform contract: input layout, output layout, owner, lifetime, valid value ranges.
  • A plural/batch path for every transform; singular calls are thin wrappers over the batch implementation (count = 1) unless documented as a true singleton.
  • A written justification for any pointer/reference/handle-heavy hot path explaining why index-based access is insufficient.
  • Explicit out-of-range behavior (clamp/reject/drop/error) at every input boundary.
  • Unresolved design questions filed as local issue files under issues/ — not GitHub issues, not inline TODOs.

For Manual Slop specifically: the equivalent of issues/ is docs/reports/ (where session retrospectives, audit reports, and design-issue docs live) or per-track spec.md §9 "Open Questions".


11. Final self-check (run before delivering tier 1+ work)

Verify, and fix or flag anything that fails:

  • The plan answered the framing, data, and cost questions — or every gap is labeled ASSUMPTION with what it affects.
  • The most common case is identified and the design serves it straight-line; rare/error cases are out of the common path.
  • The simplification pass ran; the work it removed (or why nothing could be removed) is stated.
  • No speculative generality: no parameter, option, or abstraction exists for a need that isn't real yet.
  • Out-of-range and error behavior is explicit at every boundary.
  • Transforms are plural/batch, or the singleton exception is documented.
  • Pointer-heavy hot paths carry their written justification; everything else uses indices.
  • No unmeasured performance claim anywhere in code, comments, or summary; measurements included where possible, hypotheses labeled where not.
  • Done-criteria from the plan were checked, and the summary reports what was verified and what wasn't.
  • (Tier 2) Deliverables above are present; open questions are filed under docs/reports/ or per-track spec.md §9.

12. Cross-references

  • AGENTS.md — imports this file; the project-root agent-facing rules
  • ./docs/AGENTS.md — the agent-facing mirror of docs/Readme.md (recommended first read for any agent scoping a feature)
  • conductor/code_styleguides/agent_memory_dimensions.md — the 4 memory dimensions
  • conductor/code_styleguides/rag_integration_discipline.md — the conservative-RAG rule
  • conductor/code_styleguides/cache_friendly_context.md — stable-to-volatile ordering + the cache TTL contract
  • conductor/code_styleguides/knowledge_artifacts.md — the knowledge harvest pattern
  • conductor/code_styleguides/feature_flags.md — "delete to turn off" + config flags
  • conductor/product-guidelines.md — the project's other product conventions
  • conductor/tech-stack.md — the tech stack constraints
  • conductor/edit_workflow.md — the edit-tool contract

13. External sources (the prior art this was adapted from)

  • Mike Acton, "Data-Oriented Design and C++" (cppCon 2014) — the foundational DOD talk
  • Casey Muratori, "The Big OOPs: Anatomy of a Thirty-Five-Year Mistake" (BSC 2025) — the historical indictment of OOP
  • Ryan Fleury, "A Taxonomy of Computation Shapes" (Feb 2023) — the 6 computational shapes
  • Ryan Fleury, "The Codepath Combinatoric Explosion" (Apr 2023) — the nil-sentinel / immediate-mode defusing techniques
  • Ryan Fleury, "Errors are just cases" (the Result[T, ErrorInfo] pattern) — the data-oriented error handling
  • Andrew Reece, "Assuming as Much as Possible" (BSC 2025) — the Xar pattern; the engineering discipline for stripping layers
  • John O'Donnell, "IMGUI / The Pitch / MVC" — the immediate-mode + IEventTarget paradigm
  • Mike Acton, context/data-oriented-design.md (nagent canonical; 13,084 bytes) — the immediate source for the structure of this document