progress

2026-02-20 14:11:59 -05:00
parent 0c6c895a1f
commit 7dc08178b9
5 changed files with 360 additions and 369 deletions
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -6,6 +6,9 @@ DO NOT EVER make a shell script unless told to. DO NOT EVER make a readme or a f

 WHEN WRITING SCRIPTS USE A 120-160 character limit per line. I don't want to see scrunched code.

+## Coding Conventions
+Before writing any C code in this workspace, you MUST review the strict stylistic and architectural guidelines defined in [CONVENTIONS.md](./CONVENTIONS.md). These dictate the usage of byte-width types, X-Macros, WinAPI FFI mapping, and memory arenas.
+
 ## Necessary Background for Goal

 Watch or read the following:
@@ -14,111 +17,31 @@ Watch or read the following:
 * [Metaprogramming VAMP in KYRA, a Next-gen Forth-like language](https://youtu.be/J9U_5tjdegY)
 * [Neokineogfx - 4th And Beyond](https://youtu.be/Awkdt30Ruvk)

-There are transcripts for each of these vide2s in the [references](./references/) directory.
+There are transcripts for each of these videos in the [references](./references/) directory, along with a comprehensive curation of Lottes's blogs, Onat's tweets, and architectural consolidations.

 ## Goal

 Learn ColorForth and be able to build a ColorForth derivative from scratch similar to Timothy Lottes and Onatt. 

-**Critical Clarification:** The goal is *not* for the AI to auto-generate a novelty solution or dump a finished codebase. The objective is for me (the user) to *learn* how to build this architecture from scratch. The AI must act as a highly contextualized mentor, providing guided nudges, architectural validation, and specific tactical assistance when requested. We are at the cusp of implementation. The AI should lean on the extensive curation in `./references/` to ensure its advice remains strictly aligned with the Lottes/Onat "sourceless, zero-overhead, register-only" paradigm, minimizing generic LLM hallucinations.
+**Critical Clarification:** The goal is *not* for the AI to auto-generate a novelty solution or dump a finished codebase. The objective is for me (the user) to *learn* how to build this architecture from scratch. The AI must act as a highly contextualized mentor, providing guided nudges, architectural validation, and specific tactical assistance when requested. We are at the cusp of implementation. The AI should lean on the extensive curation in `./references/` to ensure its advice remains strictly aligned with the Lottes/Onat "sourceless, zero-overhead" paradigm, minimizing generic LLM hallucinations.

-## Some exceprts from discussing things with Grok 4.20 beta
+## Architectural Constraints (The "Lottes/Onat" Paradigm)

-### User
+Based on the curation in `./references/`, the resulting system MUST adhere to these non-standard rules:

-```md
-What are the prerequisites for me to understand this completely. I know code such as those conveyed by Ryan Fleury, Jonathan Blow, Casey Muratori, Mike Acton, John Carmack, Tim Sweeney, Eskil Steenberg, Sebastian Aaltonen, etc.
-
-I'm only aware of x86-64 and mips assembly by doing Pikuma, reading some information on assemblers. Looking deeply into linkers and doing my own build scripts. Looking into portable executable format on windows. But I'm not fluent in machine level code or its nuances. Like I don't have intuition on the trade-offs of Lottes's subset of x86 he uses nor for Onatt's and how easy it is to extend and support new codegen within their forth-like derivatives, etc.
-
-They always advertise it as simple and I think there is a true simplicity to it but there is a gulf. Similar to being able to use lua but maybe it would take a while to bootstrap your own lua jit I guess. The biggest hurdle being the frontend they need to actually deal with non-text based code modification or management.
-
-The other thing is I value being able to integrate or use external tooling like the rad debugger. And it will need most likely me to generate some source file and dynamically populate data it needs on the fly so I can open step-debug the program at runtime... I'm not sure how big of a hurdle that is. Do I even need linkers with this setup?
-```
-
-### Grok
-
-```md
-**Lottes' x68 (32-bit granularity x86-64 subset)** and **Onat's KYRA codegen style** are both deliberate restrictions on the full x86-64 ISA. They are chosen for **editor simplicity** and **zero-overhead execution**, not raw peak FLOPS. Both deliver **net performance gains** in their target domain (tiny self-contained interactive systems, shaders, live-reloaded tools) because the code stays <100 KB and fits entirely in L1/L2 cache. There is **no notable runtime degradation** for real work — often the opposite.
-
-Here is the exact trade-off breakdown.
-
-### Lottes' x68 — 32-bit aligned opcodes only
-Every instruction (and immediate) is forced to 32-bit (4-byte) boundaries by padding.
-
-**How padding works (exact from his talk):**
- Ignored prefixes (e.g. 0x3E DS segment)
- Multi-byte NOPs (0x0F 0x1F ...)
- Extra REX prefixes when not needed
- Example: normal RET (C3) → C3 90 90 90 (padded to 4 bytes)
-
-**Trade-offs:**
-
-| Aspect              | Benefit (why he does it)                              | Cost / Potential downside                          | Real perf impact (his domain) |
-|---------------------|-------------------------------------------------------|----------------------------------------------------|-------------------------------|
-| **Editor / Sourceless** | Code = pure 32-bit word array. Hex editor, find/replace, auto-relink jumps all trivial. No variable-length decode. | None in practice                                  | Huge win (instant edit)      |
-| **Jumps / Calls**   | Relative offsets always 32-bit and aligned to end of opcode → editor can patch without decoding | Slightly larger code                               | Neutral (tiny code)          |
-| **Inline data**     | Immediates always at 32-bit boundaries → easy visual data in hex view | —                                                  | Win                          |
-| **Code size**       | —                                                     | 20–50% larger due to padding (estimated)          | Negligible (everything in L1) |
-| **Decode / uop cache** | —                                                     | Slightly more bytes fetched, possible extra uops  | Negligible on Zen/Intel wide decoders |
-| **ISA coverage**    | Still full x64 power for the ops he needs            | Can't use arbitrary unpadable instructions easily | None (he only needs tiny subset) |
-
-**Perf verdict from Lottes:**  
-He explicitly says the code is so small it all fits in cache, so the padding cost is irrelevant. The **gain** is massive reduction in total system complexity (no assembler, no linker, no text parser). Sourceless mode = true zero-overhead runtime. He views this as **better** than full ISA for interactive work.
-
-### Onat's KYRA style (not a strict opcode subset, but a coding / emission convention)
-Full x86-64 base + SSE (without VEX prefixes) + all addressing modes (including RIP-relative). The restriction is in **how** he uses it.
-
-**Core rules (exact from talk + his site):**
- Only **two temporary registers** for the "stack": RAX and RDX + 1 hidden bit ("which is top?").
- Before every call/definition: `xchg rax, rdx` (1 byte, 48 87 C2).
- Everything else = global memory accessed via **one base register** (R15 points to a contiguous struct of all variables).
- Every high-level word emits **1–3 real instructions** max (often 1).
- SPIR-V shaders = direct binary emit (no glslang).
-
-**Trade-offs:**
-
-| Aspect                  | Benefit (why Onat does it)                                      | Cost / downside                                      | Real perf impact |
-|-------------------------|-----------------------------------------------------------------|------------------------------------------------------|------------------|
-| **Runtime stack**       | Zero stack ops, zero frame setup, no juggling                   | Must manage globals manually                         | **Massive gain** (60 ms text/Forth vs 4 ms) |
-| **Register usage**      | Only 2 temps + base → trivial templates, 1800-byte compiler    | Occasional spill to global (he says 3 ns, ignore)   | Gain (fewer instructions) |
-| **Globals**             | One contiguous block via R15 → perfect cache locality          | Programmer tracks lifetimes (single-threaded)       | Gain (better than stack) |
-| **SSE / AVX**           | SSE without VEX (2/3-byte encoding)                             | No VEX/AVX2+ in templates (easy to add later)       | Minor loss if you need AVX, but he doesn't |
-| **Dictionary / lookup** | Pure 32-bit index (next version 32-bit index + separate tag byte) | You curate the index space by hand                   | **Huge gain** (no hash, 1–2 ms full compile) |
-| **Code size / density** | Extremely dense (single instructions)                           | None                                                 | Win |
-
-**Perf numbers Onat actually shows (VAMP program):**
- Full recompile (editor + Vulkan + FFmpeg + shaders): 4–8 ms on normal hardware.
- Shader compile: <5 ms (mostly driver), ~100× faster than glslang.
- Instruction generation: 180 million instructions/second/core on Ryzen 5900HX.
- Compared to C: "thousand times more time to compile in C" for the same program.
-
-He explicitly calls out the classic Forth stack as the bottleneck: "the difference between using data stack versus using registers only".
-
-### Overall: Do they lose or gain perf?
-**Net gain — often by 5-100× in the metrics that matter to them.**
-
-**Gains:**
- Iteration speed (4–8 ms full rebuild vs seconds in C).
- Runtime speed (fewer instructions, perfect locality, no ABI/calling-convention tax).
- Cache behavior (tiny code + contiguous globals = L1 resident forever).
- Debuggability / visibility (live dataflow viz of RAX/RDX + globals).
-
-**Losses (real but tiny):**
- Slightly larger code from Lottes padding.
- Occasional extra global access instead of a 3rd/4th register (Onat).
- Can't blindly copy-paste AVX-heavy code without adjusting templates.
-
-For programs that fit in a few cache lines (exactly their use-case: editors, shaders, tools, games), the simplicity wins so hard that the "losses" disappear. This is why both say "simple" and mean it — they removed everything that costs more than it gives.
-
-If you build the 60-line emitter I described earlier, you will feel the gain immediately: change one token → re-emit 100 bytes → run again in <1 ms. That is the real performance story.
-
-Want the exact 32-bit x68 padding helper + Onat's xchg + global load/store templates as copy-paste C code next? Just say the word and I'll drop the blocks.
-```
+1. **Sourceless Environment (x68):** No string parsing at runtime. Code exists purely as an array of 32-bit tokens. 
+   - **Token Layout:** 28 bits of payload (compressed name/index/value) + 4 bits for the semantic "Color" Tag.
+2. **Visual Editor as the OS:** The editor directly maps to the token array. It does not read text files. It uses the 4-bit tags to colorize the tokens live.
+3. **Register-Only Stack:** The traditional Forth data stack in memory is completely eliminated.
+   - We strictly use a **2-item register stack** (`RAX` and `RDX`).
+   - Stack rotation is handled via the `xchg rax, rdx` instruction.
+4. **Preemptive Scatter ("Tape Drive"):** Function arguments are not pushed to a stack before a call. They are "scattered" into pre-allocated, contiguous global memory slots during compilation/initialization. The function simply reads from these known offsets, eliminating argument gathering overhead.
+5. **No `if/then` branches:** Rely on hardware-level flags like conditional returns (`ret-if-signed`) combined with factored calls to avoid writing complex AST parsers.
+6. **No Dependencies:** C implementation must be minimal (`-nostdlib`), ideally running directly against OS APIs (e.g., WinAPI `VirtualAlloc`, `ExitProcess`, `GDI32` for rendering).

 ## Visual Context Synthesis & Color Semantics

-Based on the extracted frame OCR data from the references (Lottes' and Onat's presentations), here is the persistent mapping of ColorForth visual semantics to language logic for this project:
+Based on the extracted frame OCR data from the references:

 - **Red (`<RED>`):** Defines a new word or symbol in the dictionary. This is the entry point for compilation.
 - **Green (`<GREEN>`):** Compiles a word into the current definition.
@@ -126,8 +49,3 @@ Based on the extracted frame OCR data from the references (Lottes' and Onat's pr
 - **Cyan/Blue (`<CYAN>` / `<BLUE>`):** Used for variables, memory addresses, or formatting layout (not executable instruction logic).
 - **White/Dim (`<WHITE>` / `<DIM>`):** Comments, annotations, and UI elements.
 - **Magenta (`<MAGENTA>`):** Typically used for pointers or state modifiers.
-
-**Architectural Notes Extracted:**
-1. **Sourceless Environment:** The underlying system doesn't deal with parsing strings. It deals with 32-bit tagged tokens (as noted in Lottes' 32-bit x68 alignment).
-2. **Visual Editor:** The editor is intrinsically tied to the compiler. It reads the same memory structure. It uses these color properties to colorize the tokens live.
-3. **Hardware Locality:** We see a major focus on removing the stack in favor of register rotation (`RAX`, `RDX`) as per Onat's methodology.