forth_bootslop/GEMINI.md

# System Prompt

## Baseline

DO NOT EVER make a shell script unless told to. DO NOT EVER make a readme or a file describing your changes unless your are told to. If you have commands I should be entering into the command line or if you have something to explain to me, please just use code blocks or normal text output. DO NOT DO ANYTHING OTHER THAN WHAT YOU WERE TOLD TODO. DO NOT EVER, EVER DO ANYTHING OTHER THAN WHAT YOU WERE TOLD TO DO. IF YOU WANT TO DO OTHER THINGS, SIMPLY SUGGEST THEM, AND THEN I WILL REVIEW YOUR CHANGES, AND MAKE THE DECISION ON HOW TO PROCEED.

WHEN WRITING SCRIPTS USE A 120-160 character limit per line. I don't want to see scrunched code.

## Necessary Background for Goal

Watch or read the following:

* [Forth Day 2020 - Preview of x64 & ColorForth & SPIR V - Onat](https://youtu.be/ajZAECYdJvE)
* [Metaprogramming VAMP in KYRA, a Next-gen Forth-like language](https://youtu.be/J9U_5tjdegY)
* [Neokineogfx - 4th And Beyond](https://youtu.be/Awkdt30Ruvk)

There are transcripts for each of these vide2s in the [references](./references/) directory.

## Goal

Learn ColorForth and be able to build a ColorForth derivative from scratch similar to Timothy Lottes and Onatt. First I want to make sure the agent has a full grasp of the background content provided. I can provide more if necessary.

## Some exceprts from discussing things with Grok 4.20 beta

### User

```md
What are the prerequisites for me to understand this completely. I know code such as those conveyed by Ryan Fleury, Jonathan Blow, Casey Muratori, Mike Acton, John Carmack, Tim Sweeney, Eskil Steenberg, Sebastian Aaltonen, etc.

I'm only aware of x86-64 and mips assembly by doing Pikuma, reading some information on assemblers. Looking deeply into linkers and doing my own build scripts. Looking into portable executable format on windows. But I'm not fluent in machine level code or its nuances. Like I don't have intuition on the trade-offs of Lottes's subset of x86 he uses nor for Onatt's and how easy it is to extend and support new codegen within their forth-like derivatives, etc.

They always advertise it as simple and I think there is a true simplicity to it but there is a gulf. Similar to being able to use lua but maybe it would take a while to bootstrap your own lua jit I guess. The biggest hurdle being the frontend they need to actually deal with non-text based code modification or management.

The other thing is I value being able to integrate or use external tooling like the rad debugger. And it will need most likely me to generate some source file and dynamically populate data it needs on the fly so I can open step-debug the program at runtime... I'm not sure how big of a hurdle that is. Do I even need linkers with this setup?
```

### Grok

```md
**Lottes' x68 (32-bit granularity x86-64 subset)** and **Onat's KYRA codegen style** are both deliberate restrictions on the full x86-64 ISA. They are chosen for **editor simplicity** and **zero-overhead execution**, not raw peak FLOPS. Both deliver **net performance gains** in their target domain (tiny self-contained interactive systems, shaders, live-reloaded tools) because the code stays <100 KB and fits entirely in L1/L2 cache. There is **no notable runtime degradation** for real work — often the opposite.

Here is the exact trade-off breakdown.

### Lottes' x68 — 32-bit aligned opcodes only
Every instruction (and immediate) is forced to 32-bit (4-byte) boundaries by padding.

**How padding works (exact from his talk):**
- Ignored prefixes (e.g. 0x3E DS segment)
- Multi-byte NOPs (0x0F 0x1F ...)
- Extra REX prefixes when not needed
- Example: normal RET (C3) → C3 90 90 90 (padded to 4 bytes)

**Trade-offs:**

| Aspect              | Benefit (why he does it)                              | Cost / Potential downside                          | Real perf impact (his domain) |
|---------------------|-------------------------------------------------------|----------------------------------------------------|-------------------------------|
| **Editor / Sourceless** | Code = pure 32-bit word array. Hex editor, find/replace, auto-relink jumps all trivial. No variable-length decode. | None in practice                                  | Huge win (instant edit)      |
| **Jumps / Calls**   | Relative offsets always 32-bit and aligned to end of opcode → editor can patch without decoding | Slightly larger code                               | Neutral (tiny code)          |
| **Inline data**     | Immediates always at 32-bit boundaries → easy visual data in hex view | —                                                  | Win                          |
| **Code size**       | —                                                     | 20–50% larger due to padding (estimated)          | Negligible (everything in L1) |
| **Decode / uop cache** | —                                                     | Slightly more bytes fetched, possible extra uops  | Negligible on Zen/Intel wide decoders |
| **ISA coverage**    | Still full x64 power for the ops he needs            | Can't use arbitrary unpadable instructions easily | None (he only needs tiny subset) |

**Perf verdict from Lottes:**
He explicitly says the code is so small it all fits in cache, so the padding cost is irrelevant. The **gain** is massive reduction in total system complexity (no assembler, no linker, no text parser). Sourceless mode = true zero-overhead runtime. He views this as **better** than full ISA for interactive work.

### Onat's KYRA style (not a strict opcode subset, but a coding / emission convention)
Full x86-64 base + SSE (without VEX prefixes) + all addressing modes (including RIP-relative). The restriction is in **how** he uses it.

**Core rules (exact from talk + his site):**
- Only **two temporary registers** for the "stack": RAX and RDX + 1 hidden bit ("which is top?").
- Before every call/definition: `xchg rax, rdx` (1 byte, 48 87 C2).
- Everything else = global memory accessed via **one base register** (R15 points to a contiguous struct of all variables).
- Every high-level word emits **1–3 real instructions** max (often 1).
- SPIR-V shaders = direct binary emit (no glslang).

**Trade-offs:**

| Aspect                  | Benefit (why Onat does it)                                      | Cost / downside                                      | Real perf impact |
|-------------------------|-----------------------------------------------------------------|------------------------------------------------------|------------------|
| **Runtime stack**       | Zero stack ops, zero frame setup, no juggling                   | Must manage globals manually                         | **Massive gain** (60 ms text/Forth vs 4 ms) |
| **Register usage**      | Only 2 temps + base → trivial templates, 1800-byte compiler    | Occasional spill to global (he says 3 ns, ignore)   | Gain (fewer instructions) |
| **Globals**             | One contiguous block via R15 → perfect cache locality          | Programmer tracks lifetimes (single-threaded)       | Gain (better than stack) |
| **SSE / AVX**           | SSE without VEX (2/3-byte encoding)                             | No VEX/AVX2+ in templates (easy to add later)       | Minor loss if you need AVX, but he doesn't |
| **Dictionary / lookup** | Pure 32-bit index (next version 32-bit index + separate tag byte) | You curate the index space by hand                   | **Huge gain** (no hash, 1–2 ms full compile) |
| **Code size / density** | Extremely dense (single instructions)                           | None                                                 | Win |

**Perf numbers Onat actually shows (VAMP program):**
- Full recompile (editor + Vulkan + FFmpeg + shaders): 4–8 ms on normal hardware.
- Shader compile: <5 ms (mostly driver), ~100× faster than glslang.
- Instruction generation: 180 million instructions/second/core on Ryzen 5900HX.
- Compared to C: "thousand times more time to compile in C" for the same program.

He explicitly calls out the classic Forth stack as the bottleneck: "the difference between using data stack versus using registers only".

### Overall: Do they lose or gain perf?
**Net gain — often by 5-100× in the metrics that matter to them.**

**Gains:**
- Iteration speed (4–8 ms full rebuild vs seconds in C).
- Runtime speed (fewer instructions, perfect locality, no ABI/calling-convention tax).
- Cache behavior (tiny code + contiguous globals = L1 resident forever).
- Debuggability / visibility (live dataflow viz of RAX/RDX + globals).

**Losses (real but tiny):**
- Slightly larger code from Lottes padding.
- Occasional extra global access instead of a 3rd/4th register (Onat).
- Can't blindly copy-paste AVX-heavy code without adjusting templates.

For programs that fit in a few cache lines (exactly their use-case: editors, shaders, tools, games), the simplicity wins so hard that the "losses" disappear. This is why both say "simple" and mean it — they removed everything that costs more than it gives.

If you build the 60-line emitter I described earlier, you will feel the gain immediately: change one token → re-emit 100 bytes → run again in <1 ms. That is the real performance story.

Want the exact 32-bit x68 padding helper + Onat's xchg + global load/store templates as copy-paste C code next? Just say the word and I'll drop the blocks.
```

## Visual Context Synthesis & Color Semantics

Based on the extracted frame OCR data from the references (Lottes' and Onat's presentations), here is the persistent mapping of ColorForth visual semantics to language logic for this project:

- **Red (`<RED>`):** Defines a new word or symbol in the dictionary. This is the entry point for compilation.
- **Green (`<GREEN>`):** Compiles a word into the current definition.
- **Yellow/Orange (`<YELLOW>` / `<ORANGE>`):** Executes a word immediately. Often used for macros or immediate mode evaluation within the editor.
- **Cyan/Blue (`<CYAN>` / `<BLUE>`):** Used for variables, memory addresses, or formatting layout (not executable instruction logic).
- **White/Dim (`<WHITE>` / `<DIM>`):** Comments, annotations, and UI elements.
- **Magenta (`<MAGENTA>`):** Typically used for pointers or state modifiers.

**Architectural Notes Extracted:**
1. **Sourceless Environment:** The underlying system doesn't deal with parsing strings. It deals with 32-bit tagged tokens (as noted in Lottes' 32-bit x68 alignment).
2. **Visual Editor:** The editor is intrinsically tied to the compiler. It reads the same memory structure. It uses these color properties to colorize the tokens live.
3. **Hardware Locality:** We see a major focus on removing the stack in favor of register rotation (`RAX`, `RDX`) as per Onat's methodology.