This commit is contained in:
2026-02-20 21:25:46 -05:00
parent b3984970a8
commit 0d96c85012
2 changed files with 78 additions and 70 deletions

View File

@@ -1,86 +1,92 @@
# Advanced Source-Less Programming & JIT Architecture: A Hardcore Technical Study
# In-Depth Chronological Breakdown of Source-Less Programming Reference Videos
This document contains a deep-dive technical extraction of the mechanics, JIT compiler optimizations, and paradigms presented by Timothy Lottes and Onat Türkçüoğlu. These notes surpass high-level theory, detailing the exact x86-64 assembly generation rules, state-tracking mechanisms, and memory layouts required to implement a zero-overhead, source-less Forth environment.
This document provides an exhaustive, highly detailed chronological paraphrase of the technical specifics, screen visuals, and mechanical explanations provided by Timothy Lottes and Onat Türkçüoğlu.
---
## 1. The Lottes "x68" Paradigm: Editor as the OS
## 1. "Forth Day 2020 - Preview of x64 & ColorForth & SPIR V" (Onat, 2020)
Lottes's approach fundamentally transforms the editor into a live, dynamic linker and machine-code orchestrator.
**0:00 - 3:00 | Introduction & The Editor Visuals**
Onat introduces his 1-month-old iteration of Forth, inspired by ColorForth.
* **Screen Details:** A custom 3-pane UI rendered in C and Vulkan. Left/center panes show block-based colored tokens; the right pane displays live x64 assembly output that updates instantly as he edits.
* The editor treats code blocks as tracked state objects, supporting native undo/redo.
### 1.1 The Lexical Grid and 32-Bit Instruction Granularity
In x68, the runtime contains *no parsing logic*. Code is a flat array of 32-bit tokens (4-bit tag, 28-bit payload).
To make the x86-64 architecture fit this visual editor grid, Lottes forces all generated machine code to 32-bit boundaries:
* **Instruction Padding:** Native instructions that are smaller than 4 bytes are padded.
* *Example:* `RET` (`0xC3`) becomes `C3 90 90 90` (using three `NOP`s).
* *Example:* `MOV` or `ADD` can use ignored segment overrides (like the `3E` DS prefix) or unnecessary `REX` prefixes to reach exactly 4 bytes.
* **Auto-Relinking:** The editor implicitly acts as a linker. Because every instruction is 32 bits, 32-bit RIP-relative offsets for `CALL` (`E8`) and `JMP` (`E9`) are perfectly aligned. When the user inserts or deletes a token in the editor, the editor instantly recalculates and updates the raw binary relative offsets for all branch instructions.
* **Shorthand Assembly UI:** The editor can decode these 32-bit blocks and display human-readable macro-assembly, e.g., mapping `add rcx, qword ptr [rdx + 0x8]` to the visual string `h + at i08`.
**3:00 - 6:00 | O(1) Dictionary Lookup & Execution Tracing**
* To avoid hashing, his compiler allocates an extra 4 bytes per character strictly to store the *source memory location* of the currently compiled word.
* **Visual Feature:** "Jump to Definition" and an "Execution Trace" overlay. He demonstrates invoking a command that instantly numbers every occurrence of a word across the codebase in the exact chronological order of execution, providing a "compile-time call graph" without running the program.
### 1.2 ColorForth Semantic Tags & The State Machine
The 4-bit color tag dictates how the editor/JIT interprets the 28-bit payload:
* **White (Ignored):** Comments, formatting, or skipped words.
* **Yellow (Immediate Execution):**
* If a number: Append it to the data stack *during edit/compile time*.
* If a word: Look it up in the dictionary and execute its associated code *immediately*.
* **Red (Define):** Sets a word in the dictionary to point to the current compilation address (or TOS).
* **Green (Compile):**
* If a number: Emits machine code to push that number (e.g., `mov rax, imm`).
* If a word: Looks it up in the *Macro* dictionary; if found, calls it (code generation). Otherwise, looks it up in the *Forth* dictionary and emits a `CALL` to it.
* **Cyan/Blue (Defer Execution):** Looks up a word in the macro dictionary and appends a call to it. Used for macros that generate other macros.
* **Magenta (Variable/Pointer):** Sets the dictionary value to point to the *next source token* in memory.
* **The Transition Trigger:** A transition from Yellow (Execution) to Green (Compilation) causes the JIT to pop the current Top of Stack and emit a native machine-code instruction to push that value. (i.e., "Turning a computed number back into a program").
### 1.3 The 5-Byte Folded Interpreter
To eliminate the massive pipeline stall (branch misprediction) caused by a standard `NEXT` instruction in threaded-code interpreters, Lottes suggests embedding a micro-interpreter at the *end of every word*:
1. **`LODSD` (1 byte or 2 bytes with REX):** Loads the next 32-bit token from `RSI` (the instruction pointer) into `EAX`/`RAX` and increments `RSI`.
2. **Lookup (2 bytes):** Uses a highly optimized hash or direct mapping to translate the token payload to a memory address.
3. **Jump (2 bytes):** Emits an indirect jump (e.g., `JMP RAX`).
*Result:* Every word transition has its own dedicated branch predictor slot in the CPU hardware, reducing average clock stalls from ~16 to near 0.
**6:00 - 11:00 | The High-Level x64 Macro Assembler & SPIR-V**
* **Screen Details:** Syntax like `AX to BX` or `CX + offset`. Toggling a "direction register" macro changes `from AX to BX register, let's move an unsigned` into a 32-bit `mov ebx, eax`. Modifiers like `long` emit 64-bit `mov rbx, rax`.
* He uses this same macro-assembler to generate SPIR-V. He notes x64 was actually less complicated than SPIR-V because x64 is a flat instruction stream, whereas SPIR-V requires strict sections, type declarations, and capabilities, forcing him to introduce "sections" into his JIT.
---
## 2. Onat's VAMP / KYRA: High-Performance Macro-Assembler
## 2. "4th And Beyond" (Timothy Lottes, NeoKineoGfx, 2026)
Onat's implementation provides a masterclass in eliminating the Forth data stack and leveraging x86-64 hardware registers optimally.
**0:00 - 8:00 | HP48 Evolution & ColorForth Mechanics**
* Lottes advocates removing compilers, linkers, and debuggers. He starts with HP48's RPN as the baseline.
* **Screen Details:** He defines a red word `4K` pointing to the next item on the data stack. Typing `1024 4 *` computes `4096`. `4K` acts as a variable.
* He defines `DROP` pointing to `add esi, -4` and `ret`. `4K 1 2 + DROP` yields 4096.
* He reviews ColorForth: Code compiles onto the data stack. Yellow = Execute, Red = Define, Green = Compile, Magenta = Variable. A Yellow-to-Green transition pops the stack and emits a `push` instruction.
* **Screen Details:** Disassembly of Block 24/26 shows `168B 2 , C28B0689 ,`. This pushes bytes onto the stack, disassembling to `mov edx, dword ptr esi` and `mov dword ptr esi, eax` (literally byte-banging machine code).
### 2.1 The 2-Register Stack & JIT State Tracking
Traditional Forth maintains a data stack in RAM, requiring constant memory loads/stores. Onat eliminates this:
* **The Stack is `RAX` and `RDX`.** No memory is used for parameter passing.
* **The 1-Bit JIT Optimizer:** The JIT compiler maintains a single bit of state: `is_rax_tos` (Is RAX currently the Top of Stack?).
* **Smart Compilation:**
* If the user types a Cyan number (Immediate), the JIT checks `is_rax_tos`. If true, it emits `mov rax, imm`. If false, it emits `mov rdx, imm`.
* Before compiling a `CALL`, the JIT knows which register the target function expects the TOS to be in. If the current JIT state mismatches the target's expectation, it automatically emits the 3-byte `xchg rax, rdx` (`48 87 C2`) instruction *before* the call.
* This makes operations like `SWAP` virtually free—they often just flip the compiler's internal `is_rax_tos` boolean without emitting any machine code.
* **Function Prologue/Epilogue:** Functions do not push/pop to a return stack in memory manually; they rely purely on the native x86 `call` and `ret` instructions utilizing `RSP` purely as a call stack.
**8:00 - 20:00 | Branch Misprediction, Folded Interpreter, & x68**
* Standard Forth causes 16-clock branch misprediction stalls due to tag branching.
* **The Folded Interpreter:** Lottes fixes this by folding a 5-byte interpreter into the end of every word: `LODSD`, lookup, `JMP RAX`. Every transition gets its own branch predictor slot.
* **x68 Architecture:** Forces all instructions to 32-bit boundaries. `RET` (`C3`) is padded with three `NOP`s (`90 90 90`). `MOV ESI, imm32` is padded with a `3E` ignored DS prefix.
* This makes relative offsets (`CALL`, `JMP`) align perfectly. The editor auto-relinks offsets as tokens are inserted/deleted.
* **Assembly Shorthand:** Editor maps `add rcx, qword ptr [rdx + 0x8]` to visual `h + at i08`.
### 2.2 Global Preemptive Scatter (The "Tape Drive")
Because the data stack is limited to two items, passing deep context is impossible.
* **Global Single-Register Base:** A single x86 register (e.g., `R12` or `R15`) is dedicated globally as the base pointer for all application memory (giving "gigabytes of state").
* **Colors map to memory operations:**
* **Green Tag (Read):** Emits `mov REG, [base_ptr + token_offset]`.
* **Red Tag (Write):** Emits `mov [base_ptr + token_offset], REG`.
* **FFI (Foreign Function Interface):** To call complex OS APIs (like Vulcan `VkImageCreateInfo`), VAMP does not use C-struct bindings. It manually calculates byte-offsets from the global base, emits instructions to write the struct data inline, aligns `RSP` for the OS calling convention, and calls the dynamic library pointer.
**20:00 - End | Live Execution (SteamOS/Linux)**
* Lottes targets a mix of high-level JIT and raw x68 sourceless.
* **Cartridge execution:** The binary copies itself to `cart.back`, maps into memory at a fixed address (bypassing ASLR), and provides a zero-fill space. 32-bit tokens act as direct absolute memory pointers, removing lookup overhead.
### 2.3 Lexical Syntax and Color Semantics
Onat uses a 24-bit dictionary index + 8-bit color tag. The semantics map directly to JIT actions:
* **Magenta Pipe (`|`):** Defines the boundary of a function. The JIT encounters this, emits a `RET` (`C3`) to close the previous function, and records the current instruction pointer as the start address of the new function.
* **White (Call):** Emits a relative `CALL` to the target. (If jumping to a dynamic address already in a register, it optimizes to `JMP RAX`).
* **Yellow (Macro):** Executes the attached code *during JIT compilation*. Used for compiler directives, setting layouts, or emitting specialized instructions like `LOCK` prefixes.
* **Blue (Comment):** Ignored by the JIT pointer entirely.
---
### 2.4 Control Flow without ASTs
VAMP abandons standard `IF/ELSE/THEN` parsing trees in favor of assembly-level basic blocks and lambdas.
* **Lambdas `{ }`:** Defining a lambda simply compiles the block of code elsewhere and leaves its executable memory address on the stack (`RAX` or `RDX`).
* **Conditionals via Global State:**
1. A comparison (e.g., `>`) is executed.
2. The result is written to a dedicated global variable (e.g., `condition` using a Red tag).
3. The conditional jump word reads the `condition` variable, consumes the lambda's address from the stack, and emits `CMP condition, 0` followed by `JZ lambda_address`.
* **Basic Blocks `[ ]`:** These constrain the scope of assembly generation. If a conditional within a block passes, execution falls through. If it fails, it jumps to the nearest closing `]`.
## 3. "Metaprogramming VAMP in KYRA" (Onat, SVFIG, 2025-04-26)
### 2.5 Live Debugging via Instruction Injection
The most powerful UX feature of VAMP is its real-time data flow visualization.
* The editor tracks the user's cursor position.
* During JIT compilation, if the `compiler_instruction_ptr` equals the `editor_cursor_ptr`, the JIT injects a debug macro.
* This macro emits instructions to copy the current state of `RAX` and `RDX` (the entire data stack) into a global circular buffer.
* The UI reads this buffer, instantly displaying the exact runtime state of the program at the cursor's location, acting as an instant, zero-cost `printf`.
This presentation contains the most explicit, hardcore low-level details regarding Onat's binary-encoded compiler (VAMP).
**0:00 - 10:00 | The Binary Editor, Compilation Speed, & The 2-Item Stack**
* VAMP compiles the entire program (Vulkan renderers, UI) in **8.24 milliseconds** on Windows/Linux. His previous text-based Forth took 16-17.8ms just to compile the editor.
* **Hardware Locality & The Stack:** Traditional Forth is "runtime opinionated" with a memory data stack, making GPU compute shaders difficult. Onat strictly restricts the stack to two CPU registers: **`RAX` and `RDX`**.
* **Screen Details:** The stack state is constantly visualized in the top left corner.
* **Magenta Pipe `|`:** There are no `begin` or `end` definition words. A magenta pipe token implicitly signals the end of the previous definition (compiling a `ret`) and starts the new one. Spaces between words imply execution.
**10:00 - 18:00 | Dictionary Management, UX, & Indexing**
* **Dictionary Encoding:** Words are stored as 24-bit indices pointing to 8-byte cells, packed with an 8-bit color tag. (He notes the next iteration will use 32-bit indices + a separate 1-byte tag block for faster skipping of empty blocks).
* This pure index mapping eliminates hashing and string parsing. It allows IP-protection: you can ship the source indices without the symbols/dictionary. Core language is just 2 to 4 KB.
* **Screen Details:** Words are organized explicitly into 16-word horizontal "scrolls" (e.g., "Vulkan API", "FFMPEG", "x64 Assembly"). He presses `Ctrl+Space` to manually redefine a word in a specific scroll.
* **Comments:** A comment (Blue tag) is encoded as a string directly inside the 24-bit payload (3 characters).
**18:00 - 28:00 | Data Flow Visualization & Global Memory**
* **Free Printf:** Hovering over a word injects code to record `RAX` and `RDX`. Pressing Previous/Next steps through the execution flow visually.
* **Global Variables vs. Stacks:** To pass complex state (since the stack only holds two items), he relies entirely on global memory. He explicitly critiques Rust's "safe programming" for forcing developers to pass state through 30 layers of call stacks.
* **Single-Register Memory Access:** He dedicates a single CPU register to act as the base pointer for all program memory, giving instant access to "gigabytes of state".
**28:00 - 45:00 | Syntax, Tags, and JIT Assembly Mechanics**
* He demonstrates compiling Vulcan commands. Instead of typing `vkGetSwapchainImagesKHR`, he defines a word `get swap chain images` in the `vk device` scroll.
* **The `xchg` Trick (`48 92`):** Because the stack is just `RAX` and `RDX`, keeping `RAX` as the Top of Stack is vital. He explicitly notes that `xchg rax, rdx` compiles to just two bytes: `48 92` (REX.W + xchg eax, edx). Before starting a definition or making a call, the JIT emits `48 92` to ensure `RAX` is correctly aligned as the top.
* **Color Semantics:**
* **White (Call):** Emits a `CALL` or `JMP RAX` (e.g., `FFE0`).
* **Green (Load):** Emits `mov rax, [global_offset]`.
* **Red (Store):** Emits `mov [global_offset], rax`.
* **Yellow (Immediate/Execute):** Used heavily. For a number, emits `mov rax, imm`. Also used to invoke a lambda block.
* **Blue (Comment):** Ignored.
* **Cyan (Number):** Data literal.
**45:00 - 55:00 | Lambdas `{ }` & Basic Blocks `[ ]`**
* He explicitly eliminates `if/else` ASTs.
* **Lambdas `{ }`:** Defining a lambda block (Yellow `{`) does not execute it. It compiles the block elsewhere and leaves its executable memory address in `RAX`.
* **Basic Blocks `[ ]`:** These define a constrained range of assembly with implicit begin, link, and end jump targets.
* **Conditionals in Blocks:** He shows checking `if luma > 0.6`. He explicitly creates a `condition` variable (e.g., `26E`). The `>` operator consumes the values and writes the boolean to `condition`. The conditional word then reads `condition` and consumes the lambda address from `RAX`, emitting a `cmp condition, 0` and `jz lambda_address`.
**55:00 - 1:10:00 | FFI, Stack Pointers, and OS Interop**
* **`RSP` Alignment:** The hardware stack pointer (`RSP`) is exclusively used for the call stack, eliminating buffer overflows. When calling OS APIs (like FFMPEG), he explicitly reads `RSP` into a variable to align it to 16 bytes (required by C ABI), makes the call, and restores it.
* **Filling Structs:** For `VkImageCreateInfo`, he uses a temporary variable `$` (Dollar sign). He doesn't use C headers. He knows `14` is the Type ID, manually pushing offsets into the contiguous memory space (e.g., `info + offset`).
**1:10:00 - End | SPIR-V, Bug Triage, and Implicit Registers**
* **SPIR-V Generation:** VAMP directly emits SPIR-V. He shows the spec (Opcode 194 is Shift Right Logical) and demonstrates a 4-line definition that writes exactly `194` and its operands into a binary vector, replacing a 100MB `glslang` compiler with ~256KB of VAMP code.
* **Bug Triage:** He does not use tests or asserts. He triages bugs by commenting out blocks of code (disabling them) and hitting compile (8ms) until the crash stops.
* **Implicit Register Passing:** He shows UI hover logic where the `slot ID` is implicitly passed in register `R12D` across functions, completely avoiding pushing it to the data stack.
* **Lock Prefix:** Writing concurrent code is handled by the macro assembler. Placing the word `lock` before an `inc` token simply emits the `F0` prefix byte.