7.4 KiB
In-Depth Analysis: Metaprogramming KYRA in KYRA (Onat Türkçüoğlu)
This document provides a comprehensive synthesis of the "Metaprogramming KYRA in KYRA" presentation given by Onat Türkçüoğlu at the Silicon Valley Forth Interest Group (SVFIG) on April 26, 2025. It integrates insights from the video transcript and the extensive OCR analysis of his visual editor.
This presentation is the most explicit, hardcore low-level deep dive into Onat's binary-encoded compiler (KYRA) and serves as the definitive mechanical blueprint for our bootslop project.
1. Performance and "Runtime-Opinionated" Languages
Onat's primary critique of traditional Forth (and languages like C or Rust) is that they are "runtime opinionated." Standard Forth dictates a memory-based data stack and return stack. This makes it fundamentally incompatible with environments like GPU compute shaders.
- Compilation Speed: KYRA compiles its entire program (including a custom editor, Vulkan renderers, and FFMPEG integrations) in 8.24 milliseconds natively on Windows/Linux.
- The 2-Item Hardware Stack: To achieve hardware locality and GPU compatibility, KYRA strictly restricts the data stack to exactly two CPU registers:
RAX(Top of Stack) andRDX(Next on Stack). - Zero Stack Overhead: By having no memory data stack, KYRA eliminates the push/pop overhead that plagues standard Forth implementations.
2. The Mechanics of the KYRA Emitter
KYRA is not an interpreter; it is a high-level macro assembler that generates direct x86-64 machine code via JIT compilation.
The xchg Trick (The Magenta Pipe |)
- Because the stack is just
RAXandRDX, ensuringRAXis the active "Top of Stack" before executing a word is vital. - The
xchg rax, rdxinstruction compiles to a tiny 2-byte opcode:48 92. - Definitions: There are no
beginorendwords. A magenta pipe token (|) implicitly signals the start of a new definition. The JIT reacts to this by:- Emitting a
RET(C3) to close the previous definition. - Emitting
48 92(xchg rax, rdx) to ensure proper stack alignment for the new definition.
- Emitting a
Color Semantics and Code Generation (From Transcript & OCR)
- Magenta (
|): Definition boundary (RET+xchg rax, rdx). - White (Call): A compile-time call. Emits a direct
CALLinstruction or aJMP RAX(e.g.,FFE0) if optimizing a tail call. - Green (Load): Emits a read from memory:
mov rax, [global_offset]. - Red (Store): Emits a write to memory:
mov [global_offset], rax. - Yellow (Execute/Immediate): A highly overloaded color used for runtime execution, immediate invocation of lambdas, or prefix accessors (like struct member reading).
- Cyan (Literal): Compiles an immediate value load:
mov rax, imm. - Blue (Comment): Stored directly in the token payload (3 characters per 24-bit payload) without polluting the global dictionary.
3. Global Memory vs. Local Variables
Onat heavily critiques the conventional wisdom of avoiding global variables, specifically calling out Rust for forcing developers to pass state through 30 layers of call stacks.
- Implicit Register Passing: For passing transient state (like the active UI element's
slot ID), he implicitly passes the value in a dedicated register (e.g.,R12D) across functions, completely bypassing any need to push it to a stack. - Single-Register Memory Base: He dedicates a single CPU register to act as the base pointer for all program memory. This gives instant
[BASE_REG + offset]access to "gigabytes of state." - The "Tape Drive" in Practice: Instead of a stack, data needed for complex API calls (like Vulkan initialization) is pre-scattered into these known global offsets using Red (Store) words, and then passed via a single pointer.
4. Dictionary Management and The "Deck"
Unlike text-based Forths that require hashing, KYRA uses a pure binary index map.
- 24-Bit Indices: Words are stored as 24-bit indices pointing to 8-byte cells. (Onat notes his next iteration moves to 32-bit indices + a separate 1-byte tag array, exactly matching Lottes's
x68annotation model). - Visual Organization (The "Scrolls"): The dictionary is explicitly organized by the programmer into 16-word horizontal "scrolls" (e.g., one scroll for "Vulkan API", another for "Math").
- IP Protection: Because the dictionary mapping is separate from the source array, you can ship the binary source indices without the dictionary symbols, effectively stripping the symbols while retaining the executable structure.
5. Control Flow: Basic Blocks [ ] and Lambdas { }
KYRA eliminates standard Abstract Syntax Trees (ASTs) and if/else/then branching.
- Basic Blocks
[ ]: These visually constrain the assembly output. They provide implicit begin, link (else), and end jump targets for the JIT to resolve relative offsets within a limited scope. - Lambdas
{ }: A lambda (colored Yellow{) does not execute inline. The JIT compiles the block of code elsewhere in the arena and leaves its executable memory address inRAX. - Conditionals: To perform an
IF:- Evaluate a condition (e.g.,
luma > 0.6). - Write the boolean result to a dedicated global
conditionvariable. - Define a lambda block containing the "true" branch (leaving its address in
RAX). - Call an execution word that reads the
conditionvariable, emits acmp condition, 0, and executes ajz(jump if zero) to skip the lambda address stored inRAX.
- Evaluate a condition (e.g.,
6. FFI: Bridging to C and Vulkan (WinAPI equivalent)
Dealing with OS APIs and standard C libraries (like Vulkan and FFMPEG) requires satisfying the C Application Binary Interface (ABI).
- RSP Alignment: The hardware stack pointer (
RSP) is exclusively used for the call stack (return addresses), eliminating buffer overflow vulnerabilities. - The FFI Dance: When calling external C functions, Onat's macros explicitly read
RSPinto a temporary variable, alignRSPto 16-bytes (a strict requirement for Windows/Linux x64 C ABI), execute theCALL, and then restoreRSP. - (Note for Bootslop: We saw
CCALL1,CCALL2, etc., in the OCR, confirming he uses specialized macro words to map theRAX/RDXstack and global variables into the standardRCX,RDX,R8,R9C-ABI registers before triggering the OS call).
7. Development Workflow
- Bug Triage over Asserts: There are no unit tests or assertions. Bugs are found by commenting out blocks of code (disabling them) and hitting compile. Because compilation takes 8ms, binary searching for the crash point is faster than writing tests.
- Free Printf / Data Flow: By hovering over a word in the editor, the system automatically injects code to record
RAXandRDXat that exact execution step, allowing the programmer to step through the data flow visually without running traditional debuggers.
Conclusion for bootslop
The "Metaprogramming KYRA" talk confirms that our 2-register stack and "preemptive scatter" global memory model in attempt_1/main.c is the exact correct path.
The next major hurdles for bootslop will be:
- Implementing the
xchg rax, rdxdefinition boundary logic. - Creating an FFI bridge (like Onat's
CCALL) that alignsRSPto 16 bytes and maps globals to WinAPI registers, allowing our minimal Forth to summon full OS windows and graphics. - Transitioning dictionary definitions from string-parsing to direct array index resolution.