forth_bootslop/references/Neokineogfx - 4th And Beyond - Transcript.txt

0:00
Welcome to fourth and beyond. My name is Timothy Lis and this is the Neoenographics channel. This talk will
0:05
not be covering standard fourth. Instead, this talk is going to start with the beyond fourth part. Let's
0:10
begin. What if we didn't actually need Visual Studio? What if we didn't need a separate debugger or even to C language?
0:17
Let's start with the first principle. Question everything until the problem is truly minimized. Begin by peeling the
0:22
onion of computing, passing through APIs, compilers, languages, code generation, and so on. Search the
0:28
alternative realities until greatness is found. And we'll start by rewinding time and learning from the past masters.
0:34
We'll start with the most basic interactive computer tool, the calculator. My favorite calculator was
0:39
the HP48. That's what I used. HP48 used reverse Polish notation. This made it
0:45
very easy to type in math and get answers. You didn't have to mess with parenthesis. The HP48 provided RPL. The
0:53
later machines provided system RPL which could even assemble machine code with offline tools for the HP48. People even
1:00
built games for these machines. Now what if we were to take that calculator and evolve it to something more forthlike?
1:06
We'll start with simple reverse polish notation calculator math. And next we'll introduce a dictionary. The dictionary
1:12
will point to positions on the data stack. For instance, in the second line here, we have a red word 4k. That 4k
1:19
word would be pointing to the next stack item. Next stack item we can do some evaluations to come up with the number
1:25
for. So we type in 1024 then type in four and then type in multiply. And now
1:30
we have 4096. So the 4k word would point to 4096. This is a basic way of doing a
1:37
variable. The next thing we could do is we could actually build numbers which represent op codes or multiple numbers
1:43
which represent op codes. Things that we could actually execute and have do an operation on the machine. So in this
1:50
case there's drop and drop points to a number on the data stack which disassembles to add ESI, -4 and then
1:59
returns. Drop would basically drop the top item from the data stack where ESI is pointing to the data stack. And now
2:06
once we have this in our dictionary, we can continue to do things on the stack and we can use drop if we want to. So
2:12
now we could write 4k which would pull that number 4096 that we had uh put in
2:18
the dictionary prior and then do one and then do two and then a plus which would create a three and then we execute drop
2:25
which will drop the three leaving 4096 on the stack and thus now we've created something quite powerful. So in this
2:31
context the gold numbers get pushed on the stack the gold words get its value in the dictionary pushed on the stack.
2:37
The green words are getting a value in the dictionary executed and the red word is putting a pointer to the top of stack
2:44
in the actual word in the dictionary. In some respects, you can see how this starts to create extremely powerful
2:49
system. So a fourthlike machine is really the ultimate form of tool building. The language is free form. The
2:55
dictionary defines words. These words become the language you program in. It enables any kind of factoring of a
3:01
problem. The language, the assembler, the compiler, the linker, the editor, the debugger, they're all defined in the
3:06
source itself. And these systems can be tiny. Tiny as in the whole thing fits in the cache. In my opinion, a fourthleike
3:13
machine would have been a better option than basic for a boot language. A lot of people learn basics cuz they could type
3:18
in a program from a book, say on the C64. But imagine if it was a fourth machine instead. You have something that
3:25
runs significantly faster and is significantly more powerful. The irony here is later Apple, IBM, and Sun, they
3:32
actually used a fourth-based open firmware, but few had programmed in it at the time. So, let's look back at
3:38
Fourth. Fourth was invented in 1986 by Chuck Moore or Charles Moore. Chuck
3:43
later focused on building multiple stackbased processors. He used his own VLSI CAD system, Okad for layout and
3:51
simulation, and these were written in his language. Early it was a sourceless language and later it got moved to color
3:57
forth from my understanding. The images below show some of the actual editor and simulation. These are from the ultra
4:04
technology site. What's impressive here is that these were dramatically small systems and yet they were used to do
4:10
some of the most complicated stuff that humans can do which is design chips that actually got fabricated and got used.
4:16
Chuck Moore's color forth I think is worth learning about. It's an example of real system minimization. 32-bit reverse
4:23
Polish notation language. It provides a data stack which gives you memory to work with and note code is compiled onto
4:30
the data stack too. It provides dictionaries which map a name to a value. The value is typically a 32-bit
4:36
number or 32-bit address to the source or data stack. The dictionaries are searched in a linear order from last to
4:42
first defined word. There are two main dictionaries. The fourth one which is used for words to call and macro which
4:49
is a secondary dictionary used for words that do code generation. Source is broken up into blocks. There is no file
4:56
system. Inside the source blocks are 32-bit tokens. These tokens contain 28
5:02
bits of compressed name or string and four bits of tag. The tag controls how
5:07
to interpret the source token. Let's go through some of the tags. The white tag means an ignored word. Yellow tag means
5:13
execute. If it's a number, we append the number on the data stack. If it's a word, we look up the word in the
5:20
dictionary and then we call the word. If it's a red word, we're doing a definition. We're setting the word in
5:26
the dictionary to the top of the stack or a pointer to the top of the stack. If it's green, we're compiling. If it's a
5:32
green number, we're appending a push number onto the stack. Effectively,
5:37
we're encoding the code, the machine language that would push that number. If
5:43
we compile a word, if we're in the macro dictionary, we're first going to look up the word in the macro dictionary, and if
5:50
it exists, we're going to call it. Otherwise, we look up the word in the fourth dictionary, and we append a call
5:56
to the word itself. Cyan or blue is used to defer words execution. So, we'll look
6:02
up the word in the macro dictionary, and we will append a call to the word. This way, we can make words that do code
6:09
generation, that call other words that do code generation. Next is the variable. Variable is used by magenta.
6:16
Variable sets the dictionary value to the word to it the pointer to the next source token in the source code as it's
6:23
being evaluated. And then anytime we have a yellow to green transition, we pop a number off the stack and then we
6:30
append a push number to the data stack, which basically means we're taking a number and we're turning it back into a
6:36
program, a program that pushes the number. So if we look at some of the blocks inside color forth and notice
6:43
this one here, block 18. This starts doing the code generation. So it'll push 24 and then it'll load which will take
6:50
block 24 and actually bring in all the code generation macros. And then the next one 26 load will bring in more
6:56
quote generation macros from block 26. If you look at block 24, it starts with
7:02
executing macro which moves us to making defines in the macro dictionary. The
7:07
first define is swap. Then it does 168B and then it does a two comma. The two
7:13
comma pushes two bytes onto the data stack. The next one is C28B
7:20
0689 followed by a comma. The comma pushes four bytes onto the data stack. So
7:26
effectively what we're doing is we're pushing some bytes to actually create code onto the data stack where swap is
7:33
defined. And if we disassemble these six bytes, we get move edx, dword ptr esi.
7:40
So effectively we're pulling from the stack into edx and the stack in this case is the data stack of fourth. The
7:46
next one is move dword ptr esi, eax. So we're pushing the existing cacheed value
7:53
of the top of the stack which is in eax. We're pushing that we're putting that on the stack. And then we move edx into eax
8:01
which is taking the old second value on the stack and pushing it into the cache value which is the top of the stack in
8:07
color fourth. So basically this whole block is defining op codes that are used for code generation. So let's fast way
8:14
forward now and let's just critique color forth. Perhaps one of the biggest critiques of color forth is that it's a
8:20
mismatch to hardware today. It's a stackbased machine and modern machines are register based. Modern machines have
8:26
really deep pipelines. They don't deal with branching well and fourth is extremely branch friendly. The
8:31
interpreter costs that you have to do per token are pretty high. We have to branch based on tag. Dictionaries are
8:38
searched from last added to first added with no hashing or any acceleration.
8:43
Most commonly every time you do an interpreting after branching on the tag you're going to branch again to another
8:49
thing which is going to be a mispredicted address. And note, you got an average of 16 clock stall on say Zen
8:55
2 for a branch misprediction. Of course, the logical response here is if you only have a tiny amount of code, there's no
9:01
reason it has to be super fast. After all, the most important optimization is doing less total work. For example, an
9:07
F1 car driving a,000 m is going to be substantially slower than a turtle walking one foot. Well, towards the end
9:14
of 2025, Chuck Moore said, "I think fate is trying to tell me it's time to move on."
9:20
And this is in response to Windows auto updating and then breaking his color forth. But I ask, should we actually
9:27
move on? The world did move on to mass hardware and software complexity, but
9:32
perhaps Chuck's way of thinking is actually exactly what is needed today. How about a localized reboot? We have a
9:38
lot of FBGA based systems showing up, and I'm hopeful that they're getting commercial success, but these are
9:44
effectively all emulators of prior hardware. What about doing something new? Maybe forthinking could be a part
9:49
of that. What about Neo vintage parallel machines? After all, forthinking is ideal for a fixed hardware platform.
9:56
FPGA based hardware emulators focus mostly on the serial thinking era. But this is actually a universal speed of
10:02
light barrier. These product lines are going to stop around N64 and so on because after that serial CPU clock
10:09
rates cannot be FPGA emulated. But FPGAAS have crazy parallel DSP capabilities. Perhaps we should design
10:16
for DSPs as the processors and then provide radically parallel but medium
10:21
clock machines and these are things we could actually drive with fourth style language. There is a challenge of
10:28
minimalism in a maximalism world. Software is a problem but the root is hardware complexity growth. For example,
10:34
our DNA4 ISA guide is almost 4 megabytes in itself. And try writing a modern USBC
10:40
controller driver yourself. And yet, even with all of today's hardware complexity, I still believe fourth
10:45
inspired software can be quite useful. I spent a lot of time exploring the permutation space around fourth,
10:51
specifically more around color fourth and seeing what variations could be made. One way I varied from fourth was
10:57
in an opin coding. I don't necessarily stick with a stackbased language. Sometimes I treat the register file more
11:03
like a closer highly aliased memory. Sometimes you use a stackbased language however as a macro language say for a
11:10
native hardware assembler and sometimes I mix a stackbased language with something that has arguments for
11:16
instance having a word have arguments after the word and still use it like a stackbased language. So I have used
11:22
forthlike things in commercial products. One example is I used to run a photography business and a software
11:28
development business and the old business website that I ran in my prior life doing landscape photography was
11:35
actually generated by a fourthlike language running server side which generated all the HTML pages. It made
11:41
managing a huge website actually practical. Now I had to use the wayback machine to find this. So sorry in
11:47
advance for the broken images. And of course, I had a different last name then from a broken marriage, but that's another story. But I did a lot more
11:54
forthlike things beyond this one. One of the things that got me right away was, of course, the lure to optimize. For
12:00
example, color forth uses this Huffman style encoding for its names and its tokens. Remember, a source token is a
12:08
T-bit tag, typically T is 4, with an S bit string where S is 28 bits. We could
12:14
do a better job of encoding the the 28 bits. For instance, we could split that full number range by some initial
12:20
probability table of the first character. And then we could split each of those ranges by say a one or two
12:27
character predictor. And then we train this thing on a giant dictionary. And of course, you're going to have to use
12:32
lookup tables. And of course, the memory used for the predictor is going to be greater than the rest of this whole
12:38
system combined. And yeah, it worked. It provided some very interesting stuff. You could put a number in and it would
12:45
basically BSU a string out which was pretty cool. This journey I think was
12:50
useful. I learned a lot of things in the process like where to optimize and where not to optimize. Next question is well
12:57
should we hash or should we not hash? It turns out that a compressed string like in the prior slide is actually a great
13:04
hash function. I can simply mask off some number of the least significant bits and that becomes my hash function.
13:10
or for hashing. I always dislike the issue of only part using cache lines. That's not very efficient. And of
13:16
course, we can try to fix that too. We can check a very tiny hash table first and size that hash table to stay in the
13:23
cache. And then if we miss on that, we can go to the fulls size one, the one that's going to have pretty poor
13:28
utilization on cache lines. And assuming lots of reuse, that tiny hash table is going to keep high cache utilization.
13:35
However, now we've done two stages of optimization. But we really should start asking why are we hashing and why are we
13:42
compressed? Why are we doing all this overhead? Why do we just not direct map? After all, if we're depending on an
13:48
editor, we could just direct map or perhaps just address into the dictionary directly. Then we can split off the T-
13:54
bit tag and the S bit string for editor use. And that can start simplifying things so we don't have all this
14:00
complexity in the first place. The next thing we can do if we're interpreting is we can solve the problem of branch
14:06
misses. Normally with an interpreter, you would evaluate the word and then you'd return back to the interpreter.
14:12
That interpreter would look up another word and do another branch. But that branch would always be mispredicted. One
14:17
option is we could just fold the interpreter back into the words themselves. But of course, we got to
14:23
make that interpreter really small, otherwise we're doing a lot of code duplication. Can imagine if you have a
14:28
thousand words, you're going to embed the interpreter a thousand times. So there's a lot of different ways we can
14:33
design an interpreter down into a few bytes. For instance, this one is a 8 byte interpreter. This is one I've never
14:40
used. Actually can do better than 8 bytes and I'll show you that towards the end. So of course the best way to learn
14:45
is to build stuff. So I built many color forth inspired things over the years. Some like the one to the right here. I
14:52
got distracted with editor graphics effectively making something extremely nice to use and very pretty. This one
14:58
was cool. The dictionary I moved into the source itself and I did a direct
15:03
binary editor. So this thing you'd actually see the cache lines and you're effectively doing a hex editor that uses
15:10
tags to tell you some contextual information and then each line of course
15:15
has uh a comment on the top followed by the data on the bottom. And of course I use different fonts because sometimes
15:22
I'm packing a full number with sometimes I'm packing characters and so on in comments. It was a relatively
15:28
complicated system, but actually simple when you think about it in the context of what we build today. One of the first
15:35
questions to ask yourself is whether you want to work with text source versus a binary editor. So sometimes I would work
15:41
with text source. In order to make this work well, I would have a prefix character in front of every word, which
15:47
basically would be the tag. And it would also enable me to use very simple syntax
15:52
coloring inside say nano. Most of these I built they were more like a fourth macro language that was used to create a
15:58
binary. So what I would do and for instance what you can see on the right is I would define something that would
16:03
enable me to build the ELF header and then after the ELF header was was built I would actually write the assembler in
16:10
the source code and then finish off the rest of the binary. These kind of languages are extremely small and the
16:16
whole thing is in a say a few kilobyte binary. The other thing I do with these is I bootstrap. So the first time I
16:23
might write the thing in C and then C would evaluate and then I'd run the interpreter in C and then later I would
16:30
write the rewrite the interpreter inside the source code and then compile that. Now I would be bootstrapped into the
16:36
language itself. And so by doing that I could actually compare my C code to the the code I wrote inside my own language.
16:43
And of course I'm faster inside my own language than in the C code. And of course, I'm a lot smaller in the binary
16:48
as well because I have a very very small ELF header in the binary that I generated compared to the one that say
16:55
GCC would generate. I built some custom x86 operating systems. It was fun to
17:00
build custom VGA fonts and of course mess with the pallet entries to improve the initial colors. I did lots of
17:07
different fourth variations, but typically these projects just got blocked in the mass complexity of
17:13
today's hardware. Meaning once you get down to the point where you want to say draw something on the screen other than
17:18
using say the old DOSs VGA frame buffer or if you want to start using input you start needing a USB driver and then all
17:26
of a sudden everything turns into a nightmare. One thing I mentioned before is it's very nice to use a forthlike language as a macro assembly language.
17:33
Traditional assembly language you do something like say add ecx edx and then
17:39
colon advanced pointer by stride. The later part here is heavily commented. In fact, typically assembly is mostly
17:47
comments otherwise a human can't really understand it. When you start using a fourthlike language as a macro
17:53
assembler, a lot of times what you do is instead of using the register number, you would just put the register number
17:59
inside another word and then use that word. So now you start self-documenting. And if you had common blocks of say
18:06
multiple instructions, you would start defining those in some other word and then you start factoring. And this way
18:11
you self-document everything and it becomes actually very easy to understand, a lot more easy to
18:17
understand than say assembler. And on top of this, of course, you can also put comments, but you don't typically need
18:22
as many. So if we were to look back at some of the lessons of all these projects, I think the key thing is that when your OS is an editor, is a hyper
18:30
calculator, is a debugger, is a hex editor, you end up with this interactive
18:35
instant iteration software development and that part is wonderful. The fourth style of thinking keeps source small
18:41
enough so that it's approachable by a single developer and that I think is very important. You basically build out
18:46
your own tools for the exactly the way you like to think and that's where its true beauty lies. Others like Onot have
18:53
built full systems, meaning he is running something that actually works
18:58
with Vulcan and generates Spear V. So there is another option and that is going sourceless. No language, no
19:05
assembler. The code is the data or the data is the code. Chuck's okay was a
19:10
source of inspiration. I've only read about this, but it did send me down a spiral of trying various ideas that are
19:17
related to what I read. So when we think about sourceless programming, it's best to just work from the opposite extreme.
19:24
Start with say a hex editor and then work towards what we would need to make that practical for code generation. So I
19:30
think of a binary as an array of say n 32-bit words and then we could have another thing which is an annotation
19:37
which is an array of n64-bit words. The annotation could provide a tag which gives context to the data or could
19:44
control how the editor manipulates the data. The annotation can also provide an 8 character text annotation for the
19:50
individual binary words which serves as documentation for what the word is for. So part of sourceless programming is how
19:56
do you generate code and with fourth hand assembling words is actually relatively easy because you don't have
20:02
that many you don't have that many low-level operations if you're doing a stackbased machine. I invented something
20:09
called x68 which I'm planning to do a separate talk on. It's a subset of x64
20:15
which works with op codes at 32-bit granularity only. Note that x664
20:21
supports ignored prefixes which can pad op codes out to 32-bit. And we also have
20:27
multibite noops which can align the ends too. And we can do things like adding in rex prefixes when we don't need it to
20:34
again pad out to 32-bit. So for instance, if we wanted to do a 32-bit instruction for return, we might put the
20:41
return, which is a C3, and then pad the rest of it with a three byte noop. And
20:47
once we've built this 32-bit return number, which we can an we annotate with, we can insert a return anywhere
20:55
just by copying and inserting this word in the source code. And later, if we built different op codes and say they
21:00
were multi-word, we can just use find and replace to change those. Effectively, we're removing compilation
21:05
into some edit time operations. One of the nice things about being at 32-bit
21:10
granularity for the instruction is that the 32-bit immediates are now at 32-bit granularity as well. And so now we can
21:18
just make it so that we have a tag which says this is an op code and a tag which says this is say an immediate hex value.
21:26
And we could show them separately with different colors. In this case, I have code for setting ESI to a 32-bit
21:32
immediate. And you'll notice that this one is using the 3E ignored DS segment
21:38
selector prefix to pad out the op code to 32-bit. And then after that, we have a silly number which we're setting into
21:45
ESI. That silly number is 1 2 3 4 5 6 7 8 in hex. So it's very easy to do inline
21:51
data this way. Of course, calls and jumps are another question. And we have an easy solution for that one as well.
21:57
In x8664, column jump uses a 32-bit relative immediate and that relative is
22:03
relative to the end of the op code, not the beginning. And so if we want to make an editor support this, we would just
22:09
tag the relative branch address as a word that is a relative address. And
22:15
then when we start editing text or say editing the words inside the binary and
22:21
say we move things around, we would just relink all of the words in the binary that have a relative address. So as code
22:27
changes, things just get fixed up. And so this effectively solves the call and the jump problem. It's very easy to make
22:34
an editor which repatches everything. Conditional branches, you might think those are complicated, but they're
22:40
actually not. Conditional branches is just an 8-bit relative branch address. And so when I make words on these, I
22:46
would say jump unequal minus two, which would jump unequal to the word that is
22:51
two words before this one or say j minus 4 for four words minus and so on. And so
22:58
I can just build a few of these constructs and change the op codes around whenever I need say a jump on
23:03
zero or so on. Nice thing about this is now you no longer have to label things because you just go and count and when
23:10
you move it around it's all relative. So you don't need to do any patching. If you want to add more stuff in your loop,
23:16
you just change, you know, the op code a little. Another option is the editor could have a built-in 32-bit word
23:22
assembly disassembly. Meaning I could use a shorthand for registers and op codes. And the shorthand that I would
23:28
use would be labeling the registers starting with G. So that zero through F
23:34
could be used for hex. So this is an example of how you might want to do it. So in this case I have h + at i08
23:42
which is going to disassemble to add rcx quadward pointer rdx + 0x8
23:51
which we can shorthand very easily and so I could have an editor that would show you either the disassembly or show
23:57
the shorthand instruction for it and that would aid in the understanding and ability to insert stuff without using
24:04
separate tools. So, I did build this sourceless system once in the real. It was back when I was building tiny
24:09
x86-based operating systems. I built the editor as a console program. So, this
24:15
would run in Linux and I would build binaries in that console program and
24:20
then I would use an x86 emulator running in Linux to actually test them. And this was a pretty liberating experience. I
24:27
learned a lot from there. On the right, I'm showing one of the boot sectors of one of the examples running in the
24:33
editor. Note with sourceless programming we could extend the annotation quite a bit. For instance, we could have tables
24:38
that map register numbers to strings and then for each cache line we could have an index of a table. In this way,
24:45
registers could be automatically annotated. For instance, if register 2 is set to position and register 4 is set
24:52
to velocity. If we had add R2, comma, R4, we could just put in add POSOS comma
24:58
velocity, right? And that would make it a lot easier to understand automatically. We could also extend and
25:04
have each cache line have a string as well. And this way we could automatically annotate say a label. The
25:10
first word maybe could be the label and the rest of the string could be used for a comment. So there's a lot of ways to
25:16
do sources programming where we just provide annotation tools to actually make it a practical experience. So let's
25:22
talk about some of the variations, the pros and cons. The easiest way to perhaps start this would be to work with
25:28
text source. And usually with text source, you're going to use prefix words for the tag. For instance, slash for
25:35
comment, colon for define, maybe tick for execute. This does have the slowest runtime, however, because you have a
25:41
character granularity loop. It does have a benefit in that for you to get started, there is no editor that you
25:47
have to write. You can use an external text editor and you can do easy custom syntax coloring. You're going to be very
25:53
easy to understand and to work in. However, I think you're missing a big piece if you go down this path, and that
25:58
is you don't get any live interaction or debugging. You're basically depending on the fast compile times of your custom
26:04
language and the fast load times of whatever program you're doing to try to get you in that iteration loop. And you
26:11
can work this way. I've done it many times, but the experience is nowhere near as good as doing the full
26:16
interactive one with a binary editor. I guess one of the other benefits here is you got very easy code portability.
26:23
There's no binary files, just text files. You can copy and paste as you will. Of course, the next jump up from
26:28
that is going to binary source. This would be middle performance runtime because now you're working at a word
26:34
granularity inside your interpreter. You have portability because you have code generation that can adapt to the system.
26:41
Meaning, as you interpret your code, you can look at what's underlying in the hardware and you can make changes to how
26:46
the code is generated at runtime. You can build just what bits of an assembler are needed. You don't need to build out
26:52
everything like you would with say a disassembler tool or an assembler tool. So for instance with x86 I don't
26:58
actually generate much of the ice. I only use a very very tiny subset. You do have to write the binary editor and that
27:04
can be a lot of work and sometimes that presents a problem with bootstrapping because you don't have the language to
27:10
write the editor in from the beginning. So you have to write the editor in some other language and then in that language
27:15
write the editor again and then you know complete the bootstrapping process. The
27:21
one benefit here is you get interactive development debug from the beginning and also now your source code shows how
27:28
constructs are built instead of just showing the result as you would get with say source free. One interesting thing
27:33
about fourth with binary source and this this concept that you can rebuild source
27:39
code at runtime at any point in execution is now you can start compiling things you load into machine code and
27:46
then you don't have to run all the overhead. Basically you can bake things into machine code anytime at edit time
27:52
and that's a very powerful feature. Now if we go and look at sourceless programming there's a bunch of pros and
27:58
cons. The one pro is that it's the fastest runtime. It's a true noop. You can build things that are highly tightly
28:04
optimized for a platform and they're as optimized as you could possibly get. You
28:09
do however lose capacity for showing how constants came to be and you lose the
28:15
capability of adapting to what the machine is. You can work around this however you can make smaller embedded
28:20
constants that you write into modified bit fields and instructions but it's
28:26
really taking a little too far in terms of complexity. You do need to write an editor which includes an opc code
28:31
assembler and disassembler potentially if you want to go down that route. And if you're going to do CPU and GPU,
28:37
that's a lot of work. It can be very complicated when systems include auto annotation. For instance, if you want to
28:44
type in say a string readable register and then have it go and figure out what register number that is. I guess the
28:50
primary disadvantage here is that there's no possibility for portability. You have a raw binary editor. And today
28:57
we have a problem with GPUs. Steam Deck is RDNA 2, Steam Machine's RDNA 3, and
29:04
who knows the future may be RDNA 4 or five or six. Problem with this is that when the ISA changes across those
29:10
chipsets, that can result in different instruction sizing. So you can't just do
29:16
simple source free. For example, Mimg in RDNA 2 has a different size than V image
29:22
in RDNA4. And that's all due to the ISA changes. So if you do sourceful
29:27
programming, you can port through a compile step. However, source list, you would need to do something else. And
29:33
perhaps that is just rewriting chipset specific versions of all the shaders, but that may be something that you don't
29:39
want to do. So thinking ahead on the Steam OS/ Linux project that I'm working on, effectively I'm building an AMD only
29:46
solution. I don't really have a name for this project, so I'm going to call it fifth. I think for fifth I want a mix of
29:52
various fourth style concepts and perhaps the best mix would be the best
29:57
of both worlds. A fast highle interpreter that's intent for doing GPU code generation where I need chipset
30:04
portability at runtime mixed with low-level sourceless words for the CPU
30:09
side for simple x664 where I don't need portability at this time. Since this is
30:15
for Linux, we should think about what we can do on Linux that we might not be able to do on Windows today. We first
30:21
start thinking about execution. The Linux x8664 ABI when you're running
30:26
without address space randomization, the execution starts at a fixed 4 megabytes
30:31
in and you can still do this today if you compile without FPIE. Also note,
30:37
even if you did get position independent execution, you could effectively just map the where you want your fixed
30:44
position to be and then just start execution there and just ignore the original mapping that they threw you at.
30:50
Another thing we can do is we can use page maps. And if we look at Steam Deck, we'll notice that dirty pages get
30:56
written about as fast as every 30 seconds, which is an important number. Means it won't be overwriting too fast.
31:03
So, let's look at something we can do on Linux that we're not allowed to do on Windows anymore. The self-modifying
31:08
binary. The idea here is that we have a cart file. The cart file represents the ROM. Actually, in this case, it's a RAM
31:15
because we're going to be modifying it. So, first we would execute the cart file. And when the cart file runs, it
31:21
would realize that it's not aback. And then it would copy itself to a backup file, cart.back, and then it would
31:28
launch cart. And then exit. This cart.back back would realize that it is the backup file and then it would map
31:35
the original file cart at say 6 megabytes in and it would provide that mapping and read write and execute and
31:42
then afterwards it would map an adjustable zero fill and that would be for when we're doing compilation or when
31:48
we have data that we don't want to be backed up all the time and after that it would jump to the interpreter and so if
31:54
we look at the memory mapping we'd have at 4 megabytes we would have say a 4 kilobyte boot loader section
32:01
And then we would have somewhere say at 6 megabytes we'd have the whole file and then after that we would have the zero
32:07
fill. And the nice thing about this is we automatically create a backup. We don't have to write any code to save the
32:13
file because it's going to autosave every 30 seconds. Also the data and code is together and we can make a new
32:19
version just by copying the file itself. Inside the binary we'd have a specific spot for the size of the file and the
32:25
size of the zero fill. So the process of doing this execution we can resize when
32:31
we build the cart.back file very easily. So for source code I'm thinking 32-bit
32:36
words for words they're going to be direct addresses into the binary and the binary is going to be the dictionary.
32:43
Makes it quite fast to interpret. The nice thing here is they can be direct addresses because we fix the position.
32:50
We're not using address base randomization. We'll just fix RSI to the interpreter source position and then all
32:56
the words will contain the next of the interpreter meaning all the words end in the interpreter itself or fold the
33:02
interpreter in whatever form they want into their own execution. And by doing this we enable lots of branch predictor
33:10
slots because each of these interpret end of word interpreters are going to be different branch predictors. So we can
33:17
actually get this down into five bytes if we want to. We can use the LODs to
33:22
basically load a word from RSI and then advance RSI and we can use two bytes to
33:29
look up the value in the dictionary and we can use two bytes later to jump to
33:34
the address for the next thing to run. So now we've gotten down to a five byte interpreter. Another thing I think I
33:40
would do for variation is I would make source more free form not strict reverse post notation. In other words, with
33:48
regular reverse pulse notation, you're going to have words that going to push values onto a stack and therefore your
33:53
granularity is at the word level. If instead we have arguments, we can fetch
33:59
a bunch of values off the stack right off the bat and we can look them up in the dictionary in parallel. And now our
34:05
branch granularity is dropping significantly. Maybe say a factor of maybe two or four depending on what our
34:12
common argument count is. I think this is a better compromise for when you're doing lots of code generation, which is
34:17
what we'll be doing on the CPU, mostly GPU code generation. So, for the editor, I'll just do an advanced 32-bit hex
34:24
editor. I'll split the source into blocks, and then each one of those blocks will be split into subblocks. And
34:31
the subblock will be source and then some annotation blocks. And so for every
34:37
source 32-bit source word, I'm going to have uh two I'm going to have say 64-bit
34:42
of annotation information split across two 32-bit words. And that'll give me eight characters. Each of those
34:49
characters will be 7 bit. And I'll have an 8bit tag for editor. And the tag will give me the format of the 32-bit value
34:56
in memory and give me whatever else I want in there. So I can adapt, you know,
35:01
for whatever I feel like doing in the future. And I can make this pretty uniform because I'll mostly have numbers
35:07
and then I'll have direct addresses to words inside this. So that's it for now. This is a late welcome to 2026. I used
35:15
the holiday for some deep thinking, but I think it's time now for some more building. Take care.