684 lines
38 KiB
Plaintext
684 lines
38 KiB
Plaintext
0:00
|
|
Welcome to fourth and beyond. My name is Timothy Lis and this is the Neoenographics channel. This talk will
|
|
0:05
|
|
not be covering standard fourth. Instead, this talk is going to start with the beyond fourth part. Let's
|
|
0:10
|
|
begin. What if we didn't actually need Visual Studio? What if we didn't need a separate debugger or even to C language?
|
|
0:17
|
|
Let's start with the first principle. Question everything until the problem is truly minimized. Begin by peeling the
|
|
0:22
|
|
onion of computing, passing through APIs, compilers, languages, code generation, and so on. Search the
|
|
0:28
|
|
alternative realities until greatness is found. And we'll start by rewinding time and learning from the past masters.
|
|
0:34
|
|
We'll start with the most basic interactive computer tool, the calculator. My favorite calculator was
|
|
0:39
|
|
the HP48. That's what I used. HP48 used reverse Polish notation. This made it
|
|
0:45
|
|
very easy to type in math and get answers. You didn't have to mess with parenthesis. The HP48 provided RPL. The
|
|
0:53
|
|
later machines provided system RPL which could even assemble machine code with offline tools for the HP48. People even
|
|
1:00
|
|
built games for these machines. Now what if we were to take that calculator and evolve it to something more forthlike?
|
|
1:06
|
|
We'll start with simple reverse polish notation calculator math. And next we'll introduce a dictionary. The dictionary
|
|
1:12
|
|
will point to positions on the data stack. For instance, in the second line here, we have a red word 4k. That 4k
|
|
1:19
|
|
word would be pointing to the next stack item. Next stack item we can do some evaluations to come up with the number
|
|
1:25
|
|
for. So we type in 1024 then type in four and then type in multiply. And now
|
|
1:30
|
|
we have 4096. So the 4k word would point to 4096. This is a basic way of doing a
|
|
1:37
|
|
variable. The next thing we could do is we could actually build numbers which represent op codes or multiple numbers
|
|
1:43
|
|
which represent op codes. Things that we could actually execute and have do an operation on the machine. So in this
|
|
1:50
|
|
case there's drop and drop points to a number on the data stack which disassembles to add ESI, -4 and then
|
|
1:59
|
|
returns. Drop would basically drop the top item from the data stack where ESI is pointing to the data stack. And now
|
|
2:06
|
|
once we have this in our dictionary, we can continue to do things on the stack and we can use drop if we want to. So
|
|
2:12
|
|
now we could write 4k which would pull that number 4096 that we had uh put in
|
|
2:18
|
|
the dictionary prior and then do one and then do two and then a plus which would create a three and then we execute drop
|
|
2:25
|
|
which will drop the three leaving 4096 on the stack and thus now we've created something quite powerful. So in this
|
|
2:31
|
|
context the gold numbers get pushed on the stack the gold words get its value in the dictionary pushed on the stack.
|
|
2:37
|
|
The green words are getting a value in the dictionary executed and the red word is putting a pointer to the top of stack
|
|
2:44
|
|
in the actual word in the dictionary. In some respects, you can see how this starts to create extremely powerful
|
|
2:49
|
|
system. So a fourthlike machine is really the ultimate form of tool building. The language is free form. The
|
|
2:55
|
|
dictionary defines words. These words become the language you program in. It enables any kind of factoring of a
|
|
3:01
|
|
problem. The language, the assembler, the compiler, the linker, the editor, the debugger, they're all defined in the
|
|
3:06
|
|
source itself. And these systems can be tiny. Tiny as in the whole thing fits in the cache. In my opinion, a fourthleike
|
|
3:13
|
|
machine would have been a better option than basic for a boot language. A lot of people learn basics cuz they could type
|
|
3:18
|
|
in a program from a book, say on the C64. But imagine if it was a fourth machine instead. You have something that
|
|
3:25
|
|
runs significantly faster and is significantly more powerful. The irony here is later Apple, IBM, and Sun, they
|
|
3:32
|
|
actually used a fourth-based open firmware, but few had programmed in it at the time. So, let's look back at
|
|
3:38
|
|
Fourth. Fourth was invented in 1986 by Chuck Moore or Charles Moore. Chuck
|
|
3:43
|
|
later focused on building multiple stackbased processors. He used his own VLSI CAD system, Okad for layout and
|
|
3:51
|
|
simulation, and these were written in his language. Early it was a sourceless language and later it got moved to color
|
|
3:57
|
|
forth from my understanding. The images below show some of the actual editor and simulation. These are from the ultra
|
|
4:04
|
|
technology site. What's impressive here is that these were dramatically small systems and yet they were used to do
|
|
4:10
|
|
some of the most complicated stuff that humans can do which is design chips that actually got fabricated and got used.
|
|
4:16
|
|
Chuck Moore's color forth I think is worth learning about. It's an example of real system minimization. 32-bit reverse
|
|
4:23
|
|
Polish notation language. It provides a data stack which gives you memory to work with and note code is compiled onto
|
|
4:30
|
|
the data stack too. It provides dictionaries which map a name to a value. The value is typically a 32-bit
|
|
4:36
|
|
number or 32-bit address to the source or data stack. The dictionaries are searched in a linear order from last to
|
|
4:42
|
|
first defined word. There are two main dictionaries. The fourth one which is used for words to call and macro which
|
|
4:49
|
|
is a secondary dictionary used for words that do code generation. Source is broken up into blocks. There is no file
|
|
4:56
|
|
system. Inside the source blocks are 32-bit tokens. These tokens contain 28
|
|
5:02
|
|
bits of compressed name or string and four bits of tag. The tag controls how
|
|
5:07
|
|
to interpret the source token. Let's go through some of the tags. The white tag means an ignored word. Yellow tag means
|
|
5:13
|
|
execute. If it's a number, we append the number on the data stack. If it's a word, we look up the word in the
|
|
5:20
|
|
dictionary and then we call the word. If it's a red word, we're doing a definition. We're setting the word in
|
|
5:26
|
|
the dictionary to the top of the stack or a pointer to the top of the stack. If it's green, we're compiling. If it's a
|
|
5:32
|
|
green number, we're appending a push number onto the stack. Effectively,
|
|
5:37
|
|
we're encoding the code, the machine language that would push that number. If
|
|
5:43
|
|
we compile a word, if we're in the macro dictionary, we're first going to look up the word in the macro dictionary, and if
|
|
5:50
|
|
it exists, we're going to call it. Otherwise, we look up the word in the fourth dictionary, and we append a call
|
|
5:56
|
|
to the word itself. Cyan or blue is used to defer words execution. So, we'll look
|
|
6:02
|
|
up the word in the macro dictionary, and we will append a call to the word. This way, we can make words that do code
|
|
6:09
|
|
generation, that call other words that do code generation. Next is the variable. Variable is used by magenta.
|
|
6:16
|
|
Variable sets the dictionary value to the word to it the pointer to the next source token in the source code as it's
|
|
6:23
|
|
being evaluated. And then anytime we have a yellow to green transition, we pop a number off the stack and then we
|
|
6:30
|
|
append a push number to the data stack, which basically means we're taking a number and we're turning it back into a
|
|
6:36
|
|
program, a program that pushes the number. So if we look at some of the blocks inside color forth and notice
|
|
6:43
|
|
this one here, block 18. This starts doing the code generation. So it'll push 24 and then it'll load which will take
|
|
6:50
|
|
block 24 and actually bring in all the code generation macros. And then the next one 26 load will bring in more
|
|
6:56
|
|
quote generation macros from block 26. If you look at block 24, it starts with
|
|
7:02
|
|
executing macro which moves us to making defines in the macro dictionary. The
|
|
7:07
|
|
first define is swap. Then it does 168B and then it does a two comma. The two
|
|
7:13
|
|
comma pushes two bytes onto the data stack. The next one is C28B
|
|
7:20
|
|
0689 followed by a comma. The comma pushes four bytes onto the data stack. So
|
|
7:26
|
|
effectively what we're doing is we're pushing some bytes to actually create code onto the data stack where swap is
|
|
7:33
|
|
defined. And if we disassemble these six bytes, we get move edx, dword ptr esi.
|
|
7:40
|
|
So effectively we're pulling from the stack into edx and the stack in this case is the data stack of fourth. The
|
|
7:46
|
|
next one is move dword ptr esi, eax. So we're pushing the existing cacheed value
|
|
7:53
|
|
of the top of the stack which is in eax. We're pushing that we're putting that on the stack. And then we move edx into eax
|
|
8:01
|
|
which is taking the old second value on the stack and pushing it into the cache value which is the top of the stack in
|
|
8:07
|
|
color fourth. So basically this whole block is defining op codes that are used for code generation. So let's fast way
|
|
8:14
|
|
forward now and let's just critique color forth. Perhaps one of the biggest critiques of color forth is that it's a
|
|
8:20
|
|
mismatch to hardware today. It's a stackbased machine and modern machines are register based. Modern machines have
|
|
8:26
|
|
really deep pipelines. They don't deal with branching well and fourth is extremely branch friendly. The
|
|
8:31
|
|
interpreter costs that you have to do per token are pretty high. We have to branch based on tag. Dictionaries are
|
|
8:38
|
|
searched from last added to first added with no hashing or any acceleration.
|
|
8:43
|
|
Most commonly every time you do an interpreting after branching on the tag you're going to branch again to another
|
|
8:49
|
|
thing which is going to be a mispredicted address. And note, you got an average of 16 clock stall on say Zen
|
|
8:55
|
|
2 for a branch misprediction. Of course, the logical response here is if you only have a tiny amount of code, there's no
|
|
9:01
|
|
reason it has to be super fast. After all, the most important optimization is doing less total work. For example, an
|
|
9:07
|
|
F1 car driving a,000 m is going to be substantially slower than a turtle walking one foot. Well, towards the end
|
|
9:14
|
|
of 2025, Chuck Moore said, "I think fate is trying to tell me it's time to move on."
|
|
9:20
|
|
And this is in response to Windows auto updating and then breaking his color forth. But I ask, should we actually
|
|
9:27
|
|
move on? The world did move on to mass hardware and software complexity, but
|
|
9:32
|
|
perhaps Chuck's way of thinking is actually exactly what is needed today. How about a localized reboot? We have a
|
|
9:38
|
|
lot of FBGA based systems showing up, and I'm hopeful that they're getting commercial success, but these are
|
|
9:44
|
|
effectively all emulators of prior hardware. What about doing something new? Maybe forthinking could be a part
|
|
9:49
|
|
of that. What about Neo vintage parallel machines? After all, forthinking is ideal for a fixed hardware platform.
|
|
9:56
|
|
FPGA based hardware emulators focus mostly on the serial thinking era. But this is actually a universal speed of
|
|
10:02
|
|
light barrier. These product lines are going to stop around N64 and so on because after that serial CPU clock
|
|
10:09
|
|
rates cannot be FPGA emulated. But FPGAAS have crazy parallel DSP capabilities. Perhaps we should design
|
|
10:16
|
|
for DSPs as the processors and then provide radically parallel but medium
|
|
10:21
|
|
clock machines and these are things we could actually drive with fourth style language. There is a challenge of
|
|
10:28
|
|
minimalism in a maximalism world. Software is a problem but the root is hardware complexity growth. For example,
|
|
10:34
|
|
our DNA4 ISA guide is almost 4 megabytes in itself. And try writing a modern USBC
|
|
10:40
|
|
controller driver yourself. And yet, even with all of today's hardware complexity, I still believe fourth
|
|
10:45
|
|
inspired software can be quite useful. I spent a lot of time exploring the permutation space around fourth,
|
|
10:51
|
|
specifically more around color fourth and seeing what variations could be made. One way I varied from fourth was
|
|
10:57
|
|
in an opin coding. I don't necessarily stick with a stackbased language. Sometimes I treat the register file more
|
|
11:03
|
|
like a closer highly aliased memory. Sometimes you use a stackbased language however as a macro language say for a
|
|
11:10
|
|
native hardware assembler and sometimes I mix a stackbased language with something that has arguments for
|
|
11:16
|
|
instance having a word have arguments after the word and still use it like a stackbased language. So I have used
|
|
11:22
|
|
forthlike things in commercial products. One example is I used to run a photography business and a software
|
|
11:28
|
|
development business and the old business website that I ran in my prior life doing landscape photography was
|
|
11:35
|
|
actually generated by a fourthlike language running server side which generated all the HTML pages. It made
|
|
11:41
|
|
managing a huge website actually practical. Now I had to use the wayback machine to find this. So sorry in
|
|
11:47
|
|
advance for the broken images. And of course, I had a different last name then from a broken marriage, but that's another story. But I did a lot more
|
|
11:54
|
|
forthlike things beyond this one. One of the things that got me right away was, of course, the lure to optimize. For
|
|
12:00
|
|
example, color forth uses this Huffman style encoding for its names and its tokens. Remember, a source token is a
|
|
12:08
|
|
T-bit tag, typically T is 4, with an S bit string where S is 28 bits. We could
|
|
12:14
|
|
do a better job of encoding the the 28 bits. For instance, we could split that full number range by some initial
|
|
12:20
|
|
probability table of the first character. And then we could split each of those ranges by say a one or two
|
|
12:27
|
|
character predictor. And then we train this thing on a giant dictionary. And of course, you're going to have to use
|
|
12:32
|
|
lookup tables. And of course, the memory used for the predictor is going to be greater than the rest of this whole
|
|
12:38
|
|
system combined. And yeah, it worked. It provided some very interesting stuff. You could put a number in and it would
|
|
12:45
|
|
basically BSU a string out which was pretty cool. This journey I think was
|
|
12:50
|
|
useful. I learned a lot of things in the process like where to optimize and where not to optimize. Next question is well
|
|
12:57
|
|
should we hash or should we not hash? It turns out that a compressed string like in the prior slide is actually a great
|
|
13:04
|
|
hash function. I can simply mask off some number of the least significant bits and that becomes my hash function.
|
|
13:10
|
|
or for hashing. I always dislike the issue of only part using cache lines. That's not very efficient. And of
|
|
13:16
|
|
course, we can try to fix that too. We can check a very tiny hash table first and size that hash table to stay in the
|
|
13:23
|
|
cache. And then if we miss on that, we can go to the fulls size one, the one that's going to have pretty poor
|
|
13:28
|
|
utilization on cache lines. And assuming lots of reuse, that tiny hash table is going to keep high cache utilization.
|
|
13:35
|
|
However, now we've done two stages of optimization. But we really should start asking why are we hashing and why are we
|
|
13:42
|
|
compressed? Why are we doing all this overhead? Why do we just not direct map? After all, if we're depending on an
|
|
13:48
|
|
editor, we could just direct map or perhaps just address into the dictionary directly. Then we can split off the T-
|
|
13:54
|
|
bit tag and the S bit string for editor use. And that can start simplifying things so we don't have all this
|
|
14:00
|
|
complexity in the first place. The next thing we can do if we're interpreting is we can solve the problem of branch
|
|
14:06
|
|
misses. Normally with an interpreter, you would evaluate the word and then you'd return back to the interpreter.
|
|
14:12
|
|
That interpreter would look up another word and do another branch. But that branch would always be mispredicted. One
|
|
14:17
|
|
option is we could just fold the interpreter back into the words themselves. But of course, we got to
|
|
14:23
|
|
make that interpreter really small, otherwise we're doing a lot of code duplication. Can imagine if you have a
|
|
14:28
|
|
thousand words, you're going to embed the interpreter a thousand times. So there's a lot of different ways we can
|
|
14:33
|
|
design an interpreter down into a few bytes. For instance, this one is a 8 byte interpreter. This is one I've never
|
|
14:40
|
|
used. Actually can do better than 8 bytes and I'll show you that towards the end. So of course the best way to learn
|
|
14:45
|
|
is to build stuff. So I built many color forth inspired things over the years. Some like the one to the right here. I
|
|
14:52
|
|
got distracted with editor graphics effectively making something extremely nice to use and very pretty. This one
|
|
14:58
|
|
was cool. The dictionary I moved into the source itself and I did a direct
|
|
15:03
|
|
binary editor. So this thing you'd actually see the cache lines and you're effectively doing a hex editor that uses
|
|
15:10
|
|
tags to tell you some contextual information and then each line of course
|
|
15:15
|
|
has uh a comment on the top followed by the data on the bottom. And of course I use different fonts because sometimes
|
|
15:22
|
|
I'm packing a full number with sometimes I'm packing characters and so on in comments. It was a relatively
|
|
15:28
|
|
complicated system, but actually simple when you think about it in the context of what we build today. One of the first
|
|
15:35
|
|
questions to ask yourself is whether you want to work with text source versus a binary editor. So sometimes I would work
|
|
15:41
|
|
with text source. In order to make this work well, I would have a prefix character in front of every word, which
|
|
15:47
|
|
basically would be the tag. And it would also enable me to use very simple syntax
|
|
15:52
|
|
coloring inside say nano. Most of these I built they were more like a fourth macro language that was used to create a
|
|
15:58
|
|
binary. So what I would do and for instance what you can see on the right is I would define something that would
|
|
16:03
|
|
enable me to build the ELF header and then after the ELF header was was built I would actually write the assembler in
|
|
16:10
|
|
the source code and then finish off the rest of the binary. These kind of languages are extremely small and the
|
|
16:16
|
|
whole thing is in a say a few kilobyte binary. The other thing I do with these is I bootstrap. So the first time I
|
|
16:23
|
|
might write the thing in C and then C would evaluate and then I'd run the interpreter in C and then later I would
|
|
16:30
|
|
write the rewrite the interpreter inside the source code and then compile that. Now I would be bootstrapped into the
|
|
16:36
|
|
language itself. And so by doing that I could actually compare my C code to the the code I wrote inside my own language.
|
|
16:43
|
|
And of course I'm faster inside my own language than in the C code. And of course, I'm a lot smaller in the binary
|
|
16:48
|
|
as well because I have a very very small ELF header in the binary that I generated compared to the one that say
|
|
16:55
|
|
GCC would generate. I built some custom x86 operating systems. It was fun to
|
|
17:00
|
|
build custom VGA fonts and of course mess with the pallet entries to improve the initial colors. I did lots of
|
|
17:07
|
|
different fourth variations, but typically these projects just got blocked in the mass complexity of
|
|
17:13
|
|
today's hardware. Meaning once you get down to the point where you want to say draw something on the screen other than
|
|
17:18
|
|
using say the old DOSs VGA frame buffer or if you want to start using input you start needing a USB driver and then all
|
|
17:26
|
|
of a sudden everything turns into a nightmare. One thing I mentioned before is it's very nice to use a forthlike language as a macro assembly language.
|
|
17:33
|
|
Traditional assembly language you do something like say add ecx edx and then
|
|
17:39
|
|
colon advanced pointer by stride. The later part here is heavily commented. In fact, typically assembly is mostly
|
|
17:47
|
|
comments otherwise a human can't really understand it. When you start using a fourthlike language as a macro
|
|
17:53
|
|
assembler, a lot of times what you do is instead of using the register number, you would just put the register number
|
|
17:59
|
|
inside another word and then use that word. So now you start self-documenting. And if you had common blocks of say
|
|
18:06
|
|
multiple instructions, you would start defining those in some other word and then you start factoring. And this way
|
|
18:11
|
|
you self-document everything and it becomes actually very easy to understand, a lot more easy to
|
|
18:17
|
|
understand than say assembler. And on top of this, of course, you can also put comments, but you don't typically need
|
|
18:22
|
|
as many. So if we were to look back at some of the lessons of all these projects, I think the key thing is that when your OS is an editor, is a hyper
|
|
18:30
|
|
calculator, is a debugger, is a hex editor, you end up with this interactive
|
|
18:35
|
|
instant iteration software development and that part is wonderful. The fourth style of thinking keeps source small
|
|
18:41
|
|
enough so that it's approachable by a single developer and that I think is very important. You basically build out
|
|
18:46
|
|
your own tools for the exactly the way you like to think and that's where its true beauty lies. Others like Onot have
|
|
18:53
|
|
built full systems, meaning he is running something that actually works
|
|
18:58
|
|
with Vulcan and generates Spear V. So there is another option and that is going sourceless. No language, no
|
|
19:05
|
|
assembler. The code is the data or the data is the code. Chuck's okay was a
|
|
19:10
|
|
source of inspiration. I've only read about this, but it did send me down a spiral of trying various ideas that are
|
|
19:17
|
|
related to what I read. So when we think about sourceless programming, it's best to just work from the opposite extreme.
|
|
19:24
|
|
Start with say a hex editor and then work towards what we would need to make that practical for code generation. So I
|
|
19:30
|
|
think of a binary as an array of say n 32-bit words and then we could have another thing which is an annotation
|
|
19:37
|
|
which is an array of n64-bit words. The annotation could provide a tag which gives context to the data or could
|
|
19:44
|
|
control how the editor manipulates the data. The annotation can also provide an 8 character text annotation for the
|
|
19:50
|
|
individual binary words which serves as documentation for what the word is for. So part of sourceless programming is how
|
|
19:56
|
|
do you generate code and with fourth hand assembling words is actually relatively easy because you don't have
|
|
20:02
|
|
that many you don't have that many low-level operations if you're doing a stackbased machine. I invented something
|
|
20:09
|
|
called x68 which I'm planning to do a separate talk on. It's a subset of x64
|
|
20:15
|
|
which works with op codes at 32-bit granularity only. Note that x664
|
|
20:21
|
|
supports ignored prefixes which can pad op codes out to 32-bit. And we also have
|
|
20:27
|
|
multibite noops which can align the ends too. And we can do things like adding in rex prefixes when we don't need it to
|
|
20:34
|
|
again pad out to 32-bit. So for instance, if we wanted to do a 32-bit instruction for return, we might put the
|
|
20:41
|
|
return, which is a C3, and then pad the rest of it with a three byte noop. And
|
|
20:47
|
|
once we've built this 32-bit return number, which we can an we annotate with, we can insert a return anywhere
|
|
20:55
|
|
just by copying and inserting this word in the source code. And later, if we built different op codes and say they
|
|
21:00
|
|
were multi-word, we can just use find and replace to change those. Effectively, we're removing compilation
|
|
21:05
|
|
into some edit time operations. One of the nice things about being at 32-bit
|
|
21:10
|
|
granularity for the instruction is that the 32-bit immediates are now at 32-bit granularity as well. And so now we can
|
|
21:18
|
|
just make it so that we have a tag which says this is an op code and a tag which says this is say an immediate hex value.
|
|
21:26
|
|
And we could show them separately with different colors. In this case, I have code for setting ESI to a 32-bit
|
|
21:32
|
|
immediate. And you'll notice that this one is using the 3E ignored DS segment
|
|
21:38
|
|
selector prefix to pad out the op code to 32-bit. And then after that, we have a silly number which we're setting into
|
|
21:45
|
|
ESI. That silly number is 1 2 3 4 5 6 7 8 in hex. So it's very easy to do inline
|
|
21:51
|
|
data this way. Of course, calls and jumps are another question. And we have an easy solution for that one as well.
|
|
21:57
|
|
In x8664, column jump uses a 32-bit relative immediate and that relative is
|
|
22:03
|
|
relative to the end of the op code, not the beginning. And so if we want to make an editor support this, we would just
|
|
22:09
|
|
tag the relative branch address as a word that is a relative address. And
|
|
22:15
|
|
then when we start editing text or say editing the words inside the binary and
|
|
22:21
|
|
say we move things around, we would just relink all of the words in the binary that have a relative address. So as code
|
|
22:27
|
|
changes, things just get fixed up. And so this effectively solves the call and the jump problem. It's very easy to make
|
|
22:34
|
|
an editor which repatches everything. Conditional branches, you might think those are complicated, but they're
|
|
22:40
|
|
actually not. Conditional branches is just an 8-bit relative branch address. And so when I make words on these, I
|
|
22:46
|
|
would say jump unequal minus two, which would jump unequal to the word that is
|
|
22:51
|
|
two words before this one or say j minus 4 for four words minus and so on. And so
|
|
22:58
|
|
I can just build a few of these constructs and change the op codes around whenever I need say a jump on
|
|
23:03
|
|
zero or so on. Nice thing about this is now you no longer have to label things because you just go and count and when
|
|
23:10
|
|
you move it around it's all relative. So you don't need to do any patching. If you want to add more stuff in your loop,
|
|
23:16
|
|
you just change, you know, the op code a little. Another option is the editor could have a built-in 32-bit word
|
|
23:22
|
|
assembly disassembly. Meaning I could use a shorthand for registers and op codes. And the shorthand that I would
|
|
23:28
|
|
use would be labeling the registers starting with G. So that zero through F
|
|
23:34
|
|
could be used for hex. So this is an example of how you might want to do it. So in this case I have h + at i08
|
|
23:42
|
|
which is going to disassemble to add rcx quadward pointer rdx + 0x8
|
|
23:51
|
|
which we can shorthand very easily and so I could have an editor that would show you either the disassembly or show
|
|
23:57
|
|
the shorthand instruction for it and that would aid in the understanding and ability to insert stuff without using
|
|
24:04
|
|
separate tools. So, I did build this sourceless system once in the real. It was back when I was building tiny
|
|
24:09
|
|
x86-based operating systems. I built the editor as a console program. So, this
|
|
24:15
|
|
would run in Linux and I would build binaries in that console program and
|
|
24:20
|
|
then I would use an x86 emulator running in Linux to actually test them. And this was a pretty liberating experience. I
|
|
24:27
|
|
learned a lot from there. On the right, I'm showing one of the boot sectors of one of the examples running in the
|
|
24:33
|
|
editor. Note with sourceless programming we could extend the annotation quite a bit. For instance, we could have tables
|
|
24:38
|
|
that map register numbers to strings and then for each cache line we could have an index of a table. In this way,
|
|
24:45
|
|
registers could be automatically annotated. For instance, if register 2 is set to position and register 4 is set
|
|
24:52
|
|
to velocity. If we had add R2, comma, R4, we could just put in add POSOS comma
|
|
24:58
|
|
velocity, right? And that would make it a lot easier to understand automatically. We could also extend and
|
|
25:04
|
|
have each cache line have a string as well. And this way we could automatically annotate say a label. The
|
|
25:10
|
|
first word maybe could be the label and the rest of the string could be used for a comment. So there's a lot of ways to
|
|
25:16
|
|
do sources programming where we just provide annotation tools to actually make it a practical experience. So let's
|
|
25:22
|
|
talk about some of the variations, the pros and cons. The easiest way to perhaps start this would be to work with
|
|
25:28
|
|
text source. And usually with text source, you're going to use prefix words for the tag. For instance, slash for
|
|
25:35
|
|
comment, colon for define, maybe tick for execute. This does have the slowest runtime, however, because you have a
|
|
25:41
|
|
character granularity loop. It does have a benefit in that for you to get started, there is no editor that you
|
|
25:47
|
|
have to write. You can use an external text editor and you can do easy custom syntax coloring. You're going to be very
|
|
25:53
|
|
easy to understand and to work in. However, I think you're missing a big piece if you go down this path, and that
|
|
25:58
|
|
is you don't get any live interaction or debugging. You're basically depending on the fast compile times of your custom
|
|
26:04
|
|
language and the fast load times of whatever program you're doing to try to get you in that iteration loop. And you
|
|
26:11
|
|
can work this way. I've done it many times, but the experience is nowhere near as good as doing the full
|
|
26:16
|
|
interactive one with a binary editor. I guess one of the other benefits here is you got very easy code portability.
|
|
26:23
|
|
There's no binary files, just text files. You can copy and paste as you will. Of course, the next jump up from
|
|
26:28
|
|
that is going to binary source. This would be middle performance runtime because now you're working at a word
|
|
26:34
|
|
granularity inside your interpreter. You have portability because you have code generation that can adapt to the system.
|
|
26:41
|
|
Meaning, as you interpret your code, you can look at what's underlying in the hardware and you can make changes to how
|
|
26:46
|
|
the code is generated at runtime. You can build just what bits of an assembler are needed. You don't need to build out
|
|
26:52
|
|
everything like you would with say a disassembler tool or an assembler tool. So for instance with x86 I don't
|
|
26:58
|
|
actually generate much of the ice. I only use a very very tiny subset. You do have to write the binary editor and that
|
|
27:04
|
|
can be a lot of work and sometimes that presents a problem with bootstrapping because you don't have the language to
|
|
27:10
|
|
write the editor in from the beginning. So you have to write the editor in some other language and then in that language
|
|
27:15
|
|
write the editor again and then you know complete the bootstrapping process. The
|
|
27:21
|
|
one benefit here is you get interactive development debug from the beginning and also now your source code shows how
|
|
27:28
|
|
constructs are built instead of just showing the result as you would get with say source free. One interesting thing
|
|
27:33
|
|
about fourth with binary source and this this concept that you can rebuild source
|
|
27:39
|
|
code at runtime at any point in execution is now you can start compiling things you load into machine code and
|
|
27:46
|
|
then you don't have to run all the overhead. Basically you can bake things into machine code anytime at edit time
|
|
27:52
|
|
and that's a very powerful feature. Now if we go and look at sourceless programming there's a bunch of pros and
|
|
27:58
|
|
cons. The one pro is that it's the fastest runtime. It's a true noop. You can build things that are highly tightly
|
|
28:04
|
|
optimized for a platform and they're as optimized as you could possibly get. You
|
|
28:09
|
|
do however lose capacity for showing how constants came to be and you lose the
|
|
28:15
|
|
capability of adapting to what the machine is. You can work around this however you can make smaller embedded
|
|
28:20
|
|
constants that you write into modified bit fields and instructions but it's
|
|
28:26
|
|
really taking a little too far in terms of complexity. You do need to write an editor which includes an opc code
|
|
28:31
|
|
assembler and disassembler potentially if you want to go down that route. And if you're going to do CPU and GPU,
|
|
28:37
|
|
that's a lot of work. It can be very complicated when systems include auto annotation. For instance, if you want to
|
|
28:44
|
|
type in say a string readable register and then have it go and figure out what register number that is. I guess the
|
|
28:50
|
|
primary disadvantage here is that there's no possibility for portability. You have a raw binary editor. And today
|
|
28:57
|
|
we have a problem with GPUs. Steam Deck is RDNA 2, Steam Machine's RDNA 3, and
|
|
29:04
|
|
who knows the future may be RDNA 4 or five or six. Problem with this is that when the ISA changes across those
|
|
29:10
|
|
chipsets, that can result in different instruction sizing. So you can't just do
|
|
29:16
|
|
simple source free. For example, Mimg in RDNA 2 has a different size than V image
|
|
29:22
|
|
in RDNA4. And that's all due to the ISA changes. So if you do sourceful
|
|
29:27
|
|
programming, you can port through a compile step. However, source list, you would need to do something else. And
|
|
29:33
|
|
perhaps that is just rewriting chipset specific versions of all the shaders, but that may be something that you don't
|
|
29:39
|
|
want to do. So thinking ahead on the Steam OS/ Linux project that I'm working on, effectively I'm building an AMD only
|
|
29:46
|
|
solution. I don't really have a name for this project, so I'm going to call it fifth. I think for fifth I want a mix of
|
|
29:52
|
|
various fourth style concepts and perhaps the best mix would be the best
|
|
29:57
|
|
of both worlds. A fast highle interpreter that's intent for doing GPU code generation where I need chipset
|
|
30:04
|
|
portability at runtime mixed with low-level sourceless words for the CPU
|
|
30:09
|
|
side for simple x664 where I don't need portability at this time. Since this is
|
|
30:15
|
|
for Linux, we should think about what we can do on Linux that we might not be able to do on Windows today. We first
|
|
30:21
|
|
start thinking about execution. The Linux x8664 ABI when you're running
|
|
30:26
|
|
without address space randomization, the execution starts at a fixed 4 megabytes
|
|
30:31
|
|
in and you can still do this today if you compile without FPIE. Also note,
|
|
30:37
|
|
even if you did get position independent execution, you could effectively just map the where you want your fixed
|
|
30:44
|
|
position to be and then just start execution there and just ignore the original mapping that they threw you at.
|
|
30:50
|
|
Another thing we can do is we can use page maps. And if we look at Steam Deck, we'll notice that dirty pages get
|
|
30:56
|
|
written about as fast as every 30 seconds, which is an important number. Means it won't be overwriting too fast.
|
|
31:03
|
|
So, let's look at something we can do on Linux that we're not allowed to do on Windows anymore. The self-modifying
|
|
31:08
|
|
binary. The idea here is that we have a cart file. The cart file represents the ROM. Actually, in this case, it's a RAM
|
|
31:15
|
|
because we're going to be modifying it. So, first we would execute the cart file. And when the cart file runs, it
|
|
31:21
|
|
would realize that it's not aback. And then it would copy itself to a backup file, cart.back, and then it would
|
|
31:28
|
|
launch cart. And then exit. This cart.back back would realize that it is the backup file and then it would map
|
|
31:35
|
|
the original file cart at say 6 megabytes in and it would provide that mapping and read write and execute and
|
|
31:42
|
|
then afterwards it would map an adjustable zero fill and that would be for when we're doing compilation or when
|
|
31:48
|
|
we have data that we don't want to be backed up all the time and after that it would jump to the interpreter and so if
|
|
31:54
|
|
we look at the memory mapping we'd have at 4 megabytes we would have say a 4 kilobyte boot loader section
|
|
32:01
|
|
And then we would have somewhere say at 6 megabytes we'd have the whole file and then after that we would have the zero
|
|
32:07
|
|
fill. And the nice thing about this is we automatically create a backup. We don't have to write any code to save the
|
|
32:13
|
|
file because it's going to autosave every 30 seconds. Also the data and code is together and we can make a new
|
|
32:19
|
|
version just by copying the file itself. Inside the binary we'd have a specific spot for the size of the file and the
|
|
32:25
|
|
size of the zero fill. So the process of doing this execution we can resize when
|
|
32:31
|
|
we build the cart.back file very easily. So for source code I'm thinking 32-bit
|
|
32:36
|
|
words for words they're going to be direct addresses into the binary and the binary is going to be the dictionary.
|
|
32:43
|
|
Makes it quite fast to interpret. The nice thing here is they can be direct addresses because we fix the position.
|
|
32:50
|
|
We're not using address base randomization. We'll just fix RSI to the interpreter source position and then all
|
|
32:56
|
|
the words will contain the next of the interpreter meaning all the words end in the interpreter itself or fold the
|
|
33:02
|
|
interpreter in whatever form they want into their own execution. And by doing this we enable lots of branch predictor
|
|
33:10
|
|
slots because each of these interpret end of word interpreters are going to be different branch predictors. So we can
|
|
33:17
|
|
actually get this down into five bytes if we want to. We can use the LODs to
|
|
33:22
|
|
basically load a word from RSI and then advance RSI and we can use two bytes to
|
|
33:29
|
|
look up the value in the dictionary and we can use two bytes later to jump to
|
|
33:34
|
|
the address for the next thing to run. So now we've gotten down to a five byte interpreter. Another thing I think I
|
|
33:40
|
|
would do for variation is I would make source more free form not strict reverse post notation. In other words, with
|
|
33:48
|
|
regular reverse pulse notation, you're going to have words that going to push values onto a stack and therefore your
|
|
33:53
|
|
granularity is at the word level. If instead we have arguments, we can fetch
|
|
33:59
|
|
a bunch of values off the stack right off the bat and we can look them up in the dictionary in parallel. And now our
|
|
34:05
|
|
branch granularity is dropping significantly. Maybe say a factor of maybe two or four depending on what our
|
|
34:12
|
|
common argument count is. I think this is a better compromise for when you're doing lots of code generation, which is
|
|
34:17
|
|
what we'll be doing on the CPU, mostly GPU code generation. So, for the editor, I'll just do an advanced 32-bit hex
|
|
34:24
|
|
editor. I'll split the source into blocks, and then each one of those blocks will be split into subblocks. And
|
|
34:31
|
|
the subblock will be source and then some annotation blocks. And so for every
|
|
34:37
|
|
source 32-bit source word, I'm going to have uh two I'm going to have say 64-bit
|
|
34:42
|
|
of annotation information split across two 32-bit words. And that'll give me eight characters. Each of those
|
|
34:49
|
|
characters will be 7 bit. And I'll have an 8bit tag for editor. And the tag will give me the format of the 32-bit value
|
|
34:56
|
|
in memory and give me whatever else I want in there. So I can adapt, you know,
|
|
35:01
|
|
for whatever I feel like doing in the future. And I can make this pretty uniform because I'll mostly have numbers
|
|
35:07
|
|
and then I'll have direct addresses to words inside this. So that's it for now. This is a late welcome to 2026. I used
|
|
35:15
|
|
the holiday for some deep thinking, but I think it's time now for some more building. Take care. |