refrences
This commit is contained in:
684
references/Neokineogfx - 4th And Beyond - Transcript.txt
Normal file
684
references/Neokineogfx - 4th And Beyond - Transcript.txt
Normal file
@@ -0,0 +1,684 @@
|
||||
0:00
|
||||
Welcome to fourth and beyond. My name is Timothy Lis and this is the Neoenographics channel. This talk will
|
||||
0:05
|
||||
not be covering standard fourth. Instead, this talk is going to start with the beyond fourth part. Let's
|
||||
0:10
|
||||
begin. What if we didn't actually need Visual Studio? What if we didn't need a separate debugger or even to C language?
|
||||
0:17
|
||||
Let's start with the first principle. Question everything until the problem is truly minimized. Begin by peeling the
|
||||
0:22
|
||||
onion of computing, passing through APIs, compilers, languages, code generation, and so on. Search the
|
||||
0:28
|
||||
alternative realities until greatness is found. And we'll start by rewinding time and learning from the past masters.
|
||||
0:34
|
||||
We'll start with the most basic interactive computer tool, the calculator. My favorite calculator was
|
||||
0:39
|
||||
the HP48. That's what I used. HP48 used reverse Polish notation. This made it
|
||||
0:45
|
||||
very easy to type in math and get answers. You didn't have to mess with parenthesis. The HP48 provided RPL. The
|
||||
0:53
|
||||
later machines provided system RPL which could even assemble machine code with offline tools for the HP48. People even
|
||||
1:00
|
||||
built games for these machines. Now what if we were to take that calculator and evolve it to something more forthlike?
|
||||
1:06
|
||||
We'll start with simple reverse polish notation calculator math. And next we'll introduce a dictionary. The dictionary
|
||||
1:12
|
||||
will point to positions on the data stack. For instance, in the second line here, we have a red word 4k. That 4k
|
||||
1:19
|
||||
word would be pointing to the next stack item. Next stack item we can do some evaluations to come up with the number
|
||||
1:25
|
||||
for. So we type in 1024 then type in four and then type in multiply. And now
|
||||
1:30
|
||||
we have 4096. So the 4k word would point to 4096. This is a basic way of doing a
|
||||
1:37
|
||||
variable. The next thing we could do is we could actually build numbers which represent op codes or multiple numbers
|
||||
1:43
|
||||
which represent op codes. Things that we could actually execute and have do an operation on the machine. So in this
|
||||
1:50
|
||||
case there's drop and drop points to a number on the data stack which disassembles to add ESI, -4 and then
|
||||
1:59
|
||||
returns. Drop would basically drop the top item from the data stack where ESI is pointing to the data stack. And now
|
||||
2:06
|
||||
once we have this in our dictionary, we can continue to do things on the stack and we can use drop if we want to. So
|
||||
2:12
|
||||
now we could write 4k which would pull that number 4096 that we had uh put in
|
||||
2:18
|
||||
the dictionary prior and then do one and then do two and then a plus which would create a three and then we execute drop
|
||||
2:25
|
||||
which will drop the three leaving 4096 on the stack and thus now we've created something quite powerful. So in this
|
||||
2:31
|
||||
context the gold numbers get pushed on the stack the gold words get its value in the dictionary pushed on the stack.
|
||||
2:37
|
||||
The green words are getting a value in the dictionary executed and the red word is putting a pointer to the top of stack
|
||||
2:44
|
||||
in the actual word in the dictionary. In some respects, you can see how this starts to create extremely powerful
|
||||
2:49
|
||||
system. So a fourthlike machine is really the ultimate form of tool building. The language is free form. The
|
||||
2:55
|
||||
dictionary defines words. These words become the language you program in. It enables any kind of factoring of a
|
||||
3:01
|
||||
problem. The language, the assembler, the compiler, the linker, the editor, the debugger, they're all defined in the
|
||||
3:06
|
||||
source itself. And these systems can be tiny. Tiny as in the whole thing fits in the cache. In my opinion, a fourthleike
|
||||
3:13
|
||||
machine would have been a better option than basic for a boot language. A lot of people learn basics cuz they could type
|
||||
3:18
|
||||
in a program from a book, say on the C64. But imagine if it was a fourth machine instead. You have something that
|
||||
3:25
|
||||
runs significantly faster and is significantly more powerful. The irony here is later Apple, IBM, and Sun, they
|
||||
3:32
|
||||
actually used a fourth-based open firmware, but few had programmed in it at the time. So, let's look back at
|
||||
3:38
|
||||
Fourth. Fourth was invented in 1986 by Chuck Moore or Charles Moore. Chuck
|
||||
3:43
|
||||
later focused on building multiple stackbased processors. He used his own VLSI CAD system, Okad for layout and
|
||||
3:51
|
||||
simulation, and these were written in his language. Early it was a sourceless language and later it got moved to color
|
||||
3:57
|
||||
forth from my understanding. The images below show some of the actual editor and simulation. These are from the ultra
|
||||
4:04
|
||||
technology site. What's impressive here is that these were dramatically small systems and yet they were used to do
|
||||
4:10
|
||||
some of the most complicated stuff that humans can do which is design chips that actually got fabricated and got used.
|
||||
4:16
|
||||
Chuck Moore's color forth I think is worth learning about. It's an example of real system minimization. 32-bit reverse
|
||||
4:23
|
||||
Polish notation language. It provides a data stack which gives you memory to work with and note code is compiled onto
|
||||
4:30
|
||||
the data stack too. It provides dictionaries which map a name to a value. The value is typically a 32-bit
|
||||
4:36
|
||||
number or 32-bit address to the source or data stack. The dictionaries are searched in a linear order from last to
|
||||
4:42
|
||||
first defined word. There are two main dictionaries. The fourth one which is used for words to call and macro which
|
||||
4:49
|
||||
is a secondary dictionary used for words that do code generation. Source is broken up into blocks. There is no file
|
||||
4:56
|
||||
system. Inside the source blocks are 32-bit tokens. These tokens contain 28
|
||||
5:02
|
||||
bits of compressed name or string and four bits of tag. The tag controls how
|
||||
5:07
|
||||
to interpret the source token. Let's go through some of the tags. The white tag means an ignored word. Yellow tag means
|
||||
5:13
|
||||
execute. If it's a number, we append the number on the data stack. If it's a word, we look up the word in the
|
||||
5:20
|
||||
dictionary and then we call the word. If it's a red word, we're doing a definition. We're setting the word in
|
||||
5:26
|
||||
the dictionary to the top of the stack or a pointer to the top of the stack. If it's green, we're compiling. If it's a
|
||||
5:32
|
||||
green number, we're appending a push number onto the stack. Effectively,
|
||||
5:37
|
||||
we're encoding the code, the machine language that would push that number. If
|
||||
5:43
|
||||
we compile a word, if we're in the macro dictionary, we're first going to look up the word in the macro dictionary, and if
|
||||
5:50
|
||||
it exists, we're going to call it. Otherwise, we look up the word in the fourth dictionary, and we append a call
|
||||
5:56
|
||||
to the word itself. Cyan or blue is used to defer words execution. So, we'll look
|
||||
6:02
|
||||
up the word in the macro dictionary, and we will append a call to the word. This way, we can make words that do code
|
||||
6:09
|
||||
generation, that call other words that do code generation. Next is the variable. Variable is used by magenta.
|
||||
6:16
|
||||
Variable sets the dictionary value to the word to it the pointer to the next source token in the source code as it's
|
||||
6:23
|
||||
being evaluated. And then anytime we have a yellow to green transition, we pop a number off the stack and then we
|
||||
6:30
|
||||
append a push number to the data stack, which basically means we're taking a number and we're turning it back into a
|
||||
6:36
|
||||
program, a program that pushes the number. So if we look at some of the blocks inside color forth and notice
|
||||
6:43
|
||||
this one here, block 18. This starts doing the code generation. So it'll push 24 and then it'll load which will take
|
||||
6:50
|
||||
block 24 and actually bring in all the code generation macros. And then the next one 26 load will bring in more
|
||||
6:56
|
||||
quote generation macros from block 26. If you look at block 24, it starts with
|
||||
7:02
|
||||
executing macro which moves us to making defines in the macro dictionary. The
|
||||
7:07
|
||||
first define is swap. Then it does 168B and then it does a two comma. The two
|
||||
7:13
|
||||
comma pushes two bytes onto the data stack. The next one is C28B
|
||||
7:20
|
||||
0689 followed by a comma. The comma pushes four bytes onto the data stack. So
|
||||
7:26
|
||||
effectively what we're doing is we're pushing some bytes to actually create code onto the data stack where swap is
|
||||
7:33
|
||||
defined. And if we disassemble these six bytes, we get move edx, dword ptr esi.
|
||||
7:40
|
||||
So effectively we're pulling from the stack into edx and the stack in this case is the data stack of fourth. The
|
||||
7:46
|
||||
next one is move dword ptr esi, eax. So we're pushing the existing cacheed value
|
||||
7:53
|
||||
of the top of the stack which is in eax. We're pushing that we're putting that on the stack. And then we move edx into eax
|
||||
8:01
|
||||
which is taking the old second value on the stack and pushing it into the cache value which is the top of the stack in
|
||||
8:07
|
||||
color fourth. So basically this whole block is defining op codes that are used for code generation. So let's fast way
|
||||
8:14
|
||||
forward now and let's just critique color forth. Perhaps one of the biggest critiques of color forth is that it's a
|
||||
8:20
|
||||
mismatch to hardware today. It's a stackbased machine and modern machines are register based. Modern machines have
|
||||
8:26
|
||||
really deep pipelines. They don't deal with branching well and fourth is extremely branch friendly. The
|
||||
8:31
|
||||
interpreter costs that you have to do per token are pretty high. We have to branch based on tag. Dictionaries are
|
||||
8:38
|
||||
searched from last added to first added with no hashing or any acceleration.
|
||||
8:43
|
||||
Most commonly every time you do an interpreting after branching on the tag you're going to branch again to another
|
||||
8:49
|
||||
thing which is going to be a mispredicted address. And note, you got an average of 16 clock stall on say Zen
|
||||
8:55
|
||||
2 for a branch misprediction. Of course, the logical response here is if you only have a tiny amount of code, there's no
|
||||
9:01
|
||||
reason it has to be super fast. After all, the most important optimization is doing less total work. For example, an
|
||||
9:07
|
||||
F1 car driving a,000 m is going to be substantially slower than a turtle walking one foot. Well, towards the end
|
||||
9:14
|
||||
of 2025, Chuck Moore said, "I think fate is trying to tell me it's time to move on."
|
||||
9:20
|
||||
And this is in response to Windows auto updating and then breaking his color forth. But I ask, should we actually
|
||||
9:27
|
||||
move on? The world did move on to mass hardware and software complexity, but
|
||||
9:32
|
||||
perhaps Chuck's way of thinking is actually exactly what is needed today. How about a localized reboot? We have a
|
||||
9:38
|
||||
lot of FBGA based systems showing up, and I'm hopeful that they're getting commercial success, but these are
|
||||
9:44
|
||||
effectively all emulators of prior hardware. What about doing something new? Maybe forthinking could be a part
|
||||
9:49
|
||||
of that. What about Neo vintage parallel machines? After all, forthinking is ideal for a fixed hardware platform.
|
||||
9:56
|
||||
FPGA based hardware emulators focus mostly on the serial thinking era. But this is actually a universal speed of
|
||||
10:02
|
||||
light barrier. These product lines are going to stop around N64 and so on because after that serial CPU clock
|
||||
10:09
|
||||
rates cannot be FPGA emulated. But FPGAAS have crazy parallel DSP capabilities. Perhaps we should design
|
||||
10:16
|
||||
for DSPs as the processors and then provide radically parallel but medium
|
||||
10:21
|
||||
clock machines and these are things we could actually drive with fourth style language. There is a challenge of
|
||||
10:28
|
||||
minimalism in a maximalism world. Software is a problem but the root is hardware complexity growth. For example,
|
||||
10:34
|
||||
our DNA4 ISA guide is almost 4 megabytes in itself. And try writing a modern USBC
|
||||
10:40
|
||||
controller driver yourself. And yet, even with all of today's hardware complexity, I still believe fourth
|
||||
10:45
|
||||
inspired software can be quite useful. I spent a lot of time exploring the permutation space around fourth,
|
||||
10:51
|
||||
specifically more around color fourth and seeing what variations could be made. One way I varied from fourth was
|
||||
10:57
|
||||
in an opin coding. I don't necessarily stick with a stackbased language. Sometimes I treat the register file more
|
||||
11:03
|
||||
like a closer highly aliased memory. Sometimes you use a stackbased language however as a macro language say for a
|
||||
11:10
|
||||
native hardware assembler and sometimes I mix a stackbased language with something that has arguments for
|
||||
11:16
|
||||
instance having a word have arguments after the word and still use it like a stackbased language. So I have used
|
||||
11:22
|
||||
forthlike things in commercial products. One example is I used to run a photography business and a software
|
||||
11:28
|
||||
development business and the old business website that I ran in my prior life doing landscape photography was
|
||||
11:35
|
||||
actually generated by a fourthlike language running server side which generated all the HTML pages. It made
|
||||
11:41
|
||||
managing a huge website actually practical. Now I had to use the wayback machine to find this. So sorry in
|
||||
11:47
|
||||
advance for the broken images. And of course, I had a different last name then from a broken marriage, but that's another story. But I did a lot more
|
||||
11:54
|
||||
forthlike things beyond this one. One of the things that got me right away was, of course, the lure to optimize. For
|
||||
12:00
|
||||
example, color forth uses this Huffman style encoding for its names and its tokens. Remember, a source token is a
|
||||
12:08
|
||||
T-bit tag, typically T is 4, with an S bit string where S is 28 bits. We could
|
||||
12:14
|
||||
do a better job of encoding the the 28 bits. For instance, we could split that full number range by some initial
|
||||
12:20
|
||||
probability table of the first character. And then we could split each of those ranges by say a one or two
|
||||
12:27
|
||||
character predictor. And then we train this thing on a giant dictionary. And of course, you're going to have to use
|
||||
12:32
|
||||
lookup tables. And of course, the memory used for the predictor is going to be greater than the rest of this whole
|
||||
12:38
|
||||
system combined. And yeah, it worked. It provided some very interesting stuff. You could put a number in and it would
|
||||
12:45
|
||||
basically BSU a string out which was pretty cool. This journey I think was
|
||||
12:50
|
||||
useful. I learned a lot of things in the process like where to optimize and where not to optimize. Next question is well
|
||||
12:57
|
||||
should we hash or should we not hash? It turns out that a compressed string like in the prior slide is actually a great
|
||||
13:04
|
||||
hash function. I can simply mask off some number of the least significant bits and that becomes my hash function.
|
||||
13:10
|
||||
or for hashing. I always dislike the issue of only part using cache lines. That's not very efficient. And of
|
||||
13:16
|
||||
course, we can try to fix that too. We can check a very tiny hash table first and size that hash table to stay in the
|
||||
13:23
|
||||
cache. And then if we miss on that, we can go to the fulls size one, the one that's going to have pretty poor
|
||||
13:28
|
||||
utilization on cache lines. And assuming lots of reuse, that tiny hash table is going to keep high cache utilization.
|
||||
13:35
|
||||
However, now we've done two stages of optimization. But we really should start asking why are we hashing and why are we
|
||||
13:42
|
||||
compressed? Why are we doing all this overhead? Why do we just not direct map? After all, if we're depending on an
|
||||
13:48
|
||||
editor, we could just direct map or perhaps just address into the dictionary directly. Then we can split off the T-
|
||||
13:54
|
||||
bit tag and the S bit string for editor use. And that can start simplifying things so we don't have all this
|
||||
14:00
|
||||
complexity in the first place. The next thing we can do if we're interpreting is we can solve the problem of branch
|
||||
14:06
|
||||
misses. Normally with an interpreter, you would evaluate the word and then you'd return back to the interpreter.
|
||||
14:12
|
||||
That interpreter would look up another word and do another branch. But that branch would always be mispredicted. One
|
||||
14:17
|
||||
option is we could just fold the interpreter back into the words themselves. But of course, we got to
|
||||
14:23
|
||||
make that interpreter really small, otherwise we're doing a lot of code duplication. Can imagine if you have a
|
||||
14:28
|
||||
thousand words, you're going to embed the interpreter a thousand times. So there's a lot of different ways we can
|
||||
14:33
|
||||
design an interpreter down into a few bytes. For instance, this one is a 8 byte interpreter. This is one I've never
|
||||
14:40
|
||||
used. Actually can do better than 8 bytes and I'll show you that towards the end. So of course the best way to learn
|
||||
14:45
|
||||
is to build stuff. So I built many color forth inspired things over the years. Some like the one to the right here. I
|
||||
14:52
|
||||
got distracted with editor graphics effectively making something extremely nice to use and very pretty. This one
|
||||
14:58
|
||||
was cool. The dictionary I moved into the source itself and I did a direct
|
||||
15:03
|
||||
binary editor. So this thing you'd actually see the cache lines and you're effectively doing a hex editor that uses
|
||||
15:10
|
||||
tags to tell you some contextual information and then each line of course
|
||||
15:15
|
||||
has uh a comment on the top followed by the data on the bottom. And of course I use different fonts because sometimes
|
||||
15:22
|
||||
I'm packing a full number with sometimes I'm packing characters and so on in comments. It was a relatively
|
||||
15:28
|
||||
complicated system, but actually simple when you think about it in the context of what we build today. One of the first
|
||||
15:35
|
||||
questions to ask yourself is whether you want to work with text source versus a binary editor. So sometimes I would work
|
||||
15:41
|
||||
with text source. In order to make this work well, I would have a prefix character in front of every word, which
|
||||
15:47
|
||||
basically would be the tag. And it would also enable me to use very simple syntax
|
||||
15:52
|
||||
coloring inside say nano. Most of these I built they were more like a fourth macro language that was used to create a
|
||||
15:58
|
||||
binary. So what I would do and for instance what you can see on the right is I would define something that would
|
||||
16:03
|
||||
enable me to build the ELF header and then after the ELF header was was built I would actually write the assembler in
|
||||
16:10
|
||||
the source code and then finish off the rest of the binary. These kind of languages are extremely small and the
|
||||
16:16
|
||||
whole thing is in a say a few kilobyte binary. The other thing I do with these is I bootstrap. So the first time I
|
||||
16:23
|
||||
might write the thing in C and then C would evaluate and then I'd run the interpreter in C and then later I would
|
||||
16:30
|
||||
write the rewrite the interpreter inside the source code and then compile that. Now I would be bootstrapped into the
|
||||
16:36
|
||||
language itself. And so by doing that I could actually compare my C code to the the code I wrote inside my own language.
|
||||
16:43
|
||||
And of course I'm faster inside my own language than in the C code. And of course, I'm a lot smaller in the binary
|
||||
16:48
|
||||
as well because I have a very very small ELF header in the binary that I generated compared to the one that say
|
||||
16:55
|
||||
GCC would generate. I built some custom x86 operating systems. It was fun to
|
||||
17:00
|
||||
build custom VGA fonts and of course mess with the pallet entries to improve the initial colors. I did lots of
|
||||
17:07
|
||||
different fourth variations, but typically these projects just got blocked in the mass complexity of
|
||||
17:13
|
||||
today's hardware. Meaning once you get down to the point where you want to say draw something on the screen other than
|
||||
17:18
|
||||
using say the old DOSs VGA frame buffer or if you want to start using input you start needing a USB driver and then all
|
||||
17:26
|
||||
of a sudden everything turns into a nightmare. One thing I mentioned before is it's very nice to use a forthlike language as a macro assembly language.
|
||||
17:33
|
||||
Traditional assembly language you do something like say add ecx edx and then
|
||||
17:39
|
||||
colon advanced pointer by stride. The later part here is heavily commented. In fact, typically assembly is mostly
|
||||
17:47
|
||||
comments otherwise a human can't really understand it. When you start using a fourthlike language as a macro
|
||||
17:53
|
||||
assembler, a lot of times what you do is instead of using the register number, you would just put the register number
|
||||
17:59
|
||||
inside another word and then use that word. So now you start self-documenting. And if you had common blocks of say
|
||||
18:06
|
||||
multiple instructions, you would start defining those in some other word and then you start factoring. And this way
|
||||
18:11
|
||||
you self-document everything and it becomes actually very easy to understand, a lot more easy to
|
||||
18:17
|
||||
understand than say assembler. And on top of this, of course, you can also put comments, but you don't typically need
|
||||
18:22
|
||||
as many. So if we were to look back at some of the lessons of all these projects, I think the key thing is that when your OS is an editor, is a hyper
|
||||
18:30
|
||||
calculator, is a debugger, is a hex editor, you end up with this interactive
|
||||
18:35
|
||||
instant iteration software development and that part is wonderful. The fourth style of thinking keeps source small
|
||||
18:41
|
||||
enough so that it's approachable by a single developer and that I think is very important. You basically build out
|
||||
18:46
|
||||
your own tools for the exactly the way you like to think and that's where its true beauty lies. Others like Onot have
|
||||
18:53
|
||||
built full systems, meaning he is running something that actually works
|
||||
18:58
|
||||
with Vulcan and generates Spear V. So there is another option and that is going sourceless. No language, no
|
||||
19:05
|
||||
assembler. The code is the data or the data is the code. Chuck's okay was a
|
||||
19:10
|
||||
source of inspiration. I've only read about this, but it did send me down a spiral of trying various ideas that are
|
||||
19:17
|
||||
related to what I read. So when we think about sourceless programming, it's best to just work from the opposite extreme.
|
||||
19:24
|
||||
Start with say a hex editor and then work towards what we would need to make that practical for code generation. So I
|
||||
19:30
|
||||
think of a binary as an array of say n 32-bit words and then we could have another thing which is an annotation
|
||||
19:37
|
||||
which is an array of n64-bit words. The annotation could provide a tag which gives context to the data or could
|
||||
19:44
|
||||
control how the editor manipulates the data. The annotation can also provide an 8 character text annotation for the
|
||||
19:50
|
||||
individual binary words which serves as documentation for what the word is for. So part of sourceless programming is how
|
||||
19:56
|
||||
do you generate code and with fourth hand assembling words is actually relatively easy because you don't have
|
||||
20:02
|
||||
that many you don't have that many low-level operations if you're doing a stackbased machine. I invented something
|
||||
20:09
|
||||
called x68 which I'm planning to do a separate talk on. It's a subset of x64
|
||||
20:15
|
||||
which works with op codes at 32-bit granularity only. Note that x664
|
||||
20:21
|
||||
supports ignored prefixes which can pad op codes out to 32-bit. And we also have
|
||||
20:27
|
||||
multibite noops which can align the ends too. And we can do things like adding in rex prefixes when we don't need it to
|
||||
20:34
|
||||
again pad out to 32-bit. So for instance, if we wanted to do a 32-bit instruction for return, we might put the
|
||||
20:41
|
||||
return, which is a C3, and then pad the rest of it with a three byte noop. And
|
||||
20:47
|
||||
once we've built this 32-bit return number, which we can an we annotate with, we can insert a return anywhere
|
||||
20:55
|
||||
just by copying and inserting this word in the source code. And later, if we built different op codes and say they
|
||||
21:00
|
||||
were multi-word, we can just use find and replace to change those. Effectively, we're removing compilation
|
||||
21:05
|
||||
into some edit time operations. One of the nice things about being at 32-bit
|
||||
21:10
|
||||
granularity for the instruction is that the 32-bit immediates are now at 32-bit granularity as well. And so now we can
|
||||
21:18
|
||||
just make it so that we have a tag which says this is an op code and a tag which says this is say an immediate hex value.
|
||||
21:26
|
||||
And we could show them separately with different colors. In this case, I have code for setting ESI to a 32-bit
|
||||
21:32
|
||||
immediate. And you'll notice that this one is using the 3E ignored DS segment
|
||||
21:38
|
||||
selector prefix to pad out the op code to 32-bit. And then after that, we have a silly number which we're setting into
|
||||
21:45
|
||||
ESI. That silly number is 1 2 3 4 5 6 7 8 in hex. So it's very easy to do inline
|
||||
21:51
|
||||
data this way. Of course, calls and jumps are another question. And we have an easy solution for that one as well.
|
||||
21:57
|
||||
In x8664, column jump uses a 32-bit relative immediate and that relative is
|
||||
22:03
|
||||
relative to the end of the op code, not the beginning. And so if we want to make an editor support this, we would just
|
||||
22:09
|
||||
tag the relative branch address as a word that is a relative address. And
|
||||
22:15
|
||||
then when we start editing text or say editing the words inside the binary and
|
||||
22:21
|
||||
say we move things around, we would just relink all of the words in the binary that have a relative address. So as code
|
||||
22:27
|
||||
changes, things just get fixed up. And so this effectively solves the call and the jump problem. It's very easy to make
|
||||
22:34
|
||||
an editor which repatches everything. Conditional branches, you might think those are complicated, but they're
|
||||
22:40
|
||||
actually not. Conditional branches is just an 8-bit relative branch address. And so when I make words on these, I
|
||||
22:46
|
||||
would say jump unequal minus two, which would jump unequal to the word that is
|
||||
22:51
|
||||
two words before this one or say j minus 4 for four words minus and so on. And so
|
||||
22:58
|
||||
I can just build a few of these constructs and change the op codes around whenever I need say a jump on
|
||||
23:03
|
||||
zero or so on. Nice thing about this is now you no longer have to label things because you just go and count and when
|
||||
23:10
|
||||
you move it around it's all relative. So you don't need to do any patching. If you want to add more stuff in your loop,
|
||||
23:16
|
||||
you just change, you know, the op code a little. Another option is the editor could have a built-in 32-bit word
|
||||
23:22
|
||||
assembly disassembly. Meaning I could use a shorthand for registers and op codes. And the shorthand that I would
|
||||
23:28
|
||||
use would be labeling the registers starting with G. So that zero through F
|
||||
23:34
|
||||
could be used for hex. So this is an example of how you might want to do it. So in this case I have h + at i08
|
||||
23:42
|
||||
which is going to disassemble to add rcx quadward pointer rdx + 0x8
|
||||
23:51
|
||||
which we can shorthand very easily and so I could have an editor that would show you either the disassembly or show
|
||||
23:57
|
||||
the shorthand instruction for it and that would aid in the understanding and ability to insert stuff without using
|
||||
24:04
|
||||
separate tools. So, I did build this sourceless system once in the real. It was back when I was building tiny
|
||||
24:09
|
||||
x86-based operating systems. I built the editor as a console program. So, this
|
||||
24:15
|
||||
would run in Linux and I would build binaries in that console program and
|
||||
24:20
|
||||
then I would use an x86 emulator running in Linux to actually test them. And this was a pretty liberating experience. I
|
||||
24:27
|
||||
learned a lot from there. On the right, I'm showing one of the boot sectors of one of the examples running in the
|
||||
24:33
|
||||
editor. Note with sourceless programming we could extend the annotation quite a bit. For instance, we could have tables
|
||||
24:38
|
||||
that map register numbers to strings and then for each cache line we could have an index of a table. In this way,
|
||||
24:45
|
||||
registers could be automatically annotated. For instance, if register 2 is set to position and register 4 is set
|
||||
24:52
|
||||
to velocity. If we had add R2, comma, R4, we could just put in add POSOS comma
|
||||
24:58
|
||||
velocity, right? And that would make it a lot easier to understand automatically. We could also extend and
|
||||
25:04
|
||||
have each cache line have a string as well. And this way we could automatically annotate say a label. The
|
||||
25:10
|
||||
first word maybe could be the label and the rest of the string could be used for a comment. So there's a lot of ways to
|
||||
25:16
|
||||
do sources programming where we just provide annotation tools to actually make it a practical experience. So let's
|
||||
25:22
|
||||
talk about some of the variations, the pros and cons. The easiest way to perhaps start this would be to work with
|
||||
25:28
|
||||
text source. And usually with text source, you're going to use prefix words for the tag. For instance, slash for
|
||||
25:35
|
||||
comment, colon for define, maybe tick for execute. This does have the slowest runtime, however, because you have a
|
||||
25:41
|
||||
character granularity loop. It does have a benefit in that for you to get started, there is no editor that you
|
||||
25:47
|
||||
have to write. You can use an external text editor and you can do easy custom syntax coloring. You're going to be very
|
||||
25:53
|
||||
easy to understand and to work in. However, I think you're missing a big piece if you go down this path, and that
|
||||
25:58
|
||||
is you don't get any live interaction or debugging. You're basically depending on the fast compile times of your custom
|
||||
26:04
|
||||
language and the fast load times of whatever program you're doing to try to get you in that iteration loop. And you
|
||||
26:11
|
||||
can work this way. I've done it many times, but the experience is nowhere near as good as doing the full
|
||||
26:16
|
||||
interactive one with a binary editor. I guess one of the other benefits here is you got very easy code portability.
|
||||
26:23
|
||||
There's no binary files, just text files. You can copy and paste as you will. Of course, the next jump up from
|
||||
26:28
|
||||
that is going to binary source. This would be middle performance runtime because now you're working at a word
|
||||
26:34
|
||||
granularity inside your interpreter. You have portability because you have code generation that can adapt to the system.
|
||||
26:41
|
||||
Meaning, as you interpret your code, you can look at what's underlying in the hardware and you can make changes to how
|
||||
26:46
|
||||
the code is generated at runtime. You can build just what bits of an assembler are needed. You don't need to build out
|
||||
26:52
|
||||
everything like you would with say a disassembler tool or an assembler tool. So for instance with x86 I don't
|
||||
26:58
|
||||
actually generate much of the ice. I only use a very very tiny subset. You do have to write the binary editor and that
|
||||
27:04
|
||||
can be a lot of work and sometimes that presents a problem with bootstrapping because you don't have the language to
|
||||
27:10
|
||||
write the editor in from the beginning. So you have to write the editor in some other language and then in that language
|
||||
27:15
|
||||
write the editor again and then you know complete the bootstrapping process. The
|
||||
27:21
|
||||
one benefit here is you get interactive development debug from the beginning and also now your source code shows how
|
||||
27:28
|
||||
constructs are built instead of just showing the result as you would get with say source free. One interesting thing
|
||||
27:33
|
||||
about fourth with binary source and this this concept that you can rebuild source
|
||||
27:39
|
||||
code at runtime at any point in execution is now you can start compiling things you load into machine code and
|
||||
27:46
|
||||
then you don't have to run all the overhead. Basically you can bake things into machine code anytime at edit time
|
||||
27:52
|
||||
and that's a very powerful feature. Now if we go and look at sourceless programming there's a bunch of pros and
|
||||
27:58
|
||||
cons. The one pro is that it's the fastest runtime. It's a true noop. You can build things that are highly tightly
|
||||
28:04
|
||||
optimized for a platform and they're as optimized as you could possibly get. You
|
||||
28:09
|
||||
do however lose capacity for showing how constants came to be and you lose the
|
||||
28:15
|
||||
capability of adapting to what the machine is. You can work around this however you can make smaller embedded
|
||||
28:20
|
||||
constants that you write into modified bit fields and instructions but it's
|
||||
28:26
|
||||
really taking a little too far in terms of complexity. You do need to write an editor which includes an opc code
|
||||
28:31
|
||||
assembler and disassembler potentially if you want to go down that route. And if you're going to do CPU and GPU,
|
||||
28:37
|
||||
that's a lot of work. It can be very complicated when systems include auto annotation. For instance, if you want to
|
||||
28:44
|
||||
type in say a string readable register and then have it go and figure out what register number that is. I guess the
|
||||
28:50
|
||||
primary disadvantage here is that there's no possibility for portability. You have a raw binary editor. And today
|
||||
28:57
|
||||
we have a problem with GPUs. Steam Deck is RDNA 2, Steam Machine's RDNA 3, and
|
||||
29:04
|
||||
who knows the future may be RDNA 4 or five or six. Problem with this is that when the ISA changes across those
|
||||
29:10
|
||||
chipsets, that can result in different instruction sizing. So you can't just do
|
||||
29:16
|
||||
simple source free. For example, Mimg in RDNA 2 has a different size than V image
|
||||
29:22
|
||||
in RDNA4. And that's all due to the ISA changes. So if you do sourceful
|
||||
29:27
|
||||
programming, you can port through a compile step. However, source list, you would need to do something else. And
|
||||
29:33
|
||||
perhaps that is just rewriting chipset specific versions of all the shaders, but that may be something that you don't
|
||||
29:39
|
||||
want to do. So thinking ahead on the Steam OS/ Linux project that I'm working on, effectively I'm building an AMD only
|
||||
29:46
|
||||
solution. I don't really have a name for this project, so I'm going to call it fifth. I think for fifth I want a mix of
|
||||
29:52
|
||||
various fourth style concepts and perhaps the best mix would be the best
|
||||
29:57
|
||||
of both worlds. A fast highle interpreter that's intent for doing GPU code generation where I need chipset
|
||||
30:04
|
||||
portability at runtime mixed with low-level sourceless words for the CPU
|
||||
30:09
|
||||
side for simple x664 where I don't need portability at this time. Since this is
|
||||
30:15
|
||||
for Linux, we should think about what we can do on Linux that we might not be able to do on Windows today. We first
|
||||
30:21
|
||||
start thinking about execution. The Linux x8664 ABI when you're running
|
||||
30:26
|
||||
without address space randomization, the execution starts at a fixed 4 megabytes
|
||||
30:31
|
||||
in and you can still do this today if you compile without FPIE. Also note,
|
||||
30:37
|
||||
even if you did get position independent execution, you could effectively just map the where you want your fixed
|
||||
30:44
|
||||
position to be and then just start execution there and just ignore the original mapping that they threw you at.
|
||||
30:50
|
||||
Another thing we can do is we can use page maps. And if we look at Steam Deck, we'll notice that dirty pages get
|
||||
30:56
|
||||
written about as fast as every 30 seconds, which is an important number. Means it won't be overwriting too fast.
|
||||
31:03
|
||||
So, let's look at something we can do on Linux that we're not allowed to do on Windows anymore. The self-modifying
|
||||
31:08
|
||||
binary. The idea here is that we have a cart file. The cart file represents the ROM. Actually, in this case, it's a RAM
|
||||
31:15
|
||||
because we're going to be modifying it. So, first we would execute the cart file. And when the cart file runs, it
|
||||
31:21
|
||||
would realize that it's not aback. And then it would copy itself to a backup file, cart.back, and then it would
|
||||
31:28
|
||||
launch cart. And then exit. This cart.back back would realize that it is the backup file and then it would map
|
||||
31:35
|
||||
the original file cart at say 6 megabytes in and it would provide that mapping and read write and execute and
|
||||
31:42
|
||||
then afterwards it would map an adjustable zero fill and that would be for when we're doing compilation or when
|
||||
31:48
|
||||
we have data that we don't want to be backed up all the time and after that it would jump to the interpreter and so if
|
||||
31:54
|
||||
we look at the memory mapping we'd have at 4 megabytes we would have say a 4 kilobyte boot loader section
|
||||
32:01
|
||||
And then we would have somewhere say at 6 megabytes we'd have the whole file and then after that we would have the zero
|
||||
32:07
|
||||
fill. And the nice thing about this is we automatically create a backup. We don't have to write any code to save the
|
||||
32:13
|
||||
file because it's going to autosave every 30 seconds. Also the data and code is together and we can make a new
|
||||
32:19
|
||||
version just by copying the file itself. Inside the binary we'd have a specific spot for the size of the file and the
|
||||
32:25
|
||||
size of the zero fill. So the process of doing this execution we can resize when
|
||||
32:31
|
||||
we build the cart.back file very easily. So for source code I'm thinking 32-bit
|
||||
32:36
|
||||
words for words they're going to be direct addresses into the binary and the binary is going to be the dictionary.
|
||||
32:43
|
||||
Makes it quite fast to interpret. The nice thing here is they can be direct addresses because we fix the position.
|
||||
32:50
|
||||
We're not using address base randomization. We'll just fix RSI to the interpreter source position and then all
|
||||
32:56
|
||||
the words will contain the next of the interpreter meaning all the words end in the interpreter itself or fold the
|
||||
33:02
|
||||
interpreter in whatever form they want into their own execution. And by doing this we enable lots of branch predictor
|
||||
33:10
|
||||
slots because each of these interpret end of word interpreters are going to be different branch predictors. So we can
|
||||
33:17
|
||||
actually get this down into five bytes if we want to. We can use the LODs to
|
||||
33:22
|
||||
basically load a word from RSI and then advance RSI and we can use two bytes to
|
||||
33:29
|
||||
look up the value in the dictionary and we can use two bytes later to jump to
|
||||
33:34
|
||||
the address for the next thing to run. So now we've gotten down to a five byte interpreter. Another thing I think I
|
||||
33:40
|
||||
would do for variation is I would make source more free form not strict reverse post notation. In other words, with
|
||||
33:48
|
||||
regular reverse pulse notation, you're going to have words that going to push values onto a stack and therefore your
|
||||
33:53
|
||||
granularity is at the word level. If instead we have arguments, we can fetch
|
||||
33:59
|
||||
a bunch of values off the stack right off the bat and we can look them up in the dictionary in parallel. And now our
|
||||
34:05
|
||||
branch granularity is dropping significantly. Maybe say a factor of maybe two or four depending on what our
|
||||
34:12
|
||||
common argument count is. I think this is a better compromise for when you're doing lots of code generation, which is
|
||||
34:17
|
||||
what we'll be doing on the CPU, mostly GPU code generation. So, for the editor, I'll just do an advanced 32-bit hex
|
||||
34:24
|
||||
editor. I'll split the source into blocks, and then each one of those blocks will be split into subblocks. And
|
||||
34:31
|
||||
the subblock will be source and then some annotation blocks. And so for every
|
||||
34:37
|
||||
source 32-bit source word, I'm going to have uh two I'm going to have say 64-bit
|
||||
34:42
|
||||
of annotation information split across two 32-bit words. And that'll give me eight characters. Each of those
|
||||
34:49
|
||||
characters will be 7 bit. And I'll have an 8bit tag for editor. And the tag will give me the format of the 32-bit value
|
||||
34:56
|
||||
in memory and give me whatever else I want in there. So I can adapt, you know,
|
||||
35:01
|
||||
for whatever I feel like doing in the future. And I can make this pretty uniform because I'll mostly have numbers
|
||||
35:07
|
||||
and then I'll have direct addresses to words inside this. So that's it for now. This is a late welcome to 2026. I used
|
||||
35:15
|
||||
the holiday for some deep thinking, but I think it's time now for some more building. Take care.
|
||||
Reference in New Issue
Block a user