refrences

2026-02-19 16:16:24 -05:00
parent 3ce2977f01
commit 2d43f1711c
90 changed files with 30482 additions and 1 deletions
--- a/references/Neokineogfx
+++ b/references/Neokineogfx
@@ -0,0 +1,684 @@
+0:00
+Welcome to fourth and beyond. My name is Timothy Lis and this is the Neoenographics channel. This talk will
+0:05
+not be covering standard fourth. Instead, this talk is going to start with the beyond fourth part. Let's
+0:10
+begin. What if we didn't actually need Visual Studio? What if we didn't need a separate debugger or even to C language?
+0:17
+Let's start with the first principle. Question everything until the problem is truly minimized. Begin by peeling the
+0:22
+onion of computing, passing through APIs, compilers, languages, code generation, and so on. Search the
+0:28
+alternative realities until greatness is found. And we'll start by rewinding time and learning from the past masters.
+0:34
+We'll start with the most basic interactive computer tool, the calculator. My favorite calculator was
+0:39
+the HP48. That's what I used. HP48 used reverse Polish notation. This made it
+0:45
+very easy to type in math and get answers. You didn't have to mess with parenthesis. The HP48 provided RPL. The
+0:53
+later machines provided system RPL which could even assemble machine code with offline tools for the HP48. People even
+1:00
+built games for these machines. Now what if we were to take that calculator and evolve it to something more forthlike?
+1:06
+We'll start with simple reverse polish notation calculator math. And next we'll introduce a dictionary. The dictionary
+1:12
+will point to positions on the data stack. For instance, in the second line here, we have a red word 4k. That 4k
+1:19
+word would be pointing to the next stack item. Next stack item we can do some evaluations to come up with the number
+1:25
+for. So we type in 1024 then type in four and then type in multiply. And now
+1:30
+we have 4096. So the 4k word would point to 4096. This is a basic way of doing a
+1:37
+variable. The next thing we could do is we could actually build numbers which represent op codes or multiple numbers
+1:43
+which represent op codes. Things that we could actually execute and have do an operation on the machine. So in this
+1:50
+case there's drop and drop points to a number on the data stack which disassembles to add ESI, -4 and then
+1:59
+returns. Drop would basically drop the top item from the data stack where ESI is pointing to the data stack. And now
+2:06
+once we have this in our dictionary, we can continue to do things on the stack and we can use drop if we want to. So
+2:12
+now we could write 4k which would pull that number 4096 that we had uh put in
+2:18
+the dictionary prior and then do one and then do two and then a plus which would create a three and then we execute drop
+2:25
+which will drop the three leaving 4096 on the stack and thus now we've created something quite powerful. So in this
+2:31
+context the gold numbers get pushed on the stack the gold words get its value in the dictionary pushed on the stack.
+2:37
+The green words are getting a value in the dictionary executed and the red word is putting a pointer to the top of stack
+2:44
+in the actual word in the dictionary. In some respects, you can see how this starts to create extremely powerful
+2:49
+system. So a fourthlike machine is really the ultimate form of tool building. The language is free form. The
+2:55
+dictionary defines words. These words become the language you program in. It enables any kind of factoring of a
+3:01
+problem. The language, the assembler, the compiler, the linker, the editor, the debugger, they're all defined in the
+3:06
+source itself. And these systems can be tiny. Tiny as in the whole thing fits in the cache. In my opinion, a fourthleike
+3:13
+machine would have been a better option than basic for a boot language. A lot of people learn basics cuz they could type
+3:18
+in a program from a book, say on the C64. But imagine if it was a fourth machine instead. You have something that
+3:25
+runs significantly faster and is significantly more powerful. The irony here is later Apple, IBM, and Sun, they
+3:32
+actually used a fourth-based open firmware, but few had programmed in it at the time. So, let's look back at
+3:38
+Fourth. Fourth was invented in 1986 by Chuck Moore or Charles Moore. Chuck
+3:43
+later focused on building multiple stackbased processors. He used his own VLSI CAD system, Okad for layout and
+3:51
+simulation, and these were written in his language. Early it was a sourceless language and later it got moved to color
+3:57
+forth from my understanding. The images below show some of the actual editor and simulation. These are from the ultra
+4:04
+technology site. What's impressive here is that these were dramatically small systems and yet they were used to do
+4:10
+some of the most complicated stuff that humans can do which is design chips that actually got fabricated and got used.
+4:16
+Chuck Moore's color forth I think is worth learning about. It's an example of real system minimization. 32-bit reverse
+4:23
+Polish notation language. It provides a data stack which gives you memory to work with and note code is compiled onto
+4:30
+the data stack too. It provides dictionaries which map a name to a value. The value is typically a 32-bit
+4:36
+number or 32-bit address to the source or data stack. The dictionaries are searched in a linear order from last to
+4:42
+first defined word. There are two main dictionaries. The fourth one which is used for words to call and macro which
+4:49
+is a secondary dictionary used for words that do code generation. Source is broken up into blocks. There is no file
+4:56
+system. Inside the source blocks are 32-bit tokens. These tokens contain 28
+5:02
+bits of compressed name or string and four bits of tag. The tag controls how
+5:07
+to interpret the source token. Let's go through some of the tags. The white tag means an ignored word. Yellow tag means
+5:13
+execute. If it's a number, we append the number on the data stack. If it's a word, we look up the word in the
+5:20
+dictionary and then we call the word. If it's a red word, we're doing a definition. We're setting the word in
+5:26
+the dictionary to the top of the stack or a pointer to the top of the stack. If it's green, we're compiling. If it's a
+5:32
+green number, we're appending a push number onto the stack. Effectively,
+5:37
+we're encoding the code, the machine language that would push that number. If
+5:43
+we compile a word, if we're in the macro dictionary, we're first going to look up the word in the macro dictionary, and if
+5:50
+it exists, we're going to call it. Otherwise, we look up the word in the fourth dictionary, and we append a call
+5:56
+to the word itself. Cyan or blue is used to defer words execution. So, we'll look
+6:02
+up the word in the macro dictionary, and we will append a call to the word. This way, we can make words that do code
+6:09
+generation, that call other words that do code generation. Next is the variable. Variable is used by magenta.
+6:16
+Variable sets the dictionary value to the word to it the pointer to the next source token in the source code as it's
+6:23
+being evaluated. And then anytime we have a yellow to green transition, we pop a number off the stack and then we
+6:30
+append a push number to the data stack, which basically means we're taking a number and we're turning it back into a
+6:36
+program, a program that pushes the number. So if we look at some of the blocks inside color forth and notice
+6:43
+this one here, block 18. This starts doing the code generation. So it'll push 24 and then it'll load which will take
+6:50
+block 24 and actually bring in all the code generation macros. And then the next one 26 load will bring in more
+6:56
+quote generation macros from block 26. If you look at block 24, it starts with
+7:02
+executing macro which moves us to making defines in the macro dictionary. The
+7:07
+first define is swap. Then it does 168B and then it does a two comma. The two
+7:13
+comma pushes two bytes onto the data stack. The next one is C28B
+7:20
+0689 followed by a comma. The comma pushes four bytes onto the data stack. So
+7:26
+effectively what we're doing is we're pushing some bytes to actually create code onto the data stack where swap is
+7:33
+defined. And if we disassemble these six bytes, we get move edx, dword ptr esi.
+7:40
+So effectively we're pulling from the stack into edx and the stack in this case is the data stack of fourth. The
+7:46
+next one is move dword ptr esi, eax. So we're pushing the existing cacheed value
+7:53
+of the top of the stack which is in eax. We're pushing that we're putting that on the stack. And then we move edx into eax
+8:01
+which is taking the old second value on the stack and pushing it into the cache value which is the top of the stack in
+8:07
+color fourth. So basically this whole block is defining op codes that are used for code generation. So let's fast way
+8:14
+forward now and let's just critique color forth. Perhaps one of the biggest critiques of color forth is that it's a
+8:20
+mismatch to hardware today. It's a stackbased machine and modern machines are register based. Modern machines have
+8:26
+really deep pipelines. They don't deal with branching well and fourth is extremely branch friendly. The
+8:31
+interpreter costs that you have to do per token are pretty high. We have to branch based on tag. Dictionaries are
+8:38
+searched from last added to first added with no hashing or any acceleration.
+8:43
+Most commonly every time you do an interpreting after branching on the tag you're going to branch again to another
+8:49
+thing which is going to be a mispredicted address. And note, you got an average of 16 clock stall on say Zen
+8:55
+2 for a branch misprediction. Of course, the logical response here is if you only have a tiny amount of code, there's no
+9:01
+reason it has to be super fast. After all, the most important optimization is doing less total work. For example, an
+9:07
+F1 car driving a,000 m is going to be substantially slower than a turtle walking one foot. Well, towards the end
+9:14
+of 2025, Chuck Moore said, "I think fate is trying to tell me it's time to move on."
+9:20
+And this is in response to Windows auto updating and then breaking his color forth. But I ask, should we actually
+9:27
+move on? The world did move on to mass hardware and software complexity, but
+9:32
+perhaps Chuck's way of thinking is actually exactly what is needed today. How about a localized reboot? We have a
+9:38
+lot of FBGA based systems showing up, and I'm hopeful that they're getting commercial success, but these are
+9:44
+effectively all emulators of prior hardware. What about doing something new? Maybe forthinking could be a part
+9:49
+of that. What about Neo vintage parallel machines? After all, forthinking is ideal for a fixed hardware platform.
+9:56
+FPGA based hardware emulators focus mostly on the serial thinking era. But this is actually a universal speed of
+10:02
+light barrier. These product lines are going to stop around N64 and so on because after that serial CPU clock
+10:09
+rates cannot be FPGA emulated. But FPGAAS have crazy parallel DSP capabilities. Perhaps we should design
+10:16
+for DSPs as the processors and then provide radically parallel but medium
+10:21
+clock machines and these are things we could actually drive with fourth style language. There is a challenge of
+10:28
+minimalism in a maximalism world. Software is a problem but the root is hardware complexity growth. For example,
+10:34
+our DNA4 ISA guide is almost 4 megabytes in itself. And try writing a modern USBC
+10:40
+controller driver yourself. And yet, even with all of today's hardware complexity, I still believe fourth
+10:45
+inspired software can be quite useful. I spent a lot of time exploring the permutation space around fourth,
+10:51
+specifically more around color fourth and seeing what variations could be made. One way I varied from fourth was
+10:57
+in an opin coding. I don't necessarily stick with a stackbased language. Sometimes I treat the register file more
+11:03
+like a closer highly aliased memory. Sometimes you use a stackbased language however as a macro language say for a
+11:10
+native hardware assembler and sometimes I mix a stackbased language with something that has arguments for
+11:16
+instance having a word have arguments after the word and still use it like a stackbased language. So I have used
+11:22
+forthlike things in commercial products. One example is I used to run a photography business and a software
+11:28
+development business and the old business website that I ran in my prior life doing landscape photography was
+11:35
+actually generated by a fourthlike language running server side which generated all the HTML pages. It made
+11:41
+managing a huge website actually practical. Now I had to use the wayback machine to find this. So sorry in
+11:47
+advance for the broken images. And of course, I had a different last name then from a broken marriage, but that's another story. But I did a lot more
+11:54
+forthlike things beyond this one. One of the things that got me right away was, of course, the lure to optimize. For
+12:00
+example, color forth uses this Huffman style encoding for its names and its tokens. Remember, a source token is a
+12:08
+T-bit tag, typically T is 4, with an S bit string where S is 28 bits. We could
+12:14
+do a better job of encoding the the 28 bits. For instance, we could split that full number range by some initial
+12:20
+probability table of the first character. And then we could split each of those ranges by say a one or two
+12:27
+character predictor. And then we train this thing on a giant dictionary. And of course, you're going to have to use
+12:32
+lookup tables. And of course, the memory used for the predictor is going to be greater than the rest of this whole
+12:38
+system combined. And yeah, it worked. It provided some very interesting stuff. You could put a number in and it would
+12:45
+basically BSU a string out which was pretty cool. This journey I think was
+12:50
+useful. I learned a lot of things in the process like where to optimize and where not to optimize. Next question is well
+12:57
+should we hash or should we not hash? It turns out that a compressed string like in the prior slide is actually a great
+13:04
+hash function. I can simply mask off some number of the least significant bits and that becomes my hash function.
+13:10
+or for hashing. I always dislike the issue of only part using cache lines. That's not very efficient. And of
+13:16
+course, we can try to fix that too. We can check a very tiny hash table first and size that hash table to stay in the
+13:23
+cache. And then if we miss on that, we can go to the fulls size one, the one that's going to have pretty poor
+13:28
+utilization on cache lines. And assuming lots of reuse, that tiny hash table is going to keep high cache utilization.
+13:35
+However, now we've done two stages of optimization. But we really should start asking why are we hashing and why are we
+13:42
+compressed? Why are we doing all this overhead? Why do we just not direct map? After all, if we're depending on an
+13:48
+editor, we could just direct map or perhaps just address into the dictionary directly. Then we can split off the T-
+13:54
+bit tag and the S bit string for editor use. And that can start simplifying things so we don't have all this
+14:00
+complexity in the first place. The next thing we can do if we're interpreting is we can solve the problem of branch
+14:06
+misses. Normally with an interpreter, you would evaluate the word and then you'd return back to the interpreter.
+14:12
+That interpreter would look up another word and do another branch. But that branch would always be mispredicted. One
+14:17
+option is we could just fold the interpreter back into the words themselves. But of course, we got to
+14:23
+make that interpreter really small, otherwise we're doing a lot of code duplication. Can imagine if you have a
+14:28
+thousand words, you're going to embed the interpreter a thousand times. So there's a lot of different ways we can
+14:33
+design an interpreter down into a few bytes. For instance, this one is a 8 byte interpreter. This is one I've never
+14:40
+used. Actually can do better than 8 bytes and I'll show you that towards the end. So of course the best way to learn
+14:45
+is to build stuff. So I built many color forth inspired things over the years. Some like the one to the right here. I
+14:52
+got distracted with editor graphics effectively making something extremely nice to use and very pretty. This one
+14:58
+was cool. The dictionary I moved into the source itself and I did a direct
+15:03
+binary editor. So this thing you'd actually see the cache lines and you're effectively doing a hex editor that uses
+15:10
+tags to tell you some contextual information and then each line of course
+15:15
+has uh a comment on the top followed by the data on the bottom. And of course I use different fonts because sometimes
+15:22
+I'm packing a full number with sometimes I'm packing characters and so on in comments. It was a relatively
+15:28
+complicated system, but actually simple when you think about it in the context of what we build today. One of the first
+15:35
+questions to ask yourself is whether you want to work with text source versus a binary editor. So sometimes I would work
+15:41
+with text source. In order to make this work well, I would have a prefix character in front of every word, which
+15:47
+basically would be the tag. And it would also enable me to use very simple syntax
+15:52
+coloring inside say nano. Most of these I built they were more like a fourth macro language that was used to create a
+15:58
+binary. So what I would do and for instance what you can see on the right is I would define something that would
+16:03
+enable me to build the ELF header and then after the ELF header was was built I would actually write the assembler in
+16:10
+the source code and then finish off the rest of the binary. These kind of languages are extremely small and the
+16:16
+whole thing is in a say a few kilobyte binary. The other thing I do with these is I bootstrap. So the first time I
+16:23
+might write the thing in C and then C would evaluate and then I'd run the interpreter in C and then later I would
+16:30
+write the rewrite the interpreter inside the source code and then compile that. Now I would be bootstrapped into the
+16:36
+language itself. And so by doing that I could actually compare my C code to the the code I wrote inside my own language.
+16:43
+And of course I'm faster inside my own language than in the C code. And of course, I'm a lot smaller in the binary
+16:48
+as well because I have a very very small ELF header in the binary that I generated compared to the one that say
+16:55
+GCC would generate. I built some custom x86 operating systems. It was fun to
+17:00
+build custom VGA fonts and of course mess with the pallet entries to improve the initial colors. I did lots of
+17:07
+different fourth variations, but typically these projects just got blocked in the mass complexity of
+17:13
+today's hardware. Meaning once you get down to the point where you want to say draw something on the screen other than
+17:18
+using say the old DOSs VGA frame buffer or if you want to start using input you start needing a USB driver and then all
+17:26
+of a sudden everything turns into a nightmare. One thing I mentioned before is it's very nice to use a forthlike language as a macro assembly language.
+17:33
+Traditional assembly language you do something like say add ecx edx and then
+17:39
+colon advanced pointer by stride. The later part here is heavily commented. In fact, typically assembly is mostly
+17:47
+comments otherwise a human can't really understand it. When you start using a fourthlike language as a macro
+17:53
+assembler, a lot of times what you do is instead of using the register number, you would just put the register number
+17:59
+inside another word and then use that word. So now you start self-documenting. And if you had common blocks of say
+18:06
+multiple instructions, you would start defining those in some other word and then you start factoring. And this way
+18:11
+you self-document everything and it becomes actually very easy to understand, a lot more easy to
+18:17
+understand than say assembler. And on top of this, of course, you can also put comments, but you don't typically need
+18:22
+as many. So if we were to look back at some of the lessons of all these projects, I think the key thing is that when your OS is an editor, is a hyper
+18:30
+calculator, is a debugger, is a hex editor, you end up with this interactive
+18:35
+instant iteration software development and that part is wonderful. The fourth style of thinking keeps source small
+18:41
+enough so that it's approachable by a single developer and that I think is very important. You basically build out
+18:46
+your own tools for the exactly the way you like to think and that's where its true beauty lies. Others like Onot have
+18:53
+built full systems, meaning he is running something that actually works
+18:58
+with Vulcan and generates Spear V. So there is another option and that is going sourceless. No language, no
+19:05
+assembler. The code is the data or the data is the code. Chuck's okay was a
+19:10
+source of inspiration. I've only read about this, but it did send me down a spiral of trying various ideas that are
+19:17
+related to what I read. So when we think about sourceless programming, it's best to just work from the opposite extreme.
+19:24
+Start with say a hex editor and then work towards what we would need to make that practical for code generation. So I
+19:30
+think of a binary as an array of say n 32-bit words and then we could have another thing which is an annotation
+19:37
+which is an array of n64-bit words. The annotation could provide a tag which gives context to the data or could
+19:44
+control how the editor manipulates the data. The annotation can also provide an 8 character text annotation for the
+19:50
+individual binary words which serves as documentation for what the word is for. So part of sourceless programming is how
+19:56
+do you generate code and with fourth hand assembling words is actually relatively easy because you don't have
+20:02
+that many you don't have that many low-level operations if you're doing a stackbased machine. I invented something
+20:09
+called x68 which I'm planning to do a separate talk on. It's a subset of x64
+20:15
+which works with op codes at 32-bit granularity only. Note that x664
+20:21
+supports ignored prefixes which can pad op codes out to 32-bit. And we also have
+20:27
+multibite noops which can align the ends too. And we can do things like adding in rex prefixes when we don't need it to
+20:34
+again pad out to 32-bit. So for instance, if we wanted to do a 32-bit instruction for return, we might put the
+20:41
+return, which is a C3, and then pad the rest of it with a three byte noop. And
+20:47
+once we've built this 32-bit return number, which we can an we annotate with, we can insert a return anywhere
+20:55
+just by copying and inserting this word in the source code. And later, if we built different op codes and say they
+21:00
+were multi-word, we can just use find and replace to change those. Effectively, we're removing compilation
+21:05
+into some edit time operations. One of the nice things about being at 32-bit
+21:10
+granularity for the instruction is that the 32-bit immediates are now at 32-bit granularity as well. And so now we can
+21:18
+just make it so that we have a tag which says this is an op code and a tag which says this is say an immediate hex value.
+21:26
+And we could show them separately with different colors. In this case, I have code for setting ESI to a 32-bit
+21:32
+immediate. And you'll notice that this one is using the 3E ignored DS segment
+21:38
+selector prefix to pad out the op code to 32-bit. And then after that, we have a silly number which we're setting into
+21:45
+ESI. That silly number is 1 2 3 4 5 6 7 8 in hex. So it's very easy to do inline
+21:51
+data this way. Of course, calls and jumps are another question. And we have an easy solution for that one as well.
+21:57
+In x8664, column jump uses a 32-bit relative immediate and that relative is
+22:03
+relative to the end of the op code, not the beginning. And so if we want to make an editor support this, we would just
+22:09
+tag the relative branch address as a word that is a relative address. And
+22:15
+then when we start editing text or say editing the words inside the binary and
+22:21
+say we move things around, we would just relink all of the words in the binary that have a relative address. So as code
+22:27
+changes, things just get fixed up. And so this effectively solves the call and the jump problem. It's very easy to make
+22:34
+an editor which repatches everything. Conditional branches, you might think those are complicated, but they're
+22:40
+actually not. Conditional branches is just an 8-bit relative branch address. And so when I make words on these, I
+22:46
+would say jump unequal minus two, which would jump unequal to the word that is
+22:51
+two words before this one or say j minus 4 for four words minus and so on. And so
+22:58
+I can just build a few of these constructs and change the op codes around whenever I need say a jump on
+23:03
+zero or so on. Nice thing about this is now you no longer have to label things because you just go and count and when
+23:10
+you move it around it's all relative. So you don't need to do any patching. If you want to add more stuff in your loop,
+23:16
+you just change, you know, the op code a little. Another option is the editor could have a built-in 32-bit word
+23:22
+assembly disassembly. Meaning I could use a shorthand for registers and op codes. And the shorthand that I would
+23:28
+use would be labeling the registers starting with G. So that zero through F
+23:34
+could be used for hex. So this is an example of how you might want to do it. So in this case I have h + at i08
+23:42
+which is going to disassemble to add rcx quadward pointer rdx + 0x8
+23:51
+which we can shorthand very easily and so I could have an editor that would show you either the disassembly or show
+23:57
+the shorthand instruction for it and that would aid in the understanding and ability to insert stuff without using
+24:04
+separate tools. So, I did build this sourceless system once in the real. It was back when I was building tiny
+24:09
+x86-based operating systems. I built the editor as a console program. So, this
+24:15
+would run in Linux and I would build binaries in that console program and
+24:20
+then I would use an x86 emulator running in Linux to actually test them. And this was a pretty liberating experience. I
+24:27
+learned a lot from there. On the right, I'm showing one of the boot sectors of one of the examples running in the
+24:33
+editor. Note with sourceless programming we could extend the annotation quite a bit. For instance, we could have tables
+24:38
+that map register numbers to strings and then for each cache line we could have an index of a table. In this way,
+24:45
+registers could be automatically annotated. For instance, if register 2 is set to position and register 4 is set
+24:52
+to velocity. If we had add R2, comma, R4, we could just put in add POSOS comma
+24:58
+velocity, right? And that would make it a lot easier to understand automatically. We could also extend and
+25:04
+have each cache line have a string as well. And this way we could automatically annotate say a label. The
+25:10
+first word maybe could be the label and the rest of the string could be used for a comment. So there's a lot of ways to
+25:16
+do sources programming where we just provide annotation tools to actually make it a practical experience. So let's
+25:22
+talk about some of the variations, the pros and cons. The easiest way to perhaps start this would be to work with
+25:28
+text source. And usually with text source, you're going to use prefix words for the tag. For instance, slash for
+25:35
+comment, colon for define, maybe tick for execute. This does have the slowest runtime, however, because you have a
+25:41
+character granularity loop. It does have a benefit in that for you to get started, there is no editor that you
+25:47
+have to write. You can use an external text editor and you can do easy custom syntax coloring. You're going to be very
+25:53
+easy to understand and to work in. However, I think you're missing a big piece if you go down this path, and that
+25:58
+is you don't get any live interaction or debugging. You're basically depending on the fast compile times of your custom
+26:04
+language and the fast load times of whatever program you're doing to try to get you in that iteration loop. And you
+26:11
+can work this way. I've done it many times, but the experience is nowhere near as good as doing the full
+26:16
+interactive one with a binary editor. I guess one of the other benefits here is you got very easy code portability.
+26:23
+There's no binary files, just text files. You can copy and paste as you will. Of course, the next jump up from
+26:28
+that is going to binary source. This would be middle performance runtime because now you're working at a word
+26:34
+granularity inside your interpreter. You have portability because you have code generation that can adapt to the system.
+26:41
+Meaning, as you interpret your code, you can look at what's underlying in the hardware and you can make changes to how
+26:46
+the code is generated at runtime. You can build just what bits of an assembler are needed. You don't need to build out
+26:52
+everything like you would with say a disassembler tool or an assembler tool. So for instance with x86 I don't
+26:58
+actually generate much of the ice. I only use a very very tiny subset. You do have to write the binary editor and that
+27:04
+can be a lot of work and sometimes that presents a problem with bootstrapping because you don't have the language to
+27:10
+write the editor in from the beginning. So you have to write the editor in some other language and then in that language
+27:15
+write the editor again and then you know complete the bootstrapping process. The
+27:21
+one benefit here is you get interactive development debug from the beginning and also now your source code shows how
+27:28
+constructs are built instead of just showing the result as you would get with say source free. One interesting thing
+27:33
+about fourth with binary source and this this concept that you can rebuild source
+27:39
+code at runtime at any point in execution is now you can start compiling things you load into machine code and
+27:46
+then you don't have to run all the overhead. Basically you can bake things into machine code anytime at edit time
+27:52
+and that's a very powerful feature. Now if we go and look at sourceless programming there's a bunch of pros and
+27:58
+cons. The one pro is that it's the fastest runtime. It's a true noop. You can build things that are highly tightly
+28:04
+optimized for a platform and they're as optimized as you could possibly get. You
+28:09
+do however lose capacity for showing how constants came to be and you lose the
+28:15
+capability of adapting to what the machine is. You can work around this however you can make smaller embedded
+28:20
+constants that you write into modified bit fields and instructions but it's
+28:26
+really taking a little too far in terms of complexity. You do need to write an editor which includes an opc code
+28:31
+assembler and disassembler potentially if you want to go down that route. And if you're going to do CPU and GPU,
+28:37
+that's a lot of work. It can be very complicated when systems include auto annotation. For instance, if you want to
+28:44
+type in say a string readable register and then have it go and figure out what register number that is. I guess the
+28:50
+primary disadvantage here is that there's no possibility for portability. You have a raw binary editor. And today
+28:57
+we have a problem with GPUs. Steam Deck is RDNA 2, Steam Machine's RDNA 3, and
+29:04
+who knows the future may be RDNA 4 or five or six. Problem with this is that when the ISA changes across those
+29:10
+chipsets, that can result in different instruction sizing. So you can't just do
+29:16
+simple source free. For example, Mimg in RDNA 2 has a different size than V image
+29:22
+in RDNA4. And that's all due to the ISA changes. So if you do sourceful
+29:27
+programming, you can port through a compile step. However, source list, you would need to do something else. And
+29:33
+perhaps that is just rewriting chipset specific versions of all the shaders, but that may be something that you don't
+29:39
+want to do. So thinking ahead on the Steam OS/ Linux project that I'm working on, effectively I'm building an AMD only
+29:46
+solution. I don't really have a name for this project, so I'm going to call it fifth. I think for fifth I want a mix of
+29:52
+various fourth style concepts and perhaps the best mix would be the best
+29:57
+of both worlds. A fast highle interpreter that's intent for doing GPU code generation where I need chipset
+30:04
+portability at runtime mixed with low-level sourceless words for the CPU
+30:09
+side for simple x664 where I don't need portability at this time. Since this is
+30:15
+for Linux, we should think about what we can do on Linux that we might not be able to do on Windows today. We first
+30:21
+start thinking about execution. The Linux x8664 ABI when you're running
+30:26
+without address space randomization, the execution starts at a fixed 4 megabytes
+30:31
+in and you can still do this today if you compile without FPIE. Also note,
+30:37
+even if you did get position independent execution, you could effectively just map the where you want your fixed
+30:44
+position to be and then just start execution there and just ignore the original mapping that they threw you at.
+30:50
+Another thing we can do is we can use page maps. And if we look at Steam Deck, we'll notice that dirty pages get
+30:56
+written about as fast as every 30 seconds, which is an important number. Means it won't be overwriting too fast.
+31:03
+So, let's look at something we can do on Linux that we're not allowed to do on Windows anymore. The self-modifying
+31:08
+binary. The idea here is that we have a cart file. The cart file represents the ROM. Actually, in this case, it's a RAM
+31:15
+because we're going to be modifying it. So, first we would execute the cart file. And when the cart file runs, it
+31:21
+would realize that it's not aback. And then it would copy itself to a backup file, cart.back, and then it would
+31:28
+launch cart. And then exit. This cart.back back would realize that it is the backup file and then it would map
+31:35
+the original file cart at say 6 megabytes in and it would provide that mapping and read write and execute and
+31:42
+then afterwards it would map an adjustable zero fill and that would be for when we're doing compilation or when
+31:48
+we have data that we don't want to be backed up all the time and after that it would jump to the interpreter and so if
+31:54
+we look at the memory mapping we'd have at 4 megabytes we would have say a 4 kilobyte boot loader section
+32:01
+And then we would have somewhere say at 6 megabytes we'd have the whole file and then after that we would have the zero
+32:07
+fill. And the nice thing about this is we automatically create a backup. We don't have to write any code to save the
+32:13
+file because it's going to autosave every 30 seconds. Also the data and code is together and we can make a new
+32:19
+version just by copying the file itself. Inside the binary we'd have a specific spot for the size of the file and the
+32:25
+size of the zero fill. So the process of doing this execution we can resize when
+32:31
+we build the cart.back file very easily. So for source code I'm thinking 32-bit
+32:36
+words for words they're going to be direct addresses into the binary and the binary is going to be the dictionary.
+32:43
+Makes it quite fast to interpret. The nice thing here is they can be direct addresses because we fix the position.
+32:50
+We're not using address base randomization. We'll just fix RSI to the interpreter source position and then all
+32:56
+the words will contain the next of the interpreter meaning all the words end in the interpreter itself or fold the
+33:02
+interpreter in whatever form they want into their own execution. And by doing this we enable lots of branch predictor
+33:10
+slots because each of these interpret end of word interpreters are going to be different branch predictors. So we can
+33:17
+actually get this down into five bytes if we want to. We can use the LODs to
+33:22
+basically load a word from RSI and then advance RSI and we can use two bytes to
+33:29
+look up the value in the dictionary and we can use two bytes later to jump to
+33:34
+the address for the next thing to run. So now we've gotten down to a five byte interpreter. Another thing I think I
+33:40
+would do for variation is I would make source more free form not strict reverse post notation. In other words, with
+33:48
+regular reverse pulse notation, you're going to have words that going to push values onto a stack and therefore your
+33:53
+granularity is at the word level. If instead we have arguments, we can fetch
+33:59
+a bunch of values off the stack right off the bat and we can look them up in the dictionary in parallel. And now our
+34:05
+branch granularity is dropping significantly. Maybe say a factor of maybe two or four depending on what our
+34:12
+common argument count is. I think this is a better compromise for when you're doing lots of code generation, which is
+34:17
+what we'll be doing on the CPU, mostly GPU code generation. So, for the editor, I'll just do an advanced 32-bit hex
+34:24
+editor. I'll split the source into blocks, and then each one of those blocks will be split into subblocks. And
+34:31
+the subblock will be source and then some annotation blocks. And so for every
+34:37
+source 32-bit source word, I'm going to have uh two I'm going to have say 64-bit
+34:42
+of annotation information split across two 32-bit words. And that'll give me eight characters. Each of those
+34:49
+characters will be 7 bit. And I'll have an 8bit tag for editor. And the tag will give me the format of the 32-bit value
+34:56
+in memory and give me whatever else I want in there. So I can adapt, you know,
+35:01
+for whatever I feel like doing in the future. And I can make this pretty uniform because I'll mostly have numbers
+35:07
+and then I'll have direct addresses to words inside this. So that's it for now. This is a late welcome to 2026. I used
+35:15
+the holiday for some deep thinking, but I think it's time now for some more building. Take care.