0:00 Welcome to fourth and beyond. My name is Timothy Lis and this is the Neoenographics channel. This talk will 0:05 not be covering standard fourth. Instead, this talk is going to start with the beyond fourth part. Let's 0:10 begin. What if we didn't actually need Visual Studio? What if we didn't need a separate debugger or even to C language? 0:17 Let's start with the first principle. Question everything until the problem is truly minimized. Begin by peeling the 0:22 onion of computing, passing through APIs, compilers, languages, code generation, and so on. Search the 0:28 alternative realities until greatness is found. And we'll start by rewinding time and learning from the past masters. 0:34 We'll start with the most basic interactive computer tool, the calculator. My favorite calculator was 0:39 the HP48. That's what I used. HP48 used reverse Polish notation. This made it 0:45 very easy to type in math and get answers. You didn't have to mess with parenthesis. The HP48 provided RPL. The 0:53 later machines provided system RPL which could even assemble machine code with offline tools for the HP48. People even 1:00 built games for these machines. Now what if we were to take that calculator and evolve it to something more forthlike? 1:06 We'll start with simple reverse polish notation calculator math. And next we'll introduce a dictionary. The dictionary 1:12 will point to positions on the data stack. For instance, in the second line here, we have a red word 4k. That 4k 1:19 word would be pointing to the next stack item. Next stack item we can do some evaluations to come up with the number 1:25 for. So we type in 1024 then type in four and then type in multiply. And now 1:30 we have 4096. So the 4k word would point to 4096. This is a basic way of doing a 1:37 variable. The next thing we could do is we could actually build numbers which represent op codes or multiple numbers 1:43 which represent op codes. Things that we could actually execute and have do an operation on the machine. So in this 1:50 case there's drop and drop points to a number on the data stack which disassembles to add ESI, -4 and then 1:59 returns. Drop would basically drop the top item from the data stack where ESI is pointing to the data stack. And now 2:06 once we have this in our dictionary, we can continue to do things on the stack and we can use drop if we want to. So 2:12 now we could write 4k which would pull that number 4096 that we had uh put in 2:18 the dictionary prior and then do one and then do two and then a plus which would create a three and then we execute drop 2:25 which will drop the three leaving 4096 on the stack and thus now we've created something quite powerful. So in this 2:31 context the gold numbers get pushed on the stack the gold words get its value in the dictionary pushed on the stack. 2:37 The green words are getting a value in the dictionary executed and the red word is putting a pointer to the top of stack 2:44 in the actual word in the dictionary. In some respects, you can see how this starts to create extremely powerful 2:49 system. So a fourthlike machine is really the ultimate form of tool building. The language is free form. The 2:55 dictionary defines words. These words become the language you program in. It enables any kind of factoring of a 3:01 problem. The language, the assembler, the compiler, the linker, the editor, the debugger, they're all defined in the 3:06 source itself. And these systems can be tiny. Tiny as in the whole thing fits in the cache. In my opinion, a fourthleike 3:13 machine would have been a better option than basic for a boot language. A lot of people learn basics cuz they could type 3:18 in a program from a book, say on the C64. But imagine if it was a fourth machine instead. You have something that 3:25 runs significantly faster and is significantly more powerful. The irony here is later Apple, IBM, and Sun, they 3:32 actually used a fourth-based open firmware, but few had programmed in it at the time. So, let's look back at 3:38 Fourth. Fourth was invented in 1986 by Chuck Moore or Charles Moore. Chuck 3:43 later focused on building multiple stackbased processors. He used his own VLSI CAD system, Okad for layout and 3:51 simulation, and these were written in his language. Early it was a sourceless language and later it got moved to color 3:57 forth from my understanding. The images below show some of the actual editor and simulation. These are from the ultra 4:04 technology site. What's impressive here is that these were dramatically small systems and yet they were used to do 4:10 some of the most complicated stuff that humans can do which is design chips that actually got fabricated and got used. 4:16 Chuck Moore's color forth I think is worth learning about. It's an example of real system minimization. 32-bit reverse 4:23 Polish notation language. It provides a data stack which gives you memory to work with and note code is compiled onto 4:30 the data stack too. It provides dictionaries which map a name to a value. The value is typically a 32-bit 4:36 number or 32-bit address to the source or data stack. The dictionaries are searched in a linear order from last to 4:42 first defined word. There are two main dictionaries. The fourth one which is used for words to call and macro which 4:49 is a secondary dictionary used for words that do code generation. Source is broken up into blocks. There is no file 4:56 system. Inside the source blocks are 32-bit tokens. These tokens contain 28 5:02 bits of compressed name or string and four bits of tag. The tag controls how 5:07 to interpret the source token. Let's go through some of the tags. The white tag means an ignored word. Yellow tag means 5:13 execute. If it's a number, we append the number on the data stack. If it's a word, we look up the word in the 5:20 dictionary and then we call the word. If it's a red word, we're doing a definition. We're setting the word in 5:26 the dictionary to the top of the stack or a pointer to the top of the stack. If it's green, we're compiling. If it's a 5:32 green number, we're appending a push number onto the stack. Effectively, 5:37 we're encoding the code, the machine language that would push that number. If 5:43 we compile a word, if we're in the macro dictionary, we're first going to look up the word in the macro dictionary, and if 5:50 it exists, we're going to call it. Otherwise, we look up the word in the fourth dictionary, and we append a call 5:56 to the word itself. Cyan or blue is used to defer words execution. So, we'll look 6:02 up the word in the macro dictionary, and we will append a call to the word. This way, we can make words that do code 6:09 generation, that call other words that do code generation. Next is the variable. Variable is used by magenta. 6:16 Variable sets the dictionary value to the word to it the pointer to the next source token in the source code as it's 6:23 being evaluated. And then anytime we have a yellow to green transition, we pop a number off the stack and then we 6:30 append a push number to the data stack, which basically means we're taking a number and we're turning it back into a 6:36 program, a program that pushes the number. So if we look at some of the blocks inside color forth and notice 6:43 this one here, block 18. This starts doing the code generation. So it'll push 24 and then it'll load which will take 6:50 block 24 and actually bring in all the code generation macros. And then the next one 26 load will bring in more 6:56 quote generation macros from block 26. If you look at block 24, it starts with 7:02 executing macro which moves us to making defines in the macro dictionary. The 7:07 first define is swap. Then it does 168B and then it does a two comma. The two 7:13 comma pushes two bytes onto the data stack. The next one is C28B 7:20 0689 followed by a comma. The comma pushes four bytes onto the data stack. So 7:26 effectively what we're doing is we're pushing some bytes to actually create code onto the data stack where swap is 7:33 defined. And if we disassemble these six bytes, we get move edx, dword ptr esi. 7:40 So effectively we're pulling from the stack into edx and the stack in this case is the data stack of fourth. The 7:46 next one is move dword ptr esi, eax. So we're pushing the existing cacheed value 7:53 of the top of the stack which is in eax. We're pushing that we're putting that on the stack. And then we move edx into eax 8:01 which is taking the old second value on the stack and pushing it into the cache value which is the top of the stack in 8:07 color fourth. So basically this whole block is defining op codes that are used for code generation. So let's fast way 8:14 forward now and let's just critique color forth. Perhaps one of the biggest critiques of color forth is that it's a 8:20 mismatch to hardware today. It's a stackbased machine and modern machines are register based. Modern machines have 8:26 really deep pipelines. They don't deal with branching well and fourth is extremely branch friendly. The 8:31 interpreter costs that you have to do per token are pretty high. We have to branch based on tag. Dictionaries are 8:38 searched from last added to first added with no hashing or any acceleration. 8:43 Most commonly every time you do an interpreting after branching on the tag you're going to branch again to another 8:49 thing which is going to be a mispredicted address. And note, you got an average of 16 clock stall on say Zen 8:55 2 for a branch misprediction. Of course, the logical response here is if you only have a tiny amount of code, there's no 9:01 reason it has to be super fast. After all, the most important optimization is doing less total work. For example, an 9:07 F1 car driving a,000 m is going to be substantially slower than a turtle walking one foot. Well, towards the end 9:14 of 2025, Chuck Moore said, "I think fate is trying to tell me it's time to move on." 9:20 And this is in response to Windows auto updating and then breaking his color forth. But I ask, should we actually 9:27 move on? The world did move on to mass hardware and software complexity, but 9:32 perhaps Chuck's way of thinking is actually exactly what is needed today. How about a localized reboot? We have a 9:38 lot of FBGA based systems showing up, and I'm hopeful that they're getting commercial success, but these are 9:44 effectively all emulators of prior hardware. What about doing something new? Maybe forthinking could be a part 9:49 of that. What about Neo vintage parallel machines? After all, forthinking is ideal for a fixed hardware platform. 9:56 FPGA based hardware emulators focus mostly on the serial thinking era. But this is actually a universal speed of 10:02 light barrier. These product lines are going to stop around N64 and so on because after that serial CPU clock 10:09 rates cannot be FPGA emulated. But FPGAAS have crazy parallel DSP capabilities. Perhaps we should design 10:16 for DSPs as the processors and then provide radically parallel but medium 10:21 clock machines and these are things we could actually drive with fourth style language. There is a challenge of 10:28 minimalism in a maximalism world. Software is a problem but the root is hardware complexity growth. For example, 10:34 our DNA4 ISA guide is almost 4 megabytes in itself. And try writing a modern USBC 10:40 controller driver yourself. And yet, even with all of today's hardware complexity, I still believe fourth 10:45 inspired software can be quite useful. I spent a lot of time exploring the permutation space around fourth, 10:51 specifically more around color fourth and seeing what variations could be made. One way I varied from fourth was 10:57 in an opin coding. I don't necessarily stick with a stackbased language. Sometimes I treat the register file more 11:03 like a closer highly aliased memory. Sometimes you use a stackbased language however as a macro language say for a 11:10 native hardware assembler and sometimes I mix a stackbased language with something that has arguments for 11:16 instance having a word have arguments after the word and still use it like a stackbased language. So I have used 11:22 forthlike things in commercial products. One example is I used to run a photography business and a software 11:28 development business and the old business website that I ran in my prior life doing landscape photography was 11:35 actually generated by a fourthlike language running server side which generated all the HTML pages. It made 11:41 managing a huge website actually practical. Now I had to use the wayback machine to find this. So sorry in 11:47 advance for the broken images. And of course, I had a different last name then from a broken marriage, but that's another story. But I did a lot more 11:54 forthlike things beyond this one. One of the things that got me right away was, of course, the lure to optimize. For 12:00 example, color forth uses this Huffman style encoding for its names and its tokens. Remember, a source token is a 12:08 T-bit tag, typically T is 4, with an S bit string where S is 28 bits. We could 12:14 do a better job of encoding the the 28 bits. For instance, we could split that full number range by some initial 12:20 probability table of the first character. And then we could split each of those ranges by say a one or two 12:27 character predictor. And then we train this thing on a giant dictionary. And of course, you're going to have to use 12:32 lookup tables. And of course, the memory used for the predictor is going to be greater than the rest of this whole 12:38 system combined. And yeah, it worked. It provided some very interesting stuff. You could put a number in and it would 12:45 basically BSU a string out which was pretty cool. This journey I think was 12:50 useful. I learned a lot of things in the process like where to optimize and where not to optimize. Next question is well 12:57 should we hash or should we not hash? It turns out that a compressed string like in the prior slide is actually a great 13:04 hash function. I can simply mask off some number of the least significant bits and that becomes my hash function. 13:10 or for hashing. I always dislike the issue of only part using cache lines. That's not very efficient. And of 13:16 course, we can try to fix that too. We can check a very tiny hash table first and size that hash table to stay in the 13:23 cache. And then if we miss on that, we can go to the fulls size one, the one that's going to have pretty poor 13:28 utilization on cache lines. And assuming lots of reuse, that tiny hash table is going to keep high cache utilization. 13:35 However, now we've done two stages of optimization. But we really should start asking why are we hashing and why are we 13:42 compressed? Why are we doing all this overhead? Why do we just not direct map? After all, if we're depending on an 13:48 editor, we could just direct map or perhaps just address into the dictionary directly. Then we can split off the T- 13:54 bit tag and the S bit string for editor use. And that can start simplifying things so we don't have all this 14:00 complexity in the first place. The next thing we can do if we're interpreting is we can solve the problem of branch 14:06 misses. Normally with an interpreter, you would evaluate the word and then you'd return back to the interpreter. 14:12 That interpreter would look up another word and do another branch. But that branch would always be mispredicted. One 14:17 option is we could just fold the interpreter back into the words themselves. But of course, we got to 14:23 make that interpreter really small, otherwise we're doing a lot of code duplication. Can imagine if you have a 14:28 thousand words, you're going to embed the interpreter a thousand times. So there's a lot of different ways we can 14:33 design an interpreter down into a few bytes. For instance, this one is a 8 byte interpreter. This is one I've never 14:40 used. Actually can do better than 8 bytes and I'll show you that towards the end. So of course the best way to learn 14:45 is to build stuff. So I built many color forth inspired things over the years. Some like the one to the right here. I 14:52 got distracted with editor graphics effectively making something extremely nice to use and very pretty. This one 14:58 was cool. The dictionary I moved into the source itself and I did a direct 15:03 binary editor. So this thing you'd actually see the cache lines and you're effectively doing a hex editor that uses 15:10 tags to tell you some contextual information and then each line of course 15:15 has uh a comment on the top followed by the data on the bottom. And of course I use different fonts because sometimes 15:22 I'm packing a full number with sometimes I'm packing characters and so on in comments. It was a relatively 15:28 complicated system, but actually simple when you think about it in the context of what we build today. One of the first 15:35 questions to ask yourself is whether you want to work with text source versus a binary editor. So sometimes I would work 15:41 with text source. In order to make this work well, I would have a prefix character in front of every word, which 15:47 basically would be the tag. And it would also enable me to use very simple syntax 15:52 coloring inside say nano. Most of these I built they were more like a fourth macro language that was used to create a 15:58 binary. So what I would do and for instance what you can see on the right is I would define something that would 16:03 enable me to build the ELF header and then after the ELF header was was built I would actually write the assembler in 16:10 the source code and then finish off the rest of the binary. These kind of languages are extremely small and the 16:16 whole thing is in a say a few kilobyte binary. The other thing I do with these is I bootstrap. So the first time I 16:23 might write the thing in C and then C would evaluate and then I'd run the interpreter in C and then later I would 16:30 write the rewrite the interpreter inside the source code and then compile that. Now I would be bootstrapped into the 16:36 language itself. And so by doing that I could actually compare my C code to the the code I wrote inside my own language. 16:43 And of course I'm faster inside my own language than in the C code. And of course, I'm a lot smaller in the binary 16:48 as well because I have a very very small ELF header in the binary that I generated compared to the one that say 16:55 GCC would generate. I built some custom x86 operating systems. It was fun to 17:00 build custom VGA fonts and of course mess with the pallet entries to improve the initial colors. I did lots of 17:07 different fourth variations, but typically these projects just got blocked in the mass complexity of 17:13 today's hardware. Meaning once you get down to the point where you want to say draw something on the screen other than 17:18 using say the old DOSs VGA frame buffer or if you want to start using input you start needing a USB driver and then all 17:26 of a sudden everything turns into a nightmare. One thing I mentioned before is it's very nice to use a forthlike language as a macro assembly language. 17:33 Traditional assembly language you do something like say add ecx edx and then 17:39 colon advanced pointer by stride. The later part here is heavily commented. In fact, typically assembly is mostly 17:47 comments otherwise a human can't really understand it. When you start using a fourthlike language as a macro 17:53 assembler, a lot of times what you do is instead of using the register number, you would just put the register number 17:59 inside another word and then use that word. So now you start self-documenting. And if you had common blocks of say 18:06 multiple instructions, you would start defining those in some other word and then you start factoring. And this way 18:11 you self-document everything and it becomes actually very easy to understand, a lot more easy to 18:17 understand than say assembler. And on top of this, of course, you can also put comments, but you don't typically need 18:22 as many. So if we were to look back at some of the lessons of all these projects, I think the key thing is that when your OS is an editor, is a hyper 18:30 calculator, is a debugger, is a hex editor, you end up with this interactive 18:35 instant iteration software development and that part is wonderful. The fourth style of thinking keeps source small 18:41 enough so that it's approachable by a single developer and that I think is very important. You basically build out 18:46 your own tools for the exactly the way you like to think and that's where its true beauty lies. Others like Onot have 18:53 built full systems, meaning he is running something that actually works 18:58 with Vulcan and generates Spear V. So there is another option and that is going sourceless. No language, no 19:05 assembler. The code is the data or the data is the code. Chuck's okay was a 19:10 source of inspiration. I've only read about this, but it did send me down a spiral of trying various ideas that are 19:17 related to what I read. So when we think about sourceless programming, it's best to just work from the opposite extreme. 19:24 Start with say a hex editor and then work towards what we would need to make that practical for code generation. So I 19:30 think of a binary as an array of say n 32-bit words and then we could have another thing which is an annotation 19:37 which is an array of n64-bit words. The annotation could provide a tag which gives context to the data or could 19:44 control how the editor manipulates the data. The annotation can also provide an 8 character text annotation for the 19:50 individual binary words which serves as documentation for what the word is for. So part of sourceless programming is how 19:56 do you generate code and with fourth hand assembling words is actually relatively easy because you don't have 20:02 that many you don't have that many low-level operations if you're doing a stackbased machine. I invented something 20:09 called x68 which I'm planning to do a separate talk on. It's a subset of x64 20:15 which works with op codes at 32-bit granularity only. Note that x664 20:21 supports ignored prefixes which can pad op codes out to 32-bit. And we also have 20:27 multibite noops which can align the ends too. And we can do things like adding in rex prefixes when we don't need it to 20:34 again pad out to 32-bit. So for instance, if we wanted to do a 32-bit instruction for return, we might put the 20:41 return, which is a C3, and then pad the rest of it with a three byte noop. And 20:47 once we've built this 32-bit return number, which we can an we annotate with, we can insert a return anywhere 20:55 just by copying and inserting this word in the source code. And later, if we built different op codes and say they 21:00 were multi-word, we can just use find and replace to change those. Effectively, we're removing compilation 21:05 into some edit time operations. One of the nice things about being at 32-bit 21:10 granularity for the instruction is that the 32-bit immediates are now at 32-bit granularity as well. And so now we can 21:18 just make it so that we have a tag which says this is an op code and a tag which says this is say an immediate hex value. 21:26 And we could show them separately with different colors. In this case, I have code for setting ESI to a 32-bit 21:32 immediate. And you'll notice that this one is using the 3E ignored DS segment 21:38 selector prefix to pad out the op code to 32-bit. And then after that, we have a silly number which we're setting into 21:45 ESI. That silly number is 1 2 3 4 5 6 7 8 in hex. So it's very easy to do inline 21:51 data this way. Of course, calls and jumps are another question. And we have an easy solution for that one as well. 21:57 In x8664, column jump uses a 32-bit relative immediate and that relative is 22:03 relative to the end of the op code, not the beginning. And so if we want to make an editor support this, we would just 22:09 tag the relative branch address as a word that is a relative address. And 22:15 then when we start editing text or say editing the words inside the binary and 22:21 say we move things around, we would just relink all of the words in the binary that have a relative address. So as code 22:27 changes, things just get fixed up. And so this effectively solves the call and the jump problem. It's very easy to make 22:34 an editor which repatches everything. Conditional branches, you might think those are complicated, but they're 22:40 actually not. Conditional branches is just an 8-bit relative branch address. And so when I make words on these, I 22:46 would say jump unequal minus two, which would jump unequal to the word that is 22:51 two words before this one or say j minus 4 for four words minus and so on. And so 22:58 I can just build a few of these constructs and change the op codes around whenever I need say a jump on 23:03 zero or so on. Nice thing about this is now you no longer have to label things because you just go and count and when 23:10 you move it around it's all relative. So you don't need to do any patching. If you want to add more stuff in your loop, 23:16 you just change, you know, the op code a little. Another option is the editor could have a built-in 32-bit word 23:22 assembly disassembly. Meaning I could use a shorthand for registers and op codes. And the shorthand that I would 23:28 use would be labeling the registers starting with G. So that zero through F 23:34 could be used for hex. So this is an example of how you might want to do it. So in this case I have h + at i08 23:42 which is going to disassemble to add rcx quadward pointer rdx + 0x8 23:51 which we can shorthand very easily and so I could have an editor that would show you either the disassembly or show 23:57 the shorthand instruction for it and that would aid in the understanding and ability to insert stuff without using 24:04 separate tools. So, I did build this sourceless system once in the real. It was back when I was building tiny 24:09 x86-based operating systems. I built the editor as a console program. So, this 24:15 would run in Linux and I would build binaries in that console program and 24:20 then I would use an x86 emulator running in Linux to actually test them. And this was a pretty liberating experience. I 24:27 learned a lot from there. On the right, I'm showing one of the boot sectors of one of the examples running in the 24:33 editor. Note with sourceless programming we could extend the annotation quite a bit. For instance, we could have tables 24:38 that map register numbers to strings and then for each cache line we could have an index of a table. In this way, 24:45 registers could be automatically annotated. For instance, if register 2 is set to position and register 4 is set 24:52 to velocity. If we had add R2, comma, R4, we could just put in add POSOS comma 24:58 velocity, right? And that would make it a lot easier to understand automatically. We could also extend and 25:04 have each cache line have a string as well. And this way we could automatically annotate say a label. The 25:10 first word maybe could be the label and the rest of the string could be used for a comment. So there's a lot of ways to 25:16 do sources programming where we just provide annotation tools to actually make it a practical experience. So let's 25:22 talk about some of the variations, the pros and cons. The easiest way to perhaps start this would be to work with 25:28 text source. And usually with text source, you're going to use prefix words for the tag. For instance, slash for 25:35 comment, colon for define, maybe tick for execute. This does have the slowest runtime, however, because you have a 25:41 character granularity loop. It does have a benefit in that for you to get started, there is no editor that you 25:47 have to write. You can use an external text editor and you can do easy custom syntax coloring. You're going to be very 25:53 easy to understand and to work in. However, I think you're missing a big piece if you go down this path, and that 25:58 is you don't get any live interaction or debugging. You're basically depending on the fast compile times of your custom 26:04 language and the fast load times of whatever program you're doing to try to get you in that iteration loop. And you 26:11 can work this way. I've done it many times, but the experience is nowhere near as good as doing the full 26:16 interactive one with a binary editor. I guess one of the other benefits here is you got very easy code portability. 26:23 There's no binary files, just text files. You can copy and paste as you will. Of course, the next jump up from 26:28 that is going to binary source. This would be middle performance runtime because now you're working at a word 26:34 granularity inside your interpreter. You have portability because you have code generation that can adapt to the system. 26:41 Meaning, as you interpret your code, you can look at what's underlying in the hardware and you can make changes to how 26:46 the code is generated at runtime. You can build just what bits of an assembler are needed. You don't need to build out 26:52 everything like you would with say a disassembler tool or an assembler tool. So for instance with x86 I don't 26:58 actually generate much of the ice. I only use a very very tiny subset. You do have to write the binary editor and that 27:04 can be a lot of work and sometimes that presents a problem with bootstrapping because you don't have the language to 27:10 write the editor in from the beginning. So you have to write the editor in some other language and then in that language 27:15 write the editor again and then you know complete the bootstrapping process. The 27:21 one benefit here is you get interactive development debug from the beginning and also now your source code shows how 27:28 constructs are built instead of just showing the result as you would get with say source free. One interesting thing 27:33 about fourth with binary source and this this concept that you can rebuild source 27:39 code at runtime at any point in execution is now you can start compiling things you load into machine code and 27:46 then you don't have to run all the overhead. Basically you can bake things into machine code anytime at edit time 27:52 and that's a very powerful feature. Now if we go and look at sourceless programming there's a bunch of pros and 27:58 cons. The one pro is that it's the fastest runtime. It's a true noop. You can build things that are highly tightly 28:04 optimized for a platform and they're as optimized as you could possibly get. You 28:09 do however lose capacity for showing how constants came to be and you lose the 28:15 capability of adapting to what the machine is. You can work around this however you can make smaller embedded 28:20 constants that you write into modified bit fields and instructions but it's 28:26 really taking a little too far in terms of complexity. You do need to write an editor which includes an opc code 28:31 assembler and disassembler potentially if you want to go down that route. And if you're going to do CPU and GPU, 28:37 that's a lot of work. It can be very complicated when systems include auto annotation. For instance, if you want to 28:44 type in say a string readable register and then have it go and figure out what register number that is. I guess the 28:50 primary disadvantage here is that there's no possibility for portability. You have a raw binary editor. And today 28:57 we have a problem with GPUs. Steam Deck is RDNA 2, Steam Machine's RDNA 3, and 29:04 who knows the future may be RDNA 4 or five or six. Problem with this is that when the ISA changes across those 29:10 chipsets, that can result in different instruction sizing. So you can't just do 29:16 simple source free. For example, Mimg in RDNA 2 has a different size than V image 29:22 in RDNA4. And that's all due to the ISA changes. So if you do sourceful 29:27 programming, you can port through a compile step. However, source list, you would need to do something else. And 29:33 perhaps that is just rewriting chipset specific versions of all the shaders, but that may be something that you don't 29:39 want to do. So thinking ahead on the Steam OS/ Linux project that I'm working on, effectively I'm building an AMD only 29:46 solution. I don't really have a name for this project, so I'm going to call it fifth. I think for fifth I want a mix of 29:52 various fourth style concepts and perhaps the best mix would be the best 29:57 of both worlds. A fast highle interpreter that's intent for doing GPU code generation where I need chipset 30:04 portability at runtime mixed with low-level sourceless words for the CPU 30:09 side for simple x664 where I don't need portability at this time. Since this is 30:15 for Linux, we should think about what we can do on Linux that we might not be able to do on Windows today. We first 30:21 start thinking about execution. The Linux x8664 ABI when you're running 30:26 without address space randomization, the execution starts at a fixed 4 megabytes 30:31 in and you can still do this today if you compile without FPIE. Also note, 30:37 even if you did get position independent execution, you could effectively just map the where you want your fixed 30:44 position to be and then just start execution there and just ignore the original mapping that they threw you at. 30:50 Another thing we can do is we can use page maps. And if we look at Steam Deck, we'll notice that dirty pages get 30:56 written about as fast as every 30 seconds, which is an important number. Means it won't be overwriting too fast. 31:03 So, let's look at something we can do on Linux that we're not allowed to do on Windows anymore. The self-modifying 31:08 binary. The idea here is that we have a cart file. The cart file represents the ROM. Actually, in this case, it's a RAM 31:15 because we're going to be modifying it. So, first we would execute the cart file. And when the cart file runs, it 31:21 would realize that it's not aback. And then it would copy itself to a backup file, cart.back, and then it would 31:28 launch cart. And then exit. This cart.back back would realize that it is the backup file and then it would map 31:35 the original file cart at say 6 megabytes in and it would provide that mapping and read write and execute and 31:42 then afterwards it would map an adjustable zero fill and that would be for when we're doing compilation or when 31:48 we have data that we don't want to be backed up all the time and after that it would jump to the interpreter and so if 31:54 we look at the memory mapping we'd have at 4 megabytes we would have say a 4 kilobyte boot loader section 32:01 And then we would have somewhere say at 6 megabytes we'd have the whole file and then after that we would have the zero 32:07 fill. And the nice thing about this is we automatically create a backup. We don't have to write any code to save the 32:13 file because it's going to autosave every 30 seconds. Also the data and code is together and we can make a new 32:19 version just by copying the file itself. Inside the binary we'd have a specific spot for the size of the file and the 32:25 size of the zero fill. So the process of doing this execution we can resize when 32:31 we build the cart.back file very easily. So for source code I'm thinking 32-bit 32:36 words for words they're going to be direct addresses into the binary and the binary is going to be the dictionary. 32:43 Makes it quite fast to interpret. The nice thing here is they can be direct addresses because we fix the position. 32:50 We're not using address base randomization. We'll just fix RSI to the interpreter source position and then all 32:56 the words will contain the next of the interpreter meaning all the words end in the interpreter itself or fold the 33:02 interpreter in whatever form they want into their own execution. And by doing this we enable lots of branch predictor 33:10 slots because each of these interpret end of word interpreters are going to be different branch predictors. So we can 33:17 actually get this down into five bytes if we want to. We can use the LODs to 33:22 basically load a word from RSI and then advance RSI and we can use two bytes to 33:29 look up the value in the dictionary and we can use two bytes later to jump to 33:34 the address for the next thing to run. So now we've gotten down to a five byte interpreter. Another thing I think I 33:40 would do for variation is I would make source more free form not strict reverse post notation. In other words, with 33:48 regular reverse pulse notation, you're going to have words that going to push values onto a stack and therefore your 33:53 granularity is at the word level. If instead we have arguments, we can fetch 33:59 a bunch of values off the stack right off the bat and we can look them up in the dictionary in parallel. And now our 34:05 branch granularity is dropping significantly. Maybe say a factor of maybe two or four depending on what our 34:12 common argument count is. I think this is a better compromise for when you're doing lots of code generation, which is 34:17 what we'll be doing on the CPU, mostly GPU code generation. So, for the editor, I'll just do an advanced 32-bit hex 34:24 editor. I'll split the source into blocks, and then each one of those blocks will be split into subblocks. And 34:31 the subblock will be source and then some annotation blocks. And so for every 34:37 source 32-bit source word, I'm going to have uh two I'm going to have say 64-bit 34:42 of annotation information split across two 32-bit words. And that'll give me eight characters. Each of those 34:49 characters will be 7 bit. And I'll have an 8bit tag for editor. And the tag will give me the format of the 32-bit value 34:56 in memory and give me whatever else I want in there. So I can adapt, you know, 35:01 for whatever I feel like doing in the future. And I can make this pretty uniform because I'll mostly have numbers 35:07 and then I'll have direct addresses to words inside this. So that's it for now. This is a late welcome to 2026. I used 35:15 the holiday for some deep thinking, but I think it's time now for some more building. Take care.