TL;DR: this is an exercise in implementing a C compiler from scratch. "From scratch" here means "without an existing gcc/clang," so consider civilization destruction scenarios, aliens reading our source code, EMP strike that takes out all smart silicon, corporate policy won't let you download development tools, you only have a javascript console and gumption, etc...
To do this, you must:
1. Implement a small tool that turns hexidecimal into binary (you can do this in any language)
2. Use whatever you have (python, POSIX shell, alien crystal substrate, x86-64 machine code, ...) to implement a small VM that runs simple bytecode. The VM has 16 registers and 16MB of working memory. There are sixteen opcodes to implement for arithmetic, memory manipulation, and control flow. There are also twelve syscalls for fopen/fread/fwrite/unlink(!)/etc.
After these two steps (that you have to repeat yourself post-civilization collapse), everything's self-hosted:
3. Use the VM to write a manual linker that resolves labels
4. Use the linker to write assembler for a custom assembly language
5. Use the assembler to implement a minimal C compiler / preprocessor, that then compiles a more complex C compiler, that can compile a C17 compiler, that then compiles doom
See also: nand2tetris (focus is on teaching, less pragmatism), Cosmopolitan C (x64 as actually portable runtime)
Our C compilers don’t work on their alien crystal substrate because the ISA is different and they don’t use ASCII, but the point is that if they understand English, this gives them the tools necessary to make a C compiler without too much prework involved.
Author here. I think my opinion would be about the same as the authors of the stage0 project [1]. They invested quite a bit of time trying to get Forth to work but ultimately abandoned it. Forth has been suggested often for bootstrapping a C compiler, and I hope someone does it someday, but so far no one has succeeded.
Programming for a stack machine is really hard, whereas programming for a register machine is comparatively easy. I designed the Onramp VM specifically to be easy to program in bytecode, while also being easy to implement in machine code. Onramp bootstraps through the same linker and assembly languages that are used in a traditional C compilation process so there are no detours into any other languages like Forth (or Scheme, which live-bootstrap does with mescc.)
tl;dr I'm not really convinced that Forth would simplify things, but I'd love to be proven wrong!
To add a bit to this, although Dusk OS doesn't have the same goals as stage0, that is to mitigate the "trusting trust" attack, I think it effectively does it. Dusk OS kernels are less than 3000 bytes. The rest boots from source. One can easily audit those 3000 bytes manually to ensure that there's nothing inserted.
That being said, the goal of stage0 is to ultimately compile gcc and there's no way to do that with Dusk OS.
That being said (again), this README in stage0 could be updated because I indeed think that Dusk is a good counterpoint to this critique of Forth.
Oh, amazing! I've heard of DuskOS before but I didn't realize its C compiler was written in Forth.
Looks like it makes quite a few changes to C so it can't really run unmodified C code. I wonder how much work it would take to convert a full C compiler into something DuskCC can compile.
One of my goals with Onramp is to compile as much unmodified POSIX-style C code as possible without having to implement a full POSIX system. For example Onramp will never support a real fork() because the VM doesn't have virtual memory, but I do want to implement vfork() and exec().
It can't compile unmodified C code targeting POSIX. That's by design. Allowing this would import way too much complexity in the project.
But it does implement a fair chunk of C itself. The idea is to minimize the magnitude of the porting effort and make it mechanical.
For example, the driver the the DWC USB controller (the controller on the raspberry pi) comes from plan 9. There was a fair amount of porting to do, but it was mostly to remove the unnecessary hooks. The code itself, where the real logic happens, stays pretty much the same and can be compiled just fine by Dusk's C compiler.
Forth is not a convenient VM to target for C compilers because of its numerous idiosyncrasies (e.g. the stacks don't neatly map to what a typical naive C implementation would expect).
> Security: Compiler binaries can contain malware and backdoors that insert viruses into programs they compile. Malicious code in a compiler can even recognize its own source code and propagate itself. Recompiling a compiler with itself therefore does not eliminate the threat. The only compiler that can truly be trusted is one that you've bootstrapped from scratch.
It is a laudable goal, but without using from-scratch hardware and either running the bootstrap on bare metal or on a from-scratch OS, I think "truly be trusted" isn't quite reachable with an approach that only handles user-space program execution.
Indeed! An eventual goal of Onramp is to bootstrap in freestanding so we can boot directly into the VM without an OS. This eliminates all binaries except for the firmware of the machine. The stage0/live-bootstrap team has already accomplished this so we know it's possible. Eliminating firmware is platform-dependent and mostly outside the scope of Onramp but it's certainly something I'd like to do as a related bootstrap project.
A modern UEFI is probably a million lines of code so there's a huge firmware trust surface there. One way to eliminate this would be to bootstrap on much simpler hardware. A rosco_m68k [1] is an example, one that has requires no third party firmware at all aside from the non-programmable microcode of the processor. (A Motorola 68010 is thousands of times slower than a modern processor so the bootstrap would take days, but that's fine, I can wait!)
Of course there's still the issue of trusting that the data isn't modified getting into the machine. For example you have to trust the tools you're using to flash EEPROM chips, or if you're using an SD card reader you have to trust its firmware. You also have to trust that your chips are legit, that the Motorola 68010 isn't a modern fake that emulates it while compromising it somehow. If you had the resources you'd probably want to x-ray the whole board at a minimum to make sure the chips are real. As for trusting ROM, I have some crazy ideas on how to get data into the machine in a trustable way, but I'm not quite ready to embarrass myself by saying them out loud yet :)
From the GitHub for on-ramp: it’s “self-bootstrapping and can compile itself from scratch”. What does that mean? How can it compile itself if it doesn’t exist?
Thanks, but perhaps I should have been more clear in my question. How can something self-compile if it doesn't have a compiler to start with? Does the onramp source contain some machine code that is a compiler already?
Indeed it does! It all starts with a hexadecimal code which is converted by a tool into the machine code for some simple VM; there seem to be a few more steps, where one thing is used to build another, etc. It is in this sense that this particular system is said to be able to “compile itself.”
I think the missing piece here is that onramp contains the definition of a very simple virtual machine. One that the reader could implement themselves (though a reference implementation is also provided).
After it incrementally builds on itself all the way up to a fully functional C compiler.
Fascinating exercise and nice work!
Adjacent (resilient, low-level, big-vision, auditable) projects include:
http://collapseos.org/ Forth OS, bootstrapable from paper, for z80
https://urbit.org/ standalone, distributed, auditable, provable, minimalist
https://justine.lol/ APE (actually portable executable); cosmopolitan libc
Urbit is all that, but it's way stranger than that.
TL;DR: this is an exercise in implementing a C compiler from scratch. "From scratch" here means "without an existing gcc/clang," so consider civilization destruction scenarios, aliens reading our source code, EMP strike that takes out all smart silicon, corporate policy won't let you download development tools, you only have a javascript console and gumption, etc...
To do this, you must:
1. Implement a small tool that turns hexidecimal into binary (you can do this in any language)
2. Use whatever you have (python, POSIX shell, alien crystal substrate, x86-64 machine code, ...) to implement a small VM that runs simple bytecode. The VM has 16 registers and 16MB of working memory. There are sixteen opcodes to implement for arithmetic, memory manipulation, and control flow. There are also twelve syscalls for fopen/fread/fwrite/unlink(!)/etc.
After these two steps (that you have to repeat yourself post-civilization collapse), everything's self-hosted:
3. Use the VM to write a manual linker that resolves labels
4. Use the linker to write assembler for a custom assembly language
5. Use the assembler to implement a minimal C compiler / preprocessor, that then compiles a more complex C compiler, that can compile a C17 compiler, that then compiles doom
See also: nand2tetris (focus is on teaching, less pragmatism), Cosmopolitan C (x64 as actually portable runtime)
What is the concern about aliens reading source code?
Our C compilers don’t work on their alien crystal substrate because the ISA is different and they don’t use ASCII, but the point is that if they understand English, this gives them the tools necessary to make a C compiler without too much prework involved.
I wonder what's the author's view on Forth, seems like the role of the bytecode VM here might be interchangeable with a Forth implementation.
Author here. I think my opinion would be about the same as the authors of the stage0 project [1]. They invested quite a bit of time trying to get Forth to work but ultimately abandoned it. Forth has been suggested often for bootstrapping a C compiler, and I hope someone does it someday, but so far no one has succeeded.
Programming for a stack machine is really hard, whereas programming for a register machine is comparatively easy. I designed the Onramp VM specifically to be easy to program in bytecode, while also being easy to implement in machine code. Onramp bootstraps through the same linker and assembly languages that are used in a traditional C compilation process so there are no detours into any other languages like Forth (or Scheme, which live-bootstrap does with mescc.)
tl;dr I'm not really convinced that Forth would simplify things, but I'd love to be proven wrong!
[1]: https://github.com/oriansj/stage0?tab=readme-ov-file#forth
You might get a kick out of DuskOS(baremetal forth system)'s C compiler.
https://git.sr.ht/~vdupras/duskos/tree/master/item/fs/doc/co...
To add a bit to this, although Dusk OS doesn't have the same goals as stage0, that is to mitigate the "trusting trust" attack, I think it effectively does it. Dusk OS kernels are less than 3000 bytes. The rest boots from source. One can easily audit those 3000 bytes manually to ensure that there's nothing inserted.
That being said, the goal of stage0 is to ultimately compile gcc and there's no way to do that with Dusk OS.
That being said (again), this README in stage0 could be updated because I indeed think that Dusk is a good counterpoint to this critique of Forth.
Oh, amazing! I've heard of DuskOS before but I didn't realize its C compiler was written in Forth.
Looks like it makes quite a few changes to C so it can't really run unmodified C code. I wonder how much work it would take to convert a full C compiler into something DuskCC can compile.
One of my goals with Onramp is to compile as much unmodified POSIX-style C code as possible without having to implement a full POSIX system. For example Onramp will never support a real fork() because the VM doesn't have virtual memory, but I do want to implement vfork() and exec().
It can't compile unmodified C code targeting POSIX. That's by design. Allowing this would import way too much complexity in the project.
But it does implement a fair chunk of C itself. The idea is to minimize the magnitude of the porting effort and make it mechanical.
For example, the driver the the DWC USB controller (the controller on the raspberry pi) comes from plan 9. There was a fair amount of porting to do, but it was mostly to remove the unnecessary hooks. The code itself, where the real logic happens, stays pretty much the same and can be compiled just fine by Dusk's C compiler.
Forth is not a convenient VM to target for C compilers because of its numerous idiosyncrasies (e.g. the stacks don't neatly map to what a typical naive C implementation would expect).
> Security: Compiler binaries can contain malware and backdoors that insert viruses into programs they compile. Malicious code in a compiler can even recognize its own source code and propagate itself. Recompiling a compiler with itself therefore does not eliminate the threat. The only compiler that can truly be trusted is one that you've bootstrapped from scratch.
It is a laudable goal, but without using from-scratch hardware and either running the bootstrap on bare metal or on a from-scratch OS, I think "truly be trusted" isn't quite reachable with an approach that only handles user-space program execution.
Indeed! An eventual goal of Onramp is to bootstrap in freestanding so we can boot directly into the VM without an OS. This eliminates all binaries except for the firmware of the machine. The stage0/live-bootstrap team has already accomplished this so we know it's possible. Eliminating firmware is platform-dependent and mostly outside the scope of Onramp but it's certainly something I'd like to do as a related bootstrap project.
A modern UEFI is probably a million lines of code so there's a huge firmware trust surface there. One way to eliminate this would be to bootstrap on much simpler hardware. A rosco_m68k [1] is an example, one that has requires no third party firmware at all aside from the non-programmable microcode of the processor. (A Motorola 68010 is thousands of times slower than a modern processor so the bootstrap would take days, but that's fine, I can wait!)
Of course there's still the issue of trusting that the data isn't modified getting into the machine. For example you have to trust the tools you're using to flash EEPROM chips, or if you're using an SD card reader you have to trust its firmware. You also have to trust that your chips are legit, that the Motorola 68010 isn't a modern fake that emulates it while compromising it somehow. If you had the resources you'd probably want to x-ray the whole board at a minimum to make sure the chips are real. As for trusting ROM, I have some crazy ideas on how to get data into the machine in a trustable way, but I'm not quite ready to embarrass myself by saying them out loud yet :)
[1]: https://rosco-m68k.com/
From the GitHub for on-ramp: it’s “self-bootstrapping and can compile itself from scratch”. What does that mean? How can it compile itself if it doesn’t exist?
They do go into some detail of the steps involved. Basically, it seems as though the system unravels itself, going from simple things to more complex.
Thanks, but perhaps I should have been more clear in my question. How can something self-compile if it doesn't have a compiler to start with? Does the onramp source contain some machine code that is a compiler already?
Indeed it does! It all starts with a hexadecimal code which is converted by a tool into the machine code for some simple VM; there seem to be a few more steps, where one thing is used to build another, etc. It is in this sense that this particular system is said to be able to “compile itself.”
I think the missing piece here is that onramp contains the definition of a very simple virtual machine. One that the reader could implement themselves (though a reference implementation is also provided).
After it incrementally builds on itself all the way up to a fully functional C compiler.
Since this hasn't gotten much attention, I just wanted to say that I think this is a cool project. Nice work!
This is so great. I've been watching the project develop and it's really neat to see this milestone!
Cool project, love the bit about aliens
Can you do a blog about the goals of your project in terms of tech-archaeology? Fascinating topic
as an alpine linux enthusiast, i can say that this is fantastic. keep it clean