Assuming you are genuinely not aware of language implementation processes, this is called bootstrapping. You may want to implement a language A in A (a very common goal!) but this is generally impossible when you don't have a compiler for A, so you first write an implementation of A in an already implemented language B, and then use that implementation to write an A implementation in A, abandoning the initial implementation at the end. Zig just did this.
Cross-compilers are about target platforms (e.g. producing a macOS executable from Windows). We are talking about implementing a language in a different language. (The word "bootstrap" is very general and can be used for both cases.)
The main use case here is not as much building Zig for an new architecture, as it is a way to let contributors build latest Zig trivially.
Without a bootstrap process like this one, it could happen that you run git pull and then can't build the compiler because it's using a new feature that the version of the compiler you have doesn't support yet.
The wasm process ensures that you can always build the latest commit.
If you want to have a Zig compiler written in Zig, you need to bootstrap it once initially. Cross compilation makes it so you don't need to do that again when you want the compiler to work on a different architecture. Of course, there's the question of why you want to self-host your compiler (instead of keeping the C++ one):
- dogfooding identifies problems & helps prioritizing
- it demonstrates that the language is suitable for serious projects
- most importantly, Zig developers prefer writing Zig over having to use C++
gcc bootstraps using itself. Rust bootstraps using itself. Go bootstraps using itself. D bootstraps using itself. Zig bootstraps using itself. It seems pretty common :)
The need for bootstrapping in this context comes from the lack of a compiler for the language you want to implement, as in before the Zig compiler in C++ there was no Zig compiler.
I think you need to be a bit more humble on a topic that's clearly going over your head.
Imagine a scenario where you are testing out a brand new RISC-V development board. The vendor has only provided a C compiler for this board as is often the case. You want to be able to use the zig language to write programs for your new development board but no zig compiler exists for this board yet. That means you need to compile the zig compiler from source. The latest version of the zig compiler is written in zig. Again you don't have a zig compiler so how will you compile the zig compiler from source? You need a way to go from C compiler to Zig compiler. That's what this is describing. It does not make sense to maintain two completely separate versions of the compiler. The "real" one written in Zig and the "bootstrap" one for compiling from C. So the zig source is compiled into WASM and stored in the repo. On a system with only a C compiler this WASM can be ran instead of a native zig binary. The WASM version can then be used to compile an actual native zig binary.
They are compiling the current compiler to wasm and then using that compiler to build future versions of their compiler.
In other words, they are doing the described rust approach but instead using a platform agnostic target instead of doing a binary. That allows them to build on any platform that has a C compiler and to use current language features without needing manually backport them.
They could directly target C or C++ but that runs a greater risk of accidentally generating UB. Targeting a bytecode decreases that risk.
That's not "converting back", because by going through WebAssembly you have restricted the language. As long as you have a correct wasm-to-C implementation this is a valid strategy to finish the bootstrap---you no longer depend on C, just WebAssembly.
I'm not the parent commenter, but I will be honest -- I fail to understand the purpose of bootstrapping low-level languages. Like, if you just were to given a task to write a compiler, would you honestly choose Zig? No. Then why don't have the compiler be written in Haskell or whatever high-level language where writing code is actually productive and not error prone, since it is not a performance-critical application?
The Zig developers would strongly disagree that a compiler is not a performance-critical application, and they would also probably disagree that Zig doesn't bring anything to the table when it comes to writing compilers.
As a generalization, the people who are motivated enough to work on a language, want to use that language. It's only natural that they would want to write their compiler in that language too, if practical. Contributors to the Zig project would on average probably be more proficient and productive in Zig than they would be in a language they don't care about so much.
It's also just helpful to have the people who are designing the language working in that language regularly in the context of a sizable and nontrivial project.
> The Zig developers would strongly disagree that a compiler is not a performance-critical application,
Have they written am incremental compiler then? Or just an old-fashioned slow batch one? Compiler architecture matters much more than whether it's runtime has a GC or not.
I understood your original comment to imply that you disagree with their choice of self-hosting the Zig compiler because they should instead have focused on architectural improvements, in which case I disagree because the two things are completely orthogonal in my mind—the benefits of self-hosting have little to do with performance, and certainly don't come at the cost of it. I apologize if that's not what you were implying.
> This is a thinly veiled ad hominem.
I never personally attacked you, so I'm not sure why you think this is ad hominem. That's a very uncharitable interpretation of my comment.
> Like, if you just were to given a task to write a compiler, would you honestly choose Zig? No.
Actually, yes.
A compiler is not just a dumb filter that eats text and spits out machine code. It can provide infrastructure to other tools (like linters, analyzers, LSP servers...), and even allow importing and using parts of it in user programs.
Some of this can't be done in a different language (like Haskell) without some very crazy FFI, and might add an extra runtime dependency for those tools, which might not always be desirable.
Thank you for this comment, I only wish it was at a higher level up in this thread.
The number of ignorant comments in this thread is astounding. All this criticism feels like it was written by back-seat drivers who have no clue about the complexities of language design or compiler implementation.
Think of porting to a new CPU architecture. If you have your compiler written in its own language, then when you add support for the new CPU target you can compiler your compiler using your compiler and target the new CPU. Now you have a compiler that not only produces code for the new CPU, but you have one that also runs on that CPU.
The alternative would be to port your Haskell compiler to the new CPU too in order to set up a self hosting toolchain. Much more work involved, because you not only have to be proficient in Haskell, but you need to have Haskell compiler implementation skills in addition to your own compiler.
Well, this might be a valid reason, especially given that embedded is important for Zig’s target domain. (Though, is there all that many new architectures nowadays?)
As far as I can tell, the main reason we all spend so much time waiting for compilers is that compilers aren't considered as performance-critical as they should be.
My full-time job is making a compiler for a high-level language, and I only considered systems languages (e.g. Zig, Rust) as contenders for what to write it in - solely because compiler performance is so critical to the experience of using the compiler.
In our case, since the compiler is for a high-level language, we plan never to self-host, because that would slow it down.
To me, it seems clear that taking performance very seriously, including language choice, is the best path to delivering the fastest feedback loop for a compiler's users.
If I were to give a bad faith argument, I would say that Rust’s compiler, while being written in Rust is not famous for its speed (but I do know that it is due to it doing more work).
I honestly fail to see why would a lower level language be faster, especially that compilers are notorious for their non-standard allocation and life cycle patterns, so a GC might actually be faster here.
> I honestly fail to see why would a lower level language be faster, especially that compilers are notorious for their non-standard allocation and life cycle patterns, so a GC might actually be faster here.
The nonstandard allocation and lifecycle patterns are a major part of the reason I want a systems language and not a GC - it means I have strictly more control over when allocations happen, I can do cheap phase-oriented allocations and deallocations with arenas, etc.
Rust's compiler is an interesting example. It was originally implemented in OCaml (which has a reputation for being a GC'd language with good runtime performance), and then rewrote to Rust in order to self-host - and got faster. In contrast, the Go team rewrote from a systems language to Go (which also has a good reputation for runtime performance), again in order to self-host, and it got slower.
Rewrites are a different beast, I doubt they are fairly comparable. They probably have realized some better abstractions now that eases implementation, and may thus also boost performance. Also, Go’s gc has never been considered “good”. OCaml also just recently got multitask support, didn’t it?
Nonstandard lifetimes are not really helped with arena allocators though, and not everything is needed in each phase, or is there that divided phases at all. But you may be right, I honestly can’t tell with certainty.
The slow part of Rust compiles is LLVM, though some of it may be due to bloated IR input that's a frontend concern. There's an alternate cranelift-based backend that's usable already if runtime efficiency is not your priority.
LLVM's fast path (used by clang -O0) is fast. Rust's primary problem is that it can't use LLVM's fast path (because it implements only a subset of LLVM IR) and LLVM upstream is uninterested in extending it because it will slowdown clang -O0.
Yes. I guess FastISel still isn't fast enough to overcome larger compilation unit, but isn't it a substantial improvement over the default code generator?
I'm sorry, I haven't taken any measurements to find out the answer to this question. I would be curious to hear about how much it affected Rust builds if you explore this. I remember I spent an afternoon trying to enable FastISel in order to speed up LLVM, only to realize that we had been using it all along.
> Like, if you just were to given a task to write a compiler, would you honestly choose Zig? No.
Why wouldn't you? I understand why not for high level languages like Python or Ruby (since they're interpreted) but not for low level ones. Rust for example is also bootstrapped.
Compilation can be very intensive, and it's detrimental to a developer's workflow if they must wait for long recompiles.
Rust was originally written in OCaml before being self-hosted, and it wouldn't be as fast (or would be even slower ;) ) today if it was still OCaml.
And remember, low-level =/= poor abstractions. I think there are several novel abstractions available in Zig which the compiler devs probably want to make use of themselves.
They might have good language abstractions, but manual memory management is simply an orthogonal implementation detail to solving a problem — dealing with that is simply more work and more leaky abstractions.
[ as someone who does not work in language design ] - it does feel sometimes like this achievement is more a source of pride than a hard requirement. A sort of symbolic (no pun) closing of the circle.
Is there a reason why keeping a compiler in, say, C would be a bad idea long-term?
I'm of course not Andrew Kelley ;-), but I think it's strategic. C++ would be indeed less productive than Haskell when you write a language implementation, but if the target audience---at least initially---already knows C++ then writing a compiler in suboptimal languages may help. I know this is not universally applicable; for example Rust bootstrapped from OCaml, but it has achieved self-hosting very early in its life (and back when the goal of the language was not yet certain), so that might have been also strategic.
Writing the initial compiler in C++ is a rather surprising choice. My time would be too precious for that. OCaml was a good choice from the Rust devs and had a nice influence on the language. Rust has proper algebraic data types and pattern matching.
They wrote it in C++, because llvm is in C++. Currently the only critical parts of the Zig compiler that are in C++ are the bindings to llvm and I think stuff for linking clang.
Compilers are performance critical in the sense that you wouldn't want to wait 15 minutes to an hour for your code to build. Consider the number of layers of cache applied to compilation pipelines along with distributed builds to speed it up in places where they have millons of lines of code (e.g game studios) along with turning off linker features which slow things down.
You'd want faster compilation so that you can test your changes without it breaking your flow where having to wait a 1-5 minutes means you'll end up reading HN or checking chat for 10 or so minutes. That's also why there's interest in hot-code reloading and incremental linking to make it faster as it will further reduce compilation to just the changes you've made and nothing more.
There is no significant difference between managed languages and something like zig in this category of programs, it is not a video codec, so I stand by my “no performance sensitive” claim. And especially because algorithms matters much much more, there is a good chance that a faster one can be implemented when you don’t have to care about whether that memory location is still alive or not.
After on-disk caching, the biggest performance improvements I've seen to compilers have been from changing how data structures are allocated/deallocated and laid out in memory.
I haven't seen much opportunity to improve algorithm performance because which algorithms are applicable is heavily constrained by language design.
In my experience the difference between something like Zig and a language with managed memory is large.
Dogfooding is not always related to bootstrapping. In fact, this is a reasonably common pitfall for language designers, because language implementations are very specific and if you only tune your language for them your language will of course need ADTs and generics and value semantics and GCs. Not to say that those are great, but you may have missed a whole slew of possibilities by doing so! And that's why you can do dogfooding without bootstrapping---you can instead have other big enough software written in your language and coevolve with it. For example Rust did this with Servo.
Writing the compiler in a different language limits the language users' ability to contribute - they'll need to learn another language - and makes porting more complex since you'll need to port Haskell or whatever to the new platform. Dogfooding can also be an advantage.
Today’s managed languages are very fast. For example, if Java is not fast enough for your HFT algorithm, than nor is C++ or generic CPUs even! You have to go the custom chip route then. Where there is a significant difference between these categories is memory usage and predictability of performance. (In other applications, e.g. video codecs you will have to write assembly by hand in the hot loops, since here low-level languages are not low level enough). Since these concerns not apply to compilers, I don’t think that a significant performance difference would be observable between, say a java and zig implementation of a certain compiler.