Hey! The idea isn't to replace the compiler with an LLM, the tech is not there yet. Where we see value is in using these models to guide an existing compiler. E.g. orchestrating optimization passes. That way the LLM won't break your code, nor will the compiler (to the extent that your compiler is free from bugs, which can tricky to detect - cf Sec 3.1 of our paper).
I've done some similar LLM compiler work, obviously not on Meta's scale, teaching an LLM to do optimization by feeding an encoder/decoder pairs of -O0 and -O3 code and even on my small scale I managed to get the LLM to spit out the correct optimization every once and a while.
I think there's a lot of value in LLM compilers to specifically be used for superoptimization where you can generate many possible optimizations, verify the correctness, and pick the most optimal one. I'm excited to see where y'all go with this.
Thank you for freeing me from one of my to-do projects. I wanted to do a similar autoencoder with optimisations. Did you write about it anywhere? I'd love to read the details.
There's code there to generate unoptimized / optimized pairs via C generators like yarpgen and csmith, then compile, train, inference, and disassemble the results
then maybe dont name it "LLM Compiler", just "Compiler Guidance with LLMs" or "LLM-aided Compiler optimization" or something - will get much more to the point without overpromising
Yeah, the name was misleading. I thought it was going to be source to object translation maybe with techniques like how they translate foreign languages.
> The idea isn't to replace the compiler with an LLM, the tech is not there yet
What do you mean the tech isn't there yet, why would it ever even go into that direction? I mean we do those kinds of things for shits and giggles but for any practical use? I mean come on. From fast and reliable to glacial and not even working a quarter of the time.
I guess maybe if all compiler designers die in a freak accident and there's literally nobody to replace them, then we'll have to resort to that after the existing versions break.
We use the same architecture as other LLMs, but we include no natural language in our pretraining. We figured a single-domain training corpus would make evaluation easier. We’ll be looking at layering this on top of something like Code Llama next
Hey, we’re targeting code size in this work, not runtime performance. You would use an option like -O3 to optimize for runtime and -Oz to optimize for code size. The pass pipelines are different for both
Program synthesis is part of the loss function, which is what makes it a auxiliary learning task.
We haven’t experimented with model size yet, we just used the same configuration as the smallest Code Llama. We did play with dataset size and found thah performance tracks the usual scaling laws. Details in the paper
Hey Jon, we don't change semantics. We just choose optimization passes to run to match a particular input code. Agreed that code size wins could regress performance. We don't measure that yet, but will be looking into it next. There are some applications where codesize above-all-else is what to optimize for.
Hey, author here. We use machine learning to generate a list of _optimization passes_ for the compiler to run. These optimization passes give us a 3.0% improvement over the default (-Oz), and are what generates correct code. We don't do anything to the code that breaks semantics (assuming no bugs in the compiler passes; some nuance is needed here ;) ).
We also train the model to generate what it thinks the optimized code will look like. We find that this helps the model choose better pass lists, but obviously the code cannot be trusted and semantics are not guaranteed. It only compiles in 91% of cases. "Perfectly emulating the output of the compiler" means the model spat out code that is character-for-character identical to what the compiler generates with the given pass list (even choosing the same variable names etc). IMO this is no mean feat, but is still a long way to go before using LLMs for codegen. We provide a bunch of examples in the paper of things that LLMs can and cannot do.
Hey, I just read through this paper. The phase ordering issues are currently based on heuristics. I noticed that you're only using the instruction count of LLVM as the measurement. However, this metric might not accurately reflect the program's actual performance and code size. LLVM has instructions like GEP that can translate into several lines of assembly code. Additionally, I suggest trying to run some large benchmarks like SPEC to demonstrate the performance benefits of using the LLM.
Hey, yes that's right, and good callout on GEP instructions. We admit that instruction count is a bit handwavy, but we use it as a starting point as that's what the prior works we compare against optimize for. We'll be looking at true binary size next, and code runtime after.
This. For now we rely on differential testing against a gold-standard implementation (e.g. unoptimized). For the action space we expose, any semantics-breaking change induced by our tool is a compiler bug.
Author here. In the sense that a trained neural net produces the same output from the same input, this is deterministic. But I don’t think that’s what you’re getting at. Where it gets interesting is if we inserted a feedback loop (such as FDO) so that the compiler could fine tune itself on past decisions. In that case, the build would still be deterministic, but the compiler would keep changing, invalidating the build cache.