The Pizlo special approach sounds a bit like converting out of SSA form via compensating `alloca`s in LLVM. E.g. one `alloca` per SSA variable, with a `store` into the `alloca` in the source block, and the `phi` replaced by a `load`.
I created a monster, then a primordial explosion, then a nested simulation within the mind of the monster, and asked it to describe the physics of this nested simulation. Very engrossing.
Cool! You might even be able to run Rellic [1,2] on the LLVM IR produced by Clang when compiling Objective-C code. If it works, this will spit out goto-free C code, not C++.
> It would certainly not hurt performance to emit a compiler warning about deleting the if statement testing for signed overflow, or about optimizing away the possible null pointer dereference in Do().
I think that the nature of Clang and LLVM makes this kind of reporting non-trivial / non-obvious. Deletions of this form may manifest as a result of multiple independent optimizations being brought to bear, culminating in noticing a trivially always-taken, or never-taken branch. Thus, deletion is the just the last step of a process, and at that step, the original intent of the code (e.g. an overflow check) has been lost to time.
Attempts to recover that original intent are likely to fail. LLVM instructions may have source location information, but these are just file:line:column triples, not pointers to AST nodes, and so they lack actionability: you can't reliably go from a source location to an AST node. In general though, LLVM's modular architecture means that you can't assume the presence of an AST in the first place, as the optimization may be executed independent of Clang, e.g. via the `opt` command-line tool. This further implies that the pass may operate on a separate machine than the original LLVM IR was produced, meaning you can't trust that a referenced source file even exists. There's also the DWARF metadata, but that's just another minefield.
Perhaps there's a middle ground with `switch`, `select`, or `br` instructions in LLVM operating on `undef` values, though.
At Trail of Bits, we've been working on this type of IR for C and C++ code [1]. We operate as a kind of Clang middle end, taking in a Clang AST, and spitting LLVM IR that is Clang-compatible out the other end. In this middle area, we progressively lower from a high-level MLIR dialect down to LLVM.
There's been a longstanding wish to do things like inject a std::vector::reserve ahead of a loop that appends to a vector. Difficult to do once you've lowered the standard library to pointer arithmetic on structs. Clang emitting a MLIR dialect that preserves a lot of C++ semantic information before translation to IR would be a big deal in the LLVM pipeline.
It was pretty cool; you could stream in new facts to it over time and it would incrementally and differentially update itself. The key idea was that I wanted the introduction of ground facts to be messages that the database reads (e.g. off of a message bus), and I wanted the database to be able to publish its conclusions onto the same or other message buses. I also wanted to be able to delete ground facts, which meant it could publish withdrawals of the prior-published conclusions. A lot of it was inspired by Frank McSherry's work, although I didn't use timely or differential dataflow. In retrospect I probably should have!
This particular system isn't used anymore because we made a classic monotonicity mistake by making it the brain of a distributed system, and then having it publish and receive messages with a bunch of microservices. The internal consistency model of the datalog engine didn't extend out to the microservices, and the possibility of feedback loops in the system meant that the whole thing could lie to itself and diverge uncontrollably! Despite this particular application of the engine being a failure, the engine itself worked quite well and I hope to one day return to datalog.
I think what a lot of people miss with datalog, and what becomes apparent as you use it more, is just how unpredictable many engines can be with the execution behavior of rules. This is the same problem that you have with a database, where the query planner makes a bad choice or where you lack an index, and so performance is bad. But with datalog, the composition of rules that comes so naturally also tends to compound this issue, resulting in time spent trying to chase down weird performance things and doing spooky re-ordering of your clause bodies to try to appease whatever choices the engine makes.
At Trail of Bits, we are creating a new compiler front/middle end for Clang called VAST [1]. It consumes Clang ASTs and creates a high-level, information-rich MLIR dialect. Then, we progressively lower it through various other dialects, eventually down to the LLVM dialect in MLIR, which can be translated directly to LLVM IR.
Our goals with this pipeline are to enable static analyses that can choose the right abstraction level(s) for their goals, and using provenance, cross abstraction levels to relate results back to source code.
Neither Clang ASTs nor LLVM IR alone meet our needs for static analysis. Clang ASTs are too verbose and lack explicit representations for implicit behaviours in C++. LLVM IR isn't really "one IR," it's a two IRs (LLVM proper, and metadata), where LLVM proper is an unspecified family of dialects (-O0, -O1, -O2, -O3, then all the arch-specific stuff). LLVM IR also isn't easy to relate to source, even in the presence of maximal debug information. The Clang codegen process does ABI-specific lowering takes high-level types/values and transforms them to be more amenable to storing in target-cpu locations (e.g. registers). This actively works against relating information across levels; something that we want to solve with intermediate MLIR dialects.
Beyond our static analysis goals, I think an MLIR-based setup will be a key enabler of library-aware compiler optimizations. Right now, library-aware optimizations are challenging because Clang ASTs are hard to mutate, and by the time things are in LLVM IR, the abstraction boundaries provided by libraries are broken down by optimizations (e.g. inlining, specialization, folding), forcing optimization passes to reckon with the mechanics of how libraries are implemented.
We're very excited about MLIR, and we're pushing full steam ahead with VAST. MLIR is a technology that we can use to fix a lot of issues in Clang/LLVM that hinder really good static analysis.
These are great questions! We think our approaches are complimentary. We think that CIR, or ClangIR, can be a codegen target from a combination of our high-level IR and our medium-level IR dialects.
Our understanding of ClangIR is that it has side-stepped the problem of trying to relate/map high-level values/types to low-level values/types -- a process which the Clang codegen brings about when generating LLVM IR. We care about explicitly representing this mapping so that there is data flow from a low-level (e.g. LLVM) representation all the way back up to a high-level. There's a lot of value and implied semantics in high-level representations that is lost to Clang's codegen, and thus to the Clang IR codegen. The distinction between `time_t` and `int` is an example of this. We would like to be able to see an `i32` in LLVM and follow it back to a `time_t` in our high-level dialect. This is not a problem that ClangIR sets out to solve. Thus, ClangIR is too low level to achieve some of our goals, but it is also at the right level to achieve some of our other goals.
> LLVM dialect in MLIR, which can be translated directly to MLIR
Should be:
LLVM dialect in MLIR, which can be translated directly to LLVM IR
Otherwise, great project! We're also using MLIR internally and it's been awesome, game-changing even when considering how much can be accomplished with a reasonable amount of effort.
I think the next big problems for MLIR to address are things like: metadata/location maintenance when integrating with third-party dialects and transformations. With LLVM optimizations, getting the optimization right has always seemed like the top priority, and then maybe getting metadata propagation working came a distant second.
I think the opportunity with MLIR is that metadata/location info can be the old nodes or other dialects. In our work, we want a tower/progression of IRs, and we want them simultaneously in memory, all living together. You could think of the debug metadata for a lower level dialect being the higher level dialect. This is why I sometimes think about LLVM IR as really being two IRs: LLVM "code" and metadata nodes. Metadata nodes in LLVM IR can represent arbitrary structures, but lack concrete checks/balances. MLIR fixes this by unifying the representations, bringing in structure while retaining flexibility.
I've always assumed that the `0x` prefix is to make copy & paste easier, i.e. so that you can copy the number, then immediate use it in a a C- or C++-like expression.
I work on a Google Test-like unit testing framework called DeepState [1] that gives you access to fuzzing and symbolic execution from your C or C++ unit tests. We have a tutorial [2] describing how to use it. It's new and still under development, but shows massive potential, and really brings these software security tools into a more developer-centric workflow.
If this is the case, this is an approach I've taken in the past to unify how LLVM-based taint tracking instrumentation of "normal" `alloca`s and phi nodes works, e.g.: https://github.com/lifting-bits/vmill/blob/master/tools/Tain...