More

pag · on April 28, 2023

I implemented something like this! The Dr. Lojekyll [1] datalog system was built around the idea of inputs as receiving messages over time, and publishing outputs (also as messages) or responding to queries (using materialized views). It worked quite well, but it's differential nature ended up being the downfall for the usecase we had in mind.

Specifically, it had no way to "extend" its consistency model out to consumers of its messages, and so if consumers were also producers, then this distributed system as a whole could enter into a state where parts of it are lying to itself!

Otherwise, there were some really fun things you could do with it. For example, we could have per-client databases that would bring themselves up-to-date given the differential outputs of the server database. This would let you engage in a kind of manual sharding of data.

[1] https://www.petergoodman.me/docs/dr-lojekyll.pdf

pag · on Oct 30, 2022

These are great questions! We think our approaches are complimentary. We think that CIR, or ClangIR, can be a codegen target from a combination of our high-level IR and our medium-level IR dialects.

Our understanding of ClangIR is that it has side-stepped the problem of trying to relate/map high-level values/types to low-level values/types -- a process which the Clang codegen brings about when generating LLVM IR. We care about explicitly representing this mapping so that there is data flow from a low-level (e.g. LLVM) representation all the way back up to a high-level. There's a lot of value and implied semantics in high-level representations that is lost to Clang's codegen, and thus to the Clang IR codegen. The distinction between `time_t` and `int` is an example of this. We would like to be able to see an `i32` in LLVM and follow it back to a `time_t` in our high-level dialect. This is not a problem that ClangIR sets out to solve. Thus, ClangIR is too low level to achieve some of our goals, but it is also at the right level to achieve some of our other goals.

pag · on Sept 9, 2021

Can you explain how your offering compares with open-source alternatives, such as KLEE [1] and DeepState [2]?

P.S. your logo is surprisingly similar to that of ForAllSecure [3], which also provides a symbolic execution engine called Mayhem [4].

[1] https://klee.github.io/ (online example: http://klee.doc.ic.ac.uk/)

[2] https://github.com/trailofbits/deepstate#readme

[3] https://forallsecure.com/

[4] https://users.ece.cmu.edu/~aavgerin/papers/mayhem-oakland-12...

Choc13 · on Sept 9, 2021

Yeah sure. We initially wanted to build something to compare the equivalence of two programs written in different languages and for this we started out using KLEE. Unfortunately, we found it didn't quite satisfy our needs and then changes we wanted to make didn't really fit with their architectural model. We also felt like it wasn't as amenable to parallelisation as we would have liked. The core technology is using an SMT solver just like KLEE, but our hope is that Symbolica will be more scalable and easier to use for larger problems.

DeepState looks interesting, we had seen this before and other stuff from trail of bits, but admittedly we haven't used it ourselves. Our impression is that it seems like more of a frontend to various symbolic execution / fuzzing backends. So maybe it's something we should consider integrating with too.

We hadn't come across ForAllSecure before, so thanks for pointing them out. They do appear to be similar, but their focus seems to be on large commercial/enterprise projects in sectors like defence and aerospace. So maybe I'm wrong, but I don't think they have an offering for individuals / smaller teams.

On the logo point, as another commenter pointed out it's the mathematical forall symbol, so I guess we just both had the same thought when it came to coming up with a logo.

xendo · on Sept 9, 2021

Their logo is just \forall quantifier, which is pretty much what those products do, so I wouldn't be that surprised.

https://en.wikipedia.org/wiki/Universal_quantification

pag · on June 4, 2021

DeepState [1] is a tool that lets you write Google Test-style unit tests, as well as property tests, in either C or C++, and plug in fuzzers and symbolic executors. That is, DeepState bridges this gap between fuzz testing and property testing.

[1] https://github.com/trailofbits/deepstate

pag · on March 6, 2021

It sounds like what you want is Mishegos [1], described here [2]. In fact, this work shows that Capstone performs rather poorly compared to more permissively licensed open-source competitors such as Intel XED and Zydis. Really, if your focus is x86, then Capstone is not the right choice. If you need cross-architecture support, then perhaps Capstone is the right choice.

[1] https://github.com/trailofbits/mishegos

[2] https://blog.trailofbits.com/2019/10/31/destroying-x86_64-in...

peter_d_sherman · on March 6, 2021

Mishegos looks really nice!

Yes, that's half of what I was suggesting!

With a little bit of additional work, Mishegos could also be used to test x86 Assemblers... and/or to "pit" x86 Assemblers vs. x86 Disassemblers...

But anyway, excellent links!

pag · on March 12, 2020

And lets not forget that Rust's borrow checker is implemented with Datalog [1].

[1] https://smallcultfollowing.com/babysteps/blog/2018/04/27/an-...

pag · on March 12, 2020

Datalog also comes up in cluster management / provisioning. For example: Differential Datalog [1].

[1] https://github.com/vmware/differential-datalog

pag · on March 12, 2020

Datalog and variants of it come up a bunch in program analysis and software security. For example, GitHub/Microsoft acquired a company, Semmle [1], that has an object-oriented query language that compiles down into Datalog, and can be used to query source code in interesting ways. Souffle [2] comes up in static analysis as well, and is used in systems such as Doop [3]. A kind of predecessor to Doop is bddbddb [4].

Souufle has been used in Ddisasm [5] for disassembling binaries. XSB Prolog has been used in OOAnalyzer [6] for inferring class hierarchies and virtual tables in binaries as well.

I myself spend some time working on datalog compilation for program analysis and decompilation :-)

[1] https://semmle.com

[2] https://souffle-lang.github.io/index.html

[3] https://bitbucket.org/yanniss/doop/src/master

[4] https://suif.stanford.edu/bddbddb

[5] https://github.com/GrammaTech/ddisasm

[6] https://resources.sei.cmu.edu/library/asset-view.cfm?assetid...

pag · on Aug 30, 2019

If you can modify the kernel, you could modify binfmt_elf.c in the case of the Linux kernel, and log out the load address to dmesg or something like that.

Another alternative, which I believe are both options available with PIN and DynamoRIO, is to implement or use an existing ELF loader, forking yourself, and loading in the target binary of interest at a known location.

pag · on Aug 30, 2019

If you're inside the process, as the LD_PRELOAD interceptor mentioned in the article is, then in theory you could try to get the address of the binary's main function or some other symbol that you expect to be exported, and scan down in until you find the lowest mapped page, and then store that address somewhere that you could know of in advance. You could scan down and detect the first unmapped page by clever use of system calls that return EFAULT as a possible return code.