Writing a Simple Decompiler for .NET, Part 1

jbevain · on July 27, 2015

If that's a subject you find interesting, you can also read the source for two OSS .NET decompilers:

ILSpy: https://github.com/icsharpcode/ilspy

JustDecompile: https://github.com/telerik/JustDecompileEngine/

Both are based on the same library that is used in the post: Mono.Cecil (https://github.com/jbevain/cecil).

zerratar · on July 27, 2015

Hi jbevain :-) Nice to see you here! Having a tiny switch to a new blog engine, but next part should come up soon enough. Have a nice day, and once more. You rock!

jbevain · on July 27, 2015

Thanks! Looking forward to reading the next posts in the series!

userbinator · on July 27, 2015

Stack-based, high-level VMs like CLR, JVM, and Flash's AcriptScript are certainly quite easy to decompile, although I think this article unfortunately misses the point - it's full of (rather verbose) code, but little explanation. From what I can see it's very fragile too - it attempts to match exact instruction sequences so won't work for anything even slightly different from what's presented. This is equivalent to the test() method given, but won't get decompiled correctly:

    ldc.i4.4
    ldarg.0
    call System.Int32 Test1.AwesomeClass::c()
    starg.s b
    starg.s a

The right way to decompile a stack-based language requires keeping track of what's on the stack, building expressions instead of evaluating values.

That InstructionHelper class also looks like it could be rewritten more clearly...

jcranmer · on July 27, 2015

The standard way to resolve distinct variables is SSA-based decompilation. I've only worked with decompiling Java bytecode, so I don't know how the CLR works, but in Java, the compiler definitely reuses the local variable slots.

There's also no discussion of type inferencing for variables, parenthesizing expression DAGs properly. I suspect properly decompiling control flow would be in part 2, but I'd be surprised if that were anywhere near robust, based on the quality demonstrated so far. Which is sad because this sort of decompilation has been practically demonstrated and solved for, oh, 10-20 years.

tptacek · on July 27, 2015

Are you and 'userbinator saying the same thing? I can't tell. I know how the simple symbolic stack->expression evaluation works, and it happens that in my code I generate something pretty close to SSA expressions, but does SSA do something else profound for decompilation?

sklogic · on July 27, 2015

SSA abstracts the stack away, and allows to reason about types much easier.

tptacek · on July 27, 2015

I'm not sure I'm following. To get from stack operations to expressions, I just symbolically evaluate the stack, creating temporary variables as I go. It happens that the resulting IR is pretty much SSA form. But I'm not taking much else from SSA. I'm wondering if I'm missing opportunities.

sklogic · on July 27, 2015

It's easier to transform your expressions into a useful form from a guaranteed, proper SSA than from a simple tree representation. For example, an induction variable extraction is totally trivial in SSA, and you really need do to it if you want to reconstruct nice looking `for` loops.

It also pays well to have distinct basic blocks - loop analysis is much easier then.

tptacek · on July 27, 2015

This is helpful. But I read it and think, for instance, "distinct basic blocks aren't SSA"; compilers worked in terms of CFGs before SSA existed. :)

Again this is more about my lack of confidence about fully grokking the implications of SSA; I'm not nerd-sniping.

sklogic · on July 27, 2015

Of course, you can have basic blocks without an SSA. It's just another feature that was missing from the article that was worth mentioning.

Another thing you'll get for free from an SSA - nice ternary expressions reconstructed (even if the original code was using ifs).

zerratar · on July 27, 2015

Thank you for your comment! :-)

Haha, yep I do agree that I did not cover all important aspects of writing a decompiler. Neither did I use a stack based solution to tackle the problem.

A reminder though, the idea behind this post was to cover a little bit of everything, just trying to make it as simple as possible. This is not a fully fledged decompiler and will not decompile everything. It is to give an idea on how CIL works, how to use Mono.Cecil, and just hacking away!

Next part of the tutorial DO actually manage the stack to try and create a more complex solution. Together with code refactoring and more.

And I'm sorry for any information that I've might have forgotten and/or for any poorly written code. I will try and do better next time. Yet I hope you still like the article.

/zerratar

ghuntley · on July 27, 2015

If this is of interest to you then you will most likely find the the recent .NET Core Design API review on ILDASM interesting as well: https://www.youtube.com/watch?v=HuRc6CpiOVg

EliRivers · on July 26, 2015

For something with "UX" in the name, it's a surprisingly bad layout. Massive waste of screen width, and code boxes forcing me to scroll sideways even as acres of empty space sits there unused.

EliRivers · on July 27, 2015

Ah, I see it's been partially fixed. The source code sections no longer require scrolling sideways to see it all, at least.

lichinobu · on July 27, 2015

I really apologize for that, we were totally taken by surprise by all the attention and around 2AM i saw your (very valid) reply and was like oh snap! It's not a perfect fix, but I'll try and improve it as soon as I can. And thanks Eli.