Hacker News new | past | comments | ask | show | jobs | submit login
QEMU VM Escape (blog.bi0s.in)
298 points by ngaut on Aug 25, 2019 | hide | past | favorite | 57 comments



QEMU developer here. Some context on the impact and the security architecture of QEMU:

1. Production VMs almost exclusively use tap networking, not slirp. This CVE mostly affects users running QEMU manually for development and test VMs.

2. Slirp (https://gitlab.freedesktop.org/slirp/libslirp) is part of the QEMU userspace process, which runs unprivileged and confined by SELinux when launched via libvirt. To be clear: this is not a host ring-0 exploit!

3. Getting root on the host or accessing other VMs requires further exploits to elevate privileges of the QEMU process and escape SELinux confinement.

More info on QEMU's security architecture: https://qemu.weilnetz.de/doc/qemu-doc.html#Security

For a more detailed overview of how QEMU is designed to mitigate exploits like this, see my talk from KVM Forum 2018: https://www.youtube.com/watch?v=YAdRf_hwxU8 https://vmsplice.net/~stefan/stefanha-kvm-forum-2018.pdf


> confined by SELinux

Only on platforms that use SELinux though.


For Ubuntu users there is AppArmor support in libvirt too: https://wiki.ubuntu.com/LibvirtApparmor

QEMU runs on other operating systems like *BSD, macOS, and Windows. It is less mature on those platforms and it's safer to avoid running untrusted VMs on those platforms.


And AFAIK seccomp everywhere else. libvirt enables Qemu's seccomp too, and slirp runs in-process


and libvirt


Looks like this was something that QEMU inherited when it took code from https://en.wikipedia.org/wiki/Slirp . Even after reading that page, I'm still not sure how it works; is it like a SOCKS proxy?


Slirp is a little like NAT, but implemented differently.

It creates what looks like a virtual NIC (literally in qemu's case, indirectly by SLIP in the original Slirp), and reassembles the packets it gets coming in from the guest OS or SLIP user. For example, a SYN packet gets turned into a call to connect(), and a data packet gets turned into a write() on the appropriate TCP socket FD. An RST packet gets turned into a call to close(). The reverse happens in the other direction, based on the read() data, fake TCP packets are generated, a closed socket gets mapped to a RST packet, and so forth.

From the outside, it looks like the guest (or SLIP user) is NATed through the host's IP address. But really it is just reassembling the intention of the guest based on the packets it is sending and calling host kernel functions to cause the same effects.


Thanks for the explanation. Given the original purpose of the code, where it would basically see "good" packets exclusively, it's not too surprising that giving it arbitrary "weird" packets could trigger bugs.


Is that similar to what sshuttle does?


Not really. Sshuttle is really just a convenient wrapper around plain old ssh. It created an SSH tunnel to the host you specify and forwards all traffic through that. Its more like a VPN


Slirp was the equivalent of a userland, and user accessible version of SLIP. It allowed end users, to have IP equivalent connectivity if they had a shell account.

This was in the days when people had dial-in accounts to a shell, and wanted to use Mosaic web browser on their machines.


Amiga Mosaic and FTP, also. :)


As far as I know many projects (Podman, Virtualbox, Rootless Docker, Usernetes etc) use a fork of slirp (e.g. slirp4netns).

Let's hope that these projects are not affected, too


slirp4netns v0.2.3, v0.3.2, and v0.4.0-beta.3 are already patched for this CVE.

https://github.com/rootless-containers/slirp4netns/security/...

Also, v0.4.0-beta.2+ can harden its own process by unsharing mount namespace and pivotting_root to an empty dir that only contains /etc and /run with noexec mount option. v0.4.0-beta.4+ additionally supports seccomp filters.


Rootless Docker at least supports vpnkit, which is an alternative memory safe implementation written in OCaml.


It’s kind of amazing that SLiRP is still in use. Back in 1995, I was working tech support at an ISP and we used to kick people off of shell accounts all the time for using it to get a cheaper dialup connection. It was usually the MUDers (the bottom of the barrel of internet users in ‘95).


What was wrong with MUDers? I resemble that statement, but I don't remember SLiRP. Does anyone else on HN remember Demon Internet, potentially fondly, from the 1990s and prior to the acquisition by Thus? I just noticed that brand was terminated by Vodafone in January, which is sad, they were a super clueful provider and helped out EFnet, IRCnet and QuakeNet with servers back in the day, plus were a jolly good dial-up provider for UK customers.


Nothing inherently, but as ISP customers in the 1990s, they were terrible. They tended to spend extremely long periods online, which screwed up the economics of dialup hosting, which was priced for typical residential users (a couple of hours of use a day). We ended up adding ToS clauses to put a stop to them.

The worst of our offenders was a couple—-a husband and a wife—-who both spent nearly all day on MUDs. One of our employees knew them and told me that they lived in a trailer in squalor. Their kid ended up getting taken away from them by Child Protective Services because of neglect. It was a really bad situation and colored my opinion of hard-core gamers.


irc.demon.co.uk was my go-to efnet server. It was perceived as "strong" and not prone to EFNET- network splits causing you to lose control of your channels. Back in 1995-1996.


We had to use SLiRP to connect to the _only_ ISP in town, back in around 93, IIRC.


Why was a shell account cheaper?


Text-only terminal access was less fancy than real IP connectivity.


Shell accounts were cheaper to provide because people spent most of their time doing things that were purely local-- news, mail, etc-- so the bandwidth per user needed was trivial.

While users using Moasic/Netscape on dialup were likely to be pegging their modem link the whole time, and almost all of those bits were bits you needed to buy from a transit provider over an expensive leased line.


Could a user install anything on those shell accounts?


Generally you'd have a small disk quota, and you'd get yelled at / disconnected if you used too much CPU or RAM, and you'd not be allowed to run processes while you're not dialed up. But other than that, you could compile and run stuff like slirp, sure.

Then ISPs didn't like people running SLIRP, so often there were "no SLIRP" rules. :P


From the wikipedia article:

> This was especially useful in the 1990s because simple shell accounts were less expensive and/or more widely available than full SLIP/PPP accounts.

https://en.wikipedia.org/wiki/Slirp


If I read it correctly, the attack has three steps:

1. Exploit a miscalculated pointer to write arbitrary data.

2. Exploit a ASLR infoleak to figure out the target.

3. Use (1) to create a fake timer with a callback to "system()".

Can some forms of Control Flow Integrity mitigate this type of attacks?

W^X is useless in this case, but if Control Flow Integrity offers code pointer target verification, it could have a chance to catch the final bogus callback, am I correct?


Unfortunately all shared library entry points have to be marked as possible destinations for indirect jumps. However, both seccomp and SELinux can block the execve system call.


> which is a pointer miscalculation in network backend of QEMU

Out of interest does Rust prevent this kind of mistake?


Yes: in fact, any memory-safe language can guarantee that a “safe” program would not have this class of bug. In languages without pointer arithmetic and with bounds checking, this type of bug just isn’t possible. Not only does this include Rust but also Java, Go, and a fairly large list of other programming languages.

Most of what Rust improves on is related to memory ownership and concurrency - we’ve had ways to prevent this class of problem for a long time, and in fact there are even C variants that can do that too.


Yes.

At a surface level, the bug is in doing raw pointer arithmetic: determining the size of some value by subtracting two pointers from each other, incorrectly assuming that both pointers are within the same object. There's a codepath where one is not, and therefore this size computation is incorrect. Later, that size is added to another pointer, allowing for out-of-bounds access, overwriting other variables.

Rust doesn't let you subtract two pointers from each other. Even unsafe Rust does not; you'd have to cast the pointers to integers, first, because finding the difference between two unrelated pointers is a fundamentally meaningless operation. (Indeed, it's undefined behavior in C, and Rust compiles through LLVM and would inherit the same optimization passes that wish to consider things UB, so it doesn't pass a request through to LLVM that's going to be undefined.) And safe Rust doesn't let you index to an arbitrary spot in an array / buffer without a bounds check, so even if you got a nonsense offset, it would crash instead of overwriting unrelated values.

At a slightly higher level, it seems like the underlying issue here (if I'm reading the article right) is that struct mbuf has two ways of representing the data: the array member m_dat and the pointer m_ext. Which one you're supposed to use is represented by a flag. The code correctly kept track of which one to use in all cases except one. Entirely apart from the memory safety stuff, Rust gives you tagged enums (enums with data, aka "sum types" in functional programming) with the property that you can only access data inside a particular enum variant if the variable you're looking at is actually of that variant. So, for instance, you could have something roughly like:

    enum MData {
        Internal(buffer: [u8; 32]),
        External(ptr: &[u8]),
    }
and syntactically there's no way to get a ptr out of an Internal or a buffer out of an External, so you couldn't have the logic confusion that led up to the memory unsafety. Even if you could do raw pointer arithmetic in Rust, you'd still get it right:

    let delta = match mbuf.data {
        Internal(buffer) => q - &buffer,
        External(ptr) => q - ptr,
    }
so it's impossible to forget to check the flag. (In this case they do check the flag but it sounds like they're not checking the right flag or something? Or the flag is set too early? I don't totally follow the description, but if it's something like that, using a Rust enum would guarantee that the "flag" accurately matches whatever you're looking at.)

The memory safety stuff is great, but I really think that having a richer type system like this is more fundamentally what prevents bugs, compared to C where all you have is numbers, pointers, structures, and structures-where-things-overlap. (Another good use of this is nullable pointers that force you to do null checks before dereferencing them, and a little more broadly, this pattern also gives you locked data that forces you to take the lock before dereferencing the data, avoiding issues where you take the wrong lock, which could end up as memory unsafety eventually.)

FWIW there are a few hypervisor projects in the same space as QEMU that are written in Rust: AWS's Firecracker and Chrome's crosvm come to mind.


> The memory safety stuff is great, but I really think that having a richer type system like this is more fundamentally what prevents bugs

I know this is only tangentially related to your point, but all the research I have read about points to the opposite conclusion - richer and stricter type systems don't have a proven effect on bugs, whereas memory safety is guaranteed to eliminate whole classes of bugs.

One guess why this would be true, despite your good example of a type system-level fix that would have prevented this bug entirely, is that memory safety is automatic (or at least opt-out), while the richer types are opt-in: nothing in Rust (or Java etc) would have prevented writing the original C, the programmer would have had to think about using the enum or other equivalent in order for the compiler to help. Granted, in this case they would almost certainly have done so, as enums are just so much nicer than flag checking, but for other type-level solutions the same may not happen.


Interesting, I'm very curious if you have pointers to this! Intuitively I feel like I write better code with a better type system, but that argument makes sense to me :-(

One thing I'm curious about is if there's a slightly different property than "richer type system" that helps. In Rust but not necessarily in other languages with good type systems, you can't leave a struct member uninitialized, so perhaps the easiest way to write this code is actually to use an enum if you know you'll only ever use one or the other. That might be close enough to the "automatic" property?


I'm actually curious why the project chose to use two fields in the structure to store the buffers instead of a union (not that C's untagged union would have helped much here).


Am genuinely curious. Why can't C language be extended to do this ?


This is basically Cyclone, a "dialect" of C (but, effectively, its own language) which was a major influence on Rust. See https://cyclone.thelanguage.org/ and http://www.cs.umd.edu/~mwh/papers/cyclone-cuj.pdf for details. Cyclone primarily provides memory safety (bounds-checked pointer dereferences, lifetimes) and also provides things like type-safe tagged unions.

But Cyclone is a separate language (and so is Rust) because in order for memory or type safety to be useful, you need to carry information that cannot be represented in the C type system. The only information a C pointer carries, besides the pointer address, is the C type. You can't have it carry a lifetime, a bound, or a type that isn't a valid C type. So you lose your safety properties at the boundary between C and Cyclone/Rust. You can't pass a tagged union from Cyclone/Rust into C, because the C compiler isn't going to enforce that C code updates the tag properly - if it did, it would start rejecting valid C code. At that point it's just a matter of style/taste whether you have a language that looks like C but isn't (Cyclone) or looks less like C (Rust).

(Come to think of it, C++ also has the same property of being built from C and being mostly backwards-compatible, but being a separate language.)

Both Rust and Cyclone make it easy to call into C for interoperating with existing code, even though such calls cannot be guaranteed safe at the language level. So it's quite feasible to take a large C project and start converting parts of the code to Rust one at a time, ensuring that you can maintain safety within the parts that you've already converted. (Rust in particular is the unique memory-safe language at the intersection of "actively developed and popular" and "can be used as a drop-in replacement for C and export C-compatible binary interfaces back to any calling code, without overhead" - there are a number of neat languages, not just Cyclone, if you don't have the former requirement, and there are a number of great languages like Go or Java if you don't have the latter requirement.)


Possibly, by default yeah, but with `unsafe` blocks you can dereference raw pointers, which can theoretically land you in this same situation. But you'd have to opt into that possibility, and you'd be very aware of that.


If the VM is ran in a hypervisor, will this be able to break out of the hypervisor?


Yes but slirp isn't used in a production grade setup. It's more of a simple slow default that's guarantees to work, so it's used mostly in development setups.


Note that since the last version of QEMU (4.1, released a few days ago) slirp has been moved out into a separate library hosted at https://gitlab.freedesktop.org/slirp/libslirp.

There is interest in using it in container runtimes as well, and hopefully this will give slirp more love. It's very old code that most QEMU developers wouldn't have touched with a ten-foot pole...


>Testing

> Unfortunately, there are no automated tests available.

> You may run QEMU -net user linked with your development version.

Not trying to shame anyone here, but I do think that this would be a good entry point for anyone trying to help this project out. Hell, this might be a fun little target for property testing or other kinds of generative testing


Why would it compromise the hypervisor though?


There's no separate "hypervisor" component. QEMU itself is the hypervisor. That is, QEMU is a regular computer program whose functionality is to emulate a machine; that emulated machine is called a virtual machine. QEMU itself runs on a regular computer OS like Linux. (QEMU with hardware acceleration is a perfectly good and quite common production-grade virtualization setup.) All the code described here is part of QEMU itself, not part of the virtualized OS running inside QEMU.

If a VM running inside QEMU is able to compromise QEMU's code, it then runs with the privileges of QEMU on the host. The host, the regular OS running QEMU, is colloquially called "the hypervisor." And in most production virtualization setups, there is nothing of interest on the host other than some QEMU processes itself, i.e., anything valuable on the host is perfectly well accessible to QEMU, and therefore to malicious code that has taken over the QEMU process.


It's quite reasonable to distinguish QEMU from the hypervisor: QEMU typically runs with strictly lower privileges.

The "hardware acceleration" that helps QEMU emulate a computer operates in root VMX and non-root VMX (when actually executing guest instructions) modes. When operating in root VMX mode as a supervisor, it has approximately the highest privileges in the system (ignoring SMM), as it is operating in host ring 0. QEMU runs in host ring 3, making it a typical user mode process. If you manage to compromise QEMU (and only QEMU), you have only escaped into host user mode. In that case you haven't compromised the hypervisor, merely the virtual machine monitor.

If, on the other hand, you manage to compromise KVM (or vhost) you could reasonably claim to have actually compromised the hypervisor.

Type 2 hypervisors make this generally much uglier than type 1, but it's reasonable to say that the hypervisor in a type 2 deployment is limited to the kernel component rather than the user-mode VMM.


Where do you get this type of academic education of virtualization techniques?


Some of it is academic study (BS and MS at Georgia Tech) -- that helped with the theory and a bit of the breadth. That said, most of what I know of modern virtualization was more or less "on the job" from working on Google Compute Engine. Lots of practical experience understanding virtualization boundaries, and really really enormous access to historical domain experts in the space.


Depends on your country, but I suppose it has to do with theoretical computer science at the master level, you can see a lot of academic content around this (in Paris, France for example): https://wikimpri.dptinfo.ens-cachan.fr/doku.php (yes, HTTPS is broken…)


> it has approximately the highest privileges in the system (ignoring SMM)

and MINIX in case of Intel :-)


> anything valuable on the host is perfectly well accessible to QEMU

In a well configured KVM instance QEMU obeys the principle of least privilege as much as possible, that is it can only access the resources it needs to do its job.

In practical terms, this means QEMU is confined (via SELinux, cgroups, Unix permissions, seccomp) to only access resources for the VM it is running; taking over QEMU does not give you access to "anything valuable on the host".

Of course, having remote code execution in QEMU is awful; it can be a base for exploiting a kernel vulnerability to get root access. Luckily this bug would not be exploitable in a production setting.


I think the intersection of conigurations where QEMU is using the SLiRP networking and where it's carefully configured to use the (non-default) sandboxing is pretty small. The sandboxing configs are typically associated with use of libvirt where you'll probably be using tun/tap or some other non-slirp kind of networking.


Exactly, this is why this bug is unlikely to be exploitable in production (OpenStack for example only runs QEMU via Libvirt).


In a production system, it's highly unlikely that there is anything on the host besides the VM. You're not running qemu on one core and an unvirtualized database on another. Amazon is not running an S3 file server on the same physical machine that's being an EC2 host. The only valuable thing is other VMs.

And I think that there is no common configuration of OpenStack or libvirt that assigns unique access control labels per VM. Every qemu runs with the same privileges, ergo one exploited qemu can laterally attack another. (But maybe there's something nifty I'm missing?)

(Agree that for the use case of desktop virtualization, a good SELinux config can keep it from accessing your browser cookies.)


SVirt will usually use a per-VM label that should inhibit direct lateral interaction.


FWIW, sVirt is the "code name" for Libvirt's SELinux labeling scheme. It uses SELinux multi-category security to ensure that each QEMU can only access resources destined to one VM.


I don't see why they even bother to reassemble the packet fragments. If they're well formed, pass 'em along!


You wouldn't be able to simply pass IP packets with usermode networking. It may also be a better use of bandwidth to reassemble and refragment the packets if the MTU of the host is smaller than the MTU advertised to the VM (although I suppose the hypervisor should try to match the MTU).


I don't know if QEMU has this feature, but you need to reassemble frags to do l4 firewalling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: