Anything that powers technology like AWS Lambda needs to be really fast. And it needs to be secure. While AWS could have gone with existing technology, to satisfy both these main requirements, they went with building something new, Firecracker, that is both really fast – it can boot Linux and start executing user space processes in 125ms – and secure – it uses hardware virtualization and more to isolate one Lambda environment from the other. In this article, we’ll take a deep dive and look at how exactly Firecracker achieves these two goals. Firecracker is not new new. Significant chunks of its low-level functionality is based on Google’s crosvm, with some significant changes.
Sparkler: A light-weight Firecracker
While it certainly is fun to read a lot of source code and figuring out what is going on under the hood, it is not as much fun as firing up your favorite editor and whipping up a lightweight virtual machine environment under Linux. With Sparkler, will build a virtual machine monitor (VMM) that manages a virtual machine while providing a certain environment to the virtual machine, which it runs. We will also write a tiny “operating system” which will run inside the virtual machine. The VMM emulates some interesting hardware: a device that can read the latest tweet from Command Line Magic’s Twitter handle, a device that can get the weather from certain cities, another device that can read fetch the latest air quality measurements from certain cities and finally a console device that lets the virtual machine read the keyboard and output text to the terminal.
If you are interested in understanding how KVM works while getting your hands dirty with example code, I recommend you read this article on Sparkler.
Making things really fast
Running unmodified operating systems, especially ones for which source code isn’t available means that the virtual machine has to very closely resemble a real PC. QEMU or Virtual Box do use KVM to speed up guests as much as possible, but they also emulate a real PC with BIOS routines, various legacy and peripheral devices like graphics, sound and storage cards. This is why is is easy to run off-the-shelf versions of MS-DOS, Microsoft Windows or binary Linux distributions like Fedora or Ubuntu with QEMU or Virtual Box.
But, as for these legacy and other devices, KVM directly has very limited support for these. Also, when operating system boot, the boot loader usually depends on BIOS routines to deal with hardware like the screen, keyboard and disks. QEMU uses SeaBIOS as its choice of BIOS firmware, for example and it ships with it. Without BIOS routines being present, it wouldn’t be a real PC and this will stop it from running off-the-shelf operating systems.
The most important takeaways are that there is a lot of legacy code like the BIOS and emulation of full-blown peripherals, which is relatively very slow in virtualization systems that are designed to emulate a PC as closely as possible.
Modern kernel booting
When a PC starts, the CPU is in real mode or 16-bit mode. The BIOS runs the power-on self-test or POST, reads in the boot loader from the designated boot device to a set location in RAM and passes control over to it. The boot loader usually uses BIOS routines to display text, read disks, get system information etc.
It turns out that the Linux kernel does not need BIOS routines at all. It is the boot loader that does. Linux has what is know as a “boot protocol”, which tells the boot loader or any other program present at boot time, how to lay out the kernel and supporting data structures in RAM and how to pass control to the kernel. For the x86 architecture, boot protocols are available for real mode, 32-bit and 64-bit modes.
So, just like how we loaded the monitor program binary into the allocated guest memory, Firecracker loads the Linux kernel, setting up the required data structures like initial stack, command line parameters, as per the Linux kernel boot protocol. See the following files from the Firecracker Git repo:
This does away with the need for having any BIOS routines loaded during the boot process. This simplifies the design of the virtual machine by removing all the complexity required to enable a boot loader or having a need for BIOS routines at all.
In the tiny Sparkler monitor program, there is no use of any BIOS routines as well. Also, because there is no BIOS at all. The
print_str routine for example, uses the
OUT x86 instruction to output characters, which causes a VM exit. These exits are handled by the
sparkler program, which uses the
putc() library function to display the character on to the terminal. Something very similar happens in Firecracker as well. A virtual serial console is setup for the Linux kernel via memory-mapped IO (MMIO), which is just a fancy term for I/O which is done by reading and writing to particular memory addresses. In other words, rather than using specialized instructions like
OUT for I/O, normal load/store instructions like
mov are used for I/O. When these special MMIO addresses are read from or written to, a KVM exit is triggered, letting the virtual machine monitor or the hypervisor handle it. In Firecracker, the function
register_mmio_serial() in the file
vmm/src/device_manager/mmio.rs makes a serial console available to the Linux kernel for text input/output.
The Firecracker guest model
The other clever thing about Firecracker is that the virtual machine in which it runs the Linux kernel has very few devices. Linux, on x86 assumes the presence of an interrupt controller and an interval timer. These are really fundamental since the CPU doesn’t have these built in. While it is possible to emulate these devices in the VMM, KVM provides a way better alternative. It is able to emulate them in-kernel for you. This is a big deal. Remember that as long as there is not VM exit, the virtual machine code is executing natively on the CPU. Every time there is an exit, there is also expensive context switches, killing performance. The more time you can stay in the kernel or execute virtual machine code, performance is much better. See
KVM_CREATE_PIT2 in the KVM API documentation.
Here is a list of devices in the Firecracker guest model:
- Nested i8259 Programmable Interrupt Controller chips + an IOAPIC (emulated in-kernel by KVM)
- i8254 Programmable Interval Timer (emulated in-kernel by KVM)
- i8042 PS/2 Keyboard and Mouse Controller (emulated by Firecracker in devices/src/legacy/8042.rs)
- Serial console (emulated by Firecracker in devices/src/legacy/serial.rs)
- VirtIO Block (emulated by Firecracker in devices/src/virtio/block.rs)
- VirtIO Net (emulated by Firecracker in devices/src/virtio/net.rs)
If you look at the source code of the PS/2 keyboard controller emulated by Firecracker, you’ll notice that it is really sparse and it implements only one main function
trigger_ctrl_alt_del(), which is pretty self-explanatory.
VirtIO Block and Net devices
VirtIO needs a bit more explanation. Real hardware devices have all kinds of quirks and are fairly complicated to program. When you have operating systems that can’t be modified, but come with drivers for some hardware devices, it makes sense to emulate them. That way, it becomes possible to run these operating systems unmodified. But with Linux’s rich history of virtualization, there are more high-performance solutions available. Modern Linux kernels ship with drivers for a virtual I/O system that was specially designed for virtualized systems. Firecracker takes advantage of this.
VirtIO was developed initially by Rusty Russell for LGuest, which made its way into the kernel in 2007, but was removed in 2017. VirtIO however, continues to thrive. VirtIO now has a specification and device drivers are available in-tree in the Linux kernel. The concept behind VirtIO is very simple. It specifies a way for the guest and host to communicate efficiently. It defines various device types like: network, block, console, entropy, memory balloon and SCSI host. It supports PCI as a transport, meaning that the guest OS can enumerate VirtIO devices like regular PCI bus based devices and continue to use them like regular PCI devices. VirtIO is relatively simple to program compared to programming real hardware and it is also designed to be very high performance. The performance is also the result of not having to emulate whole, real hardware devices.
Since the Linux kernel also ships with device drivers for VirtIO devices, all the host needs to do is to emulate the specially-designed-for-virtualization VirtIO devices and have Linux work seamlessly with them, with much better performance compared to emulating some other real hardware device Linux also supports.
There is another very powerful feature of the Linux kernel called vhost. These are basically VirtIO devices emulated in the kernel directly, which means that there is no need to context switch to the VMM to deal with I/O from VirtIO devices. At the time of this writing however, Firecracker emulates both the network and the block VirtIO devices in the VMM and does not depend on vhost for further acceleration.
At the time of this writing, Linux kernel 5.4 hasn’t yet been released, but there is a patch that implements virtio-fs, which allows efficient sharing of files and directories between hosts and guest. This way, a directory containing the guests’ file system can be on the host, much like how Docker works.
Container vs hardware virtualization security
For a use case like AWS Lambda, it might not be a such good idea to just run processes (inside which Lambda functions run) belonging to different accounts on the same physical or virtual server. It would be a security nightmare. You can’t blame an architect however, if she chose to run processes from different accounts inside of Linux containers (using something like Docker). The only thing is we don’t really know what holes exist yet, letting processes break their container jails. You see, the attack surface is the whole Linux kernel API interface. It is only a logical separation. The same can be said about KVM as well, but the attack surface is relatively small. Also, the VM itself runs in a special hardware virtualization mode, making it a lot safer, relatively speaking. Recent happenings reduce this confidence a bit, but it is human nature to believe hardware is somehow a lot safer than software.
With containers being the main unit of abstraction, one common worry about Kubernetes security is the Linux API’s attack surface. Well, there has been improvement in this direction with projects like Kata Containers, which run micro-VMs that use hardware virtualization for the containers, while providing a Kubernetes compatible interface so that Kubernetes can be used to orchestrate these containers. Similar to Firecracker, Intel started a project, NEMU, which is QEMU cleaned up of most legacy hardware support while taking advantage of Linux supported virtual devices – an approach very similar to Firecracker.
From the interface perspective, another important point to consider is the minimal set of devices supported by Firecracker. This further reduces the available attack surface.
Going beyond hardware-based virtualization
Firecracker goes much further than just resting on security that hardware-based virtualization buys. There are two primary mechanisms used to secure the Firecracker process further. First, it can be deployed via a
jailor utility, which uses
cgroups to restrict the Firecracker process. Furthermore, it also uses seccomp rules to limit what the Firecracker process can do on the host system. Rules are setup carefully to only allow system calls that Firecracker explicitly needs. There is an advanced mode where when system call parameters are whitelisted. This ensures that not much can be done should the Firecracker process get compromised.
Conclusion / Summary
Firecracker is a VM environment specially built to run only the Linux kernel, doing away with legacy BIOS or devices, while leveraging modern, virtualization techniques like VirtIO and securing the Firecracker process with
cgroups and seccomp rules. On one side we have the most generic virtualization systems like QEMU and Virtual Box which can run pretty much any OS targeted at PCs, whereas on the other side, we have systems like Firecracker that are specialized to run guests based on the Linux kernel in a very efficient manner, giving up the generic nature of the virtual machines they create.
We also went into great detail of how KVM works by building a virtual machine monitor (VMM) in C. This is the piece that interfaces with KVM to create a hardware based virtual machine in which we ran a small program written in assembly language which talks to the VMM using devices that the VMM emulates. While there is a simple “console” device that lets the VM input and output text, there are other more complex devices that can read a tweet, get the weather and air quality for a few cities.
My name is Shuveb Hussain and I’m the author of this Linux-focused blog. You can follow me on Twitter where I post tech-related content mostly focusing on Linux, performance, scalability and cloud technologies.