[Join the discussion on Hacker News here.]
Serverless computing is quite the rage these days and AWS Lambda is on the forefront of this. A while ago, they released Firecracker, the engine behind Lambda. Unsurprisingly, it was based on Linux’s KVM (Kernel-based Virtual Machine) technology, but what was surprising was how it gave up the ability to run all kinds of operating systems to become this super-sleek virtual machine manager that can run only Linux, but can bring up a virtual machine in about 125 ms! I take a deep look at how Firecracker works, along with an analysis of what it does well and what it doesn’t in this article. But I wanted to go a lot deeper than just looking at Firecracker’s code. How about building a tiny virtual machine manager (VMM) and a super tiny “operating system” to understand how KVM really works? That’s exactly what we’ll be doing with Sparkler.
Sparkler: A light-weight Firecracker
While it certainly is fun reading Firecracker’s source code and figuring out what is going on under the hood, it is not as much fun as firing up your favorite editor and whipping up a lightweight virtual machine environment under Linux. With Sparkler, will build a virtual machine monitor (VMM) that manages a virtual machine while providing a certain environment to the virtual machine, which it runs. You can find Sparkler’s source code here on Github. We will also write a tiny “operating system” which will run inside the virtual machine. The VMM emulates some interesting hardware: a device that can read the latest tweet from Command Line Magic’s Twitter handle, a device that can get the weather from certain cities, another device that can read fetch the latest air quality measurements from certain cities and finally a console device that lets the virtual machine read the keyboard and output text to the terminal.
AWS Firecracker uses Linux’s KVM virtualization toolkit to create and run virtual machines. As we progress, we’ll see how exactly Firecracker’s awesome speed and security are achieved. But first, let’s lay down some groundwork to better understand how we can take advantage of Linux’s KVM (Kernel-based Virtual Machine) to build something like Firecracker. To demonstrate how this works, we build Sparkler, a lightweight virtual environment. The Sparkler environment or the virtual machine monitor (VMM) is written in C and is a KVM-based virtual machine, while the “operating system” we run inside that environment is written in assembly language. The Sparkler VM has an interesting structure, unlike any other virtual machine you might know of. Born in the internet age, it is a truly native citizen.
The Sparkler virtual machine exposes 4 devices and here is what these “devices” do:
- Console: this device is like a serial port. It allows the virtual machine to display information and also get user input via the keyboard where required.
- Twitter device: reading from this device makes available the latest tweet from one of my favorite Twitter handles, @climagic.
- Weather Info device: Reading from this device, the virtual machine can get the latest weather forecast for 6 different cities.
- Air Quality Info device: This device makes available air quality information for 6 different cities
Here is how a session in Sparkler looks like
Hardware virtualization background
Starting 2005, most Intel chips have had support for hardware virtualization. Before such support was available, virtual machines worked by either emulating every single instruction or at least had to emulate privileged instructions because those cause faults when running in user space. With Intel VT and AMD’s SVM technology, a new processor mode was created where operating system code could run natively, with full speed on the real hardware CPU, without the need to emulate or trap regular or privileged instructions. The hypervisor or the virtual machine monitor can let the CPU know when to “exit”, that is, given control to the hypervisor. For example, on accessing I/O ports with the IN
or OUT
instructions, when accessing certain privileged CPU registers that are normally only accessed by the operating system or when the virtual machine executes an instruction like CPUID
, which provides information on the CPU (the hypervisor might want to control CPU features the guest sees).
In this article, I refer to “VT” technology as a term to include the corresponding, equivalent AMD SVM technology as well.
The Unixification of hardware virtualization
KVM has several interesting features, but we shall look at the interface it provides to Intel’s VT technology. You can program KVM using the well known UNIX file paradigm. Let’s look at some code from main.c
in Sparkler.
kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC); if (kvm == -1) err(1, "/dev/kvm"); vmfd = ioctl(kvm, KVM_CREATE_VM, (unsigned long)0); if (vmfd == -1) err(1, "KVM_CREATE_VM"); /* Allocate one aligned page of guest memory to hold the code. */ mem = mmap(NULL, 0x8000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (!mem) err(1, "allocating guest memory"); /* Read our monitor program into RAM */ int fd = open("monitor", O_RDONLY); if (fd == -1) err(1, "Unable to open stub"); struct stat st; fstat(fd, &st); read(fd, mem, st.st_size); struct kvm_userspace_memory_region region = { .slot = 0, .guest_phys_addr = 0x1000, .memory_size = 0x8000, .userspace_addr = (uint64_t)mem, }; ret = ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, ®ion); if (ret == -1) err(1, "KVM_SET_USER_MEMORY_REGION"); vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long)0); if (vcpufd == -1) err(1, "KVM_CREATE_VCPU"); /* Map the shared kvm_run structure and following data. */ ret = ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, NULL); if (ret == -1) err(1, "KVM_GET_VCPU_MMAP_SIZE"); mmap_size = ret; if (mmap_size < sizeof(*run)) errx(1, "KVM_GET_VCPU_MMAP_SIZE unexpectedly small"); run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0); if (!run) err(1, "mmap vcpu"); /* Set CPUID */ struct kvm_cpuid2 *cpuid; int nent = 40; unsigned long size = sizeof(*cpuid) + nent * sizeof(*cpuid->entries); cpuid = (struct kvm_cpuid2*) malloc(size); bzero(cpuid, size); cpuid->nent = nent; ret = ioctl(kvm, KVM_GET_SUPPORTED_CPUID, cpuid); if (ret < 0) { free(cpuid); err(1, "KVM_GET_SUPPORTED_CPUID"); } for (int i = 0; i < cpuid->nent; i++) { if (cpuid->entries[i].function == 0x80000002) __get_cpuid(0x80000002, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx); if (cpuid->entries[i].function == 0x80000003) __get_cpuid(0x80000003, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx); if (cpuid->entries[i].function == 0x80000004) __get_cpuid(0x80000004, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx); } ret = ioctl(vcpufd, KVM_SET_CPUID2, cpuid); if (ret < 0) { free(cpuid); err(1, "KVM_SET_CPUID2"); } free(cpuid); /* Initialize CS to point at 0, via a read-modify-write of sregs. */ ret = ioctl(vcpufd, KVM_GET_SREGS, &sregs); if (ret == -1) err(1, "KVM_GET_SREGS"); sregs.cs.base = 0; sregs.cs.selector = 0; ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs); if (ret == -1) err(1, "KVM_SET_SREGS"); /* Initialize registers: instruction pointer for our code, addends, and * initial flags required by x86 architecture. */ struct kvm_regs regs = { .rip = 0x1000, .rflags = 0x2, }; ret = ioctl(vcpufd, KVM_SET_REGS, ®s); if (ret == -1) err(1, "KVM_SET_REGS"); char *latest_tweet = NULL; char *weather_forecast = NULL; char *aq_report = NULL; int tweet_str_idx = 0; int weather_str_idx = 0; int aq_str_idx = 0; /* Run the VM while handling any exits for device emulation */ while (1) { ret = ioctl(vcpufd, KVM_RUN, NULL); if (ret == -1) err(1, "KVM_RUN"); switch (run->exit_reason) { case KVM_EXIT_HLT: puts("KVM_EXIT_HLT"); return 0; case KVM_EXIT_IO: if (run->io.direction == KVM_EXIT_IO_OUT) { switch (run->io.port) { case SERIAL_PORT: putchar(*(((char *)run) + run->io.data_offset)); break; default: printf("Port: 0x%x\n", run->io.port); errx(1, "unhandled KVM_EXIT_IO"); } } else { /* KVM_EXIT_IO_IN */ switch (run->io.port) { case SERIAL_PORT: *(((char *)run) + run->io.data_offset) = getche(); break; case TWITTER_DEVICE: if (latest_tweet == NULL) latest_tweet = fetch_latest_tweet(); char tweet_chr = *(latest_tweet + tweet_str_idx); *(((char *)run) + run->io.data_offset) = tweet_chr; tweet_str_idx++; if (tweet_chr == '\0') { free(latest_tweet); latest_tweet = NULL; tweet_str_idx = 0; } break; case WEATHER_DEVICE_CHENNAI: case WEATHER_DEVICE_DELHI: case WEATHER_DEVICE_LONDON: case WEATHER_DEVICE_CHICAGO: case WEATHER_DEVICE_SFO: case WEATHER_DEVICE_NY: if (weather_forecast == NULL) { char city[64]; if (run->io.port == WEATHER_DEVICE_CHENNAI) strncpy(city, "Chennai", sizeof(city)); else if (run->io.port == WEATHER_DEVICE_DELHI) strncpy(city, "New%20Delhi", sizeof(city)); else if (run->io.port == WEATHER_DEVICE_LONDON) strncpy(city, "London", sizeof(city)); else if (run->io.port == WEATHER_DEVICE_CHICAGO) strncpy(city, "Chicago", sizeof(city)); else if (run->io.port == WEATHER_DEVICE_SFO) strncpy(city, "San%20Francisco", sizeof(city)); else if (run->io.port == WEATHER_DEVICE_NY) strncpy(city, "New%20York", sizeof(city)); weather_forecast = fetch_weather(city); } char weather_chr = *(weather_forecast + weather_str_idx); *(((char *)run) + run->io.data_offset) = weather_chr; weather_str_idx++; if (weather_chr == '\0') { free(weather_forecast); weather_forecast = NULL; weather_str_idx = 0; } break; case AIR_QUALITY_DEVICE_CHENNAI: case AIR_QUALITY_DEVICE_DELHI: case AIR_QUALITY_DEVICE_LONDON: case AIR_QUALITY_DEVICE_CHICAGO: case AIR_QUALITY_DEVICE_SFO: case AIR_QUALITY_DEVICE_NY: if (aq_report == NULL) { char city[64]; char country[3]; if (run->io.port == AIR_QUALITY_DEVICE_CHENNAI) { strncpy(city, "Chennai", sizeof(city)); strncpy(country, "IN", sizeof(country)); } else if (run->io.port == AIR_QUALITY_DEVICE_DELHI) { strncpy(city, "Delhi", sizeof(city)); strncpy(country, "IN", sizeof(country)); } else if (run->io.port == AIR_QUALITY_DEVICE_LONDON) { strncpy(city, "London", sizeof(city)); strncpy(country, "GB", sizeof(country)); } else if (run->io.port == AIR_QUALITY_DEVICE_CHICAGO) { strncpy(city, "Chicago-Naperville-Joliet", sizeof(city)); strncpy(country, "US", sizeof(country)); } else if (run->io.port == AIR_QUALITY_DEVICE_SFO) { strncpy(city, "San%20Francisco-Oakland-Fremont", sizeof(city)); strncpy(country, "US", sizeof(country)); } else if (run->io.port == AIR_QUALITY_DEVICE_NY) { strncpy(city, "New%20York-Northern%20New%20Jersey-Long%20Island", sizeof(city)); strncpy(country, "US", sizeof(country)); } aq_report = fetch_air_quality(country, city); } char aq_chr = *(aq_report + aq_str_idx); *(((char *)run) + run->io.data_offset) = aq_chr; aq_str_idx++; if (aq_chr == '\0') { free(aq_report); aq_report = NULL; aq_str_idx = 0; } break; default: printf("Port: 0x%x\n", run->io.port); errx(1, "unhandled KVM_EXIT_IO"); } } break; case KVM_EXIT_FAIL_ENTRY: errx(1, "KVM_EXIT_FAIL_ENTRY: hardware_entry_failure_reason = 0x%llx", (unsigned long long)run->fail_entry.hardware_entry_failure_reason); case KVM_EXIT_INTERNAL_ERROR: errx(1, "KVM_EXIT_INTERNAL_ERROR: suberror = 0x%x", run->internal.suberror); default: errx(1, "exit_reason = 0x%x", run->exit_reason); } }
The pseudocode to create and run a VM with KVM
open("/dev/kvm")
: Open the global KVM deviceioctl(KVM_CREATE_VM)
: Create a virtual machinemmap(size)
: Create memory region for the guest to useread("monitor")
: Read our operating system binary into the allocated memoryioctl(KVM_CREATE_VCPU)
: Create a VCPU for use in our newly created virtual machineioctl(KVM_SET_REGS)
: Set initial values for some registerswhile(1)
run = ioctl(KVM_RUN)
: Run the VM till there is an exitswitch(run->exit_reason)
: Decide based on exit reasoncase KVM_EXIT_HLT:
VM executed the halt instruction. Let’s exit.case KVM_EXIT_IO:
There was I/O from the VM. Handle it.
As you can see, with just simple Linux system calls like open()
, read()
, write()
, mmap()
and ioctl()
, we’re able to create and run hardware virtualization-based VMs.
Another way to handle VM exits is via eventfd()
, which can be done with the KVM_IOEVENTFD ioctl()
call. This creates a file descriptor for any MMIO memory range that needs to be monitored for reads and writes. This file descriptor can then be passed to poll()
or epoll_*
calls and events dealt with in a better fashion. This is what Firecracker does. Now, let’s look at the “operating system” that runs inside of Sparkler.
Our tiny little Sparkler operating system
I had trouble calling this an operating system, so I’m calling this a monitor program, which is a very common term used in embedded systems for operating system-like programs that are not quite operating systems themselves. Intel CPUs since Westmere (introduced 2010) have supported something called unrestricted guest mode. This means essentially that the virtual CPU starts running in real mode, or 16-bit mode, much like a real PC. The operating system can then switch the CPU to 32-bit or 64-bit mode as required. Our monitor program does not switch to 32-bit or 64-bit mode, but lives its life as a 16-bit program.
As part of the Sparkler build process, NASM turns monitor.asm
into monitor
, which is the binary program which we then load into guest memory from main.c
. This is a file with no real structure, just raw CPU instructions and data.
Although we call this the monitor program, the sparkler
program, that runs and interacts with KVM is called the VMM or the virtual machine monitor. Do not confuse these two terms during the course of reading this article. I’ll use the terms “sparkler” and “VMM” interchangeably to refer to the same thing.
bits 16 SERIAL_PORT equ 0x3f8 TWITTER_DEVICE equ 0x100 WEATHER_DEVICE_BASE equ 0x100 AIR_QUALITY_DEVICE_BASE equ 0x200 start: mov ax, 0x100 add ax, 0x20 mov ss, ax mov sp, 0x1000 cld mov ax, 0x100 mov ds, ax mov si, welcome_msg call print_str jmp menu_loop press_key: mov si, press_any_key call print_str call get_users_choice menu_loop: call display_main_menu call get_users_choice cmp al, 0x31 je .cpu_details cmp al, 0x32 je .latest_tweet cmp al, 0x33 je .weather cmp al, 0x34 je .air_quality cmp al, 0x35 je .halt mov si, illegal_choice call print_str jmp press_key .cpu_details: call print_cpu_details jmp press_key .latest_tweet: call print_latest_tweet call print_new_line jmp press_key .weather: mov si, weather_str call print_str call print_new_line mov si, cities_str call print_str call print_new_line mov si, your_choice call print_str sub ax, ax call get_users_choice sub ax, 0x30 ; turn it from ascii to number cmp ax, 1 jl .illegal_choice cmp ax, 6 jg .illegal_choice add ax, WEATHER_DEVICE_BASE ; this gives us the port number for the city mov dx, ax call print_weather jmp press_key .air_quality: mov si, air_quality_str call print_str call print_new_line mov si, cities_str call print_str call print_new_line mov si, your_choice call print_str sub ax, ax call get_users_choice sub ax, 0x30 ; turn it from ascii to number cmp ax, 1 jl .illegal_choice cmp ax, 6 jg .illegal_choice add ax, AIR_QUALITY_DEVICE_BASE ; this gives us the port number for the city mov dx, ax call print_weather jmp press_key .illegal_choice: call print_new_line mov si, illegal_choice call print_str jmp press_key .halt: hlt data: welcome_msg db `Welcome to Sparkler!\n`, 0 ; Used by the menu system main_menu db `\nMain menu:\n==========\n`, 0 main_menu_items db `1. CPU Info\n2. Latest CliMagic Tweet\n3. Get Weather\n4. Get Air Quality\n5. Halt VM\n`, 0 your_choice db `Your choice: \n`, 0 illegal_choice db `You entered an illegal choice!\n\n`, 0 press_any_key db `Press any key to continue...\n`, 0 ; Used by our CPU ID routines cpu_info_str db `\nHere is your CPU information:\n`, 0 cpuid_str db `Vendor ID\t: `, 0 brand_str db `Brand string\t: `, 0 cpu_type_str db `CPU type\t: `, 0 cpu_type_oem db 'Original OEM Processor', 0 cpu_type_overdrive db 'Intel Overdrive Processor', 0 cpu_type_dual db 'Dual processor', 0 cpu_type_reserved db 'Reserved', 0 cpu_family_str db `Family\t\t: `, 0 cpu_model_str db `Model\t\t: `, 0 cpu_stepping_str db `Stepping\t: `, 0 ; Used by devices which fetch over the internet fetching_wait db `\nFetching, please wait...\n`, 0 weather_str db `\nChoose the city to get weather forecast for:`, 0 air_quality_str db `\nChoose the city to get air quality report for:`, 0 ; Cities cities_str db `1. Chennai\n2. New Delhi\n3. London\n4. Chicago\n5. San Francisco\n6. New York`,0 cpuid_function dd 0x80000002 get_users_choice: mov dx, SERIAL_PORT in ax, dx ret display_main_menu: mov si, main_menu call print_str mov si, main_menu_items call print_str mov si, your_choice call print_str ret print_latest_tweet: mov si, fetching_wait call print_str mov dx, TWITTER_DEVICE .get_next_char: in ax, dx cmp ax, 0 je .done call print_char jmp .get_next_char .done: ret ; To be called with weather port alreay in DX print_weather: mov si, fetching_wait call print_str .get_next_char: in ax, dx cmp ax, 0 je .done call print_char jmp .get_next_char .done: ret print_cpu_details: mov si, cpu_info_str call print_str mov si, cpuid_str call print_str call print_cpuid call print_new_line call print_cpu_info mov si, brand_str call print_str call print_cpu_brand_string call print_new_line ret print_cpuid: mov eax, 0 cpuid push ecx push edx push ebx mov cl, 3 .next_dword: pop eax mov bl, 4 .print_register: call print_char shr eax, 8 dec bl jnz .print_register dec cl jnz .next_dword ret print_cpu_brand_string: mov al, '"' call print_char .next_function: mov eax, [cpuid_function] cpuid push edx push ecx push ebx push eax mov cl, 4 .next_dword: pop eax mov bl, 4 .print_register: call print_char shr eax, 8 dec bl jnz .print_register dec cl jnz .next_dword inc dword[cpuid_function] cmp dword[cpuid_function], 0x80000004 jle .next_function mov al, '"' call print_char ret print_cpu_info: mov eax, 1 cpuid mov si, cpu_type_str call print_str mov ecx, eax ; save a copy shr eax, 12 and eax, 0x0005 cmp al, 0 je .type_oem cmp al, 1 je .type_overdrive cmp al, 2 je .type_dual cmp al, 3 je .type_reserved .type_oem: mov si, cpu_type_oem jmp .print_cpu_type .type_overdrive: mov si, cpu_type_oem jmp .print_cpu_type .type_dual: mov si, cpu_type_dual jmp .print_cpu_type .type_reserved: mov si, cpu_type_reserved jmp .print_cpu_type .print_cpu_type: call print_str call print_new_line ; Family mov si, cpu_family_str call print_str mov eax, ecx shr eax, 8 and ax, 0x000f cmp ax, 15 ; if Family == 15, Family is derived as the je .calculate_family ; sum of Family + Extended family bits jmp .family_done ; else .calculate_family: mov ebx, ecx shr ebx, 20 and bx, 0x00ff add ax, bx .family_done: call print_word_hex ; Model mov si, cpu_model_str call print_str cmp al, 6 ; If family is 6 or 15, the model number je .calculate_model ; is derived from the extended model ID bits cmp al, 15 je .calculate_model mov eax, ecx ; else shr eax, 4 and ax, 0x000f jmp .model_done .calculate_model: mov eax, ecx mov ebx, ecx shr eax, 16 and ax, 0x000f shl eax, 4 shr ebx, 4 and bx, 0x000f add eax, ebx .model_done: call print_word_hex ; Stepping mov si, cpu_stepping_str call print_str mov eax, ecx and ax, 0x000f call print_word_hex ret print_new_line: push dx push ax mov dx, SERIAL_PORT mov al, `\n` out dx, al pop ax pop dx ret print_char: push dx mov dx, SERIAL_PORT out dx, al pop dx ret print_str: push dx push ax mov dx, SERIAL_PORT .print_next_char: lodsb ; load byte pointed to by SI into AL and SI++ cmp al, 0 je .printstr_done out dx, al jmp .print_next_char .printstr_done: pop ax pop dx ret ; Print the 16-bit value in AX as HEX print_word_hex: xchg al, ah ; Print the high byte first call print_byte_hex xchg al, ah ; Print the low byte second call print_byte_hex call print_new_line ret ; Print lower 8 bits of AL as HEX print_byte_hex: push dx push cx push ax lea bx, [.table] ; Get translation table address ; Translate each nibble to its ASCII equivalent mov ah, al ; Make copy of byte to print and al, 0x0f ; Isolate lower nibble in AL mov cl, 4 shr ah, cl ; Isolate the upper nibble in AH xlat ; Translate lower nibble to ASCII xchg ah, al xlat ; Translate upper nibble to ASCII mov dx, SERIAL_PORT mov ch, ah ; Make copy of lower nibble out dx, al mov al, ch out dx, al pop ax pop cx pop dx ret .table: db "0123456789ABCDEF", 0
The monitor program is written in assembly language and is assembled using the venerable NASM or Netwide Assembler. It starts, and enters a loop in which it displays a menu with various options the user can choose from. This text output and user input is done via the SERIAL_DEVICE
or the “Console device” you can see in the Sparkler architecture diagram.
For all devices that are available to Sparkler, the communication happens via the CPU’s IN
and OUT
instructions. These instructions cause a VM exit and are handled by our sparkler
program, emulating these devices. Similarly, there are other devices that allow you to get the latest Tweet from a particular Twitter account, the weather for certain cities and the air quality report for certain cities.
The Sparkler web service
Although we use libcurl
to fetch content off the internet and use json-parser
to parse JSON, doing this is a real pain from C. This is very apparent if you’re like me and you’ve been exposed to the simplicity of handling this kind of stuff with higher-level languages. And so, I wrote a quick-and-dirty Sparkler Web Service that outputs JSON that is easily parsable from C. Also, it lets you try out Sparkler in its full glory without you first having to register for a Twitter developer account for you to access the Twitter API, to be able to fetch the tweet. This NodeJS service runs on the excellent Heroku platform for free. You can check out some JSON it outputs by clicking on these links here:
As you can see, I’ve made output from these different APIs structurally similar while removing a whole lot of JSON data we’ll never use. This lets us handle this with C fairly easily. When the monitor
program requests for information from the sparkler
program, it makes a request to the web service, parses that information and returns it to the monitor program as a simple string. Trust me, you don’t want to be parsing JSON in assembly language.
Hope you had fun
Summarizing, we went into great detail of how KVM works by building a virtual machine monitor (VMM) in C. This is the piece that interfaces with KVM to create a hardware based virtual machine in which we ran a small program written in assembly language which talks to the VMM using devices that the VMM emulates. While there is a simple “console” device that lets the VM input and output text, there are other more complex devices that can read a tweet, get the weather and air quality for a few cities.
About me
My name is Shuveb Hussain and I’m the author of this Linux-focused blog. You can follow me on Twitter where I post tech-related content mostly focusing on Linux, performance, scalability and cloud technologies.
You must be logged in to post a comment.