How AWS Firecracker works: a deep dive

Anything that powers technology like AWS Lambda needs to be really fast. And it needs to be secure. While AWS could have gone with existing technology, to satisfy both these main requirements, they went with building something new, Firecracker, that is both really fast – it can boot Linux and start executing user space processes in 125ms – and secure – it uses hardware virtualization and more to isolate one Lambda environment from the other. In this article, we’ll take a deep dive and look at how exactly Firecracker achieves these two goals. Firecracker is not new new. Significant chunks of its low-level functionality is based on Google’s crosvm, with some significant changes.

Sparkler: A light-weight Firecracker

While it certainly is fun to read a lot of source code and figuring out what is going on under the hood, it is not as much fun as firing up your favorite editor and whipping up a lightweight virtual machine environment under Linux. With Sparkler, will build a virtual machine monitor (VMM) that manages a virtual machine while providing a certain environment to the virtual machine, which it runs. We will also write a tiny “operating system” which will run inside the virtual machine. The VMM emulates some interesting hardware: a device that can read the latest tweet from Command Line Magic’s Twitter handle, a device that can get the weather from certain cities, another device that can read fetch the latest air quality measurements from certain cities and finally a console device that lets the virtual machine read the keyboard and output text to the terminal.

AWS Firecracker uses Linux’s KVM virtualization toolkit to create and run virtual machines. As we progress, we’ll see how exactly Firecracker’s awesome speed and security are achieved. But first, let’s lay down some groundwork to better understand how we can take advantage of Linux’s KVM (Kernel-based Virtual Machine) to build something like Firecracker. To demonstrate how this works, we build Sparkler, a lightweight virtual environment. The Sparkler environment or the virtual machine monitor (VMM) is written in C and is a KVM-based virtual machine, while the “operating system” we run inside that environment is written in assembly language. The Sparkler VM has an interesting structure, unlike any other virtual machine you might know of. Born in the internet age, it is a truly native citizen.

Sparkler Architecture

The Sparkler virtual machine exposes 4 devices and here is what these “devices” do:

  • Console: this device is like a serial port. It allows the virtual machine to display information and also get user input via the keyboard where required.
  • Twitter device: reading from this device makes available the latest tweet from one of my favorite Twitter handles, @climagic.
  • Weather Info device: Reading from this device, the virtual machine can get the latest weather forecast for 6 different cities.
  • Air Quality Info device: This device makes available air quality information for 6 different cities

Here is how a session in Sparkler looks like

A Sparkler session

Hardware virtualization background

Starting 2005, most Intel chips have had support for hardware virtualization. Before such support was available, virtual machines worked by either emulating every single instruction or at least had to emulate privileged instructions because those cause faults when running in user space. With Intel VT and AMD’s SVM technology, a new processor mode was created where operating system code could run natively, with full speed on the real hardware CPU, without the need to emulate or trap regular or privileged instructions. The hypervisor or the virtual machine monitor can let the CPU know when to “exit”, that is, given control to the hypervisor. For example, on accessing I/O ports with the IN or OUT instructions, when accessing certain privileged CPU registers that are normally only accessed by the operating system or when the virtual machine executes an instruction like CPUID, which provides information on the CPU (the hypervisor might want to control CPU features the guest sees).

In this article, I refer to “VT” technology as a term to include the corresponding, equivalent AMD SVM technology as well.

The Unixification of hardware virtualization

KVM has several interesting features, but we shall look at the interface it provides to Intel’s VT technology. You can program KVM using the well known UNIX file paradigm. Let’s look at some code from main.c in Sparkler.

    kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC);
    if (kvm == -1)
        err(1, "/dev/kvm");

    vmfd = ioctl(kvm, KVM_CREATE_VM, (unsigned long)0);
    if (vmfd == -1)
        err(1, "KVM_CREATE_VM");

    /* Allocate one aligned page of guest memory to hold the code. */
    mem = mmap(NULL, 0x8000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (!mem)
        err(1, "allocating guest memory");

    /* Read our monitor program into RAM */
    int fd = open("monitor", O_RDONLY);
    if (fd == -1)
        err(1, "Unable to open stub");
    struct stat st;
    fstat(fd, &st);
    read(fd, mem, st.st_size);

    struct kvm_userspace_memory_region region = {
            .slot = 0,
            .guest_phys_addr = 0x1000,
            .memory_size = 0x8000,
            .userspace_addr = (uint64_t)mem,
    };
    ret = ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region);
    if (ret == -1)
        err(1, "KVM_SET_USER_MEMORY_REGION");

    vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long)0);
    if (vcpufd == -1)
        err(1, "KVM_CREATE_VCPU");

    /* Map the shared kvm_run structure and following data. */
    ret = ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, NULL);
    if (ret == -1)
        err(1, "KVM_GET_VCPU_MMAP_SIZE");
    mmap_size = ret;
    if (mmap_size < sizeof(*run))
        errx(1, "KVM_GET_VCPU_MMAP_SIZE unexpectedly small");
    run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);
    if (!run)
        err(1, "mmap vcpu");

    /* Set CPUID */
    struct kvm_cpuid2 *cpuid;
    int nent = 40;
    unsigned long size = sizeof(*cpuid) + nent * sizeof(*cpuid->entries);
    cpuid = (struct kvm_cpuid2*) malloc(size);
    bzero(cpuid, size);
    cpuid->nent = nent;

    ret = ioctl(kvm, KVM_GET_SUPPORTED_CPUID, cpuid);
    if (ret < 0) {
        free(cpuid);
        err(1, "KVM_GET_SUPPORTED_CPUID");
    }

    for (int i = 0; i < cpuid->nent; i++) {
        if (cpuid->entries[i].function == 0x80000002)
            __get_cpuid(0x80000002, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx);
        if (cpuid->entries[i].function == 0x80000003)
            __get_cpuid(0x80000003, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx);
        if (cpuid->entries[i].function == 0x80000004)
            __get_cpuid(0x80000004, &cpuid->entries[i].eax, &cpuid->entries[i].ebx, &cpuid->entries[i].ecx, &cpuid->entries[i].edx);
    }

    ret = ioctl(vcpufd, KVM_SET_CPUID2, cpuid);
    if (ret < 0) {
        free(cpuid);
        err(1, "KVM_SET_CPUID2");
    }
    free(cpuid);

    /* Initialize CS to point at 0, via a read-modify-write of sregs. */
    ret = ioctl(vcpufd, KVM_GET_SREGS, &sregs);
    if (ret == -1)
        err(1, "KVM_GET_SREGS");
    sregs.cs.base = 0;
    sregs.cs.selector = 0;
    ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs);
    if (ret == -1)
        err(1, "KVM_SET_SREGS");

    /* Initialize registers: instruction pointer for our code, addends, and
     * initial flags required by x86 architecture. */
    struct kvm_regs regs = {
            .rip = 0x1000,
            .rflags = 0x2,
    };
    ret = ioctl(vcpufd, KVM_SET_REGS, &regs);
    if (ret == -1)
        err(1, "KVM_SET_REGS");

    char *latest_tweet      = NULL;
    char *weather_forecast  = NULL;
    char *aq_report         = NULL;
    int tweet_str_idx       = 0;
    int weather_str_idx     = 0;
    int aq_str_idx          = 0;

    /* Run the VM while handling any exits for device emulation */
    while (1) {
        ret = ioctl(vcpufd, KVM_RUN, NULL);
        if (ret == -1)
            err(1, "KVM_RUN");
        switch (run->exit_reason) {
            case KVM_EXIT_HLT:
                puts("KVM_EXIT_HLT");
                return 0;
            case KVM_EXIT_IO:
                if (run->io.direction == KVM_EXIT_IO_OUT) {
                    switch (run->io.port) {
                        case SERIAL_PORT:
                            putchar(*(((char *)run) + run->io.data_offset));
                            break;
                        default:
                            printf("Port: 0x%x\n", run->io.port);
                            errx(1, "unhandled KVM_EXIT_IO");
                    }
                } else {
                    /* KVM_EXIT_IO_IN */
                    switch (run->io.port) {
                        case SERIAL_PORT:
                            *(((char *)run) + run->io.data_offset) = getche();
                            break;
                        case TWITTER_DEVICE:
                            if (latest_tweet == NULL)
                                latest_tweet = fetch_latest_tweet();
                            char tweet_chr = *(latest_tweet + tweet_str_idx);
                            *(((char *)run) + run->io.data_offset) = tweet_chr;
                            tweet_str_idx++;
                            if (tweet_chr == '\0') {
                                free(latest_tweet);
                                latest_tweet = NULL;
                                tweet_str_idx = 0;
                            }
                            break;
                        case WEATHER_DEVICE_CHENNAI:
                        case WEATHER_DEVICE_DELHI:
                        case WEATHER_DEVICE_LONDON:
                        case WEATHER_DEVICE_CHICAGO:
                        case WEATHER_DEVICE_SFO:
                        case WEATHER_DEVICE_NY:
                            if (weather_forecast == NULL) {
                                char city[64];
                                if (run->io.port == WEATHER_DEVICE_CHENNAI)
                                    strncpy(city, "Chennai", sizeof(city));
                                else if (run->io.port == WEATHER_DEVICE_DELHI)
                                    strncpy(city, "New%20Delhi", sizeof(city));
                                else if (run->io.port == WEATHER_DEVICE_LONDON)
                                    strncpy(city, "London", sizeof(city));
                                else if (run->io.port == WEATHER_DEVICE_CHICAGO)
                                    strncpy(city, "Chicago", sizeof(city));
                                else if (run->io.port == WEATHER_DEVICE_SFO)
                                    strncpy(city, "San%20Francisco", sizeof(city));
                                else if (run->io.port == WEATHER_DEVICE_NY)
                                    strncpy(city, "New%20York", sizeof(city));

                                weather_forecast = fetch_weather(city);
                            }
                            char weather_chr = *(weather_forecast + weather_str_idx);
                            *(((char *)run) + run->io.data_offset) = weather_chr;
                            weather_str_idx++;
                            if (weather_chr == '\0') {
                                free(weather_forecast);
                                weather_forecast = NULL;
                                weather_str_idx = 0;
                            }
                            break;
                        case AIR_QUALITY_DEVICE_CHENNAI:
                        case AIR_QUALITY_DEVICE_DELHI:
                        case AIR_QUALITY_DEVICE_LONDON:
                        case AIR_QUALITY_DEVICE_CHICAGO:
                        case AIR_QUALITY_DEVICE_SFO:
                        case AIR_QUALITY_DEVICE_NY:
                            if (aq_report == NULL) {
                                char city[64];
                                char country[3];
                                if (run->io.port == AIR_QUALITY_DEVICE_CHENNAI) {
                                    strncpy(city, "Chennai", sizeof(city));
                                    strncpy(country, "IN", sizeof(country));
                                }
                                else if (run->io.port == AIR_QUALITY_DEVICE_DELHI) {
                                    strncpy(city, "Delhi", sizeof(city));
                                    strncpy(country, "IN", sizeof(country));
                                }
                                else if (run->io.port == AIR_QUALITY_DEVICE_LONDON) {
                                    strncpy(city, "London", sizeof(city));
                                    strncpy(country, "GB", sizeof(country));
                                }
                                else if (run->io.port == AIR_QUALITY_DEVICE_CHICAGO) {
                                    strncpy(city, "Chicago-Naperville-Joliet", sizeof(city));
                                    strncpy(country, "US", sizeof(country));
                                }
                                else if (run->io.port == AIR_QUALITY_DEVICE_SFO) {
                                    strncpy(city, "San%20Francisco-Oakland-Fremont", sizeof(city));
                                    strncpy(country, "US", sizeof(country));
                                }
                                else if (run->io.port == AIR_QUALITY_DEVICE_NY) {
                                    strncpy(city, "New%20York-Northern%20New%20Jersey-Long%20Island", sizeof(city));
                                    strncpy(country, "US", sizeof(country));
                                }
                                aq_report = fetch_air_quality(country, city);
                            }
                            char aq_chr = *(aq_report + aq_str_idx);
                            *(((char *)run) + run->io.data_offset) = aq_chr;
                            aq_str_idx++;
                            if (aq_chr == '\0') {
                                free(aq_report);
                                aq_report = NULL;
                                aq_str_idx = 0;
                            }
                            break;
                        default:
                            printf("Port: 0x%x\n", run->io.port);
                            errx(1, "unhandled KVM_EXIT_IO");
                    }
                }

                break;
            case KVM_EXIT_FAIL_ENTRY:
                errx(1, "KVM_EXIT_FAIL_ENTRY: hardware_entry_failure_reason = 0x%llx",
                     (unsigned long long)run->fail_entry.hardware_entry_failure_reason);
            case KVM_EXIT_INTERNAL_ERROR:
                errx(1, "KVM_EXIT_INTERNAL_ERROR: suberror = 0x%x", run->internal.suberror);
            default:
                errx(1, "exit_reason = 0x%x", run->exit_reason);
        }
    }

The pseudocode to create and run a VM with KVM

  • open("/dev/kvm") : Open the global KVM device
  • ioctl(KVM_CREATE_VM) : Create a virtual machine
  • mmap(size) : Create memory region for the guest to use
  • read("monitor") : Read our operating system binary into the allocated memory
  • ioctl(KVM_CREATE_VCPU) : Create a VCPU for use in our newly created virtual machine
  • ioctl(KVM_SET_REGS) : Set initial values for some registers
  • while(1)
    • run = ioctl(KVM_RUN) : Run the VM till there is an exit
    • switch(run->exit_reason) : Decide based on exit reason
      • case KVM_EXIT_HLT: VM executed the halt instruction. Let’s exit.
      • case KVM_EXIT_IO: There was I/O from the VM. Handle it.

As you can see, with just simple Linux system calls like open(), read(), write(), mmap() and ioctl(), we’re able to create and run hardware virtualization-based VMs.

Another way to handle VM exits is via eventfd(), which can be done with the KVM_IOEVENTFD ioctl() call. This creates a file descriptor for any MMIO memory range that needs to be monitored for reads and writes. This file descriptor can then be passed to poll() or epoll_* calls and events dealt with in a better fashion. This is what Firecracker does. Now, let’s look at the “operating system” that runs inside of Sparkler.

Our tiny little Sparkler operating system

I had trouble calling this an operating system, so I’m calling this a monitor program, which is a very common term used in embedded systems for operating system-like programs that are not quite operating systems themselves. Intel CPUs since Westmere (introduced 2010) have supported something called unrestricted guest mode. This means essentially that the virtual CPU starts running in real mode, or 16-bit mode, much like a real PC. The operating system can then switch the CPU to 32-bit or 64-bit mode as required. Our monitor program does not switch to 32-bit or 64-bit mode, but lives its life as a 16-bit program.

As part of the Sparkler build process, NASM turns monitor.asm into monitor, which is the binary program which we then load into guest memory from main.c. This is a file with no real structure, just raw CPU instructions and data.

Although we call this the monitor program, the sparkler program, that runs and interacts with KVM is called the VMM or the virtual machine monitor. Do not confuse these two terms during the course of reading this article. I’ll use the terms “sparkler” and “VMM” interchangeably to refer to the same thing.

bits 16

SERIAL_PORT             equ 0x3f8
TWITTER_DEVICE          equ 0x100
WEATHER_DEVICE_BASE     equ 0x100
AIR_QUALITY_DEVICE_BASE equ 0x200

start:
    mov ax, 0x100
    add ax, 0x20
    mov ss, ax
    mov sp, 0x1000
    cld

    mov ax, 0x100
    mov ds, ax

    mov si, welcome_msg
    call print_str

    jmp menu_loop

press_key:
    mov si, press_any_key
    call print_str
    call get_users_choice
menu_loop:
    call display_main_menu
    call get_users_choice
    cmp al, 0x31
    je .cpu_details
    cmp al, 0x32
    je .latest_tweet
    cmp al, 0x33
    je .weather
    cmp al, 0x34
    je .air_quality
    cmp al, 0x35
    je .halt

    mov si, illegal_choice
    call print_str
    jmp press_key

    .cpu_details:
        call print_cpu_details
        jmp press_key
    .latest_tweet:
        call print_latest_tweet
        call print_new_line
        jmp press_key
    .weather:
        mov si, weather_str
        call print_str
        call print_new_line
        mov si, cities_str
        call print_str
        call print_new_line
        mov si, your_choice
        call print_str
        sub ax, ax
        call get_users_choice
        sub ax, 0x30                    ; turn it from ascii to number

        cmp ax, 1
        jl  .illegal_choice
        cmp ax, 6
        jg .illegal_choice

        add ax, WEATHER_DEVICE_BASE     ; this gives us the port number for the city
        mov dx, ax
        call print_weather
        jmp press_key
    .air_quality:
        mov si, air_quality_str
        call print_str
        call print_new_line
        mov si, cities_str
        call print_str
        call print_new_line
        mov si, your_choice
        call print_str
        sub ax, ax
        call get_users_choice
        sub ax, 0x30                        ; turn it from ascii to number

        cmp ax, 1
        jl  .illegal_choice
        cmp ax, 6
        jg .illegal_choice

        add ax, AIR_QUALITY_DEVICE_BASE     ; this gives us the port number for the city
        mov dx, ax
        call print_weather
        jmp press_key

        .illegal_choice:
            call print_new_line
            mov si, illegal_choice
            call print_str
            jmp press_key
    .halt:
        hlt

data:
    welcome_msg         db `Welcome to Sparkler!\n`, 0

    ; Used by the menu system
    main_menu           db  `\nMain menu:\n==========\n`, 0
    main_menu_items     db  `1. CPU Info\n2. Latest CliMagic Tweet\n3. Get Weather\n4. Get Air Quality\n5. Halt VM\n`, 0
    your_choice         db  `Your choice: \n`, 0
    illegal_choice      db  `You entered an illegal choice!\n\n`, 0
    press_any_key       db  `Press any key to continue...\n`, 0

    ; Used by our CPU ID routines
    cpu_info_str        db  `\nHere is your CPU information:\n`, 0
    cpuid_str           db  `Vendor ID\t: `, 0
    brand_str           db  `Brand string\t: `, 0
    cpu_type_str        db  `CPU type\t: `, 0
    cpu_type_oem        db  'Original OEM Processor', 0
    cpu_type_overdrive  db  'Intel Overdrive Processor', 0
    cpu_type_dual       db  'Dual processor', 0
    cpu_type_reserved   db  'Reserved', 0
    cpu_family_str      db  `Family\t\t: `, 0
    cpu_model_str       db  `Model\t\t: `, 0
    cpu_stepping_str    db  `Stepping\t: `, 0

    ; Used by devices which fetch over the internet
    fetching_wait       db  `\nFetching, please wait...\n`, 0


    weather_str         db `\nChoose the city to get weather forecast for:`, 0
    air_quality_str     db `\nChoose the city to get air quality report for:`, 0
    ; Cities
    cities_str          db  `1. Chennai\n2. New Delhi\n3. London\n4. Chicago\n5. San Francisco\n6. New York`,0

    cpuid_function      dd  0x80000002

get_users_choice:
    mov dx, SERIAL_PORT
    in ax, dx
    ret

display_main_menu:
    mov si, main_menu
    call print_str
    mov si, main_menu_items
    call print_str
    mov si, your_choice
    call print_str
    ret

print_latest_tweet:
    mov si, fetching_wait
    call print_str
    mov dx, TWITTER_DEVICE
    .get_next_char:
        in ax, dx
        cmp ax, 0
        je .done
        call print_char
        jmp .get_next_char

    .done:
        ret

; To be called with weather port alreay in DX
print_weather:
    mov si, fetching_wait
    call print_str
    .get_next_char:
        in ax, dx
        cmp ax, 0
        je .done
        call print_char
        jmp .get_next_char

    .done:
        ret

print_cpu_details:
    mov si, cpu_info_str
    call print_str

    mov si, cpuid_str
    call print_str
    call print_cpuid
    call print_new_line

    call print_cpu_info

    mov si, brand_str
    call print_str
    call print_cpu_brand_string
    call print_new_line
    ret

print_cpuid:
    mov eax, 0
    cpuid
    push ecx
    push edx
    push ebx

    mov cl, 3
    .next_dword:
        pop eax
        mov bl, 4
        .print_register:
            call print_char
            shr eax, 8
            dec bl
            jnz .print_register
        dec cl
        jnz .next_dword

    ret

print_cpu_brand_string:
    mov al, '"'
    call print_char
    .next_function:
        mov eax, [cpuid_function]
        cpuid
        push edx
        push ecx
        push ebx
        push eax

    mov cl, 4
    .next_dword:
        pop eax
        mov bl, 4
        .print_register:
            call print_char
            shr eax, 8
            dec bl
            jnz .print_register
        dec cl
        jnz .next_dword

    inc dword[cpuid_function]
    cmp dword[cpuid_function], 0x80000004
    jle .next_function

    mov al, '"'
    call print_char
    ret

print_cpu_info:
    mov eax, 1
    cpuid

    mov si, cpu_type_str
    call print_str
    mov ecx, eax                        ; save a copy
    shr eax, 12
    and eax, 0x0005
    cmp al, 0
    je .type_oem
    cmp al, 1
    je .type_overdrive
    cmp al, 2
    je .type_dual
    cmp al, 3
    je .type_reserved

    .type_oem:
        mov si, cpu_type_oem
        jmp .print_cpu_type
    .type_overdrive:
        mov si, cpu_type_oem
        jmp .print_cpu_type
    .type_dual:
        mov si, cpu_type_dual
        jmp .print_cpu_type
    .type_reserved:
        mov si, cpu_type_reserved
        jmp .print_cpu_type

    .print_cpu_type:
    call print_str
    call print_new_line

    ; Family
    mov si, cpu_family_str
    call print_str
    mov eax, ecx
    shr eax, 8
    and ax, 0x000f

    cmp ax, 15                  ; if Family == 15, Family is derived as the
    je .calculate_family        ; sum of Family + Extended family bits

    jmp .family_done            ; else

    .calculate_family:
        mov ebx, ecx
        shr ebx, 20
        and bx, 0x00ff
        add ax, bx
    .family_done:
        call print_word_hex

    ; Model
    mov si, cpu_model_str
    call print_str
    cmp al, 6                   ; If family is 6 or 15, the model number
    je .calculate_model         ; is derived from the extended model ID bits
    cmp al, 15
    je .calculate_model

    mov eax, ecx                ; else
    shr eax, 4
    and ax, 0x000f
    jmp .model_done

    .calculate_model:
        mov eax, ecx
        mov ebx, ecx
        shr eax, 16
        and ax, 0x000f
        shl eax, 4
        shr ebx, 4
        and bx, 0x000f
        add eax, ebx
    .model_done:
        call print_word_hex

    ; Stepping
    mov si, cpu_stepping_str
    call print_str
    mov eax, ecx
    and ax, 0x000f
    call print_word_hex

    ret

print_new_line:
    push dx
    push ax
    mov dx, SERIAL_PORT
    mov al, `\n`
    out dx, al
    pop ax
    pop dx
    ret

print_char:
    push dx
    mov dx, SERIAL_PORT
    out dx, al
    pop dx
    ret

print_str:
    push dx
    push ax
    mov dx, SERIAL_PORT
    .print_next_char:
        lodsb               ; load byte pointed to by SI into AL and SI++
        cmp al, 0
        je .printstr_done
        out dx, al
        jmp .print_next_char
    .printstr_done:
        pop ax
        pop dx
        ret

; Print the 16-bit value in AX as HEX
print_word_hex:
    xchg al, ah             ; Print the high byte first
    call print_byte_hex
    xchg al, ah             ; Print the low byte second
    call print_byte_hex
    call print_new_line
    ret

; Print lower 8 bits of AL as HEX
print_byte_hex:
    push dx
    push cx
    push ax

    lea bx, [.table]        ; Get translation table address

    ; Translate each nibble to its ASCII equivalent
    mov ah, al              ; Make copy of byte to print
    and al, 0x0f            ;     Isolate lower nibble in AL
    mov cl, 4
    shr ah, cl              ; Isolate the upper nibble in AH
    xlat                    ; Translate lower nibble to ASCII
    xchg ah, al
    xlat                    ; Translate upper nibble to ASCII

    mov dx, SERIAL_PORT
    mov ch, ah              ; Make copy of lower nibble
    out dx, al
    mov al, ch
    out dx, al

    pop ax
    pop cx
    pop dx
    ret
.table: db "0123456789ABCDEF", 0

The monitor program is written in assembly language and is assembled using the venerable NASM or Netwide Assembler. It starts, and enters a loop in which it displays a menu with various options the user can choose from. This text output and user input is done via the SERIAL_DEVICE or the “Console device” you can see in the Sparkler architecture diagram.

For all devices that are available to Sparkler, the communication happens via the CPU’s IN and OUT instructions. These instructions cause a VM exit and are handled by our sparkler program, emulating these devices. Similarly, there are other devices that allow you to get the latest Tweet from a particular Twitter account, the weather for certain cities and the air quality report for certain cities.

The Sparkler web service

Although we use libcurl to fetch content off the internet and use json-parser to parse JSON, doing this is a real pain from C. This is very apparent if you’re like me and you’ve been exposed to the simplicity of handling this kind of stuff with higher-level languages. And so, I wrote a quick-and-dirty Sparkler Web Service that outputs JSON that is easily parsable from C. Also, it lets you try out Sparkler in its full glory without you first having to register for a Twitter developer account for you to access the Twitter API, to be able to fetch the tweet. This NodeJS service runs on the excellent Heroku platform for free. You can check out some JSON it outputs by clicking on these links here:

As you can see, I’ve made output from these different APIs structurally similar while removing a whole lot of JSON data we’ll never use. This lets us handle this with C fairly easily. When the monitor program requests for information from the sparkler program, it makes a request to the web service, parses that information and returns it to the monitor program as a simple string. Trust me, you don’t want to be parsing JSON in assembly language.

Making things really fast

Running unmodified operating systems, especially ones for which source code isn’t available means that the virtual machine has to very closely resemble a real PC. QEMU or Virtual Box do use KVM to speed up guests as much as possible, but they also emulate a real PC with BIOS routines, various legacy and peripheral devices like graphics, sound and storage cards. This is why is is easy to run off-the-shelf versions of MS-DOS, Microsoft Windows or binary Linux distributions like Fedora or Ubuntu with QEMU or Virtual Box.

But, as for these legacy and other devices, KVM directly has very limited support for these. Also, when operating system boot, the boot loader usually depends on BIOS routines to deal with hardware like the screen, keyboard and disks. QEMU uses SeaBIOS as its choice of BIOS firmware, for example and it ships with it. Without BIOS routines being present, it wouldn’t be a real PC and this will stop it from running off-the-shelf operating systems.

The most important takeaways are that there is a lot of legacy code like the BIOS and emulation of full-blown peripherals, which is relatively very slow in virtualization systems that are designed to emulate a PC as closely as possible.

Modern kernel booting

When a PC starts, the CPU is in real mode or 16-bit mode. The BIOS runs the power-on self-test or POST, reads in the boot loader from the designated boot device to a set location in RAM and passes control over to it. The boot loader usually uses BIOS routines to display text, read disks, get system information etc.

It turns out that the Linux kernel does not need BIOS routines at all. It is the boot loader that does. Linux has what is know as a “boot protocol”, which tells the boot loader or any other program present at boot time, how to lay out the kernel and supporting data structures in RAM and how to pass control to the kernel. For the x86 architecture, boot protocols are available for real mode, 32-bit and 64-bit modes.

So, just like how we loaded the monitor program binary into the allocated guest memory, Firecracker loads the Linux kernel, setting up the required data structures like initial stack, command line parameters, as per the Linux kernel boot protocol. See the following files from the Firecracker Git repo:

This does away with the need for having any BIOS routines loaded during the boot process. This simplifies the design of the virtual machine by removing all the complexity required to enable a boot loader or having a need for BIOS routines at all.

In the tiny Sparkler monitor program, there is no use of any BIOS routines as well. Also, because there is no BIOS at all. The print_str routine for example, uses the OUT x86 instruction to output characters, which causes a VM exit. These exits are handled by the sparkler program, which uses the putc() library function to display the character on to the terminal. Something very similar happens in Firecracker as well. A virtual serial console is setup for the Linux kernel via memory-mapped IO (MMIO), which is just a fancy term for I/O which is done by reading and writing to particular memory addresses. In other words, rather than using specialized instructions like IN and OUT for I/O, normal load/store instructions like mov are used for I/O. When these special MMIO addresses are read from or written to, a KVM exit is triggered, letting the virtual machine monitor or the hypervisor handle it. In Firecracker, the function register_mmio_serial() in the file vmm/src/device_manager/mmio.rs makes a serial console available to the Linux kernel for text input/output.

The Firecracker guest model

The other clever thing about Firecracker is that the virtual machine in which it runs the Linux kernel has very few devices. Linux, on x86 assumes the presence of an interrupt controller and an interval timer. These are really fundamental since the CPU doesn’t have these built in. While it is possible to emulate these devices in the VMM, KVM provides a way better alternative. It is able to emulate them in-kernel for you. This is a big deal. Remember that as long as there is not VM exit, the virtual machine code is executing natively on the CPU. Every time there is an exit, there is also expensive context switches, killing performance. The more time you can stay in the kernel or execute virtual machine code, performance is much better. See KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2 in the KVM API documentation.

Here is a list of devices in the Firecracker guest model:

If you look at the source code of the PS/2 keyboard controller emulated by Firecracker, you’ll notice that it is really sparse and it implements only one main function trigger_ctrl_alt_del(), which is pretty self-explanatory.

VirtIO Block and Net devices

VirtIO needs a bit more explanation. Real hardware devices have all kinds of quirks and are fairly complicated to program. When you have operating systems that can’t be modified, but come with drivers for some hardware devices, it makes sense to emulate them. That way, it becomes possible to run these operating systems unmodified. But with Linux’s rich history of virtualization, there are more high-performance solutions available. Modern Linux kernels ship with drivers for a virtual I/O system that was specially designed for virtualized systems. Firecracker takes advantage of this.

VirtIO was developed initially by Rusty Russell for LGuest, which made its way into the kernel in 2007, but was removed in 2017. VirtIO however, continues to thrive. VirtIO now has a specification and device drivers are available in-tree in the Linux kernel. The concept behind VirtIO is very simple. It specifies a way for the guest and host to communicate efficiently. It defines various device types like: network, block, console, entropy, memory balloon and SCSI host. It supports PCI as a transport, meaning that the guest OS can enumerate VirtIO devices like regular PCI bus based devices and continue to use them like regular PCI devices. VirtIO is relatively simple to program compared to programming real hardware and it is also designed to be very high performance. The performance is also the result of not having to emulate whole, real hardware devices.

Since the Linux kernel also ships with device drivers for VirtIO devices, all the host needs to do is to emulate the specially-designed-for-virtualization VirtIO devices and have Linux work seamlessly with them, with much better performance compared to emulating some other real hardware device Linux also supports.

There is another very powerful feature of the Linux kernel called vhost. These are basically VirtIO devices emulated in the kernel directly, which means that there is no need to context switch to the VMM to deal with I/O from VirtIO devices. At the time of this writing however, Firecracker emulates both the network and the block VirtIO devices in the VMM and does not depend on vhost for further acceleration.

At the time of this writing, Linux kernel 5.4 hasn’t yet been released, but there is a patch that implements virtio-fs, which allows efficient sharing of files and directories between hosts and guest. This way, a directory containing the guests’ file system can be on the host, much like how Docker works.

Container vs hardware virtualization security

For a use case like AWS Lambda, it might not be a such good idea to just run processes (inside which Lambda functions run) belonging to different accounts on the same physical or virtual server. It would be a security nightmare. You can’t blame an architect however, if she chose to run processes from different accounts inside of Linux containers (using something like Docker). The only thing is we don’t really know what holes exist yet, letting processes break their container jails. You see, the attack surface is the whole Linux kernel API interface. It is only a logical separation. The same can be said about KVM as well, but the attack surface is relatively small. Also, the VM itself runs in a special hardware virtualization mode, making it a lot safer, relatively speaking. Recent happenings reduce this confidence a bit, but it is human nature to believe hardware is somehow a lot safer than software.

With containers being the main unit of abstraction, one common worry about Kubernetes security is the Linux API’s attack surface. Well, there has been improvement in this direction with projects like Kata Containers, which run micro-VMs that use hardware virtualization for the containers, while providing a Kubernetes compatible interface so that Kubernetes can be used to orchestrate these containers. Similar to Firecracker, Intel started a project, NEMU, which is QEMU cleaned up of most legacy hardware support while taking advantage of Linux supported virtual devices – an approach very similar to Firecracker.

From the interface perspective, another important point to consider is the minimal set of devices supported by Firecracker. This further reduces the available attack surface.

Going beyond hardware-based virtualization

Firecracker goes much further than just resting on security that hardware-based virtualization buys. There are two primary mechanisms used to secure the Firecracker process further. First, it can be deployed via a jailor utility, which uses chroot() and cgroups to restrict the Firecracker process. Furthermore, it also uses seccomp rules to limit what the Firecracker process can do on the host system. Rules are setup carefully to only allow system calls that Firecracker explicitly needs. There is an advanced mode where when system call parameters are whitelisted. This ensures that not much can be done should the Firecracker process get compromised.

Conclusion / Summary

Firecracker is a VM environment specially built to run only the Linux kernel, doing away with legacy BIOS or devices, while leveraging modern, virtualization techniques like VirtIO and securing the Firecracker process with chroot(), cgroups and seccomp rules. On one side we have the most generic virtualization systems like QEMU and Virtual Box which can run pretty much any OS targeted at PCs, whereas on the other side, we have systems like Firecracker that are specialized to run guests based on the Linux kernel in a very efficient manner, giving up the generic nature of the virtual machines they create.

We also went into great detail of how KVM works by building a virtual machine monitor (VMM) in C. This is the piece that interfaces with KVM to create a hardware based virtual machine in which we ran a small program written in assembly language which talks to the VMM using devices that the VMM emulates. While there is a simple “console” device that lets the VM input and output text, there are other more complex devices that can read a tweet, get the weather and air quality for a few cities.