Linux Applications Performance: Part III: Preforked Servers

This chapter is part of a series of articles on Linux application performance.

While the iterative server has trouble serving clients in parallel, the forking server incurs a lot of overhead forking a child process every time a client request is received. We saw from the performance numbers that in our test setup, it has trouble handling more than 1,500 concurrent connections. If forking on every arriving request is indeed the aspect that is causing all this overhead, we should be good if we create a set of pre-forked children ready to accept any incoming requests. This is what the preforked server does. Rather than call fork() on every new request, it calls fork() to create a bunch of child processes beforehand and they take up serving requests. Also, in the forking server, each child process quit after it had processed a request. This is not to be the case with the pre-forked server. Each child goes back to accepting more connections once it has served the previous one. Any number of worker processes created stay around for the lifetime of the server.

We fork a certain count of children when the server starts. Let’s call these worker processes, as they are generally known. All we need to do is to ensure we have sufficient worker processes enough to handle the kind of concurrency we are expecting.

int main(int argc, char *argv[])
    int server_port;
    signal(SIGINT, sigint_handler);

    if (argc > 1)
      server_port = atoi(argv[1]);
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    int server_socket = setup_listening_socket(server_port);
    printf("ZeroHTTPd server listening on port %d\n", server_port);
    for(int i=0; i < PREFORK_CHILDREN; i++)
        pids[i] = create_child(i, server_socket);


The main() function of the preforked server is slightly different since it does not call enter_server_loop() directly. It calls the create_child() function in a loop which runs for PREFORK_CHILDREN iterations.

pid_t create_child(int index, int listening_socket) {
    pid_t pid;

    pid = fork();
    if(pid < 0)
    else if (pid > 0)
        return (pid);

    printf("Server %d(pid: %ld) starting\n", index, (long)getpid());

The create_child() function is not a very busy one. We simply call fork() to create a new child/worker process. In the parent process we return, while in the child process we call enter_server_loop(), which is not very different from what we saw in earlier implementations. It blocks in accept(). The difference now compared to the forking server being that we used to fork a child process after accept() returned with a connection and here each child created waits in accept() on its own. Is this a problem? How can multiple processes wait on accept() on the very same port? Well it turns out that you can’t really call bind() multiple times, since it is the system call that binds to a specific port. Once a process bind to a port, any other process that needs to bind on the same port needs to wait until the port is not listened on any more. Any number of processes that inherit the listening socket can call accept() to accept incoming connections. If there are 10 child processes waiting on accept(), the operating system will, with fairness, equally try to distribute incoming connections among them.

The main parent process is not involved in any work at all once this setup is done. You can see that in main(), it simply keeps calling pause() in an infinite loop. The pause() system calls pauses the caller, but it can return on getting a signal that terminates the process or when a signal handler is called. To ensure that the main process doesn’t exit, we call pause() in a loop, just in case pause() happens to return.

The Thundering Herd

When any process is blocked waiting on an event, it is the job of the kernel to wake it up once the event has occurred. This block could be due to waiting for data to arrive on a socket, waiting on a client to connect (blocking on accept()), waiting for a file to be read in from disk, etc. The Thundering Herd is a problem in computer science when many processes, all waiting for the same event are woken up by the operating system. A good example is our pre-forked server in which there are several child processes all blocked on the same socket in accept(). This problem occurs when the part of the kernel that wakes up all processes or threads leaves it to another part of the kernel, the scheduler, to figure out which process or thread will actually get to run, while all others go back to blocking. When the number of processes/threads increases, this determination wastes CPU cycles every time an event that potentially turns processes or threads runnable occurs. While this problem doesn’t seem to affect Linux anymore (see this and this), it is always good to know about this, since you will see solutions to this problem implemented in code for various other network servers. Linux is now aware if there are multiple processes or threads blocked on the very same socket and will only wake up one so that the scheduler doesn’t have to pick one between several “ready” processes, which could potentially waste CPU cycles.

When switching to a pre-threaded/thread pool architecture for ZeroHTTPd, we will use a common technique to acquire a lock so that only one thread is blocked on accept() and all other threads wait on the mutex.

Performance of the preforked server

Here is our table of performance numbers for easy reference:

concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

Straight off the bat, we see that performance compared to the forking server has skyrocketed. While the changes to the code are only a handful, the jump in performance is many fold. Up until 2,500 concurrent connections, our pre-forked server is able to handle in excess of 2,100 requests/second. This is more than 5x the performance of our forking server which manages only around 400 requests/second at comparable concurrencies!

In our next article in the series, we shall look at how threaded (as opposed to process-oriented) servers compare to the server architectures we’ve implemented so far.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server