Linux Applications Performance: Part III: Preforked Servers

This chapter is part of a series of articles on Linux application performance.

While the iterative server has trouble serving clients in parallel, the forking server incurs a lot of overhead forking a child process every time a client request is received. We saw from the performance numbers that in our test setup, it has trouble handling more than 1,500 concurrent connections. If forking on every arriving request is indeed the aspect that is causing all this overhead, we should be good if we create a set of pre-forked children ready to accept any incoming requests. This is what the preforked server does. Rather than call fork() on every new request, it calls fork() to create a bunch of child processes beforehand and they take up serving requests. Also, in the forking server, each child process quit after it had processed a request. This is not to be the case with the pre-forked server. Each child goes back to accepting more connections once it has served the previous one. Any number of worker processes created stay around for the lifetime of the server.

We fork a certain count of children when the server starts. Let’s call these worker processes, as they are generally known. All we need to do is to ensure we have sufficient worker processes enough to handle the kind of concurrency we are expecting.

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, sigint_handler);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    int server_socket = setup_listening_socket(server_port);
    printf("ZeroHTTPd server listening on port %d\n", server_port);
    for(int i=0; i < PREFORK_CHILDREN; i++)
        pids[i] = create_child(i, server_socket);

    for(;;)
        pause();
}

The main() function of the preforked server is slightly different since it does not call enter_server_loop() directly. It calls the create_child() function in a loop which runs for PREFORK_CHILDREN iterations.

pid_t create_child(int index, int listening_socket) {
    pid_t pid;

    pid = fork();
    if(pid < 0)
        fatal_error("fork()");
    else if (pid > 0)
        return (pid);

    printf("Server %d(pid: %ld) starting\n", index, (long)getpid());
    enter_server_loop(listening_socket);
}

The create_child() function is not a very busy one. We simply call fork() to create a new child/worker process. In the parent process we return, while in the child process we call enter_server_loop(), which is not very different from what we saw in earlier implementations. It blocks in accept(). The difference now compared to the forking server being that we used to fork a child process after accept() returned with a connection and here each child created waits in accept() on its own. Is this a problem? How can multiple processes wait on accept() on the very same port? Well it turns out that you can’t really call bind() multiple times, since it is the system call that binds to a specific port. Once a process bind to a port, any other process that needs to bind on the same port needs to wait until the port is not listened on any more. Any number of processes that inherit the listening socket can call accept() to accept incoming connections. If there are 10 child processes waiting on accept(), the operating system will, with fairness, equally try to distribute incoming connections among them.

The main parent process is not involved in any work at all once this setup is done. You can see that in main(), it simply keeps calling pause() in an infinite loop. The pause() system calls pauses the caller, but it can return on getting a signal that terminates the process or when a signal handler is called. To ensure that the main process doesn’t exit, we call pause() in a loop, just in case pause() happens to return.

The Thundering Herd

When any process is blocked waiting on an event, it is the job of the kernel to wake it up once the event has occurred. This block could be due to waiting for data to arrive on a socket, waiting on a client to connect (blocking on accept()), waiting for a file to be read in from disk, etc. The Thundering Herd is a problem in computer science when many processes, all waiting for the same event are woken up by the operating system. A good example is our pre-forked server in which there are several child processes all blocked on the same socket in accept(). This problem occurs when the part of the kernel that wakes up all processes or threads leaves it to another part of the kernel, the scheduler, to figure out which process or thread will actually get to run, while all others go back to blocking. When the number of processes/threads increases, this determination wastes CPU cycles every time an event that potentially turns processes or threads runnable occurs. While this problem doesn’t seem to affect Linux anymore (see this and this), it is always good to know about this, since you will see solutions to this problem implemented in code for various other network servers. Linux is now aware if there are multiple processes or threads blocked on the very same socket and will only wake up one so that the scheduler doesn’t have to pick one between several “ready” processes, which could potentially waste CPU cycles.

When switching to a pre-threaded/thread pool architecture for ZeroHTTPd, we will use a common technique to acquire a lock so that only one thread is blocked on accept() and all other threads wait on the mutex.

Performance of the preforked server

Here is our table of performance numbers for easy reference:

	requests/second
concurrency	iterative	forking	preforked	threaded	prethreaded	poll	epoll
20	7	112	2,100	1,800	2,250	1,900	2,050
50	7	190	2,200	1,700	2,200	2,000	2,000
100	7	245	2,200	1,700	2,200	2,150	2,100
200	7	330	2,300	1,750	2,300	2,200	2,100
300	–	380	2,200	1,800	2,400	2,250	2,150
400	–	410	2,200	1,750	2,600	2,000	2,000
500	–	440	2,300	1,850	2,700	1,900	2,212
600	–	460	2,400	1,800	2,500	1,700	2,519
700	–	460	2,400	1,600	2,490	1,550	2,607
800	–	460	2,400	1,600	2,540	1,400	2,553
900	–	460	2,300	1,600	2,472	1,200	2,567
1,000	–	475	2,300	1,700	2,485	1,150	2,439
1,500	–	490	2,400	1,550	2,620	900	2,479
2,000	–	350	2,400	1,400	2,396	550	2,200
2,500	–	280	2,100	1,300	2,453	490	2,262
3,000	–	280	1,900	1,250	2,502	wide variations	2,138
5,000	–	wide variations	1,600	1,100	2,519	–	2,235
8,000	–	–	1,200	wide variations	2,451	–	2,100
10,000	–	–	wide variations	–	2,200	–	2,200
11,000	–	–	–	–	2,200	–	2,122
12,000	–	–	–	–	970	–	1,958
13,000	–	–	–	–	730	–	1,897
14,000	–	–	–	–	590	–	1,466
15,000	–	–	–	–	532	–	1,281

Straight off the bat, we see that performance compared to the forking server has skyrocketed. While the changes to the code are only a handful, the jump in performance is many fold. Up until 2,500 concurrent connections, our pre-forked server is able to handle in excess of 2,100 requests/second. This is more than 5x the performance of our forking server which manages only around 400 requests/second at comparable concurrencies!

In our next article in the series, we shall look at how threaded (as opposed to process-oriented) servers compare to the server architectures we’ve implemented so far.

Articles in this series

Posted

April 28, 2019

Linux, Network Programming, Performance

shuveb

Tags:

Comments

5 responses to “Linux Applications Performance: Part III: Preforked Servers”

Linux Applications Performance: Part I: Iterative Servers – Unixism

April 28, 2019

[…] Part III. Pre-forking Servers […]
Linux Applications Performance: Part VII: epoll Servers – Unixism

April 28, 2019

[…] Part III. Pre-forking Servers […]
Linux Applications Performance: Part II: Forking Servers – Unixism

April 28, 2019

[…] Part III. Pre-forking Servers […]
Linux Applications Performance: Introduction – Unixism

May 14, 2019

[…] Part III. Pre-forking Servers […]
Linux Applications Performance: Part V: Pre-threaded Servers – Unixism

May 16, 2019

[…] Rather than having all threads block on accept(), all threads call pthread_mutex_lock(). “Mutex” is a short form of the term “mutual exclusion”. Only one thread “acquires” the mutex and executes past pthread_mutex_lock(). All other threads block on pthread_mutex_lock(). This is an incredibly useful idea. Once one thread returns from accept(), it then calls pthread_mutex_unlock() to release the mutex so that some other thread can then acquire it and call accept(). This setup ensures that only one thread among the pool can actually block on accept(), thus avoiding the thundering herd problem discussed in the pre-forked server architecture article. […]