Linux Applications Performance: Part VII: epoll Servers

This chapter is part of a series of articles on Linux application performance.

If you came here without reading the poll()-based implementation description

All the explanation of how exactly we move from a process or thread-based model to an event based model is there in the poll()-based article. Without going through it, this article might not make a lot of sense.

Continuing on with epoll

The main difference between a server featuring poll() and a server featuring epoll() is just the API. Otherwise, the general program structure should remain the same. The real major changes we saw are when we went from a threading to an event-based model. Pretty much every function in the program had to change. In this article, we’ll see the main differences between usage of poll() vs epoll_wait().

Like we saw in the previous article in this series, the problem with poll() is that it takes as input a pointer to an array of structures which has one element for every file descriptor you want to monitor. This array can grow significantly when you’re dealing with thousands of concurrent requests. Also, poll(), when it returns, only tells you how many descriptors are ready. It is your job to iterate through the array and check the revents member of the structure in each element to see if that particular descriptor is the one that’s ready. This does not scale well as the number of concurrent connections increase.

With epoll, you first create an epoll descriptor:

epoll_fd = epoll_create1(0);

Now, you can add descriptors to watch for activity by calling epoll_ctl. Take a look at our utility function add_to_epoll_fd_list().

void add_to_epoll_fd_list(int fd, uint32_t ep_events) {
    struct epoll_event event;
    event.data.fd = fd;
    event.events = ep_events;
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event))
        fatal_error("add epoll_ctl()");

}

Finally, you call epoll_wait(), which is the equivalent of calling poll(), but you’ve already told the operating system which descriptors you want to monitor with epoll_ctl(). There is no huge array sitting in user space that you need to pass to the kernel every time you call epoll_wait().

void enter_server_loop(int server_socket)
{
    int epoll_ret;
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1) {
        epoll_ret = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
        if (epoll_ret < 0)
            fatal_error("epoll_wait()");

        for (int i = 0; i < epoll_ret; i++) {

            if (events[i].events == 0)
                continue;

            int client_socket = accept(
                    server_socket,
                    (struct sockaddr *)&client_addr,
                    &client_addr_len);

            if (client_socket > 0 ) {
                handle_client(client_socket);
            }

            if (events[i].data.fd != server_socket) {
                if (!((events[i].events & EPOLLIN)||(events[i].events & EPOLLOUT))) {
                    printf("ERROR: Error condition in socket %d!\n", events[i].data.fd);
                    exit(1);
                    continue;
                }
                /* A descriptor handling one of the clients is ready */
                client_context *check_cc;
                HASH_FIND(hh_cs, cc_cs_hash, &events[i].data.fd, sizeof(int), check_cc);
                if (!check_cc) {
                    HASH_FIND(hh_rs, cc_rs_hash, &events[i].data.fd, sizeof(int), check_cc);
                    if (!check_cc) {
                        printf("Unable to find client context associated with socket %d.\n", events[i].data.fd);
	            		exit(1);
                    }

                }
                check_cc->read_write_callback(check_cc);
            }
        }
    }
}

Take a look at enter_server_loop() above. epoll_wait() even fills up the events array with information on descriptors that are ready. It returns the number of descriptors that are ready. This way, you need not potentially loop through the entire array of file descriptors even if there is only one descriptor that is actually ready.

Apart from the addition of epoll_create1(), and epoll_ctl() and the replacement of poll() with epoll_wait() and the changes that go along with that, there are really no other changes in the program to convert the version of ZeroHTTPd based on poll() to the one based on epoll.

Performance of the epoll based server

Let’s bring up our table of performance numbers again for easy reference

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

In this test, as in the poll() based test, I had 10 processes based on epoll running. The real competition it has is really with the pre-threaded version of ZeroHTTPd. I ran two processes of the pre-threaded server with 10,000 threads in the pool each. You can see something very interesting. While both servers are neck-to-neck in performance, our epoll based server extends well up to 15,000 concurrent users, whereas the pre-threaded server loses steam 12,000 concurrent connections mark. I think the most important observation is that while performance is decent, due to relatively less operating system overhead, it scales very well when it comes to handling very high concurrent requests.

There are still a couple of places where the poll() and epoll based servers can block. Like the initial read from the client right after we accept a connection and all the writes to both the client and to Redis. I believe, if we make these also non-blocking, we should see even better performance compared to the pre-threaded version.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Thank you!

If you made it all the way here through the 7 articles, I hope you feel it was worth the journey. Systems programming is a fascinating subject and knowing the nuances can be very rewarding and satisfying. Feel free to tweet at me for any feedback or just a hi-fi: @shuveb.

Linux Applications Performance: Part IV: Threaded Servers

This chapter is part of a series of articles on Linux application performance.

Threads were all the rage in the 90s. They allowed the cool guys to give jaw-dropping demos to their friends and colleagues. Like so many cool things from the 90s, today, it is yet another tool in the arsenal for any programmer. In Linux, the API to use is POSIX Threads, which is lovingly shortened to pthreads. Underneath all of it, Linux implements threads using the clone() system call. But you don’t really have to worry about that. It is sufficient enough to know that the C library presents the well-known pthreads API and you just need to use it.

The threaded server is the equivalent of our forking server. That is, we create a new thread every time we get a connection from a client and we handle processing of that request in the new thread, after which the thread terminates. You might think now that we know creating a new process or thread as a request comes in incurs a lot of overhead, using this technique isn’t a great idea. You are right. A better idea would be to create the threads equivalent of the pre-forked server, which is a pre-threaded or a thread-pool based server. We are doing this for the sake of completeness and it also totally worth it since it reveals something very interesting: the difference in overhead between creating a process with each new connection request vs. creating a thread in response to a new connection request.

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, print_stats);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    int server_socket = setup_listening_socket(server_port);
    printf("ZeroHTTPd server listening on port %d\n", server_port);
    enter_server_loop(server_socket);
    return (0);
}

We are back to a simple main() function. It calls enter_server_loop() directly.

void enter_server_loop(int server_socket)
{
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);
    pthread_t tid;

    while (1)
    {
        int client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");

        pthread_create(&tid, NULL, &handle_client, (void *)(intptr_t) client_socket);
    }
}

Here, we get into an infinite loop which calls accept() and every time there is a new connection, we create a new thread with pthread_create(). We pass handle_client() as the thread’s start routine. We pass client_socket as its argument.

void *handle_client(void *targ)
{
    char line_buffer[1024];
    char method_buffer[1024];
    int method_line = 0;
    int client_socket = (long) targ;
    int redis_server_socket;

    /* No one needs to do a pthread_join() for the OS to free up this thread's resources */
    pthread_detach(pthread_self());

    connect_to_redis_server(&redis_server_socket);

    while (1)
    {
        get_line(client_socket, line_buffer, sizeof(line_buffer));
        method_line++;

        unsigned long len = strlen(line_buffer);

        /*
         The first line has the HTTP method/verb. It's the only
         thing we care about. We read the rest of the lines and
         throw them away.
         */
        if (method_line == 1)
        {
            if (len == 0)
                return (NULL);

            strcpy(method_buffer, line_buffer);
        }
        else
        {
            if (len == 0)
                break;
        }
    }

    handle_http_method(method_buffer, client_socket);
    close(client_socket);
    close(redis_socket_fd);
    return(NULL);
}

We see that the handle_client() function implemented here is not very different from the ones we saw implemented in earlier server architectures. Right at the beginning, we call pthread_detatch(), which tells the operating system that it can free this thread’s resources without another thread having to call pthread_join() to collect details about this thread’s termination. This is much like how the parent process needs to call wait(). Otherwise, this method has all the usual function calls to be able to handle the client request. Remember however, that once this function returns, the thread ends since this function was the one that was specified as the thread’s start routine.

Performance of the Threaded Server

Here is our table of performance numbers for easy reference:

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

The best comparison one can make with the threaded server is with that of the forking server. For lower concurrency numbers, you can see around 10x performance improvement over the forking server design. For concurrencies involving several 100 connections, you can see that it outperforms the forking server by 5-6x! Also, very importantly, it is able to stably deal with concurrencies of up to 8,000 connections, although by around 3,000 concurrent connections, we begin to see a drop in performance.

In the next article in this series, we see how using a thread pool improves performance over a threading design as described here.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Part V: Pre-threaded Servers

This chapter is part of a series of articles on Linux application performance.

The design discussed in this article is more popularly known as “thread pool”. Essentially, there is a pre-created pool of threads that are ready to serve any incoming requests. This is comparable to the pre-forked server design. Whereas there was a process pool in the pre-forked architecture, we have a thread pool in this case.

Most of the code is very similar to the pre-forked server. Let’s take a look at the main() function:

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, print_stats);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    printf("ZeroHTTPd server listening on port %d\n", server_port);
    server_socket = setup_listening_socket(server_port);
    for (int i = 0; i < THREADS_COUNT; ++i) {
        create_thread(i);
    }

    for (;;)
        pause();
}

We call create_thread() in a for loop that does iterations equal to THREADS_COUNT. After this, the main thread calls pause() forever in an infinite loop. create_thread() itself is very simple:

void create_thread(int index) {
   pthread_create(&threads[index], NULL, &enter_server_loop, NULL);
}

We call pthread_create() there with enter_server_loop() as the start function. Let’s take a look at that function now:

void *enter_server_loop(void *targ)
{
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1)
    {
        pthread_mutex_lock(&mlock);
        long client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");
        pthread_mutex_unlock(&mlock);

        handle_client(client_socket);
    }
}

Rather than having all threads block on accept(), all threads call pthread_mutex_lock(). “Mutex” is a short form of the term “mutual exclusion”. Only one thread “acquires” the mutex and executes past pthread_mutex_lock(). All other threads block on pthread_mutex_lock(). This is an incredibly useful idea. Once one thread returns from accept(), it then calls pthread_mutex_unlock() to release the mutex so that some other thread can then acquire it and call accept(). This setup ensures that only one thread among the pool can actually block on accept(), thus avoiding the thundering herd problem discussed in the pre-forked server architecture article.

Other parts of the server are pretty much the same code as in the pre-forked server.

Pre-threaded Server Performance

Given that threads are pretty light-weight compared to processes since they share much of the main process memory, there is very little operating system overhead when creating a new thread relative to creating a new process. Let’s bring up our performance numbers table:

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

Comparison to the threaded architecture

The pre-threaded server architecture has about 40% better performance on average compared to the threading server. However, it has similar performance compared to the pre-forked server. This clarifies the fact that under Linux, processes and threads are scheduled in the same way and also have similar performance characteristics. If there is any difference in overhead, it is in creating a process vs. creating a thread since processes share less and threads pretty much everything with the creating thread.

Comparison to the pre-forked architecture

There is one other crucial difference compared to the pre-forked architecture. While our pre-forked server is able to reliably scale till about 5,000 concurrent connections with decent performance, our pre-threaded server is able to handle up to 11,000 concurrent connections without significant deterioration in performance. This is clearly an advantage over the pre-forked server.

Switching from pre-forked to pre-threaded worth it?

It is fairly easy in most cases to change a server based on a pre-forked architecture to a pre-threaded architecture, especially if there are less synchronization primitives in use in the pre-forked version. It should be worth the effort since the performance gains that are achieved when going from a pre-forked to a pre-threaded architecture are non-trivial. However, this should be done only if you are expecting a lot of concurrent users. As you can see from the table, performance numbers are very similar for up to thousands of concurrent users.

Now, for something completely different

Now that you’ve seen how servers based on processes and threads work, we are ready to see, in the next article in the series, how servers based on the event processing model work.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Part VI: Polling Servers

This chapter is part of a series of articles on Linux application performance.

When things happen sequentially, we get them. All our flowcharts, algorithms or workflows are sequential in nature. It’s easy for our brains to understand sequential happenings. Moving to a multi-process or a threaded model is also a simple extension of that model. There are sequential things happening, but in several processes or threads. Event-based architectures, as we will see them in the poll and epoll-based servers, are a little tricky to wrap your head around, but they are no rocket science.

System calls like poll() and epoll_wait() are simple. You given them a bunch of sockets or other file descriptors and they will tell you if any of them are ready for I/O. Meaning they won’t block when you attempt reads or writes on them. Until one or more of these descriptors become ready, poll() or epoll_wait() will block. You can also specify a timeout to wait for activity on the descriptors or you can ask to wait indefinitely. There you have the shortest (but incomplete) combined man page for poll() and epoll_wait().

What is the overhead we want to avoid when processing a large number of concurrent users?

Your goal is processing thousands of concurrent users. Let’s take a look at the server architectures we have so far. Iterative is out of the question. We shall also discount forking and threading servers since they create a new process or thread the moment they receive a new connection, leading to clearly measurable performance issues. That leaves us with the pre-forked and pre-threaded servers. We saw that the pre-threaded servers can handle up to 11,000 concurrent users as far as our Guestbook app goes. As the number of threads and processes increase, there will be operating system overhead in scheduling those. While at hundreds of processes, this is not a problem, at very high concurrencies, we can prove that this is an issue.

Analyzing blocking operations

So, we want to avoid running a lot of processes or threads, but still serve a large number of concurrent connections. How exactly can we do this? We do this by handling as many connections as possible in a single process or thread. Although that is tricky to do, it is not very difficult. This is where operating system facilities like poll and epoll come in. Let’s revisit our flow of the Guestbook app:

If you look at the color coding in the flow image above, the red blocks are the ones that could potentially block our process. The orange block is the file read operation where we read in the template file. Unfortunately, under Linux, even though file operations could block, there is no way to find out if they actually will. Having said that, file reads and writes are pretty fast and they are usually cached aggressively. Usually, they are not the culprit. Network operations, like the ones shown in the red blocks can really hold up the process.

The main idea with event based servers is that we want to avoid getting blocked at all costs. Blocking blocks the whole process and remember, the idea is to process as many concurrent requests as possible with just one process. And the one thing that can keep us away from processing more and more requests is blocking. If we are not blocked, we can be accepting more connections, we can be initiating more Redis read and write requests and writing results back to client sockets, thus successfully serving requests.

There are two other blocking operation that are not in the flow diagram and those are being blocked on a call to accept(), where we accept incoming connections and the second one being reading from the client. Both of these can also block our process. These two are related to the client socket while all the red blocks in the diagram are related to the Redis socket. Writes can block you if you have a lot of data to write, but in our application, we don’t do big writes and for the sake of simplicity, we will only worry about getting blocked on reads. If you are waiting to read even a single byte, you can get blocked.

In summary, these are the operations that can block us:

  • The server socket in when accepting new connections with accept()
  • The client socket while reading with read() or recv()
  • The Redis socket while connecting to Redis with connect()
  • Reading from the Redis socket with read() or recv()

Structuring our program

The general idea is this: if we are going to be doing a potentially blocking operation like calling accept() or connect() or read() or recv(), we never just directly call them. We first add the file descriptor in question to poll()‘s list of monitored descriptors so that we know when there is data or new incoming (accept()) or outgoing (connect()) connection is ready. We do this because in the meanwhile, we can process other reads and connects.

Consider this: You have a new client connection #5. If you now try and connect to Redis with socket #6, you could block and not accept new client connections #7 and #8. Rather, you tell poll() to monitor your newly initiated Redis connect call, which could take a while. In the meanwhile, poll() tells you that there are newer client connections #7 and #8 available. It later tells you that Redis connection #6, which is the connection initiated to serve client #5 is now ready. Now that #6 is available, you can ask it to read the guestbook entries in Redis. In the meanwhile, you initiated Redis connections #9 and #10 which are paired with clients #7 and #8 and poll() tells you that connect() got through. You now react to events poll() tells you occurred. This is why this paradigm is called event-driven programming.

The Client Context

In threaded programs, you block and unblock with each call to connect() and read() or recv(), but you don’t realize it. But because the context of a single client request is there as a whole in a thread, it is naturally easy to understand. You see the client flow diagram above. You can think of each block as a state that the client request moves through. With event-driven servers, like those using poll() and epoll_wait(), it is as if you are suspending one client request in a particular state when it blocks and switch to another client request which could be in a different state. poll() or epoll_wait() will tell you when a request is ready to move from one state to the other.

To maintain the state/context of each request, we use a struct called the client_context.

typedef struct client_context{
    int client_sock;
    int redis_sock;
    char templ[16384];      /* Template storage */
    char rendering[16384];

    /* Next callback to make to render the page */
    void (*next_callback)(struct client_context *cc);

    /* Used by Redis reader/writer callback functions */
    void (*read_write_callback)(struct client_context *cc);

    int entries_count;
    char **entries;
    char redis_key_buffer[64];
    int intval;

    /* Hash tables handles to look up by Redis or by client socket*/
    UT_hash_handle hh_cs;   /* Lookup by client socket */
    UT_hash_handle hh_rs;   /* Lookup by Redis socket  */
} client_context;

As far as processing client requests is concerned, we mainly have to deal with two sockets: the client connected socket, which is client_sock and the Redis connected socket, which is redis_sock in the client_context structure.

Each client or Redis socket gets assigned a unique number by the operating system. When poll() tells us one of the file descriptors is ready, we need to find out the client context associated with it. For this, we use two hash tables. Given the client socket or the Redis socket, we can retrieve the associated client_context, which we allocate for every new successful client connection.

We use the read_write_callback function pointer to move from one state to the next. After poll(), we fetch the associated client_context and simply call the function pointed to by read_write_callback. That function then stores the next function to call in the same pointer, initiates the next operation, adds the file descriptor to the list monitored by poll() if not already present and returns, which calls poll() again.

The poll() loop

The beating heart of our program is this loop in enter_server_loop():

void enter_server_loop(int server_socket)
{
    int poll_ret;
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1) {
        poll_ret = poll(poll_fds, poll_nfds, -1);
        if (poll_ret < 0)
            fatal_error("poll()");

        for (int i = 0; i < poll_nfds; i++) {

            if (poll_fds[i].revents == 0)
                continue;

            int client_socket = accept(
                    server_socket,
                    (struct sockaddr *)&client_addr,
                    &client_addr_len);

            if (client_socket > 0 ) {
                handle_client(client_socket);
            }

            if (poll_fds[i].fd != server_socket) {
                if (!((poll_fds[i].revents & POLLIN)||(poll_fds[i].revents & POLLOUT))) {
                    printf("ERROR: Error condition in socket %d!\n", poll_fds[i].fd);
                    exit(1);
                    continue;
                }
                /* A descriptor handling one of the clients is ready */
                client_context *check_cc;
                HASH_FIND(hh_cs, cc_cs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                if (!check_cc) {
                    HASH_FIND(hh_rs, cc_rs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                    if (!check_cc) {
                        printf("Unable to find client context associated with socket.\n");
                    }

                }
                check_cc->read_write_callback(check_cc);
            }
        }
    }
}

Our main() function adds the server listening socket to poll()‘s file descriptor list by calling our utility function add_to_poll_fd_list(). Our program after that, pretty much revolves around this loop. poll() starts out with only the listening socket which can accept new client connections. When a new client connection is made, handle_client() is called. This sets off the dominoes for this request. We leave the static file handling logic of the ZeroHTTPd server untouched. However, when it comes to the Guestbook app, we take care to ensure that we do not call any potentially blocking calls directly. For example, if we want to read the guestbook remarks list from Redis, we write the Redis command to do so, but tell poll() to let us know when our Redis socket has data ready for reading. Let’s look at that flow starting from how we make our Redis connection.

int init_render_guestbook_template(int client_sock) {
    /* This is a new connection. Allocate a client context. */
    client_context *cc = malloc(sizeof(client_context));
    if (!cc)
        fatal_error("malloc()");
    cc->client_sock = client_sock;
    HASH_ADD(hh_cs, cc_cs_hash, client_sock, sizeof(int), cc);

    cc->next_callback = render_guestbook_post_redis_connect;
    connect_to_redis_server(cc);
    HASH_ADD(hh_rs, cc_rs_hash, redis_sock, sizeof(int), cc);
}

This is where it all starts. This function allocates memory for a new client_context structure, adds the client and redis sockets to the two look up tables, sets up the next_callback function pointer in the client context to render_guestbook_post_redis_connect and calls connect_to_redis, which initiates the connection to the Redis server:

void connect_to_redis_server(client_context *cc) {
    struct sockaddr_in redis_servaddr;
    /* Create a non-blocking socket */
    int redis_socket_fd = socket(AF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
    memset(&redis_servaddr, 0, sizeof(redis_servaddr));
    redis_servaddr.sin_family = AF_INET;
    redis_servaddr.sin_port = htons(REDIS_SERVER_PORT);

    int pton_ret = inet_pton(AF_INET, redis_host_ip, &redis_servaddr.sin_addr);

    if (pton_ret < 0)
        fatal_error("inet_pton()");
    else if (pton_ret == 0) {
        fprintf(stderr, "Error: Please provide a valid Redis server IP address.\n");
        exit(1);
    }

    int cret = connect(redis_socket_fd,
                       (struct sockaddr *) &redis_servaddr,
                       sizeof(redis_servaddr));
    if (cret == -1 && errno != EINPROGRESS) {
        fatal_error("redis connect()");
    } else {
        cc->redis_sock = redis_socket_fd;
        cc->read_write_callback = connect_to_redis_server_callback;
        add_to_poll_fd_list(redis_socket_fd, POLLOUT);
    }
}

If you notice, here in connect_to_redis(), we call connect(), but we’ve set this socket to non-blocking mode with the SOCK_NONBLOCK option to the socket() call. This makes sure connect() does not block. We then add the Redis socket to poll()‘s polling list, which is the list of file descriptors it monitors. Another main thing to note here is the function we are setting read_write_callback to. If you go back and look at the enter_server_loop() function, you will note that once we locate the client context associated with the file descriptor that is ready, we simply call the function pointed to by read_write_callback. This is the function pointer that is responsible to moving the request from the state to state as we saw in the state-flow diagram. Here, we’ve set it to connect_to_redis_server_callback().

Now, poll() will wait for the connect() to succeed. TCP/IP has a 3-way handshake that can take a while to establish, which is why we don’t want to potentially block for a long time in connect(). Also, it may take longer if the Redis server is busy for some reason. Let’s take a better look at the portion of the code in enter_server_loop():

            /* A descriptor handling one of the clients is ready */
            client_context *check_cc;
            HASH_FIND(hh_cs, cc_cs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
            if (!check_cc) {
                HASH_FIND(hh_rs, cc_rs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                if (!check_cc) {
                    printf("Unable to find client context associated with socket.\n");
                }

            }
            check_cc->read_write_callback(check_cc);

We use the HASH_FIND() macros to find the associated client context with the file descriptor poll() says is ready. We will have no clue if this file descriptor is a client connection or a Redis socket. So, we simply check both hash tables. Once we locate the client_context, we simply call the function pointed to by read_write_callback. It should now call connect_to_redis_server_callback(). Let’s take a look at that function:

void connect_to_redis_server_callback(client_context *cc) {

    int result;
    socklen_t result_len = sizeof(result);
    if (getsockopt(cc->redis_sock, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0) {
        fatal_error("Redis connect getsockopt()");
    }

    if (result != 0) {
        fatal_error("Redis connect getsockopt() result");
    }

    remove_from_poll_fd_list(cc->redis_sock);

    /* Make the redis socket blocking */
    fcntl(cc->redis_sock, F_SETFL,
          fcntl(cc->redis_sock, F_GETFL) &
          ~O_NONBLOCK);

    add_to_poll_fd_list(cc->redis_sock, POLLIN);
    cc->next_callback(cc);
}

Here, we remove the non-blocking mode on the Redis socket. We do this because we want to write Redis commands in blocking mode. As we discussed earlier, we’ve decided to make poll() only track readiness for connect() and reads. We write to the socket directly, even if we can potentially block. We’ve chosen this route since it is simpler. Handling writes in callbacks additionally can add more complexity to our program. It is a great exercise, however, to add writes in callbacks once poll() tells us our sockets are ready for writes. Finally, we call the function pointed to by next_callback. This is another function pointer we use to move the state forward. While we exclusively use read_write_callback when poll() tells us when a file descriptor is ready for reading or a new connection is available, next_callback is used to really move from one state to the other. At this point, next_callback points to render_guestbook_post_redis_connect() as we had stored in init_render_guestbook_template(). Let’s take a look at that function:

void render_guestbook_post_redis_connect(client_context *cc) {
    add_to_poll_fd_list(cc->client_sock, POLLIN);
    memset(cc->rendering, 0, sizeof(cc->rendering));
    memset(cc->templ, 0, sizeof(cc->rendering));

    /* Read the template file */
    int fd = open(GUESTBOOK_TEMPLATE, O_RDONLY);
    if (fd == -1)
        fatal_error("Template read()");
    read(fd, cc->templ, sizeof(cc->templ));
    close(fd);

    /* Get guestbook entries and render them as HTML */
    cc->next_callback = render_guestbook_entries;
    redis_get_list( GUESTBOOK_REDIS_REMARKS_KEY, cc);
}

Here, we read the template file and store it in the client context, change next_callback to point to render_guestbook_entries and then call redis_get_list().

/*
 * Get the range of items in a list from 'start' to 'end'.
 * This function allocates memory. An array of pointers and all the strings pointed to by it
 * are dynamically allocated. redis_free_array_results() has to be called to free this
 * allocated memory by the caller.
 * */

void redis_list_get_range(char *key, int start, int end, client_context *cc) {
    /* Shortcut: setting a character array to a null string will fill it up with zeros */
    char cmd_buf[1024]="", start_str[16], end_str[16];
    sprintf(start_str, "%d", start);
    sprintf(end_str, "%d", end);
    sprintf(cmd_buf, "*4\r\n$6\r\nLRANGE\r\n$%ld\r\n%s\r\n$%ld\r\n%s\r\n$%ld\r\n%s\r\n", strlen(key), key, strlen(start_str), start_str, strlen(end_str), end_str);
    redis_write(cc->redis_sock, cmd_buf, strlen(cmd_buf));
    cc->read_write_callback = redis_list_get_range_callback;
}

/*
 * Utility function to get the whole list.
 * We set start to 0 and end to -1, which is a magic number that means the last element.
 * */

void redis_get_list(char *key, client_context *cc) {
    redis_list_get_range(key, 0, -1, cc);
}

As you can see redis_get_list() simply calls redis_list_get_range(). You can see there that we write out the command to Redis and set the read_write_callback to redis_list_get_range_callback() which will be called by poll() whenever there is data ready to read on the Redis socket. This happens with Redis has processed the command to fetch the list and has begun writing data to the socket.

If you look at the source code for redis_list_get_range_callback(), which I’m omitting here for simplicity (it’s a long function), you’ll see that it finally calls the function pointed to by next_callback, which now points to render_guestbook_entries(), like we set in render_guestbook_post_redis_connect(). At this point redis_list_get_range_callback() has stored the guestbook entries data from Redis in the client context.

void render_guestbook_entries(client_context *cc) {
    char guest_entries_html[16384]="";

    for (int i = 0; i < cc->entries_count; i++) {
        char guest_entry[1024];
        sprintf(guest_entry, "<p class=\"guest-entry\">%s</p>", cc->entries[i]);
        strcat(guest_entries_html, guest_entry);
    }
    redis_free_array_result(cc->entries, cc->entries_count);

    /* Replace guestbook entries */
    char *entries = strstr(cc->templ, GUESTBOOK_TMPL_REMARKS);
    if (entries) {
        memcpy(cc->rendering, cc->templ, entries-cc->templ);
        strcat(cc->rendering, guest_entries_html);
        char *copy_offset = cc->templ + (entries-cc->templ) + strlen(GUESTBOOK_TMPL_REMARKS);
        strcat(cc->rendering, copy_offset);
        strcpy(cc->templ, cc->rendering);
        memset(cc->rendering, 0, sizeof(cc->rendering));
    }

    bump_visitor_count(cc);
}

Here, we form the HTML snippet needed for the guestbook entries, replace the template variable with the snippet and then call bump_visitor_count().

void bump_visitor_count(client_context *cc) {
    cc->next_callback = bump_visitor_count_callback;
    redis_incr(GUESTBOOK_REDIS_VISITOR_KEY, cc);
}

This sets next_callback to point to bump_visitor_count_callback() and then calls redis_incr(), which will again set a read_write_callback, write the command to Redis and depend on poll() to tell when data is ready to read and do the actual reading in the callback pointed to by read_write_callback, which will then call next_callback. This chain continues till finally, client_fini() is called, which marks the end of processing a single request:

void client_fini(client_context *cc) {
    client_context *check_cc;
    HASH_FIND(hh_cs, cc_cs_hash, &cc->client_sock, sizeof(cc->client_sock), check_cc);
    if (check_cc) {
        remove_from_poll_fd_list(cc->client_sock);
        remove_from_poll_fd_list(cc->redis_sock);
        close(cc->client_sock);
        close(cc->redis_sock);
        HASH_DELETE(hh_cs, cc_cs_hash, check_cc);
        HASH_DELETE(hh_rs, cc_rs_hash, check_cc);
        free(check_cc);
    }
    else {
        printf("Could not retrieve client context back from hash!\n");
    }
}

This function closes the client and Redis sockets, removes entries from the hash table and finally frees up the memory allocated for the client context. Phew!

Performance of the Polling Server

As you can see from the performance numbers table, our poll() based server fares better than the threaded server at lower concurrencies, but get worse at higher concurrencies. The pre-threaded servers and the pre-forked servers easily beat it.

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

But one thing you have to realize is that our poll() based server is capable of handling for more concurrency in a single process. The pre-forked and the pre-threaded servers have conveniently created several processes and threads to handle requests. The convention then is to run a few instances of the event based servers to take on the load. Usually we run as many event based severs as there are CPU cores available on the hosting server. In these tests, although I tested with only one CPU core, I created 10 instances of the poll() based server whereas there were 10,000 and 20,000 process and threads while testing the pre-forked and the pre-threaded servers respectively. The poll() based server can indeed handle a lot of concurrent connections per process.

Then, for all the talk about event-based servers exhibiting very high performance, how come we are not able to compete against threaded and pre-threaded servers? The problem lies with the design of the poll() system call. Take a look at its signature:

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

The first argument points to an array of structures of type pollfd. The next parameter tells poll() how many structures are present in the array. Here is the trouble: every time we invoke poll(), this array of structures has to be copied into kernel space. When poll() returns successfully, meaning that one or more file descriptors have activity, the only way to find out which ones those descriptors are ready is by iterating through the whole list and checking pollfd‘s revents member for every file descriptor in the array. Although poll() returns the number of descriptors which are ready, we won’t know which ones they are. We have to iterate through the list till we find all the descriptors poll() says are ready. This can take a while if we are tracking thousands of file descriptors since we need to run this linear search every time poll() returns. Also, you need to remember that we track both the client socket and a Redis connected socket per client connection. This is the main bottleneck with poll().

In the next section, we will see how the epoll family of calls does away with this problem and provides a descriptor activity tracking mechanism in the operating system that is a lot more scalable.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Part III: Preforked Servers

This chapter is part of a series of articles on Linux application performance.

While the iterative server has trouble serving clients in parallel, the forking server incurs a lot of overhead forking a child process every time a client request is received. We saw from the performance numbers that in our test setup, it has trouble handling more than 1,500 concurrent connections. If forking on every arriving request is indeed the aspect that is causing all this overhead, we should be good if we create a set of pre-forked children ready to accept any incoming requests. This is what the preforked server does. Rather than call fork() on every new request, it calls fork() to create a bunch of child processes beforehand and they take up serving requests. Also, in the forking server, each child process quit after it had processed a request. This is not to be the case with the pre-forked server. Each child goes back to accepting more connections once it has served the previous one. Any number of worker processes created stay around for the lifetime of the server.

We fork a certain count of children when the server starts. Let’s call these worker processes, as they are generally known. All we need to do is to ensure we have sufficient worker processes enough to handle the kind of concurrency we are expecting.

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, sigint_handler);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    int server_socket = setup_listening_socket(server_port);
    printf("ZeroHTTPd server listening on port %d\n", server_port);
    for(int i=0; i < PREFORK_CHILDREN; i++)
        pids[i] = create_child(i, server_socket);

    for(;;)
        pause();
}

The main() function of the preforked server is slightly different since it does not call enter_server_loop() directly. It calls the create_child() function in a loop which runs for PREFORK_CHILDREN iterations.

pid_t create_child(int index, int listening_socket) {
    pid_t pid;

    pid = fork();
    if(pid < 0)
        fatal_error("fork()");
    else if (pid > 0)
        return (pid);

    printf("Server %d(pid: %ld) starting\n", index, (long)getpid());
    enter_server_loop(listening_socket);
}

The create_child() function is not a very busy one. We simply call fork() to create a new child/worker process. In the parent process we return, while in the child process we call enter_server_loop(), which is not very different from what we saw in earlier implementations. It blocks in accept(). The difference now compared to the forking server being that we used to fork a child process after accept() returned with a connection and here each child created waits in accept() on its own. Is this a problem? How can multiple processes wait on accept() on the very same port? Well it turns out that you can’t really call bind() multiple times, since it is the system call that binds to a specific port. Once a process bind to a port, any other process that needs to bind on the same port needs to wait until the port is not listened on any more. Any number of processes that inherit the listening socket can call accept() to accept incoming connections. If there are 10 child processes waiting on accept(), the operating system will, with fairness, equally try to distribute incoming connections among them.

The main parent process is not involved in any work at all once this setup is done. You can see that in main(), it simply keeps calling pause() in an infinite loop. The pause() system calls pauses the caller, but it can return on getting a signal that terminates the process or when a signal handler is called. To ensure that the main process doesn’t exit, we call pause() in a loop, just in case pause() happens to return.

The Thundering Herd

When any process is blocked waiting on an event, it is the job of the kernel to wake it up once the event has occurred. This block could be due to waiting for data to arrive on a socket, waiting on a client to connect (blocking on accept()), waiting for a file to be read in from disk, etc. The Thundering Herd is a problem in computer science when many processes, all waiting for the same event are woken up by the operating system. A good example is our pre-forked server in which there are several child processes all blocked on the same socket in accept(). This problem occurs when the part of the kernel that wakes up all processes or threads leaves it to another part of the kernel, the scheduler, to figure out which process or thread will actually get to run, while all others go back to blocking. When the number of processes/threads increases, this determination wastes CPU cycles every time an event that potentially turns processes or threads runnable occurs. While this problem doesn’t seem to affect Linux anymore (see this and this), it is always good to know about this, since you will see solutions to this problem implemented in code for various other network servers. Linux is now aware if there are multiple processes or threads blocked on the very same socket and will only wake up one so that the scheduler doesn’t have to pick one between several “ready” processes, which could potentially waste CPU cycles.

When switching to a pre-threaded/thread pool architecture for ZeroHTTPd, we will use a common technique to acquire a lock so that only one thread is blocked on accept() and all other threads wait on the mutex.

Performance of the preforked server

Here is our table of performance numbers for easy reference:

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

Straight off the bat, we see that performance compared to the forking server has skyrocketed. While the changes to the code are only a handful, the jump in performance is many fold. Up until 2,500 concurrent connections, our pre-forked server is able to handle in excess of 2,100 requests/second. This is more than 5x the performance of our forking server which manages only around 400 requests/second at comparable concurrencies!

In our next article in the series, we shall look at how threaded (as opposed to process-oriented) servers compare to the server architectures we’ve implemented so far.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Part II: Forking Servers

This chapter is part of a series of articles on Linux application performance.

In Part I: Iterative servers, we took a look at a server which deals with one client request at a time. This server called accept() whenever it was done serving one client so that it could accept more client connections and process them one after the other. We know the problem with this approach. If there are many concurrent client connections, they’d all be queued, waiting for the server to call accept(). In this “forking” server, we create a new child process every time we accept a client request. This child process deals with the client request, while the parent can immediately go back to blocking on accept(). This means that while several requests could be in the process of being served by child processes, the parent is always ready to accept more client connections. This means that the clients don’t have to wait for their request to be accepted and processed while the server is busy processing requests from other clients.

While this approach seems to be great and it looks like we’re all set up for the rest of our lives as legendary Linux programmers, we will see the reason why performance takes a big hit with this server architecture and how to fix it. In the remaining parts of this article series, we will explore further more server architectures and measure their performance to understand how one is better or worse than the other.

void enter_server_loop(int server_socket)
{
    struct sockaddr_in client_addr;

    socklen_t client_addr_len = sizeof(client_addr);
    while (1)
    {
        int client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");

        int pid = fork();
        if (pid == 0) {
            /* child process*/
            signal(SIGINT, SIG_DFL);
            connect_to_redis_server(redis_host_ip);
            handle_client(client_socket);
            close(client_socket);
            exit(0);
        }
        else {
            /* parent process */
            close(client_socket);
        }
    }
}

The main change to the forking server as compared to the iterative server is in the enter_server_loop() function. And the changes are astonishingly simple. It is because of the simplicity of the fork()system call. The other changes related to zombie process handling, which we describe in a later section here.

I don’t think it would be a great idea to bring into scope how fork()works in this article series. That’s for another article by itself. Enough to say for now that fork()creates an exact duplicate of the process and stats executing the next statement after fork()in both the parent and the child. Everything is duplicated, including open files and sockets. While fork() returns the PID or process ID of the new child in the parent, in the child it returns 0. With this simple but crucial difference, you conditionally do what you need to do in the parent and in the child. In this example, we check the return value of fork()and if it is not zero, it means your program is executing in the parent process and we close the client socket. Remember, since everything is duplicated for the child, the child process also has the client socket. In the child, which we test by checking the return value of fork(), we see that it is zero and we simply call handle_client(), passing in client_socket. You can also see that in the child process, once handle_client()returns, we close the client socket and exit, which ends the child process.

To summarize, earlier we called handle_client()directly after accept() and now, we call fork(), check to ensure that we are in the child process and we call handle_client() in the child. In the parent, we close client_socket and go back to blocking on accept(), which means new clients don’t have to wait for their request to be accepted and served by our server.

Pro Tip #1

Technically, in Unix, fork() is a system call. Under Linux however, when you make the call from your programs, glibc, the main C library most Linux distributions ship with, actually makes the clone() system call. In Linux, the clone()system call is the Swiss army knife of process creation. Under the hood, it is the same system call that is used to create threads.

Zombies!

The model where parent processes get real work done with the help of child processes or with threads is pretty much the way almost everything works in Linux and in almost all other operating systems. In most cases, the parent also needs to know if a child successfully did what it was supposed to do. To this end, child processes can pass a single integer value back to the parent. This is the int you see returned from the main() function of Linux programs. The convention being that any process that returns a 0 succeeded and any other value indicating things didn’t go all that well. Parent processes collect this return value explicitly and other things like process accounting (CPU and other resource usage) implicitly when they call any one of the wait() family of system calls. When the children of any process die, they are kept around by Linux in a weird state called the “zombie” state until the parent process explicitly calls wait(). Calling wait()is also known as reaping children, since once wait()is called, the operating system can give the parent process the child’s return value and process accounting information and remove the child’s process entry from the process table, which has a limited number of slots.

Reaping children

Calling wait() in the parent avoids zombies, but wait() can block and remember, we need to be, as much as possible, blocked on accept() and not blocked on wait(). If we are blocked on wait(), we won’t be able to quickly accept and process client connections. Thankfully, there is a way in Linux to reap children, as this is called. When a child process terminates, the operating system sends its parent process a SIGCHD signal. Even when you are not handling SIGCHLD, it is being sent to the parent process. Nothing happens, since by default SIGCHLD is ignored by processes. You could setup a signal handler for this signal and call wait() in the signal handler. This should collect the child’s return status, clear its process table entry and solve our problem. This however, introduces another problem. When a signal handler is done running, back in our program, which was mostly likely blocked on accept(), it returns with a -1, setting the global errno to EINTR which presumably tells us that accept() was interrupted by a signal. Luckily for us, you can ask the operating system to automatically restart the accept()system call if it is interrupted by the need to run a signal handler.

Apart from the changes to enter_server_loop(), the other main change is to the main() function, where we install the signal handler with the SA_RESTART flag, which tells the operating system to restart any interrupted system calls.

    /* setup the SIGCHLD handler with SA_RESTART */
    struct sigaction sa;
    sa.sa_handler = sigchld_handler;
    sa.sa_flags = SA_RESTART;
    sigfillset(&sa.sa_mask);
    sigaction(SIGCHLD, &sa, NULL);

And here is the signal handler itself:

void sigchld_handler(int signo) {

    /* Let's call wait() for all children so that process accounting gets stats */
    while (1) {
        int wstatus;
        pid_t pid = waitpid(-1, &wstatus, WNOHANG);
        if (pid == 0 || pid == -1) {
            break;
        }
        child_processes++;
    }
}

The signal handler as defined in the function sigchld_handler()is quite simple, really. Remember earlier that we did not want to call wait() since we would block? Turns out that Linux has a version of the wait()system call named waitpid()that allows parent processes to wait for a child with a particular PID and it can take an option WNOHANG which tells waitpid()not to wait if there are no children available to reap, which means it returns immediately. waitpid()also allows us to pass -1for the process ID meaning that it will wait for any child–not just a child with a particular process ID.

But, why are we calling waitpid()in a loop? This is because the SIGCHLD signal handler can get called just once as a result of the termination of multiple child processes. So, we call waitpid()in a loop, reaping as many children as possible.

Pro Tip #2

Another way to solve the problem of accept() being interrupted by signals is to wrap the call to accept() in a while loop where we check if errno is EINTR and call accept() again. This should be a good exercise for a budding Linux programmer. 250BPM has a good article on EINTR which makes for some very good reading. Remember that not all system calls can be asked to be automatically restarted after interruption by a signal. See the section “Interruption of system calls and library functions by signal handlers” in this man page for a list of system calls that can be automatically restarted. Since the accept() call is in the list, we were able to get away with SA_RESTART. If you are stuck with system calls that do not support SA_RESTART however, the technique described here should come in handy.

Observing Zombies

It is a good exercise to observe zombies. By commenting out the sigaction()system call in main(), recompiling and running, you can do this. Once you’ve served some requests, do a ps aux in another terminal while ZeroHTTPd is running and you’ll see some zombie processes. In ps‘s state column, you can see the value “Z+” indicating that they are all zombies. ps also encloses the process name is enclosed in brackets.

 USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
shuveb 12252 0.0 0.0 4516 796 pts/0 S+ 09:43 0:00 forking
shuveb 12322 0.0 0.0 0 0 pts/0 Z+ 09:44 0:00 [forking]
shuveb 12361 0.0 0.0 0 0 pts/0 Z+ 09:44 0:00 [forking]
shuveb 12364 0.0 0.0 0 0 pts/0 Z+ 09:44 0:00 [forking]
shuveb 12471 0.0 0.0 0 0 pts/0 Z+ 10:01 0:00 [forking]
shuveb 12472 0.0 0.0 0 0 pts/0 Z+ 10:01 0:00 [forking]

Pro Tip #3

Under Unix-like operating systems, unhandled signals are either ignored by default or they terminate the process. We saw earlier that SIGCHLDis ignored by default if not handled by the process by setting up a signal handler for it. For example, SIGINT, the signal the shell sends the foreground process when you hit ctrl+c on your keyboard, kills the process by default. But your programs can either handle it or set it to be ignored. If you set it to be ignored, as in the following code snippet, ctrl+c will no longer kill the foreground process:

signal(SIGINT, SIG_IGN);

It is worth mentioning that there are some signals, like SIGKILL that cannot be ignored. The main point of this tip is this: although the SIGCHLD signal is ignored by default, explicitly setting it to be ignored tells Linux to reap child processes automatically on the parent’s behalf without the parent having to call any of the wait() family of system calls. If you are not interested in collecting the return value or gathering resource usage information of the children, this is an easy way out. You don’t have to worry about restarting system calls and such.

signal(SIGCHLD, SIG_IGN);

Please be aware that explicitly telling the operating system to ignore SIGCHLD, although works on Linux by automatically reaping children, is not very portable across Unix-like operating systems. See this excerpt from the sigaction man page:

POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN. POSIX.1-2001 and later allow this possibility, so that ignoring SIGCHLD can be used to prevent the creation of zombies (see wait(2)).        Nevertheless, the historical BSD and System V behaviors for ignoring SIGCHLD differ, so that the only completely portable method of ensuring that terminated children do not become zombies is to catch        the SIGCHLD signal and perform a wait(2) or similar.

Pro Tip #4

Signals are a real pain to deal with and in the case of reaping children, signals are pretty much the only way to tackle this, especially if you are doing something in the parent, like in our case, accepting new client connections with accept(). Linux has an interesting system call signalfd(), which makes life slightly better. You should read more about it. It lets you handle signals in a more synchronous fashion.

Forking Server Performance

Let’s take a look at our performance numbers table.

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

While the iterative server maxes out at 7 requests/second, with the forking server, we are able to wring out more performance. We see the server struggle after a concurrency of 1,500. For our measurements, we always run each test at any concurrency level 3 times and then average the requests/second. Performance begins to worsen as we increase concurrency from 2,000 to 3,000. At 5,000 each run gives us numbers that are vary widely from each other as far as the requests/second go, making it too unreliable to measure.

While the forking server is way better than the iterative server which is stuck at 7 requests/second no matter what the concurrency, choosing a forking server design can let you handle many more concurrent users. Also, there are minimal changes required to you program to go from iterative to forking. The design of the fork()system call is very powerful and elegant, letting us keep the program structure simple while being able to parcel out work to child processes. If you are feeling really lazy, you can move up from an iterative server to the forking server with minimal effort while still seeing significant benefits.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Part I: Iterative Servers

This chapter is part of a series of articles on Linux application performance.

The iterative network server is one of the earliest programs you might have written if you ever took a course or read a book on network programming. These type of servers are not very useful except for learning network programming or in rare cases where traffic and concurrency are expected to be low. An iterative server, as its name suggests, serves one client after the other, in succession usually running in a single process. While a client is being served, if other requests arrive, they are queued by the operating system and they wait until the server is done with the current client being served after which it is ready to pick up the next client. An iterative server thus, exhibits no concurrency. It serves exactly one client at a time.

ZeroHTTPd workings

Although this chapter/article covers the mechanics of how an iterative server works, it also covers how ZeroHTTPd, a simple web server works. This includes details of how it handles the HTTP protocol in general. In other articles in this series, we won’t go into these details and will stick to the specific changes we make to the server architecture in order to better support concurrency. We won’t cover the more generic aspects of ZeroHTTPd since we will be covering them here already. So, if you are interested in how the HTTP handling aspects of ZeroHTTPd work, this is the article in the series you should be reading.

Compiling and running ZeroHTTPd

Clone the ZeroHTTPd Git repo and change into the zerohttpd directory. You can now compile and run ZeroHTTPd. Use the following commands to compile and run the iterative server. You can press Ctrl+c at the terminal to stop the server. It is also assumed that you are running the Redis server locally.

➜  zerohttpd git:(master) ✗ make all  
gcc -o iterative 01_iterative/main.c
gcc -o forking 02_forking/main.c
gcc -o preforked 03_preforked/main.c
gcc -o threaded 04_threaded/main.c -lpthread
gcc -o prethreaded 05_prethreaded/main.c -lpthread
gcc -o poll 06_poll/main.c
gcc -o epoll 07_epoll/main.c
➜  zerohttpd git:(master) ✗ ./iterative
Connected to Redis server@ 127.0.0.1:6379
ZeroHTTPd server listening on port 8000

I shall now walk you through the source code for the iterative version of ZeroHTTPd, our super simple web server that shall serve as our leaning tool. Like any C program, ZeroHTTPd starts life with the main() function and here it is:

int main(int argc, char *argv[])
{
    int server_port;
    if (argc > 1)
        server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    int server_socket = setup_listening_socket(server_port);

    connect_to_redis_server();
    /* Setting the locale and using the %' escape sequence lets us format
     * numbers with commas */
    setlocale(LC_NUMERIC, "");
    printf("ZeroHTTPd server listening on port %d\n", server_port);
    signal(SIGINT, print_stats);
    enter_server_loop(server_socket);
    return (0);
}

To all versions of ZeroHTTPd, you can optionally pass 2 command line arguments. The first one being the port on which you want the server to listen and the second being the Redis server IP address to which you want ZeroHTTPd to connect to. If one or either of these command line parameters are not provided, defaults are used. The default server port being 8,000 and the Redis server, 127.0.0.1, which is your local machine. Let’s now move on to setup_listening_socket(), which is the first function main() calls.

int setup_listening_socket(int port) {
    int sock;
    struct sockaddr_in srv_addr;

    sock = socket(PF_INET, SOCK_STREAM, 0);
    if (sock == -1)
        fatal_error("socket()");

    int enable = 1;
    if (setsockopt(sock,
                   SOL_SOCKET, SO_REUSEADDR,
                   &enable, sizeof(int)) < 0)
        fatal_error("setsockopt(SO_REUSEADDR)");

    bzero(&srv_addr, sizeof(srv_addr));
    srv_addr.sin_family = AF_INET;
    srv_addr.sin_port = htons(port);
    srv_addr.sin_addr.s_addr = htonl(INADDR_ANY);

    /* We bind to a port and turn this socket into a listening
     * socket.
     * */
    if (bind(sock,
             (const struct sockaddr *)&srv_addr,
             sizeof(srv_addr)) < 0)
        fatal_error("bind()");

    if (listen(sock, 10) < 0)
        fatal_error("listen()");

    return (sock);
}

Line #5 creates the socket, which is like a pipe from which the server can read and write. We specify the protocol family and type. On line #10, we tell the operating system it is OK to reuse this socket’s address. Remember, we are binding to port 8,000. Without this option, if you ran the server, stopped it and then restarted it immediately after having served at least one client, it won’t bind back on port 8,000 since any client connections will go into a TIME_WAIT state while the OS waits for any potential leftover data to be transferred. This will prevent quick restarts. You can try commenting out the calls to setsocketopt() and its related fatal_error() call, recompile and run. Now, open a browser tab and point it to http://localhost:8000. Close the tab, stop the server by hitting Ctrl+c and restart it again. You should see:

bind(): Address already in use

This is the OS waiting while the client (browser) connection you just established times out. You can check out sockets in the TIME_WAIT state using the netstat utility. I leave that as an exercise for you, the reader. Setting the SO_REUSEADDR option on the socket lets the operating system reuse the address and bind(), which we will discuss next, will succeed.

On lines 15-18, we describe the socket’s address. Every socket address has two associated pieces of information: the address and the port. Think of a socket’s address as a large locker panel and the port as a particular locker. The server is the locker panel. If the server has more than one network interface with its own unique address (a network card is a network interface, for example), you can specify which particular address you want to bind to. In our case, by specifying INADDR_ANY, we bind to all available interfaces. localhost or 127.0.0.1 is a special address that refers to the local interface, which is the same host on which the program is running. You might already know this.

The bind() system call on line #23 binds the identity we just created (the address and the port) to a socket. This gives the socket an address and this allows clients to reach it. It is bind() and listen() system calls that turn a program into a network server.

The listen() system call on line #28 converts this socket into a listening socket. The queue length, which determines the number of clients that can wait until their request is accepted is a parameter of the listen() system call. While a large number can be specified, most operating systems will silently truncate this number to some internal limit. On Linux by default, this number is 128 and you can change this number by writing a new limit into /proc/sys/net/core/somaxconn.

Our server socket is now setup and is listening on all network interfaces on the machine on which it is running on port 8,000.

void enter_server_loop(int server_socket)
{
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);
    while (1)
    {
        int client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");

        handle_client(client_socket);
        close(client_socket);
    }
}

Let us now turn our attention to the enter_server_loop() function. The main action is in the while(1) loop. Whenever accept() is called on a listening socket, it blocks until clients connect to it. Whenever a client connects to our server, accept() returns the client socket which we can use with the read() and write() system calls, thus writing to and reading from the client. This client socket which accept() returns, we pass it to the handle_client() function, which handles the HTTP protocol with the help of other functions as we will see in the following sections. When I first started leaning network programming years ago, I thought, if you treated the networking related system calls like English verbs, listen() should have been the system call that should really block. But all it does is convert a socket into a listening socket. It is the accept() call that really blocks waiting to accept connections from clients. So, there’s that.

To understand how HTTP works, it is best to take a look at a session in action. It is quite simple, really. Here, I’ve run the curl command with the -v switch which turns on verbose mode so that you can see what transpires on the wire.

➜  ~ curl -v http://localhost:8000/
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.58.0
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Server: zerohttpd/0.1
< Content-Type: text/html
< content-length: 611
< 
<!DOCTYPE html>
<html lang="en">

<head>
    <title>Welcome to ZeroHTTPd</title>
    <style>
        body {
            font-family: sans-serif;
            margin: 15px;
            text-align: center;
        }
    </style>
</head>

<body>
    <h1>It works! (kinda)</h1>
    <img src="tux.png" alt="Tux, the Linux mascot">
    <p>It is sure great to get out of that socket!</p>
    <p>ZeroHTTPd is a ridiculously simple (and incomplete) web server written for learning about Linux performance.</p>
    <p>Learn more about Linux performance at <a href="https://unixism.net">unixism.net</a></p>
</body>

* Closing connection 0
</html>%

The lines starting with a * are “comments” by curl. The lines starting with a > are part of the request curl sent to our server and the lines starting with < are part of the response it gets back from our server, ZeroHTTPd. Finally, we see the content sent back, which is from the index.html file. Since this is just text and is human readable, you can see that it is all quite simple.

An HTTP request consists of multiple lines. Each line is called an HTTP header. Users sitting at browsers want to see content transferred by HTTP and that is the primary purpose of the protocol. The underlying conversation “spoken” between the client and the server happens mostly via these HTTP “headers”. The main part of the request, as far as ZeroHTTPd is concerned is the first header, which has the HTTP method and the name of resource the client wants. This resource could be a static file or some dynamic content. You can see that the very first line of the request is GET / HTTP/1.1. In ZeroHTTPd, we cheat. We only care about the first line of the request. All other headers, we read and discard. In a more serious web server I don’t see how this can be a good thing! The HTTP protocol often has very important information in these headers and no serious web server worth its salt should be ignoring these like we do. Each header ends with a \r\n (carriage return and line feed) and the server knows that the client is done sending headers when there is an empty line by itself just containing \r\n.

void handle_client(int client_socket)
{
    char line_buffer[1024];
    char method_buffer[1024];
    int method_line = 0;

    while (1)
    {
        get_line(client_socket, line_buffer, sizeof(line_buffer));
        method_line++;

        unsigned long len = strlen(line_buffer);

        /*
         The first line has the HTTP method/verb. It's the only
         thing we care about. We read the rest of the lines and
         throw them away.
         */
        if (method_line == 1)
        {
            if (len == 0)
                return;

            strcpy(method_buffer, line_buffer);
        }
        else
        {
            if (len == 0)
                break;
        }
    }

    handle_http_method(method_buffer, client_socket);
}

We call the get_line()function, which reads the client socket character-by-character until it sees the \r\n sequence. This is a particularly inefficient way to read a socket, but in the interest of simplicity, we will forgive ourselves, unlike we do in real life. A better way to do it might be to read the request in chunks and then parse it like real web servers do. But like I said, ZeroHTTPd’s goals are different and we want to keep things simple. Each line is read into the array line_buffer. When we know it is the first line, which is expected to have the GET or other HTTP method request details, we save that header in the method_buffer array. The reason we’ve named the buffer “method_buffer” is because HTTP defines these various “verbs” (GET, POST, DELTE, etc) which tell the server what the client wants it to do. It is just enough for us to know for now that the GET method means that the client wants to fetch some resource on the server. We then continue to read all other lines which have a header each in them. We break the while (1) loop when we read an empty line, which means that the client is done sending us the request headers. We then call handle_http_method(), passing to it method_buffer, which has the main part of the request, the request verb and the resource path that we are interested in.

void handle_http_method(char *method_buffer, int client_socket)
{
    char *method, *path;

    method = strtok(method_buffer, " ");
    strtolower(method);
    path = strtok(NULL, " ");

    if (strcmp(method, "get") == 0)
    {
        handle_get_method(path, client_socket);
    }
    else if (strcmp(method, "post") == 0) {

        handle_post_method(path, client_socket);
    }
    else
    {
        handle_unimplemented_method(client_socket);
    }
}

The handle_http_method() function breaks up the line into tokens. Meaning, we break up the words separated by spaces. We compare the first word to “get” after converting it to lower case. This it the HTTP method we talked about previously. Although HTTP has several methods or verbs like GET, POST, DELETE, etc, ZeroHTTPd deals only with GET and POST and nothing else. If we don’t see a GET or a POST, then we tell the client we can’t handle that method. If we do see a GET or a POST indeed, we call the handle_get_method() or the handle_post_method() function respectively with the path specified in the header. In the curl example discussed above, the path, if you recall was /, which is the index.

void handle_get_method(char *path, int client_socket)
{
    char final_path[1024];

    if (handle_app_get_routes(path, client_socket) == METHOD_HANDLED)
        return;

    /*
     If a path ends in a trailing slash, the client probably wants the index
     file inside of that directory.
     */
    if (path[strlen(path) - 1] == '/')
    {
        strcpy(final_path, "public");
        strcat(final_path, path);
        strcat(final_path, "index.html");
    }
    else
    {
        strcpy(final_path, "public");
        strcat(final_path, path);
    }

    /* The stat() system call will give you information about the file
     * like type (regular file, directory, etc), size, etc. */
    struct stat path_stat;
    if (stat(final_path, &path_stat) == -1)
    {
        printf("404 Not Found: %s\n", final_path);
        handle_http_404(client_socket);
    }
    else
    {
        /* Check if this is a normal/regular file and not a directory or something else */
        if (S_ISREG(path_stat.st_mode))
        {
            send_headers(final_path, path_stat.st_size, client_socket);
            transfer_file_contents(final_path, client_socket, path_stat.st_size);
            printf("200 %s %ld bytes\n", final_path, path_stat.st_size);
        }
        else
        {
            handle_http_404(client_socket);
            printf("404 Not Found: %s\n", final_path);
        }
    }
}

The first check is to see if the path component of the URL points to an “app” in ZeroHTTPd. For example, the Guestbook app is part of ZeroHTTPd. We call handle_app_get_routes()in the beginning and if that function returns METHOD_HANDLED, it means that a client response was sent. There is no need to carry on further with the request. So, we promptly return. Otherwise, we continue on to see if there are any static files we can serve. We discuss handle_app_get_routes()a bit later.

For ZeroHTTPd, all files it serves live inside the public folder. That folder in turn can contain other other sub folder hierarchies. If the request path ends in a /, then ZeroHTTPd looks for an index.html file inside that path. If no file is found, then a HTTP 404 is sent to the client along with a simple message. In HTTP, a status code of 200 means everything went well. In our curl example, you saw that we requested the path /. In this case, ZeroHTTPd serves up the file public/index.html. Once the path is constructed, we use the stat() system call to check if the file exists and we also retrieve its size, which we will use in the content-length header that we send as part of the response. From the same system call, we are also able to determine if the specified path is a regular file or a directory, and when we see that it is a directory, we check if there is an index file present. Since there is no way to “view” a directory, some web servers show a list of files in the directory path requested. You could take that up as an exercise. If we don’t find the file requested, we return an HTTP 404 status.

When we do find a file, we call send_headers() and then transfer_file_contents(). Finally, we print a log message. The send_headers() function is fairly simple to understand.

void send_headers(const char *path, off_t len, int client_socket) {
    char small_case_path[1024];
    char send_buffer[1024];
    strcpy(small_case_path, path);
    strtolower(small_case_path);

    strcpy(send_buffer, "HTTP/1.0 200 OK\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    strcpy(send_buffer, SERVER_STRING);
    send(client_socket, send_buffer, strlen(send_buffer), 0);

    /*
     * Check the file extension for certain common types of files
     * on web pages and send the appropriate content-type header.
     * Since extensions can be mixed case like JPG, jpg or Jpg,
     * we turn the extension into lower case before checking.
     * */
    const char *file_ext = get_filename_ext(small_case_path);
    if (strcmp("jpg", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: image/jpeg\r\n");
    if (strcmp("jpeg", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: image/jpeg\r\n");
    if (strcmp("png", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: image/png\r\n");
    if (strcmp("gif", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: image/gif\r\n");
    if (strcmp("htm", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: text/html\r\n");
    if (strcmp("html", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: text/html\r\n");
    if (strcmp("js", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: application/javascript\r\n");
    if (strcmp("css", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: text/css\r\n");
    if (strcmp("txt", file_ext) == 0)
        strcpy(send_buffer, "Content-Type: text/plain\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);

    /* Send the content-length header, which is the file size in this case. */
    sprintf(send_buffer, "content-length: %ld\r\n", len);
    send(client_socket, send_buffer, strlen(send_buffer), 0);

    /*
     * When the browser sees a '\r\n' sequence in a line on its own,
     * it understands there are no more headers. Content may follow.
     * */
    strcpy(send_buffer, "\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);
}

We first send HTTP/1.0 200 OK, which tells the client that everything went well and the resource was found. Then, we send a server string header, which identifies the type of server. Then, based on the file extension, we send the content-type header. After that, we send the content-length header with the content length we get from the stat() system call. Finally, we send \r\n on a line by their own signaling the client that we are done sending headers and the content will now follow. On the same socket, we now send the file contents with the transfer_file_contents() function.

void transfer_file_contents(char *file_path, int client_socket, off_t file_size)
{
    int fd;

    fd = open(file_path, O_RDONLY);
    sendfile(client_socket, fd, NULL, file_size);
    close(fd);
}

In general, to transfer or copy a file, we read chunks of the source into a buffer, then write that chunk to the destination file or socket and repeat this until the source file is completely read. This is an extremely common operation. However, reading a file means transferring data to user space and writing a file means copying data from user space back to kernel space. These are expensive operations and involve multiple context switches since we’re making calls to read() and write(), too. Because this is a common operation, the kernel guys came up with the sendfile() system call that just takes 2 file descriptors, an offset and length to copy and does the whole copy operation right in the kernel without the buffer copy operations to user space and back while also avoiding the multiple context switches caused due to making system calls. We use sendfile() to immensely simplify the data transfer operation to the client socket.

This ends the processing of the client request. We now close the client socket in the enter_server_loop() function right after handle_client() returns.

While we were processing the client request–that is we were parsing the HTTP headers to get the path, and if everything went well, we used sendfile() to transfer the requested file over the client socket–if our iterative server got any more client connections, those connections would be queued for accept() to retrieve them. This is why our server is even called the iterative server. It handles clients one after the other and cannot server clients in parallel. In real-life, though, any popular service will see more than one user concurrently. And if each request were to take time to serve, it makes the waiting time even longer for clients considering the fact that data also takes time to travel between their browser and the server over the internet. Imagine connecting to a popular website and having to wait for a couple of minutes to see the page load. It is easy to figure out if clients have to wait for the server to process their request or not. If the server is not designed to somehow call accept() and remain blocked on it while it is serving requests from other clients, client requests will queue up. This is the main problem with this iterative server.

Also, if you open the ZeroHTTPd index page in your browser by going to http://localhost:8000, you can see what we have an image there as well. Once the HTML file is transferred to the browser, the browser then parses it and figures out which other files referenced in the HTML file it needs to load from the server. These requests also hit our server. These are also processed one after the other. When you are the only person connected to this server, you’ll certainly feel it serves up content pretty fast, but you should really imagine hundreds of clients connected to this server and now that you know and understand how the server is structured, you should be able to appreciate why this server architecture will not perform well at all.

The Guestbook App

While we’ve seen how ZeroHTTPd serves static files, we will now see how it serves the Guestbook “app”, which is really an HTML page with some dynamic content. Take a look at the annotated screenshot below:

The labels “1” and “2” mark the two areas with dynamic content in the Guestbook app

This resulting HTML page you see in the screenshot above is constructed from what we call a “template”. That template is really a regular HTML page which you can take a look at in templates/guestbook/index.html. While more sophisticated web frameworks allow you to use specialized language snippets as part of view templates, in ZeroHTTPd, we place special strings we call variables at locations that we want to insert dynamic content into, build HTML snippets for that dynamic content at runtime and replace the variables with the dynamic HTML snippets. For example, in the screenshot above, each guestbook entry is an HTML <p>tag. The section labeled “1” contains guestbook entries. Once we have data from Redis, we create as many tags as there are entries in Redis and replace the variable $GUEST_REMARKS$with the <p>tags we generate. As for the visitor count at the bottom of the page, labeled with a “2”, we replace the variable in the template $VISITOR_COUNT$ with the value of the key we get from Redis. We will see the function render_guestbook_template() in more detail below.

Routing

In many web frameworks, you get a lot of help calling specific pieces of your code in response to the path in the URL. This functionality is called routing. Popular web frameworks help route the request to the right request handling code based on the request path. In ZeroHTTPd, this is done by a simple function, handle_app_get_routes(). This function gets the path component of the URL as an argument and checks to see if it is a path we need to handle specially. Currently, the only check it performs is to see if the path matches /guestbook and calls the render_guestbook_template()function to handle the real work. It returns METHOD_HANDLED to signal that it took care of responding to the client so that the calling function, in this case handle_get_method()refrains from sending any more responses to the client.

If you were to extend ZeroHTTPd to handle more dynamic content, you’ll begin by checking for the path of your interest in handle_app_get_routes()and calling the appropriate handler function.

Rendering the Guestbook template

As described previously, the render_guestbook_template()takes care of getting the information from Redis, replacing the variables in the template file with the dynamic content created and writing it to the client socket.

int render_guestbook_template(int client_socket) {
    char templ[16384]="";
    char rendering[16384]="";

    /* Read the template file */
    int fd = open(GUESTBOOK_TEMPLATE, O_RDONLY);
    if (fd == -1)
        fatal_error("Template read()");
    read(fd, templ, sizeof(templ));
    close(fd);

    /* Get guestbook entries and render them as HTML */
    int entries_count;
    char **guest_entries;
    char guest_entries_html[16384]="";
    redis_get_list(GUESTBOOK_REDIS_REMARKS_KEY, &guest_entries, &entries_count);
    for (int i = 0; i < entries_count; i++) {
        char guest_entry[1024];
        sprintf(guest_entry, "<p class=\"guest-entry\">%s</p>", guest_entries[i]);
        strcat(guest_entries_html, guest_entry);
    }
    redis_free_array_result(guest_entries, entries_count);

    /* In Redis, increment visitor count and fetch latest value */
    int visitor_count;
    char visitor_count_str[16]="";
    redis_incr(GUESTBOOK_REDIS_VISITOR_KEY);
    redis_get_int_key(GUESTBOOK_REDIS_VISITOR_KEY, &visitor_count);
    sprintf(visitor_count_str, "%'d", visitor_count);

    /* Replace guestbook entries */
    char *entries = strstr(templ, GUESTBOOK_TMPL_REMARKS);
    if (entries) {
        memcpy(rendering, templ, entries-templ);
        strcat(rendering, guest_entries_html);
        char *copy_offset = templ + (entries-templ) + strlen(GUESTBOOK_TMPL_REMARKS);
        strcat(rendering, copy_offset);
        strcpy(templ, rendering);
        bzero(rendering, sizeof(rendering));
    }

    /* Replace visitor count */
    char *vcount = strstr(templ, GUESTBOOK_TMPL_VISITOR);
    if (vcount) {
        memcpy(rendering, templ, vcount-templ);
        strcat(rendering, visitor_count_str);
        char *copy_offset = templ + (vcount-templ) + strlen(GUESTBOOK_TMPL_VISITOR);
        strcat(rendering, copy_offset);
        strcpy(templ, rendering);
        bzero(rendering, sizeof(rendering));
    }

    /*
     * We've now rendered the template. Send headers and finally the
     * template over to the client.
     * */
    char send_buffer[1024];
    strcpy(send_buffer, "HTTP/1.0 200 OK\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    strcpy(send_buffer, SERVER_STRING);
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    strcpy(send_buffer, "Content-Type: text/html\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    sprintf(send_buffer, "content-length: %ld\r\n", strlen(templ));
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    strcpy(send_buffer, "\r\n");
    send(client_socket, send_buffer, strlen(send_buffer), 0);
    send(client_socket, templ, strlen(templ), 0);
    printf("200 GET /guestbook %ld bytes\n", strlen(templ));
}

We start by opening the template file. We then call redis_get_list() on line #16 to get guestbook entries which are stored in a Redis list. Once we have that list, which we have as an array, in the loop in lines 17-21, we form the HTML code snippet that holds the p tags that make up the guestbook entries. In line #22, we call redis_free_array_result() which is a utility function to free all the memory allocated by redis_get_list(). In line #27, we call redis_incr()to increment the Redis key value that tracks the visitor count. In line #28, we call redis_get_int_key(), which fetches a key from Redis and converts it to an integer value. Now that we have the bits of information we need, in code between lines 31 and 51, we replace the variable strings in the template with the dynamic content based on the information we have in Redis. In line 58-69, we send headers and the modified template to the client.

Done!

In the next part, we will look at serving each client with a child process of its own, while the parent process immediately goes back to blocking on accept(), which means that new client connections can be accepted and served without having to wait.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Linux Applications Performance: Introduction

Articles in this series

  1. Part I. Iterative Servers
  2. Part II. Forking Servers
  3. Part III. Pre-forking Servers
  4. Part IV. Threaded Servers
  5. Part V. Pre-threaded Servers
  6. Part VI: poll-based server
  7. Part VII: epoll-based server

On HackerNews

There are several interesting takeaways from the HackerNews thread for this article series. Do check it out.

Web apps are the staple of consumers and enterprises. Among the many existing protocols that are used to move and make sense of bits, HTTP has an overwhelming mind share. As you encounter and learn the nuances of web application development, most of you might pay very little attention to the operating system that finally runs your applications. The separation of Dev and Ops only made this worse. But with the DevOps culture becoming common place and developers becoming responsible for running their apps in the cloud, it is a clear advantage to better understand backend operating system nitty-gritty. You don’t really have to bother with Linux and how your backend will scale if you are deploying your backend as a system for personal use or for use by a few concurrent users. If you are trying to deploy for thousands or tens of thousands of concurrent users, however, having a good understanding of how the operating system figures out in your whole stack will be incredibly useful.

The constraints we are up against in the web services we write are very similar to those in other applications that are required to make a web service or application work. Be those load balancers or the database servers. All these classes of applications have similar challenges in high-performance environments. Understanding these fundamental constraints and how to work around them will in general make you appreciate what performance and scalability of your web applications or services are up against.

I am writing this series in response to the questions I get asked by young developers who want to become well–informed system architects. Without diving down into the basics of what makes Linux applications tick and the different ways of structuring Linux or Unix network applications, it is not possible to gain a through and clear understanding of Linux application performance. While there are many types of Linux applications, what I want to explore here are Linux networking–oriented applications as opposed to say, a desktop application like a browser or a text editor. This is because the audience for this series are web services/application developers and architects who want to understand how Linux or Unix applications work and how to structure these services for high performance.

Linux is the server operating system and more often than not, your applications probably run on Linux finally. Although I say Linux, most of the time you can safely assume I also include other Unix–like operating systems in general. However, I haven’t extensively tested the accompanying code on other Unix–like systems. So, if you are interested in FreeBSD or OpenBSD, your mileage may vary. Where I attempt anything Linux-specific, I’ve done my best to point it out.

While you can certainly use this knowledge to choose the best possible structure for a new network application you want to write from scratch, you might not be firing up your favorite text editor and writing a web server in C or C++ to solve the problem of having to deliver the next business app in your organization. That might be a guaranteed way to get yourself fired. Having said that, knowing these application structures will help you tremendously in choosing one among a few existing applications, if you know how they are structured. After understanding this article series, you will be able to appreciate process–based vs. thread–based vs. event–based systems. You will get to understand and appreciate why Nginx performs better than Apache httpd or why a Tornado based Python application might be able to serve more concurrent users compared to a Django based Python application.

ZeroHTTPd: A learning tool

ZeroHTTPd is a web server I wrote from scratch in C as a teaching tool. It has no external dependencies, including for Redis access. We roll our own Redis routines–read more below.

While we could talk a whole lot of theory, there is nothing like writing code, running it, benchmarking it to compare each of server architectures we evolve. This should cement your understanding like no other method. To this end, we will develop a simple web server called ZeroHTTPd using process–based, thread–based and event–based models. We will benchmark our each of these servers and see how they perform relative to one another. ZeroHTTPd is a simple–as–possible HTTP server written from scratch in pure C with no external library dependencies. It is implemented in a single C file. For event-based servers, I include uthash, an excellent hash table implementation, which is comes in a single header file. Otherwise, there are no dependencies and this is to keep things simple.

I’ve heavily commented the code to aid understanding. ZeroHTTPd is also a bare minimal web development framework apart from being a simple web server written in a couple of hundred lines of C. It doesn’t do a lot. But, it can server static files and very simple “dynamic” pages. Having said this, ZeroHTTPd is well-suited for you to understand how to architect your Linux applications for high-performance. At the end of the day, most web services wait around for requests, look into what that request is and process them. This is exactly what we will be doing with ZeroHTTPd as well. It is a learning tool, not something you’ll use in production. It is also not going to win awards for error handling, security best practices (oh yes, I’ve used strcpy) or for clever tricks and shortcuts of the C language, of which there are several. But, it’ll hopefully serve its purpose well (pun unintended).

Here we see ZeroHTTPd’s index page. It can serve various file types, including images.

The Guestbook App

Modern web applications hardly serve just static files. They have complex interactions with various databases, caches, etc. To that end, we build a simple web app named “Guestbook” that lets guests lovingly leave their name and remarks. Guestbook also lists remarks previously left by various guests as well. There is also a visitor counter towards the bottom of the page.

The ZeroHTTPd “Guestbook” web app

We store the visitor counter and the guest book entries in Redis. To talk to Redis, we do not depend on an external library. We have our own custom C routines to talk to Redis. I’m not big fan of rolling out your own stuff when you can use something that is already available and well tested. But the goal of ZeroHTTPd is to teach Linux performance and accessing external services while in the middle of serving an HTTP request has a huge implications as far as performance goes. We need to be in full control of the way we talk to Redis in each of the server architectures we are building. While in one architecture we use blocking calls, in others we use event-based routines. Using an external Redis client library won’t allow us this control. Also, we will be implementing our own Redis client only to the extent we will use Redis (Getting, setting and incrementing a key. Getting and appending to an array). Moreover, the Redis protocol is super elegant and simple. Something to even learn about deliberately. The very fact that you can implement a super-fast protocol that does its job in about 100 lines of codes goes to say a lot about how well thought out the protocol is.

The following figure illustrates the steps we follow in order to get the HTML ready to serve when a client (browser) requests the /guestbookURL.

Guestbook app flow

When a Guestbook page needs to be served, there is one file-system call to read the template into memory and three network-related calls to Redis. The template file has most of the HTML content that makes up the Guestbook page you see in the screenshot above. It also has special placeholders where the dynamic part of the content which comes from Redis like guest remarks and the visitor counter go. We fetch these from Redis, replace these for the placeholders in the template file and finally, the fully formed content is written out to the client. The third call to Redis could have been avoided since Redis returns the new value of any incremented key. However, for our purposes, as we move our server to asynchronous, event-based architectures, having a server that is busy blocking on a bunch of network calls is a good way to learn about things. So, we discard the return value that Redis returns when we increment the visitor count and read it back in a separate call.

ZeroHTTPd Server Architectures

We will build ZeroHTTPd, retaining the same functionality, using 7 different architectures:

  • Iterative
  • Forking (one child process per request)
  • Pre-forked server (pre-forked processes)
  • Threaded (one thread per request)
  • Pre-threaded (threads pre-created)
  • poll()-based
  • epoll based

We shall also measure the performance of each architecture loading them each with 10,000 HTTP requests. However, as we move on to comparisons with architectures that can handle a lot more concurrency, we will switch to testing with 30,000 requests. We test 3 times and consider the average.

Testing Methodology

The ZeroHTTPd load testing setup

It is important that these tests not be run with all components on the same machine. If that is done, the operating system will have the extra overhead of scheduling between all those components, as they vie for CPU. Measuring operating system overhead with each of the chosen server architectures is one of the most important aims of this exercise. Adding more variables will be detrimental to the process. Hence, a setup described in the illustration above will work best.

Here is what each of these servers do:

  • load.unixism.net: This is where we run ab, the Apache Benchmark utility, which generates the load we need to test our server architectures.
  • nginx.unixism.net: At times we may want to run more than one instance of our server program. So, we use a suitably configured Nginx server as a load balancer to spread the load coming in from ab on to our server processes.
  • zerohttpd.unixism.net: This is where we run our server programs, which are based on the 7 different architectures listed above, one architecture at a time.
  • redis.unixism.net: This server runs the Redis daemon which stores the guest remarks and the visitor counter.

All servers have a single CPU core. The idea is to see how much performance we can wring out of it with each of our server architectures. Since all of our server programs are measured against the same hardware, it acts as the baseline against which we measure the relative performance or each of our server architectures. My testing setup consisted of virtual servers rented from Digital Ocean.

What are we measuring?

There are many things we can measure. However, given a certain amount of compute resources, we want to see how much performance we can squeeze out of each architecture at various levels of increasing concurrency. We test with up to 15,000 concurrent users.

Test Results

The following chart shows how servers employing different process architectures perform when subjected to various concurrency levels. In the y-axis we have requests/sec and in the x-axis we have concurrent connections.

Here is a table that with the numbers laid out

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

You can see from the chart and table above that beyond 8,000 concurrent requests we only have 2 contenders: pre-threaded and epoll. In fact, our poll-based server fares worse than the threaded server, which comfortably beats the former in performance even at the same concurrency levels. The prethreaded server architecture giving the epoll-based server a good run for its money is a testament to how well the Linux kernel handles scheduling of a very large number of threads.

ZeroHTTPd Source Code Layout

You can find the source code for ZeroHTTPd here. Each server architecture gets its own directory.

ZeroHTTPd
│
├── 01_iterative
│   ├── main.c
├── 02_forking
│   ├── main.c
├── 03_preforking
│   ├── main.c
├── 04_threading
│   ├── main.c
├── 05_prethreading
│   ├── main.c
├── 06_poll
│   ├── main.c
├── 07_epoll
│    └── main.c
├── Makefile
├── public
│   ├── index.html
│   └── tux.png
└── templates
    └── guestbook
        └── index.html

In the top level directory, apart from the 7 folders that hold the code for ZeroHTTPd based on the 7 different architectures we discuss, there are 2 other directories there. The “public” and “templates” directories. The “public/” directory contains an index file and an image that you see in the screenshot. You can put other files and folders in here and ZeroHTTPd should serve those static files without problems. When the path component entered in the browser matches a path inside the “public” folder, ZeroHTTPd will look for an “index.html” file in that directory before giving up. Our Guestbook app, which is accessed by going to the path /guestbook, is a dynamic app, meaning that its content is dynamically generated. It has only one main page and content for that page is based on the file “templates/guestbook/index.html”. It is easy to add more dynamic pages to ZeroHTTPd and extend it. The idea is that users can add more templates inside this directory and extend ZeroHTTPd as needed.

To build all 7 servers, all you need to do it run “make all” from the top level directory and all 7 servers are built and placed in the top level directory. The executables expect the “public” and “templates” directories in the same directory they are run from.

Linux APIs

If you don’t understand the Linux API well, you should still be able to follow this series and get a decent enough understanding. I however, do recommend you read more about the Linux programming API. There are innumerable resources to help you out in this regard and that is out of scope as far as this article series goes. Although we will touch over several of Linux’s API categories, our focus will be in mainly in the areas of processes, threads, events and networking. If you don’t know the Linux API well, I encourage you to read the man pages for the system calls and library functions used apart from reading books and articles on their usage.

Performance and scalability

One thought about performance and scalability. There is no relationship between them, theoretically speaking. You can have a web service that performs really well, responds within a few milliseconds, but does not scale at all. Similarly, there can be a badly performing web application that takes several seconds to respond, but scales to handle tens of thousands of concurrent users. Having said that, the combination of high-performance, highly scalable services is very powerful. High-performance applications use resources sparingly in general and are thus efficient at serving more concurrent users per server, driving down costs, which is a good thing.

CPU and I/O–bound tasks

Finally, there are always only two possible types of tasks in computing: I/O bound and CPU bound. Getting requests over the internet (network I/O), serving files (network and disk I/O), talking to a database (network and disk I/O) are all I/O bound activities. Several types of DB queries can use a bit of CPU, though (sorting, calculating the mean of a million results, etc). Most of the web applications you will build will be I/O bound and the CPU will rarely be used to its full capacity. When you see a lot of CPU being used in a I/O bound application, it most likely points to poor application architecture. This could mean that the CPU essentially is being spent in process management and context switching overhead–and that’s not exactly very useful. If you are doing things like heavy image processing, audio file conversion or machine learning inference, then your application will tend to be CPU bound. However, for the majority of applications, this won’t be the case.

Special Thanks

Writing an article series that is tens of thousands of words becomes easy with help from reviewers. Thanks go out to Vijay Lakshminarayanan and Arun Venkataswamy for spending their time reviewing this series and suggesting corrections to several obvious and not-so-obvious problems.

Go deeper into a server architecture

  1. Part I. Iterative Servers
  2. Part II. Forking Servers
  3. Part III. Pre-forking Servers
  4. Part IV. Threaded Servers
  5. Part V. Pre-threaded Servers
  6. Part VI: poll-based server
  7. Part VII: epoll-based server