Linux Applications Performance: Part VII: epoll Servers

This chapter is part of a series of articles on Linux application performance.

If you came here without reading the poll()-based implementation description

All the explanation of how exactly we move from a process or thread-based model to an event based model is there in the poll()-based article. Without going through it, this article might not make a lot of sense.

Continuing on with epoll

The main difference between a server featuring poll() and a server featuring epoll() is just the API. Otherwise, the general program structure should remain the same. The real major changes we saw are when we went from a threading to an event-based model. Pretty much every function in the program had to change. In this article, we’ll see the main differences between usage of poll() vs epoll_wait().

Like we saw in the previous article in this series, the problem with poll() is that it takes as input a pointer to an array of structures which has one element for every file descriptor you want to monitor. This array can grow significantly when you’re dealing with thousands of concurrent requests. Also, poll(), when it returns, only tells you how many descriptors are ready. It is your job to iterate through the array and check the revents member of the structure in each element to see if that particular descriptor is the one that’s ready. This does not scale well as the number of concurrent connections increase.

With epoll, you first create an epoll descriptor:

epoll_fd = epoll_create1(0);

Now, you can add descriptors to watch for activity by calling epoll_ctl. Take a look at our utility function add_to_epoll_fd_list().

void add_to_epoll_fd_list(int fd, uint32_t ep_events) {
    struct epoll_event event;
    event.data.fd = fd;
    event.events = ep_events;
    if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &event))
        fatal_error("add epoll_ctl()");

}

Finally, you call epoll_wait(), which is the equivalent of calling poll(), but you’ve already told the operating system which descriptors you want to monitor with epoll_ctl(). There is no huge array sitting in user space that you need to pass to the kernel every time you call epoll_wait().

void enter_server_loop(int server_socket)
{
    int epoll_ret;
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1) {
        epoll_ret = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
        if (epoll_ret < 0)
            fatal_error("epoll_wait()");

        for (int i = 0; i < epoll_ret; i++) {

            if (events[i].events == 0)
                continue;

            int client_socket = accept(
                    server_socket,
                    (struct sockaddr *)&client_addr,
                    &client_addr_len);

            if (client_socket > 0 ) {
                handle_client(client_socket);
            }

            if (events[i].data.fd != server_socket) {
                if (!((events[i].events & EPOLLIN)||(events[i].events & EPOLLOUT))) {
                    printf("ERROR: Error condition in socket %d!\n", events[i].data.fd);
                    exit(1);
                    continue;
                }
                /* A descriptor handling one of the clients is ready */
                client_context *check_cc;
                HASH_FIND(hh_cs, cc_cs_hash, &events[i].data.fd, sizeof(int), check_cc);
                if (!check_cc) {
                    HASH_FIND(hh_rs, cc_rs_hash, &events[i].data.fd, sizeof(int), check_cc);
                    if (!check_cc) {
                        printf("Unable to find client context associated with socket %d.\n", events[i].data.fd);
	            		exit(1);
                    }

                }
                check_cc->read_write_callback(check_cc);
            }
        }
    }
}

Take a look at enter_server_loop() above. epoll_wait() even fills up the events array with information on descriptors that are ready. It returns the number of descriptors that are ready. This way, you need not potentially loop through the entire array of file descriptors even if there is only one descriptor that is actually ready.

Apart from the addition of epoll_create1(), and epoll_ctl() and the replacement of poll() with epoll_wait() and the changes that go along with that, there are really no other changes in the program to convert the version of ZeroHTTPd based on poll() to the one based on epoll.

Performance of the epoll based server

Let’s bring up our table of performance numbers again for easy reference

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

In this test, as in the poll() based test, I had 10 processes based on epoll running. The real competition it has is really with the pre-threaded version of ZeroHTTPd. I ran two processes of the pre-threaded server with 10,000 threads in the pool each. You can see something very interesting. While both servers are neck-to-neck in performance, our epoll based server extends well up to 15,000 concurrent users, whereas the pre-threaded server loses steam 12,000 concurrent connections mark. I think the most important observation is that while performance is decent, due to relatively less operating system overhead, it scales very well when it comes to handling very high concurrent requests.

There are still a couple of places where the poll() and epoll based servers can block. Like the initial read from the client right after we accept a connection and all the writes to both the client and to Redis. I believe, if we make these also non-blocking, we should see even better performance compared to the pre-threaded version.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server

Thank you!

If you made it all the way here through the 7 articles, I hope you feel it was worth the journey. Systems programming is a fascinating subject and knowing the nuances can be very rewarding and satisfying. Feel free to tweet at me for any feedback or just a hi-fi: @shuveb.