Linux Applications Performance: Part VI: Polling Servers

This chapter is part of a series of articles on Linux application performance.

When things happen sequentially, we get them. All our flowcharts, algorithms or workflows are sequential in nature. It’s easy for our brains to understand sequential happenings. Moving to a multi-process or a threaded model is also a simple extension of that model. There are sequential things happening, but in several processes or threads. Event-based architectures, as we will see them in the poll and epoll-based servers, are a little tricky to wrap your head around, but they are no rocket science.

System calls like poll() and epoll_wait() are simple. You given them a bunch of sockets or other file descriptors and they will tell you if any of them are ready for I/O. Meaning they won’t block when you attempt reads or writes on them. Until one or more of these descriptors become ready, poll() or epoll_wait() will block. You can also specify a timeout to wait for activity on the descriptors or you can ask to wait indefinitely. There you have the shortest (but incomplete) combined man page for poll() and epoll_wait().

What is the overhead we want to avoid when processing a large number of concurrent users?

Your goal is processing thousands of concurrent users. Let’s take a look at the server architectures we have so far. Iterative is out of the question. We shall also discount forking and threading servers since they create a new process or thread the moment they receive a new connection, leading to clearly measurable performance issues. That leaves us with the pre-forked and pre-threaded servers. We saw that the pre-threaded servers can handle up to 11,000 concurrent users as far as our Guestbook app goes. As the number of threads and processes increase, there will be operating system overhead in scheduling those. While at hundreds of processes, this is not a problem, at very high concurrencies, we can prove that this is an issue.

Analyzing blocking operations

So, we want to avoid running a lot of processes or threads, but still serve a large number of concurrent connections. How exactly can we do this? We do this by handling as many connections as possible in a single process or thread. Although that is tricky to do, it is not very difficult. This is where operating system facilities like poll and epoll come in. Let’s revisit our flow of the Guestbook app:

If you look at the color coding in the flow image above, the red blocks are the ones that could potentially block our process. The orange block is the file read operation where we read in the template file. Unfortunately, under Linux, even though file operations could block, there is no way to find out if they actually will. Having said that, file reads and writes are pretty fast and they are usually cached aggressively. Usually, they are not the culprit. Network operations, like the ones shown in the red blocks can really hold up the process.

The main idea with event based servers is that we want to avoid getting blocked at all costs. Blocking blocks the whole process and remember, the idea is to process as many concurrent requests as possible with just one process. And the one thing that can keep us away from processing more and more requests is blocking. If we are not blocked, we can be accepting more connections, we can be initiating more Redis read and write requests and writing results back to client sockets, thus successfully serving requests.

There are two other blocking operation that are not in the flow diagram and those are being blocked on a call to accept(), where we accept incoming connections and the second one being reading from the client. Both of these can also block our process. These two are related to the client socket while all the red blocks in the diagram are related to the Redis socket. Writes can block you if you have a lot of data to write, but in our application, we don’t do big writes and for the sake of simplicity, we will only worry about getting blocked on reads. If you are waiting to read even a single byte, you can get blocked.

In summary, these are the operations that can block us:

  • The server socket in when accepting new connections with accept()
  • The client socket while reading with read() or recv()
  • The Redis socket while connecting to Redis with connect()
  • Reading from the Redis socket with read() or recv()

Structuring our program

The general idea is this: if we are going to be doing a potentially blocking operation like calling accept() or connect() or read() or recv(), we never just directly call them. We first add the file descriptor in question to poll()‘s list of monitored descriptors so that we know when there is data or new incoming (accept()) or outgoing (connect()) connection is ready. We do this because in the meanwhile, we can process other reads and connects.

Consider this: You have a new client connection #5. If you now try and connect to Redis with socket #6, you could block and not accept new client connections #7 and #8. Rather, you tell poll() to monitor your newly initiated Redis connect call, which could take a while. In the meanwhile, poll() tells you that there are newer client connections #7 and #8 available. It later tells you that Redis connection #6, which is the connection initiated to serve client #5 is now ready. Now that #6 is available, you can ask it to read the guestbook entries in Redis. In the meanwhile, you initiated Redis connections #9 and #10 which are paired with clients #7 and #8 and poll() tells you that connect() got through. You now react to events poll() tells you occurred. This is why this paradigm is called event-driven programming.

The Client Context

In threaded programs, you block and unblock with each call to connect() and read() or recv(), but you don’t realize it. But because the context of a single client request is there as a whole in a thread, it is naturally easy to understand. You see the client flow diagram above. You can think of each block as a state that the client request moves through. With event-driven servers, like those using poll() and epoll_wait(), it is as if you are suspending one client request in a particular state when it blocks and switch to another client request which could be in a different state. poll() or epoll_wait() will tell you when a request is ready to move from one state to the other.

To maintain the state/context of each request, we use a struct called the client_context.

typedef struct client_context{
    int client_sock;
    int redis_sock;
    char templ[16384];      /* Template storage */
    char rendering[16384];

    /* Next callback to make to render the page */
    void (*next_callback)(struct client_context *cc);

    /* Used by Redis reader/writer callback functions */
    void (*read_write_callback)(struct client_context *cc);

    int entries_count;
    char **entries;
    char redis_key_buffer[64];
    int intval;

    /* Hash tables handles to look up by Redis or by client socket*/
    UT_hash_handle hh_cs;   /* Lookup by client socket */
    UT_hash_handle hh_rs;   /* Lookup by Redis socket  */
} client_context;

As far as processing client requests is concerned, we mainly have to deal with two sockets: the client connected socket, which is client_sock and the Redis connected socket, which is redis_sock in the client_context structure.

Each client or Redis socket gets assigned a unique number by the operating system. When poll() tells us one of the file descriptors is ready, we need to find out the client context associated with it. For this, we use two hash tables. Given the client socket or the Redis socket, we can retrieve the associated client_context, which we allocate for every new successful client connection.

We use the read_write_callback function pointer to move from one state to the next. After poll(), we fetch the associated client_context and simply call the function pointed to by read_write_callback. That function then stores the next function to call in the same pointer, initiates the next operation, adds the file descriptor to the list monitored by poll() if not already present and returns, which calls poll() again.

The poll() loop

The beating heart of our program is this loop in enter_server_loop():

void enter_server_loop(int server_socket)
{
    int poll_ret;
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1) {
        poll_ret = poll(poll_fds, poll_nfds, -1);
        if (poll_ret < 0)
            fatal_error("poll()");

        for (int i = 0; i < poll_nfds; i++) {

            if (poll_fds[i].revents == 0)
                continue;

            int client_socket = accept(
                    server_socket,
                    (struct sockaddr *)&client_addr,
                    &client_addr_len);

            if (client_socket > 0 ) {
                handle_client(client_socket);
            }

            if (poll_fds[i].fd != server_socket) {
                if (!((poll_fds[i].revents & POLLIN)||(poll_fds[i].revents & POLLOUT))) {
                    printf("ERROR: Error condition in socket %d!\n", poll_fds[i].fd);
                    exit(1);
                    continue;
                }
                /* A descriptor handling one of the clients is ready */
                client_context *check_cc;
                HASH_FIND(hh_cs, cc_cs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                if (!check_cc) {
                    HASH_FIND(hh_rs, cc_rs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                    if (!check_cc) {
                        printf("Unable to find client context associated with socket.\n");
                    }

                }
                check_cc->read_write_callback(check_cc);
            }
        }
    }
}

Our main() function adds the server listening socket to poll()‘s file descriptor list by calling our utility function add_to_poll_fd_list(). Our program after that, pretty much revolves around this loop. poll() starts out with only the listening socket which can accept new client connections. When a new client connection is made, handle_client() is called. This sets off the dominoes for this request. We leave the static file handling logic of the ZeroHTTPd server untouched. However, when it comes to the Guestbook app, we take care to ensure that we do not call any potentially blocking calls directly. For example, if we want to read the guestbook remarks list from Redis, we write the Redis command to do so, but tell poll() to let us know when our Redis socket has data ready for reading. Let’s look at that flow starting from how we make our Redis connection.

int init_render_guestbook_template(int client_sock) {
    /* This is a new connection. Allocate a client context. */
    client_context *cc = malloc(sizeof(client_context));
    if (!cc)
        fatal_error("malloc()");
    cc->client_sock = client_sock;
    HASH_ADD(hh_cs, cc_cs_hash, client_sock, sizeof(int), cc);

    cc->next_callback = render_guestbook_post_redis_connect;
    connect_to_redis_server(cc);
    HASH_ADD(hh_rs, cc_rs_hash, redis_sock, sizeof(int), cc);
}

This is where it all starts. This function allocates memory for a new client_context structure, adds the client and redis sockets to the two look up tables, sets up the next_callback function pointer in the client context to render_guestbook_post_redis_connect and calls connect_to_redis, which initiates the connection to the Redis server:

void connect_to_redis_server(client_context *cc) {
    struct sockaddr_in redis_servaddr;
    /* Create a non-blocking socket */
    int redis_socket_fd = socket(AF_INET, SOCK_STREAM|SOCK_NONBLOCK, 0);
    memset(&redis_servaddr, 0, sizeof(redis_servaddr));
    redis_servaddr.sin_family = AF_INET;
    redis_servaddr.sin_port = htons(REDIS_SERVER_PORT);

    int pton_ret = inet_pton(AF_INET, redis_host_ip, &redis_servaddr.sin_addr);

    if (pton_ret < 0)
        fatal_error("inet_pton()");
    else if (pton_ret == 0) {
        fprintf(stderr, "Error: Please provide a valid Redis server IP address.\n");
        exit(1);
    }

    int cret = connect(redis_socket_fd,
                       (struct sockaddr *) &redis_servaddr,
                       sizeof(redis_servaddr));
    if (cret == -1 && errno != EINPROGRESS) {
        fatal_error("redis connect()");
    } else {
        cc->redis_sock = redis_socket_fd;
        cc->read_write_callback = connect_to_redis_server_callback;
        add_to_poll_fd_list(redis_socket_fd, POLLOUT);
    }
}

If you notice, here in connect_to_redis(), we call connect(), but we’ve set this socket to non-blocking mode with the SOCK_NONBLOCK option to the socket() call. This makes sure connect() does not block. We then add the Redis socket to poll()‘s polling list, which is the list of file descriptors it monitors. Another main thing to note here is the function we are setting read_write_callback to. If you go back and look at the enter_server_loop() function, you will note that once we locate the client context associated with the file descriptor that is ready, we simply call the function pointed to by read_write_callback. This is the function pointer that is responsible to moving the request from the state to state as we saw in the state-flow diagram. Here, we’ve set it to connect_to_redis_server_callback().

Now, poll() will wait for the connect() to succeed. TCP/IP has a 3-way handshake that can take a while to establish, which is why we don’t want to potentially block for a long time in connect(). Also, it may take longer if the Redis server is busy for some reason. Let’s take a better look at the portion of the code in enter_server_loop():

            /* A descriptor handling one of the clients is ready */
            client_context *check_cc;
            HASH_FIND(hh_cs, cc_cs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
            if (!check_cc) {
                HASH_FIND(hh_rs, cc_rs_hash, &poll_fds[i].fd, sizeof(int), check_cc);
                if (!check_cc) {
                    printf("Unable to find client context associated with socket.\n");
                }

            }
            check_cc->read_write_callback(check_cc);

We use the HASH_FIND() macros to find the associated client context with the file descriptor poll() says is ready. We will have no clue if this file descriptor is a client connection or a Redis socket. So, we simply check both hash tables. Once we locate the client_context, we simply call the function pointed to by read_write_callback. It should now call connect_to_redis_server_callback(). Let’s take a look at that function:

void connect_to_redis_server_callback(client_context *cc) {

    int result;
    socklen_t result_len = sizeof(result);
    if (getsockopt(cc->redis_sock, SOL_SOCKET, SO_ERROR, &result, &result_len) < 0) {
        fatal_error("Redis connect getsockopt()");
    }

    if (result != 0) {
        fatal_error("Redis connect getsockopt() result");
    }

    remove_from_poll_fd_list(cc->redis_sock);

    /* Make the redis socket blocking */
    fcntl(cc->redis_sock, F_SETFL,
          fcntl(cc->redis_sock, F_GETFL) &
          ~O_NONBLOCK);

    add_to_poll_fd_list(cc->redis_sock, POLLIN);
    cc->next_callback(cc);
}

Here, we remove the non-blocking mode on the Redis socket. We do this because we want to write Redis commands in blocking mode. As we discussed earlier, we’ve decided to make poll() only track readiness for connect() and reads. We write to the socket directly, even if we can potentially block. We’ve chosen this route since it is simpler. Handling writes in callbacks additionally can add more complexity to our program. It is a great exercise, however, to add writes in callbacks once poll() tells us our sockets are ready for writes. Finally, we call the function pointed to by next_callback. This is another function pointer we use to move the state forward. While we exclusively use read_write_callback when poll() tells us when a file descriptor is ready for reading or a new connection is available, next_callback is used to really move from one state to the other. At this point, next_callback points to render_guestbook_post_redis_connect() as we had stored in init_render_guestbook_template(). Let’s take a look at that function:

void render_guestbook_post_redis_connect(client_context *cc) {
    add_to_poll_fd_list(cc->client_sock, POLLIN);
    memset(cc->rendering, 0, sizeof(cc->rendering));
    memset(cc->templ, 0, sizeof(cc->rendering));

    /* Read the template file */
    int fd = open(GUESTBOOK_TEMPLATE, O_RDONLY);
    if (fd == -1)
        fatal_error("Template read()");
    read(fd, cc->templ, sizeof(cc->templ));
    close(fd);

    /* Get guestbook entries and render them as HTML */
    cc->next_callback = render_guestbook_entries;
    redis_get_list( GUESTBOOK_REDIS_REMARKS_KEY, cc);
}

Here, we read the template file and store it in the client context, change next_callback to point to render_guestbook_entries and then call redis_get_list().

/*
 * Get the range of items in a list from 'start' to 'end'.
 * This function allocates memory. An array of pointers and all the strings pointed to by it
 * are dynamically allocated. redis_free_array_results() has to be called to free this
 * allocated memory by the caller.
 * */

void redis_list_get_range(char *key, int start, int end, client_context *cc) {
    /* Shortcut: setting a character array to a null string will fill it up with zeros */
    char cmd_buf[1024]="", start_str[16], end_str[16];
    sprintf(start_str, "%d", start);
    sprintf(end_str, "%d", end);
    sprintf(cmd_buf, "*4\r\n$6\r\nLRANGE\r\n$%ld\r\n%s\r\n$%ld\r\n%s\r\n$%ld\r\n%s\r\n", strlen(key), key, strlen(start_str), start_str, strlen(end_str), end_str);
    redis_write(cc->redis_sock, cmd_buf, strlen(cmd_buf));
    cc->read_write_callback = redis_list_get_range_callback;
}

/*
 * Utility function to get the whole list.
 * We set start to 0 and end to -1, which is a magic number that means the last element.
 * */

void redis_get_list(char *key, client_context *cc) {
    redis_list_get_range(key, 0, -1, cc);
}

As you can see redis_get_list() simply calls redis_list_get_range(). You can see there that we write out the command to Redis and set the read_write_callback to redis_list_get_range_callback() which will be called by poll() whenever there is data ready to read on the Redis socket. This happens with Redis has processed the command to fetch the list and has begun writing data to the socket.

If you look at the source code for redis_list_get_range_callback(), which I’m omitting here for simplicity (it’s a long function), you’ll see that it finally calls the function pointed to by next_callback, which now points to render_guestbook_entries(), like we set in render_guestbook_post_redis_connect(). At this point redis_list_get_range_callback() has stored the guestbook entries data from Redis in the client context.

void render_guestbook_entries(client_context *cc) {
    char guest_entries_html[16384]="";

    for (int i = 0; i < cc->entries_count; i++) {
        char guest_entry[1024];
        sprintf(guest_entry, "<p class=\"guest-entry\">%s</p>", cc->entries[i]);
        strcat(guest_entries_html, guest_entry);
    }
    redis_free_array_result(cc->entries, cc->entries_count);

    /* Replace guestbook entries */
    char *entries = strstr(cc->templ, GUESTBOOK_TMPL_REMARKS);
    if (entries) {
        memcpy(cc->rendering, cc->templ, entries-cc->templ);
        strcat(cc->rendering, guest_entries_html);
        char *copy_offset = cc->templ + (entries-cc->templ) + strlen(GUESTBOOK_TMPL_REMARKS);
        strcat(cc->rendering, copy_offset);
        strcpy(cc->templ, cc->rendering);
        memset(cc->rendering, 0, sizeof(cc->rendering));
    }

    bump_visitor_count(cc);
}

Here, we form the HTML snippet needed for the guestbook entries, replace the template variable with the snippet and then call bump_visitor_count().

void bump_visitor_count(client_context *cc) {
    cc->next_callback = bump_visitor_count_callback;
    redis_incr(GUESTBOOK_REDIS_VISITOR_KEY, cc);
}

This sets next_callback to point to bump_visitor_count_callback() and then calls redis_incr(), which will again set a read_write_callback, write the command to Redis and depend on poll() to tell when data is ready to read and do the actual reading in the callback pointed to by read_write_callback, which will then call next_callback. This chain continues till finally, client_fini() is called, which marks the end of processing a single request:

void client_fini(client_context *cc) {
    client_context *check_cc;
    HASH_FIND(hh_cs, cc_cs_hash, &cc->client_sock, sizeof(cc->client_sock), check_cc);
    if (check_cc) {
        remove_from_poll_fd_list(cc->client_sock);
        remove_from_poll_fd_list(cc->redis_sock);
        close(cc->client_sock);
        close(cc->redis_sock);
        HASH_DELETE(hh_cs, cc_cs_hash, check_cc);
        HASH_DELETE(hh_rs, cc_rs_hash, check_cc);
        free(check_cc);
    }
    else {
        printf("Could not retrieve client context back from hash!\n");
    }
}

This function closes the client and Redis sockets, removes entries from the hash table and finally frees up the memory allocated for the client context. Phew!

Performance of the Polling Server

As you can see from the performance numbers table, our poll() based server fares better than the threaded server at lower concurrencies, but get worse at higher concurrencies. The pre-threaded servers and the pre-forked servers easily beat it.

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

But one thing you have to realize is that our poll() based server is capable of handling for more concurrency in a single process. The pre-forked and the pre-threaded servers have conveniently created several processes and threads to handle requests. The convention then is to run a few instances of the event based servers to take on the load. Usually we run as many event based severs as there are CPU cores available on the hosting server. In these tests, although I tested with only one CPU core, I created 10 instances of the poll() based server whereas there were 10,000 and 20,000 process and threads while testing the pre-forked and the pre-threaded servers respectively. The poll() based server can indeed handle a lot of concurrent connections per process.

Then, for all the talk about event-based servers exhibiting very high performance, how come we are not able to compete against threaded and pre-threaded servers? The problem lies with the design of the poll() system call. Take a look at its signature:

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

The first argument points to an array of structures of type pollfd. The next parameter tells poll() how many structures are present in the array. Here is the trouble: every time we invoke poll(), this array of structures has to be copied into kernel space. When poll() returns successfully, meaning that one or more file descriptors have activity, the only way to find out which ones those descriptors are ready is by iterating through the whole list and checking pollfd‘s revents member for every file descriptor in the array. Although poll() returns the number of descriptors which are ready, we won’t know which ones they are. We have to iterate through the list till we find all the descriptors poll() says are ready. This can take a while if we are tracking thousands of file descriptors since we need to run this linear search every time poll() returns. Also, you need to remember that we track both the client socket and a Redis connected socket per client connection. This is the main bottleneck with poll().

In the next section, we will see how the epoll family of calls does away with this problem and provides a descriptor activity tracking mechanism in the operating system that is a lot more scalable.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server