Linux Applications Performance: Part V: Pre-threaded Servers

This chapter is part of a series of articles on Linux application performance.

The design discussed in this article is more popularly known as “thread pool”. Essentially, there is a pre-created pool of threads that are ready to serve any incoming requests. This is comparable to the pre-forked server design. Whereas there was a process pool in the pre-forked architecture, we have a thread pool in this case.

Most of the code is very similar to the pre-forked server. Let’s take a look at the main() function:

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, print_stats);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    printf("ZeroHTTPd server listening on port %d\n", server_port);
    server_socket = setup_listening_socket(server_port);
    for (int i = 0; i < THREADS_COUNT; ++i) {
        create_thread(i);
    }

    for (;;)
        pause();
}

We call create_thread() in a for loop that does iterations equal to THREADS_COUNT. After this, the main thread calls pause() forever in an infinite loop. create_thread() itself is very simple:

void create_thread(int index) {
   pthread_create(&threads[index], NULL, &enter_server_loop, NULL);
}

We call pthread_create() there with enter_server_loop() as the start function. Let’s take a look at that function now:

void *enter_server_loop(void *targ)
{
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1)
    {
        pthread_mutex_lock(&mlock);
        long client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");
        pthread_mutex_unlock(&mlock);

        handle_client(client_socket);
    }
}

Rather than having all threads block on accept(), all threads call pthread_mutex_lock(). “Mutex” is a short form of the term “mutual exclusion”. Only one thread “acquires” the mutex and executes past pthread_mutex_lock(). All other threads block on pthread_mutex_lock(). This is an incredibly useful idea. Once one thread returns from accept(), it then calls pthread_mutex_unlock() to release the mutex so that some other thread can then acquire it and call accept(). This setup ensures that only one thread among the pool can actually block on accept(), thus avoiding the thundering herd problem discussed in the pre-forked server architecture article.

Other parts of the server are pretty much the same code as in the pre-forked server.

Pre-threaded Server Performance

Given that threads are pretty light-weight compared to processes since they share much of the main process memory, there is very little operating system overhead when creating a new thread relative to creating a new process. Let’s bring up our performance numbers table:

requests/second
concurrency iterative forking preforked threaded prethreaded poll epoll
20 7 112 2,100 1,800 2,250 1,900 2,050
50 7 190 2,200 1,700 2,200 2,000 2,000
100 7 245 2,200 1,700 2,200 2,150 2,100
200 7 330 2,300 1,750 2,300 2,200 2,100
300 380 2,200 1,800 2,400 2,250 2,150
400 410 2,200 1,750 2,600 2,000 2,000
500 440 2,300 1,850 2,700 1,900 2,212
600 460 2,400 1,800 2,500 1,700 2,519
700 460 2,400 1,600 2,490 1,550 2,607
800 460 2,400 1,600 2,540 1,400 2,553
900 460 2,300 1,600 2,472 1,200 2,567
1,000 475 2,300 1,700 2,485 1,150 2,439
1,500 490 2,400 1,550 2,620 900 2,479
2,000 350 2,400 1,400 2,396 550 2,200
2,500 280 2,100 1,300 2,453 490 2,262
3,000 280 1,900 1,250 2,502 wide variations 2,138
5,000 wide variations 1,600 1,100 2,519 2,235
8,000 1,200 wide variations 2,451 2,100
10,000 wide variations 2,200 2,200
11,000 2,200 2,122
12,000 970 1,958
13,000 730 1,897
14,000 590 1,466
15,000 532 1,281

Comparison to the threaded architecture

The pre-threaded server architecture has about 40% better performance on average compared to the threading server. However, it has similar performance compared to the pre-forked server. This clarifies the fact that under Linux, processes and threads are scheduled in the same way and also have similar performance characteristics. If there is any difference in overhead, it is in creating a process vs. creating a thread since processes share less and threads pretty much everything with the creating thread.

Comparison to the pre-forked architecture

There is one other crucial difference compared to the pre-forked architecture. While our pre-forked server is able to reliably scale till about 5,000 concurrent connections with decent performance, our pre-threaded server is able to handle up to 11,000 concurrent connections without significant deterioration in performance. This is clearly an advantage over the pre-forked server.

Switching from pre-forked to pre-threaded worth it?

It is fairly easy in most cases to change a server based on a pre-forked architecture to a pre-threaded architecture, especially if there are less synchronization primitives in use in the pre-forked version. It should be worth the effort since the performance gains that are achieved when going from a pre-forked to a pre-threaded architecture are non-trivial. However, this should be done only if you are expecting a lot of concurrent users. As you can see from the table, performance numbers are very similar for up to thousands of concurrent users.

Now, for something completely different

Now that you’ve seen how servers based on processes and threads work, we are ready to see, in the next article in the series, how servers based on the event processing model work.

Articles in this series

  1. Series Introduction
  2. Part I. Iterative Servers
  3. Part II. Forking Servers
  4. Part III. Pre-forking Servers
  5. Part IV. Threaded Servers
  6. Part V. Pre-threaded Servers
  7. Part VI: poll-based server
  8. Part VII: epoll-based server