Linux Applications Performance: Part V: Pre-threaded Servers

This chapter is part of a series of articles on Linux application performance.

The design discussed in this article is more popularly known as “thread pool”. Essentially, there is a pre-created pool of threads that are ready to serve any incoming requests. This is comparable to the pre-forked server design. Whereas there was a process pool in the pre-forked architecture, we have a thread pool in this case.

Most of the code is very similar to the pre-forked server. Let’s take a look at the main() function:

int main(int argc, char *argv[])
{
    int server_port;
    signal(SIGINT, print_stats);

    if (argc > 1)
      server_port = atoi(argv[1]);
    else
      server_port = DEFAULT_SERVER_PORT;

    if (argc > 2)
        strcpy(redis_host_ip, argv[2]);
    else
        strcpy(redis_host_ip, REDIS_SERVER_HOST);

    printf("ZeroHTTPd server listening on port %d\n", server_port);
    server_socket = setup_listening_socket(server_port);
    for (int i = 0; i < THREADS_COUNT; ++i) {
        create_thread(i);
    }

    for (;;)
        pause();
}

We call create_thread() in a for loop that does iterations equal to THREADS_COUNT. After this, the main thread calls pause() forever in an infinite loop. create_thread() itself is very simple:

void create_thread(int index) {
   pthread_create(&threads[index], NULL, &enter_server_loop, NULL);
}

We call pthread_create() there with enter_server_loop() as the start function. Let’s take a look at that function now:

void *enter_server_loop(void *targ)
{
    struct sockaddr_in client_addr;
    socklen_t client_addr_len = sizeof(client_addr);

    while (1)
    {
        pthread_mutex_lock(&mlock);
        long client_socket = accept(
                server_socket,
                (struct sockaddr *)&client_addr,
                &client_addr_len);
        if (client_socket == -1)
            fatal_error("accept()");
        pthread_mutex_unlock(&mlock);

        handle_client(client_socket);
    }
}

Rather than having all threads block on accept(), all threads call pthread_mutex_lock(). “Mutex” is a short form of the term “mutual exclusion”. Only one thread “acquires” the mutex and executes past pthread_mutex_lock(). All other threads block on pthread_mutex_lock(). This is an incredibly useful idea. Once one thread returns from accept(), it then calls pthread_mutex_unlock() to release the mutex so that some other thread can then acquire it and call accept(). This setup ensures that only one thread among the pool can actually block on accept(), thus avoiding the thundering herd problem discussed in the pre-forked server architecture article.

Other parts of the server are pretty much the same code as in the pre-forked server.

Pre-threaded Server Performance

Given that threads are pretty light-weight compared to processes since they share much of the main process memory, there is very little operating system overhead when creating a new thread relative to creating a new process. Let’s bring up our performance numbers table:

	requests/second
concurrency	iterative	forking	preforked	threaded	prethreaded	poll	epoll
20	7	112	2,100	1,800	2,250	1,900	2,050
50	7	190	2,200	1,700	2,200	2,000	2,000
100	7	245	2,200	1,700	2,200	2,150	2,100
200	7	330	2,300	1,750	2,300	2,200	2,100
300	–	380	2,200	1,800	2,400	2,250	2,150
400	–	410	2,200	1,750	2,600	2,000	2,000
500	–	440	2,300	1,850	2,700	1,900	2,212
600	–	460	2,400	1,800	2,500	1,700	2,519
700	–	460	2,400	1,600	2,490	1,550	2,607
800	–	460	2,400	1,600	2,540	1,400	2,553
900	–	460	2,300	1,600	2,472	1,200	2,567
1,000	–	475	2,300	1,700	2,485	1,150	2,439
1,500	–	490	2,400	1,550	2,620	900	2,479
2,000	–	350	2,400	1,400	2,396	550	2,200
2,500	–	280	2,100	1,300	2,453	490	2,262
3,000	–	280	1,900	1,250	2,502	wide variations	2,138
5,000	–	wide variations	1,600	1,100	2,519	–	2,235
8,000	–	–	1,200	wide variations	2,451	–	2,100
10,000	–	–	wide variations	–	2,200	–	2,200
11,000	–	–	–	–	2,200	–	2,122
12,000	–	–	–	–	970	–	1,958
13,000	–	–	–	–	730	–	1,897
14,000	–	–	–	–	590	–	1,466
15,000	–	–	–	–	532	–	1,281

Comparison to the threaded architecture

The pre-threaded server architecture has about 40% better performance on average compared to the threading server. However, it has similar performance compared to the pre-forked server. This clarifies the fact that under Linux, processes and threads are scheduled in the same way and also have similar performance characteristics. If there is any difference in overhead, it is in creating a process vs. creating a thread since processes share less and threads pretty much everything with the creating thread.

Comparison to the pre-forked architecture

There is one other crucial difference compared to the pre-forked architecture. While our pre-forked server is able to reliably scale till about 5,000 concurrent connections with decent performance, our pre-threaded server is able to handle up to 11,000 concurrent connections without significant deterioration in performance. This is clearly an advantage over the pre-forked server.

Switching from pre-forked to pre-threaded worth it?

It is fairly easy in most cases to change a server based on a pre-forked architecture to a pre-threaded architecture, especially if there are less synchronization primitives in use in the pre-forked version. It should be worth the effort since the performance gains that are achieved when going from a pre-forked to a pre-threaded architecture are non-trivial. However, this should be done only if you are expecting a lot of concurrent users. As you can see from the table, performance numbers are very similar for up to thousands of concurrent users.

Now, for something completely different

Now that you’ve seen how servers based on processes and threads work, we are ready to see, in the next article in the series, how servers based on the event processing model work.

Linux Applications Performance: Part V: Pre-threaded Servers

Pre-threaded Server Performance

Comparison to the threaded architecture

Comparison to the pre-forked architecture

Switching from pre-forked to pre-threaded worth it?

Now, for something completely different

Articles in this series

Comments

6 responses to “Linux Applications Performance: Part V: Pre-threaded Servers”

Linux Applications Performance: Part V: Pre-threaded Servers

Pre-threaded Server Performance

Comparison to the threaded architecture

Comparison to the pre-forked architecture

Switching from pre-forked to pre-threaded worth it?

Now, for something completely different

Articles in this series

Share this:

Comments

6 responses to “Linux Applications Performance: Part V: Pre-threaded Servers”

Discover more from Unixism