This chapter is part of a series of articles on Linux application performance.
Threads were all the rage in the 90s. They allowed the cool guys to give jaw-dropping demos to their friends and colleagues. Like so many cool things from the 90s, today, it is yet another tool in the arsenal for any programmer. In Linux, the API to use is POSIX Threads, which is lovingly shortened to pthreads. Underneath all of it, Linux implements threads using the clone()
system call. But you don’t really have to worry about that. It is sufficient enough to know that the C library presents the well-known pthreads API and you just need to use it.
The threaded server is the equivalent of our forking server. That is, we create a new thread every time we get a connection from a client and we handle processing of that request in the new thread, after which the thread terminates. You might think now that we know creating a new process or thread as a request comes in incurs a lot of overhead, using this technique isn’t a great idea. You are right. A better idea would be to create the threads equivalent of the pre-forked server, which is a pre-threaded or a thread-pool based server. We are doing this for the sake of completeness and it also totally worth it since it reveals something very interesting: the difference in overhead between creating a process with each new connection request vs. creating a thread in response to a new connection request.
int main(int argc, char *argv[]) { int server_port; signal(SIGINT, print_stats); if (argc > 1) server_port = atoi(argv[1]); else server_port = DEFAULT_SERVER_PORT; if (argc > 2) strcpy(redis_host_ip, argv[2]); else strcpy(redis_host_ip, REDIS_SERVER_HOST); int server_socket = setup_listening_socket(server_port); printf("ZeroHTTPd server listening on port %d\n", server_port); enter_server_loop(server_socket); return (0); }
We are back to a simple main()
function. It calls enter_server_loop()
directly.
void enter_server_loop(int server_socket) { struct sockaddr_in client_addr; socklen_t client_addr_len = sizeof(client_addr); pthread_t tid; while (1) { int client_socket = accept( server_socket, (struct sockaddr *)&client_addr, &client_addr_len); if (client_socket == -1) fatal_error("accept()"); pthread_create(&tid, NULL, &handle_client, (void *)(intptr_t) client_socket); } }
Here, we get into an infinite loop which calls accept()
and every time there is a new connection, we create a new thread with pthread_create()
. We pass handle_client()
as the thread’s start routine. We pass client_socket
as its argument.
void *handle_client(void *targ) { char line_buffer[1024]; char method_buffer[1024]; int method_line = 0; int client_socket = (long) targ; int redis_server_socket; /* No one needs to do a pthread_join() for the OS to free up this thread's resources */ pthread_detach(pthread_self()); connect_to_redis_server(&redis_server_socket); while (1) { get_line(client_socket, line_buffer, sizeof(line_buffer)); method_line++; unsigned long len = strlen(line_buffer); /* The first line has the HTTP method/verb. It's the only thing we care about. We read the rest of the lines and throw them away. */ if (method_line == 1) { if (len == 0) return (NULL); strcpy(method_buffer, line_buffer); } else { if (len == 0) break; } } handle_http_method(method_buffer, client_socket); close(client_socket); close(redis_socket_fd); return(NULL); }
We see that the handle_client()
function implemented here is not very different from the ones we saw implemented in earlier server architectures. Right at the beginning, we call pthread_detatch()
, which tells the operating system that it can free this thread’s resources without another thread having to call pthread_join()
to collect details about this thread’s termination. This is much like how the parent process needs to call wait()
. Otherwise, this method has all the usual function calls to be able to handle the client request. Remember however, that once this function returns, the thread ends since this function was the one that was specified as the thread’s start routine.
Performance of the Threaded Server
Here is our table of performance numbers for easy reference:
|
|||||||
concurrency | iterative | forking | preforked | threaded | prethreaded | poll | epoll |
20 | 7 | 112 | 2,100 | 1,800 | 2,250 | 1,900 | 2,050 |
50 | 7 | 190 | 2,200 | 1,700 | 2,200 | 2,000 | 2,000 |
100 | 7 | 245 | 2,200 | 1,700 | 2,200 | 2,150 | 2,100 |
200 | 7 | 330 | 2,300 | 1,750 | 2,300 | 2,200 | 2,100 |
300 | – | 380 | 2,200 | 1,800 | 2,400 | 2,250 | 2,150 |
400 | – | 410 | 2,200 | 1,750 | 2,600 | 2,000 | 2,000 |
500 | – | 440 | 2,300 | 1,850 | 2,700 | 1,900 | 2,212 |
600 | – | 460 | 2,400 | 1,800 | 2,500 | 1,700 | 2,519 |
700 | – | 460 | 2,400 | 1,600 | 2,490 | 1,550 | 2,607 |
800 | – | 460 | 2,400 | 1,600 | 2,540 | 1,400 | 2,553 |
900 | – | 460 | 2,300 | 1,600 | 2,472 | 1,200 | 2,567 |
1,000 | – | 475 | 2,300 | 1,700 | 2,485 | 1,150 | 2,439 |
1,500 | – | 490 | 2,400 | 1,550 | 2,620 | 900 | 2,479 |
2,000 | – | 350 | 2,400 | 1,400 | 2,396 | 550 | 2,200 |
2,500 | – | 280 | 2,100 | 1,300 | 2,453 | 490 | 2,262 |
3,000 | – | 280 | 1,900 | 1,250 | 2,502 | wide variations | 2,138 |
5,000 | – | wide variations | 1,600 | 1,100 | 2,519 | – | 2,235 |
8,000 | – | – | 1,200 | wide variations | 2,451 | – | 2,100 |
10,000 | – | – | wide variations | – | 2,200 | – | 2,200 |
11,000 | – | – | – | – | 2,200 | – | 2,122 |
12,000 | – | – | – | – | 970 | – | 1,958 |
13,000 | – | – | – | – | 730 | – | 1,897 |
14,000 | – | – | – | – | 590 | – | 1,466 |
15,000 | – | – | – | – | 532 | – | 1,281 |
The best comparison one can make with the threaded server is with that of the forking server. For lower concurrency numbers, you can see around 10x performance improvement over the forking server design. For concurrencies involving several 100 connections, you can see that it outperforms the forking server by 5-6x! Also, very importantly, it is able to stably deal with concurrencies of up to 8,000 connections, although by around 3,000 concurrent connections, we begin to see a drop in performance.
In the next article in this series, we see how using a thread pool improves performance over a threading design as described here.
Comments
6 responses to “Linux Applications Performance: Part IV: Threaded Servers”
[…] Part IV. Threaded Servers […]
[…] Part IV. Threaded Servers […]
[…] Part IV. Threaded Servers […]
[…] Part IV. Threaded Servers […]
[…] our next article in the series, we shall look at how threaded (as opposed to process-oriented) servers compare to […]
[…] Part IV. Threaded Servers […]