ApacheBench & HTTPerf
People's beliefs and convictions are almost always gotten at second hand, and without examination.
Web Server Benchmark Tools: ApacheBench and HTTPerf
If you believe in the merits of making your own opinion and want to test a server or a Web application, then this page may help you.
This information is for Linux. I left the Windows world (where I spent 30 years of my life) after I discovered in 2009 how much better G-WAN performs on Linux (but the ab.c source code below works on both environments, to let you compare them).
Multi-Core CPUs
In year 2000, Intel shipped the last single-Core (mainstream) CPU, the Pentium 4. All its successors have been multi-Core CPUs, making single-Core CPUs obsolete.
In the past, CPUs were faster by using a faster clock frequency. But at 4GHz heat dissipation problems became unmanageable. To continue delivering more power, CPUs started to embed several small CPUs (the CPU 'Cores') printed at a lower scale.
Programs that do not exploit the new CPU Cores will not run much faster on new CPUs. Established software vendors face a serious challenge because their product lines were designed at a time parallelism was not a concern on mass-market PCs and servers.
Around 2020, Moore's law will collapse as transistors reach the size of an atom, making it impossible to stack more Cores in CPUs. Then, making more powerful CPUs will require to break the laws of today's known physic. In the meantime, writing more efficient software is the only way to make computers run faster.
To test SMP servers, you need to use CPUs with many Cores (and use several workers on the client and server sides).
64-Bit Linux
Use the most recent version of a 64-bit Linux distribution. Even when G-WAN runs as a 32-bit process, a 64-bit kernel works twice faster than a 32-bit kernel. This is what my tests have shown those last two years, and this is probably the easiest way for you to save on hardware and energy consumption.
System configuration
The first issue is the lack of file descriptors (the default is only 1,024 files by process, resulting in very poor performance).
The second issue is the lack of TCP port numbers. As it takes time to fully close connections lingering in the TIME_WAIT state, the number of available ports will quickly decrease and establishing new connections will not be possible until they are released by the system. The default client port range is [1,024 - 5,000] and must be extended to the whole [1,024 - 65,535] ephemeral port range.
Not doing so, you will quickly hit the TIME_WAIT state and the AB, HTTPerf or Weighttp tools will produce errors like:
"error: connect() failed: Cannot assign requested address (99)"
To avoid these issues and improve general performance, you have to change the following system options:
ulimit -aH (this gives your limit) sudo sh -c ulimit -HSn 200000 (this setups your limit)
To make the following options permanent (available after a reboot) you must edit a couple of system configuration files:
Edit the file /etc/security/limits.conf:
sudo gedit /etc/security/limits.conf
And add the values below:
* soft nofile 200000 * hard nofile 200000
Edit the file /etc/sysctl.conf:
sudo gedit /etc/sysctl.conf
And add the values below:
# "Performance Scalability of a Multi-Core Web Server", Nov 2007 # Bryan Veal and Annie Foong, Intel Corporation, Page 4/10 fs.file-max = 5000000 net.core.netdev_max_backlog = 400000 net.core.optmem_max = 10000000 net.core.rmem_default = 10000000 net.core.rmem_max = 10000000 net.core.somaxconn = 100000 net.core.wmem_default = 10000000 net.core.wmem_max = 10000000 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.rp_filter = 1 net.ipv4.ip_local_port_range = 1024 65535 net.ipv4.tcp_congestion_control = bic net.ipv4.tcp_ecn = 0 net.ipv4.tcp_max_syn_backlog = 12000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_mem = 30000000 30000000 30000000 net.ipv4.tcp_rmem = 30000000 30000000 30000000 net.ipv4.tcp_sack = 1 net.ipv4.tcp_syncookies = 0 net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_wmem = 30000000 30000000 30000000 # optionally, avoid TIME_WAIT states on localhost no-HTTP Keep-Alive tests: # "error: connect() failed: Cannot assign requested address (99)" # On Linux, the 2MSL time is hardcoded to 60 seconds in /include/net/tcp.h: # #define TCP_TIMEWAIT_LEN (60*HZ) # The option below is safe to use: net.ipv4.tcp_tw_reuse = 1 # The option below lets you reduce TIME_WAITs further # but this option is for benchmarks, NOT for production (NAT issues) net.ipv4.tcp_tw_recycle = 1
Then save the file and then make the system reload it:
sudo sysctl -p /etc/sysctl.conf
The options above are important because values that are too low just block benchmarks. You will find other options in the ab.c wrapper described below.
If enabled, SELinux may prevent G-WAN from raising the number of file descriptors. If this is the case, apply the following SELinux module:
/usr/sbin/semodule -DB service auditd restart service gwan restart grep gwan /var/log/audit/audit.log | audit2allow -M gwan_maxfds semodule -i gwan_maxfds.pp service gwan start Starting gwan: [ OK ] /usr/sbin/semodule -B
The number of file descriptors used by G-WAN can be found in /proc:
cat /proc/`ps ax | grep gwan | grep -v grep | awk -F " " '{print $1}'`
/limits | grep "Max open files"
Max open files 2048 2048 files This is good for a one-time check, but don't use the above command to constantly monitor G-WAN, use the more efficient ab.c program described below.
IBM - ApacheBench (AB)
To install ApacheBench:
sudo apt-get -y install apache2-utils
Basic usage (ab -h for more options):
ab -n 100000 -c 100 -t 1 -k "http://127.0.0.1:8080/100.html"
- n ........ number of HTTP requests
- c ........ number of concurrent connections
- k ........ enable HTTP keep-alives
- t ......... number of seconds of the test
AB is reliable, simple to understand and easy to use. It's only deffect is a relatively high CPU usage, and its inability to put under pressure SMP (Symmetric Multiprocessing) servers (which use several worker threads).
This is because AB is using one single thread, and an outdated event polling method. AB was made at a time CPU Cores did not exist – and this now makes mostly AB irrelevant to test the load of a modern SMP server.
Knowing this, single-threaded servers usually use AB to compare themselves to SMP servers. This is because they are much slower with SMP clients like Weighttp presented below.
Lighttpd - Weighttp (WG)
Like IBM AB, Weighttp has been written by Web server authors – probably because they felt the (real) gap for a serious HTTP stress tool able to test modern multi-Core CPUs. To install Weighttp:
wget http://github.com/lighttpd/weighttp/zipball/master unzip lighttpd-weighttp-v0.2-6-g1bdbe40.zip cd lighttpd-weighttp-v0.2-6-g1bdbe40 sudo apt-get install libev gcc -g2 -O2 -DVERSION='"123"' src/*.c -o weighttp -lev -lpthread sudo cp ./weighttp /usr/local/bin
Basic usage (weighttp -h for more options):
weighttp -n 100000 -c 100 -t 4 -k "http://127.0.0.1:8080/100.html"
- n ........ number of HTTP requests
- c ........ number of concurrent connections (default: 1)
- k ........ enable HTTP keep-alives (default: none)
- t ......... number of threads of the test (default: 1, use one thread per CPU Core)
Based on epoll on Linux, Weighttp is much faster than AB – even with one single thread. But its real value is when you are using as many threads/processes as you have CPU Cores on the server you target because THIS IS THE ONLY WAY TO REALLY TEST A SMP SERVER (that is, a server using several worker threads – by default G-WAN uses one thread per CPU Core).
With Weighttp being so fast, you will almost certainly hit the TIME_WAIT state wall (see the TIME_WAIT fix above in the "System Configuration" paragraph).
Weighttp is by far the best stress tool I know today: it uses the clean AB interface and works reasonably well. It could be made even faster by using leaner code, but there are not many serious coders investing their time to write decent client tools, it seems.
Hewlett Packard - HTTPerf
Basic usage (httperf -h for more options):
httperf --server=127.0.0.1 --port=8080 --rate=100 --num-conns=100 --num-calls=100000 --timeout=5 --hog --uri=/100.html
Yes, HTTPerf is more complex than AB. This is visible at first glance in its syntax.
And HTTPerf does not let you specify the concurrency rate, nor the duration of the test:
- num-call ........... number of HTTP requests per connection (> 1 for keep-alives)
- num-conn ........ total number of connections to create
- rate ................... number of connections to start per second
If we want 100,000 HTTP requests, we have to calculate how many '--num-conn' and '--num-call' we will have to to specify to get a given '--rate':
nbr_req = rate * num-call
'num-conn' makes it last longer, but to get any given 'rate' 'num-conn' must always be >= to 'rate'.
HTTPerf takes great care to create new connections progressively and it only collects statistics after 5 seconds. This was probably done to 'warm-up' servers that have problems with 'cold' starts and memory allocation.
Removing this useful information from benchmark tests makes them NOT reflect reality (where clients send requests on short but intense bursts).
Also, HTTPerf's pointlessly long shots for each test make the TIME_WAIT state become a problem (see the TIME_WAIT fix above in the "System Configuration" paragraph).
Finally, HTTPerf cannot test client concurrency accurately: if rate=1 but num-conn=2 and num-call=100000 then you are more than likely to end with concurrent connections (despite the rate=1) because not all HTTP requests will be processed when the second connection is launched.
And if you use a smaller num-call value then you are testing the TCP/IP stack (creating TCP/IP connections is slow and this is done by the kernel, not by the user-mode HTTP server or Web application that you want to test).
As a result, HTTPerf can only be reliably used without HTTP Keep-Alives (with num-call=1). And even in this case, I found ApacheBench to be a far better proposition.
The ab.c wrapper for ApacheBench, Weighttp and HTTPerf
If you make a one-shot tests on one single concurrency then you will hardly get the same result twice. Benchmarks that can be reproduced have more value.
This is why running a test on the [1 - 1,000] concurrency range makes sense (especially if you are using 10 rounds for each concurrency test).
With such a long (and continuous) string of tests, you get more relevant results. A general trend can be extracted from the whole test, and each server's results curve's slope is as useful as its variability to interpret the behavior of a program:
But running ApacheBench 1,000 times (or more) for each server, in a continuous way, and each time with different parameters, is a tedious task (best left to computers).
The ab.c program does just that: it lets you define the URLs to test, the range, and it collects the results in a CSV file suitable for charting with LibreOffice or gnuplot (apt-get install gnuplot).
ab.c lets you choose between ApacheBench, Weighttp or HTTPerf, but if you have read this page, you know which one to use. Here again, ab.c makes the choice yours, so you can make yourself an opinion on the matter.
Know what you test
Keep in mind that Web servers do NOT receive or send data. The OS kernel is doing it.
So, when you are serving a large file (a file that requires many TCP packets, each packet being 1,500 bytes in size) then you are testing the OS kernel rather than the Web server.
Benchmarks use a 100-byte file to let each server show how good it is at *parsing* client requests and *building* a reply (depending on the Web server the real payload is 3-8 times larger because on the top of the HTML file you have HTTP headers and TCP packet headers).
For the same reason, HTTP Keep-Alives should be used to test Web servers: establishing new connections is very slow - and this is done by the OS kernel rather than by the Web server. When you create many new connections per second, you test the OS kernel, not the user-mode application.
Further, AJAX applications heavily rely on HTTP keep-alives, making them more than relevant on today's Web.
But even with a small 100-byte file and HTTP keep-Alives, most of the time is consumed by the CPU address bus saturation due to broadcast snoops. That, recognized Intel R&D, is the bottleneck.
Future multi-Core CPUs will only make things better for G-WAN and worse for all others.
Not all 3.0 Ghz CPUs are Equal
All our 6-Core tests are made with this MacPro CPU (identified as follows in the gwan.log file):
Intel(R) Xeon(R) CPU W3680 @ 3.33GHz (6 Cores/CPU, 2 threads/Core)
But (extensive) third-party CPU tests show that many same-frequency CPUs are not as fast (in year 2012, some same-frequency CPUs are 5 times slower and a few others are 1.5 times faster). You can identify your CPU here.
Moral of the story, a "3 GHz Xeon CPU" may give different results from another "3 GHz Xeon CPU", the exact reference is needed to better identify the test platform.
And the same kind of approximations (or ommissions) in the testbed environment (OS type and configuration, hardware, drivers, network devices) lead to similar inaccuracies, making it impossible to valid undocumented results.
Virtualization
Virtualization is another hardware abstraction layer on the top of the OS kernel (which, to avoid more bugs, additional critical security holes and further loss of performance, is the only abstraction layer that we should be running on any given machine).
And it is not only slower – it also has a completely different performance profile because everything is encapsulated with new code (for example, memory allocation is notoriously atrociously damaged by virtualization, even further than all other tasks).
So, instead of having the OS kernel as the bottleneck (like on a normal machine), then you have a (much) slower 'virtual machine' as the new bottleneck (see "Multi-Core scaling in a virtualized environment").
Unsurprisingly, if the speed is limited to 30km/h, then a car will not 'run faster' than a bicycle.
Beware what you are testing.
Conclusion
The fact that benchmarking tools do not tell you how to make successful tests should raise some questions. Like the fact that Web/Proxy server software publishers rarely make extensive comparative benchmarks.
Do your homework! At least now you know how to proceed.