VARNISH PERFORMANCE TUNING (PART I)
Introduction
Varnish is a caching engine that is installed in front of a website to speed up delivery. It is used in front of many high traffic sites to do just that without requiring high number of backend servers.
Reading this article series will get you through basic Varnish performance tuning as well as monitoring crucial metrics. You will also learn how to use Varnish effectively with other popular software (like HAproxy and Nginx) and what the common pitfalls are. Last thing to keep in mind when building large scale infrastructures, like the ones managed by A-nine, is to never over-engineer or over-optimize a solution.
Varnish Storage Engines
Varnish has three main storage engines: malloc, file and persistent. The last one is experimental and will likely be removed from future versions of Varnish, so we will not consider it.
File based storage engine is, well, a file. mmap(2) is called to map the file to Varnish’s virtual memory. It is worth noting that file is not persistent and every restart of varnish will cause all of cache to be dropped. File backed storage should be used when you don’t have enough main memory to cache all content and the speed of your IO subsystem overcomes the speed of content generation. For example, you have a slow storage where your content resides and you want to keep the hot content in cache on smaller, but faster, IO system.
Malloc based storage works so that each object is allocated memory space using the malloc(3) system call. This is the fastest type of cache as memory access latency and throughput are a few orders of magnitude lower/faster than even the fastest SSD drives.
Long story short, use malloc backend if the content set of interest fits into memory and file if it doesn’t. Storage backends are chosen with the -s parameter. For example:
-s malloc,12G
Varnish threads
Second most important thing to tune are Varnish threads. Varnish uses a few types of threads performing various tasks, but here we will focus on the most important ones – cache-worker threads. Cache worker threads are the ones that will actually serve the HTTP request that was sent. There are three variables that we will focus on here: threadpools, threadpoolmin and threadpool_max.
The first variable, thread_pools is, as the name says, number of pools that threads will be grouped into. Each of them will consist of at least thread_pool_min and no more than thread_pool_max. This variable is set to ‘2’ by default and is recommended you leave it like that. No noticeable performance improvement was observed by upping its value.
The second variable, threadpoolmin, defines the minimum amount of threads that always need to be alive in (per thread pool), even if idle. It is wise to keep a decent amount of threads idle (we usually like to keep at least 30-40) as creating threads is an expensive operations, compared to having them run idle. This way if you have a sudden spike in traffic, you will have enough threads to handle the first hit while new ones are being spawned.
The third variable, thread_pool_max, defines the maximum amount of threads to run per thread pool. This variable obviously needs to be high enough to accommodate your traffic and needs to be adjusted per workload. Usually, you don’t want to go over 5000 threads as specified in the documentation.
Last thing to consider is a variable called threadpoolstack and we have had good experience setting it to 256k. Otherwise Varnish threads will use your system default which, depending on the operating system, can cause quite a lot of memory waste.
IO system tuning
Most important thing is to mount the Varnish working directory (where the shmlog is stored) to tmpfs.
tmpfs /usr/lib/varnish tmpfs rw,size=256M 0 0
In case you are using the file backend, make sure you set noatime on the file systems where that file is saved. This will prevent unneeded IO to update the file’s access times (which is useless for Varnish). Be sure that your partitions are aligned, that your RAID systems have decent chunk sizes (depending on your filesize) and that file systems are aligned to those chunk sizes.
Network related
Last, but not least, one must tune the network parameters of Varnish and the operating system (we will stick to Linux here). There is a lot of low level thinking to this, but we will stick to things that have most effect. The single most important thing is to properly size your listen queue size (and the later described sysctl). Listen queue size is passed to the listen(2) system call and will limit the size of not-yet-accepted connections by the application (and connections in SYNRECV state in case smaller than tcpmaxsynbacklog).
Varnish can set the listen backlog size using the the -p parameter as follows:
-p listen_depth=16383
We use 16383 as the number is first incremented by the kernel and then rounded up to the higher power of two. There is also a sysctl that controls the maximum size of the listen backlog, net.somaxconn. You should set it higher than the listen depth, but be careful as it is represented as uint_16 in the kernel, so the maximum value is 65535.
Other variables to consider are: tcp_max_syn_backlog, net.ipv4.tcp_tw_reuse, net.ipv4.tcp_max_tw_buckets, net.ipv4.ip_local_port_range and net.ipv4.tcp_syncookies. Going into kernel tuning details is out of scope of this article.
That would be all in part I. In the next blog post we will see how to understand Varnish variables, as well as monitor and debug it. Stay tuned!