In Part I of this series, we talked about basic variables that can be used to tune performance of Varnish. In this part of the series, we will discuss tools that can help us understand what to optimize and by how much.
The single most important tool for tuning Varnish performance is varnishstat. The tool comes bundled with Varnish and shows you a number of counters and rates which, when interpreted, will show you which variables you must tune further to get the most out of your Varnish installation. There are two ways to run varnishstat, with or without the ‘-1’ flag. If you run it without the flag, you will get a continuously updating display which will show you rates. If you run it with the flag, you will just get a static output over the last update interval. We will show just the important variables and explain some of them.
To better explain what each metric is and what are the most important ones, we will group them to the following groups:
- Client and backend related
- Worker thread related
- ESI related
- Storage backend related
- Client and Backend Related
# varnishstat -1 client_conn 4234206 41.27 Client connections accepted client_drop 0 0.00 Connection dropped, no sess/wrk client_req 29233157 284.94 Client requests received cache_hit 32093887 312.82 Cache hits cache_hitpass 921 0.01 Cache hits for pass cache_miss 422706 4.12 Cache misses backend_conn 57122 0.56 Backend conn. success backend_unhealthy 0 0.00 Backend conn. not attempted backend_busy 0 0.00 Backend conn. too many backend_fail 0 0.00 Backend conn. failures
The most important variables to watch are client_drop, backend_busy, backend_unhealthy and backend_fail. The first one usually happens when you go over session_max or queue_max. The default values for those two variable are good enough and if you see client_drop increasing you should look into other things (thread workers, queue sizes, backend speed, cache hit ratio, etc.) and not blindly increase those two. The backend_busy variable indicates that you have reached the maximum amount of connections to your backend. Do not blindly increase that variable either as you may overload your backend. Last two variables, backend_unhealthy and backend_fail, indicate that either your backend was declared unhealthy by Varnish due to failing checks or it was a pure connection failure (network issue for example).
Worker Threads Related
# varnishstat -1 n_wrk 100 . N worker threads n_wrk_create 2853 0.03 N worker threads created n_wrk_failed 0 0.00 N worker threads not created n_wrk_max 0 0.00 N worker threads limited n_wrk_lqueue 0 0.00 work request queue length n_wrk_queued 13614 0.13 N queued work requests n_wrk_drop 0 0.00 N dropped work requests
Worker thread related metrics give you insight if you have properly tuned your thread pools and their sizes. n_wrk_max metric will show you how many times you have exhausted your thread pools and threads failed to be created. Queue metric, n_wrk_lqueue, shows you the current queue length (requests waiting on worker thread to become available). Two metrics, n_wrk_queued and n_wrk_drop show you how many times a request has been queued and how many times it was dropped due exceeding queue length.
# varnishstat -1 esi_errors 0 0.00 ESI parse errors (unlock) esi_warnings 0 0.00 ESI parse warnings (unlock)
If you are using Edge Side Includes , esierrors and esiwarnings will give you information about the validity of your ESI syntax. If you see them increasing, inspect what is returned by the backend regarding ESI and fix any errors found.
Storage Backend Related
# varnishstat -1 n_lru_nuked 128154 . N LRU nuked objects SMA.s0.c_req 1129966 10.98 Allocator requests SMA.s0.c_fail 128616 1.25 Allocator failures SMA.s0.g_bytes 4294824024 . Bytes outstanding SMA.s0.g_space 143272 . Bytes available SMA.Transient.c_req 24264 0.24 Allocator requests SMA.Transient.c_fail 0 0.00 Allocator failures SMA.Transient.g_bytes 27464 . Bytes outstanding SMA.Transient.g_space 0 . Bytes available
Here we have an example of the malloc storage backend and you can see that there are two types of storage, s0 and Transient. Transient storage is used by Varnish to store short-lived objects. What Varnish considers short-lived is defined by the shortlived variable – value is in seconds and it defaults to 10. Storage s0 is the storage defined with the -s flag and it is used to store all other objects. Varnishstat displays information about failed allocations in s0.c_fail and this is usually a result of reaching the cache size limit. The last variables, g_bytes and g_bytes show you the amount of used and free space. One of the most important variables found in varnishstat is the n_lru_nuked variable, which tells you how many objects got removed from cache based on the LRU algorithm. If this number is increasing fast, consider raising your cache size.
We have explained how varnishstat works and how it can be used to debug your Varnish installation. Hopefully this will be enough for you to start using varnishstat and understand where your Varnish might be bottlenecking or working non-optimally.
 ESI specification – http://www.w3.org/TR/esi-lang