VARNISHSTAT – VARNISH PERFORMANCE TUNING (PART II)
Using varnishstat
In Part I of this series, we talked about basic variables that can be used to tune performance of Varnish. In this part of the series, we will discuss tools that can help us understand what to optimize and by how much.
The single most important tool for tuning Varnish performance is varnishstat. The tool comes bundled with Varnish and shows you a number of counters and rates which, when interpreted, will show you which variables you must tune further to get the most out of your Varnish installation. There are two ways to run varnishstat, with or without the ‘-1’ flag. If you run it without the flag, you will get a continuously updating display which will show you rates. If you run it with the flag, you will just get a static output over the last update interval. We will show just the important variables and explain some of them.
To better explain what each metric is and what are the most important ones, we will group them to the following groups:
- Client and backend related
- Worker thread related
- ESI related
- Storage backend related
- Client and Backend Related
# varnishstat -1
client_conn 4234206 41.27 Client connections accepted
client_drop 0 0.00 Connection dropped, no sess/wrk
client_req 29233157 284.94 Client requests received
cache_hit 32093887 312.82 Cache hits
cache_hitpass 921 0.01 Cache hits for pass
cache_miss 422706 4.12 Cache misses
backend_conn 57122 0.56 Backend conn. success
backend_unhealthy 0 0.00 Backend conn. not attempted
backend_busy 0 0.00 Backend conn. too many
backend_fail 0 0.00 Backend conn. failures
The most important variables to watch are client_drop, backend_busy, backend_unhealthy and backend_fail. The first one usually happens when you go over session_max or queue_max. The default values for those two variable are good enough and if you see client_drop increasing you should look into other things (thread workers, queue sizes, backend speed, cache hit ratio, etc.) and not blindly increase those two. The backend_busy variable indicates that you have reached the maximum amount of connections to your backend. Do not blindly increase that variable either as you may overload your backend. Last two variables, backend_unhealthy and backend_fail, indicate that either your backend was declared unhealthy by Varnish due to failing checks or it was a pure connection failure (network issue for example).
Worker Threads Related
# varnishstat -1
n_wrk 100 . N worker threads
n_wrk_create 2853 0.03 N worker threads created
n_wrk_failed 0 0.00 N worker threads not created
n_wrk_max 0 0.00 N worker threads limited
n_wrk_lqueue 0 0.00 work request queue length
n_wrk_queued 13614 0.13 N queued work requests
n_wrk_drop 0 0.00 N dropped work requests
Worker thread related metrics give you insight if you have properly tuned your thread pools and their sizes. n_wrk_max metric will show you how many times you have exhausted your thread pools and threads failed to be created. Queue metric, n_wrk_lqueue, shows you the current queue length (requests waiting on worker thread to become available). Two metrics, n_wrk_queued and n_wrk_drop show you how many times a request has been queued and how many times it was dropped due exceeding queue length.
ESI Related
# varnishstat -1
esi_errors 0 0.00 ESI parse errors (unlock)
esi_warnings 0 0.00 ESI parse warnings (unlock)
If you are using Edge Side Includes [1], esierrors and esiwarnings will give you information about the validity of your ESI syntax. If you see them increasing, inspect what is returned by the backend regarding ESI and fix any errors found.
Storage Backend Related
# varnishstat -1
n_lru_nuked 128154 . N LRU nuked objects
SMA.s0.c_req 1129966 10.98 Allocator requests
SMA.s0.c_fail 128616 1.25 Allocator failures
SMA.s0.g_bytes 4294824024 . Bytes outstanding
SMA.s0.g_space 143272 . Bytes available
SMA.Transient.c_req 24264 0.24 Allocator requests
SMA.Transient.c_fail 0 0.00 Allocator failures
SMA.Transient.g_bytes 27464 . Bytes outstanding
SMA.Transient.g_space 0 . Bytes available
Here we have an example of the malloc storage backend and you can see that there are two types of storage, s0 and Transient. Transient storage is used by Varnish to store short-lived objects. What Varnish considers short-lived is defined by the shortlived variable – value is in seconds and it defaults to 10. Storage s0 is the storage defined with the -s flag and it is used to store all other objects. Varnishstat displays information about failed allocations in s0.c_fail and this is usually a result of reaching the cache size limit. The last variables, g_bytes and g_bytes show you the amount of used and free space. One of the most important variables found in varnishstat is the n_lru_nuked variable, which tells you how many objects got removed from cache based on the LRU algorithm. If this number is increasing fast, consider raising your cache size.
Conclusion
We have explained how varnishstat works and how it can be used to debug your Varnish installation. Hopefully this will be enough for you to start using varnishstat and understand where your Varnish might be bottlenecking or working non-optimally.
[1] ESI specification – http://www.w3.org/TR/esi-lang