VARNISHSTAT – VARNISH PERFORMANCE TUNING (PART II)

Using varnishstat

In Part I of this series, we talked about basic variables that can be used to tune performance of Varnish. In this part of the series, we will discuss tools that can help us understand what to optimize and by how much.

The single most important tool for tuning Varnish performance is varnishstat. The tool comes bundled with Varnish and shows you a number of counters and rates which, when interpreted, will show you which variables you must tune further to get the most out of your Varnish installation. There are two ways to run varnishstat, with or without the ‘-1’ flag. If you run it without the flag, you will get a continuously updating display which will show you rates. If you run it with the flag, you will just get a static output over the last update interval. We will show just the important variables and explain some of them.

To better explain what each metric is and what are the most important ones, we will group them to the following groups:

  • Client and backend related
  • Worker thread related
  • ESI related
  • Storage backend related
  • Client and Backend Related
# varnishstat -1
client_conn        4234206     41.27   Client connections accepted  
client_drop        0       0.00    Connection dropped, no sess/wrk  
client_req        29233157    284.94  Client requests received  
cache_hit        32093887    312.82  Cache hits  
cache_hitpass        921         0.01    Cache hits for pass  
cache_miss         422706      4.12    Cache misses  
backend_conn        57122       0.56    Backend conn. success  
backend_unhealthy    0       0.00    Backend conn. not attempted  
backend_busy         0       0.00    Backend conn. too many  
backend_fail         0       0.00    Backend conn. failures  

The most important variables to watch are client_drop, backend_busy, backend_unhealthy and backend_fail. The first one usually happens when you go over session_max or queue_max. The default values for those two variable are good enough and if you see client_drop increasing you should look into other things (thread workers, queue sizes, backend speed, cache hit ratio, etc.) and not blindly increase those two. The backend_busy variable indicates that you have reached the maximum amount of connections to your backend. Do not blindly increase that variable either as you may overload your backend. Last two variables, backend_unhealthy and backend_fail, indicate that either your backend was declared unhealthy by Varnish due to failing checks or it was a pure connection failure (network issue for example).
Worker Threads Related

# varnishstat -1
n_wrk               100         .       N worker threads  
n_wrk_create         2853            0.03    N worker threads created  
n_wrk_failed           0           0.00    N worker threads not created  
n_wrk_max               0           0.00    N worker threads limited  
n_wrk_lqueue           0           0.00    work request queue length  
n_wrk_queued           13614           0.13    N queued work requests  
n_wrk_drop             0           0.00    N dropped work requests  

Worker thread related metrics give you insight if you have properly tuned your thread pools and their sizes. n_wrk_max metric will show you how many times you have exhausted your thread pools and threads failed to be created. Queue metric, n_wrk_lqueue, shows you the current queue length (requests waiting on worker thread to become available). Two metrics, n_wrk_queued and n_wrk_drop show you how many times a request has been queued and how many times it was dropped due exceeding queue length.
ESI Related

# varnishstat -1
esi_errors              0           0.00    ESI parse errors (unlock)  
esi_warnings            0           0.00    ESI parse warnings (unlock)  

If you are using Edge Side Includes [1], esierrors and esiwarnings will give you information about the validity of your ESI syntax. If you see them increasing, inspect what is returned by the backend regarding ESI and fix any errors found.
Storage Backend Related

# varnishstat -1
n_lru_nuked            128154          .       N LRU nuked objects  
SMA.s0.c_req               1129966         10.98   Allocator requests  
SMA.s0.c_fail           128616             1.25    Allocator failures  
SMA.s0.g_bytes            4294824024      .       Bytes outstanding  
SMA.s0.g_space          143272          .       Bytes available  
SMA.Transient.c_req     24264             0.24    Allocator requests  
SMA.Transient.c_fail    0             0.00    Allocator failures  
SMA.Transient.g_bytes   27464              .       Bytes outstanding  
SMA.Transient.g_space   0              .       Bytes available  

Here we have an example of the malloc storage backend and you can see that there are two types of storage, s0 and Transient. Transient storage is used by Varnish to store short-lived objects. What Varnish considers short-lived is defined by the shortlived variable – value is in seconds and it defaults to 10. Storage s0 is the storage defined with the -s flag and it is used to store all other objects. Varnishstat displays information about failed allocations in s0.c_fail and this is usually a result of reaching the cache size limit. The last variables, g_bytes and g_bytes show you the amount of used and free space. One of the most important variables found in varnishstat is the n_lru_nuked variable, which tells you how many objects got removed from cache based on the LRU algorithm. If this number is increasing fast, consider raising your cache size.
Conclusion

We have explained how varnishstat works and how it can be used to debug your Varnish installation. Hopefully this will be enough for you to start using varnishstat and understand where your Varnish might be bottlenecking or working non-optimally.

[1] ESI specification – http://www.w3.org/TR/esi-lang