Skip to content

Instantly share code, notes, and snippets.

@JonCole
Last active January 6, 2020 17:52
Show Gist options
  • Select an option

  • Save JonCole/db0e90bedeb3fc4823c2 to your computer and use it in GitHub Desktop.

Select an option

Save JonCole/db0e90bedeb3fc4823c2 to your computer and use it in GitHub Desktop.
Diagnosing Redis errors caused by issues on the client side

Diagnosing Redis errors on the client side

Customers periodically ask "Why am I getting errors when talking to Redis". The answer is complicated - it could be a client or server side problem. In this article, I am going to talk about client side issues. For server side issues, see here

Clients can see connectivity issues or timeouts for several reason, here are some of the common ones I see:


Memory pressure

Problem: Memory pressure on the client machine leads to all kinds of performance problems that can delay processing of data that was sent by the Redis instance without any delay. When memory pressure hits, the system typically has to page data from physical memory to virtual memory which is on disk. This page faulting causes the system to slow down significantly.

Measurement:

  1. Monitory memory usage on machine to make sure that it does not exceed available memory.
  2. Monitor the Page Faults/Sec perf counter. Most systems will have some page faults even during normal operation, so watch for spikes in this page faults perf counter which correspond with timeouts.

Resolution: Upgrade to a larger client VM size with more memory or dig into your memory usage patterns to reduce memory consuption.


Burst of traffic

Problem: Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.

Measurement: Monitor how your ThreadPool statistics change over time using code like this. You can also look at the TimeoutException message from StackExchange.Redis. Here is an example :

System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0, 
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)

In the above message, there are several issues that are interesting:

  1. Notice that in the "IOCP" section and the "WORKER" section you have a "Busy" value that is greater than the "Min" value. This means that your threadpool settings need adjusting.
  2. You can also see "in: 64221". This indicates that 64211 bytes have been received at the kernel socket layer but haven't yet been read by the application (e.g. StackExchange.Redis). This typically means that your application isn't reading data from the network as quickly as the server is sending it to you.

Resolution: Configure your ThreadPool Settings to make sure that your threadpool will scale up quickly under burst scenarios.


High CPU usage

Problem: High CPU usage can mean that the client side can fail to process a response from Redis in a timely fashion even though Redis sent the response very quickly.

Measurement: Monitor the System Wide CPU usage through the azure portal or through the associated perf counter. Be careful not to monitor process CPU because a single process can have low CPU usage at the same time that overall system CPU can be high. Watch for spikes in CPU usage that correspond with timeouts. As a result of high CPU, you may also see high "in: XXX" values in TimeoutException error messages as described above in the "Burst of traffic" section.

Resolution: Upgrade to a larger VM size with more CPU capacity or investigate what is causing CPU spikes.


Client Side Bandwidth Exceeded

Problem: Different sized client machines have limitations on how much network bandwidth they have available. If the client exceeds the available bandwidth, then data will not be processed on the client side as quickly as the server is sending it. This can lead to timeouts.

Measurement: Monitor how your Bandwidth usage change over time using code like this. Note that this code may not run successfully in some environments with restricted permissions (like Azure WebSites).

Resolution: Increase Client VM size or reduce network bandwidth consumption.


Large Request/Response Size

Problem: A large request/response can cause timeouts. As an example, Suppose your timeout value configured on your client is 1 second. Your application requests two keys (e.g. 'A' and 'B') at the same time (using the same physical network connection). Most clients support "Pipelining" of requests, such that both requests 'A' and 'B' are sent on the wire to the server one after the other without waiting for the responses. The server will send the responses back in the same order. If response 'A' is large enough it can eat up most of the timeout for subsequent requests.

Below, I will try to demonstrate this. In this scenario, Request 'A' and 'B' are sent quickly, the server starts sending responses 'A' and 'B' quickly, but because of data transfer times, 'B' get stuck behind the other request and times out even though the server responded quickly.

|-------- 1 Second Timeout (A)----------|
|-Request A-|
     |-------- 1 Second Timeout (B) ----------|
     |-Request B-|
            |- Read Response A --------|
                                       |- Read Response B-| (**TIMEOUT**)

Measurement: This is a difficult one to measure. You basically have to instrument your client code to track large requests and responses.

Resolution:

  1. Redis is optimized for a large number of small values, rather than a few large values. The preferred solution is to break up your data into related smaller values.
  2. Increase the size of your VM (for client and Redis Cache Server), to get higher bandwidth capabilities, reducing data transfer times for larger responses. Note that getting more bandwidth on just the server or just on the client may not be enough. Measure your bandwidth usage and compare it to the capabilities of the size of VM you currently have.
  3. Increase the number of ConnectionMultiplexer objects you use and round-robin requests over different connections.
@AlanMacdonald
Copy link

But where is the list of error codes for me to compare to what is occurring in the exception? What does wq mean for instance? These error codes are critical info for diagnosing client side error

@merchanf
Copy link

@AlanMacdonald take a look at this: https://azure.microsoft.com/en-us/blog/investigating-timeout-exceptions-in-stackexchange-redis-for-azure-redis-cache/ I couldn't see wq error, but the other attributes are explained.

@kierenj
Copy link

kierenj commented May 8, 2018

I have an exception with an in value of ~16KiB.. however, the client VM is super chunky and capable, and the busy/min are 1 and 4. What else could be causing a backlog like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment