Wednesday, June 19, 2019

Dealing with Connection Timeout to Redis when the "Reclaimed Items" Spike

Recently I encountered a Redis problem, where new Redis connections failed to be established from the server, and the client massively timed out. The AWS managed Redis cluster has a few metrics coincide with the surge of connection issue: "New Connections" and "Reclaimed Items" both spiked during the time of difficulty.

Commonly Redis would respond slow, causing connection issue on the client when "Eviction" events occurs. This usually happens where the max capacity of the Redis node has been reached, and entries are forced out of Redis as a result. However, in my case the "Eviction" metrics showed 0, and "Bytes Used for Cache" suggested that Redis has enough RAM. Clearly, the surge of "New Connections" is a symptom rather than a direct cause. When the system was halting, normally connections would be expected to build up waiting on a certain resource.

The AWS documentation shows that the "Reclaimed Item" is the
"The total number of key expiration event"https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html

Redis documentation explains that "basically [key] expired events are generated when the Redis server deletes the key" after TTL on a key goes to zero. https://redis.io/topics/notifications Since the background Redis process may keep deleting items when the collected sample has >25% of expired item, Redis may take a few seconds to delete when large amount of items are expiring.  https://redis.io/commands/expire This would make established connections to be unable to proceed their operations, causing connection issue for the client.

Note that this problem is related to both write/second and the TTL length. You have an option of either reducing write/second, or make TTL shorter so that the number of items don't build up at a potential of being deleted all at once by the background process.