Recently I encountered a Redis problem, where new Redis connections failed to be established from the server, and the client massively timed out. The AWS managed Redis cluster has a few metrics coincide with the surge of connection issue: "
New Connections" and "
Reclaimed Items" both spiked during the time of difficulty.
Commonly Redis would respond slow, causing connection issue on the client when "
Eviction" events occurs. This usually happens where the max capacity of the Redis node has been reached, and entries are forced out of Redis as a result. However, in my case the "
Eviction" metrics showed 0, and "
Bytes Used for Cache" suggested that Redis has enough RAM.
Clearly, the surge of "New Connections" is a symptom rather than a direct cause. When the system was halting, normally connections would be expected to build up waiting on a certain resource.
The AWS documentation shows that the "
Reclaimed Item" is the
"The total number of key expiration event".
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html
Redis documentation explains that "basically
[key] expired events are generated
when the Redis server deletes the key" after TTL on a key goes to zero.
https://redis.io/topics/notifications Since the background Redis process may keep deleting items when the collected sample has >25% of expired item, Redis may take a few seconds to delete when large amount of items are expiring.
https://redis.io/commands/expire This would make established connections to be unable to proceed their operations, causing connection issue for the client.
Note that this problem is related to both write/second and the TTL length. You have an option of either reducing write/second, or make TTL shorter so that the number of items don't build up at a potential of being deleted all at once by the background process.