ROOT CAUSE
This incident was part of a series of incidents caused by bottlenecking in a load balancing system we placed in front of our query engine on 2022-09-01. This load balancer is shared across many of our underlying services, so many upstream Kentik portal pages were affected in different ways. The bottlenecking only occurred during peak query usage, at which time the load balancer would begin hitting its global connection limits.
RESOLUTION
Because this issue only occurred during our peak query times, it took us much longer than desired to identify the pattern and isolate a root cause. Each business day starting 2022-09-06, we would see elevated response times around the same time of day, but no obvious culprits based on metrics, logs, or traces.
For the first few days, Kentik Engineering teams were identifying potential performance bottlenecks in various software services based on trace data, rolled out patches, and saw improved response times. While these changes did result in improved performance of various services, the observed improvement in response times immediately following patch deployments were false positives due to the patches rolling out during off peak hours and the root issue actually coinciding with our query peak.
After hitting our query peaks on 2022-09-06 to 2022-09-08, we began to see the pattern emerge, but still could not clearly point at a root cause. The biggest blocker to identifying the root cause was that our load balancer was not reporting the bottlenecking in any fashion. In fact, when a Kentik Portal user loaded a page that ran a request that went through this load balancer, we would see nominal response times reported by the load balancer, but elevated response times reported by the web server. This led us to believe there was a performance issue on our web servers and focus much of our efforts there for the first few days. In addition to software improvements, the team allocated 66% more hardware capacity for our web servers, hoping this would buy us headroom to identify the true root cause, but to no avail.
It was only after looking back at macro trends several days into the incident and seeing a very slight decrease in overall responsiveness and increased error rates that coincided with our load balancer changes that we began to investigate it as a potential root cause.
Our load balancer employs several concurrency limits, and the addition of query load to it caused us to hit these limits during query peaks. We could clearly see this in concurrent connection metrics, but did not have monitoring for this scenario, nor did the load balancer log or otherwise indicate this was occurring. It would queue requests and silently incur delays while reporting nominal request and response times in its latency metrics.
On 2022-09-15, Kentik Engineering removed the query load from this load balancer, and performance returned to consistently nominal levels.
However, doing this rollback in conjunction with rapidly deploying new hardware for the web portal caused different bottlenecks in our query system during query peaks – ones that we were anticipating and trying to get ahead of by putting the load balancer in play in the first place.
On 2022-09-21, Kentik Engineering was able to get all affected systems into a nominal state in terms of query performance and overall latency.
FOLLOW UP
The team is now focused on adding several layers of observability to our platform in order to improve our ability to respond to these types of incidents. In addition to more thorough monitoring of all components of our infrastructure, we are focused on identifying performance issues more proactively. During Q4 2022, our team will be working towards:
Please contact your Customer Success team or support@kentik.com if you have any further questions or concerns.