APIv6 and some Kentik agent APIs down

Incident Report for Kentik SaaS US Cluster

Postmortem

ROOT CAUSE

Our inbound proxy/loadbalancer was configured for a small concurrent connection pool for these API paths. At approximately 13:30 UTC, a few high volume paths were brought online that filled this pool and caused requests to queue to a point where we could not catch up. This caused periodic 503 and 429 responses from our API.

RESOLUTION

At approximately 17:45 UTC, the connection pool size was increased to address this issue.

We have raised the severity of the internal alerts we have monitoring these metrics to more quickly identify and resolve future similar events.

Posted Oct 25, 2022 - 16:48 UTC

Resolved

This incident has been resolved.

Posted Oct 20, 2022 - 21:15 UTC

Monitoring

The issue has been resolved and we will continue to monitor

Posted Oct 20, 2022 - 20:30 UTC

Update

A fix has been issued and engineering continues to monitor the API

Posted Oct 20, 2022 - 20:05 UTC

Identified

The root cause has been identified and a fix is being put into place.

Posted Oct 20, 2022 - 19:50 UTC

Investigating

At approximately 1530UTC, we saw elevated error levels for API v6 (synthetics, cloud, etc.) and Kentik agent (kbgp, kproxy) APIs. We have determined these APIs to be non-operational and are investigating the root cause.

Posted Oct 20, 2022 - 13:30 UTC

This incident affected: REST API.