Azure flow loss
Incident Report for Kentik SaaS US Cluster
Resolved
Starting at 2024-01-19 10:20:00 UTC some exporter processes encountered errors while trying to use the Azure Resource Manager ListKeys API for customer storage accounts. This exposed a bug that caused affected exporters to go into an idle state, stopping flow collection and ingestion. Azure ARM failures gradually increased, and with it the number of exporters entering that idle state. By 2024-01-21 05:15 UTC, flow processing for all customers had halted.

We detected this on 2024-01-21 17:44:00 and pushed a fix on 2024-01-21 17:52:00.

Underlying causes:
- Azure AMR failures (tracking NKRF-1TG)
- Code bug in Kentik exporters (fixed)

As a follow up action we made changes to our monitoring in order to alert on gradually decreasing flow rates.
Posted Jan 19, 2024 - 18:00 UTC