Increase in checkin failures from agents. Site slowdown.
Incident Report for Pingdom Server Monitor
Postmortem

There are three things we are taking away from this incident:

1) We didn't have a sensitive enough alerts in place to know proactively when the 503's were elevated - this is pretty basic, and we're fixing this.

2) The flooding that caused this issue was a new type that our existing flooding controls couldn't detect. We're working on how / if we can automatically detect this. Regardless (even if it can't be automatically detected), we are documenting it along with manual runbooks to address the problem if it happens again.

3) Flooding from one customer should never affect others. This is a tougher nut to crack, given constraints inherent in SaaS / multi-tenant architecture -- by definition, resources are shared. We're going to take a hard look at this, and try to minimize impacts going forward.

I apologize for the false positive alerts this generated.

Posted about 1 year ago. Aug 24, 2017 - 10:31 MDT

Resolved
The flooding implementation solved the issue. Operations are back to normal.
Posted about 1 year ago. Aug 23, 2017 - 23:02 MDT
Monitoring
The issue was caused due to flooding from one customer account that starved resources within the ingestion pipeline. We have deployed an ad-hoc flooding control to fix the immediate issue. We will look into more generalized flooding controls to prevent similar issues going forward.
Posted about 1 year ago. Aug 23, 2017 - 22:48 MDT
Investigating
We are currently investigating this issue.
Posted about 1 year ago. Aug 23, 2017 - 20:08 MDT