There are three things we are taking away from this incident:
1) We didn't have a sensitive enough alerts in place to know proactively when the 503's were elevated - this is pretty basic, and we're fixing this.
2) The flooding that caused this issue was a new type that our existing flooding controls couldn't detect. We're working on how / if we can automatically detect this. Regardless (even if it can't be automatically detected), we are documenting it along with manual runbooks to address the problem if it happens again.
3) Flooding from one customer should never affect others. This is a tougher nut to crack, given constraints inherent in SaaS / multi-tenant architecture -- by definition, resources are shared. We're going to take a hard look at this, and try to minimize impacts going forward.
I apologize for the false positive alerts this generated.