I was out of the office today when the outage happened, but I'm happy to say that very few people in our IT organization were confused as to how to troubleshoot the issue and how to utilize the tools made available to them. It's refreshing to see the implementation of tools and processes, even if at a bare minimum at this point in time, make an impact when something like today's outage comes up.
At 3:22:20 PM ET, we received our first BoxTone email notification letting us know one of our servers was in an unavailable status and had lost it's SRP connection. Within a minute, we were alerted for the rest of the BES servers we have in the United States (we do not have the international sites configured for notifications at this time). The outage was reported by RIM to have started at 3:20 PM ET. Not too shabby of a turnaround time.
Ten minutes after we first received notifications from our BoxTone connector, we received our first RIM email notification for the outage. This outage notification took over 5 minutes to deliver from RIM to our infrastructure. Over an hour after our initial BoxTone alert, we received our first notification from AT&T (ATTOM), which took nearly 20 minutes to deliver to our system after it was sent from AT&T.
Granted, a monitoring solution can do nothing to fix an outage, but it can certainly reduce the amount of time spent troubleshooting end-user issues when outages happen. This determination period is vital when dealing with thousands and thousands of users. Bulletins can be posted, internal notifications can be sent, Help Desk personnel can start notifying rather than troubleshooting ...all within minutes of an outage developing and quite often much more rapid than official vendor acknowledgement and notification. During these important minutes ticking away, vendors are typically in the process of drafting a response, gaining approvals to send the message to a select few hundred thousands customers, and straining their own mail queues; meanwhile the monitoring system is doing its thing - gathering real-time statistics, aggregating the data, sending alerts to internal technology groups, and helping deduce the outage's scope of impact in your own environment.
Here's what our environment looked like following the reconnection of SRP when messages were still increasing in the pending queues. Quite astonishing.
BES: North America
SABES: South America