As I sat and pondered life without Gmail for a while I was wondering if someone in Mountain View wasnt lamenting the removal of the beta tag from the service earlier this year. In looking for an explanation, its best to turn to the source. The old adage is that Its not that you have a problem but rather how you handle it that is most important., applies here in a way that Google would like to not repeat. Heres some official words from the official Gmail blog
Gmails web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when theres a problem with the service. Thus, right up front, Id like to apologize to all of you " todays outage was a Big Deal, and were treating it as such. Weve already thoroughly investigated what happened, and were currently compiling a list of things we intend to fix or improve as a result of the investigation.
The blog then goes on to explain the 5 Ws of the situation in laymans terms and, in my opinion, provided an appropriate mea culpa as well as showing that there is work taking place to ensure that this would not happen again to the same degree. What was most interesting was the recognition that the way that the architecture was at the time of the failure caused the shutdown rather than a slowdown and that Gmail is opting for slow service over no service for the future. Good choice.
Whats next: Weve turned our full attention to helping ensure this kind of event doesnt happen again. Some of the actions are straightforward and are already done " for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle " for example, we have concluded that request routers dont have sufficient failure isolation (i.e. if theres a problem in one datacenter, it shouldnt affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). Well be hard at work over the next few weeks implementing these and other Gmail reliability improvements " Gmail remains more than 99.9% available to all users, and were committed to keeping events like todays notable for their rarity.
For something of this magnitude I give Google a decent grade for being transparent enough to say Yup, were not perfect while working to get it right for the future. Today will be a great day for all of the Google haters out there. I on the other hand, have decided to realize that since I am far from perfect myself, that to expect from others is, well, a waste of time. Does that mean I will welcome future outages with open arms? Of course not. Based on what I have seen here though, I suspect that Google wont either.