What To Do When an Outage Brings Your Site or App Down
Amazon's S3 cloud storage, a ubiquitous and indispensable resource for many companies and developers, experienced some acute outages earlier this week, shutting down thousands of popular apps and websites, and severely hobbling others. The effects spread to other services at Amazon as well, including EC2, Elastic Beanstalk, Lambda and OpsWorks.
Pages of news sites reporting on the problems with S3 were riddled with their own broken picture links, as their stored images weren't reachable, even if their CMS and their core servers were.
It's difficult if not impossible for a startup or company to have a fully-baked package to avoid being affected by an outage as widespread as this. The work involved in trying to totally mitigate something like this probably isn't worth it, given the infrequency of such an event. Overall, S3 has been very dependable over the years, which is one of the reasons that so much of the web has been built around it.
That said, startups can put some best practices in place to ensure that, during any outage like this where some assets may be unreachable or the startups' servers aren't operating, customers and users know what's going on, why it happened, and the company's best guess on when normal operations will be restored.
To keep users and customers out of the dark:
1. Have an error page
This is usually a flat HTML file that your server can direct a browser to in the case of an application error, which could be the result of a down service, database, or cloud storage system, depending on how an application has been built.
The HTML error page is usually hosted somewhere like S3, which, of course, is a problem if and when S3 goes down. So it's helpful to have an alternative page stashed elsewhere. There are lots of free places to put a web-accessible HTML file, including Dropbox and Github.
Remember that the text on the error page can be quickly edited, and the page re-uploaded to the server, giving it up-to-date information to reflect the current situation.
2. Re-direct traffic
In the case of a cloud server going down, know how to re-direct traffic via DNS or forwarding to an error page.
Typically, a service like Heroku or DigitalOcean will automatically redirect traffic to your error page in the case of an application error or data error. But if DigitalOcean or Heroku, or whatever hosting service you may be employing is totally down, then traffic won't be redirected. The browser will just spin its wheels until it errors out. This was the fate of hundreds of sites this week that depended on servers who had critical elements on S3.
In this case, founders or engineers can quickly and temporarily forward the domain to the static HTML file. This can usually be done with a quick change at the site’s web registrar.
Again, it's extra helpful to users if the error file has been edited and updated with text that reflects the current situation. It's likely that many other sites haven't taken this step, so any company that does will look proactive and ahead-of-the-game to users and consumers. It's a small step that can instill confidence in your product.
3. Reach out and be proactive
For SaaS companies or any kind of startup with keystone customers, founders should reach out to these clients via email and, if necessary, by phone, to let them know exactly what is going on.
In the case of SaaS, a startup may be providing a critical process or tool for clients, one that is used every day. If and when the app goes down, it will be conspicuous. Before tempers can boil or customers go exploring for alternative solutions (competitors), reach out and let clients know what's happened, and, if it was avoidable, what steps are being taken to ensure it doesn't recur.
At the very least, every customer of a SaaS or enterprise business should get an email, even if it's just part of a larger blast, explaining the situation as soon as it's clear that the outage will last more than several minutes.
Consumer-facing businesses should consider similar steps. Power users of particular worth might merit special treatment, and all regular users should be assured that the company is working hard to restore normal operations.
Thoroughly explaining the problem and being as upfront as possible will buy a startup depths of goodwill with users and customers, an invaluable commodity in times of crisis.