Design & Dev Tools

3-part Plan to Minimize the Impact of Site Outages

“The site is down.”

Few things worry developers as much as outages, especially on ecommerce sites where every minute represents lost revenue.

But outages happen. To minimize their impact, plan for them. One of the trickier aspects is establishing priorities — what needs to be done first. Here’s my answer:

  1. Get back online.
  2. Communicate with customers.
  3. Repair the cause.

1. Get Back Online

The primary goal is to get your site back online. The longer it is offline, the more it will cost you in revenue and customer trust.

The site does not have to be fully recovered, only functional, so that shoppers can use it again.

This might mean that advanced features are disabled. It might mean the storefront is taking orders but not yet sending them to your fulfillment system. It could even be that you’re having to processes payments manually through your payment gateway.

While manual work might be necessary to serve customers, it could cause problems, too. There are two important points to consider.

First, many outages are prolonged due to unrecognized dependencies that cause a domino effect. For example, manually rebooting a server could cause other servers to crash, which deepens the crisis.

Thus try to use your existing processes first before turning to manual ones. There will be fewer mistakes, lower risk of further damages, and the later repair phase will be easier.

Second, save all forensic data about the outage. This includes log files, crash dumps, software runtime snapshots, or even entire servers. Do that, however, only if it won’t pause or stall getting the site back online.

2. Communicate with Customers

During an outage, clear and direct communication with customers is important.

I’ve listed this as step 2, but it can be handled at the same time as step 1. While technical teams are getting the site back online, everyone else can be focused on keeping shoppers informed and helping them however necessary.

The exception is when there’s a conflict between getting back online and communication. When that occurs, communication is secondary. For example, if developers need to disable a conversion-tracking tool, the marketing staff should help them before communicating with customers.

How and where you communicate with customers will depend on your store and how many customers might be impacted. The larger the outage, the more official and widespread the communication should be. Consider these channels:

  • Social media accounts.
  • Email broadcasts, making sure they aren’t connected to the site — e.g., links, images, tracking pixels.
  • Your blog if it’s separate or unaffected by the outage.
  • Status page if you have one.

The important points to include in your communication are (a) you’re working on the issue, (b) where to find further updates, and (c) addressing potential fears, such as credit card or personal information leaks.

3. Repair the Cause

By now the site is back online or at least functional. You’ve notified customers and provided updates. You’re making progress at getting back to normal.

This is a good time to give staff a bit of rest. Depending on the outage, it might be a few hours or a couple of days. Rotate your staff if necessary — some can rest while others manage the shop.

It might seem counterintuitive to let people relax before the cause is fully repaired. But fatigue causes mistakes. The rest will help prevent a second outage caused by human error.

During the repair stage, inform your customers that the issue has been resolved, that you’re monitoring it closely, and that further information might arise as you investigate.

Now figure out what happened, and repair it.

During step 1, you may have discovered factors that could have caused or accelerated the outage. Hopefully you have forensic data. Now is the time to collect all of that and try to piece together what happened.

You can use a standard risk management process or your own process to help guide your thinking. An outage can be viewed as a risk (or multiple risks) that actually occurred.

In this step 3, find ways to improve your processes to prevent a similar outage, or at least minimize it. This could include changing software, switching vendors, adding redundancy, or a combination.

Outages Happen

Remember, outages happen. Even giants such as Amazon have them. The key to is to learn from them to lessen their occurrence. Developing an outage plan, such as one above, will minimize their impact.

Eric Davis
Eric Davis
Bio   •   RSS Feed