Last time around, I outlined a straightforward way to think about responding to incidents that threaten your ability to operate (“Thinking About Disaster Recovery: Response-Based Planning,” February 2015). Let’s think for a bit about how you need to prepare your technical plant for those incidents. After all, a plan isn’t any good if you aren’t prepared to execute it.
Here’s a cool thought: What if you could respond to any incident requiring staff to evacuate studios by simply driving to another building, walking inside and finding an exact replica of the main plant, all the way down to offices, computers and coffeemakers, with freshly made coffee just dripping into the carafe? Sounds great, right?
Some of your staffers (and maybe management) expect that level of “recovery” from incidents. Most people just don’t think about the complexity of a broadcast operation —or any business, for that matter. That level of recovery is certainly possible for your shop, but it’s going to cost you at least what it cost to build your original plant without including the costs of making sure everything is well maintained and current. That setup is probably not in this year’s budget. If it is, give me a call…
So what’s a good way to figure out what’s most important to focus on when evaluating disaster recovery preparations?
In larger IT-heavy shops there’s a tendency to extend the definition of “disaster” beyond high-level scenarios to losing individual racks, or even individual servers, and they will sketch out very detailed plans for how to recover from these point failures. This can be a good approach for some places; it reflects the way that IT staff think about their plants as an archipelago of functionality served by a common infrastructure. It’s also a good way to help design inherently resilient systems; one could put their hand on an individual server and say, “OK, what happens if this box goes down?” and use that as a guideline for planning.
IT tools from this individual system approach can be very useful. In particular, thinking in terms of recovery time objective and recovery point objective metrics provide a consistent framework to communicate expectations both internally (to the technical groups) and externally (to management and users of those systems).
Put simply, RTO describes how long a business process, including its underlying infrastructure, can be offline before its absence affects the business.
RPO is a description, also in time, of how much data a particular system or process can lose before it negatively affects the business.
RECOVERY TIME OBJECTIVE
So think about this for a moment: What is the RTO of your main signal? In other words, how long can you be off-the-air before the absence causes unacceptable problems for your business? I’ll bet that it’s measured in a very few minutes. Remember: If your main service is down, you’re not making money, and you’re not serving your community.
Now apply the same focus to an online stream that’s a repeat of your main channel. What’s the RTO there? This may vary according to the market you’re in. In a lot of places, losing an online stream means that only a few listeners— maybe a small percentage of your overall total—can’t receive your content. Maybe most of those online listeners are coming in from outside your local market. How much time can you tolerate that service being offline? In some places, though, that same online stream may be the way that a large fraction of your listeners get your content. Consider also the effect losing an online stream has on your commitment to your listeners—that’s worth something, as discussed in my last article.
How about email? How important is email to your operation and to your ability to communicate with staffers, listeners and others? Given that email is asynchronous, you can probably stand to lose some recovery time there—but not much.
RECOVERY POINT OBJECTIVE
Think also about the RPO of your various services. Here are some examples.
You have an accounting system that you use to run your business; it has financial data, payroll data, and it helps you manage your incoming revenue. How much of that data could you stand to lose if you lost that system during an incident? That’s a system that has stringent RPO requirements—but as longer RTO requirements because it can be offline for a while before it directly affects the business.
You could have a website that doesn’t change that often. It may have contact information and a static schedule. Its RPO could be pretty lengthy, maybe even days or weeks. Alternately, you could have a frequently updated website that has newsfeeds, “now playing” information, weather updates and a form for listeners or prospective advertisers to contact the station. Its RTO could be pretty short—on the order of a few minutes to an hour—because it’s an important part of your business, and its RPO could be pretty short, too, because you don’t want to lose the important contact data.
How much data can you lose from a newsroom system before that affects your news operation? This one may be a bit surprising. It’s entirely possible that the RTO of a newsroom system could be pretty long, maybe even as long as a day or two, assuming that you have adequate ways for your news staff to operate during an incident. Depending on how you use your newsroom system, its RPO could also be lengthy; you may be able to stand losing a day’s worth of unbacked-up content without a lot of bad after effects.
So how do we use those tools to think about our disaster recovery preparations? Here’s an exercise: Open a spreadsheet and list all the customer-facing services your station provides. You’ll get a list something like this:
– Main channel
– HD1 - repeats main channel
– HD2 - all-news service
– HD3 - Jazz service
– Online stream 1
– Updated news on the website
Then go through and list all the internal services your staff uses to get their jobs done:
– Phone system
– Playback system in Studio 1
– Playback system in Studio 2
– Accounting applications
– Links to the transmitter site
It helps to build the list by going to your machine/server/rack room. Stand in front of every device and put what it does on your list. Don’t be too concerned about the box itself; think about what role, or roles, it performs in your operation. The goal here is to be as comprehensive as possible and to include everything that goes into running your station.
Label the second column “RTO.” For each item on the list, enter the number of minutes that the service can be offline before the consequences are unacceptable.
Then label the third column “RPO.” Enter how far back in the past you can go without losing important data. If you’re not sure, put the amount of time between planned backups. (You are making backups of all your data and storing them offsite, right?)
Now go back and sort your spreadsheet so that the shortest overall RTO is at the top, secondarily sorted by the shortest RPO. The sorted list quickly shows where you need to expend effort to make sure you have proper backup systems to cover the most likely incident scenarios, and gives you a good idea of what data and how often you need to move that data offsite to minimize loss.
This is also where you document assumptions. For example, you can say that any plan to evacuate your main site assumes that staff will return in less than 48 hours. This is very important because it drives recovery and backup preparations for services with a long RTO. If you discover you have a service that can be down for 48 or 72 hours, you can safely shut it down, if you assume that you’ll return before the nominal time.
Of course bad stuff happens, and you may not be able to return in 24, 48 or even 72 hours.
RTO and RPO are the tools you need to identify and characterize the relative risks of making assumptions. As soon as you have your list of services and assumptions, talk with other staffers to check them. You may be surprised when you find out that something you thought was optional—like email—is actually critical to your operation and needs to be accommodated during an incident.
The RTO/RPO spreadsheet is a great tool for communicating expectations to management and other staffers and for providing a framework for further discussions of what’s most important to the operation. It moves the conversation from a declarative “here’s what we’re gonna do” to a collaborative “what are the priorities of the organization and what are the most important things to back up given our resources?”
Using the example above, you could go to your GM with a few spreadsheets and be able to say, “Based on our conversations over the past few weeks, we need to make some decisions.
“I heard from you that we want less than 30 seconds downtime on our main and HD channels during an incident. That requires not only redundant transmitters, but also redundant interconnection among all the studio sites and transmitters and generators at all the sites, not to mention the operational and training challenges. That’s going to cost twice what it would if we rethought the main channel RTO to something more like ten minutes.
“Now, I also heard from you that we don’t have to reinstate some of the accounting systems within 24 hours after a building evacuation. That’s good because we don’t have to go to the expense or complexity of keeping a live system running offsite 24/7. We do, however, need to be careful that we have good backups at least twice daily so we don’t lose data.”
(If you really want to impress your GM or IT director, say something like, “The accounting systems can stand a long RTO, but they have very stringent RPO requirements.”)
“Email is critical for communicating with staff and others during an incident; while we don’t have any control over whether a person is able to receive email if they’re offsite, we need to make sure that we can at least attempt to deliver it; I propose using an outside service for backup email use in case our local systems go down.”
Utilizing RTO and RPO to document assumptions and to see what’s really necessary to keep your operation going is a great tool for communicating with management; it puts everything in a format where you’re talking about costs versus benefits—and then naturally puts things in the proper priority when considering the plan as a whole.
After those conversations, you should have a very good picture of the current state of your preparations—and a very good idea of where you and the rest of the staff need to expend time and money to be able to meet your RTO and RPO objectives.
Next time: Where is your content?
Bridgewater works with media companies to help them solve sticky problems, including analyzing and enhancing disaster recovery preparations. Contact him at email@example.com.