Managing Technology: Thinking About Disaster Recovery; Response-Based Planning

There are key differences between cause-based and response-based disaster recovery plans.
Author:
Publish date:
Image placeholder title

Bad stuff happens, right? Even in a well-built and maintained operation, there�s always something that threatens to take you off-air: bad weather, power utility issues, human error (�oops�!). Even something like your air conditioner breaking late on a Friday night can temporarily put you out of business. Those and many other incidents all fit under the �Disaster Recovery� heading.

Why worry about disaster recovery? The simplest and cheapest thing to do during a threatening incident is to shut everything off, hunker down and wait for the bad situation to pass. What�s wrong with that? Preparing for incidents isn�t an abstraction or a luxury; you need to respond when there�s a failure.

If you aren�t prepared to respond to an incident (or don�t have a good plan to exercise those preparations), you�ll lose money, audience and respect. Simple as that. If you�re off the air, you�re not making money. You can do the math to calculate how long you can be offline before you�ve eaten up any �savings� from not being prepared. There�s a secondary cost, too. If you�re off the air when your audience needs your services, they�ll go to someone else and maybe not come back. There�s also the commitment radio stations have with their communities to provide useful information during incidents; if you�re off the air, you�re not meeting your commitments.

OK, you have good disaster recovery preparations and lots of redundancy in your plant. You�re ready to go, right? If you have disaster preparations but don�t have a written, tested, practiced and curated plan, your station isn�t ready for an incident. Broadcast facilities can be fairly complex operations, especially when you factor in translators, HD Radio channels and online streams, all with different (or common) content stores and technology.

Do any of the following sound familiar?

��� �I�ve been meaning to write something down, but it�s just been too busy � forever.�

��� �Yes, we have a disaster recovery plan; I keep it in my head � so all anyone has to do is call.�

��� �That was an intern project five or six years ago.�

��� �Yes, we had one, but it was impossible to maintain. There are just too many ways things can go wrong.�

A lot of disaster recovery plans are unwieldy, burdened by a seemingly infinite number of specific scenarios. Some, good intentions aside, never get written down because of the daunting amount of detail required. They�re written from a troubleshooting perspective instead of an operations perspective. The plans are written by engineers who want to understand what�s gone wrong rather than taking protective action to keep their service alive. In other words, a lot of disaster recovery plans are cause-focused rather than response-focused.

Cause-focused scenarios aren�t an inherently terrible way to think about disaster recovery. It helps to think through a specific incident as a template for the preparations needed to respond and recover. A cause-focused scenario could be �the air conditioning died,� which generates procedures on troubleshooting and fixing the issue. Additional procedures describe what operations and on-air staff needs to do � in this case, get ready to shut down unneeded devices or get ready to move to a backup site.

As you add more scenarios, you get a detailed list of possible causes of service interruptions with overlapping response plans. So far, so good.

[BREAK]�

The trouble comes when faced with a complex facility and dozens of scenarios. A facility doesn�t have to be large or in a large market to be complex these days; even smaller stations have HD Radio channels and significant online presences. Where there was at one time a single program stream feeding a transmitter, there�s easily up to a half-dozen or more individual streams, each important to its listeners.

It�s difficult to build a responsible plan to meet each scenario based on individual causes, not to mention the challenges to staff at the early stages of an incident trying to categorize the scenario properly and select an appropriate response plan.

Another way to look at the situation is to think about what staff response to an incident looks like, not the specific cause. For example, it really doesn�t matter whether there�s a fire in the building or a storm has broken windows and killed power sources; at the start, staff has to do largely the same things: Get out safely and move services to a backup site.

Separating causes from incident responses radically reduces the number of scenarios and the complexity of the responses. This is important for staffers trying to respond to an incident when faced with incomplete or confusing information.

[BREAK]�

What do response-focused scenarios look like? Here are a few examples I�ve used in working on disaster recovery projects:

Evacuate Building Immediately � Operations Offline: This is the worst-case scenario. The main production/studio site is offline and staff has to leave immediately. This could be due to fire, power issues that can�t be mitigated, severe weather, earthquakes, building collapse or environment issues. In this case, service interruption is almost inevitable and noticeable to listeners. Staff focus at this point is getting everyone to a safe place and to start up a backup site (if one is available). Communications, especially anything that goes through the main facility, could be interrupted.

Evacuate Building Soon � Transition to Backup: This is the most likely scenario for broadcast operations. Something has happened that will more than likely drive staff from main production/studio site, but not immediately. This could be due to non-catastrophic building damage, power issues (including running out of generator fuel), predicted bad weather (especially events like hurricanes), HVAC failure or serious telecom issues. Staff focus is to start up and cleanly transition to backup site(s) with minimal interruption and making sure the content caches at the backup sites are current.

Lights On, Nobody Home � Evacuate Operating Facility: This is on the surface an unlikely scenario, but it�s not unheard of. In this case, something has driven the people from the production/studio site, but the infrastructure is largely or completely operational. Causes for this scenario include environmental issues, external unrest or threats, some weather issues that keep staff from making it to work. This scenario is heavy on remote access to internal systems at the site.

I�m sure you can think of other scenarios. Notice how high-level the scenarios are; they�re expressed as operational actions, not as triggering incidents. They�re distinct enough that determining which path to follow after a triggering incident is a fairly straightforward process. It shouldn�t take more than a few minutes to figure out whether your facility needs to be evacuated immediately, whether staff can wait (and work) for a few hours or whether the technical core is operational. They should also be easy to communicate to managers who may be called on to make important decisions about how to proceed in a very short time.

There are a few important exceptions to the response-focused scenarios. In almost every part of the country, there�s a well-known threat: in the West, it�s earthquakes or fires; on the Gulf Coast and eastern seaboard, it�s hurricanes, in the Midwest, there�s Tornado Alley; etc. Each of those scenarios has unique features that need to be addressed. For example, if you have a station in a location that could be affected by an earthquake, you need to be prepared for the possibility that your backup site is affected by the same incident that forced you from your studios, and that staff mobility may be severely curtailed. In those cases, a specific scenario for that incident is absolutely appropriate.

Another exception would be for nonstandard but manageable operations. The prime example here is, �Running safely on UPS and generator.� There may be actions staff has to take, like shutting down unused equipment to shed load, but the staff is otherwise operating in place. Of course, this could evolve into �Evacuate Building Soon� as an incident threatens to extend past the generator�s fuel supply.

[BREAK]�

To build out a straightforward disaster recovery plan, start with five or six high-level scenarios, then add three lists to each scenario:

First, make a list of assumptions. What systems and facilities need to be active and available to execute your plan properly? That list would have items like:

���� Backup audio codec at DR site has the proper setup information

���� There�s power at the DR site and systems there respond to pings

���� There�s a bailout bag with laptop, batteries and a phone at the main site and ready to go.

These are great items to check weekly or daily.

Second, make a list of procedures to execute to respond to the incident. That list has items like:

���� Switch to the backup codec feeding the transmitter

���� Switch online streams to their backup systems

���� Any other procedures to switch to backup systems

Third, make a list of procedures to execute to return to normal operations, items like:

���� Switch the transmitter feed back to the main studio

���� Switch online streams back to their main systems

���� Any other procedures to switch back to main systems

The third list is especially important if your incident response includes systems that have frequently updated databases, like content management and automation systems. If you�re not careful, you can sometimes cause more damage restoring normal operations than the original incident.

That�s the core of an effective response-based disaster recovery plan. It�s still a lot of work to construct (especially from scratch) and must be tested and maintained, but this architecture is much more straightforward than trying to write and curate dozens of �if this breaks� scenarios. It�s not a complete plan yet; you still need communications and decision-making sections. I�ll have more on those later.

Next time, I�ll talk about using IT planning tools, especially Recovery Time Objective (RTO) and Recovery Point Objective (RPO), to help you figure out recovery priorities. They provide a great way to think about your broadcast plant and give insight into your architecture and recovery procedures.

Related

In Vegas, Plenty to Talk About promo image

In Vegas, Plenty to Talk About

IP infrastructure, FM boosters, drones, the behavior of FM receivers, digital on the AM band … these are among topics of engineering sessions at the NAB Show.