You can’t stop the unthinkable from happening, but a well-thought IT Disaster Recovery Plan minimizes the fallout. IT has global standards for disaster recovery: “Strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery and restoration are put in place,” is how the ISO/IEC standard reads. Let’s dig into what that means, and practical ways to achieve it.
Start by taking inventory
Every good disaster recovery plan starts by knowing what you have, where you have it, and how it’s configured. Start your inventory by assessing physical space. Server rooms, network operation centers, a data center—anywhere IT equipment is stored. Note the hardware in each space.
Start with your main racks, and work outward.
- Physical servers
- Network gateways, routers, and switches
- NAS and shared storage
- Power supply equipment
Move on to your endpoints, and go room by room.
- Access points, secondary Ethernet switches
- Workstations, desktop PCs, monitors, laptops
- VoIP and phones
- Printers and peripherals
As you’re moving along, note any special configurations in your setup. Imagine you have to start from scratch: what should someone know to have your IT up and running in a timely fashion?
Move on to virtual machines, applications, and software
What operating systems and virtualization technology your systems utilize to function.
- Server software
- Locally-hosted applications
- Cloud-hosted applications + contact for vendor support
- Any configurations that would help start over after a catastrophic event.
Organize your inventory by tying company data to specific machines that store it
Where do specific types of data exist in your infrastructure? Label it related to function: decision support systems, transactional systems, management information systems, and knowledge management systems are the big four in traditional business IT. You’ll need this information to prioritize restoration should something knock out multiple systems. Segment anything related to regulatory matters like HIPAA, PCI, and industry-specific IT governance.
Rank systems by importance to running the business
Identify your critical systems and rate them in order of importance. For example, if you’re an online retailer, the systems that keep your website transacting securely are high priority. If you’re a manufacturer, you’ll place emphasis on CNC machine tools, CAMs, and process-related machinery. Systems for inventory and logistics, safety, and security will be of tantamount performance. Environmental and sanitation systems are critical as well.
If a downed system shuts down business, or worse, produces more chaos in the absence of proper functioning, the goal is to produce a solution as quickly as possible, and this is the essence of a formal IT disaster recovery plan.
Before disaster strikes, you should know:
- What you need to restore your systems
- How long it will take
- Who performs each task
Measure cost of downtime in a business impact analysis (BIA)
Having a sense of prioritization for your business systems are the nuts and bolts of your business impact analysis (BIA). A good BIA will note several other important items, primarily the costs associated with failure of each and every system. Why is BIA so important for IT? Calculating cost of failure helps justify spending for preventative measures in terms that financial decision-makers understand.
How you calculate the cost of downtime will be different for each organization, but here’s a baseline measure that accounts for several common factors:
Hours of downtime = Revenue lost + Productivity lost + Replacement cost for lost equipment + Intangible cost
Estimate lost revenue
Using very basic business intelligence principles, you can calculate the revenue a company makes in an hour, or a week, or a month. How much of this revenue depends on the uptime system X, Y, Z? Calculate this as a percentage of total revenue in a BIA report.
Estimate lost productivity
Think of lost productivity as the cost of paying employees for their time but getting nothing in return because they cannot work during downtime. How does system downtime affect worker productivity in percentage terms? If you have a machinist making $35 and the CNC gets ransomware—that is 100% percent loss for the time it takes to restore the system. If a server crashes and the creative department can’t use a collaboration tool—that might be a 35% productivity loss across the department. It won’t be an exact science, but reasonable estimates will suffice. It’s advisable to work with department heads to get an accurate feel of how systems relate to productivity.
Estimate replacement costs
For IT, typical replacement costs would include outside services for recovering lost data, and any physical items that need replacing. Make sure to factor in immediate costs associated with data that cannot be recovered, and any repercussions of that data loss over time. You might say company leadership becomes 20% less effective if they lose BI they can’t recover, or something to that affect.
Understand intangible costs, and give them a dollar amount
Intangible costs often affect the brand. A data breach resulting in a loss of trust tends to affect sales. A company might be forced into upping the marketing or PR budget to counter that loss of trust. Customer-facing assets are highly sensitive. Don’t leave out intangible costs of those systems going down.
Define the Response Strategy and Recovery Strategy
Now that you’ve identified your critical systems, and assigned costs and prioritization, it’s time to solve for potential threats using a response strategy.
|Critical System||Threat||Response Strategy||Response Workflow|
|E-commerce website||Server failure||Switch to Backup Server,
validate power source
Verify backups are present
Test backup server
You’ll want to have response strategies for every critical system.
- If your business security cameras stop functioning, the response strategy might be to call in extra security personnel.
- If your power goes out, the response strategy might be to check that uninterruptible power source equipment and backup generators have engaged, call the electric company, and base further decisions according to the severity of the outage, allocate power to high priority systems.
And so forth. Next, consider a Recovery Strategy to address losses stemming from a disaster.
|Critical System||Threat||Recovery Strategy||Recovery Workflow|
|E-commerce website||Server failure||Deploy backups onto secondary hardware.
Transfer production to that site.
|Obtain backups from remote location / cloud provider. Restore / test SQL server, Exchange Server, SharePoint Server.|
Make sure everyone knows their role
Chaos prevails when disaster strikes. The more organized you are, the faster you can get systems back up and running. Speed is important when time is literally money. Even with a small staff, one of the best things you can do is assign every point of your Response & Recovery workflow to one specific team member.
Many businesses have encompassing disaster recovery plans that IT fits inside the overall plan. Here IT might be asked to account for its own assets and structure its roles and actions as they pertain to the company as a whole. Make sure that IT has enough personnel support so that systems can be restored in a timely manner.
Make sure that whomever is control over IT procurement is aware of purchasing and logistics timelines. Should you need to buy computer hardware to restore your systems it should be done with the utmost sense of urgency.
Rehearse and update every time systems and personnel change
The IT disaster recovery plan is a living document. It should be tested and tuned continuously. The most important time to revisit the plan is when your toolset changes, and when your staffing changes. New hardware may alter your procedures; new personnel must be acclimated to their role in disaster recovery. Better to do this in a controlled dry-run.
Additionally, always test any backup systems regularly. No point in discovering secondary hardware isn’t up to speed for handling a business workload when it’s too late. That defeats the purpose of the exercise. Even if you execute like a well-oiled machine during practice runs, expect a real-live disaster situation to bring about unexpected problems you’ll have to work. Having in place a solid IT disaster recovery plan is your only hope for survival.