(Disaster Recovery Plan)
IT Disasters come in a multitude of flavors. This can be defined as anything from the loss of network connectivity to the loss of a data center. The simplest components are often the culprits: Datacenters don’t normally go down, power supplies do. Network providers don’t normally go down, NIC cards do. Fortify yourself against the small issues and you will have protected yourself against 90% of issues that may impact your customer base.
Disaster Recovery Plan Step 1: Business Impact Analysis
The first step is to define what your company can not live without, otherwise known as a Business Impact Analysis (BIA). This step of the DR process will involve the leaders of your company as they are the ones that will define what applications the company must keep up and running in order for the company to do business, the “mission critical applications.”
Once the mission critical applications have been identified, you must then agree upon what is acceptable downtime or the recovery time objective (RTO). The difference between zero downtime and fifteen minutes of downtime is significant from a cost perspective.
Disaster Recovery Plan Step 2: Risk Assesment
Once your mission critical applications have been identified and your RTO has been defined you can then begin to architect your disaster recovery strategy. Begin looking at your infrastructure from two vantage points:
1. The infrastructure that you control.
2. The infrastructure that you don’t control.
In regard to the infrastructure that you control, look for single points of failure. This, by far, is the number one cause of disruption to your customers. Have your IT team map out the underlying infrastructure and identify the single points of failure. As an old boss used to tell me, in the IT business, “two equals one and one equals none.” If you have a single network card in a server and it goes down, you have none. If you have your data stored in one location and it goes down, you have none. If you only have one network provider and it goes down, you have none.
In regard to the infrastructure you don’t control, begin looking at your external partnerships. Since you cannot control their infrastructure, you will need to find ways to mitigate issues should they encounter problems. For example, store your primary copy of data in your data center, but put the secondary copy into the cloud. Talk with different network providers and bring a second link into your data center.
Disaster Recovery Plan Step 3: Risk Management
Once you’ve defined the risks, it’s time to take action to mitigate them. Add a second network card into your systems. Buy servers with dual power supplies. Have dual power feeds brought into your rack and then plug in your systems into different power sources. Set up mission critical servers in an active passive configuration. Build your environment in such a way that you can deal with the most common failure scenarios.
Disaster Recovery Plan Step 4: Testing
The last step is to test your failure scenarios under controlled circumstances. It is better to uncover a shortcoming in your infrastructure during a planned test than to uncover it during a real time emergency. In a controlled test, if you uncover that one of your network cards is not working properly, you can abort the test, buy and install a new card and the run the test again. If you uncover this during a real emergency, it will take me time to purchase and install the new card, thus missing your RTO. As I mentioned at the outset, this will allow you to withstand 90% of the common issues that bring down your site.