Early in my career, I worked in a manufacturing environment. And though we were not yet a disciplined SixSigma shop, we were obligated by customer requirements to conduct, on a monthly basis, 7-Step Root Cause Analysis on the top 10 issues causing downtime or product defects on our assembly lines.
I learned two important things from this process that I believe applies to business continuity and disaster recovery exercises and testing. First, what appears to be the cause of a failure on the surface rarely is. Â Second, even if we identify the true root cause of a failure, if we don’t take measures to remedy that root cause, the entire process is a waste.
Different organizations have developed a number of root cause analysis methodologies that meet their own specific needs. Â Particularly in manufacturing, where numerous tiers exist in the supply chain, it makes sense for those at the top to develop standards and pass those down the line to ensure some consistency in monitoring, reporting and controlling defects.
With respect to using a formal process for BC/DR testing, unless your company is in a regulated industry or is at the top of a supply chain where some vendors are extremely critical, I recommend taking time to understand the fundamental concepts of root cause analysis and then modify the steps to fit your own company culture and existing processes.
Root Cause Analsysis and DMAIC
My experience tells me the best starting point for ensuring test results find their way back into making the plan better is the DMAIC process.
D: Define
M: Measure
A: Analyze
I: Improve
C: Control
Each one of these words/concepts/phases in root cause analysis is a separate discipline in itself, and each one is key. Â Don’t skip steps!
Defining the Problem
Much of the time, especially when in meetings and brainstorming, we have a tendency to jump right to throwing out solutions to problems before we really understand what the problem is. Â It is imperative at this stage to focus specifically on defining the issue. Â And we define it objectively, without placing blame. Â Further, we want to, as best we can, define it in quantitative terms so that we can measure the extent of the problem.
Finally, it’s important to focus on the scope of the problem, identify what other systems or processes are impacted and which are not.
The final step, or it could be first, is to designate a problem owner and stakeholders. Â Just like managing a project, the process is much more effective with a single champion who is accountable for solving the problem.
Measuring the Problem
If we have defined the problem well, the results should be predictable and repeatable, and therefore measurable. Â In BC/DR we may measure business process resumption simply with a pass/fail, worked/didn’t work metric, or we may measure customer calls missed or revenue lost of the data is available. Â If we’re talking about data and IT, we may measure the number of accounts affected, users impacted, records recovered versus lost or measure time relative to RTO’s and RPO’s.
Again, the trick in defining and measuring the problem is to narrow the problem down to a point where we can measure the extent to which the problem impacts recovery efforts so that we can, in turn, measure the extent corrective actions actually improve those efforts.
Analyzing the Problem
Now, for the first time, we get to try and figure out what’s causing the problem. Â There are several different methods to do this, from simple brainstorming to drawing diagrams. Â I like the 5 Why’s approach because most organizations are not using statistical process control on BC/DR planning and trying to is probably overkill. Â The 5 Why’s approach is simple and it generally works.
The idea behind 5 Why’s is just asking Why? through five iterations. Â Start with “Why did we have this problem?” (referring to the problem identified earlier). Â Here’s an example:
1. Why did we have the problem?
A. Because the DR plan was not up-to-date.
2. Why was the DR plan not up-to-date?
A. Because Joe didn’t update it.
3. Why didn’t Joe update it?
A. Because he was busy with other priorities.
4. Why does Joe have other priorities over updating his DR plan?
A. Because his performance metrics are based on activities not related to his DR plan.
5. Why are performance metrics not tied to DR planning?
A. We don’t have a policy in-place that tells managers to hold employees accountable for the DR-related activites.
Ok, so you see how it works. Â The problem might be related to the plan not being up-to-date, but by drilling down we can see there is a deeper root cause. Â And what we typically find is that this root cause is not just the cause of this single failure, but is generally a fundamental failure with the potential to impact other areas, processes or systems.
Improving the Situation
Sometimes the Analysis phase will point very closely to a good solution. But even so, it’s good to brainstorm and work out the details of several options. Â For each, understand the time required and conduct a cost/benefit analysis.
Also evaluate each proposal to ensure that implementing a particular solution does not break or cause problems elsewhere. Especially when we are talking about systems and applications, changes can have significant impact and should be passed through a formal change control process (a subject for another time).
The final component of Improvement goes back to the metrics identified in the Measure phase. Know ahead of time how and to what extent the improvement is expected to address the problem. This will be important in future testing to ensure the improvement was implemented correctly and yielded the expected positive results.
Once the best solution is selected, depending on the size and scale, it should be implemented either through a simple Corrective Action process or, if needed, a separate, stand-alone project.
Control
Even if we implement the right solution to the true root cause of a problem, if we neglect it over time, the problems resurface. Hence, especially in BC/DR where many of our solutions are going to be based on changes to processes, the organization needs a mechanism to verify that the process remains under control. This frequently takes the form of an internal process audit.
In designing our improvement, then, implementing the solution is not enough on its own. We need to additionally define how we will continuously measure the process going forward and verify that the solution continues to prevent the original problem or reduces the frequency of occurrence.
Conclusion
In business continuity and disaster recovery testing, many organizations take the findings and results from tests and exercises and fix the immediate issues without addressing the root cause. Doing so means the organization continues to face many of the same related issues repeatedly. Incorporating a semi-formal root cause analysis process following each test, and for each problem identified, can result in better and more reliable plans, and a more efficient and cost-effective test and exercise program.












Had to tweet a from @risk_reward
“Disaster Recovery Test Results and What to do with Them” – http://bit.ly/9y2vIU is a neat and practical blog by Chad Goode – worth a peek.
Because it is – well done and thanks. There is too little fundamentally sensible stuff out there.