Tuesday, June 03, 2014

Disaster Recovery Rehearsals, A Guide To Avoiding Failure

SRA command 'testFailoverStart' failed for device '/vol/'
Who left the tap on?

How confident are you that if or when you need to enact your IT Disaster Recovery (DR) process, you will meet or exceed the expectations of management,  customers, and shareholders?

Someone, somewhere, is in the middle of a disaster right now. It happens to every day. Yet even though the price and effort-cost of providing DR has shrunk over the last 20 years, many organisations, even large and well-funded ones, continue to fail at DR. Perhaps part of the problem is the word disaster, we hear about disasters daily; typhoons, earthquakes, tsunami, reactor melt-down, pandemics, civil war. These things seem pretty remote to the inhabitants of London's Old Street or The City. What about a fire? Or a burst Victorian water pipe? Or perhaps a small construction accident? What about a malicious mouse click by an irked systems administrator? Most IT disasters do not make good candidates for Hollywood movies,  this doesn't mean they don't happen.

To download our guide to the 7 most common reasons for failure of DR exercises, skip down to the end of this posting.

360is consultants recently took part in a Disaster Recovery (DR) exercise for large London-based multi-national client. While nothing about the exercise was particularly new or remarkable, it serves as a reminder that even meticulously planned Disaster Recovery rehearsals can still hit snags on the day. We thought we'd share with you part of the briefing we gave to our client before the day of the test so that you and your organisation may be better prepared for those unforeseeable problems.


The Project Goals
  • Work through a failover of all applications, public network connections, and user remote access from the primary site to the DR site.
  • The end user's team to complete a battery of tests to determine the level of functionality at the DR site.
  • The entire exercise was to be conducted over the weekend, everything needs to be back to normal for monday morning.
  • As the client operates an international business working around-the-clock, real users need to be able to continue working on the primary site without any disturbance during the exercise.
  • At the end of the exercise the DR site workloads and data are to be destroyed, or rather reset to the previous position before the DR exercise began.

The Risk Factors
  • The client scheduled the test for the day of a major national event during which most major roads in London are closed.
  • 50% of previous rehearsals had problems which halted the exercise and caused it to be abandoned.
  • There was a tight time window given the number of workloads, acceptance testing, and the need to get everything back to the start positions for monday.
  • The technologies used were well known but complex, and not without their own quirks/bugs.
  • 360is ultimately had only 36 hours notice of the project and no prior involvement in building any of the infrastructure.
  • People from 3 organisations across 4 physical locations would be required to take part in the exercise.

The Outcome
While the client's data was never endangered, this project was a relatively high risk one given the factors above. Although minor snags were encountered during the test, these were quickly overcome using our specialist product knowledge and multi-vendor experience across the technologies involved. The total test duration was approximately 15 hours wall-clock time, with 1 out of 200 workloads failing to come up satisfactorily. The exercise was pronounced a success by the client.


So why was the exercise a success this time compared to previous failures?

Success Factors
There were 3 reasons why this exercise was a success this time around versus previous attempts.
  • Preparation
Through previous exercises, the team had carried out a relatively high degree of preparation. The order of events was well rehearsed (and documented) and everyone had a clear view of their own responsibilities. Checkpoints were established, estimations of how long each of the 20+ phases would take were relatively accurate.
  •  Automation
A high degree of automation was achieved using the virtualisation and storage platforms available. In fact the only significant snag was due to a small part of the replication process which someone had implemented manually long-ago, rather than leaving it up to the automated mechanism.
  • Availability of Expertise
No matter how well prepared you are every complex DR rehearsal will encounter some problems on the day. Whether it be a bug in software, an oversight in the procedure, or simply something that was not anticipated because full-scale testing was never done. When this happens there is no time to open a call with the vendor, or wade through Internet forums, or build a test rig, you need expertise in the room, on the phone, there and then. This is the area where 360is was able to intercede and directly influence the outcome of this project. Once a partial failure had been reported, we were able to provide our client with an explanation of what had gone wrong, why, what the significance was to the rest of the test, and what to do about it to prevent this becoming another aborted test.

360is Consultancy Engagement Profile

Duration:
  • 3-man days including preparatory project management meeting.

The Client:
  • A division of the one of the Big 4 audit firms, approximately 200 production workloads.

The Team:
  • Client: System Administrator, Project Manager, Offshore Application Testing Team, Programme Director.
  • Service Provider: Network, Security, Project Manager, System Administrator, not forgetting the NOC.
  • 360is: 1x Senior Consultant and 1x VMware Specialist.

The Technologies
  • NetApp (storage and replication)
  • VMware (Site Recovery Manager)
  • Juniper & Cisco (BGP routing & security)
  • Data centres operator facilities in east and west London

360is Guide To Avoiding Failure In Disaster Recovery
At some point or other we end up issuing the following guidance memo to all of our clients embarking on either building a DR infrastructure or conducting a DR rehearsal. It applies to all OS and application vendors, and all network or storage products, and while most of our clients are wholly or largely virtualised it also applies to non-virtual environments. To learn how to avoid the 7 most popular causes of failure in Disaster Recovery exercises, download the PDF [74KB].