Preparing for Disaster
Volume Number: 15 (1999)
Issue Number: 9
Column Tag: Sytems
Preparing for Disaster
by Paul Shields
Developing a Disaster Recovery Plan for Your Organization
What is a disaster? The dictionary defines a disaster as "an event resulting in great loss and misfortune." For business owners, this phrase translates into: A disaster is any event that affects the ability of a company to meet financial or contractual obligations.
Are your systems adequately protected from disaster? What kinds of disasters are you prepared for? Environmental events, theft, and accidental user errors are all forms of disasters that can directly impact a company's ability to survive. Most organizations think too big when they hear the words "disaster recovery" and thus fail to deal with some of the most basic issues to ensure that they can remain operational in case of a minor disaster.
Two other factors should influence your network administrator's concern over being properly prepared. First, the year 2000 is quickly approaching and everyday new reports of potential problems make their way into the media. If the power grid in your region were to fail on January 1, 2000, would your company be prepared and be able to survive? The second issue is the Internet's effect of making 24x7 support for business transactions mandatory. Customers expect your service to be on-line at all times and, with a worldwide presence, the middle of the night for you may be prime time for a customerof yours or a potential customer. Every moment of downtime not only translates to lost productivity internally, but may also affect your company's competitive position.
Most business owners make two potentially fatal, assumptions when developing a disaster recovery plan. First, they assume they are too small to justify the time and expense of developing a plan. This is a bad assumption because the smaller the business the more catastrophic the impact of even the smallest disasters. Small businesses are concentrated at a single location and thus have all their eggs in one basket. A large multi-national firm, while impacted by the loss of a single office, could survive with the staff and resources at other locations during the recovery process. The second fatal assumption is that disaster recovery plans only cover natural disasters like hurricanes, floods, or terrorist attacks, which many business owners believe are unlikely to happen. This kind of thinking does not take into account the small-scale disasters that are just as devastating. How many small businesses have a plan to deal with a fire that destroys the office or a theft that results in the significant loss of equipment? Or a daylong regional power failure like the one affecting more than a million people in San Francisco late last year?A disaster recovery plan should be comprehensive. Ideally, you should write the plan in such a way that it covers the recovery process generically, ensuring survivability and recovery, no matter how large or small the disaster.
The most vital thing the network administrator can do is to document everything. This includes documenting serial numbers and configurations of all computers, hubs, switches, routers, servers, printers, desktops, installed software, and servers., This data should be stored in a secure off-site location. Ideally, this data, along with the disaster recovery plan, should be stored off-site in paper form so that no time is lost retrieving configuration data and recovery plans from tapes or broken computers. Multi-site companies may be able to store disaster plans and configuration information in hard copy form or as a database in another location.
After capturing the information on what to restore, you must develop a plan for recovery. This plan would be the typical "disaster recovery" plan common to most businesses. A typical plan should include several components, such as supplier information, inventory data, assigned roles & responsibilities, and a priority scheme for restoring specific components and processes. The disaster recovery plan should also document, in advance, the process used to prepare for disasters, including backups, inventory control, and change management.
Insurance & Supplier Information
In a major disaster, one of the first contacts is the insurance company. Once funds are available to replace lost or broken equipment, contact vendors and make arrangements for equipment purchases or leases depending on the your immediate requirements. Use the configuration information as a basis for ordering replacement equipment.
Document the network layout, including the model and part numbers for all network devices. If possible, have configuration files from the network devices both in printed form and saved to electronic media. Most importantly, update these every time there are changes. That is why a change-management process is so critical.
Server and Software configurations
Servers play as vital a role as network components in most businesses. The servers may be running your order management system, web services, or acting as data repositories. Regardless, these servers and their data may need to be back on-line as quickly as possible. If an Internet hacker deletes the contents of a web server, how quickly can you rebuild the server and restore your companies web site? If the web site is the foundation of your business, how much time do you have?
Most desktop and server computers are not of much use without the software that runs the business. An inventory of server applications in use, including serial numbers or proof of ownership records should exist. This allows for easy re-installation of software licenses or acquisition of replacement media from the software vendors.
Building a comprehensive disaster recovery plan can be a challenging task for one person to complete. For the best results, build a disaster recovery team. Assign one individual as the disaster recovery prime who handles coordinating the team's documentation efforts and the recovery process in an emergency. Make sure you assign a backup prime to cover the possibility of the prime not being available. Include on the team experts from the network (LAN & WAN), server, and desktop support groups. This team needs management oversight to ensure that the team understands the business operations and their critical nature. Once developed, review the disaster recovery plan on a quarterly basis to ensure compliance and validate business needs.
Performing backups is the first step in protecting company data from loss. Unfortunately, most administrators fail to do backups properly. How many network administrators store backup tapes off-site? Even worse, how many store the backup tapes in the overhead bookshelf adjacent to a a fluorescent light, which is emitting electrical and magnetic fields potentially harmful to magnetic media? Both of these can compromise the integrity of the tape contents and negatively affect the administrator's ability to restore data in case of a disaster.
When you develop a backup schedule you must consider several factors. First, how critical are each day's data and how much does the data turnover each day? If it is critical that you be able to restore without losing more than a single day's work, then the backup schedule and tape rotation process is slightly more complex.
The minimal recommendation is a backup strategy that rotates 3-4 sets of tapes on a weekly basis. Each tape set contains one full backup and nightly incremental backups from the remainder of the week. At the end of the week, the administrator removes the tapes from the drive and sends them off-site for storage. This ensures that no more than one-week's data is at risk. A problem with shipping tapes off-site immediately after pulling them from the drives is the retrieval time for file restores in non-critical situations. You can overcome this problem by making duplicates of the tapes before sending them off-site, thus retaining a copy on-site for easy restores.
Other strategies that accommodate a tighter schedule are possible. You can rotate tapes daily or every other day to minimize the amount of data at risk. The most advanced strategy, one which several companies are now experimenting with, is to backup to an off-site server over either a private line or the Internet. This involves contracting space for the placement of a backup server and the installation of a connection (T1, T3, etc.) to facilitate backups. To effectively implement such a strategy you would need to invest in a tape library so that the administrator does not have to travel to the site on a regular basis to manage tape rotations. Crystalogic <http://www.crystalogic.com> offers off-site backups for a variety of platforms and several levels of service. For a Macintosh/Retrospect-centric solution check out Digital Forest, which is a Dantz certified Internet Backup Service (IBS), at <http://www.forest.net/backup/recover-it.html>.
Developing a tape rotation scheme that gets the data off-site on a regular basis is important. The backup and the source material should be at least 10km apart allowing enough distance to protect against natural disasters that have a large geographical spread. Contract with a local service provider in your area that has secure, environmentally controlled facilities for storing electronic media. Most off-site storage companies will even arrange for drop-off and pick-up service, and guarantee the safety and security of the media. Ensure that the security measures of your off-site storage vendor are effective. At a minimum, the vendor should not publicize or allow general access to the storage location.
When storing material off-site be aware of the shelf life of the media on which the data is stored. Included is a chart with typical media life.
||Shelf Life (years)
|Floppies, Zip, Jazz, Syjet
||Do not use at all. These are too unreliable for long-term storage of critical data.
|8 MM Tape
Table 1. Average Media Life.
Most companies maintain off-site data for 1-5 years depending on need and legal liabilities. Contemplate the consequences before you destroy any backup media. The shelf life of magnetic media is 3-5 years, so for storage needs that are more permanent consider CD-ROM or other optical media, such as the PinnacleMicro Magneto Optical drives which store 5.2 GB and are sometimes called Write-Once-Read-Many (WORM) drives.
If the business cannot tolerate downtime, then alternative solutions for ensuring server availability are critical. Critical resources such as on-line transaction processes, customer service sites, and ordering systems may need to remain on-line even if there is a major disaster. While increasing server availability and resiliency comes at a price, the costs of lost business can quickly grow past those implementation costs.
High Availability (HA)
The buzzwords in the Windows and UNIX server markets are "clustering," "HA," and "fault-tolerance." A typical HA environment in the Windows NT or UNIX world consists of a two machine cluster. In the simplest form, one machine (primary) handles all the transactions and processing load while the second (secondary) monitors the first, waiting for a failure. The two systems share data disks and sometimes root disks across either a dual-ported SCSI chain or using the newer Fibre-Channel architecture.
When the primary server fails, the secondary server assumes the identity of the primary server. The secondary replicates everything about the primary including IP addresses, MAC addresses, disk configurations, application processes, and computer identities. For clients accessing the server, a temporary pause of service happens during the fail-over process and then service continues as if there were no outage. Most systems take about three minutes to fail-over depending on the nature of the component that failed and the services that must fail-over.
Not all services can fail-over properly. Client/Server connections that are connection-oriented will break during the transition. Such services include Macintosh file services (AppleShare logons), telnet connections, X-Windows connections, and FTP connections. Other services that are either connectionless or store state information can survive. Typical services that can survive are NFS, Windows file sharing, web services, and databases in some instances.
Neither Apple nor any third-party developers offer a high-availability solution for the Mac OS that would do automatic fail-over of system identities and application process. In fact, Apple does not currently offer a utility or Application Programmers Interface (API) for application developers to modify the MAC addresses of the built-in Ethernet interface, a prerequisite for fail-over of some server applications. With the release of OS X server, Apple may open the possibility, as it is standard practice in the UNIX environment and a requirement for fail-over of NFS services.
On the Windows platform, Microsoft set the baseline for HA with Microsoft Cluster Services (MSCS) released in early 1998. Other vendors such as Veritas and Full Time Software offer competetive products that can scale to larger numbers of servers and support a variety of Windows and UNIX platforms. Windows NT 5 will integrate an updated version of MSCS that adds better scalability and more control over the failover process. Most current HA solutions are two node clusters with SCSI attached disks. As Storage Area Networks (SANS) evolve and mature in 1999, the typical cluster will move to N-way clusters (up to 16 nodes) with Fibre-Channel disks located remotely from the servers.High-Availability (HA) for the Mac OS
While industry-standard HA solutions are not available for the Mac, there are several steps you can take to improve the availability of Mac OS servers. The three most common techniques for improving server availability in the Mac world are RAID, data replication, and load balancing/distribution.
Redundant Arrays of Independent/Inexpensive Disks (RAID) should be the most basic thing you implement for all critical servers. Downtime from disk failures is unacceptable and RAID protects you from this. Without RAID, a disk failure results in downtime while the administrator acquires replacement parts and restores the data both of which can take hours if not days. Choosing between RAID 1 (disk-mirroring or RAID 4/5 (Striping with Parity) is an application dependent decision. Most RAID controllers are now fast enough or have enough cache that RAID 5 does not suffer from major performance disadvantages as in the past.
Data replication is another alternative, which involves mirroring data to a secondary server in another location. Replication allows for a rapid recovery if the primary server goes completely off-line. The critical part is ensuring these servers stay in synch. Once the servers are significantly out of synch then fail-over from the primary to the secondary server is meaningless. Keep the rate of data change in mind when planning for the mirroring of data to alternative sites, as such an endeavor may require a significant investment in network services. There are various backup and restore techniques, along with live data transfers, that can keep a pair of servers within a few minutes of each other. The users only access one server. Scripts or other tools keep the secondary server's image of the data on the first server current. The difference between replication and true clustering is that when the primary server fails the user or administrator must reconfigure each client to access the secondary server.
One final solution for web servers and some application services is a technique called round-robin Domain Name Services (DNS). The technique involves registering multiple IP addresses under the same DNS name. If the DNS server supports round-robin address resolution, it will hand out alternating IP addresses when it receives name resolution requests. If one server fails, the user will fail to connect but, when they hit reload, they will probably receive the alternate address and be able to get to the server. While this works, it has limitations. There is no guarantee that the DNS server will redirect the user to the working server and some DNS servers may not support round-robin behavior.
The better alternative is a class of devices known as redirectors. A redirector, like the Cisco DistributedDirector, uses the round-robin capabilities of DNS for load balancing, but adds extra capabilities that enhance the load balancing, performance, and availability issues not directly addressed by round-robin DNS.
To improve load balancing and performance, when the redirector receives a service request from a client, it starts with the list of servers as configured in DNS. The redirector parses the list and makes a determination on which server entry is closest to the client and redirects the client to that server. The redirector transparently redirects end-user service requests to the closest server as determined by client-to-server topological proximity and client-to- servers link latency (round-trip times). The result is increased performance seen by the end user and reduced transmission costs in dial-on-demand routing environments.
The second function of a redirector is monitoring server availability. Some redirectors offer a "Server Availability" option, which provides the capability to empirically verify that a server is available and thus prevent redirecting clients to servers that do not respond. When enabled, this parameter causes the redirector to attempt to create a connection to each of the servers using a specified port (e.g. port 80 for HTTP). The administrator sets a time interval (e.g. every 10 seconds, 1-minute, 10 minutes, etc.) for the check. Servers that yield unsuccessful connection attempts are marked as unavailable. Subsequent successful connections to the server will reinstate it as available. A redirector gives the administrator the best combination of availability and load balancing.
Notifying & Staffing the Recovery Team
Whether you lease the entire building or just a small section, the IT and facilities people tend to work separately. Make sure that the facilities and security staff at your site know whom to contact when there has been a major building disaster (power outages, fires, air conditioning failure, etc.). The contact person must be someone who is available 24-hours a day/7 days a week (or better yet, a team with rotating coverage). Given this, the safest bet is to provide at least two contacts in the event that one is unreachable for any reason. If an internal employee discovers the problem, they should have a process for notifying the appropriate escalation points within your organization and notifying any needed external resources (facilities, security, emergency personnel, etc.). Each emergency contact should have a paper copy of the disaster recovery plan. Make sure you provide them with an updated copy every time the disaster plan changes.
The emergency contact will react according to the disaster recovery plan by notifying the appropriate team members. The disaster recovery prime should form teams to deal with issues such as damage assessment, hardware recovery, software recovery, and services availability. Depending on the size of your organization, individuals may be members of more than one team. Each team should work according to the process as outlined in the disaster recovery plan. Most major disaster recovery operations will be a multi-stage process that involves bringing the teams together as each stage progresses.
One issue most large companies fail to account for in their disaster recovery plan is acquiring replacement equipment. Companies with hundreds of desktops and a multitude of servers and printers will have difficulty finding a local reseller that can supply replacement equipment on short notice. This can slow the restoration process when large amounts of equipment are affected such as in a building fire.
Investigate local equipment leasing options and document the process and costs in the disaster recovery plan. A combination of leased and purchased equipment should be enough to restore core business functions immediately, allowing time for the restoration of the remainder of the business such as non-essential desktops and servers. This area of planning clearly demonstrates the need for a comprehensive inventory and priority assignment to restoring services.
Even some of the most thorough disaster recovery plans do not adequately accommodate or plan for the acquisition of replacement office space. While developing a disaster recovery plan, most administrators fail to consider the question of "where will the employees work?" Without office space, the company cannot restore operations. This is a minor issue for small companies where work can continue from homes or small offices. Large corporations with hundreds of employees need a plan. While the destruction of a major office building is rare, the executive staff and facilities planners should at least contemplate the possibilities.
The best solution in this case will depend on the office space situation in your area. If office space is at a premium, negotiating contingency plans may be difficult and very expensive. You may want to explore alternatives such as a partial company shutdown or having employees work from home. The important part is documenting your priorities and processes for dealing with these issues.
At What Price?
When you develop a disaster recovery plan, you need to find a balance between costs and benefits to achieve a worthwhile Return on Investment (ROI). If there is concern that developing a disaster recovery plan is too time-consuming or expensive, consider the risks you are taking with the company's future. When fire destroys a small business' office, it will be hard enough to recover all of the equipment costs and physically rebuild. More importantly, if there is no current backup of client databases, current projects, work in progress, orders, or other critical documents, there is no foundation on which to rebuild. The business owner is back at day one with no client database and no work in progress to sustain the employees or business in general.
The opposite extreme is over-planning because of a lack of understanding of business needs. The first step in building the recovery plan is to identify which assets and services are critical to business operations. A table comparing the costs and benefits of each stage is included so that an administrator can balance investment with risks.
Calculating the costs of a disaster is difficult. The costs are a combination of the time to recreate lost data, equipment/property loss, productivity loss during the restoration process, and support staff time and effort to restore systems and data. Included is a chart that documents some of these costs independently, allowing a quick calculation of how much is at risk in your environment. '
Every company develops a different disaster recovery plan depending on the business needs, but no company should be without a plan. Failing to accommodate for the basics of data backups and restores puts the business in a high-risk state. Even the smallest incident can mean the loss of clients and money.
Developing the recovery plan is a cooperative effort between management, the support staff, and facilities planners. This may be one person for a small company or hundreds in a large organization. No matter what the size of the planning team ensure that its leadership is solid and has a clear understanding of the business needs and can balance needs with existing budgets.
Implement the disaster recovery plan in stages starting with the inventory and backup processes. After the backups and file restores have been working for a while begin developing plans that prioritize services and restoration processes. As the company grows and becomes more dependent on specific services and infrastructure the priorities will expand and change. Review the plan on a yearly basis to ensure that business priorities or requirements have not changed drastically and to ensure that inventories and configuration documents are up to date. If possible, simulate a small disaster and walk through the restoration process to validate the plan.
Paul Shields <firstname.lastname@example.org> is an Information Technology Advisor for a telecommunications firm in Dallas, TX. He is also Editor of The Mac Report, a weekly magazine for Macintosh professionals. Paul helps companies prepare disaster recovery plans that minimize downtime and maximize recoverability.