
Highlights of the Disaster-Recovery RFP
You can't implement a disaster-recovery plan until you have a comprehensive redundancy plan in place. Simply stated, you cannot recover from a sitewide disaster until you can recover from a local, isolated component outage. DCH's prospects for establishing an effective disaster-recovery plan are bolstered by its ability to recover from any outage of carrier, server, database or software component. Its network already features redundancy between the carriers, replication between its e-commerce servers and databases, and switchable SCSI interfaces among its file servers. Key parts of the architecture that accomplish this are dual carriers or local loops for all traffic, dedicated or shared standby servers, hot backups of all transaction data, robust security and distributed application intelligence (see "DCH's Existing Infrastructure" on page 42).
Due to high customer visibility and no manual system to fall back on, the following DCH systems have a low data-recovery time (less than 30 minutes) and a low threshold for data loss (no missed transactions): e-commerce/trading infrastructure; Web, application and database servers; trading partner links; and 800-number IVR (interactive voice response). The Web server's file systems and the database server's transaction logs are replicated to their standby units in real time. DCH's management is willing to invest enough in its network to ensure that the hot site's servers maintain parity with the local standby servers.
Due to the high cost of downtime, NT servers and local intranet Web servers have a middling tolerance for data recovery (less than four hours) and data loss (last backup within about 24 hours). The intranet Web server's file systems and NT server's file systems are backed up nightly to tape. DCH maintains a standby server that can take over for any of the other servers. DCH has configured the SCSI switches to boot up with either file server or intranet Web server boot image.
Finally, a high tolerance for time to recover is indicated in the network connections to the internal frame relay network and the remote-access features from the work-area LANs (i.e., remote site frame relay and local LANs).
Layering high availability over a single-threaded architecture will require a significant expenditure; such an endeavor easily may run into the hundreds of thousands of dollars. Since DCH has already made this investment, it has in effect simplified the disaster-recovery vendor's job. Essentially, all the vendor needs to do is identify how to project the current availability levels across a WAN connection, and determine how to switch the client feeds from DCH's computer room to the disaster-recovery site.
Another way DCH made it easier for vendors to create a disaster-recovery plan was to identify the recovery-point objectives (how much missing data is tolerated) and recovery-time objectives (length of time between the declaration of the disaster to the restoration of service, including time for re-creation, synchronization and cutover of clients to ready status) for all of its identified, critical systems.
|