|
Unix Disaster Recovery Techniques and Planning Know what else is a disaster? A security compromise; what happens when a server needs to be reloaded due to a security incident? Many businesses could be crippled for at least a day while the kinks are worked out, but this needn't be the case. Proper disaster recovery planning should include security incidents as well. Many sites have various mechanisms to deal with the need to suddenly reinstall a machine, but even they are likely to have a few server that aren't able to be reinstalled quickly.
Different approaches to backups and system management can make recovering from a compromise much more pleasant. The main tenants we'll focus on are: - Systems should be automated, using configuration management tools, such that every server can be reinstalled on a whim and brought back to working order without any manual intervention.
- Backups should consider the need for total system recovery, including disk images of the most important servers.
- To verify the procedures and infrastructure are conducive to success, practice recovering from a compromise of the most important server using spare hardware, and then put them into production for a short test period.
Managed Servers Cookie-cutter servers are the name of the game. Any divergence from a standard OS load absolutely must be documented and automated. If you aren't already immersed in the wonderful world of configuration management, take a serious look at puppet or cfengine. Even something as simple as disk failures can be a major disaster if you have servers running unknown and undocumented configurations. If you're in this position, your situation is dire. Hopefully there's some documentation available to aid in converting these server to some type of automated configuration system. Frequently there won't be; in fact, simply rebooting a server may cause it to stop functioning because services aren't configured to start at boot time. Frankly, if you've ever experienced this, you aren't doing things properly. Most sites are somewhere in-between. Perhaps they have a half-implemented configuration management infrastructure or just a few machines that are completely divergent. It may be too much work to get them in-line with the standard server load. That's OK, in small doses, but the oddball servers must get some special treatment, since they aren't completely automated. Backup Considerations In an ideal situation, you'll only be backing up attached storage, SAN or otherwise, because the OS data doesn't matter. In this case you can suck data directly off storage gear for full backups, and the OS doesn't even have to be involved (assuming a SAN infrastructure). Very few servers will have local storage that's vital, because all divergent information is stored in your configuration management software or perhaps mounted over NFS. For the not-so-lucky, or perhaps the host that holds the configuration management data, there needs to be a sane mechanism for backing up the data. Not just backing it up—that's easy to do—backing it up in a restorable manner. The most common backup methods will spread at least a week's worth of data across many tapes, making it a royal pain to completely restore an entire file system. There are virtual tape libraries that make this a bit more tolerable, but the restoration process still isn't quick when you need to completely rebuild a server that requires tons of customizing. That requires disk images. Any critical server should have two OS disks mirrored; that's a given. What we're talking about here is creating an entire disk image and backing that up as well. Storing a week's worth of those is certainly handy when you need to back up a few days. Just like VMware snapshots, in fact, but for full servers. Ideally you don't want to do that, but for small shops or disastrous shops, it sure beats reconstructing a server from memory and 5-20 tapes' worth of backup. The Security Aspect As was discussed in "How Do You Know When You've Been Owned?," it's not always with 100% certainty that you know you've been hacked. In the situations where you need to reinstall the server, i.e. root was compromised, it's usually clearer. But wouldn't it be better if a suspected security incident could be cleaned up with a simple reload and no manual intervention? Some say yes, some say no. You'll certainly want to know how someone was able to breach your perimeter, which usually leads to the full determination of whether or not you're in danger. But even a hacked website can pose risks later on down the road, even if it's cleaned up. The choice is yours. The absolute best solution is to have every server configuration well documented and automated. The system disk can also be archived for security-related or other disaster recovery needs. Many servers nowadays (HP started it) are coming with internal USB connections. The idea is to flash a Linux boot image onto the USB drive, so that if the need arises, you can boot off the USB disk and 'dd' over the OS image you need. Limited to the speed of your network, this is the fastest method of disaster recovery. In short, you need to work yourself into a position where all your servers are like the others. Divergent cases should be automated such that they will resume their prior configuration shortly after having been reinstalled—without manual intervention. In addition to automated servers, your most critical servers should be doubly; nay, triply backed up. Their configurations (IP addresses, files, everything) documented and automated, the normal tape backup rotations, and frequent OS disk images to ensure fast recovery in the event of any disaster. Your job will be easier, your company will be happier, therefore your life will be easier. |