Disclaimer: I am not responsible for anything the material or the included code/patches may cause, including loss of data, physical damage, service disruption, or damage of any kind. Use at your own risk!

Booting in RAID-only environments

The main problem with grub (and booting in general) is that it needs a plain ext2 partition to read the kernel image from. The device that holds the partition needs to be readable by using BIOS calls (we cannot use drivers at boot time, can we?).

RAID devices are internally managed by the kernel, so you tipically need a non-RAID device to boot from. However, software RAID 1 with linux is special, because the data is not interleaved. This means that if you have an ext2 filesystem on a RAID 1 array, you'll have the same data blocks in the same order on both members of the array, and thus a valid ext2 filesystem on both of them.

In this particular case it's possible to boot even if the /boot directory resides on the array. Because the information is duplicated, you only need one of the two array members, and you can use any of them to boot.

You should REMEMBER that this is just a trick. It only works with software RAID level 1, and in general you need a separate hard drive to hold the /boot directory and boot from it.

Tricking grub to setup and boot correctly

We'll assume that you have two disks (hda and hdc) with two partitions each. hda1 and hdc1 are equally sized, and members of the md0 RAID 1 array. Similarly, hda2 and hdc2 are equally sized, and members of the md1 array. The md0 device is smaller and holds the /boot filesystem, and md1 is larger and holds the / filesystem.

If you use anaconda at install time to configure the arrays, it will also setup grub at the end of the install process, and by some magic it will work. However, if, at some point in the future, the primary hard drive (hda) fails and you replace it, you'll have no boot sector to boot from. Eventually, you'll have to reinstall grub and, surprisingly, although anaconda managed to install grub correctly, you won't manage to do it with the configuration files created by anaconda. If you try to grub-install /dev/hda, it will claim that "md0 does not have a corresponding BIOS drive" and abort the installation.

The problem is that at boot time grub needs to read the stage1 and stage2 files, which are located in /boot/grub. But at boot time no kernel is running, so grub will have to use BIOS calls to read directly from the disk. However, it needs to know exactly what disk (and partition) to read from. This piece of information is determined at install time, and then hard-coded into the boot sector.

At install time the kernel is running, and some device is mounted under /boot. But grub needs to find what physical device that is, so it can properly read from it later at boot time. To do this, it first determines the device that is mounted on /boot, then it tries to figure out how it can be accessed through BIOS calls. Fortunately, it looks for the mounted device in /etc/mtab rather than /proc/mounts.

The /dev/md0 device (which is mounted on /boot) will never have a corresponding BIOS drive, because it's not a real (physical) device. It's a virtual device managed by the linux kernel. But the same information resides on both physical members of the array, so we need to trick grub into thinking that /boot is mounted from one of the two members (typically the one on the first disk - /dev/hda1 in our example). To do this, you need to edit /etc/mtab and manually change the entry for /boot with /dev/hda1 as the device.

At this point grub knows that /boot is on /dev/hda1, but it still needs a hint about accessing it using only BIOS calls. The hint comes from the /boot/grub/device.map, which maps a logical name such as /dev/hda to a physical device such as (hd0) which means (to grub) the first hard drive detected by the BIOS. So you need to make sure you have a correct mapping for the device you used.

Now you can safely do grub-install /dev/hda and it should work.

Recovery when the secondary drive fails

This is the easiest case, because all the boot data resides on the healthy drive. The system will boot normally, but the arrays will start in degraded mode.

You can replace the damaged drive, and the system will still boot, again with the arrays in degraded mode. Now all you have to do is create the same partitions (or larger ones) on the new drive as you had on the old one. Then you can simply "hot" add the newly created partitions to the degraded arrays.

Suppose the same example as in the previos section. In this scenario /dev/hdc failed, and you replaced it with a new drive. On the new drive you created the hdc1 and hdc2 partitions, which you'll have to add to the md0 and md1 arrays respectively. This is very simple, and all you have to do is:

raidhotadd /dev/md0 /dev/hdc1
raidhotadd /dev/md1 /dev/hdc2

Now you can watch the arrays being reconstructed by looking at /proc/mdstat. If you're anxious about the progress of the job, you can even watch "cat /proc/mdstat". Note that the two arrays won't be both rebuilt at the same time. That is because they both involve the same disks and rebuilding them at the same time would cause the disk heads to be moved very often from one partition to another. This would result in a severe performance loss and the md driver avoids it by rebuilding the second array only after the first completes.

Recovery when the primary drive fails

The first thing you should do is save a copy of your /etc/raidtab file. You'll need this later to get things to work. A floppy disk would do, but even better, make a hardcopy of the file. It's very small, but also very important.

The next thing you should do is replace the damaged disk. This is a bit tricky, because now you don't have anything left to boot from. Well, not really :) You can still boot from a rescue disk. So get Fedora disk 1 (or RedHat 9 disk 1 or... whatever) and boot in rescue mode (that's "linux rescue" at the boot prompt with RedHat & friends). Don't let the rescue disk mount anything from your hard disk. You'll mount them later.

Use fdisk to create partitions on the new disk. The new partitions must be the same size or larger than the old ones. Don't forget to change their type to 'fd' (Linux Raid Autodetect).

Now all you have to do is initialize the raid superblock on the new partitions and restore the arrays. But the only way I know to do this is start the arrays in degraded mode and then "hot" add the new partitions. The funny part is that anaconda won't start any array in degraded mode because "it's dangerous" (guys, why is it dangerous and how the heck are you supposed to restore the arrays first since they need to be running to add a new member?). Moreover, raidstart (and raidstop too) from the rescue image is some kind of anaconda "thingie" (actually a python program) that would never start the arrays. You need the original raidstart (the one from the raid tools package).

If your array is anything else but software RAID 1, you'll be on your own on this one. But if it is software RAID 1, you can do a nice trick. As I previously explained, the two members of the array are identical, and more, they are valid filesystems because the data is not interleaved. This means you can mount the corresponding "/" partition from the healthy drive as if the filesystem were created directly on the partition (and not on the RAID device). Use this very carefully and keep in mind that mounting the partition read-write is a very bad idea.

Mount the "/" partition from the healthy drive read-only and copy the raidtab file from it to /etc. Change the (newly created) /etc/raidtab file as if the arrays did not contain the partitions on the damaged drive. Remove the corresponding "device" and "raid-disk" entries, and adjust the remaining "nr-raid-disks" and "raid-disk" entries accordingly. Now you should be able to start the arrays (in degraded mode, of course) if you use the mounted partition's copy of raidstart. In our example, the modified raidtab file should look like this:

raiddev             /dev/md1
raid-level                  1
nr-raid-disks               1
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hdc2
    raid-disk     0
raiddev             /dev/md0
raid-level                  1
nr-raid-disks               1
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hdc1
    raid-disk     0

The original raidtab (in case you might need an example to write one from scratch) should look like this:

raiddev             /dev/md1
raid-level                  1
nr-raid-disks               2
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hda2
    raid-disk     0
    device          /dev/hdc2
    raid-disk     1
raiddev             /dev/md0
raid-level                  1
nr-raid-disks               2
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hda1
    raid-disk     0
    device          /dev/hdc1
    raid-disk     1

Now you can mount the RAID devices (this time it's safe to mount them r-w) and chroot into them. Use raidhotadd to add the partitions from the new hard disk to the arrays. Note that the RAID driver will synchronize only one array at a time, so start with the /boot array. You'll need it to be completely synchronized before you can boot.

Trick grub and re-install it, as previously described. Cleanly unmount the arrays, do a "sync" (just to be sure), and reboot. Your system should start cleanly.

Using a previously used disk as a replacement

This section should explain what you should take into account when using a previously used disk as a replacement, if it had RAID partitions on it (particularly this explains how to destroy the RAID superblock). To be written :)