09 February, 2005

Software RAID1 Recovery

Disclaimer: I am not responsible for anything the material or the included code/patches may cause, including loss of data, physical damage, service disruption, or damage of any kind. Use at your own risk!

Booting in RAID-only environments

The main problem with grub (and booting in general) is that it needs a plain ext2 partition to read the kernel image from. The device that holds the partition needs to be readable by using BIOS calls (we cannot use drivers at boot time, can we?).

RAID devices are internally managed by the kernel, so you tipically need a non-RAID device to boot from. However, software RAID 1 with linux is special, because the data is not interleaved. This means that if you have an ext2 filesystem on a RAID 1 array, you'll have the same data blocks in the same order on both members of the array, and thus a valid ext2 filesystem on both of them.

In this particular case it's possible to boot even if the /boot directory resides on the array. Because the information is duplicated, you only need one of the two array members, and you can use any of them to boot.

You should REMEMBER that this is just a trick. It only works with software RAID level 1, and in general you need a separate hard drive to hold the /boot directory and boot from it.

Tricking grub to setup and boot correctly

We'll assume that you have two disks (hda and hdc) with two partitions each. hda1 and hdc1 are equally sized, and members of the md0 RAID 1 array. Similarly, hda2 and hdc2 are equally sized, and members of the md1 array. The md0 device is smaller and holds the /boot filesystem, and md1 is larger and holds the / filesystem.

If you use anaconda at install time to configure the arrays, it will also setup grub at the end of the install process, and by some magic it will work. However, if, at some point in the future, the primary hard drive (hda) fails and you replace it, you'll have no boot sector to boot from. Eventually, you'll have to reinstall grub and, surprisingly, although anaconda managed to install grub correctly, you won't manage to do it with the configuration files created by anaconda. If you try to grub-install /dev/hda, it will claim that "md0 does not have a corresponding BIOS drive" and abort the installation.

The problem is that at boot time grub needs to read the stage1 and stage2 files, which are located in /boot/grub. But at boot time no kernel is running, so grub will have to use BIOS calls to read directly from the disk. However, it needs to know exactly what disk (and partition) to read from. This piece of information is determined at install time, and then hard-coded into the boot sector.

At install time the kernel is running, and some device is mounted under /boot. But grub needs to find what physical device that is, so it can properly read from it later at boot time. To do this, it first determines the device that is mounted on /boot, then it tries to figure out how it can be accessed through BIOS calls. Fortunately, it looks for the mounted device in /etc/mtab rather than /proc/mounts.

The /dev/md0 device (which is mounted on /boot) will never have a corresponding BIOS drive, because it's not a real (physical) device. It's a virtual device managed by the linux kernel. But the same information resides on both physical members of the array, so we need to trick grub into thinking that /boot is mounted from one of the two members (typically the one on the first disk - /dev/hda1 in our example). To do this, you need to edit /etc/mtab and manually change the entry for /boot with /dev/hda1 as the device.

At this point grub knows that /boot is on /dev/hda1, but it still needs a hint about accessing it using only BIOS calls. The hint comes from the /boot/grub/device.map, which maps a logical name such as /dev/hda to a physical device such as (hd0) which means (to grub) the first hard drive detected by the BIOS. So you need to make sure you have a correct mapping for the device you used.

Now you can safely do grub-install /dev/hda and it should work.

Recovery when the secondary drive fails

This is the easiest case, because all the boot data resides on the healthy drive. The system will boot normally, but the arrays will start in degraded mode.

You can replace the damaged drive, and the system will still boot, again with the arrays in degraded mode. Now all you have to do is create the same partitions (or larger ones) on the new drive as you had on the old one. Then you can simply "hot" add the newly created partitions to the degraded arrays.

Suppose the same example as in the previos section. In this scenario /dev/hdc failed, and you replaced it with a new drive. On the new drive you created the hdc1 and hdc2 partitions, which you'll have to add to the md0 and md1 arrays respectively. This is very simple, and all you have to do is:

raidhotadd /dev/md0 /dev/hdc1
raidhotadd /dev/md1 /dev/hdc2

Now you can watch the arrays being reconstructed by looking at /proc/mdstat. If you're anxious about the progress of the job, you can even watch "cat /proc/mdstat". Note that the two arrays won't be both rebuilt at the same time. That is because they both involve the same disks and rebuilding them at the same time would cause the disk heads to be moved very often from one partition to another. This would result in a severe performance loss and the md driver avoids it by rebuilding the second array only after the first completes.

Recovery when the primary drive fails

The first thing you should do is save a copy of your /etc/raidtab file. You'll need this later to get things to work. A floppy disk would do, but even better, make a hardcopy of the file. It's very small, but also very important.

The next thing you should do is replace the damaged disk. This is a bit tricky, because now you don't have anything left to boot from. Well, not really :) You can still boot from a rescue disk. So get Fedora disk 1 (or RedHat 9 disk 1 or... whatever) and boot in rescue mode (that's "linux rescue" at the boot prompt with RedHat & friends). Don't let the rescue disk mount anything from your hard disk. You'll mount them later.

Use fdisk to create partitions on the new disk. The new partitions must be the same size or larger than the old ones. Don't forget to change their type to 'fd' (Linux Raid Autodetect).

Now all you have to do is initialize the raid superblock on the new partitions and restore the arrays. But the only way I know to do this is start the arrays in degraded mode and then "hot" add the new partitions. The funny part is that anaconda won't start any array in degraded mode because "it's dangerous" (guys, why is it dangerous and how the heck are you supposed to restore the arrays first since they need to be running to add a new member?). Moreover, raidstart (and raidstop too) from the rescue image is some kind of anaconda "thingie" (actually a python program) that would never start the arrays. You need the original raidstart (the one from the raid tools package).

If your array is anything else but software RAID 1, you'll be on your own on this one. But if it is software RAID 1, you can do a nice trick. As I previously explained, the two members of the array are identical, and more, they are valid filesystems because the data is not interleaved. This means you can mount the corresponding "/" partition from the healthy drive as if the filesystem were created directly on the partition (and not on the RAID device). Use this very carefully and keep in mind that mounting the partition read-write is a very bad idea.

Mount the "/" partition from the healthy drive read-only and copy the raidtab file from it to /etc. Change the (newly created) /etc/raidtab file as if the arrays did not contain the partitions on the damaged drive. Remove the corresponding "device" and "raid-disk" entries, and adjust the remaining "nr-raid-disks" and "raid-disk" entries accordingly. Now you should be able to start the arrays (in degraded mode, of course) if you use the mounted partition's copy of raidstart. In our example, the modified raidtab file should look like this:

raiddev             /dev/md1
raid-level                  1
nr-raid-disks               1
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hdc2
    raid-disk     0
raiddev             /dev/md0
raid-level                  1
nr-raid-disks               1
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hdc1
    raid-disk     0

The original raidtab (in case you might need an example to write one from scratch) should look like this:

raiddev             /dev/md1
raid-level                  1
nr-raid-disks               2
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hda2
    raid-disk     0
    device          /dev/hdc2
    raid-disk     1
raiddev             /dev/md0
raid-level                  1
nr-raid-disks               2
persistent-superblock       1
nr-spare-disks              0
    device          /dev/hda1
    raid-disk     0
    device          /dev/hdc1
    raid-disk     1

Now you can mount the RAID devices (this time it's safe to mount them r-w) and chroot into them. Use raidhotadd to add the partitions from the new hard disk to the arrays. Note that the RAID driver will synchronize only one array at a time, so start with the /boot array. You'll need it to be completely synchronized before you can boot.

Trick grub and re-install it, as previously described. Cleanly unmount the arrays, do a "sync" (just to be sure), and reboot. Your system should start cleanly.

Using a previously used disk as a replacement

This section should explain what you should take into account when using a previously used disk as a replacement, if it had RAID partitions on it (particularly this explains how to destroy the RAID superblock). To be written :)

11 May, 2003

Realtek 8139-C Cardbus

Disclaimer: I am not responsible for anything may be caused by applying procedures described in this material, including loss of data, physical damage, service disruption, or damage of any kind. Use at your own risk!

Background: I have recently bought a cardbus pcmcia Realtek 8139 card, and tryed to set it up on my RedHat 7.3 box. It works perfectly with the 8139too module, but the init scripts "insist" on loading 8139cp instead.

The 8139cp driver claims the card chip is not supported:

8139cp 10/100 PCI Ethernet driver v0.0.7 (Feb 27, 2002)
8139cp: pci dev 02:00.0 (id 10ec:8139 rev 10) is not an 8139C+ compatible chip
8139cp: Try the "8139too" driver instead.

After two hours of digging through config files and scripts, I figured out what really happened. Cardbus pcmcia cards are in fact 32-bit pci devices. That means they will be reachable through the pci bus (actually through a pci-cardbus bridge), and listed by 'lspci'.

However, pcmcia cards are hotpluggable devices. When the kernel detects a new hotpluggable device, it invokes an utility that is responsible for loading the appropriate modules into the kernel. The utility is provided by the "hotplug" package, and it is used for all the hotpluggable devices (including usb).

The funny thing is that cardmgr would never load the modules itself for a cardbus card, even if the correct manufacturer id (or any other identification means) is present in the '/etc/pcmcia/config' file. Instead /etc/hotplug/pci.agent is invoked to load the appropriate modules.

The correct module is identified through the pci id (two 4-digit hex numbers). The id is looked up in a mapping table, and then the module is loaded into the kernel. The mapping table is located in /lib/modules/<kernel version>/modules.pcimap. In my modules.pcimap file there was a single line containing 8139cp followed by many lines containing 8139too. So I removed the line with 8139cp and... surprise! Everything worked fine.

But... the modules.pcimap is generated by depmod, and depmod is run by the startup scripts at every boot. My "quick hack" solution was to add the following line at the beginning of /etc/modules.conf:

pcimapfile=/tmp/modules.pcimap

That line tells depmod to create the file as "/tmp/modules.pcimap", so the real map file is not overwritten. That's all.

If you know how to make depmod exclude 8139cp from the pcimap file, please drop me an e-mail :)

15 July, 2002

FrontPage 2002 extensions on Apache

This document assumes that you have basic knowledge of:

*NIX;
Apache administration;
Program compilation and installation.

OK, here's what you have to do:

1. Get the files you need. I assume you save all these files in /root:

FrontPage extensions v5.0 for Linux from Ready-to-Run Software (fp50.linux.tar.gz), available at http://www.rtr.com/.
FrontPage patch for Apache 1.3.22 (fp-patch-apache_1.3.22.gz), available at http://www.rtr.com/.
(optional) patch for SuEXEC (if you want to use Apache SuEXEC with FrontPage), available here.
Apache 1.3.23 from Apache (apache_1.3.23.tar.gz), available at http://httpd.apache.org/.

2. CWD to /usr/src. Unzip/untar both the Apache and FrontPage files:

cd /usr/src
tar zxf /root/apache_1.3.23.tar.gz
tar zxf /root/fp50.linux.tar.gz

3. Patch Apache. The patch was made against Apache 1.3.22, but it seems to work fine for 1.3.23.

cd apache_1.3.23
zcat /root/fp-patch-apache_1.3.22.gz | patch -p0

4. (optional) Apply the SuEXEC patch (if you want to use Apache SuEXEC with FrontPage):

patch -p0 < /root/fp-suexec.patch

5. Compile and install Apache. I recommend compiling mod_frontpage as a static module. All other modules may be compiled as DSO.

./configure --prefix=/usr/local/apache --add-module=mod_frontpage.c
make
make install
mkdir /usr/local/apache/webs

6. Setup the FrontPage extension files:

cd /usr/src
mv frontpage /usr/local
cd /usr/local/frontpage/version5.0
# setup the suid key
cd apache-fp
dd if=/dev/random of=suidkey bs=8 count=1
# setup file ownership and permissions
cd ..
./set_default_perms.sh

7. Setup a simple Apache configuration. The following configuration directives should be present in your httpd.conf file. I assume you know how to configure apache and where to place these directives in httpd.conf:

NameVirtualHost *

<Directory /usr/local/apache/webs>
    AllowOverride All
</Directory>

<VirtualHost *>
    ServerName testsite.yourdomain.com
    DocumentRoot /usr/local/apache/webs/testsite.yourdomain.com
</VirtualHost>

8. Install the FrontPage extensions for your test virtual host. I assume user www already exists on your system and he has login group www.

mkdir /usr/local/apache/webs/testsite.yourdomain.com
/usr/local/frontpage/version5.0/bin/owsadm.exe -o install -p 80 \
    -s /usr/local/apache/conf/httpd.conf -xu www -xg www \
    -u yourusername -pw yourpassword -m testsite.yourdomain.com

Well, that's it. You should have a working sample virtual host with FrontPage extensions. This document covers just the basics. If you are a good Apache administrator, it should be enough to set up much more complex configurations, with any number of virtual hosts, different ip's and/or ports, PHP and whatever you can think of :)

Web-based administration didn't work for me. I couldn't authenticate to the interface. Moreover, many users have reported that web-based administration only works with Internet Explorer (on other browsers some CGI's are downloaded instead of being executed). Since I can't run Explorer on my Linux box, I didn't insist on getting web administration to work. Command-line administration works just fine and the documentation from Microsoft seems clear enough to me.

15 July, 2002

More than 32 groups/user

Background: Most linux distributions don't allow more than 32 groups/user. That means one user cannot belong to more than 32 groups. Unfortunately, this limit is hard coded into the linux kernel, glibc, and a few utilities including shadow.

1. Patching the kernel

I only tried this on 2.4.x kernels. However, things should be the same with 2.2.x. Be careful when choosing the new limit. The kernel behaves strangely with large limits because the groups structure per process is held on an 8K stack which seems to overflow. A 2.4.2 kernel with a limit of 1024 crashed during boot. However, I successfully used a 255 limit on a 2.4.8 kernel.

The group limit is set from two header files in the linux kernel source:

include/asm/param.h

This file should contain something like this:

#ifndef NGROUPS
#define NGROUPS         32
#endif

Simply replace 32 with the limit you want. If your param.h doesn't contain these lines, just add them.

include/linux/limits.h

Look for a line that looks like

#define NGROUPS_MAX 32

and change the limit.

Now the kernel must be recompiled. There are some howtos that explain how this is done.

2. Recompiling glibc

This applies to glibc-2.2.2 (this is the version which I used). It may also apply to other versions, but I didn't test it.

The __sysconf function in glibc is affected by the limits defined in the system header files. Other functions (initgroups and setgroups) in glibc rely on __sysconf rather than using the limits defined in the header files. You'll have to modify two header files. Please note that this limit will be used by glibc and all programs that you compile. Choose a reasonable limit. However, it's safe to use a larger limit than you used for the kernel. I successfully compiled and ran glibc with a limit of 1024.

/usr/include/asm/param.h

Make sure the file contains something like

#ifndef NGROUPS
#define NGROUPS         1024
#endif

/usr/include/linux/limits.h

It should contain a line like this:

#define NGROUPS_MAX     1024    /* supplemental group IDs are available */

Now you have to recompile glibc. I hope there are some howtos that explain how this is properly done. I only did it twice and I got into trouble both times. Glibc compiles cleanly, but the actual problem is installing the new libraries. A 'make install' won't do it, at least not with bash (some people suggested it would work if I used a statically linked shell, but I didn't try). This happened on RedHat, where the distribution glibc was placed under a subdirectory of /lib rather than directly under /lib. 'make install' copies libraries one at a time. After glibc is copied, paths stored inside the new glibc binary won't match those from the old ld-linux.so, causing ld not to be able to dinamically link any program. So 'make install' won't be able to run /usr/bin/install, which is needed to copy the new binaries, and it will fail. I had to reset the machine (/sbin/shutdown could no longer be run), boot from a bootable cd and manually copy glibc-2.2.2.so, libm-2.2.2.so, and ld-2.2.2.so, sync and reboot. Then, everything seemed to be normal.

3. Recompiling shadow utils

Before I recompiled glibc, I had manually put a user in more than 32 groups (that means it already belonged to 32 groups and I manually modified /etc/groups). Proper permissions were granted for groups above 32, but usermod failed to add a user to more than 32 groups. I began browsing the shadow utils source and found that it uses the system headers at compile time to set the limit. This means that it had to be recompiled, because the old limits were hard coded into the binaries. A simple recompilation will do. However I made a patch against shadow-20000826, that will dinamically allocate space for the group structure using __sysconf(). This means it won't have to be recompiled if glibc is recompiled with a different limit.

4. Fixing process tools

Once again I thought everything was fine. However I ran apache webserver as user www1, which belonged to more than 100 groups (that was a security measure for massive virtual hosting). The message 'Internal error' appeared (apparently) at random while running different programs. After a few grep's I figured out the message came from libproc. I began browsing the procps sources and found a terrible bug. Process information is read from the kernel and concatenated into one string which is then parsed to get a dynamic list. The problem is that the string was blindly dimensioned to 512 bytes, which was not enough to hold information for so many groups. I made a patch against procps-2.0.7, which only defines a symbolic constant in readproc.c and allocates the string with the size given by that constant. Of course, I used a larger value, such as 4096. You'll have to apply this patch and recompile procps.

Pages

Latest posts

Tag Cloud