Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 134736 - kernel panic in md driver (md lacks proper locking of device lists)
Summary: kernel panic in md driver (md lacks proper locking of device lists)
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Brian Brock
Depends On:
Blocks: 170417 RHEL3U8CanFix
TreeView+ depends on / blocked
Reported: 2004-10-05 20:30 UTC by Paul Clements
Modified: 2007-11-30 22:07 UTC (History)
4 users (show)

Fixed In Version: RHSA-2006-0437
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2006-07-20 13:17:21 UTC
Target Upstream Version:

Attachments (Terms of Use)
locking patch against 2.6.20-28.7 (deleted)
2005-12-29 17:42 UTC, James Bottomley
no flags Details | Diff

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0437 normal SHIPPED_LIVE Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8 2006-07-20 13:11:00 UTC

Description Paul Clements 2004-10-05 20:30:32 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040116

Description of problem:
Unable to handle kernel NULL pointer dereference at virtual address
 printing eip:
*pde = 00000000
Oops: 0000
nbd raid1 mousedev input parport_pc lp parport autofs audit e100
floppy sg micr
CPU:    0
EIP:    0060:[<c01fbb68>]    Not tainted
EFLAGS: 00010292
EIP is at md_do_recovery [kernel] 0x78 (2.4.21-15.EL/i686)
eax: 00000016   ebx: de634d80   ecx: c171e000   edx: 00000073
esi: 00000000   edi: d54ca000   ebp: fffffffc   esp: c171ff6c
ds: 0068   es: 0068   ss: 0068
Process mdrecoveryd (pid: 8, stackpage=c171f000)
Stack: d7154580 c171ff7c 00000000 d71545d4 d54ca300 c171e000 c25aadc0
       c25aadc8 c01fa877 00000000 c029b4aa 00000000 00000000 c171e000
       00000000 00000000 c171e000 c0341820 dffedfb0 00000000 c171e000
Call Trace:   [<c01fa877>] md_thread [kernel] 0xe7 (0xc171ff90)
[<c01fa790>] md_thread [kernel] 0x0 (0xc171ffe0)
[<c010945d>] kernel_thread_helper [kernel] 0x5 (0xc171fff0)
Code: 81 7e 04 10 33 38 c0 75 9f 83 c4 14 5b 5e 5f 5d c3 8d b4 26
Kernel panic: Fatal exception

The above panic occurs at linux-2.4.21-15.EL/drivers/md/md.c, line 3694:

void md_do_recovery(void *data)
        int err;
        mddev_t *mddev;
        mdp_super_t *sb;
        mdp_disk_t *spare;
        struct md_list_head *tmp;

        dprintk(KERN_INFO "md: recovery thread got woken up ...\n");
        ITERATE_MDDEV(mddev,tmp) {


ITERATE_MDDEV does not contain the necessary locking to ensure that
the device list does not change while it's being iterated over.

The panic is easily reproducible if 2 md devices are configured on a
system and they are stopped while one of the devices is in recovery.
There are, no doubt, other related panics/oopses that can occur due to
the lack of locking around access to the device lists in the 2.4
kernel md driver. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. create 2 md raid1 devices
2. allow the first device to complete recovery
3. allow the second device to begin recovery
4. stop the first device while the second device is in recovery
5. stop the second device while it is in recovery
6. the kernel panics in md_do_recovery()

Actual Results:  kernel panic in md_do_recovery()

Expected Results:  md device stops

Additional info:

The 2.6 md driver does not have this problem. I believe there have
also been patches circulated on the linux kernel/raid mailing lists
that attempt to correct this problem.

Comment 2 David Woodhouse 2005-12-21 18:03:42 UTC
David, it's not entirely clear -- has the patch at been tested at
the customer site, or another patch of your own?

Comment 3 David Milburn 2005-12-21 18:21:37 UTC
David, no, I only tried to add locking to ITERATE_MDDEV(), I thinking that the
patch didn't apply cleanly to rhel3 and there were some fix ups to be made.

Comment 5 James Bottomley 2005-12-29 17:38:56 UTC
We've applied the fix in Neil Brown's email to a Red Hat 8 kernel 2.4.20-28.7
and run into a locking failure in the seq_file interface.

The down_read(&all_mddevs_sem); needs to be moved from md_seq_next() to
md_seq_start() in order to avoid a proc file bug which will cause the system to
hang.  I've attached our complete patch.

With these locking changes, the system is stable for us and no-longer oopses

Comment 6 James Bottomley 2005-12-29 17:42:36 UTC
Created attachment 122629 [details]
locking patch against 2.6.20-28.7

This patch is modified from Neil Browns original to apply against 2.6.20-28.7
and also has the proc file locking problem fixed

Comment 7 Doug Ledford 2006-04-18 21:15:12 UTC
I modified the patch to work with a RHEL3 kernel and with the md event interface
we have in RHEL3.  It then passed my testing, and I've submitted it internally
for review.

Comment 8 Ernie Petrides 2006-04-25 03:30:36 UTC
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.10.EL).

Comment 9 Ernie Petrides 2006-04-28 21:54:04 UTC
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.

Comment 11 Joshua Giles 2006-05-30 15:24:33 UTC
A kernel has been released that contains a patch for this problem.  Please
verify if your problem is fixed with the latest available kernel from the RHEL3
public beta channel at

Comment 12 Ernie Petrides 2006-05-30 20:25:40 UTC
Reverting to ON_QA.

Comment 14 Red Hat Bugzilla 2006-07-20 13:17:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.