Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 162759 - System occasionally experienced system hangs.
Summary: System occasionally experienced system hangs.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Doug Ledford
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-07-08 13:22 UTC by Jeff Burke
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 19:16:47 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0132 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description Jeff Burke 2005-07-08 13:22:30 UTC
Description of problem:
  While running RHEL4 stress testing, an x86_64 system occasionally
  experienced system hangs.  Upon taking forced netdumps, the cause
  leading up to the hangs was readily evident.

Version-Release number of selected component (if applicable):


How reproducible:
  Sometimes, but not always.


Steps to Reproduce:
  1.Install RHEL4 2.6.9-11.16(or Higher) x86_64 Kernel.
  2.Download and install ltp-full-20050307.tgz
  3.Run the ltpstress.sh script.
  
  
Actual results:
  System occasionally experienced system hangs.


Expected results:
  System should remain relatively responsive. Considering it is under a stress load.

Additional info:

The problem is in the drivers/md/md.c sync_page_io() function,
where GFP_KERNEL is used as an argument to bio_alloc():

static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
                   struct page *page, int rw)
{
        struct bio *bio = bio_alloc(GFP_KERNEL, 1);
        struct completion event;
        int ret;
...

Because of the use of GFP_KERNEL, under tight memory constraints,
it is possible that the same process may call the drivers/md/md.c
md_write_start() function recursively; the second time it gets
called, it will block forever in mddev_lock_uninterruptible():

void md_write_start(mddev_t *mddev)
{
        if (!atomic_read(&mddev->writes_pending)) {
                mddev_lock_uninterruptible(mddev);
                if (mddev->in_sync) {
                        mddev->in_sync = 0;
                        del_timer(&mddev->safemode_timer);
                        md_update_sb(mddev);
                }
                atomic_inc(&mddev->writes_pending);
                mddev_unlock(mddev);
        } else
                atomic_inc(&mddev->writes_pending);
}

This can happen when the bio_alloc(GFP_KERNEL) call done in
sync_page_io() results in:

 (1) page reclamation being performed due to low memory, and
 (2) the page selected for reclamation is on the same device as the
     original page, leading to recursion back into md_write_start()

Here's an example x86_64 trace; while "pdflush" daemon was doing a page write,
it made the first call to md_write_start() at frame #29, took the semaphore
via mddev_lock_uninterruptible(), called md_update_sb() in frame #28, which
called sync_page_io() in frame #27.  The bio_alloc(GFP_KERNEL) call required
page reclamation to satisfy the request, eventually leading to the call to
make_request() in frame #4, which recursively called md_write_start() in
frame #3, where the ".text.lock.md at ffffffff80293b80" reference is the
mddev_lock_uninterruptible() call in md_write_start():

crash> bt
PID: 8108   TASK: 100281897f0       CPU: 2   COMMAND: "pdflush"
 #0 [10038ea3058] schedule at ffffffff802f9a4a
 #1 [10038ea3130] __sched_text_start at ffffffff802f8d53
 #2 [10038ea3190] __down_failed at ffffffff802fa32f
 #3 [10038ea31e0] .text.lock.md at ffffffff80293b80
 #4 [10038ea3200] make_request at ffffffffa00619b5
 #5 [10038ea3218] mempool_alloc at ffffffff80156bb0
 #6 [10038ea3220] cache_alloc_debugcheck_after at ffffffff8015bf71
 #7 [10038ea3280] generic_make_request at ffffffff8024079e
 #8 [10038ea32d0] submit_bio at ffffffff802408aa
 #9 [10038ea3300] bio_alloc at ffffffff80177db0
#10 [10038ea3330] submit_bh at ffffffff80175ce2
#11 [10038ea3360] __block_write_full_page at ffffffff80176bb9
#12 [10038ea33b0] shrink_zone at ffffffff8016013b
#13 [10038ea34a0] ip_local_deliver at ffffffff802b8e6c
#14 [10038ea34d0] free_pages_bulk at ffffffff8015769c
#15 [10038ea3660] try_to_free_pages at ffffffff80160725
#16 [10038ea36c0] kmem_cache_alloc at ffffffff8015c114
#17 [10038ea3700] __alloc_pages at ffffffff8015802e
#18 [10038ea3730] recalc_task_prio at ffffffff80130649
#19 [10038ea3770] alloc_page_interleave at ffffffff8016e0bc
#20 [10038ea3780] __get_free_pages at ffffffff801581bf
#21 [10038ea3790] kmem_getpages at ffffffff8015b107
#22 [10038ea37b0] cache_alloc_refill at ffffffff8015c3f8
#23 [10038ea37f0] kmem_cache_alloc at ffffffff8015c0fd
#24 [10038ea3810] mempool_alloc at ffffffff80156be9
#25 [10038ea3840] autoremove_wake_function at ffffffff80133983
#26 [10038ea38a0] bio_alloc at ffffffff80177cac
#27 [10038ea38d0] sync_page_io at ffffffff8028e7e1
#28 [10038ea3930] md_update_sb at ffffffff8028ff08
#29 [10038ea39a0] md_write_start at ffffffff80292e25
#30 [10038ea39c0] make_request at ffffffffa00619b5
#31 [10038ea39d8] mempool_alloc at ffffffff80156bb0
#32 [10038ea39e0] cache_alloc_debugcheck_after at ffffffff8015bf71
#33 [10038ea3a40] generic_make_request at ffffffff8024079e
#34 [10038ea3a90] submit_bio at ffffffff802408aa
#35 [10038ea3ac0] bio_alloc at ffffffff80177db0
#36 [10038ea3af0] submit_bh at ffffffff80175ce2
#37 [10038ea3b20] __block_write_full_page at ffffffff80176bb9
#38 [10038ea3b70] ext3_ordered_writepage at ffffffffa007cb26
#39 [10038ea3ba0] mpage_writepages at ffffffff80192b8d
#40 [10038ea3c80] thread_return at ffffffff802f9a74
#41 [10038ea3d20] del_timer at ffffffff8013dcbb
#42 [10038ea3d40] del_singleshot_timer_sync at ffffffff8013dd78
#43 [10038ea3d50] schedule_timeout at ffffffff802fa0d7
#44 [10038ea3db0] __writeback_single_inode at ffffffff801919d5
#45 [10038ea3df0] sync_sb_inodes at ffffffff80192049
#46 [10038ea3e30] writeback_inodes at ffffffff801922e0
#47 [10038ea3e50] background_writeout at ffffffff80159280
#48 [10038ea3ed0] pdflush at ffffffff80159db0
#49 [10038ea3f20] kthread at ffffffff80149003
#50 [10038ea3f50] kernel_thread at ffffffff80110c8f

Eventually things start backing up, waiting for the pdflush
thread to complete its requested task.

Note that the addition of the bio_alloc(GFP_KERNEL) call was
added between 2.6.10 and 2.6.11, and is included in the RHEL4
"linux-2.6.10-ac-selected-bits.patch" patch.

If the argument were GFP_NOIO, then the recursion would be
avoided because the page would not be selected for writeback:

--- linux-2.6.9/drivers/md/md.c.orig
+++ linux-2.6.9/drivers/md/md.c
@@ -364,7 +364,7 @@ static int bi_complete(struct bio *bio,
 static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
                   struct page *page, int rw)
 {
-       struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+       struct bio *bio = bio_alloc(GFP_NOIO, 1);
        struct completion event;
        int ret;

We used the less restrictive GFP_NOFS and never saw the hang after
that, but it appears that it's still possible for a make_request()
to be recursively done to the same device using GFP_NOFS, and that
GPF_NOIO would be safer.

Comment 1 Dave Anderson 2005-08-18 15:44:06 UTC
Update from a couple days ago: Jeff Burke gathered the sysrq-t output from
another 2.6.9-16 deadlock, in which dozens of tasks were blocked in I/O
transactions in mddev_lock_uninterruptible().  There's no vmcore for this one,
but the *recursive* mddev_lock_uninterruptible() caller appears to be the
md1_raid1 task, which first takes the semaphore in md_handle_safemode(), and
then subsequently via the sync_page_io() route, calls bio_alloc(GFP_KERNEL),
leading to the deadlocking mddev_lock_uninterruptible() call:

  md1_raid1 D 0000010037e108d8 0 273 1 274 250 (L-TLB)
  000001013f781588 0000000000000046 7974742f7665642f 0000001900000073
  000001000a3167f0 0000000000000073 000001000565d840 0000000300000246
  000001013f64c7f0 0000000000001eec
  Call Trace:<ffffffff80178420>{__find_get_block_slow+255}
  <ffffffff80302183>{__down+147}
  <ffffffff80132ef9>{default_wake_function+0}
  <ffffffff8030375f>{__down_failed+53}
  <ffffffff8029ce28>{.text.lock.md+155}       <= mddev_lock_uninterruptible()
  <ffffffffa00629b5>{:raid1:make_request+622}
  <ffffffff8024959e>{generic_make_request+355}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff802496aa>{submit_bio+247} 
  <ffffffff8017b224>{bio_alloc+288}
  <ffffffff80179156>{submit_bh+255}
  <ffffffff8017a02d>{__block_write_full_page+440}
  <ffffffffa007f40a>{:ext3:ext3_get_block+0}
  <ffffffffa007db46>{:ext3:ext3_ordered_writepage+245}
  <ffffffff80163553>{shrink_zone+3095} 
  <ffffffff801cdf24>{avc_has_perm+70}
  <ffffffff80163b3d>{try_to_free_pages+303}
  <ffffffff8015c4c6>{__alloc_pages+596}
  <ffffffff8015c657>{__get_free_pages+11} 
  <ffffffff8015f4ec>{kmem_getpages+36}
  <ffffffff8015fc81>{cache_alloc_refill+609}
  <ffffffff8015f9bf>{kmem_cache_alloc+90}
  <ffffffff8015b049>{mempool_alloc+186}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80134797>{autoremove_wake_function+9}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff8017b120>{bio_alloc+28} 
  <ffffffff80297a89>{sync_page_io+48}
  <ffffffff802991b0>{md_update_sb+263}         
  <ffffffff80131cd9>{finish_task_switch+55}
  <ffffffffa0062ffc>{:raid1:raid1d+0}
  <ffffffff8029c237>{md_handle_safemode+244}  <= mddev_lock_uninterruptible()
  <ffffffffa0063022>{:raid1:raid1d+38} 
  <ffffffff80302ee8>{thread_return+110}
  <ffffffffa0062ffc>{:raid1:raid1d+0} 
  <ffffffff80299c91>{md_thread+392}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80131cd9>{finish_task_switch+55}
  <ffffffff8013478e>{autoremove_wake_function+0}
  <ffffffff80131d28>{schedule_tail+11}
  <ffffffff80110ca3>{child_rip+8} <ffffffffa0062ffc>{:raid1:raid1d+0}
  <ffffffff80299b09>{md_thread+0} <ffffffff80110c9b>{child_rip+0}

 

Comment 4 Red Hat Bugzilla 2006-03-07 19:16:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.