Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 156576 - Panic due to outstanding pg_init completion after multipath mapped device is suspended and the mapped device's multipath structure is destroyed via its destructor.
Summary: Panic due to outstanding pg_init completion after multipath mapped device is ...
Keywords:
Status: CLOSED DUPLICATE of bug 154442
Alias: None
Product: Fedora
Classification: Fedora
Component: device-mapper-multipath
Version: rawhide
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Alasdair Kergon
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-05-02 00:53 UTC by Ed Goggin
Modified: 2007-11-30 22:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-07-02 19:56:59 UTC


Attachments (Terms of Use)

Description Ed Goggin 2005-05-02 00:53:10 UTC
Description of problem:

dm-mpath paniced at dm_pg_init_complete+0x10 while testing multipath
reaction to CLARiion CX300 non-destructive ucode upgrade called an NDU.

Using version 0.4.3-pre5 multipath tools and version 2.6.11-rc3-udm2
linux kernel.  I see no code in place that would prevent this problem
from occurring on Red Hat AS 4 Update 1.

The panic is occurring due to corrupted memory in the path structure
for a multipath pg_init i/o completion.  I suspect that the memory for
the path structure (and its encompassing path group and multipath structure)
has been freed by the multipath destructor, subsequently re-allocated
for other use, and written upon.

I suspect that the problem is caused by having a pg_init request oustanding
while the pending count on a multipath mapped device is zero when the
multipath mapped device is suspended.

Since pg_init requests are not accounted for in the pending
count of the multipath mapped device structure, it is possible
to have outstanding pg_init requests awaiting i/o completion
when the pending count is zero.  If the multipath table
is destroyed via dm_table_destroy() before the pg_init i/o
completion arrives, dm_pg_init_complete() can reference
corrupted memory.  This can happen either from the swap-in
of a new dm table or the closing of the dm mapped device.
My panic involves the former use case.

I suspect that the prerequisite state of having a pg_init request
outstanding while having a 0 pending count for a multiapth during
a multipath device suspension is achieved in one of two possible ways.
First, when a multipath, which has no pending requests and a call from
user space to switch_pg_num() sends a pg_init request just prior to having
the multipath device suspended.  Second, when a multipath mapped device
which has one or more pending requests and a pg_init request outstanding
is suspended.

Reasonable fixes for both cases (which I tried) are (1) to only send a pg_init
request if there are one or more pending requests queued for the multipath and 
(2) to not dispatch pending requests if there is a pg_init request outstanding.
With these changes in place, I no longer incurred this problem.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Lars Marowsky-Bree 2005-05-03 10:19:11 UTC
I think this is fixed by the updated patch I attached for bug #155428 and maybe
bug #15442 also has an impact here.

Does the bug still occur with these updates in place?


Comment 2 Ed Goggin 2005-05-03 13:06:05 UTC
(In reply to comment #1)
> I think this is fixed by the updated patch I attached for bug #155428 and 
maybe
> bug #15442 also has an impact here.
> Does the bug still occur with these updates in place?

From reading the description for bug #155428, I cannot see how that bug
and this one are at all related.  Also, I am not able to access bug
#15442 (or #155442???).


Comment 3 Alasdair Kergon 2005-05-04 14:57:28 UTC
bug 154442

Comment 6 Alasdair Kergon 2005-07-02 19:56:59 UTC
Marking as duplicate of bug 154442 as the fix is believed to be the same.

*** This bug has been marked as a duplicate of 154442 ***


Note You need to log in before you can comment on or make changes to this bug.