Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 233521 - cman panic in start_transition after cmirror device failure caused node to shut down
Summary: cman panic in start_transition after cmirror device failure caused node to sh...
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman-kernel
Version: 4
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
Depends On:
TreeView+ depends on / blocked
Reported: 2007-03-22 21:26 UTC by Corey Marthaler
Modified: 2009-04-16 20:01 UTC (History)
1 user (show)

Fixed In Version: RHBA-2007-0990
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2007-11-21 21:54:06 UTC

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0990 normal SHIPPED_LIVE cman-kernel bug fix update 2007-11-29 15:17:57 UTC

Description Corey Marthaler 2007-03-22 21:26:23 UTC
Description of problem:
I was running on a 4 node x86_64 cluster (link-02,04,07,08) and I created one
cluster mirror as such:

[root@link-02 ~]# lvs -a -o +devices
  LV              VG         Attr   LSize  Origin Snap%  Move Log       Copy% 
  LogVol00        VolGroup00 -wi-ao 35.16G                                    
  LogVol01        VolGroup00 -wi-ao  1.94G                                    
  test            vg         mwi-a-  1.00G                    test_mlog 100.00
  [test_mimage_0] vg         iwi-ao  1.00G                                    
  [test_mimage_1] vg         iwi-ao  1.00G                                    
  [test_mlog]     vg         lwi-ao  4.00M                                    

I then started a small amount of I/O with the following:
[root@link-02 ~]# dd if=/dev/zero of=/dev/vg/test bs=4M

And then failed the primary leg of that cmirror on all nodes:
[root@link-02 ~]# echo offline > /sys/block/sda/device/state

This then caused link-02 to appear hung for awhile, and then link-04 withdrew
from the cluster, and link-08 finally paniced.

Mar 22 11:18:01 link-04 kernel: scsi1 (1:1): rejecting I/O to offline device
Mar 22 11:19:35 link-04 kernel: CMAN: Being told to leave the cluster by node 3
Mar 22 11:19:35 link-04 kernel: CMAN: we are leaving the cluster.
Mar 22 11:19:35 link-04 kernel: WARNING: dlm_emergency_shutdown
Mar 22 11:19:35 link-04 kernel: WARNING: dlm_emergency_shutdown
Mar 22 11:19:35 link-04 kernel: SM: 00000003 sm_stop: SG still joined
Mar 22 11:19:35 link-04 kernel: SM: 01000179 sm_stop: SG still joined

Mar 22 16:30:25 link-07 kernel: scsi3 (0:1): rejecting I/O to offline device
Mar 22 16:30:25 link-07 last message repeated 31 times
Mar 22 16:30:41 link-07 kernel: CMAN: node link-04 has been removed from the cluu
ster : Shutdown
Mar 22 16:31:03 link-07 kernel: CMAN: node link-02 has been removed from the cluu
ster : No response to messages
Mar 22 16:31:04 link-07 kernel: CMAN: quorum lost, blocking activity
Mar 22 16:31:37 link-07 kernel: CMAN: removing node link-08 from the cluster : MM
issed too many heartbeats

CMAN: removing node link-04 from the cluster : Shutdown
CMAN: removing node link-02 from the cluster : No response to messages
CMAN: quorum lost, blocking activity
Unable to handle kernel NULL pointer dereference at 000000000000003c RIP:
PML4 1a490067 PGD 1a484067 PMD 0
Oops: 0000 [1] SMP
Modules linked in: dm_cmirror(U) dlm(U) cman(U) qla2300 qla2xxx
scsi_transport_fc md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod mptscsih mptsas mptspi mptscsi
mptbase sd_mod scsi_mod
Pid: 2616, comm: cman_memb Not tainted 2.6.9-50.ELsmp
RIP: 0010:[<ffffffffa0239ff1>] <ffffffffa0239ff1>{:cman:start_transition+447}
RSP: 0018:0000010015807d08  EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000005 RCX: 0000000000000007
RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000800030304 R09: 0000000000000040
R10: 00000000000005dc R11: 000001001efc3060 R12: 0000010015807da8
R13: ffffffffa02547a0 R14: 0000000000000000 R15: 0000010015807dc8
FS:  0000002a95562b00(0000) GS:ffffffff804ee200(0000) knlGS:00000000f7ff1b00
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000003c CR3: 0000000000101000 CR4: 00000000000006e0
Process cman_memb (pid: 2616, threadinfo 0000010015806000, task 00000100302e8030)
Stack: 00000000000005dc 0000000000000000 0000010015807e38 0000010015807da8
       0000000000000008 ffffffffa0253e60 0000010015807dc8 ffffffffa023c7ac
       00000100302e8030 0000010001008c60
Call Trace:<ffffffffa023c7ac>{:cman:dispatch_messages+5705}

Code: 8b 45 3c 41 bc 14 00 00 00 41 bf 00 00 20 00 41 88 45 03 8b
RIP <ffffffffa0239ff1>{:cman:start_transition+447} RSP <0000010015807d08>
CR2: 000000000000003c
 <0>Kernel panic - not syncing: Oops

Version-Release number of selected component (if applicable):
[root@link-07 ~]# uname -ar
Linux link-07 2.6.9-50.ELsmp #1 SMP Tue Mar 6 18:04:58 EST 2007 x86_64 x86_64
x86_64 GNU/Linux
[root@link-07 ~]# rpm -q cman
[root@link-07 ~]# rpm -q cman-kernel

Comment 1 Corey Marthaler 2007-03-22 21:37:45 UTC
Here's a little more info from the only node still "in" the cluster

[root@link-07 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   X   link-08
   2    1    4   M   link-07
   3    1    4   X   link-02
   4    1    4   X   link-04
[root@link-07 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           3   2 recover 0 -

DLM Lock Space:  "clvmd"                           377 257 recover 0 -

DLM Lock Space:  "clustered_log"                   379 259 recover 0 -

[root@link-07 cluster]# cat status
Protocol version: 5.0.1
Config version: 2
Cluster name: LINK_128
Cluster ID: 19208
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 4
Total_votes: 1
Quorum: 3  Activity blocked
Active subsystems: 4
Node name: link-07
Node ID: 2
Node addresses:

[root@link-07 cluster]# cat sm_debug
count 3
00000003 remove node 4 count 2
0100017b remove node 3 count 3
0100017b remove node 4 count 2
01000179 remove node 3 count 3
01000179 remove node 4 count 2
00000003 remove node 1 count 1
0100017b remove node 1 count 1
01000179 remove node 1 count 1

Comment 2 Corey Marthaler 2007-03-22 21:40:53 UTC
Just another note... I have automated tests that do this sort of thing over and
over, so it's surprising that when I did it once by hand this occured.

To date, this has only been seen one time.

Comment 3 Christine Caulfield 2007-03-23 11:48:00 UTC
I think it's one of those fluke occurances... 

Looking at the code it seems most likely that a NULL node structure address was
passed into start_transition() but It's really not clear to me how that can happen. 

Putting extra debugging into the code is almost certainly going to hide the
problem, if it's even reproducable at all - which seems unlikely given what
you've said.

Comment 4 Christine Caulfield 2007-05-03 10:20:23 UTC
I found one unchecked use of the node structure being passed into
start_transition(). It's pretty unlikely, but this doesn't seem to a common bug
so they might be related !

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision:; previous revision:

Feel free, of course, to reopen this if it happens again.

Comment 5 Chris Feist 2007-08-17 21:34:21 UTC
Setting flags for 4.6.

Comment 7 Corey Marthaler 2007-11-05 15:32:35 UTC
Marking this bug verified as it hasn't been seen in over 7 months.

Comment 9 errata-xmlrpc 2007-11-21 21:54:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.