Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 158592 - cman master confusion due to recovery
Summary: cman master confusion due to recovery
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-05-23 19:54 UTC by Corey Marthaler
Modified: 2009-04-16 19:59 UTC (History)
1 user (show)

Fixed In Version: RHBA-2005-734
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-07 16:46:16 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2005:734 normal SHIPPED_LIVE cman-kernel bug fix update 2005-10-07 04:00:00 UTC

Description Corey Marthaler 2005-05-23 19:54:46 UTC
Description of problem:
I have been seeing this quite a bit lately running revolver. Revolver will shoot
it's nodes, and when bringing them back up the cman join ends up dead locking.

In this case, tank-03 and tank-05 were shot, they came back up, had ccsd started
on them, and then cman_tool join attempted.

For whatever reason, tank-01 thinks tank-02 is the master and vice versa:

[root@tank-01 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-02
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-01
Node addresses: 10.15.84.91


[root@tank-02 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-01
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-02
Node addresses: 10.15.84.92

Membership state: Join-Wait
[root@tank-03 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: No
Membership state: Join-Wait


[root@tank-04 ~]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: tank-cluster
Cluster ID: 46516
Cluster Member: Yes
Membership state: State-Transition: Master is tank-01
Nodes: 3
Expected_votes: 5
Total_votes: 3
Quorum: 3
Active subsystems: 9
Node name: tank-04
Node addresses: 10.15.84.94



[root@tank-02 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    5   M   tank-01
   2    1    5   X   tank-03
   3    1    5   M   tank-02
   4    1    5   M   tank-04
   5    1    5   X   tank-05


[root@tank-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 4 2 5 1]

DLM Lock Space:  "clvmd"                             3   3 run       -
[3 4 2 5 1]

DLM Lock Space:  "corey0"                            4   4 run       -
[3 4 2 5 1]

DLM Lock Space:  "corey1"                            6   6 run       -
[3 4 2 5 1]

GFS Mount Group: "corey0"                            5   5 run       -
[3 4 2 5 1]

GFS Mount Group: "corey1"                            7   7 run       -
[3 4 2 5 1]



[root@tank-01 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:    2862611
Unlock operations:  2853855
Convert operations: 11399480
Completion ASTs:    17115870
Blocking ASTs:            2

Lockqueue        num  waittime   ave
WAIT_RSB       21573     99436     4
WAIT_GRANT      5606      1597     0
WAIT_UNLOCK       30        50     1
Total          27209    101083     3


[root@tank-02 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:    4235278
Unlock operations:  4226584
Convert operations: 14944074
Completion ASTs:    23405844
Blocking ASTs:          115

Lockqueue        num  waittime   ave
WAIT_RSB      1197641   18662165    15
WAIT_CONV         31       501    16
WAIT_GRANT      6009     23749     3
WAIT_UNLOCK      353      3951    11
Total         1204034   18690366    15


[root@tank-04 ~]# cat /proc/cluster/dlm_stats
DLM stats (HZ=1000)

Lock operations:     574693
Unlock operations:   562742
Convert operations: 2112002
Completion ASTs:    3249411
Blocking ASTs:           18

Lockqueue        num  waittime   ave
WAIT_RSB      534095   17185688    32
WAIT_GRANT      5669      5905     1
WAIT_UNLOCK       78      1555    19
Total         539842   17193148    31



[root@tank-01 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes
clvmd total nodes 5
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 2 done
clvmd move flags 0,0,1 ids 0,2,2
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 2 finished
corey0 move flags 0,1,0 ids 0,3,0
corey0 move use event 3
corey0 recover event 3 (first)
corey0 add nodes
corey0 total nodes 5
corey0 rebuild resource directory
corey0 rebuilt 5812 resources
corey0 recover event 3 done
corey0 move flags 0,0,1 ids 0,3,3
corey0 process held requests
corey0 processed 0 requests
corey0 recover event 3 finished
corey1 move flags 0,1,0 ids 0,5,0
corey1 move use event 5
corey1 recover event 5 (first)
corey1 add nodes
corey1 total nodes 5
corey1 rebuild resource directory
corey1 rebuilt 5870 resources
corey1 recover event 5 done
corey1 move flags 0,0,1 ids 0,5,5
corey1 process held requests
corey1 processed 0 requests
corey1 recover event 5 finished



[root@tank-02 ~]# cat /proc/cluster/dlm_debug
00000 node -1/-1 "       7
corey0 resent 4 requests
corey0 recover event 87 finished
corey1 move flags 1,0,0 ids 85,85,85
corey1 move flags 0,1,0 ids 85,89,85
corey1 move use event 89
corey1 recover event 89
corey1 add node 1
corey1 total nodes 5
corey1 rebuild resource directory
corey1 rebuilt 5952 resources
corey1 purge requests
corey1 purged 0 requests
corey1 mark waiting requests
corey1 mark 2be008e lq 1 nodeid -1
corey1 mark 2bb029e lq 1 nodeid -1
corey1 mark 2b20362 lq 1 nodeid -1
corey1 mark 2c70149 lq 1 nodeid -1
corey1 marked 4 requests
corey1 recover event 89 done
corey1 move flags 0,0,1 ids 85,89,89
corey1 process held requests
corey1 processed 0 requests
corey1 resend marked requests
corey1 resend 2be008e lq 1 flg 200000 node -1/-1 "       7
corey1 resend 2bb029e lq 1 flg 200000 node -1/-1 "      11
corey1 resend 2b20362 lq 1 flg 200000 node -1/-1 "       7
corey1 resend 2c70149 lq 1 flg 200000 node -1/-1 "      11
corey1 resent 4 requests
corey1 recover event 89 finished



Version-Release number of selected component (if applicable):
[root@tank-01 ~]# rpm -qa | grep cman
cman-1.0-0.pre33.14
cman-kernheaders-2.6.9-34.3
cman-kernel-smp-2.6.9-34.3


How reproducible:
revolver appear to always eventually hit this

Comment 1 Christine Caulfield 2005-05-24 15:03:11 UTC
This might happen if two nodes go into a CHECK transition at slightly different
(but still overlapping) times. This checkin fixes that problem. I hope it also
fixes this problem!

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.44.2.19; previous revision: 1.44.2.18
done


Comment 3 Red Hat Bugzilla 2005-10-07 16:46:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-734.html



Note You need to log in before you can comment on or make changes to this bug.