Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 456403 - cluster will recover even if a fence device failed
Summary: cluster will recover even if a fence device failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.0
Hardware: All
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: David Teigland
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-23 12:02 UTC by David Juran
Modified: 2009-04-16 23:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 21:53:12 UTC
Target Upstream Version:


Attachments (Terms of Use)
example cluster configuration (deleted)
2008-07-23 12:02 UTC, David Juran
no flags Details
Pass 1 fixing order bug. Untested. (deleted)
2008-07-23 15:31 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:0189 normal SHIPPED_LIVE cman bug-fix and enhancement update 2009-01-20 16:05:55 UTC

Description David Juran 2008-07-23 12:02:05 UTC
Description of problem:
If one has multiple fence devices in a fence level (necessary e.g. if one has
redundant power supplies), one of the fence devices can fail but the cluster
will still recover and reclaim cluster locks. This is obviously bad since e.g.
GFS locks will be reclaimed without the offending node being power cycled.

Version-Release number of selected component (if applicable):
cman-2.03.04-1.fc9

How reproducible:
Every time

Steps to Reproduce:
1. Create a cluster with the attached cluster configuration (never mind the
somewhat unorthodox fencing agents...) Note that the fence device f2 will always
fail.
2. run "service network stop" on one of the nodes


Actual results:

The other node will print in it's syslog

Jul 23 14:41:48 red fenced[2720]: hat not a cluster member after 0 sec
post_fail_delay
Jul 23 14:41:48 red fenced[2720]: fencing node "hat"
Jul 23 14:41:48 red fenced[2720]: fence "hat" failed
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Trying to acquire
journal lock...
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Looking at journal...
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Acquiring the
transaction lock...
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Replaying journal...
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Replayed 1 of 1 blocks
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Found 0 revoke tags
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Journal replayed in 1s
Jul 23 14:41:53 red kernel: GFS2: fsid=juran23:sda.0: jid=1: Done

The node "red" has now recovered the cluster and reclaimed GFS2 locks _although
fencing failed_ 
 [root@red ~]# cman_tool services
type             level name     id       state       
fence            0     default  00010001 none        
[1 2]
dlm              1     sda      00040001 none        
[1 2]
gfs              2     sda      00030001 none        

Expected results:
Fencing retried and cluster operation suspended until fencing succeeds. Which is
what happens if one only have a single fencing device in each level.


Additional info:
The same behavior on RHEL5 with cman-2.0.84-2.el5

Comment 1 David Juran 2008-07-23 12:02:05 UTC
Created attachment 312467 [details]
example cluster configuration

Comment 2 David Teigland 2008-07-23 14:28:55 UTC
This sounds similar to something we fixed a long time ago, will check.


Comment 3 Lon Hohberger 2008-07-23 15:28:03 UTC
update_cman() is called after the first device succeeds.

Ouch.



Comment 4 Lon Hohberger 2008-07-23 15:31:56 UTC
Created attachment 312490 [details]
Pass 1 fixing order bug. Untested.

Comment 5 RHEL Product and Program Management 2008-07-23 18:21:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 David Teigland 2008-07-23 18:24:11 UTC
commit in RHEL5 branch 9567fe17bf33eb0008831551b76c7f46c55ba40b

I've not tested the fix yet since I don't have either a RHEL5 or STABLE2
cluster readily available.  If no one else can do a quick test to verify
the patch, I'll get a cluster set up.


Comment 7 David Juran 2008-07-24 07:53:35 UTC
I've tested Lon's patch from #4 on my F-9 cluster (cman-2.03.05-1) and it seems
to solve the issue.

Also, I can confirm that the fencing agents are executed in the order they are
mentioned in cluster.conf. Which is good. Since doing power-on followed by
power-off is not quite the same as off followed by on (-:

Comment 10 errata-xmlrpc 2009-01-20 21:53:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0189.html


Note You need to log in before you can comment on or make changes to this bug.