Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 159628 - sm join/leave bug
Summary: sm join/leave bug
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
Depends On:
TreeView+ depends on / blocked
Reported: 2005-06-06 09:26 UTC by David Teigland
Modified: 2009-04-16 20:30 UTC (History)
1 user (show)

Fixed In Version: RHBA-2005-734
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2005-10-07 16:46:34 UTC

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2005:734 normal SHIPPED_LIVE cman-kernel bug fix update 2005-10-07 04:00:00 UTC

Description David Teigland 2005-06-06 09:26:54 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050302 Firefox/1.0.1 Fedora/1.0.1-1.3.2

Description of problem:
There's some bug in SM processing of joins/leaves.
A node will send out a request to join or leave a
group and it appears that the replies it gets are
being considered invalid and thus discarded for
some reason.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. we've been running a looping clvmd stop/start script, often for days before
the problem appears

Additional info:

Comment 1 David Teigland 2005-06-08 15:52:32 UTC
This is a bug with SM's internal event id's.  The event id is
a uint32, but this is cast to a uint16 when it's passed between
machines in an SM message.  When the id gets past 65535, the
event id in the messages no longer matches the id used within
SM, so things stall.

When SM stalls, all the nodes need to be reset.

This bug takes a while to see because you need to reach 65535
SM events in your cluster.  Everyone will eventually hit
it, though, given enough events in a long running cluster.
(starting/stopping clvmd, mounting/unmounting gfs,
joining/leaving the fence domain are all SM events)

There is a simple fix for this that changes the event id's
in messages to uint32's.  I'll check this in after some
basic tests.

Comment 2 Red Hat Bugzilla 2005-10-07 16:46:34 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.