Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 592999 - clustered persistant broker leaves cluster with 'critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters.' when cluster nodes started from clean store
Summary: clustered persistant broker leaves cluster with 'critical Unexpected error: C...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 494393
TreeView+ depends on / blocked
 
Reported: 2010-05-17 15:10 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-20 11:28:42 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Frantisek Reznicek 2010-05-17 15:10:23 UTC
Description of problem:

There are multiple correlated observations which prove that
starting multiple brokers at once leads to issue:
  2010-05-17 10:35:49 debug cluster(10.34.45.2:6880 INIT) initial status map complete.
  2010-05-17 10:35:49 debug Exception constructed: Cluster-ID mismatch. Stores belong to different clusters.
  2010-05-17 10:35:49 critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters.

even when all brokers are initialy started on clean store (on non existent data-dir).

This behavior is most probably bug or at least ugliness which needs to be documented.


I triggered this while trying to probe bug 494393 is fixed, so this bug/behavior is blocking bug 494393 verification.



Version-Release number of selected component (if applicable):
openais-0.80.6-16.el5_5.1
openais-debuginfo-0.80.6-16.el5_5.1
openais-devel-0.80.6-16.el5_5.1
python-qmf-0.7.934605-1.el5
python-qpid-0.7.934605-1.el5
qmf-0.7.935473-1.el5
qmf-devel-0.7.935473-1.el5
qpid-cpp-client-0.7.935473-1.el5
qpid-cpp-client-devel-0.7.935473-1.el5
qpid-cpp-client-devel-docs-0.7.935473-1.el5
qpid-cpp-client-ssl-0.7.935473-1.el5
qpid-cpp-mrg-debuginfo-0.7.935473-1.el5
qpid-cpp-server-0.7.935473-1.el5
qpid-cpp-server-cluster-0.7.935473-1.el5
qpid-cpp-server-devel-0.7.935473-1.el5
qpid-cpp-server-ssl-0.7.935473-1.el5
qpid-cpp-server-store-0.7.935473-1.el5
qpid-cpp-server-xml-0.7.935473-1.el5
qpid-java-client-0.7.934605-1.el5
qpid-java-common-0.7.934605-1.el5
qpid-tests-0.7.934605-1.el5
qpid-tools-0.7.934605-2.el5
ruby-qpid-0.7.935473-1.el5


How reproducible:
100%

Steps to Reproduce:
0. install qpidd & openais
1. configure openais
2. run  ./looper.sh
3. wait for error
  
Actual results:
Cluster of persistant brokers does not always start up in expected cluster width.

Expected results:
Cluster of persistant brokers should always start up in expected cluster width.

Additional info:

RHEL 5.5 i386 case:
~~~~~~~~~~~~~~~~~~
[root@mrg-qe-01 bz494393]# ./looper.sh
8:0;4:0;8:0;8:2;
Last log:
aisexec (pid 9750) is running...
client[s] ready
starting brokers in the cluster:........done
A:ERROR:broker[s] not running! [exp. width:8] (ports:53526 39256 50745 ,#:3, pids:12968 12974 12964 ,#:3)
tcp        0      0 0.0.0.0:53526               0.0.0.0:*                   LISTEN      12968/qpidd
tcp        0      0 0.0.0.0:39256               0.0.0.0:*                   LISTEN      12974/qpidd
tcp        0      0 0.0.0.0:50745               0.0.0.0:*                   LISTEN      12964/qpidd
root     12964  0.5  0.1 143464 11232 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_4
root     12968  0.8  0.1 165944 12060 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_5
root     12974  0.6  0.1 142716 11708 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_7
test-core:1883.42       1510.96 2896.68 2.82879
  Cluster Name: mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393
Cluster Status: ACTIVE
  Cluster Size: 3
       Members: ID=10.34.45.1:12964 URL=amqp:tcp:10.34.33.54:50745,tcp:10.34.45.1:50745
              : ID=10.34.45.1:12968 URL=amqp:tcp:10.34.33.54:53526,tcp:10.34.45.1:53526
              : ID=10.34.45.1:12974 URL=amqp:tcp:10.34.33.54:39256,tcp:10.34.45.1:39256
.ecodes:00
B:ERROR:broker[s] not running! [exp. width:8] (ports:53526 39256 50745 ,#:3, pids:12968 12974 12964 ,#:3)
stopping brokers...
Test Summary:
TEST FAILED !!! (2 failures)
err_cnt:1; dur:36 sec[s]

RHEL 5.5 x86_64 case:
~~~~~~~~~~~~~~~~~~~~
[root@mrg-qe-02 bz494393]# ./looper.sh
9:0;8:0;5:0;7:0;8:0;9:0;4:0;4:0;9:0;7:0;7:0;8:0;7:0;9:0;6:0;5:0;8:0;6:0;9:0;9:0;4:0;8:0;8:0;8:0;9:2;
Last log:
aisexec (pid 871) is running...
client[s] ready
starting brokers in the cluster:.........done
A:ERROR:broker[s] not running! (ports:34248 ,#:1, pids:6868 ,#:1)
tcp        0      0 0.0.0.0:34248               0.0.0.0:*                   LISTEN      6868/qpidd
root      6868  0.5  0.1 353908 12424 pts/2    Sl+  10:35   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-02.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_4
test-core:4241.08       2435.35 4978.36 4.86168
B:ERROR:broker[s] not running! (ports:34248 ,#:1, pids:6868 ,#:1)
stopping brokers...
Test Summary:
TEST FAILED !!! (2 failures)
err_cnt:1; dur:161 sec[s]

Comment 2 Frantisek Reznicek 2010-05-17 15:28:24 UTC
The issue is seen on both RHEL 5.5 i386 / x86_64. The test runs multiple
brokers on one machine with separate logging and data-dirs.

To be more clear on reproducibility, I was able to trigger the issue while
running attached reproducer within few iterations (5-25 loops).

Comment 3 Alan Conway 2010-05-25 18:16:16 UTC
Fixed on trunk r948143 and on 1.3 release branch:
http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=87a548b4b5e525bd887fd486469e63ed9bd79c03

Comment 4 Frantisek Reznicek 2010-06-02 14:12:54 UTC
The issue has been fixed, verified in long test run (few hundreds of cluster restarts, qpidd min/max logging, various cluster widths) on RHEL 5.5 i386 / x86_64 using packages:
openais-0.80.6-16.el5_5.1
openais-debuginfo-0.80.6-16.el5_5.1
openais-devel-0.80.6-16.el5_5.1
python-qpid-0.7.946106-1.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
qpid-cpp-server-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tests-0.7.946106-1.el5
qpid-tools-0.7.946106-4.el5
ruby-qpid-0.7.946106-1.el5

-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.