Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1063700 - [linearstore] broker restart fails under stress test
Summary: [linearstore] broker restart fails under stress test
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: All
OS: All
high
high
Target Milestone: 3.0
: ---
Assignee: Kim van der Riet
QA Contact: Zdenek Kraus
URL:
Whiteboard:
Depends On:
Blocks: 709325
TreeView+ depends on / blocked
 
Reported: 2014-02-11 09:56 UTC by Pavel Moravec
Modified: 2015-01-21 12:55 UTC (History)
4 users (show)

Fixed In Version: qpid-cpp-0.22-36
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 12:55:59 UTC


Attachments (Terms of Use)
reproducer script (deleted)
2014-02-11 09:58 UTC, Pavel Moravec
no flags Details
/var/lib/qpidd from failing broker start (deleted)
2014-02-11 10:16 UTC, Pavel Moravec
no flags Details
/var/lib/qpidd_bckp before starting the broker and /var/lib/qpidd after the broker startup failure (deleted)
2014-02-12 08:14 UTC, Pavel Moravec
no flags Details


Links
System ID Priority Status Summary Last Updated
Apache JIRA QPID-5603 None None None Never

Description Pavel Moravec 2014-02-11 09:56:38 UTC
Description of problem:
Running a stress test where many messages are purged/consumed from a queue at the same time (just to try returning many EPL together) while new messages are still coming, broker restart sometime fails with JERR_JREC_BADRECTAIL.


Version-Release number of selected component (if applicable):
qpid-cpp-server-linearstore-0.22-35.el6.x86_64


How reproducible:
100% within 15 minutes, but in non-deterministic scenario


Steps to Reproduce:
Run attached script


Actual results:
Script terminates after several broker restart attempts (usually within 5, sometimes much more), with error log:

2014-02-11 09:19:45 [Broker] critical Unexpected error: Daemon startup failed: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Magic: expected 0x9aacb3ae; found 0xae
  Serial: expected 0xdfd9ef00f95ec0e8; found 0x3bd597e6c9
  Record Id: expected 0x200422; found 0x0
  Checksum: expected 0xfa45392f; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)


Expected results:
Script never terminates


Additional info:
Attaching also one /var/lib/qpidd directory where the broker cant restart

Comment 1 Pavel Moravec 2014-02-11 09:58:43 UTC
Created attachment 861729 [details]
reproducer script

Comment 2 Pavel Moravec 2014-02-11 10:16:34 UTC
Created attachment 861741 [details]
/var/lib/qpidd from failing broker start

Broker logs:

2014-02-11 10:13:46 [Broker] critical Unexpected error: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Serial: expected 0xb0b6a3e94bf87441; found 0x3bd597e6c9
  Record Id: expected 0x1978c; found 0x0
  Checksum: expected 0x42a237f9; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)

Comment 3 Kim van der Riet 2014-02-11 18:12:42 UTC
If you have the store files from the failure, this would be most valuable. Presumably your script preserved the store before the recovery was attempted; replacing the qls store directory with the saved store and restarting the broker should make the error 100% reproducible.

If you have the store, please send it to me or attach it to this bug.

Comment 4 Pavel Moravec 2014-02-12 08:14:24 UTC
Created attachment 862147 [details]
/var/lib/qpidd_bckp before starting the broker and /var/lib/qpidd after the broker startup failure

See /var/lib/qpidd_bckp as backed up directory before invoking "service qpiddd start".

Broker logged meantime:

2014-02-12 08:06:07 [Broker] critical Unexpected error: Daemon startup failed: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Magic: expected 0x9aacb3ae; found 0x6
  Serial: expected 0x3e8b1a565b177139; found 0x3bd597e6c9
  Record Id: expected 0x1978b; found 0x0
  Checksum: expected 0x7f893667; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)

Comment 5 Kim van der Riet 2014-03-05 13:56:45 UTC
Upstream bug at https://issues.apache.org/jira/browse/QPID-5603

Comment 6 Kim van der Riet 2014-03-05 14:00:15 UTC
The attached store journal contains a record in which the record tail is exactly divided by a file boundary from the rest of the record. This meets the conditions of the upstream bug in comment #5.

Comment 7 Kim van der Riet 2014-03-05 14:53:19 UTC
Fixed in r.1574513. See upstream bug for further comments.

Comment 8 Zdenek Kraus 2014-03-18 08:21:35 UTC
this issue was tested on RHEL 6.5 i686 & x86_64 with following packages:

perl-qpid-0.22-11.el6
python-qpid-0.22-12.el6
python-qpid-qmf-0.22-28.el6
qpid-cpp-client-0.22-36.el6
qpid-cpp-client-devel-0.22-36.el6
qpid-cpp-client-devel-docs-0.22-36.el6
qpid-cpp-debuginfo-0.22-36.el6
qpid-cpp-server-0.22-36.el6
qpid-cpp-server-devel-0.22-36.el6
qpid-cpp-server-ha-0.22-36.el6
qpid-cpp-server-linearstore-0.22-36.el6
qpid-cpp-server-xml-0.22-36.el6
qpid-java-client-0.22-6.el6
qpid-java-common-0.22-6.el6
qpid-java-example-0.22-6.el6
qpid-jca-0.22-2.el6
qpid-jca-xarecovery-0.22-2.el6
qpid-proton-c-0.6-1.el6
qpid-proton-c-devel-0.6-1.el6
qpid-proton-debuginfo-0.6-1.el6
qpid-qmf-0.22-28.el6
qpid-qmf-debuginfo-0.22-28.el6
qpid-snmpd-1.0.0-16.el6
qpid-snmpd-debuginfo-1.0.0-16.el6
qpid-tools-0.22-9.el6
ruby-qpid-qmf-0.22-28.el6


-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.