Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1098118 - [linearstore] broker segfaults when recovering journal file with damaged header [NEEDINFO]
Summary: [linearstore] broker segfaults when recovering journal file with damaged header
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: 3.0
: ---
Assignee: Pavel Moravec
QA Contact: Zdenek Kraus
URL:
Whiteboard:
Depends On:
Blocks: 709325
TreeView+ depends on / blocked
 
Reported: 2014-05-15 10:47 UTC by Pavel Moravec
Modified: 2015-01-21 12:56 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-0.22-42
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 12:56:17 UTC
Target Upstream Version:
pmoravec: needinfo? (iboverma)


Attachments (Terms of Use)
Patch proposal (deleted)
2014-05-15 13:59 UTC, Pavel Moravec
no flags Details | Diff
Proposed alternative patch (deleted)
2014-05-19 21:05 UTC, Kim van der Riet
no flags Details | Diff
Proposed alternative patch #2 (deleted)
2014-05-20 16:37 UTC, Kim van der Riet
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Apache JIRA QPID-5767 None None None Never

Description Pavel Moravec 2014-05-15 10:47:22 UTC
Description of problem:
If journal file has corrupted header (example below), especially if EFP identity is invalid, broker segfaults.

The reason is:

EmptyFilePool* EmptyFilePoolManager::getEmptyFilePool(const efpPartitionNumber_t partitionNumber,
                                                      const efpDataSize_kib_t efpDataSize_kib) {
    EmptyFilePoolPartition* efppp = getEfpPartition(partitionNumber);
    if (efppp != 0)
        return efppp->getEmptyFilePool(efpDataSize_kib);
    return 0;
}

returns null pointer that is taken in:

void RecoveryManager::analyzeJournals(const std::vector<std::string>* preparedTransactionListPtr,
                                      EmptyFilePoolManager* emptyFilePoolManager,
                                      EmptyFilePool** emptyFilePoolPtrPtr) {
    // Analyze file headers of existing journal files
    efpIdentity_t efpIdentity;
    analyzeJournalFileHeaders(efpIdentity);
    *emptyFilePoolPtrPtr = emptyFilePoolManager->getEmptyFilePool(efpIdentity);
    efpFileSize_kib_ = (*emptyFilePoolPtrPtr)->fileSize_kib();


One real world example to get damaged journal (header): disk gets out of free space just when generating a new journal file, such that the file is created as truncated and contains just zeroes.


Version-Release number of selected component (if applicable):
qpid-cpp 0.22-38


How reproducible:
100%


Steps to Reproduce:
0. Start broker with empty provisioning

1. Create a durable queue:
qpid-config add queue q --durable

2. Create an empty "damaged" journal file and set its permissions:
dd if=/dev/zero of=/var/lib/qpidd/qls/jrnl/q/$(uuidgen).jrnl bs=1024 count=2052
chown qpidd:qpidd /var/lib/qpidd/qls/jrnl/q/*

3. Restart broker


Actual results:
Broker segfaults, logging:
2014-05-15 12:42:00.421721427 [Store] warning Linear Store: Journal "q": Journal file /var/lib/qpidd/qls/jrnl/q/d09e7bfb-5efa-4d96-9071-ad1053377d36.jrnl is uninitialized
2014-05-15 12:42:00.616301329 [System] debug Exception constructed: Cannot read from child process.
2014-05-15 12:42:00.616558206 [Broker] critical Unexpected error: Cannot read from child process.


Expected results:
Broker raises critical exception (stopping its kick-off) with a message like "Journal file .. has unknown EFP identity (pn=0,ds=0), please fix it".


Additional info:

Comment 1 Pavel Moravec 2014-05-15 11:28:10 UTC
(gdb) bt
#0  0x00007ffff77a1af0 in qpid::linearstore::journal::EmptyFilePool::fileSize_kib() const ()
   from /usr/lib64/qpid/daemon/linearstore.so
#1  0x00007ffff77c8e0b in qpid::linearstore::journal::RecoveryManager::analyzeJournals(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, qpid::linearstore::journal::EmptyFilePoolManager*, qpid::linearstore::journal::EmptyFilePool**) ()
   from /usr/lib64/qpid/daemon/linearstore.so
#2  0x00007ffff77addc5 in qpid::linearstore::journal::jcntl::recover(qpid::linearstore::journal::EmptyFilePoolManager*, unsigned short, unsigned int, qpid::linearstore::journal::aio_callback*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, unsigned long&) ()
   from /usr/lib64/qpid/daemon/linearstore.so
#3  0x00007ffff77db0f1 in qpid::linearstore::JournalImpl::recover(boost::shared_ptr<qpid::linearstore::journal::EmptyFilePoolManager>, unsigned short, unsigned int, qpid::linearstore::journal::aio_callback*, boost::ptr_list<qpid::linearstore::PreparedTransaction, boost::heap_clone_allocator, std::allocator<void*> >*, unsigned long&, unsigned long) () from /usr/lib64/qpid/daemon/linearstore.so
#4  0x00007ffff77efb18 in qpid::linearstore::MessageStoreImpl::recoverQueues(qpid::linearstore::TxnCtxt&, qpid::broker::RecoveryManager&, std::map<unsigned long, boost::shared_ptr<qpid::broker::RecoverableQueue>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, boost::shared_ptr<qpid::broker::RecoverableQueue> > > >&, boost::ptr_list<qpid::linearstore::PreparedTransaction, boost::heap_clone_allocator, std::allocator<void*> >&, std::map<unsigned long, boost::shared_ptr<qpid::broker::RecoverableMessage>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, boost::shared_ptr<qpid::broker::RecoverableMessage> > > >&) ()
   from /usr/lib64/qpid/daemon/linearstore.so
#5  0x00007ffff77f5663 in qpid::linearstore::MessageStoreImpl::recover(qpid::broker::RecoveryManager&) ()
   from /usr/lib64/qpid/daemon/linearstore.so
#6  0x000000339a296522 in qpid::broker::MessageStoreModule::recover(qpid::broker::RecoveryManager&) ()
   from /usr/lib64/libqpidbroker.so.9
#7  0x000000339a1f7df1 in qpid::broker::Broker::Broker(qpid::broker::Broker::Options const&) () from /usr/lib64/libqpidbroker.so.9
#8  0x0000000000407376 in qpid::broker::QpiddBroker::execute(qpid::broker::QpiddOptions*) ()
#9  0x000000000040cba4 in qpid::broker::run_broker(int, char**, bool) ()
#10 0x00000032c441ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000406bf9 in _start ()
(gdb)

Comment 2 Pavel Moravec 2014-05-15 11:53:18 UTC
IMPORTANT FOR REPRODUCER:

The EFP identity is retrieved from the latest file returned by readdir (or latest file in "ls -U" output). As there is _no_ guarantee what file will it be, to have 100% reliable reproducer, simply:

1) Create durable queue
2) Stop broker
3) Replace the journal file by some with zeroes inside

Then "the latest file" will be everytime the only one with 0s in.

Comment 3 Pavel Moravec 2014-05-15 13:15:38 UTC
(In reply to Pavel Moravec from comment #2)
> IMPORTANT FOR REPRODUCER:
> 
> The EFP identity is retrieved from the latest file returned by readdir (or
> latest file in "ls -U" output). As there is _no_ guarantee what file will it
> be, to have 100% reliable reproducer, simply:
> 
> 1) Create durable queue
> 2) Stop broker
> 3) Replace the journal file by some with zeroes inside
> 
> Then "the latest file" will be everytime the only one with 0s in.

OK, still not yet enough, as to successfully run the broker (after my patch will be ready), there has to be one valid journal file. So really reliable reproducer is:

0. Start broker with empty provisioning

1. Create a durable queue and _optionally_ (it is worth testing once with qpid-send and once without, both should work) send there some message:
qpid-config add queue q --durable
qpid-send -a q -m 1 --durable=yes

2. Stop the broker

3a. Remember the filename in journal for "q";
3. Create two empty "damaged" journal files and set their permissions:
for i in 1 2; do
  dd if=/dev/zero of=/var/lib/qpidd/qls/jrnl/q/$(uuidgen).jrnl bs=1024 count=2052
done
chown qpidd:qpidd /var/lib/qpidd/qls/jrnl/q/*

4. Ensure the latest journal file is one of the "damaged". Use "ls -U1" output. If the latest file is the one from step 3a, and the output is e.g.:

first.jrnl
second.jrnl
third.jrnl

(where third is the file from step 3a while first and second are the damaged) then do:
/bin/cp third.jrnl second.jrnl
/bin/cp first.jrnl third.jrnl

Now, ordering should be the same ("third.jrnl" is the latest) but the latest file is some damaged one.

5. Start broker

Comment 4 Pavel Moravec 2014-05-15 13:59:01 UTC
Created attachment 895945 [details]
Patch proposal

To be posted to reviewboard.apache.org as well, before committing to upstream. As some tests might be over-save, I guess.

Comment 5 Kim van der Riet 2014-05-19 21:04:42 UTC
There are two issues here:

1. The logic in RecoveryManager::analyzeJournals() detects uninitialized journal files, but if the file happens to be the last one read, it incorrectly allows the file header from this file to be used to set the EFP identity, which is usually partition=0, size=0. This is compounded by the fact that no check is made on the EFP pointer which is returned as a null if the identity is invalid. In these cases, the prior file header (if it exists) should be used and the invalid file is sent to the EFP to be re-used if the size is correct, deleted otherwise.

2. If the EFP identity should be invalid but not null (ie the file header is valid but indicates a non-existent partition or size, then the broker also cores. In this case, the EFP pointer is also null, which is not checked.

The proposed fix addresses both of these cases. I have another patch proposal which is simpler and a little less intrusive. Case 1 is handled correctly with the bad file being sent to the EFP; case 2 results in a JERR_RCVM_INVALIDEFPID exception which is appropriate for now. Once multiple partitions and mixed EFP capabilities are introduced, this will likely change. There are no current scenarios in which an unknown partition/size combination should be encountered or handled.

Comment 6 Kim van der Riet 2014-05-19 21:05:55 UTC
Created attachment 897321 [details]
Proposed alternative patch

Comment 7 Pavel Moravec 2014-05-20 10:30:01 UTC
(In reply to Kim van der Riet from comment #6)
> Created attachment 897321 [details]
> Proposed alternative patch

I would be fine with the patch but noticed another segfault very similar to the original. When the journal contains only zeroed journal file, qpidd coredumps.

1) start broker
2) qpid-config add queue q --durable
3) stop broker
4) dd if=/dev/zero of=/var/lib/qpidd/qls/jrnl/q/* bs=1024 count=2052
5) start broker

Below is backtrace of segfault:

#0  0x00007ffff6ced81b in qpid::linearstore::journal::EmptyFilePool::getIdentity() const () from src/linearstore.so
#1  0x00007ffff6d0beb7 in qpid::linearstore::journal::LinearFileController::restoreEmptyFile(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from src/linearstore.so
#2  0x00007ffff6d12cfb in qpid::linearstore::journal::RecoveryManager::setLinearFileControllerJournals(void (qpid::linearstore::journal::LinearFileController::*)(qpid::linearstore::journal::JournalFile*, unsigned int, bool), qpid::linearstore::journal::LinearFileController*) () from src/linearstore.so
#3  0x00007ffff6cfe05f in qpid::linearstore::journal::jcntl::recover(qpid::linearstore::journal::EmptyFilePoolManager*, unsigned short, unsigned int, qpid::linearstore::journal::aio_callback*, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, unsigned long&) ()
   from src/linearstore.so
#4  0x00007ffff6d33a26 in qpid::linearstore::JournalImpl::recover(boost::shared_ptr<qpid::linearstore::journal::EmptyFilePoolManager>, unsigned short, unsigned int, qpid::linearstore::journal::aio_callback*, boost::ptr_list<qpid::linearstore::PreparedTransaction, boost::heap_clone_allocator, std::allocator<void*> >*, unsigned long&, unsigned long) () from src/linearstore.so
#5  0x00007ffff6d57fbd in qpid::linearstore::JournalImpl::recover(boost::shared_ptr<qpid::linearstore::journal::EmptyFilePoolManager>, unsigned short, unsigned int, boost::ptr_list<qpid::linearstore::PreparedTransaction, boost::heap_clone_allocator, std::allocator<void*> >*, unsigned long&, unsigned long) () from src/linearstore.so
..

Again, a malformed journal leads to segfault due to EFP identity.

/me not having time today to propose a fix for this.

Comment 8 Kim van der Riet 2014-05-20 16:37:40 UTC
Created attachment 897664 [details]
Proposed alternative patch #2

Good find (again), I have proposed another patch which answers this issue also.

Comment 9 Pavel Moravec 2014-05-21 08:27:14 UTC
(In reply to Kim van der Riet from comment #8)
> Created attachment 897664 [details]
> Proposed alternative patch #2
> 
> Good find (again), I have proposed another patch which answers this issue
> also.

Patch tested, I am fine with that. Great fast feedback.

Comment 10 Pavel Moravec 2014-05-21 08:36:08 UTC
Committed to upstream as https://svn.apache.org/r1596509

Comment 11 Kim van der Riet 2014-05-21 12:47:55 UTC
Upstream bug at https://issues.apache.org/jira/browse/QPID-5767

Comment 14 Zdenek Kraus 2014-05-30 08:49:23 UTC
Previous reproducing scenarios was fixed, but I go further to damaging pages and in 0.22-41 there is still some issue:

damaging reproducer:
1. start
2. send one durable msg
3. stop
4. damage first record (with random data) 
for i in %s/*; do dd if=/dev/urandom of=${i} bs=4 count=8 seek=12 conv=block,notrunc; done
5. start -> crash

# qpidd backtrace
Thread 2 (Thread 0xb7701b70 (LWP 27623)):
#0  0x00469416 in __kernel_vsyscall ()
#1  0x00c64794 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x034037bf in wait (this=0x908b3c0) at /usr/src/debug/qpid-0.22/cpp/src/qpid/sys/posix/Condition.h:69
#3  wait (this=0x908b3c0) at /usr/src/debug/qpid-0.22/cpp/src/qpid/sys/Monitor.h:45
#4  qpid::sys::Timer::run (this=0x908b3c0) at /usr/src/debug/qpid-0.22/cpp/src/qpid/sys/Timer.cpp:186
#5  0x03392512 in qpid::sys::(anonymous namespace)::runRunnable (p=0x908b3c0) at /usr/src/debug/qpid-0.22/cpp/src/qpid/sys/posix/Thread.cpp:35
#6  0x00c60b39 in start_thread () from /lib/libpthread.so.0
#7  0x00b9cd8e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xb7702960 (LWP 27622)):
#0  0x00b35541 in __memcpy_ia32 () from /lib/libc.so.6
#1  0x003ec6ad in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int) ()
   from /usr/lib/libstdc++.so.6
#2  0x003ec73f in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign(char const*, unsigned int) () from /usr/lib/libstdc++.so.6
#3  0x01018a12 in qpid::linearstore::journal::RecoveryManager::readJournalFileHeader (journalFileName=
    "/var/lib/qpidd//qls/jrnl/test_restore_damaged_pages_first_record/b260913b-c8e4-4db0-9ef1-0a52ffb487d9.jrnl", fileHeaderRef=..., queueName=
    "\276Q\365\346J\277estore_damaged_pages_first_record", '\000' <repeats 3983 times>, "l\361C\000\000\020\000\000\350\361C", '\000' <repeats 25 times>, "|>D", '\000' <repeats 29 times>, "\001\373=", '\000' <repeats 34 times>, " ", '\000' <repeats 18 times>"\200, \063D", '\000' <repeats 17 times>"\200, \361C\000\006\000\000\000\000\000\000\000\002\020", '\000' <repeats 86 times>, "\b\000\000\000\370\205\265\277|>D\000\000\000\000\000\000\000@\000L\205\265\277`1D\000\034\064D\000\024\064D\000\066'\237\"\r\002\000\000\370\003\000\000\033\205\001\001\364\177\t\001\270N\t\t\210\n\t\tX\210\265\277$\020\002\001P\021\t\t\302\207\265\277\064\210\265\277\001\000\000\000\000\000\000\000"...) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/RecoveryManager.cpp:931
#4  0x01021024 in qpid::linearstore::journal::RecoveryManager::analyzeJournalFileHeaders (this=0x9095170, efpIdentity=...)
    at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/RecoveryManager.cpp:402
#5  0x010218c4 in qpid::linearstore::journal::RecoveryManager::analyzeJournals (this=0x9095170, preparedTransactionListPtr=0xbfb58d80, emptyFilePoolManager=0x9090a88, 
    emptyFilePoolPtrPtr=0x9094eb8) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/RecoveryManager.cpp:101
#6  0x00fff457 in qpid::linearstore::journal::jcntl::recover (this=0x9094dc4, efpmp=0x9090a88, wcache_num_pages=32, wcache_pgsize_sblks=8, cbp=0x9095340, 
    prep_txn_list_ptr=0xbfb58d80, highest_rid=@0xbfb59130) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/journal/jcntl.cpp:118
#7  0x01036e8e in qpid::linearstore::JournalImpl::recover (this=0x9094dc0, efpm=..., wcache_num_pages=32, wcache_pgsize_sblks=8, cbp=0x9095340, 
    prep_tx_list_ptr=0xbfb595dc, highest_rid=@0xbfb59130, queue_id=2) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/JournalImpl.cpp:204
#8  0x0104dc08 in recover (this=0x908de58, txn=..., registry=..., queue_index=std::map with 0 elements, prepared=..., messages=std::map with 0 elements)
    at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/JournalImpl.h:136
#9  qpid::linearstore::MessageStoreImpl::recoverQueues (this=0x908de58, txn=..., registry=..., queue_index=std::map with 0 elements, prepared=..., messages=
    std::map with 0 elements) at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:739
#10 0x0105457a in qpid::linearstore::MessageStoreImpl::recover (this=0x908de58, registry_=...)
    at /usr/src/debug/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:608
#11 0x0375fba8 in qpid::broker::MessageStoreModule::recover (this=0x9092de0, registry=...) at /usr/src/debug/qpid-0.22/cpp/src/qpid/broker/MessageStoreModule.cpp:89
#12 0x036b3d22 in qpid::broker::Broker::Broker (this=0x908af38, conf=...) at /usr/src/debug/qpid-0.22/cpp/src/qpid/broker/Broker.cpp:317
#13 0x0804e07a in qpid::broker::QpiddBroker::execute (this=0xbfb5a34d, options=0x9084ed0) at /usr/src/debug/qpid-0.22/cpp/src/posix/QpiddBroker.cpp:196
#14 0x080539a4 in qpid::broker::run_broker (argc=3, argv=0xbfb5a434, hidden=false) at /usr/src/debug/qpid-0.22/cpp/src/qpidd.cpp:108
#15 0x0804d9c4 in main (argc=3, argv=0xbfb5a434) at /usr/src/debug/qpid-0.22/cpp/src/posix/QpiddBroker.cpp:215


## this could be reproduced on the same damaged pagefile in 60-80% of broker starts, in other starts broker starts successfully with the escaped binary of the damaged page and then "ignoring"

2014-05-30 10:38:58 [Store] warning Linear Store: Journal "test_restore_damaged_pages_first_record": Journal file /var/lib/qpidd//qls/jrnl/test_restore_damaged_pages_first_record/b260913b-c8e4-4db0-9ef1-0a52ffb487d9.jrnl belongs to queue                                                                     "\xBEQ\xF5\xE6J\xBFestore_damaged_pages_first_record\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0 ... <SNIP>...": ignoring
2014-05-30 10:38:58 [Store] info Linear Store: Journal "test_restore_damaged_pages_first_record": 
     Journal recovery analysis (jid="test_restore_damaged_pages_first_record"):
     <Journal empty, no journal files found>

->ASSIGNED

Comment 15 Kim van der Riet 2014-06-02 19:31:45 UTC
Fixed in r.1599243.

Good find, I improved the inspection of the file header after reading to catch this case.

Comment 19 Zdenek Kraus 2014-06-10 12:48:53 UTC
This was tested on RHEL 6.5 x86_64 and i686 with following packages:
perl-qpid-0.22-13.el6
perl-qpid-debuginfo-0.22-13.el6
python-qpid-0.22-15.el6
python-qpid-qmf-0.22-33.el6
qpid-cpp-client-0.22-42.el6
qpid-cpp-client-devel-0.22-42.el6
qpid-cpp-client-devel-docs-0.22-42.el6
qpid-cpp-debuginfo-0.22-42.el6
qpid-cpp-server-0.22-42.el6
qpid-cpp-server-devel-0.22-42.el6
qpid-cpp-server-ha-0.22-42.el6
qpid-cpp-server-linearstore-0.22-42.el6
qpid-cpp-server-xml-0.22-42.el6
qpid-java-client-0.22-6.el6
qpid-java-common-0.22-6.el6
qpid-java-example-0.22-6.el6
qpid-jca-0.22-2.el6
qpid-jca-xarecovery-0.22-2.el6
qpid-proton-c-0.7-2.el6
qpid-proton-c-devel-0.7-2.el6
qpid-qmf-0.22-33.el6
qpid-snmpd-1.0.0-16.el6
qpid-tools-0.22-13.el6
ruby-qpid-qmf-0.22-33.el6

works as expected.

-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.