Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1353972 - Master zone radosgw process segfaults during I/O and sync operations
Summary: Master zone radosgw process segfaults during I/O and sync operations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: 2.0
Assignee: Casey Bodley
QA Contact: shilpa
URL:
Whiteboard:
Depends On: 1354156
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-08 14:50 UTC by shilpa
Modified: 2017-07-31 14:15 UTC (History)
10 users (show)

Fixed In Version: RHEL: ceph-10.2.2-18.el7cp Ubuntu: ceph_10.2.2-14redhat1xenial
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-23 19:43:47 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 16603 None None None 2016-07-08 14:51:21 UTC
Red Hat Product Errata RHBA-2016:1755 normal SHIPPED_LIVE Red Hat Ceph Storage 2.0 bug fix and enhancement update 2016-08-23 23:23:52 UTC

Description shilpa 2016-07-08 14:50:01 UTC
Description of problem:
Upload large amounts of data from both rgw nodes. Around 50 buckets with around 100GB of data. During the sync operations, rgw process segfaulted on master zone.


Version-Release number of selected component (if applicable):
ceph-radosgw-10.2.2-15.el7cp.x86_64

Actual results:

    0> 2016-07-08 09:38:32.318800 7fc485ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc485ffb700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc59936d22a]
 2: (()+0xf100) [0x7fc59879e100]
 3: (std::__detail::_List_node_base::_M_transfer(std::__detail::_List_node_base*, std::__detail::_List_node_base*)+0x10) [0x7fc5982f1010]
 4: (RGWOmapAppend::flush_pending()+0x23) [0x7fc5990f0993]
 5: (RGWOmapAppend::append(std::string const&)+0x98) [0x7fc5990f0a48]
 6: (RGWDataSyncSingleEntryCR::operate()+0x82a) [0x7fc5991ad94a]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc5990e8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc5990ea9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc5990eb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc599192912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc599255289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc5991fa083]
 13: (()+0x7dc5) [0x7fc598796dc5]
 14: (clone()+0x6d) [0x7fc597da0ced]



Additional info:

Currently unable to run RGW I/O on both the zones.

Comment 2 Casey Bodley 2016-07-08 15:06:46 UTC
upstream fix pending review: https://github.com/ceph/ceph/pull/10157

Comment 6 shilpa 2016-07-09 10:29:39 UTC
While I tried to continue with testing after a rgw restart on the two nodes, I noticed that the non-master zone segfaults with a different stack trace a few seconds after the master zone segfaults.

2016-07-09 08:06:39.522101 7fc10d7e2700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fc10d7e2700 thread_name:radosgw

 ceph version 10.2.2-15.el7cp (60cd52496ca02bdde9c2f4191e617f75166d87b6)
 1: (()+0x54e22a) [0x7fc192d5d22a]
 2: (()+0xf100) [0x7fc19218e100]
 3: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)+0x1b) [0x7fc191d30f3b]
 4: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7fc192ae60e3]
 5: (RGWRadosRemoveOmapKeysCR::RGWRadosRemoveOmapKeysCR(RGWRados*, rgw_bucket const&, std::string const&, std::set<std::string, std::less<std::string>, std::allocator<std::string> > const&)+0x128) [0x7fc192ae2968]
 6: (RGWDataSyncSingleEntryCR::operate()+0xa96) [0x7fc192b9dbb6]
 7: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7fc192ad8a2e]
 8: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7fc192ada9d1]
 9: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7fc192adb590]
 10: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7fc192b82912]
 11: (RGWDataSyncProcessorThread::process()+0x49) [0x7fc192c45289]
 12: (RGWRadosThread::Worker::entry()+0x133) [0x7fc192bea083]
 13: (()+0x7dc5) [0x7fc192186dc5]
 14: (clone()+0x6d) [0x7fc191790ced]

Not sure if this is related to the original segfault on master.

Comment 7 Casey Bodley 2016-07-11 13:42:20 UTC
(In reply to shilpa from comment #6)
> Not sure if this is related to the original segfault on master.

The fix will address this segfault as well.

Comment 8 shilpa 2016-07-12 13:37:47 UTC
Running on 10.2.2-18. I hit this stack trace again. This time on non-master node, during object upload and sync operations.

2016-07-12 13:07:27.107029 7f3879ffb700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f3879ffb700 thread_name:radosgw

 ceph version 10.2.2-18.el7cp (408019449adec8263014b356737cf326544ea7c6)
 1: (()+0x54e2ba) [0x7f39102ab2ba]
 2: (()+0xf100) [0x7f390f6dc100]
 3: (RGWCoroutinesStack::wakeup()+0xe) [0x7f39100274ce]
 4: (RGWBucketShardIncrementalSyncCR::operate()+0xfed) [0x7f39100d639d]
 5: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x7f3910026a4e]
 6: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3f1) [0x7f39100289f1]
 7: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x7f39100295b0]
 8: (RGWRemoteDataLog::run_sync(int, rgw_data_sync_status&)+0x352) [0x7f39100d0932]
 9: (RGWDataSyncProcessorThread::process()+0x49) [0x7f3910193319]
 10: (RGWRadosThread::Worker::entry()+0x133) [0x7f3910138113]
 11: (()+0x7dc5) [0x7f390f6d4dc5]
 12: (clone()+0x6d) [0x7f390ecdeced]

Comment 18 shilpa 2016-07-26 06:19:32 UTC
I haven't seen this occur since 10.2.2-23. Moving to verified.

Comment 20 errata-xmlrpc 2016-08-23 19:43:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html


Note You need to log in before you can comment on or make changes to this bug.