Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360122 - Tiering related core observed with "uuid_is_null () message".
Summary: Tiering related core observed with "uuid_is_null () message".
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: tiering
Version: 3.8.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On: 1332542 1358196 1360125
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-26 04:54 UTC by Nithya Balachandran
Modified: 2016-08-12 09:47 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.8.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1358196
Environment:
Last Closed: 2016-08-12 09:47:50 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-07-26 04:54:52 UTC
+++ This bug was initially created as a clone of Bug #1358196 +++

+++ This bug was initially created as a clone of Bug #1332542 +++

Description of problem:

Tiering related core observed with "uuid_is_null () message".

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-2

How reproducible:

Once

Steps to Reproduce:

Not sure of the exact steps which generated this core but this was observed when tiering related cases were executed.

bt as below:

#0  0x00007fa21e1543fc in uuid_is_null () from /lib64/libuuid.so.1
#1  0x00007fa21070fc09 in ctr_delete_hard_link_from_db.isra.1.constprop.4 ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#2  0x00007fa210716bd1 in ctr_rename_cbk ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#3  0x00007fa21092cccd in trash_common_rename_cbk ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so
#4  0x00007fa211169024 in posix_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/storage/posix.so
#5  0x00007fa210934c27 in trash_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/trash.so
#6  0x00007fa21071219d in ctr_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changetimerecorder.so
#7  0x00007fa21003449e in changelog_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/changelog.so
#8  0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0
#9  0x00007fa20b9cd4fa in ?? ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so
#10 0x00007fa21ea3ef90 in graphyylex_destroy () from /lib64/libglusterfs.so.0
#11 0x00007fa21c4d7a54 in ?? ()
---Type <return> to continue, or q <return> to quit---
#12 0x00007fa21e9fa009 in __gf_calloc () from /lib64/libglusterfs.so.0
#13 0x00007fa20c011bd0 in ?? ()
#14 0x00007fa21c4d7a54 in ?? ()
#15 0x00007fa21e9cbb1a in default_rename () from /lib64/libglusterfs.so.0
#16 0x00007fa20b5a0dd8 in up_rename ()
   from /usr/lib64/glusterfs/3.7.9/xlator/features/upcall.so
#17 0x00007fa21e9d8542 in default_rename_resume () from /lib64/libglusterfs.so.0
#18 0x00007fa21e9f73cd in call_resume () from /lib64/libglusterfs.so.0
#19 0x00007fa20c054c70 in ?? ()
#20 0x00007fa20c054c98 in ?? ()
#21 0x00007fa1ea4eae70 in ?? ()
#22 0x00007fa20c054c98 in ?? ()
#23 0x00007fa1ea4eae70 in ?? ()
#24 0x00007fa20b393363 in iot_worker ()
   from /usr/lib64/glusterfs/3.7.9/xlator/performance/io-threads.so
#25 0x00007fa21d82fdc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fa21d1761cd in clone () from /lib64/libc.so.6

Actual results:

Tiering related core observed with "uuid_is_null () message".

Expected results:

there should not be any core generated.

Additional info:

Dont have the exact steps to reproduce, however filing this bug so that we don't miss this issue.



--- Additional comment from Joseph Elwin Fernandes on 2016-05-09 01:08:01 EDT ---

As we dont have exact steps to reproduce, We dont know the cause of a NULL uuid or GFID. But yes handling the exception properly is important. Will send a patch with the necessary exception handling in ctr code.

Is there any chance of getting access to the brick logs?

--- Additional comment from Shashank Raj on 2016-05-09 12:24:03 EDT ---

Since i was not sure of the exact scenario which generated it and unfortunately there were lot many tests done on the same setup from the time the core got generated, i don't think i will be able to provide the exact brick logs.

but will surely keep a check on this issue and update the bugzilla with details if i hit it again.


--- Additional comment from Nithya Balachandran on 2016-07-20 02:55:33 EDT ---

RCA:

All the cores have the same backtrace


#root>gdb -ex 'set sysroot ./' -ex 'set solib-search-path ./'  -ex 'core-file ../core.18086'  ./usr/sbin/glusterfs

...

Program terminated with signal 11, Segmentation fault.
#0  0x00007fce4e4323fc in ?? ()
(gdb) bt
#0  0x00007fce4e4323fc in ?? ()
#1  0x00007fce409edc09 in gf_uuid_is_null (uuid=0x10 <Address 0x10 out of bounds>) at ../../../../libglusterfs/src/compat-uuid.h:45
#2  ctr_delete_hard_link_from_db (gfid=gfid@entry=0x10 <Address 0x10 out of bounds>, pargfid=pargfid@entry=0x20 <Address 0x20 out of bounds>, 
    basename=basename@entry=0x40 <Address 0x40 out of bounds>, fop_path=GFDB_FOP_UNDEL, this=0x7fce3c00a8e0, this=0x7fce3c00a8e0, fop_type=9)
    at ctr-helper.h:634
#3  0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, 
    preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58)
    at changetimerecorder.c:860
#4  0x00007fce40c0accd in trash_common_rename_cbk (frame=0x7fce4c7b289c, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, 
    buf=0x7fce202b2760, preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, 
    xdata=0x7fce4ef80a58) at trash.c:544
#5  0x00007fce41447024 in posix_rename (frame=0x7fce4c7b2aa0, this=<optimized out>, oldloc=<optimized out>, newloc=<optimized out>, xdata=<optimized out>)
    at posix.c:2340
#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900
#7  0x00007fce409f019d in ctr_rename (frame=0x7fce4c7b36b8, this=0x7fce3c00a8e0, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at changetimerecorder.c:984
#8  0x00007fce4031249e in changelog_rename (frame=0x7fce4c7b32b0, this=0x7fce3c00d410, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at changelog.c:440
#9  0x00007fce4eca9b1a in ?? ()
#10 0x00007fce4c7b3a14 in ?? ()
#11 0x00007fce4ef80684 in ?? ()
#12 0x00007fce4c7b3a14 in ?? ()
#13 0x00007fce4c25ba30 in ?? ()
#14 0x00007fce4c25ba70 in ?? ()
#15 0x00007fce3c010820 in ?? ()
#16 0x00007fce4ef80684 in ?? ()
#17 0x00007fce3bbe34fa in posix_acl_rename (frame=0x7fce4c7b32b0, this=0x7fce4c25ba70, old=0x7fce3c00f3e0, new=0x7fce4c25ba30, xdata=0x7fce3c00f3e0)
    at posix-acl.c:1531




(gdb) f 6
#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900
1900	                STACK_WIND (frame, trash_common_rename_cbk, FIRST_CHILD(this),
(gdb) p *oldloc
$1 = {path = 0x7fce10029cf0 "/tmp/RNM13/dir1", name = 0x7fce10029cfb "dir1", inode = 0x7fce205ee1f4, parent = 0x7fce2055148c, 
  gfid = "\036zd\216MwG\037\252\242j\204\233\003D\211", pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"}
(gdb) p *newloc
$2 = {path = 0x7fce10029550 "/tmp/RNM13/dir2", name = 0x7fce1002955b "dir2", inode = 0x7fce20597224, parent = 0x7fce2055148c, 
  gfid = '\000' <repeats 15 times>, pargfid = "80\311\321I\002N\376\260\263WzB\352x\255"}



(gdb) f 3
#3  0x00007fce409f4bd1 in ctr_rename_cbk (frame=0x7fce4c7b36b8, cookie=<optimized out>, this=0x7fce3c00a8e0, op_ret=0, op_errno=0, buf=0x7fce202b2760, 
    preoldparent=0x7fce202b27d0, postoldparent=0x7fce202b2840, prenewparent=0x7fce202b28b0, postnewparent=0x7fce202b2920, xdata=0x7fce4ef80a58)
    at changetimerecorder.c:860
860	        ret = ctr_delete_hard_link_from_db (this,
(gdb) l
855	                        CTR_MSG_INSERT_RENAME_UNWIND_FAILED,
856	                        "Invalid link count from posix");
857	                goto out;
858	        }
859	
860	        ret = ctr_delete_hard_link_from_db (this,
861	                                    CTR_DB_REC(ctr_local).old_gfid,
862	                                    CTR_DB_REC(ctr_local).pargfid,
863	                                    CTR_DB_REC(ctr_local).file_name,
864	                                    fop_type, fop_path);
(gdb) p ctr_local
$3 = (gf_ctr_local_t *) 0x0
(gdb) 

860	        ret = ctr_delete_hard_link_from_db (this,
861	                                    CTR_DB_REC(ctr_local).old_gfid,
862	                                    CTR_DB_REC(ctr_local).pargfid,
863	                                    CTR_DB_REC(ctr_local).file_name,
864	                                    fop_type, fop_path);



#define CTR_DB_REC(ctr_local)\
        (ctr_local->gfdb_db_record)



From the oldloc and newloc values, this seems to be a directory rename.
ctr_rename () calls 
ret = ctr_insert_wind(frame, this, _inode_cx); 

static inline int
ctr_insert_wind (call_frame_t                    *frame,
                xlator_t                        *this,
                gf_ctr_inode_context_t          *ctr_inode_cx)
{

...

 if (_priv->ctr_record_wind && ctr_inode_cx->ia_type != IA_IFDIR) {
                frame->local = init_ctr_local_t (this);

--> inits local only for files.

The process crashes in the ctr_rename_cbk, when it tries to access ctr_local which is NULL.

--- Additional comment from Nithya Balachandran on 2016-07-20 05:29:20 EDT ---

From ctr_rename_cbk:


832	        ret = dict_get_uint32 (xdata , GF_RESPONSE_LINK_COUNT_XDATA,
833	                                &remaining_links);
834	        if (ret) {
835	                gf_msg (this->name, GF_LOG_ERROR, 0,
836	                        CTR_MSG_GET_CTR_RESPONSE_LINK_COUNT_XDATA_FAILED,
(gdb) 
837	                        "Failed to getting GF_RESPONSE_LINK_COUNT_XDATA");
838	                remaining_links = -1;
839	                goto out;
840	        }
841	
842	        ctr_local = frame->local;

(gdb) p remaining_links
$24 = 2



posix_rename sets GF_RESPONSE_LINK_COUNT_XDATA only if GF_REQUEST_LINK_COUNT_XDATA is set in xdata.

#6  0x00007fce40c12c27 in trash_rename (frame=0x7fce4c7b289c, this=0x7fce3c009310, oldloc=0x7fce4c25ba30, newloc=0x7fce4c25ba70, xdata=0x7fce4ef80684)
    at trash.c:1900


(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal)
$17 = {hash_next = 0x7fce4ef806c4, prev = 0x0, next = 0x7fce4ef806c4, value = 0x7fce4c086f70, key = 0x7fce3c1fb230 "gf_request_link_count"}
(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->next)
$18 = {hash_next = 0x0, prev = 0x7fce4c14c408, next = 0x0, value = 0x7fce4c092d5c, key = 0x7fce3c1f1e80 "glusterfs.inodelk-dom-count"}
(gdb) p* (((dict_t*)0x00007fce4ef80684)->members_internal->value)
$19 = {is_static = 0 '\000', is_const = 0 '\000', is_stdalloc = 0 '\000', len = 2, data = 0x7fce3c11afc0 "1", refcount = 1, lock = 1}
(gdb) bt




The key "gf_request_link_count" is set in the xdata passed to posix_rename. However this should not have happened as the ctr_rename () function should have 
failed in:
ret = update_hard_link_ctx (frame, this, oldloc->inode);

because ctr_local is NULL, causing it to skip the code to set this key.


I tried to reproduce the crash using 
mv -T dir1 dir2  

where both dir1 and dir2 exist and are empty. posix_rename consistently failed to set the GF_RESPONSE_LINK_COUNT_XDATA key value in the dict so ctr_rename_cbk always exits at line 838 in the code snippet above. I was unable to reproduce the crash.

However the original bt in the description lists graphyylex_destroy in the stack. This looks like a graph change occurred but without the logs there is no way to confirm this.

--- Additional comment from Vijay Bellur on 2016-07-20 05:41:36 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#1) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-20 11:22:47 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#2) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-25 10:53:58 EDT ---

REVIEW: http://review.gluster.org/14964 (features/ctr: Check for NULL local) posted (#3) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-25 16:14:13 EDT ---

COMMIT: http://review.gluster.org/14964 committed in master by Vijay Bellur (vbellur@redhat.com) 
------
commit fc2a0f5eb8caae4f4fd0c023619954ed4b6ca760
Author: N Balachandran <nbalacha@redhat.com>
Date:   Wed Jul 20 15:10:05 2016 +0530

    features/ctr: Check for NULL local
    
    This is a defensive fix to prevent a crash reported
    during a rename operation. This is not reproducible
    under normal circumstances.
    
    This patch also moves ctr-messages.h to the src dir
    of the changetimerecorder xlator.
    
    Change-Id: I46eb926d67bf4c19387c8b26e354c635a5fb284c
    BUG: 1358196
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/14964
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    Tested-by: Zhou Zhengping <johnzzpcrystal@gmail.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>

Comment 1 Vijay Bellur 2016-07-26 06:29:04 UTC
REVIEW: http://review.gluster.org/15009 (features/ctr: Check for NULL local) posted (#1) for review on release-3.8 by N Balachandran (nbalacha@redhat.com)

Comment 2 Vijay Bellur 2016-07-26 15:14:06 UTC
COMMIT: http://review.gluster.org/15009 committed in release-3.8 by Niels de Vos (ndevos@redhat.com) 
------
commit ed1c986bff37e783a19fc43e38066a74ece13cc5
Author: N Balachandran <nbalacha@redhat.com>
Date:   Tue Jul 26 11:57:18 2016 +0530

    features/ctr: Check for NULL local
    
    This is a defensive fix to prevent a crash reported
    during a rename operation. This is not reproducible
    under normal circumstances.
    
    This patch also moves ctr-messages.h to the src dir
    of the changetimerecorder xlator.
    
    Backported from master: http://review.gluster.org/#/c/14964/
    
    Change-Id: If459f46d3ce9258f595d0ca7ef55942bf466d767
    BUG: 1360122
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/15009
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Niels de Vos <ndevos@redhat.com>

Comment 3 Niels de Vos 2016-08-12 09:47:50 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.2, please open a new bug report.

glusterfs-3.8.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/announce/2016-August/000058.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.