Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1354250 - Gluster fuse client crashed generating core dump
Summary: Gluster fuse client crashed generating core dump
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: transport
Version: 3.8.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On: 1343320 1343374
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-11 04:21 UTC by Nithya Balachandran
Modified: 2016-08-12 09:46 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.8.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1343374
Environment:
Last Closed: 2016-08-12 09:46:59 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-07-11 04:21:47 UTC
+++ This bug was initially created as a clone of Bug #1343374 +++

This bug was initially created as a clone of Bug #1343320 +++

Description of problem:
Client crash with core dump due to excessive memory consumption


Version-Release number of selected component (if applicable):

3.7.5-19.el7rhgs.x86_64
RHEL 5

Additional info:
lots of DNS resolution error found in client logs

The following logs 
I can see continuous error messages :
[2016-04-27 10:33:29.833969] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:32.843124] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:35.850581] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:38.858181] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:41.865251] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
The message "E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)" repeated 39 times between [2016-04-27 10:31:44.561995] and [2016-04-27 10:33:41.865245]
[2016-04-27 10:33:44.873510] E [MSGID: 101075] [common-utils.c:306:gf_resolve_ip6] 0-resolver: getaddrinfo failed (Name or service not known)
[2016-04-27 10:33:44.873599] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:47.881687] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
[2016-04-27 10:33:50.890768] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1
.
.
.
.
.
 
and after sometime(almost after 27 hour) :
 
[2016-04-28 13:47:23.002272] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:23.002528] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:26.008762] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1

[2016-04-28 13:47:26.008933] E [socket.c:3124:socket_connect] 0-vol01-client-1: pthread_createfailed: Cannot allocate memory
[2016-04-28 13:47:26.009134] E [socket.c:3126:socket_connect] (-->/usr/lib64/libglusterfs.so.0(gf_timer_proc+0xf5) [0x3bb8046e65] -->/usr/lib64/libgfrpc.so.0(rpc_clnt_reconnect+0xea) [0x3bb7c0e67a] -->/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so [0x2b2a7319f0ca] ) 0-: Assertion failed: 0
[2016-04-28 13:47:29.015862] E [name.c:242:af_inet_client_get_remote_sockaddr] 0-vol01-client-1: DNS resolution failed on host server1



.
.This continued for almost a week
.

.
.Followed by
.
.

.

[2016-05-15 04:12:17.272132] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
[2016-05-15 04:12:18.863904] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (2097224) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(synctask_create+0x3a1)[0x3bb806cf21]
/usr/lib64/libglusterfs.so.0(synctask_new1+0x9)[0x3bb806d4f9]
.
.
.
.
.
.
[2016-05-15 04:12:31.572526] A [MSGID: 0] [mem-pool.c:120:__gf_calloc] : no memory available for size (124) [call stack follows]
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(_gf_msg_nomem+0x42e)[0x3bb802984e]
/usr/lib64/libglusterfs.so.0(__gf_calloc+0x100)[0x3bb805bda0]
/usr/lib64/libglusterfs.so.0(mem_get+0xb8)[0x3bb805be98]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
frame : type(1) op(LOOKUP)
.
.
.
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash:
2016-05-15 04:12:31
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.5
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb5)[0x3bb8025395]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x338)[0x3bb8042378]
/lib64/libc.so.6[0x34f2030030]
/usr/lib64/libglusterfs.so.0(mem_get+0x6e)[0x3bb805be4e]
/usr/lib64/libglusterfs.so.0(mem_get0+0x1b)[0x3bb805bf0b]
/usr/lib64/libglusterfs.so.0(get_new_data+0x20)[0x3bb801f260]
/usr/lib64/libglusterfs.so.0(dict_unserialize+0xf4)[0x3bb801f374]
/usr/lib64/glusterfs/3.7.5/xlator/protocol/client.so(client3_3_lookup_cbk+0x7bc)[0x2b2a741f5acc]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa0)[0x3bb7c0fa70]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1b4)[0x3bb7c0fd34]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x27)[0x3bb7c0b517]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a3f68]
/usr/lib64/glusterfs/3.7.5/rpc-transport/socket.so[0x2b2a731a4994]
/usr/lib64/libglusterfs.so.0[0x3bb808b363]
/lib64/libpthread.so.0[0x34f280683d]
/lib64/libc.so.6(clone+0x6d)[0x34f20d4fcd]

--- Additional comment from Nithya Balachandran on 2016-06-07 04:50:10 EDT ---

RCA:

There is a memory leak in the socket_connect code in case of failure. 

In socket_connect ():

        /* if sock != -1, then cleanup is done from the event handler */
        if (ret == -1 && sock == -1) {
                /* Cleaup requires to send notification to upper layer which
                   intern holds the big_lock. There can be dead-lock situation
                   if big_lock is already held by the current thread. 
                   So transfer the ownership to seperate thread for cleanup.
                */      
                arg = GF_CALLOC (1, sizeof (*arg), 
                                 gf_sock_connect_error_state_t);
                arg->this = THIS; 
                arg->trans = this; 
                arg->refd = refd; 
                th_ret = pthread_create (&th_id, NULL, socket_connect_error_cbk,
                                         arg);   
                if (th_ret) {
                       gf_log (this->name, GF_LOG_ERROR, "pthread_create"
                               "failed: %s", strerror(errno));
                        GF_FREE (arg);
                        GF_ASSERT (0);
                }       
        }       


pthread_create does not create a detached thread so the thread resources are not cleaned up. socket_connect is called at 3 second intervals so this quickly adds up causing the process to run out of memory.

--- Additional comment from Vijay Bellur on 2016-06-07 04:56:31 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleanup up) posted (#1) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Nithya Balachandran on 2016-06-07 05:01:17 EDT ---

Fix:

Create a detached thread so all thread resources are cleaned up automatically.

--- Additional comment from Vijay Bellur on 2016-06-07 05:09:38 EDT ---

REVIEW: http://review.gluster.org/14661 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:18:43 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:22:40 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#2) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 01:54:26 EDT ---

REVIEW: http://review.gluster.org/14875 (rpc/socket: pthread resources are not cleaned up) posted (#3) for review on master by N Balachandran (nbalacha@redhat.com)

--- Additional comment from Vijay Bellur on 2016-07-08 16:17:16 EDT ---

COMMIT: http://review.gluster.org/14875 committed in master by Jeff Darcy (jdarcy@redhat.com) 
------
commit 9886d568a7a8839bf3acc81cb1111fa372ac5270
Author: N Balachandran <nbalacha@redhat.com>
Date:   Fri Jul 8 10:46:46 2016 +0530

    rpc/socket: pthread resources are not cleaned up
    
    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.
    
    Now, socket_connect creates a detached thread to handle failure.
    
    Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    BUG: 1343374
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/14875
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>

Comment 1 Nithya Balachandran 2016-07-12 13:32:35 UTC
Moving this back to Assigned as the fix has not been posted for release-3.8.

Comment 2 Vijay Bellur 2016-07-22 04:58:51 UTC
REVIEW: http://review.gluster.org/14979 (rpc/socket: pthread resources are not cleaned up) posted (#1) for review on release-3.8 by N Balachandran (nbalacha@redhat.com)

Comment 3 Vijay Bellur 2016-07-22 12:47:15 UTC
COMMIT: http://review.gluster.org/14979 committed in release-3.8 by Raghavendra G (rgowdapp@redhat.com) 
------
commit 96befcf40767cb4ff67868af46637acfabe40bc7
Author: N Balachandran <nbalacha@redhat.com>
Date:   Fri Jul 22 10:28:14 2016 +0530

    rpc/socket: pthread resources are not cleaned up
    
    A socket_connect failure creates a new pthread which
    is not a detached thread. As no pthread_join is called,
    the thread resources are not cleaned up causing a memory leak.
    
    Now, socket_connect creates a detached thread to handle failure.
    
    > Change-Id: Idbf25d312f91464ae20c97d501b628bfdec7cf0c
    > BUG: 1343374
    > Signed-off-by: N Balachandran <nbalacha@redhat.com>
    > Reviewed-on: http://review.gluster.org/14875
    > Smoke: Gluster Build System <jenkins@build.gluster.org>
    > Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    > CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    > Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
    (cherry picked from commit 9886d568a7a8839bf3acc81cb1111fa372ac5270)
    
    Change-Id: I69ef46013c8dbc70cbda2695f12be1f6d3720055
    BUG: 1354250
    Signed-off-by: N Balachandran <nbalacha@redhat.com>
    Reviewed-on: http://review.gluster.org/14979
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>

Comment 4 Niels de Vos 2016-08-12 09:46:59 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.2, please open a new bug report.

glusterfs-3.8.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/announce/2016-August/000058.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.