Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1599769 - seeing lot of sockets which are possibly stale under brick process
Summary: seeing lot of sockets which are possibly stale under brick process
Keywords:
Status: POST
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Mohit Agrawal
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: 1662629
TreeView+ depends on / blocked
 
Reported: 2018-07-10 14:33 UTC by nchilaka
Modified: 2019-04-11 08:18 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1662629 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description nchilaka 2018-07-10 14:33:47 UTC
Description of problem:
========================
I see more than 700 sockets under /proc/$(pgrep glusterfsd)/fd
which possibly are stale
I don't think this is expected.
Below are some sample set

rwx------. 1 root root 64 Jul 10 17:44 12385 -> socket:[3399901]
lrwx------. 1 root root 64 Jul 10 17:44 12386 -> socket:[3399902]
lrwx------. 1 root root 64 Jul 10 17:44 12387 -> socket:[3399903]
lrwx------. 1 root root 64 Jul 10 17:44 12388 -> socket:[3399904]
lrwx------. 1 root root 64 Jul 10 17:44 12389 -> socket:[3399916]
lrwx------. 1 root root 64 Jul 10 17:44 12390 -> socket:[3399917]
lrwx------. 1 root root 64 Jul 10 17:44 12391 -> socket:[3399918]
lrwx------. 1 root root 64 Jul 10 17:44 12392 -> socket:[3399919]
lrwx------. 1 root root 64 Jul 10 17:44 12393 -> socket:[3399920]
lrwx------. 1 root root 64 Jul 10 17:44 12394 -> socket:[3399921]
lrwx------. 1 root root 64 Jul 10 17:44 12395 -> socket:[3399922]
lrwx------. 1 root root 64 Jul 10 17:44 12396 -> socket:[3399923]
lrwx------. 1 root root 64 Jul 10 17:44 12397 -> socket:[3399924]
lrwx------. 1 root root 64 Jul 10 17:44 12398 -> socket:[3399925]
lrwx------. 1 root root 64 Jul 10 17:44 12399 -> socket:[3399926]
lrwx------. 1 root root 64 Jul 10 17:44 12400 -> socket:[3399927]
lrwx------. 1 root root 64 Jul 10 17:44 12401 -> socket:[3399928]
lrwx------. 1 root root 64 Jul 10 17:44 12402 -> socket:[3399929]
lrwx------. 1 root root 64 Jul 10 17:44 12403 -> socket:[3399930]





Version-Release number of selected component (if applicable):
----------------
3.8.4-54.14


How reproducible:
==============
have run below steps once and am seeing it on all nodes

Steps to Reproduce:
=====================
1.have a 6 node cluster, 8x3 volume  started
2.run glusterd restart in loop from one terminal of n1 and  gluster v heal command from another terminal of n1 (as part of bz#1595752 verification)
3.now from 3 clients simultaneously keep mounting the same volume using n1 IP on 1000 directories 
ie from each node for i  in {1..1000};do mkdir /mnt/vol.$i; mount -t glusterfs n1:vol /mnt/vol.$i;done
4. now the loops in step#3
5. now do in loop, glusterd restart from t1 of n1 and from t2 of n1 run quota enable/disable 

you will hit issue bz#1599702 on n1


6. now unmount all the mounts on 3 clients simultaneously
ie  for i  in {1..1000};do umount /mnt/vol.$i; done
this will succeed
7. stop step 5 and bring the cluster to idle state 


Actual results:
=============
notice /proc/$glusterfsd/fd
and you will see many sockets ie more than 700+


workaround:
==========
reboot node or kill all gluster procs and restart glusterd

Comment 3 nchilaka 2018-07-10 14:40:24 UTC
the lists are available under each server log directory as 
glusterfsd.proc.fd.list  
lsof_fd.list

Comment 6 Amar Tumballi 2018-08-13 04:58:20 UTC
A similar tests are required on top of RHGS3.4.0 (Once released, to see if we should consider for batch update. I am not aware if this is happening any more.

Comment 7 Amar Tumballi 2018-08-13 04:58:21 UTC
A similar tests are required on top of RHGS3.4.0 (Once released, to see if we should consider for batch update. I am not aware if this is happening any more.

Comment 8 Atin Mukherjee 2018-11-10 08:12:01 UTC
Setting a needinfo on Nag based on comment 7.

Comment 9 nchilaka 2018-11-30 13:31:09 UTC
cleared needinfo accidentally, placing it back

Comment 10 nchilaka 2018-12-03 07:39:44 UTC
the problem still exists even on 3.12.2-29

I am seeing anything b/w 15-75 stale sockets for the same steps as mentioned in description(only difference is I infact reduced number of mounts to 500 only)

Comment 11 nchilaka 2018-12-03 07:52:21 UTC
sosreports and health reports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1599769

Comment 14 Atin Mukherjee 2019-01-02 02:41:33 UTC
Upstream patch : https://review.gluster.org/#/c/glusterfs/+/21966/


Note You need to log in before you can comment on or make changes to this bug.