Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1597463 - Heals pending after brick kill during an ongoing block-delete
Summary: Heals pending after brick kill during an ongoing block-delete
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-block
Version: cns-3.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Pranith Kumar K
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-03 01:17 UTC by Sweta Anandpara
Modified: 2019-02-08 10:24 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-08 10:23:55 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Sweta Anandpara 2018-07-03 01:17:28 UTC
Description of problem:
======================
Had a 4node cluster with 2 volumes 'ozone' and 'nash'. Had about 80-100 blocks created in ozone, and was testing in and around failover scenarios, firewall-port-disable, blockd service down, brick-kill. 

Executed a block delete loop, and killed one of bricks hosted in ha peer node. Ideally the block delete should have passed (tested earlier) and heals that would remain pending would get to '0' once the brick is brought back up. This has been executed earlier and the expected behaviour has been seen. 

However in one such attempt, block delete failed as soon as the brick was brought down, with the error "Transport endpoint not connected". The specific block that was attempted to be deleted at the time of brick-kill showed up in the pending heal entries. Even after getting the brick back up and explicitly triggering a heal of the volume, it continued to show a few heals pending.

Raising a medium severity bug as of now. Will increase it if I happen to see it again.
Sosreports and gluster-block logs will be copied at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/swetas/<bugnumber>


Version-Release number of selected component (if applicable):
=============================================================
gluster-block-0.2.1-20
tcmu-runner-1.2.0-20
glusterfs-3.8.4-54.12


How reproducible:
=================
1:1


Steps to Reproduce:
===================
1. Have a brick mux enabled cluster, with a replica 3 volume with group profile set to 'gluster-block'
2. Create 10 blocks in a loop on the volume, and kill one of the bricks in the middle
3. Check for pending heals 'gluster volume heal <volname> info'. Force start the volume, to get the brick back up
4. Delete the 10 blocks in a loop, and kill one of the bricks in the middle.
5. Check for pending heals 'gluster volume heal <volname> info'. Force start the volume, to get the brick back up.

Actual results:
===============
Block create and delete fails in step 2 and 4, with the error "Transport Endpoint not connected"
Heals pending in step5 continue to stay even after triggering an explicit heal. 


Expected results:
==================
Block creates should succeed in step2.
Step3 should show pending heals, which should all get cleaned up as soon as the brick is brought back online
Step4 block deletes should succeed.
Step5 should show pending heals, which should all get cleaned up as soon as the brick is brought back online.


Additional info:
================

[root@dhcp35-120 ~]# for i in {30..40}; do echo "==== $i ===="; time  gluster-block delete ozone/ozone$i; done
==== 30 ====
block ozone/ozone30 doesn't exist
RESULT:FAIL

real	0m0.012s
user	0m0.001s
sys	0m0.002s
==== 31 ====
SUCCESSFUL ON:   10.70.35.141 10.70.35.147 10.70.35.120
RESULT: SUCCESS

real	0m5.033s
user	0m0.001s
sys	0m0.002s
==== 32 ====
SUCCESSFUL ON:   10.70.35.141 10.70.35.147 10.70.35.120
RESULT: SUCCESS

real	0m4.736s
user	0m0.001s
sys	0m0.002s
==== 33 ====
Not able to acquire lock on ozone[Transport endpoint is not connected]
RESULT:FAIL

real	0m4.519s
user	0m0.001s
sys	0m0.003s

[root@dhcp35-147 ~]# gluster v heal ozone info
Brick 10.70.35.120:/bricks/brick0/ozone0
/block-meta 
/block-meta/prio.info 
/block-meta/ozone33 
/block-store 
Status: Connected
Number of entries: 4

Brick 10.70.35.141:/bricks/brick0/ozone1
/block-meta 
/block-meta/prio.info 
/block-meta/ozone33 
/block-store 
Status: Connected
Number of entries: 4

Brick 10.70.35.147:/bricks/brick0/ozone2
Status: Connected
Number of entries: 0

[root@dhcp35-147 ~]# 
// After about 15-20 mins
[root@dhcp35-147 ~]# gluster v heal ozone info
Brick 10.70.35.120:/bricks/brick0/ozone0
/block-meta/prio.info 
/block-store 
Status: Connected
Number of entries: 2

Brick 10.70.35.141:/bricks/brick0/ozone1
/block-meta/prio.info 
/block-store 
Status: Connected
Number of entries: 2

Brick 10.70.35.147:/bricks/brick0/ozone2
Status: Connected
Number of entries: 0

[root@dhcp35-147 ~]#


glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.2.el7rhgs.noarch
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
python-gluster-3.8.4-54.12.el7rhgs.noarch
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-rdma-3.8.4-54.12.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
targetcli-2.1.fb46-6.el7_5.noarch
python-rtslib-2.1.fb63-12.el7_5.noarch
python-configshell-1.1.fb23-4.el7_5.noarch
tcmu-runner-1.2.0-20.el7rhgs.x86_64
libtcmu-1.2.0-20.el7rhgs.x86_64

Comment 3 Pranith Kumar K 2018-07-04 08:59:10 UTC
As per the logs, Quorum is not met. Even the connection to 3rd brick has a problem when lk fop failures are seen.

=======================================================
...
[2018-07-03 00:25:06.043613] E [socket.c:2360:socket_connect_finish] 0-ozone-client-2: connection to 10.70.35.147:24007 failed (No route to host); disconnecting socket
...
[2018-07-03 00:35:31.922955] W [socket.c:595:__socket_rwv] 0-ozone-client-1: readv on 10.70.35.141:49152 failed (No data available)
...
[2018-07-03 00:35:31.923029] W [MSGID: 108001] [afr-common.c:4673:afr_notify] 0-ozone-replicate-0: Client-quorum is not met
[2018-07-03 00:35:36.103636] W [MSGID: 114061] [client-common.c:729:client_pre_lk] 0-ozone-client-1:  (a370f495-49a3-485b-9ed7-e7b3edeec897) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-07-03 00:35:36.112062] W [MSGID: 114031] [client-rpc-fops.c:2529:client3_3_lk_cbk] 0-ozone-client-1: remote operation failed [Transport endpoint is not connected]
...
=======================================================

Comment 4 Pranith Kumar K 2018-07-04 09:01:51 UTC
Sweta,
     Do you happen to have the files in this state to look at the extended attributes of the files that need heal?

Pranith

Comment 9 Amar Tumballi 2018-11-19 08:50:52 UTC
Should we close this bug if this is not seen in recent testings? specially after RHGS3.4 ?

Comment 10 Sweta Anandpara 2018-11-23 04:26:04 UTC
Passing the needinfo  for Karthick's assessment, if we have happened to hit this in recent OCS releases,

Comment 12 RamaKasturi 2019-01-23 11:27:00 UTC
Sure karthick, will try that and keep this bug updated. Keeping the need info on me till then.

Comment 13 Prasanna Kumar Kalever 2019-02-07 10:47:32 UTC
@Kasturi, any update here ?

Just in case, if we don't happen to see this on OCS-3.11.1 testing, I would move the bug to CLOSED  without delaying it any further.

The bug is opened almost a year back, since then waiting for info. We can always reopen/open a bug, when we have sufficient data.

Thanks!

Comment 14 RamaKasturi 2019-02-07 10:58:13 UTC
Hello prasanna,

   I was not able to test this due to the release work of 3.11.1. I would be able to provide the requested details by tomorrow. Keeping the need info on me till then

Thanks
kasturi

Comment 15 RamaKasturi 2019-02-08 10:07:08 UTC
Hello prasanna,

   Below are the steps i performed to check if the issue persists with the new builds as well. I see that the issue no more happens with the latest version of ocs 3.11.1

1) killed one of the brick of BHV.
2) created block volumes using heketi-cli.
3) Made sure that all blockvolumes goes to the BHV above which was created.

Result : Even after killing the brick of a BHV block volumes creation was successful and did not see any "Transport endpoint not connected" issue as mentioned above.

[root@dhcp46-112 ~]# heketi-cli blockvolume create --size 1
Name: blockvol_dc9b1c9b141b857a471933fd03848cef
Size: 1
Volume Id: dc9b1c9b141b857a471933fd03848cef
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:9feffc85-b177-4b75-b1a9-9eb8d92cdd2e
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
[root@dhcp46-112 ~]# 
[root@dhcp46-112 ~]# for i in {1..10}; do heketi blockvolume create --size 2; sleep 10; done
bash: heketi: command not found
^C
[root@dhcp46-112 ~]# for i in {1..10}; do heketi-cli blockvolume create --size 2; sleep 10; done
Name: blockvol_83731e140fee47266f5cfb93f6f936e4
Size: 2
Volume Id: 83731e140fee47266f5cfb93f6f936e4
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:74e0c7f6-4221-4a7a-aa94-b107a453157f
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_4ffa8a133b30846b830208530fffdb8e
Size: 2
Volume Id: 4ffa8a133b30846b830208530fffdb8e
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:7275ff7f-efca-4b47-96d5-c9c8c1108fd3
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_baf36e91727341161c4e2d88c76df97d
Size: 2
Volume Id: baf36e91727341161c4e2d88c76df97d
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:98afe4c1-37c6-4927-aa57-35dc70ec62e2
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_028fb64abcc841cc445078c16fefeee0
Size: 2
Volume Id: 028fb64abcc841cc445078c16fefeee0
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:b45c6024-3d96-4ab7-9988-ad96d720e58a
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_a4606ef5439b583a26bc261dc9d343a4
Size: 2
Volume Id: a4606ef5439b583a26bc261dc9d343a4
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:54fd238f-6622-4231-af41-8aa5e139d29f
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_88ef67b977a07c6633bf4bf3f69c5c5a
Size: 2
Volume Id: 88ef67b977a07c6633bf4bf3f69c5c5a
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:49f6b756-cab6-4dec-8161-7714c4d9e990
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_9ac02fec547b343ee63c3666035cfbb5
Size: 2
Volume Id: 9ac02fec547b343ee63c3666035cfbb5
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:0d6bbbbf-15ed-4a83-a8d5-707be457466e
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_a498a2c231f9d9cc54ff7dde66cdbcf3
Size: 2
Volume Id: a498a2c231f9d9cc54ff7dde66cdbcf3
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:a65560df-d871-499e-8c86-0c47e49ad767
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_354eb41b6e59ede74af63c1e911b1ff8
Size: 2
Volume Id: 354eb41b6e59ede74af63c1e911b1ff8
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:bf157a44-fce5-4865-b124-8c99ebd33e81
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2
Name: blockvol_07096d6cec709984d7b95273cafea45f
Size: 2
Volume Id: 07096d6cec709984d7b95273cafea45f
Cluster Id: 5f1ab6259776ad529416665ce11442c2
Hosts: [10.70.46.33 10.70.46.210 10.70.46.180]
IQN: iqn.2016-12.org.gluster-block:b8d3424b-48a0-4a41-950a-05f049435860
LUN: 0
Hacount: 3
Username: 
Password: 
Block Hosting Volume: 496dc87f2d11a47472862adbb70410c2

Block Hosting volume status:
=================================
sh-4.2# gluster volume status vol_496dc87f2d11a47472862adbb70410c2
Status of volume: vol_496dc87f2d11a47472862adbb70410c2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.33:/var/lib/heketi/mounts/vg
_c44f71bbf3f2035e8936111fc6ce6244/brick_194
d5d0e1971933856da25224f565a1e/brick         49153     0          Y       8349 
Brick 10.70.46.210:/var/lib/heketi/mounts/v
g_dc74b76a0b18037621224ed50cd88159/brick_a9
4b7097023bb6d006820c73d3dfe19a/brick        N/A       N/A        N       N/A  
Brick 10.70.46.180:/var/lib/heketi/mounts/v
g_43a5dc54998979b278743f0b2813b82e/brick_88
fc31606378938852a7aac8a5ddf7e7/brick        49153     0          Y       8282 
Self-heal Daemon on localhost               N/A       N/A        Y       8566 
Self-heal Daemon on 10.70.46.180            N/A       N/A        Y       8567 
Self-heal Daemon on dhcp46-33.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       8614 
 
Task Status of Volume vol_496dc87f2d11a47472862adbb70410c2
------------------------------------------------------------------------------
There are no active volume tasks
 

Deleted the pod to bring back brick on that pod and could bring it back the pod and brick successfully.

sh-4.2# gluster volume status vol_496dc87f2d11a47472862adbb70410c2     
Status of volume: vol_496dc87f2d11a47472862adbb70410c2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.33:/var/lib/heketi/mounts/vg
_c44f71bbf3f2035e8936111fc6ce6244/brick_194
d5d0e1971933856da25224f565a1e/brick         49153     0          Y       8349 
Brick 10.70.46.210:/var/lib/heketi/mounts/v
g_dc74b76a0b18037621224ed50cd88159/brick_a9
4b7097023bb6d006820c73d3dfe19a/brick        49153     0          Y       586  
Brick 10.70.46.180:/var/lib/heketi/mounts/v
g_43a5dc54998979b278743f0b2813b82e/brick_88
fc31606378938852a7aac8a5ddf7e7/brick        49153     0          Y       8282 
Self-heal Daemon on localhost               N/A       N/A        Y       537  
Self-heal Daemon on 10.70.46.180            N/A       N/A        Y       8567 
Self-heal Daemon on 10.70.46.33             N/A       N/A        Y       8614 
 
Task Status of Volume vol_496dc87f2d11a47472862adbb70410c2
------------------------------------------------------------------------------
There are no active volume tasks

sh-4.2# gluster volume heal vol_496dc87f2d11a47472862adbb70410c2 info
Brick 10.70.46.33:/var/lib/heketi/mounts/vg_c44f71bbf3f2035e8936111fc6ce6244/brick_194d5d0e1971933856da25224f565a1e/brick
Status: Connected
Number of entries: 0

Brick 10.70.46.210:/var/lib/heketi/mounts/vg_dc74b76a0b18037621224ed50cd88159/brick_a94b7097023bb6d006820c73d3dfe19a/brick
Status: Connected
Number of entries: 0

Brick 10.70.46.180:/var/lib/heketi/mounts/vg_43a5dc54998979b278743f0b2813b82e/brick_88fc31606378938852a7aac8a5ddf7e7/brick
Status: Connected
Number of entries: 0



started doing the deletion's of block volumes in loop and killed one of the brick of block hosting volume. When the brick of block hosting volume is killed i do not see any errors related to "Transport Endpoint Not connected" and block volume deletion was successful.


sh-4.2# gluster volume status testbug 
Status of volume: testbug
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.115:/var/lib/heketi/mounts/v
g_0f0b820a8ce7358d933692c82db220fe/brick_58
b2c2bd4dac2a00456a4d2ae4120f86/brick        49153     0          Y       1081 
Brick 10.70.47.37:/var/lib/heketi/mounts/vg
_b50ab37d73d2af28ae5ecd7346e44d52/brick_54a
6e4353e50fe6beef36072b7184831/brick         49153     0          Y       8742 
Brick 10.70.47.2:/var/lib/heketi/mounts/vg_
0396ecafc0f1b5976a6b9663a9130a4c/brick_bd60
829b762b096d946867afa9174111/brick          N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       8883 
Self-heal Daemon on dhcp46-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       1102 
Self-heal Daemon on 10.70.47.37             N/A       N/A        Y       8763 
 
Task Status of Volume testbug
------------------------------------------------------------------------------
There are no active volume tasks

[root@dhcp46-138 ~]# for i in `heketi-cli blockvolume list | cut -d " " -f1 | cut -d ":" -f2`; do heketi-cli blockvolume delete $i;sleep 5;done
Volume 02667d11a02f48a762da9612da4c5a9d deleted
Volume 216fda528142b403a0b442eaee75e22d deleted
Volume 3cd7d01ec0c311e809d59de54f6f327d deleted
Volume 56149b16e94215ca3f406b37a76d0bce deleted
Volume 7ab23813b5a13fe514010d042a3b9b01 deleted
Volume 7de376ec071a5775d2cd085f1c03ae88 deleted
Volume 89e4d92fd9208ef592344db87480ead9 deleted
Volume b9478a6189b817cd2c4111d0e536a570 deleted
Volume c829b46c8d4a02c6da2012421884bcba deleted
Volume d251f96011b17bf8960ef4fccd849c17 deleted

sh-4.2# gluster volume heal testbug info
Brick 10.70.46.115:/var/lib/heketi/mounts/vg_0f0b820a8ce7358d933692c82db220fe/brick_58b2c2bd4dac2a00456a4d2ae4120f86/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.37:/var/lib/heketi/mounts/vg_b50ab37d73d2af28ae5ecd7346e44d52/brick_54a6e4353e50fe6beef36072b7184831/brick
Status: Connected
Number of entries: 0

Brick 10.70.47.2:/var/lib/heketi/mounts/vg_0396ecafc0f1b5976a6b9663a9130a4c/brick_bd60829b762b096d946867afa9174111/brick
Status: Connected
Number of entries: 0

Comment 16 Prasanna Kumar Kalever 2019-02-08 10:16:18 UTC
Given this is a year old bug and lot many improvements are made in healing at core glusterfs and the issue is not reproducible now in attempt to capture more requested details, and the good thing with latest OCS release it works as expected. IMO We should close this bug now.

@Kasturi, please feel free to close this bug with WORKSFORME, as it did according to comment 15.

Appreciate all your effort in attempting to reproduce this bug.


Thanks!

Comment 17 RamaKasturi 2019-02-08 10:23:55 UTC
According to comment 15 and comment 16 closing this bug.


Note You need to log in before you can comment on or make changes to this bug.