Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1066960

Summary: DHT: REBALANCE - Remove-brick without commit followed by rebalance causes migration failures
Product: Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED DEFERRED QA Contact: storage-qa-internal <storage-qa-internal>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.1CC: nlevinki, spalai, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1286123 (view as bug list) Environment:
Last Closed: 2015-11-27 11:35:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1286123    

Description shylesh 2014-02-19 11:42:08 UTC
Description of problem:
Starting remove-brick and stopping in the middle then followed by a add-brick + rebalance causes migration failures for some of the files

Version-Release number of selected component (if applicable):

How reproducible:
Tried once

Steps to Reproduce:
1. created a distribute volume of 6 bricks
2. create some directories and files on the mount 
3. remove 2 bricks using 
gluster v remove-brick <vol> b1 b2 start 
4. while migration is in progress stop the remove-brick op
gluster v remove-brick <vol> b1 b2 stop
5. added 2 more bricks to the volume and started rebalance

Actual results:
some failures are seen during the migration

Additional info:

[root@rhs-client9 ~]# gluster v info dt
Volume Name: dt
Type: Distribute
Volume ID: b3bc1409-8f46-48dc-8b50-10c1e66d9528
Status: Started
Number of Bricks: 8
Transport-type: tcp
Brick5:  * decommissioned brick
Brick6:   * decommissioned brick

[root@rhs-client9 mnt]# #gluster v remove-brick dt stop
[root@rhs-client9 mnt]# gluster v rebalance dt status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost              431         4.1GB          1811             0             0            completed             166.00              240         1.1GB          1650             9             0            completed             128.00              274         1.1GB          1667             4             0            completed             129.00
volume rebalance: dt: success: 

rebalance logs says
[2014-02-19 11:07:35.678282] I [dht-layout.c:646:dht_layout_normalize] 0-dt-dht: found anomalies in /another/2/2/1/0/2/1. holes=4 overlaps=1 missing=2 down=0 misc=0
[2014-02-19 11:07:35.679255] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-dt-client-6: remote operation failed: File exists. Path: /another/2/2/1/0/2/1
[2014-02-19 11:07:35.689081] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-dt-client-7: remote operation failed: File exists. Path: /another/2/2/1/0/2/1
[2014-02-19 11:07:35.730577] I [dht-common.c:2646:dht_setxattr] 0-dt-dht: fixing the layout of /another/2/2/1/0/2/1

[2014-02-19 11:07:37.574121] I [dht-common.c:1142:dht_lookup_linkfile_cbk] 0-dt-dht: lookup of /another/2/2/2/0/file.0 on dt-client-6 (following linkfile) reached link

[2014-02-19 11:07:37.574926] W [dht-common.c:1022:dht_lookup_everywhere_cbk] 0-dt-dht: multiple subvolumes (dt-client-2 and dt-client-6) have file /another/2/2/2/0/file.0 (preferably rename the file in the backend, and do a fresh lookup)
[2014-02-19 11:07:37.575420] W [client-rpc-fops.c:256:client3_3_mknod_cbk] 0-dt-client-6: remote operation failed: File exists. Path: /another/2/2/2/0/file.0
[2014-02-19 11:07:37.575907] W [dht-linkfile.c:44:dht_linkfile_lookup_cbk] 0-dt-dht: got non-linkfile dt-client-6:/another/2/2/2/0/file.0

[2014-02-19 11:07:40.992898] E [dht-rebalance.c:1276:gf_defrag_migrate_data] 0-dt-dht: /another/2/2/2/2/0/file.0: failed to get trusted.distribute.linkinfo key - No s
uch file or directory

[root@rhs-client9 mnt]# getfattr -d -m . -e hex /home/dt*/another/2/2/2/0
getfattr: Removing leading '/' from absolute path names
# file: home/dt0/another/2/2/2/0

# file: home/dt3/another/2/2/2/0

# file: home/dt6/another/2/2/2/0

[root@rhs-client39 dt4]# getfattr -d -m . -e hex /home/dt*/another/2/2/2/0
getfattr: Removing leading '/' from absolute path names
# file: home/dt1/another/2/2/2/0

# file: home/dt4/another/2/2/2/0

# file: home/dt7/another/2/2/2/0

[root@rhs-client4 ~]# getfattr -d -m . -e hex /home/dt*/another/2/2/2/0
getfattr: Removing leading '/' from absolute path names
# file: home/dt2/another/2/2/2/0

# file: home/dt5/another/2/2/2/0

Comment 3 Susant Kumar Palai 2015-11-27 11:35:29 UTC
Cloning this to 3.1. To be fixed in future.