Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1063606 - DHT: REBALANCE- spurious data movements upon starting rebalance
Summary: DHT: REBALANCE- spurious data movements upon starting rebalance
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286129
TreeView+ depends on / blocked
 
Reported: 2014-02-11 06:31 UTC by shylesh
Modified: 2015-11-27 11:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286129 (view as bug list)
Environment:
Last Closed: 2015-11-27 11:40:33 UTC


Attachments (Terms of Use)

Description shylesh 2014-02-11 06:31:31 UTC
Description of problem:
Rebalance is trying migrate some of the files from source---> dest---->source

Version-Release number of selected component (if applicable):
3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. created a single brick distribute volume 
2. created 10 files on the mount point say 1 to 10 (all the files will be on same brick)
3. added one more brick and start rebalance
gluster volume rebalance <vol> start 
4. Once the migration is complete(there may be some skipped files because of space issue) check the file distribution in the backend 


Actual results:
a file has been moved from brick0 to brick1 and again rebalance from brick1's node is trying to move the same file to brick0's node
 


Additional info:
[root@rhs-client9 ser]# gluster v info ser
 
Volume Name: ser
Type: Distribute
Volume ID: 186b3c85-81d4-4d09-81bd-847cbd4178d3
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhs-client9.lab.eng.blr.redhat.com:/home/ser0  -->initial brick
Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/ser1 --> brick added later

initially ser0 had all 10 files
after rebalance ls from bricks 


brick0
--------
[root@rhs-client9 ser]# ll /home/ser0
total 5120
---------T 2 root root 1048576 Feb 11 11:04 1
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 10
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 2
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 3
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 4
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 6


brick1
=======
[root@rhs-client39 ser1]# ll
total 5120
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 1
---------T 2 root root       0 Feb 11 11:04 10
---------T 2 root root       0 Feb 11 11:04 2
---------T 2 root root       0 Feb 11 11:04 3
---------T 2 root root       0 Feb 11 11:04 4
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 5
---------T 2 root root       0 Feb 11 11:04 6
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 7
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 8
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 9

why all the files has been shown on brick1 whereas it is supposed to contain few files ?

Rebalance log from rhs-client9.lab.eng.blr.redhat.com which had first brick
=======================================

[2014-02-11 05:34:01.030319] I [dht-rebalance.c:672:dht_migrate_file] 0-ser-dht: /1: attempting to move from ser-client-0 to ser-client-1
[2014-02-11 05:34:01.115913] I [dht-rebalance.c:881:dht_migrate_file] 0-ser-dht: completed migration of /1 from subvolume ser-client-0 to ser-client-1


this means file 1 has been migrated to other brick


Rebalance log from rhs-client39.lab.eng.blr.redhat.com which has second brick
---------------------------------------
[2014-02-11 05:34:05.037926] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-ser-dht: migrate data called on /
[2014-02-11 05:34:05.117622] I [dht-rebalance.c:672:dht_migrate_file] 0-ser-dht: /1: attempting to move from ser-client-1 to ser-client-0
[2014-02-11 05:34:05.121093] W [dht-rebalance.c:374:__dht_check_free_space] 0-ser-dht: data movement attempted from node (ser-client-1) with higher disk space to a node (ser-client-0) with lesser disk space (/1)


once the file 1 has got migrated why again its been trying to migrate back ?


cluster info
==============
rhs-client9.lab.eng.blr.redhat.com
rhs-client39.lab.eng.blr.redhat.com

mount point
----------
rhs-client9.lab.eng.blr.redhat.com:/ser


attaching the sosreports

Comment 3 vsomyaju 2014-02-12 08:24:05 UTC
From the debugging found that:

Link File Creation is happening becase of not able to store the maximum overlap in a variable. if overlap range is from 0x00000000 to 0xffffffff, total length
which is 0xffffffff + 1 can not be stored in uint32_t.
Different rebalance process will set different layout, and ends up with link file creation.

Comment 4 shylesh 2014-07-30 17:41:24 UTC
I could reproduce this bug easily on 50 node setup.

Created a 46 brick distibute volume , created 100 files on the mount point 
touch f{1..100}, added 4 more bricks and did rebalance.

log snippet
===========
Node 1
=======
[2014-07-30 10:30:33.793257] I [dht-rebalance.c:672:dht_migrate_file] 0-newvol-dht: /f28: attempting to move from newvol-client-28 to newvol-client-4
[2014-07-30 10:30:33.820710] I [dht-rebalance.c:881:dht_migrate_file] 0-newvol-dht: completed migration of /f28 from subvolume newvol-client-28 to newvol-client-4


Node2
=====
[2014-07-30 10:30:32.425980] I [dht-rebalance.c:672:dht_migrate_file] 0-newvol-dht: /f28: attempting to move from newvol-client-29 to newvol-client-28
[2014-07-30 10:30:32.453402] I [dht-rebalance.c:881:dht_migrate_file] 0-newvol-dht: completed migration of /f28 from subvolume newvol-client-29 to newvol-client-28

Node2 migrated the file from client-29 to client-28 , again later Node1 migrated the same file from client-28 to client-4.

Above log shows that file will be migrated again and again which has serious performance impact.

Comment 5 shylesh 2014-07-30 17:42:28 UTC
Build used for the testing in comment4 is 3.4.0.59rhs-1.el6rhs.x86_64

Comment 6 Susant Kumar Palai 2015-11-27 11:40:33 UTC
Cloning this to 3.1. To be fixed in future.


Note You need to log in before you can comment on or make changes to this bug.