Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1366813 - Second gluster volume is offline after daemon restart or server reboot
Summary: Second gluster volume is offline after daemon restart or server reboot
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.8.2
Hardware: x86_64
OS: All
urgent
urgent
Target Milestone: ---
Assignee: Samikshan Bairagya
QA Contact:
URL:
Whiteboard:
: 1368347 (view as bug list)
Depends On: 1367478
Blocks: glusterfs-3.8.3
TreeView+ depends on / blocked
 
Reported: 2016-08-13 00:51 UTC by Daniel
Modified: 2016-08-24 10:21 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.8.3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1367478 (view as bug list)
Environment:
Last Closed: 2016-08-24 10:21:04 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
glustershd.log - VolumeB offline and no PID (deleted)
2016-08-13 00:51 UTC, Daniel
no flags Details
glustershd.log with stopped VolumeA and working VolumeB (deleted)
2016-08-13 00:52 UTC, Daniel
no flags Details

Description Daniel 2016-08-13 00:51:50 UTC
Created attachment 1190594 [details]
glustershd.log - VolumeB offline and no PID

Description of problem:

When using two volumes only the first one gets online and receives a PID after a glusterfs daemon restart or a server reboot. Tested with replicated volumes only.

Version-Release number of selected component (if applicable): 

Debian Jessie, GlusterFS 3.8.2

How reproducible:

Every time.

Steps to Reproduce:

1. Create replicated volumes VolumeA and VolumeB, whose bricks are on Node1 and Node2.
2. Start both volumes.
3. Restart glusterfs-server.service on Node2 or reboot Node2.

Actual results:

Volume A is fine but Volume B is offline and does not get a PID on Node2.

Expected results:

Volumes A and B are online with a PID.

Additional info:

A "gluster volume start VolumeB force" fixes it.

When Volume A is stopped and you retest it by rebooting Node2 again, Volume B works as expected (online and with PID).

Logfiles are attached.


Status output of node2 after the reboot:

Status of volume: VolumeA
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/glusterfs/VolumeA              49155     0          Y       1859 
Brick node2:/glusterfs/VolumeA              49153     0          Y       1747 
Self-heal Daemon on localhost               N/A       N/A        Y       26188
Self-heal Daemon on node1                   N/A       N/A        Y       21770
 
Task Status of Volume awstats
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: VolumeB
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1:/glusterfs/VolumeB              49154     0          Y       1973 
Brick node2:/glusterfs/VolumeB              N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       26188
Self-heal Daemon on node1                   N/A       N/A        Y       21770
 
Task Status of Volume VolumeB
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 Daniel 2016-08-13 00:52:55 UTC
Created attachment 1190595 [details]
glustershd.log with stopped VolumeA and working VolumeB

Comment 2 Atin Mukherjee 2016-08-16 04:35:23 UTC
Thank you for reporting this issue. It's a regression caused by http://review.gluster.org/14758 which got backported into 3.8.2. We will work on this to fix it in 3.8.3. Keep testing :)

Comment 3 Vijay Bellur 2016-08-17 10:13:45 UTC
REVIEW: http://review.gluster.org/15186 (glusterd: Fix volume restart issue upon glusterd restart) posted (#1) for review on release-3.8 by Samikshan Bairagya (samikshan@gmail.com)

Comment 4 Vijay Bellur 2016-08-18 08:39:37 UTC
COMMIT: http://review.gluster.org/15186 committed in release-3.8 by Atin Mukherjee (amukherj@redhat.com) 
------
commit 24b499447a69c5e2979e15a99b16d5112be237d0
Author: Samikshan Bairagya <samikshan@gmail.com>
Date:   Tue Aug 16 16:46:41 2016 +0530

    glusterd: Fix volume restart issue upon glusterd restart
    
    http://review.gluster.org/#/c/14758/ introduces a check in
    glusterd_restart_bricks that makes sure that if server quorum is
    enabled and if the glusterd instance has been restarted, the bricks
    do not get started. This prevents bricks which have been brought
    down purposely, say for maintainence, from getting started
    upon a glusterd restart. However this change introduced regression
    for a situation that involves multiple volumes. The bricks from
    the first volume get started, but then for the subsequent volumes
    the bricks do not get started. This patch fixes that by setting
    the value of conf->restart_done to _gf_true only after bricks are
    started correctly for all volumes.
    
    > Reviewed-on: http://review.gluster.org/15183
    > Smoke: Gluster Build System <jenkins@build.gluster.org>
    > NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    > CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    > Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    
    (cherry picked from commit dd8d93f24a320805f1f67760b2d3266555acf674)
    
    Change-Id: I2c685b43207df2a583ca890ec54dcccf109d22c3
    BUG: 1366813
    Signed-off-by: Samikshan Bairagya <samikshan@gmail.com>
    Reviewed-on: http://review.gluster.org/15186
    Smoke: Gluster Build System <jenkins@build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>

Comment 5 Atin Mukherjee 2016-08-19 07:35:33 UTC
*** Bug 1368347 has been marked as a duplicate of this bug. ***

Comment 6 Kaushal 2016-08-22 10:02:18 UTC
(Adding a hopefully friendlier description of the problem)

On restarting GlusterD or rebooting a GlusterFS server, only the bricks of the first volume get started. The bricks of the remaining volumes are not started.
This is a regression caused by a change in GlusterFS-3.8.2.

Because of this regression, GlusterFS volumes will be left in an inoperable state after upgrading to 3.8.2, as upgrading involves restarting GlusterD. Users can forcefully start the remaining volumes, by doing running the `gluster volume start <name> force` command.

Also, this breaks automatic start of volumes on rebooting servers, and leaves the volumes inoperable.

Comment 7 Daniel 2016-08-23 09:26:13 UTC
I just retested the issue with GlusterFS-3.8.3 and it seems to be solved.
After a reboot or a manual daemon restart all volumes are online again.

Thanks a lot for the fix! :)

Comment 8 Niels de Vos 2016-08-24 10:21:04 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.3, please open a new bug report.

glusterfs-3.8.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/announce/2016-August/000059.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.