Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1056997 - [RHSC] Host does not move to non operational even after glusterd is made down .
Summary: [RHSC] Host does not move to non operational even after glusterd is made down .
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhsc
Version: 2.1
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: RHGS 2.1.2
Assignee: Kanagaraj
QA Contact: RamaKasturi
Depends On:
TreeView+ depends on / blocked
Reported: 2014-01-23 10:12 UTC by RamaKasturi
Modified: 2016-04-18 10:06 UTC (History)
9 users (show)

Fixed In Version: cb16
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2014-02-25 08:15:22 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC
oVirt gerrit 23737 None None None Never
oVirt gerrit 27838 ovirt-engine-3.4 MERGED gluster: check gluster daemon status in sync-job Never

Description RamaKasturi 2014-01-23 10:12:14 UTC
Description of problem:
Engine fires the query on a node, even if glusterd is down. 
2014-01-23 20:57:56,595 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-68) START, GlusterTasksListVDSCommand(HostName =, HostId = 6b7ec009-ef68-4110-aa2a-a12fb9f1daaf), log id: 1402ad12

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Have a 4 node cluster.
2. Now stop gluster in the first node by executing the command "service glusterd stop"

Actual results:
Host does not move to non operational.

Expected results:
Host should be moved to non operational, once glusterd is down in that node.

Additional info:

Comment 1 RamaKasturi 2014-01-23 10:13:28 UTC
Logs :

2014-01-23 20:58:04,600 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-68) FINISH, GlusterTasksListVDSCommand, log id: 1402ad12
2014-01-23 20:58:04,600 ERROR [org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService] (DefaultQuartzScheduler_Worker-68) org.ovirt.engine.core.common.errors.VDSError@6e0a71a5
2014-01-23 20:58:04,601 ERROR [org.ovirt.engine.core.bll.gluster.GlusterTasksSyncJob] (DefaultQuartzScheduler_Worker-68) Error updating tasks from CLI: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1 (Failed with error GlusterVolumeStatusAllFailedException and code 4161)
	at org.ovirt.engine.core.bll.gluster.tasks.GlusterTasksService.getTaskListForCluster( [bll.jar:]
	at org.ovirt.engine.core.bll.gluster.GlusterTasksSyncJob.updateGlusterAsyncTasks( [bll.jar:]
	at sun.reflect.GeneratedMethodAccessor96.invoke(Unknown Source) [:1.7.0_45]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke( [rt.jar:1.7.0_45]
	at java.lang.reflect.Method.invoke( [rt.jar:1.7.0_45]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute( [scheduler.jar:]
	at [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$ [quartz.jar:]

Comment 3 RamaKasturi 2014-01-23 10:19:45 UTC
Please find the sosreports in the below link :

Comment 6 RamaKasturi 2014-01-28 12:06:40 UTC
Following are my observations i tried verifying the bug :

1) When glusterd is made down, immediately host moves to non-operational.

2) When glusterd is brought back up, when sync happens i see that brick status becomes up but the host is still in non operational. 

3) Is sync job for bricks and server is different? If yes, can some one explain me why is it so?

If not, can some one tell me why is happening and is it the expected behaviour?

Comment 7 Kanagaraj 2014-01-28 12:14:56 UTC
Engine tries to recover a host which is in non-operational in definite intervals(i think few min). So you might see a delay from when you started the glusterd and the host coming UP. We don't have any sync job which tries to move a host from Non-Operational to UP. This is handled by the engine framework itself.

But volume refresh happens for every 5 seconds, and all the brick details are updated.This is the reason why the bricks are coming to UP state before hosts.

Comment 8 Kanagaraj 2014-01-28 12:21:07 UTC
Just a correction to the above comment, brick status refresh happens every 5 mins.

Comment 9 Kanagaraj 2014-01-28 12:39:16 UTC
There is an AutoRecoveryManager which tries to bring UP(activate) the host for every 5 minutes. It is scheduled as a cron job with expresssion as "0 0/5 * * * ?"

Comment 10 RamaKasturi 2014-01-29 10:26:37 UTC
verified and works fine in CB16 build rhsc-2.1.2-0.35.el6rhs.noarch.

When glusterd is made down in the node the host goes to non operational and all the brick status is shown as down.

Test performed to verify this bug : Made glusterd down in all the nodes and brought it up back again one after the other.

Comment 12 errata-xmlrpc 2014-02-25 08:15:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.