Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1059662 - [RHSC] Status of remove-brick changed from commit-pending to unknown (?) when left untouched for a few hours
Summary: [RHSC] Status of remove-brick changed from commit-pending to unknown (?) when...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhsc
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 2.1.2
Assignee: Sahina Bose
QA Contact: Shruti Sampat
URL:
Whiteboard:
Depends On:
Blocks: 1060434
TreeView+ depends on / blocked
 
Reported: 2014-01-30 10:23 UTC by Shruti Sampat
Modified: 2015-05-15 18:20 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.4.0.59rhs-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1060434 (view as bug list)
Environment:
Last Closed: 2014-02-25 08:15:45 UTC
Target Upstream Version:


Attachments (Terms of Use)
engine logs (deleted)
2014-01-30 10:23 UTC, Shruti Sampat
no flags Details
core file generated due to issue #1. (deleted)
2014-02-01 18:35 UTC, Krutika Dhananjay
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:0208 normal SHIPPED_LIVE Red Hat Storage 2.1 enhancement and bug fix update #2 2014-02-25 12:20:30 UTC

Description Shruti Sampat 2014-01-30 10:23:45 UTC
Created attachment 857421 [details]
engine logs

Description of problem:
-------------------------

After starting remove-brick on a volume, the setup was left untouched for a few hours in the commit-pending state. Later, on trying to retain the bricks, the retain operation failed, and the remove-brick icon changed to unknown (?).

From the events log - 

Could not find information for remove brick on volume dis_rep_vol of Cluster test_2 from CLI. Marking it as unknown.

Version-Release number of selected component (if applicable):
Red Hat Storage Console Version: 2.1.2-0.35.el6rhs 
glusterfs 3.4.0.58rhs

How reproducible:
Saw it once.

Steps to Reproduce:
1. Start remove-brick on a distributed-replicate volume.
2. Let the data migration finish and the icon change to commit-pending.
3. Leave the setup untouched for a few hours ( I had other clusters managed by the same engine which I continued to use )
4. Try to retain the bricks.

Actual results:
Retain brick failed and the remove-brick icon changed to unknown (?).

Expected results:
Retain should have succeeded and the remove brick icon should have changed to remove-brick-stopped icon.

Additional info:

Comment 2 Shruti Sampat 2014-01-30 10:41:38 UTC
Hi,

I am seeing that after a while, the icon changes to commit-pending again and the events log says the following - 

Data Migration finished for brick(s) on volume dis_rep_vol. Please review to abort or commit.

Comment 4 Sahina Bose 2014-01-30 11:51:27 UTC
This is due to the inconsistency of glusterTasksList return as per logs below
For a period in between, the task list was returning no tasks and then again started returning the task (No errors in vdsm):

2014-01-30 08:34:50,634 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-51) START, GlusterTasksListVDSCommand(HostName = rb_host4, HostId = 9ec85f51-af07-49f8-a0b5-bee6e829f37e), log id: 52c5e91e
2014-01-30 08:34:51,173 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-51) FINISH, GlusterTasksListVDSCommand, return: [GlusterAsyncTask[4518561a-4267-46c9-a480-80c373e983f4-REMOVE_BRICK-FINISHED]

[org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-11) [7cc81051] FINISH, GlusterTasksListVDSCommand, return: [], log id: 15240328
2014-01-30 08:35:51,513 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-11) [7cc81051] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Could not find information for remove brick on volume dis_rep_vol of Cluster test_2 from CLI. Marking it as unknown
.
.
.
.
2014-01-30 03:13:34,756 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-57) [4166d263] START, GlusterTasksListVDSCommand(HostName = rb_host1, HostId = b3b4550a-7532-4ce2-b35e-7df1562cafa9), log id: 4716d22a
2014-01-30 03:13:35,548 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterTasksListVDSCommand] (DefaultQuartzScheduler_Worker-57) [4166d263] FINISH, GlusterTasksListVDSCommand, return: [GlusterAsyncTask[a980a28b-7542-43e3-b3a5-54918ecf09aa-REMOVE_BRICK-FINISHED], GlusterAsyncTask[4518561a-4267-46c9-a480-80c373e983f4-REMOVE_BRICK-FINISHED]], log id: 4716d22a

Comment 5 RamaKasturi 2014-01-31 09:39:53 UTC
i happened to see this issue on my scale config. Following are the steps to reproduce it.

1) Have four volumes 2 distribute and 2 distribute replicate.

2) started remove-brick on all four volumes.

3) After few seconds, i see that a question mark appears in the volume activities column and an event message which says "Could not find information for remove brick on volume vol_dis_1brick_8hosts of Cluster cluster_test_setup from CLI. Marking it as unknown."

please find the sos reports here 

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1059662/

Comment 6 Shruti Sampat 2014-01-31 10:06:05 UTC
I saw it again too. Remove-brick operation failed on a volume as some of its bricks were down. The icon was in the failed state. Tried to start remove-brick again, but the icon changed to ?

Opening the status dialog shows an empty dialog, i.e with zero rows in the table.

Comment 8 Krutika Dhananjay 2014-02-01 18:35:27 UTC
Created attachment 858091 [details]
core file generated due to issue #1.

Comment 13 Sahina Bose 2014-02-03 11:53:55 UTC
The issue in Comment 5 is a duplicate of Bug 1060138

Comment 14 Kaushal 2014-02-04 11:40:43 UTC
From the logs, I was able to build the following sequence of events which lead to the bug.

1. A remove-brick stop command was issued on the volume dis_rep_vol from RHSC. This lead to a vdsm running the command on a peer. When running such a command, glusterd stores the running op in a global variable.

2. While the remove-brick was in progress, a routine poll for volume status all tasks was performed. But, in this case the tasks status command was also launched on the same peer which was running the remove-brick command.

3. The status tasks command isn't able to obtain the cluster lock in glusterd, and fails, leading to the unknown status symbol in RHSC.
While cleaning up after failure, the global op (which was set by the remove-brick command) is also cleared, as described by Krutika in comment 7.

4. Meanwhile the remove-brick command continues and completes, with some assertion logs, as shown in comment 7.

Before sending a response to cli, glusterd performs some modifications to the response based on the op. This is mainly to perform the conversions of uuid's to hostnames.

5. In this case, since the op was erased no response modification was done before responding.

6. The cli remove-brick status code for xml output, expects hostnames to be present in the response. Since, the response modification wasn't done and hostnames aren't present, the generation of xml output fails. This is taken as a failure of the remove-brick stop command, when infact the command succeeded.

7. The next routine tasks status command expects the remove-brick task to be still be present as RHSC assumes the remove-brick stop command to have failed. And since it cannot find the task, the task is marked as unknown.

Krutika's fix of correctly clearing/not-clearing the global op fixes this. With the fix in place, in a similar situation as described above, the global op wouldn't be cleared by some other command and a command already in execution will complete successfully and will do all the necessary response modifications needed, which would lead to the cli correctly generating the xml document.

TL;DR, Krutika's fix in https://code.engineering.redhat.com/gerrit/#/c/19201/ fixes the issue.

Comment 16 RamaKasturi 2014-02-10 13:05:35 UTC
Hi krutika,

  is the issue mentioned in comment 6 has been addressed?

  1) when i tried removing bricks from a volume after making glusterd down, i see that in my volume activities column a "?" appears and Opening the status dialog shows an empty dialog, i.e with zero rows in the table.

  2) when i looked at gluster volume remove-brick <volname>  <brickname> status --xml gives me the follwoing output.

[root@localhost ~]# gluster volume remove-brick vol_dis_rep_one 10.70.37.44:/rhs/brick4/b3 status --xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
  <volRemoveBrick>
    <task-id>726a21cc-59d0-4456-bd14-77bebacec762</task-id>
    <nodeCount>4</nodeCount>
    <aggregate>
      <files>0</files>
      <size>0</size>
      <lookups>0</lookups>
      <failures>0</failures>
      <skipped>0</skipped>
      <status>-1</status>
      <statusStr>(null)</statusStr>
      <runtime>0.00</runtime>
    </aggregate>
  </volRemoveBrick>
</cliOutput>

3) when i try querying for the "gluster vol status <volname> " it gives the following error:

 
Commit failed on 10.70.37.50. Please check log file for details.
Commit failed on 10.70.37.44. Please check log file for details.
 
4) The following is what i see in the gluster logs.

[2014-02-10 18:37:59.689374] E [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed to add task details to dict
[2014-02-10 18:37:59.689384] E [glusterd-syncop.c:1007:gd_commit_op_phase] 0-management: Commit of operation 'Volume Status' failed on localhost    
[2014-02-10 18:37:59.691389] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis
[2014-02-10 18:38:10.404927] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:38:10.405025] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:38:10.405045] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.528147] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_rep
[2014-02-10 18:39:48.528362] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.528451] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.528517] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.530151] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.534924] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.540503] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:48.792217] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_rep_one
[2014-02-10 18:39:48.797930] E [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to get brick count
[2014-02-10 18:39:48.797955] E [glusterd-op-sm.c:2085:_add_task_to_dict] 0-management: Failed to add remove bricks to dict
[2014-02-10 18:39:48.797966] E [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed to add task details to dict
[2014-02-10 18:39:48.797975] E [glusterd-syncop.c:1007:gd_commit_op_phase] 0-management: Commit of operation 'Volume Status' failed on localhost    
[2014-02-10 18:39:49.033752] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis
[2014-02-10 18:39:49.300819] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_one
[2014-02-10 18:39:59.427201] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:59.427314] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:39:59.427334] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.891546] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.891626] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.891699] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.893019] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.893199] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.893722] I [glusterd-ping.c:297:glusterd_ping_cbk] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:00.897399] E [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-02-10 18:40:00.897803] E [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-02-10 18:40:00.897951] E [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-02-10 18:40:00.901857] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_rep
[2014-02-10 18:40:00.911947] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_one
[2014-02-10 18:40:00.922112] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis_rep_one
[2014-02-10 18:40:00.925352] E [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to get brick count
[2014-02-10 18:40:00.925378] E [glusterd-op-sm.c:2085:_add_task_to_dict] 0-management: Failed to add remove bricks to dict
[2014-02-10 18:40:00.925389] E [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed to add task details to dict
[2014-02-10 18:40:00.925399] E [glusterd-syncop.c:1007:gd_commit_op_phase] 0-management: Commit of operation 'Volume Status' failed on localhost    
[2014-02-10 18:40:00.927880] I [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management: Received status volume req for volume vol_dis
[2014-02-10 18:40:11.429731] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:11.429846] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s
[2014-02-10 18:40:11.430023] I [glusterd-ping.c:181:glusterd_start_ping] 0-management: defaulting ping-timeout to 10s

Is this the expected behaviour?

Comment 17 Krutika Dhananjay 2014-02-11 05:02:29 UTC
(In reply to RamaKasturi from comment #16)
> Hi krutika,
> 
>   is the issue mentioned in comment 6 has been addressed?
> 
>   1) when i tried removing bricks from a volume after making glusterd down,
> i see that in my volume activities column a "?" appears and Opening the
> status dialog shows an empty dialog, i.e with zero rows in the table.
> 
>   2) when i looked at gluster volume remove-brick <volname>  <brickname>
> status --xml gives me the follwoing output.
> 
> [root@localhost ~]# gluster volume remove-brick vol_dis_rep_one
> 10.70.37.44:/rhs/brick4/b3 status --xml
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <cliOutput>
>   <opRet>0</opRet>
>   <opErrno>0</opErrno>
>   <opErrstr/>
>   <volRemoveBrick>
>     <task-id>726a21cc-59d0-4456-bd14-77bebacec762</task-id>
>     <nodeCount>4</nodeCount>
>     <aggregate>
>       <files>0</files>
>       <size>0</size>
>       <lookups>0</lookups>
>       <failures>0</failures>
>       <skipped>0</skipped>
>       <status>-1</status>
>       <statusStr>(null)</statusStr>
>       <runtime>0.00</runtime>
>     </aggregate>
>   </volRemoveBrick>
> </cliOutput>
> 
> 3) when i try querying for the "gluster vol status <volname> " it gives the
> following error:
> 
>  
> Commit failed on 10.70.37.50. Please check log file for details.
> Commit failed on 10.70.37.44. Please check log file for details.
>  
> 4) The following is what i see in the gluster logs.
> 
> [2014-02-10 18:37:59.689374] E
> [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> to add task details to dict
> [2014-02-10 18:37:59.689384] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> 0-management: Commit of operation 'Volume Status' failed on localhost    
> [2014-02-10 18:37:59.691389] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis
> [2014-02-10 18:38:10.404927] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:38:10.405025] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:38:10.405045] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.528147] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_rep
> [2014-02-10 18:39:48.528362] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.528451] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.528517] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.530151] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.534924] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.540503] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:48.792217] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_rep_one
> [2014-02-10 18:39:48.797930] E
> [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to
> get brick count
> [2014-02-10 18:39:48.797955] E [glusterd-op-sm.c:2085:_add_task_to_dict]
> 0-management: Failed to add remove bricks to dict
> [2014-02-10 18:39:48.797966] E
> [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> to add task details to dict
> [2014-02-10 18:39:48.797975] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> 0-management: Commit of operation 'Volume Status' failed on localhost    
> [2014-02-10 18:39:49.033752] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis
> [2014-02-10 18:39:49.300819] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_one
> [2014-02-10 18:39:59.427201] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:59.427314] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:39:59.427334] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.891546] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.891626] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.891699] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.893019] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.893199] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.893722] I [glusterd-ping.c:297:glusterd_ping_cbk]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:00.897399] E
> [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> aggregate response from  node/brick
> [2014-02-10 18:40:00.897803] E
> [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> aggregate response from  node/brick
> [2014-02-10 18:40:00.897951] E
> [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> aggregate response from  node/brick
> [2014-02-10 18:40:00.901857] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_rep
> [2014-02-10 18:40:00.911947] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_one
> [2014-02-10 18:40:00.922112] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis_rep_one
> [2014-02-10 18:40:00.925352] E
> [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to
> get brick count
> [2014-02-10 18:40:00.925378] E [glusterd-op-sm.c:2085:_add_task_to_dict]
> 0-management: Failed to add remove bricks to dict
> [2014-02-10 18:40:00.925389] E
> [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> to add task details to dict
> [2014-02-10 18:40:00.925399] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> 0-management: Commit of operation 'Volume Status' failed on localhost    
> [2014-02-10 18:40:00.927880] I
> [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> Received status volume req for volume vol_dis
> [2014-02-10 18:40:11.429731] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:11.429846] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> [2014-02-10 18:40:11.430023] I [glusterd-ping.c:181:glusterd_start_ping]
> 0-management: defaulting ping-timeout to 10s
> 
> Is this the expected behaviour?

Kasturi,

The 4 observations mentioned in comment #16 seem to be a manifestation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1048765, with the exact same steps executed, albeit from the CLI instead of the UI.

The fix made as part of this bug is for the issue stated in Description for this bug and based on my observations in the sos-reports filed in comment #1.

While we are the issue of (?) appearing in the UI, I have a question for Sahina:

Sahina,

Could you please explain under what circumstances the UI displays "Unknown (?)" status for 'remove-brick status', as that would help us in
narrowing down the areas/code/bugs in cli/glusterd which could cause the UI to display this message?

Thanks in advance,
Krutika

Comment 18 Krutika Dhananjay 2014-02-11 05:18:48 UTC
Moving the state of the bug back to ON_QA, as per discussion with Kasturi, since the patch in comment #10 does fix the bug described in the "Description" section.

Comment 19 RamaKasturi 2014-02-11 08:45:47 UTC
Verified in 

RHSC : rhsc-2.1.2-0.36.el6rhs.noarch

glusterfs : glusterfs-3.4.0.59rhs-1.el6rhs.x86_64

vdsm : vdsm-4.13.0-24.el6rhs.x86_64

performed the following steps :

1. Start remove-brick on a distributed-replicate volume.
2. Let the data migration finish and the icon change to commit-pending.
3. Leave the setup untouched for a few hours ( I had other clusters managed by the same engine which I continued to use )
4. Try to retain the bricks.
5. Retaining bricks works sucessfully with no errors.

marking this as verified.

Comment 20 Sahina Bose 2014-02-18 08:41:12 UTC
(In reply to Krutika Dhananjay from comment #17)
> (In reply to RamaKasturi from comment #16)
> > Hi krutika,
> > 
> >   is the issue mentioned in comment 6 has been addressed?
> > 
> >   1) when i tried removing bricks from a volume after making glusterd down,
> > i see that in my volume activities column a "?" appears and Opening the
> > status dialog shows an empty dialog, i.e with zero rows in the table.
> > 
> >   2) when i looked at gluster volume remove-brick <volname>  <brickname>
> > status --xml gives me the follwoing output.
> > 
> > [root@localhost ~]# gluster volume remove-brick vol_dis_rep_one
> > 10.70.37.44:/rhs/brick4/b3 status --xml
> > <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> > <cliOutput>
> >   <opRet>0</opRet>
> >   <opErrno>0</opErrno>
> >   <opErrstr/>
> >   <volRemoveBrick>
> >     <task-id>726a21cc-59d0-4456-bd14-77bebacec762</task-id>
> >     <nodeCount>4</nodeCount>
> >     <aggregate>
> >       <files>0</files>
> >       <size>0</size>
> >       <lookups>0</lookups>
> >       <failures>0</failures>
> >       <skipped>0</skipped>
> >       <status>-1</status>
> >       <statusStr>(null)</statusStr>
> >       <runtime>0.00</runtime>
> >     </aggregate>
> >   </volRemoveBrick>
> > </cliOutput>
> > 
> > 3) when i try querying for the "gluster vol status <volname> " it gives the
> > following error:
> > 
> >  
> > Commit failed on 10.70.37.50. Please check log file for details.
> > Commit failed on 10.70.37.44. Please check log file for details.
> >  
> > 4) The following is what i see in the gluster logs.
> > 
> > [2014-02-10 18:37:59.689374] E
> > [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> > to add task details to dict
> > [2014-02-10 18:37:59.689384] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> > 0-management: Commit of operation 'Volume Status' failed on localhost    
> > [2014-02-10 18:37:59.691389] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis
> > [2014-02-10 18:38:10.404927] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:38:10.405025] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:38:10.405045] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.528147] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_rep
> > [2014-02-10 18:39:48.528362] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.528451] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.528517] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.530151] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.534924] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.540503] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:48.792217] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_rep_one
> > [2014-02-10 18:39:48.797930] E
> > [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to
> > get brick count
> > [2014-02-10 18:39:48.797955] E [glusterd-op-sm.c:2085:_add_task_to_dict]
> > 0-management: Failed to add remove bricks to dict
> > [2014-02-10 18:39:48.797966] E
> > [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> > to add task details to dict
> > [2014-02-10 18:39:48.797975] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> > 0-management: Commit of operation 'Volume Status' failed on localhost    
> > [2014-02-10 18:39:49.033752] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis
> > [2014-02-10 18:39:49.300819] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_one
> > [2014-02-10 18:39:59.427201] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:59.427314] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:39:59.427334] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.891546] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.891626] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.891699] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.893019] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.893199] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.893722] I [glusterd-ping.c:297:glusterd_ping_cbk]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:00.897399] E
> > [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> > aggregate response from  node/brick
> > [2014-02-10 18:40:00.897803] E
> > [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> > aggregate response from  node/brick
> > [2014-02-10 18:40:00.897951] E
> > [glusterd-syncop.c:747:_gd_syncop_commit_op_cbk] 0-management: Failed to
> > aggregate response from  node/brick
> > [2014-02-10 18:40:00.901857] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_rep
> > [2014-02-10 18:40:00.911947] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_one
> > [2014-02-10 18:40:00.922112] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis_rep_one
> > [2014-02-10 18:40:00.925352] E
> > [glusterd-op-sm.c:2021:_add_remove_bricks_to_dict] 0-management: Failed to
> > get brick count
> > [2014-02-10 18:40:00.925378] E [glusterd-op-sm.c:2085:_add_task_to_dict]
> > 0-management: Failed to add remove bricks to dict
> > [2014-02-10 18:40:00.925389] E
> > [glusterd-op-sm.c:2170:glusterd_aggregate_task_status] 0-management: Failed
> > to add task details to dict
> > [2014-02-10 18:40:00.925399] E [glusterd-syncop.c:1007:gd_commit_op_phase]
> > 0-management: Commit of operation 'Volume Status' failed on localhost    
> > [2014-02-10 18:40:00.927880] I
> > [glusterd-handler.c:3498:__glusterd_handle_status_volume] 0-management:
> > Received status volume req for volume vol_dis
> > [2014-02-10 18:40:11.429731] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:11.429846] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > [2014-02-10 18:40:11.430023] I [glusterd-ping.c:181:glusterd_start_ping]
> > 0-management: defaulting ping-timeout to 10s
> > 
> > Is this the expected behaviour?
> 
> Kasturi,
> 
> The 4 observations mentioned in comment #16 seem to be a manifestation of
> the bug https://bugzilla.redhat.com/show_bug.cgi?id=1048765, with the exact
> same steps executed, albeit from the CLI instead of the UI.
> 
> The fix made as part of this bug is for the issue stated in Description for
> this bug and based on my observations in the sos-reports filed in comment #1.
> 
> While we are the issue of (?) appearing in the UI, I have a question for
> Sahina:
> 
> Sahina,
> 
> Could you please explain under what circumstances the UI displays "Unknown
> (?)" status for 'remove-brick status', as that would help us in
> narrowing down the areas/code/bugs in cli/glusterd which could cause the UI
> to display this message?

Whenever gluster CLI for " gluster volume status tasks" returns either
1. No task-id corresponding to the task id in engine
2. Returns a status "Not Started" or a status which is not recognized by RHSC engine (not one of COMPLETED,STARTED, STOPPED,FAILED, IN PROGRESS)

> 
> Thanks in advance,
> Krutika

Comment 22 errata-xmlrpc 2014-02-25 08:15:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html


Note You need to log in before you can comment on or make changes to this bug.