Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1684350 - VM's fails to shutdown after re-balance
Summary: VM's fails to shutdown after re-balance
Keywords:
Status: NEW
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhhi
Version: rhhiv-1.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Sahina Bose
QA Contact: bipin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-01 03:26 UTC by bipin
Modified: 2019-04-11 07:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description bipin 2019-03-01 03:26:03 UTC
Description of problem:
======================
In RHHI-1.6 environment, while testing re-balance once the re-balance was completed, the vm's running on the same brick failed to get powered off. The vm's still shows as powering down.


Version-Release number of selected component (if applicable):
============================================================
glusterfs-3.12.2-45.el7rhgs.x86_64
rhv-4.3.0.5-0
ovirt-engine-4.3.1.1-0.1.el7.noarch

How reproducible:
================
1/1

Steps to Reproduce:
==================
1.Create a replica 3 (1*3) volume
2.Host 2 or more vm's on it and pump IO to the additional disk
3.Add bricks to the volume mentioned above, now its 2*3 
4.Start the rebalance 
5.Once the rebalance is completed poweroff the vm's.

Actual results:
==============
The vm's never got powered off. Also tried with reboot,but never got a connection to ssh

Expected results:
================
Should poweroff the vm without any corruption

Additional info:
===============
Looking through the logs i dont see any clue if FUSE mount is the culprit here.
The logs looks clear but there are few mismatch messages. 
But the engine log suggests that the Destroy VDS command failed.

Additional, new creation of vm's on those bricks works fine.

Comment 2 bipin 2019-03-01 03:27:56 UTC
Engine log:
===========
2019-02-28 11:13:55,444+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed in 'DestroyVDS' method
2019-02-28 11:13:55,453+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: VDS_BROKER_
COMMAND_FAILURE(10,802), VDSM rhsqa-grafton9-nic2.lab.eng.blr.redhat.com command DestroyVDS failed: Virtual machine destroy error
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core
.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=42, message=Virtual machine destroy error]]'
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] HostName = rhsqa-grafton9-nic2
.lab.eng.blr.redhat.com
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'DestroyVDSCommand(Hos
tName = rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, DestroyVmVDSCommandParameters:{hostId='a37bef1d-9214-4476-a22b-2e556f71949f', vmId='ceefada7-bda7-4c7e-b7d8-1b72bd131e61', secondsToWait='0', gracefully='false
', reason='', ignoreNoVm='false'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVDSCommand, ret
urn: , log id: 6977bc42
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Failed to destroy VM 'ceefada7-bda7-4c
7e-b7d8-1b72bd131e61' in VDS = 'a37bef1d-9214-4476-a22b-2e556f71949f' , error = 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error =
 Virtual machine destroy error, code = 42'
2019-02-28 11:13:55,454+05 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] FINISH, DestroyVmVDSCommand, return: ,
 log id: 58eabca
2019-02-28 11:13:55,454+05 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] Command 'org.ovirt.engine.core.bll.StopVmCommand' 
failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = Virtual machine destroy error, code = 42 (Failed with e
rror destroyErr and code 42)
2019-02-28 11:13:55,464+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-195506) [97b743be-e2d4-4f38-b986-19cf609c5a9e] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm1_Vol_expansion (Host: rhsqa-grafton9-nic2.lab.eng.blr.redhat.com, User: admin@internal-authz)



FUSE mount log:
==============
[2019-02-27 11:28:04.401930] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388
[2019-02-27 11:28:04.402050] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c, gfid = 44063cb5-d123-423e-bef4-3b51bbe63ae6
[2019-02-27 11:28:04.402073] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388
[2019-02-27 11:28:04.403391] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images, gfid = 4986b231-207d-42a2-8eae-6d4dd09d1482
[2019-02-27 11:28:04.404863] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/58b4f4b2-f28a-4eae-90b5-7b576ad07874, gfid = 324c942a-6fcb-43e7-94f1-485ba0d9db3c
The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388" repeated 3 times between [2019-02-27 11:28:04.401930] and [2019-02-27 11:28:10.331865]
[2019-02-27 11:28:10.331869] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1
[2019-02-27 11:28:10.331944] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/dom_md, gfid = c1d3d32a-80af-40b3-8775-c22e23b95cb1
[2019-02-27 11:28:22.392578] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-0: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
[2019-02-27 11:28:22.392760] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-2: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
[2019-02-27 11:28:22.392764] W [MSGID: 114031] [client-rpc-fops.c:2865:client3_3_lookup_cbk] 2-Vol1_Expansion-client-1: remote operation failed. Path: <gfid:a24c1d8e-ad3f-43f3-8c4a-96be4cf22496> (a24c1d8e-ad3f-43f3-8c4a-96be4cf22496) [No such file or directory]
The message "I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-1; inode layout - 0 - 0 - 0; disk layout - 0 - 2147483646 - 3820200388" repeated 4 times between [2019-02-27 11:28:04.402073] and [2019-02-27 11:28:33.071945]
[2019-02-27 11:28:33.071948] I [MSGID: 109018] [dht-common.c:1790:dht_revalidate_cbk] 2-Vol1_Expansion-dht: Mismatching layouts for /52029ff1-9890-4c96-9edd-c5ff7052545c/images/86d94115-6969-41a9-afc2-f8453820370f, gfid = 5384fdfe-38b9-4ebb-9b97-02ac02f73f81
[2019-02-27 11:28:33.072133] I [MSGID: 109064] [dht-layout.c:837:dht_layout_dir_mismatch] 2-Vol1_Expansion-dht: subvol: Vol1_Expansion-replicate-0; inode layout - 0 - 4294967295 - 1; disk layout - 2147483647 - 4294967295 - 3820200388

Comment 4 Gobinda Das 2019-03-01 09:53:12 UTC
Hi Bipin,
 I was checking vdsm logs and could not find any error which explains the reason.
2019-02-28 13:24:45,811+0530 WARN  (jsonrpc/3) [virt.vm] (vmId='6998fc6d-79e3-46ca-a741-5d5f0322679e') Failed to destroy VM '6998fc6d-79e3-46ca-a741-5d5f0322679e' forcefully (error=38) (vm:5212)
2019-02-28 13:24:45,812+0530 INFO  (jsonrpc/3) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:10.70.36.247,57014, flow_id=2bfe5929-1906-44a2-bc3c-4d3520fd4070, vmId=6998fc6d-79e3-46ca-a741-5d5f0322679e (api:54)

Can you please confirm the engine log and vdsm log captured same time?

Comment 5 bipin 2019-03-01 10:37:05 UTC
Gobinda,

I updated with proper vdsm log now. Could you please check?

Comment 6 Gobinda Das 2019-03-08 05:49:56 UTC
Bipin,
 I don't find any thing related in given logs.Could you please attach libvirt log as well?

Comment 7 SATHEESARAN 2019-03-08 07:34:26 UTC
On the side note, I have also hit the similar problem with qemu/kvm environment.

1. Used the hypervisor as RHEL 7.6 server
2. Created a replica 3 volume with 3 VMs and fuse mounted the volume in this RHEL 7.6 host
3. Created few VMs and started installation OS on it.
4. While the OS installation is still  in progress, scaled the volume from 1x3 to 2x3
5. Performed rebalance
6. Expanded the volume once again from 2x3 to 3x3, followed by successful rebalance
7. Removed a replica set to convert the volume back to 2x3
8. At this point, powered down the VM.
and observed that the VM never powered down and stuck in the 'in shutdown' state.

Comment 8 bipin 2019-03-11 12:55:52 UTC
Gobinda,

I tried reproducing twice but couldn't see it. It worked fine, but since Sas also hit the issue lets not close the Bug. 
I am reducing the severity and will update the  bug if we see it later.


Note You need to log in before you can comment on or make changes to this bug.