Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1364198 - Cannot resume a suspended VM
Summary: Cannot resume a suspended VM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.0.1.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified vote
Target Milestone: ---
: ---
Assignee: nobody nobody
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-04 17:03 UTC by nicolas
Modified: 2016-08-19 08:35 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-19 08:35:43 UTC
oVirt Team: Virt
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
Engine log and supposed running host's VDSM log (deleted)
2016-08-04 17:03 UTC, nicolas
no flags Details
libvirtd log (deleted)
2016-08-09 18:24 UTC, nicolas
no flags Details

Description nicolas 2016-08-04 17:03:39 UTC
Created attachment 1187577 [details]
Engine log and supposed running host's VDSM log

Description of problem:

We've been having a VM reporting the following in the engine each few seconds:

    VM XXX SLA Policy was set. Storage policy changed for disks: [...]

As it was impossible to stop these events, we tried shutting down the VM (right click on it - shutdown). It apparently showed it was shutting down but immediately status was set again to up with the message above showing each few seconds.

We then decided suspending the VM, which actually happened and message above stopped showing up. This happened at 17:14:05 (log local time). We then tried resuming the VM, and it is stucked in a "Restoring State", with the sand clock icon and it's not possible to resume it.

I also tried changing the status value in the vm_dynamic table from 12 to 0 (Down). Immediately engine determined that the VM should be taken to the "Resuming State" again (that happened at 17:44:49 log local time).

Right now I can't do anything with the machine: Can't power it off, nor start it.

Comment 1 Tomas Jelinek 2016-08-05 06:40:46 UTC
looking at vdsm logs the <ovirt:qos part looks very suspicious. Moving to sla for further investigation.

Comment 2 Roy Golan 2016-08-08 09:59:04 UTC
Its true that the qos element is filled with line breaks but it doesn't appear to stop the vm from going into Restoging state.

What is more bothering is that the vm is always in succeddedToRunVMs in vms monitoring which leads to the call to refresh the vm sla policies, it appears like there is no state change, which explains the weird behaviour around the shutdown.

The logs also show psql errors for host

 ERROR [org.ovirt.engine.core.bll.hostdev.RefreshHostDevicesCommand] (org.ovirt.thread.pool-8-thread-7) [7b53aaee] Command 'org.ovirt.engine.core.bll.hostdev.RefreshHostDevicesCommand' failed: Failed managing transaction


Please supply some more details about the history of the setup

Comment 3 nicolas 2016-08-08 18:01:04 UTC
There's not much I can say, this host was deployed when we installed oVirt as version 3.6.3 (cca. a year ago), and we've been updating all of them (including the engine) in a similar cycle as the 5 other hosts we have, there's nothing special about this one concretely.

We've recently upgraded to oVirt 4.0.0 and we also updated vdsmd on all hosts, but this one is the only one with issues.

Please let me know if I can provide any specific details about the setup.

Comment 4 nicolas 2016-08-09 18:23:51 UTC
Some additional info. I decided to restart libvirtd + vdsmd [1]; when host recovers, it seems to do so with all running VMs, including the one that fails to resume (let's call it just VM).

After this process took place, I could see this entry in the log:

    Highly Available VM VM failed. It will be restarted automatically.

So I suspected this machine was configured as High Available and indeed, it is. I tried to remove that setting from the machine but I couldn't do that in the engine, since the "High Available" check wouldn't deactivate. I decided to do it from the Python-SDK, and it worked, but the machine was still in an "eternal" restoring state.

So I restarted libvirtd and vdsmd again, host recovered, all machines recovered exceptuating VM. Now I could see this entry in the log:

    VM VM is down with error. Exit message: Wake up from hibernation failed:Failed to find the libvirt domain.

So basically right now the machine is down, but if I try to start it, it enters the restoring state again and stalls there. I also checked state in Python-SDK and the DB and both showed the machine being 'down' (0 in the vm_dynamic table, field status).

 [1]: I'm attaching logs from libvirtd at that time which shows several errors from a time behind.

Comment 5 nicolas 2016-08-09 18:24:22 UTC
Created attachment 1189371 [details]
libvirtd log

Comment 6 Roy Golan 2016-08-09 20:31:09 UTC
can you restore this VM to a different snapshot? look at the VM snapshot subtab in the webui

Comment 7 nicolas 2016-08-10 07:27:47 UTC
I didn't have a snapshot of this VM, but I could copy its disk, create a new VM and attach it to it. The VM started correctly, I suspect I can do the same creating a snapshot a cloning it into a different VM if you wish.

I kept the original VM in case you need to do some additional tests.

Comment 8 Tomas Jelinek 2016-08-19 08:35:43 UTC
We have fixed some issues around the update flows in 4.0.z so I would suspect that you have hit one of them.
Considering you have succeeded to save your VM by copying its disks to other VM, I would close this issue - if you will hit it again please reopen.


Note You need to log in before you can comment on or make changes to this bug.