Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1064860 - VMs get stuck in 'Unknown' state when power management is not working.
Summary: VMs get stuck in 'Unknown' state when power management is not working.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-02-13 12:39 UTC by Roman Hodain
Modified: 2016-02-10 19:37 UTC (History)
14 users (show)

Fixed In Version: ovirt-engine-3.5.0_beta
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-17 17:07:54 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 26466 master ABANDONED core: VMs moved to UNKNOWN after set to DOWN Never
oVirt gerrit 28114 master MERGED core: move VM status handling to VdsManager. Never

Description Roman Hodain 2014-02-13 12:39:00 UTC
Description of problem:
	
	VMs gets stuck in "Unknown"  when hypervisor is rebooted and fencing is
not working even if the hypervisor comes up and the engine detects the the VMs
are down.


Version-Release number of selected component (if applicable):

	rhevm-3.3.0-0.46.el6ev.noarch

How reproducible:

	100%

Steps to Reproduce:

	1. Create a new DC with just one hyperviosr (local storage)
	2. Start a VM on it
	3. Reboot it

Actual results:

	VM is set to the unknown state forever.

Expected results:

	VM is mark temporaryly as unknown state and later as down wehe the
hypervisors comes up

Additional info:
	
	It seems that the is caused by defunct fencing. When the fencing does
not succed thee times the VMs are marked as in unknown state, but the rerun
trheratment already happened as the hypervisors came up already.

2014-02-12 13:42:46,359 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: State was set to Up for host dhcp-1-146.brq.redhat.com.
2014-02-12 13:42:46,531 INFO [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-63) vm TestVM running in db and not running in vds - add to rerun treatment. vds rhev-h.exmaple.com
...
2014-02-12 13:43:16,669 INFO  [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Attempt 3 to find fence proxy host failed...
2014-02-12 13:43:46,670 ERROR [org.ovirt.engine.core.bll.FenceExecutor] (pool-4-thread-47) Failed to run Power Management command on Host rhev-h.example.com, no running proxy Host was found.
2014-02-12 13:43:46,684 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) START, SetVmStatusVDSCommand( vmId = 4785f791-c535-4f64-97ef-fbd6a11bf8fd, status = Unknown), log id: 21cd9e52
2014-02-12 13:43:46,687 INFO  [org.ovirt.engine.core.vdsbroker.SetVmStatusVDSCommand] (pool-4-thread-47) FINISH, SetVmStatusVDSCommand, log id: 21cd9e52
2014-02-12 13:43:46,724 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-4-thread-47) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM TestVM was set to the Unknown status.

Comment 1 Michal Skrivanek 2014-02-28 10:03:47 UTC
the fencing operation should be aborted when the host comes up in the meantime. Then the rerun treatment should work properly and not get overwritten by failed fencing afterwards

Comment 2 Eli Mesika 2014-03-18 13:05:12 UTC
Roman, you should right click on the Host from the Hosts list in the web admin UI and select "Confirm Host has been rebooted"

Please recheck with the above

Comment 3 Arthur Berezin 2014-03-20 15:02:32 UTC
There's a user experience problem here, the user reboot a host with running VMs thus VMs are going to unknown state. There's not indication in the VMs tab(only hidden in the events section) for the user that he should go back to host level and confirm that the has been rebooted.
Adding User Experience keyword.

Comment 4 Roman Hodain 2014-03-21 11:02:21 UTC
(In reply to Arthur Berezin from comment #3)
> There's a user experience problem here, the user reboot a host with running
> VMs thus VMs are going to unknown state. There's not indication in the VMs
> tab(only hidden in the events section) for the user that he should go back
> to host level and confirm that the has been rebooted.
> Adding User Experience keyword.

I do not thing that this is the problem here. The problem is that the hypervisor where the VM was running is already up and the VM is still in the unknown state. Why would I mark hypervisor which is up as rebooted?

Comment 5 Eli Mesika 2014-03-24 22:09:22 UTC
(In reply to Roman Hodain from comment #4)
> (In reply to Arthur Berezin from comment #3)
> > There's a user experience problem here, the user reboot a host with running
> > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > tab(only hidden in the events section) for the user that he should go back
> > to host level and confirm that the has been rebooted.
> > Adding User Experience keyword.
> 
> I do not thing that this is the problem here. The problem is that the
> hypervisor where the VM was running is already up and the VM is still in the
> unknown state. Why would I mark hypervisor which is up as rebooted?

I am just copy/past from your bug description :

Steps to Reproduce:

	1. Create a new DC with just one hyperviosr (local storage)
	2. Start a VM on it
	3. Reboot it

So, you had rebooted the Host manually right? If so , please test again while after you reboot the host you also right click on it as "Confirm host has been rebooted"

BTW there is no fencing issue here since fencing can not work when there is only one Host in the DC (no proxy host available...)

Comment 6 Roman Hodain 2014-03-27 17:15:43 UTC
(In reply to Eli Mesika from comment #5)
> (In reply to Roman Hodain from comment #4)
> > (In reply to Arthur Berezin from comment #3)
> > > There's a user experience problem here, the user reboot a host with running
> > > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > > tab(only hidden in the events section) for the user that he should go back
> > > to host level and confirm that the has been rebooted.
> > > Adding User Experience keyword.
> > 
> > I do not thing that this is the problem here. The problem is that the
> > hypervisor where the VM was running is already up and the VM is still in the
> > unknown state. Why would I mark hypervisor which is up as rebooted?
> 
> I am just copy/past from your bug description :
> 
> Steps to Reproduce:
> 
> 	1. Create a new DC with just one hyperviosr (local storage)
> 	2. Start a VM on it
> 	3. Reboot it
> 
> So, you had rebooted the Host manually right? If so , please test again
> while after you reboot the host you also right click on it as "Confirm host
> has been rebooted"
> 
> BTW there is no fencing issue here since fencing can not work when there is
> only one Host in the DC (no proxy host available...)

Hi,

I have tested your suggestion, but thisis not possible. At the time when the VM is in the unknown state the hypervisor is already up:
	
Error while executing action: Cannot confirm 'Host has been rebooted' Host. Valid Host statuses are "Non operational", "Maintenance" or "Connecting".

let me repeat what happens:

 - VM is up
 - host is up
 - host is down
 - fencing is triggered
 - fencing in progress (not working)
 - hypervisor is up
 - Vm is marked as down
 - Fencing failed
 - Vm is marked as in Unknown state.
 - Mark the hypervisor as rbooted. (not possible)

I still think that this is an issue of fencing. The fencing is triggered and if it fails it marks VM as in unknow state even if they are already marked as down by the hypervisor which is already up.
It i snot related only to local storage, but also to an issues where the fencing not working.

Roman

Comment 9 Einav Cohen 2014-06-02 17:55:06 UTC
(In reply to Arthur Berezin from comment #3)
> There's a user experience problem here, the user reboot a host with running
> VMs thus VMs are going to unknown state. There's not indication in the VMs
> tab(only hidden in the events section) for the user that he should go back
> to host level and confirm that the has been rebooted.
> Adding User Experience keyword.

is this what this bug is about? I see that this BZ is in POST, so the problem reported here was solved; what you are saying is that we have a user-experience problem that, if I understand correctly, should be tracked separately from this issue. if so - please open a separate RFE for that. For now I removed the UserExperience keyword from this BZ. 
My hunch is that this should be solved via a notification-center or something similar that we can plan for 4.0, definitely not 3.5 material. 
thanks.

Comment 10 Arthur Berezin 2014-06-05 08:29:46 UTC
(In reply to Einav Cohen from comment #9)
> (In reply to Arthur Berezin from comment #3)
> > There's a user experience problem here, the user reboot a host with running
> > VMs thus VMs are going to unknown state. There's not indication in the VMs
> > tab(only hidden in the events section) for the user that he should go back
> > to host level and confirm that the has been rebooted.
> > Adding User Experience keyword.
> 
> is this what this bug is about? I see that this BZ is in POST, so the
> problem reported here was solved; what you are saying is that we have a
> user-experience problem that, if I understand correctly, should be tracked
> separately from this issue. if so - please open a separate RFE for that. For
> now I removed the UserExperience keyword from this BZ. 
> My hunch is that this should be solved via a notification-center or
> something similar that we can plan for 4.0, definitely not 3.5 material. 
> thanks.

There are 2 issues here, the first is fixed by Eli's patch - VM are marked as unknown after the host was rebooted and fencing failed. The other is that there's no "Call for Action" in the VMs tab when the user is expected to manually confirm a host was rebooted. I'll open a separate RFE on the second issue.

Comment 11 sefi litmanovich 2014-09-04 07:38:09 UTC
Verified with ovirt-engine-3.5.0-0.0.master.20140821064931.gitb794d66.el6.noarch.
vdsm-4.16.2-1.gite8cba75.el6.x86_64.

1. single host in datacenter is up (host has no power management configured).
2. create vm.
3. vm is up.
4. manually reboot the host.
5. host state connecting.
6. fencing failed for SPM host in DC, setting DC to non-operational
7. host state non-responsive.
8. vm state unknown.
9. host up.
10. vm down.
11. host is contending for SPM.
12. DC up host is SPM.

Comment 12 Eyal Edri 2015-02-17 17:07:54 UTC
rhev 3.5.0 was released. closing.


Note You need to log in before you can comment on or make changes to this bug.