Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1093366 - Migration of hosted-engine vm put target host score to zero
Summary: Migration of hosted-engine vm put target host score to zero
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.4.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Doron Fediuck
QA Contact: Artyom
URL:
Whiteboard: sla
Depends On: 1123285
Blocks: 1107968 1119763 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-05-01 14:26 UTC by Artyom
Modified: 2018-12-06 16:24 UTC (History)
13 users (show)

Fixed In Version: ovirt-3.5.0-beta2
Doc Type: Bug Fix
Doc Text:
Previously, in Hosted Engine environments with more than two host nodes, the host with the lowest engine status score was being chosen as the best alternative host for migration of the Hosted Engine virtual machine. This prevented the virtual machine from being migrated. Now, the host with the highest engine status score is chosen as the best alternative and migration occurs as expected.
Clone Of:
: 1119763 (view as bug list)
Environment:
Last Closed: 2015-02-11 21:08:25 UTC
oVirt Team: SLA
Target Upstream Version:


Attachments (Terms of Use)
target host agent log (deleted)
2014-05-01 14:26 UTC, Artyom
no flags Details
engine logs (deleted)
2014-06-10 21:10 UTC, Joop van de Wege
no flags Details
host01 logs (deleted)
2014-06-10 21:10 UTC, Joop van de Wege
no flags Details
host02 logs (deleted)
2014-06-10 21:11 UTC, Joop van de Wege
no flags Details
host03 logs (deleted)
2014-06-10 21:11 UTC, Joop van de Wege
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0194 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2015-02-12 01:35:33 UTC
oVirt gerrit 29580 master MERGED Use max instead of min when computing the best score Never
oVirt gerrit 29787 master MERGED Treat migration state as Up and check engine health normally Never
oVirt gerrit 29902 ovirt-hosted-engine-ha-1.2 MERGED Treat migration state as Up and check engine health normally Never
oVirt gerrit 29903 ovirt-hosted-engine-ha-1.1 MERGED Treat migration state as Up and check engine health normally Never
oVirt gerrit 29920 ovirt-hosted-engine-ha-1.2 MERGED use max not min when computing the best score Never
oVirt gerrit 29922 ovirt-hosted-engine-ha-1.1 MERGED use max not min when computing the best score Never

Description Artyom 2014-05-01 14:26:49 UTC
Created attachment 891547 [details]
target host agent log

Description of problem:
Migration of hosted-engine vm put target host score to zero, in my case it also not change destination host score

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.1.2-2.el6ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Setup hosted-engine environment with two hosts, on where engine-vm block connection to storage domain(via iptables -I INPUT -s sd_ip -j DROP)
2. Wait until vm start on second host and check host score
3.

Actual results:
--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : 10.35.64.85
Host ID                            : 1
Engine status                      : {'reason': 'bad vm status', 'health': 'bad', 'vm': 'up', 'detail': 'waitforlaunch'}
Score                              : 0
Local maintenance                  : False
Host timestamp                     : 1398953165
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1398953165 (Thu May  1 17:06:05 2014)
        host-id=1
        score=0
        maintenance=False
        state=EngineUpBadHealth
        timeout=Thu May  1 17:10:34 2014


--== Host 2 status ==--

Status up-to-date                  : False
Hostname                           : 10.35.97.36
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 2400
Local maintenance                  : False
Host timestamp                     : 1398953033
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1398953033 (Thu May  1 17:03:53 2014)
        host-id=2
        score=2400
        maintenance=False
        state=EngineUp


Expected results:
I expect that target vm will have 2400 score(like it was before migration started) and destination host will have some little score(because it have problems with connection to storage domain)

Additional info:

Comment 1 Artyom 2014-05-01 15:38:14 UTC
Target host receive state of engine state=EngineUnexpectedlyDown and also in vdsm.log
Thread-7780::ERROR::2014-05-01 18:28:25,314::vm::2285::vm.Vm::(_startUnderlyingVm) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2245, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 3172, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2665, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error Failed to acquire lock: error -243
Thread-7780::DEBUG::2014-05-01 18:28:25,321::vm::2727::vm.Vm::(setDownStatus) vmId=`a8d328ea-991a-4a06-ac3a-cf2c11d4f264`::Changed state to Down: internal error Failed to acquire lock: error -243

Because this reason host score is zero

Comment 2 Artyom 2014-05-01 15:40:02 UTC
sanlock client status of source host
daemon d341e1a7-1277-492e-a7b7-b1c6649427f6.rose05.qa.
p -1 helper
p -1 listener
p -1 status
s 21caf848-8e2c-4d24-b709-c4e189fa5f4b:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/dom_md/ids:0
s hosted-engine:2:/rhev/data-center/mnt/10.35.160.108\:_RHEV_artyom__hosted__engine/21caf848-8e2c-4d24-b709-c4e189fa5f4b/ha_agent/hosted-engine.lockspace:0

Comment 3 Andrew Lau 2014-06-07 23:42:49 UTC
In my case, this seems to happen after adding a third host to the cluster and it happens even without a migration.

Putting a host into local maintenance will jump it's score back up to 2400, but will drop back down to 0 after 5 or so minutes of maintenance --mode=none

Comment 4 Joop van de Wege 2014-06-10 21:04:29 UTC
I have 3 hosts too and have the same problem. I'll upload logs of all hosts + engine (vdsm, supervdsm, sanlock, agent-ha, agent-broker)
Sequence:
started ovirt01 which started engine01 while ovirt02 and 03 were powered off.
Then started ovirt02 and waited until stable, meaning hosted-deploy --vm-status gave a correct status.
Then started ovirt03 and waited until the error -243 showed up.
Collected the logs.

Comment 5 Joop van de Wege 2014-06-10 21:10:24 UTC
Created attachment 907423 [details]
engine logs

Comment 6 Joop van de Wege 2014-06-10 21:10:58 UTC
Created attachment 907424 [details]
host01 logs

Comment 7 Joop van de Wege 2014-06-10 21:11:22 UTC
Created attachment 907425 [details]
host02 logs

Comment 8 Joop van de Wege 2014-06-10 21:11:47 UTC
Created attachment 907426 [details]
host03 logs

Comment 9 Benoit Laniel 2014-07-03 20:05:08 UTC
I had the same problem when adding a third host.

According to hosted_engine.py, engine_status_score is

    engine_status_score_lookup = {
        'None': 0,
        'vm-down': 1,
        'vm-up bad-health-status': 2,
        'vm-up good-health-status': 3,
    }

It seems that in state_machine.py, refresh function of class EngineStateMachine sets best_engine to the host with the lowest engine_status_score.

Problem is, in consume function of EngineDown class, new_data.best_engine_status["vm"] can never be up.

Here's what I understood:

node1 is running the hosted engine, so it has the highest engine_status_score (vm-up good-health-status).

When node2 refreshes data, it becomes the best_engine since it has the lowest engine_status_score (None). It then tries to start the engine. The same applies to node3.

They cannot do this since engine is up and running on node1 and (I think) is locked. They finally transition to state EngineUnexpectedlyDown.

I think best_engine should be the host with the highest engine_status_score.

Changing line 124 of state_machine.py from "best_engine = min(alive_hosts," to "best_engine = max(alive_hosts," solved the problem for me. It never happened again and I could migrate engine from one node to another without issue (which was not the case without this change).

It may not be the ideal solution as I'm just starting with oVirt but I hope it will help solving this bug.

Comment 10 Jiri Moskovcak 2014-07-04 10:32:24 UTC
(In reply to Benoit Laniel from comment #9)
> 

- nice catch and good analysis Benoit! Thanks and welcome aboard!

Comment 11 Scott Herold 2014-07-07 14:22:02 UTC
Meital,

Can you check if you have capacity to test this for 3.4.1?  This is a pretty serious bug that we'd really like to get in.

Comment 13 Stefano Stagnaro 2014-07-08 15:58:06 UTC
I can confirm the bug on a freshly installed oVirt 3.4.1 with only two host in HA that rely on external NFSv4.

Comment 14 Martin Sivák 2014-07-09 10:21:26 UTC
I think there might be an issue in the host score calculation when a Vm is migrating away. My guess is that once the status changes to something else then Up, we drop the score.

Comment 15 Jason Brooks 2014-07-10 22:22:11 UTC
I have a three node setup w/ hosted engine, using gluster nfs fronted by ctdb for the engine's storage. Every hosted engine migration triggers the error: 

VM HostedEngine is down. Exit message: internal error: Failed to acquire lock: error -243.

As with other reports here, the HostedEngine never actually goes down.

This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine.

I can upload logs, etc. if needed.

Comment 16 Artyom 2014-07-11 05:51:23 UTC
I can check it, we have three hosts for hosted engine and also I can also change this line in state_machine.py(from min to max)

Comment 17 Jiri Moskovcak 2014-07-11 07:13:16 UTC
(In reply to Jason Brooks from comment #15)
> I have a three node setup w/ hosted engine, using gluster nfs fronted by
> ctdb for the engine's storage. Every hosted engine migration triggers the
> error: 
> 
> VM HostedEngine is down. Exit message: internal error: Failed to acquire
> lock: error -243.
> 
> As with other reports here, the HostedEngine never actually goes down.
> 
> This is on ovirt 3.4.2 w/ 3 F20 hosts and a CentOS 6.5 hosted engine.
> 
> I can upload logs, etc. if needed.

This is a different bug, please create a separate ticket for it, upload the logs and describe how did you run the migration.

Comment 18 Jiri Moskovcak 2014-07-14 07:14:49 UTC
*** Bug 1093638 has been marked as a duplicate of this bug. ***

Comment 20 Artyom 2014-08-07 13:48:39 UTC
Verified on ovirt-hosted-engine-ha-1.2.1-0.2.master.20140805072346.el6.noarch
Checked with 3 hosts, all works fine, also check scenario in description vm migrated without dropping destination host score to zero.

Comment 25 errata-xmlrpc 2015-02-11 21:08:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html


Note You need to log in before you can comment on or make changes to this bug.