Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1365242

Summary: 3.4->3.5->3.6->4.0 SHE migration: ovirt-ha-agent not working correctly / state=AgentStopped
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Jiri Belka <jbelka>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Jiri Belka <jbelka>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.0.1CC: bugs, mavital, sbonazzo, stirabos, ylavi
Target Milestone: ovirt-4.0.4Flags: rule-engine: ovirt-4.0.z+
rule-engine: exception+
ylavi: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+
Target Release: 2.0.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
With 4.0 we moved to the jsonrpc protocol; adding additional checks on jsonrpc responses.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-26 12:35:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jiri Belka 2016-08-08 17:04:44 UTC
Description of problem:

After doing SHE migration path 3.4->3.5->3.6->4.0 and ending global maintenance, HE VM was not automatically started and hosted-engine --vm-status showed that agent was in 'state=AgentStopped'.

manually starting HE VM with hosted-engine --vm-start worked fine.

~~~
# hosted-engine --vm-status | sed 's/rhev.lab.eng.brq.redhat/example.com/'
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vd
scli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : 10-34-60-151.example.com.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : e9f3cf55
Host timestamp                     : 231104
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=231104 (Mon Aug  8 15:53:34 2016)
        host-id=1
        score=0
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : 10-34-60-215.example.com.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "dow
n", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : True
crc32                              : 2e113351
Host timestamp                     : 234997
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=234997 (Mon Aug  8 18:39:40 2016)
        host-id=2
        score=0
        maintenance=True
        state=LocalMaintenance
        stopped=False
~~~

There's a lot of "ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: ''items'' - trying to restart agent" in the log.

Both hosts were EL7 with 4.0 rpms.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch

How reproducible:
hard to reproduce, if at all possible

Steps to Reproduce:
1. discovered as part of 3.4->3.5->3.6->4.0 SHE migration
2.
3.

Actual results:
HE VM was not started after ending global maintenance

Expected results:
HE VM should be started automatically.

Additional info:

Comment 2 Simone Tiraboschi 2016-08-09 15:43:45 UTC
The issue is here:

MainThread::WARNING::2016-08-08 15:51:03,712::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 445, in start_monitoring
    self._initialize_storage_images()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 667, in _initialize_storage_images
    img.prepare_images()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/image.py", line 141, in prepare_images
    for volUUID in vm_vol_uuid_list['items']:
KeyError: 'items'
MainThread::INFO::2016-08-08 15:51:05,328::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds
MainThread::INFO::2016-08-08 15:52:05,455::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1470664325.46 type=state_transition detail=GlobalMaintenance-ReinitializeFSM hostname='10-34-60-151.rhev.lab.eng.brq.redhat.com'

It seams that a certain time you got an image without a volume (still not sure how) and our code failed scanning it.

Comment 3 Jiri Belka 2016-09-19 23:27:31 UTC
ok, ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch

can't see the issue anymore as described in #2.