Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1599110 - Hosted-Engine HA failed and started looping, not recovering itself.
Summary: Hosted-Engine HA failed and started looping, not recovering itself.
Keywords:
Status: CLOSED DUPLICATE of bug 1583712
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.2.3
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Martin Sivák
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-09 00:35 UTC by Germano Veit Michel
Modified: 2018-07-09 06:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-09 06:05:27 UTC
oVirt Team: SLA
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3238221 None None None 2018-07-09 06:27:19 UTC

Description Germano Veit Michel 2018-07-09 00:35:49 UTC
Description of problem:

There was a problem with the hosted-engine metadata LV (somehow it got deactivated). After this the HA mechanism was unable to recover from it. It looped through states sending tons of emails and never attempted to prepareImages/refresh SD to bring that LV up again to recover from the incident. Storage was available all the time.

See the details:

1. At 15:58:25, the hosted-engine.metadata volume LV was somehow deactivated. VDSM logs from this time have rotated, all I could find was vdsm restarted 1 minute earlier and it was not through soft fencing.

MainThread::INFO::2018-07-04 15:57:55,791::hosted_engine::506::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Required service vdsmd is not up.

StatusStorageThread::ERROR::2018-07-04 15:58:25,676::status_broker::92::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.                           
Traceback (most recent call last):                                                                                                                                                             
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 88, in run                                                                                      
    self._storage_broker.get_raw_stats()                                                                                                                                                       
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats                                                                          
    .format(str(e)))                                                                                                                                                                           
RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/2fc4a59d-f27b-41be-a851-940694b94314/e14df275-5499-49fe-a158-4460948e54aa/f3efb31c-ee23-4a24-9c96-c695831409d6'

$ grep f3efb31c-ee23-4a24-9c96-c695831409d6 etc/ovirt-hosted-engine/hosted-engine.conf 
metadata_volume_UUID=f3efb31c-ee23-4a24-9c96-c695831409d6

2. Then the agent restarts (due to vdsm restarting I understand)

MainThread::INFO::2018-07-04 15:58:38,094::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2018-07-04 15:58:49,824::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.11 started

3. The ha mechanism did not recover from this anymore. After the ha-agent restart it loops through these states below while the HE VM was running on another host (he disk lock acquired):

EngineDown-EngineStart-EngineStarting-EngineForceStop-ReinitializeFSM

This sent dozens of emails as this specific host was unable to read the metadata LV to see the other hosts states and therefore kept trying (and failing) to start the HE VM.

For example, after 15:58, if one looks at host id 1 timestamp (the logs from above are from host id 5), its always 2265573 on the point of view of this failed host. But it doesn't try to refresh the storage domain or prepare images to bring up the metadata LV to read newer status of other hosts. Just keeps looping.

This is a 5 host HE cluster, only host id 5 faced this problem, all other 4 hosts were fine.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch

How reproducible:
Unknown

Actual results:
HA agent of single host looping, not recovering.

Expected results:
Bring up the missing LV (prepareVolumes/refresh storage) and recover

Comment 3 Germano Veit Michel 2018-07-09 05:21:30 UTC
(In reply to Germano Veit Michel from comment #0)
> There was a problem with the hosted-engine metadata LV (somehow it got
> deactivated).

I think this could have been caused by the vdsm restart as it cleans up "stale" LVs on start-up. Will try to reproduce.

Comment 4 Germano Veit Michel 2018-07-09 06:05:27 UTC
Reproduced LV deactivation by vdsm, but the problem does not fully reproduce in 4.2.4, it recovers automatically because the HA broker restarts.

# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul  9 15:58:40 AEST 2018
  8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi-a-----   1.00g

# systemctl restart vdsmd

2018-07-09 15:59:02,877+1000 INFO  (hsm/init) [storage.LVM] Deactivating lvs: vg=6c873880-0244-4823-90d5-03e4ec562cd8 lvs=['8a9577f7-2e95-49a2-8ab8-8447c935446f', 'ad62b2ec-3f99-47cd-a8e0-d22d4b1b3d69', '3fc2c155-65b7-4b2c-988a-b8beab04a01a', 'c9042982-1ec4-4b74-8c97-b7ab119ac8d6'] (lvm:684)

# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul  9 15:59:23 AEST 2018
  8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi-------   1.00g

StatusStorageThread::ERROR::2018-07-09 15:59:09,945::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
OSError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,946::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
    self._storage_broker.get_raw_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
    .format(str(e)))
RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,947::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker

Tracked it down to this: BZ1583712

*** This bug has been marked as a duplicate of bug 1583712 ***


Note You need to log in before you can comment on or make changes to this bug.