Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1516203 - [downstream clone - 4.1.8] hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage
Summary: [downstream clone - 4.1.8] hosted engine agent is not able to refresh hosted ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.0.7
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ovirt-4.1.8
: ---
Assignee: Martin Sivák
QA Contact: Artyom
URL:
Whiteboard:
Depends On: 1485883
Blocks: 1516964
TreeView+ depends on / blocked
 
Reported: 2017-11-22 09:29 UTC by rhev-integ
Modified: 2017-12-12 09:26 UTC (History)
12 users (show)

Fixed In Version: 2.1.8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1485883
Environment:
Last Closed: 2017-12-12 09:26:02 UTC
oVirt Team: SLA


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3435 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix update for 4.1.8 2017-12-12 14:18:05 UTC
oVirt gerrit 84017 v2.1.z MERGED Use /run/vdsm to lookup volume paths 2017-11-22 09:31:49 UTC
oVirt gerrit 84018 v2.1.z MERGED Prepare symlinks for OVF store and cache the path 2017-11-22 09:31:49 UTC

Description rhev-integ 2017-11-22 09:29:34 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1485883 +++
======================================================================

Description of problem:
hosted engine agent is not able to refresh hosted engine status when iso domain is not available after network outage

Version-Release number of selected component (if applicable):
rhevm-4.0.7.4-0.1.el7ev.noarch


How reproducible:
everytime

Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3. power off hosted engine vm
4, make iso storage doamin unavailable
5, start the hosted engine vm


Actual results:
hosted engine agent is not able to retrieve data about hosted engine status

--== Host 3 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : hosted_engine3
Host ID                            : 3
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : abdcbb9b
local_conf_timestamp               : 956778
Host timestamp                     : 956762
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=956762 (Tue Aug 22 13:05:51 2017)
        host-id=3
        score=3400
        vm_conf_refresh_time=956778 (Tue Aug 22 13:06:07 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStop
        stopped=False
        timeout=Mon Jan 12 01:48:56 1970

Expected results:
agent will be able to determine egnine status, it should ignore iso domain status

Additional info:

SPM is not able to get the status of the unreachable iso sd

(Originally by Marian Jankular)

Comment 1 rhev-integ 2017-11-22 09:29:41 UTC
Where is the ISO domain located? Is is a separate storage server or is it placed directly to the engine VM like it used to be possible in the past?

(Originally by Martin Sivak)

Comment 3 rhev-integ 2017-11-22 09:29:45 UTC
Nir, can an unresponsive ISO domain cause something like that on the vdsm side? This is not the first time we saw something like this.

(Originally by Martin Sivak)

Comment 4 rhev-integ 2017-11-22 09:29:51 UTC
Yes it can. Why do you have ISO attached to HE VM?

(Originally by michal.skrivanek)

Comment 5 rhev-integ 2017-11-22 09:29:56 UTC
Hello Michal,

I mean iso storage domain hosted on he vm, not iso image attached to iso he vm.

(Originally by Marian Jankular)

Comment 8 rhev-integ 2017-11-22 09:30:13 UTC
We are currently trying to reproduce this issue in our test environments to be able to find out what the root cause might be. We will try with ISO domain inside the engine VM itself, on a separate server and with standard storage just to be sure we cover all possible paths.

(Originally by Martin Sivak)

Comment 9 rhev-integ 2017-11-22 09:30:18 UTC
To be honest I do not really understand the reproduce steps

1) Configure HE environment
2) Configure ISO domain on the HE VM
3) I assume that you have also master storage domain configured on the engine
4) Add ISO domain to the engine
5) Power off hosted engine VM(it will make ISO domain unavailable as well, it placed on the HE VM)
6) I do not understand step "make iso storage domain unavailable", how do you make it unavailable?
7) Also the step "start the hosted engine VM", do not clear to me, ovirt-ha-agent must start it on another host that has state UP by himself, without any interaction from a user side, why do you start it?

(Originally by Artyom Lukianov)

Comment 10 rhev-integ 2017-11-22 09:30:23 UTC
Hi Artyom,

1-4 correct 
5, before powering off run "firewall-cmd --permanent --remove-service=nfs"
6, poweroff vm
7, poweron vm

I am sorry for initial steps it supposed to be:


Steps to Reproduce:
1. install hosted engine
2. add iso storage domain
3, make iso storage doamin unavailable
4, power off hosted engine vm
5, start the hosted engine vm

(Originally by Marian Jankular)

Comment 11 rhev-integ 2017-11-22 09:30:30 UTC
Thanks for the clarification.

(Originally by Artyom Lukianov)

Comment 12 rhev-integ 2017-11-22 09:30:35 UTC
And I hope the last question, do you block ISO domain from the engine or from the host where runs HE VM?

(Originally by Artyom Lukianov)

Comment 13 rhev-integ 2017-11-22 09:30:40 UTC
Hi,

I remove the "allow rules" on the he vm so host can not access it.

Marian

(Originally by Marian Jankular)

Comment 14 rhev-integ 2017-11-22 09:30:45 UTC
Any update about the test results?

(Originally by Martin Sivak)

Comment 15 rhev-integ 2017-11-22 09:30:49 UTC
Checked on:
ovirt-hosted-engine-setup-2.2.0-0.0.master.20171009203744.gitd01cc03.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.0-0.0.master.20171013115034.20171013115031.gitc8edb37.el7.centos.noarch
ovirt-engine-appliance-4.2-20171016.1.el7.centos.noarch
=====================================================================
Steps:
1) Deploy HE environment with two hosts
2) Configure NFS storage on the HE VM
3) Add ISO domain from the HE VM to the engine 
4) Remove NFS firewall rule from the HE VM
# firewall-cmd --permanent --remove-service=nfs
# firewall-cmd --reload
5) Poweroff HE VM
# hosted-engine --vm-poweroff
6) Wait until the agent will start the HE VM - FAILED

From some reason also HE status command shows that
--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : cyan-vdsf.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 710c118a
local_conf_timestamp               : 12487
Host timestamp                     : 12487
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=12487 (Mon Oct 16 19:25:16 2017)
        host-id=1
        score=3400
        vm_conf_refresh_time=12487 (Mon Oct 16 19:25:16 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : rose05.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : unknown stale-data
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : a8512249
local_conf_timestamp               : 3307341
Host timestamp                     : 3307341
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=3307341 (Mon Oct 16 19:25:32 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=3307341 (Mon Oct 16 19:25:32 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

(Originally by Artyom Lukianov)

Comment 16 rhev-integ 2017-11-22 09:30:56 UTC
Created attachment 1339376 [details]
agent, broker and vdsm logs from both hosts

(Originally by Artyom Lukianov)

Comment 17 rhev-integ 2017-11-22 09:31:01 UTC
Manually run HE VM via "# hosted-engine --vm-start" back all environment to the normal state.

(Originally by Artyom Lukianov)

Comment 18 rhev-integ 2017-11-22 09:31:06 UTC
Both agents seem to be stuck on OVF extraction. Would it be possible to reproduce this with DEBUG log enabled?

(Originally by Martin Sivak)

Comment 19 rhev-integ 2017-11-22 09:31:11 UTC
Created attachment 1340257 [details]
agent, broker and vdsm logs from both hosts(DEBUG)

(Originally by Artyom Lukianov)

Comment 20 rhev-integ 2017-11-22 09:31:16 UTC
So I think I found the culprit here:

The code here https://gerrit.ovirt.org/gitweb?p=ovirt-hosted-engine-ha.git;a=blob;f=ovirt_hosted_engine_ha/lib/heconflib.py;h=9e1996b9b0355cf3e5c9560e6f59679790ec7e8f;hb=5985fc70c4d5198d2ae3d8a3682fb85cdc3a2d35#l362 uses glob on top of /rhev/data-center/mnt:

    volume_path = os.path.join(
        volume_path,
        '*',
        sd_uuid,
        'images',
        img_uuid,
        vol_uuid,
    )
    volumes = glob.glob(volume_path)

Notice the asterisk position, it basically scans all directories with all mounted storage domains.. and if some of the domains are NFS based and unavailable.. we get stuck here.

I wonder if we can avoid the glob call.

(Originally by Martin Sivak)

Comment 21 rhev-integ 2017-11-22 09:31:21 UTC
(In reply to Martin Sivák from comment #19)
> I wonder if we can avoid the glob call.

hosted engine should not look inside /rhev/data-center/mnt. It should use

    /run/vdsm/storage/sd-id/img-id/vol-id

See this example - I have a vm with one block-based disk:
/dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb

And one file based disk:
/rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

# tree /run/vdsm/storage/
/run/vdsm/storage/
├── 373e8c55-283f-41d4-8433-95c1ef1bbd1a
├── 6ffbc483-0031-403a-819b-3bb2f0f8de0a
│   └── e54681ee-01d7-46a9-848f-2da2a38b8f1e
│       ├── 58adc0fb-c658-4ed1-a1b2-924b320477cb -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/58adc0fb-c658-4ed1-a1b2-924b320477cb
│       └── 93331705-46be-4cb8-9dc2-c1559843fd4a -> /dev/6ffbc483-0031-403a-819b-3bb2f0f8de0a/93331705-46be-4cb8-9dc2-c1559843fd4a
└── d6e4a622-bd31-4d8f-904d-1e26b7286757
    └── a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7 -> /rhev/data-center/mnt/dumbo.tlv.redhat.com:_voodoo_40/d6e4a622-bd31-4d8f-904d-1e26b7286757/images/a6f96cf8-ffd9-4b14-ac7a-5f1fa8e80bb7

But best use a vdsm api instead of duplicating the knowledge about file system
layout in hosted engine.

(Originally by Nir Soffer)

Comment 22 rhev-integ 2017-11-22 09:31:25 UTC
Thanks Nir, we were wondering about those symlinks.

Are those created for all present volumes/images or do we need to call prepareImage to get them? I am asking because we are interested in the OVF store for example and we do not mount that one.

We are considering using the API as well now as this is pretty old code. We might not have had the necessary APIs when it was written.

(Originally by Martin Sivak)

Comment 23 rhev-integ 2017-11-22 09:31:30 UTC
(In reply to Martin Sivák from comment #21)
> Thanks Nir, we were wondering about those symlinks.
> 
> Are those created for all present volumes/images or do we need to call
> prepareImage to get them? I am asking because we are interested in the OVF
> store for example and we do not mount that one.

These are created when preparing an image, so this is not a way to located
volumes you don't know about.

> We are considering using the API as well now as this is pretty old code. We
> might not have had the necessary APIs when it was written.

We don't have API for locating OVF_STORE volumes, these are private implementation
detail managed by engine. I think the right solution would be to register the
OVF_STORE disks in the domain metadata, and provide an API to fetch the disks
uuids.

(Originally by Nir Soffer)

Comment 24 rhev-integ 2017-11-22 09:31:35 UTC
Right, but we know how to get the right UUIDs, so that might be a way. We just have to call prepareImages with the right IDs and then access the /run structure or use some reasonable API that would give us the path (any hint?).

(Originally by Martin Sivak)

Comment 25 rhev-integ 2017-11-22 09:31:40 UTC
(In reply to Martin Sivák from comment #23)
> Right, but we know how to get the right UUIDs, so that might be a way. We
> just have to call prepareImages with the right IDs and then access the /run
> structure or use some reasonable API that would give us the path (any hint?).

If you know the uuid of the image, prepare it and get the path to the
volume from the response, see
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L2922

(Originally by Nir Soffer)

Comment 27 Artyom 2017-11-26 13:20:00 UTC
Verified on ovirt-hosted-engine-ha-2.1.8-1.el7ev.noarch

Comment 30 errata-xmlrpc 2017-12-12 09:26:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3435


Note You need to log in before you can comment on or make changes to this bug.