Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1514853 - Vm run fails due to NPE when host statistics are missing.
Summary: Vm run fails due to NPE when host statistics are missing.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.1.0.2
Hardware: Unspecified
OS: Unspecified
high
high vote
Target Milestone: ovirt-4.2.2
: ---
Assignee: Ravi Nori
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-18 23:00 UTC by Leon Goldberg
Modified: 2018-03-29 11:03 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.2.2.4
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-03-29 11:03:28 UTC
oVirt Team: Infra
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
NPE traceback (deleted)
2017-11-18 23:00 UTC, Leon Goldberg
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 87990 master MERGED engine : Vm run fails due to NPE when host statistics are missing 2018-03-05 15:26:31 UTC
oVirt gerrit 88492 ovirt-engine-4.2 MERGED engine : Vm run fails due to NPE when host statistics are missing 2018-03-05 17:05:29 UTC

Description Leon Goldberg 2017-11-18 23:00:29 UTC
Created attachment 1354986 [details]
NPE traceback

Description of problem:

Start a VM quick enough post host addition fails due to missing statistics (more specifically, CPU usage) with a NPE.

How reproducible:

I'm not sure if it's reproducible via the UI. A test introduced in https://gerrit.ovirt.org/#/c/83881/ fails at all times (test_live_vm_migration) without sleeping before running the VM.


Actual results:

Running the VM fails with a NPE.


Expected results:

Running the VM is blocked until possible (host status shouldn't be "UP"?)

Comment 1 Martin Sivák 2017-11-22 11:30:39 UTC
The scheduler uses getVdsDao()        .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);

I think the host should not be UP until its db records are fully populated. Can someone from the Infra team please take a look?

Comment 2 Oved Ourfali 2017-11-27 13:06:03 UTC
(In reply to Martin Sivák from comment #1)
> The scheduler uses getVdsDao()       
> .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);
> 
> I think the host should not be UP until its db records are fully populated.
> Can someone from the Infra team please take a look?

Why host shouldn't be up if it doesn't have statistics?
It can move to up and gather statistics as we go.

I agree it is a race, but I wouldn't tie the statistics with the host status. Imagine that in the future the statistics might come from a different external service. Would we leave the host down until statistics are gathered?

Comment 3 Martin Sivák 2017-11-27 13:11:33 UTC
The scheduler needs to be able to get a list of hosts with all the crucial info present (like free memory). It just can't work without the data.

So we either not put the hosts to UP before that is available (because you can't start any VMs there anyway) or we need API that checks whether stats were collected already. The scheduler is not the right place to put the logic to, it needs (best case) a bool returning filter function.

Comment 4 Oved Ourfali 2017-11-27 13:43:41 UTC
Martin S - Anyhow you should fail with an error rather than get an NPE.
Martin P - your thoughts?

Comment 5 Yaniv Kaul 2017-12-18 07:16:46 UTC
(In reply to Oved Ourfali from comment #2)
> (In reply to Martin Sivák from comment #1)
> > The scheduler uses getVdsDao()       
> > .getAllForClusterWithStatus(cluster.getId(), VDSStatus.Up);
> > 
> > I think the host should not be UP until its db records are fully populated.
> > Can someone from the Infra team please take a look?
> 
> Why host shouldn't be up if it doesn't have statistics?
> It can move to up and gather statistics as we go.
> 
> I agree it is a race, but I wouldn't tie the statistics with the host
> status. Imagine that in the future the statistics might come from a
> different external service. Would we leave the host down until statistics
> are gathered?

We will never require critical statistics from an external service (i.e., not VDSM). We may ask for additional, non-scheduling related stats from an external service, so I agree with Martin Sivak here that we should require the relevant stats prior to setting the host as UP.

Comment 6 Martin Perina 2018-01-17 14:49:04 UTC
Ravi, could you please take a look where exactly we have this issue?

Comment 7 Ravi Nori 2018-02-20 18:32:41 UTC
The status of the host is set to UP before the VdsStatistics for the host is updated. This can cause a NPE if the scheduler tries to run a VM on the host before the VdsStatistics for the host have been updated. 

Setting the status to UP only after the statistics have been updated, fixes the issue.

Comment 8 Israel Pinto 2018-03-20 15:22:58 UTC
Verify with:

Steps:
Case 1:
Set host to maintenance and activate host.
During the activation try to start VM via REST.
Case 2:
Add Host, during the installation try to start VM.


Results:
On both cases, VM failed to start. No NullPointerException found.

Comment 9 Sandro Bonazzola 2018-03-29 11:03:28 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.