Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1511037 - GetUnregisteredDiskQuery fails with NullPointerException after domain import between 4.1 and 4.2 envs
Summary: GetUnregisteredDiskQuery fails with NullPointerException after domain import ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.7.6
Hardware: x86_64
OS: Unspecified
unspecified
high vote
Target Milestone: ovirt-4.2.1
: 4.2.2.1
Assignee: Maor
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-08 15:14 UTC by Elad
Modified: 2018-03-29 10:47 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.2.1.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-29 10:47:18 UTC
oVirt Team: Storage
rule-engine: ovirt-4.2+
ylavi: exception+


Attachments (Terms of Use)
logs (deleted)
2017-11-08 15:14 UTC, Elad
no flags Details
logs from hypervisors (deleted)
2017-11-09 09:02 UTC, Elad
no flags Details
export domain scenario - engine.log (DEBUG) (deleted)
2017-12-03 14:42 UTC, Elad
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 86328 master MERGED core: return verbal message on prepareImage failure 2018-01-15 13:25:48 UTC

Description Elad 2017-11-08 15:14:06 UTC
Created attachment 1349482 [details]
logs

Description of problem:
VM registration sometimes fails with the following exception:

2017-11-08 15:48:03,638+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery] (default task-7) [14829135-429c-47e6-865a-449039821c97] Query 'GetUnregisteredDiskQu
ery' failed: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
2017-11-08 15:48:03,638+02 ERROR [org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery] (default task-7) [14829135-429c-47e6-865a-449039821c97] Exception: org.ovirt.engine.
core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.prepareImage(ImagesHandler.java:1005) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.ImagesHandler.getQemuImageInfoFromVdsm(ImagesHandler.java:844) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery.fetchQcowCompat(GetUnregisteredDiskQuery.java:126) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.GetUnregisteredDiskQuery.executeQueryCommand(GetUnregisteredDiskQuery.java:98) [bll.jar:]
        at org.ovirt.engine.core.bll.QueriesCommandBase.executeCommand(QueriesCommandBase.java:110) [bll.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendQueryExecutor.execute(DefaultBackendQueryExecutor.java:14) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runQueryImpl(Backend.java:587) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runInternalQuery(Backend.java:550) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor241.invoke(Unknown Source) [:1.8.0_151]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_151]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_151]


Version-Release number of selected component (if applicable):
rhevm-4.1.7.6-0.1.el7.noarch

How reproducible:
Not always

Steps to Reproduce:
1. Had more than 50 VMs backup up to iSCSI domain's OVF_STORE and destroyed the domain.
2. Imported the domain to 4.2 DC (4.2 setup) and registered all the VMs
3. Destroyed the domain and imported it to a 4.1 DC in 4.1.7 setup
4. Registered all the VMs 

Actual results:
Import for some of the VMs failed with the mentioned exception

Expected results:
Import should succeed.

Additional info:
logs from engine

Comment 1 Maor 2017-11-09 03:37:21 UTC
The Exception mentioned above is not part of the VM registration but it is part of the attach process when trying to fetch the OVF_STORE disk.
It looks like part of the disks failed to be prepared for some reason, can you please add the vdsm log to analyze the reason.

From the logs it seems that the registration seems to end successfully except one VM called storage-jenkins-ge4-vdsm3 which had a problem with validateGraphicsAndDisplay (see [1]), can you please open a separate bug on it

You mentioned in the bug that part of the VMs were failed to be imported.
Were there any other VMs which were failed to get registered?
Can you please share more info about it.

Thanks.



[1]

at org.ovirt.engine.core.bll.exportimport.ImportVmCommandBase.validateGraphicsAndDisplay(ImportVmCommandBase.java:212) [bll.jar:]
	at org.ovirt.engine.core.bll.exportimport.ImportVmCommand.validateAfterCloneVm(ImportVmCommand.java:469) [bll.jar:]

Comment 2 Elad 2017-11-09 09:02:31 UTC
Created attachment 1349826 [details]
logs from hypervisors

(In reply to Maor from comment #1)
> The Exception mentioned above is not part of the VM registration but it is
> part of the attach process when trying to fetch the OVF_STORE disk.
> It looks like part of the disks failed to be prepared for some reason, can
> you please add the vdsm log to analyze the reason.
Attached
> From the logs it seems that the registration seems to end successfully
> except one VM called storage-jenkins-ge4-vdsm3 which had a problem with
> validateGraphicsAndDisplay (see [1]), can you please open a separate bug on
> it

> You mentioned in the bug that part of the VMs were failed to be imported.
> Were there any other VMs which were failed to get registered?

Actually no, only the VM called storage-jenkins-ge4-vdsm3 failed to be imported



prepareImage in vdsm.log for image id 1af3a5a7-aa7e-413c-bece-2e7d6d597484 (puma45-vdsm.log):

2017-11-08 15:44:05,969+0200 INFO  (jsonrpc/4) [vdsm.api] START prepareImage(sdUUID=u'f49ce95f-1ec2-42ec-997a-62efbce917fc', spUUID=u'd633f884-45f6-4f9e-884d-85821cefd033', imgUUID=u'1af3a5a7-aa7e-413c-bece-2e7d6d597484', leafUUID=u'7574c202-97d6-47dd-b046-3f615fe7a84f', allowIllegal=True) from=::ffff:10.35.161.54,51448, flow_id=14829135-429c-47e6-865a-449039821c97, task_id=2eb8092f-cf28-4c94-afa9-23da14d1060c (api:46)
2017-11-08 15:44:06,157+0200 INFO  (jsonrpc/4) [storage.LVM] Activating lvs: vg=f49ce95f-1ec2-42ec-997a-62efbce917fc lvs=['7574c202-97d6-47dd-b046-3f615fe7a84f'] (lvm:1295)
2017-11-08 15:44:06,513+0200 INFO  (jsonrpc/4) [storage.LVM] Refreshing lvs: vg=f49ce95f-1ec2-42ec-997a-62efbce917fc lvs=['leases'] (lvm:1291)
2017-11-08 15:44:06,514+0200 INFO  (jsonrpc/4) [storage.LVM] Refreshing LVs (vg=f49ce95f-1ec2-42ec-997a-62efbce917fc, lvs=['leases']) (lvm:1319)
2017-11-08 15:44:06,682+0200 INFO  (jsonrpc/4) [vdsm.api] FINISH prepareImage return={'info': {'path': u'/rhev/data-center/mnt/blockSD/f49ce95f-1ec2-42ec-997a-62efbce917fc/images/1af3a5a7-aa7e-413c-bece-2e7d6d597484/7574c202-97d6-47dd-b046-3f615fe7a84f', 'volType': 'path'}, 'path': u'/rhev/data-center/d633f884-45f6-4f9e-884d-85821cefd033/f49ce95f-1ec2-42ec-997a-62efbce917fc/images/1af3a5a7-aa7e-413c-bece-2e7d6d597484/7574c202-97d6-47dd-b046-3f615fe7a84f', 'imgVolumesInfo': [{'domainID': u'f49ce95f-1ec2-42ec-997a-62efbce917fc', 'leaseOffset': 132120576, 'path': u'/rhev/data-center/mnt/blockSD/f49ce95f-1ec2-42ec-997a-62efbce917fc/images/1af3a5a7-aa7e-413c-bece-2e7d6d597484/7574c202-97d6-47dd-b046-3f615fe7a84f', 'volumeID': '7574c202-97d6-47dd-b046-3f615fe7a84f', 'leasePath': '/dev/f49ce95f-1ec2-42ec-997a-62efbce917fc/leases', 'imageID': u'1af3a5a7-aa7e-413c-bece-2e7d6d597484'}]} from=::ffff:10.35.161.54,51448, flow_id=14829135-429c-47e6-865a-449039821c97, task_id=2eb8092f-cf28-4c94-afa9-23da14d1060c (api:52)

Comment 3 Michal Skrivanek 2017-11-10 08:29:17 UTC
Elad, you cannot import a 4.2 DC into 4.1. VM's OVFs are in 4.2 format and there's no backward compatibility like that. Does import to a 4.2 DC work and does a 4.1 DC in 4.2 setup work when imported to 4.1 setup?

Maor, why/how is it allowed to import a 4.2 DC into 4.1, shouldn't it be blocked?

Comment 4 Michal Skrivanek 2017-11-10 08:30:27 UTC
*** Bug 1511522 has been marked as a duplicate of this bug. ***

Comment 5 Maor 2017-11-11 01:02:24 UTC
(In reply to Michal Skrivanek from comment #3)
> Elad, you cannot import a 4.2 DC into 4.1. VM's OVFs are in 4.2 format and
> there's no backward compatibility like that. Does import to a 4.2 DC work
> and does a 4.1 DC in 4.2 setup work when imported to 4.1 setup?
> 
> Maor, why/how is it allowed to import a 4.2 DC into 4.1, shouldn't it be
> blocked?

We never had any explicit validation for different cluster versions as part of import of VM/Template.
The cluster version is being validated only on specific attributes, such as balloon device,  GraphicsAndDisplay, SoundDevice 
(Actually the cluster level was introduced in the OVF only recently at https://gerrit.ovirt.org/#/c/52751/)

I can't recall any specific reason why it was decided to be done like that, but I think that this behavior does have a benefit for VMs to be manipulated easily between backup and restore setups.

Regarding Elad's bug, maybe we can add a null check for getOsRepository().getGraphicsAndDisplays().get(osId), and return a validation error in the case.

Comment 6 Maor 2017-11-11 01:03:29 UTC
Adding back Michal's need info which was removed by mistake

Comment 7 Michal Skrivanek 2017-11-11 09:33:31 UTC
(In reply to Maor from comment #5)

> I can't recall any specific reason why it was decided to be done like that,
> but I think that this behavior does have a benefit for VMs to be manipulated
> easily between backup and restore setups.
> Regarding Elad's bug, maybe we can add a null check for
> getOsRepository().getGraphicsAndDisplays().get(osId), and return a
> validation error in the case.

we can, but that doesn't make it supported. Downgrade of VMs is not supported anywhere else, and it never was, so I'd rather suggest to add the validation.

Comment 8 Elad 2017-11-12 09:11:27 UTC
(In reply to Michal Skrivanek from comment #3)
> Elad, you cannot import a 4.2 DC into 4.1. VM's OVFs are in 4.2 format and
> there's no backward compatibility like that. Does import to a 4.2 DC work
> and does a 4.1 DC in 4.2 setup work when imported to 4.1 setup?

The domain was originally created in a 4.1 setup and DC and was imported to 4.2 setup and then to 4.1 setup and DC

Comment 9 Maor 2017-11-14 01:22:39 UTC
I believe that the exception mentioned in BZ1511522 and the exception described here are two different bugs.

BZ1511037 is related to the attach operation of a storage domain with a different NPE, while BZ1511522 exception is related to the VM import process.

Regarding the exception mentioned in Comment 1, which is related to the import VM scenario, this could also reproduce when importing a VM from an export storage domain.
While the scenario the bug describes is somehow not realistic for a user (Can't find any reason the user will want to import 4.2 VM to 4.1), I believe that we do want to avoid any complicated solutions like use of predefined default values on import of entities or blocking the import operation based on the compatibility level of the entity, since this validation will be reflected also on the export storage domain behavior.

Comment 10 Michal Skrivanek 2017-11-14 08:25:30 UTC
It should behave the same on export domain. That should be properly blocked already.
We cannot support importing newer VMs into older CL.

Comment 11 Maor 2017-11-14 09:45:44 UTC
(In reply to Michal Skrivanek from comment #10)
> It should behave the same on export domain. That should be properly blocked
> already.
> We cannot support importing newer VMs into older CL.

The validation logic used in ImportVmFromConfigurationCommand is based on the validation logic of importVmCommand. I don't think that it is blocked like this today.

Elad, will it be possible to reproduce the same scenario through the export storage domain (Can you please also use debug mode in the log)

Comment 12 Elad 2017-12-03 14:42:20 UTC
Created attachment 1362267 [details]
export domain scenario - engine.log (DEBUG)

The same occurs while importing the VM from an export domain.
Reproduced with the same OVF that was used in the original bug occurrence.

2017-12-03 16:39:19,202+02 ERROR 
[org.ovirt.engine.core.bll.exportimport.ImportVmCommand] (default task-9) [] Error during ValidateFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.validator.VmValidationUtils.isGraphicsAndDisplaySupported(VmValidationUtils.java:75) [bll.jar:]
        at org.ovirt.engine.core.bll.VmHandler.isGraphicsAndDisplaySupported(VmHandler.java:632) [bll.jar:]
        at org.ovirt.engine.core.bll.VmHandler$Proxy$_$$_WeldClientProxy.isGraphicsAndDisplaySupported(Unknown Source) [bll.jar:]
        at org.ovirt.engine.core.bll.exportimport.ImportVmCommandBase.validateGraphicsAndDisplay(ImportVmCommandBase.java:212) [bll.jar:]
        at org.ovirt.engine.core.bll.exportimport.ImportVmCommand.validateAfterCloneVm(ImportVmCommand.java:469) [bll.jar:]
        at org.ovirt.engine.core.bll.exportimport.ImportVmCommand.validate(ImportVmCommand.java:190) [bll.jar:]


================================
rhevm-4.1.8.1-0.1.el7.noarch

Comment 13 Maor 2018-01-14 22:22:21 UTC
The bug here is the NPE exception which is thrown from GetUnregisteredDiskQuery instead of a proper message which should be returned like any other failure which occurs.
The failure in prepareImage might occur in several use cases (for example see https://bugzilla.redhat.com/show_bug.cgi?id=1417903#c1) although, we can not know the origin of it since puma47-vdsm log is not readable.

The solution for this bug is that we should not get an NPE on prepareImage failure from GetUnregisteredDiskQuery, instead a more verbal and clear error should be logged.

Comment 14 Maor 2018-01-15 20:04:36 UTC
(In reply to Maor from comment #13)
> The bug here is the NPE exception which is thrown from
> GetUnregisteredDiskQuery instead of a proper message which should be
> returned like any other failure which occurs.
> The failure in prepareImage might occur in several use cases (for example
> see https://bugzilla.redhat.com/show_bug.cgi?id=1417903#c1) although, we can
> not know the origin of it since puma47-vdsm log is not readable.

Elad found another use case which prepareImage failed, we should have a separate bug to analyze it

> 
> The solution for this bug is that we should not get an NPE on prepareImage
> failure from GetUnregisteredDiskQuery, instead a more verbal and clear error
> should be logged.

Comment 15 Elad 2018-03-04 17:06:57 UTC
Executed https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-5297 where the bug was found in earlier versions.

Checked against rhevm-4.1.10.1-0.1.el7.noarch


Used:
rhvm-4.2.2.1-0.1.el7.noarch
vdsm-4.20.19-1.el7ev.x86_64

Comment 16 Sandro Bonazzola 2018-03-29 10:47:18 UTC
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.