Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1516909 - [RFE] Engine should not allow VM migration to destination host which is not visible to all the VM's LUNs
Summary: [RFE] Engine should not allow VM migration to destination host which is not v...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
low
high vote
Target Milestone: ---
: ---
Assignee: Eyal Shenitzky
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1457087
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-23 14:38 UTC by Eyal Shenitzky
Modified: 2017-11-27 12:42 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-27 12:42:59 UTC
oVirt Team: Storage
amureini: ovirt-future?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
engine and vdsm logs (deleted)
2017-11-23 14:38 UTC, Eyal Shenitzky
no flags Details

Description Eyal Shenitzky 2017-11-23 14:38:53 UTC
Created attachment 1358247 [details]
engine and vdsm logs

Description of problem:

The engine should not allow VM migration to destination host which is not visible to all of the VM's LUNs.

Relevant log after migration failed:

2017-11-23 16:35:28,939+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] Failed in 'DestroyVDS' method
2017-11-23 16:35:28,956+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM h2 command DestroyVDS failed: General Exception: ("VM 'ee500acc-86d2-4042-92e6-14b4745ff64f' was not defined yet or was undefined",)
2017-11-23 16:35:28,957+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand' return value 'StatusOnlyReturn [status=Status [code=100, message=General Exception: ("VM 'ee500acc-86d2-4042-92e6-14b4745ff64f' was not defined yet or was undefined",)]]'
2017-11-23 16:35:28,959+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] HostName = h2
2017-11-23 16:35:28,959+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] Command 'DestroyVDSCommand(HostName = h2, DestroyVmVDSCommandParameters:{hostId='b6102b38-ca5a-4094-9ee4-18088d2d4579', vmId='ee500acc-86d2-4042-92e6-14b4745ff64f', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'})' execution failed: VDSGenericException: VDSErrorException: Failed to DestroyVDS, error = General Exception: ("VM 'ee500acc-86d2-4042-92e6-14b4745ff64f' was not defined yet or was undefined",), code = 100
2017-11-23 16:35:28,961+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-13) [] FINISH, DestroyVDSCommand, log id: 5ad1ef42
2017-11-23 16:35:28,962+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'ee500acc-86d2-4042-92e6-14b4745ff64f'(vm) was unexpectedly detected as 'Down' on VDS 'b6102b38-ca5a-4094-9ee4-18088d2d4579'(h2) (expected on '574a3fc9-a667-4863-9455-973bbad28d58')
2017-11-23 16:35:28,965+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (ForkJoinPool-1-worker-13) [] VMs initialization finished for Host: 'h2:b6102b38-ca5a-4094-9ee4-18088d2d4579'
^[[24~2017-11-23 16:36:07,918+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] VM 'ee500acc-86d2-4042-92e6-14b4745ff64f'(vm) moved from 'MigratingFrom' --> 'Up'
2017-11-23 16:36:07,919+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] Adding VM 'ee500acc-86d2-4042-92e6-14b4745ff64f'(vm) to re-run list
2017-11-23 16:36:07,941+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-19) [] Rerun VM 'ee500acc-86d2-4042-92e6-14b4745ff64f'. Called from VDS 'h1'
2017-11-23 16:36:07,988+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-33) [] START, MigrateStatusVDSCommand(HostName = h1, MigrateStatusVDSCommandParameters:{hostId='574a3fc9-a667-4863-9455-973bbad28d58', vmId='ee500acc-86d2-4042-92e6-14b4745ff64f'}), log id: 47264d3e
2017-11-23 16:36:07,991+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MigrateStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-33) [] FINISH, MigrateStatusVDSCommand, log id: 47264d3e
2017-11-23 16:36:08,007+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-33) [] EVENT_ID: VM_MIGRATION_TRYING_RERUN(128), Failed to migrate VM vm to Host h2 . Trying to migrate to another Host.
2017-11-23 16:36:08,082+02 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-33) [] Validation of action 'MigrateVm' failed for user admin@internal-authz. Reasons: VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_NO_HOSTS
2017-11-23 16:36:08,082+02 INFO  [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-33) [] Lock freed to object 'EngineLock:{exclusiveLocks='[ee500acc-86d2-4042-92e6-14b4745ff64f=VM]', sharedLocks=''}'
2017-11-23 16:36:08,092+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-33) [] EVENT_ID: VM_MIGRATION_NO_VDS_TO_MIGRATE_TO(166), No available host was found to migrate VM vm to.
2017-11-23 16:36:08,096+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-33) [] EVENT_ID: VM_MIGRATION_FAILED(65), Migration failed  (VM: vm, Source: h1).


Version-Release number of selected component (if applicable):
4.2.0_master - from commit 80b3e222b0162a481414601808c4f3da1dbccaed

How reproducible:
100%

Steps to Reproduce:
1. Create DC with 2 hosts, the first host can connect to a LUN, the second host cannot.
2. Create a VM with direct LUN disk (which is visible only to the first host)
3. Run the VM of the first host
4. Migrate the VM to the second host which is not visible to the LUN

Actual results:
Migration failed on the VDSM side

Expected results:
Migration should be blocked by the engine

Additional info:
Engine and VDSM log attached

Comment 1 Eyal Shenitzky 2017-11-27 12:42:59 UTC
After further discussions [see below], the decision is to keep the behavior as it was and just improve the log [bug - https://bugzilla.redhat.com/show_bug.cgi?id=1457087]

On Mon, Nov 27, 2017 at 12:59 PM, Eyal Shenitzky <eshenitz@redhat.com> wrote:
> What you described is the way we handle it currently on the scheduler (The
> only problem is that
> the log is insufficient in this case ).
>
> I think that we have a trade-off here,
> we can choose to decide to leave it as is - pay the cost when the LUN isn't
> visible to the host (the failure occur on the VDSM side),
> or to prevent it from the first place, but we will pay for each migration /
> run VM operation.
>
> IMHO I think that it will be better to leave the mechanism as is.
> It is worth to pay the cost for this specific use-case only.
>
>
>
>
> On Mon, Nov 27, 2017 at 1:09 PM, Martin Sivak <msivak@redhat.com> wrote:
>>
>> We can can use the optimistic approach too:
>>
>> - assume it is visible until proven otherwise (how common it is to
>> have LUNs that are or become not visible?)
>> - try starting the VM as usual
>> - detect it failed for this reason, mark the host as such
>> - let the retry mechanism do its job
>>
>> Martin
>>
>> On Mon, Nov 27, 2017 at 6:21 AM, Eyal Shenitzky <eshenitz@redhat.com>
>> wrote:
>> > In order to know if a host is visible to a LUN or not, we do not have
>> > any
>> > precomputed data.
>> > The host must try to connect to the storage server and get the device
>> > list.
>> >
>> > This operation is slow and heavy but must take place for the
>> > verification.
>> >
>> >
>> >
>> > On Sun, Nov 26, 2017 at 11:15 PM, Martin Sivak <msivak@redhat.com>
>> > wrote:
>> >>
>> >> > So I believe that should be implemented as a filtering done by the
>> >> > scheduler.
>> >>
>> >> Agreed.
>> >>
>> >> > Ideally, a host that cannot connect to a lun server would have marked
>> >> > as
>> >> > such so the scheduler could filter it without executing too many DB
>> >> > queries.
>> >> > Just like a host that cannot connect to a storage domain which is set
>> >> > as
>> >> > non-operational.
>> >>
>> >> This is indeed the right way. The scheduler should just use
>> >> precomputed data, it should not contain too much logic.
>> >>
>> >> Martin
>> >>
>> >> On Sun, Nov 26, 2017 at 1:55 PM, Arik Hadas <ahadas@redhat.com> wrote:
>> >> >
>> >> >
>> >> > On Sun, Nov 26, 2017 at 2:39 PM, Eyal Shenitzky <eshenitz@redhat.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I want to open $SUBJECT[1] to a discussion because there are few
>> >> >> suggestions on
>> >> >> where to implement it as you can see in -
>> >> >> https://gerrit.ovirt.org/#/c/84512/
>> >> >>
>> >> >> The first option is to add a verification to the command validation
>> >> >> as
>> >> >> it
>> >> >> appears in the patch.
>> >> >> Cons are that it includes a "heavy/slow" operation (GetDeviceList)
>> >> >> on
>> >> >> the
>> >> >> validation step.
>> >> >>
>> >> >> The second option that suggested is to add the validation to the
>> >> >> scheduler.
>> >> >> Pros - the validation will occur on run VM also.
>> >> >> Cons - Run "heavy/slow" operation (GetDeviceList) in the scheduler
>> >> >>
>> >> >> Please share your thoughts and ideas.
>> >> >
>> >> >
>> >> > I don't think the first option is really an option:
>> >> > Imagine you have 5 hosts in your cluster, the VM is running on host A
>> >> > and
>> >> > the user asks to migrate it.
>> >> > We expect the regular migration flow to happen, i.e., the scheduler
>> >> > should
>> >> > find us a candidate host, we'll try to migrate the VM to that host
>> >> > and
>> >> > if we
>> >> > fail we will try 2 more times.
>> >> > However, by adding the validation that is introduced in
>> >> > https://gerrit.ovirt.org/#/c/84512/ to MigrateVm what would happen is
>> >> > that
>> >> > the engine will ask the scheduler to pick a candiate host in the
>> >> > validation
>> >> > phase. Not only that this candidate may be different than the one
>> >> > chosen
>> >> > later on by the scheduler in the execution phase, but if this
>> >> > candidate
>> >> > is
>> >> > not able to connect to the lun server, the migrate command will fail
>> >> > -
>> >> > without trying to migrate to one of the other hosts that may be able
>> >> > to
>> >> > connect to the lun server (+ rerun attempts).
>> >> >
>> >> > So I believe that should be implemented as a filtering done by the
>> >> > scheduler.
>> >> > Ideally, a host that cannot connect to a lun server would have marked
>> >> > as
>> >> > such so the scheduler could filter it without executing too many DB
>> >> > queries.
>> >> > Just like a host that cannot connect to a storage domain which is set
>> >> > as
>> >> > non-operational.
>> >> >
>> >> >>
>> >> >>
>> >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1516909
>> >> >>
>> >> >> --
>> >> >> Regards,
>> >> >> Eyal Shenitzky
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Eyal Shenitzky
>
>
>


Note You need to log in before you can comment on or make changes to this bug.