Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360970 - [RFE] SR-IOV live migration
Summary: [RFE] SR-IOV live migration
Keywords:
Status: ON_DEV
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 7.0 (Kilo)
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: smooney
QA Contact: nova-maint
URL:
Whiteboard:
: 1631723 (view as bug list)
Depends On: 1631723
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-28 05:13 UTC by VIKRANT
Modified: 2019-04-14 05:47 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1631723 (view as bug list)
Environment:
Last Closed: 2018-02-07 16:58:16 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
OpenStack gerrit 605116 None None None 2019-01-11 10:37:32 UTC
Red Hat Knowledge Base (Solution) 2474871 None None None 2016-07-28 05:14:32 UTC

Description VIKRANT 2016-07-28 05:13:15 UTC
Description of problem:

Openstack envionment , config sriov then configure computer HA and then create a sriov vm  which is contained a direct type vnic  ,  shutdown the host  node of the vm  , vm rebuild on another computer node , but it got failed , because the vf with same pci number has been occupied on destination compute node. 

Version-Release number of selected component (if applicable):
RHEL OSP 7

How reproducible:
Everytime for Customer. 

Steps to Reproduce:
1. Spawn an instance using VF with pci_slot id similar to the one which is used by the instance running on destination compute node. 
2. Shutdown the compute node, as instance HA is configured hence it will try to rebuild the instance on destination compute node using same pci_slot id which is already used by the instance running on that node. 
3. Instance will end up in error state with below message.
~~~
| fault                                | {"message": "Requested operation is not valid: PCI device 0000:04:07.3 is in use by driver QEMU
, domain instance-0000019d", "code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 357, in 
decorated_function |
~~~

Actual results:
It's not getting spawned on destination compute node if VF with same pci_slot id is used on destination compute node. 

Expected results:
It should get spawned successfully. nova fitting logic should use another VF if the vf which instance tries to choose is already used. 

Additional info:

It look like a same issue which we have seen with cpu pinned instance mentioned in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1319385

Comment 12 Stephen Finucane 2018-02-07 16:58:16 UTC

*** This bug has been marked as a duplicate of bug 1222414 ***

Comment 13 Sahid Ferdjaoui 2018-02-08 10:24:32 UTC
This issue is related to live-migrate instances with SRIOV and should not be considered as the same as live-migrating with CPU pinning. Both can work independently and even is the fix handle both situations (which i have some doubts) QA will have to test them independently.

Comment 14 Stephen Finucane 2018-02-08 13:52:29 UTC
(In reply to Sahid Ferdjaoui from comment #13)
> This issue is related to live-migrate instances with SRIOV and should not be
> considered as the same as live-migrating with CPU pinning. Both can work
> independently and even is the fix handle both situations (which i have some
> doubts) QA will have to test them independently.

OK, thanks for the context. I've updated the title accordingly.

Comment 15 Stephen Gordon 2018-03-07 16:00:31 UTC
Requirements:

- Support live migration with passthrough of full PCI NIC.
- Support live migration with passthrough of PF.
- Support live migration with passthrough of VF.
- In all cases, performance of networking in general VM lifecycle should not be impacted. Performance degradation during live migration is acceptable.

Comment 16 Daniel Berrange 2018-03-07 16:07:13 UTC
(In reply to Stephen Gordon from comment #15)
> Requirements:
> 
> - Support live migration with passthrough of full PCI NIC.
> - Support live migration with passthrough of PF.
> - Support live migration with passthrough of VF.
> - In all cases, performance of networking in general VM lifecycle should not
> be impacted. Performance degradation during live migration is acceptable.

Achieving this from a technical POV, would require a multi-nic setup in the guest with bonding/teaming. ie every guest would need to have 2 NICs, one SRIOV based and one emulated, both connected to same host network. At migration the SRIOV device would have to be hot-removed, and a new one added afterwards. 

IOW, as well as impacting guest network performance, you need to mandate special guest setup, and guest cooperation for hot-unplug at start of migration. The implication is if the guest OS is crashed, is in early boot up phase, or otherwise non-responsive, live migration still won't be possible as it won't be responding to the initial hot-unplug request. Not a showstopper though - largely a documentation / expectation setting problem.

Comment 20 Lee Yarwood 2019-01-09 16:57:42 UTC
*** Bug 1631723 has been marked as a duplicate of this bug. ***

Comment 22 smooney 2019-03-21 17:01:21 UTC
Note this feature is being targeted for OSP 16 and will not be backportable.


Note You need to log in before you can comment on or make changes to this bug.