Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1692926 - libvirt sends VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event too soon (before it has finished tearing down the device's resources and objects)
Summary: libvirt sends VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event too soon (before it ha...
Keywords:
Status: POST
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: 8.1
Assignee: Laine Stump
QA Contact: yalzhang@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-26 17:05 UTC by Laine Stump
Modified: 2019-03-27 10:15 UTC (History)
10 users (show)

Fixed In Version: libvirt-5.2.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1658198
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)

Description Laine Stump 2019-03-26 17:05:26 UTC
+++ This bug was initially created as a clone of Bug #1658198 +++

Description of problem:
VFs are not released after unplug and plug vNIC with Passthrough

Version-Release number of selected component (if applicable):
4.3.0-0.2.master.20181121071050.gita8fcd23.el7

How reproducible:
100%

Steps to Reproduce:
1. Enable 2 VFs at the host
2. Create a logical network
3. Enable Passthrough at the assigned vNIC profile
4. Attached the vNIC profile to a VM
5. Start the VM
6. Unplug & Plug the vNIC
7. Delete the vNIC from the VM
8. Verify VFs at the source host are released as followed:
   - Go to Setup Host Networks | Show virtual functions
     and verify that 2 VFs are displayed
   - Try to reduce the number of VFs from 2 to 0

Actual results:
- One VF is displayed instead of 2
- Can't reduce the number of VFs to zero, got the error:
  "Cannot edit host NIC VFs configuration. The selected network interface 
  enp5s0f0 has VFs that are in use"

Expected results:
- Two VFs should be displayed
- Reduce the number of VFs from2 to 0 should success

Additional info:
- To release the VFs it is required to reboot the host.
- See attached error message
- This issue occurs also at the source host when migrating a VM
- See also https://bugzilla.redhat.com/show_bug.cgi?id=1655276

[red herring suggestions removed for brevity]
[...]

--- Additional comment from Laine Stump on 2019-03-14 02:04:38 UTC ---

[...]

Once I actually look into the code (rather than just relying on faulty memory and assumptions), and looked back to the original vdsm BZ (Bug 1655276) for the *libvirt* error message that was returned when this problem occurred (it can't be found in the libvirt logs attached to this bug, so they must have been collected at the wrong time), I found the following:

* The error message:

  libvirtError: Requested operation is not valid: PCI device 0000:08:10.0 is in use by driver QEMU, domain golden_env_mixed_virtio_6

is logged by virHostdevIsPCINodeDeviceUsed() when it finds that the device you want to operate on is still in the activePCIHostdevs list.

The VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event is being queued by qemuDomainRemoveHostDevice() calling virDomainEventDeviceRemovedNewFromObj() at line qemu_hotplug.c:4572 (in current upstream sources), but the device isn't being taken off the activePCIHostdevs list until qemuDomainRemoveHostDevice() calls qemuDomainRemovePCIHostDevice() (which eventually calls virHostdevReAttachPCIDevices() - this is at qemu_hotplug.c:4607. And several lengthy things happen in between (logging device removal audit messages, cgroup ACL removal, deleting node from /dev)

I tried outputting log messages at the point the event is dispatched and the point the device is removed from the activePCIHostdevs list, and found that the event is dispatched prior to removing the device from the list. We need to queue the event later. I would say we should not queue it until *all* the teardown is finished, but hopefully there won't be existing code that somehow depends on some of the device teardown not being done yet.

NB: It looks like it is the case for *all* device types that we queue the event prior to removing the device from the virDomainDef. That seems inherently racy.

I'll work on a patch for this tomorrow.

--- Additional comment from Laine Stump on 2019-03-23 03:57:44 UTC ---

I've ended up with a 21 patch series that fixes this bug in the final patch:

https://www.redhat.com/archives/libvir-list/2019-March/msg01455.html

--- Additional comment from Laine Stump on 2019-03-25 16:40:34 UTC ---

The first 12 patches of that series are posted. This does *not* solve the issue (that requires the rest of the series) but these patches will be prerequisites for the fix if it is backported:

commit 143291698358f5a1e693c768893d89038420af25
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 18:51:09 2019 -0400

    qemu_hotplug: remove erroneous call to qemuDomainDetachExtensionDevice()
    
commit e18e9b72a99f93cd4b14f39c60baa7c5ea35e5db
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 18:55:15 2019 -0400

    qemu_hotplug: remove another erroneous qemuDomainDetachExtensionDevice() call
    
commit 155064e0ed53b13701ff176c3f92605b1a850a9d
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 18:57:45 2019 -0400

    qemu_hotplug: remove unnecessary check for valid PCI address
    
commit 1c2866a1f6087837688c3c2beea08753dc6871d0
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 18 14:02:55 2019 -0400

    qemu_hotplug: rename a virDomainDeviceInfoPtr to avoid confusion
    
commit 287415e219fa2e477ae011ece275ab15a4be1d73
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 18 14:54:37 2019 -0400

    qemu_hotplug: eliminate multiple identical qemuDomainDetachHost*Device() functions
    
commit 1ed46f3a22fe8570b4237477de5d5adb5a05f455
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 18 16:41:03 2019 -0400

    qemu_hotplug: eliminate unnecessary call to qemuDomainDetachNetDevice()
    
commit ac442713e6aa1b1087d095796f9c35fd372a0511
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 18 18:14:11 2019 -0400

    qemu_hotplug: refactor qemuDomainDetachDiskLive and qemuDomainDetachDiskDevice
    
commit 48a2668151c9ba3f4c94c0a4c0412a5140885ad4
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 11:43:14 2019 -0400

    qemu_hotplug: don't call DetachThisHostDevice for hostdev network devices
    
commit 6be2414820a23663f9e6b7b4ed510ebbf3126307
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 13:33:10 2019 -0400

    qemu_hotplug: merge qemuDomainDetachThisHostDevice into qemuDomainDetachHostDevice
    
commit 036a4521f3c539c58bb5706b4710db0f1a16eec6
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 13:35:31 2019 -0400

    qemu_hotplug: move qemuDomainChangeGraphicsPasswords()
    
commit 5a8ffaec768ce25ef74eb398968e0b84b878a249
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 13:39:07 2019 -0400

    qemu_hotplug: move (almost) all qemuDomainDetach*() functions together
    
commit 015e71c54ddf8d133905a85514239b21bc7e552e
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 13:42:55 2019 -0400

    qemu_hotplug: move (Attach|Detach)Lease functions with others of same type

--- Additional comment from Laine Stump on 2019-03-26 15:14:19 UTC ---

And here are the upstream commit IDs for the remaining 14 patches. The final patch isn't required, but fits in nicely with this series.

commit 6a9c3fbade1d5479f4893a0f31293f13ad8bf229
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 13:55:13 2019 -0400

    qemu_hotplug: move qemuDomainDetachDeviceLive() to qemu_hotplug.c

commit e4d96324b48b8aab864212382390a5c4a40970d2
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 17:37:55 2019 -0400

    qemu_hotplug: remove extra function in middle of DetachController call chain
    
commit b20494186578fb779547b714fff77f07e2ca03ea
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 14:06:05 2019 -0400

    qemu_hotplug: pull qemuDomainUpdateDeviceList out of qemuDomainDetachDeviceLive
    
commit d3aab99096bb8f81600437682398235c17084d22
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 14:47:40 2019 -0400

    test: replace calls to individual detach functions with one call to main detach
    
commit 637d72f985e2700e88c3b3a4d4a83df9b8d6d35d
Author: Laine Stump <laine@laine.org>
Date:   Sat Mar 23 12:29:23 2019 -0400

    qemu_hotplug: make Detach functions called only from qemu_hotplug.c static
    
commit c4d6a121a8e903dc8d012d6737fd308b3ddec31a
Author: Laine Stump <laine@laine.org>
Date:   Tue Mar 19 17:15:00 2019 -0400

    qemu_hotplug: rename dev to match in qemuDomainDetachDeviceLive
    
commit 2ec6faea798b2a7d8986b7a958e781b54d8a8070
Author: Laine Stump <laine@laine.org>
Date:   Sun Mar 24 20:59:21 2019 -0400

    qemu_hotplug: separate Chr|Lease from other devices in DetachDevice switch
    
commit b6a53bf9079bc9ef2dc3f8b85ff5c84da14b9a0a
Author: Laine Stump <laine@laine.org>
Date:   Wed Mar 20 14:32:53 2019 -0400

    qemu_hotplug: standardize the names/args/calling of qemuDomainDetach*()
    
commit e1949c7045377e7b32f0f8ed7c76539b1b5d50da
Author: Laine Stump <laine@laine.org>
Date:   Sun Mar 24 21:29:49 2019 -0400

    qemu_hotplug: rename Chr and Lease Detach functions
    
commit b914e0eca385b52ede39b1b046bc9bf7a4fbbc2a
Author: Laine Stump <laine@laine.org>
Date:   Wed Mar 20 19:44:05 2019 -0400

    qemu_hotplug: new function qemuDomainRemoveAuditDevice()
    
commit 444c5e7c432961c7ecd59b2627dfb69ce787a22a
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 25 10:23:51 2019 -0400

    qemu_hotplug: audit *all* auditable device types in qemuDomainRemoveAuditDevice
    
commit dd60bd62d3ad60b564168f56b05f4c9354af4bd3
Author: Laine Stump <laine@laine.org>
Date:   Wed Mar 20 21:44:00 2019 -0400

    qemu_hotplug: consolidate all common detach code in qemuDomainDetachDeviceLive
    
commit 78b03a7770f1822458be3e0769538dfc92b34803
Author: Laine Stump <laine@laine.org>
Date:   Thu Mar 21 12:54:10 2019 -0400

    qemu_hotplug: delay sending DEVICE_REMOVED event until after *all* teardown
    
commit 34086fc59e7c59148409d6780176e84d0f1dbfb4
Author: Laine Stump <laine@laine.org>
Date:   Mon Mar 25 10:46:56 2019 -0400

    qemu_hotplug: don't shutdown net device until the guest has released it


Note You need to log in before you can comment on or make changes to this bug.