Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1689074 - Failed to restart VMI in cnv - Failed to terminate process 179 with SIGTERM: Device or resource busy
Summary: Failed to restart VMI in cnv - Failed to terminate process 179 with SIGTERM: ...
Keywords:
Status: NEW
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: RFE
Version: 1.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Stephen Gordon
QA Contact: Stephen Gordon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-15 05:44 UTC by chhu
Modified: 2019-03-19 03:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-18 06:46:45 UTC
Target Upstream Version:


Attachments (Terms of Use)
test.yaml (deleted)
2019-03-15 05:46 UTC, chhu
no flags Details
libvirtd_r7_6.log (deleted)
2019-03-15 06:04 UTC, chhu
no flags Details
libvirtd_r8av.log (deleted)
2019-03-15 06:07 UTC, chhu
no flags Details

Description chhu 2019-03-15 05:44:07 UTC
Description of problem:
Create a VM in cnv with nfs storage, use virtctl start VM to start VMI successfully. If the nfs storage is broken, virtctl restart VM will hang, the pod: virt-launcher-* is keep Terminating and can't be deleted.  

Version-Release number of selected component (if applicable):
libvirt-daemon-kvm-4.5.0-10.el7_6.4.x86_64
kubevirt-virtctl-0.13.3-1.g573d572.79a47ba.x86_64
kubevirt-ansible-0.12.3-1.f9f3ed4.noarch
kubevirt-cdi-manifests-1.5.1-2.c2a2fb4

How reproducible:
100%

Steps to Reproduce:
1. Create PV successfully
# cat pv-nfs.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-rhel
  labels:
    kubevirt.io: ""
    os: "rhel"
spec:
  capacity:
    storage: 12Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: 10.*****
    path: /home/nfs

# oc create -f pv-nfs.yaml

2. Create VM from DV
# oc create -f test.yaml

3. Check the PVC, DV, VM are created
# oc get vm
NAME          AGE       RUNNING   VOLUME
vm-dvt-rhel   28s       false

4. Create the VMI using virtctl
# virtctl start vm-dvt-rhel
VM vm-dvt-rhel was scheduled to start

5. Check the Pod: virt-launcher-*, VMI:vm-dvt-rhel are running,
   Login to the Pod, use virsh command, check the test_vm-dvt-rhel is running

# oc rsh virt-launcher-vm-dvt-rhel-s6d9b
sh-4.2# virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     test_vm-dvt-rhel               running

6. Stop the nfs service on nfs servr, run virctl restart VM, the Pod is keep on Terminating, the VMI is failed
# virtctl restart vm-dvt-rhel
VM vm-dvt-rhel was scheduled to restart

In the Pod: virt-launcher-***, the instance status changed from running to in shutdown,
then terminated with exit code 137.

sh-4.2# virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     test_vm-dvt-rhel               running

sh-4.2# virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     test_vm-dvt-rhel               in shutdown
sh-4.2# command terminated with exit code 137

# oc get pods
NAME                              READY     STATUS        RESTARTS   AGE
virt-launcher-vm-dvt-rhel-s6d9b   1/1       Terminating   0          3h

# oc describe pod virt-launcher-vm-dvt-rhel-s6d9b
Name:               virt-launcher-vm-dvt-rhel-s6d9b
Namespace:          test
Priority:           0
......
Events:
  Type     Reason     Age                 From                                             Message
  ----     ------     ----                ----                                             -------
  Warning  Unhealthy  26m (x17 over 26m)  kubelet, dell-per415-03.lab.eng.pek2.redhat.com  Readiness probe errored: rpc error: code = Unknown desc = command error: nsenter: failed to open /proc/86017/ns/ipc: No such file or directory
exec failed: container_linux.go:336: starting container process caused "process_linux.go:86: executing setns process caused \"exit status 21\""
, stdout: , stderr: , exit code -1
  Normal   Killing        4m (x5 over 26m)  kubelet, dell-per415-03.lab.eng.pek2.redhat.com  Killing container with id cri-o://compute:Need to kill Pod
  Warning  FailedKillPod  2m (x5 over 24m)  kubelet, dell-per415-03.lab.eng.pek2.redhat.com  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to get container 'stopped' status 6bc6733d1bf5541b4a4095537afe9ea14b6623a71ffc6b38d694b3458334b19c: failed to get container stopped status: 30s timeout reached"
, failed to "KillPodSandbox" for "084d31aa-460b-11e9-9a8f-a0b3cce8cf98" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to get container 'stopped' status k8s_compute_virt-launcher-vm-dvt-rhel-s6d9b_test_084d31aa-460b-11e9-9a8f-a0b3cce8cf98_0 in pod sandbox 80eb770060437aca5c0f9151a699e26358838776cbddeef77494e5db785e660c: failed to get container stopped status: 10s timeout reached"


# oc get vmi
NAME          AGE       PHASE     IP               NODENAME
vm-dvt-rhel   3h        Failed    10.****/23   dell-per415.****

# oc describe vmi vm-dvt-rhel
Name:         vm-dvt-rhel
Namespace:    test
Labels:       kubevirt.io/nodeName=dell-per415.****
              kubevirt.io/vm=vm-dvt-rhel
Annotations:  <none>
API Version:  kubevirt.io/v1alpha3
Kind:         VirtualMachineInstance
......
Events:
  Type     Reason            Age                 From                                                  Message
  ----     ------            ----                ----                                                  -------
  Normal   Created           30m (x27 over 3h)   virt-handler, dell-per415-03.lab.eng.pek2.redhat.com  VirtualMachineInstance defined.
  Normal   SuccessfulDelete  29m                 virtualmachine-controller                             Deleted virtual machine pod virt-launcher-vm-dvt-rhel-s6d9b
  Warning  SyncFailed        29m (x2 over 29m)   virt-handler, dell-per415-03.lab.eng.pek2.redhat.com  server error. command Launcher.Kill failed: "LibvirtError(Code=38, Domain=0, Message='Failed to terminate process 179 with SIGTERM: Device or resource busy')"
  Normal   Deleted           29m                 virt-handler, dell-per415-03.lab.eng.pek2.redhat.com  VirtualMachineInstance stopping
  Warning  SyncFailed        28m (x13 over 29m)  virt-handler, dell-per415-03.lab.eng.pek2.redhat.com  connection is shut down
  Warning  Stopped           28m                 virt-handler, dell-per415-03.lab.eng.pek2.redhat.com  The VirtualMachineInstance crashed.

7. Delete the VM, VM is deleted, check the Pod: virt-launcher-* and vmi still existed
# oc delete vm vm-dvt-rhel
virtualmachine.kubevirt.io "vm-dvt-rhel" deleted

8. Try to delete the Pod: virt-launcher-*, it's hang.
# oc delete pod virt-launcher-*


Actual results:
In step6: Run virctl restart VM, the Pod is keep on Terminating, the VMI is failed.
In step8: Try to delete the pod, but it's hang.

Expected results:
In step6: Run virctl restart VM, restart VM successfully.
In step8: Delete the pod successfully.

Additional info:
- test.yaml
- When the nfs storage is restored, the Pod: virt-handler-* can be deleted.

Comment 1 chhu 2019-03-15 05:46:03 UTC
Created attachment 1544279 [details]
test.yaml

Comment 2 chhu 2019-03-15 05:55:52 UTC
Test in libvirt side:
Start VM with nfs storage successfully. If the nfs storage is broken, virsh destroy VM will failed,
and the VM change to state: "in shutdown", it should be this state causing the pod: 
virt-handler-* is keep on "Terminating" and can't be deleted. 

But not sure if this issue should be fixed in libvirt side or cnv side.

Test on packages:
RHEL8.0.0:
libvirt-daemon-kvm-5.0.0-6.module+el8+2860+4e0fe96a.x86_64
qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64

RHEL7.6:
qemu-kvm-rhev-2.12.0-24.el7.x86_64.rpm
libvirt-4.5.0-10.el7_6.6.x86_64

Steps:
1. Create VM with nfs disk
   # virsh define r8.xml
   Domain r8 defined from r8.xml
   # virsh start r8
   Domain r8 started
   # virsh console r8
   Connected to domain r8
   #virsh list --all
    Id   Name   State
    ----------------------
    1    r8     running
2. Stop the nfs service on nfs server
3. Destroy the VM by virsh, the VM State is "in shutdown",
check the libvird log with error:
--------------------------------------------------
2019-03-15 03:32:01.035+0000: 4204: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 4386, sending SIGKILL
2019-03-15 03:32:31.049+0000: 4204: error : virProcessKillPainfullyDelay:405 : Failed to terminate process 4386 with SIGKILL: Device or resource busy

---------------------------------------------------
# virsh destroy r8
error: Disconnected from qemu:///system due to keepalive timeout
error: Failed to destroy domain r8
error: internal error: connection closed due to keepalive timeout

# virsh list --all
 Id   Name   State
--------------------------
 1    r8     in shutdown

2. Stop the nfs service on nfs server
3. Destroy the VM by virsh, the VM State is "in shutdown",
check the libvird log with error:
--------------------------------------------------
2019-03-15 03:32:01.035+0000: 4204: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 4386, sending SIGKILL
2019-03-15 03:32:31.049+0000: 4204: error : virProcessKillPainfullyDelay:405 : Failed to terminate process 4386 with SIGKILL: Device or resource busy

---------------------------------------------------
# virsh destroy r8
error: Disconnected from qemu:///system due to keepalive timeout
error: Failed to destroy domain r8
error: internal error: connection closed due to keepalive timeout

On rhel7.6:
----------------------------------------------------------------------
# virsh destroy r7
error: Failed to destroy domain r7
error: Failed to terminate process 49234 with SIGKILL: Device or resource busy

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     r7                             in shutdown

Additional info:
- libvirtd_r7-6.log
- libvirtd_r8av.log

Comment 3 chhu 2019-03-15 06:04:56 UTC
Created attachment 1544280 [details]
libvirtd_r7_6.log

Comment 4 chhu 2019-03-15 06:07:16 UTC
Created attachment 1544282 [details]
libvirtd_r8av.log

Comment 5 Peter Krempa 2019-03-15 09:36:56 UTC
Could you please post how the NFS share is mounted on the host (mount options)? Default mount options for NFS don't allow killing the process if the NFS server is not accessible. The process will end up in uniterruptible sleep (D-state)

Comment 6 chhu 2019-03-18 00:47:28 UTC
On the nfs server:
# cat /etc/exports
/home/nfs *(rw,no_root_squash,sync)

On CNV node(the dir is mounted by cnv when create VM):
# mount|grep nfs
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
10.**.**.49:/home/nfs on /var/lib/origin/openshift.local.volumes/pods/6e11869b-4915-11e9-9a8f-a0b3cce8cf98/volumes/kubernetes.io~nfs/pv-rhel type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.**.**.183,local_lock=none,addr=10.**.**.49)

Comment 7 Peter Krempa 2019-03-18 06:46:45 UTC
The mount options on the clent include 'hard' which means that the processes accessing the NFS mount point will be frozen until the share is available again to prevent data loss from the writer.

Libvirt can't do anything about this. Closing as NOTABUG as described behaviour is consistent with accessing a 'hard' mount NFS share.

Comment 8 chhu 2019-03-19 02:41:24 UTC
Thanks Peter!

As in cnv the Pod: virt-handler-*(Terminating)/ VMI(Failed) can not be deleted until the nfs storage is restored.
And the mount options are given by cnv, customer can't change mount option when creating pv/pvc/dv,
I'll change the bug product to 'CNV' to see if they would change the mount option to soft.

- Additional information: RHV use the soft mount
https://github.com/oVirt/vdsm/blob/e2910145e6ebdcc6733eeaff4f49bb102ec65260/lib/vdsm/common/config.py.in#L388


Regards,
Chenli

Comment 9 chhu 2019-03-19 03:23:40 UTC
(In reply to chhu from comment #8)
> Thanks Peter!
> 
> As in cnv the Pod: virt-handler-*(Terminating)/ VMI(Failed) can not be
> deleted until the nfs storage is restored.
> And the mount options are given by cnv, customer can't change mount option
> when creating pv/pvc/dv,
> I'll change the bug product to 'CNV' to see if they would change the mount
> option to soft.
> 
> - Additional information: RHV use the soft mount
> https://github.com/oVirt/vdsm/blob/e2910145e6ebdcc6733eeaff4f49bb102ec65260/
> lib/vdsm/common/config.py.in#L388
> 
 - RHOS fix similar issue in nova:
   Bug 1205647 - nova.virt.libvirt.driver fails to shutdown reboot instance with error 'Code=38 Error=Failed to terminate process 4260 with SIGKILL: Device or resource busy'
   Bug 1188609 - Error from libvirt during destroy Code=38 - Device or resource busy
> 
> Regards,
> Chenli


Note You need to log in before you can comment on or make changes to this bug.