Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1364924

Summary: VMs flip to non-responsive state for ever.
Product: Red Hat Enterprise Virtualization Manager Reporter: Michal Skrivanek <michal.skrivanek>
Component: vdsmAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: Israel Pinto <ipinto>
Severity: high Docs Contact:
Priority: medium    
Version: 3.6.7CC: bazulay, bgraveno, fromani, gklein, lsurette, mavital, mgoldboi, michal.skrivanek, mkalinin, obockows, rhodain, srevivo, ycui, ykaul
Target Milestone: ovirt-4.0.4Keywords: ZStream
Target Release: 4.0.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
This update fixes an issue in the monitoring code which caused the VDSM to fail to detect that a stuck QEMU process has become responsive.
Story Points: ---
Clone Of: 1361028
: 1364925 (view as bug list) Environment:
Last Closed: 2016-09-28 22:17:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1361028    
Bug Blocks: 1364925    

Description Michal Skrivanek 2016-08-08 08:25:06 UTC
clone job doesn't work, so doing that manually

+++ This bug was initially created as a clone of Bug #1361028 +++

Description of problem:
   When VMs get non responding. It can happen that in some cases the executor tasks queue get full and exception TooManyTasks is raised. This causes the operation not being scheduled any more.

Version-Release number of selected component (if applicable):
     vdsm-4.17.31-0.el7ev.noarch

How reproducible:
     Under heavy load and isue with qemu responsivness 

Steps to Reproduce:
     Not completely clear

Actual results:
     All VMs are marked as non-responding

Expected results:
     The task is scheduled as soon as there is some space in the tasks queue is 

Additional info:

vdsm.Scheduler::ERROR::2016-07-16 16:15:28,745::schedule::213::Scheduler::(_execute) Unhandled exception in <bound method Operation._try_to_dispatch of <virt.periodic.Operation object at 0x45ea390>>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/schedule.py", line 211, in _execute
    self._callable()
  File "/usr/share/vdsm/virt/periodic.py", line 190, in _try_to_dispatch
    self._dispatch()
  File "/usr/share/vdsm/virt/periodic.py", line 197, in _dispatch
    self._executor.dispatch(self, self._timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 101, in dispatch
    self._tasks.put((callable, timeout))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 256, in put
    raise TooManyTasks()
TooManyTasks



vdsm/virt/periodic.py:
...
class Operation(object):
...
     def _dispatch(self):
         """
         Send `func' to Executor to be run as soon as possible.
         """
         self._call = None
         self._executor.dispatch(self, self._timeout)
         self._step()

The exception comes from 

    self._executor.dispatch(self, self._timeout)

So self._step() is not executed and the operation is not scheduled

Comment 1 Michal Skrivanek 2016-08-08 08:30:07 UTC
cloned to zstream bug 1364925

Comment 4 Israel Pinto 2016-09-11 08:41:19 UTC
Verified with 4.0.4.2-0.1
HOST: 
OS Version:RHEL - 7.2 - 9.el7_2.1
OS Description:Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel Version:3.10.0 - 327.30.1.el7.x86_64
KVM Version:2.3.0 - 31.el7_2.21
LIBVIRT Version:libvirt-1.2.17-13.el7_2.5
VDSM Version:vdsm-4.18.13-1.el7ev
SPICE Version:0.12.4 - 15.el7_2.2

Steps:
1. Modified /etc/vdsm/vdsm.conf and added the lines:

[sampling]
periodic_workers = 1
periodic_task_per_worker = 1

2. restarted vdsm.
3. Created a vm pool with 8 vms.
4. Started all 8 vms.

result:
In vdsm.log:
vdsm.Scheduler::WARNING::2016-09-11 11:22:02,721::periodic::211::virt.periodic.Operation::(_dispatch) could not run <vdsm.virt.sampling.VMBulkSampler object at 0x21ed290>, executor queue full

Engine:
In the engine some of the vms are set to 'not responding' state for a short amount of time and when the previous tasks are done the vms are started as well until eventually all are up as expected.

Comment 6 errata-xmlrpc 2016-09-28 22:17:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1950.html