Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1361028 - VMs flip to non-responsive state for ever.
Summary: VMs flip to non-responsive state for ever.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.0-alpha
: ---
Assignee: Francesco Romani
QA Contact: sefi litmanovich
URL:
Whiteboard:
Depends On:
Blocks: 1364924 1364925
TreeView+ depends on / blocked
 
Reported: 2016-07-28 08:00 UTC by Roman Hodain
Modified: 2017-04-25 00:42 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
This update fixes an error in the monitoring code that caused VDSM to incorrectly report that a QEMU process had recovered and was responsive after being unavailable for a short amount of time, while it was actually unresponsive.
Clone Of:
: 1364924 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:42:08 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 61685 master MERGED periodic: always re-schedule operations 2016-08-02 10:17:19 UTC
oVirt gerrit 61769 ovirt-4.0 MERGED virt: Limit the number of workers in executor 2016-08-17 07:10:04 UTC
oVirt gerrit 61770 ovirt-4.0 MERGED virt: Fix of Executor._active_workers crash on modification 2016-08-17 07:22:11 UTC
oVirt gerrit 61889 ovirt-4.0 MERGED periodic: always re-schedule operations 2016-08-17 07:22:16 UTC
Red Hat Knowledge Base (Solution) 2598771 None None None 2016-09-02 06:52:56 UTC
Red Hat Product Errata RHEA-2017:0998 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC

Description Roman Hodain 2016-07-28 08:00:58 UTC
Description of problem:
   When VMs get non responding. It can happen that in some cases the executor tasks queue get full and exception TooManyTasks is raised. This causes the operation not being scheduled any more.

Version-Release number of selected component (if applicable):
     vdsm-4.17.31-0.el7ev.noarch

How reproducible:
     Under heavy load and isue with qemu responsivness 

Steps to Reproduce:
     Not completely clear

Actual results:
     All VMs are marked as non-responding

Expected results:
     The task is scheduled as soon as there is some space in the tasks queue is 

Additional info:

vdsm.Scheduler::ERROR::2016-07-16 16:15:28,745::schedule::213::Scheduler::(_execute) Unhandled exception in <bound method Operation._try_to_dispatch of <virt.periodic.Operation object at 0x45ea390>>
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/schedule.py", line 211, in _execute
    self._callable()
  File "/usr/share/vdsm/virt/periodic.py", line 190, in _try_to_dispatch
    self._dispatch()
  File "/usr/share/vdsm/virt/periodic.py", line 197, in _dispatch
    self._executor.dispatch(self, self._timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 101, in dispatch
    self._tasks.put((callable, timeout))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 256, in put
    raise TooManyTasks()
TooManyTasks



vdsm/virt/periodic.py:
...
class Operation(object):
...
     def _dispatch(self):
         """
         Send `func' to Executor to be run as soon as possible.
         """
         self._call = None
         self._executor.dispatch(self, self._timeout)
         self._step()

The exception comes from 

    self._executor.dispatch(self, self._timeout)

So self._step() is not executed and the operation is not scheduled

Comment 2 Francesco Romani 2016-07-29 08:44:15 UTC
(In reply to Roman Hodain from comment #0)
> Description of problem:
>    When VMs get non responding. It can happen that in some cases the
> executor tasks queue get full and exception TooManyTasks is raised. This
> causes the operation not being scheduled any more.
> 
> Version-Release number of selected component (if applicable):
>      vdsm-4.17.31-0.el7ev.noarch
> 
> How reproducible:
>      Under heavy load and isue with qemu responsivness 
> 
> Steps to Reproduce:
>      Not completely clear

We probably need slow, but not stuck, QEMU.

> Actual results:
>      All VMs are marked as non-responding

Yes, because sampled data is too old, in turn because we skip sampling.

> Expected results:
>      The task is scheduled as soon as there is some space in the tasks queue
> is 
> 
> Additional info:
[...]
> So self._step() is not executed and the operation is not scheduled

Ooops. Good catch, thanks!

Comment 3 Francesco Romani 2016-07-29 09:51:53 UTC
taking care of this bug.
The missing rescheduling bug is obvious and easily fixable (see patch 61685). The backport for 3.6.z is simple and feasible.

However there may be other issue underlying which caused trouble (stuck or slow QEMU). These issue could be hard to debug, so let's address them separately.

I recommend making sure the customer is running at least libvirt-1.2.17-13.el7_2.5 , because this package contains one important fix in this area.

Comment 7 Francesco Romani 2016-08-08 08:25:44 UTC
master branch has all the needed patch. Please clone to 4.0.z and/or 3.6.z if needed.

Comment 8 Michal Skrivanek 2016-08-08 08:28:32 UTC
Cloned to z-stream bug 1364924

Comment 13 sefi litmanovich 2017-02-06 12:47:45 UTC
Verified with rhevm-4.1.0.4-0.1.el7.
HOST: vdsm-4.19.4-1.el7ev.x86_64


1. Modified /etc/vdsm/vdsm.conf and added the lines:

[sampling]
periodic_workers = 1
periodic_task_per_worker = 1

2. restarted vdsm.
3. Created a vm pool with 8 vms.
4. Started all 8 vms.

result:

in vdsm log the new warning message wrapping the TooManyTask exception is issues in vdsm.log:

2017-02-06 14:20:13,790 WARN  (vdsm.Scheduler) [virt.periodic.Operation] could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x2914390>, executor queue full (periodic:218)
2017-02-06 14:20:13,791 WARN  (vdsm.Scheduler) [virt.periodic.Operation] could not run <function <lambda> at 0x1531e60>, executor queue full (periodic:218)

Did not see and vms in the engine set to 'not responding' state. When previous tasks are done the vms are started as well until eventually all are up as expected.


Note You need to log in before you can comment on or make changes to this bug.