Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1366786

Summary: [InClusterUpgrade] Possible race condition with large amount of VMs in cluster
Product: Red Hat Enterprise Virtualization Manager Reporter: Javier Coscia <jcoscia>
Component: ovirt-engineAssignee: Arik <ahadas>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.8CC: achareka, ahadas, baptiste.agasse, bgraveno, flo_bugzilla, gklein, jcoscia, kmashalk, kshukla, lsurette, mgoldboi, michal.skrivanek, mlibra, mtessun, rbalakri, rgolan, Rhev-m-bugs, srevivo, ykaul
Target Milestone: ovirt-4.1.0-alphaKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, the update of the compatibility version of a cluster with many running virtual machines that are installed with the guest-agent caused a deadlock that caused the update to fail. In some cases, these clusters could not be upgraded to a newer compatibility version. Now, the deadlock in the database has been prevented so that a cluster with many running virtual machines that are installed with the guest-agent can be upgraded to newer compatibility version.
Story Points: ---
Clone Of:
: 1369415 1369418 (view as bug list) Environment:
Last Closed: 2017-04-25 00:54:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1369415, 1369418    

Description Javier Coscia 2016-08-12 19:54:27 UTC
Description of problem:

Following https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Upgrade_Guide/Upgrading_a_Red_Hat_Enterprise_Linux_6_Cluster_to_Red_Hat_Enterprise_Linux_7.html step #3 "Set the InClusterUpgrade" scheduling policy fails, we suspect this is a race condition when having more than 100, or even more VMs in cluster.

Version-Release number of selected component (if applicable):

Engine: 
rhevm-3.6.8.1-0.1.el6.noarch
postgresql-server-8.4.20-6.el6.x86_64

Hypervisors: 
Red Hat Enterprise Virtualization Hypervisor release 6.7 (20160219.0.el6ev)


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. Have a cluster with more than 100 VMs, actually, in customer's case, he has 310 VMs
2. Follow step #1 and #2 from documentation
3. Set scheduling policy to InClusterUpgrade mode for cluster you want to upgrade

Actual results:

Fails with:

2016-08-12 11:11:18,604 ERROR [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-8) [17d532c7] Command 'org.ovirt.engine.core.bll.UpdateVmCommand' failed: CallableStatementCallback; SQL [{call incrementdbgeneration(?)}]; ERROR: deadlock detected

Expected results:

InClusterPolicy should be set fine for cluster

Additional info:

This hapened on another setup where customer had 172 VMs in cluster, then on a different cluster with only 11 VMs, there were no issues.

Will attach relevant logs.

Comment 5 Roy Golan 2016-08-14 08:07:46 UTC
According to your log statements this engine is still 3.5 (VdsUpdateRunTimeInfo doesn't exist in 3.6 codebase) so this setup seems unfinished.

Comment 6 Javier Coscia 2016-08-14 21:32:01 UTC
In the engine.log I can't see any reference to VdsUpdateRunTimeInfo after 8/12 @ 09:03:15,629

According to the setup log, the system was updated to 3.6.8 on 8/12 @ 09:58:20. ovirt-engine restart was on 8/12 @ 11:09:04,968

Attaching the setup logs in priv

Comment 8 Roy Golan 2016-08-15 12:58:56 UTC
Definitely a race between incrementdbgeneration which is called by an update to a vm, and the insertvmguestagentinterface 

There update of the cluster is doing a an UpdateVM for all the VM in the cluster, including the Up Vms of course. This is colliding with a batch update from the vms monitoring and aparently the update isn't made in the same order, and we hit the deadlock

looking further.

Comment 9 Ameya Charekar 2016-08-15 16:04:39 UTC
This problem will not be limited to any particular cluster parameter change (like InClusterUpgrade) but would be applicable to all any update to the cluster, when incrementdbgeneration is called right?

e.g. while changing cluster compatibility level can we hit this race condition?

Comment 14 Roy Golan 2016-08-16 08:48:18 UTC
Full analysis of the cause:

1. UpdateClusterCommand calls UpdateVm in a loop, without sorting, for *every* call, without any condition. Fix should be to call it *only* if the cluster compat level changed AND to sort the vms before looping over them

2. Update guest interfaces in batch should also be sorted by vm id to prevent from deadlocking with the loop update from UpdateCluster

3. nice to have, @ahadas suggest maybe no to call update guest interfaces in TX.

Comment 16 Marek Libra 2016-08-16 09:23:22 UTC
In response to comment #14: 
The UpdateVm is called if cluster version has changed (please see UpdateClusterCommand.getSharedLocks() )

Comment 21 Arik 2016-08-17 15:14:30 UTC
Update: this deadlock cannot be reproduced with postgresql 9.5.3 (fedora 24) and can be reproduced with postgresql 8.4.20 (rhel 6).
I'll check with postgres 9.2 (rhel 7)

Comment 23 Arik 2016-08-18 08:35:08 UTC
it can happen with postgres 9.2 so a proper fix will be done for 4.0

Comment 35 sefi litmanovich 2017-02-05 16:40:20 UTC
Verified with rhevm-4.1.0.4-0.1.el7.noarch.

Had a cluster with 120 Vms running with guest agent working and reporting IP.
Changed cluster compatibility version from 4.0 to 4.1 (hosts were 4.1 all the time) and monitored the updateVm calls with tail on engine log.
Repeated this upgrade flow several times, no race has occurred in any of the iterations.