Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1064499 - [engine-backend] SQL exception when starting a VM which is known as 'down' to vdsm but is actually 'paused'
Summary: [engine-backend] SQL exception when starting a VM which is known as 'down' to...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-core
Version: 3.4
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.4.1
Assignee: Francesco Romani
QA Contact: bugs@ovirt.org
URL:
Whiteboard: virt
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-12 17:38 UTC by Elad
Modified: 2014-03-31 06:31 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-31 06:31:50 UTC
oVirt Team: ---


Attachments (Terms of Use)
engine, vdsm, libvirt and qemu logs (deleted)
2014-02-12 17:38 UTC, Elad
no flags Details

Description Elad 2014-02-12 17:38:39 UTC
Created attachment 862464 [details]
engine, vdsm, libvirt and qemu logs

Description of problem:
I have a scenario in which I had 2 VMs that were failed to resume from a 'paused' state due to an AttributeError (reported here https://bugzilla.redhat.com/show_bug.cgi?id=1064471). I stopped the VMs from  UI, and they were reported as 'down' by engine and vdsm after I powered it off although libvirt reported them as 'paused' (reported here  https://bugzilla.redhat.com/show_bug.cgi?id=1063336)
When I tried to start the VM again, engine failed with SQL exception.


Version-Release number of selected component (if applicable):
ovirt-engine-3.4.0-0.7.beta2.el6.noarch
postgresql-8.4.18-1.el6_4.x86_64


Steps to Reproduce:
On shared storage DC with iSCSI domain
have a running VM with disk 
1. block connectivity from host to the storage server and wait for the domain to become 'inactive' and the VM to enter to 'paused' state 
2. resume connectivity to the storage server and wait for the domain to become 'active'. Try to resume the VM, it will fail
3. power-off the VM and try to start it again


Actual results:
Resuming the VM from 'paused' state fails with an AtrributeError as reported here https://bugzilla.redhat.com/show_bug.cgi?id=1064471.
After stopping the VMs, engine and vdsm report them as down:

## select vm_name,status from vm

 vm_name  | status
----------+--------
 shared-2 |      0
 shared-1 |      0

## vdsClient -s 0 list table

6ec3beea-c5a4-4e1b-93cb-37879754f0b7  29262  1                    Up
455425c8-d79f-4f18-85aa-48b516e1751a   5567  shared-3             Up

But libvirt reports the VMs as 'paused':

## virsh -r list

 Id    Name                           State
----------------------------------------------------
 6     shared-1                       paused
 7     shared-2                       paused
 8     1                              running
 10    shared-3                       running


QEMU processes are still running on the host:

## pgrep qemu

5567
20552
21055
29262

When I started the VMs again, I got this exception in engine.log:

2014-02-12 18:37:46,766 ERROR [org.ovirt.engine.core.bll.job.JobRepositoryImpl] (org.ovirt.thread.pool-6-thread-46) Failed to save step 533c8d4c-19df-46af-8a70-f031b99e964c, VALIDATING.: org.springframework.dao.Da
taIntegrityViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job"
  Detail: Key (job_id)=(ab401dcd-1103-4dbb-ae46-a14438616647) is not present in table "job".
  Where: SQL statement "INSERT INTO step( step_id, parent_step_id, job_id, step_type, description, step_number, status, start_time, end_time, correlation_id, external_id, external_system_type, is_external) VALUES
(  $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 ,  $8 ,  $9 ,  $10 ,  $11 ,  $12 ,  $13 )"
PL/pgSQL function "insertstep" line 2 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job"
  Detail: Key (job_id)=(ab401dcd-1103-4dbb-ae46-a14438616647) is not present in table "job".
  Where: SQL statement "INSERT INTO step( step_id, parent_step_id, job_id, step_type, description, step_number, status, start_time, end_time, correlation_id, external_id, external_system_type, is_external) VALUES
(  $1 ,  $2 ,  $3 ,  $4 ,  $5 ,  $6 ,  $7 ,  $8 ,  $9 ,  $10 ,  $11 ,  $12 ,  $13 )"
PL/pgSQL function "insertstep" line 2 at SQL statement
        at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:245) [spring-jdbc.jar:3.1.1.RELEASE]

 


Additional info: engine, vdsm, libvirt and qemu logs

Comment 1 Sandro Bonazzola 2014-03-04 09:30:27 UTC
This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 2 Michal Skrivanek 2014-03-09 14:12:13 UTC
it might be a consequence of bug 1063336 & bug 1064471

Comment 3 Francesco Romani 2014-03-17 15:57:09 UTC
I recommend to retest this scenario with the fix for bz10633336 applied, since this seems the root cause of all the wrongdoing. Once the state has gone out of sync, e.g. in resuming the VM from 'paused' state which failed with an AtrributeError, everything else snowballs.

That said, I'll investigate the VDSM logs to see if we can add more robustness on those scenarios.

Comment 4 Francesco Romani 2014-03-28 11:01:23 UTC
Besides the fix fo https://bugzilla.redhat.com/show_bug.cgi?id=1063336 as said above, on VDSM side there is not much more we can do about this.

Comment 5 Michal Skrivanek 2014-03-31 06:31:50 UTC
yeah, we believe this is fixed by vdsm recovering properly on restart, which is fixed in 3.4


Note You need to log in before you can comment on or make changes to this bug.