Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1597621 - inconsistent guest index found on target host if rebooting guest with multiple virtio videos while do migration
Summary: inconsistent guest index found on target host if rebooting guest with multipl...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Gerd Hoffmann
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-03 10:09 UTC by yafu
Modified: 2019-03-29 03:10 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
domain xml (deleted)
2018-08-23 02:58 UTC, yafu
no flags Details
console log (deleted)
2018-08-30 03:57 UTC, yafu
no flags Details
boot log -2 (deleted)
2018-08-30 08:04 UTC, yafu
no flags Details
boot log -3 (deleted)
2018-08-31 05:56 UTC, yafu
no flags Details

Description yafu 2018-07-03 10:09:16 UTC
Description of problem:
inconsistent guest index found on target host if rebooting guest with multiple virtio videos while do migration.

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-5.el7.x86_64
libvirt-4.4.0-2.el7.x86_64

How reproducible:
10%

Steps to Reproduce:
1.Start a guest with multiple virtio videos:
#virsh dumpxml iommu1
<os>
    <type arch='x86_64' machine='pc-q35-rhel7.5.0'>hvm</type>
    <boot dev='hd'/>
  </os>
...
<video>
      <model type='virtio' heads='1' primary='yes'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </video>
    <video>
      <model type='virtio' heads='1'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
 </video>

2.Do migration while rebooting the guest:
#virsh reboot iommu1; virsh migrate iommu1 qemu+ssh://10.66.4.101/system --live --verbose --p2p --tunnelled
Migration: [ 98 %]error: internal error: qemu unexpectedly closed the monitor: 
2018-06-27T12:38:48.500703Z qemu-kvm: VQ 0 size 0x40 Guest index 0xbf55 inconsistent with Host index 0x43a: delta 0xbb1b
2018-06-27T12:38:48.500726Z qemu-kvm: Failed to load virtio-gpu:virtio
2018-06-27T12:38:48.500737Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:01.4:00.0/virtio-gpu'
2018-06-27T12:38:48.500779Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501780Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501873Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.501940Z qemu-kvm: warning: TSC frequency mismatch between VM (2099996 kHz) and host (2593995 kHz), and TSC scaling unavailable
2018-06-27T12:38:48.502026Z qemu-kvm: load of migration failed: Operation not permitted
2018-06-27 12:38:48.715+0000: shutting down, reason=failed

Actual results:
Migration failed when rebooting guest with multiple virtio video.

Expected results:
Should do migration successfully.

Additional info:

Comment 2 Gerd Hoffmann 2018-08-15 11:31:41 UTC
Can you attach the complete domain xml please?

Comment 3 yafu 2018-08-23 02:58:46 UTC
Created attachment 1478026 [details]
domain xml

Comment 4 yafu 2018-08-23 02:59:42 UTC
(In reply to Gerd Hoffmann from comment #2)
> Can you attach the complete domain xml please?

Please see the domain xml in the attachment.

Comment 5 Gerd Hoffmann 2018-08-23 07:53:02 UTC
<domain type='kvm' id='5'>
  <name>iommu1</name>
  <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid>

    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>

    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xc'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>

    <video>
      <model type='virtio' heads='1' primary='yes'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e33-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </video>

    <video>
      <model type='virtio' heads='1'>
        <acceleration accel3d='no'/>
      </model>
      <alias name='ua-04c2decd-4e35-4023-84de-12205c777af6'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </video>

Ok, the primary is bus 7, which is root port 00:1.4
The secondary is bus 2, which is root port 00:1.1

So the secondary comes first in pci scan order.

Comment 6 Gerd Hoffmann 2018-08-23 07:57:12 UTC
Can you configure a serial console for the guest, log the serial console output on the source host to a file, then try to reproduce it?

The kernel log hopefully gives us a clue where exactly in the shutdown or boot process the guest kernel is when this bug happens.

Comment 7 yafu 2018-08-30 03:57:54 UTC
Created attachment 1479680 [details]
console log

Comment 8 yafu 2018-08-30 03:59:04 UTC
(In reply to Gerd Hoffmann from comment #6)
> Can you configure a serial console for the guest, log the serial console
> output on the source host to a file, then try to reproduce it?
> 
> The kernel log hopefully gives us a clue where exactly in the shutdown or
> boot process the guest kernel is when this bug happens.

Please see the log in the attachment.

Comment 9 Gerd Hoffmann 2018-08-30 06:18:06 UTC
(In reply to yafu from comment #8)
> (In reply to Gerd Hoffmann from comment #6)
> > Can you configure a serial console for the guest, log the serial console
> > output on the source host to a file, then try to reproduce it?
> > 
> > The kernel log hopefully gives us a clue where exactly in the shutdown or
> > boot process the guest kernel is when this bug happens.
> 
> Please see the log in the attachment.

Can you please remove the "quiet" from the kernel command line so all the kernel messages are in the log too?

The log looks like the guest is fully booted.  Is this the log of a migration failure?  The initial comment says 10% reproducable, so I assumed you have to hit the right moment in the shutdown or boot process to actually hit it.  Is this correct?

Comment 10 yafu 2018-08-30 06:35:20 UTC
(In reply to Gerd Hoffmann from comment #9)
> (In reply to yafu from comment #8)
> > (In reply to Gerd Hoffmann from comment #6)
> > > Can you configure a serial console for the guest, log the serial console
> > > output on the source host to a file, then try to reproduce it?
> > > 
> > > The kernel log hopefully gives us a clue where exactly in the shutdown or
> > > boot process the guest kernel is when this bug happens.
> > 
> > Please see the log in the attachment.
> 
> Can you please remove the "quiet" from the kernel command line so all the
> kernel messages are in the log too?
> 
> The log looks like the guest is fully booted.  Is this the log of a
> migration failure?  The initial comment says 10% reproducable, so I assumed
> you have to hit the right moment in the shutdown or boot process to actually
> hit it.  Is this correct?

Yes, It's the log of a migration failure. The guest can boot successfully even migration failed.

I will attach log without "quiet" in kernel command line.

Comment 11 Gerd Hoffmann 2018-08-30 06:49:14 UTC
> Yes, It's the log of a migration failure. The guest can boot successfully
> even migration failed.

Ah, right, the guest is restarted on the source host then, so the log does not stop at the point where the migration was tried.  But that is exactly what I want to know:  Where in the boot process is linux when the migration fails.  Hmm ...

Comment 12 Gerd Hoffmann 2018-08-30 07:00:14 UTC
Ok, stop & go approach should help debugging this.  Can you try this:

(1) reboot the guest.
(2) pause the guest.
(3) try migrate the guest.
(4a) when migration fails: found the guest state which breaks migration, done ;)
(4b) when migration succeeds: unpause, let it run for a moment, pause again, continue with (3).

Comment 13 yafu 2018-08-30 08:04:01 UTC
Created attachment 1479716 [details]
boot log -2

Comment 14 yafu 2018-08-30 08:05:59 UTC
(In reply to Gerd Hoffmann from comment #12)
> Ok, stop & go approach should help debugging this.  Can you try this:
> 
> (1) reboot the guest.
> (2) pause the guest.
> (3) try migrate the guest.
> (4a) when migration fails: found the guest state which breaks migration,
> done ;)
> (4b) when migration succeeds: unpause, let it run for a moment, pause again,
> continue with (3).

Please see the log in attachment 'boot log -2'. I paused the guest after migration failed.

Comment 15 Gerd Hoffmann 2018-08-30 09:18:10 UTC
> Please see the log in attachment 'boot log -2'. I paused the guest after
> migration failed.

Ok, so it happens after booting the kernel but before loading the virtio-gpu driver.

Comment 16 Gerd Hoffmann 2018-08-30 09:52:12 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944

Can you try whenever this build works?

Comment 17 yafu 2018-08-31 05:56:36 UTC
Created attachment 1479959 [details]
boot log -3

Comment 18 yafu 2018-08-31 05:57:58 UTC
(In reply to Gerd Hoffmann from comment #16)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18106944
> 
> Can you try whenever this build works?

Still can reproduce with this build.

Comment 19 Gerd Hoffmann 2018-11-27 10:10:00 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458
How about this one?

Comment 20 yafu 2018-11-30 05:24:14 UTC
(In reply to Gerd Hoffmann from comment #19)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19290458
> How about this one?

Still can reproduce with this build.

Comment 21 Gerd Hoffmann 2019-03-04 12:16:37 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769
Can you test please?

Comment 22 yafu 2019-03-07 03:12:49 UTC
(In reply to Gerd Hoffmann from comment #21)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20430769
> Can you test please?

I ran the test in 200 loops and can not reproduce the issue any more.

Comment 23 Gerd Hoffmann 2019-03-13 07:38:08 UTC
patches merged upstream:
8ea90ee690eb78bbe6644cae3a7eff857f8b4569
3912e66a3febdea3b89150f923ca9be3f02f7ae3
0be00346d1e3d96b839832809d7042db8c7d4300 (optional cleanup)


Note You need to log in before you can comment on or make changes to this bug.