Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1358992 - multi-threaded compilation failed inside guest VM
Summary: multi-threaded compilation failed inside guest VM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-aarch64
Version: 7.3
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Wei Huang
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-22 04:10 UTC by Wei Huang
Modified: 2016-07-29 19:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-29 19:47:11 UTC


Attachments (Terms of Use)

Description Wei Huang 2016-07-22 04:10:44 UTC
Description of problem:
While compiling linux kernel source inside a SMP guest, the compilation stalled and can't be stopped.

Version-Release number of selected component:
 * kernel-4.5.0-0.46.el7.aarch64
 * qemu-kvm-common-rhev-2.6.0-14.el7.aarch64
 * libvirt-daemon-2.0.0-1.el7.aarch64
 * AAVMF-20160608-2.git988715a.el7.noarch

How reproducible:
 Always

Steps to Reproduce:
This problem can be reproduced on both Mustang (with latest fw) and Seattle:
 1) provision system with RHEL-7.3-20160719.n.0 compose
 2) install the guest VM with the same (or latest) RHELSA release
   virt-install -n rhelsa-73 -l http://download.eng.bos.redhat.com/nightly/latest-RHEL-7/compose/Server/aarch64/os/ --vcpus=8 --memory 4096 --disk size=8 
 3) download the latest linux kernel source inside guest VM and compile it
    # make distclean
    # make defconfig
    # make -j8
The compilation is getting very slow and eventually stalls. In the meanwhile, it is not possible to stop the compilation with "CTRL+C" at some point.

Comment 2 Wei Huang 2016-07-22 16:12:37 UTC
Got one more data point. If I installed 7.2 guest VM (guest kernel 4.2.0-0.21.el7.aarch64), the test will pass on Mustang. 

Given that the rest components are identical, it seems this is a guest kernel issue.

-Wei

Comment 3 Wei Huang 2016-07-29 14:30:28 UTC
I have a very solid configuration to reproduce this problem now. 

=== Test 1 ===
* Install a guest VM with 4.5-0.44 kernel. One example is RHEL-7.3-20160713.n.0 compose. 
* Mount an additional disk with following config:
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/images/guest-extra-kernel.img'/>
      <target dev='sdb' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
File system I used was ext4.
* Download Linux kernel source and do the multi-threaded compilation. It will stall within 10 mins

=== Test 2 ===
Another test I did was to try 0.45 kernel (RHEL-7.3-20160715.n.0). Using the exact configuration in Test 1, this guest VM passed without any problem. 

I then installed 0.44 kernel on the same guest VM. The compilation failed on this kernel.

Comment 4 Andrew Jones 2016-07-29 14:57:23 UTC
$ git log --oneline kernel-4.5.0-0.44.el7..kernel-4.5.0-0.45.el7
2d731213abf1 [redhat] kernel-4.5.0-0.45.el7
ddba4640ce45 smsc911x: Fix bug where PHY interrupts are overwritten by 0
c7062ffec8dc arm64: mm: always take dirty state from new pte in ptep_set_access_flags
5be4d393fd68 pci: xgene: Match ECAM quirk using OEM ID and OEM Table ID
b5e0fae3ba81 pci: xgene: Enable fix-up for all PCIe domains

Can you try a .45 with c7062ffec8dc reverted?

Comment 5 Wei Huang 2016-07-29 16:48:36 UTC
(In reply to Andrew Jones from comment #4)
> $ git log --oneline kernel-4.5.0-0.44.el7..kernel-4.5.0-0.45.el7
> 2d731213abf1 [redhat] kernel-4.5.0-0.45.el7
> ddba4640ce45 smsc911x: Fix bug where PHY interrupts are overwritten by 0
> c7062ffec8dc arm64: mm: always take dirty state from new pte in
> ptep_set_access_flags
> 5be4d393fd68 pci: xgene: Match ECAM quirk using OEM ID and OEM Table ID
> b5e0fae3ba81 pci: xgene: Enable fix-up for all PCIe domains
> 
> Can you try a .45 with c7062ffec8dc reverted?

I did and this is the cause of the BZ. I added this patch on top of 0.44 kernel and re-ran it. The VM survived 2 hrs of stress testing.

Comment 6 Wei Huang 2016-07-29 19:47:11 UTC
On 4.5-0.44 kernel, I saw spikes in CPU load. This dragged system performance down to its knee. The symptom is exactly what c7062ffec8dc fixes. The symptom reported in old kernel is below: 

"
| This patch breaks swapping for me.
| In the broken case, you'll see either systemd cpu time spike (because
| it's stuck in a page fault loop) or the system hang (because the
| application owning the screen is stuck in a page fault loop).
"

This matches what I saw inside the guest VM. We can close this BZ now given that 4.5-0.45 has included c7062ffec8dc. 

-Wei


Note You need to log in before you can comment on or make changes to this bug.