Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 234375 - PV guests can crash at boot w/ >4GB memory
Summary: PV guests can crash at boot w/ >4GB memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.0
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Chris Lalancette
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-28 19:03 UTC by Stephen Tweedie
Modified: 2018-10-19 23:13 UTC (History)
3 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 19:45:39 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Stephen Tweedie 2007-03-28 19:03:25 UTC
XenSource reports:

"We're seeing some issues with the RHEL5 32b xen kernel that are leading
to frequent XenRT failures. Guests crash during boot when the host has
>4GB of RAM with alarmingly high probability.

The problem is that trying to clear a pte by setting it to PFN 0 can
potentially cause the entry to be temporarily invalid since it writes
the upper word first (i.e. the PTE remains present). It also not correct
to launder the 0 through p2m which set_pte will do. The combination of
these causes a crash when swapper_pg_dir and PFN 0 have MFNs on opposite
sides of the 4G boundary."

Version-Release number of selected component (if applicable):
RHEL-5.0.0

Steps to Reproduce:

"while true; do xm reboot <name>; done" on a machine with plenty of RAM
will result in a guest crash fairly soon. Of course, you'll need to set
preserve on crash to easily see that its happened. 

Additional info:

Should be fixed by xen-unstable cset 12381:

http://xenbits.xensource.com/xen-unstable.hg?cs=c6efd6c2feaa

Comment 2 RHEL Product and Program Management 2007-04-25 20:17:45 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Ian Campbell 2007-07-03 08:16:21 UTC
It turns out that there has been a hypervisor side workaround for this issue for
a while now but it was broken in xen-unstable.hg between
13392:0fd65225e4c6 (17 Jan 2007) and 15433:a5360bf18668 (28 June 2007)
The workaround is in xen/arch/x86/mm.c with the comment (line 3297 in current
xen-unstable):
        /*
         * If this is an upper-half write to a PAE PTE then we assume that
         * the guest has simply got the two writes the wrong way round. We
         * zap the PRESENT bit on the assumption that the bottom half will
         * be written immediately after we return to the guest.
         */

I suspect that the RHEL5 hypervisor has the workaround but doesn't have the
breakage, in which case this can be closed.

Comment 8 Red Hat Bugzilla 2007-07-25 00:44:41 UTC
change QA contact

Comment 9 Chris Lalancette 2007-07-25 18:03:15 UTC
Crap.  I finally was able to reproduce this, just not in the way specified
originally.  I have a 2 CPU Intel SDV here, running the RHEL 5.1 dom0 bits,
i386.  Then I have 1 RHEL-5.0 i386 PV guest running an FTP test that is
saturating the networking.  Finally, I have a 2nd RHEL-5.0 i386 PV guest just
rebooting in a loop (init 6 in /etc/rc.local).  Very often, that 2nd guest will
fail to boot, with this in xm dmesg:

(XEN) mm.c:3267:d6 ptwr_emulate: could not get_page_from_l1e()

After applying c/s 15433 to the HV from xen-unstable, however, I now see this:

(XEN) mm.c:3263:d14 ptwr_emulate: fixing up invalid PAE PTE 0000000149f12025

and the domain successfully reboots.  I believe we are going to need that c/s
for our HV, to support older 5.0 guests.

Chris Lalancette

Comment 10 Don Zickus 2007-07-31 21:14:50 UTC
in 2.6.18-37.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Issue Tracker 2007-08-17 07:26:04 UTC
Fujitsu tested with 5.1 beta, and it worked fine.
----------------------------------
We tested this issue with kernel-xen-2.6.18-37.el5,
the result is OK. We could boot 4 domains.


This event sent from IssueTracker by mmatsuya 
 issue 128803

Comment 14 errata-xmlrpc 2007-11-07 19:45:39 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.