Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 453574 - virtual ethernet device stops working on reception of duplicate backend state change signals
Summary: virtual ethernet device stops working on reception of duplicate backend state...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.2
Hardware: All
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Markus Armbruster
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 449772 RHEL5u3_relnotes 464676
TreeView+ depends on / blocked
 
Reported: 2008-07-01 12:59 UTC by Alex Zeffertt
Modified: 2010-03-14 21:31 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A race condition could occur when creating and destroying virtual network devices. In some circumstances — especially high load situations — this would cause the virtual device to not respond. In this update, the state of the virtual device is checked to prevent the race condition from occurring.
Clone Of:
Environment:
Last Closed: 2009-01-20 20:21:23 UTC
Target Upstream Version:


Attachments (Terms of Use)
[NET] front: Fix crashes when xenstore watches fire multiple times. (deleted)
2008-07-01 12:59 UTC, Alex Zeffertt
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Alex Zeffertt 2008-07-01 12:59:10 UTC
Description of problem:

A virtual interface is created and plugged into the VM.  Within the VM it is
given an IP address using ifconfig.  A ping is then attempted but there appears
to be no network connectivity.

This happens infrequently during stress testing.  /var/log/messages show that
the occurrences coincide with drivers/xen/netfront/netfront.c:network_connect()
being called twice.  This suggests the problem is that the netfront driver is
receiving a duplicate "backend_changed" signal, the second of which it should be
ignoring.

The duplicate "backend_changed" signals are a known issue, and there is a
xen-3.1 guest kernel patch to protect against them.  However, it looks like this
hasn't been applied in either the RHEL4.6 or RHEL5.2 kernel-xen packages.

I'll attach the patch to this bug report.


Version-Release number of selected component (if applicable):

RHEL4.6: kernel-2.6.9-67.0.20.EL
RHEL5.2: kernel-2.6.18-92.1.6.el5

How reproducible:

Sporadic.  difficult.

Steps to Reproduce:
1. Create VIF (in VM management tool)
2. assign virtual interface IP address (in VM)
3. try to ping known IP address on same network
  
Actual results:

ping says host "Unreachable"

Expected results:

ping contacts host

Additional info:

Comment 1 Alex Zeffertt 2008-07-01 12:59:10 UTC
Created attachment 310658 [details]
[NET] front: Fix crashes when xenstore watches fire multiple times.

Comment 2 Markus Armbruster 2008-07-16 22:04:46 UTC
Alex,

Many thanks for the patch.  I can see the first and the third patch hunk in the
drivers I got from http://xenbits.xensource.com/linux-2.6.18-xen.hg, but not the
second.  How come?  Could you point me to the relevant upstream changeset(s)?

Comment 3 Ian Campbell 2008-07-21 14:14:23 UTC
Alex is away at the moment but let me try and answer.

The upstream changeset is
http://xenbits.xensource.com/xen-unstable.hg?rev/79315be2c9b9

The second hunk is indeed not present any longer. I had a dig and found that it
was subsequently removed by
http://xenbits.xensource.com/xen-unstable.hg?rev/e99ba0c6c046
which came out of
http://lists.xensource.com/archives/html/xen-devel/2006-12/msg00843.html

We haven't observed that failure though (I don't know if we test for it though)

Comment 4 RHEL Product and Program Management 2008-07-31 13:01:17 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Don Zickus 2008-09-15 14:17:59 UTC
in kernel-2.6.18-115.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 10 Ryan Lerch 2008-11-07 00:15:14 UTC
This bug has been marked for inclusion in the Red Hat Enterprise Linux 5.3
Release Notes.

To aid in the development of relevant and accurate release notes, please fill
out the "Release Notes" field above with the following 4 pieces of information:


Cause:   What actions or circumstances cause this bug to present.

Consequence:  What happens when the bug presents.

Fix:   What was done to fix the bug.

Result:  What now happens when the actions or circumstances above occur. (NB:
this is not the same as 'the bug doesn't present anymore')

Comment 12 Markus Armbruster 2008-11-07 19:45:39 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Cause: A race condition exists in the Xenbus protocols for device
creation and destruction.  The netfront driver didn't cope with it.

Consequence: Network device creation can result in a device that is
hung.  Happens rarely, typically when stress testing.

Fix: Backport fix from upstream.

Result: Network device creation works reliably, even when stress
testing.

Comment 14 Ryan Lerch 2008-11-17 01:42:55 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,10 +1 @@
-Cause: A race condition exists in the Xenbus protocols for device
+A race condition could occur when creating and destroying virtual network devices. In some circumstances — especially high load situations — this would cause the virtual device to not respond. In this update, the state of the virtual device is checked to prevent the race condition from occurring.-creation and destruction.  The netfront driver didn't cope with it.
-
-Consequence: Network device creation can result in a device that is
-hung.  Happens rarely, typically when stress testing.
-
-Fix: Backport fix from upstream.
-
-Result: Network device creation works reliably, even when stress
-testing.

Comment 17 errata-xmlrpc 2009-01-20 20:21:23 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html


Note You need to log in before you can comment on or make changes to this bug.