Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1600668 - ipxe fails to function in presence of high amounts of flooded unicast and broadcast traffic [NEEDINFO]
Summary: ipxe fails to function in presence of high amounts of flooded unicast and bro...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ipxe
Version: 7.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: pre-dev-freeze
: ---
Assignee: Neil Horman
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-12 17:56 UTC by Joe Antkowiak
Modified: 2019-01-02 13:22 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-02 13:22:38 UTC
Target Upstream Version:
nhorman: needinfo? (joea)


Attachments (Terms of Use)
node console screenshot of ipxe failing to load the next image file (deleted)
2018-07-12 18:01 UTC, Joe Antkowiak
no flags Details
node console screenshot of ipxe failing to load the next image file (second failure type) (deleted)
2018-07-12 18:02 UTC, Joe Antkowiak
no flags Details

Description Joe Antkowiak 2018-07-12 17:56:52 UTC
Description of problem:
Attempting to introspect and deploy w/ OSP10 on multiple supermicro nodes.  iPXE load process would consistently fail to complete on 1-5 nodes, but never the same ones, on each attempt.

We later determined that the upstream switch was not correctly using its switching table and was flooding some unicast traffic unrelated to the target NICs, as well as the usual broadcast traffic.

Apparently this irrelevant traffic fills up an ipxe buffer somewhere.

As soon as we removed the irrelevant flooded unicast traffic from the network, ipxe works fine from then on.

Console screenshots will be attached.

Version-Release number of selected component (if applicable):
OSP10 and Director

How reproducible:
Very reproducible, just flood an interface with irrelevant (towards MAC addresses that are not the NICs) traffic and try to introspect or deploy a node.  PXE boots, chainloads iPXE, then iPXE starts to work and fails in various ways

Steps to Reproduce:
1. add nodes to director
2. reproduce unicast traffic flooding on the switchport facing the pxe boot nic (one way to do this would be to just set it as a span port similarly used for packet captures)
3. attempt to introspect the node

Actual results:
ipxe fails to load the rest of the ramdisk/agent/kernel

Expected results:
ipxe should complete, ignoring all irrelevant unicast traffic, and continue loading the OS

Additional info:

Comment 1 Joe Antkowiak 2018-07-12 18:01:28 UTC
Created attachment 1458484 [details]
node console screenshot of ipxe failing to load the next image file

Comment 2 Joe Antkowiak 2018-07-12 18:02:48 UTC
Created attachment 1458485 [details]
node console screenshot of ipxe failing to load the next image file (second failure type)

Comment 3 Dmitry Tantsur 2018-08-14 13:15:46 UTC
Hi!

What could we do to improve the situation, considering that it's way out of control of ironic itself. Maybe you could publish a KB article explaining the issue? I'm really not sure what we can do in the code to fix it.

Comment 4 Joe Antkowiak 2018-08-20 17:02:39 UTC
I agree, this is not an ironic issue, this is an issue with the ipxe image.

The ipxe network stack should operate like a standard network stack and ignore all unicast traffic received on the nic that is not destined for its mac address.  Receiving flooded unicast traffic (just like being on a hub vs a switch) should not break functionality of ipxe.

This isn't a broadcast storm instance, this is just normal ethernet unknown unicast flooding that occurs on hubs or when a switch hasn't identified where a mac address is.

Comment 5 Joe Antkowiak 2018-08-20 17:08:34 UTC
Are we able to fix an issue in ipxe?

Comment 6 Dmitry Tantsur 2018-09-19 09:28:23 UTC
We get the iPXE image from RHEL, redirecting to RHEL folks for their input.

Comment 9 Neil Horman 2018-09-20 13:07:46 UTC
can you mirror a port on the peer switch to a failing ipxe node and capture the traffic during a failure please?  To be perfectly honest, the problem description doesn't make any sense to me.  If the incorrectly flooded traffic is unicast, then it should be squashed by the NIC itself via the mac filter table which should only let in traffic from its own mac, or the broadcast mac (unless you have a very very old NIC that has no filtering capabilities).  Getting a network trace (in pcap format), would help me root cause this issue and develop an appropriate fix

Comment 10 Joe Antkowiak 2018-09-20 14:06:14 UTC
I'll do my best to get that capture, but it can be done by doing a pxe>ipxe boot via a switch port that is configured as a span port with a lot of extra unicast traffic

You are exactly correct, the nic -should- be filtering that out, but for some reason it doesn't seem to be when it's running the ipxe image.  PXE works fine to chainload ipxe, but once ipxe is up that's when it fails

Perhaps ipxe is doing something to the NIC to make it process and forward all traffic to the ipxe network stack?  (promiscuous mode?)

Comment 11 Neil Horman 2018-10-31 11:16:30 UTC
Its certainly possible, but I don't see anything in any drivers that explicitly force promisc mode.  theres some higher layer cases where that appears to happen, but nothing that you would hit in normal operation.

Any luck on the capture?  also, what is the pci id tuple of the nic in question here?

Comment 12 Neil Horman 2019-01-02 13:22:38 UTC
This has gone unanswered for over two months now.  Closing for lack of response


Note You need to log in before you can comment on or make changes to this bug.