Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1365561 - High Transmit tx error count in OSP8 environment
Summary: High Transmit tx error count in OSP8 environment
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.2
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Flavio Leitner
QA Contact: Rick Alongi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-09 14:30 UTC by Jeremy Eder
Modified: 2016-08-18 15:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-18 15:28:30 UTC


Attachments (Terms of Use)
plotnetcfg raw (deleted)
2016-08-09 18:35 UTC, Jeremy Eder
no flags Details
plotnetcfg pdf (deleted)
2016-08-09 18:35 UTC, Jeremy Eder
no flags Details
sosreport from ospd node (deleted)
2016-08-11 19:00 UTC, Jeremy Eder
no flags Details

Description Jeremy Eder 2016-08-09 14:30:25 UTC
We have reports from two environments that both use RHEL7.2 and LACP that there are excessive error counts on the transmit side of all network devices participating in the bond.

Environment 1 uses i40e X710, with Intel's 1.5.19 out-of-tree driver (we need this to fix another issue in 7.2 wrt i40e link flapping).
Environment 2 uses ixgbe 540-AT2 with RHEL7.2 inbox driver

[root@overcloud-controller-2 lib]# uname -r
3.10.0-327.18.2.el7.x86_64

[root@overcloud-controller-2 lib]# cat /etc/sysconfig/network-scripts/ifcfg-bond1 
# This file is autogenerated by os-net-config
DEVICE=bond1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
MACADDR="f8:bc:12:11:62:00"
BONDING_OPTS="mode=802.3ad"
[root@overcloud-controller-2 lib]# cat /etc/sysconfig/network-scripts/ifcfg-em1 
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
MASTER=bond1
SLAVE=yes
BOOTPROTO=none


Description of problem:
[root@overcloud-controller-2 lib]# cat /proc/net/dev|column -t
Inter-|      Receive         |            Transmit
face         |bytes          packets      errs      drop      fifo  frame  compressed  multicast|bytes  packets         errs         drop  fifo  colls  carrier  compressed
bond1:       16474155244918  18533591307  0         1992      0     0      0           1942205          19839302940640  21273701002  0     0     0      0        0           0
ovs-system:  0               0            0         0         0     0      0           0                0               0            0     0     0      0        0           0
br-int:      0               0            0         11569175  0     0      0           0                0               0            0     0     0      0        0           0
vlan165:     55248797977     132499349    0         130       0     0      0           0                94573805691     518522292    0     0     0      0        0           0
vlan164:     1523320404866   853400420    0         2         0     0      0           0                1691660232120   842398453    0     0     0      0        0           0
vlan163:     10781247902153  1427492639   0         0         0     0      0           0                14352613898502  1193573818   0     0     0      0        0           0
vlan162:     3480684362252   6934899255   0         251       0     0      0           0                2907865627241   6869372956   0     0     0      0        0           0
br-tun:      0               0            0         0         0     0      0           0                0               0            0     0     0      0        0           0
bond0:       0               0            0         0         0     0      0           0                0               0            0     0     0      0        0           0
vlan7:       103886420       674726       0         1         0     0      0           0                25444489        179810       0     0     0      0        0           0
lo:          4852845881460   1065612644   0         0         0     0      0           0                4852845881460   1065612644   0     0     0      0        0           0
br-ex:       28627899125     41779957     0         905838    0     0      0           0                9443947180      21565994     0     0     0      0        0           0
em1:         4639698518981   5714725741   0         631       0     0      0           56144            1758616323942   1547761055   0     0     0      0        0           0
em2:         4546850873376   5227603612   0         46        0     0      0           73686            3925003379313   5280462030   0     0     0      0        0           0
em3:         3906061490094   4106837946   0         780       0     0      0           1756592          7945427123920   9292150062   0     0     0      0        0           0
em4:         3381551803943   3484446531   0         535       0     0      0           55793            6210262472537   5153351640   0     0     0      0        0           0

This issue manifests in packet loss during ping tests like this:

root ansible22-j: ~ # ping 192.2.8.15                                                                      
PING 192.2.8.15 (192.2.8.15) 56(84) bytes of data.
64 bytes from 192.2.8.15: icmp_seq=2 ttl=64 time=0.540 ms
64 bytes from 192.2.8.15: icmp_seq=3 ttl=64 time=0.492 ms
64 bytes from 192.2.8.15: icmp_seq=4 ttl=64 time=0.504 ms
64 bytes from 192.2.8.15: icmp_seq=6 ttl=64 time=0.305 ms
64 bytes from 192.2.8.15: icmp_seq=7 ttl=64 time=0.443 ms
64 bytes from 192.2.8.15: icmp_seq=8 ttl=64 time=0.448 ms
64 bytes from 192.2.8.15: icmp_seq=10 ttl=64 time=0.363 ms
64 bytes from 192.2.8.15: icmp_seq=11 ttl=64 time=0.408 ms
64 bytes from 192.2.8.15: icmp_seq=12 ttl=64 time=0.403 ms
64 bytes from 192.2.8.15: icmp_seq=13 ttl=64 time=0.347 ms
64 bytes from 192.2.8.15: icmp_seq=14 ttl=64 time=0.527 ms
^C
--- 192.2.8.15 ping statistics ---
14 packets transmitted, 11 received, 21% packet loss, time 12999ms
rtt min/avg/max/mdev = 0.305/0.434/0.540/0.076 ms

Or dropped ssh sessions with errors such as
buffer_get_string_ret: bad string length 80034234

Or

root ansible22-j: ~ # ssh root 192.2.7.171
root ansible22-n-1: ~ # Corrupted MAC on input.
Disconnecting: Packet corrupt

Comment 1 Jarod Wilson 2016-08-09 14:37:52 UTC
First steps I'd like to see:

1) Does the issue reproduce with a current RHEL-7.3 development kernel? We had significant amounts of 802.3ad bonding code updates backported from upstream for 7.3.

2) Does the issue reproduce with an upstream kernel? There are some additional bonding fixes that have gone upstream since the backport work for 7.3 was done.

If one of those works, then we start chasing which patch(es) fixed the problem. If not, we have some new upstream development work to do here.

Comment 2 Jeremy Eder 2016-08-09 17:22:38 UTC
3.10.0-489 -- still errors
4.7.0 -- still errors
3.10.0-327.18.2 + test patch from Intel -- still errors

Comment 3 Alexander Duyck 2016-08-09 18:22:24 UTC
More information is definitely needed on this.  Also do we know if this is only the VMs that are dropping the packets or is the host also seeing these drops?

From what I can tell it looks like this is a setup with OVS running and I assume just VLANs.  Can you provide me with a plotnetcfg dump for this?

Also can you provide the ethtool -S output for the ports so that we can try to narrow down the type of errors being seen?

Comment 4 Jeremy Eder 2016-08-09 18:34:42 UTC
Everything I've reported so far has been from the bare metal hosts.
Yes, OVS and VXLAN actually (openstack).

plotnetcfg output is attached.

[root@overcloud-compute-10 Linux_x64]# ethtool -S em1|grep err
     rx_errors: 0
     tx_errors: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     tx_lost_interrupt: 0
     port.tx_errors: 0
     port.rx_crc_errors: 0
     port.rx_length_errors: 0
[root@overcloud-compute-10 Linux_x64]# ethtool -S em2|grep err
     rx_errors: 0
     tx_errors: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     tx_lost_interrupt: 0
     port.tx_errors: 0
     port.rx_crc_errors: 0
     port.rx_length_errors: 0
[root@overcloud-compute-10 Linux_x64]# ethtool -S em3|grep err
     rx_errors: 0
     tx_errors: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     tx_lost_interrupt: 0
     port.tx_errors: 0
     port.rx_crc_errors: 0
     port.rx_length_errors: 0
[root@overcloud-compute-10 Linux_x64]# ethtool -S em4|grep err
     rx_errors: 0
     tx_errors: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     tx_lost_interrupt: 0
     port.tx_errors: 0
     port.rx_crc_errors: 0
     port.rx_length_errors: 0

Comment 5 Jeremy Eder 2016-08-09 18:35:13 UTC
Created attachment 1189373 [details]
plotnetcfg raw

Comment 6 Jeremy Eder 2016-08-09 18:35:36 UTC
Created attachment 1189374 [details]
plotnetcfg pdf

Comment 7 Alexander Duyck 2016-08-09 20:11:56 UTC
So this seems to imply the NICs themselves are not seeing any errors.  So any errors reported appear to be up at the stack level before you are getting to the NICs.

From what I can tell it looks like I was been pulled in to address the i40e link flapping in comment 1 "(we need this to fix another issue in 7.2 wrt i40e link flapping)".  I believe that was the bit that the patch I provided was supposed to possibly address.  Specifically there were TSO layouts that could cause the Tx queue to be disabled and would cause the port to reset resulting in a link flap.

Beyond that it appears that we are looking at some issue in the OVS, Bridge, Bonding, or VXLAN code that is unrelated to the Intel drivers.

Comment 8 Jeremy Eder 2016-08-09 20:16:31 UTC
Oh -- the link flapping hasn't occurred since we updated to 1.5.19 driver.

Comment 9 Jeremy Eder 2016-08-11 13:20:07 UTC
Moving to OVS component...

Comment 10 Jarod Wilson 2016-08-11 16:47:13 UTC
(In reply to Jeremy Eder from comment #9)
> Moving to OVS component...

I still see set as a kernel bonding issue at the moment, but I'd guess it's not a bonding issue, given the errors have *also* been witnessed on a single NIC link with no bonding in the mix (iirc from our irc conversation).

Comment 11 Jeremy Eder 2016-08-11 16:52:24 UTC
Yes, exactly -- and also, I noticed a significant number (2 million) errors within one of the KVM guests (host had a lot more).

So, I don't see how this is a bonding driver issue -- hence move to OVS.
Also Intel reports back that switch ports are clean and the NIC firmware reports no errors in ethtool output.

Comment 12 Flavio Leitner 2016-08-11 18:38:44 UTC
Please capture the sosreport from the host reproducing the issue using the sos from upstream:

$ git clone https://github.com/sosreport/sos.git
$ cd sos
$ ./sosreport

and provide the resulting tarball.

Thanks,
fbl

Comment 14 Jeremy Eder 2016-08-11 19:00:38 UTC
Created attachment 1190123 [details]
sosreport from ospd node


Note You need to log in before you can comment on or make changes to this bug.