Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360420 - openvswitch randomly causing loopback errors on cisco switch port [NEEDINFO]
Summary: openvswitch randomly causing loopback errors on cisco switch port
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Matteo Croce
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks: 1552603
TreeView+ depends on / blocked
 
Reported: 2016-07-26 16:52 UTC by Andreas Karis
Modified: 2019-04-13 05:25 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1552603 (view as bug list)
Environment:
Last Closed: 2016-10-24 13:47:51 UTC
amuller: needinfo? (jbenc)
mnadeem: needinfo? (jbenc)


Attachments (Terms of Use)

Description Andreas Karis 2016-07-26 16:52:10 UTC
A customer opened a ticket where compute nodes' interfaces would randomly go into errdisabled for the reason of loopback detected. The issue occurs very rarely (every few weeks), but it seems that openvswitch loops Cisco's keepalive messages back to the originating switch. This happens on multiple nodes, so a L1 issue (crosstalk, etc.) is very unlikely.

The whole issue is also discussed here (in the context of Ubuntu, but with the same version of openvswitch):
http://openvswitch.org/pipermail/discuss/2016-March/020362.html


Here is the full text of the above link (once again, keep in mind that the above talks about Ubuntu, but the issue is the same, and it's the same version of openvswitch):
==============================================================

We have found a very strange bug in Open vSwitch, when it is connected to a
Cisco Switch port, the port will randomly get err-disabled.

So we have 76 Debian servers installed with Open vSwitch (2.4.0), each
connected an port in Cisco Switch 3110. There will be a chance of
err-disabled port on Cisco Switch every week or two. From Cisco switch
perspective, the port was disabled because detecting an loopback by
receiving a keepalive message which was originated from the cisco switch
port.

Basically the keepalive message was like below:

11:37:01.749102 e8:04:62:c8:6e:81 > e8:04:62:c8:6e:81, ethertype Loopback
(0x9000), length 60: Loopback, skipCount 0, Reply, receipt number 0, data
(40 octets)
0x0000:  0000 0100 0000 0000 0000 0000 0000 0000  ................
0x0010:  0000 0000 0000 0000 0000 0000 0000 0000  ................
0x0020:  0000 0000 0000 0000 0000 0000 0000       ..............

Our first guess was that Open vSwitch accidentally sends the keepalive
message it received back to the port and leads to err-disabled state.
Normally the Open vSwitch will discard this message, but once a week or two
in 76 servers, it will get back to the port on the cisco switch and the
port will be err-disabled.

The work around we are using now are either disabling sending keepalive
message on cisco switch or explicitly add a flow rule for discarding that
keepalive message on Open vSwitch.

The Open vSwitch version is:
ovs-vswitchd (Open vSwitch) 2.4.0
Compiled Aug 31 2015 16:53:51

The configuration of the switch is:
    Bridge "acc_10064"
        Port "acc_10064"
            Interface "acc_10064"
                type: internal
        Port "vxnet2"
            Interface "vxnet2"
        Port "10064_88ad7aaa"
            Interface "10064_88ad7aaa-02"
                type: vxlan
                options: {key="10064", local_ip="IP1", remote_ip="IP2"}
            Interface "10064_88ad7aaa-01"
                type: vxlan
                options: {key="10064", local_ip="IP1", remote_ip="IP3"}
    Bridge "acc_10050"
        Port "10050_0977455a"
            Interface "10050_0977455a-01"
                type: vxlan
                options: {key="10050", local_ip="IP1", remote_ip="IP4"}
            Interface "10050_0977455a-02"
                type: vxlan
                options: {key="10050", local_ip="IP1", remote_ip="IP5"}
        Port "vxnet0"
            Interface "vxnet0"
        Port "acc_10050"
            Interface "acc_10050"
                type: internal
        Port "vxnet1"
            Interface "vxnet1"
    Bridge "br0"
        Port "eth0"
            Interface "eth0"
        Port "br0"
            Interface "br0"
                type: internal
    ovs_version: "2.4.0"

The kernel version is:
Linux version 3.16.0-4-amd64 (debian-kernel at lists.debian.org) (gcc version
4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04)

The ovs-dpctl show output is:
system at ovs-system:
lookups: hit:536177037 missed:17196786 lost:0
flows: 182
masks: hit:1130706939 total:9 hit/pkt:2.04
port 0: ovs-system (internal)
port 1: acc_10050 (internal)
port 2: vxlan_sys_4789 (vxlan)
port 3: eth0
port 4: br0 (internal)
port 5: vxnet0
port 6: vxnet1
port 7: acc_10064 (internal)
port 8: vxnet2

The Open vSwitch does not have a controller connected and it is configured
as normal L2 switch.

We have found some similar case on google but unanswered:
https://forums.gentoo.org/viewtopic-p-7884924.html?sid=12abe544bda8782c840fa5c70df6e65e

============================================================================

And a few more details:

http://www.cisco.com/c/en/us/support/docs/lan-switching/spanning-tree-protocol/69980-errdisable-recovery.html

"Loopback error

A loopback error occurs when the keepalive packet is looped back to the port that sent the keepalive. The switch sends keepalives out all the interfaces by default. A device can loop the packets back to the source interface, which usually occurs because there is a logical loop in the network that the spanning tree has not blocked. The source interface receives the keepalive packet that it sent out, and the switch disables the interface (errdisable). This message occurs because the keepalive packet is looped back to the port that sent the keepalive:

%PM-4-ERR_DISABLE: loopback error detected on Gi4/1, putting Gi4/1 in
err-disable state

Keepalives are sent on all interfaces by default in Cisco IOS Software Release 12.1EA-based software. In Cisco IOS Software Release 12.2SE-based software and later, keepalives are not sent by default on fiber and uplink interfaces. For more information, refer to Cisco bug ID CSCea46385 (registered customers only) ."

https://supportforums.cisco.com/discussion/12022291/3750-x-2960-x-strange-portchannel-behaviour
"The loopback error you are seeing is caused by a port receiving its own LOOP frames that are sent each 10 seconds. Normally, this should never happen because a correctly behaved neighboring switch would never send a frame back through the port it came in. However, if the MAC learning is disabled, or if there is an unblocked loop in the network, the frame may eventually loop back to the port where it was originated. Such a port will be err-disabled. I am therefore looking for any reason that would allow the LOOP frame to get back to its originating port."

[searched on google for openvswitchcisco loopback]
https://forums.gentoo.org/viewtopic-p-7884924.html?sid=12abe544bda8782c840fa5c70df6e65e
and
http://openvswitch.org/pipermail/discuss/2016-March/020362.html

"The work around we are using now are either disabling sending keepalive
message on cisco switch or explicitly add a flow rule for discarding that
keepalive message on Open vSwitch."

 ovs_version: "2.4.0"  which is the same as in the above mailing list

[workarounds on cisco side]
! disable errdisable loopback
no errdisable detect cause loopback

int gi1/0/x
no keepalive
(http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/interface/command/ir-cr-book/ir-i1.html#wp1133860665)
(https://supportforums.cisco.com/document/20111/error-ethcntr-3-loopbackdetected-catalyst-switch-runs-cisco-iosr-software)

==> Should monitor the network closely to avoid that any real network loops are generated. If there really is a loop, then "no keepalive" should prevent the interface from being disabled, but it will _not_ prevent the loop.

[workarounds on ovs side]
-  explicitly add a flow rule for discarding that keepalive message on Open vSwitch.
however, this may be more complicated, because ovs is programmed by neutron

Comment 2 Jiri Benc 2016-09-08 13:13:51 UTC
Given there's no reply from the customer to the last question in the case (and the case itself seems to be closed now), is there anything we can do?

Or should we just close this bug?

Comment 3 Flavio Leitner 2016-09-29 20:05:28 UTC
From the ticket:
---8<---
I am have not yet enabled the capture on all nodes, however I thought I would mention that the other day we had this happen again.  Looking at our monitoring tool it seems as though an instance recieved the same mac address as the physical network card that the bridge is running on.  This in turn caused the loopback error state on the interface.  I am still looking into that to see if I can get any logs from the occurance.
---8<---

Looks like there is a problem in mac address allocation for instances which is not part of OVS.

Andreas, could you please confirm?

Comment 5 Andreas Karis 2016-09-29 20:08:22 UTC
Please keep this bug open, we are gathering further data for it.

Comment 7 Flavio Leitner 2016-10-24 13:47:51 UTC
The support case is closed due to inactivity and it has been almost a month without updates here. Having said that I will close this bz. However, feel free to re-open when you have additional information to provide and we will continue helping as before.

Thanks!

Comment 11 Jiri Benc 2017-07-13 16:34:43 UTC
This upstream bug report contains some interesting data:
https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/319418.html

Comment 12 Shinobu KINJO 2017-07-20 05:02:16 UTC
(In reply to Jiri Benc from comment #11)
> This upstream bug report contains some interesting data:
> https://mail.openvswitch.org/pipermail/ovs-dev/2016-July/319418.html

That's good to read.

Comment 13 Jiri Benc 2017-07-20 16:21:15 UTC
I'm confused about the sosreport in comment 8. I can't make sense of it. Are you sure this was captured with the problem reproduced, i.e. the switch port to which em1 was connected being err-disabled at the point the sosreport was run?

Comment 14 Jiri Benc 2017-08-18 15:20:50 UTC
Ping? (See comment 13.)


Note You need to log in before you can comment on or make changes to this bug.