Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1518684 - "ovs-vsctl show" on OCP nodes returns multiple "No such device" messages [NEEDINFO]
Summary: "ovs-vsctl show" on OCP nodes returns multiple "No such device" messages
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.9.0
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks: 1542093
TreeView+ depends on / blocked
 
Reported: 2017-11-29 13:06 UTC by Thom Carlin
Modified: 2019-01-19 01:13 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-03-28 14:13:03 UTC
knakayam: needinfo? (dcbw)


Attachments (Terms of Use)
Log from ovs-vsctl and ovs-ofctl commands (deleted)
2017-11-30 16:18 UTC, Weibin Liang
no flags Details
node log and OPTIONS=--loglevel=5 (deleted)
2017-11-30 18:47 UTC, Weibin Liang
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 14:13:48 UTC
Origin (Github) 18166 None None None 2018-01-19 14:23:05 UTC

Description Thom Carlin 2017-11-29 13:06:27 UTC
Description of problem:

On a fully patched OCP 3.6/CNS 3.6 cluster, receiving "No such device" messages on the nodes

Version-Release number of selected component (if applicable):

3.6

How reproducible:

100% on this cluster

Steps to Reproduce:
1. On each node: ovs-vsctl show

Actual results:

[...]
      Port "vethbcdb039b"
            Interface "vethbcdb039b"
                error: "could not open network device vethbcdb039b (No such device)"
[...]



Expected results:

list of OpenvSwitch database without errors

Additional info:

sosreports will be added in private attachments

Comment 1 Thom Carlin 2017-11-29 14:09:03 UTC
sosreports are too large for attachments

Comment 2 Weibin Liang 2017-11-30 15:41:24 UTC
Saw the same error in v3.7.9:

[root@host-172-16-120-67 ~]# ovs-vsctl show
8e6c5352-1338-4e22-ad1a-5e3a905b4159
    Bridge "br0"
        fail_mode: secure
        Port "veth6cf0fa55"
            Interface "veth6cf0fa55"
        Port "veth0bf8145d"
            Interface "veth0bf8145d"
        Port "vethe68eec9b"
            Interface "vethe68eec9b"
                error: "could not open network device vethe68eec9b (No such device)"
        Port "veth5dabac94"
            Interface "veth5dabac94"
        Port "br0"
            Interface "br0"
                type: internal
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "vethd6279c2b"
            Interface "vethd6279c2b"
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "veth98c02cf9"
            Interface "veth98c02cf9"
    ovs_version: "2.7.3"
[root@host-172-16-120-67 ~]# oc version
oc v3.7.9
kubernetes v1.7.6+a08f5eeb62
features: Basic-Auth GSSAPI Kerberos SPNEGO
[root@host-172-16-120-67 ~]#

Comment 3 Dan Winship 2017-11-30 16:00:13 UTC
Weibin: can you attach the result of "ovs-ofctl -O OpenFlow13 show br0" and "ovs-ofctl -O OpenFlow13 dump-flows br0" as well?

Comment 4 Weibin Liang 2017-11-30 16:18:46 UTC
Created attachment 1360987 [details]
Log from ovs-vsctl and ovs-ofctl commands

Comment 5 Dan Winship 2017-11-30 16:50:22 UTC
OK, so "ovs-ofctl show" shows veths attached to ports 4, 8, 10, 12, and 13, but "ovs-ofctl dump" shows flows for ports 4, 7, 8, 10, 12, and 13. Meaning, we still have a flow for port 7 despite not having a veth attached to it, presumably corresponding to the missing veth in the "ovs-vsctl" output.

So, this is some sort of pod cleanup error. Possibly related to bug 1518912.

Weibin: can you put the atomic-openshift-node logs for this node somewhere? As far back as they go on this node. (And let me know what loglevel they're at.)

Comment 6 Thom Carlin 2017-11-30 18:29:53 UTC
Although there is no evidence either way that this error causes any other issues, 
a workaround supplied by Dan removes these messages:

1) oadm drain <<node_name>>
2) Reboot node
3) oadm uncordon <<node_name>>

Note that you must have sufficient capacity in your cluster to absorb the containers evacuated from the node.

Comment 7 Weibin Liang 2017-11-30 18:47:39 UTC
Created attachment 1361101 [details]
node log and OPTIONS=--loglevel=5

Comment 12 Weibin Liang 2018-02-09 14:52:27 UTC
Tested and verified on v3.9.0.-0.41.0

[root@host-172-16-120-139 Sanity-Test]# ovs-vsctl show
451601d1-2b65-4e88-8be4-189491cdd333
    Bridge "br0"
        fail_mode: secure
        Port "vethf90cbbbf"
            Interface "vethf90cbbbf"
        Port "veth06984ca2"
            Interface "veth06984ca2"
        Port "vxlan0"
            Interface "vxlan0"
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port "tun0"
            Interface "tun0"
                type: internal
        Port "vethf35a42c9"
            Interface "vethf35a42c9"
        Port "vethe1ee7155"
            Interface "vethe1ee7155"
        Port "br0"
            Interface "br0"
                type: internal
        Port "veth65346a6c"
            Interface "veth65346a6c"
        Port "veth65a33588"
            Interface "veth65a33588"
        Port "veth573462cb"
            Interface "veth573462cb"
    ovs_version: "2.7.3"
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# 
[root@host-172-16-120-139 Sanity-Test]# oc version
oc v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.16.120.139:8443
openshift v3.9.0-0.41.0
kubernetes v1.9.1+a0ce1bc657

Comment 15 errata-xmlrpc 2018-03-28 14:13:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 16 xingweiyang 2018-05-30 01:08:02 UTC
still foud in ocp 3.9.14

Comment 17 Kenjiro Nakayama 2018-12-18 04:18:41 UTC
@Dan @Ben,

One of our TAM partners requested to backport this to 3.7.x. Although they know that rebooting the host could be the workaround, it is difficult to accept it. 
Could you please consider to backport the fix to 3.7.x? If not possible, we need to explain that this issue is completely harmless, so can you advice us? (e.g - They already observed that a bunch of useless port/flow rules of openvswitch remains on each nodes. It will not hit any limit?)

Comment 18 Ryan Howe 2019-01-19 01:13:17 UTC
Adding more info regarding the issue.


Hit issue in OCP 3.7.72-1

Where we see about 666 ports showing this:

        Port "veth942fc505"
            Interface "veth942fc505"
                error: "could not open network device veth942fc505 (No such device)"
        Port "veth14fb4836"
            Interface "veth14fb4836"
                error: "could not open network device veth14fb4836 (No such device)"
    ovs_version: "2.9.0"


What ended up happing the sdn created an eth0 and failed to place it in the container.

# ip -s link | grep 960
1199: veth34fe960f@eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP mode DEFAULT 
1200: eth0@veth34fe960f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT 

From there the default was set to eth0 and the IP (SDN CIDR ADDR). Node became not ready to due to failing to connect to master. 

Rebooting the machine works around the issue.


Note You need to log in before you can comment on or make changes to this bug.