Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1595092 - Creating and deleting a large number of pods causes ovs errors
Summary: Creating and deleting a large number of pods causes ovs errors
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: jtanenba
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-26 06:05 UTC by sfu@redhat.com
Modified: 2019-03-26 18:18 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-26 18:18:08 UTC
Target Upstream Version:


Attachments (Terms of Use)
Script and deployment config (deleted)
2018-07-16 23:43 UTC, Mike Fiedler
no flags Details

Description sfu@redhat.com 2018-06-26 06:05:33 UTC
Description of problem:
Creating and deleting a large number of pods causes ovs errors,unable to create and delete ovs virtual port.


Version-Release number of selected component (if applicable):
ocp 3.9.14
openvswitch-2.7.3-3.git20180112.el7fdp.x86_64

How reproducible:
always

Steps to Reproduce:
1.create a project
2.create a tomcat app
3.scale up pod number to 500,and delete them immediately,then repeat creating and deleting,the error will appear after a while

Actual results:
can't create and delete pod

Expected results:
working normally

Additional info:

Creating and deleting a large number of pods causes ovs errors,unable to create and delete ovs virtual port.
   
Customer using ocp 3.9.14 to do Stress Test, test steps are: using dc to scale up to 500 pods,then delete them, after a while,the error occurred.
At first,the problem seems ocp cannot delete ovs port,after that creating will failed too.The problem seems in the rapid creation and deleting

  ovs version:openvswitch-2.7.3-3.git20180112.el7fdp.x86_64


   1,  In ovs-vswitchd.log  there are many failure to create message.
2018-05-29T12:11:42.593Z|7139362|bridge|WARN|could not add network device veth50e17307 to ofproto (No such device)
2018-05-29T12:11:42.595Z|7139363|bridge|WARN|could not open network device veth497797fd (No such device)

 2.  OVS continues to perform open or add operations on the same port, unable to complete.
 # cat  ovs-vswitchd.log |grep veth2d159069
2018-05-29T12:11:42.490Z|7139314|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:11:47.415Z|7141594|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:11:52.296Z|7143876|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:11:57.534Z|7146163|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:02.446Z|7148445|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:07.371Z|7150725|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:12.627Z|7153012|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:17.534Z|7155294|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:22.435Z|7157574|bridge|WARN|could not open network device veth2d159069 (No such device)
2018-05-29T12:12:27.771Z|7159865|bridge|WARN|could not open network device veth2d159069 (No such device)

 # grep veth61060184 ovs-vswitchd.log
2018-05-29T12:11:45.489Z|7140717|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:11:50.426Z|7142999|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:11:55.293Z|7145279|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:00.536Z|7147566|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:05.473Z|7149848|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:10.378Z|7152128|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:15.643Z|7154419|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:20.561Z|7156697|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:25.493Z|7158979|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:30.784Z|7161268|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:35.682Z|7163550|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:40.581Z|7165830|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:45.833Z|7168119|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:50.735Z|7170401|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:12:55.640Z|7172681|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:00.903Z|7174970|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:05.783Z|7177250|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:10.669Z|7179530|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:15.916Z|7181819|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:20.849Z|7184099|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:25.733Z|7186381|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:30.955Z|7188668|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:35.849Z|7190950|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:40.737Z|7193232|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)
2018-05-29T12:13:45.994Z|7195519|bridge|WARN|could not add network device veth61060184 to ofproto (No such device)

    3.  The above OVS problem leads to failure of openshift CNI call, resulting in pod creation problem.

#cat messages |grep -i veth2d319e8e
May 29 19:31:07 node1 NetworkManager[1049]: <info>  [1527593467.0108] device (veth2d319e8e): carrier: link connected
May 29 19:31:07 node1 NetworkManager[1049]: <info>  [1527593467.0116] manager: (veth2d319e8e): new Veth device (/org/freedesktop/NetworkManager/Devices/10678)
May 29 19:34:03 node1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth2d319e8e -- set Interface veth2d319e8e "external-ids=sandbox=\"c404ff9d3536bb791246e601f84dce09aa809f0c340acd61a6d7f2a6394b4125\",ip=\"10.130.0.18\""
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.928553   17092 cni.go:259] Error adding network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)"
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.928579   17092 cni.go:227] Error while adding to cni network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)"
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930176   17092 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8
: "could not open network device veth2d319e8e (No such device)"
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930210   17092 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)"
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930224   17092 kuberuntime_manager.go:647] createPodSandbox for pod "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)"
May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930274   17092 pod_workers.go:186] Error syncing pod 6d67c53e-627b-11e8-b361-005056947c2e ("qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)"), skipping: failed to "CreatePodSandbox" for "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"qperf-server-13-swlwr_yjfns\" network: CNI request failed with status 400: 'error on port
veth2d319e8e: \"could not open network device veth2d319e8e (No such device)\"\n'"


    4.   All the above issue disappear after rebooting node.

Comment 6 Mike Fiedler 2018-07-16 23:41:45 UTC
I was able to reproduce the error messages in the description:  

|WARN|could not open network device veth389bfe8b (No such device)

|WARN|could not add network device veth90f0d4ed to ofproto (No such device)

NetworkPlugin cni failed to set up pod "deploymentconfig1-1-wvmnl_svt" network: CNI request failed with status 400: 'error on port vethd9ba5ea3: "could not open network device vethd9ba5ea3 (No such device)"

These messages are not the core problem - they are a result of deleting the dc before the pods/containers are fully initialized.  As long as the ports are cleaned up, things are operating correctly

However, I was not able to reproduce the reported condition where ovs failed to delete the container interface ports.   In my test, each time the dc was deleted, all ports were removed.   I will attach the script and dc used to test this.

OCP 3.9.33  with openvswitch 2.8 and 2.7

openvswitch.x86_64     1:2.8.0-4.el7fdb 
openvswitch.x86_64     1:2.7.0-8.git20170530.el7fdb

Going to let this run overnight, but so far, the condition where ports are not removed has not been reproduced.

Comment 7 Mike Fiedler 2018-07-16 23:43:16 UTC
Created attachment 1459283 [details]
Script and deployment config

Comment 8 Mike Fiedler 2018-07-17 16:51:31 UTC
The test has been running for about 20 hours now without seeing ports left in a bad state in OVS - they continue to be cleaned up with each deletion of the dc.   I do still see the error messages in the description and in comment 6, but they are not affecting cluster operation.

Comment 10 chen zhen 2018-07-18 03:15:12 UTC
(In reply to Mike Fiedler from comment #8)
> The test has been running for about 20 hours now without seeing ports left
> in a bad state in OVS - they continue to be cleaned up with each deletion of
> the dc.   I do still see the error messages in the description and in
> comment 6, but they are not affecting cluster operation.

Thanks for you time to reproduce. , if you look at the sosreport from customer production server. 

sosreport-node1.szcsmanager.cmbchina.net

ovs-vsctl show report 2273 ports on the bridge,  most of them are in status like be 2. because container is not running anymore . 

----------
        Port "vethb66c0018"
            Interface "vethb66c0018"
                error: "could not add network device vethb66c0018 to ofproto (No such device)"
        Port "vethfd31e168"
            Interface "vethfd31e168"
                error: "could not open network device vethfd31e168 (No such device)"

--------

    it looks like ovsdb performance is not enough, but I'm not sure about that. 


.  I found CNI is using below command to create / delete ovs-ports. 

Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth75304f6a -- set Interface veth75304f6a "external-ids=sandbox=\"5b984c16b456ac76698dc93566b8a84ff053d8646467b848551c3f67ccc1953e\",ip=\"10.128.0.34\""
Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth75304f6a qos
Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth3c30dd21 -- set Interface veth3c30dd21 "external-ids=sandbox=\"25c5d987c5f7ec0e44546723b2bc5d176b983a45b0c32427a916c9770e335e60\",ip=\"10.128.0.35\""
Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth3c30dd21 qos
Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth00c05ab7 -- set Interface veth00c05ab7 "external-ids=sandbox=\"c7a3dcd25910075bafb0e572e5fff353dfa45a28432942631cd2f9a633907e1f\",ip=\"10.128.0.36\""
Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth00c05ab7 qos
Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth847a495b -- set Interface veth847a495b "external-ids=sandbox=\"38a310de0687703edbfd667e08a6b55e5a0b9af363fa90149fc6f17899b1a924\",ip=\"10.128.0.37\""
Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth847a495b qos
Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth75304f6a qos
Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth75304f6a
Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth3c30dd21 qos
Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth3c30dd21
Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth00c05ab7 qos
Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos
Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth00c05ab7

    I'm trying to write a script to direct fire these command to vswitch and to see what will happen..

Comment 11 Dan Winship 2018-07-30 18:09:27 UTC
OK, I think there may be an actual OVS problem here (as opposed to an OpenShift's-use-of-OVS problem).

For veth2d159069, in the OpenShift logs we see:

May 29 16:41:14 node1... atomic-openshift-node[17092]: I0529 16:41:14.829786   17092 kubelet.go:1837] SyncLoop (ADD, "api"): "csdemo-7fdf8fdb64-748jr_yxw-prod(0c14beb3-631c-11e8-b361-005056947c2e)"
May 29 16:41:15 node1... atomic-openshift-node[17092]: I0529 16:41:15.398903   17092 kubelet.go:1853] SyncLoop (DELETE, "api"): "csdemo-7fdf8fdb64-748jr_yxw-prod(0c14beb3-631c-11e8-b361-005056947c2e)"
May 29 16:41:15 node1.... ovs-vsctl[90114]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth2d159069 -- set Interface veth2d159069 "external-ids=sandbox=\"ccad598ff4218127ef969954c848e263cdc19d00d26d52414e72c71281007e5f\",ip=\"10.130.0.99\""
May 29 16:41:15 node1... atomic-openshift-node[17092]: I0529 16:41:15.948196   17092 kubelet_pods.go:1118] Killing unwanted pod "csdemo-7fdf8fdb64-748jr"
May 29 16:41:25 node1... atomic-openshift-node[17092]: E0529 16:41:25.586803   17092 cni.go:259] Error adding network: CNI request failed with status 400: 'error on port veth2d159069: "could not open network device veth2d159069 (No such device)"

So, kubernetes starts creating the pod, then decides it needs to destroy it instead, and starts destroying it while it's in the middle of creating it, because kubernetes is like that. The "ovs-vsctl add-port" call fails because by the time it reaches OVS, the veth device has already been destroyed. Because "ovs-vsctl add-port" failed, OpenShift does not call "ovs-vsctl del-port" or do any other sort of cleanup on the OVS side.

But the "ovs-vsctl show" output shows that interface as being attached to the bridge (in an error state), and ovs-vswitchd is still spewing errors about that device to ovs-vswitchd.log multiple times a second when the sosreport is generated. (Note that ovs-vswitchd.log and the journal output seem to be in different timezones.)


Likewise for veth61060184 which has a different error ("could not add network device veth61060184 to ofproto (No such device)"). Kubernetes started creating the pod and then started deleting it while it was still being created. It looks like the time between creation and deletion was a bit longer in that case, which may explain the different error. In any case, from OpenShift's perspective, it's still the same: ovs-vsctl returned an error, so we don't do any cleanup, and we expect OVS to have forgotten about the interface, but apparently it hasn't. (Are we wrong?)


(For veth2d319e8e, that is completely unrelated and isn't really an OVS issue at all; the problem there is that docker got restarted for some reason, which kills all of the containers on the node, and in the process of ensuring that everything is cleaned up correctly, kubernetes attempts to delete a network interface that the kernel already deleted. But then it figures things out and successfully recreates the pod; there's no persistent problem in that case.)

Comment 12 Flavio Leitner 2018-08-15 23:38:46 UTC
Please attach the output of 'ovsdb-tool -m show-log' after had reproduced the issue. That output will tell us what OVS was instructed to do, how and when.

Also note that ovs by default waits the vswitchd to reconfigure before return, so openshift (if using ovs-vsctl tool) would execute synchronous operations. However, if OpenShift is passing --no-wait, then the command returns right after the database is updated, and not really when the operation completed.

fbl

Comment 13 Flavio Leitner 2018-08-15 23:59:23 UTC
BTW, regarding to the OVS operation itself, ovs-vsctl task is to configure the database. It has no knowledge of the interfaces in the system.

It does echo ovs-vswitchd error reports when --no-wait is not used, but it is up to the caller to fix the database or do whatever is necessary.

See this example:

[root@plex ~]# ovs-vsctl add-port ovsbr0 foo 
ovs-vsctl: Error detected while setting up 'foo': could not open network device foo (No such device).  See ovs-vswitchd log for details.
ovs-vsctl: The default log directory is "/var/log/openvswitch".

[root@plex ~]# ovs-vsctl show 
36a1b8c2-9512-4f38-a9b1-2328ea8832e2
    Bridge "ovsbr0"
        Port foo
            Interface foo
                error: "could not open network device foo (No such device)"
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal

[root@plex ~]# ip link add foo type veth peer name veth10

[root@plex ~]# ovs-vsctl show 
36a1b8c2-9512-4f38-a9b1-2328ea8832e2
    Bridge "ovsbr0"
        Port foo
            Interface foo
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal

fbl

Comment 14 Dan Winship 2018-08-16 13:41:29 UTC
(In reply to Flavio Leitner from comment #13)
> It does echo ovs-vswitchd error reports when --no-wait is not used, but it
> is up to the caller to fix the database or do whatever is necessary.

Huh... OK, I can see where being able to add the interface to the bridge before it exists might be useful, but it's pretty confusing that it returns an error despite having "succeeded".

So I guess we need to fix that part in OpenShift, but there still seems to be some problem with OVS getting stuck in some sort of loop constantly re-logging errors about them.

Comment 15 Dan Winship 2018-08-16 13:51:16 UTC
(In reply to Dan Winship from comment #14)
> So I guess we need to fix that part in OpenShift

https://github.com/openshift/origin/pull/20670

Comment 16 Aaron Conole 2018-08-16 13:59:28 UTC
OVS allows you to add ports that may become active in the future to the database.  The vswitchd will complain that they exist.  Perhaps you mean that this error should be ratelimited?  Regardless there isn't anything wrong with OvS here - it allows those non-existent ports to exist in the database because:

1. They could disappear at any time and the software needs to handle that
2. The initialization order could be such that ovs is up before the networking is up (although, that shouldn't be the case on rhel).

The issue is an 'add' was not followed up with a 'delete' - and that means the port will live on in the configuration forever.  There might be ways that we can change the software to mitigate this kind of behaviour (for instance, maybe we have a task that can delete non-existent ports) and/or make the log a bit less spew-y.

As an example, here's a quick and dirty python script that will clean up non-existent interfaces.  I haven't really tested it well.

---- 8< ----
import netifaces
import sys
import time
from ovs.db import idl
from ovs import jsonrpc
from ovs.poller import Poller
from ovs.stream import Stream


class OVSDBException(Exception):
    pass


class OVSDB(object):
    @staticmethod
    def wait_for_db_change(idl):
        seq = idl.change_seqno
        stop = time.time() + 10
        while idl.change_seqno == seq and not idl.run():
            poller = Poller()
            idl.wait(poller)
            poller.block()
            if time.time() >= stop:
                raise Exception('Retry Timeout')

    def __init__(self, db_sock):
        self._db_sock = db_sock
        self._txn = None
        schema = self._get_schema()
        schema.register_all()
        self._idl_conn = idl.Idl(db_sock, schema)
        OVSDB.wait_for_db_change(self._idl_conn)  # Initial Sync with DB

    def _get_schema(self):
        error, strm = Stream.open_block(Stream.open(self._db_sock))
        if error:
            raise Exception("Unable to connect to %s" % self._db_sock)
        rpc = jsonrpc.Connection(strm)
        req = jsonrpc.Message.create_request('get_schema', ['Open_vSwitch'])
        error, resp = rpc.transact_block(req)
        rpc.close()

        if error or resp.error:
            raise Exception('Unable to retrieve schema.')
        return idl.SchemaHelper(None, resp.result)

    def get_table(self, table_name):
        return self._idl_conn.tables[table_name]

    def _start_txn(self):
        if self._txn is not None:
            raise OVSDBException("ERROR: A transaction was started already")
        self._idl_conn.change_seqno += 1
        self._txn = idl.Transaction(self._idl_conn)
        return self._txn

    def _complete_txn(self, try_again_fn):
        if self._txn is None:
            raise OVSDBException("ERROR: Not in a transaction")
        status = self._txn.commit_block()
        if status is idl.Transaction.TRY_AGAIN:
            if self._idl_conn._session.rpc.status != 0:
                self._idl_conn.force_reconnect()
                OVSDB.wait_for_db_change(self._idl_conn)
            return try_again_fn(self)
        elif status is idl.Transaction.ERROR:
            return False

    def _find_row(self, table_name, find):
        return next(
            (row for row in self.get_table(table_name).rows.values()
             if find(row)), None)

    def _find_row_by_name(self, table_name, value):
        return self._find_row(table_name, lambda row: row.name == value)

    def Ports(self):
        return self.get_table('Port').rows.values()

    def Bridges(self):
        return self.get_table('Bridge').rows.values()

    def destroy_port(self, port_name):
        txn = self._start_txn()
        br = None
        for br_row in self.Bridges():
            for port in br_row.ports:
                if port.name == port_name:
                    br = br_row

        if br is None:
            print("unable to find bridge for port '%s" % port_name)
            return

        ports = [port for port in br.ports if port.name != port_name]
        br.ports = ports

        def try_again(db_entity):
            db_entity.destroy_port(port_name)

        txn.add_comment("ovs-purge: destroy_port=%s" % (port_name))
        status = self._complete_txn(try_again)
        if status is False:
            raise OVSDBException('unable to delete Port %s: %s' %
                                 (port_name, txn.get_error()))
        self._txn = None


def argv_tuples(lst):
    cur, nxt = iter(lst), iter(lst)
    next(nxt, None)

    try:
        while True:
            yield next(cur), next(nxt, None)
    except StopIteration:
        pass


def main():

    db_sock = 'unix:/var/run/openvswitch/db.sock'

    for cur, nxt in argv_tuples(sys.argv[1:]):
        if cur in ['--db-sock']:
            db_sock = nxt

    ovsdb = OVSDB(db_sock)
    ifs = netifaces.interfaces()
    print(ifs)
    for port in ovsdb.Ports():
        if port.name not in ifs:
            print("Found bad port '%s'" % port.name)
            ovsdb.destroy_port(port.name)


if __name__ == '__main__':
    main()
---- >8 ----

Does it make sense?

Comment 17 Dan Winship 2018-08-16 14:02:59 UTC
(In reply to Aaron Conole from comment #16)
> OVS allows you to add ports that may become active in the future to the
> database.  The vswitchd will complain that they exist.  Perhaps you mean
> that this error should be ratelimited?  Regardless there isn't anything
> wrong with OvS here

To clarify: it is apparently spewing errors about every non-existent port, multiple times a second, apparently forever (see the ovs-vswitchd.log greps in the initial comment)

Comment 18 Flavio Leitner 2018-08-16 14:39:18 UTC
(In reply to Dan Winship from comment #17)
> To clarify: it is apparently spewing errors about every non-existent port,
> multiple times a second, apparently forever (see the ovs-vswitchd.log greps
> in the initial comment)

Because each time the vswitchd has to reconfigure itself, it will try to rebuild according with what is in the database. If the problematic interface is there, the error will be noted.

fbl

Comment 20 zhaozhanqi 2018-09-12 10:30:12 UTC
I tested this bug on v3.11.0-0.32.0, this issue still can be reproduced.

step: 

1. Create rc with this json
oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json
  oc scale rc test-rc --replicas=100

2. ssh into the pod scheduled node and delete the docker process by manual
  while true ; do docker ps | grep test-rc | awk '{ print $1 }' | xargs docker rm -f ; done

3. after running 10 mins and check the ovs port and pod container did not match 
  sh-4.2# ovs-vsctl list-ports br0  | grep veth | wc -l
28

# ip a | grep veth | awk '{ print $2}' | awk -F@ '{ print $1}' | sort | wc -l
23

4. Check the log on node

  cat /var/log/messages | grep "veth57be0e6f"
Sep 12 05:56:57 ip-172-18-8-246 NetworkManager[5100]: <info>  [1536746217.8748] device (veth57be0e6f): carrier: link connected
Sep 12 05:56:57 ip-172-18-8-246 NetworkManager[5100]: <info>  [1536746217.8760] manager: (veth57be0e6f): new Veth device (/org/freedesktop/NetworkManager/Devices/188)
Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425374    8149 cni.go:260] Error adding network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"
Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425415    8149 cni.go:228] Error while adding to cni network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"
Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579413    8149 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"
Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579470    8149 kuberuntime_sandbox.go:56] CreatePodSandbox for pod "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"
Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579498    8149 kuberuntime_manager.go:646] createPodSandbox for pod "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"
Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579589    8149 pod_workers.go:186] Error syncing pod 061b9cc1-b671-11e8-b93b-0ea178bb84b2 ("test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)"), skipping: failed to "CreatePodSandbox" for "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" with CreatePodSandboxError: "CreatePodSandbox for pod \"test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)\" failed: rpc error: code = Unknown desc = [failed to set up sandbox container \"69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f\" network for pod \"test-rc-v6c69\": NetworkPlugin cni failed to set up pod \"test-rc-v6c69_z1\" network: CNI request failed with status 400: 'error on port veth57be0e6f: \"could not open network device veth57be0e6f (No such device)\"\n', failed to stop sandbox container \"69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f\" for pod \"test-rc-v6c69\": Error response from daemon: No such container: 69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f]"
Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: I0912 05:57:05.579936    8149 server.go:470] Event(v1.ObjectReference{Kind:"Pod", Namespace:"z1", Name:"test-rc-v6c69", UID:"061b9cc1-b671-11e8-b93b-0ea178bb84b2", APIVersion:"v1", ResourceVersion:"91472", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)"

Comment 21 Dan Winship 2018-09-12 11:40:57 UTC
(In reply to zhaozhanqi from comment #20)
> Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425374
> 8149 cni.go:260] Error adding network: CNI request failed with status 400:
> 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No
> such device)"

The patch does not get rid of these errors, it just tries to ensure that if this error happens, then the veth device mentioned gets removed from the OVS bridge properly afterward.

So it should be getting rid of the errors about veths in "ovs-vsctl show", but this suggests it may not be doing that entirely:

> 3. after running 10 mins and check the ovs port and pod container did not
> match 
>   sh-4.2# ovs-vsctl list-ports br0  | grep veth | wc -l
> 28
> 
> # ip a | grep veth | awk '{ print $2}' | awk -F@ '{ print $1}' | sort | wc -l
> 23


Note You need to log in before you can comment on or make changes to this bug.