Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1514627

Summary: Builds fail with Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED CURRENTRELEASE QA Contact: Meng Bo <bmeng>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.4.1CC: aos-bugs, bbennett, clichybi, dcbw, dzhukous, knakai, misalunk, stwalter, vwalek, weliang
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-07 14:00:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steven Walter 2017-11-17 23:39:43 UTC
Description of problem:
Builds fail with iptables-restore unable to occur due to a hold on xtables lock

This has been a known issue in the past (see "additional info") -- however after implementing the erratas, the issue still occurs. Although to some extent it is expected for containers to be blocked for some time, it is expected for the builds to eventually finish (the iptables wait flag should allow us to wait until the lock is available, rather than exiting).

Version-Release number of selected component (if applicable):
iptables-1.4.21-18.el7.x86_64
atomic-openshift-node-3.4.1.44.26-1.git.0.a62e88b.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
Unconfirmed

Steps to Reproduce:
1. Kick off many builds

Actual results:
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685452   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685515   25053 docker_manager.go:1434] Failed to teardown network for pod "84aeeea2-c565-11e7-8f20-005056a97aae" using network plugins "cni": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:32 njrarltapp001c7 atomic-openshift-node: E1109 11:36:32.850516   25053 kubelet.go:2092] Failed killing the pod "pcis-integration-1-hz0zp": failed to "TeardownNetwork" for "pcis-integration-1-hz0zp_cipe-c2811c-1" with TeardownNetworkError: "Failed to teardown network for pod \"84aeeea2-c565-11e7-8f20-005056a97aae\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n)\n'"
Nov  9 11:36:35 njrarltapp001c7 atomic-openshift-node: E1109 11:36:34.955050   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?


Expected results:
Builds succeed

Additional info:
Customer implemented errata that fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1484133
Which is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1438597

Comment 2 Vladislav Walek 2017-11-27 11:54:03 UTC
Hello,

customer having the same issue. Due this error the iptables are not getting updated with the correct endpoint ip.
Therefore the service is not reachable - production down.

Issue happening on hawkular-cassandra.

Comment 20 Miheer Salunke 2017-11-30 14:13:50 UTC
Hi Dan,

I think collecting go routines from the openshift node can help us understand the issue right ?

Eg ->
Set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:


OPTIONS='--loglevel=8'
OPENSHIFT_PROFILE=web and then restart atomic-openshift-node  "systemctl restart atomic-openshift-node".
Let it run for a bit, long enough that you'd expect it to have the issue reproduced again, and then run

curl http://localhost:6060/debug/pprof/goroutine?debug=2  

and attach the routines here.


Thanks and regards,
Miheer

Comment 26 Ben Bennett 2017-12-07 14:00:39 UTC
This appears to be resolved by the updated kernel, the changed iptables wait times, and the reduction in the frequency with which we call iptables on pod creation.