Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1514627 - Builds fail with Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock
Summary: Builds fail with Failed to execute iptables-restore: exit status 4 (Another a...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-17 23:39 UTC by Steven Walter
Modified: 2017-12-07 14:00 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-07 14:00:39 UTC


Attachments (Terms of Use)

Description Steven Walter 2017-11-17 23:39:43 UTC
Description of problem:
Builds fail with iptables-restore unable to occur due to a hold on xtables lock

This has been a known issue in the past (see "additional info") -- however after implementing the erratas, the issue still occurs. Although to some extent it is expected for containers to be blocked for some time, it is expected for the builds to eventually finish (the iptables wait flag should allow us to wait until the lock is available, rather than exiting).

Version-Release number of selected component (if applicable):
iptables-1.4.21-18.el7.x86_64
atomic-openshift-node-3.4.1.44.26-1.git.0.a62e88b.el7.x86_64
kernel-3.10.0-693.2.2.el7.x86_64
Red Hat Enterprise Linux Server release 7.4 (Maipo)

How reproducible:
Unconfirmed

Steps to Reproduce:
1. Kick off many builds

Actual results:
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685452   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:30 njrarltapp001c7 atomic-openshift-node: E1109 11:36:30.685515   25053 docker_manager.go:1434] Failed to teardown network for pod "84aeeea2-c565-11e7-8f20-005056a97aae" using network plugins "cni": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Nov  9 11:36:32 njrarltapp001c7 atomic-openshift-node: E1109 11:36:32.850516   25053 kubelet.go:2092] Failed killing the pod "pcis-integration-1-hz0zp": failed to "TeardownNetwork" for "pcis-integration-1-hz0zp_cipe-c2811c-1" with TeardownNetworkError: "Failed to teardown network for pod \"84aeeea2-c565-11e7-8f20-005056a97aae\" using network plugins \"cni\": CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?\n)\n'"
Nov  9 11:36:35 njrarltapp001c7 atomic-openshift-node: E1109 11:36:34.955050   25053 cni.go:273] Error deleting network: CNI request failed with status 400: 'Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?


Expected results:
Builds succeed

Additional info:
Customer implemented errata that fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1484133
Which is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1438597

Comment 2 Vladislav Walek 2017-11-27 11:54:03 UTC
Hello,

customer having the same issue. Due this error the iptables are not getting updated with the correct endpoint ip.
Therefore the service is not reachable - production down.

Issue happening on hawkular-cassandra.

Comment 20 Miheer Salunke 2017-11-30 14:13:50 UTC
Hi Dan,

I think collecting go routines from the openshift node can help us understand the issue right ?

Eg ->
Set OpenShift Nodes's log level to Debug add or edit this line in /etc/sysconfig/atomic-openshift-node:


OPTIONS='--loglevel=8'
OPENSHIFT_PROFILE=web and then restart atomic-openshift-node  "systemctl restart atomic-openshift-node".
Let it run for a bit, long enough that you'd expect it to have the issue reproduced again, and then run

curl http://localhost:6060/debug/pprof/goroutine?debug=2  

and attach the routines here.


Thanks and regards,
Miheer

Comment 26 Ben Bennett 2017-12-07 14:00:39 UTC
This appears to be resolved by the updated kernel, the changed iptables wait times, and the reduction in the frequency with which we call iptables on pod creation.


Note You need to log in before you can comment on or make changes to this bug.