Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1362519 - Too many open files on a node
Summary: Too many open files on a node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-02 12:24 UTC by Miheer Salunke
Modified: 2016-10-26 20:01 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-26 20:01:31 UTC


Attachments (Terms of Use)

Description Miheer Salunke 2016-08-02 12:24:19 UTC
Description of problem:

It seems that we have an openshift node which becomes unstable (we don't see direct problem but a lot of error logs) like that:

May 17 11:27:06 hostname atomic-openshift-node: E0517 11:27:06.061459    2915 manager.go:212] Registration of the 
raw container factory failed: inotify_init: too many open files
or
[root@hostname ~]# systemctl restart docker
Error: Too many open files

When we count all open file handles we got this result:

[root@hostname ~]# lsof | wc -l
762521

If we went to another node we also got an high value  (like +- 250000 but not too many than this node)
I attached you a sosreport of the node & also a lsof listing output so you can check that more in details.

I suppose that if I reboot the problem will be probably resolved but here we want 
1) a solution to fix this problem
2) understand the problem and take measure to avoid that in the future

Version-Release number of selected component (if applicable):
Openshift Enterprise 3.1.1

How reproducible:
Always on customer test environment

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:
Openshift node become unstable

Expected results:
Openshift node shall not become unstable

Additional info:

Comment 3 Andy Goldstein 2016-08-02 14:31:05 UTC
Could we please get the output of `lsof -a -p <pid of openshift>`?

Comment 6 Andy Goldstein 2016-08-03 13:42:58 UTC
Could we please get both the full `lsof` output as well as the `lsof -a -p <openshift pid>`? It will be helpful to have both data points taken from approximately the same time. Thanks.

Comment 13 Seth Jennings 2016-08-09 16:16:46 UTC
The pod density on the nodes is high around 150-200 pods per node, it seems.  The kubelet is making a LOT of connections to the pods and those sockets don't seem to be closing properly.

Are the pods being deployed using liveness probes?  Need to figure out why the kubelet is making so many connections to the pods and why those sockets aren't closing properly (bad liveness endpoint implementation in the app?)

Please provide the pod specs for the applications being deployed.  Any addition context you can provide on the nature of the app would be helpful.


Note You need to log in before you can comment on or make changes to this bug.