Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1511896 - [DOCKER] Jenkins slave pods are killed off
Summary: [DOCKER] Jenkins slave pods are killed off
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.5.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.8.0
Assignee: Jim Minter
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-10 11:35 UTC by Gabor Burges
Modified: 2017-11-21 23:11 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-21 23:11:57 UTC


Attachments (Terms of Use)
docker logs (deleted)
2017-11-10 12:41 UTC, Gabor Burges
no flags Details
master api (deleted)
2017-11-10 13:05 UTC, Gabor Burges
no flags Details
node logs (deleted)
2017-11-10 13:06 UTC, Gabor Burges
no flags Details
jenkins (deleted)
2017-11-10 17:01 UTC, Gabor Burges
no flags Details
error of jenkins (deleted)
2017-11-10 17:02 UTC, Gabor Burges
no flags Details
stacktrace from afternoon build (deleted)
2017-11-10 17:04 UTC, Gabor Burges
no flags Details
stacktrace from a successful afternoon build (deleted)
2017-11-10 17:05 UTC, Gabor Burges
no flags Details

Description Gabor Burges 2017-11-10 11:35:10 UTC
Description of problem:

One of our dedicated customer is experiencing issues when running Jenkins - Their maven slave pod is killed off for some unknown reason


Version-Release number of selected component (if applicable):

oc version
oc v3.5.5.31
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO
rpm -qa |grep kernel
kernel-tools-3.10.0-514.32.3.el7.x86_64
kernel-tools-libs-3.10.0-514.32.3.el7.x86_64
kernel-3.10.0-514.26.2.el7.x86_64
kernel-3.10.0-514.el7.x86_64
kernel-3.10.0-514.32.3.el7.x86_64
rpm -qa |grep docker
docker-client-1.12.6-48.git0fdc778.el7.x86_64
docker-common-1.12.6-48.git0fdc778.el7.x86_64
docker-rhel-push-plugin-1.12.6-48.git0fdc778.el7.x86_64
docker-1.12.6-48.git0fdc778.el7.x86_64
python-docker-py-1.10.6-1.el7.noarch
atomic-openshift-docker-excluder-3.5.5.31-1.git.0.b6f55a2.el7.noarch
python-docker-pycreds-1.10.6-1.el7.noarch


How reproducible:



Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Gabor Burges 2017-11-10 12:41:29 UTC
Created attachment 1350458 [details]
docker logs

Comment 5 Gabor Burges 2017-11-10 13:05:09 UTC
Created attachment 1350465 [details]
master api

Comment 6 Gabor Burges 2017-11-10 13:06:30 UTC
Created attachment 1350466 [details]
node logs

Comment 7 Ruben Romero Montes 2017-11-10 15:28:52 UTC
I would ask for the description of the job and (if possible) the build logs.
 $ oc rsync $pod:/var/lib/jenkins/jobs/$jobname .
 $ tar -zcvf $jobname.tar.gz $jobname

Comment 8 Gabor Burges 2017-11-10 17:01:26 UTC
Created attachment 1350592 [details]
jenkins

Comment 9 Gabor Burges 2017-11-10 17:02:37 UTC
Created attachment 1350593 [details]
error of jenkins

Comment 10 Gabor Burges 2017-11-10 17:04:00 UTC
Created attachment 1350595 [details]
stacktrace from afternoon build

Comment 11 Gabor Burges 2017-11-10 17:05:05 UTC
Created attachment 1350596 [details]
stacktrace from a successful afternoon build

Comment 12 Jim Minter 2017-11-10 19:31:44 UTC
Exit code 137 implies SIGKILL (128 + 9).

There is no shortage of the following in dmesg.log.gz:

> [3786123.702425] java invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=935
> [3786123.710679] java cpuset=docker-db049ca4d0a8301081ef7622d5f3da3b014cec5fae11d76ccc5ca65076f2ae24.scope mems_allowed=0
...
> [3786123.843937] memory: usage 2097152kB, limit 2097152kB, failcnt 37577
> [3786123.848627] memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
> [3786123.855292] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0

Sounds like mistuned Java memory management parameters.

Separate follow-up: we need to make it much easier to tell that an exit is due to an OOM, probably both in K8s and in OpenShift.

Comment 13 Jim Minter 2017-11-10 19:33:21 UTC
I haven't seen anything to suggest that the JNLP runner itself is misconfigured, but it looks like the Jenkins script is manually running ./gradlew clean build via a shell script.

I think it is likely that the Java memory tuning used by the JNLP runner is not being inherited by the gradle process.  If untuned, unfortunately by default Java calculates its maximum heap size with reference to the node memory, not the container memory limit.  With a non-trivial workload, such a Java process may exceed the cgroup memory limits, triggering a kill from the kernel OOM killer.  I think this is what is happening here.

The recommended course of action is for appropriate parameters to be passed into gradle to manage its maximum memory consumption.  This should be achievable via JAVA_OPTS or GRADLE_OPTS environment variables.

A starting point might be to try the equivalent parameter values/logic to that used by the JNLP runner[1], but note that these may need to be adapted further according to the workload (e.g.: if gradle forks child processes that consume additional memory, that may also need to be managed).

[1] https://github.com/openshift/jenkins/blob/master/slave-base/contrib/bin/run-jnlp-client#L68-L100

A second point to follow up on is looking at providing some sort of out-of-the box gradle starting point memory tuning to help prevent this issue in future.

Comment 14 Jim Minter 2017-11-10 23:08:41 UTC
https://github.com/kubernetes/kubernetes/pull/45682 is a pre-existing PR which would make Kubernetes create events upon container death.  It doesn't indicate when that death is due to OOM, but it's a starting point.

Comment 15 Jim Minter 2017-11-10 23:23:03 UTC
It may additionally make sense to have the OOMs logged somewhere to make debugging more obvious.

Comment 16 Jim Minter 2017-11-10 23:48:50 UTC
https://github.com/openshift/jenkins/pull/417 is a first cut at ensuring that a maven or gradle running under the Jenkins slave image should automatically get the requisite JVM options to avoid being OOM killed.

Comment 17 Jim Minter 2017-11-10 23:49:31 UTC
Also need to look at whether the Jenkins k8s plugin might warn if/when its pods get killed by OOM, before deleting them outright...

Comment 18 Steve Speicher 2017-11-12 14:30:21 UTC
Jim, so do we have confirmation that is getting OOM-killed or gradle is just terminating due to hitting heap limits (not proper settings)?

Regarding comment #14, I have noticed on 3.6 instance, the OOM Killed status showed. One issue is that terminated pods disappear pretty quickly, so it is hard to get the logs or other data.

Comment 19 Jim Minter 2017-11-12 15:42:41 UTC
The OOM killer operates in two cases: 1) node out of memory; 2) cgroup processes exceed their memory limit.  From the process point of view, the two are nearly indistinguishable, but I'm pretty confident that what's happening here is 2).

If the kubernetes plugin didn't delete the slave pod immediately, you should see the OOMKilled status.  I'm working on a test patch that will make this happen.


Note You need to log in before you can comment on or make changes to this bug.