Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1367141 - docker not removing cgroups for exited containers
Summary: docker not removing cgroups for exited containers
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker
Version: 7.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Matthew Heon
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1496245 1328913 1357081
TreeView+ depends on / blocked
 
Reported: 2016-08-15 17:03 UTC by Seth Jennings
Modified: 2019-03-06 01:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-30 15:01:37 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1328913 None CLOSED Long running reliability tests show network errors on nodes 2019-03-04 05:32:31 UTC
Red Hat Bugzilla 1357081 None CLOSED openshift node logs thousands of "du: cannot access" 2019-03-04 05:32:31 UTC

Internal Links: 1328913 1357081

Description Seth Jennings 2016-08-15 17:03:27 UTC
Description of problem:

Repeating errors in cadvisor/kubelet/openshift-node process

Either:

Unable to get network stats from pid 128265: couldn't read network stats: failure opening /proc/128265/net/dev: open /proc/128265/net/dev: no such file or directory

or

failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Follows this error:

Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

Or this error:

kernel: XFS (dm-95): Unmounting Filesystem
kernel: device-mapper: thin: Deletion of thin device 68560 failed.
forward-journal[18687]: time="2016-07-15T04:55:27.490877319Z" level=error msg="Handler for DELETE /v1.21/containers/5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436 returned error: Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy"
forward-journal[18687]: time="2016-07-15T04:55:27.491095598Z" level=error msg="HTTP Error" err="Driver devicemapper failed to remove root filesystem 5cec00e7355de03fdc8966895223776c1163875b92a7ed05734ea80ce6527436: Device is Busy" statusCode=500

Version-Release number of selected component (if applicable):
Openshift 3.1

How reproducible:
No reproduction procedure.  Seems to be a race.

Steps to Reproduce:
1.
2.
3.

Actual results:

The device mapper thin device fails to unmount or remove pod directories due to a "device busy" error.  This causes the container's cgroup to not be cleaned up properly.  This causes cadvisor (and therefore kubelet and openshift-node) to continue monitoring for a container whose pid 1 has exited, leading to failure messages in the logs.

Expected results:

Ideally, we would not encounter the "device busy" errors which leave orphaned thin devices everywhere.  But in the case the error does occur (due to a kernel bug, race in docker, or whatever), docker should _always_ clean up the container cgroup so that cadvisor will stop monitoring the container.

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1328913
https://bugzilla.redhat.com/show_bug.cgi?id=1357081

Comment 1 Seth Jennings 2016-08-15 17:15:35 UTC
Should be noted that the device mapper error causing the orphaned cgroup issue is a theory.  Might not be right.  The real problem is the orphaned cgoup which was confirmed here:

https://bugzilla.redhat.com/show_bug.cgi?id=1328913#c9

The reason the cgroup is not being removed is really unknown at this point.

Comment 3 Daniel Walsh 2016-10-18 13:51:21 UTC
Mrunal have any ideas here?

Comment 4 Mrunal Patel 2016-10-18 17:53:06 UTC
I think that this is related to the mounts leaking into machined issue.

Comment 5 Daniel Walsh 2017-06-30 15:01:37 UTC
Should be fixed by oci-umount I believe.


Note You need to log in before you can comment on or make changes to this bug.