Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1357081 - openshift node logs thousands of "du: cannot access"
Summary: openshift node logs thousands of "du: cannot access"
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On: 1367141
Blocks: OSOPS_V3
TreeView+ depends on / blocked
 
Reported: 2016-07-15 16:17 UTC by Sten Turpin
Modified: 2017-05-24 14:10 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-24 14:10:14 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1367141 None CLOSED docker not removing cgroups for exited containers 2019-02-15 14:00:19 UTC

Internal Links: 1367141

Description Sten Turpin 2016-07-15 16:17:28 UTC
Description of problem: openshift node logs thousands of "du: cannot access"


Version-Release number of selected component (if applicable): atomic-openshift-node-3.2.0.44-1.git.0.a4463d9.el7.x86_64


How reproducible: always


Steps to Reproduce:
1. run pods on an openshift node with the above version (also noted on earlier versions)

Actual results:
the below error is logged roughly 6000 times/day on a low-utilization node

Expected results:
this error should be handled

Additional info:
example of a complete log entry:
Jul 15 14:11:52 ip-172-31-3-96.ec2.internal atomic-openshift-node[7724]: E0715 14:11:52.590669    7724 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/cc8af30222fa33248476635981347a56bb10e7014cd8546f75fe2d6fcb301740 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/cc8af30222fa33248476635981347a56bb10e7014cd8546f75fe2d6fcb301740’: No such file or directory

Comment 1 Andy Goldstein 2016-07-15 16:19:11 UTC
I believe this is fixed in 3.3.

Comment 2 DeShuai Ma 2016-07-18 08:53:53 UTC
Test on openshift 3.3, There is still has this error, but not too much. In one of my nodes there are two related errors.
openshift v3.3.0.6
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

 4346 Jul 18 03:51:49 ip-172-18-10-95 atomic-openshift-node: E0718 03:51:49.519355   17412 fsHandler.go:106] failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604      476ad8c31255d54a53cdb9352b41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/a5febc665c196470ec125d7354103b707604476ad8c31255d54a53cdb9352b41': No such file or directory

Jul 18 04:48:48 ip-172-18-10-95 atomic-openshift-node: E0718 04:48:48.245490   17412 fsHandler.go:106] failed to collect filesystem stats - du command failed on /rootfs/var/lib/docker/containers/27d314be763347414adef76078fc057cc0490aa6a2e3b818df6deef883d46c41 with output stdout: , stderr: du: cannot access '/rootfs/var/lib/docker/containers/27d314be763347414adef76078fc057cc0490aa6a2e3b818df6deef883d46c41': No such file or directory

Comment 3 Andy Goldstein 2016-07-18 13:44:06 UTC
Could you please paste the output of `sudo docker info`?

Also, do you know what those 2 containers are - what were they running, did OpenShift start them, etc?

Comment 4 DeShuai Ma 2016-07-19 07:40:09 UTC
(In reply to Andy Goldstein from comment #3)
> Could you please paste the output of `sudo docker info`?
> 
> Also, do you know what those 2 containers are - what were they running, did
> OpenShift start them, etc?

Those container is created by openshift. like this one I create 10 pods with same image then delete it. only has meet this error.
 18961 Jul 19 00:53:53 ip-172-18-4-118 atomic-openshift-node: E0719 00:53:53.669042    8978 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a348475       2e0e154b1dbff71f6c6e25 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25’: No such file or directory


[root@ip-172-18-4-118 containers]# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker-202:2-67110041-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 1.692 GB
 Data Space Total: 107.4 GB
 Data Space Available: 22.64 GB
 Metadata Space Used: 2.13 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.145 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: null host bridge
 Authorization: rhel-push-plugin
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 1
Total Memory: 3.518 GiB
Name: ip-172-18-4-118.ec2.internal
ID: FHHR:XIJ3:XQV7:U5I2:VMF5:PNF6:OJYZ:MV33:PR4T:T3U6:SQ74:4WJV
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Registries: registry.qe.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure)

Comment 5 DeShuai Ma 2016-07-19 07:45:12 UTC
sometime when delete all the pod in project, also meet device busy error.
1. delete pod in project
[root@ip-172-18-1-64 ~]# oc delete `oc get pod -n dma -o name` -n dma
pod "besteffort-3iap2" deleted
pod "besteffort-95ct6" deleted
pod "besteffort-9n99g" deleted
pod "besteffort-bpzy0" deleted
pod "besteffort-hdxs8" deleted
pod "besteffort-jkyrd" deleted
pod "besteffort-lqmga" deleted
pod "besteffort-n72uy" deleted
pod "besteffort-qc3xu" deleted
pod "besteffort-ttt3j" deleted

2. check node logs.
Jul 19 03:40:57 ip-172-18-4-118 atomic-openshift-node: I0719 03:40:57.451528   35151 kubelet.go:2043] Failed to remove orphaned pod "c134c5fe-4d6c-11e6-b1cc-0e227273c3bd" dir; err: remove /var/lib/origin/openshift.local.volumes/pods/c134c5fe-4d6c-11e6-b1cc-0e227273c3bd/volumes/kubernetes.io~secret/deployer-token-0hr8m: device or resource busy

http://pastebin.test.redhat.com/393492

Comment 6 Seth Jennings 2016-08-12 19:27:11 UTC
DeShuai, looking for some clarity here.  Does the du error message continue indefinitely for a particular pod or does it go away within a short period of time?

Comment 7 Seth Jennings 2016-08-12 19:49:13 UTC
I am pretty sure this is related to the orphaning of docker cgroups, see 1328913.

The suspected cause of the orphaning is the DM error mentioned in comment #5

This can be confirmed using the following method:

Get this error described:

 18961 Jul 19 00:53:53 ip-172-18-4-118 atomic-openshift-node: E0719 00:53:53.669042    8978 fsHandler.go:106] failed to collect filesystem stats - du command failed on /var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a348475       2e0e154b1dbff71f6c6e25 with output stdout: , stderr: du: cannot access ‘/var/lib/docker/containers/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25’: No such file or directory

If the path

/sys/fs/cgroup/cpu,cpuacct/system.slice/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25

exists, then it is the same problem.

Additional confirmation is if /sys/fs/cgroup/cpu,cpuacct/system.slice/319d3d8950094b7c1d7147ca74e6663b839a3484752e0e154b1dbff71f6c6e25/tasks is empty (no tasks in the cgroup)

Comment 8 Seth Jennings 2017-05-05 14:37:58 UTC
Still blocked on
https://bugzilla.redhat.com/show_bug.cgi?id=1367141

Comment 9 Seth Jennings 2017-05-24 14:10:14 UTC
Trail has gone cold on this.  Docker 1.10 is old.  No associated customer issue.  Closing.


Note You need to log in before you can comment on or make changes to this bug.