Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1510167 - [GSS] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk. [NEEDINFO]
Summary: [GSS] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present o...
Keywords:
Status: MODIFIED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: kubernetes
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Humble Chirammal
QA Contact: Rachael
URL:
Whiteboard:
Depends On: 1558600
Blocks: 1542093 OCS-3.11.1-devel-triage-done 1642792 1573420 1622458
TreeView+ depends on / blocked
 
Reported: 2017-11-06 20:26 UTC by Thom Carlin
Modified: 2019-04-11 08:25 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
bkunal: needinfo? (andcosta)
bkunal: needinfo? (knakayam)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github kubernetes kubernetes issues 60645 None None None 2018-06-20 17:45:58 UTC

Description Thom Carlin 2017-11-06 20:26:24 UTC
Description of problem:

While investigating FailedSync errors, found about 10 "kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk." messages during "systemctl status -l atomic-openshift-node" for CNS 3.6-backed pods

Version-Release number of selected component (if applicable):

3.6

How reproducible:

Uncertain

Steps to Reproduce: [Uncertain]
1. systemctl status -l atomic-openshift-node
2.
3.

Actual results:

kubelet_volumes.go:114] Orphaned pod "<<pod_uuid>>" found, but volume paths are still present on disk.

Expected results:

No errors
No remnants of pods left

Additional info:

Tentative workaround (for each orphaned pod):
1) Note pod_uuid
2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
  * Note pvc_uuid
3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
  * Directory should be empty
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
   * Directory should be removed
   * All parent directories up to /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs (inclusive) should also disappear
   * Orphan message for this pod no longer appears for this pod_uuid

Comment 2 Thom Carlin 2017-11-06 20:51:49 UTC
Correction:
4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>

Additionally
oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
oc get pv | egrep <<pvc_uuid>> should not return anything

Comment 3 Humble Chirammal 2018-01-11 11:54:49 UTC
We are trying to come up with a common mechanism to resolve the stale mounts across different FSs like NFS, GlusterFS ..etc. I will keep you posted. The patch is in review state .

Comment 7 Humble Chirammal 2018-02-05 17:01:41 UTC
This is fixed in OCP 3.9 builds. Moving to ON_QA.

Comment 8 Humble Chirammal 2018-02-07 07:28:08 UTC
I have tried to reproduce this issue with above patches in place and this issue is not reproducible. 

The volumes are gone after pod deletion in case of unsuccessful pod launch.

Comment 13 Rachael 2018-03-08 08:02:11 UTC
How can this issue be reproduced to test the fix?

Comment 14 Humble Chirammal 2018-03-12 16:05:21 UTC
(In reply to Rachael from comment #13)
> How can this issue be reproduced to test the fix?

One verification model can be going through the logs for `orphaned pod` and check the pod uuid matches with the pod which used gluster pvc claim. This is a generic error message and could be available in a system for any volume types, we just need to check for glusterfs pvc used pod.

Also looking at `ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs` path for POD UUIDs which are no longer in the system or running can also help us to verify the bug.

Comment 20 Thom Carlin 2018-05-01 13:37:01 UTC
More information for end-users:

On OCP node running the pod:

1) df | grep "<<pod_uuid>>
df: ‘/var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>’: Transport endpoint is not connected

2) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs
cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
Total 0
drwxr-x---. 3 root root 54 <<datestamp>> .
drwxr-x---. 5 root root 96 <<datestamp>> ..
d?????????? ? ?    ?     ?            ? pvc-<<pvc_uuid>>
  * Note pvc_uuid

3) ls -l -a /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/<<pvc_uuid>>
ls: cannot access /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>: Transport endpoint is not connected
  * Directory should be empty

4) rmdir /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>>


Additionally
A) oc get pvc --all-namespaces | egrep <<pvc_uuid>> should not return anything
B) oc get pv | egrep <<pvc_uuid>> should not return anything
C) lsof | grep /var/lib/origin/openshift.local.volumes/pods/<<pod_uuid>>/volumes/kubernetes.io~glusterfs/pvc-<<pvc_uuid>> may help to isolate the root cause

Comment 21 Thom Carlin 2018-05-01 13:46:53 UTC
D) Using the PID(s) from C), ps -fp <<glusterfs_pid>>
   * Note the log file path
E) less <<glusterfs_logfile_path>>
   * Note the errors which led to "Transport endpoint is not connected"

Comment 30 Ilkka Tengvall 2018-07-13 06:47:12 UTC
This would need some cleanup script. In my customer's env there are e.g. 86 of these per node, and that multiplies to quite many quickly. Could we deliver a cleanup script for anyone to run e.g. from cron/tower until this is fixed?

Here's what I used to find these:

export pod=`sudo /bin/journalctl -u atomic-openshift-node --since "1 hour ago" | grep "Orphaned pod"| tail -1 | sed 's/.Orphaned pod "([^"]).*/\1/'`; echo $pod

-> Will output:
000d0d5b-47de-11e8-8a1d-001dd8b71e6f

To see the dirs, especially the important volumes/ dir under that:

sudo ls -la /var/lib/origin/openshift.local.volumes/pods/$pod/

Also see if it's mounted: 
df|grep $pod

now that would need to be enhanced further, but I'm leaving on PTO and don't have the time now. Problem with the above is that it only shows one at the time.

Comment 31 Ilkka Tengvall 2018-07-13 07:45:30 UTC
And here's btw an ansible command to find the affected nodes from your cluster:

ansible -i ocp/hosts.yml nodes -b -m shell -a '/bin/journalctl -u atomic-openshift-node --since yesterday | /bin/grep "Orphaned pod" | tail -1 ' -f 15

It will output the affected nodes like this:

xxx.yyy.local | SUCCESS | rc=0 >>
Jul 13 10:24:16 xxx.yyy.local atomic-openshift-node[28429]: E0713 10:24:16.909897   28429 kubelet_volumes.go:128] Orphaned pod "000d0d5b-47de-11e8-8a1d-001dd8b71e6f" found, but volume paths are still present on disk : There were a total of 86 errors similar to this. Turn up verbosity to see them.

Comment 34 Michael Adam 2018-09-19 22:07:09 UTC
Is this completely fixed by BZ #1558600 or is there more left?

If this is tracking #1558600, then we should close this as CURRENTRELEASE, since it's already fixed in OCP 3.10.

Comment 35 Michael Adam 2018-09-20 13:44:21 UTC
Even if an additional fix is needed, we can not fix it in 3.11.0.
==> moving out.

And adapting severity, since this is mostly cosmetic.

Leaving needinfo on Humble to verify whether a fix is needed.

Comment 36 Humble Chirammal 2018-09-20 13:52:37 UTC
(In reply to Michael Adam from comment #34)
> Is this completely fixed by BZ #1558600 or is there more left?
> 
> If this is tracking #1558600, then we should close this as CURRENTRELEASE,
> since it's already fixed in OCP 3.10.

Yes, the OCP bug has been closed with recent errata 
https://access.redhat.com/errata/RHBA-2018:1816

However I doubt all the corner cases are fixed due to the issues I have seen in upstream. For eg. https://github.com/kubernetes/kubernetes/issues/45464


We can retest this with OCP 3.11 and proceed accordingly. Thats the best thing I can think of now.


Note You need to log in before you can comment on or make changes to this bug.