Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1596819

Summary: Build controller does a poor job at reacting to dynamic conditions on the cluster - some builds fail that could have been retried
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: RFEAssignee: Ben Parees <bparees>
Status: NEW --- QA Contact: Xiaoli Tian <xtian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, ccoleman, jokerman, mmccomas, slaterx
Target Milestone: ---Flags: bparees: needinfo? (ccoleman)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Clayton Coleman 2018-06-29 18:29:22 UTC
The build controller is fairly naive to events that can commonly occur during normal operation - eviction, oomkill, and failure to admit the pod.

This means that on busy clusters builds fail that don't need to fail, but can be safely retried, analogous to how being unable to schedule a pod for a bit isn't a failure condition, but something we hide from users.

In the case we hit on CI, the build controller scheduled a build, but the node was under disk pressure and prevent admission of the pod.  The pod hadn't started running yet, so we could have safely retried, but the build controller didn't know that.  Not all cases like this will be safe, but it means things randomly fail on busy clusters.

There may be additional tuning the nodes should do - I'm not yet clear whether the node should have started GC *before* it put up the DiskPressure rejection.  I.e. if we hit 85% disk, immediately started rejecting pods before we cleaned anything, that's bad.  

Jun 29 17:57:45 origin-ci-ig-n-nbnl origin-node[13300]: W0629 17:57:45.977393   13300 eviction_manager.go:343] eviction manager: attempting to reclaim imagefs
Jun 29 17:57:45 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:45.981974   13300 container_gc.go:85] attempting to delete unused containers
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.631605   13300 kubelet.go:1923] SyncLoop (PLEG): "os-build_ci-op-n6ppr6fr(da059406-7bc5-11e8-9efa-42010a8e0004)", event: &pleg.PodLifecycleEvent{ID:"da059406-7bc5-11e8-9efa-42010a8e0004", Type:"ContainerDied", Data:"f03e1f20877e393a86d3275fee44e
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730088   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "builder-dockercfg-tkrgp-push" (UniqueName: "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-dockercfg-tkrgp-push") pod "da059406-7bc5-11e8-9e
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730155   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "crio-socket" (UniqueName: "kubernetes.io/host-path/da059406-7bc5-11e8-9efa-42010a8e0004-crio-socket") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (UID: "da05940
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730203   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "buildworkdir" (UniqueName: "kubernetes.io/empty-dir/da059406-7bc5-11e8-9efa-42010a8e0004-buildworkdir") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (UID: "da059
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730235   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "docker-socket" (UniqueName: "kubernetes.io/host-path/da059406-7bc5-11e8-9efa-42010a8e0004-docker-socket") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (UID: "da0
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730270   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "builder-dockercfg-tkrgp-pull" (UniqueName: "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-dockercfg-tkrgp-pull") pod "da059406-7bc5-11e8-9e
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.730313   13300 reconciler.go:181] operationExecutor.UnmountVolume started for volume "builder-token-gblzd" (UniqueName: "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-token-gblzd") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.731001   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/da059406-7bc5-11e8-9efa-42010a8e0004-buildworkdir" (OuterVolumeSpecName: "buildworkdir") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.731019   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/da059406-7bc5-11e8-9efa-42010a8e0004-crio-socket" (OuterVolumeSpecName: "crio-socket") pod "da059406-7bc5-11e8-9efa-42010a8e0004" (UI
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.731107   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/da059406-7bc5-11e8-9efa-42010a8e0004-docker-socket" (OuterVolumeSpecName: "docker-socket") pod "da059406-7bc5-11e8-9efa-42010a8e0004"
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.747489   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-token-gblzd" (OuterVolumeSpecName: "builder-token-gblzd") pod "da059406-7bc5-11e8-9efa-4201
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.754409   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-dockercfg-tkrgp-push" (OuterVolumeSpecName: "builder-dockercfg-tkrgp-push") pod "da059406-7
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.755309   13300 operation_generator.go:684] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/da059406-7bc5-11e8-9efa-42010a8e0004-builder-dockercfg-tkrgp-pull" (OuterVolumeSpecName: "builder-dockercfg-tkrgp-pull") pod "da059406-7
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: I0629 17:57:46.770526   13300 kubelet.go:1878] SyncLoop (ADD, "api"): "base-build_ci-op-n6ppr6fr(ee0b3142-7bc5-11e8-9efa-42010a8e0004)"
Jun 29 17:57:46 origin-ci-ig-n-nbnl origin-node[13300]: W0629 17:57:46.770666   13300 eviction_manager.go:143] Failed to admit pod base-build_ci-op-n6ppr6fr(ee0b3142-7bc5-11e8-9efa-42010a8e0004) - node has conditions: [DiskPressure]

Comment 1 Ben Parees 2018-06-29 18:38:43 UTC
Seems like failure to admit is the only one we could reasonably retry..my question would be, why isn't that a platform feature for all pods?  Or why can't i decide how long i want the cluster to try to admit my pod before giving up?  Why should i be responsible for retrying my pod when its "failure" was due to temporary cluster  conditions?

Doesn't the same problem apply to Jobs and possibly worth workloads?  Doesn't seem like every controller implement its own handling for this scenario.

Comment 2 Ben Parees 2018-07-18 21:37:54 UTC
some additional detail here:  the scenario is that the scheduler picked the node for the pod, so the pod was "Admitted" and then by the time the pod landed on the node, the node no longer had capacity and so it held the pod, ultimately failing(?) it.

in this scenario the build controller would need to detect what happened to the pod, recognize this means the pod took no actions, and delete it and create a new build pod.

Comment 3 Gleidson Nascimento 2018-09-06 00:45:52 UTC
This is reproducible on OKD 3.10.0.

I've got this scenario on one of the glusterfs dedicated nodes. Since all pods there are from daemonsets, eviction doesn't work properly since it can't ensure that the daemonset pod will be rescheduled back to the node after eviction. 

Worst of all, this constant scheduling/eviction ends up creating too much journal logging that ends creating a 'NodeHasDiskPressure' status.

Thanks,
Gleidson

Comment 4 Ben Parees 2018-09-06 01:47:20 UTC
That sounds like a different/tangential issue to the one being described here.

Comment 5 Gleidson Nascimento 2018-09-06 09:56:15 UTC
Here are the example logs for said scenario:

Sep 06 10:57:40 storage-03 origin-node[16213]: W0906 10:57:40.808665   16213 eviction_manager.go:343] eviction manager: attempting to reclaim nodefs
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.808716   16213 eviction_manager.go:357] eviction manager: must evict pod(s) to reclaim nodefs
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.808865   16213 eviction_manager.go:375] eviction manager: pods ranked for eviction: sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.809138   16213 kuberuntime_container.go:547] Killing container "docker://26e16f9d2eadbf12df353d91c5c1083f29b1600e10c6454d597685
Sep 06 10:57:40 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:40.812764689+12:00" level=info msg="Container 26e16f9d2eadbf12df353d91c5c1083f29b1600e10c6454d597685e880c9b738 faile
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.830757   16213 kubelet_pods.go:1127] Killing unwanted pod "sync-8ldb4"
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.838151   16213 kuberuntime_container.go:547] Killing container "docker://26e16f9d2eadbf12df353d91c5c1083f29b1600e10c6454d597685
Sep 06 10:57:40 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:40.839963370+12:00" level=warning msg="container kill failed because of 'container not found' or 'no such process':
Sep 06 10:57:40 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:40.926374087+12:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 26e16f9d2eadb
Sep 06 10:57:40 storage-03 kernel: XFS (dm-6): Unmounting Filesystem
Sep 06 10:57:40 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:40.973830415+12:00" level=warning msg="26e16f9d2eadbf12df353d91c5c1083f29b1600e10c6454d597685e880c9b738 cleanup: fai
Sep 06 10:57:40 storage-03 origin-node[16213]: E0906 10:57:40.982769   16213 kuberuntime_container.go:65] Can't make a ref to pod "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a16
Sep 06 10:57:40 storage-03 origin-node[16213]: I0906 10:57:40.997914   16213 kubelet.go:1919] SyncLoop (PLEG): "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)", event: &ple
Sep 06 10:57:41 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:41.091271206+12:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 7345bd6f4ad05
Sep 06 10:57:41 storage-03 kernel: XFS (dm-3): Unmounting Filesystem
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.135834   16213 kubelet.go:1890] SyncLoop (DELETE, "api"): "glusterfs-storage-2nwx8_dynamic-storage(16ede562-b15f-11e8-a976-001a
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.138914   16213 kubelet.go:1884] SyncLoop (REMOVE, "api"): "glusterfs-storage-2nwx8_dynamic-storage(16ede562-b15f-11e8-a976-001a
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.138972   16213 kubelet.go:2078] Failed to delete pod "glusterfs-storage-2nwx8_dynamic-storage(16ede562-b15f-11e8-a976-001a4a160
Sep 06 10:57:41 storage-03 dockerd-current[2670]: time="2018-09-06T10:57:41.140823337+12:00" level=warning msg="7345bd6f4ad0594740e865e78913d067a5fb407d0367c148cbf3bbd960ed7129 cleanup: fai
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.145953   16213 eviction_manager.go:157] eviction manager: pods sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.160764   16213 kubelet.go:1874] SyncLoop (ADD, "api"): "glusterfs-storage-vs5jr_dynamic-storage(17a5e474-b15f-11e8-a976-001a4a1
Sep 06 10:57:41 storage-03 origin-node[16213]: W0906 10:57:41.160835   16213 eviction_manager.go:143] Failed to admit pod glusterfs-storage-vs5jr_dynamic-storage(17a5e474-b15f-11e8-a976-001
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.727741   16213 kubelet.go:1890] SyncLoop (DELETE, "api"): "prometheus-node-exporter-z68mb_openshift-metrics(174b83c2-b15f-11e8-
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.733520   16213 kubelet.go:1884] SyncLoop (REMOVE, "api"): "prometheus-node-exporter-z68mb_openshift-metrics(174b83c2-b15f-11e8-
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.737574   16213 kubelet.go:2078] Failed to delete pod "prometheus-node-exporter-z68mb_openshift-metrics(174b83c2-b15f-11e8-a976-
Sep 06 10:57:41 storage-03 origin-node[16213]: I0906 10:57:41.757900   16213 kubelet.go:1874] SyncLoop (ADD, "api"): "prometheus-node-exporter-v5chh_openshift-metrics(18007882-b15f-11e8-a97
Sep 06 10:57:41 storage-03 origin-node[16213]: W0906 10:57:41.757977   16213 eviction_manager.go:143] Failed to admit pod prometheus-node-exporter-v5chh_openshift-metrics(18007882-b15f-11e8
Sep 06 10:57:42 storage-03 origin-node[16213]: I0906 10:57:42.009742   16213 kubelet.go:1919] SyncLoop (PLEG): "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)", event: &ple
Sep 06 10:57:42 storage-03 origin-node[16213]: W0906 10:57:42.009871   16213 pod_container_deletor.go:77] Container "7345bd6f4ad0594740e865e78913d067a5fb407d0367c148cbf3bbd960ed7129" not fo
Sep 06 10:57:42 storage-03 origin-node[16213]: I0906 10:57:42.139312   16213 kubelet.go:1890] SyncLoop (DELETE, "api"): "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)"
Sep 06 10:57:42 storage-03 origin-node[16213]: I0906 10:57:42.145085   16213 kubelet.go:1884] SyncLoop (REMOVE, "api"): "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)"
Sep 06 10:57:42 storage-03 origin-node[16213]: I0906 10:57:42.145170   16213 kubelet.go:2078] Failed to delete pod "sync-8ldb4_openshift-node(e0ecdcd7-b15e-11e8-a976-001a4a160158)"