Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1510174

Summary: occasional restart of atomic-openshift-master-controllers.service due to scheduler cache corruption
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED ERRATA QA Contact: Vikas Laad <vlaad>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, eparis, jokerman, mfojtik, mifiedle, mmccomas
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-37
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-28 14:11:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Vikas Laad 2017-11-06 20:44:39 UTC
Description of problem:
I am running reliability tests on 3.7 cluster, I see occasional restart of atomic-openshift-master-controllers.service in following pattern

Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[93589]: F1102 04:15:33.881293   93600 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16013]: container "atomic-openshift-master-controllers" does not exist
Nov 02 04:15:33 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service: control process exited, code=exited status=1
Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 02 04:15:34 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service failed.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Starting atomic-openshift-master-controllers.service...
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal systemd[1]: Started atomic-openshift-master-controllers.service.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: I1102 04:15:39.427994   16060 plugins.go:77] Registered admission plugin "NamespaceLifecycle"
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429207   16060 start_master.go:290] Warning: assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console, master start will continue.
Nov 02 04:15:39 ip-172-31-29-26.us-west-2.compute.internal atomic-openshift-master-controllers[16049]: W1102 04:15:39.429234   16060 start_master.go:290] Warning: assetConfig.metricsPublicURL: Invalid value: "": required to view cluster metrics in the console, master start will continue.

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.178.0
kubernetes v1.7.6+a08f5eeb62
etcd 3.2.8


How reproducible:
Always

Steps to Reproduce:
1. Keep creating/updating/building/scaling quickstart apps on the cluster
2. watch master logs

Actual results:
master controller restart occasionally

Expected results:
should not restart master controller

Additional info:
See master logs attached.

Comment 2 Mike Fiedler 2017-11-07 03:11:40 UTC
In the referenced logs in comment 1, master-controllers restarted 15 times in 5 days due to the scheduler cache corruption fatal.

Comment 3 Jordan Liggitt 2017-11-07 14:10:08 UTC
https://github.com/kubernetes/kubernetes/issues/50916

Comment 4 Jordan Liggitt 2017-11-07 20:15:49 UTC
Fix in https://github.com/kubernetes/kubernetes/pull/55262

Comment 5 Michal Fojtik 2017-12-07 09:16:45 UTC
Pick: https://github.com/openshift/origin/pull/17656

Comment 6 Jordan Liggitt 2018-01-09 20:04:48 UTC
Will be fixed by 1.9.1 rebase in https://github.com/openshift/origin/pull/18003

Comment 8 Mike Fiedler 2018-01-15 20:43:40 UTC
Assigning QA to @vlaad.   This will be verified in the 3.9 reliability runs.

Comment 9 Vikas Laad 2018-01-22 16:08:25 UTC
Verified in following version

openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8


I do not see restarts of master-controller process anymore.

Comment 12 errata-xmlrpc 2018-03-28 14:11:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489