Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1690043

Summary: APIServer should return a structured error and retry-after for graceful shutdown errors
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: VERIFIED --- QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: low    
Version: 4.1CC: aos-bugs, bparees, jokerman, mfojtik, mmccomas, xxia, yinzhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Clayton Coleman 2019-03-18 16:10:28 UTC
CMO reported a hard error (failing=true) on it's cluster operator, and this should be an error it handles and ignores/retries:

Mar 18 14:54:32.602 E clusteroperator/monitoring changed Failing to True: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus ClusterRoleBinding failed: updating ClusterRoleBinding object failed: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (put clusterrolebindings.rbac.authorization.k8s.io prometheus-k8s)

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5804

Depends on https://github.com/kubernetes/kubernetes/pull/75368 which should make it automatic.

For 4.1 we want the server to return a structured error and have client stacks gracefully retry the error, to minimize the churn caused by API restarts. Blocks GA

Comment 1 Clayton Coleman 2019-03-18 16:10:51 UTC
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1684547

Need to ensure all components are protected.

Comment 2 Michal Fojtik 2019-03-19 09:06:23 UTC
*** Bug 1690167 has been marked as a duplicate of this bug. ***

Comment 3 Michal Fojtik 2019-03-19 10:12:07 UTC
To mitigate: https://github.com/openshift/origin/pull/22355

Stefan believe we have bug in shutdown order, so we still need to look at that. The pick above should make the error less disturbing.

Comment 7 Michal Fojtik 2019-04-09 11:29:34 UTC
To match with upstream: https://github.com/openshift/origin/pull/22511

Comment 8 zhou ying 2019-04-10 02:38:10 UTC
No 'apiserver is shutting down' error , but have some related error: ClusterOperatorNotAvailable: Cluster operator openshift-apiserver has not yet reported success.   Not sure is same issue or not.

Comment 9 Michal Fojtik 2019-04-10 18:02:44 UTC
That is different error and it has been fixed today.

Comment 10 zhou ying 2019-04-11 07:59:35 UTC
Checked with latest e2e test logs, no 'apiserver is shutting down' error , will verify this.