Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1694216

Summary: openshift controller manager failed to upgrade
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: BuildAssignee: Corey Daley <cdaley>
Status: VERIFIED --- QA Contact: wewang <wewang>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1CC: adam.kaplan, aos-bugs, wzheng
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ben Parees 2019-03-29 19:46:42 UTC
Description of problem:
Mar 29 18:11:08.848: INFO: cluster upgrade is failing: Cluster operator openshift-controller-manager is still updating

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/753


How reproducible:
flake


Note:  We are seeing this form of failure for several operators, I will cross link them as i open the bugs.

I have not dug into whether this operator specifically failed to upgrade, or if something earlier in the process took so long that this operator was the "victim" of the eventual timeout.  As you investigate the job that failed, feel free to reassign this if you think there is a root cause that is independent of your operator.

If your operator currently lacks sufficient events/logs/etc to determine when it started upgrading and what it was doing when we timed out, consider using this bug to introduce that information.

Comment 1 Ben Parees 2019-03-29 19:54:10 UTC
Related "operator failed to upgrade" bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694220

Comment 2 Ben Parees 2019-03-29 19:56:24 UTC
Related operator failed to upgrade bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694222

Comment 4 Adam Kaplan 2019-04-01 15:01:28 UTC
Potential related/duplicate bug: https://bugzilla.redhat.com/show_bug.cgi?id=1691505

Comment 5 wewang 2019-04-02 07:56:12 UTC
I Update version from 4.0.0-0.nightly-2019-03-25-180911 to 4.0.0-0.nightly-2019-03-28-030453, did not met "Cluster operator openshift-controller-manager is still updating", and pods are running in openshift-controller-manager, so how to reproduce it?
$ oc get clusteroperator/openshift-controller-manager  
NAME                           VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
openshift-controller-manager   4.0.0-0.nightly-2019-03-28-030453   True        False         False     23m

Comment 6 wewang 2019-04-02 08:08:45 UTC
FYI, pods are also running in namespace openshift-controller-manager-cluster, and image are updated. it saw it's a flake bug, it's hard to reproduce it, right?

Comment 7 Adam Kaplan 2019-04-03 20:39:19 UTC
@wewang it may not be consistent - it depends on how long it takes the openshift-controller-manager daemonset to roll out. We've observed this enough to know that it is a bug - at minimum we must have Progressing=true on the cluster operator instance during the daemonset rollout.

Comment 9 Corey Daley 2019-04-15 01:39:41 UTC
Pull request has merged.

Comment 11 wewang 2019-04-16 05:38:57 UTC
Upgraded the clusterversion from 4.1.0-0.ci-2019-04-11-015355 to 4.1.0-0.ci-2019-04-15-224704,During clusteroperator/openshift-controller-manager upgrade, when daemonset rollout, Progressing are setting true now, and clusteroperator updated to 4.1.0-0.ci-2019-04-15-224704, check follow info


$ oc get clusteroperator/openshift-controller-manager
NAME                           VERSION                        AVAILABLE   PROGRESSING   FAILING   SINCE
openshift-controller-manager   4.1.0-0.ci-2019-04-15-224704   True        False         False     118m

$oc logs -f pod/openshift-controller-manager-operator-856b59c84c-wb9dx -n openshift-controller-manager-operator --loglevel=5  |grep -i PROGRESSING

I0416 03:35:30.231803       1 status_controller.go:152] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-16T03:35:29Z","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-04-16T03:35:30Z","message":"Progressing: daemonset/controller-manager: observed generation is 6, desired generation is 7.\nProgressing: openshiftcontrollermanagers.operator.openshift.io/cluster: observed generation is 3, desired generation is 4.","reason":"Progressing","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-16T03:35:29Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-16T03:35:29Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}],"versions":[{"name":"operator","version":"4.1.0-0.ci-2019-04-15-224704"}]}}
I0416 03:35:30.237320       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"996a0c7e-5fec-11e9-bae2-0279093214fc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator openshift-controller-manager changed: Progressing changed from False to True ("Progressing: daemonset/controller-manager: observed generation is 6, desired generation is 7.\nProgressing: openshiftcontrollermanagers.operator.openshift.io/cluster: observed generation is 3, desired generation is 4.")