Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1690085 - network operator failed to upgrade
Summary: network operator failed to upgrade
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Dan Winship
QA Contact: Meng Bo
Depends On:
TreeView+ depends on / blocked
Reported: 2019-03-18 18:21 UTC by Ben Parees
Modified: 2019-04-09 14:14 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-04-09 14:14:18 UTC
Target Upstream Version:

Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (deleted)
2019-03-22 05:46 UTC, W. Trevor King
no flags Details

Description Ben Parees 2019-03-18 18:21:13 UTC
Description of problem:

Network operator failed during upgrade testing.

Version-Release number of selected component (if applicable):

How reproducible:
Seems to flake in the upgrade test

Mar 18 16:48:11.708 E clusterversion/version changed Failing to True: ClusterOperatorNotAvailable: Cluster operator network is still updating

As seen in:

Comment 2 Meng Bo 2019-03-21 11:13:46 UTC
We did not meet this issue during our upgrading tests.

Comment 3 W. Trevor King 2019-03-22 05:46:27 UTC
Created attachment 1546774 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This has caused 26 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'clusterversion/version changed Failing to True: .*Cluster operator network is still updating'


Comment 4 Dan Winship 2019-03-25 21:02:00 UTC
That "error" message is not really an error; it gets reported even in successful runs (eg, It seems that CVO has some arbitrary idea of how long it's supposed to take an operator to upgrade, and will start logging "errors" once it goes beyond that. Several operators seem to reliably take longer than that.

OTOH, in that successful run, it only logged that "network is still updating" once, whereas in the failed run here, it logged that multiple times. But it *did* eventually complete;  clusteroperators.json shows that the network operator did update successfully. Unfortunately the e2e-aws-upgrade job doesn't capture artifacts at all stages of the update, and the only cluster-network-operator log in the artifacts for that run started at 17:26:53, which is 20 minutes after the last time the build log shows the "Cluster operator network is still updating" message. So there's no information about why it took longer than normal.

Looking at the sdn logs, I see that openshift-sdn_sdn-2z5gt_sdn_previous.log starts at 17:10:19, which is shortly after the last "network is still updating" warning. So, yup, it was still updating at that point, but we don't really know why. Maybe the nodes were being slow to reboot, so it was slow to redeploy the SDN to them? The workers-journal/masters-journal files don't go back that far...

Comment 5 Dan Winship 2019-03-26 12:30:34 UTC
Filed about upgrade logs

Comment 7 Ben Bennett 2019-04-09 13:12:25 UTC
Dan, do you think there is anything to do here?  Should we close it?  Or assign to a later release?

Comment 8 Dan Winship 2019-04-09 14:14:18 UTC
Close, I guess. There's nothing actionable here besides maybe "e2e-aws-upgrade needs better artifacts", which I filed as a github issue.

Note You need to log in before you can comment on or make changes to this bug.