Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1690087 - Could not reach HTTP service through after 2m0s
Summary: Could not reach HTTP service through adf61ce70497f11e99af60e4be0b8886-6762599...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.1
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.2.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
Depends On:
TreeView+ depends on / blocked
Reported: 2019-03-18 18:42 UTC by Ben Parees
Modified: 2019-04-11 13:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (deleted)
2019-03-22 05:49 UTC, W. Trevor King
no flags Details

Description Ben Parees 2019-03-18 18:42:36 UTC
Description of problem:
ELB became inaccessible during upgrade testing

fail []: Mar 18 13:37:27.884: Could not reach HTTP service through after 2m0s

Version-Release number of selected component (if applicable):

How reproducible:
appears to be a flake

Comment 2 Nick Stielau 2019-03-21 19:43:43 UTC
If this is a flake, as noted above, we should know what the flake rate is.  Interacting with external services, we will encounter things outside of our control.  If this happens infrequently, I think we can move forward with the beta, and consider mitigating this by recreating or waiting longer.

Comment 4 Ben Parees 2019-03-21 19:48:14 UTC
> I do not think this is not an issue of interacting w/ external services outside of our control.

double negatives......  meant to say: I do not think this is an issue of interacting w/ external services outside of our control.

Comment 5 W. Trevor King 2019-03-22 05:49:36 UTC
Created attachment 1546775 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 5 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'Could not reach HTTP service through .* after'


Comment 6 Nick Stielau 2019-03-22 16:09:14 UTC
At 0.05% failure rate I think we can take this off of TestBlocker and BetaBlocker list. Can anyone +1?

But it is definitely still a bug.

Comment 7 Ben Parees 2019-03-22 17:04:16 UTC
I'm inclined to agree but i think someone from the networking team needs to make the final call based on their understanding of the potential cause + implications.

Comment 8 Miciah Dashiel Butler Masters 2019-03-22 17:31:42 UTC
I have not reproduced the issue yet, and CI does not have pod logs until after the initial point of failure.  I did find an upstream report that looks related, indicating that this is a known, as-yet undiagnosed flake upstream:

Comment 9 Dan Mace 2019-03-22 17:32:47 UTC
+1 to removing it from the blocker list. We need to stress test the upstream LB bits from `test/e2e/upgrades/services.go` in our AWS environments to reproduce and diagnose. I don't have enough data at this time to attribute the problem to external DNS, the SDN, etc.

The failure seems unrelated directly to ingress controllers — the test is creating a bare LoadBalancer Service backed by custom endpoints created during the test in a temporary namespace. However, we must understand the problem since upstream LoadBalancer Services are key to our ingress controller implementation, and any flakiness around that "primitive" will introduce downstream instability in our ingress solution.

Comment 10 Nick Stielau 2019-03-22 17:34:09 UTC
Removing BetaBlocker

Note You need to log in before you can comment on or make changes to this bug.