Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1691516 - aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success'
Summary: aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state ...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 19:25 UTC by bpeterse
Modified: 2019-04-01 06:26 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC (deleted)
2019-03-22 06:31 UTC, W. Trevor King
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift installer pull 1442 None None None 2019-03-28 22:14:07 UTC
Github openshift installer pull 1481 None None None 2019-03-28 22:14:07 UTC
Github openshift release pull 3285 None None None 2019-03-28 22:14:07 UTC

Description bpeterse 2019-03-21 19:25:52 UTC
Description of problem:

upgrade Failed release-openshift-origin-installer-e2e-aws-upgrade-4.0

A snippet of the error logs:

``````````````````````````````
Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-21-152907
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: Error applying plan:"
level=error
level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route.to_nat_gw[1]: 1 error occurred:"
level=error msg="\t* aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s)"
level=error
level=error
``````````````````````````````
as seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437

Version-Release number of the following components:
4.1

How reproducible:
Upgrade Test

Comment 1 Abhinav Dahiya 2019-03-21 22:40:12 UTC
Installer failed because of rate-limiting of AWS APIs

from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/437/artifacts/e2e-aws-upgrade/installer/.openshift_install.log

There are a bunch of rate-limit exceeded errors
```
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.098Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Retrying Request ec2/CreateRoute, attempt 4"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/CreateRoute Details:"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: ---[ REQUEST POST-SIGN ]-----------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: POST / HTTP/1.1"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Host: ec2.us-east-1.amazonaws.com"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: User-Agent: aws-sdk-go/1.16.6 (go1.10.3; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.10"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Length: 140"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Authorization: AWS4-HMAC-SHA256 Credential=AKIAJ7WKPAWT2BW5KB2A/20190321/us-east-1/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=c148d75dc7e79addd523dfaaafcb168ed64afb820efd9819f1134e75ba155d9d"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Content-Type: application/x-www-form-urlencoded; charset=utf-8"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: X-Amz-Date: 20190321T174123Z"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Accept-Encoding: gzip"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: Action=CreateRoute&DestinationCidrBlock=0.0.0.0%2F0&NatGatewayId=nat-0092145b40c62f90d&RouteTableId=rtb-0cc2b71acf7913565&Version=2016-11-15"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.099Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/CreateRoute Details:"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: ---[ RESPONSE ]--------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: HTTP/1.1 503 Service Unavailable"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Connection: close"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Transfer-Encoding: chunked"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Date: Thu, 21 Mar 2019 17:41:23 GMT"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: Server: AmazonEC2"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: "
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: -----------------------------------------------------"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] <?xml version=\"1.0\" encoding=\"UTF-8\"?>"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>4e4a313a-0f6a-44d2-a5f4-a30e85bc38e9</RequestID></Response>"
time="2019-03-21T17:41:23Z" level=debug msg="2019-03-21T17:41:23.255Z [DEBUG] plugin.terraform-provider-aws: 2019/03/21 17:41:23 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/CreateRoute failed, not retrying, error RequestLimitExceeded: Request limit exceeded."
```

Comment 2 W. Trevor King 2019-03-22 06:27:27 UTC
Previous discussion of AWS rate limiting vs. Terraform in bug 1690069.

Comment 3 W. Trevor King 2019-03-22 06:31:53 UTC
Created attachment 1546785 [details]
Occurrences of this error in CI from 2019-03-19T12:28 to 2019-03-21T20:06 UTC

This occurred in 73 of our 861 failures in *-e2e-aws* jobs across the whole CI system over the past 55 hours.  Generated with [1]:

  $ deck-build-log-plot 'aws_route.to_nat_gw.* Error creating route: timeout while waiting for state'

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 4 W. Trevor King 2019-03-28 22:14:07 UTC
Today we landed a few PRs to help mitigate this:

* installer#1442 bumped our Terraform AWS provider up to v2.2, which may have brought along some efficiency improvements.  I don't have anything specific here about what those might be, beyond thinking "surely they must have fixed some useful bugs since v1.52 went out on 2018-12-13 :p).
* installer#1481 and release#3285 together should have halved our per-zone resource needs, with an associated reduction in AWS API calls and likelyhood of a given cluster hitting an overly-throttled resource.

It's hard to measure the impact separately, since we didn't get a lot of soak time on the Terraform bump before the others landed, but grabbing the last 40 minutes of API consumers, we have:

$ agents-and-events.py <last-40m.csv | grep '^[0-9]'
13172	aws-sdk-go/1.14.12:
11062	terraform:
6828	destroyer:
6319	openshift-machine-api:
3096	cloud-credential-operator-iam-ro:
3051	aws-elasticloadbalancing:
2521	cloud-credential-operator:
...small fry...

That makes Kubernetes our largest AWS API consumer, with the installer (via 'terraform' for creation and 'destroyer' for deletion) seated firmly in second.  I think that's sufficent to mark this MODIFIED, although I'm not sure what the process is for getting it closed from there.  Do we need to get below a certain fraction of these errors over a certain duration of CI jobs?

Comment 6 W. Trevor King 2019-03-29 19:45:42 UTC
So checking rates today, there is only a small decrease in the rates of created NAT gateways:

$ grep 'CreateNatGateway\|DescribeNatGateways\|CreateVpc$' api-consumers-2019-03-2*
api-consumers-2019-03-25.txt:	15585	ec2.DescribeNatGateways
api-consumers-2019-03-25.txt:	1085	ec2.CreateNatGateway
api-consumers-2019-03-25.txt:	152	ec2.CreateVpc
api-consumers-2019-03-26.txt:	66645	ec2.DescribeNatGateways
api-consumers-2019-03-26.txt:	5370	ec2.CreateNatGateway
api-consumers-2019-03-26.txt:	978	ec2.CreateVpc
api-consumers-2019-03-29.txt:	61463	ec2.DescribeNatGateways
api-consumers-2019-03-29.txt:	4466	ec2.CreateNatGateway
api-consumers-2019-03-29.txt:	886	ec2.CreateVpc

with API-consumer files generated with [1].  That's 7.1, 5.5, and 5.0 CreateNatGateway calls per VPC for each of the days where I happened to sync down the nightly 24-hour report.  5.5 to 5.0 is not a particularly inspiring reduction, when I had expected 6 -> 3.  Spot-checking a recent installer PR, we are only creating three NAT gateways there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1483/pull-ci-openshift-installer-master-e2e-aws/4798/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation'
time="2019-03-29T17:27:36Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-02e3e9699e4cda551)"
time="2019-03-29T17:27:57Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m56s (ID: nat-065ed68f2030ecf80)"
time="2019-03-29T17:28:08Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m7s (ID: nat-08101d95f8e9200f6)"

Same for origin release-promotion gates:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6317/artifacts/e2e-aws/installer/.openshift_install.log | grep 'nat_gateway.*Creation'
time="2019-03-29T18:30:25Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m58s (ID: nat-0851535a532052f00)"
time="2019-03-29T18:30:43Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 2m16s (ID: nat-0d79a329860f3df66)"
time="2019-03-29T18:30:44Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m16s (ID: nat-0123c2fbfdb2098de)"

I'm hunting around to see if I can figure out where all the unexpected CreateNatGateway calls are coming from.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/agents-and-events

Comment 7 W. Trevor King 2019-03-29 21:26:10 UTC
Ok, after we exclude errors, the step is down to ~4 CreateNatGateway calls per VPC [1].  I'm still not sure why that isn't three...

[1]: https://raw.githubusercontent.com/wking/openshift-release/91cde42964c0764985116dd447fe14a4861d0900/vpcs/calls-per-vpc.png

Comment 8 W. Trevor King 2019-03-29 22:13:53 UTC
Previous version of the plot script was confused by the stacked histogram.  Here's a fixed graph [1] generated with [2] that shows the drop is from 6 to 3.


[1]: https://raw.githubusercontent.com/wking/openshift-release/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105/vpcs/calls-per-vpc.png
[2]: https://github.com/wking/openshift-release/commit/46a3de9c3a8ef2e15d4c1e1f84e80a4e95d70105

Comment 9 Johnny Liu 2019-04-01 06:26:12 UTC
Verified this bug with openshift-install v4.0.22-201903272149-dirty (extracted from 4.0.0-0.nightly-2019-03-29-144824) + 4.0.0-0.nightly-2019-03-28-030453 release payload, this issue is not reproduced, move this to 'VERIFIED'


Note You need to log in before you can comment on or make changes to this bug.