Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1354031 - Openshift master scheduler cache had bad info, scheduler was unable to assign pod with error PodFitsPorts [NEEDINFO]
Summary: Openshift master scheduler cache had bad info, scheduler was unable to assign...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod
Version: 3.2.1
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
Depends On:
Blocks: OSOPS_V3
TreeView+ depends on / blocked
Reported: 2016-07-08 18:29 UTC by Matt Woodson
Modified: 2016-08-09 14:22 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2016-08-09 14:22:17 UTC
Target Upstream Version:
agoldste: needinfo? (mwoodson)

Attachments (Terms of Use)

Description Matt Woodson 2016-07-08 18:29:15 UTC
Description of problem:

We were trying to deploy a router after an upgraded system.  We were doing a blue/green upgrade.  We had old version of nodes (3.1) and were upgrading to 3.2.1.   We installed the new set of nodes (green, 3.2), then we cut over to them.  This process happened over 2-3 days, also done by multiple people.  The details on this is fuzzy (sorry).

We then tried to deploy the router to our infra nodes (node selector and node labels).  When the pod tried to deploy, the pod went right to error state.  We did an "oc describe pod router-1-xxxx -n default"  We saw this error blocking the pod from rolling out to the node.

fit failure on node (ip-172-31-51-239.ec2.internal): PodFitsPorts

The node, ip-172-31-51-239.ec2.internal, had 0 things running on it.  oc get pods --all-namespace -o wide did NOT show any other pods scheduled to this node.  There was nothing running on this node.

We tried multiple things, but what worked was restarting atomic-openshift-master-controllers service.  We believe this is what fixed it, but the api service was also restarted.  Once this was restarted, the router deployed immediately.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

See description

Actual results:

router did not deploy

Expected results:

router should have deployed.

Additional info:

Sorry, the details are fuzzy here.  We have collected logs and sent them to engineering.  We wanted to put the bug together to track this for future use.

Comment 1 Matt Woodson 2016-07-08 18:37:11 UTC
Some additional info.  One of the things that was done was we had 4 infra nodes (2 blue, 3.1, 2 gree, 3.2).  This was scaled to 6, by accident.  Something odd may have happened as we went back and forth between the blue and green nodes (we make them scheduable, and then not schedualble)

We also suspect scheduler cache is what was probably incorrect.

Comment 2 Andy Goldstein 2016-07-08 18:56:47 UTC
It looks like we don't have logs going back far enough to diagnose what started this problem. Hopefully if it does happen again, we'll have increased log retention to be able to triage.

Comment 3 Andy Goldstein 2016-08-04 10:33:01 UTC
Matt, have you ever run into this again?

Comment 4 Andy Goldstein 2016-08-09 14:22:17 UTC
Matt says he hasn't seen this any more. I'm going to close this for now, but if it happens again, please reopen.

Note You need to log in before you can comment on or make changes to this bug.