Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1684905 - kube-scheduler operator reaches a permanent failure state
Summary: kube-scheduler operator reaches a permanent failure state
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod
Version: 4.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: ravig
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-03 19:00 UTC by Derek Carr
Modified: 2019-04-01 21:01 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-01 21:01:51 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Derek Carr 2019-03-03 19:00:39 UTC
Description of problem:

Installed cluster with level 4.0.0-0.alpha-2019-03-02-023844.

Came back to cluster 24hrs later, and kube-scheduler operator is failing with following message:

"Failing: request declared a Content-Length of 2342 but only wrote 0 bytes"

Attempted to upgrade cluster to level 4.0.0-0.alpha-2019-03-02-043149 to resolve.  kube-scheduler-operator stuck reporting that message, no clear idea why.

Comment 2 ravig 2019-03-06 17:23:52 UTC
I looked into the scheduler logs, I see the following error continuously,

k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: dial tcp [::1]:6443: connect: connection refused

And when I looked at API server logs, the api-server is not available which I think is the issue here. I faced similar issue when testing in CI, David and I thought it's a flkae and ignored it. Perhaps we need to investigate more into this.

18:27:45 - Is the time when scheduler gave up trying to communicate with API server. All the 3 api servers(when I looked at logs) seem to be not available during that time. I am curious about the state of controller-manger at that time, was it failing at the same time? I will set up a cluster to see if I can reproduce it.

Comment 3 Seth Jennings 2019-04-01 21:01:51 UTC
We haven't seen this in a long time and having been unable to recreate or develop a theory on how it happens.  Please reopen if it is recreated and make sure to get the kube-scheduler logs from all masters.


Note You need to log in before you can comment on or make changes to this bug.