Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1688321 - Unhealthy status of machineconfiguration.openshift.io/state=Degraded
Summary: Unhealthy status of machineconfiguration.openshift.io/state=Degraded
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.1
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Steve Milner
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-13 14:35 UTC by Oleg Nesterov
Modified: 2019-03-14 18:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-14 18:47:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift pivot pull 41 None None None 2019-03-13 16:54:59 UTC

Description Oleg Nesterov 2019-03-13 14:35:08 UTC
Description of problem:
Unhealthy status of machineconfiguration.openshift.io/state=Degraded and unable to establish ssh connection with cluster after deploy.

How reproducible:
1. Build 47.330 (ami-0a0438a5e62dc0f9f) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (always)
2. Released version v0.14.0 (always)
3. Build 400.7.20190312.0 (ami-0eecbb884c8b35b1e) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (partially) . machineconfiguration.openshift.io/state=Done , but unable to establish ssh connection

Steps to Reproduce:
1. Install cluster
2. Login to cluster
3. Run "oc describe node|grep machineconfig"
4. Create the follow yaml file to add your public ssh key to remote machines

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 00-master-ssh-02
spec:
  config:
    passwd:
      users:
      - name: core
        sshAuthorizedKeys: |
          ssh-rsa 

run "oc create -f 00-master-ssh-update-02.yaml"
5. Create bastion ssh pod: "curl https://raw.githubusercontent.com/eparis/ssh-bastion/master/deploy/deploy.sh | bash"

6. Try establish connection:
ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core@iip-10-0-136-218.us-east-2.compute.internal


Actual results:
3. [cloud-user@preserve-olnester-workstation openshift]$ oc describe node|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded

6. [cloud-user@preserve-olnester-workstation ~]$ ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core@iip-10-0-136-218.us-east-2.compute.internal
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host


Expected results:
3. machineconfiguration.openshift.io/state=Done
6. SSH connection is established 

Additional info:
Here are logs from  machine-config-daemon:

[cloud-user@preserve-olnester-workstation openshift]$ oc logs machine-config-daemon-4tm5l -n openshift-machine-config-operator
I0313 11:17:05.160638    5791 start.go:52] Version: 4.0.16-1-dirty
I0313 11:17:05.161210    5791 start.go:88] starting node writer
I0313 11:17:05.166939    5791 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0313 11:17:05.246718    5791 daemon.go:175] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 (47.330)
I0313 11:17:05.249128    5791 daemon.go:247] Managing node: ip-10-0-133-188.us-east-2.compute.internal
I0313 11:17:05.261483    5791 node.go:39] Setting initial node config: worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.277596    5791 start.go:146] Calling chroot("/rootfs")
I0313 11:17:05.277632    5791 run.go:22] Running captured: rpm-ostree status
I0313 11:17:05.305453    5791 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
              CustomOrigin: Managed by pivot tool
                   Version: 47.330 (2019-02-23T04:17:13Z)

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:aff3ff28e3caa6ffe171cabad0048ab140825059b53c5da1587f14c5597f6823
              CustomOrigin: Provisioned from oscontainer
                   Version: 400.7.20190312.0 (2019-03-12T14:19:09Z)
I0313 11:17:05.305504    5791 daemon.go:477] In bootstrap mode
I0313 11:17:05.316144    5791 daemon.go:505] Current+desired config: worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.345993    5791 daemon.go:613] No bootstrap pivot required; unlinking bootstrap node annotations
I0313 11:17:05.347989    5791 daemon.go:640] Validated on-disk state
I0313 11:17:05.348008    5791 daemon.go:672] In desired config worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.348031    5791 start.go:165] Starting MachineConfigDaemon
I0313 11:17:05.348041    5791 daemon.go:314] Enabling Kubelet Healthz Monitor
E0313 11:18:00.502169    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
E0313 13:01:28.406586    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=31, ErrCode=NO_ERROR, debug=""
E0313 13:09:00.971241    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
E0313 13:11:22.625361    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""

Comment 1 Johnny Liu 2019-03-13 16:36:26 UTC
I hit similar issue too in my cluster (47.330 + 4.0.0-0.nightly-2019-03-04-234414)

# oc logs machine-config-daemon-kl5f5 -n openshift-machine-config-operator
I0313 09:16:14.520551   21098 start.go:52] Version: 4.0.15-1-dirty
I0313 09:16:14.520825   21098 start.go:88] starting node writer
I0313 09:16:14.526389   21098 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
I0313 09:16:14.596812   21098 daemon.go:247] Managing node: ip-10-0-151-157.us-east-2.compute.internal
I0313 09:16:14.621930   21098 node.go:39] Setting initial node config: master-1fd62473a6bce230df3d90a1f6109081
I0313 09:16:14.644060   21098 start.go:146] Calling chroot("/rootfs")
I0313 09:16:14.644095   21098 run.go:22] Running captured: rpm-ostree status
I0313 09:16:14.671146   21098 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b
              CustomOrigin: Provisioned from oscontainer
                   Version: 47.330 (2019-02-23T04:17:13Z)
I0313 09:16:14.671197   21098 daemon.go:477] In bootstrap mode
I0313 09:16:14.683197   21098 daemon.go:505] Current+desired config: master-1fd62473a6bce230df3d90a1f6109081
I0313 09:16:14.686404   21098 daemon.go:609] Bootstrap pivot required to: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
I0313 09:16:14.686534   21098 update.go:674] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: I0313 09:16:20.041972   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:20Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:20.390391   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:16:30.390660   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:30Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:30.693978   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:16:50.694228   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:50Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:50.995356   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:17:30.995530   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:17:31.315975   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:18:51.316209   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:18:51Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:18:51.664982   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: F0313 09:18:51.665013   21151 run.go:53] skopeo: timed out waiting for the condition
E0313 09:18:51.674651   21098 daemon.go:446] Fatal error checking initial state of node: Checking initial state: Failed to run pivot: error queuing start job; got failed
E0313 09:18:51.674908   21098 writer.go:91] Marking degraded due to: Checking initial state: Failed to run pivot: error queuing start job; got failed
I0313 09:18:51.717659   21098 daemon.go:448] Entering degraded state; going to sleep


This would make all machine get into `Degrade` state, cluster get into unhealty state.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-04-234414   True        False         7h9m    Error while reconciling 4.0.0-0.nightly-2019-03-04-234414: the cluster operator machine-config is failing

Comment 2 W. Trevor King 2019-03-13 16:50:20 UTC
> pivot.service: I0313 09:17:30.995530   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
> pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"

Sounds like a pull-secret issue.  Maybe like bug 1686556, where non-kubelet tooling isn't looking in the right place for the pull secret.

Comment 3 W. Trevor King 2019-03-13 16:54:59 UTC
> I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)

RHCOS 47.330 is too old.  Try running with a newer RHCOS (I haven't looked up exactly which version pivot#41 landed in though).

Comment 4 Johnny Liu 2019-03-13 17:27:17 UTC
(In reply to W. Trevor King from comment #3)
> > I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
> 
> RHCOS 47.330 is too old.  Try running with a newer RHCOS (I haven't looked
> up exactly which version pivot#41 landed in though).

Just like the initial report, 400.7.20190312.0 rhcos would have no such error. Yeah, probably this issue would not happen using newer rhcos, but after I talked with Oleg, we could reproduce this bug with released beta2. How to fix it for released beta2? As what I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1677198#c8, QE is running beta2 release version -> beta3 pre-release candidate upgrade testing, this issue seem like bring a big trouble for our testing.


Note You need to log in before you can comment on or make changes to this bug.