Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1696774 - Machine-os-content has not been promoted in the last 32 hours
Summary: Machine-os-content has not been promoted in the last 32 hours
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.1.0
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
: 1697213 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-05 15:30 UTC by Clayton Coleman
Modified: 2019-04-16 03:38 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-09 19:54:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Comment 1 Colin Walters 2019-04-05 15:35:27 UTC
https://jira.coreos.com/browse/RHCOS-120

Comment 4 Colin Walters 2019-04-06 12:44:55 UTC
Looks like we have a new failure mode, dying during install:

 https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/91/?log#log

One thing I notice in a quick glance is workers aren't coming up.
And the machine-api-operator logs are empty...not sure why yet.

Comment 5 Colin Walters 2019-04-07 20:09:22 UTC
In order to make this easier to debug, I created a release image: registry.svc.ci.openshift.org/rhcos/release:latest

Which you can pass to the installer as usual.

Looks like openshift-apiserver is in CrashLoopBackoff:

Events:
  Type     Reason     Age                      From                            Message
  ----     ------     ----                     ----                            -------
  Normal   Pulled     7m19s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Container image "registry.svc.ci.openshift.org/openshift/origin-v4.0@sha256:ecacd82ccd9f2631fb5e184fc22cf79b3efe60896f29ed0e011f7f4a8589d6ae" already present on machine
  Normal   Created    7m18s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Created container openshift-apiserver
  Normal   Started    7m17s (x29 over 6h24m)   kubelet, osiris-5rh5d-master-0  Started container openshift-apiserver
  Warning  Unhealthy  6m57s (x57 over 6h24m)   kubelet, osiris-5rh5d-master-0  Readiness probe failed: Get https://10.88.0.43:8443/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    2m10s (x566 over 6h23m)  kubelet, osiris-5rh5d-master-0  Back-off restarting failed container

$ oc logs pods/apiserver-b44dk
I0407 20:02:44.505348       1 clientca.go:92] [0] "/var/run/configmaps/aggregator-client-ca/ca-bundle.crt" client-ca certificate: "aggregator-signer" [] issuer="<self>" (2019-04-07 13:26:11 +0000 UTC to 2019-04-08 13:26:11 +0000 UTC (now=2019-04-07 20:02:44.505327761 +0000 UTC))
I0407 20:02:44.505780       1 clientca.go:92] [0] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-ca" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.505770769 +0000 UTC))
I0407 20:02:44.505812       1 clientca.go:92] [1] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "admin-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:08 +0000 UTC to 2029-04-04 13:26:08 +0000 UTC (now=2019-04-07 20:02:44.50579267 +0000 UTC))
I0407 20:02:44.505822       1 clientca.go:92] [2] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505817243 +0000 UTC))
I0407 20:02:44.505831       1 clientca.go:92] [3] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-control-plane-signer" [] issuer="<self>" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505826169 +0000 UTC))
I0407 20:02:44.505842       1 clientca.go:92] [4] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "system:kube-apiserver" [client] groups=[kube-master] issuer="kube-apiserver-to-kubelet-signer" (2019-04-07 13:26:13 +0000 UTC to 2020-04-06 13:26:14 +0000 UTC (now=2019-04-07 20:02:44.505836365 +0000 UTC))
I0407 20:02:44.505865       1 clientca.go:92] [5] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kubelet-bootstrap-kubeconfig-signer" [] issuer="<self>" (2019-04-07 13:26:09 +0000 UTC to 2029-04-04 13:26:09 +0000 UTC (now=2019-04-07 20:02:44.505860046 +0000 UTC))
I0407 20:02:44.505874       1 clientca.go:92] [6] "/var/run/configmaps/client-ca/ca-bundle.crt" client-ca certificate: "kube-csr-signer_@1554644021" [] issuer="kubelet-signer" (2019-04-07 13:33:40 +0000 UTC to 2019-04-08 13:26:13 +0000 UTC (now=2019-04-07 20:02:44.505869248 +0000 UTC))
I0407 20:02:44.508408       1 audit.go:362] Using audit backend: ignoreErrors<log>
I0407 20:02:44.517343       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"
I0407 20:02:44.517359       1 plugins.go:84] Registered admission plugin "Initializers"
I0407 20:02:44.517364       1 plugins.go:84] Registered admission plugin "ValidatingAdmissionWebhook"
I0407 20:02:44.517368       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I0407 20:02:44.518190       1 glog.go:79] Admission plugin "project.openshift.io/ProjectRequestLimit" is not configured so it will be disabled.
I0407 20:02:44.518358       1 glog.go:79] Admission plugin "scheduling.openshift.io/PodNodeConstraints" is not configured so it will be disabled.
I0407 20:02:44.518754       1 plugins.go:158] Loaded 5 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,build.openshift.io/BuildConfigSecretInjector,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,MutatingAdmissionWebhook.
I0407 20:02:44.518766       1 plugins.go:161] Loaded 8 validating admission controller(s) successfully in the following order: OwnerReferencesPermissionEnforcement,build.openshift.io/BuildConfigSecretInjector,build.openshift.io/BuildByStrategy,image.openshift.io/ImageLimitRange,image.openshift.io/ImagePolicy,quota.openshift.io/ClusterResourceQuota,ValidatingAdmissionWebhook,ResourceQuota.
I0407 20:02:44.524149       1 clientconn.go:551] parsed scheme: ""
I0407 20:02:44.524165       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0407 20:02:44.524209       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd.kube-system.svc:2379 0  <nil>}]
I0407 20:02:44.524289       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd.kube-system.svc:2379 <nil>}]
F0407 20:03:04.524406       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 openshift.io [https://etcd.kube-system.svc:2379] /var/run/secrets/etcd-client/tls.key /var/run/secrets/etcd-client/tls.crt /var/run/configmaps/etcd-serving-ca/ca-bundle.crt true {0xc420bfa990 0xc420bfaa20} <nil> 5m0s 1m0s}), err (context deadline exceeded)

Comment 13 Colin Walters 2019-04-08 16:45:12 UTC
Not sure about cause and effect here, but I noticed:

```
oc get csr
NAME        AGE   REQUESTOR                                                                   CONDITION
csr-2bgjq   11m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-55sdk   11m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-9bjw6   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-d76nz   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-k8t4r   20m   system:node:ip-10-0-137-197.us-east-2.compute.internal                      Pending
csr-qj28q   20m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qzd8x   11m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
csr-vvvmx   20m   system:node:ip-10-0-157-133.us-east-2.compute.internal                      Pending
csr-x5rwh   19m   system:node:ip-10-0-164-115.us-east-2.compute.internal                      Pending
```

And:

```
oc -n openshift-service-ca logs deploy/service-serving-cert-signer
Error from server: Get https://ip-10-0-157-133.us-east-2.compute.internal:10250/containerLogs/openshift-service-ca/service-serving-cert-signer-5788d67844-jlnts/service-serving-cert-signer-controller: remote error: tls: internal error
```

I just tried a ` oc get csr | xargs oc adm certificate approve`

Comment 21 Colin Walters 2019-04-08 18:51:30 UTC
Dan Winship [2:48 PM]
`10.88.0.0/16 dev cni0 proto kernel scope link src 10.88.0.1`
no part of openshift creates a `cni0`

Comment 22 Colin Walters 2019-04-08 18:58:58 UTC
For reference in order to make the release payload here, I did:

oc adm release new --from-image-stream=origin-v4.0 -n openshift --to-image registry.svc.ci.openshift.org/rhcos/release:latest machine-os-content=registry.svc.ci.openshift.org/rhcos/machine-os-content@sha256:e808a43cd9ef88cbf1c805179ff7672c03907aeff184bc1e902c82d9ab5aa9c8

Then: xokdinst launch -p aws -I registry.svc.ci.openshift.org/rhcos/release:latest horus
(Which uses my installer wrapper, you need to do:
 env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/rhcos/release:latest openshift-install ...)

Now if e.g. some fixes land in kubelet (or anything host side) for this we'll need to re-push machine-os-content and rerun the above.
Or if fixes land in some of the SDN pods, we can generate a custom payload that overrides those too to test it.

Comment 24 Seth Jennings 2019-04-08 20:35:50 UTC
*** Bug 1697213 has been marked as a duplicate of this bug. ***

Comment 25 Colin Walters 2019-04-08 21:54:53 UTC
I am quite confident https://github.com/openshift/machine-config-operator/pull/608 (which just merged) will fix this.  But we also have a crio fix coming too that should work.

Comment 26 Colin Walters 2019-04-09 12:59:51 UTC
OK, latest promotion job is failing in e2e now:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.0/99/

Failing tests:

[Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal]
[sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s]

Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190409-001133.xml

error: 2 fail, 569 pass, 547 skip (26m22s)

Those look like usual flakes?  Going to ask for retests.

Comment 28 Mike Fiedler 2019-04-10 18:25:18 UTC
Kubelet is still 1.12.4 in the 4.0.0-0.nightly-2019-04-10-141956 build

Moving this to MODIFIED until available in a build.  Then it can go to ON_QA

Comment 29 Mike Fiedler 2019-04-10 18:26:01 UTC
Adding TestBlocker since it blocks Kube 1.13 regression testing

Comment 30 Seth Jennings 2019-04-10 19:28:40 UTC
I just installed a cluster and it is running 1.13 kubelet

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-243.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-131-69.us-west-1.compute.internal    Ready    master   25h   v1.13.4+1ad602308
ip-10-0-139-220.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-143-227.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308
ip-10-0-156-146.us-west-1.compute.internal   Ready    worker   25h   v1.13.4+1ad602308
ip-10-0-158-107.us-west-1.compute.internal   Ready    master   25h   v1.13.4+1ad602308

$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.alpha-2019-04-09-164546   True        False         25h     Cluster version is 4.0.0-0.alpha-2019-04-09-164546

Comment 31 Mike Fiedler 2019-04-11 00:00:46 UTC
re: comment 30 Did you install an ART-produced OCP build?  or a CI OKD build?

Comment 32 Seth Jennings 2019-04-11 04:27:45 UTC
OKD CI build


Note You need to log in before you can comment on or make changes to this bug.