Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1693951

Summary: TLS errors due to expired kubelet certificates after node was shutdown
Product: OpenShift Container Platform Reporter: Gerard Braad (Red Hat) <gbraad>
Component: MasterAssignee: Maciej Szulik <maszulik>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: high    
Version: unspecifiedCC: aos-bugs, cfergeau, jokerman, lmohanty, maszulik, mmccomas, prkumar, trking, veillard, yinzhou
Target Milestone: ---Keywords: Reopened
Target Release: 4.1.0   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-09 14:27:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Gerard Braad (Red Hat) 2019-03-29 06:52:28 UTC
Description of problem:

When the VM has been shut down for a period of time, it is unable to communicate with the api server as it logs TLS errors

Version-Release number of the following components:
openshift 4.x, installer 0.15.0

How reproducible:

Steps to Reproduce:
1. Install openshift, preferably with a single master and a worker.
2. After install, shut down VM for a long enough period (3+ hours perhaps?)
3. restart the cluster


Actual results:
[root@test1-svtfv-master-0 ~]# cd /var/lib/kubelet/pki/
[root@test1-svtfv-master-0 pki]# openssl x509 -in kubelet-client-current.pem -noout -text > kubelet-client-current.txt
[root@test1-svtfv-master-0 pki]# openssl x509 -in kubelet-server-current.pem -noout -text > kubelet-server-current.txt
[root@test1-svtfv-master-0 pki]# less kubelet-client-current.txt 
[root@test1-svtfv-master-0 pki]# less kubelet-client-current.txt 
[root@test1-svtfv-master-0 pki]# cd /var/lib/kubelet/pki/
[root@test1-svtfv-master-0 pki]# openssl x509 -in kubelet-client-current.pem -noout -text > kubelet-client-current.txt
[root@test1-svtfv-master-0 pki]# openssl x509 -in kubelet-server-current.pem -noout -text > kubelet-server-current.txt
[root@test1-svtfv-master-0 pki]# cat kubelet-client-current.txt | grep Not
            Not Before: Mar 27 06:57:00 2019 GMT
            Not After : Mar 28 06:39:47 2019 GMT
[root@test1-svtfv-master-0 pki]# cat kubelet-server-current.txt | grep Not
            Not Before: Mar 27 07:02:00 2019 GMT
            Not After : Mar 28 06:39:39 2019 GMT
[root@test1-svtfv-master-0 pki]# date
vr 29 mrt 2019  5:21:18 UTC
[root@test1-svtfv-master-0 pki]# 

this causes tls: internal error being logged


Expected results:
No TLS error

Comment 1 Gerard Braad (Red Hat) 2019-03-29 06:52:57 UTC
Also filed as: https://github.com/openshift/installer/issues/1494

Comment 2 W. Trevor King 2019-03-29 11:29:03 UTC
So it looks like those were one-day certs.  I don't know how often they are rotated, what installer and release image were you using?  The 0.15.0 installer sets up the kubelet client with a one-day cert [1] (and probably the kubelet server too, although I haven't tracked down a link).  Then the cluster rotates the certs with... something.  For example, see [2] for the Kubernetes API server and related totations.  But nothing in [3] is jumping out at me as the kubelet certs you're having issues with.  Moving to the auth team, since they'll probably know, although the code itself may live in a master-team repo.

But if you want to shut down nodes, you'll certainly want to wait after the initial install, for a whole day or however long it takes for the first in-cluster rotations to go through, to get certs with longer validity times before shutting down nodes.  The auth/master teams may also have some advice for monitoring those rotations, even if it's just "grep the kube-apiserver-operator logs".  Maybe there are Kubernetes Events you can watch for?  I dunno.

Alternatively, you can just let the certs expire, and when the cluster comes back up, use SSH (which we don't expire/rotate) to go through and rebuild the x.509 chains.  I know that sort of thing has been discussed before, but don't have a reference handy at the moment.  I'll see if I can dig up a link later.

[1]: https://github.com/openshift/installer/blob/v0.15.0/pkg/asset/tls/kubelet.go#L184
[2]: https://github.com/openshift/cluster-kube-apiserver-operator/pull/342
[3]: https://github.com/openshift/cluster-kube-apiserver-operator/blob/0b686ff00295c382f245b0b4103a566d672498c8/pkg/operator/certrotationcontroller/certrotationcontroller.go

Comment 3 Gerard Braad (Red Hat) 2019-03-29 11:52:15 UTC
>  But if you want to shut down nodes, you'll certainly want to wait after the initial install, for a whole day or however long it takes for the first in-cluster rotations to go through,

that is unacceptable as this would be part of a delivery pipeline


> use SSH (which we don't expire/rotate) to go through and rebuild the x.509 chains.

This pre-provisioning of the certificates is what we prefer, as this also allows to create certs that are created for a longer period. So far we have not seen/received any instructions about this.

Comment 4 W. Trevor King 2019-03-29 12:21:07 UTC
For recovery, I may have been remembering this ask [1], although there was no further discussion there.  There's another ask for a manual rotation trigger in [2].  I'm not aware of procedures for either, but there may be more-specific trackers somewhere that I just haven't turned up (or maybe not :p).

[1]: https://github.com/openshift/installer/blob/v0.15.0/docs/user/troubleshooting.md#unable-to-ssh-into-master-nodes
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1684547#c27

Comment 5 W. Trevor King 2019-03-29 12:22:50 UTC
Oops, stale paste.  [1] above should have been:

https://github.com/openshift/api/pull/199#discussion_r261689426

Comment 6 W. Trevor King 2019-03-29 12:56:25 UTC
I hear kubelet certs are the Pod team, so reassigning to see if they can link the rotation code and/or have ideas about triggering, monitoring, or recovering cert rotation.

Comment 7 W. Trevor King 2019-03-29 13:09:12 UTC
Recovery tool now has a tracker in bug 1694079.

Comment 8 Praveen Kumar 2019-03-29 13:50:55 UTC
In my case I shutdown the VM for less than 2 hour and started again which was working but when I checked the kubelet cert it was vaild for 3 hours so does that mean within 24 hours till the kubelet get proper 30 days valid cert it rotate in ever 2-3 hour?

```
$ oc get nodes
NAME                   STATUS   ROLES           AGE   VERSION
test1-svtfv-master-0   Ready    master,worker   47h   v1.12.4+30e6a0f55

# cat kubelet-client-current.txt | grep Not
            Not Before: Mar 29 05:58:00 2019 GMT
            Not After : Mar 29 08:43:34 2019 GMT

# date
Fri Mar 29 06:27:12 UTC 2019
```

Comment 9 Seth Jennings 2019-03-29 14:14:35 UTC
This bug is strange because you can have a single master setup in OCP 4.x.  HA is required; 3 master minimum.

This is a known issue with rapidly rotating the kubelet client/server certs.  If the kubelet is down during the time it would normally do the rotation and doesn't come back up before the existing certs expire, then the kubelet is unable to connect to the apiserver after that.

I think I'm going to dup this to the recovery tool tracker because it does take external intervention to resolve this situation.

*** This bug has been marked as a duplicate of bug 1694079 ***

Comment 11 Seth Jennings 2019-04-01 19:20:48 UTC
This was changed 5 days ago to a 60 day validity time and 30 day rotation time
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/203

Comment 14 W. Trevor King 2019-04-02 03:05:36 UTC
I dunno if the commit is available in a nightly yet, but PR 203 has certainly landed, so this should be at least MODIFIED.

Comment 15 Praveen Kumar 2019-04-02 05:47:31 UTC
So I tried the installer master which have this PR in as a payload for controller operator but then also I had to wait around 24 hour till the kubelet client/server cert actually rotated for a month validity.


```
$ openshift-install version
openshift-install unreleased-master-663-g086a88534ad03776c97e31a843658e53e0088e78
built from commit 086a88534ad03776c97e31a843658e53e0088e78

[root@test1-sbgjm-master-0 pki]# uptime -p
up 14 hours, 32 minutes

[root@test1-sbgjm-master-0 pki] # openssl x509 -in kubelet-client-2019-03-31-13-57-00.pem -noout -text | grep Not
            Not Before: Mar 31 13:52:00 2019 GMT
            Not After : Apr  1 13:34:52 2019 GMT
            
[root@test1-sbgjm-master-0 pki]# uptime -p
up 1 day, 1 hours, 51 minutes

[root@test1-sbgjm-master-0 pki] # openssl x509 -in kubelet-client-2019-04-01-11-04-04.pem -noout -text | grep Not
            Not Before: Apr  1 10:59:00 2019 GMT
            Not After : May  1 08:50:40 2019 GMT
			
[root@test1-sbgjm-master-0 pki] # openssl x509 -in kubelet-server-2019-04-01-09-03-42.pem  -noout -text | grep Not
            Not Before: Apr  1 08:59:00 2019 GMT
            Not After : May  1 08:51:02 2019 GMT

```

So I am still wondering that https://bugzilla.redhat.com/show_bug.cgi?id=1693951#c8 "the kubelet cert it was vaild for 3 hours so does that mean within 24 hours till the kubelet get proper 30 days valid cert it rotate in ever 2-3 hour?"

Comment 16 Praveen Kumar 2019-04-02 05:48:58 UTC
(In reply to Praveen Kumar from comment #15)
> So I tried the installer master which have this PR in as a payload for
> controller operator but then also I had to wait around 24 hour till the
> kubelet client/server cert actually rotated for a month validity.
> 
> 
> ```
> $ openshift-install version
> openshift-install
> unreleased-master-663-g086a88534ad03776c97e31a843658e53e0088e78
> built from commit 086a88534ad03776c97e31a843658e53e0088e78
> 
> [root@test1-sbgjm-master-0 pki]# uptime -p
> up 14 hours, 32 minutes
> 
> [root@test1-sbgjm-master-0 pki] # openssl x509 -in
> kubelet-client-2019-03-31-13-57-00.pem -noout -text | grep Not
>             Not Before: Mar 31 13:52:00 2019 GMT
>             Not After : Apr  1 13:34:52 2019 GMT
>             
> [root@test1-sbgjm-master-0 pki]# uptime -p
> up 1 day, 1 hours, 51 minutes
> 
> [root@test1-sbgjm-master-0 pki] # openssl x509 -in
> kubelet-client-2019-04-01-11-04-04.pem -noout -text | grep Not
>             Not Before: Apr  1 10:59:00 2019 GMT
>             Not After : May  1 08:50:40 2019 GMT
> 			
> [root@test1-sbgjm-master-0 pki] # openssl x509 -in
> kubelet-server-2019-04-01-09-03-42.pem  -noout -text | grep Not
>             Not Before: Apr  1 08:59:00 2019 GMT
>             Not After : May  1 08:51:02 2019 GMT
> 
> ```
> 
> So I am still wondering that
> https://bugzilla.redhat.com/show_bug.cgi?id=1693951#c8 "the kubelet cert it
> was vaild for 3 hours so does that mean within 24 hours till the kubelet get
> proper 30 days valid cert it rotate in ever 2-3 hour?"

Forgot to add the payload info for this installer binary.

```
$ oc adm release info --commits | grep cluster-kube-controller-manager-operator
  cluster-kube-controller-manager-operator      https://github.com/openshift/cluster-kube-controller-manager-operator      4e5073f837b1262db0c390c30b275b293db0b469
```

Comment 17 Maciej Szulik 2019-04-09 14:27:20 UTC
The solution will be provided in this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1694079

*** This bug has been marked as a duplicate of bug 1694079 ***