Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1518921 - etcd v3 migration fails on unavailable NTP
Summary: etcd v3 migration fails on unavailable NTP
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.1
Hardware: All
OS: Linux
Target Milestone: ---
: 3.6.z
Assignee: Scott Dodson
QA Contact: Johnny Liu
Depends On:
TreeView+ depends on / blocked
Reported: 2017-11-29 18:40 UTC by Matthew Robson
Modified: 2018-08-08 16:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-08-08 16:15:18 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Matthew Robson 2017-11-29 18:40:07 UTC
Description of problem:

The etcd v3 attempted to start / enable NTPD after a significant amount of the migration has been run.

Due to an selinux issue, it failed and caused the migration to abort, leaving the cluster in a down state.

TASK [openshift_clock : Start and enable ntpd/chronyd] *************************************************
fatal: [server-d-100.dmz]: FAILED! => {
    "changed": true,
    "cmd": "timedatectl set-ntp true",
    "delta": "0:00:25.046517",
    "end": "2017-11-27 12:24:10.275760",
    "failed": true,
    "rc": 1,
    "start": "2017-11-27 12:23:45.229243"


Failed to set ntp: Connection timed out

Version-Release number of selected component (if applicable):

[root@~]# oc version
oc v3.
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift v3.
kubernetes v1.6.1+5115d708d7

How reproducible:

This would 100% with an NTP issue

Steps to Reproduce:
1. Run migration when there is a NTP client issue

Actual results:

Entire migration fails

Expected results:

Validate NTP as a pre-req before starting or remove from the migration.

Additional info:

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Vadim Rutkovsky 2018-01-17 13:44:43 UTC
NTP is required for etcd migration as it would protect from generating invalid certificates.

It seems that the re-run of this playbook has passed the clock later on - and stumbled upon an unreachable host. 

Would you mind if I close this issue, as it seems transient and cannot be skipped?

Comment 2 Matthew Robson 2018-01-17 13:57:42 UTC
The unavailability of NTP leads to an almost unrecoverable situation. When this happens, there are already v3 elements created in etcd, so rerunning the playbook errors out on detecting v3 elements. 

I understand the necessity, but NTP should be validated pre-flight and not lead to an unrecoverable failure part way through the migration.

Comment 3 Vadim Rutkovsky 2018-01-18 13:44:05 UTC
(In reply to Matthew Robson from comment #2)

> I understand the necessity, but NTP should be validated pre-flight and not
> lead to an unrecoverable failure part way through the migration.

It is being validated pretty soon:

   2017-11-27 12:02:14,406 p=74077 u=root |  PLAY [Create initial host groups for localhost] ********************************************************
   2017-11-27 12:17:47,351 p=85008 u=root |  TASK [openshift_clock : Start and enable ntpd/chronyd] *************************************************
   2017-11-27 12:18:20,159 p=85008 u=root |  TASK [etcd_ca : Create etcd CA certificate] ************************************************************

It seems the failure has occurred when NTP was setup for the second time (probably, this part should be reworked to be idempotent) - so the failure is definitely transient

Comment 5 Stephen Cuppett 2018-08-08 16:15:18 UTC
The original customer case is closed and contained two parts. The other part was migration of certificates prior to this step. With the errata, the NTP check should now be positioned correctly to prevent further bad states in the etcd data during migration.

There are KB and errata articles to help prevent (or remove) v3 content created and etcd updates interrupted as part of 3.5->3.6 migrations. The upgrades can be retried after this.

Based on the above and previous comments, marking WONTFIX.

Note You need to log in before you can comment on or make changes to this bug.