Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1686687 - [3.9] upgrade failed due to node name issue on OSP
Summary: [3.9] upgrade failed due to node name issue on OSP
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.9.z
Assignee: Patrick Dillon
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-08 02:08 UTC by Weihua Meng
Modified: 2019-04-11 00:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
Running Openshift facts on the failed node shows different results than the upgrade run. (deleted)
2019-04-04 19:53 UTC, Patrick Dillon
no flags Details

Description Weihua Meng 2019-03-08 02:08:12 UTC
Description of problem:
upgrade failed due to node name issue on OSP
no such issue found on GCE platform.

Version-Release number of the following components:
openshift-ansible-3.9.71-1.git.0.abb258d.el7

How reproducible:
N/A

Steps to Reproduce:
1. setup OCP v3.7 cluster on OSP
2. upgrade to v3.9 with openshift-ansible-3.9.71-1.git.0.abb258d.el7


Actual results:
upgrade failed.

fatal: [dhcp-89-135.sjc.redhat.com -> dhcp-89-148.sjc.redhat.com]: FAILED! => {
    "attempts": 10, 
    "changed": false, 
    "failed": true, 
    "invocation": {
        "module_args": {
            "debug": false, 
            "dry_run": false, 
            "evacuate": false, 
            "force": false, 
            "grace_period": null, 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "list_pods": false, 
            "node": [
                "wmengshare2ug37-nrri-2.openshift-snvl2.internal"
            ], 
            "pod_selector": null, 
            "schedulable": false, 
            "selector": null
        }
    }, 
    "msg": {
        "results": [
            {
                "cmd": "/usr/local/bin/oc get node wmengshare2ug37-nrri-2.openshift-snvl2.internal -o json", 
                "results": [
                    {}
                ], 
                "returncode": 1, 
                "stderr": "Error from server (NotFound): nodes \"wmengshare2ug37-nrri-2.openshift-snvl2.internal\" not found\n", 
                "stdout": ""
            }
        ], 
        "returncode": 1
    }
}


more info:
[root@wmengshare2ug37-master-etcd-1 ~]# oc get node
NAME                             STATUS    ROLES     AGE       VERSION
wmengshare2ug37-master-etcd-1    Ready     master    8h        v1.9.1+a0ce1bc657
wmengshare2ug37-master-etcd-2    Ready     master    8h        v1.9.1+a0ce1bc657
wmengshare2ug37-master-etcd-3    Ready     master    8h        v1.9.1+a0ce1bc657
wmengshare2ug37-node-primary-1   Ready     compute   8h        v1.7.6+a08f5eeb62
wmengshare2ug37-node-primary-2   Ready     compute   8h        v1.7.6+a08f5eeb62
wmengshare2ug37-nrri-1           Ready     <none>    8h        v1.9.1+a0ce1bc657
wmengshare2ug37-nrri-2           Ready     <none>    8h        v1.7.6+a08f5eeb62

[root@wmengshare2ug37-master-etcd-1 ~]# /usr/local/bin/oc get node wmengshare2ug37-nrri-2.openshift-snvl2.internal
Error from server (NotFound): nodes "wmengshare2ug37-nrri-2.openshift-snvl2.internal" not found
[root@wmengshare2ug37-master-etcd-1 ~]# /usr/local/bin/oc get node wmengshare2ug37-nrri-2
NAME                     STATUS    ROLES     AGE       VERSION
wmengshare2ug37-nrri-2   Ready     <none>    8h        v1.7.6+a08f5eeb62

Expected results:
upgrade successful.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Patrick Dillon 2019-04-03 18:01:17 UTC
You mentioned that you preserved the cluster, but I think it is gone now.

Would you be able to recreate this and preserve the cluster again? It would also be extremely helpful if you could set ANSIBLE_KEEP_REMOTE_FILES=1 when running the upgrade.

Comment 5 Weihua Meng 2019-04-04 00:56:38 UTC
The cluster is still alive.

[root@wmengshare2ug37-master-etcd-1 ~]# oc get node
NAME                             STATUS    ROLES     AGE       VERSION
wmengshare2ug37-master-etcd-1    Ready     master    28d       v1.9.1+a0ce1bc657
wmengshare2ug37-master-etcd-2    Ready     master    28d       v1.9.1+a0ce1bc657
wmengshare2ug37-master-etcd-3    Ready     master    28d       v1.9.1+a0ce1bc657
wmengshare2ug37-node-primary-1   Ready     compute   28d       v1.7.6+a08f5eeb62
wmengshare2ug37-node-primary-2   Ready     compute   28d       v1.7.6+a08f5eeb62
wmengshare2ug37-nrri-1           Ready     <none>    28d       v1.9.1+a0ce1bc657
wmengshare2ug37-nrri-2           Ready     <none>    28d       v1.7.6+a08f5eeb62

Comment 6 Patrick Dillon 2019-04-04 19:52:21 UTC
I am struggling to reproduce this issue. 

As best as I can tell, the failure in the logs is happening because OpenShift Facts is not finding a provider when it is run on the node wmengshare2ug37-nrri-2. When a provider is found, the hostname and the nodename are updated to be the shorter, correct form. Because the provider is not found, the hostname and nodename are left in the longer form, which eventually causes the error. 

Also note in the logs:
- the other infrastructure node wmengshare2ug37-nrri-1 succeeds
- the master wmengshare2ug37-master-etcd-3  does not find a provider in the Gather Cluster facts task. This is interesting because wmengshare2ug37-master-etcd-3 finds a provider in all other tasks, including when it is upgraded earlier in the playbook.

This seeming randomness about which  nodes can find a provider or not makes me wonder whether some sort of network issue was preventing the node from connecting the provider metadata url: https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_facts/library/openshift_facts.py#L1474

When I run OpenShift facts on the node that failed with the following command, the provider is found. I am attaching the logs.

sudo ansible-playbook -vvv -i inventory/inventory openshift-ansible/playbooks/init/cluster_facts.yml --step --extra-vars "l_init_fact_hosts=dhcp-89-135.sjc.redhat.com" --key-file="~/.ssh/libra.pem" 

When you have reproduced this issue is it always the same node that is failing? Any tips for reproducing this would greatly help. Ideally you could run a new upgrade to reproduce this issue, with ANSIBLE_KEEP_REMOTE_FILES=1 enabled and keep the cluster alive. It would be fine to replace the existing cluster.

Comment 7 Patrick Dillon 2019-04-04 19:53:27 UTC
Created attachment 1552069 [details]
Running Openshift facts on the failed node shows different results than the upgrade run.


Note You need to log in before you can comment on or make changes to this bug.