Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1684353 - 3.11.82 upgrade playbook selecting evicted pods
Summary: 3.11.82 upgrade playbook selecting evicted pods
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 3.11.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 3.11.z
Assignee: Samuel Padgett
QA Contact: Yadan Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-01 03:34 UTC by Mitchell Rollinson
Modified: 2019-03-07 22:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-07 22:00:28 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mitchell Rollinson 2019-03-01 03:34:37 UTC
Description of problem:

When running the 3.11.82 upgrade playbook, 
In ./openshift-ansible/roles/openshift_web_console/tasks/start.yml 
the following task executes...

TASK [openshift_web_console : Get pods in the openshift-web-console namespace]

Issue: Playbook TASK checks are possibly made against PODS that are in a state other than running E.G EVICTED state. 
 

Version-Release number of the following components:
rpm -q openshift-ansible


rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

This causes the installation to fail.

Expected results:

The Playbook TASK should not continue to 'pursue' a POD that is not in a RUNNING state.

Additional logic to (one or all of the following)..
* verify pod status .. IF EVICTED
* wait xxx seconds for EVICTED state to change
* IF multiple pods, move on to the next POD until RUNNING pod identified.
* fail if no running pods located.

Successful installation

Please include the entire output from the last TASK line through the end of output if an error is generated

<peek-boo> (1, '\r\n\r\n{"changed": true, "end": "2019-02-27 14:31:32.279069", "stdout": "", "cmd": ["oc", "logs", "deployment/webconsole", "--tail=50", "--config=/etc/origin/master/admin.kubeconfig", "-n", "openshift-web-console"][15/9282]
": true, "delta": "0:00:00.282914", "stderr": "Found 9 pods, using pod/webconsole-7659dcd487-4454z\\nError from server (BadRequest): container \\"webconsole\\" in pod \\"webconsole-7659dcd487-4454z\\" is not available", "rc": 1, "invocation": {"mo
dule_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "oc logs deployment/webconsole --tail=50 --config=/etc/origin/master/admin.kubeconfig -n openshift-web-console", "removes": null, "argv": null, "creates": null, "c
hdir": null, "stdin": null}}, "start": "2019-02-27 14:31:31.996155", "msg": "non-zero return code"}\r\n', 'Shared connection to peeka-boo closed.\r\n')
<peek-boo> Failed to connect to the host via ssh: Shared connection to peek-boo closed.
<peek-boo> ESTABLISH SSH CONNECTION FOR USER: thatsme
<peek-boo> SSH: EXEC sshpass -d7 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o User=thatsme -o ConnectTimeout=10 -o ControlPath=/home/thatsme/.ansible/cp/48bf060d0a peek-boo '/bin/sh -c '"'"'rm -f -r /home/thatsme/.ansible/tmp/ansi
ble-tmp-1551231091.53-224885511603996/ > /dev/null 2>&1 && sleep 0'"'"''
<peek-boo> (0, '', '')
fatal: [peek-boo]: FAILED! => {
    "changed": true,
    "cmd": [
        "oc",
        "logs",
        "deployment/webconsole",
        "--tail=50",
        "--config=/etc/origin/master/admin.kubeconfig",
        "-n",
        "openshift-web-console"
    ],
    "delta": "0:00:00.282914",
    "end": "2019-02-27 14:31:32.279069",
    "invocation": {
        "module_args": {
            "_raw_params": "oc logs deployment/webconsole --tail=50 --config=/etc/origin/master/admin.kubeconfig -n openshift-web-console",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2019-02-27 14:31:31.996155",
    "stderr": "Found 9 pods, using pod/webconsole-7659dcd487-4454z\nError from server (BadRequest): container \"webconsole\" in pod \"webconsole-7659dcd487-4454z\" is not available",
    "stderr_lines": [
        "Found 9 pods, using pod/webconsole-7659dcd487-4454z",
        "Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-7659dcd487-4454z\" is not available"
    ],
    "stdout": "",
    "stdout_lines": []
}
...ignoring

TASK [openshift_web_console : debug] ******************************************************************************************************************************************************************************************************************
task path: /home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/roles/openshift_web_console/tasks/start.yml:47
ok: [peek-boo] => {
    "msg": []
}

TASK [openshift_web_console : Report console errors] **************************************************************************************************************************************************************************************************
task path: /home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/roles/openshift_web_console/tasks/start.yml:52
fatal: [peek-boo]: FAILED! => {
    "changed": false,
    "msg": "Console install failed."
}
        to retry, use: --limit @/home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry

###

The part of interest is; " Found 9 pods, using pod/webconsole-7659dcd487-4454z "

If I check the pods in that namespace;

[root@peek-boo ~]# oc get pods
NAME                          READY     STATUS    RESTARTS   AGE
webconsole-7659dcd487-4454z   0/1       Evicted   0          2d
webconsole-7659dcd487-6j46p   0/1       Evicted   0          2d
webconsole-7659dcd487-87qgz   1/1       Running   0          31m
webconsole-7659dcd487-9hhh2   0/1       Evicted   0          1h
webconsole-7659dcd487-bpdmf   0/1       Evicted   0          1h
webconsole-7659dcd487-bw66h   0/1       Evicted   0          2d
webconsole-7659dcd487-lbh7z   0/1       Evicted   0          1h
webconsole-7659dcd487-rvf44   1/1       Running   0          31m
webconsole-7659dcd487-vvfrz   1/1       Running   0          31m


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Mitchell Rollinson 2019-03-01 03:44:46 UTC
Version-Release number of the following components:

rpm -q openshift-ansible

Name        : openshift-ansible
Arch        : noarch
Version     : 3.11.82
Release     : 3.git.0.9718d0a.el7

Version-Release number of the following components:
rpm -q openshift-ansible


rpm -q ansible

ansible 2.7.8

Comment 2 Mitchell Rollinson 2019-03-01 03:54:06 UTC
Tidied up all evicted pods;

eval "$(oc get pods -o json --all-namespaces | jq -r '.items[] | select(.status.phase == "Failed" and .status.reason == "Evicted") | "oc delete pod --namespace " + .metadata.namespace + " " + .metadata.name')"

Then re-ran the upgrade, and no issues were observed.

Naturally ensuring a clean tidy cluster before upgrading is ideal.

However, in the event that a pod is evicted during an upgrade, this could still be an issue.

Comment 3 Scott Dodson 2019-03-01 14:17:10 UTC
Do you by chance have the complete logs at the verbosity in the description. These tasks should've only happened if the deployment was marked unsuccessful and it's not clear to me what specifically triggered that condition.

Comment 4 Mitchell Rollinson 2019-03-03 22:04:17 UTC
@Scott
The cu inadvertently overwrote the original 'tee-ed output' log file when re-running the playbook.

They have advised that they will attempt to replicate in their dev ENV. I am not sure what the selection process is (which pod the playbook checks) after the pod listing process completes, but it is probably it is the first pod in the list.
As such if they can force an eviction of that pod, they should be able to replicate.

Re - These tasks should've only happened if the deployment was marked unsuccessful - My understanding is that the initial deployment (there were pods in an EVICTED state) was unsuccessful.

Only after removing the evicted pods, did the upgrade succeed.

So just to clarify.

1) Evicted pods existed when commencing the upgrade. Absoultely one should verify cluster health prior to upgrading, but the thinking is that a pod 'may' change states during an upgrade.
2) Playbook task generates a list of pods in the namespace
3) Playbook then performs 'checks' on one of the pods in the namespace (the playbook did not identify that the pod selected was evicted {does the playbook need additional logic to verify pod status - appears to be missing})
4) Upgrade fails
5) CU deletes evicted pods, and re-runs the upgrade successfully.

Mitch

Comment 5 Samuel Padgett 2019-03-04 15:33:38 UTC
This task is simply printing out the pods in the namespace after an upgrade fails to help troubleshoot. It's not the cause of the upgrade failure. We *want* to see evicted pods in case it's related to the upgrade failure.

To verify that the upgrade succeeded, we check ready replicas on the console deployment:

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_web_console/tasks/start.yml#L2-L16

This logic should handle evicted pods. It looks the rollout did not complete inside of the 10 minute timeout, and the installer correctly failed because it could not update the console.

Note that in the pods list you have the running pods are much newer than the evicted ones, which does seem to indicate that they were created sometime after the 10 minute timeout. If you are able to include the full log, it would help confirm.

Comment 6 Mitchell Rollinson 2019-03-04 21:27:00 UTC
Thanks for the clarification Sam.

I have requested the full logs. The cu will attempt to provide these.

Comment 7 Mitchell Rollinson 2019-03-07 21:56:31 UTC
Hi Sam. The cu has not been able to replicate the issue, and consequently the full set of logs are not available in order to further this investigation.
The cu is happy with the explanations provided, and now believes that the RC was a different issue.
Thanks for your help


Note You need to log in before you can comment on or make changes to this bug.