Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1691551 - Director deployed OCP 3.11 fails while Ironic is deploying the worker nodes
Summary: Director deployed OCP 3.11 fails while Ironic is deploying the worker nodes
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 21:25 UTC by Marius Cornea
Modified: 2019-03-29 20:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
ironic.tar.gz (deleted)
2019-03-21 21:25 UTC, Marius Cornea
no flags Details

Description Marius Cornea 2019-03-21 21:25:33 UTC
Created attachment 1546708 [details]
ironic.tar.gz

Description of problem:

Director deployed OCP 3.11 deployment fails during Ironic deploying the worker nodes:

(undercloud) [stack@undercloud-0 ~]$ openstack server list
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+
| ID                                   | Name               | Status | Networks               | Image          | Flavor |
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+
| 91f0adcf-ac6f-4283-8ead-87def0ab7fe0 | openshift-worker-1 | BUILD  |                        | overcloud-full | worker |
| 233daa2d-523e-453b-b6bd-2a1e62561920 | openshift-worker-2 | BUILD  |                        | overcloud-full | worker |
| 8d70b9ad-7474-43af-b529-eb834e818949 | openshift-infra-0  | ACTIVE | ctlplane=192.168.24.8  | overcloud-full | infra  |
| ce04f408-8a1d-4fd3-9a59-c5bec1fdeecf | openshift-master-1 | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | master |
| e2b76597-7551-49e5-bc97-022c6dc81db2 | openshift-master-0 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | master |
| f90e7b89-72d1-496b-8ad3-faa39a25b6f9 | openshift-infra-1  | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | infra  |
| 6022f7a2-c9d2-4da2-9422-3aeec0003e4b | openshift-worker-0 | BUILD  |                        | overcloud-full | worker |
| 22046fab-8ae4-45f1-83dd-9bf624bae02b | openshift-infra-2  | ACTIVE | ctlplane=192.168.24.21 | overcloud-full | infra  |
| f982330b-982e-4596-964e-bb51c04e5ffe | openshift-master-2 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | master |
+--------------------------------------+--------------------+--------+------------------------+----------------+--------+

Deployment of openshift-worker nodes never succeeds.


Version-Release number of selected component (if applicable):
openstack-ironic-api:20190320.1 

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 overcloud with 3 x master + 3 x infra + 3 x worker nodes. Worker nodes have 2 additional disks for Gluster storage.

Actual results:
Ironic fails to deploy the worker nodes.

Expected results:
No failures.

Additional info:
Attaching Ironic logs.

Comment 1 Dmitry Tantsur 2019-03-25 10:49:10 UTC
2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking command image.install_bootloader. Error: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60): requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read timeout=60)

May be a firewall or DHCP error? Could you check what is going on the nodes themselves? The deploy logs you have seem to show success for the nodes that have failures in ironic-conductor logs.

Comment 2 Bob Fournier 2019-03-27 12:37:05 UTC
Hi Marius - any update on this?

Comment 3 Marius Cornea 2019-03-28 01:05:59 UTC
(In reply to Dmitry Tantsur from comment #1)
> 2019-03-21 17:01:54.094 1 ERROR ironic.drivers.modules.agent_client
> [req-4f1bc06c-2bbb-4b8b-9608-0abfc4b7ca9e - - - - -] Failed to connect to
> the agent running on node 7e542470-ed6a-4b9c-9a19-c47215d57407 for invoking
> command image.install_bootloader. Error:
> HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read
> timeout=60): requests.exceptions.ReadTimeout:
> HTTPConnectionPool(host='192.168.24.17', port=9999): Read timed out. (read
> timeout=60)
> 
> May be a firewall or DHCP error? Could you check what is going on the nodes
> themselves? The deploy logs you have seem to show success for the nodes that
> have failures in ironic-conductor logs.

I tried calling the agent url while deploying and it took longer than 60s to return but I don't know how to live debug what's going on on the node:

time curl 192.168.24.14:9999
{"name": "OpenStack Ironic Python Agent API", "description": "Ironic Python Agent is a provisioning agent for OpenStack Ironic", "versions": [{"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/",
 "rel": "self"}]}], "default_version": {"id": "v1", "links": [{"href": "http://192.168.24.14:9999/v1/", "rel": "self"}]}}
real 2m21.189s

What is odd is that this issue occurs only for nodes which 4 disks. Deployment of nodes with 1 or 2 disks works fine.

Looking at the deploy logs for the failed node I can see that it tried running grub2-mkconfig 3 times:

 [root@undercloud-0 9516cf39-4aa6-4e66-ae0d-37add36b8d15]# grep grub2-mk journal  | grep CMD
Mar 27 20:04:29 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:04:29.684 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp1kv7n9k8 /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.550s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409
Mar 27 20:06:01 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:06:01.674 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp00fqhe1u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.340s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409
Mar 27 20:07:33 host-192-168-24-9 ironic-python-agent[705]: 2019-03-27 20:07:33.615 705 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmpquhri5xn /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 91.277s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409


Compared to a node with 2 disks:

 [root@undercloud-0 e2761690-9cf0-445a-8456-d76205661d9f]# grep grub2-mk journal  | grep CMD
Mar 27 20:03:49 host-192-168-24-22 ironic-python-agent[681]: 2019-03-27 20:03:49.506 681 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp28_55k_m /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 51.376s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409


and a node with 1 disk:

 [root@undercloud-0 cb458d2c-66b9-4108-ad7b-8671a953c3f2]# grep grub2-mk journal  | grep CMD
Mar 27 20:03:29 host-192-168-24-24 ironic-python-agent[702]: 2019-03-27 20:03:29.209 702 DEBUG oslo_concurrency.processutils [-] CMD "chroot /tmp/tmp8owj_d0u /bin/sh -c "grub2-mkconfig -o /boot/grub2/grub.cfg"" returned: 0 in 31.407s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:409

Comment 5 Dmitry Tantsur 2019-03-29 15:55:07 UTC
If it blocks eventlet (it shouldn't, but how knows), it can explain timing out. However, even 30 seconds sounds insane to me.

Comment 6 Dmitry Tantsur 2019-03-29 16:06:14 UTC
Do you have any previous information on the disks? If yes, could you try with metadata cleaning enabled?

Comment 7 Marius Cornea 2019-03-29 20:31:25 UTC
(In reply to Dmitry Tantsur from comment #6)
> Do you have any previous information on the disks? If yes, could you try
> with metadata cleaning enabled?

No, this is an initial deployment on virt environment where the disk files have just been created.


Note You need to log in before you can comment on or make changes to this bug.