Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1354627 - Existing nodes get rebuilt during scale out after 8->9 upgrade
Summary: Existing nodes get rebuilt during scale out after 8->9 upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Brad P. Crochet
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks: 1362612 1409851
TreeView+ depends on / blocked
 
Reported: 2016-07-11 19:08 UTC by Marius Cornea
Modified: 2017-01-03 16:02 UTC (History)
15 users (show)

Fixed In Version: openstack-tripleo-common-2.0.0-8.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1409851 (view as bug list)
Environment:
Last Closed: 2016-08-11 11:36:00 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1609020 None None None 2016-08-02 16:08:30 UTC
Red Hat Product Errata RHEA-2016:1599 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 director Release Candidate Advisory 2016-08-11 15:25:37 UTC

Description Marius Cornea 2016-07-11 19:08:27 UTC
Description of problem:

During the first scale out attempt(adding an additional compute node) on an upgraded deployment all the existing nodes get rebuilt:

[stack@undercloud ~]$ nova list


+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| ID                                   | Name                    | Status  | Task State       | Power State | Networks              |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| bef41d32-bfed-42c8-9839-bd07f8ad2d93 | overcloud-cephstorage-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.21 |
| 5f7fe8b4-54ee-4790-8fd4-c9ab1fad5cf8 | overcloud-cephstorage-1 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.20 |
| 3772244f-e332-458a-b1b9-6f466a6d7411 | overcloud-cephstorage-2 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.24 |
| 8a5dbbfd-696e-4ed3-be94-4f77c0f871b5 | overcloud-controller-0  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.25 |
| ebcdcdb5-de05-45c6-b430-b0119ae04a60 | overcloud-controller-1  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.23 |
| 4a118c5b-de23-48e8-8dc1-f3e1aaea2db6 | overcloud-controller-2  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.26 |
| 692099f5-1ddd-4e0a-9fe4-e7c47d1f4d36 | overcloud-novacompute-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.22 |
| 90795dbb-a353-4203-a820-108e3015b499 | overcloud-novacompute-1 | BUILD   | spawning         | NOSTATE     | ctlplane=192.168.0.11 |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+ 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-12.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Do initial deployment
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

2. Upgrade undercloud 
sudo yum update -y
openstack undercloud upgrade

3. Apply THT patches
pushd /usr/share/openstack-tripleo-heat-templates
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=947ed53bc01ad25c90b403e9ad6cef4673a2e71f' | sudo patch -p1
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=65efc468db16a28c623c428dd205f100809c73a1' | sudo patch -p1
curl -4 'https://review.openstack.org/gitweb?p=openstack/tripleo-heat-templates.git;a=patch;h=a6cd51c0c987dec1f391438e87511c5285b07124' | sudo patch -p1
popd

4. Add osp8 repos on overcloud nodes

5. major-upgrade-aodh.yaml

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

6. major-upgrade-keystone-liberty-mitaka.yaml 

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-keystone-liberty-mitaka.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

7. Add OSP9 repos on overcloud nodes

8. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

9. Update os-collect-config and resource-agents on overcloud nodes

10. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

11. Start rabbitmq on controller-1 and controller-2
systemctl start rabbitmq-server.service
pcs resource cleanup

12. 
upgrade-non-controller.sh --upgrade overcloud-novacompute-0

13. 
upgrade-non-controller.sh --upgrade overcloud-cephstorage-0
upgrade-non-controller.sh --upgrade overcloud-cephstorage-1
upgrade-non-controller.sh --upgrade overcloud-cephstorage-2

14. 
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

15. Update images
openstack overcloud image upload --update-existing
openstack baremetal configure boot

16. Scale out
source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 2 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--libvirt-type qemu

Actual results:
All the nodes get reprovisioned.

Expected results:
The existing node remain running. 

Additional info:
I've only provided the overcloud-resource-registry-puppet.yaml environment file in the last step - major-upgrade-pacemaker-converge.yaml (I missed it for the first commands) so could this be the cause for this behavior?

Comment 2 Marius Cornea 2016-07-12 12:34:58 UTC
I wasn't able to reproduce this issue with the latest build. I'm going to reopen it if I see it again.

Comment 3 Marius Cornea 2016-07-14 08:18:42 UTC
Reopening it - I was able to reproduce only when I update the images after the last upgrade step: major-upgrade-pacemaker-converge.yaml. 

Note that if I update the images right after upgrading the undercloud this issue doesn't show up. Also note that the enable-tls.yaml environment was changed during the overcloud upgrade process in order to overcome BZ#1353079#c6

Nevertheless the result is destructive as all the nodes get recreated so we should make sure a user doesn't end up in this situation.

Comment 11 Alexander Chuzhoy 2016-07-25 20:16:13 UTC
Just checked that the issue doesn't reproduce on clean deployment of 9 + scale out.

Comment 12 Alexander Chuzhoy 2016-07-26 17:11:24 UTC
Ran into https://bugzilla.redhat.com/show_bug.cgi?id=1360421, which probably confirms this bug.

Comment 13 Brad P. Crochet 2016-07-28 12:01:45 UTC
I was able to reproduce this without the Ceph nodes, so the ordering of the image update is looking like a good candidate. I will investigate why that is.

Comment 14 Brad P. Crochet 2016-07-29 11:40:56 UTC
I have now reproduced this with a single controller (running pacemaker) and a single compute. Still investigating why the ordering of the image upload make a difference.

Comment 15 Brad P. Crochet 2016-07-29 11:42:30 UTC
@mcornea Do you run the 'openstack baremetal configure boot' when you update the images right after the upgrading the undercloud?

Comment 16 Marius Cornea 2016-07-29 12:25:33 UTC
(In reply to Brad P. Crochet from comment #15)
> @mcornea Do you run the 'openstack baremetal configure boot' when you update
> the images right after the upgrading the undercloud?

Yes, I do. What I did was:

source ~/stackrc; 
openstack overcloud image upload --update-existing
openstack baremetal configure boot

Comment 17 Marius Cornea 2016-07-29 13:33:10 UTC
I hit the nodes getting rebuilt in a different scenario:

1. update images
2. upgrade overcloud
3. scale out additional compute node

4. remove the added node:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud node delete --stack overcloud --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
fe07a5fb-52ff-4736-acda-64e4267301ff

resulting in:
[stack@undercloud ~]$ nova list
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| ID                                   | Name                    | Status  | Task State       | Power State | Networks              |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+
| 063f9adc-626f-4735-96ca-471e232f90c7 | overcloud-cephstorage-0 | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.20 |
| 6b1ab8c3-835d-427f-bb4a-0c831313d098 | overcloud-compute-0     | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.21 |
| e4c7970b-d43d-46fe-b959-44e367e76b16 | overcloud-controller-0  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.23 |
| a88fd352-532b-4c23-a845-e53ece208811 | overcloud-controller-1  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.22 |
| 23810c05-cdf0-4063-86d6-6ed9797e189f | overcloud-controller-2  | REBUILD | rebuild_spawning | Running     | ctlplane=192.168.0.24 |
+--------------------------------------+-------------------------+---------+------------------+-------------+-----------------------+

Comment 18 Brad P. Crochet 2016-08-01 23:07:41 UTC
Progress... I tried doing the upgrade, but manually changing the deploy_kernel and deploy_ramdisk on only unused nodes, leaving the already installed nodes alone. The old nodes were not rebuilt. It's probably a bad idea to have the old images updated like that anyway. So, the fix may need to come in the 'configure boot' command, and have it ignore already installed nodes.

Comment 21 Jiri Stransky 2016-08-02 15:12:40 UTC
Just a data point -- (assuming this is not an intermittent issue) i was able to prevent the rebuild from happening by editing heat-engine code this way:

https://paste.fedoraproject.org/399907/1497781/raw/

Obviously this is not a solution, but maybe it could help us narrow down the search for the cause. I wonder why doing the above is necessary, even though we have a Heat plugin to ignore property changes on OS::Nova::Server, which seemed to previously prevent OS::Nova::Server replacement successfully:

https://github.com/openstack/tripleo-common/blob/stable/mitaka/undercloud_heat_plugins/server_update_allowed.py

Is the difference here rebuild vs. replace perhaps? Previously we've seen issues where 2nd instance of the server was deployed, while now we see them rebuilding instead.



Another data point -- i managed to reproduce the issue like this:

# ... finish upgrade ...

tar -xvf overcloud-full.tar
openstack overcloud image upload --update-existing

# ... and now do the scale up ...

^^ the point being i didn't download updated ironic agent image and i didn't run the `configure boot` command, but the issue still reproduced.

Comment 24 Zane Bitter 2016-08-02 15:22:26 UTC
(In reply to Jiri Stransky from comment #21)
> Just a data point -- (assuming this is not an intermittent issue) i was able
> to prevent the rebuild from happening by editing heat-engine code this way:
> 
> https://paste.fedoraproject.org/399907/1497781/raw/
> 
> Obviously this is not a solution, but maybe it could help us narrow down the
> search for the cause.

Yes, the proximate cause is clearly that the image name is changing. So we need to figure out why.

> I wonder why doing the above is necessary, even though
> we have a Heat plugin to ignore property changes on OS::Nova::Server, which
> seemed to previously prevent OS::Nova::Server replacement successfully:
> 
> https://github.com/openstack/tripleo-common/blob/stable/mitaka/
> undercloud_heat_plugins/server_update_allowed.py
> 
> Is the difference here rebuild vs. replace perhaps? Previously we've seen
> issues where 2nd instance of the server was deployed, while now we see them
> rebuilding instead.

Correct, that custom plugin is designed to prevent any changes to properties triggering a replacement, not to ignore all changes.

Comment 25 Brad P. Crochet 2016-08-02 15:29:29 UTC
(In reply to Zane Bitter from comment #24)
> (In reply to Jiri Stransky from comment #21)
> > Just a data point -- (assuming this is not an intermittent issue) i was able
> > to prevent the rebuild from happening by editing heat-engine code this way:
> > 
> > https://paste.fedoraproject.org/399907/1497781/raw/
> > 
> > Obviously this is not a solution, but maybe it could help us narrow down the
> > search for the cause.
> 
> Yes, the proximate cause is clearly that the image name is changing. So we
> need to figure out why.
> 

It does seem to be only on scale up/down, rather than a "simple" stack update.

Comment 26 Zane Bitter 2016-08-02 15:45:37 UTC
This change in Mitaka:

https://review.openstack.org/#/c/287834/11

added a translation rule that causes the 'image' property passed to OS::Nova::Server to be automatically translated to a UUID prior to the properties being assembled. The result is that TripleO's previous trick of uploading a new image and keeping the same name (but getting a new UUID) no longer works to prevent Heat from rebuilding the server when the image changes.

Comment 27 Rabi Mishra 2016-08-02 16:44:51 UTC
I assume the update_restrict feature can also be used to avoid update/replacement.

http://docs.openstack.org/developer/heat/template_guide/environment.html#restrict-update-or-replace-of-a-given-resource

Comment 29 Marius Cornea 2016-08-05 15:26:52 UTC
openstack-tripleo-common-2.0.0-8.el7ost.noarch

Comment 31 errata-xmlrpc 2016-08-11 11:36:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html


Note You need to log in before you can comment on or make changes to this bug.