Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1364241 - rhel-osp-director: Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired
Summary: rhel-osp-director: Update 7.2->7.3Async fails: Error: Could not complete sh...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 7.0 (Kilo)
Assignee: Michele Baldessari
QA Contact: Omri Hochman
URL:
Whiteboard:
: 1384068 (view as bug list)
Depends On:
Blocks: 1344794
TreeView+ depends on / blocked
 
Reported: 2016-08-04 19:07 UTC by Alexander Chuzhoy
Modified: 2018-08-13 13:26 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-20 08:18:36 UTC


Attachments (Terms of Use)
journalctl errors (deleted)
2016-08-05 07:53 UTC, Marios Andreou
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1381356 None None None Never

Internal Links: 1381356

Description Alexander Chuzhoy 2016-08-04 19:07:31 UTC
rhel-osp-director:   Update 7.2->7.3Async fails: Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining Error performing operation: Timer expired


Environment:
openstack-tripleo-heat-templates-0.8.6-128.el7ost.noarch
openstack-puppet-modules-2015.1.8-51.el7ost.noarch
instack-undercloud-2.1.2-39.el7ost.noarch


Steps to reproduce:
1. Deploy 7.2GA with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml --ceph-storage-scale 1

2. Populate the setup.
3. Try to update to 7.3Async

Result:

IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
You must provide a username via either --os-username or env[OS_USERNAME] or a token via --os-auth-token or env[OS_AUTH_TOKEN]
[stack@instack ~]$ . stackrc
[stack@instack ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| resource_name                                 | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource                               |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+
| ControllerNodesPostDeployment                 | 7a26168f-6a06-4f0f-bee1-170e1efb4967          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2016-08-04T17:43:14Z |                                               |
| ControllerPostPuppet                          | 7db75a4c-4d71-428f-a014-2e5ba3403839          | OS::TripleO::Tasks::ControllerPostPuppet          | UPDATE_FAILED   | 2016-08-04T18:03:01Z | ControllerNodesPostDeployment                 |
| ControllerPostPuppetRestartDeployment         | 61534b6e-f54f-425c-b307-ddcede267abf          | OS::Heat::SoftwareDeployments                     | UPDATE_FAILED   | 2016-08-04T18:04:23Z | ControllerPostPuppet                          |
| 0                                             | bc9d8d72-9c7a-4c2d-93e9-a74502be60c0          | OS::Heat::SoftwareDeployment                      | UPDATE_FAILED   | 2016-08-04T18:04:27Z | ControllerPostPuppetRestartDeployment         |
+-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+


++ grep -v Clone
+ node_states=' openstack-keystone      (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ echo ' openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped
 openstack-keystone     (systemd:openstack-keystone):   (target-role:Stopped) Stopped'
+ grep -q Started
+ echo 'openstack-keystone has stopped'
+ return
+ pcs status
+ grep haproxy-clone
+ pcs resource restart haproxy-clone
+ pcs resource restart redis-master
+ pcs resource restart mongod-clone
+ pcs resource restart rabbitmq-clone
Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining
Error performing operation: Timer expired

Set 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role set=rabbitmq-clone-meta_attributes name=target-role=stopped
Waiting for 1 resources to stop:
 cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
 cirros-0.3.4-x86_64-disk.img deploy_command deploy-ramdisk-ironic.initramfs deploy-ramdisk-ironic.kernel deploy-ramdisk-ironic.tar discovery-ramdisk.initramfs discovery-ramdisk.kernel discovery-ramdisk.tar instackenv.json keystonerc_master netiso-virt-7.x.tgz network-environment.yaml nic-configs oskey.priv overcloud-env.json overcloud-full.initrd overcloud-full.qcow2 overcloud-full.tar overcloud-full.vmlinuz overcloudrc rhel-guest-image-7.2-20160302.0.x86_64.qcow2 rhos-qe-core-installer stackrc tempest-deployer-input.conf tripleo-overcloud-passwords undercloud.conf undercloud-passwords.conf rabbitmq-clone
Deleted 'rabbitmq-clone' option: id=rabbitmq-clone-meta_attributes-target-role name=target-role

", "deploy_status_code": 1 }, "creation_time": "2016-08-04T16:20:38Z", "updated_time": "2016-08-04T18:09:24Z", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "bc9d8d72-9c7a-4c2d-93e9-a74502be60c0" }

Comment 3 Mike Orazi 2016-08-04 20:41:41 UTC
Rabbit shutdown on controller-1 is taking ~44 seconds.  Can this be host load related?

Comment 4 Andrew Beekhof 2016-08-05 00:20:45 UTC
There are a few stop/start cycles, but I think you're talking about this one right?

Aug 04 14:08:54 [31393] overcloud-controller-1.localdomain       crmd:     info: do_lrm_rsc_op:	Performing key=124:48:0:31a2b3c6-bcce-4a1c-b9e3-5eefa5a12b25 op=rabbitmq_stop_0
Aug 04 14:09:56 [31393] overcloud-controller-1.localdomain       crmd:   notice: process_lrm_event:	Operation rabbitmq_stop_0: ok (node=overcloud-controller-1, call=476, rc=0, cib-update=271, confirmed=true)

Which is closer to 62s.
Regardless, as far as the cluster is concerned, everything was fine.

Which timeout was not satisfied and what is it set to?  Its not clear from the description.

Seems to me it should really be the start timeout + stop timeout + a second or so of buffer.


Also I think the clock on overcloud-controller-0.localdomain is off by a couple of hours which makes log analysis rather challenging.

Comment 5 Marios Andreou 2016-08-05 07:53:02 UTC
Created attachment 1187792 [details]
journalctl errors

just trying to understand what's going on here. I see a 'failed to push cib' on control 2 from these errors as well as anything else (" Error: Could not complete shutdown of rabbitmq-clone, 1 resources remaining" control0 and "Error: {not_a_cluster_node,"The node selected is not in the cluster." on control1)

Comment 6 Marios Andreou 2016-08-05 16:20:41 UTC
As Andrew says in comment #4 the clock skew makes it hard to understand what is going on here. I poked some more, one note of possible interest is from controller-0:

Aug  4 14:09:24 overcloud-controller-0 os-collect-config: [2016-08-04 18:09:24,766] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/4ad7e76b-bd9a-4cde-afa8-321f0c87d04d. [1]


I confirmed that this file ^^^ is the extraconfig/tasks/pacemaker_resource_restart.sh but that is as far as I got. It could be a legit timeout bringing a service down+up on oversubscribed hardware (rabbit in this case), or it could be a regression. Is this reproducible on multiple setups @sasha or just once so far?

Comment 7 Andrew Beekhof 2016-08-08 03:29:19 UTC
This one is a bit concerning:

34156:Aug 04 17:40:30 overcloud-controller-2.localdomain os-collect-config[3603]: Error: unable to push cib


but i don't know what os-collect-config could possibly be trying to push into the cib

Comment 8 Marios Andreou 2016-08-08 11:54:40 UTC
(In reply to Andrew Beekhof from comment #7)
> This one is a bit concerning:
> 
> 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain
> os-collect-config[3603]: Error: unable to push cib
> 
> 
> but i don't know what os-collect-config could possibly be trying to push
> into the cib


o/ Andrew fyi: on current master this is happening in the pacemaker_migrations.sh https://github.com/openstack/tripleo-heat-templates/blob/292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh#L92 

However, the environment this bug is for is rhos7 so looking at the rhos-7.0-patches branch to confirm, this was during the yum_update.sh which is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7 downstream link to the relevant code https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat-templates.git;a=blob;f=extraconfig/tasks/yum_update.sh;h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0-patches#l136

@sasha did you manage to duplicate this on other hardware? I get the impression from comment #3 and some brief mentions on irc that this might be timeout because of insufficient resources and that you were going to try and see if it happens again/esp elsewhere?

thanks, marios

Comment 9 Andrew Beekhof 2016-08-09 01:14:10 UTC
(In reply to marios from comment #8)
> (In reply to Andrew Beekhof from comment #7)
> > This one is a bit concerning:
> > 
> > 34156:Aug 04 17:40:30 overcloud-controller-2.localdomain
> > os-collect-config[3603]: Error: unable to push cib
> > 
> > 
> > but i don't know what os-collect-config could possibly be trying to push
> > into the cib
> 
> 
> o/ Andrew fyi: on current master this is happening in the
> pacemaker_migrations.sh
> https://github.com/openstack/tripleo-heat-templates/blob/
> 292fdf87e0fdcbd66664afc4c463f2f0e9a354fa/extraconfig/tasks/
> major_upgrade_pacemaker_migrations.sh#L92 

Oh, I didn't realize those actions would be performed under the os-collect-config tag.

> 
> However, the environment this bug is for is rhos7 so looking at the
> rhos-7.0-patches branch to confirm, this was during the yum_update.sh which
> is what is being executed here (for update of 7.2 -> 7.3) since it is rhos 7
> downstream link to the relevant code
> https://code.engineering.redhat.com/gerrit/gitweb?p=openstack-tripleo-heat-
> templates.git;a=blob;f=extraconfig/tasks/yum_update.sh;
> h=98ad25c764fc9948567253444d882d45a6b4b31d;hb=refs/heads/rhos-7.0-
> patches#l136

That is a big problem. It means none of the cluster updates went in.

Comment 10 Alexander Chuzhoy 2016-10-12 17:16:43 UTC
I just reproduced this bug on attempt to update 8 to latest.

Environment:
instack-undercloud-2.2.7-7.el7ost.noarch
openstack-puppet-modules-7.1.3-1.el7ost.noarch
openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch

Comment 11 Alexander Chuzhoy 2016-10-12 18:01:16 UTC
No clock skew:
[stack@director ~]$  for i in `nova list|awk -F'|' '/Running/ {print $(NF-1)}'|awk -F"=" '{print $NF}'`; do ssh heat-admin@$i "hostname; date"; done
overcloud-cephstorage-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-1.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-cephstorage-2.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-0.localdomain
Wed Oct 12 17:57:47 UTC 2016
overcloud-compute-1.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-compute-2.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-0.localdomain
Wed Oct 12 17:57:48 UTC 2016
overcloud-controller-1.localdomain
Wed Oct 12 17:57:49 UTC 2016
overcloud-controller-2.localdomain
Wed Oct 12 17:57:49 UTC 2016

Comment 12 Alexander Chuzhoy 2016-10-12 18:03:47 UTC
The system was idle after deploy (5 days ago), didn't run anything on it.

Comment 13 Michele Baldessari 2016-10-12 22:35:58 UTC
(In reply to Alexander Chuzhoy from comment #10)
> I just reproduced this bug on attempt to update 8 to latest.
> 
> Environment:
> instack-undercloud-2.2.7-7.el7ost.noarch
> openstack-puppet-modules-7.1.3-1.el7ost.noarch
> openstack-tripleo-heat-templates-kilo-0.8.14-18.el7ost.noarch
> openstack-tripleo-heat-templates-0.8.14-18.el7ost.noarch

Sasha,

can I get access to the system or sosreports from the controllers?
I wonder if this is simply a matter of increasing timeouts of rabbitmq (aka https://bugzilla.redhat.com/show_bug.cgi?id=1378391). If you can reproduce it at will can we sync up online and try a quick patch?

Comment 14 Michele Baldessari 2016-10-13 18:14:35 UTC
Sasha,

can you try the following patch on the undercloud before launching the update command? (Patch needs to be applied in /usr/share/openstack-tripleo-heat-temaplates):
diff -up tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh
--- tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh.orig	2016-10-13 20:09:48.607711524 +0200
+++ tripleo-heat-templates-0.8.14/extraconfig/tasks/yum_update.sh	2016-10-13 20:10:54.237854605 +0200
@@ -130,6 +130,7 @@ openstack-nova-scheduler"
 
     echo "Making sure rabbitmq has the notify=true meta parameter"
     pcs -f $pacemaker_dumpfile resource update rabbitmq meta notify=true
+    pcs -f $pacemaker_dumpfile resource update rabbitmq op start timeout=200s stop timeout=200s
 
     echo "Applying new Pacemaker config"
     if ! pcs cluster cib-push $pacemaker_dumpfile; then


My analysis so far on the reports I got about this came to the conclusion that rabbitmq simply needs more time. The reliable broken deployment I got my hands on was fixed with this. Let me know if you need help applying/testing.

cheers,
Michele

Comment 15 Alexander Chuzhoy 2016-10-13 21:47:53 UTC
Michele,
Can I apply the patch on the setup with the failed update, or should I start from scratc at this point?

Comment 16 Michele Baldessari 2016-10-14 06:48:16 UTC
Hi Sasha,

it would be best to launch it from scratch at this point. I haven't audited the code to see what happens if you launch it against a failed deployment.

The other option would be to bring up the cluster by hand and clean up any failed action. And after that relaunch the update, that *should* work as well, although I have not tested it

Comment 17 Michele Baldessari 2016-11-15 09:00:38 UTC
*** Bug 1384068 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.