Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360653 - OSPD-9 virt setup HA deployment 3 controllers 1 compute failed on step 4
Summary: OSPD-9 virt setup HA deployment 3 controllers 1 compute failed on step 4
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: John Eckersberg
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-27 09:04 UTC by Udi Shkalim
Modified: 2018-09-07 19:15 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-05 12:02:49 UTC


Attachments (Terms of Use)
corosync and rabbitmq logs from controllers (It shows original fail, success of all nodes joining after one hour was made by performing "pcs resource cleanup") (deleted)
2016-08-04 13:08 UTC, Marian Krcmarik
no flags Details

Description Udi Shkalim 2016-07-27 09:04:23 UTC
Description of problem:
Overcloud deployment failed. Problem is that it seem to be a race that causing the failure since deployment passes with external/internal ceph
[overcloud]: CREATE_FAILED Resource CREATE failed: Error: resources.ControllerNodesPostDeployment.resources.ControllerOvercloudServicesDeployment_Step4.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6

Logging in to the controllers noticed that the rabbit cluster failed to start:
[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Wed Jul 27 08:58:43 2016		Last change: Tue Jul 26 16:30:05 2016 by root via crm_attribute on controller-2
Stack: corosync
Current DC: controller-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
3 nodes and 27 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 ip-172.17.1.10	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-192.0.2.6	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.10	(ocf::heartbeat:IPaddr2):	Started controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ controller-0 controller-1 controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ controller-0 controller-1 controller-2 ]
 ip-172.17.3.10	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-1
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-2 ]
     Stopped: [ controller-0 controller-1 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-2 ]
     Slaves: [ controller-0 controller-1 ]
 ip-172.17.1.11	(ocf::heartbeat:IPaddr2):	Started controller-2
 Clone Set: mongod-clone [mongod]
     Started: [ controller-0 controller-1 controller-2 ]

Failed Actions:
* rabbitmq_start_0 on controller-1 'unknown error' (1): call=80, status=complete, exitreason='none',
    last-rc-change='Tue Jul 26 16:29:25 2016', queued=0ms, exec=31177ms
* rabbitmq_start_0 on controller-0 'unknown error' (1): call=78, status=Timed Out, exitreason='none',
    last-rc-change='Tue Jul 26 16:27:07 2016', queued=0ms, exec=100004ms


PCSD Status:
  controller-0: Online
  controller-1: Online
  controller-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-18.el7ost.noarch
rabbitmq-server-3.6.3-5.el7ost.noarch
resource-agents-3.9.5-54.el7_2.10.x86_64

How reproducible:
80%

Steps to Reproduce:
1. Deploy overcloud with 3 controllers 1 compute local lvm
2.
3.

Actual results:
deployment failed

Expected results:
deployment passed

Additional info:
[root@controller-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         


[root@controller-0 ~]# getenforce 
Enforcing

Comment 2 Jiri Stransky 2016-07-27 15:30:31 UTC
I couldn't reproduce this issue using the latest poodle. I deployed 3 controller + 1 compute with network isolation. Udi were you or someone else able to reproduce this?

Comment 3 Udi Shkalim 2016-07-27 15:34:29 UTC
(In reply to Jiri Stransky from comment #2)
> I couldn't reproduce this issue using the latest poodle. I deployed 3
> controller + 1 compute with network isolation. Udi were you or someone else
> able to reproduce this?

Yes it happened on 3 environments more then 8 times.

Comment 4 Jiri Stransky 2016-07-28 12:12:19 UTC
This looks like it probably shares the root cause with bug 1343905. The rabbitmq service is reported as started on controller-2 in pacemaker, but the rabbit app is in fact stopped there. It looks like it could be the same pause_minority situation, which makes me hope eck's fix for bug 1343905 would fix this problem too.


[root@controller-2 cluster]# pcs status | grep rabbit -A2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-2 ]
     Stopped: [ controller-0 controller-1 ]
--
* rabbitmq_start_0 on controller-0 'unknown error' (1): call=76, status=Timed Out, exitreason='none',
    last-rc-change='Wed Jul 27 18:05:45 2016', queued=0ms, exec=100003ms
* rabbitmq_start_0 on controller-1 'unknown error' (1): call=80, status=complete, exitreason='none',
    last-rc-change='Wed Jul 27 18:07:42 2016', queued=0ms, exec=31321ms


[root@controller-2 cluster]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-2' ...
[{nodes,[{disc,['rabbit@controller-0','rabbit@controller-2']}]},{alarms,[]}]


Excerpt from rabbitmq log on controller-2 (the bootstrap node):

=INFO REPORT==== 27-Jul-2016::18:06:27 ===
node 'rabbit@controller-0' up

=INFO REPORT==== 27-Jul-2016::18:07:42 ===
node 'rabbit@controller-0' down: connection_closed

=WARNING REPORT==== 27-Jul-2016::18:07:42 ===
Cluster minority/secondary status detected - awaiting recovery

=INFO REPORT==== 27-Jul-2016::18:07:42 ===
Stopping RabbitMQ

Comment 6 Udi Shkalim 2016-07-31 08:33:36 UTC
Tried running a deployment with latest resource-agent 
resource-agents-3.9.5-80.el7.x86_64

Deployment failed on the same reason.

Comment 7 Mike Orazi 2016-08-01 14:36:24 UTC
Moving to PIDONE for now.  This has been reproduced on initial install rather than updated/upgrade.  The suspicion is that it is actually more of a VM size/resource issue but I'm hoping Petro can help confirm that.

Comment 9 Fabio Massimo Di Nitto 2016-08-02 10:36:54 UTC
Udi, as discussed on the weekly call, we would need full logs in the event the issue is still reproducible

Comment 10 Udi Shkalim 2016-08-03 08:57:39 UTC
I have increased the VM resources (cpu/mem/disk) and indeed the deployment passed.

In all the environment we used the same resource allocation(jenkis job) that is why it happened in more then one setup.

strangely, that in ospd-8 we used the same resource allocation and did not encountered this problem.

Comment 11 Marian Krcmarik 2016-08-04 13:03:15 UTC
I am reopening this bug - 5 jobs (various topology) failed on this issue on latest puddle which contain latest resource-agents rpm as well as latest rabbitmq-server rpm. I tried as well several more and even with controller VMs with 12GB and 8cpu cores assigned and the problem still occurs.

I will attach some logs but the situation from logs (corosync, rabbitmq) is following:
- rabbitmq cluster on controller-2 gets bootstrapped.
- rabbitmq resource on controller-0 is started and within that It attempts to join cluster, but It times out because resource time out is set to 100 seconds.
- rabbitmq resource on controller-1 is started but It fails to joins cluster.

The end result is that pacemaker shows controller-2 resource as started and stopped on controller-1 and controller-0.

The return value of function added in latest patch rmq_is_clustered is false becaise I cannot see anywhere in the log msg: "Successfully re-joined existing rabbitmq cluster automatically"

Comment 12 Marian Krcmarik 2016-08-04 13:08:29 UTC
Created attachment 1187490 [details]
corosync and rabbitmq logs from controllers (It shows original fail, success of all nodes joining after one hour was made by performing "pcs resource cleanup")

Comment 14 Andrew Beekhof 2016-08-05 03:15:47 UTC
From the description it sounds like the cluster did the right thing, Petr, can you check this from a rabbit perspective please?

Comment 16 Marian Krcmarik 2016-08-05 10:26:04 UTC
I rerun the deployment on much powerful machine with 16GB and 8 CPU cores assigned to each controller node. The result is that all 5 deploys were successful so This does look like a sizing problem even though what's interesting is that: previously failed deployments on controllers with 12GB failed always on the same step and cluster was always in the same status (One rabbitmq resource started, one timeouted out and stopped and one failed on rabbitmq node joinging cluster) as well as RHOS8 was ale to be deployed on machines with only 8GB controllers.

To sum up: The problem does not occur in my testing env which has controllers with 16GB RAM (On 12GB it fails in my env), If such increase of resource requirements between RHOS8 and RHOS9 is justified, please close the bug as NOTABUB again (and sorry for the noise).

Comment 17 Fabio Massimo Di Nitto 2016-08-05 12:02:49 UTC
(In reply to Marian Krcmarik from comment #16)
> I rerun the deployment on much powerful machine with 16GB and 8 CPU cores
> assigned to each controller node. The result is that all 5 deploys were
> successful so This does look like a sizing problem even though what's
> interesting is that: previously failed deployments on controllers with 12GB
> failed always on the same step and cluster was always in the same status
> (One rabbitmq resource started, one timeouted out and stopped and one failed
> on rabbitmq node joinging cluster) as well as RHOS8 was ale to be deployed
> on machines with only 8GB controllers.
> 
> To sum up: The problem does not occur in my testing env which has
> controllers with 16GB RAM (On 12GB it fails in my env), If such increase of
> resource requirements between RHOS8 and RHOS9 is justified, please close the
> bug as NOTABUB again (and sorry for the noise).

good and bad.

Perhaps we should reassign this bug to documentation to make sure that the sizing of the VMs is reflected properly.


Note You need to log in before you can comment on or make changes to this bug.