Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1595753 - [UPGRADES][10]RMQ resource-agent should handle stopped node
Summary: [UPGRADES][10]RMQ resource-agent should handle stopped node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents
Version: 7.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: pre-dev-freeze
: 7.7
Assignee: Oyvind Albrigtsen
QA Contact: pkomarov
Marek Suchanek
URL:
Whiteboard: GSSApproved
: 1628524 (view as bug list)
Depends On:
Blocks: 1639826 1641944 1641945
TreeView+ depends on / blocked
 
Reported: 2018-06-27 13:27 UTC by Yurii Prokulevych
Modified: 2019-03-21 11:57 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Previously, the rabbitmqctl cluster_status command read cached cluster status from disk and returned 0 when mnesia service was not running. For example, this happened if rabbitmqctl stop_app was called, or the service paused during partition due to the pause_minority strategy. As a consequence, RabbitMQ might have returned cached status from disk. With this update, RabbitMQ now gets cluster status from mnesia during monitor-action. As a result, the described problem no longer occurs.
Clone Of:
: 1641944 1641945 (view as bug list)
Environment:
Last Closed: 2019-03-21 11:57:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3619471 None None None 2018-09-21 13:33:21 UTC

Description Yurii Prokulevych 2018-06-27 13:27:40 UTC
Description of problem:
-----------------------
RMQ failed to come up after controller upgraded to RHOS-10 got rebooted.
Excerpt from logs:

Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ rabbit@controller-2: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * connected to epmd (port 4369) on controller-2 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * epmd reports node 'rabbit' running on port 35672 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * can't establish TCP connection, reason: timeout (timed out) ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * suggestion: blocked by firewall? ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ current node details: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - node name: 'rabbitmq-cli-42@controller-0' ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - cookie hash: T6QXguOieCu9zpMaS+Vohw== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ Error: unable to connect to nodes ['rabbit@controller-1']: nodedown ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ DIAGNOSTICS ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ =========== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ rabbit@controller-1: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * connected to epmd (port 4369) on controller-1 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * epmd reports node 'rabbit' running on port 35672 ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * can't establish TCP connection, reason: timeout (timed out) ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [   * suggestion: blocked by firewall? ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ current node details: ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - node name: 'rabbitmq-cli-55@controller-0' ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - home dir: /var/lib/rabbitmq ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [ - cookie hash: T6QXguOieCu9zpMaS+Vohw== ]
Jun 27 12:39:38 [3300] controller-0       lrmd:   notice: operation_finished:   rabbitmq_start_0:17517:stderr [  ]
Jun 27 12:39:38 [3300] controller-0       lrmd:     info: log_finished: finished - rsc:rabbitmq action:start call_id:67 pid:17517 exit-code:1 exec-time:94900ms queue-time:0ms
Jun 27 12:39:38 [3306] controller-0       crmd:   notice: process_lrm_event:    Result of start operation for rabbitmq on controller-0: 1 (unknown error) | call=67 key=rabbitmq_start_0 confirmed=true cib-update=
63
Jun 27 12:39:38 [3306] controller-0       crmd:   notice: process_lrm_event:    controller-0-rabbitmq_start_0:67 [ Error: unable to connect to node 'rabbit@controller-0': nodedown\n\nDIAGNOSTICS\n===========\n\n
attempted to contact: ['rabbit@controller-0']\n\nrabbit@controller-0:\n  * connected to epmd (port 4369) on controller-0\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on c
ontroller-0\n  * suggestion: start the node\n\ncurrent node details:\n- node name: 'rabbitmq

Restarting the service before reboot helped:
  pcs resource restart rabbitmq-clone

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
puppet-rabbitmq-5.6.0-2.el7ost.noarch
rabbitmq-server-3.6.3-9.el7ost.noarch
pcs-0.9.162-5.el7_5.1.x86_64
resource-agents-3.9.5-124.el7.x86_64
python-testresources-0.2.7-6.el7ost.noarch

Steps to Reproduce:
-------------------
1. Upgrade RHOS-9 to RHOS-10
2. Reboot controller-0 and observe rmq's status

Comment 2 John Eckersberg 2018-08-28 15:47:58 UTC
Here's what's going on here...

We changed the clustering port out of the ephemeral range here (35672->25672):

  https://review.openstack.org/#/c/345851/6/puppet/services/rabbitmq.yaml

So rabbitmq is originally up and bound to 35672.

The upgrade runs, and the firewall rules change so 35672 closes and 25672 opens.

On each controller, when 35672 closes, the existing clustered connection dies.  So the pause_minority code kicks in and the rabbit app stops.

This happens on all three controllers.  Each is paused waiting for the others to return (they never do).  You can see this in the status because the rabbit app is not running:

[root@controller-0 ~]# rabbitmqctl status
Status of node 'rabbit@controller-0' ...
[{pid,20759},
 {running_applications,[{xmerl,"XML parser","1.3.10"},
                        {ranch,"Socket acceptor pool for TCP protocols.",
                               "1.2.1"},
                        {sasl,"SASL  CXC 138 11","2.7"},
                        {stdlib,"ERTS  CXC 138 10","2.8"},
                        {kernel,"ERTS  CXC 138 10","4.2"}]},
[...snip...]

Also, and more importantly, the resource agent health check is buggy in this situation.  It's reporting the service is up:

[root@controller-0 ~]# pcs status | grep -A1 rabbit
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]

Because the health check just checks cluster_status, which returns success:

[root@controller-0 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@controller-0' ...
[{nodes,[{disc,['rabbit@controller-0','rabbit@controller-1',
                'rabbit@controller-2']}]},
 {alarms,[]}]
[root@controller-0 ~]# echo $?
0

However this isn't really accurate when the node is paused.  When mnesia is stopped, cluster_status reads from the cached status on disk here:

[root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/cluster_nodes.config 
{['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2'],['rabbit@controller-0','rabbit@controller-1','rabbit@controller-2']}.
[root@controller-0 ~]# cat /var/lib/rabbitmq/mnesia/rabbit@controller-0/nodes_running_at_shutdown 
['rabbit@controller-0'].

But if you force it to read from mnesia, you can see things are not ok:

[root@controller-0 ~]# rabbitmqctl eval 'rabbit_mnesia:cluster_status_from_mnesia().'
{error,mnesia_not_running}

So this is not really a doc bug, we should fix the resource agent to handle this better.

Comment 5 John Eckersberg 2018-09-13 14:04:52 UTC
*** Bug 1628524 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.