Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1229822 - [RFE] make "cluster setup --start", "cluster start" and "cluster standby" support --wait as well
Summary: [RFE] make "cluster setup --start", "cluster start" and "cluster standby" sup...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcs
Version: 7.2
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: rc
: ---
Assignee: Tomas Jelinek
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
: 1238874 (view as bug list)
Depends On:
Blocks: 1229826 1251196
TreeView+ depends on / blocked
 
Reported: 2015-06-09 17:42 UTC by Jan Pokorný [poki]
Modified: 2019-01-02 12:44 UTC (History)
11 users (show)

Fixed In Version: pcs-0.9.151-1.el7
Doc Type: Enhancement
Doc Text:
Feature: Add an option which makes pcs wait for nodes to start or go to/from standby mode. Reason: If one wants to be sure nodes have fully started or gone to/from standby mode, it is required to periodically check output of 'pcs status', which cannot be easily done e.g. in scripts which call pcs. Result: --wait option added to 'pcs cluster start', 'pcs cluster setup --start', 'pcs cluster node add --start', 'pcs node standby' and 'pcs node unstandby' commands.
Clone Of: 1195703
: 1229826 (view as bug list)
Environment:
Last Closed: 2016-11-03 20:54:29 UTC
Target Upstream Version:


Attachments (Terms of Use)
proposed fix - standby (deleted)
2016-03-04 15:30 UTC, Tomas Jelinek
no flags Details | Diff
proposed fix - start (deleted)
2016-03-04 15:30 UTC, Tomas Jelinek
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1284404 None CLOSED make restarting pcsd a synchronous operation 2019-01-02 12:40:50 UTC
Red Hat Bugzilla 1463327 None CLOSED Starting a larger cluster times out 2019-01-02 12:40:49 UTC
Red Hat Product Errata RHSA-2016:2596 normal SHIPPED_LIVE Moderate: pcs security, bug fix, and enhancement update 2016-11-03 12:11:34 UTC

Internal Links: 1284404 1463327

Description Jan Pokorný [poki] 2015-06-09 17:42:51 UTC
In clufter's pcs-commands output (meant for little-to-no-edit execution)
an obstacle was found with "cluster create --start" followed immediately
with "cluster cib-push" because -- apparently -- cluster stack is not
up and running by this time yet.

Workaround is to put a fixed-time sleep in between:
https://github.com/jnpkrn/clufter/commit/a197f61d4ebe404902a8693473ec2c71bd0967c0

Better solution would be to make pcs do a proper job and *not* return
until cluster has started (or return with appropriate exit status
when this hadn't been successful as soon as a failure is detected).


Original bug that requested traceability of "Pacemaker being started"
state on the node, which might be helpful for internally for pcs to
implement this feature as well:

+++ This bug was initially created as a clone of Bug #1195703 +++

Comment 2 Jan Pokorný [poki] 2015-07-27 16:31:21 UTC
Also "pcs cluster standby <node>" might deserve --wait as per the wish
expressed at #clusterlabs Freenode's channel (times in CEST):

17:02 < larsks> After putting a node into standby mode with 'pcs cluster
                standby <node>', is there a way to confirm that
                resources are no longer running on that node?
                Specifically, a way that could be used in an automated
                script (as opposed to, say, "visual inspection of pcs
                status output")...
17:04 < lge> depends on your resources, right?
17:04 < lge> confirm as in how: ask pacemaker if it believes it is done?
17:04 < lge> or double check all resources?
17:05 < larsks> lge: confirm as in "have pacemaker confirm that there are
                no longer any active resources on the node that was put
                into standby mode"
17:05 < lge> "poll" crm_mon.
17:06 < larsks> Hmm.  crm_mon produces largely human-readable output, as
                opposed to something machine-parseable.  I guess I can
                run 'pcs status xml' or 'cibadmin -Q' through an XML
                parser...
17:06 < larsks> All I really want is a --wait flag to
                'pcs cluster standby' :)
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17:06 < lge> once crmadmin -S returns S_IDLE, pacemaker thinks it has
             nothing more to do. if that works for you...
17:07 < larsks> lge: huh, interesting.  I'll try that out and see how
                that works.
17:08 < lge> need to point it against the DC, though.
17:11 < lge> and that does not tell you if all stop operations have been
             successful.
17:12 < lge> though, if stop fails, that is supposed to be escalated to
             fencing... and once it goes S_IDLE after fencing, you can
             be sure everything is stopped ;-)
17:12 < lge> that's the whole point of fencing, after all.

Comment 3 Jan Pokorný [poki] 2015-08-13 08:05:10 UTC
Another data point (IMHO) supporting "cluster start --wait":
http://clusterlabs.org/pipermail/users/2015-August/001029.html

Comment 4 Chris Feist 2015-09-23 22:12:37 UTC
We also probably want to add a cluster stop --wait as well. (from bz1238874).

Comment 5 Radek Steiger 2016-02-03 15:49:37 UTC
*** Bug 1238874 has been marked as a duplicate of this bug. ***

Comment 6 Tomas Jelinek 2016-03-04 15:30:17 UTC
Created attachment 1133201 [details]
proposed fix - standby

Comment 7 Tomas Jelinek 2016-03-04 15:30:40 UTC
Created attachment 1133202 [details]
proposed fix - start

Comment 8 Tomas Jelinek 2016-03-04 15:46:44 UTC
For standby:

[root@rh72-node1:~]# pcs resource create delay delay startdelay=0 stopdelay=10
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2; pcs resource
real    0m0.172s
user    0m0.109s
sys     0m0.028s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# pcs node unstandby --all
[root@rh72-node1:~]# pcs resource
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node2
[root@rh72-node1:~]# time pcs node standby rh72-node2 --wait; pcs resource
real    0m12.226s
user    0m0.135s
sys     0m0.039s
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 delay  (ocf::heartbeat:Delay): Started rh72-node1

Similarly for unstandby.



For start:

[root@rh72-node1:~]# pcs cluster stop --all
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all; pcs status nodes
rh72-node2: Starting Cluster...
rh72-node1: Starting Cluster...
real    0m1.264s
user    0m0.404s
sys     0m0.068s
Pacemaker Nodes:
 Online: 
 Standby: 
 Offline: rh72-node1 rh72-node2 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline: 
[root@rh72-node1:~]# pcs cluster stop --all
rh72-node2: Stopping Cluster (pacemaker)...
rh72-node1: Stopping Cluster (pacemaker)...
rh72-node2: Stopping Cluster (corosync)...
rh72-node1: Stopping Cluster (corosync)...
[root@rh72-node1:~]# time pcs cluster start --all --wait; pcs status nodes
rh72-node1: Starting Cluster...
rh72-node2: Starting Cluster...
Waiting for node(s) to start...
rh72-node2: Started
rh72-node1: Started
real    0m24.463s
user    0m4.943s
sys     0m0.796s
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2 
 Standby: 
 Offline: 
Pacemaker Remote Nodes:
 Online: 
 Standby: 
 Offline:

Similarly for 'pcs cluster start --wait', 'pcs cluster start node --wait', 'pcs cluster setup --start --wait' and 'pcs cluster node add node --start --wait'.


In all cases it is possible to specify timeout like this: --wait=<timeout>.



For stop:
Pcs stops pacemaker service using systemd or service call, which does not return until pacemaker is fully stopped. So I believe there is nothing to be done. If it does not work, please file another bz with a reproducer.

Comment 9 Jan Pokorný [poki] 2016-03-07 18:18:10 UTC
Great to see this addressed :)

Upstream references for posterity:
https://github.com/feist/pcs/commit/9dc37994c94c6c8d03accd52c3c2f85df431f3ea
https://github.com/feist/pcs/commit/5ccd54fe7193683d4c161e1d2ce4ece66d5d881e

Can you advise on how to figure out that "cluster start" (which, I hope,
implies also the same handling with "cluster setup --start") indeed
supports --wait in an non-intrusive way?
Unfortunately "pcs cluster start --wait" is pretty dumb about being passed
an unsupported switch so I cannot look for possible error exit code.

Comment 10 Jan Pokorný [poki] 2016-03-07 18:28:56 UTC
If not such compatibility mechanism is easily achieved, I'd request that
specifying --wait=-1 (or any negative number) will fail immediately so
that I can do at least some reasoning about what to use in a backward
compatible way.

Comment 11 Tomas Jelinek 2016-03-08 08:12:13 UTC
I think the easiest thing to do is to check pcs version by running 'pcs --version'. Any version higher than 0.9.149 should support this.

Specifying wait=-1 (or any invalid timeout) fails immediately:
[root@rh72-node1:~]# pcs cluster start --all --wait=-1
Error: -1 is not a valid number of seconds to wait
[root@rh72-node1:~]# echo $?
1

Comment 12 Jan Pokorný [poki] 2016-03-09 19:39:45 UTC
Thanks, here's the reflection on clufter side:
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58

Does a change in emitted pcs commands akin to
https://pagure.io/clufter/bfb870495ca4619d9f858cd9483cea83642dca58#diff-file-1
look sane in your opinion (double backslashes are just a matter of notation)?

Comment 13 Tomas Jelinek 2016-03-10 13:57:24 UTC
Yes, that looks fine to me.

Comment 14 Mike McCune 2016-03-28 23:40:50 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Comment 15 Ivan Devat 2016-05-31 12:27:08 UTC
Setup:
[vm-rhel72-1 ~] $ pcs resource create delay delay startdelay=0 stopdelay=10

Before fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.143-15.el7.x86_64

A) standby
A1) pcs node standby

There is no command "pcs node".

A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1; pcs resource

real    0,23s
user    0,12s
sys     0,07s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ pcs cluster unstandby --all
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-1 --wait; pcs resource

real    0,18s
user    0,10s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3

B) start

[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,40s
user    0,39s
sys     0,07s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,43s
user    0,36s
sys     0,09s
Pacemaker Nodes:
 Online:
 Standby:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Offline:

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.151-1.el7.x86_64

A) standby
A1) pcs node standby

[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3; pcs resource

real    0,24s
user    0,13s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs node standby vm-rhel72-3 --wait; pcs resource

real    12,27s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


A2) pcs cluster standby

[vm-rhel72-1 ~] $ pcs node unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource                   
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3; pcs resource                                                                                                                                                                                                       

real    0,24s
user    0,14s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs cluster unstandby --all --wait
[vm-rhel72-1 ~] $ pcs resource
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-3
[vm-rhel72-1 ~] $ time pcs cluster standby vm-rhel72-3 --wait; pcs resource

real    12,28s
user    0,16s
sys     0,05s
 delay  (ocf::heartbeat:Delay): Started vm-rhel72-1


B) start
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...

real    1,70s
user    0,54s
sys     0,12s
Pacemaker Nodes:
 Online:
 Standby:
 Maintenance:
 Offline: vm-rhel72-1 vm-rhel72-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:
[vm-rhel72-1 ~] $ pcs cluster stop --all
vm-rhel72-1: Stopping Cluster (pacemaker)...
vm-rhel72-3: Stopping Cluster (pacemaker)...
vm-rhel72-1: Stopping Cluster (corosync)...
vm-rhel72-3: Stopping Cluster (corosync)...
[vm-rhel72-1 ~] $ time pcs cluster start --all --wait; pcs status nodes
vm-rhel72-1: Starting Cluster...
vm-rhel72-3: Starting Cluster...
Waiting for node(s) to start...
vm-rhel72-3: Started
vm-rhel72-1: Started

real    26,08s
user    4,44s
sys     0,77s
Pacemaker Nodes:
 Online: vm-rhel72-1
 Standby: vm-rhel72-3
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 22 errata-xmlrpc 2016-11-03 20:54:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html


Note You need to log in before you can comment on or make changes to this bug.