Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1360234 - ClusterMon will not kill crm_mon process correctly
Summary: ClusterMon will not kill crm_mon process correctly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: pacemaker
Version: 6.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 6.9
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1378817 1385753
TreeView+ depends on / blocked
 
Reported: 2016-07-26 10:07 UTC by Josef Zimek
Modified: 2017-03-21 09:52 UTC (History)
7 users (show)

Fixed In Version: pacemaker-1.1.15-3.el6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1378817 1385753 (view as bug list)
Environment:
Last Closed: 2017-03-21 09:52:10 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:0629 normal SHIPPED_LIVE pacemaker bug fix update 2017-03-21 12:29:32 UTC

Description Josef Zimek 2016-07-26 10:07:49 UTC
Description of problem:


After putting the cluster in maintanance mode, then reboot the nodes and resetting the maintenance status sometimes the ClusterMon resource keeps running even if "crm_mon" is NOT running. Normally ClusterMon should detect the missing "crm_mon" in reliable way to report the node as "bad"


The mechanism used  by "ClusterMon" script delivered in "/usr/lib/ocf/resource.d/pacemaker/ClusterMon":

The script checks the content of the PID file created by "crm_mon" and tries to send a signal ("kill -s 0 <PID>" ) to this PID.

If any process is running with this PID the resource agent script reports that "crm_mon" is running. But this can be any process running with this PID (e.g. after the reboot of the machine this could be another daemon process)
In this case the reported value of the agent is wrong.

Actual implementation:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1; rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING
}


So sometimes the "kill -s 0 $pid" command returns "success" even if "crm_mon" is not running



Version-Release number of selected component (if applicable):
pacemaker.x86_64 - 1.1.12-8.el6_7.2


How reproducible:
Problem depends on the actual running processes and their PIDs.

Steps to Reproduce:
1. Set ClusterMon resource 
2. Put node into maintenance mode
3. Reboot node
4. Either find way to force PID, originally assigned to original crm_mon PID, to other process (or try couple times so it happens randomly)
5. Disable maintenance mode 


Actual results:
If PID of crm_mon changes after reboot the ClusterMon_monitor still checks the old PID and can kill other process instead


Expected results:
ClusterMon_monitor checks PID of actual crm_mon (not the old one)



Additional info:

Customer proposed following patch:

ClusterMon_monitor() {
    if [ -f $OCF_RESKEY_pidfile ]; then
        pid=`cat $OCF_RESKEY_pidfile`
        if [ ! -z $pid ]; then
            kill -s 0 $pid >/dev/null 2>&1 \
                  && ps -fp $pid | grep $OCF_RESKEY_pidfile >/dev/null 2>&1
            rc=$?
            case $rc in
                0) exit $OCF_SUCCESS;;
                1) exit $OCF_NOT_RUNNING;;
                *) exit $OCF_ERR_GENERIC;;
            esac
        fi
    fi
    exit $OCF_NOT_RUNNING


This will filter out if the "crm_mon" with the reported PID is running.

(but it still might kill other process)

Comment 4 Oyvind Albrigtsen 2016-09-23 10:11:09 UTC
Tested and working patch: https://github.com/ClusterLabs/pacemaker/pull/1147

Comment 9 michal novacek 2017-01-16 14:22:24 UTC
I have verified that crm mon resource agent is correctly recognized as failed
in pacemaker-1.1.15-4

---

Have running pacemaker cluster configured (1).

before the fix pacemaker-1.1.14-8.el6_8.2.x86_64
================================================
crm_mon is recognized as running even though it is not

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# ps axf | grep crm_mon
10381 pts/0    S+     0:00          \_ grep crm_mon
10337 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid 
     10337
[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 10337
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid
[root@virt-157 ~]# pcs node unmaintenance virt-157

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-157 ~]# ps axf | grep crm_mon
 3273 pts/0    S+     0:00          \_ grep crm_mon

after the patch pacemaker-1.1.15-4.el6.x86_64
=============================================

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157
[root@virt-157 ~]# ps axf | grep crm_mon
 4887 pts/1    S+     0:00          \_ grep crm_mon
 4539 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html
[root@virt-157 ~]# cat /tmp/ClusterMon_cmon.pid
      4539

[root@virt-157 ~]# pcs node maintenance virt-157
[root@virt-157 ~]# kill -9 4539
[root@virt-157 ~]# echo 1 > /tmp/ClusterMon_cmon.pid 
[root@virt-157 ~]# pcs node unmaintenance virt-157

[root@virt-157 ~]# pcs resource
 cmon   (ocf::pacemaker:ClusterMon):    Started virt-157

[root@virt-157 ~]# pcs resource debug-monitor cmon
Operation monitor for cmon (ocf:pacemaker:ClusterMon) returned 0

[root@virt-157 ~]# ps axf | grep crm_mon
 5327 pts/1    S+     0:00          \_ grep crm_mon
 5102 ?        S      0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_cmon.pid -d -i 15 -h /tmp/ClusterMon_cmon.html

 --

 > (1) pcs config
[root@virt-157 ~]# pcs config
Cluster Name: STSRHTS14109
Corosync Nodes:
 virt-157 virt-159
Pacemaker Nodes:
 virt-157 virt-159

Resources:
 Resource: cmon (class=ocf provider=pacemaker type=ClusterMon)
  Operations: start interval=0s timeout=20 (cmon-start-interval-0s)
              stop interval=0s timeout=20 (cmon-stop-interval-0s)
              monitor interval=10 timeout=20 (cmon-monitor-interval-10)

Stonith Devices:
 Resource: fence-virt-157 (class=stonith type=fence_xvm)
  Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-157 pcmk_host_map=virt-157:virt-157.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-157-monitor-interval-60s)
 Resource: fence-virt-159 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-159 pcmk_host_map=virt-159:virt-159.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-159-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: cmon
    Enabled on: virt-157 (score:INFINITY) (role: Started) (id:cli-prefer-cmon)
Ordering Constraints:
Colocation Constraints:

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.15-4.el6-e174ec8
 have-watchdog: false
 last-lrm-refresh: 1484574749

Comment 11 errata-xmlrpc 2017-03-21 09:52:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0629.html


Note You need to log in before you can comment on or make changes to this bug.