Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1414139 - Potentially incorrect/undefined parsing results of time-related configuration parameters such as action intervals/timeouts
Summary: Potentially incorrect/undefined parsing results of time-related configuration...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rgmanager
Version: 6.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Pokorný [poki]
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 1310529
TreeView+ depends on / blocked
 
Reported: 2017-01-17 21:04 UTC by Jan Pokorný [poki]
Modified: 2017-03-21 10:41 UTC (History)
4 users (show)

Fixed In Version: rgmanager-3.0.12.1-33.el6
Doc Type: Release Note
Doc Text:
Time-related values of cluster resource configuration are now evaluated properly Previously, time-related resource values in actual use could differ from the values configured in the `cluster.conf` file, especially at the initial configuration load. This could cause the `rgmanager` daemon to behave unpredictably. With this fix, `rgmanager` behaves exactly as configured with regards to resources and respective time-related values.
Clone Of:
Environment:
Last Closed: 2017-03-21 10:41:24 UTC
Target Upstream Version:


Attachments (Terms of Use)
Proposed fix (deleted)
2017-01-17 21:58 UTC, Jan Pokorný [poki]
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0684 normal SHIPPED_LIVE rgmanager bug fix update 2017-03-21 12:37:12 UTC
Red Hat Bugzilla 1036652 None CLOSED Apply upstream patch to prevent rgmanager crash on misconfigured "time_t" values (e.g., __restart_expire_time) in cluste... 2018-12-14 07:20:09 UTC

Internal Links: 1036652

Description Jan Pokorný [poki] 2017-01-17 21:04:55 UTC
Symptoms
========

Using following cluster.conf (original indented with tabs):

<?xml version="1.0"?>
<cluster config_version="5" name="test">
 <clusternodes>
  <clusternode name="virt-122.cluster-qe.lab.eng.brq.redhat.com" nodeid="1"/>
 </clusternodes>
 <totem consensus="200" join="100" token="5000" token_retransmits_before_loss_const="4"/>
 <logging>
  <logging_daemon debug="on" name="corosync" subsys="CONFDB"/>
 </logging>
 <rm>
  <resources>
   <fs device="/dev/vda1" mountpoint="/boot" name="Reference" options="ro,remount">
    <action depth="20" interval="0" name="status"/>
   </fs>
  </resources>
  <service autostart="1" max_restarts="321" name="DummyRef" recovery="relocate" restart_expire_time="654">
   <fs ref="Reference"/>
  </service>
 </rm>
</cluster>

I can reliably, i.e.
- behavior surviving reboot of a given VM
- behavior very likely surviving a bit different deployment of VM
  from scratch on the same archicture architecture (see below)
(- with glibc-2.12-1.208.el6.x86_64, libxml2-2.7.6-21.el6_8.1.x86_64),
reproduce following bad behavior of rgmanager (that I came across in
relation to [bug 1310529 comment 27]):

# service cman start
# rgmanager -f
> [...]
> Building Resource Trees                            
> Replacing action 'status' depth 20: interval: 60->2
> [...]

This does not make any sense as we are overriding status action of
Reference resource at depth 20 because we want to disable (interval=0)
the checks at that level because otherwise they are implicitly (through
metadata of fs.sh) scheduled to happen every 1 minute (60 seconds)
-- this is a use case of referred [bug 1310529].
And apparently, the status action at depth 20 did not get disabled,
but rather happens 30x faster!

The expected message is rather:

> Replacing action 'status' depth 20: interval: 60->0

and subsequent behavior should be really no such depth for a status
based monitoring is ever used.

Note that we are now discussing the startup behavior.  A config-reload
(e.g. ccs --incversion --sync --activate) action does not (again,
reliably) this issue for me, I can observe the expected message above.


Root cause
==========

It was found out that the issue was introduced with upstream commit:
https://git.fedorahosted.org/cgit/cluster.git/commit/?id=9d0ecbac1ff8ac1f045bd19bb73ca063a5351014
and the issue can only happen with a time specification ending with
a digit (as opposed to, e.g., ending the implicit entry in seconds
with explicit 's').  In that case, the logic of expand_time function
skips the nul-terminator after this last digit and continues with
a subsequent byte, which in general has unspecified value, just as any
other in row (this may also serve potential attackers, but because
cluster.conf is supposed to come from trusted sources, this line is
disregarded).

Note that first affected RHEL release is 6.6: [bug 1036652].


Workaround
==========

Always suffix the entry in seconds with explicit 's'.

Comment 1 Jan Pokorný [poki] 2017-01-17 21:58:29 UTC
Created attachment 1241957 [details]
Proposed fix

Affected attributes
===================

Following is hopefully an exhustive list of affected values
as expressed with fitting XPath expressions:

/cluster/rm/*//action/@timeout
/cluster/rm/*//action/@interval
/cluster/rm/*[name() != 'resources']/*//@__restart_expire_time
/cluster/rm/*[name() != 'resources']/@restart_expire_time
/cluster/rm/*[name() != 'resources']//@__failure_expire_time

In my observation, only explicit values in cluster.conf were affected
but let's suppose that even the implicit ones (e.g., encoded in
metadata of particular resource agents) are endangered.


Fix
===

See the attached patch.  Simplest solution is simply not to allow
nul-terminator to slip through the checks as detailed in Root cause
[comment 0].

Comment 9 errata-xmlrpc 2017-03-21 10:41:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0684.html


Note You need to log in before you can comment on or make changes to this bug.