Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1600640

Summary: Improve fence agent meta-data failure logs
Product: Red Hat Enterprise Linux 7 Reporter: Andrew Price <anprice>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: NEW --- QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: medium    
Version: 7.6CC: abeekhof, cluster-maint, cmackows, gfs2-maint, jpokorny
Target Milestone: pre-dev-freezeKeywords: RFE
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
crm report
none
The corosync.log containing the failure none

Description Andrew Price 2018-07-12 16:28:54 UTC
Created attachment 1458468 [details]
crm report

Description of problem:

QE hit a revolver test failure on s390x where a node was fenced off and wasn't turned on again as expected.

Version-Release number of selected component (if applicable):

pacemaker-1.1.19-1.el7.s390x
fence-agents-zvm-4.2.1-1.el7.s390x

How reproducible:
Only hit once so far

Steps to Reproduce:
1. Get QE to run revolver on an s390x cluster

Actual results:

Waiting up to 600 seconds for cluster to recover.
Killing network interfaces on qe-c01-m03...
Jul 11 17:38:40 qe-c01-m01: Saw victim qe-c01-m03 leave corosync
Jul 11 17:38:40 qe-c01-m02: Saw victim qe-c01-m03 leave corosync
Jul 11 17:38:41 qe-c01-m01.s390.bos.redhat.com pengine[1551]:  warning: Scheduling Node qe-c01-m03 for STONITH
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster11.1 jid 2
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster12.1 jid 2
Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster10.0 jid 2
Didn't receive heartbeat from qe-c01-m03 for 2 seconds
Victims left to recover: qe-c01-m03
[...]
Victims left to recover: qe-c01-m03 
qe-c01-m03 has not rebooted in 600 seconds, down
qe-c01-m03 returned 1 

Expected results:

Test passes

Additional info:

crm_report attached. Significant log entry:

Jul 11 17:38:40 [1548] qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith_device_execute:     Agent 'fence_zvmip' does not advertise support for 'reboot', performing 'off' action instead

I don't see this in the logs anywhere else.

[root@qe-c01-m01 qe-c01-m01]# crm_resource --show-metadata stonith:fence_zvmip | grep 'action.*reboot'
    <action name="reboot"/>

Comment 2 Andrew Price 2018-07-12 16:44:06 UTC
Created attachment 1458473 [details]
The corosync.log containing the failure

Comment 3 Ken Gaillot 2018-07-12 17:07:39 UTC
As discussed elsewhere, the logs contain this earlier:

Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith__rhcs_metadata:     Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103

Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng: error: get_agent_metadata: Could not retrieve metadata for fencing agent fence_zvmip

which indicates that this is not a bug but a local failure of the agent for a meta-data call.

The rather unhelpful "connection abort" message indicates that the agent call either timed out (which might not be obvious in the logs, just compare the timestamps of when the meta-data call was initiated vs when the failure was reported), was interrupted by a signal (in which case there should also be an "... exited due to signal" log message), or failed in an unexpected way ("call ... returned unexpected status").

To solve the immediate problem, you'll have to figure out why the agent call failed and fix that.

If the underlying issue is transient, it might be sufficient to set the pcmk_meta-data_retries parameter to 2 or 3. This parameter is not advertised, so it may require a --force, but I believe it is supported. If the first meta-data call fails, it would immediately try again.

If the underlying issue is a timeout, it might be sufficient to set pcmk_meta-data_timeout appropriately. Again this is not an advertised parameter but should work.

I'll leave this BZ open as a feature request to improve the error messages, and advertise the pcmk_meta-data_* parameters if supported. Specifically if meta-data fails, we should give a more descriptive failure message, and we should warn about the specific assumptions that will be made without the benefit of meta-data.

Also, pacemaker currently tries to get the meta-data when parsing the device configuration from the CIB; we could consider whether it might be useful to retry a failed meta-data command at some later point, but that might not be practical.

Comment 4 Andrew Price 2018-07-12 17:23:47 UTC
Thanks, Ken. There isn't a timestamp for the metadata call initiation so I can't tell how long it took unfortunately. I don't see the other forms of log messages though so it seems most likely a timeout. I'll take a look at the system configs for any reasons it might be slow and talk to Chris M about trying those workarounds. What are the defaults for those two undocumented settings?

Comment 5 Ken Gaillot 2018-07-12 18:54:20 UTC
The default meta-data timeout for fence agents is 5 seconds. (As an aside, we use 30 seconds as the default for resource agents, so it might be a good idea to make this the same.)

Looking at the log, it does look like a timeout:

Jul 11 12:12:42 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng:     info: init_cib_cache_cb:  Updating device list from the cib: init
...
Jul 11 12:12:48 [1514] qe-c01-m01.s390.bos.redhat.com stonith-ng:  warning: stonith__rhcs_metadata:     Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103


I mis-remembered the default retries as only one attempt, but actually the default retries for all actions is 2. The timeout is for the entire action, so the cluster doesn't retry after a timeout. In other words, there's no way currently to put a shorter timeout on the individual attempts and a larger timeout on the entire operation, which might help if the first attempt were hung and would never complete.

Comment 6 Andrew Price 2018-07-13 09:51:12 UTC
6 seconds does seem like a long time just to output some static(?) data so either the node was pretty heavily loaded at that point or the agent is doing something sub-optimal. I might open a new bz against the fence-agents to get that figured out. Thanks again.

Comment 7 Jan Pokorný [poki] 2018-07-13 10:47:48 UTC
Especially if this was indeed caused by an elapsed timeout and
considering fence_zvmip boils down to Python implementation
(note there's also a C implementation along!), I'd note several
points:

0. Python is a poor choice of the tool for the task, to start with
   (migrating whatever eligible to a subset of MicroPython could
   be a reasonable tradeoff)

1. Python-based fence-agents appear to be launched in a poor
   way, meaning it is fairly easy to inflict any sort of issues,
   like unindented delay because of running some kind magic
   for any default Python interpreter runs (e.g. if root user
   manages custom site-packages directory, with *.pth files
   that get interpreted prior to any default run there,
   for instance): filed a bug to that effect for pcs a while
                  back, though only mentioned that internally
                  in the context of cluster Python-based
                  components on the whole, so rectifying that
                  now with [bug 1600893]

Next, perhaps it might make sense, as some kind of an optional
OCF extension, to declare a mechanism of explicit meta-data
caching by compliant resource managers:

  a. dedicate a directory on presumably a volatile partition,
     likewise presumably OCF-namespaced, for this meta-data cache

  b. whenever meta-data (of actively used agents?) are required:
     - if cached version of meta-data doesn't exist in cache,
       or is older than the agent itself, rerun caching process,
       when the same content is obtained, just refresh the
       timestamp, otherwise store anew, then ...
     - otherwise or in addition, return cached content, do not
       invoke agent again needlessly

There are plentiful advantages with this scheme, beside not draining
resources with new processes needlessly, it'd be a good addition
for crm_report to grab these cached files, especially if only
actively used resources (at some point, that is) were a subject
to this externalized caching.

Comment 8 Jan Pokorný [poki] 2018-07-13 11:00:30 UTC
re [comment 6]:

> 6 seconds does seem like a long time just to output some static(?)
> data

Yes, those data are expected to be invariant, not dependent on the
temperature of the CPU, that's why the caching per [comment 7]
would be feasible at all.

Unrelated to fence_zvmip, sadly, there is very low discipline about
the implementation of the agents, meaning people do all kind of crazy
things, including talking back to the resource manager even for
the purpose of getting meta-data (!):

https://bugs.clusterlabs.org/show_bug.cgi?id=5357#c16

Comment 9 Andrew Price 2018-07-13 11:12:46 UTC
Well in gfs2 workloads in particular we need the cluster stack to be robust in cases where a lot of memory is being used for dirty pages. In that case anything requiring memory allocations can be slow, as pages might need to be flushed back to disk to free up memory, and so memory allocations can briefly become dependent on disk speed. Perhaps that's what we're seeing here when the python fence agent is executed.

Comment 10 Andrew Price 2018-07-17 11:02:33 UTC
(In reply to Andrew Price from comment #6)
> 6 seconds does seem like a long time just to output some static(?) data so
> either the node was pretty heavily loaded at that point or the agent is
> doing something sub-optimal. I might open a new bz against the fence-agents
> to get that figured out. Thanks again.

https://bugzilla.redhat.com/show_bug.cgi?id=1601820

Comment 12 Chris Mackowski 2018-07-18 21:46:22 UTC
I have been recording the 'time' output of -o metadata at an interval of 60 seconds. Below are a few of the slowest times. These were recorded during GFS2 testing.   

qe-c01-m01:
Jul 18 13:45:00	0.03user 0.02system 0:00.59elapsed 9%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:45:07 0.03user 0.02system 0:00.71elapsed 7%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:51:40 0.03user 0.02system 0:00.78elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k

qe-c01-m02:
Jul 18 13:38:22 0.04user 0.02system 0:02.19elapsed 3%CPU (0avgtext+0avgdata 11232maxresident)k
Jul 18 14:38:32 0.03user 0.02system 0:00.67elapsed 9%CPU (0avgtext+0avgdata 11236maxresident)k

qe-c01-m03:
Jul 18 13:51:48 0.03user 0.02system 0:01.08elapsed 5%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 13:58:23 0.03user 0.02system 0:00.96elapsed 6%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:31:56 0.03user 0.02system 0:00.54elapsed 10%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:38:15 0.03user 0.02system 0:00.89elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k
Jul 18 15:44:32 0.04user 0.02system 0:00.85elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k

I also observed sluggish performance from 'rpm -qa'. Unfortunately I did not grab a datestamp, approx July 18 16:15 CDT. There was no GFS2 testing taking place, systems essentially idle.        

[root@qe-c01-m02 ~]# time rpm -qa | grep pcp
snip.....
real	0m41.565s
user	0m0.697s
sys	0m0.298s

[root@qe-c01-m03 ~]# time rpm -qa | grep pcp
snip....
real	0m40.453s
user	0m0.645s
sys	0m0.264s

-Chris Mackowski

Comment 13 Andrew Price 2018-07-25 10:27:27 UTC
I've been adding better monitoring to the cluster and noticed a low level of entropy available for rng, which really should be expected as this is virt. Theoretically that could cause processes to block, and not only when using /dev/random. To give it a boost I've enabled rngd to consume from /dev/prandom (which is a z specific randomness source provided by the prng kernel module).

Chris, could you set up the revolver test to run for a long time with the new tweaks?

Comment 15 Andrew Price 2018-08-02 15:45:54 UTC
Just to be clear on the workaround, since it's undocumented, is this the right way to set those parameters?

pcs stonith update --force zfence1 pcmk_meta-data_timeout=20 pcmk_monitor_timeout=60s

Comment 16 Ken Gaillot 2018-08-02 16:13:48 UTC
(In reply to Andrew Price from comment #15)
> Just to be clear on the workaround, since it's undocumented, is this the
> right way to set those parameters?
> 
> pcs stonith update --force zfence1 pcmk_meta-data_timeout=20
> pcmk_monitor_timeout=60s

Yes, that should work.