|Summary:||Improve fence agent meta-data failure logs|
|Product:||Red Hat Enterprise Linux 7||Reporter:||Andrew Price <anprice>|
|Component:||pacemaker||Assignee:||Ken Gaillot <kgaillot>|
|Status:||NEW ---||QA Contact:||cluster-qe <cluster-qe>|
|Version:||7.6||CC:||abeekhof, cluster-maint, cmackows, gfs2-maint, jpokorny|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Andrew Price 2018-07-12 16:28:54 UTC
Created attachment 1458468 [details] crm report Description of problem: QE hit a revolver test failure on s390x where a node was fenced off and wasn't turned on again as expected. Version-Release number of selected component (if applicable): pacemaker-1.1.19-1.el7.s390x fence-agents-zvm-4.2.1-1.el7.s390x How reproducible: Only hit once so far Steps to Reproduce: 1. Get QE to run revolver on an s390x cluster Actual results: Waiting up to 600 seconds for cluster to recover. Killing network interfaces on qe-c01-m03... Jul 11 17:38:40 qe-c01-m01: Saw victim qe-c01-m03 leave corosync Jul 11 17:38:40 qe-c01-m02: Saw victim qe-c01-m03 leave corosync Jul 11 17:38:41 qe-c01-m01.s390.bos.redhat.com pengine: warning: Scheduling Node qe-c01-m03 for STONITH Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster11.1 jid 2 Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster12.1 jid 2 Jul 11 17:38:43 qe-c01-m01: Recovered s390-cluster1:s390-cluster10.0 jid 2 Didn't receive heartbeat from qe-c01-m03 for 2 seconds Victims left to recover: qe-c01-m03 [...] Victims left to recover: qe-c01-m03 qe-c01-m03 has not rebooted in 600 seconds, down qe-c01-m03 returned 1 Expected results: Test passes Additional info: crm_report attached. Significant log entry: Jul 11 17:38:40  qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith_device_execute: Agent 'fence_zvmip' does not advertise support for 'reboot', performing 'off' action instead I don't see this in the logs anywhere else. [root@qe-c01-m01 qe-c01-m01]# crm_resource --show-metadata stonith:fence_zvmip | grep 'action.*reboot' <action name="reboot"/>
Comment 2 Andrew Price 2018-07-12 16:44:06 UTC
Created attachment 1458473 [details] The corosync.log containing the failure
Comment 3 Ken Gaillot 2018-07-12 17:07:39 UTC
As discussed elsewhere, the logs contain this earlier: Jul 11 12:12:48  qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith__rhcs_metadata: Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103 Jul 11 12:12:48  qe-c01-m01.s390.bos.redhat.com stonith-ng: error: get_agent_metadata: Could not retrieve metadata for fencing agent fence_zvmip which indicates that this is not a bug but a local failure of the agent for a meta-data call. The rather unhelpful "connection abort" message indicates that the agent call either timed out (which might not be obvious in the logs, just compare the timestamps of when the meta-data call was initiated vs when the failure was reported), was interrupted by a signal (in which case there should also be an "... exited due to signal" log message), or failed in an unexpected way ("call ... returned unexpected status"). To solve the immediate problem, you'll have to figure out why the agent call failed and fix that. If the underlying issue is transient, it might be sufficient to set the pcmk_meta-data_retries parameter to 2 or 3. This parameter is not advertised, so it may require a --force, but I believe it is supported. If the first meta-data call fails, it would immediately try again. If the underlying issue is a timeout, it might be sufficient to set pcmk_meta-data_timeout appropriately. Again this is not an advertised parameter but should work. I'll leave this BZ open as a feature request to improve the error messages, and advertise the pcmk_meta-data_* parameters if supported. Specifically if meta-data fails, we should give a more descriptive failure message, and we should warn about the specific assumptions that will be made without the benefit of meta-data. Also, pacemaker currently tries to get the meta-data when parsing the device configuration from the CIB; we could consider whether it might be useful to retry a failed meta-data command at some later point, but that might not be practical.
Comment 4 Andrew Price 2018-07-12 17:23:47 UTC
Thanks, Ken. There isn't a timestamp for the metadata call initiation so I can't tell how long it took unfortunately. I don't see the other forms of log messages though so it seems most likely a timeout. I'll take a look at the system configs for any reasons it might be slow and talk to Chris M about trying those workarounds. What are the defaults for those two undocumented settings?
Comment 5 Ken Gaillot 2018-07-12 18:54:20 UTC
The default meta-data timeout for fence agents is 5 seconds. (As an aside, we use 30 seconds as the default for resource agents, so it might be a good idea to make this the same.) Looking at the log, it does look like a timeout: Jul 11 12:12:42  qe-c01-m01.s390.bos.redhat.com stonith-ng: info: init_cib_cache_cb: Updating device list from the cib: init ... Jul 11 12:12:48  qe-c01-m01.s390.bos.redhat.com stonith-ng: warning: stonith__rhcs_metadata: Could not execute metadata action for fence_zvmip: Software caused connection abort | rc=-103 I mis-remembered the default retries as only one attempt, but actually the default retries for all actions is 2. The timeout is for the entire action, so the cluster doesn't retry after a timeout. In other words, there's no way currently to put a shorter timeout on the individual attempts and a larger timeout on the entire operation, which might help if the first attempt were hung and would never complete.
Comment 6 Andrew Price 2018-07-13 09:51:12 UTC
6 seconds does seem like a long time just to output some static(?) data so either the node was pretty heavily loaded at that point or the agent is doing something sub-optimal. I might open a new bz against the fence-agents to get that figured out. Thanks again.
Comment 7 Jan Pokorný [poki] 2018-07-13 10:47:48 UTC
Especially if this was indeed caused by an elapsed timeout and considering fence_zvmip boils down to Python implementation (note there's also a C implementation along!), I'd note several points: 0. Python is a poor choice of the tool for the task, to start with (migrating whatever eligible to a subset of MicroPython could be a reasonable tradeoff) 1. Python-based fence-agents appear to be launched in a poor way, meaning it is fairly easy to inflict any sort of issues, like unindented delay because of running some kind magic for any default Python interpreter runs (e.g. if root user manages custom site-packages directory, with *.pth files that get interpreted prior to any default run there, for instance): filed a bug to that effect for pcs a while back, though only mentioned that internally in the context of cluster Python-based components on the whole, so rectifying that now with [bug 1600893] Next, perhaps it might make sense, as some kind of an optional OCF extension, to declare a mechanism of explicit meta-data caching by compliant resource managers: a. dedicate a directory on presumably a volatile partition, likewise presumably OCF-namespaced, for this meta-data cache b. whenever meta-data (of actively used agents?) are required: - if cached version of meta-data doesn't exist in cache, or is older than the agent itself, rerun caching process, when the same content is obtained, just refresh the timestamp, otherwise store anew, then ... - otherwise or in addition, return cached content, do not invoke agent again needlessly There are plentiful advantages with this scheme, beside not draining resources with new processes needlessly, it'd be a good addition for crm_report to grab these cached files, especially if only actively used resources (at some point, that is) were a subject to this externalized caching.
Comment 8 Jan Pokorný [poki] 2018-07-13 11:00:30 UTC
re [comment 6]: > 6 seconds does seem like a long time just to output some static(?) > data Yes, those data are expected to be invariant, not dependent on the temperature of the CPU, that's why the caching per [comment 7] would be feasible at all. Unrelated to fence_zvmip, sadly, there is very low discipline about the implementation of the agents, meaning people do all kind of crazy things, including talking back to the resource manager even for the purpose of getting meta-data (!): https://bugs.clusterlabs.org/show_bug.cgi?id=5357#c16
Comment 9 Andrew Price 2018-07-13 11:12:46 UTC
Well in gfs2 workloads in particular we need the cluster stack to be robust in cases where a lot of memory is being used for dirty pages. In that case anything requiring memory allocations can be slow, as pages might need to be flushed back to disk to free up memory, and so memory allocations can briefly become dependent on disk speed. Perhaps that's what we're seeing here when the python fence agent is executed.
Comment 10 Andrew Price 2018-07-17 11:02:33 UTC
(In reply to Andrew Price from comment #6) > 6 seconds does seem like a long time just to output some static(?) data so > either the node was pretty heavily loaded at that point or the agent is > doing something sub-optimal. I might open a new bz against the fence-agents > to get that figured out. Thanks again. https://bugzilla.redhat.com/show_bug.cgi?id=1601820
Comment 12 Chris Mackowski 2018-07-18 21:46:22 UTC
I have been recording the 'time' output of -o metadata at an interval of 60 seconds. Below are a few of the slowest times. These were recorded during GFS2 testing. qe-c01-m01: Jul 18 13:45:00 0.03user 0.02system 0:00.59elapsed 9%CPU (0avgtext+0avgdata 11232maxresident)k Jul 18 14:45:07 0.03user 0.02system 0:00.71elapsed 7%CPU (0avgtext+0avgdata 11232maxresident)k Jul 18 14:51:40 0.03user 0.02system 0:00.78elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k qe-c01-m02: Jul 18 13:38:22 0.04user 0.02system 0:02.19elapsed 3%CPU (0avgtext+0avgdata 11232maxresident)k Jul 18 14:38:32 0.03user 0.02system 0:00.67elapsed 9%CPU (0avgtext+0avgdata 11236maxresident)k qe-c01-m03: Jul 18 13:51:48 0.03user 0.02system 0:01.08elapsed 5%CPU (0avgtext+0avgdata 11236maxresident)k Jul 18 13:58:23 0.03user 0.02system 0:00.96elapsed 6%CPU (0avgtext+0avgdata 11236maxresident)k Jul 18 15:31:56 0.03user 0.02system 0:00.54elapsed 10%CPU (0avgtext+0avgdata 11236maxresident)k Jul 18 15:38:15 0.03user 0.02system 0:00.89elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k Jul 18 15:44:32 0.04user 0.02system 0:00.85elapsed 7%CPU (0avgtext+0avgdata 11236maxresident)k I also observed sluggish performance from 'rpm -qa'. Unfortunately I did not grab a datestamp, approx July 18 16:15 CDT. There was no GFS2 testing taking place, systems essentially idle. [root@qe-c01-m02 ~]# time rpm -qa | grep pcp snip..... real 0m41.565s user 0m0.697s sys 0m0.298s [root@qe-c01-m03 ~]# time rpm -qa | grep pcp snip.... real 0m40.453s user 0m0.645s sys 0m0.264s -Chris Mackowski
Comment 13 Andrew Price 2018-07-25 10:27:27 UTC
I've been adding better monitoring to the cluster and noticed a low level of entropy available for rng, which really should be expected as this is virt. Theoretically that could cause processes to block, and not only when using /dev/random. To give it a boost I've enabled rngd to consume from /dev/prandom (which is a z specific randomness source provided by the prng kernel module). Chris, could you set up the revolver test to run for a long time with the new tweaks?
Comment 15 Andrew Price 2018-08-02 15:45:54 UTC
Just to be clear on the workaround, since it's undocumented, is this the right way to set those parameters? pcs stonith update --force zfence1 pcmk_meta-data_timeout=20 pcmk_monitor_timeout=60s
Comment 16 Ken Gaillot 2018-08-02 16:13:48 UTC
(In reply to Andrew Price from comment #15) > Just to be clear on the workaround, since it's undocumented, is this the > right way to set those parameters? > > pcs stonith update --force zfence1 pcmk_meta-data_timeout=20 > pcmk_monitor_timeout=60s Yes, that should work.