Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 596398 - [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL
Summary: [RFE] NOT_RESPONDING_WANT_CORE =TRUE: follow SIGABRT with SIGKILL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 1.2
Hardware: All
OS: Linux
low
medium
Target Milestone: 1.3.2
: ---
Assignee: Matthew Farrellee
QA Contact: Luigi Toscano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-26 17:38 UTC by Matthew Farrellee
Modified: 2018-11-14 19:11 UTC (History)
2 users (show)

Fixed In Version: condor-7.4.5-0.1
Doc Type: Enhancement
Doc Text:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE C: If the child does not exit in response to SIGABRT, the child may never exit. F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable). R: The hung child will eventually be terminated by the condor_master.
Clone Of:
Environment:
Last Closed: 2011-02-15 12:16:33 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0217 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update 2011-02-15 12:10:15 UTC

Description Matthew Farrellee 2010-05-26 17:38:14 UTC
Upstream at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=746

During development or during deployment when trying to capture a rare event, it is useful to have the master send a SIGABRT instead of SIGKILL when a daemon stops responding. However, the daemon may be in a state where the SIGABRT does not cause the daemon to exit. In such a situation, the master should follow up the SIGABRT with a SIGKILL.

Comment 1 Matthew Farrellee 2010-07-01 12:21:13 UTC
Description of problem:

We've set NOT_RESPONDING_WANT_CORE=True based on wanting to collect condor core
dumps for support purposes:

http://www.cs.wisc.edu/condor/manual/v7.4/3_3Configuration.html#SECTION00435000000000000000

This parameter tells a parent process (e.g. condor_master) to send a kill -ABRT
instead of a kill -KILL to a hung child.

What I've discovered when conducting HA testing is that if you kill -STOP a
process, condor_master will recognize the process as hung and send it a kill
-ABRT exactly as documented.  However, a stopped process can't handle the ABRT,
so the hung process stays hung and remains so forever.

DaemonCore should escalate the set of signals sent to a child daemon - if a
child doesn't exit from an ABRT, there should be an associated timeout around
that and then it should be sent a KILL.

How reproducible:

100%

Steps to Reproduce:

Set NOT_RESPONDING_WANT_CORE=true.  Set NOT_RESPONDING_TIMEOUT to something
more sane than 1 hour for your testing purposes.  kill -STOP a process like
condor_negotiator.  Wait for the NOT_RESPONDING_TIMEOUT duration.  Monitor the
MasterLog to see condor_master say something like:

05/21 15:28:41 ERROR: Child pid 28198 appears hung! Killing it hard.

Note that the child process is *not* killed hard, and never will be.

Actual results:

Expected results:

I'd expect a tunable called something like NOT_RESPONDING_CORE_TIMEOUT, and if
NOT_RESPONDING_WANT_CORE=true a process is sent an ABRT and then after the
duration of NOT_RESPONDING_CORE_TIMEOUT a KILL would be sent.

Comment 2 Matthew Farrellee 2010-07-01 12:21:31 UTC
*** Bug 609692 has been marked as a duplicate of this bug. ***

Comment 3 Matthew Farrellee 2010-10-05 02:25:50 UTC
Resolved in 7.5.5 -

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1688

Comment 4 Matthew Farrellee 2010-11-18 17:59:42 UTC
GT1688 pulled into V7_4-BZ596398-sigabrt-escalation-backport-branch, to be merged for condor post 7.4.4-0.17

Comment 5 Matthew Farrellee 2010-11-18 21:51:11 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: condor_master uses SIGABRT to terminate children when NOT_RESPONDING_WANT_CORE=TRUE
C: If the child does not exit in response to SIGABRT, the child may never exit.
F: The condor_master now follows up the SIGABRT with a SIGKILL after 600 seconds (not configurable).
R: The hung child will eventually be terminated by the condor_master.

Comment 8 Luigi Toscano 2011-02-01 16:02:12 UTC
The behavior described in #c5 has been tested and verified on RHEL 4.9 beta (20110127) / RHEL 5.6, i386/x86_64.
condor-7.4.5-0.7

Comment 9 errata-xmlrpc 2011-02-15 12:16:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html


Note You need to log in before you can comment on or make changes to this bug.