Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1120829 - [RFE] Do not fence hosts when more than X% of hosts are in a Non-Responding or Connecting state
Summary: [RFE] Do not fence hosts when more than X% of hosts are in a Non-Responding o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.5.0
Assignee: Eli Mesika
QA Contact: sefi litmanovich
URL:
Whiteboard: infra
Depends On:
Blocks: 1084611 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-07-17 19:37 UTC by Scott Herold
Modified: 2018-12-09 18:13 UTC (History)
14 users (show)

Fixed In Version: ovirt-3.5.0_rc1.1
Doc Type: Enhancement
Doc Text:
A new option in the 'Fencing Policy' tab of the 'New/Edit Cluster' window allows users to disable fencing of hosts in the cluster if more than a user-defined percentage of hosts have connectivity issues. This can prevent hosts being fenced in scenarios where hosts are in a 'Non-Responding' or 'Connecting' state due to a general network connectivity error, rather than a host error.
Clone Of:
Environment:
Last Closed: 2015-02-11 18:06:13 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0158 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC
oVirt gerrit 31303 master MERGED core: Skip fencing if host has connectivity issues Never
oVirt gerrit 31304 master ABANDONED core: Skip hard fencing if host has connectivity.. Never
oVirt gerrit 31305 master MERGED UI:Adding support for host with connectivity.. Never
oVirt gerrit 31615 ovirt-engine-3.5 MERGED core: Skip fencing if host has connectivity issues Never
oVirt gerrit 31616 ovirt-engine-3.5 MERGED UI:Adding support for host with connectivity.. Never
oVirt gerrit 31640 master MERGED core: Fix FencingPolicy copy constructor Never
oVirt gerrit 31641 master MERGED webadmin: Fix Fencing Policy look Never
oVirt gerrit 31654 ovirt-engine-3.5 MERGED core: Fix FencingPolicy copy constructor Never
oVirt gerrit 31655 ovirt-engine-3.5 MERGED webadmin: Fix Fencing Policy look Never

Description Scott Herold 2014-07-17 19:37:28 UTC
Infra/Engine component to improve fencing logic

When Triggered
--------------
This action is triggered once RHEV-M had made the decision that a target host may need to be fenced.  Prior to issuing a fencing command, a user may specify an optional time value to wait to investigate a percentage or number of total hosts that have moved to a non-operational state.

In this workflow, a calculation will be done over a short period of time that will determine if one or two optional conditions are met before sending fencing commands to proxy hosts:

1) If more than X% of hosts are non-operational, do not fence any host
2) If more than Y# of hosts are non-operational, do not fence any host

This will help prevent fence storms from leaving the engine to be potentially picked up in retries and introducing a potential race condition on fence requests.

UX
--
There will be an option in the Fencing Policy sub menu (Defined by BZ 1118879) to set the following options:

[Boolean]
"Do not fence if the following conditions are met" - Turns entire check on or off.  One or both % or # values must be specified if this option is enabled.

"Time of Delay" - Amount of time to wait for multiple hosts to enter non-responsive state
DEFAULT: 60 - Seconds

"Greater than X% of hosts fail"
DEFAULT: 50%

"More than Y# of hosts fail"
DEFAULT: 0/Disabled

Comment 1 Barak 2014-07-28 11:09:06 UTC
(In reply to Scott Herold from comment #0)
> Infra/Engine component to improve fencing logic
> 
> When Triggered
> --------------
> This action is triggered once RHEV-M had made the decision that a target
> host may need to be fenced.  Prior to issuing a fencing command, a user may
> specify an optional time value to wait to investigate a percentage or number
> of total hosts that have moved to a non-operational state.

The additional timeout is redundant there are too much timeout parameters as is in the fencing flow.
This decision should be done after the SSH vdsm restart that is happening somewhere in the beginning of the fencing flow.

> 
> In this workflow, a calculation will be done over a short period of time
> that will determine if one or two optional conditions are met before sending
> fencing commands to proxy hosts:
> 
> 1) If more than X% of hosts are non-operational, do not fence any host
> 2) If more than Y# of hosts are non-operational, do not fence any host

I do not understand #2 why define it - what use case will it cover ?

> 
> This will help prevent fence storms from leaving the engine to be
> potentially picked up in retries and introducing a potential race condition
> on fence requests.
> 
> UX
> --
> There will be an option in the Fencing Policy sub menu (Defined by BZ
> 1118879) to set the following options:
> 
> [Boolean]
> "Do not fence if the following conditions are met" - Turns entire check on
> or off.  One or both % or # values must be specified if this option is
> enabled.

The above is very confusing ... one or both ?
I think the # is redundant.

> 
> "Time of Delay" - Amount of time to wait for multiple hosts to enter
> non-responsive state
> DEFAULT: 60 - Seconds

above is redundant

> 
> "Greater than X% of hosts fail"
> DEFAULT: 50%
> 
> "More than Y# of hosts fail"
> DEFAULT: 0/Disabled

Comment 2 Scott Herold 2014-07-30 15:28:16 UTC
OK to limit scope to % of hosts that have failed and removing number of hosts from the rule.  Also OK to remove the timeout.  

Barak/Oved, with the simpler scope, realistically, is this a target for 3.5 or 3.6?

Comment 3 Oved Ourfali 2014-07-31 07:40:05 UTC
(In reply to Scott Herold from comment #2)
> OK to limit scope to % of hosts that have failed and removing number of
> hosts from the rule.  Also OK to remove the timeout.  
> 
> Barak/Oved, with the simpler scope, realistically, is this a target for 3.5
> or 3.6?

Targeting for 3.5.0.
Still waiting for an answer on what's the deadline for exception RFEs to be MODIFIED... the answer for that might change the target release, but we're targeted for 3.5.0 for now.

Comment 4 Eli Mesika 2014-08-05 12:01:14 UTC
F

Comment 5 Eli Mesika 2014-08-05 12:06:08 UTC
I had changed the BZ title to relate to the non-responding state which is network-based instead of the non-operational that is storage-based.

Also, I had concluded with Scott that for the percentage calculations we will seraph for all Host in Non-Responding or Connecting state since upon network issue a Host goes for a while first to the Connecting state before it is marked as Non-Responding

Comment 6 Eli Mesika 2014-08-05 12:08:16 UTC
(In reply to Eli Mesika from comment #5)
> will seraph
seraph=>search

Comment 7 Scott Herold 2014-08-05 12:10:00 UTC
Eli - Agree on the "Connecting" state.  It will be the first indication of a potential networking problem if half of the hosts go into a non-responsive state.  While we don't tack action on "Connecting", there may be overlap where we see a mix of non-responsive and connecting, and we want to include these in the total counts.

Comment 9 errata-xmlrpc 2015-02-11 18:06:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.