Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1364553 - failed 'reboot' action with paired switched PDUs leaves no power to the node
Summary: failed 'reboot' action with paired switched PDUs leaves no power to the node
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.8
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: rc
: ---
Assignee: Marek Grac
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-05 17:01 UTC by digimer
Modified: 2016-09-06 08:34 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-06 08:34:38 UTC


Attachments (Terms of Use)

Description digimer 2016-08-05 17:01:35 UTC
Description of problem:

A client of ours was doing some failure testing and hit upon a strange issue. I will admit right up front that this is a real corner case, but it does leave things in a bad state, so I wanted to file it. I suspect the fix might be simple. If it is not simple, then it might be worth ignoring (though I will check if the same thing happens with pacemaker on RHEL 7 later).

Condition;

If a node loses one PDU, and if the IPMI fencing fails, the PDU-based fence method will be attempted and will also fail, as expected. The problem is that the outlet feeding the node on the good PDU is left in the 'Off' state. I *assume* that this is because fenced did 'off -> off -> verify (ok) -> verify (failed)' and exited. 

This leaves the node unbootable by the user until s/he realizes that the PDU outlet is off and re-enables it. However, if the fence is still pending, the outlet gets turned off again quickly (IPMI BMC doesn't have time to boot, so IPMI fencing keeps failing). 



Version-Release number of selected component (if applicable):

cman-3.0.12.1-78.el6.x86_64



How reproducible:

100%


Steps to Reproduce:
1. Unplug a PDU.
2. Temporarily cut power to a node so that fencing initiates and IPMI fencing fails.
3. plug the node's PSUs back in.
4. Observe that the PDU outlet feeding the node is off.


Actual results:

Node is left entirely without power.


Expected results:

An 'on' call is made to the method's devices that verified 'off' successfully, despite the method itself failing from a fence perspective.


Additional info:

I am not sure if the same behaviour would be observed in RHEL 7 with pure pacemaker, but it might be worth checking.

Comment 3 Marek Grac 2016-08-12 08:34:36 UTC
Hi,

I'm not sure if I understand it correctly, so situation is:

* node is fenced by IPMI and if IPMI fails then PDU fencing is used

* scenario:
  1) node is ON, ipmi responds ON, fence-PDU plug ON
  2) unplug node from PDU ==> node is ON, IPMI is working, fence-PDU is working but its state is independent of node
  3) cut power to node ==> node is OFF, IPMI (fails/connection timeout), fence-PDU reboot plug
  4) plug PDU back ==> node is OFF, IPMI is failing, fence-PDU is able to connect

In pacemaker cluster, we will notice that the fence device is failing (action monitor) and it requires manual intervention to put it back to online state. 

imho there is a false assumption how we check ON state of device. It is tested purely on fence device, so if the device can turn ON/OFF empty plug we will not notice any problem.

Comment 4 digimer 2016-08-12 08:47:39 UTC
The fundamental issue can be boiled down to a simpler example, I think.

Ignore IPMI fencing, lets assume the node has two PDUs for fencing (one for each PSU). 

At some point, a PDU is lost (destroyed, removed, whatever). At this point, the node is operational still but the ability to fence it has been lost. This requires human intervention to resolve, no debate there at all.

We have a scenario though where the system may be entirely off the internet and is physically inaccessible, so realisation that the problem exists and/or the ability to repair it delayed. All of this is not a concern to developers, of course, but it sets the stage for how the issue can occur in the real world.

So, later, the degraded node fails and a fence is issued. Of course, the fence will fail and the cluster will block, as it should. This is also not a concern.

The issue is that, when the PDU fence occurs, it asks both PDUs to turn off the given outlets. The good node does this, but the other PDU is dead. This is where the issue arises. The good PDU is left in the 'off' position. So if the admin tries to manually (re)boot the node, s/he can't because there is no power from the good PDU.

My proposal/request is that, if one device in a given method fails and the fence action was 'reboot', then re-enabling the port before moving on to the next fence device (if any) would help. Specifically because a 'fence_ack_manual' (or the pcmk equiv.) would not call an on.

This is fundamentally not a cluster software problem, but more of a feature request, I suppose. It would save the admin some head-scratching. :)

Comment 5 Marek Grac 2016-08-12 09:00:54 UTC
Great explanation, I got it.

* 2x PSU/PDU (A,B) for one node
* remove A -> connection timeout -> every action will fail
* attempt to fence node -> planned action A-off,A-verify,B-off,B-verify,A-on,A-verify,B-on,B-verify
* A-off will fail; B-off will turn PSU off

and in such case you want to turn B-on.

From fencing perspective, removing one PSU is not enough to think that machine is fenced so it does not matter if B is on or off. If this would be a configurable option, it should be fine. 

When you have one PDU with two different plugs for that node, this situation can be handled completely by fence agent. In such case - the first fail will end fence agent. So such fallback action should be done by someone from higher level (fenced / stonithd) probably.

Comment 6 digimer 2016-08-12 09:06:41 UTC
> From fencing perspective, removing one PSU is not enough to think that machine is fenced so it does not matter if B is on or off.

From a fencing perspective, I 100% agree. It's a failed fence attempt. The request is only to send an 'on' command to the device(s) that worked before moving on to the next fence method.

This is simply to leave the target node in a more recoverable state after, say, the user confirm's the node's death via fence_ack_manual. As it stands now, the admin will have to first realize the PDU outlet is off and second, turn it back on (which the admin might not be familiar with how to do that).

Comment 7 Marek Grac 2016-09-05 08:55:09 UTC
@Andrew: What do you think?

Comment 8 Marek Grac 2016-09-05 08:55:21 UTC
@Andrew: What do you think?

Comment 9 Andrew Beekhof 2016-09-06 02:33:13 UTC
Agreed that it would have to happen at a higher level, the agent has nowhere near enough context.

A Pacemaker based system would at least have noticed and notified the admin that one of the PDUs was dead - so hopefully there is time to rectify that problem before a node dies anyway.

There is also the question as to whether a node that was fenced should be allowed to come back with only one PDU.

On balance, I would not be in favor of adding another error path to cover this scenario.  Fencing is primarily optimized around taking nodes down (or cutting them off) for safety/consistency, having them come back up again has to take a lower priority.  I worry more about scenarios that could lead to a node being on when it should be off than the reverse :-)

In an appliance model, could you not add "turn all the PDUs on" to your "start node X" command?

Comment 10 digimer 2016-09-06 02:52:00 UTC
I'm inclined to agree with Andrew on this. In HA, nothing should trump simplicity, particularly in the core. For this reason, I think we should close this as either WONTFIX or NOTABUG. If I want to, I can have our system handle this.

Comment 11 Marek Grac 2016-09-06 08:34:38 UTC
ok, closing then.


Note You need to log in before you can comment on or make changes to this bug.