Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 171407 - multipath-tools: Ability to NOT switch to the group with the most paths
Summary: multipath-tools: Ability to NOT switch to the group with the most paths
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: device-mapper-multipath
Version: 4.0
Hardware: All
OS: Linux
Target Milestone: ---
: ---
Assignee: Ben Marzinski
QA Contact:
Depends On: 155738
TreeView+ depends on / blocked
Reported: 2005-10-21 16:01 UTC by Jonathan Earl Brassow
Modified: 2010-05-12 15:30 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Last Closed: 2010-05-12 15:30:57 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Jonathan Earl Brassow 2005-10-21 16:01:19 UTC
+++ This bug was initially created as a clone of Bug #155738 +++

Some scenarios might wish to minimize the switch of priority groups as far as
possible and not have multipath-tools switch priority groups automatically; this
behaviour must be configurable.

Also, switching back the priority group to the preferred one if paths in it
become available again should be configurable.

In both scenarios, it might be desireable to not only have the option to fully
disable it, but also to have a timer before actually switching, and allow the
situation to stabilize first.

-- Additional comment from on 2005-04-22 18:45 EST --
What we have today as a switch pg policy is what could be described as
"opportunistic", ie if another is seen as better switch to it even if the
current one is in a working state.

What you suggest to add would be a "last_resort" policy.

I would need to add 2 new struct multipath fields :

- int switchover_policy (opportunistic or last_resort)
- int switchback_policy (true or false)

These properties would be definable in the devices{} and multipaths{} config blobs.

That definitely doable. Would it fit your needs ?

-- Additional comment from on 2005-04-25 08:27 EST --

However I'd propose the following scheme:

pg_select_policy (max_paths|highest_prio)

max_paths would select the PG with the most paths (like right now); highest_prio
would switch to the one with the highest priority and available paths.

pg_switching_policy (<timer>|controlled)

<timer> would switch automatically when the pg_select_policy suggests a new PG
(in response to some event) after <timer> seconds; "0" would switch immediately.
This is to give the system some time to stabilize - ie, most of the time you'd
not want to switch _immediately_, but give it 2-3 path checker iterations to
make sure the path is there to stay.

"controlled" would only switch to the selected PG (or in fact to any other PG)
when the admin tells us to (or an error forces us). (Which relates to bug
#155546 ;-)
(automated and controlled failback/switch-over are well established terms in the
HA clustering world for resource migration behaviour, so it makes sense to reuse
the concepts here.)

Does that make sense?

Comment 1 Tore Anderson 2006-03-23 21:40:13 UTC
That dm-multipath gratuitously switch paths has in my opinion to be a rather
important bug.  I have a what I imagine to be a fairly common setup with dual
independent fabrics, looking like this:

  HOST 1                 HOST 2
   |  |                   |  |
   |  | +------C----------+  |
   A  | |                    D
   |  +--------B--------+    |
   |    |               |    |
  SWITCH 1             SWITCH 2  
     |                     |
     +- SP A (ARRAY) SP B -+

Hosts are SunFire X4100es, running RHEL4 U3, with a dual-port QLogic HBA
identifying itself as a QLA2422 (but knowing QLogic, that's probably a lie),
switches are McData Sphereon 4x00es, and the array is an EMC² CX200, which is
an active/passive unit - only one SP (controller) accepts I/O at one time.
Oh, and:


(Both the most recent at the time of writing.)

multipath.conf contains:

        device {
                vendor                  "DGC     "
                product                 "*"
                hardware_handler        "1 emc"
                features                "1 queue_if_no_path"
                path_checker            emc_clariion
                path_grouping_policy    failover
                failback                manual

I have been unable to make it NOT switch paths at startup.  It insists on
changing to SP A (possibly because that one has the lesser minor number?
SP A is sdb/8:16 and SP B sdc/8:32).  Even when SP A was already the
active controller, it still logs that it sends the "trespass" command to
move the LU (I assume it's to SP A, because the LU doesn't actually move
anywhere).  Output from multipath -ll, in case it's interesting:

fjas (36006017cd00e0000414744ed98e1d911)
[size=1 GB][features="1 queue_if_no_path"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][active]
 \_ 21:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 22:0:0:0 sdc 8:32 [active][ready]

Switching paths at startup isn't really problematic in a one-host scenario,
and I could have used prio_callout "/sbin/mpath_prio_emc /dev/%n" in order to
make it prefer the PG corresponding to the SP marked as "default" in the
array (curiously enough even when testing using this setup dm-multipath ends
up ALWAYS moving the LU to SP A, just like without any prio_callout, but moves
it back again to SP B at first I/O if that is the SP marked as default in the
array configuration).

However, this behaviour is quite problematic for clusters with shared storage,
like mine.  Imagine that path C in my drawing fails.  dm-multipath on host 2
will correctly fail the path and move the LU to SP B if necessary, and host 1
will follow.  However, at this point I cannot reboot host 1 without adversely
affecting host 2, because host 1 will move the LU to SP A at init - in effect
yanking the only healthy path host 2 has left.  Bad, bad, bad.  If you're
lucky enough that queue_if_no_path is the desired behaviour, what you get is
a few seconds (that feels like an eternity) where everything is blocked,
before multipathd realizes that the path is alive after all, and reinstates
it.  (I have not even dared to find out how Cluster Manager will handle this,
if its shared state or tie breaker is affected by this.)  If queue_if_no_path
isn't for you, well, you're screwed.  So much for fault tolerance.

Please someone tell me that I'm wrong, that this is UBD - me being unable to
configure it correctly.  There must be a way to avoid this behaviour...right?

Tore Anderson

Comment 2 Ed Goggin 2006-03-23 23:06:47 UTC
You should only use the group_by_prio path group priority policy with a 
CLARiiON otherwise you run the risk of (1) incurring unecessary path re-
assignment operations and (2) missing out on potential path load sharing 
opportunies (if you had multiple active paths).  This is your initial problem.

You are getting the behavior you are seeing with the failover pg policy because 
the path group selection code chooses the first path group it finds (which is 
also the first one created) amongst path groups which have the same priority.  
This path group may or may not be associated with the current service processor 
owning the logical unit (I assume the current SP is already associated with the 
default owning SP).  So when you actually try to do IO, the odds are 50/50 that 
you will need to switch to the other path group.  These are not good odds.

As far as the larger problem is concerned, no solution I've seen yet addresses 
the need to adapt the failback policy based on the current state.  I see the 
merit in a new adaptive pg_failback behavior where an automated path group 
failback operation would only be performed by the multipath-tools software when 
the most recent path group failover (pg_init of a new path group) operation was 
also initiated by the multipath-tools software.  Conversely, the automated path 
group failback operation would not be initiated by multipath-tools if the path 
group failover operation was initiated by another host or software on that host 
which is external to the multipath-tools software.

None of the current path group failback policies nor the ones described so far 
on this bugzilla provide this adaptive behavior -- they are all static 
policies.  For active/passive storage arrays like the CLARiiON, there is real 
benefit (a static load balancing/sharing of active paths to logical units 
across the two service processors) in always seeking the opportunistic policy 
UNLESS the path group was changed manually or via another host in the first 
place.  It is best if our code can detect and adapt to this situation and not 
require that the failback policy be changed beforehand in these situations.

I've been trying to develop a solution which will prevent unnecessary trespass 
operations in multi-host configurations.  At least for a CLARiiON, these 
configurations are not restricted to clusters but also involve (1) remote 
replication use cases and (2) cases where a storage system's default owning 
service proccessor is changed by storage management software (e.g., EMC 
Navishphere or Control Center for the CLARiiON).

I've subbmitted code upstream to Christophe in the form of a new failback 
policy which attempted to approximate this behavior without requiring kernel 
changes.  The new failback policy was similar to the current opportunistic one 
but would only possibly initiate failback when either (1) the  priority of a 
path changed or (2) a path group with all failed paths had a path transition 
from failed to active.  This policy would only fail to handle the use case you 
have described if path B for host 1 also failed and got reactivated after the 
failure of path C on host 2.  Since this use case involves a double failure, it 
may be reasonable to live with.

I think Christophe is rejecting this code sumbission because he thinks the 
problem is specific to CLARiiON and should be solved in CLARiiON specific code 
(e.g., path priority handler).  I do not agree.

Comment 3 Tore Anderson 2006-03-24 15:43:33 UTC

Thanks for your feedback.  I tried:

        device {
                vendor                  "DGC     "
                product                 "*"
                hardware_handler        "1 emc"
                features                "1 queue_if_no_path"
                path_checker            emc_clariion
                path_grouping_policy    group_by_prio
                failback                manual

As expected, this resulted in one PG with both paths, with the path to the
passive controller flapping (ie. multipathd reinstates it because it's
healthy, but I/O to it fails, so dm-multipath fails it.  Repeat ad
infinitum).  So I added «prio_callout "/sbin/mpath_prio_emc /dev/%n"», which
I expect needs to be there anyway anyway.  multipath -ll from when SP A is
marked as being the default:

fjas (36006017cd00e0000414744ed98e1d911)
[size=1 GB][features="1 queue_if_no_path"][hwhandler="1 emc"]
\_ round-robin 0 [prio=1][enabled]
 \_ 25:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [enabled]
 \_ 26:0:0:0 sdc 8:32 [active][ready]

I've tested various situations to see what happened:

Initial SP      Default SP      Trespassed at init?     Trespassed at first I/O?
A               A               Yes (no effect)         Yes (no effect)
B               A               Yes (moved to SP A)     Yes (no effect)
A               B               Yes (no effect)         Yes (moved to SP B)
B               B               Yes (moved to SP A)     Yes (moved to SP B)

So no matter what I do the LU ends up getting moved to SP A, but moved back
again to SP B if that's the default when I send some I/O there.  Hardly
desireable behaviour.  Just for the heck of it, I tried filing out the table
using the failover grouping policy instead, and the results are exactly the
same (should I have expected any difference?).

I think the reason why it always winds up at SP A at first is how the RH
scripts intergrate with udev.  In /etc/dev.d/block/ I see:

if /sbin/lsmod | /bin/grep "^dm_multipath" > /dev/null 2>&1 && \
   [ ${mpath_ctl_fail} -eq 0 ] && \
   [ "${ACTION}" = "add" -a "${DEVPATH:7:3}" != "dm-" ] ; then
        /sbin/multipath -v0 ${DEVNAME} > /dev/null 2>&1

I'm assuming that this /sbin/multipath invocation always happens for /dev/sdb
before it does for /dev/sdc.  That probably means that the path to SP A is
the only path to the LU for a brief period, and that is perhaps enough to
(bogusly) trigger a trespass - even though no I/O has been sent to the
resulting DM device yet.  Do you agree?  If so I guess that this needs to be
handled either by making the multipath utility not cause any trespasses on its
own (only a failing path should), or that RH need to rework their scripts so
that the multipath utility is not called until the HBA driver is finished
registering all the block devices so that both paths are made available to
dm-multipath simultaneously.

Tore Anderson

Comment 6 Ben Marzinski 2010-05-12 15:30:57 UTC
RHEL6 has a new failback option, "followover", that allows you to do this.  There are no plans to backport this. If this is a problem, please reopen this bug with a comment.

Note You need to log in before you can comment on or make changes to this bug.