Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 454093 - Install successfully but fail to failover
Summary: Install successfully but fail to failover
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath
Version: 5.0
Hardware: i386
OS: Linux
low
urgent
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-04 15:43 UTC by I-Chung Ho
Modified: 2010-05-12 17:24 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-05-12 17:24:40 UTC
Target Upstream Version:


Attachments (Terms of Use)
Sytem log when we perform hot plug miniSAS cable (deleted)
2008-07-09 07:20 UTC, I-Chung Ho
no flags Details
system log (deleted)
2008-08-19 11:15 UTC, I-Chung Ho
no flags Details
System log, one disk's path missing (deleted)
2008-09-05 16:41 UTC, I-Chung Ho
no flags Details

Description I-Chung Ho 2008-07-04 15:43:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; Tablet PC 2.0)

Description of problem:
We have Sun Storage J4200 and LSI 1068E HBA card on RHEL5.
We try to use device-mapper to handle the multipath.
There is no problem to install and enable the multipath.
And no problem to create software RAID (mdadm) with the merged devices(/dev/dm*) and run IO on the raid. 

The problem occurs when we try to failover.
We perform the hot plug in/out miniSAS cables on the host.
Before plugging out the miniSAS cable we do check that all paths are ready.
But after several cycles hot plugging, the IO fail when we plug out the miniSAS cable. The other path remain ready and active and the link LED indicates normal.

We have tried different mutipath setting, such as set path_grouping_policy=failover or different path_checker.
The symptom remains the same.

Version-Release number of selected component (if applicable):
device-mapper-multipath-0.4.7-8.el5.i386

How reproducible:
Always


Steps to Reproduce:
1.One host connect to Sun Storage J4200 with 2 paths.
2.Enable the multipath.
3.Create RAID(any RAID, 1, 5, 6) by mdadm with the merged devices.
4.Start IO.
5.Plug out one miniSAS on host.
6.Make sure that half of the path have gone.
7.Plug in the miniSAS cable.
8.Make sure that all the paths have back.
9.Plug out another miniSAS on host.
10.Make sure that half of the paths have gone.
11.Plug in the miniSAS cable.
12.Make sure that all paths have back.
13.Repeat step 5~12.

Actual Results:
After 2~3 cycles of hot plugging the miniSAS cable, the IO hangs.

Expected Results:
When we plug out one miniSAS cable the other cable should remain maintaining the IO. And IO should remain running no matter how many cycles of hot plugging the miniSAS calbe we have performed.

Additional info:
HBA driver: 4.00.21.00-1

multipath.conf:
device {
	vendor				"SEAGATE"
	product				"ST314655SSUN146G"
	path_grouping_policy	        multibus 
	getuid_callout 			"/sbin/scsi_id -g -u -s /block/%n"
	prio_callout			"none"
	path_checker			tur
	path_selector			"round-robin 0"
	failback			immediate
	rr_weight			uniform
	hardware_handler		"0"
	no_path_retry			fail
	user_friendly_name		no
}

Comment 1 Ben Marzinski 2008-07-07 19:57:01 UTC
First off, I trust you have multipathd running.  If it's not, please run

# service multipathd start
# chkconfig multipathd on

and see if that fixes your problem.

Could you please send me the output of /var/log/messages while you are pulling
the cables?

You could also try setting no_path_retry to something other than "fail". For
instance, if you have

no_path_retry    10

multipath will not fail the IO until it checks for active paths 10 times in a
row, and doesn't find any (With the default polling_interval of 5 seconds, this
should allow multipath to deal with all paths being down for up to 50 seconds).
 However, if this fixes the problem, please let me know.  If you are making sure
that the paths lost during the last cable pull have come back up before pulling
the next cable, like you mentioned in your steps to reproduce, then I don't see
why this should be necessary.

Comment 2 I-Chung Ho 2008-07-09 07:20:30 UTC
Created attachment 311349 [details]
Sytem log when we perform hot plug miniSAS cable

Comment 3 Ben Marzinski 2008-08-13 16:29:14 UTC
Looking at your system log, there is no multipathd output at all. Are you sure that multipathd is running?

Can you please send me the output of
# multipath -v3 -ll

from before the start of your test.  Then run
# service multipathd stop
# multipathd -v3

Then start your test, and collect the output to /var/log/messages for the entire length of your test.

Comment 4 I-Chung Ho 2008-08-19 11:15:31 UTC
Created attachment 314532 [details]
system log

System log when we perform hot plug miniSAS cabe, one path of dm-53 didn't come back.

Comment 5 I-Chung Ho 2008-08-19 11:18:10 UTC
We have tried the command you suggested:
# service multipathd stop
# multipathd -v3

It pass 5 cycles of hot plugging miniSAS.
(What is different between:
# service multipathd stop
and
# multipathd -v3)

But one path of dm-53 didn't comeback after we plug the miniSAS cable.
It's a SAS disk.

We will try on SATA disk.

Comment 6 I-Chung Ho 2008-09-05 16:41:31 UTC
Created attachment 315917 [details]
System log, one disk's path missing

We still have issue with hot-plugging sas cable.
A disk's path() won't come back after several cycles of hot-plug.

Comment 8 Ben Marzinski 2010-05-05 19:31:54 UTC
Is this still an issue?  There have been multiple changes that may have solved this issue.  If this is still reproduceable, please let me know, otherwise I'm going to close this bug.

Comment 9 Ben Marzinski 2010-05-12 17:24:40 UTC
Numerous SAS fixes have happened since 5.0.  This bug should be solved, and there is not enough information for me to debug in this bugzilla.


Note You need to log in before you can comment on or make changes to this bug.