|Summary:||[NetApp-S 4.7 bug] DM-MP fails to configure devices due to stale sd entries in the sysfs|
|Product:||Red Hat Enterprise Linux 4||Reporter:||Martin George <marting>|
|Component:||device-mapper-multipath||Assignee:||Ben Marzinski <bmarzins>|
|Status:||CLOSED INSUFFICIENT_DATA||QA Contact:||Corey Marthaler <cmarthal>|
|Version:||4.7||CC:||agk, andriusb, atodorov, bmarzins, christophe.varoqui, coughlan, dwysocha, egoggin, j-nomura, k-ueda, lmb, mbroz, prockai, tranlan, xdl-redhat-bugzilla|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2008-06-24 14:59:27 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Martin George 2007-02-07 11:09:52 UTC
Description of problem: While configuring dm-mp devices on a RHEL4 U3 host, there have been cases where the dm-mp driver fails to create appropriate device maps if stale sd entries are present in the sysfs i.e. configuring dm-mp devices fails on the host. Due to this, no dm-mp entries show up in /dev/mapper/ directory as well as in "multipath -l/-ll" output. In such cases, the scsi_id command fails for the specified sd entry. For eg. suppose sdc is one such device. Now the "scsi_id -gus /block/sdc" command gives the following output: "3:0:0:0: page 0 not available" A workaround for this would be to blacklist the corresponding sd entry in the multipath.conf file. This would help in properly configuring dm-mp devices on the host. Version-Release number of selected component (if applicable): device-mapper-multipath-0.4.5-12.0.RHEL4 How reproducible: Not always. But regularly. Steps to Reproduce: 1. Configure dm-mp devices on any host where the "scsi_id -gus /block/<sd>" fails on a sd entry in the sysfs. Actual results: dm-mp fails to configure devices in the above scenario. Correspondingly, no entries are seen in /dev/mapper/ as well as in "multipath -l/-ll" outputs. Expected results: dm-mp should have properly configured devices for the above scenario. Additional info:
Comment 1 Ben Marzinski 2007-03-29 23:16:10 UTC
Can you run # multipath -v6 and # multipath -ll -v6 and copy the results into this bug. I'm not sure sure exactly where this is failing. Also, do you know of any way to reliably create a stale sysfs entry?
Comment 2 Martin George 2007-04-04 14:41:41 UTC
This issue occurs intermittantly. Right now, I don't have a host which exhibits this behavior..so I am unable to provide you with the multipath output as requested. And by stale sysfs entry, I meant a sd entry that does not respond to the "scsi_id -gus /block/<sd>" command. I am not sure how this entry came into being in the first place. But this sd entry name kept shifting across reboots. But whats evident here is that dm-mp does not configure any devices if the scsi_id command fails on a sysfs sd entry (if its not blacklisted). Does this mean that dm-mp always expects scsi_id to pass for all corresponding sd entries?
Comment 3 Ben Marzinski 2007-04-05 18:13:46 UTC
No. failing the getuid callout (usually scsi_id) will not cause multipath to fail in this way. However, multipath relies on sysfs for multiple pieces of information. Obviously, the stale sd entry is messing with one of these checks, and multipath isn't handling the failure correctly. I was hoping that the multipath -v6 output would point to where the failure was happening. There's not that many sysfs interactions in multipath. Even without any hints from the debugging output, I should be able to track this down fairly easily. However, If you do see this again, please run those commands and put the output in the bugzilla.
Comment 4 RHEL Product and Program Management 2007-05-09 07:53:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Comment 6 Andrius Benokraitis 2007-06-28 13:54:42 UTC
Setting to NEEDINFO on NetApp to report debuginfo if and when it can be reproduced. This is ongoing.
Comment 8 Martin George 2007-06-29 11:23:20 UTC
Created attachment 158195 [details] multipath -ll -v6 output as requested
Comment 9 Martin George 2007-06-29 11:27:09 UTC
Created attachment 158197 [details] multipath -ll -v6 & multipath -v6 outputs as requested
Comment 10 Martin George 2007-06-29 11:37:05 UTC
Ben, We were able to reproduce the issue on a RHEL 4.4 host. Attaching the logs as requested. In this case, the "scsi_id -gus /block/sdb" command failed with the following error: "4:0:0:0: page 0 not available" This eventually caused dm-mp to fail configuring devices (multipath -ll gave a blank output). Once sdb was blacklisted using the devnode method in the multipath.conf file, things came back to normal with the successful configuration of dm-mp devices.
Comment 12 Ben Marzinski 2007-06-29 15:02:09 UTC
Thanks. That should be all I need.
Comment 13 Ben Marzinski 2007-07-19 17:48:56 UTC
Looking at this the output from these two commands, I'm confused. Both outputs seem correct on their own. The only issue is that they don't agree with each other. The multipath -v6 -ll output looks exactly like what you would expect if you were trying to list the multipath maps, and you had none configured. The multipath -v6 output looks exactly like what you would expect if you ran this command, but you already had the maps configured. If these commands were run one right after the other (in either order), I cannot see how you would get this output. Looking at the output for the multipath -v6 command, right after the # # all paths : # section, it lists the parameters of the multipath maps that are already known to device-mapper. The code paths for the two commands do not diverge until after this point, however this listing is never in the multipath -v6 -ll command output (which is exactly what should happen if there are no multipath maps known to device-mapper) Do you know if these commands were run back to back? Further, it seems from the multipath -v6 output, that the device already was created, according to device mapper. Is it possible that the device is getting created, but the device node is not? Of course, if the multipath -v6 -ll command was in fact run immediately after, I cannot account for why it did not list the device. The only answer that seems possible (but not at all likely) is that for some reason, multipath -v6 -ll failed when talking to device mapper. This is very odd, since the calls to device-mapper were exactly the same as with the multipath -v6 command. By the way, since you created this on RHEL 4.4, I looked at the device-mapper-multipath-0.4.5-16.RHEL4 package (which is the same as the device-mapper-multipath-0.4.5-16.1.RHEL4 package, minus some minor changes to some EMC specific code), if you are not using one of these two pacakges, please upgrade multipath to 0.4.5-16.1.RHEL4, as this is the latest RHEL 4.4 package. I can stick some error messages in where the device-mapper code could fail. But, if this is where it is failing, there is no way for multipath to recover. There may be a bug I can't see here, or it may be in device-mapper itself, but until I can find out exactly what's failing, I can't really debug it. If you see this again, can you try to check to see if the device was actually created by running. dmsetup table --target multipath If it is, and you still can't list with multipath -v6 -ll, try running that command under gdb, and see if it is crashing. If the command is not crashing, and the paths get listed in the debug output, but maps are not being listed, then it must be silently failing while trying to communicate with device-mapper.
Comment 14 Martin George 2007-07-23 09:41:34 UTC
Ben, I'll get back to you on this.
Comment 15 Ben Marzinski 2007-08-03 02:08:24 UTC
There are a bunch of new printouts going into 4.6 to help locate this problem, but the fix will not make 4.6.
Comment 17 Ben Marzinski 2007-10-10 19:14:11 UTC
Please let me know when you recreate this problem.
Comment 18 Martin George 2007-10-10 19:29:20 UTC
Comment 21 Tom Coughlan 2008-01-28 14:33:15 UTC
(In reply to comment #15) > There are a bunch of new printouts going into 4.6 to help locate this problem, > but the fix will not make 4.6. Netapp has not been able to reproduce this so far. They will test 4.7 beta. If the problem is not seen there, this BZ will be closed.
Comment 24 Andrius Benokraitis 2008-06-03 03:12:15 UTC
NETAPP: Has this been tested on RHEL 4.7? This needs to be tested ASAP.
Comment 25 Martin George 2008-06-05 07:34:08 UTC
We'll test this on RHEL 4.7 and update the bugzilla accordingly. Thanks.
Comment 26 Martin George 2008-06-24 14:59:27 UTC
I've not been able to reproduce this issue. So closing this for now.