Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1365710 - multipath daemon crashes
Summary: multipath daemon crashes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath
Version: 6.9
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Ben Marzinski
QA Contact: Lin Li
URL:
Whiteboard:
Depends On:
Blocks: 1400664 1405173
TreeView+ depends on / blocked
 
Reported: 2016-08-10 02:17 UTC by George Beshers
Modified: 2017-03-21 10:52 UTC (History)
16 users (show)

Fixed In Version: device-mapper-multipath-0.4.9.96.el6
Doc Type: Bug Fix
Doc Text:
Cause: If multipathd was unable to update a multipath device during path removal, it would free the device structures twice Consequence: multipathd could crash while removing path devices. Fix: multipathd no longer deletes the device structure twice. Result: multipathd no longer crashes occasionally during path removal.
Clone Of:
Environment:
Last Closed: 2017-03-21 10:52:39 UTC


Attachments (Terms of Use)
tar.bz2 of a crash directory, has sosreport. (deleted)
2016-10-27 15:31 UTC, George Beshers
no flags Details
tar.xy of 4 crash directories (deleted)
2016-10-27 15:49 UTC, George Beshers
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0697 normal SHIPPED_LIVE device-mapper-multipath bug fix and enhancement update 2017-03-21 12:39:06 UTC

Description George Beshers 2016-08-10 02:17:13 UTC
Description of problem:

    We believe the crash is triggered as a result of powering JBOD disks on and
    off.  A standalone test wasn't successful in recreating the crash, although
    it may not have been run for long enough.


Version-Release number of selected component (if applicable):

[root@perf3094 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)
[root@perf3094 ~]#
[root@perf3094 ~]# rpm -qiv device-mapper-multipath
Name        : device-mapper-multipath      Relocations: (not relocatable)
Version     : 0.4.9                             Vendor: Red Hat, Inc.
Release     : 93.el6                        Build Date: Wed 09 Mar 2016 02:18:03 PM PST
Install Date: Sun 05 Jun 2016 11:02:07 PM PDT      Build Host: x86-032.build.eng.bos.redhat.com
Group       : System Environment/Base       Source RPM: device-mapper-multipath-0.4.9-93.el6.src.rpm
Size        : 223236                           License: GPL+
Signature   : RSA/8, Tue 05 Apr 2016 02:28:34 AM PDT, Key ID 199e2f91fd431d51
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
URL         : http://christophe.varoqui.free.fr/
Summary     : Tools to manage multipath devices using device-mapper
Description :
device-mapper-multipath provides tools to manage multipath devices by
instructing the device-mapper multipath kernel module what to do.
The tools are :
* multipath - Scan the system for multipath devices and assemble them.
* multipathd - Detects when paths fail and execs multipath to update things.

How reproducible:
    Script can usually reproduce within an hour.


Steps to Reproduce:
1.
2.
3.

Actual results:

Here's the latest traceback and core location:

(gdb) where
#0  0x00000038460179d1 in free_pgvec () from /lib64/libmultipath.so
#1  0x0000003846017acb in free_multipath () from /lib64/libmultipath.so
#2  0x00000000004089d2 in ?? ()
#3  0x00000000004091ed in uev_trigger ()
#4  0x0000003846029ad4 in service_uevq () from /lib64/libmultipath.so
#5  0x0000003846029b8f in uevent_dispatch () from /lib64/libmultipath.so
#6  0x00000000004058cb in ?? ()
#7  0x0000003844007aa1 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003843ce8aad in clone () from /lib64/libc.so.6
(gdb) quit

gdb) where
#0  0x00000038b4534cfc in __strncpy_ssse3 () from /lib64/libc.so.6
#1  0x00000038b68329ff in find_existing_alias (vecs=0x7f0e9cc306c0,
    pp=0x7f0e9b831e00, add_vec=1) at /usr/include/bits/string3.h:121
#2  add_map_with_path (vecs=0x7f0e9cc306c0, pp=0x7f0e9b831e00, add_vec=1)
    at structs_vec.c:479
#3  0x00000000004085aa in ev_add_path (devname=0x7f0e9bbcd018 "sdp",
    vecs=0x7f0e9cc306c0) at main.c:513
#4  0x0000000000409198 in uev_add_path (uev=0x7f0e9bc51800,
    trigger_data=0x7f0e9cc306c0) at main.c:431
#5  uev_trigger (uev=0x7f0e9bc51800, trigger_data=0x7f0e9cc306c0) at main.c:819
#6  0x00000038b6829ad4 in service_uevq () at uevent.c:113
#7  0x00000038b6829b8f in uevent_dispatch (uev_trigger=<value optimized out>,
    trigger_data=<value optimized out>) at uevent.c:143
#8  0x00000000004058cb in uevqloop (ap=0x7f0e9cc306c0) at main.c:842
#9  0x00000038b4c07aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038b44e8aad in clone () from /lib64/libc.so.6
Expected results:


Additional info:

Comment 2 Joseph Kachuck 2016-08-10 16:22:38 UTC
Hello,
Please attach a sosreport from a system after seeing this issue.

Please confirm steps to recreate this issue locally.

Thank You
Joe Kachuck

Comment 4 George Beshers 2016-09-12 19:08:38 UTC
    fmt=0x38b683afea "unloading %s prioritizer\n", ap=0x7ffd67df4360)
    at log_pthread.c:23
#2  0x00000038b6823b27 in dlog (sink=<value optimized out>,
    prio=<value optimized out>, fmt=0x38b683afea "unloading %s prioritizer\n")
    at debug.c:36
#3  0x00000038b6833aa4 in free_prio (p=0x73e7d0) at prio.c:47
#4  0x00000038b6833f58 in cleanup_prio () at prio.c:69
#5  0x0000000000406d25 in child (argc=<value optimized out>,
    argv=<value optimized out>) at main.c:1763
#6  main (argc=<value optimized out>, argv=<value optimized out>)
    at main.c:1919

pebble ccpp-2016-08-09-22:16:34-31304
(gdb) where
#0  0x0000003eb1e179d1 in free_pgvec () from /lib64/libmultipath.so
#1  0x0000003eb1e17acb in free_multipath () from /lib64/libmultipath.so
#2  0x00000000004089d2 in ?? ()
#3  0x00000000004091ed in uev_trigger ()
#4  0x0000003eb1e29ad4 in service_uevq () from /lib64/libmultipath.so
#5  0x0000003eb1e29b8f in uevent_dispatch () from /lib64/libmultipath.so
#6  0x00000000004058cb in ?? ()
#7  0x0000003eb0607aa1 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003eafee8aad in clone () from /lib64/libc.so.6

pebble ccpp-2016-08-14-14:43:19-3295
(gdb) where
#0  0x0000003eb060b68c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000000000406c33 in ?? ()
#2  0x0000003eafe1ed1d in __libc_start_main () from /lib64/libc.so.6
#3  0x00000000004052f9 in ?? ()
#4  0x00007ffcce150578 in ?? ()
#5  0x000000000000001c in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x00007ffcce150f76 in ?? ()
#8  0x0000000000000000 in ?? ()

pebble ccpp-2016-08-17-09:00:21-19774
(gdb) where
#0  0x0000003eafe325e5 in raise () from /lib64/libc.so.6
#1  0x0000003eafe33dc5 in abort () from /lib64/libc.so.6
#2  0x0000003eafe704f7 in __libc_message () from /lib64/libc.so.6
#3  0x0000003eafe75f3e in malloc_printerr () from /lib64/libc.so.6
#4  0x0000003eafe78dd0 in _int_free () from /lib64/libc.so.6
#5  0x0000003eb1e108c7 in vector_free () from /lib64/libmultipath.so
#6  0x0000003eb1e17abd in free_multipath () from /lib64/libmultipath.so
#7  0x00000000004089d2 in ?? ()
#8  0x00000000004091ed in uev_trigger ()
#9  0x0000003eb1e29ad4 in service_uevq () from /lib64/libmultipath.so
#10 0x0000003eb1e29b8f in uevent_dispatch () from /lib64/libmultipath.so
#11 0x00000000004058cb in ?? ()
#12 0x0000003eb0607aa1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003eafee8aad in clone () from /lib64/libc.so.6

Comment 5 Ben Marzinski 2016-10-12 22:56:54 UTC
Is there any chance of getting an sosreport/coredump from this system.? Without that, this is going to be really hard to debug.  Simply posting a backtrace of the crash doesn't give me enough information to go on.

Comment 6 George Beshers 2016-10-27 15:29:14 UTC
I have asked to get time on the system.
This is the hardware
    https://www.supermicro.com/products/chassis/4u/946/sc946ed-r2kjbod.cfm

Some coredump information to follow in a minute.

Comment 7 George Beshers 2016-10-27 15:31:40 UTC
Created attachment 1214658 [details]
tar.bz2 of a crash directory, has sosreport.

Comment 8 George Beshers 2016-10-27 15:49:38 UTC
Created attachment 1214665 [details]
tar.xy of 4 crash directories

Comment 9 Ben Marzinski 2016-11-04 03:45:45 UTC
I've definitely fixed the cause of three of the five crashes that I got coredumps for. The coredumps in ccpp-2016-08-08-07:37:39-3400 ccpp-2016-08-08-08:00:12-20573 both involve the string "multipath" being written over the list of paths from a multipath pathgroup.  This looks like it's a use after free issue. There are a couple of places where that string could be written to memory, but the most likely cause is from running device-mapper commands that grab the device type, which is "multipath".  I wasn't able to definitively figure out where the memory was getting incorrectly freed, that allowed caused the device-mapper command to write over it.  However, there is a chance that it was caused by the double-free that was definitely causing another of the crashes, and that I did fix. I also went through and made the code set
a number of pointers to NULL after the memory that they were pointing to was freed. Hopefully this either fix issue, or make it more obvious where things are going wrong.

You can get packages with these fixes at:
http://people.redhat.com/~bmarzins/device-mapper-multipath/rpms/RHEL6/1365710/

Comment 11 George Beshers 2016-12-08 16:21:29 UTC
The new code has been tested overnight on internally at SGI, and did not crash.

Also while testing power cycling expanders we haven't seen any hung task
timeouts, which used to occur frequently with the older code.

Further testing exposed a long latency issue, see BZ1402099.

Comment 12 George Beshers 2016-12-22 16:18:55 UTC
Getting ready for OnBoarding first week of January.

Comment 13 Lin Li 2016-12-27 07:29:14 UTC
The issue has been fixed from the spec file and source code. And customer has verified it. change to verified + sanityonly.

Comment 15 errata-xmlrpc 2017-03-21 10:52:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0697.html


Note You need to log in before you can comment on or make changes to this bug.