Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1357133 - Multithreaded application hanging while attempting to open a file.
Summary: Multithreaded application hanging while attempting to open a file.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: glibc
Version: 5.5
Hardware: i686
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Carlos O'Donell
QA Contact: qe-baseos-tools
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-15 20:46 UTC by Alan Matsuoka
Modified: 2017-10-30 15:30 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-29 19:32:13 UTC


Attachments (Terms of Use)

Comment 1 Alan Matsuoka 2016-07-15 20:48:03 UTC
 (1) The current action items to continue troubleshooting this matter are as follows: 
     - Reboot the system to ensure the newer glibc package which has since been installed but the system not rebooted is actually in 
       use within the application and everywhere else. 
     - Continue testing in RHEL 5.11 and 6.8. 
          . We certainly understand the difficulty your team has had in reproducing the issue in house. The intent with this is to ensure
            the problem's surface area is reduced drastically and simply places the system in a far more supported state. 
     - We will need details of what the signal handlers for the application end up doing and if they use any syscalls. 
          . The intent behind this point is a possibility that a thread begins waiting on a futex and is then interrupted 
            to perform a signal handler. Within the signal handler, the thread may be cancelled via another thread's pthread_cancel() 
            call resulting in the state we are seeing. 
          . This request is from our Glibc engineering team directly as an approach to troubleshoot this matter from the development side
     - Investigate and implement the gdb watchpoints as outlined by Vern in the production system and test system
          . Again, we realize this is going to require the issue to be reproduced in production and the watchpoints will temporarily pause 
            the application, but the intent here is to watch the application at an extremely high resolution layer to see its behavior 
            when the lock reaches the invalid state (cnt == -1). The intent with the watchpoints on the test system is, as Vern noted, not
            to catch the culprit in the act as the issue seems unreprodcuable on the test systems, but rather to see the behavior of the 
            application around the time of the issue. 
          . This will get us the historical data we need which we were looking for from the rolling strace data we could not gather due to 
            its effects on the system.

Comment 2 Alan Matsuoka 2016-07-15 20:50:13 UTC
// INTERNAL



I've been doing a bit of reverse engineering to see if in fact pthread_cancel is called anywhere 
in cservr and I can't find it.

The VM now has the shared libs that the customer provided in the correct place and the core dump
gives me this:

(gdb) info sh
From        To          Syms Read   Shared Object Library
0xf79c0d90  0xf7ac65b4  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libotrans.so
0xf790ec50  0xf7976884  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libxml_server.so
0xf772ec20  0xf777c204  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libaidb.so
0xf7710800  0xf7710be4  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libotdf.5.so
0xf770d800  0xf770dbe4  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libueeapi.5.so
0xf760abc0  0xf76bfcc4  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libACE-5.6.so
0xf71e4df4  0xf74d8b38  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libxalan-c.so.111
0xf70c3b04  0xf70c3c28  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libxalanMsg.so.111
0xf6e5cce8  0xf6fef4fc  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libxerces-c-3.1.so
0xf6c98670  0xf6d13b54  Yes (*)     /usr/lib/libstdc++.so.6
0x00b3c410  0x00b575a4  Yes (*)     /lib/libm.so.6
0xf6c50660  0xf6c57f34  Yes (*)     /lib/libgcc_s.so.1
0x00b09210  0x00b14bc4  Yes (*)     /lib/libpthread.so.0
0x0099acc0  0x00a99ad0  Yes (*)     /lib/libc.so.6
0xf6c4bee0  0xf6c4d954  Yes (*)     /lib/libuuid.so.1
0xf6c418d0  0xf6c479e0  Yes         /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libjansson.so.4
0x00c10880  0x00c14c44  Yes (*)     /lib/librt.so.1
0xf4e79c70  0xf5960c48  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libclntsh.so.11.1
0xf4b6fd40  0xf4b9bf34  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libocci.so.11.1
0x00ae3a70  0x00ae4aa4  Yes (*)     /lib/libdl.so.2
0xf49d1984  0xf4a97be0  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libicuuc.so.53
0xf47d8e04  0xf4937efc  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libicui18n.so.53
0xf32c31fc  0xf32c31fc  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libicudata.so.53
0x009667f0  0x0097c12f  Yes (*)     /lib/ld-linux.so.2
0xf3092590  0xf30c7294  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libnnz11.so
0x00aed140  0x00afaca4  Yes (*)     /lib/libnsl.so.1
0xf306d390  0xf306d6d0  Yes (*)     /usr/lib/libaio.so.1
0xf06b2614  0xf06b2700  Yes (*)     /prod/pms50ai1/ai/NESAM/aiprodNESAM/bin/libociicus.so
                        No          /usr/lib/libnuma.so
(*): Shared library is missing debugging information.
(gdb) (gdb) info thr
  36 Thread 0xf306b700 (LWP 12694)  0xffffe410 in __kernel_vsyscall ()
  35 Thread 0xdabf5b90 (LWP 12740)  0xffffe410 in __kernel_vsyscall ()
  34 Thread 0xdb5f6b90 (LWP 12739)  0xffffe410 in __kernel_vsyscall ()
  33 Thread 0xdbff7b90 (LWP 12738)  0xffffe410 in __kernel_vsyscall ()
  32 Thread 0xdc9f8b90 (LWP 12737)  0xffffe410 in __kernel_vsyscall ()
  31 Thread 0xdd3f9b90 (LWP 12736)  0xffffe410 in __kernel_vsyscall ()
  30 Thread 0xdddfab90 (LWP 12735)  0xffffe410 in __kernel_vsyscall ()
  29 Thread 0xde7fbb90 (LWP 12734)  0xffffe410 in __kernel_vsyscall ()
  28 Thread 0xdf1fcb90 (LWP 12733)  0xffffe410 in __kernel_vsyscall ()
  27 Thread 0xdfbfdb90 (LWP 12732)  0xffffe410 in __kernel_vsyscall ()
  26 Thread 0xe05feb90 (LWP 12731)  0xffffe410 in __kernel_vsyscall ()
  25 Thread 0xe0fffb90 (LWP 12730)  0xffffe410 in __kernel_vsyscall ()
  24 Thread 0xe1bf5b90 (LWP 12729)  0xffffe410 in __kernel_vsyscall ()
  23 Thread 0xe25f6b90 (LWP 12728)  0xffffe410 in __kernel_vsyscall ()
  22 Thread 0xe2ff7b90 (LWP 12727)  0xffffe410 in __kernel_vsyscall ()
  21 Thread 0xe39f8b90 (LWP 12726)  0xffffe410 in __kernel_vsyscall ()
  20 Thread 0xe43f9b90 (LWP 12725)  0xffffe410 in __kernel_vsyscall ()
  19 Thread 0xe4dfab90 (LWP 12724)  0xffffe410 in __kernel_vsyscall ()
  18 Thread 0xe57fbb90 (LWP 12723)  0xffffe410 in __kernel_vsyscall ()
  17 Thread 0xe61fcb90 (LWP 12722)  0xffffe410 in __kernel_vsyscall ()
  16 Thread 0xe6bfdb90 (LWP 12721)  0xffffe410 in __kernel_vsyscall ()
  15 Thread 0xe75feb90 (LWP 12720)  0xffffe410 in __kernel_vsyscall ()
  14 Thread 0xe7fffb90 (LWP 12719)  0xffffe410 in __kernel_vsyscall ()
  13 Thread 0xe8de4b90 (LWP 12718)  0xffffe410 in __kernel_vsyscall ()
  12 Thread 0xe97e5b90 (LWP 12717)  0xffffe410 in __kernel_vsyscall ()
  11 Thread 0xea1e6b90 (LWP 12716)  0xffffe410 in __kernel_vsyscall ()
  10 Thread 0xeabe7b90 (LWP 12715)  0xffffe410 in __kernel_vsyscall ()
  9 Thread 0xeb5e8b90 (LWP 12714)  0xffffe410 in __kernel_vsyscall ()
  8 Thread 0xebfe9b90 (LWP 12713)  0xffffe410 in __kernel_vsyscall ()
  7 Thread 0xec9eab90 (LWP 12712)  0xffffe410 in __kernel_vsyscall ()
  6 Thread 0xed3ebb90 (LWP 12711)  0xffffe410 in __kernel_vsyscall ()
  5 Thread 0xeddecb90 (LWP 12710)  0xffffe410 in __kernel_vsyscall ()
  4 Thread 0xee7edb90 (LWP 12709)  0xffffe410 in __kernel_vsyscall ()
  3 Thread 0xef1eeb90 (LWP 12708)  0xffffe410 in __kernel_vsyscall ()
  2 Thread 0xefbefb90 (LWP 12707)  0xffffe410 in __kernel_vsyscall ()
* 1 Thread 0xf05f0b90 (LWP 12706)  0xffffe410 in __kernel_vsyscall ()
(gdb) 
so there are 36 threads

(gdb) thr 36
[Switching to thread 36 (Thread 0xf306b700 (LWP 12694))]#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00b0ecc5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0xf762797e in cond_timedwait (this=0xaa620b0, mutex=..., abstime=0x0) at ../../ace/OS_NS_Thread.inl:410
#3  ACE_Condition_Thread_Mutex::wait (this=0xaa620b0, mutex=..., abstime=0x0) at ../../ace/Condition_Thread_Mutex.cpp:98
#4  0xf76279a6 in ACE_Condition_Thread_Mutex::wait (this=0xfffffffc, abstime=0x0) at ../../ace/Condition_Thread_Mutex.cpp:107
#5  0xf76b7e2a in ACE_Thread_Manager::wait (this=0xaa62068, timeout=0x0, abandon_detached_threads=false, use_absolute_time=true) at ../../ace/Thread_Manager.cpp:1697
#6  0x08063cb3 in ai_main (param=0x0) at ../../../../appint/cservr/ai_main.cpp:343
#7  0x080705da in main (argc=11, argv=0xffe03d34) at ../../../../appint/cservr/main.cpp:124

Thread 36 has main in it

All of  the threads except for 1 and thread 7 look like this:

(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a66e13 in __lll_lock_wait_private () from /lib/libc.so.6
#2  0x009eb3c5 in _L_lock_338 () at genops.c:241
#3  0x009e99b1 in _IO_link_in (fp=0xe81ac348) at genops.c:104
#4  0x009e9256 in _IO_new_file_init (fp=0xe81ac348) at fileops.c:153
#5  0x009dd940 in __fopen_internal (filename=0xf20abb50 "/prod/pms50ai1/ai/NESAM/aiprodNESAM/logs/n6.tr120172270.log", mode=0xf7ad2b4b "w", is32=0) at iofopen.c:89
#6  0x009dfe9c in _IO_fopen64 (filename=0xf20abb50 "/prod/pms50ai1/ai/NESAM/aiprodNESAM/logs/n6.tr120172270.log", mode=0xf7ad2b4b "w") at iofopen64.c:39
#7  0xf7a79cf8 in init_thr_res (otgbls=0xf20ab920, gbl_ifthr=0xf05e9810, max_pg_buf_p=6400) at ../../../trans_lib/dispatch.c:1143
#8  0x08078bae in AI_Ige_UATrans::init_ifthr (this=0xab79850, ige_session=0xe81abe98) at ../../../../../appint/cservr/ige/../AI_Globals.h:104
#9  0x080780f1 in AI_Ige_UATrans::service_ (this=0xab79850, session=..., follow_req=0xe81ac118) at ../../../../../appint/cservr/ige/AI_Ige_UATrans.cpp:203
#10 0x080707ff in AI_Ige_Request::service (this=0xab79850, session=..., follow_req=0xe81ac118) at ../../../../../appint/cservr/ige/AI_Ige_Request.cpp:554
#11 0x0806093b in AI_Ige_Request_Handler::service_request (session=0xf2d7b494, req1=0xab79850, req2=0xe81ac118)
    at ../../../../appint/cservr/AI_Ige_Request_Handler.cpp:444
#12 0x08060fbb in AI_Ige_Request_Handler::service_request (job=0xf05eff10) at ../../../../appint/cservr/AI_Job.h:33
#13 0x0806073e in AI_Ige_Request_Handler::handle_input (this=0xad079d8, fd=62) at ../../../../appint/cservr/AI_Ige_Request_Handler.cpp:401
#14 0xf76b0571 in ACE_TP_Reactor::dispatch_socket_event (this=0xffdfe260, dispatch_info=...) at ../../ace/TP_Reactor.cpp:591
#15 0xf76b064f in ACE_TP_Reactor::handle_socket_events (this=0xffdfe260, event_count=@0xf05f01d8, guard=...) at ../../ace/TP_Reactor.cpp:460
#16 0xf76b092f in ACE_TP_Reactor::dispatch_i (this=0xffdfe260, max_wait_time=0xf05f02d0, guard=...) at ../../ace/TP_Reactor.cpp:249
#17 0xf76b09c5 in ACE_TP_Reactor::handle_events (this=0xffdfe260, max_wait_time=0xf05f02d0) at ../../ace/TP_Reactor.cpp:173
#18 0xf76b06ee in ACE_TP_Reactor::handle_events (this=0xfffffffc, max_wait_time=...) at ../../ace/TP_Reactor.cpp:664
#19 0xf768cc89 in ACE_Reactor::handle_events (this=0x0, max_wait_time=...) at ../../ace/Reactor.cpp:432
#20 0x080631f6 in svr_worker () at ../../../../appint/cservr/ai_main.cpp:82
#21 0xf76b23fa in ACE_Thread_Adapter::invoke_i (this=0xaa61500) at ../../ace/Thread_Adapter.cpp:146
#22 0xf76b2361 in ACE_Thread_Adapter::invoke (this=0xaa61500) at ../../ace/Thread_Adapter.cpp:95
#23 0xf761fd97 in ace_thread_adapter (args=0xfffffffc) at ../../ace/Base_Thread_Adapter.cpp:116
#24 0x00b0a912 in start_thread (arg=0xf05f0b90) at pthread_create.c:301
#25 0x00a5a4ae in clone () from /lib/libc.so.6

Thread 7 looks like this:
(gdb) thr 7
[Switching to thread 7 (Thread 0xec9eab90 (LWP 12712))]#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x00a52f71 in select () from /lib/libc.so.6
#2  0x0806a171 in select (this=0xffdfe260, dispatch_set=..., max_wait_time=0xec9ea2d0) at /AIDev/tc/projects/dev52/cserver/deps_root/include/ace/OS_NS_sys_select.inl:45
#3  ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token> >::wait_for_multiple_events (this=0xffdfe260, dispatch_set=..., max_wait_time=0xec9ea2d0)
    at /AIDev/tc/projects/dev52/cserver/deps_root/include/ace/Select_Reactor_T.cpp:1104
#4  0xf76afc2e in ACE_TP_Reactor::get_event_for_dispatching (this=0xffdfe260, max_wait_time=0xec9ea2d0) at ../../ace/TP_Reactor.cpp:491
#5  0xf76b08df in ACE_TP_Reactor::dispatch_i (this=0xffdfe260, max_wait_time=0xec9ea2d0, guard=...) at ../../ace/TP_Reactor.cpp:182
#6  0xf76b09c5 in ACE_TP_Reactor::handle_events (this=0xffdfe260, max_wait_time=0xec9ea2d0) at ../../ace/TP_Reactor.cpp:173
#7  0xf76b06ee in ACE_TP_Reactor::handle_events (this=0xfffffdfe, max_wait_time=...) at ../../ace/TP_Reactor.cpp:664
#8  0xf768cc89 in ACE_Reactor::handle_events (this=0x0, max_wait_time=...) at ../../ace/Reactor.cpp:432
#9  0x080631f6 in svr_worker () at ../../../../appint/cservr/ai_main.cpp:82
#10 0xf76b23fa in ACE_Thread_Adapter::invoke_i (this=0xaa6f1f0) at ../../ace/Thread_Adapter.cpp:146
#11 0xf76b2361 in ACE_Thread_Adapter::invoke (this=0xaa6f1f0) at ../../ace/Thread_Adapter.cpp:95
#12 0xf761fd97 in ace_thread_adapter (args=0xfffffdfe) at ../../ace/Base_Thread_Adapter.cpp:116
#13 0x00b0a912 in start_thread (arg=0xec9eab90) at pthread_create.c:301
#14 0x00a5a4ae in clone () from /lib/libc.so.6


so the problem here is as Vern says is in list_all_lock where cnt == -1

I've gone over the symbols defined and referenced by cserv and it's dependent libraries and I can't find anywhere where pthread_cancel is called.

I went and built ACE-5.6 on a RHEL 5.5 VM and started digging around.
pthread_cancel is wrapped into a C++ method called ACE_OS::thr_cancel

doing an nm -g -C on all of the binaries, there isn't a single reference to that method.

=============================================================================

1) I wanted to get a closer look at ACE so while building and running it's own test bucket I noticed that
there was a open bug on the 5.6 version that they are running. It may or may not have any relevance
since there has been an ongoing discussion regarding  thread cancellation and signals.

http://deuce.doc.wustl.edu/bugzilla/show_bug.cgi?id=2368

2) The only other relevant point that I can think of is that this is a 32 bit process. Is it even possible that cnt is 
wrapping around ? Highly unlikely but just a thought.

3) One more thing to try would be to run this under valgrind. It's possible that there is
   a) memory corruption
   b) a race condition
   c) both a) and b)

Given how slow valgrind is, this may not be feasible but there is a chance that it may pick up any problems quickly.
As a caveat, if they do want to try to use valgrind then they should use the DTS version instead of the stock RHEL 5 version
because the helgrind support is so much better.

4) Something else that I missed totally.
   Does this application  use asynchronous IO (aio) ?

0xf306d390  0xf306d6d0  Yes (*)     /usr/lib/libaio.so.1

The facility is there in libACE but I don't see any indication that it's actually being used.


------------------------------------------------------------------------------
// INTERNAL

More thoughts ..

We know nothing about the structure of the customer application.
From what I can tell, they are using a number of worker threads that are in a pool to service incoming 
requests. 

1) 
 a) Are the number of threads fixed ?
 b)  or are they created dynamically and destroyed ? 
 c) are there any dynamically created and destroyed helper threads 
2) Do any of the threads call fork or popen either directly or indirectly through other shared libs that the application uses?


3) Is it possible to create a smaller reproducer in C++ that essentially replicates the thread structure of the application ?


4) Given how long it takes for this server process to run into this condition this is most certainly due to a race 
condition and in both cores is shows that it's deadlocked on attempting to open 32 different log files at a time with
the peculiar set of values in list_all_lock

(gdb) p list_all_lock
$1 = {lock = 2, cnt = -1, owner = 0xec9eab90}

Comment 3 Alan Matsuoka 2016-07-15 20:51:37 UTC
As Vern stated, he would like to set two watch points:

1.	When a call is made to either increment or decrement the lock count he would like to log to a file:
a.	All active threads.
b.	Dump the contents of 3 32-bit integers, printed as hex values, starting at the address of the list_all_lock memory location.
c.	Produce a back trace of the thread that was interrupted by the watch.
2.	When the lock count hits -1 stop cservr execution. When the count first reaches -1 it represents the top of the hill for the downhill slide into the FUTEX hang.

The first watch point above could create a very large log file depending on how long it takes to get to the second watch point (the beginning of the FUTEX hang). While gdb is writing the log entries, all threads of cservr will be paused. Any translations in flight will continue to run but new ones will not be accepted or dispatched. 

At the failure point, instead of stopping we could:
1.	Produce a backtrace of all threads.
2.	Continue cservr execution and let the strace script that is currently in place discover the hang situation.


Note You need to log in before you can comment on or make changes to this bug.