Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 589675 - job_server: QMF init/connection failure from Qpid "Invalid URL"
Summary: job_server: QMF init/connection failure from Qpid "Invalid URL"
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On: 560693
Blocks: 586348
TreeView+ depends on / blocked
 
Reported: 2010-05-06 16:39 UTC by Pete MacKinnon
Modified: 2015-11-16 01:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
condor_job_server connection failure trace (deleted)
2010-06-03 16:20 UTC, Pete MacKinnon
no flags Details
condor_master plug-in successful connection (deleted)
2010-06-03 16:21 UTC, Pete MacKinnon
no flags Details

Description Pete MacKinnon 2010-05-06 16:39:45 UTC
05/06 12:37:57 (fd:2) (pid:4002) === Current Probing Information ===
05/06 12:37:57 (fd:2) (pid:4002) fsize: 80		mtime: 1273150920
05/06 12:37:57 (fd:2) (pid:4002) first log entry: 2 CreationTimestamp 1273105563
05/06 12:37:57 (fd:2) (pid:4002) ERROR "Assertion ERROR on (hash)" at line 1001 in file attrlist.cpp
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
Stack dump for process 4002 at timestamp 1273163887 (23 frames)
./condor_job_server(dprintf_dump_stack+0xc7)[0x80f9b63]
./condor_job_server[0x80f9d3a]
[0xe24400]
[0xe24422]
/lib/libc.so.6(gsignal+0x51)[0x7d9781]
/lib/libc.so.6(abort+0x17a)[0x7db04a]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x14f)[0x56df44f]
/usr/lib/libstdc++.so.6[0x56dd385]
/usr/lib/libstdc++.so.6[0x56dd3c2]
/usr/lib/libstdc++.so.6[0x56dd501]
/usr/lib/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x7f)[0x5677d1f]
/usr/lib/libstdc++.so.6(_ZNSs4_Rep9_S_createEjjRKSaIcE+0x85)[0x56b8b95]
/usr/lib/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEj+0x38)[0x56b97e8]
/usr/lib/libstdc++.so.6(_ZNSs6assignERKSs+0x6f)[0x56ba57f]
/usr/lib/libqpidcommon.so.2(_ZN4qpid7framing18DeliveryProperties13setRoutingKeyERKSs+0x2c)[0x25534c]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread11sendMessageENS_6client7MessageERKSsS6_+0xc4)[0x712f84]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread10sendBufferERKSsS4_St3mapISsNS_5types7VariantESt4lessISsESaISt4pairIS3_S7_EEES4_S4_S4_+0x171)[0x713681]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl13sendHeartbeatEv+0x4ff)[0x713c6f]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl18periodicProcessingEv+0x7ba)[0x7165aa]
/usr/lib/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl13PublishThread3runEv+0x2a)[0x7170ca]
/usr/lib/libqpidcommon.so.2[0x21de31]
/lib/libpthread.so.0[0x1158f5]
/lib/libc.so.6(clone+0x5e)[0x88bfce]
Aborted

Comment 1 Pete MacKinnon 2010-05-06 19:26:03 UTC
(gdb) where
#0  AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType")
    at attrlist.cpp:1001
#1  0x08156d44 in AttrList::Lookup (this=0xbfffe5fc, name=0x825a800 "MyType")
    at attrlist.cpp:994
#2  0x08156eda in AttrList::Lookup (this=0xbfffe5fc, attr=0x825a858)
    at attrlist.cpp:1037
#3  0x08156663 in AttrList::Insert (this=0xbfffe5fc, expr=0x8259e48, 
    check_for_dups=true) at attrlist.cpp:705
#4  0x08154ebe in AttrList::AttrList (this=0xbfffe5fc, file=0x825a5c8, 
    delimitor=0x81ae2e4 "***", isEOF=@0xbfffe728, error=@0xbfffe724, 
    empty=@0xbfffe720) at attrlist.cpp:222
#5  0x0815b3f8 in ClassAd::ClassAd (this=0xbfffe5fc, f=0x825a5c8, 
    d=0x81ae2e4 "***", i=@0xbfffe728, err=@0xbfffe724, empty=@0xbfffe720)
    at classad.cpp:102
#6  0x080d1679 in HistoryFile::poll (this=0x8217300) at HistoryFile.cpp:205
#7  0x080d4620 in ProcessCurrentHistory () at HistoryProcessingUtils.cpp:234
#8  0x080c63ae in ProcessHistoryTimer () at job_server_main.cpp:245
#9  0x080f7553 in TimerManager::Timeout (this=0x8217348)
    at timer_manager.cpp:397
#10 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900
#11 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281

Comment 2 Pete MacKinnon 2010-05-06 20:08:16 UTC
Culprit...bulk load?

(gdb) p hash
$19 = (HashTable<YourString, AttrListElem*> *) 0x0
(gdb) where
#0  AttrList::Lookup (this=0x8259a70, name=0x81bf388 "JobStatus")
    at attrlist.cpp:1001
#1  0x080c7cec in LiveJob::Get (this=0x8233ac0, _name=0x81bf388 "JobStatus", 
    _attribute=@0xbfffe79c) at Job.cpp:176
#2  0x080c8013 in LiveJob::GetStatus (this=0x8233ac0) at Job.cpp:236
#3  0x080cb90a in SubmissionObject::Decrement (this=0x0, job=0x8233ac0)
    at SubmissionObject.cpp:117
#4  0x080c782f in Job::DecrementSubmission (this=0x8233ac0) at Job.cpp:107
#5  0x080c7aa4 in LiveJob::~LiveJob (this=0x8233ac0, 
    __in_chrg=<value optimized out>) at Job.cpp:127
#6  0x080cd5be in JobServerJobLogConsumer::Reset (this=0x8258478)
    at JobServerJobLogConsumer.cpp:50
#7  0x0812c925 in JobLogReader::BulkLoad (this=0x8258994)
    at JobLogReader.cpp:97
#8  0x0812c8b7 in JobLogReader::Poll (this=0x8258994) at JobLogReader.cpp:75
#9  0x080cf1ee in JobLogMirror::TimerHandler_JobLogPolling (this=0x8258990)
    at JobLogMirror.cpp:79
#10 0x080f753a in TimerManager::Timeout (this=0x8217348)
    at timer_manager.cpp:394
#11 0x080de60f in DaemonCore::Driver (this=0x8235898) at daemon_core.cpp:2900
#12 0x080f400f in main (argc=1, argv=0xbffff0fc) at daemon_core_main.cpp:2281

Comment 3 Pete MacKinnon 2010-05-11 14:22:42 UTC
Fixed an order problem in LiveJob dtor. Delete the ad AFTER decrementing the submission, due to a lookup in the decrement. :-)

Comment 4 Pete MacKinnon 2010-05-21 21:07:30 UTC
Same symptom but different trace...

05/21 16:57:50 ******************************************************
05/21 16:57:50 ** condor_job_server (CONDOR_JOB_SERVER) STARTING UP
05/21 16:57:50 ** /usr/sbin/condor_job_server
05/21 16:57:50 ** SubsystemInfo: name=JOB_SERVER type=DAEMON(11) class=DAEMON(1)
05/21 16:57:50 ** Configuration: subsystem:JOB_SERVER local:<NONE> class:DAEMON
05/21 16:57:50 ** $CondorVersion: 7.4.3 May 19 2010 BuildID: RH-7.4.3-0.14.el5 PRE-RELEASE $
05/21 16:57:50 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
05/21 16:57:50 ** PID = 15135
05/21 16:57:50 ** Log last touched 5/21 15:57:50
05/21 16:57:50 ******************************************************
05/21 16:57:50 Using config source: /etc/condor/condor_config
05/21 16:57:50 Using local config sources:
05/21 16:57:50    /var/lib/condor/condor_config.local
05/21 16:57:50    /var/lib/condor/config/99configd.config
05/21 16:57:50    /var/lib/condor/config/ZQMF_BROKER_HOST
05/21 16:57:50    /var/lib/condor/config/qmf_condor_config
05/21 16:57:50    /var/lib/condor/config/zmaster
05/21 16:57:50    /var/lib/condor/config/zmatt
05/21 16:57:50    /var/lib/condor/config/zpmackinn
05/21 16:57:50 DaemonCore: Command Socket at <10.16.44.232:31118>
05/21 16:57:50 main_init() called
05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819'
05/21 16:57:50 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry.
05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: no such job '025819.-1' for 'ClusterId = 25819'
05/21 16:58:00 error reading /mnt/pool/spool/job_queue.log: Failed to process log entry.
Stack dump for process 15135 at timestamp 1274475484 (15 frames)
condor_job_server(dprintf_dump_stack+0x4e)[0x4a0f9e]
condor_job_server[0x4a2d72]
/lib64/libpthread.so.0[0x320f80e930]
/lib64/libc.so.6(gsignal+0x35)[0x320ec30265]
/lib64/libc.so.6(abort+0x110)[0x320ec31d10]
/lib64/libc.so.6[0x320ec6a84b]
/lib64/libc.so.6[0x320ec723e5]
/lib64/libc.so.6(cfree+0x4b)[0x320ec7273b]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client12TCPConnectorD0Ev+0xff)[0x2b7bbc40329f]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client14ConnectionImplD0Ev+0xe8)[0x2b7bbc3b9248]
/usr/lib64/libqpidclient.so.2(_ZN4qpid6client10Connection4openERKNS0_18ConnectionSettingsE+0x569)[0x2b7bbc3acdc9]
/usr/lib64/libqmf.so.1(_ZN4qpid10management19ManagementAgentImpl16ConnectionThread3runEv+0x25b)[0x2b7bbc65051b]
/usr/lib64/libqpidcommon.so.2[0x2b7bbbfe458a]
/lib64/libpthread.so.0[0x320f806617]
/lib64/libc.so.6(clone+0x6d)[0x320ecd3c2d]

Comment 5 Pete MacKinnon 2010-05-22 13:14:21 UTC
Running condor_job_server -t -f tells me...

2010-05-22 09:01:54 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)
2010-05-22 09:01:56 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)
2010-05-22 09:02:00 warning Closing connection due to Invalid URL: amqp:tcp:20.0.10.31:5672,tcp:192.168.122.1:5672,tcp:10.16.44.236:5672 (qpid/Url.cpp:211)


Don't see this in the plug-in daemons. ld issue?

Comment 6 Pete MacKinnon 2010-05-25 14:04:20 UTC
Think we need a reproducer with the sample or test C++ agent in a pkg we can try in the pool.

Comment 7 Pete MacKinnon 2010-05-26 15:31:26 UTC
condor_job_server failure trace...

gdb) where
#0  qpid::Url::parse (this=<value optimized out>, url=<value optimized out>)
    at qpid/Url.cpp:211
#1  0x00000031a9c5b10c in qpid::client::ConnectionHandler::openOk (
    this=<value optimized out>, knownBrokers=<value optimized out>)
    at ../include/qpid/Url.h:54
#2  0x00000031a9763352 in qpid::framing::AMQP_ClientOperations::ConnectionHandler::Invoker::visit (this=<value optimized out>, body=<value optimized out>)
    at qpid/framing/ConnectionOpenOkBody.h:63
#3  0x00000031a9c60244 in qpid::framing::invoke<qpid::framing::AMQP_ClientOperations::ConnectionHandler> (target=<value optimized out>, 
    body=<value optimized out>) at qpid/framing/Invoker.h:80
#4  0x00000031a9c5b907 in qpid::client::ConnectionHandler::incoming (
    this=<value optimized out>, frame=<value optimized out>)
    at qpid/client/ConnectionHandler.cpp:105
#5  0x00000031a9cb07bd in qpid::client::TCPConnector::decode (
    this=<value optimized out>, buffer=<value optimized out>, 
    size=<value optimized out>) at qpid/client/TCPConnector.cpp:299
#6  0x00000031a9cacea0 in qpid::client::TCPConnector::readbuff (
    this=<value optimized out>, aio=<value optimized out>, 
    buff=<value optimized out>) at qpid/client/TCPConnector.cpp:267
#7  0x00000031a9720fca in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator() (
    this=<value optimized out>, a0=<value optimized out>, 
    a1=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:576
#8  0x00000031a971e753 in qpid::sys::posix::AsynchIO::readable (
    this=<value optimized out>, h=<value optimized out>)
    at qpid/sys/posix/AsynchIO.cpp:418
#9  0x00000031a97f6067 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() (this=<value optimized out>, 
    a0=<value optimized out>)
    at /usr/include/boost/function/function_template.hpp:576
#10 0x00000031a97f2637 in qpid::sys::DispatchHandle::processEvent (
    this=<value optimized out>, type=<value optimized out>)
    at qpid/sys/DispatchHandle.cpp:278
#11 0x00000031a972c9ff in qpid::sys::Poller::run (this=<value optimized out>)
    at qpid/sys/Poller.h:123
#12 0x00000031a972358a in runRunnable (p=<value optimized out>)
    at qpid/sys/posix/Thread.cpp:35
#13 0x000000320f806617 in start_thread () from /lib64/libpthread.so.0
#14 0x000000320ecd3c2d in clone () from /lib64/libc.so.6

Comment 8 Pete MacKinnon 2010-06-02 22:35:45 UTC
For some reason the Address object is not added to the URL vector. Seems like some kind of initialization gap that doesn't affect the plug-ins which are dlopened.

Comment 9 Pete MacKinnon 2010-06-03 16:20:07 UTC
Created attachment 419436 [details]
condor_job_server connection failure trace

Note the absence of the line found in the good trace...

2010-06-02 12:20:47 debug Created IO thread: 0

Comment 10 Pete MacKinnon 2010-06-03 16:21:26 UTC
Created attachment 419438 [details]
condor_master plug-in successful connection

Comment 11 Alan Conway 2010-06-04 15:34:03 UTC
Fix in trunk r951441, release repo
http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=4b96352abe88aa513434d59812a6d44679d2acd8

Was not able to reproduce independently, fix was tested directly on the system that showed the problem.

Comment 12 Pete MacKinnon 2010-06-04 16:00:14 UTC
Fix looks good to me. No error on connection to broker by condor daemon.

Comment 14 Frantisek Reznicek 2010-06-16 13:51:18 UTC
The issue has been fixed, tested on RHEL 4.8 / 5.5, i386 / x86_64 on packages:

python-qmf-0.7.946106-3.el5
python-qpid-0.7.946106-1.el5
qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-3.el5
qpid-cpp-client-devel-0.7.946106-3.el5
qpid-cpp-client-devel-docs-0.7.946106-3.el5
qpid-cpp-client-rdma-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-3.el5
qpid-cpp-mrg-debuginfo-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-3.el5
qpid-cpp-server-cluster-0.7.946106-3.el5
qpid-cpp-server-devel-0.7.946106-3.el5
qpid-cpp-server-rdma-0.7.946106-3.el5
qpid-cpp-server-ssl-0.7.946106-3.el5
qpid-cpp-server-store-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-3.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
rh-qpid-cpp-tests-0.7.946106-3.el5
ruby-qmf-0.7.946106-3.el5


-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.