Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1062094 - lrmd eats memory, dumps core
Summary: lrmd eats memory, dumps core
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: pacemaker
Version: 19
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Andrew Beekhof
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-06 06:23 UTC by lav
Modified: 2015-02-18 11:15 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-18 11:15:03 UTC


Attachments (Terms of Use)

Description lav 2014-02-06 06:23:12 UTC
Description of problem:
occasionally lrmd starts to eat memory which causes paging, then corosync complains for not being scheduled for several seconds, then other bad things happen with the cluster, e.g. op monitor times out/fails. On the other cluster node lrmd starts to abort and dump core continuously filling up /var/lib/pacemaker/cores.

Version-Release number of selected component (if applicable):
pacemaker-1.1.9-3.fc19.x86_64


How reproducible:
randomly

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 lav 2014-02-06 06:30:23 UTC
Here is a sample of lrmd stack trace:

#0  0xb7709424 in __kernel_vsyscall ()
#1  0x4fef8936 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#2  0x4fefa173 in __GI_abort () at abort.c:90
#3  0x417ad50f in crm_abort (file=file@entry=0x417d4b59 "logging.c", 
    function=function@entry=0x417d5dca <__PRETTY_FUNCTION__.20433> "crm_glib_handler", line=line@entry=63, 
    assert_condition=assert_condition@entry=0x996e670 "g_error_free: assertion `error != NULL' failed", do_core=do_core@entry=1, do_fork=do_fork@entry=1)
    at utils.c:1118
#4  0x417cb50e in crm_glib_handler (
    log_domain=log_domain@entry=0x412794ee "GLib", 
    flags=flags@entry=G_LOG_LEVEL_CRITICAL, 
    message=message@entry=0x996e670 "g_error_free: assertion `error != NULL' failed", user_data=user_data@entry=0x0) at logging.c:63
#5  0x41213977 in g_logv (log_domain=log_domain@entry=0x412794ee "GLib", 
    log_level=log_level@entry=G_LOG_LEVEL_CRITICAL, 
    format=format@entry=0x41281a86 "%s: assertion `%s' failed", 
    args=args@entry=0xbf8ff69c "\025\317'A\352\315'A") at gmessages.c:949
#6  0x41213b34 in g_log (log_domain=log_domain@entry=0x412794ee "GLib", 
    log_level=log_level@entry=G_LOG_LEVEL_CRITICAL, 
    format=format@entry=0x41281a86 "%s: assertion `%s' failed")
    at gmessages.c:1010
#7  0x41213b7e in g_return_if_fail_warning (
    log_domain=log_domain@entry=0x412794ee "GLib", 
    pretty_function=pretty_function@entry=0x4127cf15 <__PRETTY_FUNCTION__.3351> "g_error_free", expression=expression@entry=0x4127cdea "error != NULL")
    at gmessages.c:1019
#8  0x411f7952 in g_error_free (error=0x0) at gerror.c:474
#9  0x419563f5 in systemd_unit_exec (op=0x9997b70, synchronous=0)
    at systemd.c:481
#10 0x4195094d in services_action_async (op=op@entry=0x9997b70, 
    action_callback=0x0, action_callback@entry=0x804c3f0 <action_complete>)
    at services.c:410
#11 0x0804bf00 in lrmd_rsc_execute_service_lib (cmd=0x9a3ff00, rsc=0x9995d70)
    at lrmd.c:879
#12 lrmd_rsc_execute (rsc=0x9995d70) at lrmd.c:941
#13 lrmd_rsc_dispatch (user_data=0x9995d70) at lrmd.c:950
#14 0x417c98c1 in crm_trigger_dispatch (source=source@entry=0x9995cf8, 
    callback=0x804bb90 <lrmd_rsc_dispatch>, userdata=0x9995cf8)
    at mainloop.c:105
#15 0x4120c0e6 in g_main_dispatch (context=0x996e380, context@entry=0x996b7b0)
    at gmain.c:3054
#16 g_main_context_dispatch (context=context@entry=0x996e380) at gmain.c:3630
#17 0x4120c498 in g_main_context_iterate (context=0x996e380, 
    block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3701
#18 0x4120c913 in g_main_loop_run (loop=0x996fd88) at gmain.c:3895
#19 0x08049b00 in main (argc=1, argv=0xbf8ff9e4) at main.c:314

Comment 2 Andrew Beekhof 2014-02-17 05:25:54 UTC
"occasionally lrmd starts to eat memory which causes paging"

Maybe this is indeed happening, but you've also not shown it here.

Comment 3 lav 2014-02-17 07:45:21 UTC
(In reply to Andrew Beekhof from comment #2)
> "occasionally lrmd starts to eat memory which causes paging"
> 
> Maybe this is indeed happening, but you've also not shown it here.

I have found out that primary reason for my cluster problems was a glitch in linux kernel 3.12.6, which caused scheduling starvation during disk IO. It was fixed in linux kernel 3.12.7.

Bug the problem with core dumps is still actual.

And lrmd process size seems to grow in a slow pace on some nodes.

cluster1 on i686:
 1018 root      20   0 1842620 964184   1396 S   0.0 49.4  33:34.83 lrmd
 1066 root      20   0   21804   3952   3224 S   0.0  0.2   0:22.51 lrmd

cluster2 on x86_64:
 9003 root      20   0  942496 844776   2624 S   0.0 10.3   9:29.38 lrmd
16404 root      20   0  580140 193480   1744 S   0.0  2.4   0:43.82 lrmd

cluster3 on i686:
18013 root      20   0 2108488 1.698g   2028 S   3.0 57.4  64:40.06 lrmd
17240 root      20   0  555852 538896   3196 S   0.0 17.4   4:19.56 lrmd

As you see, process size varies greatly and without a reason. I see a correlation with CPU time used, probably the smaller process was restarted later due to a core dump.

Comment 4 Andrew Beekhof 2014-02-18 03:12:03 UTC
Ouch. Those numbers are pretty bad.

Happily I just submitted a build of 1.1.11 to bohdi which git reliably informs me has plenty of memory leak fixes:

# git slog Pacemaker-1.1.9..Pacemaker-1.1.11 | grep leak | wc -l
41

Do you think you could give that a go?

Comment 5 lav 2014-03-05 07:51:52 UTC
pacemaker-1.1.11-1.fc21.i686 does not work for me:

Mar  5 11:04:56 bbox1 kernel: [  231.893990] lrmd[1944]: segfault at 0 ip b7756964 sp bfba5020 error 4 in libcrmservice.so.1.0.0[b774c000+10000]
Mar  5 11:04:56 bbox1 pacemakerd[1837]:    error: child_death_dispatch: Managed process 1944 (lrmd) dumped core

and all sorts of weird problems follow.

Comment 6 Andrew Beekhof 2014-03-06 23:17:38 UTC
Did you grab the new libqb too?

Comment 7 lav 2014-03-07 09:40:34 UTC
With libqb-0.17.0-2.fc21.i686 I also have this error. Are there any other dependencies? My current installation is fc19, I have tried to update pacemaker and now also libqb.

Comment 8 Andrew Beekhof 2014-03-11 22:36:07 UTC
Could you run crm_report for the period starting prior to the lrmd dumping core?
It will grab the necessary logs and backtraces we need.

Installing the pacemaker-debuginfo and libqb-debuginfo packages first would definitely help too.

Comment 9 Fedora End Of Life 2015-01-09 22:37:01 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Fedora End Of Life 2015-02-18 11:15:03 UTC
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.