Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 455747 - Oops when running oprofile
Summary: Oops when running oprofile
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: beta
Hardware: x86_64
OS: All
low
medium
Target Milestone: 1.0.1
: ---
Assignee: Red Hat Real Time Maintenance
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-17 15:20 UTC by IBM Bug Proxy
Modified: 2008-08-26 19:57 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-26 19:57:37 UTC


Attachments (Terms of Use)
Patch that introduces oops with oprofile (deleted)
2008-07-22 06:10 UTC, IBM Bug Proxy
no flags Details
Initial patch from peterz (deleted)
2008-07-22 23:30 UTC, IBM Bug Proxy
no flags Details
updated patch (deleted)
2008-07-23 16:53 UTC, Clark Williams
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
IBM Linux Technology Center 46482 None None None Never
Red Hat Product Errata RHSA-2008:0585 normal SHIPPED_LIVE Important: kernel security and bug fix update 2008-08-26 19:56:57 UTC

Description IBM Bug Proxy 2008-07-17 15:20:23 UTC
=Comment: #0=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-17 04:34 EDT
Problem description:

On starting oprofile on the latest R2 kernel,

#opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/2.6.24.7-72ibmrt2.5/vmlinux
#opcontrol --start 

I get a kernel oops. I tried several times, and each time I got a different oops
(though, the origin might be the same and only looks diff). Sadly enough, could
not capture dump. kdump kernel starts booting and then drops into a shell.
Manual copy of the core results in out of memory issue. Pasting a few of the
oops here:


rt-oak.austin.ibm.com login: Unable to handle kernel paging request at
fffffffffffffff8 RIP:
 [<ffffffff8113be87>] clear_page_c+0x7/0x10
PGD 203067 PUD 204067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl
dm_multipath scsi_dh iTCO_wdt pata_acpi ata_generic iTCO_vendor_support
i5000_edac edac_core usb_storage serio_raw shpchp bnx2 lp parport_pc parport
ac battery button i2c_core sbs sbshc video output dm_mirror dm_mod xt_tcpudp
ip6t_REJECT ipv6 nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables
nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth
autofs4 sg mptsas mptscsih mptbase scsi_transport_sas pcspkr ata_piix libata
sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 4311, comm: opcontrol Not tainted 2.6.24.7-72ibmrt2.5 #1
RIP: 0010:[<ffffffff8113be87>]  [<ffffffff8113be87>] clear_page_c+0x7/0x10
RSP: 0018:ffff81042b5f5bb0  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000018f495c0 RCX: 0000000000000003
RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffff810428c39fe8
RBP: ffff81042b5f5c48 R08: ffff81042fc01d14 R09: ffff81042b5f5bc8
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
R13: ffff81000001b100 R14: ffffe20018f49560 R15: 0000000000000000
FS:  00002b65ebd31f00(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 0000000428498000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process opcontrol (pid: 4311, threadinfo ffff81042b5f4000, task
ffff81042c9fc140)
Stack:  ffffffff8108a191 ffff81042b5f5bc8 000000448110cd5d ffff81000001d670
 00000000810b9191 00000040000284d0 ffff81000001d678 0000000200000000
 0000000000000000 000000002b5f5c58 ffffffff00000000 0000000100000000
Call Trace:
 [<ffffffff8108a191>] ? get_page_from_freelist+0x496/0x60b
 [<ffffffff8108a3c4>] __alloc_pages+0x68/0x312
 [<ffffffff8109188d>] ? zone_statistics+0x64/0x69
 [<ffffffff81087fc4>] ? put_zone_pcp+0x1f/0x21
 [<ffffffff810a439e>] alloc_pages_current+0xa8/0xb1
 [<ffffffff81089cbe>] get_zeroed_page+0x11/0x4e
 [<ffffffff81092c7d>] __pud_alloc+0x1d/0x7a
 [<ffffffff81095a78>] copy_page_range+0x152/0x6d9
 [<ffffffff81287193>] ? rt_spin_lock+0x9/0xb
 [<ffffffff810a8afb>] ? _slab_irq_disable+0x37/0x5c
 [<ffffffff8103b03d>] copy_process+0xcf8/0x1626
 [<ffffffff8103bb0a>] do_fork+0x75/0x20e
 [<ffffffff81077795>] ? audit_syscall_entry+0x148/0x17e
 [<ffffffff8100c37e>] ? traceret+0x0/0x5
 [<ffffffff8100a630>] sys_clone+0x23/0x25
 [<ffffffff8100c517>] ptregscall_common+0x67/0xb0


<The below if from a diff machine>

llm49.in.ibm.com login: Unable to handle kernel paging request at
fffffffffffffff8 RIP:
 [<ffffffff811f598c>] acpi_pm_read+0xc/0x13
PGD 203067 PUD 204067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in: oprofile usb_storage i2c_amd756 i2c_core tg3 serio_raw
amd_rng shpchp joydev lp parport_pc parport ac battery button sbs sbshc video
output dm_multipath scsi_dh dm_mirror dm_mod xt_tcpudp ip6t_REJECT ipv6
nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables
nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth
autofs4 sg mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod pcspkr
k8temp hwmon k8_edac edac_core ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.24.7-72ibmrt2.5 #1
RIP: 0010:[<ffffffff811f598c>]  [<ffffffff811f598c>] acpi_pm_read+0xc/0x13
RSP: 0018:ffffffff81504e88  EFLAGS: 00010086
RAX: 00000000088757e5 RBX: 0000000001b89700 RCX: ffff81010f91dd28
RDX: 0000000000002208 RSI: 0000000000000000 RDI: ffffffff81504ed8
RBP: ffffffff81504e88 R08: 0000000000000010 R09: 0000000000000000
R10: ffff810087b0e000 R11: 0000000000000000 R12: ffffffff81504ed8
R13: 0000000000000000 R14: 000006909ad975c0 R15: 0000000000000002
FS:  00002b90582d3f00(0000) GS:ffffffff813ef100(0000) knlGS:00000000f7f116d0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 000000010f1f2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff8149a000, task ffffffff813aa7a0)
Stack:  ffffffff81504ea8 ffffffff81056a44 0000000001b89700 ffffffff81504ed8
 ffffffff81504ec8 ffffffff81054983 000006909ad975c0 ffff8100090045a0
 ffffffff81504ee8 ffffffff810549c7 00000000487efa76 00000000363c8f37
Call Trace:
 <IRQ>  [<ffffffff81056a44>] getnstimeofday+0x31/0x88
 [<ffffffff81054983>] ktime_get_ts+0x18/0x4b
 [<ffffffff810549c7>] ktime_get+0x11/0x42
 [<ffffffff810599cb>] tick_program_event+0x2c/0x5a
 [<ffffffff81054db9>] hrtimer_interrupt+0x183/0x1aa
 [<ffffffff8101ffd3>] smp_local_timer_interrupt+0x5a/0x5e
 [<ffffffff81020635>] smp_apic_timer_interrupt+0x3a/0x51
 [<ffffffff8100ada1>] ? default_idle+0x0/0x56
 [<ffffffff8100ce66>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff8100ade2>] ? default_idle+0x41/0x56
 [<ffffffff8100abb1>] ? enter_idle+0x22/0x24
 [<ffffffff8100ae90>] ? cpu_idle+0x99/0xf8
 [<ffffffff81283e5e>] ? rest_init+0x82/0x84
 [<ffffffff814a4b68>] ? start_kernel+0x31e/0x329
 [<ffffffff814a4119>] ? _sinittext+0x119/0x120


Code: 0f 97 c0 39 cf 0f 92 c2 21 d0 a8 01 75 bd c9 89 c8 c3 55 48 89 e5 e8 a6
ff ff ff c9 89 c0 c3 55 0f b7 15 78 0f 2a 00 48 89 e5 ed <c9> 25 ff ff ff 00
c3 83 3d 76 34 3c 00 00 55 48 89 e5 75 29 80
RIP  [<ffffffff811f598c>] acpi_pm_read+0xc/0x13
 RSP <ffffffff81504e88>


RIP: 0010:[<ffffffff810bd2b5>]  [<ffffffff810bd2b5>] do_select+0x2af/0x4f2
RSP: 0018:ffff81021003ba68  EFLAGS: 00010202
RAX: 0000000000010092 RBX: 0000000000000300 RCX: 0000000000000304
RDX: ffff81021091e090 RSI: ffff81021003bd44 RDI: 0000000000000012
RBP: ffff81021003bd78 R08: 0000000000020000 R09: ffff81021003ba58
R10: 0000000000000003 R11: 0000000000000003 R12: ffff810208cd4080
R13: 0000000000000012 R14: 0000000000040000 R15: ffff81021003bea8
FS:  00002ac6f7c3c080(0000) GS:ffff810211ae4640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000100ca CR3: 000000020f5a7000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ntpd (pid: 2660, threadinfo ffff81021003a000, task ffff8102107194c0)
Stack:  ffff81021003bad8 ffff81021003bf40 0000000000000000 00000000080e0a80
 ffff81021003bdc0 ffff81021003bdc8 ffff81021003bdd0 ffff81021003bda8
 ffff81021003bdb0 ffff81021003bdb8 00000000003f0000 0000000000000000
Call Trace:
 [<ffffffff810bd972>] ? __pollwait+0x0/0xdf
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff81033423>] ? default_wake_function+0x0/0x14
 [<ffffffff810a8c60>] ? __cache_free+0x3b/0x203
 [<ffffffff81287193>] ? rt_spin_lock+0x9/0xb
 [<ffffffff81053d8e>] ? enqueue_hrtimer+0xda/0xe8
 [<ffffffff81054753>] ? hrtimer_start+0x136/0x17f
 [<ffffffff810549b1>] ? ktime_get_ts+0x46/0x4b
 [<ffffffff810bd6be>] core_sys_select+0x1c6/0x275
 [<ffffffff81047966>] ? recalc_sigpending+0x12/0x41
 [<ffffffff8100bc83>] ? do_notify_resume+0x71e/0x7db
 [<ffffffff810bdb05>] sys_select+0xb4/0x176
 [<ffffffff8100f8c4>] ? syscall_trace_enter+0xb7/0xbb
 [<ffffffff8100c37e>] traceret+0x0/0x5


Code: 58 fd ff ff 0f 84 b0 00 00 00 48 8d 75 cc 44 89 ef e8 e8 3c ff ff 48 85
c0 49 89 c4 0f 84 93 00 00 00 48 8b 40 68 48 85 c0 74 23 <48> 8b 40 38 48 85
c0 74 1a 31 f6 83 bd 78 fd ff ff 00 4c 89 e7
RIP  [<ffffffff810bd2b5>] do_select+0x2af/0x4f2
 RSP <ffff81021003ba68>



llm49.in.ibm.com login: Unable to handle kernel paging request at
fffffffffffffff8 RIP:
 [<ffffffff81083f18>] find_get_page+0x5a/0x173
PGD 203067 PUD 204067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in: oprofile usb_storage i2c_amd756 i2c_core tg3 serio_raw
amd_rng shpchp joydev lp parport_pc parport ac battery button sbs sbshc video
output dm_multipath scsi_dh dm_mirror dm_mod xt_tcpudp ip6t_REJECT ipv6
nfnetlink nf_conntrack_ipv4 xt_state ipt_REJECT x_tables
nf_conntrack_netbios_ns nf_conntrack sunrpc rfcomm hidp l2cap bluetooth
autofs4 sg mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod pcspkr
k8temp hwmon k8_edac edac_core ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 3357, comm: opcontrol Not tainted 2.6.24.7-72ibmrt2.5 #1
RIP: 0010:[<ffffffff81083f18>]  [<ffffffff81083f18>] find_get_page+0x5a/0x173
RSP: 0000:ffff81020f96dc48  EFLAGS: 00010283
RAX: ffffe2000c63ffbf RBX: ffffe2000c63ffc0 RCX: 000000000000002b
RDX: ffff810210a84368 RSI: ffffe2000c63ffc8 RDI: 0000000000000000
RBP: ffff81020f96dcb8 R08: ffff81021114b480 R09: 0000000000000000
R10: 00002b2f1344ff90 R11: 0000000000000246 R12: ffffe2000c63ffc0
R13: ffff810210a844f8 R14: 000000000000006f R15: ffff810211437840
FS:  00002b2f1344ff00(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 000000020f5f5000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process opcontrol (pid: 3357, threadinfo ffff81020f96c000, task
ffff81021114b480)
Stack:  ffff81020f96dc58 0000004481087fc4 ffff810000016231 000000008108a196
 00000040001200d2 ffff810000016239 00000002000157f9 0000000000000000
 00000000000284d0 ffff81020c0187a0 ffff810211437840 000000000000006f
Call Trace:
 [<ffffffff810843a5>] find_lock_page+0x1e/0x5d
 [<ffffffff810863e9>] filemap_fault+0x79/0x399
 [<ffffffff81092dae>] __do_fault+0x65/0x3b6
 [<ffffffff81093564>] ? do_wp_page+0x465/0x4e5
 [<ffffffff810949a5>] handle_mm_fault+0x202/0x764
 [<ffffffff812895d0>] do_page_fault+0x3ba/0x76d
 [<ffffffff81077adc>] ? audit_syscall_exit+0x311/0x332
 [<ffffffff812879e9>] error_exit+0x0/0x51


Code: 84 25 01 00 00 48 8b 18 49 83 cc ff f6 c3 01 4c 0f 44 e3 49 8d 44 24 ff
4c 89 e3 48 83 f8 fd 77 cc 41 8b 4c 24 08 49 8d 74 24 08 <85> c9 74 be 8d 41
01 48 63 d1 48 63 f8 48 89 d0 f0 0f b1 3e 39
RIP  [<ffffffff81083f18>] find_get_page+0x5a/0x173
 RSP <ffff81020f96dc48>


I am yet to try it on HS21 and on mainline kernel.
=Comment: #1=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-17 05:15 EDT
Could not reproduce the issue with mainline rt kernel 2.6.25.8-rt7
=Comment: #2=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-17 05:27 EDT
Will quickly try and verify if the issue is reproducible on MRG kernel and will
then mirror bug to RH.
=Comment: #3=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-17 05:54 EDT
As expected, can reproduce the bug with MRG kernel as well. Here is the oops I got:

rt-beech.austin.ibm.com login: Unable to handle kernel paging request at
fffffffffffffff8 RIP: 
 [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d
PGD 203067 PUD 204067 PMD 0 
Oops: 0002 [1] PREEMPT SMP 
CPU 0 
Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6
autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth sunrpc dm_mirror
dm_multipath dm_mod video output sbs sbshc battery ac parport_pc lp parport sg
bnx2 button serio_raw k8temp k8_edac edac_core hwmon shpchp pcspkr usb_storage
mptsas mptscsih scsi_transport_sas mptbase sd_mod scsi_mod ext3 jbd mbcache
ehci_hcd ohci_hcd uhci_hcd
Pid: 3416, comm: opcontrol Not tainted 2.6.24.7-72.el5rt #1
RIP: 0010:[<ffffffff810b3af1>]  [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d
RSP: 0018:ffff81021fccfeb8  EFLAGS: 00010292
RAX: 0000000000000000 RBX: ffff810000000000 RCX: 0000000000000000
RDX: ffff81021fccffd8 RSI: ffff81022d033009 RDI: ffffe2000cc253e0
RBP: ffff81021fccfec8 R08: 0000000000000001 R09: ffff810000015670
R10: 0000000000000000 R11: 0000000000000002 R12: ffff81022b98f080
R13: 0000000000000080 R14: ffff81022d033000 R15: 000000000071b370
FS:  00002b1da6050f10(0000) GS:ffffffff813ef100(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 000000022cd72000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process opcontrol (pid: 3416, threadinfo ffff81021fcce000, task ffff81022ccf0b60)
Stack:  ffff81021fccfec8 ffff81022c0b1dc0 ffff81021fccff18 ffffffff810b5334
 00000000007293a0 ffff81021fccff58 0000000000720ec0 ffff81022d033000
 000000000071b370 0000000000720ec0 ffff81022d033000 0000000000000000
Call Trace:
 [<ffffffff810b5334>] do_execve+0x108/0x205
 [<ffffffff8100ac30>] sys_execve+0x36/0x8a
 [<ffffffff8100c5c7>] stub_execve+0x67/0xb0


Code: ec 08 65 48 8b 04 25 10 00 00 00 48 8b 98 48 e0 ff ff 48 c7 80 48 e0 ff ff
ff ff ff ff e8 14 fe ff ff 65 48 8b 14 25 10 00 00 00 <48> 89 9a 48 e0 ff ff 59
5b c9 c3 55 48 89 e5 41 57 41 56 41 55 
RIP  [<ffffffff810b3af1>] copy_strings_kernel+0x32/0x3d
 RSP <ffff81021fccfeb8>
CR2: fffffffffffffff8

=Comment: #5=================================================
Ankita Garg <ankigarg@in.ibm.com> - 2008-07-17 07:09 EDT
So this is weird. Every time the machine panics, on the second reboot, it fails
to come up, with the following error:

Waiting on MM for permission to boot.....       CP: CD

I9990306  Timed Out Waiting on MM:System Halted in POST

This is a known issue, however, due to this, the machine up-time is rather limited !

Comment 1 IBM Bug Proxy 2008-07-18 08:40:29 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-18 04:36 EDT-------
Could not reproduce this issue with the aplha01 - R2 (2.6.24.3-29ibmrt2.0)
kernel. So, this issue came in at a later point in time.

Comment 2 IBM Bug Proxy 2008-07-18 09:51:20 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-18 05:40 EDT-------
Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This
is the -65 kernel. I get the following oops.

rt-ipe.austin.ibm.com login: Unable to handle kernel paging request at
fffffffffffffff8 RIP:
[<ffffffff81139a82>] rb_insert_color+0x6/0xe3
PGD 203067 PUD 204067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in: oprofile ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6
autofs4 i2c_dev i2c_core hidp rfcomm l2cap bluetooth sunrpc dm_mirror
dm_multipath dm_mod video output sbs sbshc battery ac parport_pc lp parport sg
bnx2 button i5000_edac edac_core pata_acpi pcspkr ata_generic iTCO_wdt
iTCO_vendor_support shpchp ata_piix libata mptsas mptscsih scsi_transport_sas
mptbase sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.24.7-65ibmrt2.4 #1
RIP: 0010:[<ffffffff81139a82>]  [<ffffffff81139a82>] rb_insert_color+0x6/0xe3
RSP: 0018:ffffffff81509ee0  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff810001008780 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff8100010086a8 RDI: ffff810001008780
RBP: ffffffff81509ee8 R08: 0000000000000010 R09: 0000000000000000
R10: ffff81007fb09000 R11: 0000000000000000 R12: ffff810001008698
R13: ffff81042a8d44f0 R14: ffff81042a8d44e0 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff813f3100(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: fffffffffffffff8 CR3: 000000041956e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff8149e000, task ffffffff813ae7a0)
Stack:  0000000000000001 ffffffff81509f18 ffffffff81053e48 ffff810001008780
ffff810001008698 ffff810001008640 7fffffffffffffff ffffffff81509f68
ffffffff81054df7 000000152c077bb2 000000152c077bb2 0000000000000000
Call Trace:
<IRQ>  [<ffffffff81053e48>] enqueue_hrtimer+0xda/0xe8
[<ffffffff81054df7>] hrtimer_interrupt+0x136/0x1ab
[<ffffffff8101ff97>] smp_local_timer_interrupt+0x5a/0x5e
[<ffffffff810205f9>] smp_apic_timer_interrupt+0x3a/0x51
[<ffffffff8100af61>] ? mwait_idle+0x0/0x73
[<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70
<EOI>  [<ffffffff8100afcf>] ? mwait_idle+0x6e/0x73
[<ffffffff8100abb1>] ? enter_idle+0x22/0x24
[<ffffffff8100ae90>] ? cpu_idle+0x99/0xf8
[<ffffffff8128648e>] ? rest_init+0x82/0x84
[<ffffffff814a8b70>] ? start_kernel+0x31e/0x329
[<ffffffff814a8119>] ? _sinittext+0x119/0x120

Code: 74 12 49 3b 78 08 75 06 49 89 48 08 eb 09 49 89 48 10 eb 03 48 89 0e 48 8b
07 83 e0 03 48 09 c1 48 89 0f c9 c3 55 48 89 e5 41 57 <49> 89 f7 41 56 41 55 49
89 fd 41 54 53 e9 a1 00 00 00 49 89 c4
RIP  [<ffffffff81139a82>] rb_insert_color+0x6/0xe3
RSP <ffffffff81509ee0>
Initializing cgroup subsys cpuset

Comment 3 IBM Bug Proxy 2008-07-18 10:10:43 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-18 06:02 EDT-------
(In reply to comment #11)
> Can reproduce the bug on alpha18 kernel (after alpha14, I tried alpha18). This
> is the -65 kernel. I get the following oops.
>

So luckily this time I was able to capture a dump !! The vmcore can be found
here if anyone wants to take a look:

http://kernel.beaverton.ibm.com/jtcltc/kdump_cores/bz46482/vmcore.bz2

Will investigate the dump and also try another R2 kernel between alpha14 & 18

Comment 4 IBM Bug Proxy 2008-07-18 12:00:46 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-18 07:59 EDT-------
So looks like the issue originated in -65 mrg kernel. On -60, oprofile is
working fine. So, am looking at finding the changelog to see the changes and
also got to look at the kdump.

Comment 5 IBM Bug Proxy 2008-07-21 11:40:41 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-21 07:36 EDT-------
So I have been trying to figure out which of the patches between -61 and -65
could have led to the issue. Looking at the patches, there could be a few
candidates. Tried by removing several patches, but with no gain! Only to realize
that the process I was using the generate kernels with some patches removed was
erroneous!!!!!!! *grrr* :-( kicking it off again ! and in this process, rt-ipe
too went down!

Comment 6 IBM Bug Proxy 2008-07-22 06:10:41 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-22 02:05 EDT-------
I cannot reproduce this issue with the -62 kernel. Also, all the oops seen are
related to "Unable to handle kernel paging request at fffffffffffffff8". Taking
the cue from this, I removed the patch slab-fix-rt-v2.patch from -65 kernel and
built a new kernel. With this patch removed, oprofile is working fine on -65
kernel. So, this could point to this particular patch as faulty. I would spend
sometime looking at the patch and the dump. However, it might be faster if Peter
takes a look at it. Attaching the patch here for ref.

Comment 7 IBM Bug Proxy 2008-07-22 06:10:43 UTC
Created attachment 312310 [details]
Patch that introduces oops with oprofile

Comment 8 IBM Bug Proxy 2008-07-22 08:00:32 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-22 03:51 EDT-------
http://lkml.org/lkml/2008/6/10/133 introduces these patches. So this series
provides fixes for hotplug..not sure how it is affecting us here.

Comment 9 IBM Bug Proxy 2008-07-22 10:42:30 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-22 06:30 EDT-------
Pasting some content from the dump -

# crash /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux /test/ankita/vmcore

crash 4.0-5.0.3
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: cpu 0 first exception stack: 0
boot_exception_stacks: ffffffff8150a000

KERNEL: /usr/lib/debug/lib/modules/2.6.24.7-65ibmrt2.4/vmlinux
DUMPFILE: /test/ankita/vmcore
CPUS: 8
DATE: Fri Jul 18 01:39:31 2008
UPTIME: 00:01:30
LOAD AVERAGE: 1.66, 0.69, 0.25
TASKS: 257
NODENAME: rt-ipe.austin.ibm.com
RELEASE: 2.6.24.7-65ibmrt2.4
VERSION: #1 SMP PREEMPT RT Fri Jun 6 20:06:47 EDT 2008
MACHINE: x86_64  (3000 Mhz)
MEMORY: 16 GB
PANIC: "Oops: 0002 [1] PREEMPT SMP " (check log for details)
PID: 0
COMMAND: "swapper"
TASK: ffffffff813ae7a0  (1 of 8)  [THREAD_INFO: ffffffff8149e000]
CPU: 0
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0      TASK: ffffffff813ae7a0  CPU: 0   COMMAND: "swapper"
#0 [ffffffff81509b80] machine_kexec at ffffffff81022dcd
#1 [ffffffff81509c60] crash_kexec at ffffffff8106abd3
#2 [ffffffff81509d20] __die at ffffffff8128a48b
#3 [ffffffff81509d50] do_page_fault at ffffffff8128bca1
#4 [ffffffff81509e30] error_exit at ffffffff81289e19
[exception RIP: rb_insert_color+6]
RIP: ffffffff81139a82  RSP: ffffffff81509ee0  RFLAGS: 00010046
RAX: 0000000000000000  RBX: ffff810001008780  RCX: 0000000000000001
RDX: 0000000000000000  RSI: ffff8100010086a8  RDI: ffff810001008780
RBP: ffffffff81509ee8   R8: 0000000000000010   R9: 0000000000000000
R10: ffff81007fb09000  R11: 0000000000000000  R12: ffff810001008698
R13: ffff81042a8d44f0  R14: ffff81042a8d44e0  R15: 0000000000000001
ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#5 [ffffffff81509ef0] enqueue_hrtimer at ffffffff81053e48
#6 [ffffffff81509f20] hrtimer_interrupt at ffffffff81054df7
#7 [ffffffff81509f70] smp_local_timer_interrupt at ffffffff8101ff97
#8 [ffffffff81509f90] smp_apic_timer_interrupt at ffffffff810205f9
#9 [ffffffff81509fb0] apic_timer_interrupt at ffffffff8100ce46
--- <IRQ stack> ---
#10 [ffffffff8149fe98] apic_timer_interrupt at ffffffff8100ce46
[exception RIP: mwait_idle+110]
RIP: ffffffff8100afcf  RSP: ffffffff8149ff48  RFLAGS: 00000246
RAX: 0000000000000000  RBX: ffffffff8149ff48  RCX: 0000000000000000
RDX: 0000000000000000  RSI: ffffffff8149e010  RDI: ffffffff813b2230
RBP: 0000000000000001   R8: 0000000000000000   R9: ffff81042e4a9e60
R10: 0000000000000000  R11: ffff81042e4a9e90  R12: ffffffff8128bddc
R13: ffffffff8149fee8  R14: ffff81000100b980  R15: ffff81041954f080
ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#11 [ffffffff8149ff40] enter_idle at ffffffff8100abb1
#12 [ffffffff8149ff50] cpu_idle at ffffffff8100ae90

crash> dis 0xffffffff81139a82
0xffffffff81139a82 <rb_insert_color+6>: mov    %rsi,%r15

crash> dis 0xffffffff81509ee0
0xffffffff81509ee0 <boot_cpu_stack+16096>:      add    %eax,(%rax)

crash> rd boot_cpu_stack
ffffffff81506000:  0000000000000000                    ........

Note that the slab fix patch makes changes to the stack allocation code. Since
oprofile here is using NMI, the oops happens when the switch to other stack
happens. While I am not sure what the "first exception stack" is supposed to be,
in the WARNING below,

WARNING: cpu 0 first exception stack: 0
boot_exception_stacks: ffffffff8150a000

it appears to be zero. My guess would be that the stack allocation code might
need some attention. I can provide more output from crash if required.

Comment 10 IBM Bug Proxy 2008-07-22 23:30:43 UTC
Created attachment 312406 [details]
Initial patch from peterz

I applied this patch and started oprofile as Ankita did in the opening comment
on an LS21.  The box panic'd within a few seconds.  So this patch doesn't
appear to resolve the issue entirely.

Comment 11 Clark Williams 2008-07-23 16:53:06 UTC
Created attachment 312494 [details]
updated patch

move alloc stacks to trap_init()

Comment 12 IBM Bug Proxy 2008-07-23 19:20:56 UTC
------- Comment From dvhltc@us.ibm.com 2008-07-23 15:19 EDT-------
I've built with the patch in comment #27, and kicked off oprofile per the
opening comment.  Been profiling for several minutes and haven't seen a crash.
I think this is fixed.

Comment 13 IBM Bug Proxy 2008-07-24 00:40:52 UTC
------- Comment From jstultz@us.ibm.com 2008-07-23 20:35 EDT-------
Issue should be fixed in -74, which is available for testing.

Comment 14 IBM Bug Proxy 2008-07-24 04:50:44 UTC
------- Comment From ankigarg@in.ibm.com 2008-07-24 00:42 EDT-------
Yes, I too confirm that the patch fixes this issue. Tested the -74 kernel. So,
marking this fixed.

Comment 18 errata-xmlrpc 2008-08-26 19:57:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0585.html


Note You need to log in before you can comment on or make changes to this bug.