Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 229516 - BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)
Summary: BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 6
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-21 18:05 UTC by Orion Poplawski
Modified: 2007-11-30 22:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-17 19:55:33 UTC


Attachments (Terms of Use)
var log messages from March 25th 2007 where Kernel error occurs. (deleted)
2007-03-26 02:50 UTC, James Collins
no flags Details

Description Orion Poplawski 2007-02-21 18:05:07 UTC
Description of problem:

Machine locked up.  On console:

BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)
Call Trace:
  show_trace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  __drain_alien_cache+0x2d/0x71
  cache_reap+0x244/0x251
  run_workqueue+0xb2/0x10c
  worker_thread+0xec/0x11e
  kthread+0xd0/0x100
  child_rip+0xa/0x12

This is with 2.6.19-1.2288.fc5debug

Above this was the tail end of another trace of some sort, perhaps a recursive
lockdep message.  Didn't get flushed the /var/log/messages either.

  :xfs:xfs_iflush+0x168/0x51e
  :xfs:xfs_inode_flush+0x13c/0x165
  :xfs:xfs_fs_write_inode+0x28/0x50
  __writeback_single_inode+0x213/0x352
  sync_sb_inodes+0x1c1/0x282
  write_back_inodes+0x93/0xee
  background_writeout+0x82/0xb5
  pdflush+0x134/0x1df
  kthread+0xd0/0x100
  child_rip+0xa/0x12

Dual Opteron 275, Tyan S2882-D MB

Machine has been locking up and spontaneously rebooting.  Had gotten some EDAC
chipkill errors before, but that seems to be gone after removing one on the DIMMs.

Comment 1 Chuck Ebbert 2007-02-21 19:24:40 UTC
Are you using XFS on top of LVM and/or RAID?


Comment 2 Orion Poplawski 2007-02-21 20:28:22 UTC
(In reply to comment #1)
> Are you using XFS on top of LVM and/or RAID?
> 

Nope, straight partition.

# mount
/dev/hdc6 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/hdc1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/hdc7 on /scratch type xfs (rw)  *********
/dev/hdc3 on /usr type ext3 (rw)
/dev/hdc5 on /var type ext3 (rw)
tmpfs on /tmp type tmpfs (rw)


Do you think xfs could be causing the lockups?  Could easily make it an ext3 fs
instead...

Comment 3 Orion Poplawski 2007-02-22 18:24:19 UTC
System ran fine overnight with a couple CPU intensive loads.  Started up a
memory intensive load that caused swapping and it crashed fairly quickly (few
minutes).

I'm going to reinstall with FC6 and no xfs /scratch partition and see what
happens there.

Screen from todays crash:

  showtrace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  cache_alloc_refill+0x8b/0x219
  kmem_cache_alloc+0xb1/0x107
  copy_process+0xa8/0x1748
  do_fork+0xb41/0x1af
  ptregscall_common+0x67/0xb0
  system_call+0x7e/0x83

BUG: spinlock lockup on CPU#2, events/2/16 .... (Not tainted)

Call Trace:
  showtrace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  drain_freelist+0x29/0xa9
  cache_reap+0x12d/0x251
  run_workqueue+0xb2/0x10c
  worker_thread+0xec/0x11e
  kthread+0xd0/0x100
  child_rip+0xa/0x12

Tried to collect vmcore with kdump but ran out of space.  Will make sure there
is enough next time...

Comment 4 Orion Poplawski 2007-02-22 20:44:41 UTC
With FC6:

Feb 22 12:51:37 apapane kernel: BUGging on
(!spin_is_locked(&cachep->nodelists[({ typeof(
_proxy_pda.nodenumber) ret__; switch (sizeof(_proxy_pda.nodenumber)) { case 2:
asm("mov"
"w %%gs:%c1,%0" : "=r" (ret__) : "i" (__builtin_offsetof(struct
x8664_pda,nodenumber)), "
m" (_proxy_pda.nodenumber)); break; case 4: asm("mov" "l %%gs:%c1,%0": "=r"
(ret__): "i"
(__builtin_offsetof(struct x8664_pda,nodenumber)), "m" (_proxy_pda.nodenumber));
break; c
ase 8: asm("mov" "q %%gs:%c1,%0": "=r" (ret__) : "i" (__builtin_offsetof(struct
x8664_pda
,nodenumber)), "m" (_proxy_pda.nodenumber)); break; default: __bad_pda_field();
} ret__;
})]->list_lock))
Feb 22 12:51:37 apapane kernel: ----------- [cut here ] --------- [please bite
here ] ---
------
Feb 22 12:51:37 apapane kernel: Kernel BUG at mm/slab.c:2350
Feb 22 12:51:37 apapane kernel: invalid opcode: 0000 [1] SMP
Feb 22 12:51:37 apapane kernel: last sysfs file:
/devices/pci0000:00/0000:00:06.0/irq
Feb 22 12:51:37 apapane kernel: CPU 3
Feb 22 12:51:37 apapane kernel: Modules linked in: nfs lockd nfs_acl autofs4
w83627hf eep
rom lm85 hwmon_vid i2c_isa sunrpc ipv6 cpufreq_ondemand dm_mirror dm_multipath
dm_mod vid
eo sbs i2c_ec button battery asus_acpi ac parport_pc lp parport amd_rng
i2c_amd8111 i2c_a
md756 k8_edac pcspkr shpchp i2c_core edac_mc k8temp hwmon e100 mii serio_raw tg3
ext3 jbd
 ehci_hcd ohci_hcd uhci_hcd
Feb 22 12:51:37 apapane kernel: Pid: 2744, comm: cupsd Not tainted
2.6.19-1.2911.fc6debug
 #1
Feb 22 12:51:37 apapane kernel: RIP: 0010:[<ffffffff8025d136>] 
[<ffffffff8025d136>] cach
e_alloc_refill+0x112/0x219
Feb 22 12:51:37 apapane kernel: RSP: 0018:ffff8100bfa4ba08  EFLAGS: 00010086
Feb 22 12:51:37 apapane kernel: RAX: 0000000000000263 RBX: 0000000000000000 RCX:
ffffffff
80586390
Feb 22 12:51:37 apapane kernel: RDX: ffff8100bfec7100 RSI: 0000000000000000 RDI:
ffffffff
80586360
Feb 22 12:51:37 apapane kernel: RBP: ffff8100bfc81c78 R08: ffffffff80586390 R09:
ffffffff
80497c39
Feb 22 12:51:37 apapane kernel: R10: ffff81005a3ab590 R11: ffff810037e90300 R12:
ffff8100
75f76da0
Feb 22 12:51:37 apapane kernel: R13: ffff810037f5e5c0 R14: ffff810037ec4700 R15:
00000000
0000000c
Feb 22 12:51:37 apapane kernel: FS:  00002aaaaefce560(0000)
GS:ffff8100bff54470(0000) knl
GS:0000000000000000
Feb 22 12:51:37 apapane kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 22 12:51:37 apapane kernel: CR2: 0000000012331000 CR3: 00000000bece7000 CR4:
00000000
000006e0
Feb 22 12:51:37 apapane kernel: Process cupsd (pid: 2744, threadinfo
ffff8100bfa4a000, ta
sk ffff8100bfec7100)
Feb 22 12:51:38 apapane kernel: Stack:  0000000000000850 0000000000000000
ffff810037f5e5c
0 0000000000000246
Feb 22 12:51:38 apapane kernel:  0000000000000850 ffffffff88022c3d
0000000000000001 fffff
fff8020abf8
Feb 22 12:51:38 apapane kernel:  0000000000000000 ffff81007f7e93b0
ffff810069bad6d0 00000
00000000000
Feb 22 12:51:39 apapane kernel: Call Trace:

seems to point to hardware corruption?  How else would we generate an invalid
opcode?  memtest86 does run fine with no errors.

Comment 5 Orion Poplawski 2007-02-22 23:54:46 UTC
Downgraded to 2.6.18-1.2798.fc6 and the system is much more stable now.  Need to
let it run for a few days to be positive, but before it was crashing in minutes.
Up for 3 hours now.

Comment 6 Chuck Ebbert 2007-02-23 14:27:02 UTC
(In reply to comment #4)
> With FC6:
> 
> <SNIP>
> Feb 22 12:51:39 apapane kernel: Call Trace:

Where is the call trace from the syslog? Is it missing, or did you
just not post the entire oops message?

> seems to point to hardware corruption?  How else would we generate an invalid
> opcode?  memtest86 does run fine with no errors.

We deliberately execute an invalid instruction to force the oops message
when we hit BUG().

Comment 7 Orion Poplawski 2007-02-23 15:43:51 UTC
(In reply to comment #6)
> 
> Where is the call trace from the syslog? Is it missing, or did you
> just not post the entire oops message?

It was missing from syslog.  alt-sysrq-c didn't trigger the kdump kernel either.
 Machine is still happily running this morning with the older kernel.

Comment 8 Orion Poplawski 2007-02-26 17:08:58 UTC
System has been running fine for almost 4 days now on the old kernel.  Looks
like it may be a kernel issue.  Let me know if there is anything else I can send
that would be useful.  Unfortunately, it's in heavy use at the moment, but we
may be able to do some testing later.

Comment 9 Orion Poplawski 2007-03-09 16:30:38 UTC
System will crash with old kernel (2.6.18-1.2798.fc6), though not as frequently.
 Console will lock up and it will start spewing *something* out of the gigE
ports that shuts down those networks.  Hooked up a serial console and got *tons*
of the following:

BUG: soft lockup detected on CPU#2!

Call Trace:
 [<ffffffff8026929b>] show_trace+0x34/0x47
 [<ffffffff802692c0>] dump_stack+0x12/0x17
 [<ffffffff802b4900>] softlockup_tick+0xdb/0xf6
 [<ffffffff80293837>] update_process_times+0x42/0x68
 [<ffffffff80273e48>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff80274504>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c
DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Leftover inexact backtrace:
 <IRQ>  <EOI>  [<ffffffff80400d55>] pci_conf1_read+0x0/0xc0
 [<ffffffff80272a5c>] __smp_call_function+0x64/0x8b
 [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff80272b0f>] smp_call_function+0x32/0x49
 [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff8028e4b2>] on_each_cpu+0x10/0x22
 [<ffffffff8026e86f>] mcheck_timer+0x1c/0x6c
 [<ffffffff8024bf5f>] run_workqueue+0x9a/0xed
 [<ffffffff8024884a>] worker_thread+0x0/0x122
 [<ffffffff8024884a>] worker_thread+0x0/0x122
 [<ffffffff8024893a>] worker_thread+0xf0/0x122
 [<ffffffff80287150>] default_wake_function+0x0/0xe
 [<ffffffff80232843>] kthread+0xf6/0x12a
 [<ffffffff8025cea5>] child_rip+0xa/0x11
 [<ffffffff8023274d>] kthread+0x0/0x12a
 [<ffffffff8025ce9b>] child_rip+0x0/0x11



Comment 10 James Collins 2007-03-26 02:50:43 UTC
Created attachment 150867 [details]
var log messages from March 25th 2007 where Kernel error occurs.

Comment 11 James Collins 2007-03-26 02:54:41 UTC
Upgrade to kernel-xen-2.6.20-1.2933.fc6 caused very flakey operation on a 
Postgresql server. Errors started at 21:02:07 with  kernel: no vm86_info: BAD. 
This caused postmaster to roll back, and during that process, it spontaneously 
rebooted so I didn't get any more error info. Something else happened the next 
morning at 1:20AM, but there was no error information in the log, it just 
rebooted. Postgres died again at 2:04AM, Kernel errors streamed by on the 
screen, eventually locking solid with BUG: spinlock lockup on CPU#0. Reverted 
to previous Kernel, kernel-xen-2.6.19-1.2911.6.5.fc6, and since then, 
everything has been smooth sailing. 



Comment 12 Orion Poplawski 2007-10-17 19:55:33 UTC
System has been running RHEL now for 73 days without issue.  Closing.


Note You need to log in before you can comment on or make changes to this bug.