Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 584153 - RT tests fail due to watchdog timeout
Summary: RT tests fail due to watchdog timeout
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: Development
Hardware: All
OS: Linux
high
medium
Target Milestone: 1.3
: ---
Assignee: Luis Claudio R. Goncalves
QA Contact: David Sommerseth
URL:
Whiteboard:
: 596705 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-04-20 20:31 UTC by D. Marlin
Modified: 2016-05-22 23:30 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-09-28 18:42:40 UTC
Target Upstream Version:


Attachments (Terms of Use)
trace from rteval hang on the tyan-gt24-05 (deleted)
2010-06-24 14:19 UTC, Beth Uptagrafft
no flags Details
console from hung dell-per905-01 (deleted)
2010-06-25 14:40 UTC, Beth Uptagrafft
no flags Details
console from hung ibm-hs22-01 (deleted)
2010-06-25 14:41 UTC, Beth Uptagrafft
no flags Details
console from hung tyan-gt24-08 (deleted)
2010-06-25 14:42 UTC, Beth Uptagrafft
no flags Details
console from hung dell-pem610-01 (deleted)
2010-06-25 19:24 UTC, Beth Uptagrafft
no flags Details
consoles for Jun25 hangs on dell-pem610-01, tyan-gt24-08, and ibm-hs22-01 (deleted)
2010-06-28 15:08 UTC, Beth Uptagrafft
no flags Details
consoles for Jun29 hangs on pem610, tyan-gt24-08, hs22, ls42, and per905 (deleted)
2010-06-30 14:35 UTC, Beth Uptagrafft
no flags Details

Comment 1 D. Marlin 2010-04-26 16:16:39 UTC
At Clark's suggestion I've set up another test that adds the --verbose option and only runs for 4 hours.

  tests/kernel/rt-tests/rt_eval_debug

It uses the following command line to run the test:

  rteval --xmlrpc-submit=rtserver.farm.hsv.redhat.com --duration=4h --verbose 

I queued a job on one of the systems known to fail (watchdog timeout), and it did terminate with a timeout:

  https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=151715

Unfortunately there is not really any additional information in the logs.

Comment 2 D. Marlin 2010-04-29 19:35:32 UTC
We have added more debugging info to the test, including:

  the --debug option
  sysrq's every hour of the test

In addition, Luis has patched rteval to send output to the console to assist in debugging.

We performed a test on ibm-ls21-7972-01.rhts.eng.bos.redhat.com with this new version, but it did not fail (no watchdog).  

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=152894

We have queued several addition jobs running this version in order to see if it reproducible.

Comment 3 D. Marlin 2010-04-30 12:12:52 UTC
Results are still coming in, but the following jobs aborted (watchdog):

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153308
  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153309
  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153320
  

However, the following job completed successfully (no watchdog):

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153318

More results later today.

Comment 4 David Sommerseth 2010-05-27 13:13:24 UTC
*** Bug 596705 has been marked as a duplicate of this bug. ***

Comment 5 Beth Uptagrafft 2010-06-22 15:19:25 UTC
all recent manual rteval runs (8hr, 12hr, 18hr) in the farm (systems not in rhts) on the following kernels have completed:
	rt23.23 
	rt23.22 
	rt22.21

Comment 6 Beth Uptagrafft 2010-06-22 15:21:29 UTC
june 21, 2010 overnight runs:
- 2.6.33.5-rt22.21.1.el5rt debug kernel
- rteval-1.22-1.el5rt, rt-tests-0.71-1.el5rt, rt-setup-1.7-2.el5rt, rtctl-1.8-2.el5rt
- Red Hat Enterprise Linux Server release 5.5 (Tikanga)
- additional debug
	echo 9 > /proc/sysrq-trigger
	echo 1 > /proc/sys/kernel/panic_on_io_nmi
	echo 1 > /proc/sys/kernel/panic_on_oops
	echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 1 > /proc/sys/kernel/unknown_nmi_panic
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs
	echo 0 > /proc/sys/kernel/hung_task_panic
	echo 60 > /proc/sys/kernel/softlockup_thresh
	echo 1 > /proc/sys/kernel/softlockup_panic

Tests:
- hang on 16hr rteval run on the dell-pem610-01.  nothing on the ssh session. info from the console - http://pastebin.test.redhat.com/25348
note: scrollback was not set high enough, so that's all the info on the console.

- tyangt24-08 hung on a 16hr run.  nothing on the console. here's all i could see from the ssh session - http://pastebin.test.redhat.com/25349 

- a second tyan system, tyan-gt24-05, hung on a 16hr run with 9hr 19min to go, but there was nothing on the console and nothing on ssh

- ibm ls42 and hs22 systems completed their 15hr runs with no problems

Comment 8 Beth Uptagrafft 2010-06-24 14:19:49 UTC
Created attachment 426589 [details]
trace from rteval hang on the tyan-gt24-05

Comment 9 Beth Uptagrafft 2010-06-24 14:20:55 UTC
June 23, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel on the tyan-gt24-05.rhts.eng.bos.redhat.com

> rteval --xmlrpc-submit=rtserver.farm.hsv.redhat.com --workdir=./bhu-runs --duration=15h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Wed Jun 23 17:48:19 2010
started 2 loads on 8 cores  with 2 numa nodes
...
rteval time remaining: 0 days, 5 hours, 19 minutes, 30 seconds


Did not print out anything else in ssh window. See tyan-gt24-05-2010jun23 for trace from console.

Comment 11 Beth Uptagrafft 2010-06-25 14:40:19 UTC
Created attachment 426899 [details]
console from hung dell-per905-01

Comment 12 Beth Uptagrafft 2010-06-25 14:41:16 UTC
Created attachment 426900 [details]
console from hung ibm-hs22-01

Comment 13 Beth Uptagrafft 2010-06-25 14:42:21 UTC
Created attachment 426901 [details]
console from hung tyan-gt24-08

Comment 14 Beth Uptagrafft 2010-06-25 15:04:14 UTC
June 24, 2010 testing 
Configuration:
- 2.6.33.5-rt23.23.1.el5rt debug kernel
- 'nmi_watchdog=panic,1' to the kernel boot 
- proc settings:
	echo 9 > /proc/sysrq-trigger
	echo 0 > /proc/sys/kernel/panic_on_io_nmi
	echo 0 > /proc/sys/kernel/panic_on_oops
	echo 0 > /proc/sys/kernel/panic_on_unrecovered_nmi
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 0 > /proc/sys/kernel/unknown_nmi_panic
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs
	echo 0 > /proc/sys/kernel/hung_task_panic
	echo 60 > /proc/sys/kernel/softlockup_thresh
	echo 0 > /proc/sys/kernel/softlockup_panic


tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:22:57 2010
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 15 hours, 39 minutes, 52 seconds


can't ssh back into it. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426901



ibm-hs22-01.lab.bos.redhat.com
--------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:06:48 2010
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 9 minutes, 55 seconds

can't ssh back into it. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426900


dell-per905-01.rhts.eng.bos.redhat.com
----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:01:14 2010
  started 2 loads on 16 cores  with 4 numa nodes
  ...
  rteval time remaining: 0 days, 15 hours, 40 minutes, 12 seconds


can't ssh back in. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426899

Comment 15 Beth Uptagrafft 2010-06-25 19:24:08 UTC
Created attachment 426964 [details]
console from hung dell-pem610-01

dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 15:57:42 2010
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 0 hours, 5 minutes, 54 seconds


can't ssh into it. ping still works. ssh window is not responding.
output is still going to the console.

Comment 16 Beth Uptagrafft 2010-06-28 15:08:02 UTC
Created attachment 427439 [details]
consoles for Jun25 hangs on dell-pem610-01,  tyan-gt24-08, and ibm-hs22-01

June 25, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel
Configuration:
- 'nmi_watchdog=panic,1' to kernel boot 
- proc settings
	echo 9 > /proc/sysrq-trigger
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs


dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:29:32 2010
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 12 hours, 15 minutes, 42 seconds

hung. can't ssh into the system, console not responding.
see jun25-dell-pem610-01-console

tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:31:09 2010
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 19 minutes, 54 seconds  

hung. can't ssh into the system, console not responding.
see jun24-tyan-gt24-08-console

ibm-hs22-01.lab.bos.redhat.com
--------
  rteval  --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:34:58 2010
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 17 hours, 49 minutes, 59 seconds @ 7:17

hung. can't ssh into the system, console not responding.
see jun25-ibm-hs22-01-console

Comment 17 John Kacur 2010-06-29 14:54:57 UTC
I created the following with an upstream -rt kernel running
sudo ./cyclictest -t32 -p99
while :; do ./hackbench 20 ; done

make O=/bld/tmp -j19 (kernel compile)

INFO: task kswapd0:547 blocked for more than 900 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 00000000ffffffff     0   547      2 0x00000008
 ffff88018d033950 0000000000000046 ffff88018d033918 0000000000000000
 ffff880100000000 000000000000af98 ffff8801a0131470 00000001014f8772
 ffff8800973902c0 ffff8801a01310c0 00000000000001f5 0000000000000000
Call Trace:
 [<ffffffff81511c2f>] schedule+0x20/0x52
 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324
 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4
 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d
 [<ffffffff811953e9>] generic_detach_inode+0x69/0x29c
 [<ffffffff811971a7>] generic_drop_inode+0x39/0x9e
 [<ffffffff81194dfb>] iput+0xd7/0xe6
 [<ffffffff8118fa1e>] dentry_iput+0x11f/0x12b
 [<ffffffff8118fc37>] d_kill+0xa9/0xb9
 [<ffffffff81190136>] __shrink_dcache_sb+0x4ef/0x771
 [<ffffffff81190538>] ? shrink_dcache_memory+0x155/0x271
 [<ffffffff811905aa>] shrink_dcache_memory+0x1c7/0x271
 [<ffffffff811301a2>] ? shrink_slab+0x53/0x208
 [<ffffffff811302a0>] shrink_slab+0x151/0x208
 [<ffffffff81130bbe>] balance_pgdat+0x4a0/0x785
 [<ffffffff8112cfe2>] ? isolate_pages_global+0x0/0x324
 [<ffffffff81513ec8>] ? rt_spin_unlock+0x2b/0x30
 [<ffffffff81131182>] kswapd+0x2df/0x2f5
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81044c0e>] ? need_resched+0x3f/0x46
 [<ffffffff81130ea3>] ? kswapd+0x0/0x2f5
 [<ffffffff8108beaa>] kthread+0xa4/0xac
 [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81514f80>] ? restore_args+0x0/0x30
 [<ffffffff8108be06>] ? kthread+0x0/0xac
 [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10
INFO: lockdep is turned off.
Kernel panic - not syncing: hung_task: blocked tasks
Dumping ftrace buffer:
   (ftrace buffer empty)

Comment 18 John Kacur 2010-06-30 09:36:19 UTC
INFO: task jbd2/md127p5-8:1737 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md127p5- D 00000000ffffffff     0  1737      2 0x00000000
 ffff88018c12da40 0000000000000046 ffff88018c12d9f0 ffffffff815147d4
 ffffffff810aed78 00000000955a4740 ffff88018b3ec4f0 ffff88018c54fa70
 ffff8800b63c5640 ffff88018b3ec140 ffff880100000000 ffffffff8104f1f7
Call Trace:
 [<ffffffff815147d4>] ? _raw_spin_lock_irq+0x89/0x90
 [<ffffffff810aed78>] ? task_blocks_on_rt_mutex+0x1d7/0x1e9
 [<ffffffff8104f1f7>] ? get_parent_ip+0x11/0x6f
 [<ffffffff81511c2f>] schedule+0x20/0x52
 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324
 [<ffffffff810e9fab>] ? __rcu_read_unlock+0x57/0x318
 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4
 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d
 [<ffffffff811a4a34>] __mark_inode_dirty+0xdc/0x29c
 [<ffffffff811adce1>] __set_page_dirty+0x17d/0x18a
 [<ffffffff811ade60>] mark_buffer_dirty+0x172/0x17a
 [<ffffffffa01d75d3>] __jbd2_journal_temp_unlink_buffer+0x202/0x20b [jbd2]
 [<ffffffffa01d75f2>] __jbd2_journal_unfile_buffer+0x16/0x22 [jbd2]
 [<ffffffffa01d7a21>] __jbd2_journal_refile_buffer+0xa4/0x13c [jbd2]
 [<ffffffffa01dc27d>] jbd2_journal_commit_transaction+0x16fb/0x1d5a [jbd2]
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81075f31>] ? del_timer_sync+0x0/0xce
 [<ffffffffa01e3ae3>] kjournald2+0x10b/0x311 [jbd2]
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81044c0e>] ? need_resched+0x3f/0x46
 [<ffffffffa01e39d8>] ? kjournald2+0x0/0x311 [jbd2]
 [<ffffffff8108beaa>] kthread+0xa4/0xac
 [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81514f80>] ? restore_args+0x0/0x30
 [<ffffffff8108be06>] ? kthread+0x0/0xac
 [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10
INFO: lockdep is turned off.
Kernel panic - not syncing: hung_task: blocked tasks

Comment 19 Beth Uptagrafft 2010-06-30 14:35:14 UTC
Created attachment 428008 [details]
consoles for Jun29 hangs on pem610, tyan-gt24-08, hs22, ls42, and per905

June 29, 2010 testing using rostedt's 2.6.33.5-rt23.23.1.el5rt debug kernel built on June 28
Settings:
- 'nmi_watchdog=1' and 'earlyprintk=xxx'  to kernel boot 
- proc settings
	echo 9 > /proc/sysrq-trigger


dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:54:50 2010 (Eastern)
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 10 hours, 39 minutes, 34 seconds


hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 22:41 screenlog.2 (Central)
see jun29-dell-pem610-01-console


tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:37:46 2010 (Eastern)
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 9 minutes, 54 seconds


hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 16:57 screenlog.4  (Central)
see jun29-tyan-gt24-08-console


ibm-ls42-01.lab.bos.redhat.com
-----------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:57:17 2010 (Eastern)
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 19 minutes, 56 seconds

hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 17:33 screenlog.6  (Central)
see jun29-ibm-ls42-01-console


ibm-hs22-01.lab.bos.redhat.com
--------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 16:12:19 2010 (Eastern)
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 4 hours, 55 minutes, 30 seconds

hung. can't ssh into the system, console not responding.
last update to the console log: Jun 30 04:28 screenlog.8  (Central)
see jun29-ibm-hs22-01-console


dell-per905-01.rhts.eng.bos.redhat.com
----
  rteval --workdir=./bhu-runs --duration=18h

  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:12:23 2010 (Eastern)

  started 2 loads on 16 cores  with 4 numa nodes

  ...

  rteval time remaining: 0 days, 17 hours, 9 minutes, 57 seconds

hung. can't ssh into the system, console not responding.
last update to the console log:  Jun 29 16:30 screenlog.10 (Central)
see jun29-dell-per905-01-console

Comment 20 Steven Rostedt 2010-06-30 14:52:22 UTC
The console ends with this:

  .bkl_count                     : 1017

cfs_rq[0]:
  .exec_clock                    : 6712126.752263

You hit the bug in the NMI dump code that I just fixed. The sched_debug which prints out this grabs the rq spinlock here. And if the NMI preempted a place where the rq spinlock was already taken, it would lock up the box.

Thus, this is just a lockup due to the debug output, not the lockup we are looking for.

I'm currently building a new kernel with the debug fix.

Comment 21 D. Marlin 2010-07-05 15:19:48 UTC
I set up 18-hour runs of the 2.6.33.5-rt23.23.4.el5rt kernel on:

  dell-pem610-01
  dell-per905-01
  tyan-gt24-08
  ibm-ls42-01
  ibm-hs22-01

With the exception of ibm-ls42-01 all runs completed.  Apparently ibm-ls42-01 disconnected the console and ssh session about 8 hours into the test.  I'm not sure if the system reset or something else caused the connection loss.

The log is in the farm:

  rhel5.farm.hsv.redhat.com:/es/scratch/dmarlin/RT/screenlog.4.20100603

Comment 22 Clark Williams 2010-09-13 15:21:07 UTC
Dropping MRG 1.3 Blocker status. 

We'll tell customers that if they want to run a system with more than 4GB of RAM, they need to run a 64-bit kernel

Comment 23 Clark Williams 2010-09-28 18:42:40 UTC
This looks like an issue with HIGHMEM/PAE being used on very large memory servers. We've decided to ship our 32-bit kernels without PAE support, limiting them to 4GB of RAM. 

Closing


Note You need to log in before you can comment on or make changes to this bug.