|Summary:||Weird partial SMP kernel 2.4.18-17.7.x hang on HP netserver lp2000r|
|Product:||[Retired] Red Hat Linux||Reporter:||ville.sulko|
|Component:||kernel||Assignee:||Arjan van de Ven <arjanv>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Brian Brock <bbrock>|
|Version:||7.3||CC:||chpruc, erik.bennett, jason.piszcyk, johnsom, p.dania|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2004-09-30 15:40:14 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description ville.sulko 2002-11-25 13:42:08 UTC
From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461) Description of problem: On a dual HP lp2000r with kernel 2.4.18-17.7.xsmp, we have experienced twice a mysterious partial hang (see later) of the kernel. Network and console login usually possible, but result in mostly unusable sessions. Version-Release number of selected component (if applicable): How reproducible: Didn't try Steps to Reproduce: Don't know of any systematical way to reproduce, except for leaving the system on, and waiting... Has however happened twice, about two weeks apart. Additional info: These hangs are almost too weird to describe :) On the machine there is an apache and tomcat process running. Both crashes were noticed by the fact that the web service was not responding anymore. The machine still responded to ping, and I was actually able to ssh into it too. The connection (ssh login) was a bit sluggish, and when I was logged in, typing something was a bit hard; it seemed that the display always lagged one character (maybe network packet) behind. So, when I entered 'ls', only 'l' was displayed, then pressed enter, 'ls' was displayed, then pressed 'enter' again, and ls listing was shown, etc etc. The same thing happened on the console too, so this wasn't a network error. Another thing I noticed was that the system clock seemed to be hung. That was, date always returned the same time. From the logs we saw that cron had ceased to execute tasks between 00:40 and 00:50 (sar was no longer run), and when we gave 'shutdown -r 0' to the system, logs did show that the reboot time was about 00:43, even if actually was already the following day. And no, the shutdown also hung, and we were forced to use the main switch... The whole system was very sluggish overall, and I couldn't run top, for example. ps-list completed ok, but showed nothing too unusual (sorry, no saved data this time). free showed that the machine wasn't swapping, so that's not the problem. According to 'w', the machine was also idle, so no processes were running wildly. There were also no panics or other kernel messages in the dmesg or messages -file, so there is no further info I can provide at this time. So, overall this seemed like a partial kernel crash, leaving the kernel unable to function properly. I know the description above most likely isn't enough to isolate the problem, so is there something that I should especially try or write down the next time it happens? BTW, the system had been running kernel 2.4.18-10 for a couple of months, and there were no crashes as faw as I can recall. There seems to be a newer errata kernel 2.4.18-18.7.x, I might upgrade to that too, even if the bugs fixed don't seem to match this problem. About the system : HP netserver lp2000r Dual PIII 1GHz HP NetRAID 2M with 6 disks (raid1) RedHat 7.3, with the latest patches up to 11-Nov-2002.
Comment 1 ville.sulko 2003-01-02 14:16:44 UTC
Ok, more similar problems on two different HP lp2000r machines. The same symptoms, sluggish machine etc. The other machine was running 2.4.18-18.7.x and the other was running 2.4.18-19.7.x. This time however, I noticed one very interesting thing. On both machines, grepping from /proc/interrupts, the timer interrupts (#0) seemed to happening way too infrequently. I rebooted the other machine (after which it seems to run just fine) and calculated from 10 second (external timing :) sample the timer rate, and got about 512 Hz. On the other machine (the misbehaving one), using the same method I got about 9 Hz... That is, total 92 interrupts (46 + 46) in ten seconds... I quess this explains why "sleep 1" lasts over 10 seconds and "vmstat 1" and "top" seem to freeze. And one more note, on the other machine, there was the following in dmesg. May be relevant or then not... Jan 1 14:12:10 extra1 kernel: set_rtc_mmss: can't update from 59 to 12
Comment 2 Arjan van de Ven 2003-01-02 14:20:10 UTC
this phenomenon seems to happen with a lot of kernels (not just RH's) and only on hp lp2000r's.... hardware/bios bug?
Comment 3 ville.sulko 2003-01-02 21:00:44 UTC
Created attachment 89071 [details] Bootup messages (dmesg) Just for the record, here's the dmesg listing of the kernel bootup sequence.
Comment 4 ville.sulko 2003-01-03 10:52:41 UTC
One more observation, the timer interrupt rate seems to be decreasing all the time. Yesterday it was at about 9 Hz, six hours later at 8 Hz and now (the following day) it seems to be around 6 Hz... Is there anything to check to determine the reason for the timer interrupt slowdown? Is it possible to verify timer circuit settings and/or reset the timer frequency? Or could there be some other reason for the interrupts not to be delivered, in case the timer works ok? BTW, I checked the release notes for the latest BIOS update, and there was nothing related to this kind of problems. And for the question could this be HW problem, it could of course, but it had to be generic lp2000r HW bug. And to continue with, we had no problems with some earlier kernels (<= 2.4.18-10 ?). Were those (RH) kernels running with CONFIG_HZ=100 or 512 ? Could this be a result of a timer wraparound or something like that?
Comment 5 Michael Johnson 2003-01-04 00:46:40 UTC
I can confirm this issue. I have had four HP lp2000r servers get stuck looping the clock. I am also unable to shutdown the boxes without a force flag. All of the servers were running the 2.4.18-18.7xsmp Redhat errata kernel. I think we had a similar case to this on a previous kernel release as well. A reboot of the box does restore the correct clock operation.
Comment 6 Erik Bennett 2003-01-23 22:00:52 UTC
We've also seen this behavior on a dozen lp2000r machines as well as one lh3r. All of them had dual procs. This happened on kernels from 2.4.18-17.7xsmp through 2.4.18-19.7xsmp. Also, these machines won't boot if you install an SMP kernel on a machine with only one CPU. It hangs right after the line: Configuring 256 Unix98 ttys This wasn't the case with the 2.4.9 series.
Comment 7 Erik Bennett 2003-01-23 23:40:13 UTC
I too did some timings, and the interrupts on this machine have gone to 0 (zero) in a ten second period. Also, this machine has a reletivly high number of them. It takes 29 bits to hold this number (#0), for what that's worth. Also, this clock is in a 6 second loop. This is one of two failure modes we're seeing. The other is the slowdown. rssv05 ~ 7# cat /proc/interrupts CPU0 CPU1 0: 309234823 309250674 IO-APIC-edge timer 1: 2 2 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 12: 11 9 IO-APIC-edge PS/2 Mouse 14: 0 2 IO-APIC-edge ide0 17: 140090 140329 IO-APIC-level eth0 18: 680127 686096 IO-APIC-level sym53c8xx 19: 535511 535621 IO-APIC-level eth1 NMI: 0 0 LOC: 625276847 625276870 ERR: 0 MIS: 0 rssv05 ~ 8# cat /proc/interrupts CPU0 CPU1 0: 309234823 309250674 IO-APIC-edge timer 1: 2 2 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 12: 11 9 IO-APIC-edge PS/2 Mouse 14: 0 2 IO-APIC-edge ide0 17: 140090 140342 IO-APIC-level eth0 18: 680127 686096 IO-APIC-level sym53c8xx 19: 535521 535621 IO-APIC-level eth1 NMI: 0 0 LOC: 625282370 625282393 ERR: 0 MIS: 0
Comment 8 Chad 2003-02-19 19:45:15 UTC
I have reproduced this problem consistently now 5 times in addition to watching several servers fail with this problem in the wild. It affects only SMP boxes with any i386 kernel over 2.4.10 and does not affect RHAS. It seems particularly problematic with HP servers. To reproduce the problem, the server need only be installed (we install over the network using kickstart) with 7.x and left alone while still on the network. Idleness is the common denominator. Also, going from high demand to low demand or disuse speeds up the appearance of this issue. I can get a hang with these symptoms in 1.5 to 4 days. I can find no evidence in my research that indicates Asus CUR-DLS or CUR-DLSR servers (which are identical to HP LP2000r and LP1000r servers in nearly every way right down to the case design) are affected in the same way as HP hardware. The primary difference between the two boxes is Asus uses Award BIOS while HP uses a modified PhoenixBIOS.
Comment 9 Jason Piszcyk 2003-05-01 03:57:08 UTC
I have also experienced the same problem. The server is an HP LH4R with 4CPU's. I am running kernel 2.4.18-24.7.x. The server was previously running Red Hat 6.2 for a year and half with no problems. It was recently upgraded to 7.3, and ran fine for a couple of months with no issues. It has now experienced this problem twice in the last fortnight. I have disabled NTP, and configured the system to fire up in run-level 3, and the server has been running fine for about a week. I am waiting to see how this goes.
Comment 10 ville.sulko 2003-05-02 04:24:52 UTC
Just for the record, as a workaround, I have recompiled the kernel with CONFIG_HZ=100, and haven't had any problems since. The problem most likely is still there, but at least it's seems to be much less frequent.
Comment 11 Jason Piszcyk 2003-06-19 02:37:27 UTC
I have rebuilt the Kernel with 'CONFIG_HZ=100', and the server has been up for almost 3 weeks with no problems, although I am experiencing some slight pain from keeping my fingers and toes crossed. The server has about 50 users on it during the day, so it is getting a decent workout. Thanks Ville (I hope that's right) for the suggestion, it has definitely helped. Any ideas if this is the long term solution, or am I likely to see the 'hang' again?
Comment 12 Arjan van de Ven 2003-06-19 07:40:11 UTC
newer errata kernels have HZ=100 again for other reasons (basically on some machines it caused time skew when burning cds)
Comment 13 Pietro Dania 2003-07-14 12:32:40 UTC
Same problem on 3 out of about 20 lp2000r machines. Machines are 1.4 and 1.133 GHz SMP / 1 GB RAM / NetRAID 1M BIOS versions are spread from 4.6.06 to 4.6.16. Kernel is 2.4.20-18.7 and all errata installed. I can reproduce the problem starting a stress test procedure, running 1000 processess of 3 threads each. I kill all of them and soon the problem rises. In addition, i get a continuous sound; maybe the machine is gone flatline: "BIIIIIIIIIIIIIIII" :-) ?!? As long as the machine is RH certified on 7.2, i'll go on with that distro.
Comment 14 Chad 2003-09-18 21:43:44 UTC
We are calling this phenomena "TimeWarp." It is not fully understood but I have spent a good while exploring and experimenting with affected servers. Here's what my group knows: 1) The problem is indeed a skewing problem between the two CPUs. 2) CONFIG_HZ = 100 is just a delaying tactic. TimeWarp still occurs at about 330 days on AS 2.1 with CONFIG_HZ = 100 kernels. 3) Typical failure time is 30 days. 4) The "trigger" is heavy or sustained activity followed by an abrupt cessation of activity. Within three days of idleness, TimeWarp occurs. The above prescribed method for reproducing failure is correct. 5) BIOS alterations using the F11 method are not fruitful. 6) Building/installing a custom kernel that turns off ALL elements of power management (APM and ACPI) and other superfluous functionality results in at least 180 days (and counting) of uptime even with CONFIG_HZ = 512.
Comment 15 Bugzilla owner 2004-09-30 15:40:14 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/