|Summary:||LONG scheduling pauses|
|Product:||[Retired] Red Hat Linux||Reporter:||Alan Robertson <alanr>|
|Component:||kernel||Assignee:||Dave Jones <davej>|
|Status:||CLOSED WONTFIX||QA Contact:||Brian Brock <bbrock>|
|Version:||7.3||CC:||ccoy, dedourek, francois-xavier.kowalski, lars, marc.schmitt, mingo, pfrields, yuval|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2004-05-27 14:09:57 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
Description Alan Robertson 2002-10-31 15:46:56 UTC
From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020204 Description of problem: There are occasions (a few times a day) where the scheduler fails to schedule a soft realtime process for up to 30 seconds for no apparent reason. This is observed by the 'heartbeat' program which is a high-availability package for Linux. This has been observed by three different users of heartbeat in three different environments. Heartbeat tracks how long it takes systems to send out heartbeats. It is normal that under load that heartbeat times will go up some. However, in this case, things run along just as smoothly as you like until suddenly one VERY long heartbeat interval occurs - without any warnings concerning delayed heartbeats before this. This problem has been observed by Alex Kramarov of incredimail, Steven Wilson of NCD health, and Brian Tinsley of emageon. The kernel was kernel-2.4.18-17.7.x. Alex Kramarov reports he was running it on an SMP machine. I do not yet know about the other two. The users report that the systems were either completely idle or nearly so. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. I'm unsure of some of the exact initial conditions 2. Install heartbeat on two systems and configure them over a couple of ethernets. Set keepalive to 1, deadtime to 10 and warntime to 2. 3. Run heartbeat for a few days. It will falsely report that one or both machines have died - without any warnings about successively longer heartbeat delays like those which occur when the machine is under heavy load. It may occur in a few hours, but no more than a day or two. Actual Results: The users in question report that when they upgraded to this kernel, they began to experience these problems. They reproduced them several times in this way, and they went away when they changed to an earlier kernel. Expected Results: No false takeovers should occur. Additional info: They reproduced them several times in this way, and they went away when they changed back to an earlier kernel. Until this kernel, heartbeat has worked nicely with every kernel made by any vendor. This kernel has a unique bug - unlike any seen before in the last 3 years or so. This bug renders these machines unsuitable for high-availability work. The result is equivalent to a crash - both machines shut down their HA services and restart them. If the users have not configured everything properly, and have a shared disk, this can lead to loss of data. There is a long involved discussion of these problems on the linux-ha mailing list. The topics are "Sporadic split brain on Red Hat 7.2 with 2.4.18-17.7 kernel" and "heartbeat failure". The relevant list archives can be found here: http://marc.theaimsgroup.com/?l=linux-ha&r=1&b=200210&w=2
Comment 1 John DeDourek 2002-11-01 22:23:38 UTC
I suspect that this bug is related to bug 76499 which causes large numbers of clock ticks to be lost.
Comment 2 Alan Robertson 2002-11-04 18:52:43 UTC
As firstname.lastname@example.org suggested, it is likely related to bug 76499. We have had few more emails from the victims of this bug. These emails are in the linux-ha archives with subject lines containing "Red Hat Kernel 2.4.18-17" with the usual set of Re: Fwd:, etc. set of prefixes. It also appears that one person may have seen similar problems on somewhat earlier RH kernels. There are at least three models of computers this has been reproduced on: Some UP, some 2-CPU, and some 4-CPU.
Comment 3 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 09:37:34 UTC
We also observe this problem on every dual-CPU HP Kayak desktop workstations (with or without SCSI), dual-CPUs & 6-ways HP NetServers (obsolete, but...). Our HA mechanism (polling layer on top of tcp) get confused as well, causing system reboot. We observe this problem with 2.4.18-3smp, 2.4.18-1-smp & 2.4.18-17.7.xsmp (same for bigmem's).
Comment 4 Arjan van de Ven 2002-11-13 09:45:00 UTC
is there any chance of anyone getting a sysreq-t dump of a stuck state ?
Comment 5 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 13:46:04 UTC
Created attachment 84804 [details] /var/log/messages section when hitting sysrq-t at lock time
Comment 6 Francois-Xavier 'FiX' KOWALSKI 2002-11-13 13:48:02 UTC
I have attached the piece of my /var/log/messages that contains a sysrq-t output. I don't know if the lock I encountered is specifically the one we are talking about here, because an NFS server went down at this time. If you feel it is un-related, please disregard this log.
Comment 7 Arjan van de Ven 2002-11-13 14:02:30 UTC
email@example.com: you are using binary only kernel modules that interact with the VFS and NFS, your trace is not useful therefore ;(
Comment 8 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 15:30:18 UTC
Created attachment 85444 [details] /var/log/message extract on 2.4.18-3 with no proprietary module
Comment 9 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 15:34:11 UTC
Here is another /var/log/messages (with 2 Alt-SysRq) in a 2.4.18-3 kernel with no proprietary module. The system on which it happens is relying on NFS+NIS. In case of the first log part, the frozen application (whereas all others are fine) is XEmacs. (This may also happens with LyX, but not in this log) The second log part in in the exact same situation, but it also includes a "ps lax" command that is frozen too.
Comment 10 Arjan van de Ven 2002-11-18 15:39:03 UTC
Nov 13 14:30:38 tarifa1 kernel: Call Trace: [<d09fceb2>] mvfs_rhat_rdwr [mvfs] 0x122 ehm please be serious when you say "no proprietary modules loaded" and stop wasting my time
Comment 11 Arjan van de Ven 2002-11-18 15:42:37 UTC
also please don't use 2.4.18-3 but 2.4.18-18.7.x or 2.4.18-18.8.0
Comment 12 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 16:16:33 UTC
Sorry if I gave you the impression to waste your time. Please read below: The first log (collected last week on machine tarifa1) was on a kernel using a proprietary module. So I have uploaded another attachment for another machine (fuerteventura), which does not use a proprietary module. Since I have no mean to remove my previous attachement, it was left attached to the bug record. I will try to catch the problem on 2.4.18-17.7.x. Hope that the updates will not go quick enough that another update is out... Any chance that you have a look at the problem in the fuerteventura log on 2.4.18-3?
Comment 13 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 16:21:15 UTC
Ooops! wrong log from worng machine.... :-(
Comment 14 Francois-Xavier 'FiX' KOWALSKI 2002-11-18 16:22:56 UTC
Created attachment 85450 [details] 2.4.18-3 with no proprietary module (this time...)
Comment 15 Arjan van de Ven 2002-11-18 22:44:19 UTC
looks like an NFS lockup ;( 2.4.18-18.7.x (2.4.18-17.7.x is not the latest ;) ought to have a better NFS behavior (but we know it's not perfect yet)
Comment 16 Alan Robertson 2002-11-22 17:48:43 UTC
I do not believe that all the original people reporting this have NFS issues.
Comment 17 Alan Robertson 2003-02-14 13:47:33 UTC
Recently, various people have reported to me that they've seen this problem in kernel.org kernels. The reports are *very* credible, and contain what appears to be the unique signature of this bug (at least I hope there aren't two with the same bug). When I was at LinuxWorld, folks from the SuSE booth told me that they had seen this bug and fixed it. You might check with them on what the fix was. The person I heard this from was either Marcus Rex, or Ralf Flaxa. So I would suppose that either Rik or Andrea might have more information.
Comment 18 Francois-Xavier 'FiX' KOWALSKI 2003-02-24 18:10:57 UTC
On our side, we have reproduced the problem with & without NFS. Quite a pain, so no update since a long time. Our "scheduling pause" was due to a mis-configured NTP daemon (defaut RH settings) that was triggering system clock-jumps. When the clock was going backward N seconds , our application was not "scheduled" during ~ N seconds. On my side, the fix is /etc/sysconfig/ntpd : - OPTIONS="-U ntp" + OPTIONS="-U ntp -x 0 -g"  Invoked by the scheduler, but not doing anything, because almost everything depends on timers in a Telco application...  too bad that there is no man page for ntpd startup options on RedHat (/usr/share/doc/*.html only)... :-(
Comment 19 Steven Boger 2003-10-02 20:23:42 UTC
We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7.
Comment 20 Steven Boger 2003-10-02 20:30:33 UTC
Whoops. Sorry for the me too. Let me add a full report: We are also experiencing this with 2.4.20-19.7 and 2.4.20-20.7. We have done everything from updating to the most recent HA rpms (all ultramonkeys), to trying the two different kernels mentioned above. I am currently compiling a plain vanilla 2.4.22 kernel to see if this has any effect. We can easily reproduce this on a two server cluster by setting the hwclock to 3:30am and rebooting both systems, while they are coming up we then start 3 test clients pounding the box's with about 300 requests per second each (these are apache clusters).. within ten minutes the "long heartbeat intervals" occur, causing systems to assume both master status, or worse. We are interested in working with any parties to resolve this problem. Please contact me at the above email address.