|Summary:||autofs-4.1.4-non-replicated-ping.patch backported to 4.1.3?|
|Product:||Red Hat Enterprise Linux 3||Reporter:||David Johnson <davidj>|
|Component:||autofs||Assignee:||Jeff Moyer <jmoyer>|
|Status:||CLOSED INSUFFICIENT_DATA||QA Contact:||Brock Organ <borgan>|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2006-05-23 17:45:17 UTC||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description David Johnson 2005-07-07 00:04:21 UTC
Description of problem: autofs-mounted NFS file systems are sometimes timing out quickly with a "file not found" error. This appears to have been introduced between autofs-4.1.3-47 and auotfs-4.1.3-130 Version-Release number of selected component (if applicable): autofs-4.1.3-130 How reproducible: Intermittent Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: This looks like the bug fixed by autofs-4.1.4-non-replicated-ping.patch. Is there a backport to autofs-4.1.3 for RHEL3 (autofs-4.1.3-non-replicated-ping looks like it's solving a different problem)? Or should I use autofs-4.1.4 on RHEL3? If so, what's the minimum kernel release of 2.4.21 that's usable with autofs-4.1.4?
Comment 2 Chris Feist 2005-07-18 18:50:32 UTC
Can you specify which version of autofs you are running? Can you also attach your automount maps and entries from /var/log/messages when you encountered this problem?
Comment 3 David Johnson 2005-07-19 21:23:34 UTC
Created attachment 116947 [details] auto.master for autofs bug
Comment 4 David Johnson 2005-07-19 21:24:13 UTC
Created attachment 116948 [details] auto.home for autofs bug
Comment 5 David Johnson 2005-07-19 21:24:47 UTC
Created attachment 116949 [details] auto.net for autofs bug
Comment 6 David Johnson 2005-07-19 21:30:41 UTC
As detailed in my initial report, the problem doesn't seem to exhibit itself in autofs-4.1.3-47 but _does_ occur when using auotfs-4.1.3-130. I've attached our automount maps. I can't find any relevant autofs messages in /var/log/messages that correspond to the failure to find files. There's a slight chance I'm missing the messages (it failed >week ago and on random machines in our compute pool) but I'm 90% sure that nothing was reported when the "file not found" occurred. There was a reasonable possibility of high network traffic or server load when the failure occurred.
Comment 7 David Johnson 2005-07-19 21:31:49 UTC
Also note that the server was running Solaris 2.8.
Comment 8 David Johnson 2005-07-21 21:40:51 UTC
We've had an additional report of the problem, but this time with autofs-4.1.3-47. We did get an automount failure message. The error occurred druing a compile, with libraries on NFS-mounted partitions not being found and then, later, reappearing (probably when network/server load changed). Jul 20 14:40:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1) Jul 20 14:50:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1) Jul 20 14:57:32 lentil automount: failed to mount /u/drspeech Jul 20 15:00:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1) Jul 20 15:01:00 lentil CROND: (root) CMD (run-parts /etc/cron.hourly) Jul 20 15:01:01 lentil ntpdate: adjust time server 18.104.22.168 offset 0 .085569 sec Jul 20 15:01:03 lentil automount: failed to mount /u/drspeech Jul 20 15:07:23 lentil automount: failed to mount /u/drspeech Jul 20 15:08:23 lentil automount: failed to mount /u/drspeech Jul 20 15:10:00 lentil CROND: (root) CMD (/usr/lib/sa/sa1 1 1)
Comment 9 David Johnson 2005-07-21 21:47:48 UTC
Could the priority of this bug be raised to "high"? We were assuming this was due to the 4.1.3-47->4.1.3-130 upgrade and so we had a downgrade as a workaround. Now we have no workaround. We have 10's of people who rely on NFS for their work and having random files disappear, if only occasionally, is obviously a serious issue. If autofs turns out to be unreliable we may be forced to move back to Solaris 10 for our compute cluster. Thanks.
Comment 10 Chris Feist 2005-07-21 23:00:47 UTC
Can you enable debugging on autofs? (This should help me track down exactly where the problem is occurring.) Add the following line to your /etc/syslog.conf: *.* /var/log/debug Then add --debug to all of the maps in /etc/auto.master. Then run 'service syslog restart'. Then run 'service autofs restart'. Once you reproduce the problem can you post the information into the bug report.
Comment 11 David Johnson 2005-07-25 21:01:40 UTC
We're working on this. Note that we have ~100 production machines, each with nearly 10 maps, so pushing out the maps and restarting the services isn't trivial, even with our push infrastructure. Many of the hosts have multiple services on them, some of which are critical. A "*.* /var/log/debug" on every machine could well be a great way of making hosts crash with a full /var (depending on exactly what they're running). We'll try and get something out to a representative enough subset to stand a chance of seeing the problem. However, something less all-inclusive than "*.*" in /etc/syslog.conf would be preferrable... Thanks!
Comment 12 Chris Feist 2005-07-26 17:55:30 UTC
Unfortunately autofs puts data into syslog at several different priorities and having the syslog outputing everything will also allow me to see if something else is interacting with autofs causing those mounts to fail. If you can find a way to consistently replicate the problem it would make it easier for me to debug. I'm changing this bug to NEEDINFO as I need either debug logs or a method to replicate the problem.
Comment 13 David Johnson 2005-07-29 18:46:50 UTC
I think we have managed to reproduce this problem. The way we did it was to connect a test machine to our main network via a box that effectively generated 10% packet loss. We then had a second machine try and randomly log into the first machine and execute a trivial script from an autofs-mounted directory. The script ran for a few hours but then everything locked up. I'll attach the following: - the script that periodically logged in - the output from that script around the time of failure - the debug log output around the time of failure The maps are as attached above. Let me know if you want the full log output as it's 90MB.
Comment 14 David Johnson 2005-07-29 18:48:52 UTC
Created attachment 117283 [details] Test script to trigger autofs bug
Comment 15 David Johnson 2005-07-29 18:51:44 UTC
Created attachment 117284 [details] Example output from test script
Comment 16 David Johnson 2005-07-29 19:18:25 UTC
Created attachment 117286 [details] section of syslog for failed automount
Comment 17 David Johnson 2005-07-29 19:20:46 UTC
After some broswing of what happened in the logs, it appears things went something like: 00:42:20 expired 00:44:33 attempted to mount 00:45:15 expired 00:46:44 attempted to mount 00:47:51 expired 00:47:51 attempted to mount 00:47:51 failed to mount
Comment 18 David Johnson 2005-07-29 22:19:29 UTC
Note that autofs-4.1.3-47 fails in a similar manner. Reducing the max. time waited from 300 to 60, with the --timeout to autofs set to 30, results in a failure within in few minutes (assuming 10% packet loss). I'm on vacation until Thursday 4th August so won't be able to provide any more info until then.
Comment 19 David Johnson 2005-08-05 21:15:26 UTC
Is anything happening with this? We were hoping to include a fix in our next kernel patch application, which is likely to happen very soon.
Comment 20 Chris Kloiber 2005-08-10 18:31:15 UTC
Engineering is requesting you compress and attach the full log from the debug runs, eg: bzip2 -c /var/log/debug > /tmp/debug.log.bz2 Also please send a new sysreport from the same failing client you take the debug log from.
Comment 21 David Johnson 2005-08-10 19:21:38 UTC
Hmmm - it's taken nearly two weeks to go from "would you like the full debug output" to "yes we would". I'll do the sysreport and get a compressed log and attach it below. I'm 80% sure that this bug will hit any RHEL3 autofs setup if packets are lost. I'm 70% sure it's independent of the type of NFS server (Solaris in this case). I'm contemplating: - seting a system with an out-of-the-box RHEL3 client and out-of-the-box RHEL3 NFS server, fully patched with latest patch set and no local customizations other than minor test-related changes - connecting it to a packet dropper and checking to see if I can crash the client with a 5 line test script This will prove that there are many thousands of RHEL3 autofs clients out there that will eventually crash when the network is loaded. If I do this, providing all of the information necessary to reproduce the test (including how to build a simple packet dropper with free software), will this help get this problem solved sooner and result in a usable autofs package?
Comment 22 David Johnson 2005-08-11 00:39:42 UTC
Created attachment 117629 [details] sysreport output for failing client system
Comment 23 David Johnson 2005-08-11 00:40:51 UTC
Created attachment 117630 [details] full syslog output for autofs failure
Comment 24 David Johnson 2005-08-12 23:28:46 UTC
After a bunch more testing, it appears that this bug is apparent with both RHEL3 and Solaris 8 NFS servers (although it may occur less often with RHEL3). On the client side I've only managed to reproduce it with maps using -tcp, although I'm gonna run a longer test using UDP over the weekend. However, a fully-patched, out-of-the box, uncustomized RHEL3U5 install exhibits the problem with both types of server. The only change was to drop the unmount timout in /etc/sysconfig/autofs, enable the misc map and add a misc map line: test -tcp fig:/scratch/test
Comment 25 Chris Feist 2005-08-15 19:58:05 UTC
After doing some testing myself it looks like the problems happens when the mount request fails (which explains the "File not found" or "Command not found" error). However it should fail so fast from the attempting mount. It looks like it's trying to mount a directory that isn't directly exported by the nfs server for some reason. What do your maps look like on the test system? The map that you've posted earlier show a /u/davidj map, but it seems like you're trying to use a /u/davidj/src map.
Comment 26 David Johnson 2005-08-15 20:19:26 UTC
My original tests were on one of our production systems. However, on Friday I had it failing on a much less customized setup. Below I'll attach: - /etc/auto.misc - the new (simplified) test script /etc/auto.master is unchanged except for enabling the auto.misc map.
Comment 28 David Johnson 2005-08-15 20:26:39 UTC
Created attachment 117771 [details] second test script
Comment 29 Chris Feist 2005-08-15 20:31:42 UTC
Can you also post the /var/log/debug output and output from the 2nd test script?
Comment 30 David Johnson 2005-08-15 20:35:03 UTC
Just to make sure - you did see from my way-back posting that the error seems to happen when there's a timed-out unmount colliding with a mount? Also note that the problem only seems to occur with -tcp mounts. I had it running all weekend with UDP and it didn't fail. I'll switch back to tcp and get some log output for you.
Comment 31 David Johnson 2005-08-17 00:51:40 UTC
Created attachment 117819 [details] *.* syslog output from second autofs test script Here's some more log output, from the second test script. Both client and server are RHEL3. The failure is right near the end of the file at 17:44:47 or thereabouts.
Comment 32 Chris Feist 2005-09-01 16:32:33 UTC
Looking through the logs it looks like the reason that the automount fails is because the mount fails (because of RPC timeouts). If you try to run mount on a server that doesn't respond or is too busy to respond autofs can't mount and will report a failure. I don't think there is anything that can be done differently with autofs as we just run 'mount' and if it works we continue, and if it doesn't then we generate an error. I'm planning on closing this bug as NOTABUG unless you think there could be other problems causing this.
Comment 33 David Johnson 2005-09-01 22:07:14 UTC
Hmmmm. - on our production system, our Solaris clients work fine but our RHEL3 clients occasionally lose file systems, resulting in jobs crashing. - on our test setup, using the udp mount option we've never seen a failure of the sort causing problmes but using the tcp mount option results in mounts failing on a regular basis. - the failure only happens when the timeout-triggered unmount happens at about the same time as the mount. I've never seen it fail to mount except when the two are within a second or so of each other despite the fact that the majority of the test cases are many seconds apart. - a regular mount timeout takes many, many seconds. However, when it fails to do the mount incorrectly, as detailed in the logs, the mount times out in a fraction of a second. Maybe I've spent too much time doing sysadmin and not enough time programming, but this sounds very much like an umount/mount race condition when using NFS over tcp, possibly autofs-triggered or possibly in the kernel or mount/umount programs. To me it doesn't sound much like NOTABUG. Fortunately switching back to NFS over UDP looks like it might be a short-term workaround. I will also produce a google-indexed web page explaining our diagnosis as others seem to be affected by this and they may disagree with the NOTABUG conclusion too.
Comment 34 Chris Feist 2005-09-01 22:57:05 UTC
From the the debug log in comment 31, I see where the mount failed twice. Once at 17:45:25 & once at 17:46:07. Both times the mount was attempted ~20seconds before it actually failed due to an RPC timeout. If autofs were failing the mount due to a race then I would expect it to return immediately and would not have an error from mount (RPC: Timed out). So, it doesn't appear that the 2 debug log contains any issues. I've looked at the first debug log you've posted and there may be a possible bug in there. I'm not sure why autofs is trying to mount /u/davidj/src, that's not exported directory anywhere, is it? I only see /u/davidj in the maps.
Comment 35 David Johnson 2005-09-01 23:52:12 UTC
You're right - the two failures are different. Indeed, on reflection, the symptoms are different - with the first type of failure the autofs system appears to become a mess and the machine often becomes unusable, but in my simplified testing it seemed to carry on fine after the failure. The first type of symptoms are often seen on our production system too - clients often become unusable and need a reboot. You're right - /u/davidj/src/ isn't exported directly - so the automounter is doing something pretty weird. I've also had my account with a home of /u/davidj2/ try and mount my home from /u/davidj/, with resultant permission issues, although I've never been able to subsequently reproduce this. This seems consistent with some type of corruption of the mount path with other string info. Is it possible the unmount is called from a signal handler and is using e.g. non-rentrant routines that are also used as part of mounts?
Comment 36 Jeff Moyer 2005-09-02 14:59:18 UTC
OK, just jumping into this now. For the debug logs, you can configure syslog to only log daemon.*. That is the facility to which autofs logs. There is a kernel patch which addresses the mounting of a subdirectory instead of the autofs "key" when ghosting is enabled. However, it doesn't appear that ghosting is enabled in this environment. I can't verify that, since the updated auto.master was never posted, but "stock" doesn't have --ghsot enabled. Also note that /etc/sysconfig/autofs would be a helpful file to include in the bug report. The unmount is not called from a signal handler. There are calls to ioctl, which are unsafe, but I don't believe this is the code path you are hitting. There are *also* calls to syslog in a signal handler, which are very bad. We've seen these cause autofs to hang, and not perform any more mounts or unmounts. A fix for this problem exists in the RHEL3 U7 beta. As for a mount/umount race, I don't believe there is a problem in the actual mount/umount commands themselves. Any race here would be an autofs problem. Since you have a reproducer, would you be willing to test a new kernel? I'm not certain it will make a difference, but I'd like to give it a try, anyway. The kernel can be found here: http://people.redhat.com/jmoyer/.bz162636/ I've included the smp kernel. If you would like a different kernel, please let me know. I also put the U7 beta version of autofs there for you to try. NOTE: There is a chance that this will fix one of your problems: autofs attempting to mount a non-exported path, that appears to be a directory within an automounted directory. I haven't looked into the other problem in enough detail to comment at this point. Thanks for all of your help in diagnosing this issue.
Comment 37 Chris Feist 2005-09-02 15:02:31 UTC
Just a note, that kernel is part of the RHEL3U6 beta (not RHEL3U7).
Comment 38 Jeff Moyer 2006-03-07 22:06:56 UTC
Changing to NEEDINFO_REPORTER from NEEDINFO.
Comment 39 Jeff Moyer 2006-05-23 17:17:37 UTC
Were you able to try a newer kernel, and did it solve your problem?