Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 77553

Summary: NFS server stops honoring mount requests - stuck NFS UDP queue
Product: [Retired] Red Hat Linux Reporter: Need Real Name <dwtrusty>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 7.3CC: mvpel, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:40:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Need Real Name 2002-11-08 22:22:40 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (X11; U; HP-UX B.11.11 9000/785)

Description of problem:
Our NFS server stopped honoring mount requests.  The UDP queue for the NFS
service appeared stuck.
Mount requests from all servers timed-out repeatedly. 

The NFS server's log showed "rpc.mountd: authenticated mount request" but
apparently failed to complete the request.

Version-Release number of selected component (if applicable):


How reproducible:
Didn't try

Steps to Reproduce:
1.Run NFS
2.
3.
	

Additional info:

The output from the netstat command showed:

>netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
...
udp   184288      0 *:nfs                   *:*                                 
udp     1728      0 *:32769                 *:*   

The permanently stuck queue was the 'nfs' queue shown above, with 184288 bytes
queued.

Note that udp port 32769 is opened by mountd.  This queue would build up for a
while and then
drop to zero.  Doing an 'strace' on the mountd process showed that it saw the
mount requests,
so mountd was not deadlocked.

I noticed these messages in the log, but am not sure if they are related to the
problem:
Nov  8 14:19:59 genetics iptables: Flushing all chains: succeeded
Nov  8 14:19:59 genetics iptables: Removing user defined chains: succeeded
Nov  8 14:19:59 genetics iptables: Resetting built-in chains to the default
ACCEPT policy succeeded

Stopping and restarting NFS via the /etc/rc3.d/ script cleared the problem.

Comment 1 Arjan van de Ven 2002-11-11 17:07:57 UTC
ok so what kernel version are you using and which network card?

Comment 2 Stephen Tweedie 2002-11-11 17:25:36 UTC
Please show a "tcpdump" of the mount port, both from the client and from the
server.  You should first do a

  /usr/sbin/rpcinfo -p

on the server to see which port it is using for mountd, then 

  /usr/sbin/tcpdump port <portnum>

to capture all traffic to or from that port.

Comment 3 Need Real Name 2002-11-12 00:30:30 UTC
Thanks for helping us.

Our NIC hardware is this one:
   eth0: Intel Corp. 82557/8/9 [Ethernet Pro 100], 00:06:5B:88:1D:2A, IRQ 19.

We are using kernel version 2.4.18-10smp.



Comment 4 Need Real Name 2002-11-12 03:11:02 UTC
Here is the client info:

[root@hydra /etc]#  /usr/sbin/rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    801  rquotad
    100011    2   udp    801  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100021    1   udp  32768  nlockmgr
    100021    3   udp  32768  nlockmgr
    100021    4   udp  32768  nlockmgr
    100005    1   udp  32770  mountd
    100005    1   tcp  32768  mountd
    100005    2   udp  32770  mountd
    100005    2   tcp  32768  mountd
    100005    3   udp  32770  mountd
    100005    3   tcp  32768  mountd
    100024    1   udp  32771  status
    100024    1   tcp  32769  status


Here is the server info:

[root@genetics log]#  /usr/sbin/rpcinfo -p
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp  32768  status
    100024    1   tcp  32768  status
    391002    2   tcp  32769  sgi_fam
    100011    1   udp    985  rquotad
    100011    2   udp    985  rquotad
    100011    1   tcp    988  rquotad
    100011    2   tcp    988  rquotad
    100005    1   udp  32784  mountd
    100005    1   tcp  37175  mountd
    100005    2   udp  32784  mountd
    100005    2   tcp  37175  mountd
    100005    3   udp  32784  mountd
    100005    3   tcp  37175  mountd
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100021    1   udp  32785  nlockmgr
    100021    3   udp  32785  nlockmgr
    100021    4   udp  32785  nlockmgr


Here is the output of tcpdump for the server:

[root@genetics dtrusty]# tcpdump port 32784
tcpdump: listening on eth0
21:34:50.991743 hydra.swmed.edu.689 > genetics.swmed.edu.32784:  udp 120 (DF)
21:34:50.992573 genetics.swmed.edu.32784 > hydra.swmed.edu.689:  udp 56 (DF)

The server syslog seems happy:

Nov 11 21:34:50 genetics rpc.mountd: authenticated mount request from
hydra.swmed.edu:689 for /mnt/databases (/mnt/databases) 


I get nothing on the client tcpdump:

[root@hydra log]# tcpdump port 32770
Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket
tcpdump: listening on all devices






Comment 5 Need Real Name 2002-11-12 03:22:00 UTC
I'm finding out a few more symptoms.  

The order of the mounts seems to affect it greatly.  If I mount a filesystem
from another server first,
the next mounts from the previously timing-out server work!  It is 100%
repeatable. 

Any ideas?


Comment 6 Need Real Name 2002-11-12 05:33:39 UTC
I think we should concentrate on the client, since that machine is 
where we can repeat the error.  The client is running kernel 2.4.18-smp.

It has these network interfaces:

eth0: Intel Corp. 82557 [Ethernet Pro 100], 00:D0:B7:A9:2C:73, IRQ 21.
   - internal network

eth1: NetGear GA620 Gigabit Ethernet at 0xf4000000, irq 18
   - connected to the NFS servers

Any help will be greatly appreciated.


Comment 7 Stephen Tweedie 2002-11-12 10:00:34 UTC
For the tcpdump on the client, please use the same port as on the server ---
we're looking for packets to or from the server's mountd port, not the client's
one.  We need to see if those packets are getting back to the client OK.

Comment 8 Need Real Name 2002-11-18 22:38:54 UTC
We are seeing some other errors on the client today:
  Nov 17 05:20:12 hydra kernel: nfs: server morpheus not responding, still
trying
  Nov 18 18:30:43 hydra kernel: nfs: task 10873 can't get a request slot
  Nov 18 18:32:43 hydra kernel: nfs_statfs: statfs error = 512
  Nov 18 18:38:16 hydra kernel: nfs_statfs: statfs error = 512

The host 'morpheus' seems alive and well.

Any ideas as to the cause?

David

Comment 9 Need Real Name 2002-11-18 23:19:33 UTC
I did a trace on the 'nfs' port (on the client) and I am getting these messages:
tcpdump 'port 2049 && host morpheus'
Kernel filter, protocol ALL, TURBO mode (575 frames), datagram packet socket
tcpdump: listening on all devices
19:30:48.767198 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:30:48.767577 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)
19:31:48.760169 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:31:48.762127 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)
19:32:11.157577 eth1 > hydra.swmed.edu.661373190 > morpheus.swmed.edu.nfs: 108
getattr fh 0,32/655360 (DF)
19:32:11.157972 eth1 < morpheus.swmed.edu.nfs > hydra.swmed.edu.661373190: reply
ok 28 getattr ERROR: Communication error on send (DF)


Comment 10 Need Real Name 2002-11-18 23:26:41 UTC
Here is the trace from the server.  It seems to be responding properly (ok 28
message).

tcpdump: listening on lan0
17:21:53.324681 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr
[|nfs] (DF)
17:21:53.339951 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28
getattr [|nfs] (DF)
17:21:53.340014 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp
port 929 unreachable [tos 0xc0] 
17:21:53.340108 order.swmed.edu.22 > morpheus.swmed.edu.54670: P
2035704829:2035704973(144) ack 1603569760 win 9648 (DF) [tos 0x10] 
17:21:53.344672 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 144:320(176)
ack 1 win 9648 (DF) [tos 0x10] 
17:21:53.344721 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 320 win
32768 (DF) [tos 0x10] 
17:22:53.317428 order.swmed.edu.661373190 > morpheus.swmed..nfs: 108 getattr
[|nfs] (DF)
17:22:53.317571 morpheus.swmed..nfs > order.swmed.edu.661373190: reply ok 28
getattr [|nfs] (DF)
17:22:53.318068 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 320:464(144)
ack 1 win 9648 (DF) [tos 0x10] 
17:22:53.318135 order.swmed.edu > morpheus.swmed.edu: icmp: order.swmed.edu udp
port 929 unreachable [tos 0xc0] 
17:22:53.318596 order.swmed.edu.22 > morpheus.swmed.edu.54670: P 464:640(176)
ack 1 win 9648 (DF) [tos 0x10] 
17:22:53.318646 morpheus.swmed.edu.54670 > order.swmed.edu.22: . ack 640 win
32768 (DF) [tos 0x10] 

Any ideas?

David


Comment 11 Michael Pelletier 2004-01-29 01:37:49 UTC
Found a possible related page, regarding write-caching: 
http://article.gmane.org/gmane.linux.nfs/4249

We found similiar statfs error 512 log entries on a RedHat 8.0 system 
today, kernel 2.4.18-19.8.0, shortly after increasing the wsize on 
TCP-based NFS mounts.

Is the client running ClearCase, by any chance?

Comment 12 Bugzilla owner 2004-09-30 15:40:10 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/