Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 158606 - Login nodes to cluster hang, requiring reboots. Top shows high proportion of iowait ..
Summary: Login nodes to cluster hang, requiring reboots. Top shows high proportion of...
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: nfs-utils
Version: 3.0
Hardware: i686
OS: Linux
Target Milestone: ---
Assignee: Steve Dickson
QA Contact: Ben Levenson
Depends On:
TreeView+ depends on / blocked
Reported: 2005-05-23 21:31 UTC by Gary Howell
Modified: 2008-08-02 23:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2006-10-17 14:39:57 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Gary Howell 2005-05-23 21:31:41 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

Description of problem:
Since the "upgrade" two weeks ago to rhel 3, we have (every few days)
had to reboot the login nodes.

They hang, user sessions stall, new users may or may not get a request
for password.  

"top" shows usage is low, except for iowait near 100% on all processors. 
Typically, both login nodes stall and are rebooted. 

The redhat 7.3 fileserver node does not require rebooting. Neither
do the rhel 3 computational nodes. 

Other cluster sys admins we've talked to report a similar problem, but
with lesser frequency. 

We do not know how to identify which user job (if any) is generating
a large number of I/O requests.  Recently we have added a few more
mounted file systems which are accessible to all 130 or so computational
nodes.  Our file systems are NFS mounted, connected by GiGE switches. 

What other information would be helpful to you? 

Version-Release number of selected component (if applicable):

How reproducible:
Didn't try

Actual Results:  We are currently trying some IOzone tests of the file system(s) to see if 
we can reproduce the error.  Note though that the nodes stalling are not the 

Additional info:

Comment 4 Steve Dickson 2005-05-24 17:02:03 UTC
I'm assuming your using NFS to server to serve a large computational cluster
and some of the clients are getting hung? if this is the case
please post a system traces (i.e. echo t > /proc/sysrq-trigger)
of both the server and client (assuming they are linux clients).

Note: To enable system traces either edit /etc/sysctl.conf and
set kernel.sysrq to 1 or 'echo 1 > /proc/sys/kernel/sysrq'. If
configured correctly, there will be a system trace of every
process in /var/log/messages. 

Note You need to log in before you can comment on or make changes to this bug.