Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 225515 - [patch]Kernel panic (handle kernel NULLpointer) occurred in NFSv4
Summary: [patch]Kernel panic (handle kernel NULLpointer) occurred in NFSv4
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Steve Dickson
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 245062
TreeView+ depends on / blocked
 
Reported: 2007-01-31 00:36 UTC by shichao
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 19:23:33 UTC


Attachments (Terms of Use)
nfs_update_inode_panic.patch (deleted)
2007-01-31 00:36 UTC, shichao
no flags Details | Diff
Proposed Upstream patch. (deleted)
2007-06-06 19:38 UTC, Steve Dickson
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description shichao 2007-01-31 00:36:20 UTC
Description of problem:
  If the /proc/sys/sunrpc/nfs_debug was set to 65535, when closing many 
opening files in NFSv4, the kernel panic (handle kernel NULL pointer) usually 
happens.

kernel in RHEL5Beta2

How reproducible:
Usually

Steps to Reproduce:

1. Mount: mount -t nfs4  192.168.0.21:/  /mnt
2. A process creates and opens 1024 files on NFSv4
3. The process writes some test message to each file.
4. The process closes all the opening files and exit.
5. After the process exit, the test begins to umount the NFSv4
immediately.

Actual results:
In normal case, I haven't found the problem in the test; but when
the /proc/sys/sunrpc/nfs_debug is set to debug mode(65535), the panic
problem will usually happen.

Additional info:
The kernel panic log is as follows:

NFS: nfs_update_inode(0:18/2049980 ct=1 info=0xe)
BUG: unable to handle kernel NULL pointer dereference at virtual address
0000000c
 printing eip: deb82605
*pde = 001f3067
Oops: 0000 [#1]
SMP
last sysfs file: /block/hda/removable
Modules linked in: nfs fscache nfsd ¡­. ¡­¡­.
CPU:    0
EIP:    0060:[<deb82605>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-1.2747.el5 #1)
EIP is at nfs_update_inode+0xb0/0x692 [nfs]
¡­skip¡­.
Process rpciod/0 (pid: 1865, ti=dd3ac000 task=ddd47550 task.ti=dd3ac000)
Stack: deba0609 ¡­ ¡­
Call Trace:
 [<deb82c1f>] nfs_refresh_inode+0x38/0x1b0 [nfs]
 [<dea9f602>] rpc_exit_task+0x1e/0x6c [sunrpc]
 [<dea9f314>] __rpc_execute+0x82/0x1b3 [sunrpc]
 [<c0433899>] run_workqueue+0x83/0xc5
 [<c0434171>] worker_thread+0xd9/0x10c
 [<c0436620>] kthread+0xc0/0xec
 [<c0404d63>] kernel_thread_helper+0x7/0x10
DWARF2 unwinder stuck at kernel_thread_helper+0x7/0x10

I have investigated the problem, and found the cause was the NULL
pointer i_sb->s_root in the "nfs_update_inode()" when the panic
happened.

  In the kernel, at the end of the file closing operation, the
nfs_file_release() will be invoked.
  I have found the operation process of the kernel is as follows:  
  nfs_file_release()
       |-- NFS_PROTO(inode)->file_release ()   
             |
           nfs_file_clear_open_context()
             |
            put_nfs_open_context()
               | -- nfs4_close_state
               |       | -- nfs4_close_ops
               |                 |
               |               nfs4_do_close()
               |                  |
               |                 nfs_update_inode()
               |                        |-- inode == inode->i_sb-
>s_root->d_inode
               |
               | -- mntput(ctx_vfsmnt)
                      |
                     atomic_dec_and_lock(&mnt->mnt_count,
&vfsmount_lock)

  After the asynchronous RPC call "nfs4_close_ops" is invoked in
put_nfs_open_context(), the kernel invokes the mntput(), and the mnt-
>mnt_count is decreased. In my test, after the file closing operation,
the sys_umont() was executed immediately. In normal case, the
asynchronous RPC call "nfs4_close_ops" can be completed quickly, and it
rarely ever happens that the sys_umount() is invoked before the end of
nfs_update_inode() operation. But when the sunrpc/nfs_debug is set to
debug mode, a lot of printk operations will be invoked in NFS. Due to
the lots of prink operations, the RPC call "nfs4_close_ops" will easily
be delayed, then it is possible that the sys_umount() is invoked before
the end of nfs_update_inode() operation. In the do_umont() (Because mnt-
>mnt_count has been decreased, umount can be executed successfully), the
sb->s_root will be set to NULL in the shrink_dcache_for_umount () which
is invoked by the nfs_kill_super().
Therefore, kernel panic occurred by the NULL pointer access when
nfs_update_date() accessed inode->i_sb->s_root.

Because there is a possibility that sb->s_root of a super_block is set
to NULL with umount when nfs4_close_ops () is not finished in the NFS. 
It is really necessary to check an empty pointer for the inode->i_sb-
>s_root in the nfs_update_date().

To resolve this problem, I have made the patch attachment for the
kernel. After the patch is applied, the problem can be resolved in my test.

Comment 1 shichao 2007-01-31 00:36:20 UTC
Created attachment 146986 [details]
nfs_update_inode_panic.patch

Comment 2 RHEL Product and Program Management 2007-02-13 11:23:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Steve Dickson 2007-03-21 14:13:46 UTC
Upstream thread:

http://www.gossamer-threads.com/lists/linux/kernel/727123


Comment 4 Martin Jenner 2007-05-02 15:39:47 UTC
QE ack for RHEL 5.1

Comment 5 Steve Dickson 2007-06-06 19:38:05 UTC
Created attachment 156385 [details]
Proposed Upstream patch.

Would it be possible to test this upstream patch? 

Unfortunately I'm having no success in reproducing
this oops, so if you could verify that this patch
fixes the race condition that would be very helpful.

Comment 6 Steve Dickson 2007-06-19 16:40:36 UTC
ping....

Comment 7 Michael Torrie 2007-06-19 17:39:37 UTC
I am interested in testing this patch, but it will be some time before I can do
so, since I'm in the process of some major server work.

Will this patch also apply to the Fedora Core 6 kernel sources?  I had this
problem also on FC6.  (not sure about FC7, haven't had a chance to try nfsv4 on it).

Comment 8 Steve Dickson 2007-06-19 18:02:11 UTC
Yes... it should apply to an FC6 kernel... Please let me know if there is a
problem... 

Comment 9 Don Zickus 2007-06-27 15:53:10 UTC
in 2.6.18-32.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2007-11-07 19:23:33 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.