Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 604244 - GFS2: kernel NULL pointer dereference from dlm_astd
Summary: GFS2: kernel NULL pointer dereference from dlm_astd
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: All
OS: Linux
low
high
Target Milestone: rc
: 6.0
Assignee: Robert Peterson
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 612608
TreeView+ depends on / blocked
 
Reported: 2010-06-15 17:11 UTC by Robert Peterson
Modified: 2010-11-15 14:28 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 612608 (view as bug list)
Environment:
Last Closed: 2010-11-15 14:28:06 UTC
Target Upstream Version:


Attachments (Terms of Use)
Upstream patch (deleted)
2010-06-15 17:16 UTC, Robert Peterson
no flags Details | Diff
Upstream patch - try #2 (deleted)
2010-06-17 20:49 UTC, Robert Peterson
no flags Details | Diff

Description Robert Peterson 2010-06-15 17:11:14 UTC
Description of problem:

This bug was discovered as a result of testing for bug #595397.
The symptom is a kernel crash due to an error patch taken by
gfs2 while looking up a dinode.  I have a patch for this that
I've submitted to cluster-devel.

The problem is in an error path when looking
up dinodes.  There are two sister-functions, gfs2_inode_lookup
and gfs2_process_unlinked_inode.  Both functions acquire and
hold the i_iopen glock for the dinode being looked up. The last
thing they try to do is hold the i_gl glock for the dinode.
If that glock fails for some reason, the error path was
incorrectly calling gfs2_glock_put for the i_iopen glock twice.
This resulted in the glock being prematurely freed.  The
"minimum hold time" usually kept the glock in memory, but the
lock interface to dlm (aka lock_dlm) freed its memory for the
glock.  In some circumstances, it would cause dlm's dlm_astd daemon
to try to call the bast function for the freed lock_dlm memory,
which resulted in a NULL pointer dereference.

Version-Release number of selected component (if applicable):
RHEL6

How reproducible:
Moderate on rhel6, easy on rhel5

Steps to Reproduce:
Run the qe brawl test, but especially minrg's coherency test.

Actual results:
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<0000000000000000>] _stext+0x7ffff000/0x1000
PGD 2041d8067 PUD 20a010067 PMD 0 
Oops: 0010 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:1c.5/0000:06:00.0/irq
CPU 1 
Modules linked in: md5 sctp lock_dlm gfs2(U) dlm configfs autofs4 hidp rfcomm
l2cap bluetooth dm_log_clustered(U) lockd sunrpc be2iscsi ib_iser rdma_cm ib_cm
iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo
crypto_api uio cxgb3i cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2
scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon
i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp
parport sr_mod joydev tg3 ide_cd serio_raw pata_sil680 i3000_edac i2c_i801
shpchp cdrom i2c_core edac_mc sg pcspkr dm_raid45 dm_message dm_region_hash
dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx
scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd
ehci_hcd
Pid: 3191, comm: dlm_astd Tainted: G      2.6.18-200.el5 #1
RIP: 0010:[<0000000000000000>]  [<0000000000000000>] _stext+0x7ffff000/0x1000
RSP: 0018:ffff8102314bfeb8  EFLAGS: 00010247
RAX: 0000000000000000 RBX: ffff8101fbeec4d8 RCX: 0000000000000001
RDX: ffff810208b63188 RSI: 0000000000000101 RDI: 0000000000200200
RBP: 0000000000000005 R08: ffff8102314be000 R09: ffff810232b75280
R10: 0000000000000003 R11: 0000000000000000 R12: ffffffff888212c1
R13: 0000000000000002 R14: ffff81023149fca8 R15: ffffffff800a07e7
FS:  0000000000000000(0000) GS:ffff8101fffb77c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000204270000 CR4: 00000000000006e0
Process dlm_astd (pid: 3191, threadinfo ffff8102314be000, task
ffff810231462080)
Stack:  ffffffff88775238 ffff81023149fcb8 0000000000000000 ffffffff88775137
 ffff81023149fcb8 0000000000000282 ffffffff80032880 0000000000000000
 ffff810231462080 ffffffffffffffff ffffffffffffffff ffffffffffffffff
Call Trace:
 [<ffffffff88775238>] :dlm:dlm_astd+0x101/0x14f
 [<ffffffff88775137>] :dlm:dlm_astd+0x0/0x14f
 [<ffffffff80032880>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff800a07e7>] keventd_create_kthread+0x0/0xc4
(kernel panic)

Expected results:
No kernel panic.

Additional info:
See RHEL5 bug #595397 for more information.

Comment 1 RHEL Product and Program Management 2010-06-15 17:13:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 Robert Peterson 2010-06-15 17:16:49 UTC
Created attachment 424231 [details]
Upstream patch

Here is the upstream patch to fix the problem.  It should
just apply to the RHEL6 branch I think, but I'll make sure.

Comment 4 Robert Peterson 2010-06-16 21:07:20 UTC
Unfortunately, I discovered a problem with the patch.  I know
what it is, so I'll create a new one and test it well before
reposting upstream.  Although Steve claims to have pushed the
patch upstream, it appears he hasn't, and that's a good thing
so I have time to rework it as needed.

Comment 5 Robert Peterson 2010-06-17 20:49:59 UTC
Created attachment 424941 [details]
Upstream patch - try #2

The previous patch worked perfectly for the failing scenario
but got into trouble when the genesis test was run.  This
version of the patch has passed more than 50 iterations of the
genesis program, with flying colors.

It was tested on RHEL6-beta system roth-08.  I could recreate
the problem with the previous version pretty reliably, so I'm
confident that problem is fixed.

I emailed the patch upstream to cluster-devel, but since Steve
is on holiday/vacation it won't be pushed upstream until he
returns in one week.

Comment 6 Robert Peterson 2010-06-18 18:33:43 UTC
The brawl test completed successfully on the west cluster.
The genesis test completed successfully on RHEL6 node west-08.
I posted the patch for inclusion into the RHEL6 kernel today.
Changing status to POST.

Comment 7 Jarod Wilson 2010-06-18 18:44:28 UTC
The initial backtrace says rhel5, but this bug is against rhel6 and mentions testing on rhel6, so I presume you actually meant to cc Aris, not me.

Comment 8 Aristeu Rozanski 2010-07-01 16:12:15 UTC
Patch(es) available on kernel-2.6.32-42.el6

Comment 11 Robert Peterson 2010-07-06 17:15:31 UTC
*** Bug 610136 has been marked as a duplicate of this bug. ***

Comment 12 Nate Straz 2010-07-08 13:45:32 UTC
I hit this BUG with kernel 2.6.32-42.el6.x86_64.  It is the same backtrace as 610136 which was dup'd to this bz.  It was hit while running brawl w/ a 1k file system block size.  The flock below corresponds to a file generated by the test program accordion.

3689713 -rw-rw-r--. 1 root root     27189 Jul  7 23:34 accrdfile2l


 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0
------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:173!
invalid opcode: 0000 [#1]
Modules linked in: sctp libcrc32c gfs2 dlm configfs sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 dm_mirror dm_region_hash dm_log dcdba
s k8temp hwmon serio_raw amd64_edac_mod edac_core edac_mce_amd tg3 sg i2c_piix4 shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom qla2x
xx scsi_transport_fc scsi_tgt sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod [las
t unloaded: configfs]
Pid: 6793, comm: dlm_astd Not tainted 2.6.32-42.el6.x86_64 #1 PowerEdge SC1435
RIP: 0010:[<ffffffffa0435680>]  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
RSP: 0018:ffff88011a075e10  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8801fa45ba28 RCX: 000000000000264e
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000000
RBP: ffff88011a075e10 R08: ffffffff818bb9c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: ffff88011a12f000
FS:  00007f1497a47700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002886000 CR3: 00000001bdbc4000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_astd (pid: 6793, threadinfo ffff88011a074000, task ffff8801186c7580)
Stack:
 ffff88011a075e40 ffffffffa0436141 0000000000000001 0000000000000000
<0> 0000000000000001 ffff880107b80000 ffff88011a075e60 ffffffffa0453a5d
<0> ffffffffa0416aa8 ffff8801d229a078 ffff88011a075ee0 ffffffffa03f93dd
Call Trace:
 [<ffffffffa0436141>] gfs2_glock_complete+0x31/0xd0 [gfs2]
 [<ffffffffa0453a5d>] gdlm_ast+0xfd/0x110 [gfs2]
 [<ffffffffa03f93dd>] dlm_astd+0x25d/0x2b0 [dlm]
 [<ffffffffa0453860>] ? gdlm_bast+0x0/0x50 [gfs2]
 [<ffffffffa0453960>] ? gdlm_ast+0x0/0x110 [gfs2]
 [<ffffffffa03f9180>] ? dlm_astd+0x0/0x2b0 [dlm]
 [<ffffffff810909e6>] kthread+0x96/0xa0
 [<ffffffff810141ca>] child_rip+0xa/0x20
 [<ffffffff81090950>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
Code: ff ff c9 c3 0f 1f 80 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 8b 47 28 85 c0 74 06 f0 ff 47 28 c9 c3 48 89 fe 31 ff e8 a0 fc ff ff <0f> 0
b eb fe 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 
RIP  [<ffffffffa0435680>] gfs2_glock_hold+0x20/0x30 [gfs2]
 RSP <ffff88011a075e10>

Comment 13 Robert Peterson 2010-07-08 15:13:53 UTC
The duplicate bug #610136 was a duplicate because it was an
improperly referenced i_iopen glock, as shown by the "5/"
in the glock dump:

 G:  s:UN n:5/9d14 f:I t:UN d:EX/0 a:0 r:0

However, in this case, the glock referenced improperly is

 G:  s:UN n:6/384cf1 f:I t:UN d:EX/0 a:0 r:0

and "6/" indicates a glock for an flock: LM_TYPE_FLOCK.

The patch for this bug record affected only i_open glocks.
Therefore, although this symptom is nearly identical, the
problem is not with this patch.  This has got to be another
similar bug somewhere in the flock code.

Please open a new bugzilla record with the symptom from
comment #12 and assign it to me.  Setting this one back to
ON_QA.

Comment 14 releng-rhel@redhat.com 2010-11-15 14:28:06 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.