Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 450939 - panic in cluster_log_ser during resync of two legged core log mirrors
Summary: panic in cluster_log_ser during resync of two legged core log mirrors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror-kernel
Version: 4
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-06-11 20:13 UTC by Corey Marthaler
Modified: 2010-01-12 02:12 UTC (History)
1 user (show)

Fixed In Version: RHBA-2008-0803
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-25 19:28:02 UTC


Attachments (Terms of Use)
log from taft-01 (deleted)
2008-06-11 20:19 UTC, Corey Marthaler
no flags Details
log from taft-02 (deleted)
2008-06-11 20:21 UTC, Corey Marthaler
no flags Details
log from taft-03 (deleted)
2008-06-11 20:27 UTC, Corey Marthaler
no flags Details
log from taft-04 (deleted)
2008-06-11 20:28 UTC, Corey Marthaler
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0803 normal SHIPPED_LIVE cmirror-kernel bug fix update 2008-07-25 19:27:56 UTC

Description Corey Marthaler 2008-06-11 20:13:14 UTC
Description of problem:
Saw this panic on taft-01 after the failed logs were added back into the mirrors
and while they were in the process of resyncing.

Senario: Kill secondary leg of synced core log 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      syncd_secondary_core_2legs
* sync:      1
* mirrors:   2
* disklog:   0
* failpv:    /dev/sdf1
* legs:      2
* pvs:       /dev/sdg1 /dev/sdf1
************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_1 -L 800M
helter_skelter /dev/sdg1:0-1000 /dev/sdf1:0-1000
taft-01: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_2 -L 800M
helter_skelter /dev/sdg1:0-1000 /dev/sdf1:0-1000

Waiting until all mirrors become fully syncd...
        0/2 mirror(s) are fully synced: ( 1=11.50% 2=0.50% )
        0/2 mirror(s) are fully synced: ( 1=31.50% 2=20.50% )
        0/2 mirror(s) are fully synced: ( 1=52.50% 2=42.00% )
        0/2 mirror(s) are fully synced: ( 1=72.50% 2=62.00% )
        0/2 mirror(s) are fully synced: ( 1=93.00% 2=82.50% )
        2/2 mirror(s) are fully synced: ( 1=100.00% 2=100.00% )

Creating gfs on top of mirror(s) on taft-01...
Mounting mirrored gfs filesystems on taft-01...
Mounting mirrored gfs filesystems on taft-02...
Mounting mirrored gfs filesystems on taft-03...
Mounting mirrored gfs filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...

<start name="taft-01_1" pid="9519" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-01_2" pid="9521" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-02_1" pid="9523" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-02_2" pid="9525" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-03_1" pid="9527" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-03_2" pid="9529" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-04_1" pid="9531" time="Wed Jun 11 14:05:02 2008" type="cmd" />
<start name="taft-04_2" pid="9533" time="Wed Jun 11 14:05:02 2008" type="cmd" />
Sleeping 12 seconds to get some outsanding I/O locks before the failure

Disabling device sdf on taft-01
Disabling device sdf on taft-02
Disabling device sdf on taft-03
Disabling device sdf on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
Verifying the down conversion of the failed mirror(s)
  /dev/sdf1: open failed: No such device or address
Verifying LEG device /dev/sdg1 *IS* in the volume(s)
  /dev/sdf1: open failed: No such device or address
Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s)
  /dev/sdf1: open failed: No such device or address
Verify the dm devices associated with /dev/sdf1 are no longer present

Verifying files (checkit) on mirror(s) on...

Enabling device sdf on taft-01
Enabling device sdf on taft-02
Enabling device sdf on taft-03
Enabling device sdf on taft-04

Recreating PV /dev/sdf1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter
Up converting linear(s) back to mirror(s) on taft-01...
taft-01: lvconvert --corelog -m 1 helter_skelter/syncd_secondary_core_2legs_1
/dev/sdg1:0-1000 /dev/sdf1:0-1000
taft-01: lvconvert --corelog -m 1 helter_skelter/syncd_secondary_core_2legs_2
/dev/sdg1:0-1000 /dev/sdf1:0-1000
Verifying the up conversion from linear(s) to mirror(s)
Verifying device /dev/sdg1 *IS* in the mirror(s)
Verifying device /dev/sdf1 *IS* in the mirror(s)

Waiting until all mirrors become fully syncd...
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=2.00% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=23.00% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=39.00% )
        1/2 mirror(s) are fully synced: ( 1=100.00% 2=51.00% )
[PANIC]

Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Someone is already recovering region
697/90GcsfRZ
Jun 11 14:10:18 taft-01 last message repeated 22 times
Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Client finishing recovery: 697/90GcsfRZ
Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Resync work completed by 2: 697/90GcsfRZ
Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Assigning recovery work to 2:
698/90GcsfRZ
Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Received recovery work from 2:
698/90GcsfRZ
Jun 11 14:10:18 taft-01 kernel: dm-cmirror: Someone is already recovering region
698/90GcsfRZ
Jun 11 14:10:18 taft-01 last message redm-cmirror: Recovery blocked by
outstanding write on region 831/90GcsfRZ
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at dm_cmirror_server:696
invalid operand: 0000 [1] SMP
CPU 2
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U)
cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds
yenta_socket pcmcia_core cpufreq_powersave button battery ac uhci_hcd ehci_hcd
hw_random e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix
libata qla2300 qla2xxx scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod
Pid: 20086, comm: cluster_log_ser Not tainted 2.6.9-71.ELsmp
RIP: 0010:[<ffffffffa028e7f1>]
<ffffffffa028e7f1>{:dm_cmirror:cluster_log_serverd+3897}
RSP: 0000:0000010202817e08  EFLAGS: 00010216
RAX: 000000000000004c RBX: 00000000fffffffa RCX: ffffffff803f5aa8
RDX: ffffffff803f5aa8 RSI: 0000000000000246 RDI: ffffffff803f5aa0
RBP: 000001021467c4d0 R08: ffffffff803f5aa8 R09: 00000000fffffffa
R10: 0000000100000000 R11: 0000ffff80410d80 R12: 00000101fff87800
R13: 0000000000000004 R14: 000001021467c4c0 R15: 000001021467c4e0
FS:  0000000000000000(0000) GS:ffffffff8050c680(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005a23e8 CR3: 00000000dffae000 CR4: 00000000000006e0
Process cluster_log_ser (pid: 20086, threadinfo 0000010202816000, task
0000010006af97f0)
Stack: 474a465148554247 474a465148554247 474a465148554247 474a465148554247
       474a465148554247 474a465148554247 474a465148554247 474a465148554247
       474a465148554247 474a465148554247
Call Trace:
        <ffffffff80101000>{init_level4_pgt+0}
        <ffffffff80110fd3>{child_rip+8}
        <ffffffffa028d8b8>{:dm_cmirror:cluster_log_serverd+0}
        <ffffffff80110fcb>{child_rip+0}

Code: 0f 0b 1a 13 29 a0 ff ff ff ff b8 02 e9 b8 00 00 00 31 c0 49
RIP <ffffffffa028e7f1>{:dm_cmirror:cluster_log_serverd+3897} RSP <0000010202817e08>
 ----------- [cut here ] --------- [please bite here ] ---------
Kernel panic - not syncing: Oops
 <1>Kernel BUG at spinlock:76
invalid operand: 0000 [2] SMP
CPU 1
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U)
cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds
yenta_socket pcmcia_core cpufreq_powersave button battery ac uhci_hcd ehci_hcd
hw_random e1000 floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix
libata qla2300 qla2xxx scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod
Pid: 3645, comm: klogd Not tainted 2.6.9-71.ELsmp
RIP: 0010:[<ffffffff80318f69>] <ffffffff80318f69>{_spin_unlock_irq+27}
RSP: 0018:0000010215253e50  EFLAGS: 00010002
RAX: 0000000000000001 RBX: 000000000000006e RCX: 00000000008be539
RDX: ffffffff804861c0 RSI: 0000000000000e68 RDI: ffffffff803f5ac0
RBP: 0000000000000000 R08: 0000010215252000 R09: 0000000000000008
R10: 0000000000000042 R11: 0000000000000005 R12: 0000000000000fff
R13: 000000000000003d R14: 000000552abb093d R15: 0000000000000000
FS:  0000002a958a5b00(0000) GS:ffffffff8050c600(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005a5b58 CR3: 00000001fffa8000 CR4: 00000000000006e0
Process klogd (pid: 3645, threadinfo 0000010215252000, task 0000010006b767f0)
Stack: ffffffff8013897d 0000000000000000 0000010006b767f0 ffffffff801360c4
       0000010215253ea0 0000010215253ea0 0000000000000000 0000000000000000
       0000010006b767f0 ffffffff801360c4
Call Trace:
        <ffffffff8013897d>{do_syslog+405}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801b28e2>{kmsg_read+56}
        <ffffffff8017bac0>{vfs_read+207}
        <ffffffff8017bd1c>{sys_read+69}
        <ffffffff801102f6>{system_call+126}


Code: 0f 0b b1 24 33 80 ff ff ff ff 4c 00 c7 07 01 00 00 00 fb c3
RIP <ffffffff80318f69>{_spin_unlock_irq+27} RSP <0000010215253e50>
Badness in do_unblank_screen at drivers/char/vt.c:2878

Call Trace:
        <ffffffff8023b25c>{do_unblank_screen+61}
        <ffffffff80124388>{bust_spinlocks+28}
        <ffffffff80111b68>{oops_end+18}
        <ffffffff80111c95>{die+54}
        <ffffffff80112058>{do_invalid_op+145}
        <ffffffff80318f69>{_spin_unlock_irq+27}
        <ffffffff80317d24>{thread_return+0}
        <ffffffff80317d7c>{thread_return+88}
        <ffffffff80110e1d>{error_exit+0}
        <ffffffff80318f69>{_spin_unlock_irq+27}
        <ffffffff8013897d>{do_syslog+405}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801360c4>{autoremove_wake_function+0}
        <ffffffff801b28e2>{kmsg_read+56}
        <ffffffff8017bac0>{vfs_read+207}
        <ffffffff8017bd1c>{sys_read+69}
        <ffffffff801102f6>{system_call+126}

Version-Release number of selected component (if applicable):
2.6.9-71.ELsmp

lvm2-2.02.37-2.el4    BUILT: Wed Jun 11 07:03:46 CDT 2008
lvm2-cluster-2.02.37-2.el4    BUILT: Wed Jun 11 08:56:04 CDT 2008
device-mapper-1.02.25-2.el4    BUILT: Mon Jun  9 09:28:41 CDT 2008
cmirror-1.0.1-1    BUILT: Tue Jan 30 17:28:02 CST 2007
cmirror-kernel-2.6.9-41.4    BUILT: Tue Jun  3 13:54:29 CDT 2008

Comment 1 Corey Marthaler 2008-06-11 20:19:11 UTC
Created attachment 308986 [details]
log from taft-01

Comment 2 Corey Marthaler 2008-06-11 20:21:06 UTC
Created attachment 308987 [details]
log from taft-02

Comment 3 Corey Marthaler 2008-06-11 20:27:01 UTC
Created attachment 308989 [details]
log from taft-03

Comment 4 Corey Marthaler 2008-06-11 20:28:16 UTC
Created attachment 308990 [details]
log from taft-04

Comment 5 Corey Marthaler 2008-06-13 13:40:06 UTC
Reproduced this again last night while running the same test case:
Kill the secendary legs of two synced core log mirrors.

lvm2-2.02.37-3.el4
lvm2-cluster-2.02.37-3.el4

Comment 6 Corey Marthaler 2008-06-13 15:58:21 UTC
This now seems fairly easy to reproduce, and since this testcase worked before,
it appears to be a regression. 

Comment 7 Corey Marthaler 2008-06-13 20:31:10 UTC
Hit this during machine recovery testing as well. Shot hayes-01, and hayes-03
paniced.

GFS: fsid=HAYES:2.0: Done
dm-cmirror: Error listening for server(2) response for KbfWJO2B: -110
dm-cmirror: Error listening for server(2) response for KbfWJO2B: -110
GFS: fsid=HAYES:2.0: jid=1: Trying to acquire journal lock...
GFS: fsid=HAYES:1.2: jid=0: Trying to acquire journal lock...
GFS: fsid=HAYES:2.0: jid=1: Looking at journal...
GFS: fsid=HAYES:1.2: jid=0: Busy
dm-cmirror: Recovery blocked by outstanding write on region 6899/KbfWJO2B
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at dm_cmirror_server:696                                            
        
[...]

Comment 8 Corey Marthaler 2008-06-16 15:02:17 UTC
FWIW, this can be reproduced by failing the primary leg as well, doesn't have to
be the secondary one.

Comment 9 Jonathan Earl Brassow 2008-06-17 03:38:58 UTC
Judging from comment #1, you should be able to repo this bug just by doing
linear -> (corelog) mirror upconverts...  This should be true because there is
no mirror state carried along when a failed mirror goes to linear.

I'm not sure I see a way for this panic to be triggered during normal
operation...  I think I see a way during machine/disk failure; so it will be
important to know when exactly this panic is happening.

(BTW, Thanks for including the build dates of the rpms - that's helpful.)

Comment 10 Nate Straz 2008-06-17 13:16:02 UTC
I hit something like this while running a related cmirror failure case with the
following bits:

cmirror-1.0.1-1
cmirror-kernel-2.6.9-41.4
lvm2-2.02.37-3.el4
lvm2-cluster-2.02.37-3.el4

Senario: Kill secondary leg of non synced core log 2 leg mirror(s)

****** Mirror hash info for this scenario ******
* name:      nonsyncd_secondary_core_2legs
* sync:      0
* mirrors:   1
* disklog:   0
* failpv:    /dev/sdd1
* legs:      2
* pvs:       /dev/sde1 /dev/sdd1
************************************************

During a resync morph-02 panicked with the following messages:

dm-cmirror: Mark attempted to recovering region by 3: 679/fA32IibI
dm-cmirror:   lc->recovering_region = 679
dm-cmirror:   ru->ru_rw             = 2
dm-cmirror:   ru->ru_nodeid         = 4
dm-cmirror:   ru->ru_region         = 679
------------[ cut here ]------------
kernel BUG at
/builddir/build/BUILD/cmirror-kernel-2.6.9-41/hugemem/src/dm-cmirror-server.c:574!
invalid operand: 0000 [#1]
SMP 
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U)
cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc
cpufreq_powersave button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod qla2300 ata_piix libata qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    1
EIP:    0060:[<82c15820>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-71.ELhugemem) 
EIP is at server_mark_region+0x198/0x1bb [dm_cmirror]
eax: 0000002d   ebx: 79ad8200   ecx: 75578f34   edx: 82c18b67
esi: 7961ff20   edi: 39c6c375   ebp: 00000000   esp: 75578f30
ds: 007b   es: 007b   ss: 0068
Process cluster_log_ser (pid: 24176, threadinfo=75578000 task=773a3330)
Stack: 82c18b67 000002a7 00000000 82c18b3a 00000004 82c18b0d 00000002 00000003 
       81da5180 39c6c200 75578000 fffffffa ff000000 39c6c200 82c1664a 022d6202 
       00000043 75578000 7d5d85b0 773a3330 7e692d00 81da5220 00000000 c73e0002 
Call Trace:
 [<82c1664a>] process_log_request+0x28f/0x47b [dm_cmirror]
 [<022d6202>] schedule+0x8c2/0x8ee
 [<82c169e1>] cluster_log_serverd+0x1ab/0x21f [dm_cmirror]
 [<82c16836>] cluster_log_serverd+0x0/0x21f [dm_cmirror]
 [<021041f5>] kernel_thread_helper+0x5/0xb
Code: 20 ff 73 08 68 0d 8b c1 82 e8 66 d3 50 7f ff 73 0c 68 3a 8b c1 82 e8 59 d3
50 7f ff 73 14 ff 73 10 68 67 8b c1 82 e8 49 d3 50 7f <0f> 0b 3e 02 95 8b c1 82
83 c4 1c eb 0c 8b 00 89 70 04 89 06 89 
 <0>Fatal exception: panic in 5 seconds
dm-cmirror: Client finishing recovery: 679/fA32IibI
Kernel panic - not syncing: Fatal exception

Comment 11 Corey Marthaler 2008-06-19 15:33:37 UTC
It appears that the kernel mod that Jon gave me fixes this issue.

Comment 12 Jonathan Earl Brassow 2008-06-20 18:19:45 UTC
Fix checked into CVS:
commit a0e6a6d02a4a55b98078dc874204c5555dbf74a4
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Fri Jun 20 13:17:44 2008 -0500

    dm-cmirror.ko:  Fix for bug 450939

    - must reset 'in_sync' var upon resume
    -- Clean-up of some compile warnings
    -- Additional debugging statements


Comment 13 Corey Marthaler 2008-06-23 16:23:25 UTC
Looks like this is still reproducable after all. Moving back to 'assigned'.

Comment 14 Jonathan Earl Brassow 2008-06-24 14:28:00 UTC
There should be no need for failure to reproduce this bug.  Let's also try:

1) Create corelog mirror - using '--nosync' is fine
2) get I/O going on all machines except the server
3) Deactivate the server (and only the server) node via 'lvchange -aln vg/lv'

This should trigger the bug.  The server will migrate, the outstanding I/O will
cause the mirror to appear 'not-in-sync', but the in_sync var will still be set.
 Once recovery collides with a write, the BUG will trigger.

The failure should happen within 10 minutes.  If it doesn't, you can try again.

Comment 15 Corey Marthaler 2008-06-25 16:57:22 UTC
My test is now hitting this on non corelog mirrors

Senario: Kill primary leg of synced 3 leg mirror(s)

[...] # well after the failure was dealt with

Enabling device sdg on taft-01
Enabling device sdg on taft-02
Enabling device sdg on taft-03
Enabling device sdg on taft-04

Recreating PV /dev/sdg1
  WARNING: Volume group helter_skelter is not consistent
  WARNING: Volume Group helter_skelter is not consistent
  WARNING: Volume group helter_skelter is not consistent
Extending the recreated PV back into VG helter_skelter
Since we can't yet up convert existing mirrors, down converting to linear(s)
on taft-04 before re-converting back to original mirror(s)
Up converting linear(s) back to mirror(s) on taft-04...
taft-04: lvconvert -m 2 helter_skelter/syncd_primary_3legs_1 /dev/sdg1:0-1000
/dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-150
taft-04: lvconvert -m 2 helter_skelter/syncd_primary_3legs_2 /dev/sdg1:0-1000
/dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdh1:0-150
Verifying the up conversion from linear(s) to mirror(s)
Verifying device /dev/sdg1 *IS* in the mirror(s)
Could not connect to remote host
[PANIC]

Comment 16 Jonathan Earl Brassow 2008-06-27 14:37:55 UTC
should be fixed with the latest updates:


commit 3ab00427e3eaf45e99d5f40fed6f3b459faccb14
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Fri Jun 27 09:36:00 2008 -0500

    dm-cmirror.ko:  Fix for bug 450939, and other minor cleanups

    - If a write-recovery conflict is detected, halt recovery
      rather than calling BUG() (the fix bug 450939)
    - Minor code style cleanups


Comment 20 errata-xmlrpc 2008-07-25 19:28:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0803.html



Note You need to log in before you can comment on or make changes to this bug.