Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 233034 - cmirror server failure/migration during GFS I/O causes metadata error and then GFS withdraw
Summary: cmirror server failure/migration during GFS I/O causes metadata error and the...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-19 23:18 UTC by Corey Marthaler
Modified: 2010-01-12 02:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-01 20:46:44 UTC


Attachments (Terms of Use)

Description Corey Marthaler 2007-03-19 23:18:30 UTC
Description of problem:
Three (3-way) cmirrors, gfs on top of all, gfs I/O ruuning to all
(xdoio/xiogen). I then failed the primary legs, and they appeared to down conver
t to 2-way cmirrors properly. I then kill one of the nodes (link-07) in the
cluster (was also the cmirror server to at least one of the cmirrors) and that
caused the following problems with GFS.


GFS: fsid=LINK_128:gfs2.1: fatal: invalid metadata block
GFS: fsid=LINK_128:gfs2.1:   bh = 1086097 (magic)
GFS: fsid=LINK_128:gfs2.1:   function = foreach_descriptor
GFS: fsid=LINK_128:gfs2.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
GFS: fsid=LINK_128:gfs2.1:   time = 1174323986
GFS: fsid=LINK_128:gfs2.1: about to withdraw from the cluster
GFS: fsid=LINK_128:gfs2.1: waiting for outstanding I/O
GFS: fsid=LINK_128:gfs2.1: telling LM to withdraw
GFS: fsid=LINK_128:gfs1.1: fatal: invalid metadata block
GFS: fsid=LINK_128:gfs1.1:   bh = 1099633 (magic)
GFS: fsid=LINK_128:gfs1.1:   function = foreach_descriptor
GFS: fsid=LINK_128:gfs1.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
GFS: fsid=LINK_128:gfs1.1:   time = 1174323986
GFS: fsid=LINK_128:gfs1.1: about to withdraw from the cluster
GFS: fsid=LINK_128:gfs1.1: waiting for outstanding I/O
GFS: fsid=LINK_128:gfs1.1: telling LM to withdraw
GFS: fsid=LINK_128:gfs3.1: jid=3: Journal replayed in 4s
GFS: fsid=LINK_128:gfs3.1: jid=3: Done
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: fatal: invalid
metadata block
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   bh = 1086097 (magic)
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   function =
foreach_descriptor
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1:   time = 1174323986
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: about to withdraw
from the cluster
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: waiting for
outstanding I/O
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs2.1: telling LM to withdraw
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: fatal: invalid
metadata block
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   bh = 1099633 (magic)
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   function =
foreach_descriptor
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-71/smp/src/gfs/recovery.c, line = 400
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1:   time = 1174323986
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: about to withdraw
from the cluster
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: waiting for
outstanding I/O
Mar 19 12:06:26 link-08 kernel: GFS: fsid=LINK_128:gfs1.1: telling LM to withdraw
Mar 19 12:06:27 link-08 kernel: GFS: fsid=LINK_128:gfs3.1: jid=3: Journal
replayed in 4s
Mar 19 12:06:27 link-08 kernel: GFS: fsid=LINK_128:gfs3.1: jid=3: Done


Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-25.0

Comment 1 Jonathan Earl Brassow 2007-03-22 22:35:13 UTC
assigned -> post

Comment 5 Kiersten (Kerri) Anderson 2007-03-27 19:51:55 UTC
Code change is a cluster product not a rhel product - switching to cluster4.5
flag instead of rhel4.5 flag.

Comment 6 Jonathan Earl Brassow 2007-04-03 20:12:19 UTC
post -> modified

Comment 7 Corey Marthaler 2007-05-24 21:08:03 UTC
I appear to have hit this bug attempting the same scenario in the original
report, that is... start GFS i/o to all 3 cmirrors, fail a leg on all three
mirrors, wait for the down convert to work, and then fail a node in the cluster.
In this case I failed link-07, and the 'invalid metadata' happened on link-04.


May 24 15:37:28 link-04 kernel: dm-cmirror: I'm the cluster mirror log server
for PIKAqmNT
May 24 15:37:28 link-04 kernel: dm-cmirror: Not reading disk log because I'm
suspended (PIKAqmNT)
May 24 15:37:28 link-04 kernel: dm-cmirror: Performing flush to work around bug
235040
May 24 15:37:28 link-04 kernel: dm-cmirror: Log flush complete
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_LEAVING(13): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 0
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_SELECTION(11): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 2
May 24 15:37:48 link-04 kernel: dm-cmirror: LRT_MASTER_ASSIGN(12): (PIKAqmNT)
May 24 15:37:48 link-04 kernel: dm-cmirror:   starter     : 1
May 24 15:37:48 link-04 kernel: dm-cmirror:   co-ordinator: 57005
May 24 15:37:48 link-04 kernel: dm-cmirror:   node_count  : 1
May 24 15:37:48 link-04 kernel: dm-cmirror: Removing PIKAqmNT (1)
May 24 15:37:48 link-04 kernel: dm-cmirror: 0 region user structures freed
May 24 15:37:48 link-04 lvm[14010]: Monitoring mirror device vg3-mirror3 for events
May 24 15:37:48 link-04 lvm[14010]: vg3-mirror3 is now in-sync
May 24 15:44:32 link-04 kernel: CMAN: removing node link-07 from the cluster :
Missed too many hea
rtbeats
May 24 15:44:33 link-04 fenced[13193]: fencing deferred to link-02
May 24 15:45:13 link-04 kernel: dm-cmirror: A cluster mirror log member has failed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  PIKAqmNT (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  mloZLxpj (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Disk Resume::  vbQ8ivnj (active)
May 24 15:45:13 link-04 kernel: dm-cmirror:   Live nodes        :: 3
May 24 15:45:13 link-04 kernel: dm-cmirror:   In-Use Regions    :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Good IUR's        :: 1
May 24 15:45:13 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Sync count        :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   Disk Region count :: 0
May 24 15:45:13 link-04 kernel: dm-cmirror:   Region count      :: 10240
May 24 15:45:13 link-04 kernel: dm-cmirror:   NOTE:  Mapping has changed.
May 24 15:45:13 link-04 kernel: dm-cmirror: Marked regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: dm-cmirror: Out-of-sync regions::
May 24 15:45:13 link-04 kernel: dm-cmirror:   Total = 0
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Trying to acquire
journal lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Looking at journal...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:13 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Acquiring the
transaction lock...
May 24 15:45:14 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Replayed 7 of 15
blocks
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: replays = 7,
skips = 7, sames = 1
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:1.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:2.2: jid=1: Replaying journal...
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Journal replayed
in 2s
May 24 15:45:15 link-04 kernel: GFS: fsid=LINK_128:3.2: jid=1: Done
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: fatal: invalid metadata
block
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   bh = 637377 (magic)
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   function =
foreach_descriptor
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2:   time = 1180039516
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: about to withdraw from
the cluster
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: waiting for outstanding I/O
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:1.2: telling LM to withdraw
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: fatal: invalid metadata
block
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   bh = 635057 (magic)
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   function =
foreach_descriptor
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/recovery.c, line = 400
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2:   time = 1180039516
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: about to withdraw from
the cluster
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: waiting for outstanding I/O
May 24 15:45:16 link-04 kernel: GFS: fsid=LINK_128:2.2: telling LM to withdraw
May 24 16:00:33 link-04 sshd(pam_unix)[17118]: session opened for user root by
root(uid=0)


[root@link-04 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   X   link-07
   2    1    4   M   link-02
   3    1    4   M   link-04
   4    1    4   M   link-08
[root@link-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 3 4]

DLM Lock Space:  "clvmd"                            10   5 run       -
[3 2 4]

DLM Lock Space:  "clustered_log"                    11   6 run       -
[2 3 4]

DLM Lock Space:  "1"                                13   7 run       S-10,200,0
[2 3 4]

DLM Lock Space:  "2"                                17   9 run       -
[2 3 4]

DLM Lock Space:  "3"                                21  11 run       -
[2 3 4]

GFS Mount Group: "1"                                15   8 recover 2 -
[2 3 4]

GFS Mount Group: "2"                                19  10 recover 2 -
[2 3 4]

GFS Mount Group: "3"                                23  12 run       -
[2 3 4]

[root@link-04 ~]# dmsetup ls
vg1-mirror1     (253, 5)
vg3-mirror3_mimage_2    (253, 13)
vg3-mirror3_mimage_1    (253, 12)
vg3-mirror3_mlog        (253, 10)
vg2-mirror2     (253, 9)
vg1-mirror1_mimage_2    (253, 4)
vg1-mirror1_mimage_1    (253, 3)
vg3-mirror3     (253, 14)
VolGroup00-LogVol01     (253, 1)
vg2-mirror2_mimage_2    (253, 8)
VolGroup00-LogVol00     (253, 0)
vg2-mirror2_mimage_1    (253, 7)
[root@link-04 ~]# dmsetup info
Name:              vg1-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8obumXSC6B3CduV0RBrFdhrGJvbQ8ivnj

Name:              vg3-mirror3_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZpaLs9n0BnDV4VslNvPVcRL3d45zqssuA

Name:              vg3-mirror3_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 12
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZMRSO8BF06jpxnWlWYB0WGOy56PYjrS3d

Name:              vg3-mirror3_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZcc12hKC015ulgOhPIdSBVSxqPIKAqmNT

Name:              vg2-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      7
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRo1FEv8TbkEAtSiJbPXhpunPRmloZLxpj

Name:              vg1-mirror1_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8Nb6ifzgNNMjv03zJH9BX1b2hgnqbvP1n

Name:              vg1-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-jVW0p5yyy5wTagIL0aygoSMVZLemXmb8zrRxcvrXawT7rn885SId6bvqma8h4Rb4

Name:              vg3-mirror3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 14
Number of targets: 1
UUID: LVM-7xY0XAC7BmKdCH3armVoGdByviEcs4xZOy5EfnriDVYBEO5dps4g4ySR9UEHUDfN

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjyYdc3o77yCGdTIILMdW0LlXkWPGcQWen

Name:              vg2-mirror2_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRzz7RntrbEeSmb96NeIWeMxRNWXu3gxlB

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-RsBOCkblQpLicArLlocNswlochRxjsFjRMSyxR0nBvsyde1ijAWS9nf4aX3gDTYI

Name:              vg2-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-klO57to59SCrg2sHUlIenGLcGTPah2JRThc04y0AcxA1ucoQDOV5PQNDonEmiKPQ
                                   





Comment 8 Jonathan Earl Brassow 2007-10-01 15:53:28 UTC
This should be fixed with the latest cmirror code (>= 9/28/07).

assigned -> post


Comment 9 Corey Marthaler 2008-07-01 20:46:44 UTC
This has not been seen in over a year, marking fixed.


Note You need to log in before you can comment on or make changes to this bug.