Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 453809 - Problems with jbd error handling
Summary: Problems with jbd error handling
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Josef Bacik
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-07-02 17:41 UTC by Bryn M. Reeves
Modified: 2018-10-20 02:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-21 15:37:20 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Bryn M. Reeves 2008-07-02 17:41:10 UTC
Description of problem:
Hidehiro Kawai discovered some problems with jbd's error handling that can in
rare failure situations lead to file system corruption during journal recovery:

http://lkml.org/lkml/2008/4/18/154

Although upstream is planning to move past the current jbd implementation in a
way that may make these changes irrelevant users of the current code are still
vulnerable to these problems. Environments where a very large number of disks
are in use increases the probability that one of these problems will occur.

Version-Release number of selected component (if applicable):
2.6.18-*

How reproducible:
Very difficult; may require hardware with fault injection capabilities. Problems
discovered via code inspection.

Steps to Reproduce:
1. n/a
  
Additional info:
[PATCH 1/4] jbd: strictly check for write errors on data buffers
[PATCH 2/4] jbd: ordered data integrity fix
[PATCH 3/4] jbd: abort when failed to log metadata buffers
[PATCH 4/4] jbd/ext3: fix error handling for checkpoint io

Comment 2 RHEL Product and Program Management 2009-02-16 15:27:14 UTC
Updating PM score.

Comment 3 RHEL Product and Program Management 2009-02-24 17:32:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Debbie Johnson 2009-04-09 16:32:18 UTC
Josef,

What is the status of this BZ?  Will it be going into 5.4?  It is unclear by the comments and I have a customer that is in need of this.  I attached the IT to this.

Debbie

Errors they are seeing...

Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 24588 on sdb3
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 24588
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 24588 on sdb3
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 24588
Mar  4 15:08:47 jb-601 kernel: JBD: Failed to read block at offset 24585
Mar  4 15:08:47 jb-601 kernel: JBD: recovery failed
Mar  4 15:08:47 jb-601 kernel: EXT3-fs: error loading journal.
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 21516 on sdb4
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 21516
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 21685 on sdb4
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 21685
Mar  4 15:08:47 jb-601 kernel: JBD: recovery failed
Mar  4 15:08:47 jb-601 kernel: EXT3-fs: error loading journal.

Comment 5 Josef Bacik 2009-04-09 16:39:49 UTC
I'm pretty sure Hitachi has already posted these, but they will not fix the problem it looks like your customer is having.  These patches are simply to make sure we always abort the transaction when we are supposed to, it seems like your customers is suffering from data corruption.

Comment 6 Josef Bacik 2009-04-21 15:37:20 UTC
the recovery patches referenced in c1 have largely been accepted already via other bz's.  I'm closing this bz.


Note You need to log in before you can comment on or make changes to this bug.