Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1684780 - aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in the overlapped portion of the page containing zeros
Summary: aio O_DIRECT writes to non-page-aligned file locations on ext4 can result in ...
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Lukáš Czerner
QA Contact: Boyang Xue
URL:
Whiteboard:
Depends On:
Blocks: 1693561 1690497
TreeView+ depends on / blocked
 
Reported: 2019-03-02 12:05 UTC by Frank Sorenson
Modified: 2019-04-16 07:18 UTC (History)
15 users (show)

Fixed In Version: kernel-3.10.0-1031.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1690497 1693561 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
reproducer (deleted)
2019-03-02 12:05 UTC, Frank Sorenson
no flags Details
updated/simplified reproducer (deleted)
2019-03-07 21:28 UTC, Frank Sorenson
no flags Details
updated updated reproducer (deleted)
2019-03-11 16:55 UTC, Frank Sorenson
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3953911 Troubleshoot None Possible Oracle corruption when using filesystemio_options=SETALL 2019-03-05 02:30:18 UTC

Description Frank Sorenson 2019-03-02 12:05:05 UTC
Created attachment 1540033 [details]
reproducer

Description of problem:

Sequential non-page-aligned aio writes can result in a portion of the shared page containing all zeros.

This only occurs on ext4 with aio + O_DIRECT


         **************** **************** ****************
         *    page 1    * *    page 2    * *    page 3    *
         **************** **************** ****************
existing 0000000000000000 0000000000000000 0000000000000000
write 1    AAAAAAAAAAAAAA AA
write 2                     BBBBBBBBBBBBBB BB

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000


Version-Release number of selected component (if applicable):

RHEL 7.4+ (and quite probably earlier as well)
also appears to occur up through at least 4.19


How reproducible:

reproducer attached


Steps to Reproduce:

reproducer submits several sequential writes which are not aligned to 4 KiB (these begin 512 bytes into the page):

    ftruncate(fd, 65012224)
    io_prep_pwrite(iocbs[0], fd, buf[0], 1048576, 63963648);
    io_prep_pwrite(iocbs[1], fd, buf[1], 1048576, 65012224);

    io_prep_pwrite(iocbs[2], fd, buf[2], 1048576, 66060800);
    io_prep_pwrite(iocbs[3], fd, buf[3], 1048576, 67109376);

    io_submit(io_ctx, 1, &iocbs[0]);
    io_submit(io_ctx, 1, &iocbs[1]);
    io_submit(io_ctx, 1, &iocbs[2]);
    io_submit(io_ctx, 1, &iocbs[3]);

    io_getevents(io_ctx, 4, 4, events, NULL)


Actual results:

a portion of the page contains all 0s

result   00AAAAAAAAAAAAAA 00BBBBBBBBBBBBBB BB00000000000000


Expected results:

the writes occur to the expected file locations

desired  00AAAAAAAAAAAAAA AABBBBBBBBBBBBBB BB00000000000000



Additional info:

introducing a delay between each io_submit can prevent the writes from interfering with each other

does not occur on xfs, nfs, etc.

Comment 2 Frank Sorenson 2019-03-03 06:34:34 UTC
This particular condition is obviously known to be problematic, as shown by numerous comments in the ext4 kernel code:

for example, commit e9e3bcecf44c0, which went into 2.6.38 added:
/*
 * This tests whether the IO in question is block-aligned or not.
 * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
 * are converted to written only after the IO is complete.  Until they are
 * mapped, these blocks appear as holes, so dio_zero_block() will assume that
 * it needs to zero out portions of the start and/or end block.  If 2 AIO
 * threads are at work on the same unwritten block, they must be synchronized
 * or one thread will zero the other's data, causing corruption.
 */

and upstream commit e142d05263a4b went into 4.6 added:

+        * Unaligned direct AIO must be serialized among each other as zeroing
+        * of partial blocks of two competing unaligned AIOs can result in data
+        * corruption.

however I've reproduced this with 4.19+, so this does not appear resolved.

Comment 15 Frank Sorenson 2019-03-07 21:28:40 UTC
Created attachment 1541975 [details]
updated/simplified reproducer

Here's a much-simplified reproducer

Comment 16 Zorro Lang 2019-03-09 13:19:29 UTC
(In reply to Frank Sorenson from comment #15)
> Created attachment 1541975 [details]
> updated/simplified reproducer
> 
> Here's a much-simplified reproducer

Can someone help to explain why the reopen is needed after ftruncate?

        ftruncate(fd, WRITE0_SIZE);
        close(fd);
        ....
        fd = open(filename, O_RDWR|O_DIRECT);

I've tried, if no this reopen, can't reproduce this bug.


Thanks,
Zorro

Comment 17 Frank Sorenson 2019-03-11 16:54:16 UTC
Okay, the close()/open() can be eliminated if you initially open the file with 'O_DIRECT' as well (the unlink() can also be removed, since the following open() call includes O_TRUNC):

        if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) == -1)

after changing the initial open options like this, the bug seems to reproduce consistently again.


However, also note that the file corruption is dependent on timing of the IOs, so there is a possibility that the test may return SUCCESS, rather than FAILURE; some light testing leads me to believe this may occur once about every 150-250 runs (on my test VMs).  Separating the ftruncate() from the io_submit() with the close()/open() appears to double the number of iterations between successful runs (failures to FAIL?).

Might be best to run the test several times, just in case it doesn't reproduce the bug the first time (though I suppose a very slow machine could potentially always return SUCCESS, even though the bug is still there).

I'm attaching an updated reproducer that tests multiple times, and eliminates the reopen.  With testing 5 times, I haven't had it incorrectly report SUCCESS in hundreds of thousands of runs.

Comment 18 Frank Sorenson 2019-03-11 16:55:13 UTC
Created attachment 1542966 [details]
updated updated reproducer

Comment 19 Zorro Lang 2019-03-12 03:26:37 UTC
(In reply to Frank Sorenson from comment #17)
> Okay, the close()/open() can be eliminated if you initially open the file
> with 'O_DIRECT' as well (the unlink() can also be removed, since the
> following open() call includes O_TRUNC):
> 
>         if ((fd = open(filename, O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0660)) ==
> -1)

Oh, sorry I didn't notice that the 1st open didn't use 'O_DIRECT'.

I've written a case for xfstests, but I still have one question which I don't know
how to comment it: why the ftruncate is necessary? And looks like must truncate
to the end of the 1st write operation (offset + 1st write size) to reproduce this bug.

Thanks,
Zorro

> 
> after changing the initial open options like this, the bug seems to
> reproduce consistently again.
> 
> 
> However, also note that the file corruption is dependent on timing of the
> IOs, so there is a possibility that the test may return SUCCESS, rather than
> FAILURE; some light testing leads me to believe this may occur once about
> every 150-250 runs (on my test VMs).  Separating the ftruncate() from the
> io_submit() with the close()/open() appears to double the number of
> iterations between successful runs (failures to FAIL?).
> 
> Might be best to run the test several times, just in case it doesn't
> reproduce the bug the first time (though I suppose a very slow machine could
> potentially always return SUCCESS, even though the bug is still there).
> 
> I'm attaching an updated reproducer that tests multiple times, and
> eliminates the reopen.  With testing 5 times, I haven't had it incorrectly
> report SUCCESS in hundreds of thousands of runs.

Comment 20 Lukáš Czerner 2019-03-12 06:35:04 UTC
Hi Zorro,

the ftruncate is absolutely necessary because this bug is about writes at the i_size boundary, or rather in the same fs block as i_size (see the description of my ext4 patch).

https://patchwork.ozlabs.org/patch/1052277/

-Lukas

Comment 21 Zorro Lang 2019-03-12 16:43:46 UTC
(In reply to Lukáš Czerner from comment #20)
> Hi Zorro,
> 
> the ftruncate is absolutely necessary because this bug is about writes at
> the i_size boundary, or rather in the same fs block as i_size (see the
> description of my ext4 patch).
> 
> https://patchwork.ozlabs.org/patch/1052277/

Thanks Lukas, finally I can understand all things by reading the patch:)
I've written a case to xfstests, please feel free to review it if you'd like.

> 
> -Lukas

Comment 27 Bruno Meneguele 2019-03-27 20:57:00 UTC
Patch(es) committed on kernel-3.10.0-1031.el7


Note You need to log in before you can comment on or make changes to this bug.