|Summary:||GFS not syncing inode mtime before freeze|
|Product:||Red Hat Enterprise Linux 5||Reporter:||Harold Miller <hamiller>|
|Component:||gfs-kmod||Assignee:||Ben Marzinski <bmarzins>|
|Status:||CLOSED ERRATA||QA Contact:||Cluster QE <mspqa-list>|
|Version:||5.8||CC:||adas, anprice, bmarzins, briang, cluster-maint, dwysocha, fsorenso, hamiller, jharriga, jruemker, pevans, rpeterso, sbradley, sergey_meirovich, smayhew, swhiteho|
|Fixed In Version:||gfs-kmod-0.1.34-22.el5||Doc Type:||Bug Fix|
Cause: gfs was not syncing the inode attributes to disk along with the data for shared mmap writes. Consequence: files did not have their mtime updated by mmapped writes Fix: gfs now syncs the inode attributes to disk along with the mmap write data when msync is called Result: files now have their mtime updated by mmap writes if msync is called.
Red Hat Enterprise Linux Server release 5.8 Booted kernel: 2.6.18-308.8.2.el5 gfs2-utils-0.1.62-34.el5.x86_64 gfs-utils-0.1.20-13.el5.x86_64
|Last Closed:||2014-09-16 00:18:58 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Harold Miller 2014-02-17 22:58:23 UTC
Testing with meta.c (attached) Test program created numerous files using various file open() flags and truncated each to 10 MiB. The test then wrote a 512-byte block beginning at offset 0 to each file. After a 10-second delay, the next 512-byte block was written, and so forth. During the testing, both regular file write() and memcpy() to an mmapped file were tested, as well as combinations of fsync(), fdatasync(), and no file sync. While the test program was running (after 10+ hours), customer initiated their snapshot + backup process. The metadata of the various files (in particular, the 'mtime') was then compared. The result of the testing indicated that the file open() flags had no bearing on whether mtime was updated. Both fsync() and fdatasync() were confirmed to behave as expected (fsync forced both data write-out and 'mtime' update, while fdatasync() only caused write-out of the data). When the file was written to through an mmap, fdatasync() was called, and fsync() was not called, the file's 'mtime' was never updated. In these cases, the mtime remained unchanged since the file creation time. File contents appeared to be correct in all cases. In comparison, when the test program ran on a local ext4 filesystem (with the default data=ordered), the combination of mmap and not calling either fsync() or fdatasync() resulted in files where the 'mtime' would periodically lag behind the write time (and the mtime of the other files), but never by more than 20 seconds. (Note: after running 'sync', all metadata was written to disk, and the maximum mtime difference was less than 10 milliseconds). It appears that gfs does not update the on-disk mtime in file writes when: an mmap() is used, fdatasync() is called, and fsync() is not called
Comment 1 Harold Miller 2014-02-17 22:59:13 UTC
Created attachment 864329 [details] test program used to demonstrate the issue
Comment 2 Steve Whitehouse 2014-02-18 14:41:23 UTC
Well I'm not sure exactly what the problem is that you are reporting here. Why is it not expected that mtime would update during an mmap write if fsync has not been called? The man page for fdatasync says: fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modifica‐ tion; see stat(2)) do not require flushing because they are not neces‐ sary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. I can't see the attachment in comment #1, as it seems to be corrupt or marked as the wrong type of file.
Comment 4 Sergey Meirovich 2014-02-18 14:58:16 UTC
Comment 5 Steve Whitehouse 2014-02-18 15:13:26 UTC
*** Bug 1066178 has been marked as a duplicate of this bug. ***
Comment 6 Steve Whitehouse 2014-02-18 15:14:38 UTC
*** Bug 1066179 has been marked as a duplicate of this bug. ***
Comment 7 Steve Whitehouse 2014-02-18 15:34:23 UTC
Thanks for the link in comment #4, however so far as I can tell the behaviour is as expected per the man page information for fdatasync() unless I'm missing something here?
Comment 9 Sergey Meirovich 2014-02-19 00:10:32 UTC
In-memory mtime is in fact updated. And the fact that in-memory never propagated to the disk creates a lot of confusion in case of the last close (2) SAN level snapshots creation. Though I think DRBD will be affected the same.
Comment 10 Sergey Meirovich 2014-02-19 00:13:01 UTC
I think it would be rather nice to have `gfs_tool freeze' to synchronize in-memory with on-disc. This single change will make me happy.
Comment 11 Steve Whitehouse 2014-02-19 11:40:15 UTC
Ok, so the issue is not fdatasync then, its not syncing before a freeze. I'll update the bug description accordingly.
Comment 14 RHEL Product and Program Management 2014-02-25 00:09:08 UTC
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux.
Comment 15 RHEL Product and Program Management 2014-03-07 12:18:25 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via appropriate support channels and provide additional business and/or technical details about its importance to you.
Comment 16 Steve Whitehouse 2014-03-07 12:37:07 UTC
Please do not close this bug. It should not be closed as WONTFIX
Comment 19 Ben Marzinski 2014-03-15 05:00:03 UTC
Actually, I don't see any evidence that the mtime is getting updated at all for mmap writes in your test program. the file keeps it's creation mtime no matter how long you leave it. Unmounting and remounting doesn't update it. I'm going to do some more digging. Apparently the Single Unix Specification guarantee is that mtime will be updated at the very least by the next call to msync() or if no call to mysnc happens, by munmap(). Presumably, even if there isn't an explicit call to munmap(), when the file is closed, the mtime should be updated.
Comment 20 Ben Marzinski 2014-03-25 04:19:50 UTC
Created attachment 878304 [details] patch to make gfs write out mtime on mmap syncs This patch modifies gfs_write_inode to make it actually write out the modified vfs inode attributes. This will cause mtime to get updated when the mmap file changes are synced back to disk.
Comment 21 Steve Whitehouse 2014-03-25 09:01:07 UTC
This bit: if (ip && sync) gfs_log_flush_glock(ip->i_gl, 0); seems odd, bearing in mind that ip is dereferenced higher up the function. However that seems to be in the existing code, but maybe worth checking for certain that ip can never be NULL here. Still in dev phase for 5.11, so lets ask for blocker for this as it is something that we ought to ensure is fixed.
Comment 22 Ben Marzinski 2014-03-25 20:09:13 UTC
Created attachment 878667 [details] Version of the patch that is posted to cluster-devel This version is just like the former, without the debug printk, and with a sensible checking if ip != NULL. Now, I can't see where ip could be NULL calling this function, but the other gfs_super_ops functions also do this check, so I assume that there was at least a worry that this could happen, and I left the check in. 5.11 seems like a bad place to clean up this sort of thing.
Comment 24 Nate Straz 2014-05-23 19:38:22 UTC
Verified against kmod-gfs-0.1.34-20.el5 with the following simple python script: from mmap import * from time import sleep from random import randint maxsize = 2**25 f = open("mmapfile", "r+b") mm = mmap(f.fileno(), maxsize, MAP_SHARED) token = "hello" while True: o = randint(0, maxsize - len(token)) mm[o:o+len(token)] = token sleep(5) Running `watch stat mmapfile` on each node shows that the mtime is updated across the cluster while the above script is running. With kmod-gfs-0.1.34-18.el5, the mtime on other nodes never updates.
Comment 25 Nate Straz 2014-05-27 15:14:57 UTC
After I stopped my test case to verify this bug, my system panicked with the following assert. GFS: fsid=nate:nate0.1: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed GFS: fsid=nate:nate0.1: function = gfs_trans_add_gl GFS: fsid=nate:nate0.1: file = /builddir/build/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/trans.c, line = 234 GFS: fsid=nate:nate0.1: time = 1400873905 GFS: fsid=nate:nate0.1: about to withdraw from the cluster ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...ld/BUILD/gfs-kmod-0.1.34/_kmod_build_/src/gfs/lm.c:112 invalid opcode: 0000  SMP last sysfs file: /fs/gfs/nate:nate0/lock_module/recover_done CPU 0 Modules linked in: gfs(U) dm_log_clustered(U) lock_nolock lock_dlm gfs2 dlm gnbd(U) configfs lpfc scsi_transport_fc sg be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic cxgb3i iptable_filter ip_tables x_tables sctp autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api uio libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy i2c_piix4 i2c_core pcspkr virtio_balloon virtio_net serio_raw tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci ata_piix libata sd_mod scsi_mod virtio_blk virtio_pci virtio_ring virtio ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 30450, comm: python Tainted: G -------------------- 2.6.18-389.el5 #1 RIP: 0010:[<ffffffff8882f06f>] [<ffffffff8882f06f>] :gfs:gfs_lm_withdraw+0x97/0x103 RSP: 0018:ffff810030e65b88 EFLAGS: 00010202 RAX: 000000000000003e RBX: ffffc200001ce000 RCX: 0000000000000286 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff803270dc RBP: ffffc200002069d4 R08: 000000000000000d R09: 00000000ffffffff R10: 0000000000000000 R11: 000000000000000a R12: ffff81003b31fdc8 R13: ffffc200001ce000 R14: 0000000000000001 R15: ffff810030e65e98 FS: 00002b4a72e94170(0000) GS:ffffffff80436000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000152c3000 CR3: 0000000037549000 CR4: 00000000000006e0 Process python (pid: 30450, threadinfo ffff810030e64000, task ffff810039f657b0) Stack: 0000003000000030 ffff810030e65ca0 ffff810030e65ba8 ffff810030e65c38 0000000300000000 ffffc20000206870 ffffc200002069d4 ffffffff8884e6f0 ffffc200002069d4 ffffffff88848c60 ffff810030e65cb8 000000000001725f Call Trace: [<ffffffff8002621e>] find_or_create_page+0x22/0x72 [<ffffffff8881b753>] :gfs:getbuf+0x172/0x181 [<ffffffff8881ba54>] :gfs:gfs_dreread+0x72/0xc6 [<ffffffff8884725c>] :gfs:gfs_assert_withdraw_i+0x30/0x3c [<ffffffff88845e10>] :gfs:gfs_trans_add_gl+0x82/0xbe [<ffffffff88845ef3>] :gfs:gfs_trans_add_bh+0xa7/0xd9 [<ffffffff8883d8ba>] :gfs:gfs_write_inode+0x10c/0x166 [<ffffffff80063117>] wait_for_completion+0x1f/0xa2 [<ffffffff8002ff43>] __writeback_single_inode+0x1dd/0x31c [<ffffffff88825d51>] :gfs:gfs_glock_nq+0x3aa/0x3ea [<ffffffff800f8ee8>] sync_inode+0x24/0x33 [<ffffffff88839f32>] :gfs:gfs_fsync+0x88/0xba [<ffffffff8005afe4>] do_writepages+0x29/0x2f [<ffffffff8005055a>] do_fsync+0x52/0xa4 [<ffffffff800d468e>] sys_msync+0xff/0x180 [<ffffffff8005d29e>] tracesys+0xd5/0xdf Code: 0f 0b 68 3b ad 84 88 c2 70 00 eb fe 48 89 ee 48 c7 c7 7b ad RIP [<ffffffff8882f06f>] :gfs:gfs_lm_withdraw+0x97/0x103 RSP <ffff810030e65b88> <0>Kernel panic - not syncing: Fatal exception
Comment 26 Ben Marzinski 2014-05-30 06:38:58 UTC
Created attachment 900627 [details] New version to fix the QA issues. This patch is similar to my earlier version, but includes more checks in gfs_write_inode to make sure that it can start a transaction. It now makes sure that there isn't already a transaction in progress, and that if it already has a lock, that the lock is exclusive. I also noticed that the original fix itself doesn't always work. The issue is that after the last holder of the inode glock is dropped, the vfs inode timestamps are overwritten by the gfs inode ones. During mmap, the mtime is getting updated in the vfs inode during the page fault. If either a call to stat the inode or a call to write out the vfs inode doesn't attempt to lock it while the mmap call is still holding its glock, the updated vfs timestamp will get overwritten, and be lost. To solve this, I've changed gfs_inode_attr_in() to not overwrite the vfs timestamps unless the gfs ones are newer. The timestamps still always get synced to the gfs inode when the vfs inode is first created. I have verified that it is still possible to manually reset the inode mtime to an earlier time using the touch command, and I can't think of any other possible issue with doing this. However, I'll bug Steve about this tomorrow to see if he can think of any problem with it.
Comment 27 Ben Marzinski 2014-05-30 18:14:01 UTC
Applied the above patch.
Comment 28 Ben Marzinski 2014-06-06 06:29:32 UTC
Created attachment 902772 [details] Yet another version. This one should do no harm, but it does much less good. The problem with my previous patches is that they do incorrect lock ordering. In RHEL5 gfs, the inode glock must be grabbed before inode is locked (it's opposite in RHEL5 gfs2), and __sync_single_inode() locks the inode before it calls gfs_write_inode(). This means that gfs_write_inode() can't lock the inode glock. Some times __sync_single_inode() is called when the process already has locked the inode glock exclusively. In these cases gfs_write_inode is able to write out the inode. Unfortunately, limiting gfs_write_inode() to only writing out the inode structure when it's called with the inode glock already held doesn't get the mtime updated when munmap is called, or an mmaped file is closed. The only condition that will trigger the mtime getting updated on disk is when the msync is called on the mmapped area with the MS_SYNC flag. Right now, I can't see a way to do better than this. For this to work, we'd need to be able to write out the inode in a transaction from some other gfs function that was getting called at least as often as whenever you msync or munmap and mmaped area, which hasn't already locked the inode. I don't see any function that would work for this. It's possible that this is the best we can do in gfs.
Comment 29 Ben Marzinski 2014-06-06 21:24:11 UTC
Above patch applied. This will only fix the issue when msync is called with MS_SYNC.
Comment 30 Nate Straz 2014-06-10 20:31:02 UTC
I can confirm that mtime is updated on disk (and visible on other nodes) after msync is called. Otherwise, mtime is not updated when munmap is called, the file is closed, or fsfreeze is invoked. Used: kmod-gfs-0.1.34-22.el5 Move to VERIFIED after TPS runs are complete.
Comment 32 errata-xmlrpc 2014-09-16 00:18:58 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1234.html