Message ID | 20170724145019.GG9167@eguan.usersys.redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > Hi all, > > Currently generic/446 could trigger a warning in iomap_dio_actor() > easily, it's complaining about unexpected iomap->type (see the end for > full call trace). > > fs/iomap.c: iomap_dio_actor() > 859 default: > 860 WARN_ON_ONCE(1); > 861 return -EIO; > > It's due to the race between direct read and mmap write pagefault on the > same *sparse* file. > > direct read process mmap write process > xfs_file_dio_aio_read (take IOLOCK_SHARED) > iomap_dio_rw > iomap_apply > filemap_write_and_wait_range > invalidate_inode_pages2_range > iomap_apply > mmap > xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED) > iomap_page_mkwrite > iomap_apply > xfs_file_iomap_begin > xfs_file_iomap_begin_delay (take ILOCK_EXCL) > (release ILOCK_EXCL) > ... > xfs_file_iomap_begin > (take ILOCK and read in bmap info) > iomap_dio_actor > ... > > The dio path and page_mkwrite path are taking different locks so they're > not serialized. So after dio read flushing the file range but before > taking ILOCK, the page faults from mmap write could fault in and update > the file map first with delalloc blocks. Then the dio reader sees this > delalloc block map unexpectedly. > > I'm not sure what's the best way to fix it, but a quick hack of > disabling delalloc in the write page fault path could do the work for > me, e.g. > > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -981,7 +981,7 @@ xfs_file_iomap_begin( > if (XFS_FORCED_SHUTDOWN(mp)) > return -EIO; > > - if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) && > + if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) && > !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { > /* Reserve delalloc blocks for regular writeback. */ > return xfs_file_iomap_begin_delay(inode, offset, length, flags, > > So that mmap write page fault brings in already allocated blocks, and > dio reader sees non-IOMAP_DELALLOC iomaps. But now we can't take advantage of delayed allocation for mmap writes even when the user isn't being evil by peppering us with dio reads. > I know concurrent dio and mmap io are not recommended, so is this > something that doens't need a fix at all, and the test should filter out > the warning instead? XFS no longer BUG_ON, so I guess it's fine if the test filters out the warning. It looks like the end result of a dioread/mmapwrite collision is that the dio reader gets -EIO. Would it be better to return a short read? --D > > Thanks, > Eryu > > [162097.402359] ------------[ cut here ]------------ > [162097.402745] WARNING: CPU: 1 PID: 20582 at fs/iomap.c:860 iomap_dio_actor+0xca/0x370 > [162097.403392] Modules linked in: rpcsec_gss_krb5 nfsv4(OE) dns_resolver nfs(OE) fscache btrfs xor raid6_pq ppdev i2c_piix4 parport_pc sg virtio_balloon parport i2c_core pcspkr nfsd(OE) auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi ata_piix 8139too libata virtio_pci 8139cp floppy virtio_ring mii serio_raw virtio > [162097.406258] CPU: 1 PID: 20582 Comm: xfs_io Tainted: G W OE 4.13.0-rc1 #191 > [162097.406898] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 > [162097.407388] task: ffff880212e40000 task.stack: ffffc900012ac000 > [162097.408403] RIP: 0010:iomap_dio_actor+0xca/0x370 > [162097.409072] RSP: 0018:ffffc900012afc20 EFLAGS: 00010202 > [162097.409533] RAX: 0000000000000002 RBX: ffffc900012afcb0 RCX: 000000000241b000 > [162097.410144] RDX: 00000000000001ff RSI: ffffc900012afe30 RDI: ffffc900012afe68 > [162097.410720] RBP: ffffc900012afc90 R08: ffffc900012afcb0 R09: ffff880211a65c60 > [162097.411324] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 > [162097.411912] R13: ffffffffa016bb60 R14: ffff880211a65c60 R15: 0000000000000009 > [162097.412503] FS: 00007f94e825a740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000 > [162097.413181] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [162097.413648] CR2: 000000000241a000 CR3: 0000000212c22000 CR4: 00000000000006e0 > [162097.414261] Call Trace: > [162097.414478] ? iomap_dio_zero+0x100/0x100 > [162097.414808] iomap_apply+0xa1/0x110 > [162097.415133] iomap_dio_rw+0x20b/0x390 > [162097.415442] ? iomap_dio_zero+0x100/0x100 > [162097.415804] xfs_file_dio_aio_read+0x6e/0xf0 [xfs] > [162097.416250] xfs_file_read_iter+0xab/0xc0 [xfs] > [162097.416629] __vfs_read+0xe0/0x150 > [162097.416925] vfs_read+0x8c/0x130 > [162097.417217] SyS_pread64+0x87/0xb0 > [162097.417512] do_syscall_64+0x67/0x150 > [162097.417818] entry_SYSCALL64_slow_path+0x25/0x25 > [162097.418234] RIP: 0033:0x7f94e7e3cf73 > [162097.418533] RSP: 002b:00007fff7f52ae90 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 > [162097.419167] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f94e7e3cf73 > [162097.419740] RDX: 0000000000001000 RSI: 000000000241a000 RDI: 0000000000000003 > [162097.420339] RBP: 00007fff7f52af50 R08: 0000000000000000 R09: 0000000000000000 > [162097.420924] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 > [162097.421516] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000000000 > [162097.422124] Code: e1 48 09 c1 48 85 ca 0f 85 82 02 00 00 0f b7 43 18 66 83 f8 03 0f 84 fb 01 00 00 66 83 f8 04 74 35 66 83 f8 01 0f 84 07 02 00 00 <0f> ff 48 c7 c0 fb ff ff ff 48 8b 4d d0 65 48 33 0c 25 28 00 00 > [162097.423661] ---[ end trace 927f98a352499782 ]--- > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote: > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > > Hi all, > > > > Currently generic/446 could trigger a warning in iomap_dio_actor() > > easily, it's complaining about unexpected iomap->type (see the end for > > full call trace). > > > > fs/iomap.c: iomap_dio_actor() > > 859 default: > > 860 WARN_ON_ONCE(1); > > 861 return -EIO; > > > > It's due to the race between direct read and mmap write pagefault on the > > same *sparse* file. > > > > direct read process mmap write process > > xfs_file_dio_aio_read (take IOLOCK_SHARED) > > iomap_dio_rw > > iomap_apply > > filemap_write_and_wait_range > > invalidate_inode_pages2_range > > iomap_apply > > mmap > > xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED) > > iomap_page_mkwrite > > iomap_apply > > xfs_file_iomap_begin > > xfs_file_iomap_begin_delay (take ILOCK_EXCL) > > (release ILOCK_EXCL) > > ... > > xfs_file_iomap_begin > > (take ILOCK and read in bmap info) > > iomap_dio_actor > > ... > > > > The dio path and page_mkwrite path are taking different locks so they're > > not serialized. So after dio read flushing the file range but before > > taking ILOCK, the page faults from mmap write could fault in and update > > the file map first with delalloc blocks. Then the dio reader sees this > > delalloc block map unexpectedly. > > > > I'm not sure what's the best way to fix it, but a quick hack of > > disabling delalloc in the write page fault path could do the work for > > me, e.g. > > > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -981,7 +981,7 @@ xfs_file_iomap_begin( > > if (XFS_FORCED_SHUTDOWN(mp)) > > return -EIO; > > > > - if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) && > > + if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) && > > !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { > > /* Reserve delalloc blocks for regular writeback. */ > > return xfs_file_iomap_begin_delay(inode, offset, length, flags, > > > > So that mmap write page fault brings in already allocated blocks, and > > dio reader sees non-IOMAP_DELALLOC iomaps. > > But now we can't take advantage of delayed allocation for mmap writes > even when the user isn't being evil by peppering us with dio reads. That's true, so I called it a hack not fix :) And I'm wondering what's the bigger problem of letting the dio path take MMAPLOCK too to serialize against mmap page faults? e.g. xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL. > > > I know concurrent dio and mmap io are not recommended, so is this > > something that doens't need a fix at all, and the test should filter out > > the warning instead? > > XFS no longer BUG_ON, so I guess it's fine if the test filters out the > warning. OK. > > It looks like the end result of a dioread/mmapwrite collision is that > the dio reader gets -EIO. Would it be better to return a short read? Yes, right now dio read gets EIO in this case. I can't tell which one is better, if the whole dio vs mmap is not recommended, EIO seems to be a strong signal that indicates "don't do this " :) Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote: > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote: > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > > > Hi all, > > > > > > Currently generic/446 could trigger a warning in iomap_dio_actor() > > > easily, it's complaining about unexpected iomap->type (see the end for > > > full call trace). > > > > > > fs/iomap.c: iomap_dio_actor() > > > 859 default: > > > 860 WARN_ON_ONCE(1); > > > 861 return -EIO; > > > > > > It's due to the race between direct read and mmap write pagefault on the > > > same *sparse* file. > > > > > > direct read process mmap write process > > > xfs_file_dio_aio_read (take IOLOCK_SHARED) > > > iomap_dio_rw > > > iomap_apply > > > filemap_write_and_wait_range > > > invalidate_inode_pages2_range > > > iomap_apply > > > mmap > > > xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED) > > > iomap_page_mkwrite > > > iomap_apply > > > xfs_file_iomap_begin > > > xfs_file_iomap_begin_delay (take ILOCK_EXCL) > > > (release ILOCK_EXCL) > > > ... > > > xfs_file_iomap_begin > > > (take ILOCK and read in bmap info) > > > iomap_dio_actor > > > ... > > > > > > The dio path and page_mkwrite path are taking different locks so they're > > > not serialized. So after dio read flushing the file range but before > > > taking ILOCK, the page faults from mmap write could fault in and update > > > the file map first with delalloc blocks. Then the dio reader sees this > > > delalloc block map unexpectedly. > > > > > > I'm not sure what's the best way to fix it, but a quick hack of > > > disabling delalloc in the write page fault path could do the work for > > > me, e.g. > > > > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -981,7 +981,7 @@ xfs_file_iomap_begin( > > > if (XFS_FORCED_SHUTDOWN(mp)) > > > return -EIO; > > > > > > - if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) && > > > + if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) && > > > !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { > > > /* Reserve delalloc blocks for regular writeback. */ > > > return xfs_file_iomap_begin_delay(inode, offset, length, flags, > > > > > > So that mmap write page fault brings in already allocated blocks, and > > > dio reader sees non-IOMAP_DELALLOC iomaps. > > > > But now we can't take advantage of delayed allocation for mmap writes > > even when the user isn't being evil by peppering us with dio reads. > > That's true, so I called it a hack not fix :) > > And I'm wondering what's the bigger problem of letting the dio path take > MMAPLOCK too to serialize against mmap page faults? e.g. > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and Adds locking overhead to all directio paths to cover a scenario we already don't really support... > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL. ...and page_mkwrite loses the ability to service multiple faults in parallel? > > > I know concurrent dio and mmap io are not recommended, so is this > > > something that doens't need a fix at all, and the test should filter out > > > the warning instead? > > > > XFS no longer BUG_ON, so I guess it's fine if the test filters out the > > warning. > > OK. > > > > > It looks like the end result of a dioread/mmapwrite collision is that > > the dio reader gets -EIO. Would it be better to return a short read? > > Yes, right now dio read gets EIO in this case. I can't tell which one is > better, if the whole dio vs mmap is not recommended, EIO seems to be a > strong signal that indicates "don't do this " :) Sure, since POSIX never bothered to make EPEBKAC official. :P (But in all seriousness I think it's sufficient if we burp back EIO for a usage model that we don't support.) --D > > Thanks, > Eryu > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote: > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote: > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > And I'm wondering what's the bigger problem of letting the dio path take > MMAPLOCK too to serialize against mmap page faults? e.g. > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL. direct IO calls get_user_pages() which can trigger page faults and that means we can't hold any lock that is taken in the page fault path. It's the same reason we have the MMAPLOCK in the first place - we can't use the IOLOCK in the page fault path because copy-in/copy-out in the buffered IO path can trigger page faults, hence we need some other lock that we can use to serialise page faults against extent operations (like fallocate). > > It looks like the end result of a dioread/mmapwrite collision is that > > the dio reader gets -EIO. Would it be better to return a short read? > > Yes, right now dio read gets EIO in this case. I can't tell which one is > better, if the whole dio vs mmap is not recommended, EIO seems to be a > strong signal that indicates "don't do this " :) $ man 2 open ..... .... Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files. .... That said, EIO is extremely unfriendly - a short read would be much better as a properly written app will simply try to read the bit it didn't get again, whereas EIO tends to be an indication of severe failure to the application... Cheers, Dave.
On Tue, Jul 25, 2017 at 08:02:05AM +1000, Dave Chinner wrote: > On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote: > > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote: > > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > > And I'm wondering what's the bigger problem of letting the dio path take > > MMAPLOCK too to serialize against mmap page faults? e.g. > > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and > > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL. > > direct IO calls get_user_pages() which can trigger page faults and > that means we can't hold any lock that is taken in the page fault > path. Ah, that's what I missed, thanks! > > It's the same reason we have the MMAPLOCK in the first place - we > can't use the IOLOCK in the page fault path because copy-in/copy-out > in the buffered IO path can trigger page faults, hence we need some > other lock that we can use to serialise page faults against extent > operations (like fallocate). > > > > It looks like the end result of a dioread/mmapwrite collision is that > > > the dio reader gets -EIO. Would it be better to return a short read? > > > > Yes, right now dio read gets EIO in this case. I can't tell which one is > > better, if the whole dio vs mmap is not recommended, EIO seems to be a > > strong signal that indicates "don't do this " :) > > $ man 2 open > ..... > .... Likewise, applications should avoid mixing mmap(2) > of files with direct I/O to the same files. > .... > > That said, EIO is extremely unfriendly - a short read would be much > better as a properly written app will simply try to read the bit it > didn't get again, whereas EIO tends to be an indication of severe > failure to the application... The problem is dio read could hit delalloc blocks on its first read and return 0, how can we tell between a short read and a real EOF? I may miss something again .. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 25, 2017 at 03:53:48PM +0800, Eryu Guan wrote: > On Tue, Jul 25, 2017 at 08:02:05AM +1000, Dave Chinner wrote: > > On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote: > > > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote: > > > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote: > > > And I'm wondering what's the bigger problem of letting the dio path take > > > MMAPLOCK too to serialize against mmap page faults? e.g. > > > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and > > > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL. > > > > direct IO calls get_user_pages() which can trigger page faults and > > that means we can't hold any lock that is taken in the page fault > > path. > > Ah, that's what I missed, thanks! > > > > > It's the same reason we have the MMAPLOCK in the first place - we > > can't use the IOLOCK in the page fault path because copy-in/copy-out > > in the buffered IO path can trigger page faults, hence we need some > > other lock that we can use to serialise page faults against extent > > operations (like fallocate). > > > > > > It looks like the end result of a dioread/mmapwrite collision is that > > > > the dio reader gets -EIO. Would it be better to return a short read? > > > > > > Yes, right now dio read gets EIO in this case. I can't tell which one is > > > better, if the whole dio vs mmap is not recommended, EIO seems to be a > > > strong signal that indicates "don't do this " :) > > > > $ man 2 open > > ..... > > .... Likewise, applications should avoid mixing mmap(2) > > of files with direct I/O to the same files. > > .... > > > > That said, EIO is extremely unfriendly - a short read would be much > > better as a properly written app will simply try to read the bit it > > didn't get again, whereas EIO tends to be an indication of severe > > failure to the application... > > The problem is dio read could hit delalloc blocks on its first read and > return 0, how can we tell between a short read and a real EOF? I may > miss something again .. Well, stat() + lseek(fd, 0, SEEK_CUR) can be used to determine if the read() started at the current EOF. But, yeah, the read(2) man page says "return of zero means EOF" but then muddies the waters by saying "a read with count = 0 will return zero but have no effect". IOWs, a return of 0 does not /always/ mean the read was called at/beyond EOF.... Yay? POSIX defined behaviour can be a real problem at times, especially when you are using methods that are specifically designed to work around limitations of the POSIX specification. Like, for example, applications using direct IO... Cheers, Dave.
--- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -981,7 +981,7 @@ xfs_file_iomap_begin( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; - if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) && + if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) && !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) { /* Reserve delalloc blocks for regular writeback. */ return xfs_file_iomap_begin_delay(inode, offset, length, flags,