diff mbox

warnings complaining IOMAP_DELALLOC blocks in iomap_dio_actor from generic/446

Message ID 20170724145019.GG9167@eguan.usersys.redhat.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Eryu Guan July 24, 2017, 2:50 p.m. UTC
Hi all,

Currently generic/446 could trigger a warning in iomap_dio_actor()
easily, it's complaining about unexpected iomap->type (see the end for
full call trace).

fs/iomap.c: iomap_dio_actor()
 859         default:                                                                                                                                                                          
 860                 WARN_ON_ONCE(1);                                                                                                                                                          
 861                 return -EIO;

It's due to the race between direct read and mmap write pagefault on the
same *sparse* file. 

direct read process			mmap write process
xfs_file_dio_aio_read (take IOLOCK_SHARED)
 iomap_dio_rw
  iomap_apply
   filemap_write_and_wait_range
   invalidate_inode_pages2_range
   iomap_apply
					mmap
					 xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED)
					  iomap_page_mkwrite
					   iomap_apply
					    xfs_file_iomap_begin
					     xfs_file_iomap_begin_delay (take ILOCK_EXCL)
					     (release ILOCK_EXCL)
					     ...
    xfs_file_iomap_begin
    (take ILOCK and read in bmap info)
    iomap_dio_actor
    ...

The dio path and page_mkwrite path are taking different locks so they're
not serialized. So after dio read flushing the file range but before
taking ILOCK, the page faults from mmap write could fault in and update
the file map first with delalloc blocks. Then the dio reader sees this
delalloc block map unexpectedly.

I'm not sure what's the best way to fix it, but a quick hack of
disabling delalloc in the write page fault path could do the work for
me, e.g.


So that mmap write page fault brings in already allocated blocks, and
dio reader sees non-IOMAP_DELALLOC iomaps.

I know concurrent dio and mmap io are not recommended, so is this
something that doens't need a fix at all, and the test should filter out
the warning instead?

Thanks,
Eryu

[162097.402359] ------------[ cut here ]------------
[162097.402745] WARNING: CPU: 1 PID: 20582 at fs/iomap.c:860 iomap_dio_actor+0xca/0x370
[162097.403392] Modules linked in: rpcsec_gss_krb5 nfsv4(OE) dns_resolver nfs(OE) fscache btrfs xor raid6_pq ppdev i2c_piix4 parport_pc sg virtio_balloon parport i2c_core pcspkr nfsd(OE) auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi ata_piix 8139too libata virtio_pci 8139cp floppy virtio_ring mii serio_raw virtio
[162097.406258] CPU: 1 PID: 20582 Comm: xfs_io Tainted: G        W  OE   4.13.0-rc1 #191
[162097.406898] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[162097.407388] task: ffff880212e40000 task.stack: ffffc900012ac000
[162097.408403] RIP: 0010:iomap_dio_actor+0xca/0x370
[162097.409072] RSP: 0018:ffffc900012afc20 EFLAGS: 00010202
[162097.409533] RAX: 0000000000000002 RBX: ffffc900012afcb0 RCX: 000000000241b000
[162097.410144] RDX: 00000000000001ff RSI: ffffc900012afe30 RDI: ffffc900012afe68
[162097.410720] RBP: ffffc900012afc90 R08: ffffc900012afcb0 R09: ffff880211a65c60
[162097.411324] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[162097.411912] R13: ffffffffa016bb60 R14: ffff880211a65c60 R15: 0000000000000009
[162097.412503] FS:  00007f94e825a740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
[162097.413181] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[162097.413648] CR2: 000000000241a000 CR3: 0000000212c22000 CR4: 00000000000006e0
[162097.414261] Call Trace:
[162097.414478]  ? iomap_dio_zero+0x100/0x100
[162097.414808]  iomap_apply+0xa1/0x110
[162097.415133]  iomap_dio_rw+0x20b/0x390
[162097.415442]  ? iomap_dio_zero+0x100/0x100
[162097.415804]  xfs_file_dio_aio_read+0x6e/0xf0 [xfs]
[162097.416250]  xfs_file_read_iter+0xab/0xc0 [xfs]
[162097.416629]  __vfs_read+0xe0/0x150
[162097.416925]  vfs_read+0x8c/0x130
[162097.417217]  SyS_pread64+0x87/0xb0
[162097.417512]  do_syscall_64+0x67/0x150
[162097.417818]  entry_SYSCALL64_slow_path+0x25/0x25
[162097.418234] RIP: 0033:0x7f94e7e3cf73
[162097.418533] RSP: 002b:00007fff7f52ae90 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[162097.419167] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f94e7e3cf73
[162097.419740] RDX: 0000000000001000 RSI: 000000000241a000 RDI: 0000000000000003
[162097.420339] RBP: 00007fff7f52af50 R08: 0000000000000000 R09: 0000000000000000
[162097.420924] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[162097.421516] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000000000
[162097.422124] Code: e1 48 09 c1 48 85 ca 0f 85 82 02 00 00 0f b7 43 18 66 83 f8 03 0f 84 fb 01 00 00 66 83 f8 04 74 35 66 83 f8 01 0f 84 07 02 00 00 <0f> ff 48 c7 c0 fb ff ff ff 48 8b 4d d0 65 48 33 0c 25 28 00 00
[162097.423661] ---[ end trace 927f98a352499782 ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Darrick J. Wong July 24, 2017, 4:05 p.m. UTC | #1
On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> Hi all,
> 
> Currently generic/446 could trigger a warning in iomap_dio_actor()
> easily, it's complaining about unexpected iomap->type (see the end for
> full call trace).
> 
> fs/iomap.c: iomap_dio_actor()
>  859         default:                                                                                                                                                                          
>  860                 WARN_ON_ONCE(1);                                                                                                                                                          
>  861                 return -EIO;
> 
> It's due to the race between direct read and mmap write pagefault on the
> same *sparse* file. 
> 
> direct read process			mmap write process
> xfs_file_dio_aio_read (take IOLOCK_SHARED)
>  iomap_dio_rw
>   iomap_apply
>    filemap_write_and_wait_range
>    invalidate_inode_pages2_range
>    iomap_apply
> 					mmap
> 					 xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED)
> 					  iomap_page_mkwrite
> 					   iomap_apply
> 					    xfs_file_iomap_begin
> 					     xfs_file_iomap_begin_delay (take ILOCK_EXCL)
> 					     (release ILOCK_EXCL)
> 					     ...
>     xfs_file_iomap_begin
>     (take ILOCK and read in bmap info)
>     iomap_dio_actor
>     ...
> 
> The dio path and page_mkwrite path are taking different locks so they're
> not serialized. So after dio read flushing the file range but before
> taking ILOCK, the page faults from mmap write could fault in and update
> the file map first with delalloc blocks. Then the dio reader sees this
> delalloc block map unexpectedly.
> 
> I'm not sure what's the best way to fix it, but a quick hack of
> disabling delalloc in the write page fault path could do the work for
> me, e.g.
> 
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -981,7 +981,7 @@ xfs_file_iomap_begin(
>         if (XFS_FORCED_SHUTDOWN(mp))
>                 return -EIO;
> 
> -       if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
> +       if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) &&
>                         !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
>                 /* Reserve delalloc blocks for regular writeback. */
>                 return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> 
> So that mmap write page fault brings in already allocated blocks, and
> dio reader sees non-IOMAP_DELALLOC iomaps.

But now we can't take advantage of delayed allocation for mmap writes
even when the user isn't being evil by peppering us with dio reads.

> I know concurrent dio and mmap io are not recommended, so is this
> something that doens't need a fix at all, and the test should filter out
> the warning instead?

XFS no longer BUG_ON, so I guess it's fine if the test filters out the
warning.

It looks like the end result of a dioread/mmapwrite collision is that
the dio reader gets -EIO.  Would it be better to return a short read?

--D

> 
> Thanks,
> Eryu
> 
> [162097.402359] ------------[ cut here ]------------
> [162097.402745] WARNING: CPU: 1 PID: 20582 at fs/iomap.c:860 iomap_dio_actor+0xca/0x370
> [162097.403392] Modules linked in: rpcsec_gss_krb5 nfsv4(OE) dns_resolver nfs(OE) fscache btrfs xor raid6_pq ppdev i2c_piix4 parport_pc sg virtio_balloon parport i2c_core pcspkr nfsd(OE) auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi ata_piix 8139too libata virtio_pci 8139cp floppy virtio_ring mii serio_raw virtio
> [162097.406258] CPU: 1 PID: 20582 Comm: xfs_io Tainted: G        W  OE   4.13.0-rc1 #191
> [162097.406898] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
> [162097.407388] task: ffff880212e40000 task.stack: ffffc900012ac000
> [162097.408403] RIP: 0010:iomap_dio_actor+0xca/0x370
> [162097.409072] RSP: 0018:ffffc900012afc20 EFLAGS: 00010202
> [162097.409533] RAX: 0000000000000002 RBX: ffffc900012afcb0 RCX: 000000000241b000
> [162097.410144] RDX: 00000000000001ff RSI: ffffc900012afe30 RDI: ffffc900012afe68
> [162097.410720] RBP: ffffc900012afc90 R08: ffffc900012afcb0 R09: ffff880211a65c60
> [162097.411324] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [162097.411912] R13: ffffffffa016bb60 R14: ffff880211a65c60 R15: 0000000000000009
> [162097.412503] FS:  00007f94e825a740(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
> [162097.413181] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [162097.413648] CR2: 000000000241a000 CR3: 0000000212c22000 CR4: 00000000000006e0
> [162097.414261] Call Trace:
> [162097.414478]  ? iomap_dio_zero+0x100/0x100
> [162097.414808]  iomap_apply+0xa1/0x110
> [162097.415133]  iomap_dio_rw+0x20b/0x390
> [162097.415442]  ? iomap_dio_zero+0x100/0x100
> [162097.415804]  xfs_file_dio_aio_read+0x6e/0xf0 [xfs]
> [162097.416250]  xfs_file_read_iter+0xab/0xc0 [xfs]
> [162097.416629]  __vfs_read+0xe0/0x150
> [162097.416925]  vfs_read+0x8c/0x130
> [162097.417217]  SyS_pread64+0x87/0xb0
> [162097.417512]  do_syscall_64+0x67/0x150
> [162097.417818]  entry_SYSCALL64_slow_path+0x25/0x25
> [162097.418234] RIP: 0033:0x7f94e7e3cf73
> [162097.418533] RSP: 002b:00007fff7f52ae90 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
> [162097.419167] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f94e7e3cf73
> [162097.419740] RDX: 0000000000001000 RSI: 000000000241a000 RDI: 0000000000000003
> [162097.420339] RBP: 00007fff7f52af50 R08: 0000000000000000 R09: 0000000000000000
> [162097.420924] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
> [162097.421516] R13: 0000000000020000 R14: 0000000000000000 R15: 0000000000000000
> [162097.422124] Code: e1 48 09 c1 48 85 ca 0f 85 82 02 00 00 0f b7 43 18 66 83 f8 03 0f 84 fb 01 00 00 66 83 f8 04 74 35 66 83 f8 01 0f 84 07 02 00 00 <0f> ff 48 c7 c0 fb ff ff ff 48 8b 4d d0 65 48 33 0c 25 28 00 00
> [162097.423661] ---[ end trace 927f98a352499782 ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan July 24, 2017, 5:51 p.m. UTC | #2
On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> > Hi all,
> > 
> > Currently generic/446 could trigger a warning in iomap_dio_actor()
> > easily, it's complaining about unexpected iomap->type (see the end for
> > full call trace).
> > 
> > fs/iomap.c: iomap_dio_actor()
> >  859         default:
> >  860                 WARN_ON_ONCE(1);
> >  861                 return -EIO;
> > 
> > It's due to the race between direct read and mmap write pagefault on the
> > same *sparse* file. 
> > 
> > direct read process			mmap write process
> > xfs_file_dio_aio_read (take IOLOCK_SHARED)
> >  iomap_dio_rw
> >   iomap_apply
> >    filemap_write_and_wait_range
> >    invalidate_inode_pages2_range
> >    iomap_apply
> > 					mmap
> > 					 xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED)
> > 					  iomap_page_mkwrite
> > 					   iomap_apply
> > 					    xfs_file_iomap_begin
> > 					     xfs_file_iomap_begin_delay (take ILOCK_EXCL)
> > 					     (release ILOCK_EXCL)
> > 					     ...
> >     xfs_file_iomap_begin
> >     (take ILOCK and read in bmap info)
> >     iomap_dio_actor
> >     ...
> > 
> > The dio path and page_mkwrite path are taking different locks so they're
> > not serialized. So after dio read flushing the file range but before
> > taking ILOCK, the page faults from mmap write could fault in and update
> > the file map first with delalloc blocks. Then the dio reader sees this
> > delalloc block map unexpectedly.
> > 
> > I'm not sure what's the best way to fix it, but a quick hack of
> > disabling delalloc in the write page fault path could do the work for
> > me, e.g.
> > 
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -981,7 +981,7 @@ xfs_file_iomap_begin(
> >         if (XFS_FORCED_SHUTDOWN(mp))
> >                 return -EIO;
> > 
> > -       if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
> > +       if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) &&
> >                         !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> >                 /* Reserve delalloc blocks for regular writeback. */
> >                 return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> > 
> > So that mmap write page fault brings in already allocated blocks, and
> > dio reader sees non-IOMAP_DELALLOC iomaps.
> 
> But now we can't take advantage of delayed allocation for mmap writes
> even when the user isn't being evil by peppering us with dio reads.

That's true, so I called it a hack not fix :)

And I'm wondering what's the bigger problem of letting the dio path take
MMAPLOCK too to serialize against mmap page faults? e.g.
xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and
xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL.

> 
> > I know concurrent dio and mmap io are not recommended, so is this
> > something that doens't need a fix at all, and the test should filter out
> > the warning instead?
> 
> XFS no longer BUG_ON, so I guess it's fine if the test filters out the
> warning.

OK.

> 
> It looks like the end result of a dioread/mmapwrite collision is that
> the dio reader gets -EIO.  Would it be better to return a short read?

Yes, right now dio read gets EIO in this case. I can't tell which one is
better, if the whole dio vs mmap is not recommended, EIO seems to be a
strong signal that indicates "don't do this " :)

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong July 24, 2017, 6:51 p.m. UTC | #3
On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote:
> On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> > > Hi all,
> > > 
> > > Currently generic/446 could trigger a warning in iomap_dio_actor()
> > > easily, it's complaining about unexpected iomap->type (see the end for
> > > full call trace).
> > > 
> > > fs/iomap.c: iomap_dio_actor()
> > >  859         default:
> > >  860                 WARN_ON_ONCE(1);
> > >  861                 return -EIO;
> > > 
> > > It's due to the race between direct read and mmap write pagefault on the
> > > same *sparse* file. 
> > > 
> > > direct read process			mmap write process
> > > xfs_file_dio_aio_read (take IOLOCK_SHARED)
> > >  iomap_dio_rw
> > >   iomap_apply
> > >    filemap_write_and_wait_range
> > >    invalidate_inode_pages2_range
> > >    iomap_apply
> > > 					mmap
> > > 					 xfs_filemap_page_mkwrite (take MMAPLOCK_SHARED)
> > > 					  iomap_page_mkwrite
> > > 					   iomap_apply
> > > 					    xfs_file_iomap_begin
> > > 					     xfs_file_iomap_begin_delay (take ILOCK_EXCL)
> > > 					     (release ILOCK_EXCL)
> > > 					     ...
> > >     xfs_file_iomap_begin
> > >     (take ILOCK and read in bmap info)
> > >     iomap_dio_actor
> > >     ...
> > > 
> > > The dio path and page_mkwrite path are taking different locks so they're
> > > not serialized. So after dio read flushing the file range but before
> > > taking ILOCK, the page faults from mmap write could fault in and update
> > > the file map first with delalloc blocks. Then the dio reader sees this
> > > delalloc block map unexpectedly.
> > > 
> > > I'm not sure what's the best way to fix it, but a quick hack of
> > > disabling delalloc in the write page fault path could do the work for
> > > me, e.g.
> > > 
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -981,7 +981,7 @@ xfs_file_iomap_begin(
> > >         if (XFS_FORCED_SHUTDOWN(mp))
> > >                 return -EIO;
> > > 
> > > -       if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
> > > +       if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) &&
> > >                         !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
> > >                 /* Reserve delalloc blocks for regular writeback. */
> > >                 return xfs_file_iomap_begin_delay(inode, offset, length, flags,
> > > 
> > > So that mmap write page fault brings in already allocated blocks, and
> > > dio reader sees non-IOMAP_DELALLOC iomaps.
> > 
> > But now we can't take advantage of delayed allocation for mmap writes
> > even when the user isn't being evil by peppering us with dio reads.
> 
> That's true, so I called it a hack not fix :)
> 
> And I'm wondering what's the bigger problem of letting the dio path take
> MMAPLOCK too to serialize against mmap page faults? e.g.
> xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and

Adds locking overhead to all directio paths to cover a scenario we
already don't really support...

> xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL.

...and page_mkwrite loses the ability to service multiple faults in parallel?

> > > I know concurrent dio and mmap io are not recommended, so is this
> > > something that doens't need a fix at all, and the test should filter out
> > > the warning instead?
> > 
> > XFS no longer BUG_ON, so I guess it's fine if the test filters out the
> > warning.
> 
> OK.
> 
> > 
> > It looks like the end result of a dioread/mmapwrite collision is that
> > the dio reader gets -EIO.  Would it be better to return a short read?
> 
> Yes, right now dio read gets EIO in this case. I can't tell which one is
> better, if the whole dio vs mmap is not recommended, EIO seems to be a
> strong signal that indicates "don't do this " :)

Sure, since POSIX never bothered to make EPEBKAC official. :P

(But in all seriousness I think it's sufficient if we burp back EIO for
a usage model that we don't support.)

--D

> 
> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 24, 2017, 10:02 p.m. UTC | #4
On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote:
> On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> And I'm wondering what's the bigger problem of letting the dio path take
> MMAPLOCK too to serialize against mmap page faults? e.g.
> xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and
> xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL.

direct IO calls get_user_pages() which can trigger page faults and
that means we can't hold any lock that is taken in the page fault
path.

It's the same reason we have the MMAPLOCK in the first place - we
can't use the IOLOCK in the page fault path because copy-in/copy-out
in the buffered IO path can trigger page faults, hence we need some
other lock that we can use to serialise page faults against extent
operations (like fallocate).

> > It looks like the end result of a dioread/mmapwrite collision is that
> > the dio reader gets -EIO.  Would it be better to return a short read?
> 
> Yes, right now dio read gets EIO in this case. I can't tell which one is
> better, if the whole dio vs mmap is not recommended, EIO seems to be a
> strong signal that indicates "don't do this " :)

$ man 2 open
.....
	.... Likewise, applications should avoid  mixing  mmap(2)
	of files with direct I/O to the same files.
....

That said, EIO is extremely unfriendly - a short read would be much
better as a properly written app will simply try to read the bit it
didn't get again, whereas EIO tends to be an indication of severe
failure to the application...

Cheers,

Dave.
Eryu Guan July 25, 2017, 7:53 a.m. UTC | #5
On Tue, Jul 25, 2017 at 08:02:05AM +1000, Dave Chinner wrote:
> On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote:
> > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> > And I'm wondering what's the bigger problem of letting the dio path take
> > MMAPLOCK too to serialize against mmap page faults? e.g.
> > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and
> > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL.
> 
> direct IO calls get_user_pages() which can trigger page faults and
> that means we can't hold any lock that is taken in the page fault
> path.

Ah, that's what I missed, thanks!

> 
> It's the same reason we have the MMAPLOCK in the first place - we
> can't use the IOLOCK in the page fault path because copy-in/copy-out
> in the buffered IO path can trigger page faults, hence we need some
> other lock that we can use to serialise page faults against extent
> operations (like fallocate).
> 
> > > It looks like the end result of a dioread/mmapwrite collision is that
> > > the dio reader gets -EIO.  Would it be better to return a short read?
> > 
> > Yes, right now dio read gets EIO in this case. I can't tell which one is
> > better, if the whole dio vs mmap is not recommended, EIO seems to be a
> > strong signal that indicates "don't do this " :)
> 
> $ man 2 open
> .....
> 	.... Likewise, applications should avoid  mixing  mmap(2)
> 	of files with direct I/O to the same files.
> ....
> 
> That said, EIO is extremely unfriendly - a short read would be much
> better as a properly written app will simply try to read the bit it
> didn't get again, whereas EIO tends to be an indication of severe
> failure to the application...

The problem is dio read could hit delalloc blocks on its first read and
return 0, how can we tell between a short read and a real EOF? I may
miss something again ..

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 25, 2017, 11:43 p.m. UTC | #6
On Tue, Jul 25, 2017 at 03:53:48PM +0800, Eryu Guan wrote:
> On Tue, Jul 25, 2017 at 08:02:05AM +1000, Dave Chinner wrote:
> > On Tue, Jul 25, 2017 at 01:51:25AM +0800, Eryu Guan wrote:
> > > On Mon, Jul 24, 2017 at 09:05:51AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jul 24, 2017 at 10:50:19PM +0800, Eryu Guan wrote:
> > > And I'm wondering what's the bigger problem of letting the dio path take
> > > MMAPLOCK too to serialize against mmap page faults? e.g.
> > > xfs_file_dio_aio_read() takes XFS_MMAPLOCK_SHRED and
> > > xfs_filemap_page_mkwrite() takes XFS_MMAPLOCK_EXCL.
> > 
> > direct IO calls get_user_pages() which can trigger page faults and
> > that means we can't hold any lock that is taken in the page fault
> > path.
> 
> Ah, that's what I missed, thanks!
> 
> > 
> > It's the same reason we have the MMAPLOCK in the first place - we
> > can't use the IOLOCK in the page fault path because copy-in/copy-out
> > in the buffered IO path can trigger page faults, hence we need some
> > other lock that we can use to serialise page faults against extent
> > operations (like fallocate).
> > 
> > > > It looks like the end result of a dioread/mmapwrite collision is that
> > > > the dio reader gets -EIO.  Would it be better to return a short read?
> > > 
> > > Yes, right now dio read gets EIO in this case. I can't tell which one is
> > > better, if the whole dio vs mmap is not recommended, EIO seems to be a
> > > strong signal that indicates "don't do this " :)
> > 
> > $ man 2 open
> > .....
> > 	.... Likewise, applications should avoid  mixing  mmap(2)
> > 	of files with direct I/O to the same files.
> > ....
> > 
> > That said, EIO is extremely unfriendly - a short read would be much
> > better as a properly written app will simply try to read the bit it
> > didn't get again, whereas EIO tends to be an indication of severe
> > failure to the application...
> 
> The problem is dio read could hit delalloc blocks on its first read and
> return 0, how can we tell between a short read and a real EOF? I may
> miss something again ..

Well, stat() + lseek(fd, 0, SEEK_CUR) can be used to determine if
the read() started at the current EOF.

But, yeah, the read(2) man page says "return of zero means EOF" but
then muddies the waters by saying "a read with count = 0 will return
zero but have no effect". IOWs, a return of 0 does not /always/ mean
the read was called at/beyond EOF....

Yay?

POSIX defined behaviour can be a real problem at times, especially
when you are using methods that are specifically designed to work
around limitations of the POSIX specification. Like, for example,
applications using direct IO...

Cheers,

Dave.
diff mbox

Patch

--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -981,7 +981,7 @@  xfs_file_iomap_begin(
        if (XFS_FORCED_SHUTDOWN(mp))
                return -EIO;

-       if (((flags & (IOMAP_WRITE | IOMAP_DIRECT)) == IOMAP_WRITE) &&
+       if (((flags & (IOMAP_WRITE|IOMAP_DIRECT|IOMAP_FAULT)) == IOMAP_WRITE) &&
                        !IS_DAX(inode) && !xfs_get_extsz_hint(ip)) {
                /* Reserve delalloc blocks for regular writeback. */
                return xfs_file_iomap_begin_delay(inode, offset, length, flags,