diff mbox series

iomap: fix warning in iomap_write_delalloc_release()

Message ID 20231216115559.3823359-1-leo.lilong@huawei.com (mailing list archive)
State Deferred, archived
Headers show
Series iomap: fix warning in iomap_write_delalloc_release() | expand

Commit Message

Long Li Dec. 16, 2023, 11:55 a.m. UTC
While fsstress + drop cache test, we get following warning:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
 Modules linked in:
 CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
 RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
 Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
 RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
 RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
 RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
 R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
 R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
 FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  xfs_buffered_write_iomap_end+0x40/0xb0
  iomap_iter+0x8e/0x5f0
  iomap_file_buffered_write+0xa4/0x460
  xfs_file_buffered_write+0x156/0x3d0
  xfs_file_write_iter+0xb2/0x1c0
  do_iter_readv_writev+0x19b/0x1f0
  vfs_writev+0x114/0x4f0
  do_writev+0x7f/0x1c0
  __x64_sys_writev+0x24/0x30
  do_syscall_64+0x3f/0xe0
  entry_SYSCALL_64_after_hwframe+0x62/0x6a
 RIP: 0033:0x7efc03b06610
 Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
 RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
 RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
 RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
 RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
 R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
 R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094

The warning occurred in the following code of iomap_write_delalloc_release().
After analyzing vmcore, I found that the reason for the warning is that
data_end was equal to start_byte.

        WARN_ON_ONCE(data_end <= start_byte);

If some delay is added between seeking for data and seeking for hole
in iomap_write_delalloc_release(), the problem can be reproduced quickly.
The root cause of the problem is that clean data page was dropped between
two seeking in the page cache. As a result, data_end may be equal to
start_byte.

buffered write                        drop cache
---------------------------           ---------------------------
xfs_buffered_write_iomap_end
  iomap_file_buffered_write_punch_delalloc
    iomap_write_delalloc_release
      start_byte = mapping_seek_hole_data(SEEK_DATA)

                                      drop_pagecache_sb
                                        invalidate_mapping_pages
                                          mapping_try_invalidate
                                            mapping_evict_folio
                                              remove_mapping

      data_end = mapping_seek_hole_data(SEEK_HOLE)
      WARN_ON_ONCE(data_end <= start_byte)

In my investigation, I found that clean data pages was alloced and added
to page cache when reading the file's hole. After that, while buffered
write and goes into delalloc release, we seek for data, it will find
the start offset of the clean data pages. If the clean data pages was
dropped, when we seek for hole, it will find the same offset as the
previous seek.

During delalloc release, we punch out all the delalloc blocks in the range
given except for those that have dirty data still pending in the page cache.
If the start_byte is equal to data_end after seeking data and hole, it will
be returned directly in the delalloc scan, and we can continue to find the
next data, and perform delalloc scan. It does not affect the range of
delalloc block that need to be punched out.

Therefore, if start_byte equal to data_end, just let it seek for data
again in the loop.

Fixes: f43dc4dc3eff ("iomap: buffered write failure should not truncate the page cache")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/iomap/buffered-io.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Dec. 16, 2023, 5:39 p.m. UTC | #1
On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> While fsstress + drop cache test, we get following warning:
> 
>  ------------[ cut here ]------------
>  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
>  Modules linked in:
>  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
>  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
>  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
>  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
>  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
>  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
>  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
>  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
>  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
>  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   xfs_buffered_write_iomap_end+0x40/0xb0
>   iomap_iter+0x8e/0x5f0
>   iomap_file_buffered_write+0xa4/0x460
>   xfs_file_buffered_write+0x156/0x3d0
>   xfs_file_write_iter+0xb2/0x1c0
>   do_iter_readv_writev+0x19b/0x1f0
>   vfs_writev+0x114/0x4f0
>   do_writev+0x7f/0x1c0
>   __x64_sys_writev+0x24/0x30
>   do_syscall_64+0x3f/0xe0
>   entry_SYSCALL_64_after_hwframe+0x62/0x6a
>  RIP: 0033:0x7efc03b06610
>  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
>  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
>  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
>  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
>  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
>  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
>  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> 
> The warning occurred in the following code of iomap_write_delalloc_release().
> After analyzing vmcore, I found that the reason for the warning is that
> data_end was equal to start_byte.
> 
>         WARN_ON_ONCE(data_end <= start_byte);
> 
> If some delay is added between seeking for data and seeking for hole
> in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> The root cause of the problem is that clean data page was dropped between
> two seeking in the page cache. As a result, data_end may be equal to
> start_byte.
> 
> buffered write                        drop cache
> ---------------------------           ---------------------------
> xfs_buffered_write_iomap_end
>   iomap_file_buffered_write_punch_delalloc
>     iomap_write_delalloc_release
>       start_byte = mapping_seek_hole_data(SEEK_DATA)
> 
>                                       drop_pagecache_sb
>                                         invalidate_mapping_pages
>                                           mapping_try_invalidate
>                                             mapping_evict_folio
>                                               remove_mapping
> 
>       data_end = mapping_seek_hole_data(SEEK_HOLE)
>       WARN_ON_ONCE(data_end <= start_byte)
> 
> In my investigation, I found that clean data pages was alloced and added
> to page cache when reading the file's hole. After that, while buffered
> write and goes into delalloc release, we seek for data, it will find
> the start offset of the clean data pages. If the clean data pages was
> dropped, when we seek for hole, it will find the same offset as the
> previous seek.

iomap_write_delalloc_release holds the invalidation lock, shouldn't that
be sufficient to prevent folios from being dropped?

--D

> During delalloc release, we punch out all the delalloc blocks in the range
> given except for those that have dirty data still pending in the page cache.
> If the start_byte is equal to data_end after seeking data and hole, it will
> be returned directly in the delalloc scan, and we can continue to find the
> next data, and perform delalloc scan. It does not affect the range of
> delalloc block that need to be punched out.
> 
> Therefore, if start_byte equal to data_end, just let it seek for data
> again in the loop.
> 
> Fixes: f43dc4dc3eff ("iomap: buffered write failure should not truncate the page cache")
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/iomap/buffered-io.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 093c4515b22a..45b54f3e6f47 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1179,7 +1179,17 @@ static int iomap_write_delalloc_release(struct inode *inode,
>  			error = data_end;
>  			goto out_unlock;
>  		}
> -		WARN_ON_ONCE(data_end <= start_byte);
> +
> +		/*
> +		 * Seek for data/hole in the page cache can race with drop
> +		 * cache, if data page was dropped between seek for data and
> +		 * hole, data_end may be equal to start_byte, just let it keep
> +		 * seeking.
> +		 */
> +		if (data_end == start_byte)
> +			continue;
> +
> +		WARN_ON_ONCE(data_end < start_byte);
>  		WARN_ON_ONCE(data_end > scan_end_byte);
>  
>  		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
> -- 
> 2.31.1
> 
>
Brian Foster Dec. 18, 2023, 12:29 p.m. UTC | #2
On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > While fsstress + drop cache test, we get following warning:
> > 
> >  ------------[ cut here ]------------
> >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> >  Modules linked in:
> >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >  Call Trace:
> >   <TASK>
> >   xfs_buffered_write_iomap_end+0x40/0xb0
> >   iomap_iter+0x8e/0x5f0
> >   iomap_file_buffered_write+0xa4/0x460
> >   xfs_file_buffered_write+0x156/0x3d0
> >   xfs_file_write_iter+0xb2/0x1c0
> >   do_iter_readv_writev+0x19b/0x1f0
> >   vfs_writev+0x114/0x4f0
> >   do_writev+0x7f/0x1c0
> >   __x64_sys_writev+0x24/0x30
> >   do_syscall_64+0x3f/0xe0
> >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> >  RIP: 0033:0x7efc03b06610
> >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > 
> > The warning occurred in the following code of iomap_write_delalloc_release().
> > After analyzing vmcore, I found that the reason for the warning is that
> > data_end was equal to start_byte.
> > 
> >         WARN_ON_ONCE(data_end <= start_byte);
> > 
> > If some delay is added between seeking for data and seeking for hole
> > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > The root cause of the problem is that clean data page was dropped between
> > two seeking in the page cache. As a result, data_end may be equal to
> > start_byte.
> > 
> > buffered write                        drop cache
> > ---------------------------           ---------------------------
> > xfs_buffered_write_iomap_end
> >   iomap_file_buffered_write_punch_delalloc
> >     iomap_write_delalloc_release
> >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > 
> >                                       drop_pagecache_sb
> >                                         invalidate_mapping_pages
> >                                           mapping_try_invalidate
> >                                             mapping_evict_folio
> >                                               remove_mapping
> > 
> >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> >       WARN_ON_ONCE(data_end <= start_byte)
> > 
> > In my investigation, I found that clean data pages was alloced and added
> > to page cache when reading the file's hole. After that, while buffered
> > write and goes into delalloc release, we seek for data, it will find
> > the start offset of the clean data pages. If the clean data pages was
> > dropped, when we seek for hole, it will find the same offset as the
> > previous seek.
> 
> iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> be sufficient to prevent folios from being dropped?
> 

I believe it's the other way around. The invalidate lock prevents new
folios from being added (i.e. to serialize an invalidation sequence).
IIRC it won't necessarily prevent folios from going away.

Brian

> --D
> 
> > During delalloc release, we punch out all the delalloc blocks in the range
> > given except for those that have dirty data still pending in the page cache.
> > If the start_byte is equal to data_end after seeking data and hole, it will
> > be returned directly in the delalloc scan, and we can continue to find the
> > next data, and perform delalloc scan. It does not affect the range of
> > delalloc block that need to be punched out.
> > 
> > Therefore, if start_byte equal to data_end, just let it seek for data
> > again in the loop.
> > 
> > Fixes: f43dc4dc3eff ("iomap: buffered write failure should not truncate the page cache")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/iomap/buffered-io.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 093c4515b22a..45b54f3e6f47 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1179,7 +1179,17 @@ static int iomap_write_delalloc_release(struct inode *inode,
> >  			error = data_end;
> >  			goto out_unlock;
> >  		}
> > -		WARN_ON_ONCE(data_end <= start_byte);
> > +
> > +		/*
> > +		 * Seek for data/hole in the page cache can race with drop
> > +		 * cache, if data page was dropped between seek for data and
> > +		 * hole, data_end may be equal to start_byte, just let it keep
> > +		 * seeking.
> > +		 */
> > +		if (data_end == start_byte)
> > +			continue;
> > +
> > +		WARN_ON_ONCE(data_end < start_byte);
> >  		WARN_ON_ONCE(data_end > scan_end_byte);
> >  
> >  		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
> > -- 
> > 2.31.1
> > 
> > 
>
Long Li Dec. 18, 2023, 12:44 p.m. UTC | #3
On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > While fsstress + drop cache test, we get following warning:
> > 
> >  ------------[ cut here ]------------
> >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> >  Modules linked in:
> >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >  Call Trace:
> >   <TASK>
> >   xfs_buffered_write_iomap_end+0x40/0xb0
> >   iomap_iter+0x8e/0x5f0
> >   iomap_file_buffered_write+0xa4/0x460
> >   xfs_file_buffered_write+0x156/0x3d0
> >   xfs_file_write_iter+0xb2/0x1c0
> >   do_iter_readv_writev+0x19b/0x1f0
> >   vfs_writev+0x114/0x4f0
> >   do_writev+0x7f/0x1c0
> >   __x64_sys_writev+0x24/0x30
> >   do_syscall_64+0x3f/0xe0
> >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> >  RIP: 0033:0x7efc03b06610
> >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > 
> > The warning occurred in the following code of iomap_write_delalloc_release().
> > After analyzing vmcore, I found that the reason for the warning is that
> > data_end was equal to start_byte.
> > 
> >         WARN_ON_ONCE(data_end <= start_byte);
> > 
> > If some delay is added between seeking for data and seeking for hole
> > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > The root cause of the problem is that clean data page was dropped between
> > two seeking in the page cache. As a result, data_end may be equal to
> > start_byte.
> > 
> > buffered write                        drop cache
> > ---------------------------           ---------------------------
> > xfs_buffered_write_iomap_end
> >   iomap_file_buffered_write_punch_delalloc
> >     iomap_write_delalloc_release
> >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > 
> >                                       drop_pagecache_sb
> >                                         invalidate_mapping_pages
> >                                           mapping_try_invalidate
> >                                             mapping_evict_folio
> >                                               remove_mapping
> > 
> >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> >       WARN_ON_ONCE(data_end <= start_byte)
> > 
> > In my investigation, I found that clean data pages was alloced and added
> > to page cache when reading the file's hole. After that, while buffered
> > write and goes into delalloc release, we seek for data, it will find
> > the start offset of the clean data pages. If the clean data pages was
> > dropped, when we seek for hole, it will find the same offset as the
> > previous seek.
> 
> iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> be sufficient to prevent folios from being dropped?
> 

Holding the invalidate_lock in iomap_write_delalloc_release() cann't
prevent folios from being dropped. The invalidate_lock was added to 
block races between page cache invalidation(in truncate / hole punch path)
and page cache filling functions. It only protects adding of folios to
page cache for page faults / reads / readahead. 

Thanks,
Long Li

> --D
> 
> > During delalloc release, we punch out all the delalloc blocks in the range
> > given except for those that have dirty data still pending in the page cache.
> > If the start_byte is equal to data_end after seeking data and hole, it will
> > be returned directly in the delalloc scan, and we can continue to find the
> > next data, and perform delalloc scan. It does not affect the range of
> > delalloc block that need to be punched out.
> > 
> > Therefore, if start_byte equal to data_end, just let it seek for data
> > again in the loop.
> > 
> > Fixes: f43dc4dc3eff ("iomap: buffered write failure should not truncate the page cache")
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/iomap/buffered-io.c | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 093c4515b22a..45b54f3e6f47 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1179,7 +1179,17 @@ static int iomap_write_delalloc_release(struct inode *inode,
> >  			error = data_end;
> >  			goto out_unlock;
> >  		}
> > -		WARN_ON_ONCE(data_end <= start_byte);
> > +
> > +		/*
> > +		 * Seek for data/hole in the page cache can race with drop
> > +		 * cache, if data page was dropped between seek for data and
> > +		 * hole, data_end may be equal to start_byte, just let it keep
> > +		 * seeking.
> > +		 */
> > +		if (data_end == start_byte)
> > +			continue;
> > +
> > +		WARN_ON_ONCE(data_end < start_byte);
> >  		WARN_ON_ONCE(data_end > scan_end_byte);
> >  
> >  		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
> > -- 
> > 2.31.1
> > 
> > 
>
Dave Chinner Dec. 18, 2023, 8:21 p.m. UTC | #4
[cc Jan Kara]

On Mon, Dec 18, 2023 at 07:29:34AM -0500, Brian Foster wrote:
> On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> > On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > > While fsstress + drop cache test, we get following warning:
> > > 
> > >  ------------[ cut here ]------------
> > >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > >  Modules linked in:
> > >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> > >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> > >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> > >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> > >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> > >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> > >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >  Call Trace:
> > >   <TASK>
> > >   xfs_buffered_write_iomap_end+0x40/0xb0
> > >   iomap_iter+0x8e/0x5f0
> > >   iomap_file_buffered_write+0xa4/0x460
> > >   xfs_file_buffered_write+0x156/0x3d0
> > >   xfs_file_write_iter+0xb2/0x1c0
> > >   do_iter_readv_writev+0x19b/0x1f0
> > >   vfs_writev+0x114/0x4f0
> > >   do_writev+0x7f/0x1c0
> > >   __x64_sys_writev+0x24/0x30
> > >   do_syscall_64+0x3f/0xe0
> > >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> > >  RIP: 0033:0x7efc03b06610
> > >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> > >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> > >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> > >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> > >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> > >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> > >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > > 
> > > The warning occurred in the following code of iomap_write_delalloc_release().
> > > After analyzing vmcore, I found that the reason for the warning is that
> > > data_end was equal to start_byte.
> > > 
> > >         WARN_ON_ONCE(data_end <= start_byte);
> > > 
> > > If some delay is added between seeking for data and seeking for hole
> > > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > > The root cause of the problem is that clean data page was dropped between
> > > two seeking in the page cache. As a result, data_end may be equal to
> > > start_byte.
> > > 
> > > buffered write                        drop cache
> > > ---------------------------           ---------------------------
> > > xfs_buffered_write_iomap_end
> > >   iomap_file_buffered_write_punch_delalloc
> > >     iomap_write_delalloc_release
> > >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > > 
> > >                                       drop_pagecache_sb
> > >                                         invalidate_mapping_pages
> > >                                           mapping_try_invalidate
> > >                                             mapping_evict_folio
> > >                                               remove_mapping
> > > 
> > >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> > >       WARN_ON_ONCE(data_end <= start_byte)
> > > 
> > > In my investigation, I found that clean data pages was alloced and added
> > > to page cache when reading the file's hole. After that, while buffered
> > > write and goes into delalloc release, we seek for data, it will find
> > > the start offset of the clean data pages. If the clean data pages was
> > > dropped, when we seek for hole, it will find the same offset as the
> > > previous seek.
> > 
> > iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> > be sufficient to prevent folios from being dropped?
> > 
> 
> I believe it's the other way around. The invalidate lock prevents new
> folios from being added (i.e. to serialize an invalidation sequence).
> IIRC it won't necessarily prevent folios from going away.

It also serialises operations doing invalidations against other
operations doing invalidations (e.g. truncate, hole punch, direct
IO, etc) that require the invalidation lock to be held exclusively.

Having looked a little deeper, it appears we have a large number of
invalidation paths that don't actually hold the invalidation lock at
all. drop_pagecache_sb() is just one of them.  Is that a bug?

If not, then this code needs to prevent any mapping tree change from
occurring whilst it is walking the range of the write that needs to
be scanned. The write has partially failed at this point, and we're
trying to clean up the mess the partial write has left behind. We
really need exclusive access to that range of the mapping whilst
cleanup is done.

IOWs, if the invalidation lock is not sufficient to serialise
against other invalidations being run concurrently, what mapping
lock can we hold whilst we call into the filesystem, take sleeping
FS locks and modify the extent tree that will serialise against
other invalidations?

-Dave.
Darrick J. Wong Dec. 18, 2023, 10:12 p.m. UTC | #5
[cc willy and fsdevel]

On Tue, Dec 19, 2023 at 07:21:37AM +1100, Dave Chinner wrote:
> [cc Jan Kara]
> 
> On Mon, Dec 18, 2023 at 07:29:34AM -0500, Brian Foster wrote:
> > On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> > > On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > > > While fsstress + drop cache test, we get following warning:
> > > > 
> > > >  ------------[ cut here ]------------
> > > >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > >  Modules linked in:
> > > >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > > >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> > > >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> > > >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> > > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> > > >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> > > >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> > > >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> > > >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > >  Call Trace:
> > > >   <TASK>
> > > >   xfs_buffered_write_iomap_end+0x40/0xb0
> > > >   iomap_iter+0x8e/0x5f0
> > > >   iomap_file_buffered_write+0xa4/0x460
> > > >   xfs_file_buffered_write+0x156/0x3d0
> > > >   xfs_file_write_iter+0xb2/0x1c0
> > > >   do_iter_readv_writev+0x19b/0x1f0
> > > >   vfs_writev+0x114/0x4f0
> > > >   do_writev+0x7f/0x1c0
> > > >   __x64_sys_writev+0x24/0x30
> > > >   do_syscall_64+0x3f/0xe0
> > > >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> > > >  RIP: 0033:0x7efc03b06610
> > > >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> > > >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> > > >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> > > >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> > > >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> > > >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> > > >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > > > 
> > > > The warning occurred in the following code of iomap_write_delalloc_release().
> > > > After analyzing vmcore, I found that the reason for the warning is that
> > > > data_end was equal to start_byte.
> > > > 
> > > >         WARN_ON_ONCE(data_end <= start_byte);
> > > > 
> > > > If some delay is added between seeking for data and seeking for hole
> > > > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > > > The root cause of the problem is that clean data page was dropped between
> > > > two seeking in the page cache. As a result, data_end may be equal to
> > > > start_byte.
> > > > 
> > > > buffered write                        drop cache
> > > > ---------------------------           ---------------------------
> > > > xfs_buffered_write_iomap_end
> > > >   iomap_file_buffered_write_punch_delalloc
> > > >     iomap_write_delalloc_release
> > > >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > > > 
> > > >                                       drop_pagecache_sb
> > > >                                         invalidate_mapping_pages
> > > >                                           mapping_try_invalidate
> > > >                                             mapping_evict_folio
> > > >                                               remove_mapping
> > > > 
> > > >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> > > >       WARN_ON_ONCE(data_end <= start_byte)
> > > > 
> > > > In my investigation, I found that clean data pages was alloced and added
> > > > to page cache when reading the file's hole. After that, while buffered
> > > > write and goes into delalloc release, we seek for data, it will find
> > > > the start offset of the clean data pages. If the clean data pages was
> > > > dropped, when we seek for hole, it will find the same offset as the
> > > > previous seek.
> > > 
> > > iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> > > be sufficient to prevent folios from being dropped?
> > > 
> > 
> > I believe it's the other way around. The invalidate lock prevents new
> > folios from being added (i.e. to serialize an invalidation sequence).
> > IIRC it won't necessarily prevent folios from going away.
> 
> It also serialises operations doing invalidations against other
> operations doing invalidations (e.g. truncate, hole punch, direct
> IO, etc) that require the invalidation lock to be held exclusively.
> 
> Having looked a little deeper, it appears we have a large number of
> invalidation paths that don't actually hold the invalidation lock at
> all. drop_pagecache_sb() is just one of them.  Is that a bug?

Clearly I think so. ;)

> If not, then this code needs to prevent any mapping tree change from
> occurring whilst it is walking the range of the write that needs to
> be scanned. The write has partially failed at this point, and we're
> trying to clean up the mess the partial write has left behind. We
> really need exclusive access to that range of the mapping whilst
> cleanup is done.
> 
> IOWs, if the invalidation lock is not sufficient to serialise
> against other invalidations being run concurrently, what mapping
> lock can we hold whilst we call into the filesystem, take sleeping
> FS locks and modify the extent tree that will serialise against
> other invalidations?

I don't know of any other locks that would provide that; also, I thought
invalidate_lock (fka MMAPLOCK_EXCL) was what xfs was supposed to take to
serialize against folios getting ripped out of the mapping and page
faults.

Though now that I think about this further -- have dropcaches and xfs
post-write error delalloc punching always lacked coordination?  I think
for most pagecache operations that old "lock and recheck mapping/offset"
strategy works ok; it's just here where we might've been touching
multiple folios that things go awry...?

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Dec. 18, 2023, 10:45 p.m. UTC | #6
On Mon, Dec 18, 2023 at 02:12:00PM -0800, Darrick J. Wong wrote:
> [cc willy and fsdevel]
> 
> On Tue, Dec 19, 2023 at 07:21:37AM +1100, Dave Chinner wrote:
> > [cc Jan Kara]
> > 
> > On Mon, Dec 18, 2023 at 07:29:34AM -0500, Brian Foster wrote:
> > > On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> > > > On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > > > > While fsstress + drop cache test, we get following warning:
> > > > > 
> > > > >  ------------[ cut here ]------------
> > > > >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > > >  Modules linked in:
> > > > >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> > > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > > > >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > > >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> > > > >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> > > > >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> > > > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> > > > >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> > > > >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> > > > >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> > > > >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> > > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> > > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >  Call Trace:
> > > > >   <TASK>
> > > > >   xfs_buffered_write_iomap_end+0x40/0xb0
> > > > >   iomap_iter+0x8e/0x5f0
> > > > >   iomap_file_buffered_write+0xa4/0x460
> > > > >   xfs_file_buffered_write+0x156/0x3d0
> > > > >   xfs_file_write_iter+0xb2/0x1c0
> > > > >   do_iter_readv_writev+0x19b/0x1f0
> > > > >   vfs_writev+0x114/0x4f0
> > > > >   do_writev+0x7f/0x1c0
> > > > >   __x64_sys_writev+0x24/0x30
> > > > >   do_syscall_64+0x3f/0xe0
> > > > >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> > > > >  RIP: 0033:0x7efc03b06610
> > > > >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> > > > >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> > > > >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> > > > >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> > > > >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> > > > >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> > > > >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > > > > 
> > > > > The warning occurred in the following code of iomap_write_delalloc_release().
> > > > > After analyzing vmcore, I found that the reason for the warning is that
> > > > > data_end was equal to start_byte.
> > > > > 
> > > > >         WARN_ON_ONCE(data_end <= start_byte);
> > > > > 
> > > > > If some delay is added between seeking for data and seeking for hole
> > > > > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > > > > The root cause of the problem is that clean data page was dropped between
> > > > > two seeking in the page cache. As a result, data_end may be equal to
> > > > > start_byte.
> > > > > 
> > > > > buffered write                        drop cache
> > > > > ---------------------------           ---------------------------
> > > > > xfs_buffered_write_iomap_end
> > > > >   iomap_file_buffered_write_punch_delalloc
> > > > >     iomap_write_delalloc_release
> > > > >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > > > > 
> > > > >                                       drop_pagecache_sb
> > > > >                                         invalidate_mapping_pages
> > > > >                                           mapping_try_invalidate
> > > > >                                             mapping_evict_folio
> > > > >                                               remove_mapping
> > > > > 
> > > > >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> > > > >       WARN_ON_ONCE(data_end <= start_byte)
> > > > > 
> > > > > In my investigation, I found that clean data pages was alloced and added
> > > > > to page cache when reading the file's hole. After that, while buffered
> > > > > write and goes into delalloc release, we seek for data, it will find
> > > > > the start offset of the clean data pages. If the clean data pages was
> > > > > dropped, when we seek for hole, it will find the same offset as the
> > > > > previous seek.
> > > > 
> > > > iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> > > > be sufficient to prevent folios from being dropped?
> > > > 
> > > 
> > > I believe it's the other way around. The invalidate lock prevents new
> > > folios from being added (i.e. to serialize an invalidation sequence).
> > > IIRC it won't necessarily prevent folios from going away.
> > 
> > It also serialises operations doing invalidations against other
> > operations doing invalidations (e.g. truncate, hole punch, direct
> > IO, etc) that require the invalidation lock to be held exclusively.
> > 
> > Having looked a little deeper, it appears we have a large number of
> > invalidation paths that don't actually hold the invalidation lock at
> > all. drop_pagecache_sb() is just one of them.  Is that a bug?
> 
> Clearly I think so. ;)

+1

> > If not, then this code needs to prevent any mapping tree change from
> > occurring whilst it is walking the range of the write that needs to
> > be scanned. The write has partially failed at this point, and we're
> > trying to clean up the mess the partial write has left behind. We
> > really need exclusive access to that range of the mapping whilst
> > cleanup is done.
> > 
> > IOWs, if the invalidation lock is not sufficient to serialise
> > against other invalidations being run concurrently, what mapping
> > lock can we hold whilst we call into the filesystem, take sleeping
> > FS locks and modify the extent tree that will serialise against
> > other invalidations?
> 
> I don't know of any other locks that would provide that; also, I thought
> invalidate_lock (fka MMAPLOCK_EXCL) was what xfs was supposed to take to
> serialize against folios getting ripped out of the mapping and page
> faults.

Well, originally it was just fs operations vs page faults, but it
also had the (XFS internal) property of serialise operations against
other fs-driven invalidations and breaking layout leases (e.g.
xfs_break_dax_layouts(), xfs_ilock2_io_mmap() from
xfs_reflink_remap_prep(), etc)

Details are a bit fuzzy, but ISTR that I wrote this iomap code with
the assumption that holding the invalidate_lock also serialises
against other invalidation operations - it was originally based on
the XFS_MMAPLOCK_EXCL being held, so that likely bled over.

> Though now that I think about this further -- have dropcaches and xfs
> post-write error delalloc punching always lacked coordination?

I don't think drop_caches has proper serialisation against anything.

If we go back to before we have background inode inactivation, and
drop caches could deadlock against filesystem freeze because it
calls the shrinker which runs transactions whilst holding the
s_umount lock so thaw can't unfreeze the filesystem. Any filesystem
that does an operation in ->evict or ->destroy_inode as called from
memory reclaim context that can be blocked by a frozen fs state can
trip over this.

In reality, drop_caches was only supposed to be a debug-only
interface and so it's never really been designed/coded to be
properly safe against anything.  Hence I think this is drop_caches
bug more than it is a iomap bug...

> I think
> for most pagecache operations that old "lock and recheck mapping/offset"
> strategy works ok; it's just here where we might've been touching
> multiple folios that things go awry...?

That old mechanism never really worked reliably. Just look
at the constant stream of "we broke truncate" bugs that we used to
have to deal with. They've dropped right off now that most
filesystems use the invalidate lock for truncate serialisation
rather than the old "lock, check mapping, check i_size, unlock,
retry" loop that people were also struggling to get right.

-Dave
Jan Kara Dec. 19, 2023, 1:14 p.m. UTC | #7
On Tue 19-12-23 07:21:37, Dave Chinner wrote:
> [cc Jan Kara]
> 
> On Mon, Dec 18, 2023 at 07:29:34AM -0500, Brian Foster wrote:
> > On Sat, Dec 16, 2023 at 09:39:51AM -0800, Darrick J. Wong wrote:
> > > On Sat, Dec 16, 2023 at 07:55:59PM +0800, Long Li wrote:
> > > > While fsstress + drop cache test, we get following warning:
> > > > 
> > > >  ------------[ cut here ]------------
> > > >  WARNING: CPU: 2 PID: 1003 at fs/iomap/buffered-io.c:1182 iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > >  Modules linked in:
> > > >  CPU: 2 PID: 1003 Comm: fsstress Not tainted 6.7.0-rc5-06945-g3ba9b31d6bf3-dirty #256
> > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> > > >  RIP: 0010:iomap_file_buffered_write_punch_delalloc+0x691/0x730
> > > >  Code: d1 0b 01 0f 0b 48 83 05 14 a2 d1 0b 01 48 89 05 35 a1 d1 0b 49 39 ec 0f 8c 09 fb ff ff e9 b6 fd ff ff 48 83 05 df a1 d1 0b 01 <0f> 0b 48 83 05 dd a1 d1 0b 01 48 39 6c 24 10 7c c0 48 89 05 07 a1
> > > >  RSP: 0018:ffffc900005b7b08 EFLAGS: 00010202
> > > >  RAX: 0000000000000001 RBX: ffff888102363d40 RCX: 0000000000000001
> > > >  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff888108080000
> > > >  RBP: 0000000000050000 R08: ffff888108084eb8 R09: ffff888108084eb8
> > > >  R10: 000000000000005c R11: 0000000000000059 R12: 0000000000050000
> > > >  R13: ffffffff8c978ef0 R14: 0000000000050000 R15: 000000000005a000
> > > >  FS:  00007efc04c63b40(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >  CR2: 00007efc0375c000 CR3: 0000000105a4d000 CR4: 00000000000006f0
> > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > >  Call Trace:
> > > >   <TASK>
> > > >   xfs_buffered_write_iomap_end+0x40/0xb0
> > > >   iomap_iter+0x8e/0x5f0
> > > >   iomap_file_buffered_write+0xa4/0x460
> > > >   xfs_file_buffered_write+0x156/0x3d0
> > > >   xfs_file_write_iter+0xb2/0x1c0
> > > >   do_iter_readv_writev+0x19b/0x1f0
> > > >   vfs_writev+0x114/0x4f0
> > > >   do_writev+0x7f/0x1c0
> > > >   __x64_sys_writev+0x24/0x30
> > > >   do_syscall_64+0x3f/0xe0
> > > >   entry_SYSCALL_64_after_hwframe+0x62/0x6a
> > > >  RIP: 0033:0x7efc03b06610
> > > >  Code: 73 01 c3 48 8b 0d 78 88 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d d9 e0 2c 00 00 75 10 b8 14 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 5e 8e 01 00 48 89 04 24
> > > >  RSP: 002b:00007ffdf8f426d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> > > >  RAX: ffffffffffffffda RBX: 000000000000007a RCX: 00007efc03b06610
> > > >  RDX: 00000000000002c4 RSI: 00000000012f5580 RDI: 0000000000000003
> > > >  RBP: 0000000000000003 R08: 00000000012f53a0 R09: 0000000000000077
> > > >  R10: 000000000000007c R11: 0000000000000246 R12: 00000000000002c4
> > > >  R13: 00000000012dba50 R14: 00000000012f5580 R15: 0000000000000094
> > > > 
> > > > The warning occurred in the following code of iomap_write_delalloc_release().
> > > > After analyzing vmcore, I found that the reason for the warning is that
> > > > data_end was equal to start_byte.
> > > > 
> > > >         WARN_ON_ONCE(data_end <= start_byte);
> > > > 
> > > > If some delay is added between seeking for data and seeking for hole
> > > > in iomap_write_delalloc_release(), the problem can be reproduced quickly.
> > > > The root cause of the problem is that clean data page was dropped between
> > > > two seeking in the page cache. As a result, data_end may be equal to
> > > > start_byte.
> > > > 
> > > > buffered write                        drop cache
> > > > ---------------------------           ---------------------------
> > > > xfs_buffered_write_iomap_end
> > > >   iomap_file_buffered_write_punch_delalloc
> > > >     iomap_write_delalloc_release
> > > >       start_byte = mapping_seek_hole_data(SEEK_DATA)
> > > > 
> > > >                                       drop_pagecache_sb
> > > >                                         invalidate_mapping_pages
> > > >                                           mapping_try_invalidate
> > > >                                             mapping_evict_folio
> > > >                                               remove_mapping
> > > > 
> > > >       data_end = mapping_seek_hole_data(SEEK_HOLE)
> > > >       WARN_ON_ONCE(data_end <= start_byte)
> > > > 
> > > > In my investigation, I found that clean data pages was alloced and added
> > > > to page cache when reading the file's hole. After that, while buffered
> > > > write and goes into delalloc release, we seek for data, it will find
> > > > the start offset of the clean data pages. If the clean data pages was
> > > > dropped, when we seek for hole, it will find the same offset as the
> > > > previous seek.
> > > 
> > > iomap_write_delalloc_release holds the invalidation lock, shouldn't that
> > > be sufficient to prevent folios from being dropped?
> > > 
> > 
> > I believe it's the other way around. The invalidate lock prevents new
> > folios from being added (i.e. to serialize an invalidation sequence).
> > IIRC it won't necessarily prevent folios from going away.
> 
> It also serialises operations doing invalidations against other
> operations doing invalidations (e.g. truncate, hole punch, direct
> IO, etc) that require the invalidation lock to be held exclusively.
> 
> Having looked a little deeper, it appears we have a large number of
> invalidation paths that don't actually hold the invalidation lock at
> all. drop_pagecache_sb() is just one of them.  Is that a bug?

Not as we defined invalidate_lock semantics until now. invalidate_lock is
guarding consistency between filesystem's idea of logical_offset ->
disk_block mapping and where the page cache is loading data from. So we
require that holding invalidate_lock in shared mode is enough for any
offset->block mapping to stay valid until we release the lock. And all
operations using offset->block mapping need to hold invalidate_lock at
least in shared mode. Generally page reclaim (and drop_pagecache_sb() is
just a special case of page reclaim) does not rely on the offset->block
mapping in any way and does not modify it, so it doesn't need to acquire
invalidate_lock.

Furthermore making page reclaim grab invalidate_lock would be really
problematic due to lock ordering issues so I don't see it as something we
should be doing.

> If not, then this code needs to prevent any mapping tree change from
> occurring whilst it is walking the range of the write that needs to
> be scanned. The write has partially failed at this point, and we're
> trying to clean up the mess the partial write has left behind. We
> really need exclusive access to that range of the mapping whilst
> cleanup is done.

Well, in this particular case, what's the difference if page reclaim
removed the clean page just before calling iomap_write_delalloc_release()?
Sure it won't trigger the warning in iomap_write_delalloc_release() as that
offset will already be treated as a hole by mapping_seek_hole_data() but
otherwise I don't think there will be a difference. And
iomap_write_delalloc_punch() seems to be handling that case just fine
AFAICT. So perhaps all that's needed is to make
iomap_write_delalloc_release() a bit more relaxed and reflect the fact that
page reclaim can transform a DATA to a HOLE from the point of view of
mapping_seek_hole_data()? Note that this is really only about
mapping_seek_hole_data() being imprecise. From the filesystem's point of
view invalidate_lock does indeed protects us from any real DATA->HOLE
changes.

								Honza
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 093c4515b22a..45b54f3e6f47 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1179,7 +1179,17 @@  static int iomap_write_delalloc_release(struct inode *inode,
 			error = data_end;
 			goto out_unlock;
 		}
-		WARN_ON_ONCE(data_end <= start_byte);
+
+		/*
+		 * Seek for data/hole in the page cache can race with drop
+		 * cache, if data page was dropped between seek for data and
+		 * hole, data_end may be equal to start_byte, just let it keep
+		 * seeking.
+		 */
+		if (data_end == start_byte)
+			continue;
+
+		WARN_ON_ONCE(data_end < start_byte);
 		WARN_ON_ONCE(data_end > scan_end_byte);
 
 		error = iomap_write_delalloc_scan(inode, &punch_start_byte,