diff mbox series

exfat: fix file not locking when writing zeros in exfat_file_mmap()

Message ID PUZPR04MB63168A32AB45E8924B52CBC2817B2@PUZPR04MB6316.apcprd04.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series exfat: fix file not locking when writing zeros in exfat_file_mmap() | expand

Commit Message

Yuezhang.Mo@sony.com Jan. 24, 2024, 5 a.m. UTC
inode->i_rwsem should be locked when writing file. But the lock
is missing when writing zeros to the file in exfat_file_mmap().

Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength")
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
---
 fs/exfat/file.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Matthew Wilcox Jan. 24, 2024, 5:21 a.m. UTC | #1
On Wed, Jan 24, 2024 at 05:00:37AM +0000, Yuezhang.Mo@sony.com wrote:
> inode->i_rwsem should be locked when writing file. But the lock
> is missing when writing zeros to the file in exfat_file_mmap().

This is actually very weird behaviour in exfat.  This kind of "I must
manipulate the on-disc layout" is not generally done in mmap(), it's
done in ->page_mkwrite() or even delayed until we actually do writeback.
Why does exfat do this?
Yuezhang.Mo@sony.com Jan. 24, 2024, 10:05 a.m. UTC | #2
From: Matthew Wilcox <willy@infradead.org> 
Sent: Wednesday, January 24, 2024 1:21 PM
To: Mo, Yuezhang <Yuezhang.Mo@sony.com>
Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()
> On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@sony.com wrote:
> > inode->i_rwsem should be locked when writing file. But the lock
> > is missing when writing zeros to the file in exfat_file_mmap().
> 
> This is actually very weird behaviour in exfat.  This kind of "I must
> manipulate the on-disc layout" is not generally done in mmap(), it's
> done in ->page_mkwrite() or even delayed until we actually do writeback.
> Why does exfat do this?

In exfat, "valid_size" describes how far into the data stream user data has been
written and "size" describes the file size.  Return zeros if read "valid_size"~"size".

For example,

(1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
     - Write 0x59 to 0~1023
     - both "size" and "valid_size" are 1024
(2) xfs_io -t -f -c "truncate 4K" $filename
     - "valid_size" is still 1024
     - "size" is changed to 4096
     - 1024~4095 is not zeroed
     - return zeros if read 1024~4095
(3) xfs_io -t -f -c "mmap -rw 0 3072" -c  "mwrite -S 0x5a 2048 512" $filename
     (3.1) "mmap -rw 0 3072"
              -  write zeros to 1024~3071
              -  "valid_size" is changed to 3072
              - "size" is still 4096
     (3.2) "mwrite -S 0x5a 2048 512"
             - write 0x5a to 2048~2559
             - "valid_size" is still 3072
             -  "size" is still 4096

To avoid 1024~2047 is not zeroed and no need to update "valid_size" in (3.2),
I zero 1024~3071 in (3.1).

If you have a better solution, welcome to contribute to  exfat or share your
solution in detail.
Matthew Wilcox Jan. 24, 2024, 2:02 p.m. UTC | #3
On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@sony.com wrote:
> From: Matthew Wilcox <willy@infradead.org> 
> Sent: Wednesday, January 24, 2024 1:21 PM
> To: Mo, Yuezhang <Yuezhang.Mo@sony.com>
> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()
> > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@sony.com wrote:
> > > inode->i_rwsem should be locked when writing file. But the lock
> > > is missing when writing zeros to the file in exfat_file_mmap().
> > 
> > This is actually very weird behaviour in exfat.  This kind of "I must
> > manipulate the on-disc layout" is not generally done in mmap(), it's
> > done in ->page_mkwrite() or even delayed until we actually do writeback.
> > Why does exfat do this?
> 
> In exfat, "valid_size" describes how far into the data stream user data has been
> written and "size" describes the file size.  Return zeros if read "valid_size"~"size".

Yes, it's like a very weak implementation of sparse files.  There can
only be one zero range and it must be at the end (as opposed to most
unix derived filesystems which allow arbitrary zero ranges anywhere in
the file).

> For example,
> 
> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
>      - Write 0x59 to 0~1023
>      - both "size" and "valid_size" are 1024

Yes.

> (2) xfs_io -t -f -c "truncate 4K" $filename
>      - "valid_size" is still 1024
>      - "size" is changed to 4096
>      - 1024~4095 is not zeroed
>      - return zeros if read 1024~4095

Yes.

> (3) xfs_io -t -f -c "mmap -rw 0 3072" -c  "mwrite -S 0x5a 2048 512" $filename
>      (3.1) "mmap -rw 0 3072"
>               -  write zeros to 1024~3071
>               -  "valid_size" is changed to 3072
>               - "size" is still 4096

That's not what the code says you do.  Is this from a trace or were you
making up an example?

        loff_t start = ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
        loff_t end = min_t(loff_t, i_size_read(inode),
                        start + vma->vm_end - vma->vm_start);
                ret = exfat_file_zeroed_range(file, ei->valid_size, end);


vm_end - vm_start will be 4kB because Linux rounds to PAGE_SIZE even if
you ask to map 3072 bytes.

In any case, most filesystems do not do this, and I don't understand why
exfat does.  Just because a file is mmaped writable doesn't mean that
we're necessarily going to write to it.  The right thing to do here is
defer the changing of ->valid_size until you have already written back
the new data.

>      (3.2) "mwrite -S 0x5a 2048 512"
>              - write 0x5a to 2048~2559
>              - "valid_size" is still 3072
>              -  "size" is still 4096
> 
> To avoid 1024~2047 is not zeroed and no need to update "valid_size" in (3.2),
> I zero 1024~3071 in (3.1).
> 
> If you have a better solution, welcome to contribute to  exfat or share your
> solution in detail.

Update ->valid_size in the writeback path.  If I'm reading
exfat_get_block() correctly, you already correctly zero pages in the page
cache that are read after ->valid_size, so there is very little work for
you to do.


Oh!  I just figured out why you probably do it this way.

(1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
(2) xfs_io -t -f -c "truncate 4T" $filename
(3) xfs_io -t -f -c "mmap -rw 3T 4096" -c  "mwrite -S 0x5a 3T 512" $filename

Now (at whatever point you're delaying writing zeroes to) you have to
write 3TB of zeroes.  And it's probably better to do that at mmap time
than at page fault time, so you can still return an error.  It's a bit
weird to return ENOSPC from mmap, but here we are.

It'd be nice to have a comment to explain this.  Also, it seems to me
that I can write a script that floods the kernel log with:

                        exfat_err(inode->i_sb,
                                  "mmap: fail to zero from %llu to %llu(%d)",
                                  start, end, ret);

That error message should probably be taken out entirely (maybe use a
tracepoint if you really want some kind of logging).
Dave Chinner Jan. 24, 2024, 9:35 p.m. UTC | #4
On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@sony.com wrote:
> From: Matthew Wilcox <willy@infradead.org> 
> Sent: Wednesday, January 24, 2024 1:21 PM
> To: Mo, Yuezhang <Yuezhang.Mo@sony.com>
> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()
> > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@sony.com wrote:
> > > inode->i_rwsem should be locked when writing file. But the lock
> > > is missing when writing zeros to the file in exfat_file_mmap().
> > 
> > This is actually very weird behaviour in exfat.  This kind of "I must
> > manipulate the on-disc layout" is not generally done in mmap(), it's
> > done in ->page_mkwrite() or even delayed until we actually do writeback.
> > Why does exfat do this?
> 
> In exfat, "valid_size" describes how far into the data stream user data has been
> written and "size" describes the file size.  Return zeros if read "valid_size"~"size".
> 
> For example,
> 
> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
>      - Write 0x59 to 0~1023
>      - both "size" and "valid_size" are 1024
> (2) xfs_io -t -f -c "truncate 4K" $filename
>      - "valid_size" is still 1024
>      - "size" is changed to 4096
>      - 1024~4095 is not zeroed

I think that's the problem right there. File extension via truncate
should really zero the bytes in the page cache in partial pages on
file extension (and likley should do it on-disk as well). See
iomap_truncate_page(), ext4_block_truncate_page(), etc.

Leaving the zeroing until someone actually accesses the data leads
to complexity in the IO path to handle this corner case and getting
that wrong leads directly to data corruption bugs. Just zero the
data in the operation that exposes that data range as zeros to the
user.

-Dave.
Namjae Jeon Jan. 25, 2024, 10:19 a.m. UTC | #5
2024-01-25 6:35 GMT+09:00, Dave Chinner <david@fromorbit.com>:
> On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@sony.com wrote:
>> From: Matthew Wilcox <willy@infradead.org>
>> Sent: Wednesday, January 24, 2024 1:21 PM
>> To: Mo, Yuezhang <Yuezhang.Mo@sony.com>
>> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in
>> exfat_file_mmap()
>> > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@sony.com
>> > wrote:
>> > > inode->i_rwsem should be locked when writing file. But the lock
>> > > is missing when writing zeros to the file in exfat_file_mmap().
>> >
>> > This is actually very weird behaviour in exfat.  This kind of "I must
>> > manipulate the on-disc layout" is not generally done in mmap(), it's
>> > done in ->page_mkwrite() or even delayed until we actually do
>> > writeback.
>> > Why does exfat do this?
>>
>> In exfat, "valid_size" describes how far into the data stream user data
>> has been
>> written and "size" describes the file size.  Return zeros if read
>> "valid_size"~"size".
>>
>> For example,
>>
>> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
>>      - Write 0x59 to 0~1023
>>      - both "size" and "valid_size" are 1024
>> (2) xfs_io -t -f -c "truncate 4K" $filename
>>      - "valid_size" is still 1024
>>      - "size" is changed to 4096
>>      - 1024~4095 is not zeroed
>
> I think that's the problem right there. File extension via truncate
> should really zero the bytes in the page cache in partial pages on
> file extension (and likley should do it on-disk as well). See
> iomap_truncate_page(), ext4_block_truncate_page(), etc.
>
> Leaving the zeroing until someone actually accesses the data leads
> to complexity in the IO path to handle this corner case and getting
> that wrong leads directly to data corruption bugs. Just zero the
> data in the operation that exposes that data range as zeros to the
> user.
We need to consider the case that mmap against files with different
valid size and size created from Windows. So it needed to zero out in mmap.
We tried to improve this after receiving a report of a compatibility
issue with linux-exfat, where the two file sizes are set differently
from Windows.

https://github.com/exfatprogs/exfatprogs/issues/213

Yue referred to mmap code of ntfs3 that has valid-size like exfat and
had handled it in mmap.

Thanks.

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Jan. 26, 2024, 1:22 a.m. UTC | #6
On Thu, Jan 25, 2024 at 07:19:45PM +0900, Namjae Jeon wrote:
> 2024-01-25 6:35 GMT+09:00, Dave Chinner <david@fromorbit.com>:
> > On Wed, Jan 24, 2024 at 10:05:15AM +0000, Yuezhang.Mo@sony.com wrote:
> >> From: Matthew Wilcox <willy@infradead.org>
> >> Sent: Wednesday, January 24, 2024 1:21 PM
> >> To: Mo, Yuezhang <Yuezhang.Mo@sony.com>
> >> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in
> >> exfat_file_mmap()
> >> > On Wed, Jan 24, 2024 at 05:00:37AM +0000, mailto:Yuezhang.Mo@sony.com
> >> > wrote:
> >> > > inode->i_rwsem should be locked when writing file. But the lock
> >> > > is missing when writing zeros to the file in exfat_file_mmap().
> >> >
> >> > This is actually very weird behaviour in exfat.  This kind of "I must
> >> > manipulate the on-disc layout" is not generally done in mmap(), it's
> >> > done in ->page_mkwrite() or even delayed until we actually do
> >> > writeback.
> >> > Why does exfat do this?
> >>
> >> In exfat, "valid_size" describes how far into the data stream user data
> >> has been
> >> written and "size" describes the file size.  Return zeros if read
> >> "valid_size"~"size".
> >>
> >> For example,
> >>
> >> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
> >>      - Write 0x59 to 0~1023
> >>      - both "size" and "valid_size" are 1024
> >> (2) xfs_io -t -f -c "truncate 4K" $filename
> >>      - "valid_size" is still 1024
> >>      - "size" is changed to 4096
> >>      - 1024~4095 is not zeroed
> >
> > I think that's the problem right there. File extension via truncate
> > should really zero the bytes in the page cache in partial pages on
> > file extension (and likley should do it on-disk as well). See
> > iomap_truncate_page(), ext4_block_truncate_page(), etc.
> >
> > Leaving the zeroing until someone actually accesses the data leads
> > to complexity in the IO path to handle this corner case and getting
> > that wrong leads directly to data corruption bugs. Just zero the
> > data in the operation that exposes that data range as zeros to the
> > user.
> We need to consider the case that mmap against files with different
> valid size and size created from Windows. So it needed to zero out in mmap.

That's a different case - that's a "read from a hole" case, not a
"extending truncate" case. i.e. the range from 'valid size' to EOF
is a range where no data has been written and so contains zeros.
It is equivalent to either a hole in the file (no backing store) or
an unwritten range (backing store instantiated but marked as
containing no valid data).

When we consider this range as "reading from a hole/unwritten
range", it should become obvious the correct way to handle this case
is the same as every other filesystem that supports holes and/or
unwritten extents: the page cache page gets zeroed in the
readahead/readpage paths when it maps to a hole/unwritten range in
the file.

There's no special locking needed if it is done this way, and
there's no need for special hooks anywhere to zero data beyond valid
size because it is already guaranteed to be zeroed in memory if the
range is cached in the page cache.....

> We tried to improve this after receiving a report of a compatibility
> issue with linux-exfat, where the two file sizes are set differently
> from Windows.
> 
> https://github.com/exfatprogs/exfatprogs/issues/213
> 
> Yue referred to mmap code of ntfs3 that has valid-size like exfat and
> had handled it in mmap.

Saying "but someone else did the same thing" doesn't make it the
right thing to do. It just means someone else has already done it
the wrong way, and it wasn't noticed during review. :/

-Dave.
Matthew Wilcox Jan. 26, 2024, 2:54 a.m. UTC | #7
On Fri, Jan 26, 2024 at 12:22:32PM +1100, Dave Chinner wrote:
> On Thu, Jan 25, 2024 at 07:19:45PM +0900, Namjae Jeon wrote:
> > We need to consider the case that mmap against files with different
> > valid size and size created from Windows. So it needed to zero out in mmap.
> 
> That's a different case - that's a "read from a hole" case, not a
> "extending truncate" case. i.e. the range from 'valid size' to EOF
> is a range where no data has been written and so contains zeros.
> It is equivalent to either a hole in the file (no backing store) or
> an unwritten range (backing store instantiated but marked as
> containing no valid data).
> 
> When we consider this range as "reading from a hole/unwritten
> range", it should become obvious the correct way to handle this case
> is the same as every other filesystem that supports holes and/or
> unwritten extents: the page cache page gets zeroed in the
> readahead/readpage paths when it maps to a hole/unwritten range in
> the file.
> 
> There's no special locking needed if it is done this way, and
> there's no need for special hooks anywhere to zero data beyond valid
> size because it is already guaranteed to be zeroed in memory if the
> range is cached in the page cache.....

but the problem is that Microsoft half-arsed their support for holes.
See my other mail in this thread.

truncate the file up to 4TB
write a byte at offset 3TB

... now we have to stream 3TB of zeroes through the page cache so that
we can write the byte at 3TB.
Yuezhang.Mo@sony.com Jan. 26, 2024, 5:43 a.m. UTC | #8
> From: Matthew Wilcox <mailto:willy@infradead.org> 
> Sent: Wednesday, January 24, 2024 10:03 PM
> To: Mo, Yuezhang <mailto:Yuezhang.Mo@sony.com>
> Cc: mailto:linkinjeon@kernel.org; mailto:sj1557.seo@samsung.com; mailto:linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()
> 
> > (3) xfs_io -t -f -c "mmap -rw 0 3072" -c  "mwrite -S 0x5a 2048 512" $filename
> >      (3.1) "mmap -rw 0 3072"
> >               -  write zeros to 1024~3071
> >               -  "valid_size" is changed to 3072
> >               - "size" is still 4096
> 
> That's not what the code says you do.  Is this from a trace or were you
> making up an example?
>
>       loff_t start = ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>      loff_t end = min_t(loff_t, i_size_read(inode),
>                       start + vma->vm_end - vma->vm_start);
>               ret = exfat_file_zeroed_range(file, ei->valid_size, end);
> 
> vm_end - vm_start will be 4kB because Linux rounds to PAGE_SIZE even if
> you ask to map 3072 bytes.

I just gave an example to explain why there is an operation of writing zeros in exfat_file_mmap().
Yes, it's better to explain it with traced data, the data in my example confuses you.

> Update ->valid_size in the writeback path.  If I'm reading
> exfat_get_block() correctly, you already correctly zero pages in the page
> cache that are read after ->valid_size, so there is very little work for
> you to do.

Do you mean to move update valid_size to ->page_mkwrite()?
If yes, we need to consider this case.

(1) The file size is 12KB (3 pages), valid_size is 0.
(2) mmap maps these 3 pages.
(3) Only the second page is written.
(4) Need to be done in ->page_mkwrite(),
       (4.1) Mark the second page as dirty(Currently implemented in filemap_page_mkwrite())
       (4.2) Mark the first page as dirty(do extra in exfat).
       (4.3) Update valid_size to the end of the second page(do extra in exfat).

Is my understanding correct? If yes, how to do (4.2)? Are there examples of this?

> Oh!  I just figured out why you probably do it this way.
> 
> (1) xfs_io -t -f -c "pwrite -S 0x59 0 1024" $filename
> (2) xfs_io -t -f -c "truncate 4T" $filename
> (3) xfs_io -t -f -c "mmap -rw 3T 4096" -c  "mwrite -S 0x5a 3T 512" $filename
> 
> Now (at whatever point you're delaying writing zeroes to) you have to
> write 3TB of zeroes. 

From this example, it seems that the current implementation is not good.

I referred to mmap code of ntfs3, and ntfs3 also writes zeros in ->mmap.
Is the current implementation acceptable as a workaround?

> And it's probably better to do that at mmap time
> than at page fault time, so you can still return an error.  It's a bit
> weird to return ENOSPC from mmap, but here we are.

exfat_file_zeroed_range() never returns NOSPEC, because exfat_file_zeroed_range()
doesn't change the file size(i.e. not allocate new clusters).

> 
> It'd be nice to have a comment to explain this.  Also, it seems to me
> that I can write a script that floods the kernel log with:
>
>                       exfat_err(inode->i_sb,
>                                  "mmap: fail to zero from %llu to %llu(%d)",
>                                  start, end, ret);
>
> That error message should probably be taken out entirely (maybe use a
> tracepoint if you really want some kind of logging).

The error message can be seen directly by developers during development.
How about use pr_err_ratelimited() ?
Dave Chinner Jan. 26, 2024, 10:32 p.m. UTC | #9
On Fri, Jan 26, 2024 at 02:54:24AM +0000, Matthew Wilcox wrote:
> On Fri, Jan 26, 2024 at 12:22:32PM +1100, Dave Chinner wrote:
> > On Thu, Jan 25, 2024 at 07:19:45PM +0900, Namjae Jeon wrote:
> > > We need to consider the case that mmap against files with different
> > > valid size and size created from Windows. So it needed to zero out in mmap.
> > 
> > That's a different case - that's a "read from a hole" case, not a
> > "extending truncate" case. i.e. the range from 'valid size' to EOF
> > is a range where no data has been written and so contains zeros.
> > It is equivalent to either a hole in the file (no backing store) or
> > an unwritten range (backing store instantiated but marked as
> > containing no valid data).
> > 
> > When we consider this range as "reading from a hole/unwritten
> > range", it should become obvious the correct way to handle this case
> > is the same as every other filesystem that supports holes and/or
> > unwritten extents: the page cache page gets zeroed in the
> > readahead/readpage paths when it maps to a hole/unwritten range in
> > the file.
> > 
> > There's no special locking needed if it is done this way, and
> > there's no need for special hooks anywhere to zero data beyond valid
> > size because it is already guaranteed to be zeroed in memory if the
> > range is cached in the page cache.....
> 
> but the problem is that Microsoft half-arsed their support for holes.
> See my other mail in this thread.

Why does that matter?  It's exactly the same problem with any other
filesytsem that doesn't support sparse files.

All I said is that IO operations beyond the "valid size" should
be treated like a operating in a hole - I pass no judgement on the
filesystem design, implementation or level of sparse file support
it has. ALl it needs to do is treat the "not valid" size range as if
it was a hole or unwritten, regardless of whether the file is sparse
or not....

> truncate the file up to 4TB
> write a byte at offset 3TB
> 
> ... now we have to stream 3TB of zeroes through the page cache so that
> we can write the byte at 3TB.

This behaviour cannot be avoided on filesystems without sparse file
support - the hit of writing zeroes has to be taken somewhere. We
can handle this in truncate(), the write() path or in ->page_mkwrite
*if* the zeroing condition is hit.  There's no need to do it at
mmap() time if that range of the file is not actually written to by
the application...

-Dave.
Matthew Wilcox Jan. 26, 2024, 10:41 p.m. UTC | #10
On Sat, Jan 27, 2024 at 09:32:42AM +1100, Dave Chinner wrote:
> > but the problem is that Microsoft half-arsed their support for holes.
> > See my other mail in this thread.
> 
> Why does that matter?  It's exactly the same problem with any other
> filesytsem that doesn't support sparse files.
> 
> All I said is that IO operations beyond the "valid size" should
> be treated like a operating in a hole - I pass no judgement on the
> filesystem design, implementation or level of sparse file support
> it has. ALl it needs to do is treat the "not valid" size range as if
> it was a hole or unwritten, regardless of whether the file is sparse
> or not....
> 
> > truncate the file up to 4TB
> > write a byte at offset 3TB
> > 
> > ... now we have to stream 3TB of zeroes through the page cache so that
> > we can write the byte at 3TB.
> 
> This behaviour cannot be avoided on filesystems without sparse file
> support - the hit of writing zeroes has to be taken somewhere. We
> can handle this in truncate(), the write() path or in ->page_mkwrite
> *if* the zeroing condition is hit.  There's no need to do it at
> mmap() time if that range of the file is not actually written to by
> the application...

It's really hard to return -ENOSPC from ->page_mkwrite.  At best you'll
get a SIGBUS or SEGV.  So truncate() or mmap() are the only two places to
do it.  And if we do it in truncate() then we can't take any advantage
of the limited "hole" support the filesystem has.

Most files are never mmaped, much less mapped writable.  I think doing
it in mmap() is the best of several bad options.
Dave Chinner March 6, 2024, 10:31 p.m. UTC | #11
[Sorry, missed this when you sent it and I think it deserves a
response...]

On Fri, Jan 26, 2024 at 10:41:14PM +0000, Matthew Wilcox wrote:
> On Sat, Jan 27, 2024 at 09:32:42AM +1100, Dave Chinner wrote:
> > > but the problem is that Microsoft half-arsed their support for holes.
> > > See my other mail in this thread.
> > 
> > Why does that matter?  It's exactly the same problem with any other
> > filesytsem that doesn't support sparse files.
> > 
> > All I said is that IO operations beyond the "valid size" should
> > be treated like a operating in a hole - I pass no judgement on the
> > filesystem design, implementation or level of sparse file support
> > it has. ALl it needs to do is treat the "not valid" size range as if
> > it was a hole or unwritten, regardless of whether the file is sparse
> > or not....
> > 
> > > truncate the file up to 4TB
> > > write a byte at offset 3TB
> > > 
> > > ... now we have to stream 3TB of zeroes through the page cache so that
> > > we can write the byte at 3TB.
> > 
> > This behaviour cannot be avoided on filesystems without sparse file
> > support - the hit of writing zeroes has to be taken somewhere. We
> > can handle this in truncate(), the write() path or in ->page_mkwrite
> > *if* the zeroing condition is hit.  There's no need to do it at
> > mmap() time if that range of the file is not actually written to by
> > the application...
> 
> It's really hard to return -ENOSPC from ->page_mkwrite.  At best you'll
> get a SIGBUS or SEGV.  So truncate() or mmap() are the only two places to
> do it.  And if we do it in truncate() then we can't take any advantage
> of the limited "hole" support the filesystem has.

Yes, but the entire point of ->page-mkwrite was to allow the
filesystems to abort the user data modification if they were at
ENOSPC when the page fault happened. This is way better than getting
into trouble due to space overcommit caused by ignoring ENOSPC
situations during page faults.

This was a significant problem for XFS users back in the mid-2000s
before ->page_mkwrite existed, because both delayed allocation and
writes over unwritten extents requiring ENOSPC to be determined at
page fault time. It was too late if ENOSPC happened at writeback
time or IO completion - this was significant data loss vector and if
we tried to prevent data loss by redirtying the pages then we'd lock
the machine up because dirty pages could not be cleaned and memory
reclaim couldn't make progress...

IOWs, it is far better to immediately terminate the application than
it is to have silent data loss or completely lock the machine up.
Hence the defined semantics of ->page_mkwrite is to send SIGBUS to
the application on ENOSPC. 

It's not an amazing solution but, in reality, it is little different
to the OOM killer triggering when memory reclaim declares ENOMEM. We
can't allow the operation to continue, and we can't return an error
for the application to handle. So we kill it, just like the OOM
killer does in the same situation for ENOMEM.

We even have an fstest that explicitly exercises this case. i.e.
generic/619 creates this mmap() into sparse files situation and then
checks that the filesystem really is at ENOSPC when the page fault
throws a SIGBUS out because ->page_mkwrite failed due to ENOSPC....

> Most files are never mmaped, much less mapped writable.  I think doing
> it in mmap() is the best of several bad options.

Perhaps so, but that makes it unnecessarily different to the major
Linux filesystems....

-Dave.
diff mbox series

Patch

diff --git a/fs/exfat/file.c b/fs/exfat/file.c
index d25a96a148af..473c1641d50d 100644
--- a/fs/exfat/file.c
+++ b/fs/exfat/file.c
@@ -613,7 +613,11 @@  static int exfat_file_mmap(struct file *file, struct vm_area_struct *vma)
 			start + vma->vm_end - vma->vm_start);
 
 	if ((vma->vm_flags & VM_WRITE) && ei->valid_size < end) {
+		if (!inode_trylock(inode))
+			return -EAGAIN;
+
 		ret = exfat_file_zeroed_range(file, ei->valid_size, end);
+		inode_unlock(inode);
 		if (ret < 0) {
 			exfat_err(inode->i_sb,
 				  "mmap: fail to zero from %llu to %llu(%d)",