diff mbox series

filemap: skip range writeback if end offset precedes start

Message ID 20221028125428.976549-1-bfoster@redhat.com (mailing list archive)
State New
Headers show
Series filemap: skip range writeback if end offset precedes start | expand

Commit Message

Brian Foster Oct. 28, 2022, 12:54 p.m. UTC
A call to file[map]_write_and_wait_range() with an end offset that
precedes the start offset but happens to land in the same page can
trigger writeback submission but fail to wait on the submitted page.
Writeback submission occurs because __filemap_fdatawrite_range()
passes both offsets down into write_cache_pages(), which rounds down
to page indexes before it starts processing writeback.
__filemap_fdatawait_range() immediately returns if the specified end
offset precedes the start offset, however.

I suspect these checks are primarily intended to handle overflow
conditions. I happened to notice this behavior when investigating an
unrelated problem and observed that a filemap_write_and_wait_range()
call with unexpected parameters had seemingly unpredictable latency.
That latency turned out to be the submission path occasionally
waiting on writeback state of the page (i.e. from
write_cache_pages()) before issuing the currently requested
writepage and then unconditionally failing to wait on the latter via
__filemap_fdatawait_range().

This could probably be reasonably fixed to either elide the
submission, as this patch does, or modify the fdatawait path to
check the page indexes instead of the unaligned offsets. After
poking around a bit, it seemed more consistent with various other
filemap interfaces to check the offsets in the write path and return
if the end offset is not >= the start. For example,
filemap_range_has_page() and filemap_range_has_writeback() both
include similar byte granularity checks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Andrew Morton Oct. 31, 2022, 4:51 a.m. UTC | #1
On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@redhat.com> wrote:

> A call to file[map]_write_and_wait_range() with an end offset that
> precedes the start offset but happens to land in the same page can
> trigger writeback submission but fail to wait on the submitted page.
> Writeback submission occurs because __filemap_fdatawrite_range()
> passes both offsets down into write_cache_pages(), which rounds down
> to page indexes before it starts processing writeback.
> __filemap_fdatawait_range() immediately returns if the specified end
> offset precedes the start offset, however.
> 
> I suspect these checks are primarily intended to handle overflow
> conditions. I happened to notice this behavior when investigating an
> unrelated problem and observed that a filemap_write_and_wait_range()
> call with unexpected parameters had seemingly unpredictable latency.
> That latency turned out to be the submission path occasionally
> waiting on writeback state of the page (i.e. from
> write_cache_pages()) before issuing the currently requested
> writepage and then unconditionally failing to wait on the latter via
> __filemap_fdatawait_range().
> 
> This could probably be reasonably fixed to either elide the
> submission, as this patch does, or modify the fdatawait path to
> check the page indexes instead of the unaligned offsets. After
> poking around a bit, it seemed more consistent with various other
> filemap interfaces to check the offsets in the write path and return
> if the end offset is not >= the start. For example,
> filemap_range_has_page() and filemap_range_has_writeback() both
> include similar byte granularity checks.
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  		.range_end = end,
>  	};
>  
> +	if (end < start)
> +		return 0;
> +
>  	return filemap_fdatawrite_wbc(mapping, &wbc);
>  }

Is there any way in which this condition can be triggered from
userspace?  Or from any non-buggy kernelspace?

Should we have a WARN_ON() in there to detect this?
Brian Foster Nov. 7, 2022, 4:33 p.m. UTC | #2
On Sun, Oct 30, 2022 at 09:51:00PM -0700, Andrew Morton wrote:
> On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@redhat.com> wrote:
> 
> > A call to file[map]_write_and_wait_range() with an end offset that
> > precedes the start offset but happens to land in the same page can
> > trigger writeback submission but fail to wait on the submitted page.
> > Writeback submission occurs because __filemap_fdatawrite_range()
> > passes both offsets down into write_cache_pages(), which rounds down
> > to page indexes before it starts processing writeback.
> > __filemap_fdatawait_range() immediately returns if the specified end
> > offset precedes the start offset, however.
> > 
> > I suspect these checks are primarily intended to handle overflow
> > conditions. I happened to notice this behavior when investigating an
> > unrelated problem and observed that a filemap_write_and_wait_range()
> > call with unexpected parameters had seemingly unpredictable latency.
> > That latency turned out to be the submission path occasionally
> > waiting on writeback state of the page (i.e. from
> > write_cache_pages()) before issuing the currently requested
> > writepage and then unconditionally failing to wait on the latter via
> > __filemap_fdatawait_range().
> > 
> > This could probably be reasonably fixed to either elide the
> > submission, as this patch does, or modify the fdatawait path to
> > check the page indexes instead of the unaligned offsets. After
> > poking around a bit, it seemed more consistent with various other
> > filemap interfaces to check the offsets in the write path and return
> > if the end offset is not >= the start. For example,
> > filemap_range_has_page() and filemap_range_has_writeback() both
> > include similar byte granularity checks.
> > 
> > ...
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
> >  		.range_end = end,
> >  	};
> >  
> > +	if (end < start)
> > +		return 0;
> > +
> >  	return filemap_fdatawrite_wbc(mapping, &wbc);
> >  }
> 

Hi Andrew,

Sorry for the delay..

> Is there any way in which this condition can be triggered from
> userspace?  Or from any non-buggy kernelspace?
> 

Hmm.. good question. Making a quick pass through the callers of
__filemap_fdatawrite_range(), I see the following situations:

- sync_file_range() includes a higher level end < start check that
results in skipping the operation. This appears to cover the
sync_file_range() family of syscalls.

- generic_fadvise(..., POSIX_FADV_DONTNEED) - sets end = -1 if less than
start. This essentially converts the range to cover through the end of
the file.

- filemap_fdatawrite_range() and file[map]_write_and_wait_range() are
called from quite a few places with non-determistic inputs. It
wouldn't surprise me a ton if somehow it were possible for some of these
callers to do the end < start thing based on unsanitized input or buggy
logic, they may just not care depending on if they wait or not.

For example, gfs2_fsync() calls filemap_fdata[write|wait]_range()
separately, so in theory if called with end < start, the write could
submit but the wait could skip similar to having called
filemap_write_and_wait_range() (assuming the conditional whole file
write in that path doesn't trigger) if the offsets land in the same
page.

> Should we have a WARN_ON() in there to detect this?
> 

That might make sense. I suppose it depends on what expected behavior
is. It certainly doesn't make much sense to write and then not wait from
write_and_wait() variants, so callers would probably want to know about
that. It might be hard to really audit all callsites to determine
whether anybody actually relies on the "end is before start but lands on
the same page" behavior for the write only case.

We could alternatively change fdatawait to compare the shifted page
indexes to match fdatawrite behavior, but that all seems a bit fragile
because of 1. the various higher level byte granularity checks and 2. I
doubt anybody actually checks for whether the range crosses a page
boundary, which is the difference between skipping just the wait or the
write as well.

So I dunno, I could see various combinations of changes being considered
reasonable. Perhaps a good starting point would be to wrap the check in
this patch with a WARN_ON_ONCE() and let it soak in -next for a while?
That would avoid excessive noise from repetitive callers [1] but still
allow those callsites to be identified/fixed. If there is some really
weird fdatawrite-only caller that conflicts, the change could always be
loosened up from there (as unlikely as that seems).. Hm?

Brian

[1] The use case that identified this problem is a wonky call from the
XFS truncate path from a workload that makes this truncate call
repeatedly. A WARN_ON_ONCE() would have most definitely been useful IMO,
but an unconditional warning would spam the logs in this particular
case.
Andrew Morton Nov. 7, 2022, 8:28 p.m. UTC | #3
On Mon, 7 Nov 2022 11:33:46 -0500 Brian Foster <bfoster@redhat.com> wrote:

> Perhaps a good starting point would be to wrap the check in
> this patch with a WARN_ON_ONCE() and let it soak in -next for a while?
> That would avoid excessive noise from repetitive callers [1] but still
> allow those callsites to be identified/fixed. If there is some really
> weird fdatawrite-only caller that conflicts, the change could always be
> loosened up from there (as unlikely as that seems).. Hm?

Sounds reasonable.

Please let's be clear in the changelog why we're adding this.  I mean,
we could add a zillion checks everywhere for misbehaving callers.  Why
choose this one place in particular?
Brian Foster Nov. 14, 2022, 5:35 p.m. UTC | #4
On Mon, Nov 07, 2022 at 12:28:02PM -0800, Andrew Morton wrote:
> On Mon, 7 Nov 2022 11:33:46 -0500 Brian Foster <bfoster@redhat.com> wrote:
> 
> > Perhaps a good starting point would be to wrap the check in
> > this patch with a WARN_ON_ONCE() and let it soak in -next for a while?
> > That would avoid excessive noise from repetitive callers [1] but still
> > allow those callsites to be identified/fixed. If there is some really
> > weird fdatawrite-only caller that conflicts, the change could always be
> > loosened up from there (as unlikely as that seems).. Hm?
> 
> Sounds reasonable.
> 
> Please let's be clear in the changelog why we're adding this.  I mean,
> we could add a zillion checks everywhere for misbehaving callers.  Why
> choose this one place in particular?
> 

Ok, so TLDR.. this patch is broken as is. I was probably thinking the
generic_fadvise() case of end == -1 was cast to unsigned by the caller,
but testing shows that is not the case and -1 is passed down as a valid
input. This obviously conflicts with the check as proposed here.

I suppose it might make some sense to drop the analogous check from
__filemap_fdatawait_range() so underlying write/wait behavior is
consistent, and perhaps consider adding a higher level check in the
write_and_wait_range() wrappers. I'd have to make a pass through some of
the callers and think about that some more. I.e.,
filemap_write_and_wait_range() documents that end = -1 is acceptable,
fdatawrite presumably does the right thing in that case, fdatawait skips
when end_byte < start_byte, and most callers seem to actually use
LLONG_MAX anyways.

Brian
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..99d8686c9f5c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -418,6 +418,9 @@  int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 		.range_end = end,
 	};
 
+	if (end < start)
+		return 0;
+
 	return filemap_fdatawrite_wbc(mapping, &wbc);
 }