Message ID | 20221028125428.976549-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | filemap: skip range writeback if end offset precedes start | expand |
On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@redhat.com> wrote: > A call to file[map]_write_and_wait_range() with an end offset that > precedes the start offset but happens to land in the same page can > trigger writeback submission but fail to wait on the submitted page. > Writeback submission occurs because __filemap_fdatawrite_range() > passes both offsets down into write_cache_pages(), which rounds down > to page indexes before it starts processing writeback. > __filemap_fdatawait_range() immediately returns if the specified end > offset precedes the start offset, however. > > I suspect these checks are primarily intended to handle overflow > conditions. I happened to notice this behavior when investigating an > unrelated problem and observed that a filemap_write_and_wait_range() > call with unexpected parameters had seemingly unpredictable latency. > That latency turned out to be the submission path occasionally > waiting on writeback state of the page (i.e. from > write_cache_pages()) before issuing the currently requested > writepage and then unconditionally failing to wait on the latter via > __filemap_fdatawait_range(). > > This could probably be reasonably fixed to either elide the > submission, as this patch does, or modify the fdatawait path to > check the page indexes instead of the unaligned offsets. After > poking around a bit, it seemed more consistent with various other > filemap interfaces to check the offsets in the write path and return > if the end offset is not >= the start. For example, > filemap_range_has_page() and filemap_range_has_writeback() both > include similar byte granularity checks. > > ... > > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, > .range_end = end, > }; > > + if (end < start) > + return 0; > + > return filemap_fdatawrite_wbc(mapping, &wbc); > } Is there any way in which this condition can be triggered from userspace? Or from any non-buggy kernelspace? Should we have a WARN_ON() in there to detect this?
On Sun, Oct 30, 2022 at 09:51:00PM -0700, Andrew Morton wrote: > On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@redhat.com> wrote: > > > A call to file[map]_write_and_wait_range() with an end offset that > > precedes the start offset but happens to land in the same page can > > trigger writeback submission but fail to wait on the submitted page. > > Writeback submission occurs because __filemap_fdatawrite_range() > > passes both offsets down into write_cache_pages(), which rounds down > > to page indexes before it starts processing writeback. > > __filemap_fdatawait_range() immediately returns if the specified end > > offset precedes the start offset, however. > > > > I suspect these checks are primarily intended to handle overflow > > conditions. I happened to notice this behavior when investigating an > > unrelated problem and observed that a filemap_write_and_wait_range() > > call with unexpected parameters had seemingly unpredictable latency. > > That latency turned out to be the submission path occasionally > > waiting on writeback state of the page (i.e. from > > write_cache_pages()) before issuing the currently requested > > writepage and then unconditionally failing to wait on the latter via > > __filemap_fdatawait_range(). > > > > This could probably be reasonably fixed to either elide the > > submission, as this patch does, or modify the fdatawait path to > > check the page indexes instead of the unaligned offsets. After > > poking around a bit, it seemed more consistent with various other > > filemap interfaces to check the offsets in the write path and return > > if the end offset is not >= the start. For example, > > filemap_range_has_page() and filemap_range_has_writeback() both > > include similar byte granularity checks. > > > > ... > > > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, > > .range_end = end, > > }; > > > > + if (end < start) > > + return 0; > > + > > return filemap_fdatawrite_wbc(mapping, &wbc); > > } > Hi Andrew, Sorry for the delay.. > Is there any way in which this condition can be triggered from > userspace? Or from any non-buggy kernelspace? > Hmm.. good question. Making a quick pass through the callers of __filemap_fdatawrite_range(), I see the following situations: - sync_file_range() includes a higher level end < start check that results in skipping the operation. This appears to cover the sync_file_range() family of syscalls. - generic_fadvise(..., POSIX_FADV_DONTNEED) - sets end = -1 if less than start. This essentially converts the range to cover through the end of the file. - filemap_fdatawrite_range() and file[map]_write_and_wait_range() are called from quite a few places with non-determistic inputs. It wouldn't surprise me a ton if somehow it were possible for some of these callers to do the end < start thing based on unsanitized input or buggy logic, they may just not care depending on if they wait or not. For example, gfs2_fsync() calls filemap_fdata[write|wait]_range() separately, so in theory if called with end < start, the write could submit but the wait could skip similar to having called filemap_write_and_wait_range() (assuming the conditional whole file write in that path doesn't trigger) if the offsets land in the same page. > Should we have a WARN_ON() in there to detect this? > That might make sense. I suppose it depends on what expected behavior is. It certainly doesn't make much sense to write and then not wait from write_and_wait() variants, so callers would probably want to know about that. It might be hard to really audit all callsites to determine whether anybody actually relies on the "end is before start but lands on the same page" behavior for the write only case. We could alternatively change fdatawait to compare the shifted page indexes to match fdatawrite behavior, but that all seems a bit fragile because of 1. the various higher level byte granularity checks and 2. I doubt anybody actually checks for whether the range crosses a page boundary, which is the difference between skipping just the wait or the write as well. So I dunno, I could see various combinations of changes being considered reasonable. Perhaps a good starting point would be to wrap the check in this patch with a WARN_ON_ONCE() and let it soak in -next for a while? That would avoid excessive noise from repetitive callers [1] but still allow those callsites to be identified/fixed. If there is some really weird fdatawrite-only caller that conflicts, the change could always be loosened up from there (as unlikely as that seems).. Hm? Brian [1] The use case that identified this problem is a wonky call from the XFS truncate path from a workload that makes this truncate call repeatedly. A WARN_ON_ONCE() would have most definitely been useful IMO, but an unconditional warning would spam the logs in this particular case.
On Mon, 7 Nov 2022 11:33:46 -0500 Brian Foster <bfoster@redhat.com> wrote: > Perhaps a good starting point would be to wrap the check in > this patch with a WARN_ON_ONCE() and let it soak in -next for a while? > That would avoid excessive noise from repetitive callers [1] but still > allow those callsites to be identified/fixed. If there is some really > weird fdatawrite-only caller that conflicts, the change could always be > loosened up from there (as unlikely as that seems).. Hm? Sounds reasonable. Please let's be clear in the changelog why we're adding this. I mean, we could add a zillion checks everywhere for misbehaving callers. Why choose this one place in particular?
On Mon, Nov 07, 2022 at 12:28:02PM -0800, Andrew Morton wrote: > On Mon, 7 Nov 2022 11:33:46 -0500 Brian Foster <bfoster@redhat.com> wrote: > > > Perhaps a good starting point would be to wrap the check in > > this patch with a WARN_ON_ONCE() and let it soak in -next for a while? > > That would avoid excessive noise from repetitive callers [1] but still > > allow those callsites to be identified/fixed. If there is some really > > weird fdatawrite-only caller that conflicts, the change could always be > > loosened up from there (as unlikely as that seems).. Hm? > > Sounds reasonable. > > Please let's be clear in the changelog why we're adding this. I mean, > we could add a zillion checks everywhere for misbehaving callers. Why > choose this one place in particular? > Ok, so TLDR.. this patch is broken as is. I was probably thinking the generic_fadvise() case of end == -1 was cast to unsigned by the caller, but testing shows that is not the case and -1 is passed down as a valid input. This obviously conflicts with the check as proposed here. I suppose it might make some sense to drop the analogous check from __filemap_fdatawait_range() so underlying write/wait behavior is consistent, and perhaps consider adding a higher level check in the write_and_wait_range() wrappers. I'd have to make a pass through some of the callers and think about that some more. I.e., filemap_write_and_wait_range() documents that end = -1 is acceptable, fdatawrite presumably does the right thing in that case, fdatawait skips when end_byte < start_byte, and most callers seem to actually use LLONG_MAX anyways. Brian
diff --git a/mm/filemap.c b/mm/filemap.c index 08341616ae7a..99d8686c9f5c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, .range_end = end, }; + if (end < start) + return 0; + return filemap_fdatawrite_wbc(mapping, &wbc); }
A call to file[map]_write_and_wait_range() with an end offset that precedes the start offset but happens to land in the same page can trigger writeback submission but fail to wait on the submitted page. Writeback submission occurs because __filemap_fdatawrite_range() passes both offsets down into write_cache_pages(), which rounds down to page indexes before it starts processing writeback. __filemap_fdatawait_range() immediately returns if the specified end offset precedes the start offset, however. I suspect these checks are primarily intended to handle overflow conditions. I happened to notice this behavior when investigating an unrelated problem and observed that a filemap_write_and_wait_range() call with unexpected parameters had seemingly unpredictable latency. That latency turned out to be the submission path occasionally waiting on writeback state of the page (i.e. from write_cache_pages()) before issuing the currently requested writepage and then unconditionally failing to wait on the latter via __filemap_fdatawait_range(). This could probably be reasonably fixed to either elide the submission, as this patch does, or modify the fdatawait path to check the page indexes instead of the unaligned offsets. After poking around a bit, it seemed more consistent with various other filemap interfaces to check the offsets in the write path and return if the end offset is not >= the start. For example, filemap_range_has_page() and filemap_range_has_writeback() both include similar byte granularity checks. Signed-off-by: Brian Foster <bfoster@redhat.com> --- mm/filemap.c | 3 +++ 1 file changed, 3 insertions(+)