[RFC] cifs: Fix cifs_writepages_region()

Message ID	2213409.1677249075@warthog.procyon.org.uk (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 In-Reply-To: <2134430.1677240738@warthog.procyon.org.uk> References: <2134430.1677240738@warthog.procyon.org.uk> <2009825.1677229488@warthog.procyon.org.uk> <CAHk-=whAAOVBrzwb2uMjCmdRrtudGesYj0tuqdUgi8X_gbw1jw@mail.gmail.com> <20230220135225.91b0f28344c01d5306c31230@linux-foundation.org> To: Linus Torvalds <torvalds@linux-foundation.org>, Steve French <stfrench@microsoft.com> Cc: dhowells@redhat.com, Vishal Moola <vishal.moola@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Jan Kara <jack@suse.cz>, Paulo Alcantara <pc@cjr.nz>, Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, Xin Hao <xhao@linux.alibaba.com>, linux-mm@kvack.org, mm-commits@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC][PATCH] cifs: Fix cifs_writepages_region() MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <2213351.1677249027.1@warthog.procyon.org.uk> Content-Transfer-Encoding: quoted-printable From: David Howells <dhowells@redhat.com> Date: Fri, 24 Feb 2023 14:31:15 +0000 Message-ID: <2213409.1677249075@warthog.procyon.org.uk> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] cifs: Fix cifs_writepages_region() \| expand [RFC] cifs: Fix cifs_writepages_region()

David Howells Feb. 24, 2023, 2:31 p.m. UTC

Here's the simplest fix for cifs_writepages_region() that gets it to work.

Fix the cifs_writepages_region() to just skip over members of the batch that
have been cleaned up rather than retrying them.  I'm not entirely sure why it
fixes it, though.  It's also not the most efficient as, in the common case,
this is going to happen a lot because cifs_extend_writeback() is going to
clean up the contiguous pages in the batch - and then this skip will occur for
those.

Fix: 3822a7c40997 ("Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
Signed-off-by: David Howells <dhowells@redhat.com>
---

Linus Torvalds Feb. 24, 2023, 4:06 p.m. UTC | #1

On Fri, Feb 24, 2023 at 6:31 AM David Howells <dhowells@redhat.com> wrote:
>
> Here's the simplest fix for cifs_writepages_region() that gets it to work.

Hmm. The commit message for this is wrong.

> Fix the cifs_writepages_region() to just skip over members of the batch that
> have been cleaned up rather than retrying them.

It never retried them. The "skip_write" code did that same

                        start += folio_size(folio);
                        continue;

that your patch does, but it *also* had that

                        if (skips >= 5 || need_resched()) {

thing to just stop writing entirely.

> I'm not entirely sure why it fixes it, though.

Yes. Strange. Because it does the exact same thing as the "Oh, the
trylock worked, but it was still under writeback or fscache" did. I
just merged all the "skip write" cases.

But the code is clearly (a) not working and (b) the whole skip count
and need_resched() logic is a bit strange to begin with.

Can you humor me, and try if just removing that skip count thing
instead? IOW, this attached patch? Because that whole "let's stop
writing if we need to reschedule" sounds truly odd (we have a
cond_resched(), although it's per folio batch, not per-folio), and the
skip count logic doesn't make much sense to me either.

SteveF?

              Linus

Matthew Wilcox Feb. 24, 2023, 4:11 p.m. UTC | #2

On Fri, Feb 24, 2023 at 02:31:15PM +0000, David Howells wrote:
> Here's the simplest fix for cifs_writepages_region() that gets it to work.
> 
> Fix the cifs_writepages_region() to just skip over members of the batch that
> have been cleaned up rather than retrying them.  I'm not entirely sure why it
> fixes it, though.  It's also not the most efficient as, in the common case,
> this is going to happen a lot because cifs_extend_writeback() is going to
> clean up the contiguous pages in the batch - and then this skip will occur for
> those.

Why are you doing it this way?  What's wrong with using
write_cache_pages() to push all the contiguous dirty folios into a single
I/O object, submitting it when the folios turn out not to be contiguous,
or when we run out of a batch?

You've written an awful lot of code here and it's a different model from
every other filesystem.  Why is it better?

David Howells Feb. 24, 2023, 5:15 p.m. UTC | #3

Matthew Wilcox <willy@infradead.org> wrote:

> Why are you doing it this way?  What's wrong with using
> write_cache_pages() to push all the contiguous dirty folios into a single
> I/O object, submitting it when the folios turn out not to be contiguous,
> or when we run out of a batch?
> 
> You've written an awful lot of code here and it's a different model from
> every other filesystem.  Why is it better?

Because write_cache_pages():

 (1) Takes no account of fscache.  I can't just build knowledge of PG_fscache
     into it because PG_private_2 may be used differently by other filesystems
     (btrfs, for example).  (I'm also trying to phase out the use of
     PG_private_2 and instead uses PG_writeback to cover both and the
     difference will be recorded elsewhere - but that's not there yet).

 (2) Calls ->writepage() individually for each folio - which is excessive.  In
     AFS's implementation, we locate the first folio, then race through the
     following folios without ever waiting until we hit something that's
     locked or a gap and then stop and submit.

     write_cache_pages(), otoh, calls us with the next folio already undirtied
     and set for writeback when we find out that we don't want it yet.

 (3) Skips over holes, but at some point in the future we're going to need to
     schedule adjacent clean pages (before and after) for writeback too to
     handle transport compression and fscache updates if the granule size for
     either is larger than the folio size.

It might be better to take what's in cifs, generalise it and replace
write_cache_pages() with it, then have a "->submit_write()" aop that takes an
ITER_XARRAY iterator to write from.

David

David Howells Feb. 24, 2023, 5:19 p.m. UTC | #4

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Can you humor me, and try if just removing that skip count thing
> instead? IOW, this attached patch?

That works too.

> Because that whole "let's stop writing if we need to reschedule" sounds
> truly odd (we have a cond_resched(), although it's per folio batch, not
> per-folio), and the skip count logic doesn't make much sense to me either.

The skip thing, in my code, is only used in WB_SYNC_NONE mode.  If we hit 5
things in progress or rescheduling is required, we return to the caller on the
basis that conflicting flushes appear to be happening in other threads.

David

Matthew Wilcox Feb. 24, 2023, 6:44 p.m. UTC | #5

On Fri, Feb 24, 2023 at 05:15:41PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > Why are you doing it this way?  What's wrong with using
> > write_cache_pages() to push all the contiguous dirty folios into a single
> > I/O object, submitting it when the folios turn out not to be contiguous,
> > or when we run out of a batch?
> > 
> > You've written an awful lot of code here and it's a different model from
> > every other filesystem.  Why is it better?
> 
> Because write_cache_pages():
> 
>  (1) Takes no account of fscache.  I can't just build knowledge of PG_fscache
>      into it because PG_private_2 may be used differently by other filesystems
>      (btrfs, for example).  (I'm also trying to phase out the use of
>      PG_private_2 and instead uses PG_writeback to cover both and the
>      difference will be recorded elsewhere - but that's not there yet).

I don't understand why waiting for fscache here is even the right thing
to do.  The folio is dirty and needs to be written back to the
server.  Why should beginning that write wait for the current write
to the fscache to complete?

>  (2) Calls ->writepage() individually for each folio - which is excessive.  In
>      AFS's implementation, we locate the first folio, then race through the
>      following folios without ever waiting until we hit something that's
>      locked or a gap and then stop and submit.
> 
>      write_cache_pages(), otoh, calls us with the next folio already undirtied
>      and set for writeback when we find out that we don't want it yet.

I think you're so convinced that your way is better that you don't see
what write_cache_pages() is actually doing.  Yes, the writepage callback
is called once for each folio, but that doesn't actually submit a write.
Instead, they're accumulated into the equivalent of a wdata and the
wdata is submitted when there's a discontiguity or we've accumulated
enough to satisfy the constraints of the wbc.

>  (3) Skips over holes, but at some point in the future we're going to need to
>      schedule adjacent clean pages (before and after) for writeback too to
>      handle transport compression and fscache updates if the granule size for
>      either is larger than the folio size.

Then we'll need it for other filesystems too, so should enhance
write_cache_pages().

> It might be better to take what's in cifs, generalise it and replace
> write_cache_pages() with it, then have a "->submit_write()" aop that takes an
> ITER_XARRAY iterator to write from.

->writepages _is_ ->submit_write.  Should write_cache_pages() be better?
Maybe!  But it works a whole lot better than what AFS was doing and
you've now ladelled into cifs.

Linus Torvalds Feb. 24, 2023, 6:58 p.m. UTC | #6

On Fri, Feb 24, 2023 at 9:19 AM David Howells <dhowells@redhat.com> wrote:
>
> The skip thing, in my code, is only used in WB_SYNC_NONE mode.  If we hit 5
> things in progress or rescheduling is required, we return to the caller on the
> basis that conflicting flushes appear to be happening in other threads.

Ahh. *That* is the difference, and I didn't realize.

I made all the skip-write cases the same, and I really meant for that
case to only trigger for WB_SYNC_NONE, but I stupidly didn't notice
that the whole folio_test_dirty() re-test case was done without that
WB_SYNC_NONE case that all the other cases had.

Mea culpa, mea maxima culpa. That was just me being stupid.

So that case isn't actually a "skip write" case at all, it's actually
a "no write needed at all" case.

Your original patch is the right fix, and I was just being silly for
having not realized.

I'll apply that minimal fix for now - I think the right thing to do is
your bigger patch, but that needs more thinking (or at least splitting
up).

Sorry for the confusion, and thanks for setting me straight,

                  Linus

Linus Torvalds Feb. 24, 2023, 7:05 p.m. UTC | #7

On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll apply that minimal fix for now - I think the right thing to do is
> your bigger patch, but that needs more thinking (or at least splitting
> up).

Minimal fix applied - that way I can drop this for now, and we can
discuss the whole "maybe we can just use write_cache_pages()" instead.

Because that _would_ be lovely, even if it's possible that the generic
helper might need some extra love to work better for cifs/afs.

             Linus

David Howells Feb. 24, 2023, 8:13 p.m. UTC | #8

Matthew Wilcox <willy@infradead.org> wrote:

> I don't understand why waiting for fscache here is even the right thing
> to do.

Then why do we have to wait for PG_writeback to complete?

David

Linus Torvalds Feb. 24, 2023, 8:16 p.m. UTC | #9

On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
>
> Then why do we have to wait for PG_writeback to complete?

At least for PG_writeback, it's about "the _previous_ dirty write is
still under way, but - since PG_dirty is set again - the page has been
dirtied since".

So we have to start _another_ writeback, because while the current
writeback *might* have written the updated data, that is not at all
certain or clear.

I'm not sure what the fscache rules are.

                Linus

Matthew Wilcox Feb. 24, 2023, 8:45 p.m. UTC | #10

On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote:
> On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
> >
> > Then why do we have to wait for PG_writeback to complete?
> 
> At least for PG_writeback, it's about "the _previous_ dirty write is
> still under way, but - since PG_dirty is set again - the page has been
> dirtied since".
> 
> So we have to start _another_ writeback, because while the current
> writeback *might* have written the updated data, that is not at all
> certain or clear.

also, we only have a writeback bit, not a writeback count.  And when
the current writeback completes, it'll clear that bit.  We're also
being kind to our backing store and not writing to the same block twice
at the same time.

> I'm not sure what the fscache rules are.

My understanding is that the fscache bit is set under several
circumstances, but if the folio is dirty _and_ the fscache bit
is set, it means the folio is currently being written to the cache
device.  I don't see a conflict there; we can write to the backing
store and the cache device at the same time.

David Howells Feb. 24, 2023, 8:58 p.m. UTC | #11

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Then why do we have to wait for PG_writeback to complete?
> 
> At least for PG_writeback, it's about "the _previous_ dirty write is
> still under way, but - since PG_dirty is set again - the page has been
> dirtied since".
> 
> So we have to start _another_ writeback, because while the current
> writeback *might* have written the updated data, that is not at all
> certain or clear.

As I understand it, it's also about serialising writes from the same page to
the same backing store.  We don't want them to end up out-of-order.  I'm not
sure what guarantees, for instance, the block layer gives if two I/O requests
go to the same place.

> I'm not sure what the fscache rules are.

I'm now using PG_fscache in exactly the same way: the previous write to the
cache is still under way.  I don't want to start another DIO write to the
cache for the same pages.

Hence the waits/checks on PG_fscache I've added anywhere we need to wait/check
on PG_writeback.

As I mentioned I'm looking at the possibility of making PG_dirty and
PG_writeback cover *both* cases and recording the difference elsewhere -
thereby returning PG_private_2 to the VM folks who'd like their bit back.

This means, for instance, when we read from the server and find we need to
write it to the cache, we set a note in the aforementioned elsewhere, mark the
page dirty and leave it to writepages() to effect the write to the cache.

It could get tricky because we have two different places to write to, with
very different characteristics (e.g. ~6000km away server vs local SSD) with
their own queueing, scheduling, bandwidth, etc. - and the local disk might
have to share with the system.

David

Steven French Feb. 24, 2023, 10:11 p.m. UTC | #12

I am traveling today but will be testing this tonight or tomorrow

Get Outlook for Android<https://aka.ms/AAb9ysg>

David Howells Feb. 27, 2023, 1:20 p.m. UTC | #13

Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Feb 24, 2023 at 12:16:49PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 24, 2023 at 12:14 PM David Howells <dhowells@redhat.com> wrote:
> > >
> > > Then why do we have to wait for PG_writeback to complete?
> > 
> > At least for PG_writeback, it's about "the _previous_ dirty write is
> > still under way, but - since PG_dirty is set again - the page has been
> > dirtied since".
> > 
> > So we have to start _another_ writeback, because while the current
> > writeback *might* have written the updated data, that is not at all
> > certain or clear.
> 
> also, we only have a writeback bit, not a writeback count.  And when
> the current writeback completes, it'll clear that bit.  We're also
> being kind to our backing store and not writing to the same block twice
> at the same time.

It's not so much being kind to the backing store, I think, as avoiding the
possibility that the writes happen out of order.

> > I'm not sure what the fscache rules are.
> 
> My understanding is that the fscache bit is set under several
> circumstances, but if the folio is dirty _and_ the fscache bit
> is set, it means the folio is currently being written to the cache
> device.  I don't see a conflict there; we can write to the backing
> store and the cache device at the same time.

The dirty bit is nothing to do with it.  If the fscache bit is set, then the
page is currently being written to the cache - and we need to wait before
starting another write.

Sometimes we start a write to the cache from a clean page (e.g. we just read
it from the server) and sometimes we start a write to the cache from
writepages (e.g. the data is dirty and we're writing it to the server as
well).

Things will become more 'interesting' should we ever get around to
implementing disconnected operation.  Then we might end up staging dirty data
through the cache.

David

Steven French March 1, 2023, 6:32 p.m. UTC | #14

Have been doing additional testing of these - and also verified that David's most recent patch, now in my for-next branch ("cifs: Fix memory leak in direct I/O") fixes the remaining problem (the issue with xfstest 208 that Murphy pointed out).  Will send the three small cifs fixes from David later this week, along with some unrelated fixes from Paulo and Shyam.

-----Original Message-----
From: Linus Torvalds <torvalds@linux-foundation.org> 
Sent: Friday, February 24, 2023 1:06 PM
To: David Howells <dhowells@redhat.com>
Cc: Steven French <Steven.French@microsoft.com>; Vishal Moola <vishal.moola@gmail.com>; Andrew Morton <akpm@linux-foundation.org>; Jan Kara <jack@suse.cz>; pc <pc@cjr.nz>; Matthew Wilcox <willy@infradead.org>; Huang Ying <ying.huang@intel.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Xin Hao <xhao@linux.alibaba.com>; linux-mm@kvack.org; mm-commits@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [EXTERNAL] Re: [RFC][PATCH] cifs: Fix cifs_writepages_region()

[You don't often get email from torvalds@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

On Fri, Feb 24, 2023 at 10:58 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> I'll apply that minimal fix for now - I think the right thing to do is 
> your bigger patch, but that needs more thinking (or at least splitting 
> up).

Minimal fix applied - that way I can drop this for now, and we can discuss the whole "maybe we can just use write_cache_pages()" instead.

Because that _would_ be lovely, even if it's possible that the generic helper might need some extra love to work better for cifs/afs.

             Linus

[RFC] cifs: Fix cifs_writepages_region()

Commit Message

Comments

Patch