diff mbox series

[v4,02/28] mm: Add an unlock function for PG_private_2/PG_fscache

Message ID 161539528910.286939.1252328699383291173.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series Network fs helper library & fscache kiocb API | expand

Commit Message

David Howells March 10, 2021, 4:54 p.m. UTC
Add a function, unlock_page_private_2(), to unlock PG_private_2 analogous
to that of PG_lock.  Add a kerneldoc banner to that indicating the example
usage case.

A wrapper will need to be placed in the netfs header in the patch that adds
that.

[This implements a suggestion by Linus[1] to not mix the terminology of
 PG_private_2 and PG_fscache in the mm core function]

Changes:
- Remove extern from the declaration[2].

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-mm@kvack.org
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/1330473.1612974547@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/CAHk-=wjgA-74ddehziVk=XAEMTKswPu1Yw4uaro1R3ibs27ztw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/20210216102659.GA27714@lst.de/ [2]
Link: https://lore.kernel.org/r/161340387944.1303470.7944159520278177652.stgit@warthog.procyon.org.uk/ # v3
---

 include/linux/pagemap.h |    1 +
 mm/filemap.c            |   20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

Comments

Matthew Wilcox March 16, 2021, 7:07 p.m. UTC | #1
On Wed, Mar 10, 2021 at 04:54:49PM +0000, David Howells wrote:
> Add a function, unlock_page_private_2(), to unlock PG_private_2 analogous
> to that of PG_lock.  Add a kerneldoc banner to that indicating the example
> usage case.

This isn't a problem with this patch per se, but I'm concerned about
private2 and expected page refcounts.

static inline int is_page_cache_freeable(struct page *page)
{
        /*
         * A freeable page cache page is referenced only by the caller
         * that isolated the page, the page cache and optional buffer
         * heads at page->private.
         */
        int page_cache_pins = thp_nr_pages(page);
        return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
}

static inline int page_has_private(struct page *page)
{
        return !!(page->flags & PAGE_FLAGS_PRIVATE);
}

#define PAGE_FLAGS_PRIVATE                              \
        (1UL << PG_private | 1UL << PG_private_2)

So ... a page with both flags cleared should have a refcount of N.
A page with one or both flags set should have a refcount of N+1.

How is a poor filesystem supposed to make that true?  Also btrfs has this
problem since it uses private_2 for its own purposes.
David Howells March 16, 2021, 8:38 p.m. UTC | #2
Matthew Wilcox <willy@infradead.org> wrote:

> So ... a page with both flags cleared should have a refcount of N.
> A page with one or both flags set should have a refcount of N+1.
> ...
> How is a poor filesystem supposed to make that true?  Also btrfs has this
> problem since it uses private_2 for its own purposes.

It's simpler if it's N+2 for both patches set.  Btw, patch 13 adds that - and
possibly that should be merged into an earlier patch.

David
Matthew Wilcox March 16, 2021, 11:32 p.m. UTC | #3
On Tue, Mar 16, 2021 at 08:38:00PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > So ... a page with both flags cleared should have a refcount of N.
> > A page with one or both flags set should have a refcount of N+1.
> > ...
> > How is a poor filesystem supposed to make that true?  Also btrfs has this
> > problem since it uses private_2 for its own purposes.
> 
> It's simpler if it's N+2 for both patches set.  Btw, patch 13 adds that - and
> possibly that should be merged into an earlier patch.

So ...

static inline int page_has_private(struct page *page)
{
	unsigned long flags = page->flags;
	return ((flags >> PG_private) & 1) + ((flags >> PG_private_2) & 1);
}

perhaps?
Linus Torvalds March 17, 2021, 12:43 a.m. UTC | #4
[ Adding btrfs people explicitly, maybe they see this on the fs-devel
list, but maybe they don't react .. ]

On Tue, Mar 16, 2021 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> This isn't a problem with this patch per se, but I'm concerned about
> private2 and expected page refcounts.

Ugh. You are very right.

It would be good to just change the rules - I get the feeling nobody
actually depended on them anyway because they were _so_ esoteric.

> static inline int is_page_cache_freeable(struct page *page)
> {
>         /*
>          * A freeable page cache page is referenced only by the caller
>          * that isolated the page, the page cache and optional buffer
>          * heads at page->private.
>          */
>         int page_cache_pins = thp_nr_pages(page);
>         return page_count(page) - page_has_private(page) == 1 + page_cache_pins;

You're right, that "page_has_private()" is really really nasty.

The comment is, I think, the traditional usage case, which used to be
about page->buffers. Obviously these days it is now about
page->private with PG_private set, pointing to buffers
(attach_page_private() and detach_page_private()).

But as you point out:

> #define PAGE_FLAGS_PRIVATE                              \
>         (1UL << PG_private | 1UL << PG_private_2)
>
> So ... a page with both flags cleared should have a refcount of N.
> A page with one or both flags set should have a refcount of N+1.

Could we just remove the PG_private_2 thing in this context entirely,
and make the rule be that

 (a) PG_private means that you have some local private data in
page->private, and that's all that matters for the "freeable" thing.

 (b) PG_private_2 does *not* have the same meaning, and has no bearing
on freeability (and only the refcount matters)

I _)think_ the btrfs behavior is to only use PagePrivate2() when it
has a reference to the page, so btrfs doesn't care?

I think fscache is already happy to take the page count when using
PG_private_2 for locking, exactly because I didn't want to have any
confusion about lifetimes. But this "page_has_private()" math ends up
meaning it's confusing anyway.

btrfs people? What are the semantics for PG_private_2? Is it just a
flag, and you really don't want it to have anything to do with any
page lifetime decisions? Or?

        Linus
Josef Bacik March 17, 2021, 2:12 a.m. UTC | #5
On 3/16/21 8:43 PM, Linus Torvalds wrote:
> [ Adding btrfs people explicitly, maybe they see this on the fs-devel
> list, but maybe they don't react .. ]
> 
> On Tue, Mar 16, 2021 at 12:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> This isn't a problem with this patch per se, but I'm concerned about
>> private2 and expected page refcounts.
> 
> Ugh. You are very right.
> 
> It would be good to just change the rules - I get the feeling nobody
> actually depended on them anyway because they were _so_ esoteric.
> 
>> static inline int is_page_cache_freeable(struct page *page)
>> {
>>          /*
>>           * A freeable page cache page is referenced only by the caller
>>           * that isolated the page, the page cache and optional buffer
>>           * heads at page->private.
>>           */
>>          int page_cache_pins = thp_nr_pages(page);
>>          return page_count(page) - page_has_private(page) == 1 + page_cache_pins;
> 
> You're right, that "page_has_private()" is really really nasty.
> 
> The comment is, I think, the traditional usage case, which used to be
> about page->buffers. Obviously these days it is now about
> page->private with PG_private set, pointing to buffers
> (attach_page_private() and detach_page_private()).
> 
> But as you point out:
> 
>> #define PAGE_FLAGS_PRIVATE                              \
>>          (1UL << PG_private | 1UL << PG_private_2)
>>
>> So ... a page with both flags cleared should have a refcount of N.
>> A page with one or both flags set should have a refcount of N+1.
> 
> Could we just remove the PG_private_2 thing in this context entirely,
> and make the rule be that
> 
>   (a) PG_private means that you have some local private data in
> page->private, and that's all that matters for the "freeable" thing.
> 
>   (b) PG_private_2 does *not* have the same meaning, and has no bearing
> on freeability (and only the refcount matters)
> 
> I _)think_ the btrfs behavior is to only use PagePrivate2() when it
> has a reference to the page, so btrfs doesn't care?
> 
> I think fscache is already happy to take the page count when using
> PG_private_2 for locking, exactly because I didn't want to have any
> confusion about lifetimes. But this "page_has_private()" math ends up
> meaning it's confusing anyway.
> 
> btrfs people? What are the semantics for PG_private_2? Is it just a
> flag, and you really don't want it to have anything to do with any
> page lifetime decisions? Or?
> 

Yeah it's just a flag, we use it to tell that the page is part of a range that 
has been allocated for IO.  The lifetime of the page is independent of the page, 
but is generally either dirty or under writeback, so either it goes through 
truncate and we clear PagePrivate2 there, or it actually goes through IO and is 
cleared before we drop the page in our endio.  We _always_ have PG_private set 
on the page as long as we own it, and PG_private_2 is only set in this IO 
related context, so we're safe there because of the rules around 
PG_dirty/PG_writeback.  We don't need it to have an extra ref for it being set. 
  Thanks,

Josef
Linus Torvalds March 17, 2021, 2:35 a.m. UTC | #6
On Tue, Mar 16, 2021 at 7:12 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
>
> Yeah it's just a flag, we use it to tell that the page is part of a range that
> has been allocated for IO.  The lifetime of the page is independent of the page,
> but is generally either dirty or under writeback, so either it goes through
> truncate and we clear PagePrivate2 there, or it actually goes through IO and is
> cleared before we drop the page in our endio.

Ok, that's what it looked like from my very limited "looking at a
couple of grep cases", but I didn't go any further than that.

> We _always_ have PG_private set on the page as long as we own it, and
> PG_private_2 is only set in this IO related context, so we're safe
> there because of the rules around PG_dirty/PG_writeback. We don't need
> it to have an extra ref for it being set.

Perfect. That means that at least as far as btrfs is concerned, we
could trivially remove PG_private_2 from that page_has_private() math
- you'd always see the same result anyway, exactly because you have
PG_private set.

And as far as I can tell, fscache doesn't want that PG_private_2 bit
to interact with the random VM lifetime or migration rules either, and
should rely entirely on the page count. David?

There's actually a fair number of page_has_private() users, so we'd
better make sure that's the case. But it's simplified by this but
really only being used by btrfs (which doesn't care) and fscache, so
this cleanup would basically be entirely up to the whole fscache
series.

Hmm? Objections?

            Linus
David Howells March 17, 2021, 9:04 a.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And as far as I can tell, fscache doesn't want that PG_private_2 bit
> to interact with the random VM lifetime or migration rules either, and
> should rely entirely on the page count. David?

It's slightly complicated for fscache as there are two separate pieces of code
involved:

 (1) For the old fscache code that I'm trying to phase out, it does not take a
     ref when PG_fscache is taken (probably incorrectly), relying instead on
     releasepage, etc. getting called to strip the PG_fscache bit.  PG_fscache
     is held for the lifetime of the page, indicating that fscache knows about
     it and might access it at any time (to write to the cache in the
     background for example or to move pages around in the cache).

     Here PG_fscache should not prevent page eviction or migration and it's
     analogous to PG_private.

     That said, the old fscache code keeps its own radix trees of pages that
     are undergoing write to the cache, so to allow a page to be evicted,
     releasepage and co. have to consult those
     (__fscache_maybe_release_page()).

 (2) For the new netfs lib, PG_fscache is ignored by fscache itself and is
     used by the read helpers.  The helpers simply use it analogously to
     PG_writeback, indicating that there's I/O in progress from this page to
     the cache[*].  It's fine to take a ref here because we know we'll drop it
     shortly.

     Here PG_fscache might prevent page eviction or migration, but only
     because I/O is in progress.  If an increment on the page refcount
     suffices, that's fine.

In both cases, releasepage, etc. look at PG_fscache and decide whether to wait
or not (releasepage may tell the caller to skip the page if PG_fscache is
set).

[*] Willy suggested using PG_writeback to cover both write to the server and
write to the cache, and getting rid of PG_fscache entirely, but that would
require extra mechanisms.  There are three cases:

 (a) We might be writing to only the cache, e.g. because we just read from the
     server.

     Note that this may adversely affect code that does accounting associated
     with PG_writeback because we woudn't actually be writing back a user-made
     change or dealing with a dirty page.  I'm not sure if that's an issue.

 (b) We might writing to both, in which case we can expect both writes to
     finish at different times.

 (c) We might only be writing to the server, e.g. because there's no space in
     the cache or there is no cache.

It's something that might make sense, however, and we can look at in the
future, but for the moment having two separate page flags is simplest.

An additional use of PG_fscache is to prevent a second write to the cache from
being started whilst one is in progress.  I guess that would be taken over by
PG_writeback if we used that.

David
David Howells March 17, 2021, 10:53 a.m. UTC | #8
David Howells <dhowells@redhat.com> wrote:

>  (1) For the old fscache code that I'm trying to phase out, it does not take a
>      ref when PG_fscache is taken (probably incorrectly), relying instead on
>      releasepage, etc. getting called to strip the PG_fscache bit.  PG_fscache
>      is held for the lifetime of the page, indicating that fscache knows about
>      it and might access it at any time (to write to the cache in the
>      background for example or to move pages around in the cache).
> 
>      Here PG_fscache should not prevent page eviction or migration and it's
>      analogous to PG_private.
> 
>      That said, the old fscache code keeps its own radix trees of pages that
>      are undergoing write to the cache, so to allow a page to be evicted,
>      releasepage and co. have to consult those
>      (__fscache_maybe_release_page()).

Note that, ideally, we'll be able to remove the old fscache I/O code in the
next merge window or the one after.

David
Matthew Wilcox March 21, 2021, 10:53 a.m. UTC | #9
On Wed, Mar 10, 2021 at 04:54:49PM +0000, David Howells wrote:
> Add a function, unlock_page_private_2(), to unlock PG_private_2 analogous
> to that of PG_lock.  Add a kerneldoc banner to that indicating the example
> usage case.

One of the things which confused me about this was ... where's the other
side?  Where's lock_page_private_2()?  Then I found this:

#ifdef CONFIG_AFS_FSCACHE
        if (PageFsCache(page) &&
            wait_on_page_bit_killable(page, PG_fscache) < 0)
                return VM_FAULT_RETRY;
#endif

Please respect the comment!

/*
 * This is exported only for wait_on_page_locked/wait_on_page_writeback, etc.,
 * and should not be used directly.
 */
extern void wait_on_page_bit(struct page *page, int bit_nr);
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);

I think we need the exported API to be wait_on_page_private_2(), and
AFS needs to not tinker in the guts of filemap.  Otherwise you miss
out on bugfixes like c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 (see also
https://lore.kernel.org/linux-fsdevel/20210320054104.1300774-4-willy@infradead.org/T/#u
).

That also brings up that there is no set_page_private_2().  I think
that's OK -- you only set PageFsCache() immediately after reading the
page from the server.  But I feel this "unlock_page_private_2" is actually
"clear_page_private_2" -- ie it's equivalent to writeback, not to lock.

> +++ b/mm/filemap.c
> @@ -1432,6 +1432,26 @@ void unlock_page(struct page *page)
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> +/**
> + * unlock_page_private_2 - Unlock a page that's locked with PG_private_2
> + * @page: The page
> + *
> + * Unlocks a page that's locked with PG_private_2 and wakes up sleepers in
> + * wait_on_page_private_2().
> + *
> + * This is, for example, used when a netfs page is being written to a local
> + * disk cache, thereby allowing writes to the cache for the same page to be
> + * serialised.
> + */
> +void unlock_page_private_2(struct page *page)
> +{
> +	page = compound_head(page);
> +	VM_BUG_ON_PAGE(!PagePrivate2(page), page);
> +	clear_bit_unlock(PG_private_2, &page->flags);
> +	wake_up_page_bit(page, PG_private_2);
> +}
> +EXPORT_SYMBOL(unlock_page_private_2);
> +
>  /**
David Howells March 22, 2021, 10:56 a.m. UTC | #10
Matthew Wilcox <willy@infradead.org> wrote:

> That also brings up that there is no set_page_private_2().  I think
> that's OK -- you only set PageFsCache() immediately after reading the
> page from the server.  But I feel this "unlock_page_private_2" is actually
> "clear_page_private_2" -- ie it's equivalent to writeback, not to lock.

How about I do the following:

 (1) Add set_page_private_2() or mark_page_private_2() to set the PG_fscache_2
     bit.  It could take a ref on the page here.

 (2) Rename unlock_page_private_2() to end_page_private_2().  It could drop
     the ref on the page here, but that then means I can't use
     pagevec_release().

 (3) Add wait_on_page_private_2() an analogue of wait_on_page_writeback()
     rather than wait_on_page_locked().

 (4) Provide fscache synonyms of the above.

David
David Howells March 23, 2021, 1:17 p.m. UTC | #11
David Howells <dhowells@redhat.com> wrote:

> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > That also brings up that there is no set_page_private_2().  I think
> > that's OK -- you only set PageFsCache() immediately after reading the
> > page from the server.  But I feel this "unlock_page_private_2" is actually
> > "clear_page_private_2" -- ie it's equivalent to writeback, not to lock.
> 
> How about I do the following:
> 
>  (1) Add set_page_private_2() or mark_page_private_2() to set the PG_fscache_2
>      bit.  It could take a ref on the page here.
> 
>  (2) Rename unlock_page_private_2() to end_page_private_2().  It could drop
>      the ref on the page here, but that then means I can't use
>      pagevec_release().
> 
>  (3) Add wait_on_page_private_2() an analogue of wait_on_page_writeback()
>      rather than wait_on_page_locked().
> 
>  (4) Provide fscache synonyms of the above.

Perhaps something like the attached changes (they'll need merging back into
the other patches).

David
---
 include/linux/pagemap.h |   21 +++++++++++++++++-
 include/linux/netfs.h   |   54 ++++++++++++++++++++++++++++++++++++------------
 fs/afs/write.c          |    5 ++--
 fs/netfs/read_helper.c  |   17 +++++----------
 mm/filemap.c            |   49 +++++++++++++++++++++++++++++++++++++++----
 mm/page-writeback.c     |   25 ++++++++++++++++++++++
 6 files changed, 139 insertions(+), 32 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bf05e99ce588..5c14a9365aae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,7 +591,6 @@ extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
-void unlock_page_private_2(struct page *page);
 
 /*
  * Return true if the page was successfully locked
@@ -684,11 +683,31 @@ static inline int wait_on_page_locked_killable(struct page *page)
 
 int put_and_wait_on_page_locked(struct page *page, int state);
 void wait_on_page_writeback(struct page *page);
+int wait_on_page_writeback_killable(struct page *page);
 extern void end_page_writeback(struct page *page);
 void wait_for_stable_page(struct page *page);
 
 void page_endio(struct page *page, bool is_write, int err);
 
+/**
+ * set_page_private_2 - Set PG_private_2 on a page and take a ref
+ * @page: The page.
+ *
+ * Set the PG_private_2 flag on a page and take the reference needed for the VM
+ * to handle its lifetime correctly.  This sets the flag and takes the
+ * reference unconditionally, so care must be taken not to set the flag again
+ * if it's already set.
+ */
+static inline void set_page_private_2(struct page *page)
+{
+	get_page(page);
+	SetPagePrivate2(page);
+}
+
+void end_page_private_2(struct page *page);
+void wait_on_page_private_2(struct page *page);
+int wait_on_page_private_2_killable(struct page *page);
+
 /*
  * Add an arbitrary waiter to a page's wait queue
  */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9d3fbed4e30a..2299e7662ff0 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -29,32 +29,60 @@
 #define TestClearPageFsCache(page)	TestClearPagePrivate2((page))
 
 /**
- * unlock_page_fscache - Unlock a page that's locked with PG_fscache
- * @page: The page
+ * set_page_fscache - Set PG_fscache on a page and take a ref
+ * @page: The page.
  *
- * Unlocks a page that's locked with PG_fscache and wakes up sleepers in
- * wait_on_page_fscache().  This page bit is used by the netfs helpers when a
- * netfs page is being written to a local disk cache, thereby allowing writes
- * to the cache for the same page to be serialised.
+ * Set the PG_fscache (PG_private_2) flag on a page and take the reference
+ * needed for the VM to handle its lifetime correctly.  This sets the flag and
+ * takes the reference unconditionally, so care must be taken not to set the
+ * flag again if it's already set.
  */
-static inline void unlock_page_fscache(struct page *page)
+static inline void set_page_fscache(struct page *page)
 {
-	unlock_page_private_2(page);
+	set_page_private_2(page);
 }
 
 /**
- * wait_on_page_fscache - Wait for PG_fscache to be cleared on a page
+ * end_page_fscache - Clear PG_fscache and release any waiters
  * @page: The page
  *
- * Wait for the PG_fscache (PG_private_2) page bit to be removed from a page.
- * This is, for example, used to handle a netfs page being written to a local
+ * Clear the PG_fscache (PG_private_2) bit on a page and wake up any sleepers
+ * waiting for this.  The page ref held for PG_private_2 being set is released.
+ *
+ * This is, for example, used when a netfs page is being written to a local
  * disk cache, thereby allowing writes to the cache for the same page to be
  * serialised.
  */
+static inline void end_page_fscache(struct page *page)
+{
+	end_page_private_2(page);
+}
+
+/**
+ * wait_on_page_fscache - Wait for PG_fscache to be cleared on a page
+ * @page: The page to wait on
+ *
+ * Wait for PG_fscache (aka PG_private_2) to be cleared on a page.
+ */
 static inline void wait_on_page_fscache(struct page *page)
 {
-	if (PageFsCache(page))
-		wait_on_page_bit(compound_head(page), PG_fscache);
+	wait_on_page_private_2(page);
+}
+
+/**
+ * wait_on_page_fscache_killable - Wait for PG_fscache to be cleared on a page
+ * @page: The page to wait on
+ *
+ * Wait for PG_fscache (aka PG_private_2) to be cleared on a page or until a
+ * fatal signal is received by the calling task.
+ *
+ * Return:
+ * - 0 if successful.
+ * - -EINTR if a fatal signal was encountered.
+ */
+static inline int wait_on_page_fscache_killable(struct page *page)
+{
+	return wait_on_page_private_2_killable(page);
 }
 
 enum netfs_read_source {
diff --git a/fs/afs/write.c b/fs/afs/write.c
index b2e03de09c24..9f82e2bb463e 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -846,7 +846,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
 	 */
 #ifdef CONFIG_AFS_FSCACHE
 	if (PageFsCache(page) &&
-	    wait_on_page_bit_killable(page, PG_fscache) < 0)
+	    wait_on_page_fscache_killable(page) < 0)
 		return VM_FAULT_RETRY;
 #endif
 
@@ -861,7 +861,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
 	 * details the portion of the page we need to write back and we might
 	 * need to redirty the page if there's a problem.
 	 */
-	wait_on_page_writeback(page);
+	if (wait_on_page_writeback_killable(page) < 0)
+		return VM_FAULT_RETRY | VM_FAULT_LOCKED;
 
 	priv = afs_page_dirty(page, 0, thp_size(page));
 	priv = afs_page_dirty_mmapped(priv);
diff --git a/fs/netfs/read_helper.c b/fs/netfs/read_helper.c
index 56e90e0388f2..2b23584499b2 100644
--- a/fs/netfs/read_helper.c
+++ b/fs/netfs/read_helper.c
@@ -239,13 +239,10 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
 					  bool was_async)
 {
 	struct netfs_read_subrequest *subreq;
-	struct pagevec pvec;
 	struct page *page;
 	pgoff_t unlocked = 0;
 	bool have_unlocked = false;
 
-	pagevec_init(&pvec);
-
 	rcu_read_lock();
 
 	list_for_each_entry(subreq, &rreq->subrequests, rreq_link) {
@@ -258,9 +255,7 @@ static void netfs_rreq_unmark_after_write(struct netfs_read_request *rreq,
 			if (have_unlocked && page->index <= unlocked)
 				continue;
 			unlocked = page->index;
-			unlock_page_fscache(page);
-			if (pagevec_add(&pvec, page) == 0)
-				pagevec_release(&pvec);
+			end_page_fscache(page);
 			have_unlocked = true;
 		}
 	}
@@ -419,10 +414,8 @@ static void netfs_rreq_unlock(struct netfs_read_request *rreq)
 				pg_failed = true;
 				break;
 			}
-			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags)) {
-				get_page(page);
-				SetPageFsCache(page);
-			}
+			if (test_bit(NETFS_SREQ_WRITE_TO_CACHE, &subreq->flags))
+				set_page_fscache(page);
 			pg_failed |= subreq_failed;
 			if (pgend < iopos + subreq->len)
 				break;
@@ -1167,7 +1160,9 @@ int netfs_write_begin(struct file *file, struct address_space *mapping,
 		goto error;
 
 have_page:
-	wait_on_page_fscache(page);
+	ret = wait_on_page_fscache_killable(page);
+	if (ret < 0)
+		goto error;
 have_page_no_wait:
 	if (netfs_priv)
 		ops->cleanup(netfs_priv, mapping);
diff --git a/mm/filemap.c b/mm/filemap.c
index 925964b67583..788b71e8a72d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1433,24 +1433,63 @@ void unlock_page(struct page *page)
 EXPORT_SYMBOL(unlock_page);
 
 /**
- * unlock_page_private_2 - Unlock a page that's locked with PG_private_2
+ * end_page_private_2 - Clear PG_private_2 and release any waiters
  * @page: The page
  *
- * Unlocks a page that's locked with PG_private_2 and wakes up sleepers in
- * wait_on_page_private_2().
+ * Clear the PG_private_2 bit on a page and wake up any sleepers waiting for
+ * this.  The page ref held for PG_private_2 being set is released.
  *
  * This is, for example, used when a netfs page is being written to a local
  * disk cache, thereby allowing writes to the cache for the same page to be
  * serialised.
  */
-void unlock_page_private_2(struct page *page)
+void end_page_private_2(struct page *page)
 {
 	page = compound_head(page);
 	VM_BUG_ON_PAGE(!PagePrivate2(page), page);
 	clear_bit_unlock(PG_private_2, &page->flags);
 	wake_up_page_bit(page, PG_private_2);
+	put_page(page);
+}
+EXPORT_SYMBOL(end_page_private_2);
+
+/**
+ * wait_on_page_private_2 - Wait for PG_private_2 to be cleared on a page
+ * @page: The page to wait on
+ *
+ * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page.
+ */
+void wait_on_page_private_2(struct page *page)
+{
+	while (PagePrivate2(page))
+		wait_on_page_bit(page, PG_private_2);
+}
+EXPORT_SYMBOL(wait_on_page_private_2);
+
+/**
+ * wait_on_page_private_2_killable - Wait for PG_private_2 to be cleared on a page
+ * @page: The page to wait on
+ *
+ * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page or until a
+ * fatal signal is received by the calling task.
+ *
+ * Return:
+ * - 0 if successful.
+ * - -EINTR if a fatal signal was encountered.
+ */
+int wait_on_page_private_2_killable(struct page *page)
+{
+	int ret = 0;
+
+	while (PagePrivate2(page)) {
+		ret = wait_on_page_bit_killable(page, PG_private_2);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
 }
-EXPORT_SYMBOL(unlock_page_private_2);
+EXPORT_SYMBOL(wait_on_page_private_2_killable);
 
 /**
  * end_page_writeback - end writeback against a page
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eb34d204d4ee..b8bad275f94b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2833,6 +2833,31 @@ void wait_on_page_writeback(struct page *page)
 }
 EXPORT_SYMBOL_GPL(wait_on_page_writeback);
 
+/**
+ * Wait for a page to complete writeback
+ * @page: The page to wait on
+ *
+ * Wait for the writeback status of a page to clear or a fatal signal to occur.
+ *
+ * Return:
+ * - 0 on success.
+ * - -EINTR if a fatal signal was encountered.
+ */
+int wait_on_page_writeback_killable(struct page *page)
+{
+	int ret = 0;
+
+	while (PageWriteback(page)) {
+		trace_wait_on_page_writeback(page, page_mapping(page));
+		ret = wait_on_page_bit_killable(page, PG_writeback);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(wait_on_page_writeback_killable);
+
 /**
  * wait_for_stable_page() - wait for writeback to finish, if necessary.
  * @page:	The page to wait on.
Matthew Wilcox March 23, 2021, 1:51 p.m. UTC | #12
On Tue, Mar 23, 2021 at 01:17:20PM +0000, David Howells wrote:
> +++ b/fs/afs/write.c
> @@ -846,7 +846,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
>  	 */
>  #ifdef CONFIG_AFS_FSCACHE
>  	if (PageFsCache(page) &&
> -	    wait_on_page_bit_killable(page, PG_fscache) < 0)
> +	    wait_on_page_fscache_killable(page) < 0)
>  		return VM_FAULT_RETRY;
>  #endif
>  
> @@ -861,7 +861,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
>  	 * details the portion of the page we need to write back and we might
>  	 * need to redirty the page if there's a problem.
>  	 */
> -	wait_on_page_writeback(page);
> +	if (wait_on_page_writeback_killable(page) < 0)
> +		return VM_FAULT_RETRY | VM_FAULT_LOCKED;

You forgot to unlock the page.  Also, if you're waiting killably here,
do you need to wait before you get the page lock?  Ditto for waiting on
fscache -- do you want to do that before or after you get the page lock?

Also, I never quite understood why you needed to wait for fscache
writes to finish before allowing the page to be dirtied.  Is this a
wait_for_stable_page() kind of situation, where the cache might be
calculating a checksum on it?  Because as far as I can tell, once the
page is dirty in RAM, the contents of the on-disk cache are irrelevant ...
unless they're part of a RAID 5 checksum kind of situation.

I didn't spot any other problems ...
David Howells March 23, 2021, 2:16 p.m. UTC | #13
Matthew Wilcox <willy@infradead.org> wrote:

> On Tue, Mar 23, 2021 at 01:17:20PM +0000, David Howells wrote:
> > +++ b/fs/afs/write.c
> > @@ -846,7 +846,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
> >  	 */
> >  #ifdef CONFIG_AFS_FSCACHE
> >  	if (PageFsCache(page) &&
> > -	    wait_on_page_bit_killable(page, PG_fscache) < 0)
> > +	    wait_on_page_fscache_killable(page) < 0)
> >  		return VM_FAULT_RETRY;
> >  #endif
> >  
> > @@ -861,7 +861,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf)
> >  	 * details the portion of the page we need to write back and we might
> >  	 * need to redirty the page if there's a problem.
> >  	 */
> > -	wait_on_page_writeback(page);
> > +	if (wait_on_page_writeback_killable(page) < 0)
> > +		return VM_FAULT_RETRY | VM_FAULT_LOCKED;
> 
> You forgot to unlock the page.

Do I need to?  Doesn't VM_FAULT_LOCKED indicate that to the caller?  Or is it
impermissible to do it like that?

> Also, if you're waiting killably here, do you need to wait before you get
> the page lock?  Ditto for waiting on fscache -- do you want to do that
> before or after you get the page lock?

I'm waiting both before and after.  If I wait before, write() can go and
trample over the page between PG_writeback/PG_fscache being cleared and us
getting the lock here.  Probably I should only be waiting after locking the
page.

> Also, I never quite understood why you needed to wait for fscache
> writes to finish before allowing the page to be dirtied.  Is this a
> wait_for_stable_page() kind of situation, where the cache might be
> calculating a checksum on it?  Because as far as I can tell, once the
> page is dirty in RAM, the contents of the on-disk cache are irrelevant ...
> unless they're part of a RAID 5 checksum kind of situation.

Um.  I do want to add disconnected operation in the future and cache
encryption, but, as things currently stand, it isn't necessary because the
cache object is marked "in use" and will be discarded on rebinding after a
power loss or crash if it's still marked when it's opened again.

Also, the thought has occurred to me that I can make use of reflink copy to
handle the caching of local modifications to cached files, in which case I'd
rather have a clean copy to link from.

David
David Howells March 23, 2021, 10:06 p.m. UTC | #14
David Howells <dhowells@redhat.com> wrote:

> > > -	wait_on_page_writeback(page);
> > > +	if (wait_on_page_writeback_killable(page) < 0)
> > > +		return VM_FAULT_RETRY | VM_FAULT_LOCKED;
> > 
> > You forgot to unlock the page.
> 
> Do I need to?  Doesn't VM_FAULT_LOCKED indicate that to the caller?  Or is it
> impermissible to do it like that?

Looks like, yes, I do need to.  VM_FAULT_LOCKED is ignored if RETRY is given.

David
diff mbox series

Patch

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 20225b067583..ac91b33f3873 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,6 +591,7 @@  extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
+void unlock_page_private_2(struct page *page);
 
 /*
  * Return true if the page was successfully locked
diff --git a/mm/filemap.c b/mm/filemap.c
index 43700480d897..925964b67583 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1432,6 +1432,26 @@  void unlock_page(struct page *page)
 }
 EXPORT_SYMBOL(unlock_page);
 
+/**
+ * unlock_page_private_2 - Unlock a page that's locked with PG_private_2
+ * @page: The page
+ *
+ * Unlocks a page that's locked with PG_private_2 and wakes up sleepers in
+ * wait_on_page_private_2().
+ *
+ * This is, for example, used when a netfs page is being written to a local
+ * disk cache, thereby allowing writes to the cache for the same page to be
+ * serialised.
+ */
+void unlock_page_private_2(struct page *page)
+{
+	page = compound_head(page);
+	VM_BUG_ON_PAGE(!PagePrivate2(page), page);
+	clear_bit_unlock(PG_private_2, &page->flags);
+	wake_up_page_bit(page, PG_private_2);
+}
+EXPORT_SYMBOL(unlock_page_private_2);
+
 /**
  * end_page_writeback - end writeback against a page
  * @page: the page