diff mbox series

mm, zswap: don't touch the XArray lock if there is no entry to free

Message ID 20241018192525.95862-1-ryncsn@gmail.com (mailing list archive)
State New
Headers show
Series mm, zswap: don't touch the XArray lock if there is no entry to free | expand

Commit Message

Kairui Song Oct. 18, 2024, 7:25 p.m. UTC
From: Kairui Song <kasong@tencent.com>

zswap_invalidation now already avoids touching the XArray if the whole
tree is empty, which is mostly beneficial only when zswap is disabled.
This commit takes it further by optimizing the case where zswap is
enabled.

To reduce lock overhead, we load the XArray value locklessly first
and keep the walk state. Only perform a locked erase if a entry is
found, thereby minimizing unnecessary XArray lock acquisitions.

Below tests are done with a 4G brd SWAP device with BLK_FEAT_SYNCHRONOUS
flag dropped to simulate fast SSD device, with zswap enabled and on a 32
core system:

Swapin of 4G mixed zero and 0x1 filled pages (avg of 12 test run):
Before:         After (-1.6%):
2315237 us      2277721 us

Swapin of 2G 0x1 filled pages (avg of 24 test run):
Before:         After (-0.5%):
4623561 us      4600406 us

Build linux kernel test with 2G memory cgroup limit (avg of 12 test
run, make -j32):
Before:         After (-0.2%):
1334.35s        1331.63s

Swapin of 2G 0x1 filled pages, but zswap disabled (avg of 24 test run):
Before:         After (+0.0%):
2513837 us      2514437 us

zswap enabled tests are a little bit faster, zswap disabled case are
identical.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>

---

A previous patch [1] has been Acked and now in mm-unstable, that is a
valid optimization on its own. This patch is Suggested-by Yosry during
discussion. This patch is for a bit different cases (zswap disabled vs
zswap enabled), so instead of a V2, I sent this as a incremental
optimization and also tested it a little bit differently.

Link:
https://lore.kernel.org/linux-mm/20241011171950.62684-1-ryncsn@gmail.com/ [1]

 mm/zswap.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox Oct. 18, 2024, 7:46 p.m. UTC | #1
On Sat, Oct 19, 2024 at 03:25:25AM +0800, Kairui Song wrote:
>  	if (xa_empty(tree))
>  		return;
>  
> -	entry = xa_erase(tree, offset);
> -	if (entry)
> +	rcu_read_lock();
> +	entry = xas_load(&xas);
> +	if (entry) {

You should call xas_reset() here.  And I'm not sure it's a great idea to
spin waiting for the xa lock while holding the RCU read lock?  Probably
not awful but I could easily be wrong.

> +		xas_lock(&xas);
> +		WARN_ON_ONCE(xas_reload(&xas) != entry);
> +		xas_store(&xas, NULL);
> +		xas_unlock(&xas);
>  		zswap_entry_free(entry);
> +	}
> +	rcu_read_unlock();
>  }
>  
>  int zswap_swapon(int type, unsigned long nr_pages)
> -- 
> 2.47.0
> 
>
Kairui Song Oct. 18, 2024, 8:01 p.m. UTC | #2
On Sat, Oct 19, 2024 at 3:46 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Oct 19, 2024 at 03:25:25AM +0800, Kairui Song wrote:
> >       if (xa_empty(tree))
> >               return;
> >
> > -     entry = xa_erase(tree, offset);
> > -     if (entry)
> > +     rcu_read_lock();
> > +     entry = xas_load(&xas);
> > +     if (entry) {
>
> You should call xas_reset() here.  And I'm not sure it's a great idea to
> spin waiting for the xa lock while holding the RCU read lock?  Probably
> not awful but I could easily be wrong.

Thanks for the review. I thought about it, that could cancel this optimization.

Oh, and there is a thing I forgot to mention (maybe I should add some
comments about it?). If xas_load found an entry, that entry must be
pinned by HAS_CACHE or swap slot count right now, and one entry can
only be freed once.
So it should be safe here?

This might be a little fragile though, maybe this optimization can
better be done after some zswap invalidation path cleanup.

>
> > +             xas_lock(&xas);
> > +             WARN_ON_ONCE(xas_reload(&xas) != entry);
> > +             xas_store(&xas, NULL);
> > +             xas_unlock(&xas);
> >               zswap_entry_free(entry);
> > +     }
> > +     rcu_read_unlock();
> >  }
> >
> >  int zswap_swapon(int type, unsigned long nr_pages)
> > --
> > 2.47.0
> >
> >
>
Yosry Ahmed Oct. 18, 2024, 8:40 p.m. UTC | #3
On Fri, Oct 18, 2024 at 1:01 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Oct 19, 2024 at 3:46 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Oct 19, 2024 at 03:25:25AM +0800, Kairui Song wrote:
> > >       if (xa_empty(tree))
> > >               return;
> > >
> > > -     entry = xa_erase(tree, offset);
> > > -     if (entry)
> > > +     rcu_read_lock();
> > > +     entry = xas_load(&xas);
> > > +     if (entry) {
> >
> > You should call xas_reset() here.

Oh I thought xas_reload() is enough here to check that the entry is
still there after the lock is acquired. Do we have to start the walk
over after holding the lock?

If yes, it seems like that would be equivalent to the following:

entry = xa_load(tree, offset);
if (entry)
           xa_erase(tree, offset);

>> And I'm not sure it's a great idea to
> > spin waiting for the xa lock while holding the RCU read lock?  Probably
> > not awful but I could easily be wrong.

If we end up using xa_load() and xa_erase() then we avoid that, but
then we'd need to walk the xarray twice. I thought we could avoid the
rewalk with xas_reload(). I am not sure if the xa_load() check would
still be worth it at this point -- or maybe the second walk will be
much faster as everything will be cache hot? Idk.

Matthew, any prior experience with such patterns of lockless lookups
followed by a conditional locked operation?

>
> Thanks for the review. I thought about it, that could cancel this optimization.
>
> Oh, and there is a thing I forgot to mention (maybe I should add some
> comments about it?). If xas_load found an entry, that entry must be
> pinned by HAS_CACHE or swap slot count right now, and one entry can
> only be freed once.
> So it should be safe here?
>
> This might be a little fragile though, maybe this optimization can
> better be done after some zswap invalidation path cleanup.

The only guarantee that we are requiring from the caller here is that
the swap entry is stable, i.e. is not freed and reused while
zswap_invalidate() is running. This seems to be a reasonable
assumption, or did I miss something here?
Matthew Wilcox Oct. 18, 2024, 8:55 p.m. UTC | #4
On Fri, Oct 18, 2024 at 01:40:18PM -0700, Yosry Ahmed wrote:
> Oh I thought xas_reload() is enough here to check that the entry is
> still there after the lock is acquired. Do we have to start the walk
> over after holding the lock?

Yes.  The entry is guaranteed to still be valid, but the node you're
looking in might have been freed, so you can't modify the node without
making sure the node is still in the tree.  We could make that cheaper
than a rewalk, but you're going to need to write that code since you're
the first to want to do something like this.
Yosry Ahmed Oct. 18, 2024, 9 p.m. UTC | #5
On Fri, Oct 18, 2024 at 1:55 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 18, 2024 at 01:40:18PM -0700, Yosry Ahmed wrote:
> > Oh I thought xas_reload() is enough here to check that the entry is
> > still there after the lock is acquired. Do we have to start the walk
> > over after holding the lock?
>
> Yes.  The entry is guaranteed to still be valid, but the node you're
> looking in might have been freed, so you can't modify the node without
> making sure the node is still in the tree.  We could make that cheaper
> than a rewalk, but you're going to need to write that code since you're
> the first to want to do something like this.

I see, thanks for elaborating.

Could you confirm if the current patch with the xas_reset() added
would be equivalent to just checking xa_load() before using
xa_erase()?
Matthew Wilcox Oct. 18, 2024, 9:38 p.m. UTC | #6
On Fri, Oct 18, 2024 at 02:00:16PM -0700, Yosry Ahmed wrote:
> On Fri, Oct 18, 2024 at 1:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Oct 18, 2024 at 01:40:18PM -0700, Yosry Ahmed wrote:
> > > Oh I thought xas_reload() is enough here to check that the entry is
> > > still there after the lock is acquired. Do we have to start the walk
> > > over after holding the lock?
> >
> > Yes.  The entry is guaranteed to still be valid, but the node you're
> > looking in might have been freed, so you can't modify the node without
> > making sure the node is still in the tree.  We could make that cheaper
> > than a rewalk, but you're going to need to write that code since you're
> > the first to want to do something like this.
> 
> I see, thanks for elaborating.
> 
> Could you confirm if the current patch with the xas_reset() added
> would be equivalent to just checking xa_load() before using
> xa_erase()?

Yes, I think it would, so it's probably a poor tradeoff.
Yosry Ahmed Oct. 18, 2024, 10:28 p.m. UTC | #7
On Fri, Oct 18, 2024 at 2:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 18, 2024 at 02:00:16PM -0700, Yosry Ahmed wrote:
> > On Fri, Oct 18, 2024 at 1:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Oct 18, 2024 at 01:40:18PM -0700, Yosry Ahmed wrote:
> > > > Oh I thought xas_reload() is enough here to check that the entry is
> > > > still there after the lock is acquired. Do we have to start the walk
> > > > over after holding the lock?
> > >
> > > Yes.  The entry is guaranteed to still be valid, but the node you're
> > > looking in might have been freed, so you can't modify the node without
> > > making sure the node is still in the tree.  We could make that cheaper
> > > than a rewalk, but you're going to need to write that code since you're
> > > the first to want to do something like this.
> >
> > I see, thanks for elaborating.
> >
> > Could you confirm if the current patch with the xas_reset() added
> > would be equivalent to just checking xa_load() before using
> > xa_erase()?
>
> Yes, I think it would, so it's probably a poor tradeoff.

Thanks. Kairui, please feel free to drop this or if you want you can
check if checking xa_load() before xa_erase() helps.
Johannes Weiner Oct. 18, 2024, 10:38 p.m. UTC | #8
On Sat, Oct 19, 2024 at 04:01:18AM +0800, Kairui Song wrote:
> On Sat, Oct 19, 2024 at 3:46 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Oct 19, 2024 at 03:25:25AM +0800, Kairui Song wrote:
> > >       if (xa_empty(tree))
> > >               return;
> > >
> > > -     entry = xa_erase(tree, offset);
> > > -     if (entry)
> > > +     rcu_read_lock();
> > > +     entry = xas_load(&xas);
> > > +     if (entry) {
> >
> > You should call xas_reset() here.  And I'm not sure it's a great idea to
> > spin waiting for the xa lock while holding the RCU read lock?  Probably
> > not awful but I could easily be wrong.

Spinlocks already implicitly acquire an RCU read-side lock before
beginning to spin, so we shouldn't be worse for wear by doing this.

> Thanks for the review. I thought about it, that could cancel this optimization.
> 
> Oh, and there is a thing I forgot to mention (maybe I should add some
> comments about it?). If xas_load found an entry, that entry must be
> pinned by HAS_CACHE or swap slot count right now, and one entry can
> only be freed once.
> So it should be safe here?
> 
> This might be a little fragile though, maybe this optimization can
> better be done after some zswap invalidation path cleanup.

This seems fine too, exlusivity during invalidation is a fundamental
property of swap. If a load were possible, we'd be freeing an entry
with ptes pointing to it (or readahead a slot whose backing space has
been discarded). If a store were possible, we could write new data
into a dead slot and lose it. Even the swapcache bypass path in
do_swap_page() must at least acquire HAS_CACHE due to this.

So from a swap POV, if we find an entry here it's guaranteed to remain
in the tree by the calling context. The xa lock is for protection the
tree structure against concurrent changes (e.g. from adjacent entries).

With that said, is there still a way for the tree to change internally
before we acquire the lock? Such that tree + index might end up
pointing to the same contents in a different memory location?

AFAIK there are two possible ways:

- xas_split() - this shouldn't be possible because we don't do large
  entries inside the zswap trees.

- xas_shrink() - this could move the entry from a node to xa->head,
  iff it's the last entry in the tree and its index is 0. Swap offset
  0 is never a valid swap entry (swap header), but unfortunately we
  have split trees so it could happen to any offset that is a multiple
  of SWAP_ADDRESS_SPACE_PAGES. AFAICS xas_store() doesn't detect such
  a transition. And making it do that honestly sounds a bit hairy...

So this doesn't look safe to me without a reload :(
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index f6316b66fb23..a5ba80ac8861 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1641,12 +1641,21 @@  void zswap_invalidate(swp_entry_t swp)
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
+	XA_STATE(xas, tree, offset);
+
 	if (xa_empty(tree))
 		return;
 
-	entry = xa_erase(tree, offset);
-	if (entry)
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	if (entry) {
+		xas_lock(&xas);
+		WARN_ON_ONCE(xas_reload(&xas) != entry);
+		xas_store(&xas, NULL);
+		xas_unlock(&xas);
 		zswap_entry_free(entry);
+	}
+	rcu_read_unlock();
 }
 
 int zswap_swapon(int type, unsigned long nr_pages)