[0/2] mm/zswap: optimize the scalability of zswap rb-tree

Message ID	20240117-b4-zswap-lock-optimize-v1-0-23f6effe5775@bytedance.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Subject: [PATCH 0/2] mm/zswap: optimize the scalability of zswap rb-tree MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit From: Chengming Zhou <zhouchengming@bytedance.com> Date: Wed, 17 Jan 2024 09:23:17 +0000 Message-Id: <20240117-b4-zswap-lock-optimize-v1-0-23f6effe5775@bytedance.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Yosry Ahmed <yosryahmed@google.com>, Chengming Zhou <zhouchengming@bytedance.com>, linux-kernel@vger.kernel.org, Johannes Weiner <hannes@cmpxchg.org>, linux-mm@kvack.org, Chris Li <chriscli@google.com>, Nhat Pham <nphamcs@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/zswap: optimize the scalability of zswap rb-tree \| expand [0/2] mm/zswap: optimize the scalability of zswap rb-tree [1/2] mm/zswap: make sure each swapfile always have zswap rb-tree [2/2] mm/zswap: split zswap rb-tree

Chengming Zhou Jan. 17, 2024, 9:23 a.m. UTC

When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which
is protected by the only spinlock. That would cause heavy lock contention
if multiple tasks zswap_store/load concurrently.

So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention. Below is the results of kernel build
in tmpfs with zswap shrinker enabled:

     linux-next  zswap-lock-optimize
real 1m9.181s    1m3.820s
user 17m44.036s  17m40.100s
sys  7m37.297s   4m54.622s

So there are clearly improvements. And it's complementary with the ongoing
zswap xarray conversion by Chris. Anyway, I think we can also merge this
first, it's complementary IMHO. So I just refresh and resend this for
further discussion.

Thanks for review and comment!

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (2):
      mm/zswap: make sure each swapfile always have zswap rb-tree
      mm/zswap: split zswap rb-tree

 include/linux/zswap.h |  7 +++--
 mm/swapfile.c         | 10 ++++---
 mm/zswap.c            | 74 ++++++++++++++++++++++++++++++++-------------------
 3 files changed, 59 insertions(+), 32 deletions(-)
---
base-commit: ab27740f76654ed58dd32ac0ba0031c18a6dea3b
change-id: 20240117-b4-zswap-lock-optimize-44e071c13427

Best regards,

Yosry Ahmed Jan. 17, 2024, 6:37 p.m. UTC | #1

On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention. Below is the results of kernel build
> in tmpfs with zswap shrinker enabled:
>
>      linux-next  zswap-lock-optimize
> real 1m9.181s    1m3.820s
> user 17m44.036s  17m40.100s
> sys  7m37.297s   4m54.622s
>
> So there are clearly improvements. And it's complementary with the ongoing
> zswap xarray conversion by Chris. Anyway, I think we can also merge this
> first, it's complementary IMHO. So I just refresh and resend this for
> further discussion.

The reason why I think we should wait for the xarray patch(es) is
there is a chance we may see less improvements from splitting the tree
if it was an xarray. If we merge this series first, there is no way to
know.

Chris, do you intend to send the xarray patch(es) anytime soon?

Chris Li Jan. 17, 2024, 11:41 p.m. UTC | #2

Hi Yosry and Chengming,

On Wed, Jan 17, 2024 at 10:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention. Below is the results of kernel build
> > in tmpfs with zswap shrinker enabled:
> >
> >      linux-next  zswap-lock-optimize
> > real 1m9.181s    1m3.820s
> > user 17m44.036s  17m40.100s
> > sys  7m37.297s   4m54.622s
> >
> > So there are clearly improvements. And it's complementary with the ongoing
> > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > first, it's complementary IMHO. So I just refresh and resend this for
> > further discussion.
>

Sorry I have been radio silent busying on a few refreshments of the
xarray on the recent kernel tree. There is an assertion triggered on
xarray and the rb tree does not agree with each other. It takes some
time to debug. I ironed that out, also glad the assert did catch a
bug.

Currently the xarray patch should have everything it takes to use RCU
read lock. However taking out the tree spinlock is more work than
previously. If we are going to remove the tree spinlock, I think we
should revert back to doing a zswap tree lookup and return the zswap
entry with reference increased. The tree mapping can still decouple
from the zswap entry reference count drop to zero.  Anyway, my V1 of
the xarray patch will not include removing the tree spinlock.

> The reason why I think we should wait for the xarray patch(es) is
> there is a chance we may see less improvements from splitting the tree
> if it was an xarray. If we merge this series first, there is no way to
> know.
>
> Chris, do you intend to send the xarray patch(es) anytime soon?

Thanks for the heads up. Let me send it out now.

Chris

Yosry Ahmed Jan. 17, 2024, 11:47 p.m. UTC | #3

> Currently the xarray patch should have everything it takes to use RCU
> read lock. However taking out the tree spinlock is more work than
> previously. If we are going to remove the tree spinlock, I think we
> should revert back to doing a zswap tree lookup and return the zswap
> entry with reference increased. The tree mapping can still decouple
> from the zswap entry reference count drop to zero.  Anyway, my V1 of
> the xarray patch will not include removing the tree spinlock.

Interesting. What do you mean by removing the tree spinlock? My
assumption was that the xarray reduces lock contention because we do
not need a lock to do lookups, but we still need the lock otherwise.
Did you have something in mind to completely remove the tree lock?

>
> > The reason why I think we should wait for the xarray patch(es) is
> > there is a chance we may see less improvements from splitting the tree
> > if it was an xarray. If we merge this series first, there is no way to
> > know.
> >
> > Chris, do you intend to send the xarray patch(es) anytime soon?
>
> Thanks for the heads up. Let me send it out now.

Awesome, thanks! I assume Chengming can test whether this series
provides the same benefits with the xarray.

Chris Li Jan. 18, 2024, 12:17 a.m. UTC | #4

Hi Yosry,

On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Currently the xarray patch should have everything it takes to use RCU
> > read lock. However taking out the tree spinlock is more work than
> > previously. If we are going to remove the tree spinlock, I think we
> > should revert back to doing a zswap tree lookup and return the zswap
> > entry with reference increased. The tree mapping can still decouple
> > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > the xarray patch will not include removing the tree spinlock.
>
> Interesting. What do you mean by removing the tree spinlock? My
> assumption was that the xarray reduces lock contention because we do
> not need a lock to do lookups, but we still need the lock otherwise.
> Did you have something in mind to completely remove the tree lock?

In my current xarray series, it adds the xarray alongside the rb tree.
Xarray has its own internal lock as well. Effectively zswap now has
two locks instead of just one previously.  The xarray lock will not
have any contention due to the xarray lock taken inside the zswap rb
tree lock. The eventual goal is reducing the two locks back to
one(xarray lock), which is not in my V1 patch. Your understanding is
correct, the xarray still needs to have one lock for protecting the
write path.

> > > The reason why I think we should wait for the xarray patch(es) is
> > > there is a chance we may see less improvements from splitting the tree
> > > if it was an xarray. If we merge this series first, there is no way to
> > > know.
> > >
> > > Chris, do you intend to send the xarray patch(es) anytime soon?
> >
> > Thanks for the heads up. Let me send it out now.
>
> Awesome, thanks! I assume Chengming can test whether this series
> provides the same benefits with the xarray.

That would be great. The xarray patch needs more testing for sure.

Chris

Yosry Ahmed Jan. 18, 2024, 12:34 a.m. UTC | #5

On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > Currently the xarray patch should have everything it takes to use RCU
> > > read lock. However taking out the tree spinlock is more work than
> > > previously. If we are going to remove the tree spinlock, I think we
> > > should revert back to doing a zswap tree lookup and return the zswap
> > > entry with reference increased. The tree mapping can still decouple
> > > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > > the xarray patch will not include removing the tree spinlock.
> >
> > Interesting. What do you mean by removing the tree spinlock? My
> > assumption was that the xarray reduces lock contention because we do
> > not need a lock to do lookups, but we still need the lock otherwise.
> > Did you have something in mind to completely remove the tree lock?
>
> In my current xarray series, it adds the xarray alongside the rb tree.
> Xarray has its own internal lock as well. Effectively zswap now has
> two locks instead of just one previously.  The xarray lock will not
> have any contention due to the xarray lock taken inside the zswap rb
> tree lock. The eventual goal is reducing the two locks back to
> one(xarray lock), which is not in my V1 patch. Your understanding is
> correct, the xarray still needs to have one lock for protecting the
> write path.

Hmm I don't understand. What's the point of keeping the rbtree if we
have the xarray? Doesn't it end up being more expensive and bug-prone
to maintain both trees?

When you say "eventual goal", do you mean what the patch would morph
into in later versions (as in v1 is just a proof of concept without
removing the rbtree), or follow up patches?

Nhat Pham Jan. 18, 2024, 12:49 a.m. UTC | #6

On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > Currently the xarray patch should have everything it takes to use RCU
> > > read lock. However taking out the tree spinlock is more work than
> > > previously. If we are going to remove the tree spinlock, I think we
> > > should revert back to doing a zswap tree lookup and return the zswap
> > > entry with reference increased. The tree mapping can still decouple
> > > from the zswap entry reference count drop to zero.  Anyway, my V1 of
> > > the xarray patch will not include removing the tree spinlock.
> >
> > Interesting. What do you mean by removing the tree spinlock? My
> > assumption was that the xarray reduces lock contention because we do
> > not need a lock to do lookups, but we still need the lock otherwise.
> > Did you have something in mind to completely remove the tree lock?
>
> In my current xarray series, it adds the xarray alongside the rb tree.

Hmmm why? Is there a reason to keep the rb tree around?

Chris Li Jan. 18, 2024, 1:03 a.m. UTC | #7

Hi Yosry,

On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> Hmm I don't understand. What's the point of keeping the rbtree if we
> have the xarray? Doesn't it end up being more expensive and bug-prone
> to maintain both trees?

Patch 2/2 remove the rb tree code. Just keeping the tree spinlock.

>
> When you say "eventual goal", do you mean what the patch would morph
> into in later versions (as in v1 is just a proof of concept without
> removing the rbtree), or follow up patches?

V1 will remove the rb tree, but does not merge the rb tree lock with
the xarray lock.

Hi Nhat,

> Hmmm why? Is there a reason to keep the rb tree around?

No, not at all.


Chris

Yosry Ahmed Jan. 18, 2024, 3:51 a.m. UTC | #8

On Wed, Jan 17, 2024 at 5:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Yosry,
>
> On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > Hmm I don't understand. What's the point of keeping the rbtree if we
> > have the xarray? Doesn't it end up being more expensive and bug-prone
> > to maintain both trees?
>
> Patch 2/2 remove the rb tree code. Just keeping the tree spinlock.
>
> >
> > When you say "eventual goal", do you mean what the patch would morph
> > into in later versions (as in v1 is just a proof of concept without
> > removing the rbtree), or follow up patches?
>
> V1 will remove the rb tree, but does not merge the rb tree lock with
> the xarray lock.

I see you already posted the patches, let's move the discussion there.
I will take a look at them as soon as I get the chance to.

Johannes Weiner Jan. 18, 2024, 3:34 p.m. UTC | #9

On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention. Below is the results of kernel build
> > in tmpfs with zswap shrinker enabled:
> >
> >      linux-next  zswap-lock-optimize
> > real 1m9.181s    1m3.820s
> > user 17m44.036s  17m40.100s
> > sys  7m37.297s   4m54.622s
> >
> > So there are clearly improvements. And it's complementary with the ongoing
> > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > first, it's complementary IMHO. So I just refresh and resend this for
> > further discussion.
> 
> The reason why I think we should wait for the xarray patch(es) is
> there is a chance we may see less improvements from splitting the tree
> if it was an xarray. If we merge this series first, there is no way to
> know.

I mentioned this before, but I disagree quite strongly with this
general sentiment.

Chengming's patches are simple, mature, and have convincing
numbers. IMO it's poor form to hold something like that for "let's see
how our other experiment works out". The only exception would be if we
all agree that the earlier change flies in the face of the overall
direction we want to pursue, which I don't think is the case here.

With the xarray we'll still have a per-swapfile lock for writes. That
lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
the swapcache in the first place. Lockless reads help of course, but
read-only access to swap are in the minority - stores will write, and
loads are commonly followed by invalidations. Somebody already went
through the trouble of proving that xarrays + segmentation are worth
it for swap load and store access patterns. Why dismiss that?

So my vote is that we follow the ususal upstreaming process here:
merge the ready patches now, and rebase future work on top of it.

Yosry Ahmed Jan. 18, 2024, 5:30 p.m. UTC | #10

On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> > >
> > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > directory, I found the scalability of zswap rb-tree is not good, which
> > > is protected by the only spinlock. That would cause heavy lock contention
> > > if multiple tasks zswap_store/load concurrently.
> > >
> > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > >
> > > Although this method can't solve the spinlock contention completely, it
> > > can mitigate much of that contention. Below is the results of kernel build
> > > in tmpfs with zswap shrinker enabled:
> > >
> > >      linux-next  zswap-lock-optimize
> > > real 1m9.181s    1m3.820s
> > > user 17m44.036s  17m40.100s
> > > sys  7m37.297s   4m54.622s
> > >
> > > So there are clearly improvements. And it's complementary with the ongoing
> > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > first, it's complementary IMHO. So I just refresh and resend this for
> > > further discussion.
> >
> > The reason why I think we should wait for the xarray patch(es) is
> > there is a chance we may see less improvements from splitting the tree
> > if it was an xarray. If we merge this series first, there is no way to
> > know.
>
> I mentioned this before, but I disagree quite strongly with this
> general sentiment.
>
> Chengming's patches are simple, mature, and have convincing
> numbers. IMO it's poor form to hold something like that for "let's see
> how our other experiment works out". The only exception would be if we
> all agree that the earlier change flies in the face of the overall
> direction we want to pursue, which I don't think is the case here.

My intention was not to delay merging these patches until the xarray
patches are merged in. It was only to wait until the xarray patches
are *posted*, so that we can redo the testing on top of them and
verify that the gains are still there. That should have been around
now, but the xarray patches were posted in a form that does not allow
this testing (because we still have a lock on the read path), so I am
less inclined.

My rationale was that if the gains from splitting the tree become
minimal after we switch to an xarray, we won't know. It's more
difficult to remove optimizations than to add them, because we may
cause a regression. I am kind of paranoid about having code sitting
around that we don't have full information about how much it's needed.

In this case, I suppose we can redo the testing (1 tree vs. split
trees) once the xarray patches are in a testable form, and before we
have formed any strong dependencies on the split trees (we have time
until v6.9 is released, I assume).

How about that?

>
> With the xarray we'll still have a per-swapfile lock for writes. That
> lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
> the swapcache in the first place. Lockless reads help of course, but
> read-only access to swap are in the minority - stores will write, and
> loads are commonly followed by invalidations. Somebody already went
> through the trouble of proving that xarrays + segmentation are worth
> it for swap load and store access patterns. Why dismiss that?

Fair point, although I think the swapcache lock may be more contended
than the zswap tree lock.

> So my vote is that we follow the ususal upstreaming process here:
> merge the ready patches now, and rebase future work on top of it.

No objections given the current state of the xarray patches as I
mentioned earlier, but I prefer we redo the testing once possible with
the xarray.

Johannes Weiner Jan. 18, 2024, 6:06 p.m. UTC | #11

On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > > <zhouchengming@bytedance.com> wrote:
> > > >
> > > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > > directory, I found the scalability of zswap rb-tree is not good, which
> > > > is protected by the only spinlock. That would cause heavy lock contention
> > > > if multiple tasks zswap_store/load concurrently.
> > > >
> > > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > > >
> > > > Although this method can't solve the spinlock contention completely, it
> > > > can mitigate much of that contention. Below is the results of kernel build
> > > > in tmpfs with zswap shrinker enabled:
> > > >
> > > >      linux-next  zswap-lock-optimize
> > > > real 1m9.181s    1m3.820s
> > > > user 17m44.036s  17m40.100s
> > > > sys  7m37.297s   4m54.622s
> > > >
> > > > So there are clearly improvements. And it's complementary with the ongoing
> > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > > first, it's complementary IMHO. So I just refresh and resend this for
> > > > further discussion.
> > >
> > > The reason why I think we should wait for the xarray patch(es) is
> > > there is a chance we may see less improvements from splitting the tree
> > > if it was an xarray. If we merge this series first, there is no way to
> > > know.
> >
> > I mentioned this before, but I disagree quite strongly with this
> > general sentiment.
> >
> > Chengming's patches are simple, mature, and have convincing
> > numbers. IMO it's poor form to hold something like that for "let's see
> > how our other experiment works out". The only exception would be if we
> > all agree that the earlier change flies in the face of the overall
> > direction we want to pursue, which I don't think is the case here.
> 
> My intention was not to delay merging these patches until the xarray
> patches are merged in. It was only to wait until the xarray patches
> are *posted*, so that we can redo the testing on top of them and
> verify that the gains are still there. That should have been around
> now, but the xarray patches were posted in a form that does not allow
> this testing (because we still have a lock on the read path), so I am
> less inclined.
> 
> My rationale was that if the gains from splitting the tree become
> minimal after we switch to an xarray, we won't know. It's more
> difficult to remove optimizations than to add them, because we may
> cause a regression. I am kind of paranoid about having code sitting
> around that we don't have full information about how much it's needed.

Yeah I understand that fear.

I expect the splitting to help more than the move to xarray because
it's the writes that are hot. Luckily in this case it should be fairly
easy to differential-test after it's been merged by changing that tree
lookup macro/function locally to always return &trees[type][0], right?

> In this case, I suppose we can redo the testing (1 tree vs. split
> trees) once the xarray patches are in a testable form, and before we
> have formed any strong dependencies on the split trees (we have time
> until v6.9 is released, I assume).
> 
> How about that?

That sounds reasonable.

> > With the xarray we'll still have a per-swapfile lock for writes. That
> > lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for
> > the swapcache in the first place. Lockless reads help of course, but
> > read-only access to swap are in the minority - stores will write, and
> > loads are commonly followed by invalidations. Somebody already went
> > through the trouble of proving that xarrays + segmentation are worth
> > it for swap load and store access patterns. Why dismiss that?
> 
> Fair point, although I think the swapcache lock may be more contended
> than the zswap tree lock.

Right, it has two updates for each transition, compared to the one for
zswap. But we know that in a concurrent system under pressure a
globally shared swap lock will hurt. There is a history in Chengming's
numbers, your previous patch to split the zpools,
235b62176712b970c815923e36b9a9cc05d4d901 etc.

> > So my vote is that we follow the ususal upstreaming process here:
> > merge the ready patches now, and rebase future work on top of it.
> 
> No objections given the current state of the xarray patches as I
> mentioned earlier, but I prefer we redo the testing once possible with
> the xarray.

Cool, sounds good to me.

Yosry Ahmed Jan. 18, 2024, 6:37 p.m. UTC | #12

On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
> > On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
> > > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
> > > > <zhouchengming@bytedance.com> wrote:
> > > > >
> > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > > > > directory, I found the scalability of zswap rb-tree is not good, which
> > > > > is protected by the only spinlock. That would cause heavy lock contention
> > > > > if multiple tasks zswap_store/load concurrently.
> > > > >
> > > > > So a simple solution is to split the only one zswap rb-tree into multiple
> > > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> > > > >
> > > > > Although this method can't solve the spinlock contention completely, it
> > > > > can mitigate much of that contention. Below is the results of kernel build
> > > > > in tmpfs with zswap shrinker enabled:
> > > > >
> > > > >      linux-next  zswap-lock-optimize
> > > > > real 1m9.181s    1m3.820s
> > > > > user 17m44.036s  17m40.100s
> > > > > sys  7m37.297s   4m54.622s
> > > > >
> > > > > So there are clearly improvements. And it's complementary with the ongoing
> > > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this
> > > > > first, it's complementary IMHO. So I just refresh and resend this for
> > > > > further discussion.
> > > >
> > > > The reason why I think we should wait for the xarray patch(es) is
> > > > there is a chance we may see less improvements from splitting the tree
> > > > if it was an xarray. If we merge this series first, there is no way to
> > > > know.
> > >
> > > I mentioned this before, but I disagree quite strongly with this
> > > general sentiment.
> > >
> > > Chengming's patches are simple, mature, and have convincing
> > > numbers. IMO it's poor form to hold something like that for "let's see
> > > how our other experiment works out". The only exception would be if we
> > > all agree that the earlier change flies in the face of the overall
> > > direction we want to pursue, which I don't think is the case here.
> >
> > My intention was not to delay merging these patches until the xarray
> > patches are merged in. It was only to wait until the xarray patches
> > are *posted*, so that we can redo the testing on top of them and
> > verify that the gains are still there. That should have been around
> > now, but the xarray patches were posted in a form that does not allow
> > this testing (because we still have a lock on the read path), so I am
> > less inclined.
> >
> > My rationale was that if the gains from splitting the tree become
> > minimal after we switch to an xarray, we won't know. It's more
> > difficult to remove optimizations than to add them, because we may
> > cause a regression. I am kind of paranoid about having code sitting
> > around that we don't have full information about how much it's needed.
>
> Yeah I understand that fear.
>
> I expect the splitting to help more than the move to xarray because
> it's the writes that are hot. Luckily in this case it should be fairly
> easy to differential-test after it's been merged by changing that tree
> lookup macro/function locally to always return &trees[type][0], right?

Yeah that's exactly what I had in mind. Once we have a version of the
xarray patch without the locking on the read side we can test with
that. Chengming, does this sound reasonable to you?

Chengming Zhou Jan. 19, 2024, 6:40 a.m. UTC | #13

On 2024/1/19 02:37, Yosry Ahmed wrote:
> On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote:
>>> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>
>>>> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote:
>>>>> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou
>>>>> <zhouchengming@bytedance.com> wrote:
>>>>>>
>>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>>>>> directory, I found the scalability of zswap rb-tree is not good, which
>>>>>> is protected by the only spinlock. That would cause heavy lock contention
>>>>>> if multiple tasks zswap_store/load concurrently.
>>>>>>
>>>>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>>>>
>>>>>> Although this method can't solve the spinlock contention completely, it
>>>>>> can mitigate much of that contention. Below is the results of kernel build
>>>>>> in tmpfs with zswap shrinker enabled:
>>>>>>
>>>>>>      linux-next  zswap-lock-optimize
>>>>>> real 1m9.181s    1m3.820s
>>>>>> user 17m44.036s  17m40.100s
>>>>>> sys  7m37.297s   4m54.622s
>>>>>>
>>>>>> So there are clearly improvements. And it's complementary with the ongoing
>>>>>> zswap xarray conversion by Chris. Anyway, I think we can also merge this
>>>>>> first, it's complementary IMHO. So I just refresh and resend this for
>>>>>> further discussion.
>>>>>
>>>>> The reason why I think we should wait for the xarray patch(es) is
>>>>> there is a chance we may see less improvements from splitting the tree
>>>>> if it was an xarray. If we merge this series first, there is no way to
>>>>> know.
>>>>
>>>> I mentioned this before, but I disagree quite strongly with this
>>>> general sentiment.
>>>>
>>>> Chengming's patches are simple, mature, and have convincing
>>>> numbers. IMO it's poor form to hold something like that for "let's see
>>>> how our other experiment works out". The only exception would be if we
>>>> all agree that the earlier change flies in the face of the overall
>>>> direction we want to pursue, which I don't think is the case here.
>>>
>>> My intention was not to delay merging these patches until the xarray
>>> patches are merged in. It was only to wait until the xarray patches
>>> are *posted*, so that we can redo the testing on top of them and
>>> verify that the gains are still there. That should have been around
>>> now, but the xarray patches were posted in a form that does not allow
>>> this testing (because we still have a lock on the read path), so I am
>>> less inclined.
>>>
>>> My rationale was that if the gains from splitting the tree become
>>> minimal after we switch to an xarray, we won't know. It's more
>>> difficult to remove optimizations than to add them, because we may
>>> cause a regression. I am kind of paranoid about having code sitting
>>> around that we don't have full information about how much it's needed.
>>
>> Yeah I understand that fear.
>>
>> I expect the splitting to help more than the move to xarray because
>> it's the writes that are hot. Luckily in this case it should be fairly
>> easy to differential-test after it's been merged by changing that tree
>> lookup macro/function locally to always return &trees[type][0], right?
> 
> Yeah that's exactly what I had in mind. Once we have a version of the
> xarray patch without the locking on the read side we can test with
> that. Chengming, does this sound reasonable to you?

It's ok, sounds reasonable to me. I agree with Johannes, we will need
both since xarray still have a spinlock in the writes, it's clearly
better to split it. As for testing, we can always return &trees[type][0].

Thanks!

[0/2] mm/zswap: optimize the scalability of zswap rb-tree

Message

Comments