[0/7] mm/zswap: optimize the scalability of zswap rb-tree

Message ID	20231206-zswap-lock-optimize-v1-0-e25b059f9c3a@bytedance.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Subject: [PATCH 0/7] mm/zswap: optimize the scalability of zswap rb-tree MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit From: Chengming Zhou <zhouchengming@bytedance.com> Date: Wed, 06 Dec 2023 09:46:23 +0000 Message-Id: <20231206-zswap-lock-optimize-v1-0-e25b059f9c3a@bytedance.com> To: Vitaly Wool <vitaly.wool@konsulko.com>, Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, Andrew Morton <akpm@linux-foundation.org>, Yosry Ahmed <yosryahmed@google.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Chengming Zhou <zhouchengming@bytedance.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/zswap: optimize the scalability of zswap rb-tree \| expand [0/7] mm/zswap: optimize the scalability of zswap rb-tree [1/7] mm/zswap: make sure each swapfile always have zswap rb-tree [2/7] mm/zswap: split zswap rb-tree [3/7] mm/zswap: reuse dstmem when decompress [4/7] mm/zswap: change dstmem size to one page [5/7] mm/zswap: refactor out __zswap_load() [6/7] mm/zswap: cleanup zswap_load() [7/7] mm/zswap: cleanup zswap_reclaim_entry()

Chengming Zhou Dec. 6, 2023, 9:46 a.m. UTC

Hi everyone,

This patch series is based on the linux-next 20231205, which depends on
the "workload-specific and memory pressure-driven zswap writeback" series
from Nhat Pham.

When testing the zswap performance by using kernel build -j32 in a tmpfs
directory, I found the scalability of zswap rb-tree is not good, which
is protected by the only spinlock. That would cause heavy lock contention
if multiple tasks zswap_store/load concurrently.

So a simple solution is to split the only one zswap rb-tree into multiple
rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

Although this method can't solve the spinlock contention completely, it
can mitigate much of that contention.

Another problem when testing the zswap using our default zsmalloc is that
zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
to support !zpool_can_sleep_mapped().

Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.

Thanks for review and comment!

To: Andrew Morton <akpm@linux-foundation.org>
To: Seth Jennings <sjenning@redhat.com>
To: Dan Streetman <ddstreet@ieee.org>
To: Vitaly Wool <vitaly.wool@konsulko.com>
To: Nhat Pham <nphamcs@gmail.com>
To: Johannes Weiner <hannes@cmpxchg.org>
To: Yosry Ahmed <yosryahmed@google.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

---
Chengming Zhou (7):
      mm/zswap: make sure each swapfile always have zswap rb-tree
      mm/zswap: split zswap rb-tree
      mm/zswap: reuse dstmem when decompress
      mm/zswap: change dstmem size to one page
      mm/zswap: refactor out __zswap_load()
      mm/zswap: cleanup zswap_load()
      mm/zswap: cleanup zswap_reclaim_entry()

 include/linux/zswap.h |   4 +-
 mm/swapfile.c         |  10 ++-
 mm/zswap.c            | 233 +++++++++++++++++++++-----------------------------
 3 files changed, 106 insertions(+), 141 deletions(-)
---
base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b
change-id: 20231206-zswap-lock-optimize-06f45683b02b

Best regards,

Nhat Pham Dec. 6, 2023, 5:24 p.m. UTC | #1

+ Chris Li

Chris, I vaguely remember from our last conversation that you have
some concurrent efforts to use xarray here right?

On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Hi everyone,
>
> This patch series is based on the linux-next 20231205, which depends on
> the "workload-specific and memory pressure-driven zswap writeback" series
> from Nhat Pham.
>
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention.
>
> Another problem when testing the zswap using our default zsmalloc is that
> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> to support !zpool_can_sleep_mapped().
>
> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
>
> Thanks for review and comment!
>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Seth Jennings <sjenning@redhat.com>
> To: Dan Streetman <ddstreet@ieee.org>
> To: Vitaly Wool <vitaly.wool@konsulko.com>
> To: Nhat Pham <nphamcs@gmail.com>
> To: Johannes Weiner <hannes@cmpxchg.org>
> To: Yosry Ahmed <yosryahmed@google.com>
> To: Michal Hocko <mhocko@kernel.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>
> ---
> Chengming Zhou (7):
>       mm/zswap: make sure each swapfile always have zswap rb-tree
>       mm/zswap: split zswap rb-tree
>       mm/zswap: reuse dstmem when decompress
>       mm/zswap: change dstmem size to one page
>       mm/zswap: refactor out __zswap_load()
>       mm/zswap: cleanup zswap_load()
>       mm/zswap: cleanup zswap_reclaim_entry()
>
>  include/linux/zswap.h |   4 +-
>  mm/swapfile.c         |  10 ++-
>  mm/zswap.c            | 233 +++++++++++++++++++++-----------------------------
>  3 files changed, 106 insertions(+), 141 deletions(-)
> ---
> base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b
> change-id: 20231206-zswap-lock-optimize-06f45683b02b
>
> Best regards,
> --
> Chengming Zhou <zhouchengming@bytedance.com>

Nhat Pham Dec. 6, 2023, 8:08 p.m. UTC | #2

On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention.

By how much? Do you have any stats to estimate the amount of
contention and the reduction by this patch?

I do think lock contention could be a problem here, and it will be
even worse with the zswap shrinker enabled (which introduces an
theoretically unbounded number of concurrent reclaimers hammering on
the zswap rbtree and its lock). I am generally a bit weary about
architectural change though, especially if it is just a bandaid. We
have tried to reduce the lock contention somewhere else (multiple
zpools), and as predicted it just shifts the contention point
elsewhere. Maybe we need a deeper architectural re-think.

Not an outright NACK of course - just food for thought.

>
> Another problem when testing the zswap using our default zsmalloc is that
> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> to support !zpool_can_sleep_mapped().
>
> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.

It'd be nice to reduce the (temporary) memory allocation on these
paths, but would this introduce contention on the per-cpu dstmem and
the mutex that protects it, if there are too many concurrent
store/load/writeback requests?

Yosry Ahmed Dec. 6, 2023, 8:41 p.m. UTC | #3

On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> + Chris Li
>
> Chris, I vaguely remember from our last conversation that you have
> some concurrent efforts to use xarray here right?

If I recall correctly, the xarray already reduces the lock contention
as lookups are lockless, but Chris knows more here. As you mentioned
in a different email, it would be nice to get some data so that we can
compare different solutions.

>
> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Hi everyone,
> >
> > This patch series is based on the linux-next 20231205, which depends on
> > the "workload-specific and memory pressure-driven zswap writeback" series
> > from Nhat Pham.
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention.
> >
> > Another problem when testing the zswap using our default zsmalloc is that
> > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> > to support !zpool_can_sleep_mapped().
> >
> > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
> >
> > Thanks for review and comment!
> >
> > To: Andrew Morton <akpm@linux-foundation.org>
> > To: Seth Jennings <sjenning@redhat.com>
> > To: Dan Streetman <ddstreet@ieee.org>
> > To: Vitaly Wool <vitaly.wool@konsulko.com>
> > To: Nhat Pham <nphamcs@gmail.com>
> > To: Johannes Weiner <hannes@cmpxchg.org>
> > To: Yosry Ahmed <yosryahmed@google.com>
> > To: Michal Hocko <mhocko@kernel.org>
> > Cc: linux-kernel@vger.kernel.org
> > Cc: linux-mm@kvack.org
> > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >
> > ---
> > Chengming Zhou (7):
> >       mm/zswap: make sure each swapfile always have zswap rb-tree
> >       mm/zswap: split zswap rb-tree
> >       mm/zswap: reuse dstmem when decompress
> >       mm/zswap: change dstmem size to one page
> >       mm/zswap: refactor out __zswap_load()
> >       mm/zswap: cleanup zswap_load()
> >       mm/zswap: cleanup zswap_reclaim_entry()
> >
> >  include/linux/zswap.h |   4 +-
> >  mm/swapfile.c         |  10 ++-
> >  mm/zswap.c            | 233 +++++++++++++++++++++-----------------------------
> >  3 files changed, 106 insertions(+), 141 deletions(-)
> > ---
> > base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b
> > change-id: 20231206-zswap-lock-optimize-06f45683b02b
> >
> > Best regards,
> > --
> > Chengming Zhou <zhouchengming@bytedance.com>

Chris Li Dec. 7, 2023, 12:43 a.m. UTC | #4

Hi Nhat and Yosry,

On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > + Chris Li
> >
> > Chris, I vaguely remember from our last conversation that you have
> > some concurrent efforts to use xarray here right?

Yes, I do have the zswap xarray for older versions of the kernel. The
recent mm-unstable tree has  a lot of zswap related updates. Give me 2
days to refresh and post it. The zswap invalid entry and the reference
count change is causing a good portion of the code to be updated. That
is the price to pay keeping out of tree patches. My fault is not
getting to it sooner.

>
> If I recall correctly, the xarray already reduces the lock contention
> as lookups are lockless, but Chris knows more here. As you mentioned

Yes. To be exact, xarray can use spin lock (same as current RB tree)
or take RCU read lock on the lookup path (depending on how you call
the xarray API). Not completely lockless but the RCU read lock should
have less lock contention than normal spinlock. +Matthew

> in a different email, it would be nice to get some data so that we can
> compare different solutions.

Yes, it is certainly welcome to see more data points. If I recall
correctly, the zswap xarray array makes the lookup similar to the swap
cache lookup. It has a noticeable difference in the long tail end.

Stay tuned.

Chris

Chengming Zhou Dec. 7, 2023, 3:13 a.m. UTC | #5

On 2023/12/7 04:08, Nhat Pham wrote:
> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>> directory, I found the scalability of zswap rb-tree is not good, which
>> is protected by the only spinlock. That would cause heavy lock contention
>> if multiple tasks zswap_store/load concurrently.
>>
>> So a simple solution is to split the only one zswap rb-tree into multiple
>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>
>> Although this method can't solve the spinlock contention completely, it
>> can mitigate much of that contention.
> 
> By how much? Do you have any stats to estimate the amount of
> contention and the reduction by this patch?

Actually, I did some test using the linux-next 20231205 yesterday.

Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.

			20231205	+patchset
1. !shrinker_enabled:   156s		126s
2.  shrinker_enabled:   79s		70s

I think your zswap shrinker fix patch can solve !shrinker_enabled case.

So will test again today using the new mm-unstable branch.

> 
> I do think lock contention could be a problem here, and it will be
> even worse with the zswap shrinker enabled (which introduces an
> theoretically unbounded number of concurrent reclaimers hammering on
> the zswap rbtree and its lock). I am generally a bit weary about
> architectural change though, especially if it is just a bandaid. We
> have tried to reduce the lock contention somewhere else (multiple
> zpools), and as predicted it just shifts the contention point
> elsewhere. Maybe we need a deeper architectural re-think.
> 
> Not an outright NACK of course - just food for thought.
> 

Right, I think xarray is good for lockless reading side, and
multiple trees is also complementary, which can reduce the lock
contention on the writing sides too.

>>
>> Another problem when testing the zswap using our default zsmalloc is that
>> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
>> to support !zpool_can_sleep_mapped().
>>
>> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
>> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
> 
> It'd be nice to reduce the (temporary) memory allocation on these
> paths, but would this introduce contention on the per-cpu dstmem and
> the mutex that protects it, if there are too many concurrent
> store/load/writeback requests?

I think the mutex holding time is not changed, right? So the contention
on the per-cpu mutex should be the same. We just reuse percpu dstmem more.

Thanks!

Chengming Zhou Dec. 7, 2023, 3:25 a.m. UTC | #6

On 2023/12/7 08:43, Chris Li wrote:
> Hi Nhat and Yosry,
> 
> On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>>
>>> + Chris Li
>>>
>>> Chris, I vaguely remember from our last conversation that you have
>>> some concurrent efforts to use xarray here right?
> 
> Yes, I do have the zswap xarray for older versions of the kernel. The
> recent mm-unstable tree has  a lot of zswap related updates. Give me 2
> days to refresh and post it. The zswap invalid entry and the reference
> count change is causing a good portion of the code to be updated. That
> is the price to pay keeping out of tree patches. My fault is not
> getting to it sooner.
> 
>>
>> If I recall correctly, the xarray already reduces the lock contention
>> as lookups are lockless, but Chris knows more here. As you mentioned
> 
> Yes. To be exact, xarray can use spin lock (same as current RB tree)
> or take RCU read lock on the lookup path (depending on how you call
> the xarray API). Not completely lockless but the RCU read lock should
> have less lock contention than normal spinlock. +Matthew
> 

Great! Lockless lookup in zswap_load() should reduce spinlock contention.
And multiple trees (multiple xarrays) can further reduce the contention
on the concurrent zswap_store() side. So it's complementary IMHO.

>> in a different email, it would be nice to get some data so that we can
>> compare different solutions.
> 
> Yes, it is certainly welcome to see more data points. If I recall
> correctly, the zswap xarray array makes the lookup similar to the swap
> cache lookup. It has a noticeable difference in the long tail end.
> 

Right, I post some data from yesterday in another reply.
Will test again and update the data since Nhat's zswap shrinker fix patch
has been merged into mm-unstable today.

Thanks!

Chengming Zhou Dec. 7, 2023, 3:18 p.m. UTC | #7

On 2023/12/7 11:13, Chengming Zhou wrote:
> On 2023/12/7 04:08, Nhat Pham wrote:
>> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>> directory, I found the scalability of zswap rb-tree is not good, which
>>> is protected by the only spinlock. That would cause heavy lock contention
>>> if multiple tasks zswap_store/load concurrently.
>>>
>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>
>>> Although this method can't solve the spinlock contention completely, it
>>> can mitigate much of that contention.
>>
>> By how much? Do you have any stats to estimate the amount of
>> contention and the reduction by this patch?
> 
> Actually, I did some test using the linux-next 20231205 yesterday.
> 
> Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
> 
> 			20231205	+patchset
> 1. !shrinker_enabled:   156s		126s
> 2.  shrinker_enabled:   79s		70s
> 
> I think your zswap shrinker fix patch can solve !shrinker_enabled case.
> 
> So will test again today using the new mm-unstable branch.
> 

Updated test data based on today's mm-unstable branch:

			mm-unstable	+patchset
1. !shrinker_enabled:	86s		74s
2.  shrinker_enabled:	63s		61s

Shows much less optimization for the shrinker_enabled case, but still
much optimization for the !shrinker_enabled case.

Thanks!

Nhat Pham Dec. 7, 2023, 6:15 p.m. UTC | #8

On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/7 11:13, Chengming Zhou wrote:
> > On 2023/12/7 04:08, Nhat Pham wrote:
> >> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> >> <zhouchengming@bytedance.com> wrote:
> >>> When testing the zswap performance by using kernel build -j32 in a tmpfs
> >>> directory, I found the scalability of zswap rb-tree is not good, which
> >>> is protected by the only spinlock. That would cause heavy lock contention
> >>> if multiple tasks zswap_store/load concurrently.
> >>>
> >>> So a simple solution is to split the only one zswap rb-tree into multiple
> >>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> >>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >>>
> >>> Although this method can't solve the spinlock contention completely, it
> >>> can mitigate much of that contention.
> >>
> >> By how much? Do you have any stats to estimate the amount of
> >> contention and the reduction by this patch?
> >
> > Actually, I did some test using the linux-next 20231205 yesterday.
> >
> > Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
> >
> >                       20231205        +patchset
> > 1. !shrinker_enabled:   156s          126s
> > 2.  shrinker_enabled:   79s           70s
> >
> > I think your zswap shrinker fix patch can solve !shrinker_enabled case.
> >
> > So will test again today using the new mm-unstable branch.
> >
>
> Updated test data based on today's mm-unstable branch:
>
>                         mm-unstable     +patchset
> 1. !shrinker_enabled:   86s             74s
> 2.  shrinker_enabled:   63s             61s
>
> Shows much less optimization for the shrinker_enabled case, but still
> much optimization for the !shrinker_enabled case.
>
> Thanks!

I'm gonna assume this is build time since it makes the zswap shrinker
look pretty good :)
I think this just means some of the gains between this patchset and
the zswap shrinker overlaps. But on the positive note:

a) Both are complementary, i.e enable both (bottom right corner) gives
us the best result.
b) Each individual change improves the runtime. If you disable the
shrinker, then this patch helps tremendously, so we're onto something.
c) The !shrinker_enabled is no longer *too* bad - once again, thanks
for noticing the regression and help me fix it! In fact, every cell
improves compared to the last run. Woohoo!

Nhat Pham Dec. 7, 2023, 6:57 p.m. UTC | #9

On Thu, Dec 7, 2023 at 10:15 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Updated test data based on today's mm-unstable branch:
> >
> >                         mm-unstable     +patchset
> > 1. !shrinker_enabled:   86s             74s
> > 2.  shrinker_enabled:   63s             61s
> >
> > Shows much less optimization for the shrinker_enabled case, but still
> > much optimization for the !shrinker_enabled case.
> >
> > Thanks!
>
> I'm gonna assume this is build time since it makes the zswap shrinker
> look pretty good :)
> I think this just means some of the gains between this patchset and
> the zswap shrinker overlaps. But on the positive note:
>
> a) Both are complementary, i.e enable both (bottom right corner) gives
> us the best result.
> b) Each individual change improves the runtime. If you disable the
> shrinker, then this patch helps tremendously, so we're onto something.
> c) The !shrinker_enabled is no longer *too* bad - once again, thanks
> for noticing the regression and help me fix it! In fact, every cell
> improves compared to the last run. Woohoo!

Oh and, another thing that might be helpful to observe reduction in
lock contention (and compare approaches if necessary) is this analysis
that Yosry performed for the multiple zpools change:
https://lore.kernel.org/lkml/20230620194644.3142384-1-yosryahmed@google.com/

We could look at the various paths that utilize rbtree and see how
long we're spinning at the lock(s) etc.

Chengming Zhou Dec. 8, 2023, 3:41 p.m. UTC | #10

On 2023/12/8 02:15, Nhat Pham wrote:
> On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2023/12/7 11:13, Chengming Zhou wrote:
>>> On 2023/12/7 04:08, Nhat Pham wrote:
>>>> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
>>>> <zhouchengming@bytedance.com> wrote:
>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>>>> directory, I found the scalability of zswap rb-tree is not good, which
>>>>> is protected by the only spinlock. That would cause heavy lock contention
>>>>> if multiple tasks zswap_store/load concurrently.
>>>>>
>>>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>>>
>>>>> Although this method can't solve the spinlock contention completely, it
>>>>> can mitigate much of that contention.
>>>>
>>>> By how much? Do you have any stats to estimate the amount of
>>>> contention and the reduction by this patch?
>>>
>>> Actually, I did some test using the linux-next 20231205 yesterday.
>>>
>>> Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
>>>
>>>                       20231205        +patchset
>>> 1. !shrinker_enabled:   156s          126s
>>> 2.  shrinker_enabled:   79s           70s
>>>
>>> I think your zswap shrinker fix patch can solve !shrinker_enabled case.
>>>
>>> So will test again today using the new mm-unstable branch.
>>>
>>
>> Updated test data based on today's mm-unstable branch:
>>
>>                         mm-unstable     +patchset
>> 1. !shrinker_enabled:   86s             74s
>> 2.  shrinker_enabled:   63s             61s
>>
>> Shows much less optimization for the shrinker_enabled case, but still
>> much optimization for the !shrinker_enabled case.
>>
>> Thanks!
> 
> I'm gonna assume this is build time since it makes the zswap shrinker
> look pretty good :)
> I think this just means some of the gains between this patchset and
> the zswap shrinker overlaps. But on the positive note:
> 
> a) Both are complementary, i.e enable both (bottom right corner) gives
> us the best result.

Right, both optimizations are complementary, to make zswap perform better :)

> b) Each individual change improves the runtime. If you disable the
> shrinker, then this patch helps tremendously, so we're onto something.
> c) The !shrinker_enabled is no longer *too* bad - once again, thanks
> for noticing the regression and help me fix it! In fact, every cell
> improves compared to the last run. Woohoo!

It's my pleasure! Thanks!

Yosry Ahmed Dec. 12, 2023, 11:26 p.m. UTC | #11

On Wed, Dec 6, 2023 at 7:25 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2023/12/7 08:43, Chris Li wrote:
> > Hi Nhat and Yosry,
> >
> > On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >>>
> >>> + Chris Li
> >>>
> >>> Chris, I vaguely remember from our last conversation that you have
> >>> some concurrent efforts to use xarray here right?
> >
> > Yes, I do have the zswap xarray for older versions of the kernel. The
> > recent mm-unstable tree has  a lot of zswap related updates. Give me 2
> > days to refresh and post it. The zswap invalid entry and the reference
> > count change is causing a good portion of the code to be updated. That
> > is the price to pay keeping out of tree patches. My fault is not
> > getting to it sooner.
> >
> >>
> >> If I recall correctly, the xarray already reduces the lock contention
> >> as lookups are lockless, but Chris knows more here. As you mentioned
> >
> > Yes. To be exact, xarray can use spin lock (same as current RB tree)
> > or take RCU read lock on the lookup path (depending on how you call
> > the xarray API). Not completely lockless but the RCU read lock should
> > have less lock contention than normal spinlock. +Matthew
> >
>
> Great! Lockless lookup in zswap_load() should reduce spinlock contention.
> And multiple trees (multiple xarrays) can further reduce the contention
> on the concurrent zswap_store() side. So it's complementary IMHO.
>
> >> in a different email, it would be nice to get some data so that we can
> >> compare different solutions.
> >
> > Yes, it is certainly welcome to see more data points. If I recall
> > correctly, the zswap xarray array makes the lookup similar to the swap
> > cache lookup. It has a noticeable difference in the long tail end.
> >
>
> Right, I post some data from yesterday in another reply.
> Will test again and update the data since Nhat's zswap shrinker fix patch
> has been merged into mm-unstable today.
>
> Thanks!

Let's split the rbtree breakdown into a separate series. This series
has irrelevant (and very nice) cleanups and optimizations, let's get
them separately and defer the rbtree breakdown part until we get data
about the xarray implementation. Perhaps the tree breakdown is not
needed as much with an xarray, or at the very least the implementation
would look different on top of an xarray.

Nhat Pham Dec. 12, 2023, 11:33 p.m. UTC | #12

On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Let's split the rbtree breakdown into a separate series. This series
> has irrelevant (and very nice) cleanups and optimizations, let's get
> them separately and defer the rbtree breakdown part until we get data
> about the xarray implementation. Perhaps the tree breakdown is not
> needed as much with an xarray, or at the very least the implementation
> would look different on top of an xarray.

Actually, kinda agree - I quite like the cleanup/optimization done
w.r.t dstmem reuse :)

Chengming Zhou Dec. 13, 2023, 2:57 a.m. UTC | #13

On 2023/12/13 07:33, Nhat Pham wrote:
> On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> Let's split the rbtree breakdown into a separate series. This series
>> has irrelevant (and very nice) cleanups and optimizations, let's get
>> them separately and defer the rbtree breakdown part until we get data

Ok, will split and just send the cleanups/optimizations with dstmem reuse.

>> about the xarray implementation. Perhaps the tree breakdown is not
>> needed as much with an xarray, or at the very least the implementation
>> would look different on top of an xarray.

Yeah, will retest on the xarray version of Chris, the implementation is
easy anyway.

> 
> Actually, kinda agree - I quite like the cleanup/optimization done
> w.r.t dstmem reuse :)

Thanks!

[0/7] mm/zswap: optimize the scalability of zswap rb-tree

Message

Comments