[0/2] RFC: zswap tree use xarray instead of RB tree

Message ID	20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chris Li <chrisl@kernel.org> Subject: [PATCH 0/2] RFC: zswap tree use xarray instead of RB tree Date: Wed, 17 Jan 2024 19:05:40 -0800 Message-Id: <20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, =?utf-8?b?V2VpIFh177+8?= <weixugc@google.com>, Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>, Chun-Tse Shao <ctshao@google.com>, =?utf-8?q?Suren_Baghdasaryan=EF=BF=BC?= <surenb@google.com>, Yosry Ahmed <yosryahmed@google.com>, Brain Geffon <bgeffon@google.com>, Minchan Kim <minchan@kernel.org>, Michal Hocko <mhocko@suse.com>, Mel Gorman <mgorman@techsingularity.net>, Huang Ying <ying.huang@intel.com>, Nhat Pham <nphamcs@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, Kairui Song <kasong@tencent.com>, Zhongkun He <hezhongkun.hzk@bytedance.com>, Kemeng Shi <shikemeng@huaweicloud.com>, Barry Song <v-songbaohua@oppo.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, "Liam R. Howlett" <Liam.Howlett@oracle.com>, Joel Fernandes <joel@joelfernandes.org>, Chengming Zhou <zhouchengming@bytedance.com>, Chris Li <chrisl@kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	RFC: zswap tree use xarray instead of RB tree \| expand [0/2] RFC: zswap tree use xarray instead of RB tree [1/2] mm: zswap.c: add xarray tree to zswap [2/2] mm: zswap.c: remove RB tree

Chris Li Jan. 18, 2024, 3:05 a.m. UTC

The RB tree shows some contribution to the swap fault
long tail latency due to two factors:
1) RB tree requires re-balance from time to time.
2) The zswap RB tree has a tree level spin lock protecting
the tree access.

The swap cache is using xarray. The break down the swap
cache access does not have the similar long time as zswap
RB tree.

Moving the zswap entry to xarray enable read side
take read RCU lock only.

The first patch adds the xarray alongside the RB tree.
There is some debug check asserting the xarray agrees with
the RB tree results.

The second patch removes the zwap RB tree.

I expect to merge the zswap rb tree spin lock with the xarray
lock in the follow up changes.

I can surely use some help in reviewing and testing.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
      mm: zswap.c: add xarray tree to zswap
      mm: zswap.c: remove RB tree

 mm/zswap.c | 120 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 59 insertions(+), 61 deletions(-)
---
base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905
change-id: 20240104-zswap-xarray-716260e541e3

Best regards,

Yosry Ahmed Jan. 18, 2024, 6:01 a.m. UTC | #1

That's a long CC list for sure :)

On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
>
> The RB tree shows some contribution to the swap fault
> long tail latency due to two factors:
> 1) RB tree requires re-balance from time to time.
> 2) The zswap RB tree has a tree level spin lock protecting
> the tree access.
>
> The swap cache is using xarray. The break down the swap
> cache access does not have the similar long time as zswap
> RB tree.

I think the comparison to the swap cache may not be valid as the swap
cache has many trees per swapfile, while zswap has a single tree.

>
> Moving the zswap entry to xarray enable read side
> take read RCU lock only.

Nice.

>
> The first patch adds the xarray alongside the RB tree.
> There is some debug check asserting the xarray agrees with
> the RB tree results.
>
> The second patch removes the zwap RB tree.

The breakdown looks like something that would be a development step,
but for patch submission I think it makes more sense to have a single
patch replacing the rbtree with an xarray.

>
> I expect to merge the zswap rb tree spin lock with the xarray
> lock in the follow up changes.

Shouldn't this simply be changing uses of tree->lock to use
xa_{lock/unlock}? We also need to make sure we don't try to lock the
tree when operating on the xarray if the caller is already holding the
lock, but this seems to be straightforward enough to be done as part
of this patch or this series at least.

Am I missing something?

>
> I can surely use some help in reviewing and testing.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Chris Li (2):
>       mm: zswap.c: add xarray tree to zswap
>       mm: zswap.c: remove RB tree
>
>  mm/zswap.c | 120 ++++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 59 insertions(+), 61 deletions(-)
> ---
> base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905
> change-id: 20240104-zswap-xarray-716260e541e3
>
> Best regards,
> --
> Chris Li <chrisl@kernel.org>
>

Yosry Ahmed Jan. 18, 2024, 6:39 a.m. UTC | #2

On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> That's a long CC list for sure :)
>
> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > The RB tree shows some contribution to the swap fault
> > long tail latency due to two factors:
> > 1) RB tree requires re-balance from time to time.
> > 2) The zswap RB tree has a tree level spin lock protecting
> > the tree access.
> >
> > The swap cache is using xarray. The break down the swap
> > cache access does not have the similar long time as zswap
> > RB tree.
>
> I think the comparison to the swap cache may not be valid as the swap
> cache has many trees per swapfile, while zswap has a single tree.
>
> >
> > Moving the zswap entry to xarray enable read side
> > take read RCU lock only.
>
> Nice.
>
> >
> > The first patch adds the xarray alongside the RB tree.
> > There is some debug check asserting the xarray agrees with
> > the RB tree results.
> >
> > The second patch removes the zwap RB tree.
>
> The breakdown looks like something that would be a development step,
> but for patch submission I think it makes more sense to have a single
> patch replacing the rbtree with an xarray.
>
> >
> > I expect to merge the zswap rb tree spin lock with the xarray
> > lock in the follow up changes.
>
> Shouldn't this simply be changing uses of tree->lock to use
> xa_{lock/unlock}? We also need to make sure we don't try to lock the
> tree when operating on the xarray if the caller is already holding the
> lock, but this seems to be straightforward enough to be done as part
> of this patch or this series at least.
>
> Am I missing something?

Also, I assume we will only see performance improvements after the
tree lock in its current form is removed so that we get loads
protected only by RCU. Can we get some performance numbers to see how
the latency improves with the xarray under contention (unless
Chengming is already planning on testing this for his multi-tree
patches).

Chris Li Jan. 18, 2024, 6:48 a.m. UTC | #3

On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> That's a long CC list for sure :)
>
> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > The RB tree shows some contribution to the swap fault
> > long tail latency due to two factors:
> > 1) RB tree requires re-balance from time to time.
> > 2) The zswap RB tree has a tree level spin lock protecting
> > the tree access.
> >
> > The swap cache is using xarray. The break down the swap
> > cache access does not have the similar long time as zswap
> > RB tree.
>
> I think the comparison to the swap cache may not be valid as the swap
> cache has many trees per swapfile, while zswap has a single tree.

Yes, good point. I think we can bench mark the xarray zswap vs the RB
tree zswap, that would be more of a direct comparison.

> > Moving the zswap entry to xarray enable read side
> > take read RCU lock only.
>
> Nice.
>
> >
> > The first patch adds the xarray alongside the RB tree.
> > There is some debug check asserting the xarray agrees with
> > the RB tree results.
> >
> > The second patch removes the zwap RB tree.
>
> The breakdown looks like something that would be a development step,
> but for patch submission I think it makes more sense to have a single
> patch replacing the rbtree with an xarray.

I think it makes the review easier. The code adding and removing does
not have much overlap. Combining it to a single patch does not save
patch size. Having the assert check would be useful for some bisecting
to narrow down which step causing the problem. I am fine with squash
it to one patch as well.
>
> >
> > I expect to merge the zswap rb tree spin lock with the xarray
> > lock in the follow up changes.
>
> Shouldn't this simply be changing uses of tree->lock to use
> xa_{lock/unlock}? We also need to make sure we don't try to lock the
> tree when operating on the xarray if the caller is already holding the
> lock, but this seems to be straightforward enough to be done as part
> of this patch or this series at least.
>
> Am I missing something?

Currently the zswap entry refcount is protected by the zswap tree spin
lock as well. Can't remove the tree spin lock without changing the
refcount code. I think the zswap search entry should just return the
entry with refcount atomic increase, inside the RCU read() or xarray
lock. The previous zswap code does the find_and_get entry() which is
closer to what I want.

Chris

Chengming Zhou Jan. 18, 2024, 6:57 a.m. UTC | #4

Hi Yosry and Chris,

On 2024/1/18 14:39, Yosry Ahmed wrote:
> On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> That's a long CC list for sure :)
>>
>> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
>>>
>>> The RB tree shows some contribution to the swap fault
>>> long tail latency due to two factors:
>>> 1) RB tree requires re-balance from time to time.
>>> 2) The zswap RB tree has a tree level spin lock protecting
>>> the tree access.
>>>
>>> The swap cache is using xarray. The break down the swap
>>> cache access does not have the similar long time as zswap
>>> RB tree.
>>
>> I think the comparison to the swap cache may not be valid as the swap
>> cache has many trees per swapfile, while zswap has a single tree.
>>
>>>
>>> Moving the zswap entry to xarray enable read side
>>> take read RCU lock only.
>>
>> Nice.
>>
>>>
>>> The first patch adds the xarray alongside the RB tree.
>>> There is some debug check asserting the xarray agrees with
>>> the RB tree results.
>>>
>>> The second patch removes the zwap RB tree.
>>
>> The breakdown looks like something that would be a development step,
>> but for patch submission I think it makes more sense to have a single
>> patch replacing the rbtree with an xarray.
>>
>>>
>>> I expect to merge the zswap rb tree spin lock with the xarray
>>> lock in the follow up changes.
>>
>> Shouldn't this simply be changing uses of tree->lock to use
>> xa_{lock/unlock}? We also need to make sure we don't try to lock the
>> tree when operating on the xarray if the caller is already holding the
>> lock, but this seems to be straightforward enough to be done as part
>> of this patch or this series at least.
>>
>> Am I missing something?
> 
> Also, I assume we will only see performance improvements after the
> tree lock in its current form is removed so that we get loads
> protected only by RCU. Can we get some performance numbers to see how
> the latency improves with the xarray under contention (unless
> Chengming is already planning on testing this for his multi-tree
> patches).

I just give it a try, the same test of kernel build in tmpfs with zswap
shrinker enabled, all based on the latest mm/mm-stable branch.

                    mm-stable           zswap-split-tree    zswap-xarray        
real                1m10.442s           1m4.157s            1m9.962s            
user                17m48.232s          17m41.477s          17m45.887s          
sys                 8m13.517s           5m2.226s            7m59.305s           

Looks like the contention of concurrency is still there, I haven't
look into the code yet, will review it later.

Yosry Ahmed Jan. 18, 2024, 7:02 a.m. UTC | #5

On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Hi Yosry and Chris,
>
> On 2024/1/18 14:39, Yosry Ahmed wrote:
> > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >>
> >> That's a long CC list for sure :)
> >>
> >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> >>>
> >>> The RB tree shows some contribution to the swap fault
> >>> long tail latency due to two factors:
> >>> 1) RB tree requires re-balance from time to time.
> >>> 2) The zswap RB tree has a tree level spin lock protecting
> >>> the tree access.
> >>>
> >>> The swap cache is using xarray. The break down the swap
> >>> cache access does not have the similar long time as zswap
> >>> RB tree.
> >>
> >> I think the comparison to the swap cache may not be valid as the swap
> >> cache has many trees per swapfile, while zswap has a single tree.
> >>
> >>>
> >>> Moving the zswap entry to xarray enable read side
> >>> take read RCU lock only.
> >>
> >> Nice.
> >>
> >>>
> >>> The first patch adds the xarray alongside the RB tree.
> >>> There is some debug check asserting the xarray agrees with
> >>> the RB tree results.
> >>>
> >>> The second patch removes the zwap RB tree.
> >>
> >> The breakdown looks like something that would be a development step,
> >> but for patch submission I think it makes more sense to have a single
> >> patch replacing the rbtree with an xarray.
> >>
> >>>
> >>> I expect to merge the zswap rb tree spin lock with the xarray
> >>> lock in the follow up changes.
> >>
> >> Shouldn't this simply be changing uses of tree->lock to use
> >> xa_{lock/unlock}? We also need to make sure we don't try to lock the
> >> tree when operating on the xarray if the caller is already holding the
> >> lock, but this seems to be straightforward enough to be done as part
> >> of this patch or this series at least.
> >>
> >> Am I missing something?
> >
> > Also, I assume we will only see performance improvements after the
> > tree lock in its current form is removed so that we get loads
> > protected only by RCU. Can we get some performance numbers to see how
> > the latency improves with the xarray under contention (unless
> > Chengming is already planning on testing this for his multi-tree
> > patches).
>
> I just give it a try, the same test of kernel build in tmpfs with zswap
> shrinker enabled, all based on the latest mm/mm-stable branch.
>
>                     mm-stable           zswap-split-tree    zswap-xarray
> real                1m10.442s           1m4.157s            1m9.962s
> user                17m48.232s          17m41.477s          17m45.887s
> sys                 8m13.517s           5m2.226s            7m59.305s
>
> Looks like the contention of concurrency is still there, I haven't
> look into the code yet, will review it later.

I think that's expected with the current version because the tree
spin_lock is still there and we are still doing lookups with a
spinlock.

Yosry Ahmed Jan. 18, 2024, 7:05 a.m. UTC | #6

The name changes from Chris to Christopher are confusing :D

>
> I think it makes the review easier. The code adding and removing does
> not have much overlap. Combining it to a single patch does not save
> patch size. Having the assert check would be useful for some bisecting
> to narrow down which step causing the problem. I am fine with squash
> it to one patch as well.

I think having two patches is unnecessarily noisy, and we add some
debug code in this patch that we remove in the next patch anyway.
Let's see what others think, but personally I prefer a single patch.

> >
> > >
> > > I expect to merge the zswap rb tree spin lock with the xarray
> > > lock in the follow up changes.
> >
> > Shouldn't this simply be changing uses of tree->lock to use
> > xa_{lock/unlock}? We also need to make sure we don't try to lock the
> > tree when operating on the xarray if the caller is already holding the
> > lock, but this seems to be straightforward enough to be done as part
> > of this patch or this series at least.
> >
> > Am I missing something?
>
> Currently the zswap entry refcount is protected by the zswap tree spin
> lock as well. Can't remove the tree spin lock without changing the
> refcount code. I think the zswap search entry should just return the
> entry with refcount atomic increase, inside the RCU read() or xarray
> lock. The previous zswap code does the find_and_get entry() which is
> closer to what I want.

I think this can be done in an RCU read section surrounding xa_load()
and the refcount increment. Didn't look closely to check how much
complexity this adds to manage refcounts with RCU, but I think there
should be a lot of examples all around the kernel.

IIUC, there are no performance benefits from this conversion until we
remove the tree spinlock, right?

Chris Li Jan. 18, 2024, 7:19 a.m. UTC | #7

On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> >
> > Hi Yosry and Chris,
> >
> > On 2024/1/18 14:39, Yosry Ahmed wrote:
> > > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >>
> > >> That's a long CC list for sure :)
> > >>
> > >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> > >>>
> > >>> The RB tree shows some contribution to the swap fault
> > >>> long tail latency due to two factors:
> > >>> 1) RB tree requires re-balance from time to time.
> > >>> 2) The zswap RB tree has a tree level spin lock protecting
> > >>> the tree access.
> > >>>
> > >>> The swap cache is using xarray. The break down the swap
> > >>> cache access does not have the similar long time as zswap
> > >>> RB tree.
> > >>
> > >> I think the comparison to the swap cache may not be valid as the swap
> > >> cache has many trees per swapfile, while zswap has a single tree.
> > >>
> > >>>
> > >>> Moving the zswap entry to xarray enable read side
> > >>> take read RCU lock only.
> > >>
> > >> Nice.
> > >>
> > >>>
> > >>> The first patch adds the xarray alongside the RB tree.
> > >>> There is some debug check asserting the xarray agrees with
> > >>> the RB tree results.
> > >>>
> > >>> The second patch removes the zwap RB tree.
> > >>
> > >> The breakdown looks like something that would be a development step,
> > >> but for patch submission I think it makes more sense to have a single
> > >> patch replacing the rbtree with an xarray.
> > >>
> > >>>
> > >>> I expect to merge the zswap rb tree spin lock with the xarray
> > >>> lock in the follow up changes.
> > >>
> > >> Shouldn't this simply be changing uses of tree->lock to use
> > >> xa_{lock/unlock}? We also need to make sure we don't try to lock the
> > >> tree when operating on the xarray if the caller is already holding the
> > >> lock, but this seems to be straightforward enough to be done as part
> > >> of this patch or this series at least.
> > >>
> > >> Am I missing something?
> > >
> > > Also, I assume we will only see performance improvements after the
> > > tree lock in its current form is removed so that we get loads
> > > protected only by RCU. Can we get some performance numbers to see how
> > > the latency improves with the xarray under contention (unless
> > > Chengming is already planning on testing this for his multi-tree
> > > patches).
> >
> > I just give it a try, the same test of kernel build in tmpfs with zswap
> > shrinker enabled, all based on the latest mm/mm-stable branch.
> >
> >                     mm-stable           zswap-split-tree    zswap-xarray
> > real                1m10.442s           1m4.157s            1m9.962s
> > user                17m48.232s          17m41.477s          17m45.887s
> > sys                 8m13.517s           5m2.226s            7m59.305s
> >
> > Looks like the contention of concurrency is still there, I haven't
> > look into the code yet, will review it later.

Thanks for the quick test. Interesting to see the sys usage drop for
the xarray case even with the spin lock.
Not sure if the 13 second saving is statistically significant or not.

We might need to have both xarray and split trees for the zswap. It is
likely removing the spin lock wouldn't be able to make up the 35%
difference. That is just my guess. There is only one way to find out.

BTW, do you have a script I can run to replicate your results?

>
> I think that's expected with the current version because the tree
> spin_lock is still there and we are still doing lookups with a
> spinlock.

Right.

Chris

Chris Li Jan. 18, 2024, 7:28 a.m. UTC | #8

On Wed, Jan 17, 2024 at 11:05 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> The name changes from Chris to Christopher are confusing :D
>
> >
> > I think it makes the review easier. The code adding and removing does
> > not have much overlap. Combining it to a single patch does not save
> > patch size. Having the assert check would be useful for some bisecting
> > to narrow down which step causing the problem. I am fine with squash
> > it to one patch as well.
>
> I think having two patches is unnecessarily noisy, and we add some
> debug code in this patch that we remove in the next patch anyway.
> Let's see what others think, but personally I prefer a single patch.
>
> > >
> > > >
> > > > I expect to merge the zswap rb tree spin lock with the xarray
> > > > lock in the follow up changes.
> > >
> > > Shouldn't this simply be changing uses of tree->lock to use
> > > xa_{lock/unlock}? We also need to make sure we don't try to lock the
> > > tree when operating on the xarray if the caller is already holding the
> > > lock, but this seems to be straightforward enough to be done as part
> > > of this patch or this series at least.
> > >
> > > Am I missing something?
> >
> > Currently the zswap entry refcount is protected by the zswap tree spin
> > lock as well. Can't remove the tree spin lock without changing the
> > refcount code. I think the zswap search entry should just return the
> > entry with refcount atomic increase, inside the RCU read() or xarray
> > lock. The previous zswap code does the find_and_get entry() which is
> > closer to what I want.
>
> I think this can be done in an RCU read section surrounding xa_load()

xa_load() already has RCU read lock inside. If you do that you might
just as well use some XAS API to work with the lock directly.

> and the refcount increment. Didn't look closely to check how much
> complexity this adds to manage refcounts with RCU, but I think there
> should be a lot of examples all around the kernel.

The complexity is not adding the refcount inside xa_load(). It is on
the zswap code that calls zswap_search() and zswap_{insert,erase}().
As far as I can tell, those codes need some tricky changes to go along
with the refcount change.

>
> IIUC, there are no performance benefits from this conversion until we
> remove the tree spinlock, right?

The original intent is helping the long tail case. RB tree has worse
long tails than xarray. I expect it will help the page fault long tail
even without removing the tree spinlock.

Chris

Chengming Zhou Jan. 18, 2024, 7:35 a.m. UTC | #9

On 2024/1/18 15:19, Chris Li wrote:
> On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou
>> <zhouchengming@bytedance.com> wrote:
>>>
>>> Hi Yosry and Chris,
>>>
>>> On 2024/1/18 14:39, Yosry Ahmed wrote:
>>>> On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>>>
>>>>> That's a long CC list for sure :)
>>>>>
>>>>> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
>>>>>>
>>>>>> The RB tree shows some contribution to the swap fault
>>>>>> long tail latency due to two factors:
>>>>>> 1) RB tree requires re-balance from time to time.
>>>>>> 2) The zswap RB tree has a tree level spin lock protecting
>>>>>> the tree access.
>>>>>>
>>>>>> The swap cache is using xarray. The break down the swap
>>>>>> cache access does not have the similar long time as zswap
>>>>>> RB tree.
>>>>>
>>>>> I think the comparison to the swap cache may not be valid as the swap
>>>>> cache has many trees per swapfile, while zswap has a single tree.
>>>>>
>>>>>>
>>>>>> Moving the zswap entry to xarray enable read side
>>>>>> take read RCU lock only.
>>>>>
>>>>> Nice.
>>>>>
>>>>>>
>>>>>> The first patch adds the xarray alongside the RB tree.
>>>>>> There is some debug check asserting the xarray agrees with
>>>>>> the RB tree results.
>>>>>>
>>>>>> The second patch removes the zwap RB tree.
>>>>>
>>>>> The breakdown looks like something that would be a development step,
>>>>> but for patch submission I think it makes more sense to have a single
>>>>> patch replacing the rbtree with an xarray.
>>>>>
>>>>>>
>>>>>> I expect to merge the zswap rb tree spin lock with the xarray
>>>>>> lock in the follow up changes.
>>>>>
>>>>> Shouldn't this simply be changing uses of tree->lock to use
>>>>> xa_{lock/unlock}? We also need to make sure we don't try to lock the
>>>>> tree when operating on the xarray if the caller is already holding the
>>>>> lock, but this seems to be straightforward enough to be done as part
>>>>> of this patch or this series at least.
>>>>>
>>>>> Am I missing something?
>>>>
>>>> Also, I assume we will only see performance improvements after the
>>>> tree lock in its current form is removed so that we get loads
>>>> protected only by RCU. Can we get some performance numbers to see how
>>>> the latency improves with the xarray under contention (unless
>>>> Chengming is already planning on testing this for his multi-tree
>>>> patches).
>>>
>>> I just give it a try, the same test of kernel build in tmpfs with zswap
>>> shrinker enabled, all based on the latest mm/mm-stable branch.
>>>
>>>                     mm-stable           zswap-split-tree    zswap-xarray
>>> real                1m10.442s           1m4.157s            1m9.962s
>>> user                17m48.232s          17m41.477s          17m45.887s
>>> sys                 8m13.517s           5m2.226s            7m59.305s
>>>
>>> Looks like the contention of concurrency is still there, I haven't
>>> look into the code yet, will review it later.
> 
> Thanks for the quick test. Interesting to see the sys usage drop for
> the xarray case even with the spin lock.
> Not sure if the 13 second saving is statistically significant or not.
> 
> We might need to have both xarray and split trees for the zswap. It is
> likely removing the spin lock wouldn't be able to make up the 35%
> difference. That is just my guess. There is only one way to find out.

Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
have to contend for the xarray spinlock even though we would have converted
the rb-tree to the xarray structure at last. So I think we should have both.

> 
> BTW, do you have a script I can run to replicate your results?

```
#!/bin/bash

testname="build-kernel-tmpfs"
cgroup="/sys/fs/cgroup/$testname"

tmpdir="/tmp/vm-scalability-tmp"
workdir="$tmpdir/$testname"

memory_max="$((2 * 1024 * 1024 * 1024))"

linux_src="/root/zcm/linux-6.6.tar.xz"
NR_TASK=32

swapon ~/zcm/swapfile
echo 60 > /proc/sys/vm/swappiness

echo zsmalloc > /sys/module/zswap/parameters/zpool
echo lz4 > /sys/module/zswap/parameters/compressor
echo 1 > /sys/module/zswap/parameters/shrinker_enabled
echo 1 > /sys/module/zswap/parameters/enabled

if ! [ -d $tmpdir ]; then
	mkdir -p $tmpdir
	mount -t tmpfs -o size=100% nodev $tmpdir
fi

mkdir -p $cgroup
echo $memory_max > $cgroup/memory.max
echo $$ > $cgroup/cgroup.procs

rm -rf $workdir
mkdir -p $workdir
cd $workdir

tar xvf $linux_src
cd linux-6.6
make -j$NR_TASK clean
make defconfig
time make -j$NR_TASK
```

Johannes Weiner Jan. 18, 2024, 2:48 p.m. UTC | #10

On Wed, Jan 17, 2024 at 11:05:15PM -0800, Yosry Ahmed wrote:
> > I think it makes the review easier. The code adding and removing does
> > not have much overlap. Combining it to a single patch does not save
> > patch size. Having the assert check would be useful for some bisecting
> > to narrow down which step causing the problem. I am fine with squash
> > it to one patch as well.
> 
> I think having two patches is unnecessarily noisy, and we add some
> debug code in this patch that we remove in the next patch anyway.
> Let's see what others think, but personally I prefer a single patch.

+1

Yosry Ahmed Jan. 18, 2024, 5:14 p.m. UTC | #11

On Wed, Jan 17, 2024 at 11:28 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Jan 17, 2024 at 11:05 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > The name changes from Chris to Christopher are confusing :D
> >
> > >
> > > I think it makes the review easier. The code adding and removing does
> > > not have much overlap. Combining it to a single patch does not save
> > > patch size. Having the assert check would be useful for some bisecting
> > > to narrow down which step causing the problem. I am fine with squash
> > > it to one patch as well.
> >
> > I think having two patches is unnecessarily noisy, and we add some
> > debug code in this patch that we remove in the next patch anyway.
> > Let's see what others think, but personally I prefer a single patch.
> >
> > > >
> > > > >
> > > > > I expect to merge the zswap rb tree spin lock with the xarray
> > > > > lock in the follow up changes.
> > > >
> > > > Shouldn't this simply be changing uses of tree->lock to use
> > > > xa_{lock/unlock}? We also need to make sure we don't try to lock the
> > > > tree when operating on the xarray if the caller is already holding the
> > > > lock, but this seems to be straightforward enough to be done as part
> > > > of this patch or this series at least.
> > > >
> > > > Am I missing something?
> > >
> > > Currently the zswap entry refcount is protected by the zswap tree spin
> > > lock as well. Can't remove the tree spin lock without changing the
> > > refcount code. I think the zswap search entry should just return the
> > > entry with refcount atomic increase, inside the RCU read() or xarray
> > > lock. The previous zswap code does the find_and_get entry() which is
> > > closer to what I want.
> >
> > I think this can be done in an RCU read section surrounding xa_load()
>
> xa_load() already has RCU read lock inside. If you do that you might
> just as well use some XAS API to work with the lock directly.

RCU reda locks are nestable, some users of xa_load() do some in an RCU
read section, also for refcounting purposes. Also, I thought the point
was avoiding the lock in this path.

>
> > and the refcount increment. Didn't look closely to check how much
> > complexity this adds to manage refcounts with RCU, but I think there
> > should be a lot of examples all around the kernel.
>
> The complexity is not adding the refcount inside xa_load(). It is on
> the zswap code that calls zswap_search() and zswap_{insert,erase}().
> As far as I can tell, those codes need some tricky changes to go along
> with the refcount change.

I don't think it should be very tricky.
https://docs.kernel.org/RCU/rcuref.html may have relevant examples,
and there should be examples all over the code.

>
> >
> > IIUC, there are no performance benefits from this conversion until we
> > remove the tree spinlock, right?
>
> The original intent is helping the long tail case. RB tree has worse
> long tails than xarray. I expect it will help the page fault long tail
> even without removing the tree spinlock.

I think it would be better if we can remove the tree spinlock as part
of this change.

Nhat Pham Jan. 18, 2024, 6:01 p.m. UTC | #12

On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
>
> The RB tree shows some contribution to the swap fault
> long tail latency due to two factors:
> 1) RB tree requires re-balance from time to time.
> 2) The zswap RB tree has a tree level spin lock protecting
> the tree access.
>
> The swap cache is using xarray. The break down the swap
> cache access does not have the similar long time as zswap
> RB tree.
>
> Moving the zswap entry to xarray enable read side
> take read RCU lock only.
>
> The first patch adds the xarray alongside the RB tree.
> There is some debug check asserting the xarray agrees with
> the RB tree results.
>
> The second patch removes the zwap RB tree.
>
> I expect to merge the zswap rb tree spin lock with the xarray
> lock in the follow up changes.
>
> I can surely use some help in reviewing and testing.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Chris Li (2):
>       mm: zswap.c: add xarray tree to zswap

While I think it is pretty neat to keep the rbtree around to check if
the results agree during development stages, in the final version
please squash the patches. One patch is enough :)

>       mm: zswap.c: remove RB tree
>
>  mm/zswap.c | 120 ++++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 59 insertions(+), 61 deletions(-)
> ---
> base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905
> change-id: 20240104-zswap-xarray-716260e541e3
>
> Best regards,
> --
> Chris Li <chrisl@kernel.org>
>

Liam R. Howlett Jan. 18, 2024, 6:59 p.m. UTC | #13

* Christopher Li <chrisl@kernel.org> [240118 01:48]:
> On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > That's a long CC list for sure :)
> >
> > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > The RB tree shows some contribution to the swap fault
> > > long tail latency due to two factors:
> > > 1) RB tree requires re-balance from time to time.
> > > 2) The zswap RB tree has a tree level spin lock protecting
> > > the tree access.
> > >
> > > The swap cache is using xarray. The break down the swap
> > > cache access does not have the similar long time as zswap
> > > RB tree.
> >
> > I think the comparison to the swap cache may not be valid as the swap
> > cache has many trees per swapfile, while zswap has a single tree.
> 
> Yes, good point. I think we can bench mark the xarray zswap vs the RB
> tree zswap, that would be more of a direct comparison.
> 
> > > Moving the zswap entry to xarray enable read side
> > > take read RCU lock only.
> >
> > Nice.
> >
> > >
> > > The first patch adds the xarray alongside the RB tree.
> > > There is some debug check asserting the xarray agrees with
> > > the RB tree results.
> > >
> > > The second patch removes the zwap RB tree.
> >
> > The breakdown looks like something that would be a development step,
> > but for patch submission I think it makes more sense to have a single
> > patch replacing the rbtree with an xarray.
> 
> I think it makes the review easier. The code adding and removing does
> not have much overlap. Combining it to a single patch does not save
> patch size. Having the assert check would be useful for some bisecting
> to narrow down which step causing the problem. I am fine with squash
> it to one patch as well.

I had thought similar when I replaced the rbtree with the maple tree in
the VMA space.  That conversion was more involved and I wanted to detect
if there was ever any difference, and where I had made the error in the
multiple patch conversion.

This became rather painful once an issue was found, as then anyone
bisecting other issues could hit this difference and either blamed the
commit pointing at the BUG_ON() or gave up (I don't blame them for
giving up, I would).  With only two commits, it may be easier for people
to see a fixed tag pointing to the same commit that bisect found (if
they check), but it proved an issue with my multiple patch conversion.

You may not experience this issue with the users of the zswap, but I
plan to avoid doing this again in the future.  At least a WARN_ON_ONCE()
and a comment might help?

Thanks,
Liam

Chris Li Jan. 19, 2024, 4:59 a.m. UTC | #14

On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:

> >>>                     mm-stable           zswap-split-tree    zswap-xarray
> >>> real                1m10.442s           1m4.157s            1m9.962s
> >>> user                17m48.232s          17m41.477s          17m45.887s
> >>> sys                 8m13.517s           5m2.226s            7m59.305s
> >>>
> >>> Looks like the contention of concurrency is still there, I haven't
> >>> look into the code yet, will review it later.
> >
> > Thanks for the quick test. Interesting to see the sys usage drop for
> > the xarray case even with the spin lock.
> > Not sure if the 13 second saving is statistically significant or not.
> >
> > We might need to have both xarray and split trees for the zswap. It is
> > likely removing the spin lock wouldn't be able to make up the 35%
> > difference. That is just my guess. There is only one way to find out.
>
> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
> have to contend for the xarray spinlock even though we would have converted
> the rb-tree to the xarray structure at last. So I think we should have both.
>
> >
> > BTW, do you have a script I can run to replicate your results?

Hi Chengming,

Thanks for your script.

>
> ```
> #!/bin/bash
>
> testname="build-kernel-tmpfs"
> cgroup="/sys/fs/cgroup/$testname"
>
> tmpdir="/tmp/vm-scalability-tmp"
> workdir="$tmpdir/$testname"
>
> memory_max="$((2 * 1024 * 1024 * 1024))"
>
> linux_src="/root/zcm/linux-6.6.tar.xz"
> NR_TASK=32
>
> swapon ~/zcm/swapfile

How big is your swapfile here?

It seems you have only one swapfile there. That can explain the contention.
Have you tried multiple swapfiles for the same test?
That should reduce the contention without using your patch.

Chris

> echo 60 > /proc/sys/vm/swappiness
>
> echo zsmalloc > /sys/module/zswap/parameters/zpool
> echo lz4 > /sys/module/zswap/parameters/compressor
> echo 1 > /sys/module/zswap/parameters/shrinker_enabled
> echo 1 > /sys/module/zswap/parameters/enabled
>
> if ! [ -d $tmpdir ]; then
>         mkdir -p $tmpdir
>         mount -t tmpfs -o size=100% nodev $tmpdir
> fi
>
> mkdir -p $cgroup
> echo $memory_max > $cgroup/memory.max
> echo $$ > $cgroup/cgroup.procs
>
> rm -rf $workdir
> mkdir -p $workdir
> cd $workdir
>
> tar xvf $linux_src
> cd linux-6.6
> make -j$NR_TASK clean
> make defconfig
> time make -j$NR_TASK
> ```
>
>

Chris Li Jan. 19, 2024, 5:13 a.m. UTC | #15

On Thu, Jan 18, 2024 at 11:00 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
> > >
> > > >
> > > > The first patch adds the xarray alongside the RB tree.
> > > > There is some debug check asserting the xarray agrees with
> > > > the RB tree results.
> > > >
> > > > The second patch removes the zwap RB tree.
> > >
> > > The breakdown looks like something that would be a development step,
> > > but for patch submission I think it makes more sense to have a single
> > > patch replacing the rbtree with an xarray.
> >
> > I think it makes the review easier. The code adding and removing does
> > not have much overlap. Combining it to a single patch does not save
> > patch size. Having the assert check would be useful for some bisecting
> > to narrow down which step causing the problem. I am fine with squash
> > it to one patch as well.
>
> I had thought similar when I replaced the rbtree with the maple tree in
> the VMA space.  That conversion was more involved and I wanted to detect
> if there was ever any difference, and where I had made the error in the
> multiple patch conversion.
>
> This became rather painful once an issue was found, as then anyone
> bisecting other issues could hit this difference and either blamed the
> commit pointing at the BUG_ON() or gave up (I don't blame them for
> giving up, I would).  With only two commits, it may be easier for people
> to see a fixed tag pointing to the same commit that bisect found (if
> they check), but it proved an issue with my multiple patch conversion.

Thanks for sharing your experience. That debug assert did help me
catch issues on my own internal version after rebasing to the latest
mm tree. If the user can't do the bisect, then I agree we don't need
to assert in the official version. I can always bisect on my one
internal version.

>
> You may not experience this issue with the users of the zswap, but I
> plan to avoid doing this again in the future.  At least a WARN_ON_ONCE()
> and a comment might help?

Sure, I might just merge the two patches. Don't have the BUG_ON() any more.

Chris

Chris Li Jan. 19, 2024, 5:14 a.m. UTC | #16

On Thu, Jan 18, 2024 at 10:01 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > The RB tree shows some contribution to the swap fault
> > long tail latency due to two factors:
> > 1) RB tree requires re-balance from time to time.
> > 2) The zswap RB tree has a tree level spin lock protecting
> > the tree access.
> >
> > The swap cache is using xarray. The break down the swap
> > cache access does not have the similar long time as zswap
> > RB tree.
> >
> > Moving the zswap entry to xarray enable read side
> > take read RCU lock only.
> >
> > The first patch adds the xarray alongside the RB tree.
> > There is some debug check asserting the xarray agrees with
> > the RB tree results.
> >
> > The second patch removes the zwap RB tree.
> >
> > I expect to merge the zswap rb tree spin lock with the xarray
> > lock in the follow up changes.
> >
> > I can surely use some help in reviewing and testing.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > Chris Li (2):
> >       mm: zswap.c: add xarray tree to zswap
>
> While I think it is pretty neat to keep the rbtree around to check if
> the results agree during development stages, in the final version
> please squash the patches. One patch is enough :)

Ack.

Chris

Chengming Zhou Jan. 19, 2024, 6:18 a.m. UTC | #17

On 2024/1/19 12:59, Chris Li wrote:
> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
> 
>>>>>                     mm-stable           zswap-split-tree    zswap-xarray
>>>>> real                1m10.442s           1m4.157s            1m9.962s
>>>>> user                17m48.232s          17m41.477s          17m45.887s
>>>>> sys                 8m13.517s           5m2.226s            7m59.305s
>>>>>
>>>>> Looks like the contention of concurrency is still there, I haven't
>>>>> look into the code yet, will review it later.
>>>
>>> Thanks for the quick test. Interesting to see the sys usage drop for
>>> the xarray case even with the spin lock.
>>> Not sure if the 13 second saving is statistically significant or not.
>>>
>>> We might need to have both xarray and split trees for the zswap. It is
>>> likely removing the spin lock wouldn't be able to make up the 35%
>>> difference. That is just my guess. There is only one way to find out.
>>
>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
>> have to contend for the xarray spinlock even though we would have converted
>> the rb-tree to the xarray structure at last. So I think we should have both.
>>
>>>
>>> BTW, do you have a script I can run to replicate your results?
> 
> Hi Chengming,
> 
> Thanks for your script.
> 
>>
>> ```
>> #!/bin/bash
>>
>> testname="build-kernel-tmpfs"
>> cgroup="/sys/fs/cgroup/$testname"
>>
>> tmpdir="/tmp/vm-scalability-tmp"
>> workdir="$tmpdir/$testname"
>>
>> memory_max="$((2 * 1024 * 1024 * 1024))"
>>
>> linux_src="/root/zcm/linux-6.6.tar.xz"
>> NR_TASK=32
>>
>> swapon ~/zcm/swapfile
> 
> How big is your swapfile here?

The swapfile is big enough here, I use a 50GB swapfile.

> 
> It seems you have only one swapfile there. That can explain the contention.
> Have you tried multiple swapfiles for the same test?
> That should reduce the contention without using your patch.
Do you mean to have many 64MB swapfiles to swapon at the same time?
Maybe it's feasible to test, I'm not sure how swapout will choose.
But in our usecase, we normally have only one swapfile.

Thanks.

> 
> Chris
> 
>> echo 60 > /proc/sys/vm/swappiness
>>
>> echo zsmalloc > /sys/module/zswap/parameters/zpool
>> echo lz4 > /sys/module/zswap/parameters/compressor
>> echo 1 > /sys/module/zswap/parameters/shrinker_enabled
>> echo 1 > /sys/module/zswap/parameters/enabled
>>
>> if ! [ -d $tmpdir ]; then
>>         mkdir -p $tmpdir
>>         mount -t tmpfs -o size=100% nodev $tmpdir
>> fi
>>
>> mkdir -p $cgroup
>> echo $memory_max > $cgroup/memory.max
>> echo $$ > $cgroup/cgroup.procs
>>
>> rm -rf $workdir
>> mkdir -p $workdir
>> cd $workdir
>>
>> tar xvf $linux_src
>> cd linux-6.6
>> make -j$NR_TASK clean
>> make defconfig
>> time make -j$NR_TASK
>> ```
>>
>>

Chris Li Jan. 19, 2024, 10:26 a.m. UTC | #18

On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2024/1/19 12:59, Chris Li wrote:
> > On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >
> >>>>>                     mm-stable           zswap-split-tree    zswap-xarray
> >>>>> real                1m10.442s           1m4.157s            1m9.962s
> >>>>> user                17m48.232s          17m41.477s          17m45.887s
> >>>>> sys                 8m13.517s           5m2.226s            7m59.305s
> >>>>>
> >>>>> Looks like the contention of concurrency is still there, I haven't
> >>>>> look into the code yet, will review it later.
> >>>
> >>> Thanks for the quick test. Interesting to see the sys usage drop for
> >>> the xarray case even with the spin lock.
> >>> Not sure if the 13 second saving is statistically significant or not.
> >>>
> >>> We might need to have both xarray and split trees for the zswap. It is
> >>> likely removing the spin lock wouldn't be able to make up the 35%
> >>> difference. That is just my guess. There is only one way to find out.
> >>
> >> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
> >> have to contend for the xarray spinlock even though we would have converted
> >> the rb-tree to the xarray structure at last. So I think we should have both.
> >>
> >>>
> >>> BTW, do you have a script I can run to replicate your results?
> >
> > Hi Chengming,
> >
> > Thanks for your script.
> >
> >>
> >> ```
> >> #!/bin/bash
> >>
> >> testname="build-kernel-tmpfs"
> >> cgroup="/sys/fs/cgroup/$testname"
> >>
> >> tmpdir="/tmp/vm-scalability-tmp"
> >> workdir="$tmpdir/$testname"
> >>
> >> memory_max="$((2 * 1024 * 1024 * 1024))"
> >>
> >> linux_src="/root/zcm/linux-6.6.tar.xz"
> >> NR_TASK=32
> >>
> >> swapon ~/zcm/swapfile
> >
> > How big is your swapfile here?
>
> The swapfile is big enough here, I use a 50GB swapfile.

Thanks,

>
> >
> > It seems you have only one swapfile there. That can explain the contention.
> > Have you tried multiple swapfiles for the same test?
> > That should reduce the contention without using your patch.
> Do you mean to have many 64MB swapfiles to swapon at the same time?

64MB is too small. There are limits to MAX_SWAPFILES. It is less than
(32 - n) swap files.
If you want to use 50G swap space, you can have MAX_SWAPFILES, each
swapfile 50GB / MAX_SWAPFILES.

> Maybe it's feasible to test,

Of course it is testable, I am curious to see the test results.

> I'm not sure how swapout will choose.

It will rotate through the same priority swap files first.
swapfile.c: get_swap_pages().

> But in our usecase, we normally have only one swapfile.

Is there a good reason why you can't use more than one swapfile?
One swapfile will not take the full advantage of the existing code.
Even if you split the zswap trees within a swapfile. With only one
swapfile, you will still be having lock contention on "(struct
swap_info_struct).lock".
It is one lock per swapfile.
Using more than one swap file should get you better results.

Chris

Chengming Zhou Jan. 19, 2024, 11:12 a.m. UTC | #19

On 2024/1/19 18:26, Chris Li wrote:
> On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> On 2024/1/19 12:59, Chris Li wrote:
>>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
>>> <zhouchengming@bytedance.com> wrote:
>>>
>>>>>>>                     mm-stable           zswap-split-tree    zswap-xarray
>>>>>>> real                1m10.442s           1m4.157s            1m9.962s
>>>>>>> user                17m48.232s          17m41.477s          17m45.887s
>>>>>>> sys                 8m13.517s           5m2.226s            7m59.305s
>>>>>>>
>>>>>>> Looks like the contention of concurrency is still there, I haven't
>>>>>>> look into the code yet, will review it later.
>>>>>
>>>>> Thanks for the quick test. Interesting to see the sys usage drop for
>>>>> the xarray case even with the spin lock.
>>>>> Not sure if the 13 second saving is statistically significant or not.
>>>>>
>>>>> We might need to have both xarray and split trees for the zswap. It is
>>>>> likely removing the spin lock wouldn't be able to make up the 35%
>>>>> difference. That is just my guess. There is only one way to find out.
>>>>
>>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
>>>> have to contend for the xarray spinlock even though we would have converted
>>>> the rb-tree to the xarray structure at last. So I think we should have both.
>>>>
>>>>>
>>>>> BTW, do you have a script I can run to replicate your results?
>>>
>>> Hi Chengming,
>>>
>>> Thanks for your script.
>>>
>>>>
>>>> ```
>>>> #!/bin/bash
>>>>
>>>> testname="build-kernel-tmpfs"
>>>> cgroup="/sys/fs/cgroup/$testname"
>>>>
>>>> tmpdir="/tmp/vm-scalability-tmp"
>>>> workdir="$tmpdir/$testname"
>>>>
>>>> memory_max="$((2 * 1024 * 1024 * 1024))"
>>>>
>>>> linux_src="/root/zcm/linux-6.6.tar.xz"
>>>> NR_TASK=32
>>>>
>>>> swapon ~/zcm/swapfile
>>>
>>> How big is your swapfile here?
>>
>> The swapfile is big enough here, I use a 50GB swapfile.
> 
> Thanks,
> 
>>
>>>
>>> It seems you have only one swapfile there. That can explain the contention.
>>> Have you tried multiple swapfiles for the same test?
>>> That should reduce the contention without using your patch.
>> Do you mean to have many 64MB swapfiles to swapon at the same time?
> 
> 64MB is too small. There are limits to MAX_SWAPFILES. It is less than
> (32 - n) swap files.
> If you want to use 50G swap space, you can have MAX_SWAPFILES, each
> swapfile 50GB / MAX_SWAPFILES.

Right.

> 
>> Maybe it's feasible to test,
> 
> Of course it is testable, I am curious to see the test results.
> 
>> I'm not sure how swapout will choose.
> 
> It will rotate through the same priority swap files first.
> swapfile.c: get_swap_pages().
> 
>> But in our usecase, we normally have only one swapfile.
> 
> Is there a good reason why you can't use more than one swapfile?

I think no, but it seems an unneeded change/burden to our admin.
So I just tested and optimized for the normal case.

> One swapfile will not take the full advantage of the existing code.
> Even if you split the zswap trees within a swapfile. With only one
> swapfile, you will still be having lock contention on "(struct
> swap_info_struct).lock".
> It is one lock per swapfile.
> Using more than one swap file should get you better results.

IIUC, we already have the per-cpu swap entry cache to not contend for
this lock? And I don't see much hot of this lock in the testing.

Thanks.

Chris Li Jan. 19, 2024, 11:59 a.m. UTC | #20

On Fri, Jan 19, 2024 at 3:12 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2024/1/19 18:26, Chris Li wrote:
> > On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou
> > <zhouchengming@bytedance.com> wrote:
> >>
> >> On 2024/1/19 12:59, Chris Li wrote:
> >>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou
> >>> <zhouchengming@bytedance.com> wrote:
> >>>
> >>>>>>>                     mm-stable           zswap-split-tree    zswap-xarray
> >>>>>>> real                1m10.442s           1m4.157s            1m9.962s
> >>>>>>> user                17m48.232s          17m41.477s          17m45.887s
> >>>>>>> sys                 8m13.517s           5m2.226s            7m59.305s
> >>>>>>>
> >>>>>>> Looks like the contention of concurrency is still there, I haven't
> >>>>>>> look into the code yet, will review it later.
> >>>>>
> >>>>> Thanks for the quick test. Interesting to see the sys usage drop for
> >>>>> the xarray case even with the spin lock.
> >>>>> Not sure if the 13 second saving is statistically significant or not.
> >>>>>
> >>>>> We might need to have both xarray and split trees for the zswap. It is
> >>>>> likely removing the spin lock wouldn't be able to make up the 35%
> >>>>> difference. That is just my guess. There is only one way to find out.
> >>>>
> >>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still
> >>>> have to contend for the xarray spinlock even though we would have converted
> >>>> the rb-tree to the xarray structure at last. So I think we should have both.
> >>>>
> >>>>>
> >>>>> BTW, do you have a script I can run to replicate your results?
> >>>
> >>> Hi Chengming,
> >>>
> >>> Thanks for your script.
> >>>
> >>>>
> >>>> ```
> >>>> #!/bin/bash
> >>>>
> >>>> testname="build-kernel-tmpfs"
> >>>> cgroup="/sys/fs/cgroup/$testname"
> >>>>
> >>>> tmpdir="/tmp/vm-scalability-tmp"
> >>>> workdir="$tmpdir/$testname"
> >>>>
> >>>> memory_max="$((2 * 1024 * 1024 * 1024))"
> >>>>
> >>>> linux_src="/root/zcm/linux-6.6.tar.xz"
> >>>> NR_TASK=32
> >>>>
> >>>> swapon ~/zcm/swapfile
> >>>
> >>> How big is your swapfile here?
> >>
> >> The swapfile is big enough here, I use a 50GB swapfile.
> >
> > Thanks,
> >
> >>
> >>>
> >>> It seems you have only one swapfile there. That can explain the contention.
> >>> Have you tried multiple swapfiles for the same test?
> >>> That should reduce the contention without using your patch.
> >> Do you mean to have many 64MB swapfiles to swapon at the same time?
> >
> > 64MB is too small. There are limits to MAX_SWAPFILES. It is less than
> > (32 - n) swap files.
> > If you want to use 50G swap space, you can have MAX_SWAPFILES, each
> > swapfile 50GB / MAX_SWAPFILES.
>
> Right.
>
> >
> >> Maybe it's feasible to test,
> >
> > Of course it is testable, I am curious to see the test results.
> >
> >> I'm not sure how swapout will choose.
> >
> > It will rotate through the same priority swap files first.
> > swapfile.c: get_swap_pages().
> >
> >> But in our usecase, we normally have only one swapfile.
> >
> > Is there a good reason why you can't use more than one swapfile?
>
> I think no, but it seems an unneeded change/burden to our admin.
> So I just tested and optimized for the normal case.

I understand. Just saying it is not really a kernel limitation per say.
I blame the user space :-)

>
> > One swapfile will not take the full advantage of the existing code.
> > Even if you split the zswap trees within a swapfile. With only one
> > swapfile, you will still be having lock contention on "(struct
> > swap_info_struct).lock".
> > It is one lock per swapfile.
> > Using more than one swap file should get you better results.
>
> IIUC, we already have the per-cpu swap entry cache to not contend for
> this lock? And I don't see much hot of this lock in the testing.

Yes. The swap entry cache helps. The cache batching also causes other
problems, e.g. the long tail in swap faults handling.
Shameless plug, I have a patch posted earlier to address the swap
fault long tail latencies.

https://lore.kernel.org/linux-mm/20231221-async-free-v1-1-94b277992cb0@kernel.org/T/

Chris

[0/2] RFC: zswap tree use xarray instead of RB tree

Message

Comments