Message ID | 20240117-zswap-xarray-v1-0-6daa86c08fae@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | RFC: zswap tree use xarray instead of RB tree | expand |
That's a long CC list for sure :) On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > The RB tree shows some contribution to the swap fault > long tail latency due to two factors: > 1) RB tree requires re-balance from time to time. > 2) The zswap RB tree has a tree level spin lock protecting > the tree access. > > The swap cache is using xarray. The break down the swap > cache access does not have the similar long time as zswap > RB tree. I think the comparison to the swap cache may not be valid as the swap cache has many trees per swapfile, while zswap has a single tree. > > Moving the zswap entry to xarray enable read side > take read RCU lock only. Nice. > > The first patch adds the xarray alongside the RB tree. > There is some debug check asserting the xarray agrees with > the RB tree results. > > The second patch removes the zwap RB tree. The breakdown looks like something that would be a development step, but for patch submission I think it makes more sense to have a single patch replacing the rbtree with an xarray. > > I expect to merge the zswap rb tree spin lock with the xarray > lock in the follow up changes. Shouldn't this simply be changing uses of tree->lock to use xa_{lock/unlock}? We also need to make sure we don't try to lock the tree when operating on the xarray if the caller is already holding the lock, but this seems to be straightforward enough to be done as part of this patch or this series at least. Am I missing something? > > I can surely use some help in reviewing and testing. > > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: zswap.c: add xarray tree to zswap > mm: zswap.c: remove RB tree > > mm/zswap.c | 120 ++++++++++++++++++++++++++++++------------------------------- > 1 file changed, 59 insertions(+), 61 deletions(-) > --- > base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905 > change-id: 20240104-zswap-xarray-716260e541e3 > > Best regards, > -- > Chris Li <chrisl@kernel.org> >
On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > That's a long CC list for sure :) > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The RB tree shows some contribution to the swap fault > > long tail latency due to two factors: > > 1) RB tree requires re-balance from time to time. > > 2) The zswap RB tree has a tree level spin lock protecting > > the tree access. > > > > The swap cache is using xarray. The break down the swap > > cache access does not have the similar long time as zswap > > RB tree. > > I think the comparison to the swap cache may not be valid as the swap > cache has many trees per swapfile, while zswap has a single tree. > > > > > Moving the zswap entry to xarray enable read side > > take read RCU lock only. > > Nice. > > > > > The first patch adds the xarray alongside the RB tree. > > There is some debug check asserting the xarray agrees with > > the RB tree results. > > > > The second patch removes the zwap RB tree. > > The breakdown looks like something that would be a development step, > but for patch submission I think it makes more sense to have a single > patch replacing the rbtree with an xarray. > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > lock in the follow up changes. > > Shouldn't this simply be changing uses of tree->lock to use > xa_{lock/unlock}? We also need to make sure we don't try to lock the > tree when operating on the xarray if the caller is already holding the > lock, but this seems to be straightforward enough to be done as part > of this patch or this series at least. > > Am I missing something? Also, I assume we will only see performance improvements after the tree lock in its current form is removed so that we get loads protected only by RCU. Can we get some performance numbers to see how the latency improves with the xarray under contention (unless Chengming is already planning on testing this for his multi-tree patches).
On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > That's a long CC list for sure :) > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The RB tree shows some contribution to the swap fault > > long tail latency due to two factors: > > 1) RB tree requires re-balance from time to time. > > 2) The zswap RB tree has a tree level spin lock protecting > > the tree access. > > > > The swap cache is using xarray. The break down the swap > > cache access does not have the similar long time as zswap > > RB tree. > > I think the comparison to the swap cache may not be valid as the swap > cache has many trees per swapfile, while zswap has a single tree. Yes, good point. I think we can bench mark the xarray zswap vs the RB tree zswap, that would be more of a direct comparison. > > Moving the zswap entry to xarray enable read side > > take read RCU lock only. > > Nice. > > > > > The first patch adds the xarray alongside the RB tree. > > There is some debug check asserting the xarray agrees with > > the RB tree results. > > > > The second patch removes the zwap RB tree. > > The breakdown looks like something that would be a development step, > but for patch submission I think it makes more sense to have a single > patch replacing the rbtree with an xarray. I think it makes the review easier. The code adding and removing does not have much overlap. Combining it to a single patch does not save patch size. Having the assert check would be useful for some bisecting to narrow down which step causing the problem. I am fine with squash it to one patch as well. > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > lock in the follow up changes. > > Shouldn't this simply be changing uses of tree->lock to use > xa_{lock/unlock}? We also need to make sure we don't try to lock the > tree when operating on the xarray if the caller is already holding the > lock, but this seems to be straightforward enough to be done as part > of this patch or this series at least. > > Am I missing something? Currently the zswap entry refcount is protected by the zswap tree spin lock as well. Can't remove the tree spin lock without changing the refcount code. I think the zswap search entry should just return the entry with refcount atomic increase, inside the RCU read() or xarray lock. The previous zswap code does the find_and_get entry() which is closer to what I want. Chris
Hi Yosry and Chris, On 2024/1/18 14:39, Yosry Ahmed wrote: > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> That's a long CC list for sure :) >> >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: >>> >>> The RB tree shows some contribution to the swap fault >>> long tail latency due to two factors: >>> 1) RB tree requires re-balance from time to time. >>> 2) The zswap RB tree has a tree level spin lock protecting >>> the tree access. >>> >>> The swap cache is using xarray. The break down the swap >>> cache access does not have the similar long time as zswap >>> RB tree. >> >> I think the comparison to the swap cache may not be valid as the swap >> cache has many trees per swapfile, while zswap has a single tree. >> >>> >>> Moving the zswap entry to xarray enable read side >>> take read RCU lock only. >> >> Nice. >> >>> >>> The first patch adds the xarray alongside the RB tree. >>> There is some debug check asserting the xarray agrees with >>> the RB tree results. >>> >>> The second patch removes the zwap RB tree. >> >> The breakdown looks like something that would be a development step, >> but for patch submission I think it makes more sense to have a single >> patch replacing the rbtree with an xarray. >> >>> >>> I expect to merge the zswap rb tree spin lock with the xarray >>> lock in the follow up changes. >> >> Shouldn't this simply be changing uses of tree->lock to use >> xa_{lock/unlock}? We also need to make sure we don't try to lock the >> tree when operating on the xarray if the caller is already holding the >> lock, but this seems to be straightforward enough to be done as part >> of this patch or this series at least. >> >> Am I missing something? > > Also, I assume we will only see performance improvements after the > tree lock in its current form is removed so that we get loads > protected only by RCU. Can we get some performance numbers to see how > the latency improves with the xarray under contention (unless > Chengming is already planning on testing this for his multi-tree > patches). I just give it a try, the same test of kernel build in tmpfs with zswap shrinker enabled, all based on the latest mm/mm-stable branch. mm-stable zswap-split-tree zswap-xarray real 1m10.442s 1m4.157s 1m9.962s user 17m48.232s 17m41.477s 17m45.887s sys 8m13.517s 5m2.226s 7m59.305s Looks like the contention of concurrency is still there, I haven't look into the code yet, will review it later.
On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > Hi Yosry and Chris, > > On 2024/1/18 14:39, Yosry Ahmed wrote: > > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> That's a long CC list for sure :) > >> > >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > >>> > >>> The RB tree shows some contribution to the swap fault > >>> long tail latency due to two factors: > >>> 1) RB tree requires re-balance from time to time. > >>> 2) The zswap RB tree has a tree level spin lock protecting > >>> the tree access. > >>> > >>> The swap cache is using xarray. The break down the swap > >>> cache access does not have the similar long time as zswap > >>> RB tree. > >> > >> I think the comparison to the swap cache may not be valid as the swap > >> cache has many trees per swapfile, while zswap has a single tree. > >> > >>> > >>> Moving the zswap entry to xarray enable read side > >>> take read RCU lock only. > >> > >> Nice. > >> > >>> > >>> The first patch adds the xarray alongside the RB tree. > >>> There is some debug check asserting the xarray agrees with > >>> the RB tree results. > >>> > >>> The second patch removes the zwap RB tree. > >> > >> The breakdown looks like something that would be a development step, > >> but for patch submission I think it makes more sense to have a single > >> patch replacing the rbtree with an xarray. > >> > >>> > >>> I expect to merge the zswap rb tree spin lock with the xarray > >>> lock in the follow up changes. > >> > >> Shouldn't this simply be changing uses of tree->lock to use > >> xa_{lock/unlock}? We also need to make sure we don't try to lock the > >> tree when operating on the xarray if the caller is already holding the > >> lock, but this seems to be straightforward enough to be done as part > >> of this patch or this series at least. > >> > >> Am I missing something? > > > > Also, I assume we will only see performance improvements after the > > tree lock in its current form is removed so that we get loads > > protected only by RCU. Can we get some performance numbers to see how > > the latency improves with the xarray under contention (unless > > Chengming is already planning on testing this for his multi-tree > > patches). > > I just give it a try, the same test of kernel build in tmpfs with zswap > shrinker enabled, all based on the latest mm/mm-stable branch. > > mm-stable zswap-split-tree zswap-xarray > real 1m10.442s 1m4.157s 1m9.962s > user 17m48.232s 17m41.477s 17m45.887s > sys 8m13.517s 5m2.226s 7m59.305s > > Looks like the contention of concurrency is still there, I haven't > look into the code yet, will review it later. I think that's expected with the current version because the tree spin_lock is still there and we are still doing lookups with a spinlock.
The name changes from Chris to Christopher are confusing :D > > I think it makes the review easier. The code adding and removing does > not have much overlap. Combining it to a single patch does not save > patch size. Having the assert check would be useful for some bisecting > to narrow down which step causing the problem. I am fine with squash > it to one patch as well. I think having two patches is unnecessarily noisy, and we add some debug code in this patch that we remove in the next patch anyway. Let's see what others think, but personally I prefer a single patch. > > > > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > > lock in the follow up changes. > > > > Shouldn't this simply be changing uses of tree->lock to use > > xa_{lock/unlock}? We also need to make sure we don't try to lock the > > tree when operating on the xarray if the caller is already holding the > > lock, but this seems to be straightforward enough to be done as part > > of this patch or this series at least. > > > > Am I missing something? > > Currently the zswap entry refcount is protected by the zswap tree spin > lock as well. Can't remove the tree spin lock without changing the > refcount code. I think the zswap search entry should just return the > entry with refcount atomic increase, inside the RCU read() or xarray > lock. The previous zswap code does the find_and_get entry() which is > closer to what I want. I think this can be done in an RCU read section surrounding xa_load() and the refcount increment. Didn't look closely to check how much complexity this adds to manage refcounts with RCU, but I think there should be a lot of examples all around the kernel. IIUC, there are no performance benefits from this conversion until we remove the tree spinlock, right?
On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > Hi Yosry and Chris, > > > > On 2024/1/18 14:39, Yosry Ahmed wrote: > > > On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > >> > > >> That's a long CC list for sure :) > > >> > > >> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > >>> > > >>> The RB tree shows some contribution to the swap fault > > >>> long tail latency due to two factors: > > >>> 1) RB tree requires re-balance from time to time. > > >>> 2) The zswap RB tree has a tree level spin lock protecting > > >>> the tree access. > > >>> > > >>> The swap cache is using xarray. The break down the swap > > >>> cache access does not have the similar long time as zswap > > >>> RB tree. > > >> > > >> I think the comparison to the swap cache may not be valid as the swap > > >> cache has many trees per swapfile, while zswap has a single tree. > > >> > > >>> > > >>> Moving the zswap entry to xarray enable read side > > >>> take read RCU lock only. > > >> > > >> Nice. > > >> > > >>> > > >>> The first patch adds the xarray alongside the RB tree. > > >>> There is some debug check asserting the xarray agrees with > > >>> the RB tree results. > > >>> > > >>> The second patch removes the zwap RB tree. > > >> > > >> The breakdown looks like something that would be a development step, > > >> but for patch submission I think it makes more sense to have a single > > >> patch replacing the rbtree with an xarray. > > >> > > >>> > > >>> I expect to merge the zswap rb tree spin lock with the xarray > > >>> lock in the follow up changes. > > >> > > >> Shouldn't this simply be changing uses of tree->lock to use > > >> xa_{lock/unlock}? We also need to make sure we don't try to lock the > > >> tree when operating on the xarray if the caller is already holding the > > >> lock, but this seems to be straightforward enough to be done as part > > >> of this patch or this series at least. > > >> > > >> Am I missing something? > > > > > > Also, I assume we will only see performance improvements after the > > > tree lock in its current form is removed so that we get loads > > > protected only by RCU. Can we get some performance numbers to see how > > > the latency improves with the xarray under contention (unless > > > Chengming is already planning on testing this for his multi-tree > > > patches). > > > > I just give it a try, the same test of kernel build in tmpfs with zswap > > shrinker enabled, all based on the latest mm/mm-stable branch. > > > > mm-stable zswap-split-tree zswap-xarray > > real 1m10.442s 1m4.157s 1m9.962s > > user 17m48.232s 17m41.477s 17m45.887s > > sys 8m13.517s 5m2.226s 7m59.305s > > > > Looks like the contention of concurrency is still there, I haven't > > look into the code yet, will review it later. Thanks for the quick test. Interesting to see the sys usage drop for the xarray case even with the spin lock. Not sure if the 13 second saving is statistically significant or not. We might need to have both xarray and split trees for the zswap. It is likely removing the spin lock wouldn't be able to make up the 35% difference. That is just my guess. There is only one way to find out. BTW, do you have a script I can run to replicate your results? > > I think that's expected with the current version because the tree > spin_lock is still there and we are still doing lookups with a > spinlock. Right. Chris
On Wed, Jan 17, 2024 at 11:05 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > The name changes from Chris to Christopher are confusing :D > > > > > I think it makes the review easier. The code adding and removing does > > not have much overlap. Combining it to a single patch does not save > > patch size. Having the assert check would be useful for some bisecting > > to narrow down which step causing the problem. I am fine with squash > > it to one patch as well. > > I think having two patches is unnecessarily noisy, and we add some > debug code in this patch that we remove in the next patch anyway. > Let's see what others think, but personally I prefer a single patch. > > > > > > > > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > > > lock in the follow up changes. > > > > > > Shouldn't this simply be changing uses of tree->lock to use > > > xa_{lock/unlock}? We also need to make sure we don't try to lock the > > > tree when operating on the xarray if the caller is already holding the > > > lock, but this seems to be straightforward enough to be done as part > > > of this patch or this series at least. > > > > > > Am I missing something? > > > > Currently the zswap entry refcount is protected by the zswap tree spin > > lock as well. Can't remove the tree spin lock without changing the > > refcount code. I think the zswap search entry should just return the > > entry with refcount atomic increase, inside the RCU read() or xarray > > lock. The previous zswap code does the find_and_get entry() which is > > closer to what I want. > > I think this can be done in an RCU read section surrounding xa_load() xa_load() already has RCU read lock inside. If you do that you might just as well use some XAS API to work with the lock directly. > and the refcount increment. Didn't look closely to check how much > complexity this adds to manage refcounts with RCU, but I think there > should be a lot of examples all around the kernel. The complexity is not adding the refcount inside xa_load(). It is on the zswap code that calls zswap_search() and zswap_{insert,erase}(). As far as I can tell, those codes need some tricky changes to go along with the refcount change. > > IIUC, there are no performance benefits from this conversion until we > remove the tree spinlock, right? The original intent is helping the long tail case. RB tree has worse long tails than xarray. I expect it will help the page fault long tail even without removing the tree spinlock. Chris
On 2024/1/18 15:19, Chris Li wrote: > On Wed, Jan 17, 2024 at 11:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Wed, Jan 17, 2024 at 10:57 PM Chengming Zhou >> <zhouchengming@bytedance.com> wrote: >>> >>> Hi Yosry and Chris, >>> >>> On 2024/1/18 14:39, Yosry Ahmed wrote: >>>> On Wed, Jan 17, 2024 at 10:01 PM Yosry Ahmed <yosryahmed@google.com> wrote: >>>>> >>>>> That's a long CC list for sure :) >>>>> >>>>> On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: >>>>>> >>>>>> The RB tree shows some contribution to the swap fault >>>>>> long tail latency due to two factors: >>>>>> 1) RB tree requires re-balance from time to time. >>>>>> 2) The zswap RB tree has a tree level spin lock protecting >>>>>> the tree access. >>>>>> >>>>>> The swap cache is using xarray. The break down the swap >>>>>> cache access does not have the similar long time as zswap >>>>>> RB tree. >>>>> >>>>> I think the comparison to the swap cache may not be valid as the swap >>>>> cache has many trees per swapfile, while zswap has a single tree. >>>>> >>>>>> >>>>>> Moving the zswap entry to xarray enable read side >>>>>> take read RCU lock only. >>>>> >>>>> Nice. >>>>> >>>>>> >>>>>> The first patch adds the xarray alongside the RB tree. >>>>>> There is some debug check asserting the xarray agrees with >>>>>> the RB tree results. >>>>>> >>>>>> The second patch removes the zwap RB tree. >>>>> >>>>> The breakdown looks like something that would be a development step, >>>>> but for patch submission I think it makes more sense to have a single >>>>> patch replacing the rbtree with an xarray. >>>>> >>>>>> >>>>>> I expect to merge the zswap rb tree spin lock with the xarray >>>>>> lock in the follow up changes. >>>>> >>>>> Shouldn't this simply be changing uses of tree->lock to use >>>>> xa_{lock/unlock}? We also need to make sure we don't try to lock the >>>>> tree when operating on the xarray if the caller is already holding the >>>>> lock, but this seems to be straightforward enough to be done as part >>>>> of this patch or this series at least. >>>>> >>>>> Am I missing something? >>>> >>>> Also, I assume we will only see performance improvements after the >>>> tree lock in its current form is removed so that we get loads >>>> protected only by RCU. Can we get some performance numbers to see how >>>> the latency improves with the xarray under contention (unless >>>> Chengming is already planning on testing this for his multi-tree >>>> patches). >>> >>> I just give it a try, the same test of kernel build in tmpfs with zswap >>> shrinker enabled, all based on the latest mm/mm-stable branch. >>> >>> mm-stable zswap-split-tree zswap-xarray >>> real 1m10.442s 1m4.157s 1m9.962s >>> user 17m48.232s 17m41.477s 17m45.887s >>> sys 8m13.517s 5m2.226s 7m59.305s >>> >>> Looks like the contention of concurrency is still there, I haven't >>> look into the code yet, will review it later. > > Thanks for the quick test. Interesting to see the sys usage drop for > the xarray case even with the spin lock. > Not sure if the 13 second saving is statistically significant or not. > > We might need to have both xarray and split trees for the zswap. It is > likely removing the spin lock wouldn't be able to make up the 35% > difference. That is just my guess. There is only one way to find out. Yes, I totally agree with this! IMHO, concurrent zswap_store paths still have to contend for the xarray spinlock even though we would have converted the rb-tree to the xarray structure at last. So I think we should have both. > > BTW, do you have a script I can run to replicate your results? ``` #!/bin/bash testname="build-kernel-tmpfs" cgroup="/sys/fs/cgroup/$testname" tmpdir="/tmp/vm-scalability-tmp" workdir="$tmpdir/$testname" memory_max="$((2 * 1024 * 1024 * 1024))" linux_src="/root/zcm/linux-6.6.tar.xz" NR_TASK=32 swapon ~/zcm/swapfile echo 60 > /proc/sys/vm/swappiness echo zsmalloc > /sys/module/zswap/parameters/zpool echo lz4 > /sys/module/zswap/parameters/compressor echo 1 > /sys/module/zswap/parameters/shrinker_enabled echo 1 > /sys/module/zswap/parameters/enabled if ! [ -d $tmpdir ]; then mkdir -p $tmpdir mount -t tmpfs -o size=100% nodev $tmpdir fi mkdir -p $cgroup echo $memory_max > $cgroup/memory.max echo $$ > $cgroup/cgroup.procs rm -rf $workdir mkdir -p $workdir cd $workdir tar xvf $linux_src cd linux-6.6 make -j$NR_TASK clean make defconfig time make -j$NR_TASK ```
On Wed, Jan 17, 2024 at 11:05:15PM -0800, Yosry Ahmed wrote: > > I think it makes the review easier. The code adding and removing does > > not have much overlap. Combining it to a single patch does not save > > patch size. Having the assert check would be useful for some bisecting > > to narrow down which step causing the problem. I am fine with squash > > it to one patch as well. > > I think having two patches is unnecessarily noisy, and we add some > debug code in this patch that we remove in the next patch anyway. > Let's see what others think, but personally I prefer a single patch. +1
On Wed, Jan 17, 2024 at 11:28 PM Chris Li <chrisl@kernel.org> wrote: > > On Wed, Jan 17, 2024 at 11:05 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > The name changes from Chris to Christopher are confusing :D > > > > > > > > I think it makes the review easier. The code adding and removing does > > > not have much overlap. Combining it to a single patch does not save > > > patch size. Having the assert check would be useful for some bisecting > > > to narrow down which step causing the problem. I am fine with squash > > > it to one patch as well. > > > > I think having two patches is unnecessarily noisy, and we add some > > debug code in this patch that we remove in the next patch anyway. > > Let's see what others think, but personally I prefer a single patch. > > > > > > > > > > > > > > > > I expect to merge the zswap rb tree spin lock with the xarray > > > > > lock in the follow up changes. > > > > > > > > Shouldn't this simply be changing uses of tree->lock to use > > > > xa_{lock/unlock}? We also need to make sure we don't try to lock the > > > > tree when operating on the xarray if the caller is already holding the > > > > lock, but this seems to be straightforward enough to be done as part > > > > of this patch or this series at least. > > > > > > > > Am I missing something? > > > > > > Currently the zswap entry refcount is protected by the zswap tree spin > > > lock as well. Can't remove the tree spin lock without changing the > > > refcount code. I think the zswap search entry should just return the > > > entry with refcount atomic increase, inside the RCU read() or xarray > > > lock. The previous zswap code does the find_and_get entry() which is > > > closer to what I want. > > > > I think this can be done in an RCU read section surrounding xa_load() > > xa_load() already has RCU read lock inside. If you do that you might > just as well use some XAS API to work with the lock directly. RCU reda locks are nestable, some users of xa_load() do some in an RCU read section, also for refcounting purposes. Also, I thought the point was avoiding the lock in this path. > > > and the refcount increment. Didn't look closely to check how much > > complexity this adds to manage refcounts with RCU, but I think there > > should be a lot of examples all around the kernel. > > The complexity is not adding the refcount inside xa_load(). It is on > the zswap code that calls zswap_search() and zswap_{insert,erase}(). > As far as I can tell, those codes need some tricky changes to go along > with the refcount change. I don't think it should be very tricky. https://docs.kernel.org/RCU/rcuref.html may have relevant examples, and there should be examples all over the code. > > > > > IIUC, there are no performance benefits from this conversion until we > > remove the tree spinlock, right? > > The original intent is helping the long tail case. RB tree has worse > long tails than xarray. I expect it will help the page fault long tail > even without removing the tree spinlock. I think it would be better if we can remove the tree spinlock as part of this change.
On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > The RB tree shows some contribution to the swap fault > long tail latency due to two factors: > 1) RB tree requires re-balance from time to time. > 2) The zswap RB tree has a tree level spin lock protecting > the tree access. > > The swap cache is using xarray. The break down the swap > cache access does not have the similar long time as zswap > RB tree. > > Moving the zswap entry to xarray enable read side > take read RCU lock only. > > The first patch adds the xarray alongside the RB tree. > There is some debug check asserting the xarray agrees with > the RB tree results. > > The second patch removes the zwap RB tree. > > I expect to merge the zswap rb tree spin lock with the xarray > lock in the follow up changes. > > I can surely use some help in reviewing and testing. > > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: zswap.c: add xarray tree to zswap While I think it is pretty neat to keep the rbtree around to check if the results agree during development stages, in the final version please squash the patches. One patch is enough :) > mm: zswap.c: remove RB tree > > mm/zswap.c | 120 ++++++++++++++++++++++++++++++------------------------------- > 1 file changed, 59 insertions(+), 61 deletions(-) > --- > base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905 > change-id: 20240104-zswap-xarray-716260e541e3 > > Best regards, > -- > Chris Li <chrisl@kernel.org> >
* Christopher Li <chrisl@kernel.org> [240118 01:48]: > On Wed, Jan 17, 2024 at 10:02 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > That's a long CC list for sure :) > > > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > > > The RB tree shows some contribution to the swap fault > > > long tail latency due to two factors: > > > 1) RB tree requires re-balance from time to time. > > > 2) The zswap RB tree has a tree level spin lock protecting > > > the tree access. > > > > > > The swap cache is using xarray. The break down the swap > > > cache access does not have the similar long time as zswap > > > RB tree. > > > > I think the comparison to the swap cache may not be valid as the swap > > cache has many trees per swapfile, while zswap has a single tree. > > Yes, good point. I think we can bench mark the xarray zswap vs the RB > tree zswap, that would be more of a direct comparison. > > > > Moving the zswap entry to xarray enable read side > > > take read RCU lock only. > > > > Nice. > > > > > > > > The first patch adds the xarray alongside the RB tree. > > > There is some debug check asserting the xarray agrees with > > > the RB tree results. > > > > > > The second patch removes the zwap RB tree. > > > > The breakdown looks like something that would be a development step, > > but for patch submission I think it makes more sense to have a single > > patch replacing the rbtree with an xarray. > > I think it makes the review easier. The code adding and removing does > not have much overlap. Combining it to a single patch does not save > patch size. Having the assert check would be useful for some bisecting > to narrow down which step causing the problem. I am fine with squash > it to one patch as well. I had thought similar when I replaced the rbtree with the maple tree in the VMA space. That conversion was more involved and I wanted to detect if there was ever any difference, and where I had made the error in the multiple patch conversion. This became rather painful once an issue was found, as then anyone bisecting other issues could hit this difference and either blamed the commit pointing at the BUG_ON() or gave up (I don't blame them for giving up, I would). With only two commits, it may be easier for people to see a fixed tag pointing to the same commit that bisect found (if they check), but it proved an issue with my multiple patch conversion. You may not experience this issue with the users of the zswap, but I plan to avoid doing this again in the future. At least a WARN_ON_ONCE() and a comment might help? Thanks, Liam
On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > >>> mm-stable zswap-split-tree zswap-xarray > >>> real 1m10.442s 1m4.157s 1m9.962s > >>> user 17m48.232s 17m41.477s 17m45.887s > >>> sys 8m13.517s 5m2.226s 7m59.305s > >>> > >>> Looks like the contention of concurrency is still there, I haven't > >>> look into the code yet, will review it later. > > > > Thanks for the quick test. Interesting to see the sys usage drop for > > the xarray case even with the spin lock. > > Not sure if the 13 second saving is statistically significant or not. > > > > We might need to have both xarray and split trees for the zswap. It is > > likely removing the spin lock wouldn't be able to make up the 35% > > difference. That is just my guess. There is only one way to find out. > > Yes, I totally agree with this! IMHO, concurrent zswap_store paths still > have to contend for the xarray spinlock even though we would have converted > the rb-tree to the xarray structure at last. So I think we should have both. > > > > > BTW, do you have a script I can run to replicate your results? Hi Chengming, Thanks for your script. > > ``` > #!/bin/bash > > testname="build-kernel-tmpfs" > cgroup="/sys/fs/cgroup/$testname" > > tmpdir="/tmp/vm-scalability-tmp" > workdir="$tmpdir/$testname" > > memory_max="$((2 * 1024 * 1024 * 1024))" > > linux_src="/root/zcm/linux-6.6.tar.xz" > NR_TASK=32 > > swapon ~/zcm/swapfile How big is your swapfile here? It seems you have only one swapfile there. That can explain the contention. Have you tried multiple swapfiles for the same test? That should reduce the contention without using your patch. Chris > echo 60 > /proc/sys/vm/swappiness > > echo zsmalloc > /sys/module/zswap/parameters/zpool > echo lz4 > /sys/module/zswap/parameters/compressor > echo 1 > /sys/module/zswap/parameters/shrinker_enabled > echo 1 > /sys/module/zswap/parameters/enabled > > if ! [ -d $tmpdir ]; then > mkdir -p $tmpdir > mount -t tmpfs -o size=100% nodev $tmpdir > fi > > mkdir -p $cgroup > echo $memory_max > $cgroup/memory.max > echo $$ > $cgroup/cgroup.procs > > rm -rf $workdir > mkdir -p $workdir > cd $workdir > > tar xvf $linux_src > cd linux-6.6 > make -j$NR_TASK clean > make defconfig > time make -j$NR_TASK > ``` > >
On Thu, Jan 18, 2024 at 11:00 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > The first patch adds the xarray alongside the RB tree. > > > > There is some debug check asserting the xarray agrees with > > > > the RB tree results. > > > > > > > > The second patch removes the zwap RB tree. > > > > > > The breakdown looks like something that would be a development step, > > > but for patch submission I think it makes more sense to have a single > > > patch replacing the rbtree with an xarray. > > > > I think it makes the review easier. The code adding and removing does > > not have much overlap. Combining it to a single patch does not save > > patch size. Having the assert check would be useful for some bisecting > > to narrow down which step causing the problem. I am fine with squash > > it to one patch as well. > > I had thought similar when I replaced the rbtree with the maple tree in > the VMA space. That conversion was more involved and I wanted to detect > if there was ever any difference, and where I had made the error in the > multiple patch conversion. > > This became rather painful once an issue was found, as then anyone > bisecting other issues could hit this difference and either blamed the > commit pointing at the BUG_ON() or gave up (I don't blame them for > giving up, I would). With only two commits, it may be easier for people > to see a fixed tag pointing to the same commit that bisect found (if > they check), but it proved an issue with my multiple patch conversion. Thanks for sharing your experience. That debug assert did help me catch issues on my own internal version after rebasing to the latest mm tree. If the user can't do the bisect, then I agree we don't need to assert in the official version. I can always bisect on my one internal version. > > You may not experience this issue with the users of the zswap, but I > plan to avoid doing this again in the future. At least a WARN_ON_ONCE() > and a comment might help? Sure, I might just merge the two patches. Don't have the BUG_ON() any more. Chris
On Thu, Jan 18, 2024 at 10:01 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 7:06 PM Chris Li <chrisl@kernel.org> wrote: > > > > The RB tree shows some contribution to the swap fault > > long tail latency due to two factors: > > 1) RB tree requires re-balance from time to time. > > 2) The zswap RB tree has a tree level spin lock protecting > > the tree access. > > > > The swap cache is using xarray. The break down the swap > > cache access does not have the similar long time as zswap > > RB tree. > > > > Moving the zswap entry to xarray enable read side > > take read RCU lock only. > > > > The first patch adds the xarray alongside the RB tree. > > There is some debug check asserting the xarray agrees with > > the RB tree results. > > > > The second patch removes the zwap RB tree. > > > > I expect to merge the zswap rb tree spin lock with the xarray > > lock in the follow up changes. > > > > I can surely use some help in reviewing and testing. > > > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Chris Li (2): > > mm: zswap.c: add xarray tree to zswap > > While I think it is pretty neat to keep the rbtree around to check if > the results agree during development stages, in the final version > please squash the patches. One patch is enough :) Ack. Chris
On 2024/1/19 12:59, Chris Li wrote: > On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > >>>>> mm-stable zswap-split-tree zswap-xarray >>>>> real 1m10.442s 1m4.157s 1m9.962s >>>>> user 17m48.232s 17m41.477s 17m45.887s >>>>> sys 8m13.517s 5m2.226s 7m59.305s >>>>> >>>>> Looks like the contention of concurrency is still there, I haven't >>>>> look into the code yet, will review it later. >>> >>> Thanks for the quick test. Interesting to see the sys usage drop for >>> the xarray case even with the spin lock. >>> Not sure if the 13 second saving is statistically significant or not. >>> >>> We might need to have both xarray and split trees for the zswap. It is >>> likely removing the spin lock wouldn't be able to make up the 35% >>> difference. That is just my guess. There is only one way to find out. >> >> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still >> have to contend for the xarray spinlock even though we would have converted >> the rb-tree to the xarray structure at last. So I think we should have both. >> >>> >>> BTW, do you have a script I can run to replicate your results? > > Hi Chengming, > > Thanks for your script. > >> >> ``` >> #!/bin/bash >> >> testname="build-kernel-tmpfs" >> cgroup="/sys/fs/cgroup/$testname" >> >> tmpdir="/tmp/vm-scalability-tmp" >> workdir="$tmpdir/$testname" >> >> memory_max="$((2 * 1024 * 1024 * 1024))" >> >> linux_src="/root/zcm/linux-6.6.tar.xz" >> NR_TASK=32 >> >> swapon ~/zcm/swapfile > > How big is your swapfile here? The swapfile is big enough here, I use a 50GB swapfile. > > It seems you have only one swapfile there. That can explain the contention. > Have you tried multiple swapfiles for the same test? > That should reduce the contention without using your patch. Do you mean to have many 64MB swapfiles to swapon at the same time? Maybe it's feasible to test, I'm not sure how swapout will choose. But in our usecase, we normally have only one swapfile. Thanks. > > Chris > >> echo 60 > /proc/sys/vm/swappiness >> >> echo zsmalloc > /sys/module/zswap/parameters/zpool >> echo lz4 > /sys/module/zswap/parameters/compressor >> echo 1 > /sys/module/zswap/parameters/shrinker_enabled >> echo 1 > /sys/module/zswap/parameters/enabled >> >> if ! [ -d $tmpdir ]; then >> mkdir -p $tmpdir >> mount -t tmpfs -o size=100% nodev $tmpdir >> fi >> >> mkdir -p $cgroup >> echo $memory_max > $cgroup/memory.max >> echo $$ > $cgroup/cgroup.procs >> >> rm -rf $workdir >> mkdir -p $workdir >> cd $workdir >> >> tar xvf $linux_src >> cd linux-6.6 >> make -j$NR_TASK clean >> make defconfig >> time make -j$NR_TASK >> ``` >> >>
On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2024/1/19 12:59, Chris Li wrote: > > On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou > > <zhouchengming@bytedance.com> wrote: > > > >>>>> mm-stable zswap-split-tree zswap-xarray > >>>>> real 1m10.442s 1m4.157s 1m9.962s > >>>>> user 17m48.232s 17m41.477s 17m45.887s > >>>>> sys 8m13.517s 5m2.226s 7m59.305s > >>>>> > >>>>> Looks like the contention of concurrency is still there, I haven't > >>>>> look into the code yet, will review it later. > >>> > >>> Thanks for the quick test. Interesting to see the sys usage drop for > >>> the xarray case even with the spin lock. > >>> Not sure if the 13 second saving is statistically significant or not. > >>> > >>> We might need to have both xarray and split trees for the zswap. It is > >>> likely removing the spin lock wouldn't be able to make up the 35% > >>> difference. That is just my guess. There is only one way to find out. > >> > >> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still > >> have to contend for the xarray spinlock even though we would have converted > >> the rb-tree to the xarray structure at last. So I think we should have both. > >> > >>> > >>> BTW, do you have a script I can run to replicate your results? > > > > Hi Chengming, > > > > Thanks for your script. > > > >> > >> ``` > >> #!/bin/bash > >> > >> testname="build-kernel-tmpfs" > >> cgroup="/sys/fs/cgroup/$testname" > >> > >> tmpdir="/tmp/vm-scalability-tmp" > >> workdir="$tmpdir/$testname" > >> > >> memory_max="$((2 * 1024 * 1024 * 1024))" > >> > >> linux_src="/root/zcm/linux-6.6.tar.xz" > >> NR_TASK=32 > >> > >> swapon ~/zcm/swapfile > > > > How big is your swapfile here? > > The swapfile is big enough here, I use a 50GB swapfile. Thanks, > > > > > It seems you have only one swapfile there. That can explain the contention. > > Have you tried multiple swapfiles for the same test? > > That should reduce the contention without using your patch. > Do you mean to have many 64MB swapfiles to swapon at the same time? 64MB is too small. There are limits to MAX_SWAPFILES. It is less than (32 - n) swap files. If you want to use 50G swap space, you can have MAX_SWAPFILES, each swapfile 50GB / MAX_SWAPFILES. > Maybe it's feasible to test, Of course it is testable, I am curious to see the test results. > I'm not sure how swapout will choose. It will rotate through the same priority swap files first. swapfile.c: get_swap_pages(). > But in our usecase, we normally have only one swapfile. Is there a good reason why you can't use more than one swapfile? One swapfile will not take the full advantage of the existing code. Even if you split the zswap trees within a swapfile. With only one swapfile, you will still be having lock contention on "(struct swap_info_struct).lock". It is one lock per swapfile. Using more than one swap file should get you better results. Chris
On 2024/1/19 18:26, Chris Li wrote: > On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2024/1/19 12:59, Chris Li wrote: >>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou >>> <zhouchengming@bytedance.com> wrote: >>> >>>>>>> mm-stable zswap-split-tree zswap-xarray >>>>>>> real 1m10.442s 1m4.157s 1m9.962s >>>>>>> user 17m48.232s 17m41.477s 17m45.887s >>>>>>> sys 8m13.517s 5m2.226s 7m59.305s >>>>>>> >>>>>>> Looks like the contention of concurrency is still there, I haven't >>>>>>> look into the code yet, will review it later. >>>>> >>>>> Thanks for the quick test. Interesting to see the sys usage drop for >>>>> the xarray case even with the spin lock. >>>>> Not sure if the 13 second saving is statistically significant or not. >>>>> >>>>> We might need to have both xarray and split trees for the zswap. It is >>>>> likely removing the spin lock wouldn't be able to make up the 35% >>>>> difference. That is just my guess. There is only one way to find out. >>>> >>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still >>>> have to contend for the xarray spinlock even though we would have converted >>>> the rb-tree to the xarray structure at last. So I think we should have both. >>>> >>>>> >>>>> BTW, do you have a script I can run to replicate your results? >>> >>> Hi Chengming, >>> >>> Thanks for your script. >>> >>>> >>>> ``` >>>> #!/bin/bash >>>> >>>> testname="build-kernel-tmpfs" >>>> cgroup="/sys/fs/cgroup/$testname" >>>> >>>> tmpdir="/tmp/vm-scalability-tmp" >>>> workdir="$tmpdir/$testname" >>>> >>>> memory_max="$((2 * 1024 * 1024 * 1024))" >>>> >>>> linux_src="/root/zcm/linux-6.6.tar.xz" >>>> NR_TASK=32 >>>> >>>> swapon ~/zcm/swapfile >>> >>> How big is your swapfile here? >> >> The swapfile is big enough here, I use a 50GB swapfile. > > Thanks, > >> >>> >>> It seems you have only one swapfile there. That can explain the contention. >>> Have you tried multiple swapfiles for the same test? >>> That should reduce the contention without using your patch. >> Do you mean to have many 64MB swapfiles to swapon at the same time? > > 64MB is too small. There are limits to MAX_SWAPFILES. It is less than > (32 - n) swap files. > If you want to use 50G swap space, you can have MAX_SWAPFILES, each > swapfile 50GB / MAX_SWAPFILES. Right. > >> Maybe it's feasible to test, > > Of course it is testable, I am curious to see the test results. > >> I'm not sure how swapout will choose. > > It will rotate through the same priority swap files first. > swapfile.c: get_swap_pages(). > >> But in our usecase, we normally have only one swapfile. > > Is there a good reason why you can't use more than one swapfile? I think no, but it seems an unneeded change/burden to our admin. So I just tested and optimized for the normal case. > One swapfile will not take the full advantage of the existing code. > Even if you split the zswap trees within a swapfile. With only one > swapfile, you will still be having lock contention on "(struct > swap_info_struct).lock". > It is one lock per swapfile. > Using more than one swap file should get you better results. IIUC, we already have the per-cpu swap entry cache to not contend for this lock? And I don't see much hot of this lock in the testing. Thanks.
On Fri, Jan 19, 2024 at 3:12 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2024/1/19 18:26, Chris Li wrote: > > On Thu, Jan 18, 2024 at 10:19 PM Chengming Zhou > > <zhouchengming@bytedance.com> wrote: > >> > >> On 2024/1/19 12:59, Chris Li wrote: > >>> On Wed, Jan 17, 2024 at 11:35 PM Chengming Zhou > >>> <zhouchengming@bytedance.com> wrote: > >>> > >>>>>>> mm-stable zswap-split-tree zswap-xarray > >>>>>>> real 1m10.442s 1m4.157s 1m9.962s > >>>>>>> user 17m48.232s 17m41.477s 17m45.887s > >>>>>>> sys 8m13.517s 5m2.226s 7m59.305s > >>>>>>> > >>>>>>> Looks like the contention of concurrency is still there, I haven't > >>>>>>> look into the code yet, will review it later. > >>>>> > >>>>> Thanks for the quick test. Interesting to see the sys usage drop for > >>>>> the xarray case even with the spin lock. > >>>>> Not sure if the 13 second saving is statistically significant or not. > >>>>> > >>>>> We might need to have both xarray and split trees for the zswap. It is > >>>>> likely removing the spin lock wouldn't be able to make up the 35% > >>>>> difference. That is just my guess. There is only one way to find out. > >>>> > >>>> Yes, I totally agree with this! IMHO, concurrent zswap_store paths still > >>>> have to contend for the xarray spinlock even though we would have converted > >>>> the rb-tree to the xarray structure at last. So I think we should have both. > >>>> > >>>>> > >>>>> BTW, do you have a script I can run to replicate your results? > >>> > >>> Hi Chengming, > >>> > >>> Thanks for your script. > >>> > >>>> > >>>> ``` > >>>> #!/bin/bash > >>>> > >>>> testname="build-kernel-tmpfs" > >>>> cgroup="/sys/fs/cgroup/$testname" > >>>> > >>>> tmpdir="/tmp/vm-scalability-tmp" > >>>> workdir="$tmpdir/$testname" > >>>> > >>>> memory_max="$((2 * 1024 * 1024 * 1024))" > >>>> > >>>> linux_src="/root/zcm/linux-6.6.tar.xz" > >>>> NR_TASK=32 > >>>> > >>>> swapon ~/zcm/swapfile > >>> > >>> How big is your swapfile here? > >> > >> The swapfile is big enough here, I use a 50GB swapfile. > > > > Thanks, > > > >> > >>> > >>> It seems you have only one swapfile there. That can explain the contention. > >>> Have you tried multiple swapfiles for the same test? > >>> That should reduce the contention without using your patch. > >> Do you mean to have many 64MB swapfiles to swapon at the same time? > > > > 64MB is too small. There are limits to MAX_SWAPFILES. It is less than > > (32 - n) swap files. > > If you want to use 50G swap space, you can have MAX_SWAPFILES, each > > swapfile 50GB / MAX_SWAPFILES. > > Right. > > > > >> Maybe it's feasible to test, > > > > Of course it is testable, I am curious to see the test results. > > > >> I'm not sure how swapout will choose. > > > > It will rotate through the same priority swap files first. > > swapfile.c: get_swap_pages(). > > > >> But in our usecase, we normally have only one swapfile. > > > > Is there a good reason why you can't use more than one swapfile? > > I think no, but it seems an unneeded change/burden to our admin. > So I just tested and optimized for the normal case. I understand. Just saying it is not really a kernel limitation per say. I blame the user space :-) > > > One swapfile will not take the full advantage of the existing code. > > Even if you split the zswap trees within a swapfile. With only one > > swapfile, you will still be having lock contention on "(struct > > swap_info_struct).lock". > > It is one lock per swapfile. > > Using more than one swap file should get you better results. > > IIUC, we already have the per-cpu swap entry cache to not contend for > this lock? And I don't see much hot of this lock in the testing. Yes. The swap entry cache helps. The cache batching also causes other problems, e.g. the long tail in swap faults handling. Shameless plug, I have a patch posted earlier to address the swap fault long tail latencies. https://lore.kernel.org/linux-mm/20231221-async-free-v1-1-94b277992cb0@kernel.org/T/ Chris
The RB tree shows some contribution to the swap fault long tail latency due to two factors: 1) RB tree requires re-balance from time to time. 2) The zswap RB tree has a tree level spin lock protecting the tree access. The swap cache is using xarray. The break down the swap cache access does not have the similar long time as zswap RB tree. Moving the zswap entry to xarray enable read side take read RCU lock only. The first patch adds the xarray alongside the RB tree. There is some debug check asserting the xarray agrees with the RB tree results. The second patch removes the zwap RB tree. I expect to merge the zswap rb tree spin lock with the xarray lock in the follow up changes. I can surely use some help in reviewing and testing. Signed-off-by: Chris Li <chrisl@kernel.org> --- Chris Li (2): mm: zswap.c: add xarray tree to zswap mm: zswap.c: remove RB tree mm/zswap.c | 120 ++++++++++++++++++++++++++++++------------------------------- 1 file changed, 59 insertions(+), 61 deletions(-) --- base-commit: d7ba3d7c3bf13e2faf419cce9e9bdfc3a1a50905 change-id: 20240104-zswap-xarray-716260e541e3 Best regards,