Message ID | 20231206-zswap-lock-optimize-v1-0-e25b059f9c3a@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/zswap: optimize the scalability of zswap rb-tree | expand |
+ Chris Li Chris, I vaguely remember from our last conversation that you have some concurrent efforts to use xarray here right? On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > Hi everyone, > > This patch series is based on the linux-next 20231205, which depends on > the "workload-specific and memory pressure-driven zswap writeback" series > from Nhat Pham. > > When testing the zswap performance by using kernel build -j32 in a tmpfs > directory, I found the scalability of zswap rb-tree is not good, which > is protected by the only spinlock. That would cause heavy lock contention > if multiple tasks zswap_store/load concurrently. > > So a simple solution is to split the only one zswap rb-tree into multiple > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > Although this method can't solve the spinlock contention completely, it > can mitigate much of that contention. > > Another problem when testing the zswap using our default zsmalloc is that > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory > to support !zpool_can_sleep_mapped(). > > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. > > Thanks for review and comment! > > To: Andrew Morton <akpm@linux-foundation.org> > To: Seth Jennings <sjenning@redhat.com> > To: Dan Streetman <ddstreet@ieee.org> > To: Vitaly Wool <vitaly.wool@konsulko.com> > To: Nhat Pham <nphamcs@gmail.com> > To: Johannes Weiner <hannes@cmpxchg.org> > To: Yosry Ahmed <yosryahmed@google.com> > To: Michal Hocko <mhocko@kernel.org> > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > > --- > Chengming Zhou (7): > mm/zswap: make sure each swapfile always have zswap rb-tree > mm/zswap: split zswap rb-tree > mm/zswap: reuse dstmem when decompress > mm/zswap: change dstmem size to one page > mm/zswap: refactor out __zswap_load() > mm/zswap: cleanup zswap_load() > mm/zswap: cleanup zswap_reclaim_entry() > > include/linux/zswap.h | 4 +- > mm/swapfile.c | 10 ++- > mm/zswap.c | 233 +++++++++++++++++++++----------------------------- > 3 files changed, 106 insertions(+), 141 deletions(-) > --- > base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b > change-id: 20231206-zswap-lock-optimize-06f45683b02b > > Best regards, > -- > Chengming Zhou <zhouchengming@bytedance.com>
On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > When testing the zswap performance by using kernel build -j32 in a tmpfs > directory, I found the scalability of zswap rb-tree is not good, which > is protected by the only spinlock. That would cause heavy lock contention > if multiple tasks zswap_store/load concurrently. > > So a simple solution is to split the only one zswap rb-tree into multiple > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > Although this method can't solve the spinlock contention completely, it > can mitigate much of that contention. By how much? Do you have any stats to estimate the amount of contention and the reduction by this patch? I do think lock contention could be a problem here, and it will be even worse with the zswap shrinker enabled (which introduces an theoretically unbounded number of concurrent reclaimers hammering on the zswap rbtree and its lock). I am generally a bit weary about architectural change though, especially if it is just a bandaid. We have tried to reduce the lock contention somewhere else (multiple zpools), and as predicted it just shifts the contention point elsewhere. Maybe we need a deeper architectural re-think. Not an outright NACK of course - just food for thought. > > Another problem when testing the zswap using our default zsmalloc is that > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory > to support !zpool_can_sleep_mapped(). > > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. It'd be nice to reduce the (temporary) memory allocation on these paths, but would this introduce contention on the per-cpu dstmem and the mutex that protects it, if there are too many concurrent store/load/writeback requests?
On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > + Chris Li > > Chris, I vaguely remember from our last conversation that you have > some concurrent efforts to use xarray here right? If I recall correctly, the xarray already reduces the lock contention as lookups are lockless, but Chris knows more here. As you mentioned in a different email, it would be nice to get some data so that we can compare different solutions. > > On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > Hi everyone, > > > > This patch series is based on the linux-next 20231205, which depends on > > the "workload-specific and memory pressure-driven zswap writeback" series > > from Nhat Pham. > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > directory, I found the scalability of zswap rb-tree is not good, which > > is protected by the only spinlock. That would cause heavy lock contention > > if multiple tasks zswap_store/load concurrently. > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > Although this method can't solve the spinlock contention completely, it > > can mitigate much of that contention. > > > > Another problem when testing the zswap using our default zsmalloc is that > > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory > > to support !zpool_can_sleep_mapped(). > > > > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also > > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. > > > > Thanks for review and comment! > > > > To: Andrew Morton <akpm@linux-foundation.org> > > To: Seth Jennings <sjenning@redhat.com> > > To: Dan Streetman <ddstreet@ieee.org> > > To: Vitaly Wool <vitaly.wool@konsulko.com> > > To: Nhat Pham <nphamcs@gmail.com> > > To: Johannes Weiner <hannes@cmpxchg.org> > > To: Yosry Ahmed <yosryahmed@google.com> > > To: Michal Hocko <mhocko@kernel.org> > > Cc: linux-kernel@vger.kernel.org > > Cc: linux-mm@kvack.org > > Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> > > > > --- > > Chengming Zhou (7): > > mm/zswap: make sure each swapfile always have zswap rb-tree > > mm/zswap: split zswap rb-tree > > mm/zswap: reuse dstmem when decompress > > mm/zswap: change dstmem size to one page > > mm/zswap: refactor out __zswap_load() > > mm/zswap: cleanup zswap_load() > > mm/zswap: cleanup zswap_reclaim_entry() > > > > include/linux/zswap.h | 4 +- > > mm/swapfile.c | 10 ++- > > mm/zswap.c | 233 +++++++++++++++++++++----------------------------- > > 3 files changed, 106 insertions(+), 141 deletions(-) > > --- > > base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b > > change-id: 20231206-zswap-lock-optimize-06f45683b02b > > > > Best regards, > > -- > > Chengming Zhou <zhouchengming@bytedance.com>
Hi Nhat and Yosry, On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > > > > + Chris Li > > > > Chris, I vaguely remember from our last conversation that you have > > some concurrent efforts to use xarray here right? Yes, I do have the zswap xarray for older versions of the kernel. The recent mm-unstable tree has a lot of zswap related updates. Give me 2 days to refresh and post it. The zswap invalid entry and the reference count change is causing a good portion of the code to be updated. That is the price to pay keeping out of tree patches. My fault is not getting to it sooner. > > If I recall correctly, the xarray already reduces the lock contention > as lookups are lockless, but Chris knows more here. As you mentioned Yes. To be exact, xarray can use spin lock (same as current RB tree) or take RCU read lock on the lookup path (depending on how you call the xarray API). Not completely lockless but the RCU read lock should have less lock contention than normal spinlock. +Matthew > in a different email, it would be nice to get some data so that we can > compare different solutions. Yes, it is certainly welcome to see more data points. If I recall correctly, the zswap xarray array makes the lookup similar to the swap cache lookup. It has a noticeable difference in the long tail end. Stay tuned. Chris
On 2023/12/7 04:08, Nhat Pham wrote: > On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> When testing the zswap performance by using kernel build -j32 in a tmpfs >> directory, I found the scalability of zswap rb-tree is not good, which >> is protected by the only spinlock. That would cause heavy lock contention >> if multiple tasks zswap_store/load concurrently. >> >> So a simple solution is to split the only one zswap rb-tree into multiple >> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is >> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). >> >> Although this method can't solve the spinlock contention completely, it >> can mitigate much of that contention. > > By how much? Do you have any stats to estimate the amount of > contention and the reduction by this patch? Actually, I did some test using the linux-next 20231205 yesterday. Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs. 20231205 +patchset 1. !shrinker_enabled: 156s 126s 2. shrinker_enabled: 79s 70s I think your zswap shrinker fix patch can solve !shrinker_enabled case. So will test again today using the new mm-unstable branch. > > I do think lock contention could be a problem here, and it will be > even worse with the zswap shrinker enabled (which introduces an > theoretically unbounded number of concurrent reclaimers hammering on > the zswap rbtree and its lock). I am generally a bit weary about > architectural change though, especially if it is just a bandaid. We > have tried to reduce the lock contention somewhere else (multiple > zpools), and as predicted it just shifts the contention point > elsewhere. Maybe we need a deeper architectural re-think. > > Not an outright NACK of course - just food for thought. > Right, I think xarray is good for lockless reading side, and multiple trees is also complementary, which can reduce the lock contention on the writing sides too. >> >> Another problem when testing the zswap using our default zsmalloc is that >> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory >> to support !zpool_can_sleep_mapped(). >> >> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also >> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. > > It'd be nice to reduce the (temporary) memory allocation on these > paths, but would this introduce contention on the per-cpu dstmem and > the mutex that protects it, if there are too many concurrent > store/load/writeback requests? I think the mutex holding time is not changed, right? So the contention on the per-cpu mutex should be the same. We just reuse percpu dstmem more. Thanks!
On 2023/12/7 08:43, Chris Li wrote: > Hi Nhat and Yosry, > > On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote: >>> >>> + Chris Li >>> >>> Chris, I vaguely remember from our last conversation that you have >>> some concurrent efforts to use xarray here right? > > Yes, I do have the zswap xarray for older versions of the kernel. The > recent mm-unstable tree has a lot of zswap related updates. Give me 2 > days to refresh and post it. The zswap invalid entry and the reference > count change is causing a good portion of the code to be updated. That > is the price to pay keeping out of tree patches. My fault is not > getting to it sooner. > >> >> If I recall correctly, the xarray already reduces the lock contention >> as lookups are lockless, but Chris knows more here. As you mentioned > > Yes. To be exact, xarray can use spin lock (same as current RB tree) > or take RCU read lock on the lookup path (depending on how you call > the xarray API). Not completely lockless but the RCU read lock should > have less lock contention than normal spinlock. +Matthew > Great! Lockless lookup in zswap_load() should reduce spinlock contention. And multiple trees (multiple xarrays) can further reduce the contention on the concurrent zswap_store() side. So it's complementary IMHO. >> in a different email, it would be nice to get some data so that we can >> compare different solutions. > > Yes, it is certainly welcome to see more data points. If I recall > correctly, the zswap xarray array makes the lookup similar to the swap > cache lookup. It has a noticeable difference in the long tail end. > Right, I post some data from yesterday in another reply. Will test again and update the data since Nhat's zswap shrinker fix patch has been merged into mm-unstable today. Thanks!
On 2023/12/7 11:13, Chengming Zhou wrote: > On 2023/12/7 04:08, Nhat Pham wrote: >> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou >> <zhouchengming@bytedance.com> wrote: >>> When testing the zswap performance by using kernel build -j32 in a tmpfs >>> directory, I found the scalability of zswap rb-tree is not good, which >>> is protected by the only spinlock. That would cause heavy lock contention >>> if multiple tasks zswap_store/load concurrently. >>> >>> So a simple solution is to split the only one zswap rb-tree into multiple >>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is >>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). >>> >>> Although this method can't solve the spinlock contention completely, it >>> can mitigate much of that contention. >> >> By how much? Do you have any stats to estimate the amount of >> contention and the reduction by this patch? > > Actually, I did some test using the linux-next 20231205 yesterday. > > Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs. > > 20231205 +patchset > 1. !shrinker_enabled: 156s 126s > 2. shrinker_enabled: 79s 70s > > I think your zswap shrinker fix patch can solve !shrinker_enabled case. > > So will test again today using the new mm-unstable branch. > Updated test data based on today's mm-unstable branch: mm-unstable +patchset 1. !shrinker_enabled: 86s 74s 2. shrinker_enabled: 63s 61s Shows much less optimization for the shrinker_enabled case, but still much optimization for the !shrinker_enabled case. Thanks!
On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2023/12/7 11:13, Chengming Zhou wrote: > > On 2023/12/7 04:08, Nhat Pham wrote: > >> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou > >> <zhouchengming@bytedance.com> wrote: > >>> When testing the zswap performance by using kernel build -j32 in a tmpfs > >>> directory, I found the scalability of zswap rb-tree is not good, which > >>> is protected by the only spinlock. That would cause heavy lock contention > >>> if multiple tasks zswap_store/load concurrently. > >>> > >>> So a simple solution is to split the only one zswap rb-tree into multiple > >>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > >>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > >>> > >>> Although this method can't solve the spinlock contention completely, it > >>> can mitigate much of that contention. > >> > >> By how much? Do you have any stats to estimate the amount of > >> contention and the reduction by this patch? > > > > Actually, I did some test using the linux-next 20231205 yesterday. > > > > Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs. > > > > 20231205 +patchset > > 1. !shrinker_enabled: 156s 126s > > 2. shrinker_enabled: 79s 70s > > > > I think your zswap shrinker fix patch can solve !shrinker_enabled case. > > > > So will test again today using the new mm-unstable branch. > > > > Updated test data based on today's mm-unstable branch: > > mm-unstable +patchset > 1. !shrinker_enabled: 86s 74s > 2. shrinker_enabled: 63s 61s > > Shows much less optimization for the shrinker_enabled case, but still > much optimization for the !shrinker_enabled case. > > Thanks! I'm gonna assume this is build time since it makes the zswap shrinker look pretty good :) I think this just means some of the gains between this patchset and the zswap shrinker overlaps. But on the positive note: a) Both are complementary, i.e enable both (bottom right corner) gives us the best result. b) Each individual change improves the runtime. If you disable the shrinker, then this patch helps tremendously, so we're onto something. c) The !shrinker_enabled is no longer *too* bad - once again, thanks for noticing the regression and help me fix it! In fact, every cell improves compared to the last run. Woohoo!
On Thu, Dec 7, 2023 at 10:15 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > Updated test data based on today's mm-unstable branch: > > > > mm-unstable +patchset > > 1. !shrinker_enabled: 86s 74s > > 2. shrinker_enabled: 63s 61s > > > > Shows much less optimization for the shrinker_enabled case, but still > > much optimization for the !shrinker_enabled case. > > > > Thanks! > > I'm gonna assume this is build time since it makes the zswap shrinker > look pretty good :) > I think this just means some of the gains between this patchset and > the zswap shrinker overlaps. But on the positive note: > > a) Both are complementary, i.e enable both (bottom right corner) gives > us the best result. > b) Each individual change improves the runtime. If you disable the > shrinker, then this patch helps tremendously, so we're onto something. > c) The !shrinker_enabled is no longer *too* bad - once again, thanks > for noticing the regression and help me fix it! In fact, every cell > improves compared to the last run. Woohoo! Oh and, another thing that might be helpful to observe reduction in lock contention (and compare approaches if necessary) is this analysis that Yosry performed for the multiple zpools change: https://lore.kernel.org/lkml/20230620194644.3142384-1-yosryahmed@google.com/ We could look at the various paths that utilize rbtree and see how long we're spinning at the lock(s) etc.
On 2023/12/8 02:15, Nhat Pham wrote: > On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: >> >> On 2023/12/7 11:13, Chengming Zhou wrote: >>> On 2023/12/7 04:08, Nhat Pham wrote: >>>> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou >>>> <zhouchengming@bytedance.com> wrote: >>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs >>>>> directory, I found the scalability of zswap rb-tree is not good, which >>>>> is protected by the only spinlock. That would cause heavy lock contention >>>>> if multiple tasks zswap_store/load concurrently. >>>>> >>>>> So a simple solution is to split the only one zswap rb-tree into multiple >>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is >>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). >>>>> >>>>> Although this method can't solve the spinlock contention completely, it >>>>> can mitigate much of that contention. >>>> >>>> By how much? Do you have any stats to estimate the amount of >>>> contention and the reduction by this patch? >>> >>> Actually, I did some test using the linux-next 20231205 yesterday. >>> >>> Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs. >>> >>> 20231205 +patchset >>> 1. !shrinker_enabled: 156s 126s >>> 2. shrinker_enabled: 79s 70s >>> >>> I think your zswap shrinker fix patch can solve !shrinker_enabled case. >>> >>> So will test again today using the new mm-unstable branch. >>> >> >> Updated test data based on today's mm-unstable branch: >> >> mm-unstable +patchset >> 1. !shrinker_enabled: 86s 74s >> 2. shrinker_enabled: 63s 61s >> >> Shows much less optimization for the shrinker_enabled case, but still >> much optimization for the !shrinker_enabled case. >> >> Thanks! > > I'm gonna assume this is build time since it makes the zswap shrinker > look pretty good :) > I think this just means some of the gains between this patchset and > the zswap shrinker overlaps. But on the positive note: > > a) Both are complementary, i.e enable both (bottom right corner) gives > us the best result. Right, both optimizations are complementary, to make zswap perform better :) > b) Each individual change improves the runtime. If you disable the > shrinker, then this patch helps tremendously, so we're onto something. > c) The !shrinker_enabled is no longer *too* bad - once again, thanks > for noticing the regression and help me fix it! In fact, every cell > improves compared to the last run. Woohoo! It's my pleasure! Thanks!
On Wed, Dec 6, 2023 at 7:25 PM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > On 2023/12/7 08:43, Chris Li wrote: > > Hi Nhat and Yosry, > > > > On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <nphamcs@gmail.com> wrote: > >>> > >>> + Chris Li > >>> > >>> Chris, I vaguely remember from our last conversation that you have > >>> some concurrent efforts to use xarray here right? > > > > Yes, I do have the zswap xarray for older versions of the kernel. The > > recent mm-unstable tree has a lot of zswap related updates. Give me 2 > > days to refresh and post it. The zswap invalid entry and the reference > > count change is causing a good portion of the code to be updated. That > > is the price to pay keeping out of tree patches. My fault is not > > getting to it sooner. > > > >> > >> If I recall correctly, the xarray already reduces the lock contention > >> as lookups are lockless, but Chris knows more here. As you mentioned > > > > Yes. To be exact, xarray can use spin lock (same as current RB tree) > > or take RCU read lock on the lookup path (depending on how you call > > the xarray API). Not completely lockless but the RCU read lock should > > have less lock contention than normal spinlock. +Matthew > > > > Great! Lockless lookup in zswap_load() should reduce spinlock contention. > And multiple trees (multiple xarrays) can further reduce the contention > on the concurrent zswap_store() side. So it's complementary IMHO. > > >> in a different email, it would be nice to get some data so that we can > >> compare different solutions. > > > > Yes, it is certainly welcome to see more data points. If I recall > > correctly, the zswap xarray array makes the lookup similar to the swap > > cache lookup. It has a noticeable difference in the long tail end. > > > > Right, I post some data from yesterday in another reply. > Will test again and update the data since Nhat's zswap shrinker fix patch > has been merged into mm-unstable today. > > Thanks! Let's split the rbtree breakdown into a separate series. This series has irrelevant (and very nice) cleanups and optimizations, let's get them separately and defer the rbtree breakdown part until we get data about the xarray implementation. Perhaps the tree breakdown is not needed as much with an xarray, or at the very least the implementation would look different on top of an xarray.
On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > Let's split the rbtree breakdown into a separate series. This series > has irrelevant (and very nice) cleanups and optimizations, let's get > them separately and defer the rbtree breakdown part until we get data > about the xarray implementation. Perhaps the tree breakdown is not > needed as much with an xarray, or at the very least the implementation > would look different on top of an xarray. Actually, kinda agree - I quite like the cleanup/optimization done w.r.t dstmem reuse :)
On 2023/12/13 07:33, Nhat Pham wrote: > On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> Let's split the rbtree breakdown into a separate series. This series >> has irrelevant (and very nice) cleanups and optimizations, let's get >> them separately and defer the rbtree breakdown part until we get data Ok, will split and just send the cleanups/optimizations with dstmem reuse. >> about the xarray implementation. Perhaps the tree breakdown is not >> needed as much with an xarray, or at the very least the implementation >> would look different on top of an xarray. Yeah, will retest on the xarray version of Chris, the implementation is easy anyway. > > Actually, kinda agree - I quite like the cleanup/optimization done > w.r.t dstmem reuse :) Thanks!
Hi everyone, This patch series is based on the linux-next 20231205, which depends on the "workload-specific and memory pressure-driven zswap writeback" series from Nhat Pham. When testing the zswap performance by using kernel build -j32 in a tmpfs directory, I found the scalability of zswap rb-tree is not good, which is protected by the only spinlock. That would cause heavy lock contention if multiple tasks zswap_store/load concurrently. So a simple solution is to split the only one zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Another problem when testing the zswap using our default zsmalloc is that zswap_load() and zswap_writeback_entry() have to malloc a temporary memory to support !zpool_can_sleep_mapped(). Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex. Thanks for review and comment! To: Andrew Morton <akpm@linux-foundation.org> To: Seth Jennings <sjenning@redhat.com> To: Dan Streetman <ddstreet@ieee.org> To: Vitaly Wool <vitaly.wool@konsulko.com> To: Nhat Pham <nphamcs@gmail.com> To: Johannes Weiner <hannes@cmpxchg.org> To: Yosry Ahmed <yosryahmed@google.com> To: Michal Hocko <mhocko@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- Chengming Zhou (7): mm/zswap: make sure each swapfile always have zswap rb-tree mm/zswap: split zswap rb-tree mm/zswap: reuse dstmem when decompress mm/zswap: change dstmem size to one page mm/zswap: refactor out __zswap_load() mm/zswap: cleanup zswap_load() mm/zswap: cleanup zswap_reclaim_entry() include/linux/zswap.h | 4 +- mm/swapfile.c | 10 ++- mm/zswap.c | 233 +++++++++++++++++++++----------------------------- 3 files changed, 106 insertions(+), 141 deletions(-) --- base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b change-id: 20231206-zswap-lock-optimize-06f45683b02b Best regards,