Message ID | 20240117-b4-zswap-lock-optimize-v1-0-23f6effe5775@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/zswap: optimize the scalability of zswap rb-tree | expand |
On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou <zhouchengming@bytedance.com> wrote: > > When testing the zswap performance by using kernel build -j32 in a tmpfs > directory, I found the scalability of zswap rb-tree is not good, which > is protected by the only spinlock. That would cause heavy lock contention > if multiple tasks zswap_store/load concurrently. > > So a simple solution is to split the only one zswap rb-tree into multiple > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > Although this method can't solve the spinlock contention completely, it > can mitigate much of that contention. Below is the results of kernel build > in tmpfs with zswap shrinker enabled: > > linux-next zswap-lock-optimize > real 1m9.181s 1m3.820s > user 17m44.036s 17m40.100s > sys 7m37.297s 4m54.622s > > So there are clearly improvements. And it's complementary with the ongoing > zswap xarray conversion by Chris. Anyway, I think we can also merge this > first, it's complementary IMHO. So I just refresh and resend this for > further discussion. The reason why I think we should wait for the xarray patch(es) is there is a chance we may see less improvements from splitting the tree if it was an xarray. If we merge this series first, there is no way to know. Chris, do you intend to send the xarray patch(es) anytime soon?
Hi Yosry and Chengming, On Wed, Jan 17, 2024 at 10:38 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > directory, I found the scalability of zswap rb-tree is not good, which > > is protected by the only spinlock. That would cause heavy lock contention > > if multiple tasks zswap_store/load concurrently. > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > Although this method can't solve the spinlock contention completely, it > > can mitigate much of that contention. Below is the results of kernel build > > in tmpfs with zswap shrinker enabled: > > > > linux-next zswap-lock-optimize > > real 1m9.181s 1m3.820s > > user 17m44.036s 17m40.100s > > sys 7m37.297s 4m54.622s > > > > So there are clearly improvements. And it's complementary with the ongoing > > zswap xarray conversion by Chris. Anyway, I think we can also merge this > > first, it's complementary IMHO. So I just refresh and resend this for > > further discussion. > Sorry I have been radio silent busying on a few refreshments of the xarray on the recent kernel tree. There is an assertion triggered on xarray and the rb tree does not agree with each other. It takes some time to debug. I ironed that out, also glad the assert did catch a bug. Currently the xarray patch should have everything it takes to use RCU read lock. However taking out the tree spinlock is more work than previously. If we are going to remove the tree spinlock, I think we should revert back to doing a zswap tree lookup and return the zswap entry with reference increased. The tree mapping can still decouple from the zswap entry reference count drop to zero. Anyway, my V1 of the xarray patch will not include removing the tree spinlock. > The reason why I think we should wait for the xarray patch(es) is > there is a chance we may see less improvements from splitting the tree > if it was an xarray. If we merge this series first, there is no way to > know. > > Chris, do you intend to send the xarray patch(es) anytime soon? Thanks for the heads up. Let me send it out now. Chris
> Currently the xarray patch should have everything it takes to use RCU > read lock. However taking out the tree spinlock is more work than > previously. If we are going to remove the tree spinlock, I think we > should revert back to doing a zswap tree lookup and return the zswap > entry with reference increased. The tree mapping can still decouple > from the zswap entry reference count drop to zero. Anyway, my V1 of > the xarray patch will not include removing the tree spinlock. Interesting. What do you mean by removing the tree spinlock? My assumption was that the xarray reduces lock contention because we do not need a lock to do lookups, but we still need the lock otherwise. Did you have something in mind to completely remove the tree lock? > > > The reason why I think we should wait for the xarray patch(es) is > > there is a chance we may see less improvements from splitting the tree > > if it was an xarray. If we merge this series first, there is no way to > > know. > > > > Chris, do you intend to send the xarray patch(es) anytime soon? > > Thanks for the heads up. Let me send it out now. Awesome, thanks! I assume Chengming can test whether this series provides the same benefits with the xarray.
Hi Yosry, On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Currently the xarray patch should have everything it takes to use RCU > > read lock. However taking out the tree spinlock is more work than > > previously. If we are going to remove the tree spinlock, I think we > > should revert back to doing a zswap tree lookup and return the zswap > > entry with reference increased. The tree mapping can still decouple > > from the zswap entry reference count drop to zero. Anyway, my V1 of > > the xarray patch will not include removing the tree spinlock. > > Interesting. What do you mean by removing the tree spinlock? My > assumption was that the xarray reduces lock contention because we do > not need a lock to do lookups, but we still need the lock otherwise. > Did you have something in mind to completely remove the tree lock? In my current xarray series, it adds the xarray alongside the rb tree. Xarray has its own internal lock as well. Effectively zswap now has two locks instead of just one previously. The xarray lock will not have any contention due to the xarray lock taken inside the zswap rb tree lock. The eventual goal is reducing the two locks back to one(xarray lock), which is not in my V1 patch. Your understanding is correct, the xarray still needs to have one lock for protecting the write path. > > > The reason why I think we should wait for the xarray patch(es) is > > > there is a chance we may see less improvements from splitting the tree > > > if it was an xarray. If we merge this series first, there is no way to > > > know. > > > > > > Chris, do you intend to send the xarray patch(es) anytime soon? > > > > Thanks for the heads up. Let me send it out now. > > Awesome, thanks! I assume Chengming can test whether this series > provides the same benefits with the xarray. That would be great. The xarray patch needs more testing for sure. Chris
On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Currently the xarray patch should have everything it takes to use RCU > > > read lock. However taking out the tree spinlock is more work than > > > previously. If we are going to remove the tree spinlock, I think we > > > should revert back to doing a zswap tree lookup and return the zswap > > > entry with reference increased. The tree mapping can still decouple > > > from the zswap entry reference count drop to zero. Anyway, my V1 of > > > the xarray patch will not include removing the tree spinlock. > > > > Interesting. What do you mean by removing the tree spinlock? My > > assumption was that the xarray reduces lock contention because we do > > not need a lock to do lookups, but we still need the lock otherwise. > > Did you have something in mind to completely remove the tree lock? > > In my current xarray series, it adds the xarray alongside the rb tree. > Xarray has its own internal lock as well. Effectively zswap now has > two locks instead of just one previously. The xarray lock will not > have any contention due to the xarray lock taken inside the zswap rb > tree lock. The eventual goal is reducing the two locks back to > one(xarray lock), which is not in my V1 patch. Your understanding is > correct, the xarray still needs to have one lock for protecting the > write path. Hmm I don't understand. What's the point of keeping the rbtree if we have the xarray? Doesn't it end up being more expensive and bug-prone to maintain both trees? When you say "eventual goal", do you mean what the patch would morph into in later versions (as in v1 is just a proof of concept without removing the rbtree), or follow up patches?
On Wed, Jan 17, 2024 at 4:18 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Wed, Jan 17, 2024 at 3:48 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > Currently the xarray patch should have everything it takes to use RCU > > > read lock. However taking out the tree spinlock is more work than > > > previously. If we are going to remove the tree spinlock, I think we > > > should revert back to doing a zswap tree lookup and return the zswap > > > entry with reference increased. The tree mapping can still decouple > > > from the zswap entry reference count drop to zero. Anyway, my V1 of > > > the xarray patch will not include removing the tree spinlock. > > > > Interesting. What do you mean by removing the tree spinlock? My > > assumption was that the xarray reduces lock contention because we do > > not need a lock to do lookups, but we still need the lock otherwise. > > Did you have something in mind to completely remove the tree lock? > > In my current xarray series, it adds the xarray alongside the rb tree. Hmmm why? Is there a reason to keep the rb tree around?
Hi Yosry, On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote: > Hmm I don't understand. What's the point of keeping the rbtree if we > have the xarray? Doesn't it end up being more expensive and bug-prone > to maintain both trees? Patch 2/2 remove the rb tree code. Just keeping the tree spinlock. > > When you say "eventual goal", do you mean what the patch would morph > into in later versions (as in v1 is just a proof of concept without > removing the rbtree), or follow up patches? V1 will remove the rb tree, but does not merge the rb tree lock with the xarray lock. Hi Nhat, > Hmmm why? Is there a reason to keep the rb tree around? No, not at all. Chris
On Wed, Jan 17, 2024 at 5:03 PM Chris Li <chrisl@kernel.org> wrote: > > Hi Yosry, > > On Wed, Jan 17, 2024 at 4:35 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > Hmm I don't understand. What's the point of keeping the rbtree if we > > have the xarray? Doesn't it end up being more expensive and bug-prone > > to maintain both trees? > > Patch 2/2 remove the rb tree code. Just keeping the tree spinlock. > > > > > When you say "eventual goal", do you mean what the patch would morph > > into in later versions (as in v1 is just a proof of concept without > > removing the rbtree), or follow up patches? > > V1 will remove the rb tree, but does not merge the rb tree lock with > the xarray lock. I see you already posted the patches, let's move the discussion there. I will take a look at them as soon as I get the chance to.
On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote: > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou > <zhouchengming@bytedance.com> wrote: > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > directory, I found the scalability of zswap rb-tree is not good, which > > is protected by the only spinlock. That would cause heavy lock contention > > if multiple tasks zswap_store/load concurrently. > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > Although this method can't solve the spinlock contention completely, it > > can mitigate much of that contention. Below is the results of kernel build > > in tmpfs with zswap shrinker enabled: > > > > linux-next zswap-lock-optimize > > real 1m9.181s 1m3.820s > > user 17m44.036s 17m40.100s > > sys 7m37.297s 4m54.622s > > > > So there are clearly improvements. And it's complementary with the ongoing > > zswap xarray conversion by Chris. Anyway, I think we can also merge this > > first, it's complementary IMHO. So I just refresh and resend this for > > further discussion. > > The reason why I think we should wait for the xarray patch(es) is > there is a chance we may see less improvements from splitting the tree > if it was an xarray. If we merge this series first, there is no way to > know. I mentioned this before, but I disagree quite strongly with this general sentiment. Chengming's patches are simple, mature, and have convincing numbers. IMO it's poor form to hold something like that for "let's see how our other experiment works out". The only exception would be if we all agree that the earlier change flies in the face of the overall direction we want to pursue, which I don't think is the case here. With the xarray we'll still have a per-swapfile lock for writes. That lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for the swapcache in the first place. Lockless reads help of course, but read-only access to swap are in the minority - stores will write, and loads are commonly followed by invalidations. Somebody already went through the trouble of proving that xarrays + segmentation are worth it for swap load and store access patterns. Why dismiss that? So my vote is that we follow the ususal upstreaming process here: merge the ready patches now, and rebase future work on top of it.
On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote: > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou > > <zhouchengming@bytedance.com> wrote: > > > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > > directory, I found the scalability of zswap rb-tree is not good, which > > > is protected by the only spinlock. That would cause heavy lock contention > > > if multiple tasks zswap_store/load concurrently. > > > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > > > Although this method can't solve the spinlock contention completely, it > > > can mitigate much of that contention. Below is the results of kernel build > > > in tmpfs with zswap shrinker enabled: > > > > > > linux-next zswap-lock-optimize > > > real 1m9.181s 1m3.820s > > > user 17m44.036s 17m40.100s > > > sys 7m37.297s 4m54.622s > > > > > > So there are clearly improvements. And it's complementary with the ongoing > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this > > > first, it's complementary IMHO. So I just refresh and resend this for > > > further discussion. > > > > The reason why I think we should wait for the xarray patch(es) is > > there is a chance we may see less improvements from splitting the tree > > if it was an xarray. If we merge this series first, there is no way to > > know. > > I mentioned this before, but I disagree quite strongly with this > general sentiment. > > Chengming's patches are simple, mature, and have convincing > numbers. IMO it's poor form to hold something like that for "let's see > how our other experiment works out". The only exception would be if we > all agree that the earlier change flies in the face of the overall > direction we want to pursue, which I don't think is the case here. My intention was not to delay merging these patches until the xarray patches are merged in. It was only to wait until the xarray patches are *posted*, so that we can redo the testing on top of them and verify that the gains are still there. That should have been around now, but the xarray patches were posted in a form that does not allow this testing (because we still have a lock on the read path), so I am less inclined. My rationale was that if the gains from splitting the tree become minimal after we switch to an xarray, we won't know. It's more difficult to remove optimizations than to add them, because we may cause a regression. I am kind of paranoid about having code sitting around that we don't have full information about how much it's needed. In this case, I suppose we can redo the testing (1 tree vs. split trees) once the xarray patches are in a testable form, and before we have formed any strong dependencies on the split trees (we have time until v6.9 is released, I assume). How about that? > > With the xarray we'll still have a per-swapfile lock for writes. That > lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for > the swapcache in the first place. Lockless reads help of course, but > read-only access to swap are in the minority - stores will write, and > loads are commonly followed by invalidations. Somebody already went > through the trouble of proving that xarrays + segmentation are worth > it for swap load and store access patterns. Why dismiss that? Fair point, although I think the swapcache lock may be more contended than the zswap tree lock. > So my vote is that we follow the ususal upstreaming process here: > merge the ready patches now, and rebase future work on top of it. No objections given the current state of the xarray patches as I mentioned earlier, but I prefer we redo the testing once possible with the xarray.
On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote: > On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote: > > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou > > > <zhouchengming@bytedance.com> wrote: > > > > > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > > > directory, I found the scalability of zswap rb-tree is not good, which > > > > is protected by the only spinlock. That would cause heavy lock contention > > > > if multiple tasks zswap_store/load concurrently. > > > > > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > > > > > Although this method can't solve the spinlock contention completely, it > > > > can mitigate much of that contention. Below is the results of kernel build > > > > in tmpfs with zswap shrinker enabled: > > > > > > > > linux-next zswap-lock-optimize > > > > real 1m9.181s 1m3.820s > > > > user 17m44.036s 17m40.100s > > > > sys 7m37.297s 4m54.622s > > > > > > > > So there are clearly improvements. And it's complementary with the ongoing > > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this > > > > first, it's complementary IMHO. So I just refresh and resend this for > > > > further discussion. > > > > > > The reason why I think we should wait for the xarray patch(es) is > > > there is a chance we may see less improvements from splitting the tree > > > if it was an xarray. If we merge this series first, there is no way to > > > know. > > > > I mentioned this before, but I disagree quite strongly with this > > general sentiment. > > > > Chengming's patches are simple, mature, and have convincing > > numbers. IMO it's poor form to hold something like that for "let's see > > how our other experiment works out". The only exception would be if we > > all agree that the earlier change flies in the face of the overall > > direction we want to pursue, which I don't think is the case here. > > My intention was not to delay merging these patches until the xarray > patches are merged in. It was only to wait until the xarray patches > are *posted*, so that we can redo the testing on top of them and > verify that the gains are still there. That should have been around > now, but the xarray patches were posted in a form that does not allow > this testing (because we still have a lock on the read path), so I am > less inclined. > > My rationale was that if the gains from splitting the tree become > minimal after we switch to an xarray, we won't know. It's more > difficult to remove optimizations than to add them, because we may > cause a regression. I am kind of paranoid about having code sitting > around that we don't have full information about how much it's needed. Yeah I understand that fear. I expect the splitting to help more than the move to xarray because it's the writes that are hot. Luckily in this case it should be fairly easy to differential-test after it's been merged by changing that tree lookup macro/function locally to always return &trees[type][0], right? > In this case, I suppose we can redo the testing (1 tree vs. split > trees) once the xarray patches are in a testable form, and before we > have formed any strong dependencies on the split trees (we have time > until v6.9 is released, I assume). > > How about that? That sounds reasonable. > > With the xarray we'll still have a per-swapfile lock for writes. That > > lock is the reason SWAP_ADDRESS_SPACE segmentation was introduced for > > the swapcache in the first place. Lockless reads help of course, but > > read-only access to swap are in the minority - stores will write, and > > loads are commonly followed by invalidations. Somebody already went > > through the trouble of proving that xarrays + segmentation are worth > > it for swap load and store access patterns. Why dismiss that? > > Fair point, although I think the swapcache lock may be more contended > than the zswap tree lock. Right, it has two updates for each transition, compared to the one for zswap. But we know that in a concurrent system under pressure a globally shared swap lock will hurt. There is a history in Chengming's numbers, your previous patch to split the zpools, 235b62176712b970c815923e36b9a9cc05d4d901 etc. > > So my vote is that we follow the ususal upstreaming process here: > > merge the ready patches now, and rebase future work on top of it. > > No objections given the current state of the xarray patches as I > mentioned earlier, but I prefer we redo the testing once possible with > the xarray. Cool, sounds good to me.
On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote: > > On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote: > > > > On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou > > > > <zhouchengming@bytedance.com> wrote: > > > > > > > > > > When testing the zswap performance by using kernel build -j32 in a tmpfs > > > > > directory, I found the scalability of zswap rb-tree is not good, which > > > > > is protected by the only spinlock. That would cause heavy lock contention > > > > > if multiple tasks zswap_store/load concurrently. > > > > > > > > > > So a simple solution is to split the only one zswap rb-tree into multiple > > > > > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is > > > > > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). > > > > > > > > > > Although this method can't solve the spinlock contention completely, it > > > > > can mitigate much of that contention. Below is the results of kernel build > > > > > in tmpfs with zswap shrinker enabled: > > > > > > > > > > linux-next zswap-lock-optimize > > > > > real 1m9.181s 1m3.820s > > > > > user 17m44.036s 17m40.100s > > > > > sys 7m37.297s 4m54.622s > > > > > > > > > > So there are clearly improvements. And it's complementary with the ongoing > > > > > zswap xarray conversion by Chris. Anyway, I think we can also merge this > > > > > first, it's complementary IMHO. So I just refresh and resend this for > > > > > further discussion. > > > > > > > > The reason why I think we should wait for the xarray patch(es) is > > > > there is a chance we may see less improvements from splitting the tree > > > > if it was an xarray. If we merge this series first, there is no way to > > > > know. > > > > > > I mentioned this before, but I disagree quite strongly with this > > > general sentiment. > > > > > > Chengming's patches are simple, mature, and have convincing > > > numbers. IMO it's poor form to hold something like that for "let's see > > > how our other experiment works out". The only exception would be if we > > > all agree that the earlier change flies in the face of the overall > > > direction we want to pursue, which I don't think is the case here. > > > > My intention was not to delay merging these patches until the xarray > > patches are merged in. It was only to wait until the xarray patches > > are *posted*, so that we can redo the testing on top of them and > > verify that the gains are still there. That should have been around > > now, but the xarray patches were posted in a form that does not allow > > this testing (because we still have a lock on the read path), so I am > > less inclined. > > > > My rationale was that if the gains from splitting the tree become > > minimal after we switch to an xarray, we won't know. It's more > > difficult to remove optimizations than to add them, because we may > > cause a regression. I am kind of paranoid about having code sitting > > around that we don't have full information about how much it's needed. > > Yeah I understand that fear. > > I expect the splitting to help more than the move to xarray because > it's the writes that are hot. Luckily in this case it should be fairly > easy to differential-test after it's been merged by changing that tree > lookup macro/function locally to always return &trees[type][0], right? Yeah that's exactly what I had in mind. Once we have a version of the xarray patch without the locking on the read side we can test with that. Chengming, does this sound reasonable to you?
On 2024/1/19 02:37, Yosry Ahmed wrote: > On Thu, Jan 18, 2024 at 10:07 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Thu, Jan 18, 2024 at 09:30:12AM -0800, Yosry Ahmed wrote: >>> On Thu, Jan 18, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>> >>>> On Wed, Jan 17, 2024 at 10:37:22AM -0800, Yosry Ahmed wrote: >>>>> On Wed, Jan 17, 2024 at 1:23 AM Chengming Zhou >>>>> <zhouchengming@bytedance.com> wrote: >>>>>> >>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs >>>>>> directory, I found the scalability of zswap rb-tree is not good, which >>>>>> is protected by the only spinlock. That would cause heavy lock contention >>>>>> if multiple tasks zswap_store/load concurrently. >>>>>> >>>>>> So a simple solution is to split the only one zswap rb-tree into multiple >>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is >>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). >>>>>> >>>>>> Although this method can't solve the spinlock contention completely, it >>>>>> can mitigate much of that contention. Below is the results of kernel build >>>>>> in tmpfs with zswap shrinker enabled: >>>>>> >>>>>> linux-next zswap-lock-optimize >>>>>> real 1m9.181s 1m3.820s >>>>>> user 17m44.036s 17m40.100s >>>>>> sys 7m37.297s 4m54.622s >>>>>> >>>>>> So there are clearly improvements. And it's complementary with the ongoing >>>>>> zswap xarray conversion by Chris. Anyway, I think we can also merge this >>>>>> first, it's complementary IMHO. So I just refresh and resend this for >>>>>> further discussion. >>>>> >>>>> The reason why I think we should wait for the xarray patch(es) is >>>>> there is a chance we may see less improvements from splitting the tree >>>>> if it was an xarray. If we merge this series first, there is no way to >>>>> know. >>>> >>>> I mentioned this before, but I disagree quite strongly with this >>>> general sentiment. >>>> >>>> Chengming's patches are simple, mature, and have convincing >>>> numbers. IMO it's poor form to hold something like that for "let's see >>>> how our other experiment works out". The only exception would be if we >>>> all agree that the earlier change flies in the face of the overall >>>> direction we want to pursue, which I don't think is the case here. >>> >>> My intention was not to delay merging these patches until the xarray >>> patches are merged in. It was only to wait until the xarray patches >>> are *posted*, so that we can redo the testing on top of them and >>> verify that the gains are still there. That should have been around >>> now, but the xarray patches were posted in a form that does not allow >>> this testing (because we still have a lock on the read path), so I am >>> less inclined. >>> >>> My rationale was that if the gains from splitting the tree become >>> minimal after we switch to an xarray, we won't know. It's more >>> difficult to remove optimizations than to add them, because we may >>> cause a regression. I am kind of paranoid about having code sitting >>> around that we don't have full information about how much it's needed. >> >> Yeah I understand that fear. >> >> I expect the splitting to help more than the move to xarray because >> it's the writes that are hot. Luckily in this case it should be fairly >> easy to differential-test after it's been merged by changing that tree >> lookup macro/function locally to always return &trees[type][0], right? > > Yeah that's exactly what I had in mind. Once we have a version of the > xarray patch without the locking on the read side we can test with > that. Chengming, does this sound reasonable to you? It's ok, sounds reasonable to me. I agree with Johannes, we will need both since xarray still have a spinlock in the writes, it's clearly better to split it. As for testing, we can always return &trees[type][0]. Thanks!
When testing the zswap performance by using kernel build -j32 in a tmpfs directory, I found the scalability of zswap rb-tree is not good, which is protected by the only spinlock. That would cause heavy lock contention if multiple tasks zswap_store/load concurrently. So a simple solution is to split the only one zswap rb-tree into multiple rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"). Although this method can't solve the spinlock contention completely, it can mitigate much of that contention. Below is the results of kernel build in tmpfs with zswap shrinker enabled: linux-next zswap-lock-optimize real 1m9.181s 1m3.820s user 17m44.036s 17m40.100s sys 7m37.297s 4m54.622s So there are clearly improvements. And it's complementary with the ongoing zswap xarray conversion by Chris. Anyway, I think we can also merge this first, it's complementary IMHO. So I just refresh and resend this for further discussion. Thanks for review and comment! Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> --- Chengming Zhou (2): mm/zswap: make sure each swapfile always have zswap rb-tree mm/zswap: split zswap rb-tree include/linux/zswap.h | 7 +++-- mm/swapfile.c | 10 ++++--- mm/zswap.c | 74 ++++++++++++++++++++++++++++++++------------------- 3 files changed, 59 insertions(+), 32 deletions(-) --- base-commit: ab27740f76654ed58dd32ac0ba0031c18a6dea3b change-id: 20240117-b4-zswap-lock-optimize-44e071c13427 Best regards,