diff mbox series

mm/zswap: avoid touching XArray for unnecessary invalidation

Message ID 20241011171950.62684-1-ryncsn@gmail.com (mailing list archive)
State New
Headers show
Series mm/zswap: avoid touching XArray for unnecessary invalidation | expand

Commit Message

Kairui Song Oct. 11, 2024, 5:19 p.m. UTC
From: Kairui Song <kasong@tencent.com>

zswap_invalidation simply calls xa_erase, which acquires the Xarray
lock first, then does a look up. This has a higher overhead even if
zswap is not used or the tree is empty.

So instead, do a very lightweight xa_empty check first, if there is
nothing to erase, don't touch the lock or the tree.

Using xa_empty rather than zswap_never_enabled is more helpful as it
cover both case where zswap wes never used or the particular range
doesn't have any zswap entry. And it's safe as the swap slot should
be currently pinned by caller with HAS_CACHE.

Sequential SWAP in/out tests with zswap disabled showed a minor
performance gain, SWAP in of zero page with zswap enabled also
showed a performance gain. (swapout is basically unchanged so
only test one case):

Swapout of 2G zero page using brd as SWAP, zswap disabled
(total time, 4 testrun, +0.1%):
Before: 1705013 us 1703119 us 1704335 us 1705848 us.
After:  1703579 us 1710640 us 1703625 us 1708699 us.

Swapin of 2G zero page using brd as SWAP, zswap disabled
(total time, 4 testrun, -3.5%):
Before: 1912312 us 1915692 us 1905837 us 1912706 us.
After:  1845354 us 1849691 us 1845868 us 1841828 us.

Swapin of 2G zero page using brd as SWAP, zswap enabled
(total time, 4 testrun, -3.3%):
Before: 1897994 us 1894681 us 1899982 us 1898333 us
After:  1835894 us 1834113 us 1832047 us 1833125 us

Swapin of 2G random page using brd as SWAP, zswap enabled
(total time, 4 testrun, -0.1%):
Before: 4519747 us 4431078 us 4430185 us 4439999 us
After:  4492176 us 4437796 us 4434612 us 4434289 us

And the performance is very slightly better or unchanged for
build kernel test with zswap enabled or disabled.

Build Linux Kernel with defconfig and -j32 in 1G memory cgroup,
using brd SWAP, zswap disabled (sys time in seconds, 6 testrun, -0.1%):
Before: 1648.83 1653.52 1666.34 1665.95 1663.06 1656.67
After:  1651.36 1661.89 1645.70 1657.45 1662.07 1652.83

Build Linux Kernel with defconfig and -j32 in 2G memory cgroup,
using brd SWAP zswap enabled (sys time in seconds, 6 testrun, -0.3%):
Before: 1240.25 1254.06 1246.77 1265.92 1244.23 1227.74
After:  1226.41 1218.21 1249.12 1249.13 1244.39 1233.01

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/zswap.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Yosry Ahmed Oct. 11, 2024, 5:53 p.m. UTC | #1
On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> zswap_invalidation simply calls xa_erase, which acquires the Xarray
> lock first, then does a look up. This has a higher overhead even if
> zswap is not used or the tree is empty.
>
> So instead, do a very lightweight xa_empty check first, if there is
> nothing to erase, don't touch the lock or the tree.
>
> Using xa_empty rather than zswap_never_enabled is more helpful as it
> cover both case where zswap wes never used or the particular range
> doesn't have any zswap entry. And it's safe as the swap slot should
> be currently pinned by caller with HAS_CACHE.

You actually made me think whether it's better to replace
zswap_never_enabled() with something like zswap_may_have_swpentry()
that checks xa_empty() as well.

>
> Sequential SWAP in/out tests with zswap disabled showed a minor
> performance gain, SWAP in of zero page with zswap enabled also
> showed a performance gain. (swapout is basically unchanged so
> only test one case):
>
> Swapout of 2G zero page using brd as SWAP, zswap disabled
> (total time, 4 testrun, +0.1%):
> Before: 1705013 us 1703119 us 1704335 us 1705848 us.
> After:  1703579 us 1710640 us 1703625 us 1708699 us.
>
> Swapin of 2G zero page using brd as SWAP, zswap disabled
> (total time, 4 testrun, -3.5%):
> Before: 1912312 us 1915692 us 1905837 us 1912706 us.
> After:  1845354 us 1849691 us 1845868 us 1841828 us.
>
> Swapin of 2G zero page using brd as SWAP, zswap enabled
> (total time, 4 testrun, -3.3%):
> Before: 1897994 us 1894681 us 1899982 us 1898333 us
> After:  1835894 us 1834113 us 1832047 us 1833125 us
>
> Swapin of 2G random page using brd as SWAP, zswap enabled
> (total time, 4 testrun, -0.1%):
> Before: 4519747 us 4431078 us 4430185 us 4439999 us
> After:  4492176 us 4437796 us 4434612 us 4434289 us
>
> And the performance is very slightly better or unchanged for
> build kernel test with zswap enabled or disabled.
>
> Build Linux Kernel with defconfig and -j32 in 1G memory cgroup,
> using brd SWAP, zswap disabled (sys time in seconds, 6 testrun, -0.1%):
> Before: 1648.83 1653.52 1666.34 1665.95 1663.06 1656.67
> After:  1651.36 1661.89 1645.70 1657.45 1662.07 1652.83
>
> Build Linux Kernel with defconfig and -j32 in 2G memory cgroup,
> using brd SWAP zswap enabled (sys time in seconds, 6 testrun, -0.3%):
> Before: 1240.25 1254.06 1246.77 1265.92 1244.23 1227.74
> After:  1226.41 1218.21 1249.12 1249.13 1244.39 1233.01

Nice!

Do you know how the results change if we use xa_load() instead?

Or even do something like this to avoid doing the lookup twice:

XA_STATE(xas, ..);

rcu_read_lock();
entry = xas_load(&xas);
if (entry) {
    xas_lock(&xas);
    WARN_ON_ONCE(xas_reload(&xas) != entry);
    xas_store(&xas, NULL);
    xas_unlock(&xas);
}
rcu_read_unlock();

On one hand, xa_empty() is cheaper. OTOH, xas_load() will also avoid
the lock if the tree is not empty yet the entry is not there. Actually
there's no reason why we can't check both.

I think the benefit of this would be most visible with SSD swap,
because zswap_load() will erase the entry from the xarray, and
zswap_invalidate() should always be able to avoid holding the lock.

>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Anyway, all of this is kinda orthogonal to this patch,

Acked-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zswap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7f00cc918e7c..f6316b66fb23 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1641,6 +1641,9 @@ void zswap_invalidate(swp_entry_t swp)
>         struct xarray *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>
> +       if (xa_empty(tree))
> +               return;
> +
>         entry = xa_erase(tree, offset);
>         if (entry)
>                 zswap_entry_free(entry);
> --
> 2.47.0
>
Johannes Weiner Oct. 11, 2024, 6:28 p.m. UTC | #2
On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
> On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > zswap_invalidation simply calls xa_erase, which acquires the Xarray
> > lock first, then does a look up. This has a higher overhead even if
> > zswap is not used or the tree is empty.
> >
> > So instead, do a very lightweight xa_empty check first, if there is
> > nothing to erase, don't touch the lock or the tree.

Great idea!

> XA_STATE(xas, ..);
> 
> rcu_read_lock();
> entry = xas_load(&xas);
> if (entry) {
>     xas_lock(&xas);
>     WARN_ON_ONCE(xas_reload(&xas) != entry);
>     xas_store(&xas, NULL);
>     xas_unlock(&xas);
> }
> rcu_read_unlock();

This does the optimization more reliably, and I think we should go
with this version.

First, swapcache is size-targeted to 50% of total swap capacity (see
vm_swap_full()), and swap is rarely full. Second, entries in swapcache
don't hold on to zswap copies. In combination, this means that after
pressure spikes we routinely end up with many swapcache entries and
only a few zswap entries. Those few zswapped entries would defeat the
optimization when invalidating the many swapcached entries.

So checking on a per-entry basis makes a lot of sense.
Kairui Song Oct. 12, 2024, 3:04 a.m. UTC | #3
Johannes Weiner <hannes@cmpxchg.org> 于 2024年10月12日周六 02:28写道:
>
> On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
> > On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > zswap_invalidation simply calls xa_erase, which acquires the Xarray
> > > lock first, then does a look up. This has a higher overhead even if
> > > zswap is not used or the tree is empty.
> > >
> > > So instead, do a very lightweight xa_empty check first, if there is
> > > nothing to erase, don't touch the lock or the tree.
>
> Great idea!
>
> > XA_STATE(xas, ..);
> >
> > rcu_read_lock();
> > entry = xas_load(&xas);
> > if (entry) {
> >     xas_lock(&xas);
> >     WARN_ON_ONCE(xas_reload(&xas) != entry);
> >     xas_store(&xas, NULL);
> >     xas_unlock(&xas);
> > }
> > rcu_read_unlock():
>
> This does the optimization more reliably, and I think we should go
> with this version.

Hi Yosry and Johannes,

This is a good idea. But xa_empty is just much lighweighter, it's just
a inlined ( == NULL ) check, so unsurprising it has better performance
than xas_load.

And surprisingly it's faster than zswap_never_enabled. So I think it
could be doable to introduce something like zswap_may_have_swpentry as
Yosry suggested.

So how about a combined version with xas_load and xa_empty? Check
xa_empty first as a faster path, then xas_load, then xas_store.

Here is the benchmark result (time of swapin 2G zero pages in us):

Before:   1908944 1905870 1905322 1905627 1901667
xa_empty: 1835343 1827367 1828402 1831841 1832719
z.._enabled: 1838428 1831162 1838205 1837287 1840980
xas_load: 1874606 1878971 1870182 1875852 1873403
combined: 1845309 1832919 1831904 1836455 1842570

`combined` is xa_empty + xas_load.
Yosry Ahmed Oct. 12, 2024, 3:26 a.m. UTC | #4
On Fri, Oct 11, 2024 at 8:05 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Johannes Weiner <hannes@cmpxchg.org> 于 2024年10月12日周六 02:28写道:
> >
> > On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
> > > On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > zswap_invalidation simply calls xa_erase, which acquires the Xarray
> > > > lock first, then does a look up. This has a higher overhead even if
> > > > zswap is not used or the tree is empty.
> > > >
> > > > So instead, do a very lightweight xa_empty check first, if there is
> > > > nothing to erase, don't touch the lock or the tree.
> >
> > Great idea!
> >
> > > XA_STATE(xas, ..);
> > >
> > > rcu_read_lock();
> > > entry = xas_load(&xas);
> > > if (entry) {
> > >     xas_lock(&xas);
> > >     WARN_ON_ONCE(xas_reload(&xas) != entry);
> > >     xas_store(&xas, NULL);
> > >     xas_unlock(&xas);
> > > }
> > > rcu_read_unlock():
> >
> > This does the optimization more reliably, and I think we should go
> > with this version.
>
> Hi Yosry and Johannes,
>
> This is a good idea. But xa_empty is just much lighweighter, it's just
> a inlined ( == NULL ) check, so unsurprising it has better performance
> than xas_load.
>
> And surprisingly it's faster than zswap_never_enabled. So I think it
> could be doable to introduce something like zswap_may_have_swpentry as
> Yosry suggested.

That is surprising indeed, but it is cleaner anyway to use the xarray
check than the static key.

>
> So how about a combined version with xas_load and xa_empty? Check
> xa_empty first as a faster path, then xas_load, then xas_store.

Yeah I think having an additional xa_empty() check before xas_load() is fine.

>
> Here is the benchmark result (time of swapin 2G zero pages in us):
>
> Before:   1908944 1905870 1905322 1905627 1901667
> xa_empty: 1835343 1827367 1828402 1831841 1832719
> z.._enabled: 1838428 1831162 1838205 1837287 1840980
> xas_load: 1874606 1878971 1870182 1875852 1873403
> combined: 1845309 1832919 1831904 1836455 1842570
>
> `combined` is xa_empty + xas_load.

Is this with SSD swap? If you are using brd, it bypasses the swapcache
so the benefit from the xas_load() optimization won't be much visible
(see my earlier email as well as Johannes's).
Chengming Zhou Oct. 12, 2024, 3:33 a.m. UTC | #5
On 2024/10/12 11:04, Kairui Song wrote:
> Johannes Weiner <hannes@cmpxchg.org> 于 2024年10月12日周六 02:28写道:
>>
>> On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
>>> On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
>>>>
>>>> From: Kairui Song <kasong@tencent.com>
>>>>
>>>> zswap_invalidation simply calls xa_erase, which acquires the Xarray
>>>> lock first, then does a look up. This has a higher overhead even if
>>>> zswap is not used or the tree is empty.
>>>>
>>>> So instead, do a very lightweight xa_empty check first, if there is
>>>> nothing to erase, don't touch the lock or the tree.
>>
>> Great idea!
>>
>>> XA_STATE(xas, ..);
>>>
>>> rcu_read_lock();
>>> entry = xas_load(&xas);
>>> if (entry) {
>>>      xas_lock(&xas);
>>>      WARN_ON_ONCE(xas_reload(&xas) != entry);
>>>      xas_store(&xas, NULL);
>>>      xas_unlock(&xas);
>>> }
>>> rcu_read_unlock():
>>
>> This does the optimization more reliably, and I think we should go
>> with this version.
> 
> Hi Yosry and Johannes,
> 
> This is a good idea. But xa_empty is just much lighweighter, it's just
> a inlined ( == NULL ) check, so unsurprising it has better performance
> than xas_load.
> 
> And surprisingly it's faster than zswap_never_enabled. So I think it

Do you have CONFIG_ZSWAP_DEFAULT_ON enabled? In your case, CPU will go 
to the unlikely branch of static_key every time, which maybe the cause.

> could be doable to introduce something like zswap_may_have_swpentry as
> Yosry suggested.
> 
> So how about a combined version with xas_load and xa_empty? Check
> xa_empty first as a faster path, then xas_load, then xas_store.

Yeah, I also think this combined version is better.

Thanks.

> 
> Here is the benchmark result (time of swapin 2G zero pages in us):
> 
> Before:   1908944 1905870 1905322 1905627 1901667
> xa_empty: 1835343 1827367 1828402 1831841 1832719
> z.._enabled: 1838428 1831162 1838205 1837287 1840980
> xas_load: 1874606 1878971 1870182 1875852 1873403
> combined: 1845309 1832919 1831904 1836455 1842570
> 
> `combined` is xa_empty + xas_load.
Kairui Song Oct. 12, 2024, 4:48 a.m. UTC | #6
On Sat, Oct 12, 2024 at 11:33 AM Chengming Zhou
<chengming.zhou@linux.dev> wrote:
>
> On 2024/10/12 11:04, Kairui Song wrote:
> > Johannes Weiner <hannes@cmpxchg.org> 于 2024年10月12日周六 02:28写道:
> >>
> >> On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
> >>> On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> >>>>
> >>>> From: Kairui Song <kasong@tencent.com>
> >>>>
> >>>> zswap_invalidation simply calls xa_erase, which acquires the Xarray
> >>>> lock first, then does a look up. This has a higher overhead even if
> >>>> zswap is not used or the tree is empty.
> >>>>
> >>>> So instead, do a very lightweight xa_empty check first, if there is
> >>>> nothing to erase, don't touch the lock or the tree.
> >>
> >> Great idea!
> >>
> >>> XA_STATE(xas, ..);
> >>>
> >>> rcu_read_lock();
> >>> entry = xas_load(&xas);
> >>> if (entry) {
> >>>      xas_lock(&xas);
> >>>      WARN_ON_ONCE(xas_reload(&xas) != entry);
> >>>      xas_store(&xas, NULL);
> >>>      xas_unlock(&xas);
> >>> }
> >>> rcu_read_unlock():
> >>
> >> This does the optimization more reliably, and I think we should go
> >> with this version.
> >
> > Hi Yosry and Johannes,
> >
> > This is a good idea. But xa_empty is just much lighweighter, it's just
> > a inlined ( == NULL ) check, so unsurprising it has better performance
> > than xas_load.
> >
> > And surprisingly it's faster than zswap_never_enabled. So I think it
>
> Do you have CONFIG_ZSWAP_DEFAULT_ON enabled? In your case, CPU will go
> to the unlikely branch of static_key every time, which maybe the cause.

No, it's off by default. Maybe it's just noise, the performance
difference is very tiny.
Kairui Song Oct. 12, 2024, 5:08 a.m. UTC | #7
On Sat, Oct 12, 2024 at 11:27 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Oct 11, 2024 at 8:05 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > Johannes Weiner <hannes@cmpxchg.org> 于 2024年10月12日周六 02:28写道:
> > >
> > > On Fri, Oct 11, 2024 at 10:53:31AM -0700, Yosry Ahmed wrote:
> > > > On Fri, Oct 11, 2024 at 10:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > zswap_invalidation simply calls xa_erase, which acquires the Xarray
> > > > > lock first, then does a look up. This has a higher overhead even if
> > > > > zswap is not used or the tree is empty.
> > > > >
> > > > > So instead, do a very lightweight xa_empty check first, if there is
> > > > > nothing to erase, don't touch the lock or the tree.
> > >
> > > Great idea!
> > >
> > > > XA_STATE(xas, ..);
> > > >
> > > > rcu_read_lock();
> > > > entry = xas_load(&xas);
> > > > if (entry) {
> > > >     xas_lock(&xas);
> > > >     WARN_ON_ONCE(xas_reload(&xas) != entry);
> > > >     xas_store(&xas, NULL);
> > > >     xas_unlock(&xas);
> > > > }
> > > > rcu_read_unlock():
> > >
> > > This does the optimization more reliably, and I think we should go
> > > with this version.
> >
> > Hi Yosry and Johannes,
> >
> > This is a good idea. But xa_empty is just much lighweighter, it's just
> > a inlined ( == NULL ) check, so unsurprising it has better performance
> > than xas_load.
> >
> > And surprisingly it's faster than zswap_never_enabled. So I think it
> > could be doable to introduce something like zswap_may_have_swpentry as
> > Yosry suggested.
>
> That is surprising indeed, but it is cleaner anyway to use the xarray
> check than the static key.
>
> >
> > So how about a combined version with xas_load and xa_empty? Check
> > xa_empty first as a faster path, then xas_load, then xas_store.
>
> Yeah I think having an additional xa_empty() check before xas_load() is fine.
>
> >
> > Here is the benchmark result (time of swapin 2G zero pages in us):
> >
> > Before:   1908944 1905870 1905322 1905627 1901667
> > xa_empty: 1835343 1827367 1828402 1831841 1832719
> > z.._enabled: 1838428 1831162 1838205 1837287 1840980
> > xas_load: 1874606 1878971 1870182 1875852 1873403
> > combined: 1845309 1832919 1831904 1836455 1842570
> >
> > `combined` is xa_empty + xas_load.
>
> Is this with SSD swap? If you are using brd, it bypasses the swapcache
> so the benefit from the xas_load() optimization won't be much visible
> (see my earlier email as well as Johannes's).

Hi, I'm using brd indeed.

This test is trying to show the zswap disabled case, so I think swap
cache has no effect here?

For the zswap enabled case I do believe xas_load will work better,
I'll add some test info in the combined V2, will test some other
workload with the brd SYNC flag removed, sequential swapin with zswap
enabled have almost no performance change with this commit.
diff mbox series

Patch

diff --git a/mm/zswap.c b/mm/zswap.c
index 7f00cc918e7c..f6316b66fb23 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1641,6 +1641,9 @@  void zswap_invalidate(swp_entry_t swp)
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
+	if (xa_empty(tree))
+		return;
+
 	entry = xa_erase(tree, offset);
 	if (entry)
 		zswap_entry_free(entry);