[v2] mm: zswap: shrink until can accept

Message ID	20230526173955.781115-1-cerasuolodomenico@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Domenico Cerasuolo <cerasuolodomenico@gmail.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, yosryahmed@google.com, hannes@cmpxchg.org, kernel-team@fb.com, Domenico Cerasuolo <cerasuolodomenico@gmail.com> Subject: [PATCH v2] mm: zswap: shrink until can accept Date: Fri, 26 May 2023 19:39:55 +0200 Message-Id: <20230526173955.781115-1-cerasuolodomenico@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm: zswap: shrink until can accept \| expand [v2] mm: zswap: shrink until can accept

Message ID

20230526173955.781115-1-cerasuolodomenico@gmail.com (mailing list archive)

State

New

Headers

From: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	sjenning@redhat.com,
	ddstreet@ieee.org,
	vitaly.wool@konsulko.com,
	yosryahmed@google.com,
	hannes@cmpxchg.org,
	kernel-team@fb.com,
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Subject: [PATCH v2] mm: zswap: shrink until can accept
Date: Fri, 26 May 2023 19:39:55 +0200
Message-Id: <20230526173955.781115-1-cerasuolodomenico@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[v2] mm: zswap: shrink until can accept | expand

Commit Message

Domenico Cerasuolo May 26, 2023, 5:39 p.m. UTC

This update addresses an issue with the zswap reclaim mechanism, which
hinders the efficient offloading of cold pages to disk, thereby
compromising the preservation of the LRU order and consequently
diminishing, if not inverting, its performance benefits.

The functioning of the zswap shrink worker was found to be inadequate,
as shown by basic benchmark test. For the test, a kernel build was
utilized as a reference, with its memory confined to 1G via a cgroup and
a 5G swap file provided. The results are presented below, these are
averages of three runs without the use of zswap:

real 46m26s
user 35m4s
sys 7m37s

With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
system), the results changed to:

real 56m4s
user 35m13s
sys 8m43s

written_back_pages: 18
reject_reclaim_fail: 0
pool_limit_hit:1478

Besides the evident regression, one thing to notice from this data is
the extremely low number of written_back_pages and pool_limit_hit.

The pool_limit_hit counter, which is increased in zswap_frontswap_store
when zswap is completely full, doesn't account for a particular
scenario: once zswap hits his limit, zswap_pool_reached_full is set to
true; with this flag on, zswap_frontswap_store rejects pages if zswap is
still above the acceptance threshold. Once we include the rejections due
to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
1478 to a significant 21578266.

Zswap is stuck in an undesirable state where it rejects pages because
it's above the acceptance threshold, yet fails to attempt memory
reclaimation. This happens because the shrink work is only queued when
zswap_frontswap_store detects that it's full and the work itself only
reclaims one page per run.

This state results in hot pages getting written directly to disk,
while cold ones remain memory, waiting only to be invalidated. The LRU
order is completely broken and zswap ends up being just an overhead
without providing any benefits.

This commit applies 2 changes: a) the shrink worker is set to reclaim
pages until the acceptance threshold is met and b) the task is also
enqueued when zswap is not full but still above the threshold.

Testing this suggested update showed much better numbers:

real 36m37s
user 35m8s
sys 9m32s

written_back_pages: 10459423
reject_reclaim_fail: 12896
pool_limit_hit: 75653

V2:
- loop against == -EAGAIN rather than != -EINVAL and also break the loop
on MAX_RECLAIM_RETRIES (thanks Yosry)
- cond_resched() to ensure that the loop doesn't burn the cpu (thanks
Vitaly)

Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
---
 mm/zswap.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Johannes Weiner May 26, 2023, 6:10 p.m. UTC | #1

On Fri, May 26, 2023 at 07:39:55PM +0200, Domenico Cerasuolo wrote:
> This update addresses an issue with the zswap reclaim mechanism, which
> hinders the efficient offloading of cold pages to disk, thereby
> compromising the preservation of the LRU order and consequently
> diminishing, if not inverting, its performance benefits.
> 
> The functioning of the zswap shrink worker was found to be inadequate,
> as shown by basic benchmark test. For the test, a kernel build was
> utilized as a reference, with its memory confined to 1G via a cgroup and
> a 5G swap file provided. The results are presented below, these are
> averages of three runs without the use of zswap:
> 
> real 46m26s
> user 35m4s
> sys 7m37s
> 
> With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> system), the results changed to:
> 
> real 56m4s
> user 35m13s
> sys 8m43s
> 
> written_back_pages: 18
> reject_reclaim_fail: 0
> pool_limit_hit:1478
> 
> Besides the evident regression, one thing to notice from this data is
> the extremely low number of written_back_pages and pool_limit_hit.
> 
> The pool_limit_hit counter, which is increased in zswap_frontswap_store
> when zswap is completely full, doesn't account for a particular
> scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> still above the acceptance threshold. Once we include the rejections due
> to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> 1478 to a significant 21578266.
> 
> Zswap is stuck in an undesirable state where it rejects pages because
> it's above the acceptance threshold, yet fails to attempt memory
> reclaimation. This happens because the shrink work is only queued when
> zswap_frontswap_store detects that it's full and the work itself only
> reclaims one page per run.
> 
> This state results in hot pages getting written directly to disk,
> while cold ones remain memory, waiting only to be invalidated. The LRU
> order is completely broken and zswap ends up being just an overhead
> without providing any benefits.
> 
> This commit applies 2 changes: a) the shrink worker is set to reclaim
> pages until the acceptance threshold is met and b) the task is also
> enqueued when zswap is not full but still above the threshold.
> 
> Testing this suggested update showed much better numbers:
> 
> real 36m37s
> user 35m8s
> sys 9m32s
> 
> written_back_pages: 10459423
> reject_reclaim_fail: 12896
> pool_limit_hit: 75653
> 
> V2:
> - loop against == -EAGAIN rather than != -EINVAL and also break the loop
> on MAX_RECLAIM_RETRIES (thanks Yosry)
> - cond_resched() to ensure that the loop doesn't burn the cpu (thanks
> Vitaly)
> 
> Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> ---
>  mm/zswap.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59da2a415fbb..f953dceaab34 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -37,6 +37,7 @@
>  #include <linux/workqueue.h>
>  
>  #include "swap.h"
> +#include "internal.h"
>  
>  /*********************************
>  * statistics
> @@ -587,9 +588,17 @@ static void shrink_worker(struct work_struct *w)
>  {
>  	struct zswap_pool *pool = container_of(w, typeof(*pool),
>  						shrink_work);
> +	int ret, failures = 0;
>  
> -	if (zpool_shrink(pool->zpool, 1, NULL))
> -		zswap_reject_reclaim_fail++;
> +	do {
> +		ret = zpool_shrink(pool->zpool, 1, NULL);
> +		if (ret) {
> +			zswap_reject_reclaim_fail++;
> +			failures++;
> +		}
> +		cond_resched();
> +	} while (!zswap_can_accept() && ret == -EAGAIN &&
> +		 failures < MAX_RECLAIM_RETRIES);

It should also loop on !ret, right?

AFAIU Yosry's suggestion was that instead of breaking only on -EINVAL,
it should break on all failures but -EAGAIN. But it should still keep
going if the shrink was successful and the pool cannot accept yet.

Basically, something like this?

	do {
		ret = zpool_shrink(pool->zpool, 1, NULL);
		if (ret) {
			zswap_reject_reclaim_fail++;
			if (ret != -EAGAIN)
				break;
			if (++failures == MAX_RECLAIM_RETRIES)
				break;
		}
		cond_resched();
	} while (!zswap_can_accept());

Yosry Ahmed May 26, 2023, 6:15 p.m. UTC | #2

On Fri, May 26, 2023 at 11:10 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 26, 2023 at 07:39:55PM +0200, Domenico Cerasuolo wrote:
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
> >
> > V2:
> > - loop against == -EAGAIN rather than != -EINVAL and also break the loop
> > on MAX_RECLAIM_RETRIES (thanks Yosry)
> > - cond_resched() to ensure that the loop doesn't burn the cpu (thanks
> > Vitaly)
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > ---
> >  mm/zswap.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..f953dceaab34 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/workqueue.h>
> >
> >  #include "swap.h"
> > +#include "internal.h"
> >
> >  /*********************************
> >  * statistics
> > @@ -587,9 +588,17 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                               shrink_work);
> > +     int ret, failures = 0;
> >
> > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > -             zswap_reject_reclaim_fail++;
> > +     do {
> > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > +             if (ret) {
> > +                     zswap_reject_reclaim_fail++;
> > +                     failures++;
> > +             }
> > +             cond_resched();
> > +     } while (!zswap_can_accept() && ret == -EAGAIN &&
> > +              failures < MAX_RECLAIM_RETRIES);
>
> It should also loop on !ret, right?
>
> AFAIU Yosry's suggestion was that instead of breaking only on -EINVAL,
> it should break on all failures but -EAGAIN. But it should still keep
> going if the shrink was successful and the pool cannot accept yet.
>
> Basically, something like this?
>
>         do {
>                 ret = zpool_shrink(pool->zpool, 1, NULL);
>                 if (ret) {
>                         zswap_reject_reclaim_fail++;
>                         if (ret != -EAGAIN)
>                                 break;
>                         if (++failures == MAX_RECLAIM_RETRIES)
>                                 break;
>                 }
>                 cond_resched();
>         } while (!zswap_can_accept());

Yes, that's what I meant. Otherwise if shrink is successful we end up
doing 1 page only, which is exactly what we are trying to avoid here.

Domenico Cerasuolo May 26, 2023, 6:16 p.m. UTC | #3

On Fri, May 26, 2023 at 8:10 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 26, 2023 at 07:39:55PM +0200, Domenico Cerasuolo wrote:
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
> >
> > V2:
> > - loop against == -EAGAIN rather than != -EINVAL and also break the loop
> > on MAX_RECLAIM_RETRIES (thanks Yosry)
> > - cond_resched() to ensure that the loop doesn't burn the cpu (thanks
> > Vitaly)
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > ---
> >  mm/zswap.c | 15 ++++++++++++---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..f953dceaab34 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -37,6 +37,7 @@
> >  #include <linux/workqueue.h>
> >
> >  #include "swap.h"
> > +#include "internal.h"
> >
> >  /*********************************
> >  * statistics
> > @@ -587,9 +588,17 @@ static void shrink_worker(struct work_struct *w)
> >  {
> >       struct zswap_pool *pool = container_of(w, typeof(*pool),
> >                                               shrink_work);
> > +     int ret, failures = 0;
> >
> > -     if (zpool_shrink(pool->zpool, 1, NULL))
> > -             zswap_reject_reclaim_fail++;
> > +     do {
> > +             ret = zpool_shrink(pool->zpool, 1, NULL);
> > +             if (ret) {
> > +                     zswap_reject_reclaim_fail++;
> > +                     failures++;
> > +             }
> > +             cond_resched();
> > +     } while (!zswap_can_accept() && ret == -EAGAIN &&
> > +              failures < MAX_RECLAIM_RETRIES);
>
> It should also loop on !ret, right?
>
> AFAIU Yosry's suggestion was that instead of breaking only on -EINVAL,
> it should break on all failures but -EAGAIN. But it should still keep
> going if the shrink was successful and the pool cannot accept yet.
>
> Basically, something like this?
>
>         do {
>                 ret = zpool_shrink(pool->zpool, 1, NULL);
>                 if (ret) {
>                         zswap_reject_reclaim_fail++;
>                         if (ret != -EAGAIN)
>                                 break;
>                         if (++failures == MAX_RECLAIM_RETRIES)
>                                 break;
>                 }
>                 cond_resched();
>         } while (!zswap_can_accept());

Thanks, !ret should indeed keep it going.

diff --git a/mm/zswap.c b/mm/zswap.c
index 59da2a415fbb..f953dceaab34 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -37,6 +37,7 @@ 
 #include <linux/workqueue.h>
 
 #include "swap.h"
+#include "internal.h"
 
 /*********************************
 * statistics
@@ -587,9 +588,17 @@  static void shrink_worker(struct work_struct *w)
 {
 	struct zswap_pool *pool = container_of(w, typeof(*pool),
 						shrink_work);
+	int ret, failures = 0;
 
-	if (zpool_shrink(pool->zpool, 1, NULL))
-		zswap_reject_reclaim_fail++;
+	do {
+		ret = zpool_shrink(pool->zpool, 1, NULL);
+		if (ret) {
+			zswap_reject_reclaim_fail++;
+			failures++;
+		}
+		cond_resched();
+	} while (!zswap_can_accept() && ret == -EAGAIN &&
+		 failures < MAX_RECLAIM_RETRIES);
 	zswap_pool_put(pool);
 }
 
@@ -1188,7 +1197,7 @@  static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	if (zswap_pool_reached_full) {
 	       if (!zswap_can_accept()) {
 			ret = -ENOMEM;
-			goto reject;
+			goto shrink;
 		} else
 			zswap_pool_reached_full = false;
 	}

[v2] mm: zswap: shrink until can accept

Commit Message

Comments

Patch