[PATCHv4,0/4] zsmalloc: fine-grained fullness and new compaction algorithm

Message ID	20230304034835.2082479-1-senozhatsky@chromium.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Sergey Senozhatsky <senozhatsky@chromium.org> To: Minchan Kim <minchan@kernel.org>, Andrew Morton <akpm@linux-foundation.org> Cc: Yosry Ahmed <yosryahmed@google.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Sergey Senozhatsky <senozhatsky@chromium.org> Subject: [PATCHv4 0/4] zsmalloc: fine-grained fullness and new compaction algorithm Date: Sat, 4 Mar 2023 12:48:31 +0900 Message-Id: <20230304034835.2082479-1-senozhatsky@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zsmalloc: fine-grained fullness and new compaction algorithm \| expand [PATCHv4,0/4] zsmalloc: fine-grained fullness and new compaction algorithm [PATCHv4,1/4] zsmalloc: remove insert_zspage() ->inuse optimization [PATCHv4,2/4] zsmalloc: fine-grained inuse ratio based fullness grouping [PATCHv4,3/4] zsmalloc: rework compaction algorithm [PATCHv4,4/4] zsmalloc: show per fullness group class stats

Sergey Senozhatsky March 4, 2023, 3:48 a.m. UTC

Hi,

Existing zsmalloc page fullness grouping leads to suboptimal page
selection for both zs_malloc() and zs_compact(). This patchset
reworks zsmalloc fullness grouping/classification.

Additinally it also implements new compaction algorithm that is
expected to use less CPU-cycles (as it potentially does fewer
memcpy-s in zs_object_copy()).

Test (synthetic) results can be seen in patch 0003.

v4:
-- fixed classes stats loop bug (Yosry)
-- fixed spelling errors (Andrew)
-- dropped some unnecessary hunks from the patches

v3:
-- reworked compaction algorithm implementation (Minchan)
-- keep existing stats and fullness enums (Minchan, Yosry)
-- dropped the patch with new zsmalloc compaction stats (Minchan)
-- report per inuse ratio group classes stats

Sergey Senozhatsky (4):
  zsmalloc: remove insert_zspage() ->inuse optimization
  zsmalloc: fine-grained inuse ratio based fullness grouping
  zsmalloc: rework compaction algorithm
  zsmalloc: show per fullness group class stats

 mm/zsmalloc.c | 358 ++++++++++++++++++++++++--------------------------
 1 file changed, 173 insertions(+), 185 deletions(-)

Minchan Kim March 10, 2023, 9:10 p.m. UTC | #1

On Sat, Mar 04, 2023 at 12:48:31PM +0900, Sergey Senozhatsky wrote:
> 	Hi,
> 
> Existing zsmalloc page fullness grouping leads to suboptimal page
> selection for both zs_malloc() and zs_compact(). This patchset
> reworks zsmalloc fullness grouping/classification.
> 
> Additinally it also implements new compaction algorithm that is
> expected to use less CPU-cycles (as it potentially does fewer
> memcpy-s in zs_object_copy()).
> 
> Test (synthetic) results can be seen in patch 0003.
> 
> v4:
> -- fixed classes stats loop bug (Yosry)
> -- fixed spelling errors (Andrew)
> -- dropped some unnecessary hunks from the patches
> 
> v3:
> -- reworked compaction algorithm implementation (Minchan)
> -- keep existing stats and fullness enums (Minchan, Yosry)
> -- dropped the patch with new zsmalloc compaction stats (Minchan)
> -- report per inuse ratio group classes stats
> 
> Sergey Senozhatsky (4):
>   zsmalloc: remove insert_zspage() ->inuse optimization
>   zsmalloc: fine-grained inuse ratio based fullness grouping
>   zsmalloc: rework compaction algorithm
>   zsmalloc: show per fullness group class stats
> 
>  mm/zsmalloc.c | 358 ++++++++++++++++++++++++--------------------------
>  1 file changed, 173 insertions(+), 185 deletions(-)
> 
> -- 

Acked-by: Minchan Kim <minchan@kernel.org>

Thanks, Sergey!

Sergey Senozhatsky March 11, 2023, 8:30 a.m. UTC | #2

On (23/03/10 13:10), Minchan Kim wrote:
> > v4:
> > -- fixed classes stats loop bug (Yosry)
> > -- fixed spelling errors (Andrew)
> > -- dropped some unnecessary hunks from the patches
> > 
> > v3:
> > -- reworked compaction algorithm implementation (Minchan)
> > -- keep existing stats and fullness enums (Minchan, Yosry)
> > -- dropped the patch with new zsmalloc compaction stats (Minchan)
> > -- report per inuse ratio group classes stats
> > 
> > Sergey Senozhatsky (4):
> >   zsmalloc: remove insert_zspage() ->inuse optimization
> >   zsmalloc: fine-grained inuse ratio based fullness grouping
> >   zsmalloc: rework compaction algorithm
> >   zsmalloc: show per fullness group class stats
> > 
> >  mm/zsmalloc.c | 358 ++++++++++++++++++++++++--------------------------
> >  1 file changed, 173 insertions(+), 185 deletions(-)
> > 
> > -- 
> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> 
> Thanks, Sergey!

Thank you!

Yu Zhao April 16, 2023, 7:20 a.m. UTC | #3

On Fri, Mar 3, 2023 at 8:48 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
>         Hi,
>
> Existing zsmalloc page fullness grouping leads to suboptimal page
> selection for both zs_malloc() and zs_compact(). This patchset
> reworks zsmalloc fullness grouping/classification.
>
> Additinally it also implements new compaction algorithm that is
> expected to use less CPU-cycles (as it potentially does fewer
> memcpy-s in zs_object_copy()).
>
> Test (synthetic) results can be seen in patch 0003.

Seeing the following crashes from mm-unstable. Please take a look. Thanks.

  list_add corruption. next is NULL.
  kernel BUG at lib/list_debug.c:26!
  Call Trace:
   <TASK>
   zs_compact+0xbf6/0xda0
   zs_shrinker_scan+0x19/0x30
   do_shrink_slab+0x1ac/0x450
   shrink_slab+0xdc/0x3d0
   shrink_one+0xe2/0x1d0
   shrink_node+0xc7f/0xea0
   do_try_to_free_pages+0x1b5/0x500
   try_to_free_pages+0x396/0x5d0
   __alloc_pages_slowpath+0x5d0/0x1030
   __alloc_pages+0x1de/0x280
   __folio_alloc+0x1e/0x40
   vma_alloc_folio+0x4c0/0x530
   shmem_alloc_and_acct_folio+0x1a6/0x3b0
   shmem_get_folio_gfp+0x689/0xf00
   shmem_fault+0x81/0x240

Sergey Senozhatsky April 16, 2023, 3:18 p.m. UTC | #4

On (23/04/16 01:20), Yu Zhao wrote:
> 
> Seeing the following crashes from mm-unstable. Please take a look. Thanks.
> 

Hi,

Did you bisect it down to this series?

Yu Zhao April 16, 2023, 7:27 p.m. UTC | #5

On Sun, Apr 16, 2023 at 9:19 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/16 01:20), Yu Zhao wrote:
> >
> > Seeing the following crashes from mm-unstable. Please take a look. Thanks.
> >
>
> Hi,
>
> Did you bisect it down to this series?

Not exactly -- since this series was the only suspect I had, I cherry
picked it to v6.3-rc6 and verified it is the culprit.

Sergey Senozhatsky April 17, 2023, 2:44 a.m. UTC | #6

On (23/04/16 13:27), Yu Zhao wrote:
> > Hi,
> >
> > Did you bisect it down to this series?
> 
> Not exactly -- since this series was the only suspect I had, I cherry
> picked it to v6.3-rc6 and verified it is the culprit.

Can't reproduce it yet. One of the theories is that get_fullness_group()
maybe returns an invalid index, but I don't immediately see how would it
do so.

Is the problem reproducible? Do you run some specific test?

Yu Zhao April 17, 2023, 2:55 a.m. UTC | #7

On Sun, Apr 16, 2023 at 8:44 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/16 13:27), Yu Zhao wrote:
> > > Hi,
> > >
> > > Did you bisect it down to this series?
> >
> > Not exactly -- since this series was the only suspect I had, I cherry
> > picked it to v6.3-rc6 and verified it is the culprit.
>
> Can't reproduce it yet. One of the theories is that get_fullness_group()
> maybe returns an invalid index, but I don't immediately see how would it
> do so.
>
> Is the problem reproducible?

Whenever swapping *multithreaded* heavily.

> Do you run some specific test?

E.g.,
  tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800
with 112 CPUs and ~770GB DRAM + 32GB zram.

Sergey Senozhatsky April 17, 2023, 3:52 a.m. UTC | #8

On (23/04/16 20:55), Yu Zhao wrote:
> > Do you run some specific test?
> 
> E.g.,
>   tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800
> with 112 CPUs and ~770GB DRAM + 32GB zram.

Hmm ...

Something like this maybe?

The src zspage pointer is not NULL-ed after non-empty zspage is
put back to corresponding fullness list.

---

@@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool,
                if (fg == ZS_INUSE_RATIO_0) {
                        free_zspage(pool, class, src_zspage);
                        pages_freed += class->pages_per_zspage;
-                       src_zspage = NULL;
                }
+               src_zspage = NULL;

                if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
                    || spin_is_contended(&pool->lock)) {

Yosry Ahmed April 17, 2023, 8:29 a.m. UTC | #9

Hi Sergey,

On Sun, Apr 16, 2023 at 8:52 PM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/16 20:55), Yu Zhao wrote:
> > > Do you run some specific test?
> >
> > E.g.,
> >   tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800
> > with 112 CPUs and ~770GB DRAM + 32GB zram.
>
> Hmm ...
>
> Something like this maybe?
>
> The src zspage pointer is not NULL-ed after non-empty zspage is
> put back to corresponding fullness list.
>
> ---
>
> @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool,
>                 if (fg == ZS_INUSE_RATIO_0) {
>                         free_zspage(pool, class, src_zspage);
>                         pages_freed += class->pages_per_zspage;
> -                       src_zspage = NULL;
>                 }
> +               src_zspage = NULL;
>
>                 if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
>                     || spin_is_contended(&pool->lock)) {

For my own education, how can this result in the "next is NULL" debug
error Yu Zhao is seeing?

IIUC if we do not set src_zspage to NULL properly after putback, then
we will attempt to putback again after the main loop in some cases.
This can result in a zspage being present more than once in the
per-class fullness list, right?

I am not sure how this can lead to "next is NULL", which sounds like a
corrupted list_head, because the next ptr should never be NULL as far
as I can tell. I feel like I am missing something.

Sergey Senozhatsky April 17, 2023, 11:12 a.m. UTC | #10

On (23/04/17 01:29), Yosry Ahmed wrote:
> > @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> >                 if (fg == ZS_INUSE_RATIO_0) {
> >                         free_zspage(pool, class, src_zspage);
> >                         pages_freed += class->pages_per_zspage;
> > -                       src_zspage = NULL;
> >                 }
> > +               src_zspage = NULL;
> >
> >                 if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> >                     || spin_is_contended(&pool->lock)) {
> 
> For my own education, how can this result in the "next is NULL" debug
> error Yu Zhao is seeing?
> 
> IIUC if we do not set src_zspage to NULL properly after putback, then
> we will attempt to putback again after the main loop in some cases.
> This can result in a zspage being present more than once in the
> per-class fullness list, right?
> 
> I am not sure how this can lead to "next is NULL", which sounds like a
> corrupted list_head, because the next ptr should never be NULL as far
> as I can tell. I feel like I am missing something.

That's a good question to which I don't have an answer. We can list_add()
the same zspage twice, unlocking the pool after first list_add() so that
another process (including another zs_compact()) can do something to that
zspage. The answer is somewhere between these lines, I guess.

I can see how, for example, another DEBUG_LIST check can be triggered:
"list_add double add", because we basically can do

	list_add(page, list)
	list_add(page, list)

I can also see how lockdep can be unhappy with us doing

	write_unlock(&zspage->lock);
	write_unlock(&zspage->lock);

But I don't think I see how "next is NULL" happens (I haven't observed
it).

Yosry Ahmed April 17, 2023, 11:16 a.m. UTC | #11

On Mon, Apr 17, 2023 at 4:12 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/17 01:29), Yosry Ahmed wrote:
> > > @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> > >                 if (fg == ZS_INUSE_RATIO_0) {
> > >                         free_zspage(pool, class, src_zspage);
> > >                         pages_freed += class->pages_per_zspage;
> > > -                       src_zspage = NULL;
> > >                 }
> > > +               src_zspage = NULL;
> > >
> > >                 if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> > >                     || spin_is_contended(&pool->lock)) {
> >
> > For my own education, how can this result in the "next is NULL" debug
> > error Yu Zhao is seeing?
> >
> > IIUC if we do not set src_zspage to NULL properly after putback, then
> > we will attempt to putback again after the main loop in some cases.
> > This can result in a zspage being present more than once in the
> > per-class fullness list, right?
> >
> > I am not sure how this can lead to "next is NULL", which sounds like a
> > corrupted list_head, because the next ptr should never be NULL as far
> > as I can tell. I feel like I am missing something.
>
> That's a good question to which I don't have an answer. We can list_add()
> the same zspage twice, unlocking the pool after first list_add() so that
> another process (including another zs_compact()) can do something to that
> zspage. The answer is somewhere between these lines, I guess.

But the first list_add() is (in this case) the correct add, so we
expect other processes to be able to access the zspage after the first
list_add() anyway, right?

>
> I can see how, for example, another DEBUG_LIST check can be triggered:
> "list_add double add", because we basically can do
>
>         list_add(page, list)
>         list_add(page, list)
>
> I can also see how lockdep can be unhappy with us doing
>
>         write_unlock(&zspage->lock);
>         write_unlock(&zspage->lock);
>
> But I don't think I see how "next is NULL" happens (I haven't observed
> it).

Yeah I reached the same conclusion. Couldn't figure out how we can
reach the NULL scenario.

Sergey Senozhatsky April 17, 2023, 11:24 a.m. UTC | #12

On (23/04/17 04:16), Yosry Ahmed wrote:
> > That's a good question to which I don't have an answer. We can list_add()
> > the same zspage twice, unlocking the pool after first list_add() so that
> > another process (including another zs_compact()) can do something to that
> > zspage. The answer is somewhere between these lines, I guess.
> 
> But the first list_add() is (in this case) the correct add, so we
> expect other processes to be able to access the zspage after the first
> list_add() anyway, right?

Correct. Compaction also can unlock pool->lock and schedule() so that
another process can access the source zspage, when compaction gets
scheduled it can attempt putback/unlock the same zspage one more time
(the zspage may not even exist at this point, I assume).

Yosry Ahmed April 17, 2023, 11:31 a.m. UTC | #13

On Mon, Apr 17, 2023 at 4:24 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/17 04:16), Yosry Ahmed wrote:
> > > That's a good question to which I don't have an answer. We can list_add()
> > > the same zspage twice, unlocking the pool after first list_add() so that
> > > another process (including another zs_compact()) can do something to that
> > > zspage. The answer is somewhere between these lines, I guess.
> >
> > But the first list_add() is (in this case) the correct add, so we
> > expect other processes to be able to access the zspage after the first
> > list_add() anyway, right?
>
> Correct. Compaction also can unlock pool->lock and schedule() so that
> another process can access the source zspage, when compaction gets
> scheduled it can attempt putback/unlock the same zspage one more time
> (the zspage may not even exist at this point, I assume).

Good point, that could very well be where the corruption is coming
from. Thanks for pointing this out.

[PATCHv4,0/4] zsmalloc: fine-grained fullness and new compaction algorithm

Message

Comments