Message ID | 20230304034835.2082479-1-senozhatsky@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | zsmalloc: fine-grained fullness and new compaction algorithm | expand |
On Sat, Mar 04, 2023 at 12:48:31PM +0900, Sergey Senozhatsky wrote: > Hi, > > Existing zsmalloc page fullness grouping leads to suboptimal page > selection for both zs_malloc() and zs_compact(). This patchset > reworks zsmalloc fullness grouping/classification. > > Additinally it also implements new compaction algorithm that is > expected to use less CPU-cycles (as it potentially does fewer > memcpy-s in zs_object_copy()). > > Test (synthetic) results can be seen in patch 0003. > > v4: > -- fixed classes stats loop bug (Yosry) > -- fixed spelling errors (Andrew) > -- dropped some unnecessary hunks from the patches > > v3: > -- reworked compaction algorithm implementation (Minchan) > -- keep existing stats and fullness enums (Minchan, Yosry) > -- dropped the patch with new zsmalloc compaction stats (Minchan) > -- report per inuse ratio group classes stats > > Sergey Senozhatsky (4): > zsmalloc: remove insert_zspage() ->inuse optimization > zsmalloc: fine-grained inuse ratio based fullness grouping > zsmalloc: rework compaction algorithm > zsmalloc: show per fullness group class stats > > mm/zsmalloc.c | 358 ++++++++++++++++++++++++-------------------------- > 1 file changed, 173 insertions(+), 185 deletions(-) > > -- Acked-by: Minchan Kim <minchan@kernel.org> Thanks, Sergey!
On (23/03/10 13:10), Minchan Kim wrote: > > v4: > > -- fixed classes stats loop bug (Yosry) > > -- fixed spelling errors (Andrew) > > -- dropped some unnecessary hunks from the patches > > > > v3: > > -- reworked compaction algorithm implementation (Minchan) > > -- keep existing stats and fullness enums (Minchan, Yosry) > > -- dropped the patch with new zsmalloc compaction stats (Minchan) > > -- report per inuse ratio group classes stats > > > > Sergey Senozhatsky (4): > > zsmalloc: remove insert_zspage() ->inuse optimization > > zsmalloc: fine-grained inuse ratio based fullness grouping > > zsmalloc: rework compaction algorithm > > zsmalloc: show per fullness group class stats > > > > mm/zsmalloc.c | 358 ++++++++++++++++++++++++-------------------------- > > 1 file changed, 173 insertions(+), 185 deletions(-) > > > > -- > > Acked-by: Minchan Kim <minchan@kernel.org> > > Thanks, Sergey! Thank you!
On Fri, Mar 3, 2023 at 8:48 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > Hi, > > Existing zsmalloc page fullness grouping leads to suboptimal page > selection for both zs_malloc() and zs_compact(). This patchset > reworks zsmalloc fullness grouping/classification. > > Additinally it also implements new compaction algorithm that is > expected to use less CPU-cycles (as it potentially does fewer > memcpy-s in zs_object_copy()). > > Test (synthetic) results can be seen in patch 0003. Seeing the following crashes from mm-unstable. Please take a look. Thanks. list_add corruption. next is NULL. kernel BUG at lib/list_debug.c:26! Call Trace: <TASK> zs_compact+0xbf6/0xda0 zs_shrinker_scan+0x19/0x30 do_shrink_slab+0x1ac/0x450 shrink_slab+0xdc/0x3d0 shrink_one+0xe2/0x1d0 shrink_node+0xc7f/0xea0 do_try_to_free_pages+0x1b5/0x500 try_to_free_pages+0x396/0x5d0 __alloc_pages_slowpath+0x5d0/0x1030 __alloc_pages+0x1de/0x280 __folio_alloc+0x1e/0x40 vma_alloc_folio+0x4c0/0x530 shmem_alloc_and_acct_folio+0x1a6/0x3b0 shmem_get_folio_gfp+0x689/0xf00 shmem_fault+0x81/0x240
On (23/04/16 01:20), Yu Zhao wrote: > > Seeing the following crashes from mm-unstable. Please take a look. Thanks. > Hi, Did you bisect it down to this series?
On Sun, Apr 16, 2023 at 9:19 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/16 01:20), Yu Zhao wrote: > > > > Seeing the following crashes from mm-unstable. Please take a look. Thanks. > > > > Hi, > > Did you bisect it down to this series? Not exactly -- since this series was the only suspect I had, I cherry picked it to v6.3-rc6 and verified it is the culprit.
On (23/04/16 13:27), Yu Zhao wrote: > > Hi, > > > > Did you bisect it down to this series? > > Not exactly -- since this series was the only suspect I had, I cherry > picked it to v6.3-rc6 and verified it is the culprit. Can't reproduce it yet. One of the theories is that get_fullness_group() maybe returns an invalid index, but I don't immediately see how would it do so. Is the problem reproducible? Do you run some specific test?
On Sun, Apr 16, 2023 at 8:44 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/16 13:27), Yu Zhao wrote: > > > Hi, > > > > > > Did you bisect it down to this series? > > > > Not exactly -- since this series was the only suspect I had, I cherry > > picked it to v6.3-rc6 and verified it is the culprit. > > Can't reproduce it yet. One of the theories is that get_fullness_group() > maybe returns an invalid index, but I don't immediately see how would it > do so. > > Is the problem reproducible? Whenever swapping *multithreaded* heavily. > Do you run some specific test? E.g., tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800 with 112 CPUs and ~770GB DRAM + 32GB zram.
On (23/04/16 20:55), Yu Zhao wrote: > > Do you run some specific test? > > E.g., > tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800 > with 112 CPUs and ~770GB DRAM + 32GB zram. Hmm ... Something like this maybe? The src zspage pointer is not NULL-ed after non-empty zspage is put back to corresponding fullness list. --- @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, if (fg == ZS_INUSE_RATIO_0) { free_zspage(pool, class, src_zspage); pages_freed += class->pages_per_zspage; - src_zspage = NULL; } + src_zspage = NULL; if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100 || spin_is_contended(&pool->lock)) {
Hi Sergey, On Sun, Apr 16, 2023 at 8:52 PM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/16 20:55), Yu Zhao wrote: > > > Do you run some specific test? > > > > E.g., > > tools/testing/selftests/kvm/max_guest_memory_test -c 112 -m 800 -s 800 > > with 112 CPUs and ~770GB DRAM + 32GB zram. > > Hmm ... > > Something like this maybe? > > The src zspage pointer is not NULL-ed after non-empty zspage is > put back to corresponding fullness list. > > --- > > @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, > if (fg == ZS_INUSE_RATIO_0) { > free_zspage(pool, class, src_zspage); > pages_freed += class->pages_per_zspage; > - src_zspage = NULL; > } > + src_zspage = NULL; > > if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100 > || spin_is_contended(&pool->lock)) { For my own education, how can this result in the "next is NULL" debug error Yu Zhao is seeing? IIUC if we do not set src_zspage to NULL properly after putback, then we will attempt to putback again after the main loop in some cases. This can result in a zspage being present more than once in the per-class fullness list, right? I am not sure how this can lead to "next is NULL", which sounds like a corrupted list_head, because the next ptr should never be NULL as far as I can tell. I feel like I am missing something.
On (23/04/17 01:29), Yosry Ahmed wrote: > > @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, > > if (fg == ZS_INUSE_RATIO_0) { > > free_zspage(pool, class, src_zspage); > > pages_freed += class->pages_per_zspage; > > - src_zspage = NULL; > > } > > + src_zspage = NULL; > > > > if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100 > > || spin_is_contended(&pool->lock)) { > > For my own education, how can this result in the "next is NULL" debug > error Yu Zhao is seeing? > > IIUC if we do not set src_zspage to NULL properly after putback, then > we will attempt to putback again after the main loop in some cases. > This can result in a zspage being present more than once in the > per-class fullness list, right? > > I am not sure how this can lead to "next is NULL", which sounds like a > corrupted list_head, because the next ptr should never be NULL as far > as I can tell. I feel like I am missing something. That's a good question to which I don't have an answer. We can list_add() the same zspage twice, unlocking the pool after first list_add() so that another process (including another zs_compact()) can do something to that zspage. The answer is somewhere between these lines, I guess. I can see how, for example, another DEBUG_LIST check can be triggered: "list_add double add", because we basically can do list_add(page, list) list_add(page, list) I can also see how lockdep can be unhappy with us doing write_unlock(&zspage->lock); write_unlock(&zspage->lock); But I don't think I see how "next is NULL" happens (I haven't observed it).
On Mon, Apr 17, 2023 at 4:12 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/17 01:29), Yosry Ahmed wrote: > > > @@ -2239,8 +2241,8 @@ static unsigned long __zs_compact(struct zs_pool *pool, > > > if (fg == ZS_INUSE_RATIO_0) { > > > free_zspage(pool, class, src_zspage); > > > pages_freed += class->pages_per_zspage; > > > - src_zspage = NULL; > > > } > > > + src_zspage = NULL; > > > > > > if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100 > > > || spin_is_contended(&pool->lock)) { > > > > For my own education, how can this result in the "next is NULL" debug > > error Yu Zhao is seeing? > > > > IIUC if we do not set src_zspage to NULL properly after putback, then > > we will attempt to putback again after the main loop in some cases. > > This can result in a zspage being present more than once in the > > per-class fullness list, right? > > > > I am not sure how this can lead to "next is NULL", which sounds like a > > corrupted list_head, because the next ptr should never be NULL as far > > as I can tell. I feel like I am missing something. > > That's a good question to which I don't have an answer. We can list_add() > the same zspage twice, unlocking the pool after first list_add() so that > another process (including another zs_compact()) can do something to that > zspage. The answer is somewhere between these lines, I guess. But the first list_add() is (in this case) the correct add, so we expect other processes to be able to access the zspage after the first list_add() anyway, right? > > I can see how, for example, another DEBUG_LIST check can be triggered: > "list_add double add", because we basically can do > > list_add(page, list) > list_add(page, list) > > I can also see how lockdep can be unhappy with us doing > > write_unlock(&zspage->lock); > write_unlock(&zspage->lock); > > But I don't think I see how "next is NULL" happens (I haven't observed > it). Yeah I reached the same conclusion. Couldn't figure out how we can reach the NULL scenario.
On (23/04/17 04:16), Yosry Ahmed wrote: > > That's a good question to which I don't have an answer. We can list_add() > > the same zspage twice, unlocking the pool after first list_add() so that > > another process (including another zs_compact()) can do something to that > > zspage. The answer is somewhere between these lines, I guess. > > But the first list_add() is (in this case) the correct add, so we > expect other processes to be able to access the zspage after the first > list_add() anyway, right? Correct. Compaction also can unlock pool->lock and schedule() so that another process can access the source zspage, when compaction gets scheduled it can attempt putback/unlock the same zspage one more time (the zspage may not even exist at this point, I assume).
On Mon, Apr 17, 2023 at 4:24 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/17 04:16), Yosry Ahmed wrote: > > > That's a good question to which I don't have an answer. We can list_add() > > > the same zspage twice, unlocking the pool after first list_add() so that > > > another process (including another zs_compact()) can do something to that > > > zspage. The answer is somewhere between these lines, I guess. > > > > But the first list_add() is (in this case) the correct add, so we > > expect other processes to be able to access the zspage after the first > > list_add() anyway, right? > > Correct. Compaction also can unlock pool->lock and schedule() so that > another process can access the source zspage, when compaction gets > scheduled it can attempt putback/unlock the same zspage one more time > (the zspage may not even exist at this point, I assume). Good point, that could very well be where the corruption is coming from. Thanks for pointing this out.