Message ID | 20231024000702.1387130-1-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zswap: export more zswap store failure stats | expand |
On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote: > Since: > > "42c06a0e8ebe mm: kill frontswap" > > we no longer have a counter to tracks the number of zswap store > failures. This makes it hard to investigate and monitor for zswap > issues. > > This patch adds a global and a per-cgroup zswap store failure counter, > as well as a dedicated debugfs counter for compression algorithm failure > (which can happen for e.g when random data are passed to zswap). > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> I agree this is an issue. > --- > include/linux/vm_event_item.h | 1 + > mm/memcontrol.c | 1 + > mm/vmstat.c | 1 + > mm/zswap.c | 18 ++++++++++++++---- > 4 files changed, 17 insertions(+), 4 deletions(-) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 8abfa1240040..7b2b117b193d 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_ZSWAP > ZSWPIN, > ZSWPOUT, > + ZSWPOUT_FAIL, Would the writeback stat be sufficient to determine this? Hear me out. We already have pswpout that shows when we're hitting disk swap. Right now we can't tell if this is because of a rejection or because of writeback. With a writeback counter we could. And I think we want the writeback counter anyway going forward in order to monitor and understand the dynamic shrinker's performance. Either way we go, one of the metrics needs to be derived from the other(s). But I think subtle and not so subtle shrinker issues are more concerning than outright configuration problems where zswap doesn't work at all. The latter is easier to catch before or during early deployment with simple functionality tests. Plus, rejections should be rare. They are now, and they should become even more rare or cease to exist going forward. Because every time they happen at scale, they represent problematic LRU inversions. We have patched, have pending patches, or discussed changes to reduce every single one of them: /* Store failed due to a reclaim failure after pool limit was reached */ static u64 zswap_reject_reclaim_fail; With the shrinker this becomes less relevant. There was also the proposal to disable the bypass to swap and just keep the page. /* Compressed page was too big for the allocator to (optimally) store */ static u64 zswap_reject_compress_poor; You were working on eradicating this (with zsmalloc at least). /* Store failed because underlying allocator could not get memory */ static u64 zswap_reject_alloc_fail; /* Store failed because the entry metadata could not be allocated (rare) */ static u64 zswap_reject_kmemcache_fail; These shouldn't happen at all due to PF_MEMALLOC. IOW, the fail counter is expected to stay zero in healthy, well-configured systems. Rather than an MM event that needs counting, this strikes me as something that could be a WARN down the line... I agree with adding the debugfs counter though.
On Tue, Oct 24, 2023 at 9:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Oct 23, 2023 at 05:07:02PM -0700, Nhat Pham wrote: > > Since: > > > > "42c06a0e8ebe mm: kill frontswap" > > > > we no longer have a counter to tracks the number of zswap store > > failures. This makes it hard to investigate and monitor for zswap > > issues. > > > > This patch adds a global and a per-cgroup zswap store failure counter, > > as well as a dedicated debugfs counter for compression algorithm failure > > (which can happen for e.g when random data are passed to zswap). > > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > I agree this is an issue. > > > --- > > include/linux/vm_event_item.h | 1 + > > mm/memcontrol.c | 1 + > > mm/vmstat.c | 1 + > > mm/zswap.c | 18 ++++++++++++++---- > > 4 files changed, 17 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > index 8abfa1240040..7b2b117b193d 100644 > > --- a/include/linux/vm_event_item.h > > +++ b/include/linux/vm_event_item.h > > @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > #ifdef CONFIG_ZSWAP > > ZSWPIN, > > ZSWPOUT, > > + ZSWPOUT_FAIL, > > Would the writeback stat be sufficient to determine this? > > Hear me out. We already have pswpout that shows when we're hitting > disk swap. Right now we can't tell if this is because of a rejection > or because of writeback. With a writeback counter we could. Oh I see! It's a bit of an extra step, but I supposed (pswpout - writeback) could give us the number of zswap store failures. > > And I think we want the writeback counter anyway going forward in > order to monitor and understand the dynamic shrinker's performance. Domenico and I were talking about this, and we both agree the writeback counter is absolutely necessary - if anything, to make sure that the shrinker is not a) completely not working or b) going overboard. So it is coming as part of the shrinker regardless of this. I just didn't realize that it also solves this issue we're having too! > > Either way we go, one of the metrics needs to be derived from the > other(s). But I think subtle and not so subtle shrinker issues are > more concerning than outright configuration problems where zswap > doesn't work at all. The latter is easier to catch before or during > early deployment with simple functionality tests. > > Plus, rejections should be rare. They are now, and they should become > even more rare or cease to exist going forward. Because every time > they happen at scale, they represent problematic LRU inversions. We > have patched, have pending patches, or discussed changes to reduce > every single one of them: > > /* Store failed due to a reclaim failure after pool limit was reached */ > static u64 zswap_reject_reclaim_fail; > > With the shrinker this becomes less relevant. There was also the > proposal to disable the bypass to swap and just keep the page. The shrinker and that proposal sound like good ideas ;) > > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > > You were working on eradicating this (with zsmalloc at least). > > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > static u64 zswap_reject_kmemcache_fail; > > These shouldn't happen at all due to PF_MEMALLOC. > > IOW, the fail counter is expected to stay zero in healthy, > well-configured systems. Rather than an MM event that needs counting, > this strikes me as something that could be a WARN down the line... > Yup, I agree that it should (mostly) be at 0. It being non-zero (especially at a higher ratio w.r.t total number of zswap store counts) is an indication of something wrong - either a bug, misconfiguration, or a very ill-compressible workload (or again a bug with the compression algorithm). A WARN might be good too, but if it's just an ill-compressible workload that might be too many WARNS :) But we can always just monitor pswpout - writeback (both globally, and on a cgroup-basis, I assume?). > I agree with adding the debugfs counter though. Then I'll send a new patch that focuses on the debugfs counter (for the compression failure). Thanks for the feedback, Johannes.
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 8abfa1240040..7b2b117b193d 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -145,6 +145,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_ZSWAP ZSWPIN, ZSWPOUT, + ZSWPOUT_FAIL, #endif #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 61c0c46c2d62..0e247e72a379 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -593,6 +593,7 @@ static const unsigned int memcg_vm_event_stat[] = { #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) ZSWPIN, ZSWPOUT, + ZSWPOUT_FAIL, #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, diff --git a/mm/vmstat.c b/mm/vmstat.c index 359460deb377..85cc79449355 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1401,6 +1401,7 @@ const char * const vmstat_text[] = { #ifdef CONFIG_ZSWAP "zswpin", "zswpout", + "zswpout_fail", #endif #ifdef CONFIG_X86 "direct_map_level2_splits", diff --git a/mm/zswap.c b/mm/zswap.c index 37d2b1cb2ecb..38e6620f8b58 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -61,6 +61,8 @@ static u64 zswap_pool_limit_hit; static u64 zswap_written_back_pages; /* Store failed due to a reclaim failure after pool limit was reached */ static u64 zswap_reject_reclaim_fail; +/* Store failed due to compression algorithm failure */ +static u64 zswap_reject_compress_fail; /* Compressed page was too big for the allocator to (optimally) store */ static u64 zswap_reject_compress_poor; /* Store failed because underlying allocator could not get memory */ @@ -1213,10 +1215,10 @@ bool zswap_store(struct folio *folio) /* Large folios aren't supported */ if (folio_test_large(folio)) - return false; + goto out_reject; if (!zswap_enabled || !tree) - return false; + goto out_reject; /* * If this is a duplicate, it must be removed before attempting to store @@ -1309,8 +1311,10 @@ bool zswap_store(struct folio *folio) ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; - if (ret) + if (ret) { + zswap_reject_compress_fail++; goto put_dstmem; + } /* store */ zpool = zswap_find_zpool(entry); @@ -1377,8 +1381,12 @@ bool zswap_store(struct folio *folio) freepage: zswap_entry_cache_free(entry); reject: - if (objcg) + if (objcg) { + count_objcg_event(objcg, ZSWPOUT_FAIL); obj_cgroup_put(objcg); + } +out_reject: + count_vm_event(ZSWPOUT_FAIL); return false; shrink: @@ -1550,6 +1558,8 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, &zswap_reject_alloc_fail); debugfs_create_u64("reject_kmemcache_fail", 0444, zswap_debugfs_root, &zswap_reject_kmemcache_fail); + debugfs_create_u64("reject_compress_fail", 0444, + zswap_debugfs_root, &zswap_reject_compress_fail); debugfs_create_u64("reject_compress_poor", 0444, zswap_debugfs_root, &zswap_reject_compress_poor); debugfs_create_u64("written_back_pages", 0444,
Since: "42c06a0e8ebe mm: kill frontswap" we no longer have a counter to tracks the number of zswap store failures. This makes it hard to investigate and monitor for zswap issues. This patch adds a global and a per-cgroup zswap store failure counter, as well as a dedicated debugfs counter for compression algorithm failure (which can happen for e.g when random data are passed to zswap). Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- include/linux/vm_event_item.h | 1 + mm/memcontrol.c | 1 + mm/vmstat.c | 1 + mm/zswap.c | 18 ++++++++++++++---- 4 files changed, 17 insertions(+), 4 deletions(-)