Message ID | 20250225213200.729056-1-nphamcs@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zswap: do not crash the kernel on decompression failure | expand |
On Tue, 25 Feb 2025 13:32:00 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > + acomp_ctx->req->dlen != PAGE_SIZE) { > + ret = false; > + zswap_reject_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed on zswap entry with offset %08lx\n", > + entry->swpentry.val); > + } > mutex_unlock(&acomp_ctx->mutex); This mutex_unlock() isn't present in current kernels. I'd normally just fix the reject but a change in the locking environment needs some careful checking and retesting, please.
On Tue, Feb 25, 2025 at 2:25 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 25 Feb 2025 13:32:00 -0800 Nhat Pham <nphamcs@gmail.com> wrote: > > > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > > + acomp_ctx->req->dlen != PAGE_SIZE) { > > + ret = false; > > + zswap_reject_decompress_fail++; > > + pr_alert_ratelimited( > > + "decompression failed on zswap entry with offset %08lx\n", > > + entry->swpentry.val); > > + } > > mutex_unlock(&acomp_ctx->mutex); > > This mutex_unlock() isn't present in current kernels. I'd normally just fix > the reject but a change in the locking environment needs some careful > checking and retesting, please. Hah strange. I could have sworn I pulled the latest mm-unstable, but maybe the git pull failed and I didn't notice. Thanks for picking this up Andrew! Lemme re-pull and rebase.
On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. > > See [1] for a recent upstream discussion about this. > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > Suggested-by: Matthew Wilcox <willy@infradead.org> > Suggested-by: Yosry Ahmed <yosryahmed@google.com> > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 58 insertions(+), 27 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb23..31d4397eed61 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load and writeback failed due to decompression failure */ > +static u64 zswap_reject_decompress_fail; "reject" refers to "rejected store", so the name doesn't quite make sense. zswap_decompress_fail? > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + bool ret = true; > u8 *src; > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > + acomp_ctx->req->dlen != PAGE_SIZE) { > + ret = false; > + zswap_reject_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed on zswap entry with offset %08lx\n", > + entry->swpentry.val); Since this is a pretty dramatic failure scenario, IMO it would be useful to dump as much info as possible. The exact return value of crypto_wait_req() could be useful, entry->length and req->dlen too. entry->pool->tfm_name just to make absolute sure there is no confusion, as the compressor can be switched for new stores. Maybe swp_type() and swp_offset() of entry->swpentry? Those could be easy markers to see if the entry was corrupted for example. > + } > mutex_unlock(&acomp_ctx->mutex); > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + return ret; > } > > /********************************* > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > + int ret = 0; > > /* try to allocate swap cache folio */ > mpol = get_task_policy(current); > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * and freed when invalidated by the concurrent shrinker anyway. > */ > if (!folio_was_allocated) { > - folio_put(folio); > - return -EEXIST; > + ret = -EEXIST; > + goto put_folio; > } > > /* > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * be dereferenced. > */ > tree = swap_zswap_tree(swpentry); > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > - delete_from_swap_cache(folio); > - folio_unlock(folio); > - folio_put(folio); > - return -ENOMEM; > + if (entry != xa_load(tree, offset)) { > + ret = -ENOMEM; > + goto fail; > } > > - zswap_decompress(entry, folio); > + if (!zswap_decompress(entry, folio)) { > + ret = -EIO; > + goto fail; > + } > + > + xa_erase(tree, offset); > > count_vm_event(ZSWPWB); > if (entry->objcg) > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > /* start writeback */ > __swap_writepage(folio, &wbc); > - folio_put(folio); > + goto put_folio; > > - return 0; > +fail: > + delete_from_swap_cache(folio); > + folio_unlock(folio); > +put_folio: > + folio_put(folio); > + return ret; > } Nice, yeah it's time for factoring out the error unwinding. If you write it like this, you can save a jump in the main sequence: __swap_writepage(folio, &wbc); ret = 0; put: folio_put(folio); return ret; delete_unlock: delete_from_swap_cache(folio); folio_unlock(folio); goto put; > > /********************************* > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > if (WARN_ON_ONCE(folio_test_large(folio))) > return true; > > + /* > + * We cannot invalidate the zswap entry before decompressing it. If > + * decompression fails, we must keep the entry in the tree so that > + * a future read by another process on the same swap entry will also > + * have to go through zswap. Otherwise, we risk silently reading > + * corrupted data for the other process. > + */ > + entry = xa_load(tree, offset); > + if (!entry) > + return false; The explanation in the comment makes sense in the context of this change. But fresh eyes reading this function and having never seen that this *used to* open with xa_erase() will be confused. It answers questions the reader doesn't have at this point - it's just a function called zswap_load() beginning with an xa_load(), so what? At first I was going to suggest moving it down to the swapcache branch. But honestly after reading *that* comment again, in the new sequence, I don't think the question will arise there either. It's pretty self-evident that the whole "we can invalidate when reading into the swapcache" does not apply if the read failed. > + /* > + * If decompression fails, we return true to notify the caller that the > + * folio's data were in zswap, but do not mark the folio as up-to-date. > + * This will effectively SIGBUS the calling process. > + */ It would be good to put a lampshade on this weirdness that the return value has nothing to do with success or not. This wasn't as important a distinction, but with actual decompression failures I think it is. Something like this? if (!zswap_decompress(entry, folio)) { /* * The zswap_load() return value doesn't indicate success or * failure, but whether zswap owns the swapped out contents. * This MUST return true here, otherwise swap_readpage() will * read garbage from the backend. * * Success is signaled by marking the folio uptodate. */ return true; } folio_mark_uptodate(folio); > + if (!zswap_decompress(entry, folio)) > + return true; > + > + count_vm_event(ZSWPIN); > + if (entry->objcg) > + count_objcg_events(entry->objcg, ZSWPIN, 1); > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) > * files, which reads into a private page and may free it if > * the fault fails. We remain the primary owner of the entry.) > */ > - if (swapcache) > - entry = xa_erase(tree, offset); > - else > - entry = xa_load(tree, offset); > - > - if (!entry) > - return false; > - > - zswap_decompress(entry, folio); > - > - count_vm_event(ZSWPIN); > - if (entry->objcg) > - count_objcg_events(entry->objcg, ZSWPIN, 1); > - > if (swapcache) { > + xa_erase(tree, offset); > zswap_entry_free(entry); > folio_mark_dirty(folio); > }
On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. It's probably worth adding more details here. For example, how the SIGBUS is delivered (it's not super clear in the code), and the distinction between handling decompression failures for loads and writeback. > > See [1] for a recent upstream discussion about this. > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > Suggested-by: Matthew Wilcox <willy@infradead.org> > Suggested-by: Yosry Ahmed <yosryahmed@google.com> It's Yosry Ahmed <yosry.ahmed@linux.dev> now :P > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > --- > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 58 insertions(+), 27 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index f6316b66fb23..31d4397eed61 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > static u64 zswap_reject_compress_fail; > /* Compressed page was too big for the allocator to (optimally) store */ > static u64 zswap_reject_compress_poor; > +/* Load and writeback failed due to decompression failure */ Nit: Load or writeback? > +static u64 zswap_reject_decompress_fail; > /* Store failed because underlying allocator could not get memory */ > static u64 zswap_reject_alloc_fail; > /* Store failed because the entry metadata could not be allocated (rare) */ > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > return comp_ret == 0 && alloc_ret == 0; > } > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > { > struct zpool *zpool = entry->pool->zpool; > struct scatterlist input, output; > struct crypto_acomp_ctx *acomp_ctx; > + bool ret = true; > u8 *src; > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > sg_init_table(&output, 1); > sg_set_folio(&output, folio, PAGE_SIZE, 0); > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > + acomp_ctx->req->dlen != PAGE_SIZE) { > + ret = false; > + zswap_reject_decompress_fail++; > + pr_alert_ratelimited( > + "decompression failed on zswap entry with offset %08lx\n", > + entry->swpentry.val); > + } This can probably be prettier if we save the return value of crypto_wait_req() in a variable, and then do the check at the end of the function: zswap_decompress() { mutex_lock(); ... ret = crypto_wait_req(..); ... mutex_unlock(); ... if (ret || acomp_ctx->req->dlen != PAGE_SIZE) { /* SHIT */ return false; } return true; } > mutex_unlock(&acomp_ctx->mutex); > > if (src != acomp_ctx->buffer) > zpool_unmap_handle(zpool, entry->handle); > + return ret; > } > > /********************************* > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > struct writeback_control wbc = { > .sync_mode = WB_SYNC_NONE, > }; > + int ret = 0; > > /* try to allocate swap cache folio */ > mpol = get_task_policy(current); > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * and freed when invalidated by the concurrent shrinker anyway. > */ > if (!folio_was_allocated) { > - folio_put(folio); > - return -EEXIST; > + ret = -EEXIST; > + goto put_folio; > } > > /* > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > * be dereferenced. > */ > tree = swap_zswap_tree(swpentry); > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > - delete_from_swap_cache(folio); > - folio_unlock(folio); > - folio_put(folio); > - return -ENOMEM; > + if (entry != xa_load(tree, offset)) { > + ret = -ENOMEM; > + goto fail; > } > > - zswap_decompress(entry, folio); > + if (!zswap_decompress(entry, folio)) { > + ret = -EIO; > + goto fail; > + } > + > + xa_erase(tree, offset); > > count_vm_event(ZSWPWB); > if (entry->objcg) > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > /* start writeback */ > __swap_writepage(folio, &wbc); > - folio_put(folio); > + goto put_folio; > > - return 0; > +fail: > + delete_from_swap_cache(folio); > + folio_unlock(folio); > +put_folio: > + folio_put(folio); > + return ret; > } > > /********************************* > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > if (WARN_ON_ONCE(folio_test_large(folio))) > return true; > > + /* > + * We cannot invalidate the zswap entry before decompressing it. If > + * decompression fails, we must keep the entry in the tree so that > + * a future read by another process on the same swap entry will also > + * have to go through zswap. Otherwise, we risk silently reading > + * corrupted data for the other process. > + */ > + entry = xa_load(tree, offset); We are doing load + erase here and in the writeback path now, so two xarray walks instead of one. How does this affect performance? We had a similar about the possiblity of doing a lockless xas_load() followed by a conditional xas_erase() for zswap_invalidate(): https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/ Unfortunately it seems like we can't trivially do that unless we keep the tree locked, which we probably don't want to do throughout decompression. How crazy would it be to remove the entry from the tree and re-add it if compression fails? Does swapcache_prepare() provide sufficient protection for us to do this without anyone else looking at this entry (seems like it)? Anyway, this is all moot if the second walk is not noticeable from a perf perspective.
On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote: > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > > > See [1] for a recent upstream discussion about this. > > > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > Suggested-by: Yosry Ahmed <yosryahmed@google.com> > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 58 insertions(+), 27 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb23..31d4397eed61 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > > static u64 zswap_reject_compress_fail; > > /* Compressed page was too big for the allocator to (optimally) store */ > > static u64 zswap_reject_compress_poor; > > +/* Load and writeback failed due to decompression failure */ > > +static u64 zswap_reject_decompress_fail; > > "reject" refers to "rejected store", so the name doesn't quite make > sense. zswap_decompress_fail? > > > /* Store failed because underlying allocator could not get memory */ > > static u64 zswap_reject_alloc_fail; > > /* Store failed because the entry metadata could not be allocated (rare) */ > > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > { > > struct zpool *zpool = entry->pool->zpool; > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > + bool ret = true; > > u8 *src; > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > > + acomp_ctx->req->dlen != PAGE_SIZE) { > > + ret = false; > > + zswap_reject_decompress_fail++; > > + pr_alert_ratelimited( > > + "decompression failed on zswap entry with offset %08lx\n", > > + entry->swpentry.val); > > Since this is a pretty dramatic failure scenario, IMO it would be > useful to dump as much info as possible. > > The exact return value of crypto_wait_req() could be useful, > entry->length and req->dlen too. > > entry->pool->tfm_name just to make absolute sure there is no > confusion, as the compressor can be switched for new stores. > > Maybe swp_type() and swp_offset() of entry->swpentry? Those could be > easy markers to see if the entry was corrupted for example. > > > + } > > mutex_unlock(&acomp_ctx->mutex); > > > > if (src != acomp_ctx->buffer) > > zpool_unmap_handle(zpool, entry->handle); > > + return ret; > > } > > > > /********************************* > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > struct writeback_control wbc = { > > .sync_mode = WB_SYNC_NONE, > > }; > > + int ret = 0; > > > > /* try to allocate swap cache folio */ > > mpol = get_task_policy(current); > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * and freed when invalidated by the concurrent shrinker anyway. > > */ > > if (!folio_was_allocated) { > > - folio_put(folio); > > - return -EEXIST; > > + ret = -EEXIST; > > + goto put_folio; > > } > > > > /* > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * be dereferenced. > > */ > > tree = swap_zswap_tree(swpentry); > > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > > - delete_from_swap_cache(folio); > > - folio_unlock(folio); > > - folio_put(folio); > > - return -ENOMEM; > > + if (entry != xa_load(tree, offset)) { > > + ret = -ENOMEM; > > + goto fail; > > } > > > > - zswap_decompress(entry, folio); > > + if (!zswap_decompress(entry, folio)) { > > + ret = -EIO; > > + goto fail; > > + } > > + > > + xa_erase(tree, offset); > > > > count_vm_event(ZSWPWB); > > if (entry->objcg) > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > /* start writeback */ > > __swap_writepage(folio, &wbc); > > - folio_put(folio); > > + goto put_folio; > > > > - return 0; > > +fail: > > + delete_from_swap_cache(folio); > > + folio_unlock(folio); > > +put_folio: > > + folio_put(folio); > > + return ret; > > } > > Nice, yeah it's time for factoring out the error unwinding. If you > write it like this, you can save a jump in the main sequence: > > __swap_writepage(folio, &wbc); > ret = 0; > put: > folio_put(folio); > return ret; > delete_unlock: (I like how you sneaked the label rename in here, I didn't like 'fail' either :P) > delete_from_swap_cache(folio); > folio_unlock(folio); > goto put; I would go even further and avoid gotos completely (and make it super clear what gets executed in the normal path vs the failure path): __swap_writepage(folio, &wbc); folio_put(folio); if (ret) { delete_from_swap_cache(folio); folio_unlock(folio); } return ret; > > > > > /********************************* > > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > > if (WARN_ON_ONCE(folio_test_large(folio))) > > return true; > > > > + /* > > + * We cannot invalidate the zswap entry before decompressing it. If > > + * decompression fails, we must keep the entry in the tree so that > > + * a future read by another process on the same swap entry will also > > + * have to go through zswap. Otherwise, we risk silently reading > > + * corrupted data for the other process. > > + */ > > + entry = xa_load(tree, offset); > > + if (!entry) > > + return false; > > The explanation in the comment makes sense in the context of this > change. But fresh eyes reading this function and having never seen > that this *used to* open with xa_erase() will be confused. It answers > questions the reader doesn't have at this point - it's just a function > called zswap_load() beginning with an xa_load(), so what? > > At first I was going to suggest moving it down to the swapcache > branch. But honestly after reading *that* comment again, in the new > sequence, I don't think the question will arise there either. It's > pretty self-evident that the whole "we can invalidate when reading > into the swapcache" does not apply if the read failed. +1 > > > + /* > > + * If decompression fails, we return true to notify the caller that the > > + * folio's data were in zswap, but do not mark the folio as up-to-date. > > + * This will effectively SIGBUS the calling process. > > + */ > > It would be good to put a lampshade on this weirdness that the return > value has nothing to do with success or not. This wasn't as important > a distinction, but with actual decompression failures I think it is. +1 > > Something like this? > > if (!zswap_decompress(entry, folio)) { > /* > * The zswap_load() return value doesn't indicate success or > * failure, but whether zswap owns the swapped out contents. > * This MUST return true here, otherwise swap_readpage() will > * read garbage from the backend. > * > * Success is signaled by marking the folio uptodate. > */ We use the same trick in the folio_test_large() branch, so maybe this should be moved to above the function definition. Then we can perhaps refer to it in places where we return true wihout setting uptodate for added clarity if needed. > return true; > } > > folio_mark_uptodate(folio); > > > + if (!zswap_decompress(entry, folio)) > > + return true; > > + > > + count_vm_event(ZSWPIN); > > + if (entry->objcg) > > + count_objcg_events(entry->objcg, ZSWPIN, 1); > > + > > /* > > * When reading into the swapcache, invalidate our entry. The > > * swapcache can be the authoritative owner of the page and > > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) > > * files, which reads into a private page and may free it if > > * the fault fails. We remain the primary owner of the entry.) > > */ > > - if (swapcache) > > - entry = xa_erase(tree, offset); > > - else > > - entry = xa_load(tree, offset); > > - > > - if (!entry) > > - return false; > > - > > - zswap_decompress(entry, folio); > > - > > - count_vm_event(ZSWPIN); > > - if (entry->objcg) > > - count_objcg_events(entry->objcg, ZSWPIN, 1); > > - > > if (swapcache) { > > + xa_erase(tree, offset); > > zswap_entry_free(entry); > > folio_mark_dirty(folio); > > } >
On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > Currently, we crash the kernel when a decompression failure occurs in > zswap (either because of memory corruption, or a bug in the compression > algorithm). This is overkill. We should only SIGBUS the unfortunate > process asking for the zswap entry on zswap load, and skip the corrupted > entry in zswap writeback. Some relevant observations/questions, but not really actionable for this patch, perhaps some future work, or more likely some incoherent illogical thoughts : (1) It seems like not making the folio uptodate will cause shmem faults to mark the swap entry as hwpoisoned, but I don't see similar handling for do_swap_page(). So it seems like even if we SIGBUS the process, other processes mapping the same page could follow in the same footsteps. (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()), even though we have VM_FAULT_HWPOISON. This patch falls under this bucket, but unfortunately we cannot tell for sure if it's a hwpoision or a decompression bug. (3) If we run into a decompression failure, should we free the underlying memory from zsmalloc? I don't know. On one hand if we free it zsmalloc may start using it for more compressed objects. OTOH, I don't think proper hwpoison handling will kick in until the page is freed. Maybe we should tell zsmalloc to drop this page entirely and mark objects within it as invalid? Probably not worth the hassle but something to think about.
On Wed, Feb 26, 2025 at 02:45:41AM +0000, Yosry Ahmed wrote: > On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote: > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > + } > > > mutex_unlock(&acomp_ctx->mutex); > > > > > > if (src != acomp_ctx->buffer) > > > zpool_unmap_handle(zpool, entry->handle); > > > + return ret; > > > } > > > > > > /********************************* > > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > struct writeback_control wbc = { > > > .sync_mode = WB_SYNC_NONE, > > > }; > > > + int ret = 0; > > > > > > /* try to allocate swap cache folio */ > > > mpol = get_task_policy(current); > > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > * and freed when invalidated by the concurrent shrinker anyway. > > > */ > > > if (!folio_was_allocated) { > > > - folio_put(folio); > > > - return -EEXIST; > > > + ret = -EEXIST; > > > + goto put_folio; > > > } > > > > > > /* > > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > * be dereferenced. > > > */ > > > tree = swap_zswap_tree(swpentry); > > > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > > > - delete_from_swap_cache(folio); > > > - folio_unlock(folio); > > > - folio_put(folio); > > > - return -ENOMEM; > > > + if (entry != xa_load(tree, offset)) { > > > + ret = -ENOMEM; > > > + goto fail; > > > } > > > > > > - zswap_decompress(entry, folio); > > > + if (!zswap_decompress(entry, folio)) { > > > + ret = -EIO; > > > + goto fail; > > > + } > > > + > > > + xa_erase(tree, offset); > > > > > > count_vm_event(ZSWPWB); > > > if (entry->objcg) > > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > > > /* start writeback */ > > > __swap_writepage(folio, &wbc); > > > - folio_put(folio); > > > + goto put_folio; > > > > > > - return 0; > > > +fail: > > > + delete_from_swap_cache(folio); > > > + folio_unlock(folio); > > > +put_folio: > > > + folio_put(folio); > > > + return ret; > > > } > > > > Nice, yeah it's time for factoring out the error unwinding. If you > > write it like this, you can save a jump in the main sequence: > > > > __swap_writepage(folio, &wbc); > > ret = 0; > > put: > > folio_put(folio); > > return ret; > > delete_unlock: > > (I like how you sneaked the label rename in here, I didn't like 'fail' > either :P) > > > delete_from_swap_cache(folio); > > folio_unlock(folio); > > goto put; > > I would go even further and avoid gotos completely (and make it super > clear what gets executed in the normal path vs the failure path): > > __swap_writepage(folio, &wbc); > folio_put(folio); > if (ret) { > delete_from_swap_cache(folio); > folio_unlock(folio); > } > return ret; The !folio_was_allocated case only needs the put. I guess that could stay open-coded. And I think you still need one goto for the other two error legs to jump past the __swap_writepage. > > Something like this? > > > > if (!zswap_decompress(entry, folio)) { > > /* > > * The zswap_load() return value doesn't indicate success or > > * failure, but whether zswap owns the swapped out contents. > > * This MUST return true here, otherwise swap_readpage() will > > * read garbage from the backend. > > * > > * Success is signaled by marking the folio uptodate. > > */ > > We use the same trick in the folio_test_large() branch, so maybe this > should be moved to above the function definition. Then we can perhaps > refer to it in places where we return true wihout setting uptodate for > added clarity if needed. That makes sense to me. Nhat, what do you think?
On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote: > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > Some relevant observations/questions, but not really actionable for this > patch, perhaps some future work, or more likely some incoherent > illogical thoughts : > > (1) It seems like not making the folio uptodate will cause shmem faults > to mark the swap entry as hwpoisoned, but I don't see similar handling > for do_swap_page(). So it seems like even if we SIGBUS the process, > other processes mapping the same page could follow in the same > footsteps. It's analogous to what __end_swap_bio_read() does for block backends, so it's hitchhiking on the standard swap protocol for read failures. The page sticks around if there are other users. It can get reclaimed, but since it's not marked dirty, it won't get overwritten. Another access will either find it in the swapcache and die on !uptodate; if it was reclaimed, it will attempt another decompression. If all references have been killed, zswap_invalidate() will finally drop it. Swapoff actually poisons the page table as well (unuse_pte). > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()), > even though we have VM_FAULT_HWPOISON. This patch falls under this > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or > a decompression bug. Are you sure? Actual memory failure should replace the ptes of a mapped shmem page with TTU_HWPOISON, which turns them into special swap entries that trigger VM_FAULT_HWPOISON in do_swap_page(). Anon swap distinguishes as long as the swapfile is there. Swapoff installs poison markers, which are then handled the same in future faults (VM_FAULT_HWPOISON): /* * "Poisoned" here is meant in the very general sense of "future accesses are * invalid", instead of referring very specifically to hardware memory errors. * This marker is meant to represent any of various different causes of this. * * Note that, when encountered by the faulting logic, PTEs with this marker will * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error * logic. */ #define PTE_MARKER_POISONED BIT(1)
On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote: > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote: > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > Currently, we crash the kernel when a decompression failure occurs in > > > zswap (either because of memory corruption, or a bug in the compression > > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > > process asking for the zswap entry on zswap load, and skip the corrupted > > > entry in zswap writeback. > > > > Some relevant observations/questions, but not really actionable for this > > patch, perhaps some future work, or more likely some incoherent > > illogical thoughts : > > > > (1) It seems like not making the folio uptodate will cause shmem faults > > to mark the swap entry as hwpoisoned, but I don't see similar handling > > for do_swap_page(). So it seems like even if we SIGBUS the process, > > other processes mapping the same page could follow in the same > > footsteps. > > It's analogous to what __end_swap_bio_read() does for block backends, > so it's hitchhiking on the standard swap protocol for read failures. Right, that's also how I got the idea when I did the same for large folios handling. > > The page sticks around if there are other users. It can get reclaimed, > but since it's not marked dirty, it won't get overwritten. Another > access will either find it in the swapcache and die on !uptodate; if > it was reclaimed, it will attempt another decompression. If all > references have been killed, zswap_invalidate() will finally drop it. > > Swapoff actually poisons the page table as well (unuse_pte). Right. My question was basically why don't we also poison the page table in do_swap_page() in this case. It's like that we never swapoff. This will cause subsequent fault attempts to return VM_FAULT_HWPOISON quickly without doing through the swapcache or decompression. Probably not a big deal, but shmem does it so maybe it'd be nice to do it for consistency. > > > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases > > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()), > > even though we have VM_FAULT_HWPOISON. This patch falls under this > > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or > > a decompression bug. > > Are you sure? Actual memory failure should replace the ptes of a > mapped shmem page with TTU_HWPOISON, which turns them into special > swap entries that trigger VM_FAULT_HWPOISON in do_swap_page(). I was looking at the shmem_fault() path. It seems like for this path we end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not -EHWPOISON. This seems like something that can be easily fixed though, unless -EHWPOISON is not always correct for a diffrent reason. > > Anon swap distinguishes as long as the swapfile is there. Swapoff > installs poison markers, which are then handled the same in future > faults (VM_FAULT_HWPOISON): > > /* > * "Poisoned" here is meant in the very general sense of "future accesses are > * invalid", instead of referring very specifically to hardware memory errors. > * This marker is meant to represent any of various different causes of this. > * > * Note that, when encountered by the faulting logic, PTEs with this marker will > * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error > * logic. If that's the case, maybe it's better for zswap in the future if we stop relying on not marking the folio uptodate, and instead propagate an error through swap_read_folio() to the callers to make sure we always return VM_FAULT_HWPOISON and install poison markers. The handling is a bit quirky and inconsistent, but it ultimately results in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now. > */ > #define PTE_MARKER_POISONED BIT(1)
On Tue, Feb 25, 2025 at 10:57:30PM -0500, Johannes Weiner wrote: > On Wed, Feb 26, 2025 at 02:45:41AM +0000, Yosry Ahmed wrote: > > On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote: > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > > + } > > > > mutex_unlock(&acomp_ctx->mutex); > > > > > > > > if (src != acomp_ctx->buffer) > > > > zpool_unmap_handle(zpool, entry->handle); > > > > + return ret; > > > > } > > > > > > > > /********************************* > > > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > struct writeback_control wbc = { > > > > .sync_mode = WB_SYNC_NONE, > > > > }; > > > > + int ret = 0; > > > > > > > > /* try to allocate swap cache folio */ > > > > mpol = get_task_policy(current); > > > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > * and freed when invalidated by the concurrent shrinker anyway. > > > > */ > > > > if (!folio_was_allocated) { > > > > - folio_put(folio); > > > > - return -EEXIST; > > > > + ret = -EEXIST; > > > > + goto put_folio; > > > > } > > > > > > > > /* > > > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > * be dereferenced. > > > > */ > > > > tree = swap_zswap_tree(swpentry); > > > > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > > > > - delete_from_swap_cache(folio); > > > > - folio_unlock(folio); > > > > - folio_put(folio); > > > > - return -ENOMEM; > > > > + if (entry != xa_load(tree, offset)) { > > > > + ret = -ENOMEM; > > > > + goto fail; > > > > } > > > > > > > > - zswap_decompress(entry, folio); > > > > + if (!zswap_decompress(entry, folio)) { > > > > + ret = -EIO; > > > > + goto fail; > > > > + } > > > > + > > > > + xa_erase(tree, offset); > > > > > > > > count_vm_event(ZSWPWB); > > > > if (entry->objcg) > > > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > > > > > /* start writeback */ > > > > __swap_writepage(folio, &wbc); > > > > - folio_put(folio); > > > > + goto put_folio; > > > > > > > > - return 0; > > > > +fail: > > > > + delete_from_swap_cache(folio); > > > > + folio_unlock(folio); > > > > +put_folio: > > > > + folio_put(folio); > > > > + return ret; > > > > } > > > > > > Nice, yeah it's time for factoring out the error unwinding. If you > > > write it like this, you can save a jump in the main sequence: > > > > > > __swap_writepage(folio, &wbc); > > > ret = 0; > > > put: > > > folio_put(folio); > > > return ret; > > > delete_unlock: > > > > (I like how you sneaked the label rename in here, I didn't like 'fail' > > either :P) > > > > > delete_from_swap_cache(folio); > > > folio_unlock(folio); > > > goto put; > > > > I would go even further and avoid gotos completely (and make it super > > clear what gets executed in the normal path vs the failure path): > > > > __swap_writepage(folio, &wbc); > > folio_put(folio); > > if (ret) { > > delete_from_swap_cache(folio); > > folio_unlock(folio); > > } > > return ret; > > The !folio_was_allocated case only needs the put. I guess that could > stay open-coded. We also do: if (ret && ret != -EEXIST) { ... } or if (ret && !folio_was_allocated) { ... } Probably the latter is clearer. If the folio was already there, we didn't add it to the swapcache or lock it ourselves so we don't unwind that. > > And I think you still need one goto for the other two error legs to > jump past the __swap_writepage. Oh yeah I meant eliminate the jumps within the cleanup/return code, not entirely. Sorry for the confusion. We still need an 'out' label or similar after __swap_writepage(). > > > > Something like this? > > > > > > if (!zswap_decompress(entry, folio)) { > > > /* > > > * The zswap_load() return value doesn't indicate success or > > > * failure, but whether zswap owns the swapped out contents. > > > * This MUST return true here, otherwise swap_readpage() will > > > * read garbage from the backend. > > > * > > > * Success is signaled by marking the folio uptodate. > > > */ > > > > We use the same trick in the folio_test_large() branch, so maybe this > > should be moved to above the function definition. Then we can perhaps > > refer to it in places where we return true wihout setting uptodate for > > added clarity if needed. > > That makes sense to me. Nhat, what do you think?
On Tue, Feb 25, 2025 at 6:40 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > It's probably worth adding more details here. For example, how the > SIGBUS is delivered (it's not super clear in the code), and the > distinction between handling decompression failures for loads and > writeback. Will do :) > > > > > See [1] for a recent upstream discussion about this. > > > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > Suggested-by: Yosry Ahmed <yosryahmed@google.com> > > It's Yosry Ahmed <yosry.ahmed@linux.dev> now :P Oops, will fix :P > > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 58 insertions(+), 27 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb23..31d4397eed61 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > > static u64 zswap_reject_compress_fail; > > /* Compressed page was too big for the allocator to (optimally) store */ > > static u64 zswap_reject_compress_poor; > > +/* Load and writeback failed due to decompression failure */ > > Nit: Load or writeback? Consider it done :) > > > +static u64 zswap_reject_decompress_fail; > > /* Store failed because underlying allocator could not get memory */ > > static u64 zswap_reject_alloc_fail; > > /* Store failed because the entry metadata could not be allocated (rare) */ > > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > { > > struct zpool *zpool = entry->pool->zpool; > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > + bool ret = true; > > u8 *src; > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > > + acomp_ctx->req->dlen != PAGE_SIZE) { > > + ret = false; > > + zswap_reject_decompress_fail++; > > + pr_alert_ratelimited( > > + "decompression failed on zswap entry with offset %08lx\n", > > + entry->swpentry.val); > > + } > > This can probably be prettier if we save the return value of > crypto_wait_req() in a variable, and then do the check at the end of the > function: > > zswap_decompress() > { > mutex_lock(); > ... > ret = crypto_wait_req(..); > ... > mutex_unlock(); > ... > if (ret || acomp_ctx->req->dlen != PAGE_SIZE) { Hmmmm, do we need the mutex to protect acomp_ctx->req->dlen? No strong opinions here from my end TBH. > /* SHIT */ > return false; > } > return true; > } [...] > > We are doing load + erase here and in the writeback path now, so two > xarray walks instead of one. How does this affect performance? We had a > similar about the possiblity of doing a lockless xas_load() followed by > a conditional xas_erase() for zswap_invalidate(): > > https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/ > > Unfortunately it seems like we can't trivially do that unless we keep > the tree locked, which we probably don't want to do throughout > decompression. > > How crazy would it be to remove the entry from the tree and re-add it if > compression fails? Does swapcache_prepare() provide sufficient > protection for us to do this without anyone else looking at this entry > (seems like it)? My concern is, what happens if xa_store() in the re-add path fails because we do not have enough memory? > > Anyway, this is all moot if the second walk is not noticeable from a > perf perspective.
On Wed, Feb 26, 2025 at 7:33 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote: > > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote: > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > > Currently, we crash the kernel when a decompression failure occurs in > > > > zswap (either because of memory corruption, or a bug in the compression > > > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > > > process asking for the zswap entry on zswap load, and skip the corrupted > > > > entry in zswap writeback. > > > > > > Some relevant observations/questions, but not really actionable for this > > > patch, perhaps some future work, or more likely some incoherent > > > illogical thoughts : > > > > > > (1) It seems like not making the folio uptodate will cause shmem faults > > > to mark the swap entry as hwpoisoned, but I don't see similar handling > > > for do_swap_page(). So it seems like even if we SIGBUS the process, > > > other processes mapping the same page could follow in the same > > > footsteps. > > > > It's analogous to what __end_swap_bio_read() does for block backends, > > so it's hitchhiking on the standard swap protocol for read failures. > > Right, that's also how I got the idea when I did the same for large > folios handling. And your handling of the large folio (along with the comment in the other thread) was how I got the idea for this patch :) > > > > > The page sticks around if there are other users. It can get reclaimed, > > but since it's not marked dirty, it won't get overwritten. Another > > access will either find it in the swapcache and die on !uptodate; if > > it was reclaimed, it will attempt another decompression. If all > > references have been killed, zswap_invalidate() will finally drop it. > > > > Swapoff actually poisons the page table as well (unuse_pte). > > Right. My question was basically why don't we also poison the page table > in do_swap_page() in this case. It's like that we never swapoff. That would require a rmap walk right? To also poison the other PTEs that point to the faulty (z)swap entry? Or am I misunderstanding your point :) > > This will cause subsequent fault attempts to return VM_FAULT_HWPOISON > quickly without doing through the swapcache or decompression. Probably > not a big deal, but shmem does it so maybe it'd be nice to do it for > consistency. > > > > > > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases > > > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()), > > > even though we have VM_FAULT_HWPOISON. This patch falls under this > > > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or > > > a decompression bug. > > > > Are you sure? Actual memory failure should replace the ptes of a > > mapped shmem page with TTU_HWPOISON, which turns them into special > > swap entries that trigger VM_FAULT_HWPOISON in do_swap_page(). > > I was looking at the shmem_fault() path. It seems like for this path we > end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not > -EHWPOISON. This seems like something that can be easily fixed though, > unless -EHWPOISON is not always correct for a diffrent reason. > > > > > Anon swap distinguishes as long as the swapfile is there. Swapoff > > installs poison markers, which are then handled the same in future > > faults (VM_FAULT_HWPOISON): > > > > /* > > * "Poisoned" here is meant in the very general sense of "future accesses are > > * invalid", instead of referring very specifically to hardware memory errors. > > * This marker is meant to represent any of various different causes of this. > > * > > * Note that, when encountered by the faulting logic, PTEs with this marker will > > * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error > > * logic. > > If that's the case, maybe it's better for zswap in the future if we stop > relying on not marking the folio uptodate, and instead propagate an > error through swap_read_folio() to the callers to make sure we always > return VM_FAULT_HWPOISON and install poison markers. > > The handling is a bit quirky and inconsistent, but it ultimately results > in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now. Yeah I think it's OK for now. FWIW it's consistent with the way we treat swap IO error, as you pointed out :) > > > */ > > #define PTE_MARKER_POISONED BIT(1)
On Tue, Feb 25, 2025 at 7:12 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > Some relevant observations/questions, but not really actionable for this > patch, perhaps some future work, or more likely some incoherent > illogical thoughts : > > (1) It seems like not making the folio uptodate will cause shmem faults > to mark the swap entry as hwpoisoned, but I don't see similar handling > for do_swap_page(). So it seems like even if we SIGBUS the process, > other processes mapping the same page could follow in the same > footsteps. poisoned, I think? It's the weird SWP_PTE_MARKER thing. [...] > > > (3) If we run into a decompression failure, should we free the > underlying memory from zsmalloc? I don't know. On one hand if we free it > zsmalloc may start using it for more compressed objects. OTOH, I don't > think proper hwpoison handling will kick in until the page is freed. > Maybe we should tell zsmalloc to drop this page entirely and mark > objects within it as invalid? Probably not worth the hassle but > something to think about. This might be a fun follow up :) I guess my question is - is there a chance that we might recover in the future? For example, can memory (hardware) failure somehow recover, or the decompression algorithm somehow fix itself? I suppose not? If that is the case, one thing we can do is just free the zsmalloc slot, then mark the zswap entry as corrupted somehow. We can even invalidate the zswap entry altogether, and install a (shared) ZSWAP_CORRUPT_ENTRY. Future readers can check for this and exit if they encounter a corrupted entry? It's not common enough (lol hopefully) for me to optimize right away, but I can get on with it if there are actual data of this happening IRL/in product :)ion
On Tue, Feb 25, 2025 at 4:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > Currently, we crash the kernel when a decompression failure occurs in > > zswap (either because of memory corruption, or a bug in the compression > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > process asking for the zswap entry on zswap load, and skip the corrupted > > entry in zswap writeback. > > > > See [1] for a recent upstream discussion about this. > > > > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ > > > > Suggested-by: Matthew Wilcox <willy@infradead.org> > > Suggested-by: Yosry Ahmed <yosryahmed@google.com> > > Signed-off-by: Nhat Pham <nphamcs@gmail.com> > > --- > > mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 58 insertions(+), 27 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index f6316b66fb23..31d4397eed61 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; > > static u64 zswap_reject_compress_fail; > > /* Compressed page was too big for the allocator to (optimally) store */ > > static u64 zswap_reject_compress_poor; > > +/* Load and writeback failed due to decompression failure */ > > +static u64 zswap_reject_decompress_fail; > > "reject" refers to "rejected store", so the name doesn't quite make > sense. zswap_decompress_fail? SGTM :) > > > /* Store failed because underlying allocator could not get memory */ > > static u64 zswap_reject_alloc_fail; > > /* Store failed because the entry metadata could not be allocated (rare) */ > > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, > > return comp_ret == 0 && alloc_ret == 0; > > } > > > > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > { > > struct zpool *zpool = entry->pool->zpool; > > struct scatterlist input, output; > > struct crypto_acomp_ctx *acomp_ctx; > > + bool ret = true; > > u8 *src; > > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); > > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) > > sg_init_table(&output, 1); > > sg_set_folio(&output, folio, PAGE_SIZE, 0); > > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); > > - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); > > - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); > > + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || > > + acomp_ctx->req->dlen != PAGE_SIZE) { > > + ret = false; > > + zswap_reject_decompress_fail++; > > + pr_alert_ratelimited( > > + "decompression failed on zswap entry with offset %08lx\n", > > + entry->swpentry.val); > > Since this is a pretty dramatic failure scenario, IMO it would be > useful to dump as much info as possible. > > The exact return value of crypto_wait_req() could be useful, > entry->length and req->dlen too. > > entry->pool->tfm_name just to make absolute sure there is no > confusion, as the compressor can be switched for new stores. > > Maybe swp_type() and swp_offset() of entry->swpentry? Those could be > easy markers to see if the entry was corrupted for example. I'll include them all :) > > > + } > > mutex_unlock(&acomp_ctx->mutex); > > > > if (src != acomp_ctx->buffer) > > zpool_unmap_handle(zpool, entry->handle); > > + return ret; > > } > > > > /********************************* > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > struct writeback_control wbc = { > > .sync_mode = WB_SYNC_NONE, > > }; > > + int ret = 0; > > > > /* try to allocate swap cache folio */ > > mpol = get_task_policy(current); > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * and freed when invalidated by the concurrent shrinker anyway. > > */ > > if (!folio_was_allocated) { > > - folio_put(folio); > > - return -EEXIST; > > + ret = -EEXIST; > > + goto put_folio; > > } > > > > /* > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > * be dereferenced. > > */ > > tree = swap_zswap_tree(swpentry); > > - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { > > - delete_from_swap_cache(folio); > > - folio_unlock(folio); > > - folio_put(folio); > > - return -ENOMEM; > > + if (entry != xa_load(tree, offset)) { > > + ret = -ENOMEM; > > + goto fail; > > } > > > > - zswap_decompress(entry, folio); > > + if (!zswap_decompress(entry, folio)) { > > + ret = -EIO; > > + goto fail; > > + } > > + > > + xa_erase(tree, offset); > > > > count_vm_event(ZSWPWB); > > if (entry->objcg) > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > > /* start writeback */ > > __swap_writepage(folio, &wbc); > > - folio_put(folio); > > + goto put_folio; > > > > - return 0; > > +fail: > > + delete_from_swap_cache(folio); > > + folio_unlock(folio); > > +put_folio: > > + folio_put(folio); > > + return ret; > > } > > Nice, yeah it's time for factoring out the error unwinding. If you > write it like this, you can save a jump in the main sequence: > > __swap_writepage(folio, &wbc); > ret = 0; > put: > folio_put(folio); > return ret; > delete_unlock: > delete_from_swap_cache(folio); > folio_unlock(folio); > goto put; SGTM :) > > > > > /********************************* > > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) > > if (WARN_ON_ONCE(folio_test_large(folio))) > > return true; > > > > + /* > > + * We cannot invalidate the zswap entry before decompressing it. If > > + * decompression fails, we must keep the entry in the tree so that > > + * a future read by another process on the same swap entry will also > > + * have to go through zswap. Otherwise, we risk silently reading > > + * corrupted data for the other process. > > + */ > > + entry = xa_load(tree, offset); > > + if (!entry) > > + return false; > > The explanation in the comment makes sense in the context of this > change. But fresh eyes reading this function and having never seen > that this *used to* open with xa_erase() will be confused. It answers > questions the reader doesn't have at this point - it's just a function > called zswap_load() beginning with an xa_load(), so what? > > At first I was going to suggest moving it down to the swapcache > branch. But honestly after reading *that* comment again, in the new > sequence, I don't think the question will arise there either. It's > pretty self-evident that the whole "we can invalidate when reading > into the swapcache" does not apply if the read failed. > > > + /* > > + * If decompression fails, we return true to notify the caller that the > > + * folio's data were in zswap, but do not mark the folio as up-to-date. > > + * This will effectively SIGBUS the calling process. > > + */ > > It would be good to put a lampshade on this weirdness that the return > value has nothing to do with success or not. This wasn't as important > a distinction, but with actual decompression failures I think it is. > > Something like this? > > if (!zswap_decompress(entry, folio)) { > /* > * The zswap_load() return value doesn't indicate success or > * failure, but whether zswap owns the swapped out contents. > * This MUST return true here, otherwise swap_readpage() will > * read garbage from the backend. > * > * Success is signaled by marking the folio uptodate. > */ > return true; > } Per Yosry's comment, I'll just move this into zswap_load()'s function documentation. > > folio_mark_uptodate(folio); > > > + if (!zswap_decompress(entry, folio)) > > + return true; > > + > > + count_vm_event(ZSWPIN); > > + if (entry->objcg) > > + count_objcg_events(entry->objcg, ZSWPIN, 1); > > + > > /* > > * When reading into the swapcache, invalidate our entry. The > > * swapcache can be the authoritative owner of the page and > > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) > > * files, which reads into a private page and may free it if > > * the fault fails. We remain the primary owner of the entry.) > > */ > > - if (swapcache) > > - entry = xa_erase(tree, offset); > > - else > > - entry = xa_load(tree, offset); > > - > > - if (!entry) > > - return false; > > - > > - zswap_decompress(entry, folio); > > - > > - count_vm_event(ZSWPIN); > > - if (entry->objcg) > > - count_objcg_events(entry->objcg, ZSWPIN, 1); > > - > > if (swapcache) { > > + xa_erase(tree, offset); > > zswap_entry_free(entry); > > folio_mark_dirty(folio); > > }
On Wed, Feb 26, 2025 at 03:29:06PM -0800, Nhat Pham wrote: > On Tue, Feb 25, 2025 at 7:12 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > Currently, we crash the kernel when a decompression failure occurs in > > > zswap (either because of memory corruption, or a bug in the compression > > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > > process asking for the zswap entry on zswap load, and skip the corrupted > > > entry in zswap writeback. > > > > Some relevant observations/questions, but not really actionable for this > > patch, perhaps some future work, or more likely some incoherent > > illogical thoughts : > > > > (1) It seems like not making the folio uptodate will cause shmem faults > > to mark the swap entry as hwpoisoned, but I don't see similar handling > > for do_swap_page(). So it seems like even if we SIGBUS the process, > > other processes mapping the same page could follow in the same > > footsteps. > > poisoned, I think? It's the weird SWP_PTE_MARKER thing. Not sure what you mean here, I am referring to the inconsistency between shmem_swapin_folio() and do_swap_page(). > > [...] > > > > > > > (3) If we run into a decompression failure, should we free the > > underlying memory from zsmalloc? I don't know. On one hand if we free it > > zsmalloc may start using it for more compressed objects. OTOH, I don't > > think proper hwpoison handling will kick in until the page is freed. > > Maybe we should tell zsmalloc to drop this page entirely and mark > > objects within it as invalid? Probably not worth the hassle but > > something to think about. > > This might be a fun follow up :) I guess my question is - is there a > chance that we might recover in the future? > > For example, can memory (hardware) failure somehow recover, or the > decompression algorithm somehow fix itself? I suppose not? > > If that is the case, one thing we can do is just free the zsmalloc > slot, then mark the zswap entry as corrupted somehow. We can even > invalidate the zswap entry altogether, and install a (shared) > ZSWAP_CORRUPT_ENTRY. Future readers can check for this and exit if > they encounter a corrupted entry? > > It's not common enough (lol hopefully) for me to optimize right away, > but I can get on with it if there are actual data of this happening > IRL/in product :)ion I am not aware of this being a common problem, but something to keep in mind, perhaps.
On Wed, Feb 26, 2025 at 03:20:13PM -0800, Nhat Pham wrote: > On Wed, Feb 26, 2025 at 7:33 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > > > On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote: > > > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote: > > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote: > > > > > Currently, we crash the kernel when a decompression failure occurs in > > > > > zswap (either because of memory corruption, or a bug in the compression > > > > > algorithm). This is overkill. We should only SIGBUS the unfortunate > > > > > process asking for the zswap entry on zswap load, and skip the corrupted > > > > > entry in zswap writeback. > > > > > > > > Some relevant observations/questions, but not really actionable for this > > > > patch, perhaps some future work, or more likely some incoherent > > > > illogical thoughts : > > > > > > > > (1) It seems like not making the folio uptodate will cause shmem faults > > > > to mark the swap entry as hwpoisoned, but I don't see similar handling > > > > for do_swap_page(). So it seems like even if we SIGBUS the process, > > > > other processes mapping the same page could follow in the same > > > > footsteps. > > > > > > It's analogous to what __end_swap_bio_read() does for block backends, > > > so it's hitchhiking on the standard swap protocol for read failures. > > > > Right, that's also how I got the idea when I did the same for large > > folios handling. > > And your handling of the large folio (along with the comment in the > other thread) was how I got the idea for this patch :) > > > > > > > > > The page sticks around if there are other users. It can get reclaimed, > > > but since it's not marked dirty, it won't get overwritten. Another > > > access will either find it in the swapcache and die on !uptodate; if > > > it was reclaimed, it will attempt another decompression. If all > > > references have been killed, zswap_invalidate() will finally drop it. > > > > > > Swapoff actually poisons the page table as well (unuse_pte). > > > > Right. My question was basically why don't we also poison the page table > > in do_swap_page() in this case. It's like that we never swapoff. > > That would require a rmap walk right? To also poison the other PTEs > that point to the faulty (z)swap entry? > > Or am I misunderstanding your point :) Oh I meant why not just mark the entry where the fault happened as poisoned at least. Finding other PTEs that point to the swap entry is a different story. I don't think we can even use the rmap here.
[..] > > zswap_decompress() > > { > > mutex_lock(); > > ... > > ret = crypto_wait_req(..); > > ... > > mutex_unlock(); > > ... > > if (ret || acomp_ctx->req->dlen != PAGE_SIZE) { > > Hmmmm, do we need the mutex to protect acomp_ctx->req->dlen? No strong > opinions here from my end TBH. We can copy the length to a local variable if that's the case. > > > /* SHIT */ > > return false; > > } > > return true; > > } > [...] > > > > We are doing load + erase here and in the writeback path now, so two > > xarray walks instead of one. How does this affect performance? We had a > > similar about the possiblity of doing a lockless xas_load() followed by > > a conditional xas_erase() for zswap_invalidate(): > > > > https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/ > > > > Unfortunately it seems like we can't trivially do that unless we keep > > the tree locked, which we probably don't want to do throughout > > decompression. > > > > How crazy would it be to remove the entry from the tree and re-add it if > > compression fails? Does swapcache_prepare() provide sufficient > > protection for us to do this without anyone else looking at this entry > > (seems like it)? > > My concern is, what happens if xa_store() in the re-add path fails > because we do not have enough memory? Hmm good point. xa_erase() is essentially xas_store(NULL), but I think at some point if a node is made of entirely NULLs we'll free it, and it would be theoritically possible that we hit this case. Let's start by checking if the added xarray walk adds any noticable overhead and go from there. Ideally we want to test with a large (and sparse?) xarray to try and hit the worst case. > > > > > Anyway, this is all moot if the second walk is not noticeable from a > > perf perspective.
diff --git a/mm/zswap.c b/mm/zswap.c index f6316b66fb23..31d4397eed61 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail; static u64 zswap_reject_compress_fail; /* Compressed page was too big for the allocator to (optimally) store */ static u64 zswap_reject_compress_poor; +/* Load and writeback failed due to decompression failure */ +static u64 zswap_reject_decompress_fail; /* Store failed because underlying allocator could not get memory */ static u64 zswap_reject_alloc_fail; /* Store failed because the entry metadata could not be allocated (rare) */ @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, return comp_ret == 0 && alloc_ret == 0; } -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) { struct zpool *zpool = entry->pool->zpool; struct scatterlist input, output; struct crypto_acomp_ctx *acomp_ctx; + bool ret = true; u8 *src; acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) sg_init_table(&output, 1); sg_set_folio(&output, folio, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) || + acomp_ctx->req->dlen != PAGE_SIZE) { + ret = false; + zswap_reject_decompress_fail++; + pr_alert_ratelimited( + "decompression failed on zswap entry with offset %08lx\n", + entry->swpentry.val); + } mutex_unlock(&acomp_ctx->mutex); if (src != acomp_ctx->buffer) zpool_unmap_handle(zpool, entry->handle); + return ret; } /********************************* @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry, struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, }; + int ret = 0; /* try to allocate swap cache folio */ mpol = get_task_policy(current); @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry, * and freed when invalidated by the concurrent shrinker anyway. */ if (!folio_was_allocated) { - folio_put(folio); - return -EEXIST; + ret = -EEXIST; + goto put_folio; } /* @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry, * be dereferenced. */ tree = swap_zswap_tree(swpentry); - if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) { - delete_from_swap_cache(folio); - folio_unlock(folio); - folio_put(folio); - return -ENOMEM; + if (entry != xa_load(tree, offset)) { + ret = -ENOMEM; + goto fail; } - zswap_decompress(entry, folio); + if (!zswap_decompress(entry, folio)) { + ret = -EIO; + goto fail; + } + + xa_erase(tree, offset); count_vm_event(ZSWPWB); if (entry->objcg) @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, /* start writeback */ __swap_writepage(folio, &wbc); - folio_put(folio); + goto put_folio; - return 0; +fail: + delete_from_swap_cache(folio); + folio_unlock(folio); +put_folio: + folio_put(folio); + return ret; } /********************************* @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio) if (WARN_ON_ONCE(folio_test_large(folio))) return true; + /* + * We cannot invalidate the zswap entry before decompressing it. If + * decompression fails, we must keep the entry in the tree so that + * a future read by another process on the same swap entry will also + * have to go through zswap. Otherwise, we risk silently reading + * corrupted data for the other process. + */ + entry = xa_load(tree, offset); + if (!entry) + return false; + + /* + * If decompression fails, we return true to notify the caller that the + * folio's data were in zswap, but do not mark the folio as up-to-date. + * This will effectively SIGBUS the calling process. + */ + if (!zswap_decompress(entry, folio)) + return true; + + count_vm_event(ZSWPIN); + if (entry->objcg) + count_objcg_events(entry->objcg, ZSWPIN, 1); + /* * When reading into the swapcache, invalidate our entry. The * swapcache can be the authoritative owner of the page and @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio) * files, which reads into a private page and may free it if * the fault fails. We remain the primary owner of the entry.) */ - if (swapcache) - entry = xa_erase(tree, offset); - else - entry = xa_load(tree, offset); - - if (!entry) - return false; - - zswap_decompress(entry, folio); - - count_vm_event(ZSWPIN); - if (entry->objcg) - count_objcg_events(entry->objcg, ZSWPIN, 1); - if (swapcache) { + xa_erase(tree, offset); zswap_entry_free(entry); folio_mark_dirty(folio); } @@ -1727,6 +1756,8 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, &zswap_reject_compress_fail); debugfs_create_u64("reject_compress_poor", 0444, zswap_debugfs_root, &zswap_reject_compress_poor); + debugfs_create_u64("reject_decompress_fail", 0444, + zswap_debugfs_root, &zswap_reject_decompress_fail); debugfs_create_u64("written_back_pages", 0444, zswap_debugfs_root, &zswap_written_back_pages); debugfs_create_file("pool_total_size", 0444,
Currently, we crash the kernel when a decompression failure occurs in zswap (either because of memory corruption, or a bug in the compression algorithm). This is overkill. We should only SIGBUS the unfortunate process asking for the zswap entry on zswap load, and skip the corrupted entry in zswap writeback. See [1] for a recent upstream discussion about this. [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/ Suggested-by: Matthew Wilcox <willy@infradead.org> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> --- mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 58 insertions(+), 27 deletions(-)