zswap: do not crash the kernel on decompression failure

Message ID	20250225213200.729056-1-nphamcs@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Nhat Pham <nphamcs@gmail.com> To: akpm@linux-foundation.org Cc: hannes@cmpxchg.org, yosryahmed@google.com, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org Subject: [PATCH] zswap: do not crash the kernel on decompression failure Date: Tue, 25 Feb 2025 13:32:00 -0800 Message-ID: <20250225213200.729056-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zswap: do not crash the kernel on decompression failure \| expand zswap: do not crash the kernel on decompression failure

Nhat Pham Feb. 25, 2025, 9:32 p.m. UTC

Currently, we crash the kernel when a decompression failure occurs in
zswap (either because of memory corruption, or a bug in the compression
algorithm). This is overkill. We should only SIGBUS the unfortunate
process asking for the zswap entry on zswap load, and skip the corrupted
entry in zswap writeback.

See [1] for a recent upstream discussion about this.

[1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/

Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 58 insertions(+), 27 deletions(-)

Andrew Morton Feb. 25, 2025, 10:25 p.m. UTC | #1

On Tue, 25 Feb 2025 13:32:00 -0800 Nhat Pham <nphamcs@gmail.com> wrote:

> @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
>  	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> +			acomp_ctx->req->dlen != PAGE_SIZE) {
> +		ret = false;
> +		zswap_reject_decompress_fail++;
> +		pr_alert_ratelimited(
> +			"decompression failed on zswap entry with offset %08lx\n",
> +			entry->swpentry.val);
> +	}
>  	mutex_unlock(&acomp_ctx->mutex);

This mutex_unlock() isn't present in current kernels.  I'd normally just fix
the reject but a change in the locking environment needs some careful
checking and retesting, please.

Nhat Pham Feb. 25, 2025, 10:28 p.m. UTC | #2

On Tue, Feb 25, 2025 at 2:25 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 25 Feb 2025 13:32:00 -0800 Nhat Pham <nphamcs@gmail.com> wrote:
>
> > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >       sg_init_table(&output, 1);
> >       sg_set_folio(&output, folio, PAGE_SIZE, 0);
> >       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> > -     BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> > -     BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +     if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> > +                     acomp_ctx->req->dlen != PAGE_SIZE) {
> > +             ret = false;
> > +             zswap_reject_decompress_fail++;
> > +             pr_alert_ratelimited(
> > +                     "decompression failed on zswap entry with offset %08lx\n",
> > +                     entry->swpentry.val);
> > +     }
> >       mutex_unlock(&acomp_ctx->mutex);
>
> This mutex_unlock() isn't present in current kernels.  I'd normally just fix
> the reject but a change in the locking environment needs some careful
> checking and retesting, please.

Hah strange. I could have sworn I pulled the latest mm-unstable, but
maybe the git pull failed and I didn't notice. Thanks for picking this
up Andrew! Lemme re-pull and rebase.

Johannes Weiner Feb. 26, 2025, 12:51 a.m. UTC | #3

On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> Currently, we crash the kernel when a decompression failure occurs in
> zswap (either because of memory corruption, or a bug in the compression
> algorithm). This is overkill. We should only SIGBUS the unfortunate
> process asking for the zswap entry on zswap load, and skip the corrupted
> entry in zswap writeback.
> 
> See [1] for a recent upstream discussion about this.
> 
> [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 58 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f6316b66fb23..31d4397eed61 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
>  static u64 zswap_reject_compress_fail;
>  /* Compressed page was too big for the allocator to (optimally) store */
>  static u64 zswap_reject_compress_poor;
> +/* Load and writeback failed due to decompression failure */
> +static u64 zswap_reject_decompress_fail;

"reject" refers to "rejected store", so the name doesn't quite make
sense. zswap_decompress_fail?

>  /* Store failed because underlying allocator could not get memory */
>  static u64 zswap_reject_alloc_fail;
>  /* Store failed because the entry metadata could not be allocated (rare) */
> @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	return comp_ret == 0 && alloc_ret == 0;
>  }
>  
> -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  {
>  	struct zpool *zpool = entry->pool->zpool;
>  	struct scatterlist input, output;
>  	struct crypto_acomp_ctx *acomp_ctx;
> +	bool ret = true;
>  	u8 *src;
>  
>  	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
>  	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> +			acomp_ctx->req->dlen != PAGE_SIZE) {
> +		ret = false;
> +		zswap_reject_decompress_fail++;
> +		pr_alert_ratelimited(
> +			"decompression failed on zswap entry with offset %08lx\n",
> +			entry->swpentry.val);

Since this is a pretty dramatic failure scenario, IMO it would be
useful to dump as much info as possible.

The exact return value of crypto_wait_req() could be useful,
entry->length and req->dlen too.

entry->pool->tfm_name just to make absolute sure there is no
confusion, as the compressor can be switched for new stores.

Maybe swp_type() and swp_offset() of entry->swpentry? Those could be
easy markers to see if the entry was corrupted for example.

> +	}
>  	mutex_unlock(&acomp_ctx->mutex);
>  
>  	if (src != acomp_ctx->buffer)
>  		zpool_unmap_handle(zpool, entry->handle);
> +	return ret;
>  }
>  
>  /*********************************
> @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct writeback_control wbc = {
>  		.sync_mode = WB_SYNC_NONE,
>  	};
> +	int ret = 0;
>  
>  	/* try to allocate swap cache folio */
>  	mpol = get_task_policy(current);
> @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * and freed when invalidated by the concurrent shrinker anyway.
>  	 */
>  	if (!folio_was_allocated) {
> -		folio_put(folio);
> -		return -EEXIST;
> +		ret = -EEXIST;
> +		goto put_folio;
>  	}
>  
>  	/*
> @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * be dereferenced.
>  	 */
>  	tree = swap_zswap_tree(swpentry);
> -	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> -		delete_from_swap_cache(folio);
> -		folio_unlock(folio);
> -		folio_put(folio);
> -		return -ENOMEM;
> +	if (entry != xa_load(tree, offset)) {
> +		ret = -ENOMEM;
> +		goto fail;
>  	}
>  
> -	zswap_decompress(entry, folio);
> +	if (!zswap_decompress(entry, folio)) {
> +		ret = -EIO;
> +		goto fail;
> +	}
> +
> +	xa_erase(tree, offset);
>  
>  	count_vm_event(ZSWPWB);
>  	if (entry->objcg)
> @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  
>  	/* start writeback */
>  	__swap_writepage(folio, &wbc);
> -	folio_put(folio);
> +	goto put_folio;
>  
> -	return 0;
> +fail:
> +	delete_from_swap_cache(folio);
> +	folio_unlock(folio);
> +put_folio:
> +	folio_put(folio);
> +	return ret;
>  }

Nice, yeah it's time for factoring out the error unwinding. If you
write it like this, you can save a jump in the main sequence:

	__swap_writepage(folio, &wbc);
	ret = 0;
put:
	folio_put(folio);
	return ret;
delete_unlock:
	delete_from_swap_cache(folio);
	folio_unlock(folio);
	goto put;

>  
>  /*********************************
> @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio)
>  	if (WARN_ON_ONCE(folio_test_large(folio)))
>  		return true;
>  
> +	/*
> +	 * We cannot invalidate the zswap entry before decompressing it. If
> +	 * decompression fails, we must keep the entry in the tree so that
> +	 * a future read by another process on the same swap entry will also
> +	 * have to go through zswap. Otherwise, we risk silently reading
> +	 * corrupted data for the other process.
> +	 */
> +	entry = xa_load(tree, offset);
> +	if (!entry)
> +		return false;

The explanation in the comment makes sense in the context of this
change. But fresh eyes reading this function and having never seen
that this *used to* open with xa_erase() will be confused. It answers
questions the reader doesn't have at this point - it's just a function
called zswap_load() beginning with an xa_load(), so what?

At first I was going to suggest moving it down to the swapcache
branch. But honestly after reading *that* comment again, in the new
sequence, I don't think the question will arise there either. It's
pretty self-evident that the whole "we can invalidate when reading
into the swapcache" does not apply if the read failed.

> +	/*
> +	 * If decompression fails, we return true to notify the caller that the
> +	 * folio's data were in zswap, but do not mark the folio as up-to-date.
> +	 * This will effectively SIGBUS the calling process.
> +	 */

It would be good to put a lampshade on this weirdness that the return
value has nothing to do with success or not. This wasn't as important
a distinction, but with actual decompression failures I think it is.

Something like this?

	if (!zswap_decompress(entry, folio)) {
		/*
		 * The zswap_load() return value doesn't indicate success or
		 * failure, but whether zswap owns the swapped out contents.
		 * This MUST return true here, otherwise swap_readpage() will
		 * read garbage from the backend.
		 *
		 * Success is signaled by marking the folio uptodate.
		 */
		return true;
	}

	folio_mark_uptodate(folio);

> +	if (!zswap_decompress(entry, folio))
> +		return true;
> +
> +	count_vm_event(ZSWPIN);
> +	if (entry->objcg)
> +		count_objcg_events(entry->objcg, ZSWPIN, 1);
> +
>  	/*
>  	 * When reading into the swapcache, invalidate our entry. The
>  	 * swapcache can be the authoritative owner of the page and
> @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio)
>  	 * files, which reads into a private page and may free it if
>  	 * the fault fails. We remain the primary owner of the entry.)
>  	 */
> -	if (swapcache)
> -		entry = xa_erase(tree, offset);
> -	else
> -		entry = xa_load(tree, offset);
> -
> -	if (!entry)
> -		return false;
> -
> -	zswap_decompress(entry, folio);
> -
> -	count_vm_event(ZSWPIN);
> -	if (entry->objcg)
> -		count_objcg_events(entry->objcg, ZSWPIN, 1);
> -
>  	if (swapcache) {
> +		xa_erase(tree, offset);
>  		zswap_entry_free(entry);
>  		folio_mark_dirty(folio);
>  	}

Yosry Ahmed Feb. 26, 2025, 2:40 a.m. UTC | #4

On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> Currently, we crash the kernel when a decompression failure occurs in
> zswap (either because of memory corruption, or a bug in the compression
> algorithm). This is overkill. We should only SIGBUS the unfortunate
> process asking for the zswap entry on zswap load, and skip the corrupted
> entry in zswap writeback.

It's probably worth adding more details here. For example, how the
SIGBUS is delivered (it's not super clear in the code), and the
distinction between handling decompression failures for loads and
writeback.

> 
> See [1] for a recent upstream discussion about this.
> 
> [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>

It's Yosry Ahmed <yosry.ahmed@linux.dev> now :P

> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 58 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f6316b66fb23..31d4397eed61 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
>  static u64 zswap_reject_compress_fail;
>  /* Compressed page was too big for the allocator to (optimally) store */
>  static u64 zswap_reject_compress_poor;
> +/* Load and writeback failed due to decompression failure */

Nit: Load or writeback?

> +static u64 zswap_reject_decompress_fail;
>  /* Store failed because underlying allocator could not get memory */
>  static u64 zswap_reject_alloc_fail;
>  /* Store failed because the entry metadata could not be allocated (rare) */
> @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>  	return comp_ret == 0 && alloc_ret == 0;
>  }
>  
> -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  {
>  	struct zpool *zpool = entry->pool->zpool;
>  	struct scatterlist input, output;
>  	struct crypto_acomp_ctx *acomp_ctx;
> +	bool ret = true;
>  	u8 *src;
>  
>  	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>  	sg_init_table(&output, 1);
>  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
>  	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> +	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> +			acomp_ctx->req->dlen != PAGE_SIZE) {
> +		ret = false;
> +		zswap_reject_decompress_fail++;
> +		pr_alert_ratelimited(
> +			"decompression failed on zswap entry with offset %08lx\n",
> +			entry->swpentry.val);
> +	}

This can probably be prettier if we save the return value of
crypto_wait_req() in a variable, and then do the check at the end of the
function:

zswap_decompress()
{
	mutex_lock();
	...
	ret = crypto_wait_req(..);
	...
	mutex_unlock();
	...
	if (ret || acomp_ctx->req->dlen != PAGE_SIZE) {
		/* SHIT */
		return false;
	}
	return true;
}


>  	mutex_unlock(&acomp_ctx->mutex);
>  
>  	if (src != acomp_ctx->buffer)
>  		zpool_unmap_handle(zpool, entry->handle);
> +	return ret;
>  }
>  
>  /*********************************
> @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct writeback_control wbc = {
>  		.sync_mode = WB_SYNC_NONE,
>  	};
> +	int ret = 0;
>  
>  	/* try to allocate swap cache folio */
>  	mpol = get_task_policy(current);
> @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * and freed when invalidated by the concurrent shrinker anyway.
>  	 */
>  	if (!folio_was_allocated) {
> -		folio_put(folio);
> -		return -EEXIST;
> +		ret = -EEXIST;
> +		goto put_folio;
>  	}
>  
>  	/*
> @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	 * be dereferenced.
>  	 */
>  	tree = swap_zswap_tree(swpentry);
> -	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> -		delete_from_swap_cache(folio);
> -		folio_unlock(folio);
> -		folio_put(folio);
> -		return -ENOMEM;
> +	if (entry != xa_load(tree, offset)) {
> +		ret = -ENOMEM;
> +		goto fail;
>  	}
>  
> -	zswap_decompress(entry, folio);
> +	if (!zswap_decompress(entry, folio)) {
> +		ret = -EIO;
> +		goto fail;
> +	}
> +
> +	xa_erase(tree, offset);
>  
>  	count_vm_event(ZSWPWB);
>  	if (entry->objcg)
> @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  
>  	/* start writeback */
>  	__swap_writepage(folio, &wbc);
> -	folio_put(folio);
> +	goto put_folio;
>  
> -	return 0;
> +fail:
> +	delete_from_swap_cache(folio);
> +	folio_unlock(folio);
> +put_folio:
> +	folio_put(folio);
> +	return ret;
>  }
>  
>  /*********************************
> @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio)
>  	if (WARN_ON_ONCE(folio_test_large(folio)))
>  		return true;
>  
> +	/*
> +	 * We cannot invalidate the zswap entry before decompressing it. If
> +	 * decompression fails, we must keep the entry in the tree so that
> +	 * a future read by another process on the same swap entry will also
> +	 * have to go through zswap. Otherwise, we risk silently reading
> +	 * corrupted data for the other process.
> +	 */
> +	entry = xa_load(tree, offset);

We are doing load + erase here and in the writeback path now, so two
xarray walks instead of one. How does this affect performance? We had a
similar about the possiblity of doing a lockless xas_load() followed by
a conditional xas_erase() for zswap_invalidate():

https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/

Unfortunately it seems like we can't trivially do that unless we keep
the tree locked, which we probably don't want to do throughout
decompression.

How crazy would it be to remove the entry from the tree and re-add it if
compression fails? Does swapcache_prepare() provide sufficient
protection for us to do this without anyone else looking at this entry
(seems like it)?

Anyway, this is all moot if the second walk is not noticeable from a
perf perspective.

Yosry Ahmed Feb. 26, 2025, 2:45 a.m. UTC | #5

On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote:
> On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > Currently, we crash the kernel when a decompression failure occurs in
> > zswap (either because of memory corruption, or a bug in the compression
> > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > process asking for the zswap entry on zswap load, and skip the corrupted
> > entry in zswap writeback.
> > 
> > See [1] for a recent upstream discussion about this.
> > 
> > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/
> > 
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 58 insertions(+), 27 deletions(-)
> > 
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb23..31d4397eed61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
> >  static u64 zswap_reject_compress_fail;
> >  /* Compressed page was too big for the allocator to (optimally) store */
> >  static u64 zswap_reject_compress_poor;
> > +/* Load and writeback failed due to decompression failure */
> > +static u64 zswap_reject_decompress_fail;
> 
> "reject" refers to "rejected store", so the name doesn't quite make
> sense. zswap_decompress_fail?
> 
> >  /* Store failed because underlying allocator could not get memory */
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >  	return comp_ret == 0 && alloc_ret == 0;
> >  }
> >  
> > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >  {
> >  	struct zpool *zpool = entry->pool->zpool;
> >  	struct scatterlist input, output;
> >  	struct crypto_acomp_ctx *acomp_ctx;
> > +	bool ret = true;
> >  	u8 *src;
> >  
> >  	acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >  	sg_init_table(&output, 1);
> >  	sg_set_folio(&output, folio, PAGE_SIZE, 0);
> >  	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> > -	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> > -	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +	if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> > +			acomp_ctx->req->dlen != PAGE_SIZE) {
> > +		ret = false;
> > +		zswap_reject_decompress_fail++;
> > +		pr_alert_ratelimited(
> > +			"decompression failed on zswap entry with offset %08lx\n",
> > +			entry->swpentry.val);
> 
> Since this is a pretty dramatic failure scenario, IMO it would be
> useful to dump as much info as possible.
> 
> The exact return value of crypto_wait_req() could be useful,
> entry->length and req->dlen too.
> 
> entry->pool->tfm_name just to make absolute sure there is no
> confusion, as the compressor can be switched for new stores.
> 
> Maybe swp_type() and swp_offset() of entry->swpentry? Those could be
> easy markers to see if the entry was corrupted for example.
> 
> > +	}
> >  	mutex_unlock(&acomp_ctx->mutex);
> >  
> >  	if (src != acomp_ctx->buffer)
> >  		zpool_unmap_handle(zpool, entry->handle);
> > +	return ret;
> >  }
> >  
> >  /*********************************
> > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >  	struct writeback_control wbc = {
> >  		.sync_mode = WB_SYNC_NONE,
> >  	};
> > +	int ret = 0;
> >  
> >  	/* try to allocate swap cache folio */
> >  	mpol = get_task_policy(current);
> > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >  	 * and freed when invalidated by the concurrent shrinker anyway.
> >  	 */
> >  	if (!folio_was_allocated) {
> > -		folio_put(folio);
> > -		return -EEXIST;
> > +		ret = -EEXIST;
> > +		goto put_folio;
> >  	}
> >  
> >  	/*
> > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >  	 * be dereferenced.
> >  	 */
> >  	tree = swap_zswap_tree(swpentry);
> > -	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> > -		delete_from_swap_cache(folio);
> > -		folio_unlock(folio);
> > -		folio_put(folio);
> > -		return -ENOMEM;
> > +	if (entry != xa_load(tree, offset)) {
> > +		ret = -ENOMEM;
> > +		goto fail;
> >  	}
> >  
> > -	zswap_decompress(entry, folio);
> > +	if (!zswap_decompress(entry, folio)) {
> > +		ret = -EIO;
> > +		goto fail;
> > +	}
> > +
> > +	xa_erase(tree, offset);
> >  
> >  	count_vm_event(ZSWPWB);
> >  	if (entry->objcg)
> > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >  
> >  	/* start writeback */
> >  	__swap_writepage(folio, &wbc);
> > -	folio_put(folio);
> > +	goto put_folio;
> >  
> > -	return 0;
> > +fail:
> > +	delete_from_swap_cache(folio);
> > +	folio_unlock(folio);
> > +put_folio:
> > +	folio_put(folio);
> > +	return ret;
> >  }
> 
> Nice, yeah it's time for factoring out the error unwinding. If you
> write it like this, you can save a jump in the main sequence:
> 
> 	__swap_writepage(folio, &wbc);
> 	ret = 0;
> put:
> 	folio_put(folio);
> 	return ret;
> delete_unlock:

(I like how you sneaked the label rename in here, I didn't like 'fail'
either :P)

> 	delete_from_swap_cache(folio);
> 	folio_unlock(folio);
> 	goto put;

I would go even further and avoid gotos completely (and make it super
clear what gets executed in the normal path vs the failure path):

	__swap_writepage(folio, &wbc);
	folio_put(folio);
	if (ret) {
		delete_from_swap_cache(folio);
		folio_unlock(folio);
	}
	return ret;

> 
> >  
> >  /*********************************
> > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio)
> >  	if (WARN_ON_ONCE(folio_test_large(folio)))
> >  		return true;
> >  
> > +	/*
> > +	 * We cannot invalidate the zswap entry before decompressing it. If
> > +	 * decompression fails, we must keep the entry in the tree so that
> > +	 * a future read by another process on the same swap entry will also
> > +	 * have to go through zswap. Otherwise, we risk silently reading
> > +	 * corrupted data for the other process.
> > +	 */
> > +	entry = xa_load(tree, offset);
> > +	if (!entry)
> > +		return false;
> 
> The explanation in the comment makes sense in the context of this
> change. But fresh eyes reading this function and having never seen
> that this *used to* open with xa_erase() will be confused. It answers
> questions the reader doesn't have at this point - it's just a function
> called zswap_load() beginning with an xa_load(), so what?
> 
> At first I was going to suggest moving it down to the swapcache
> branch. But honestly after reading *that* comment again, in the new
> sequence, I don't think the question will arise there either. It's
> pretty self-evident that the whole "we can invalidate when reading
> into the swapcache" does not apply if the read failed.

+1

> 
> > +	/*
> > +	 * If decompression fails, we return true to notify the caller that the
> > +	 * folio's data were in zswap, but do not mark the folio as up-to-date.
> > +	 * This will effectively SIGBUS the calling process.
> > +	 */
> 
> It would be good to put a lampshade on this weirdness that the return
> value has nothing to do with success or not. This wasn't as important
> a distinction, but with actual decompression failures I think it is.

+1

> 
> Something like this?
> 
> 	if (!zswap_decompress(entry, folio)) {
> 		/*
> 		 * The zswap_load() return value doesn't indicate success or
> 		 * failure, but whether zswap owns the swapped out contents.
> 		 * This MUST return true here, otherwise swap_readpage() will
> 		 * read garbage from the backend.
> 		 *
> 		 * Success is signaled by marking the folio uptodate.
> 		 */

We use the same trick in the folio_test_large() branch, so maybe this
should be moved to above the function definition. Then we can perhaps
refer to it in places where we return true wihout setting uptodate for
added clarity if needed.

> 		return true;
> 	}
> 
> 	folio_mark_uptodate(folio);
> 
> > +	if (!zswap_decompress(entry, folio))
> > +		return true;
> > +
> > +	count_vm_event(ZSWPIN);
> > +	if (entry->objcg)
> > +		count_objcg_events(entry->objcg, ZSWPIN, 1);
> > +
> >  	/*
> >  	 * When reading into the swapcache, invalidate our entry. The
> >  	 * swapcache can be the authoritative owner of the page and
> > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio)
> >  	 * files, which reads into a private page and may free it if
> >  	 * the fault fails. We remain the primary owner of the entry.)
> >  	 */
> > -	if (swapcache)
> > -		entry = xa_erase(tree, offset);
> > -	else
> > -		entry = xa_load(tree, offset);
> > -
> > -	if (!entry)
> > -		return false;
> > -
> > -	zswap_decompress(entry, folio);
> > -
> > -	count_vm_event(ZSWPIN);
> > -	if (entry->objcg)
> > -		count_objcg_events(entry->objcg, ZSWPIN, 1);
> > -
> >  	if (swapcache) {
> > +		xa_erase(tree, offset);
> >  		zswap_entry_free(entry);
> >  		folio_mark_dirty(folio);
> >  	}
>

Yosry Ahmed Feb. 26, 2025, 3:12 a.m. UTC | #6

On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> Currently, we crash the kernel when a decompression failure occurs in
> zswap (either because of memory corruption, or a bug in the compression
> algorithm). This is overkill. We should only SIGBUS the unfortunate
> process asking for the zswap entry on zswap load, and skip the corrupted
> entry in zswap writeback.

Some relevant observations/questions, but not really actionable for this
patch, perhaps some future work, or more likely some incoherent
illogical thoughts :

(1) It seems like not making the folio uptodate will cause shmem faults
to mark the swap entry as hwpoisoned, but I don't see similar handling
for do_swap_page(). So it seems like even if we SIGBUS the process,
other processes mapping the same page could follow in the same
footsteps.

(2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases
(e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()),
even though we have VM_FAULT_HWPOISON. This patch falls under this
bucket, but unfortunately we cannot tell for sure if it's a hwpoision or
a decompression bug.

(3) If we run into a decompression failure, should we free the
underlying memory from zsmalloc? I don't know. On one hand if we free it
zsmalloc may start using it for more compressed objects. OTOH, I don't
think proper hwpoison handling will kick in until the page is freed.
Maybe we should tell zsmalloc to drop this page entirely and mark
objects within it as invalid? Probably not worth the hassle but
something to think about.

Johannes Weiner Feb. 26, 2025, 3:57 a.m. UTC | #7

On Wed, Feb 26, 2025 at 02:45:41AM +0000, Yosry Ahmed wrote:
> On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > +	}
> > >  	mutex_unlock(&acomp_ctx->mutex);
> > >  
> > >  	if (src != acomp_ctx->buffer)
> > >  		zpool_unmap_handle(zpool, entry->handle);
> > > +	return ret;
> > >  }
> > >  
> > >  /*********************************
> > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >  	struct writeback_control wbc = {
> > >  		.sync_mode = WB_SYNC_NONE,
> > >  	};
> > > +	int ret = 0;
> > >  
> > >  	/* try to allocate swap cache folio */
> > >  	mpol = get_task_policy(current);
> > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >  	 * and freed when invalidated by the concurrent shrinker anyway.
> > >  	 */
> > >  	if (!folio_was_allocated) {
> > > -		folio_put(folio);
> > > -		return -EEXIST;
> > > +		ret = -EEXIST;
> > > +		goto put_folio;
> > >  	}
> > >  
> > >  	/*
> > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >  	 * be dereferenced.
> > >  	 */
> > >  	tree = swap_zswap_tree(swpentry);
> > > -	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> > > -		delete_from_swap_cache(folio);
> > > -		folio_unlock(folio);
> > > -		folio_put(folio);
> > > -		return -ENOMEM;
> > > +	if (entry != xa_load(tree, offset)) {
> > > +		ret = -ENOMEM;
> > > +		goto fail;
> > >  	}
> > >  
> > > -	zswap_decompress(entry, folio);
> > > +	if (!zswap_decompress(entry, folio)) {
> > > +		ret = -EIO;
> > > +		goto fail;
> > > +	}
> > > +
> > > +	xa_erase(tree, offset);
> > >  
> > >  	count_vm_event(ZSWPWB);
> > >  	if (entry->objcg)
> > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >  
> > >  	/* start writeback */
> > >  	__swap_writepage(folio, &wbc);
> > > -	folio_put(folio);
> > > +	goto put_folio;
> > >  
> > > -	return 0;
> > > +fail:
> > > +	delete_from_swap_cache(folio);
> > > +	folio_unlock(folio);
> > > +put_folio:
> > > +	folio_put(folio);
> > > +	return ret;
> > >  }
> > 
> > Nice, yeah it's time for factoring out the error unwinding. If you
> > write it like this, you can save a jump in the main sequence:
> > 
> > 	__swap_writepage(folio, &wbc);
> > 	ret = 0;
> > put:
> > 	folio_put(folio);
> > 	return ret;
> > delete_unlock:
> 
> (I like how you sneaked the label rename in here, I didn't like 'fail'
> either :P)
> 
> > 	delete_from_swap_cache(folio);
> > 	folio_unlock(folio);
> > 	goto put;
> 
> I would go even further and avoid gotos completely (and make it super
> clear what gets executed in the normal path vs the failure path):
> 
> 	__swap_writepage(folio, &wbc);
> 	folio_put(folio);
> 	if (ret) {
> 		delete_from_swap_cache(folio);
> 		folio_unlock(folio);
> 	}
> 	return ret;

The !folio_was_allocated case only needs the put. I guess that could
stay open-coded.

And I think you still need one goto for the other two error legs to
jump past the __swap_writepage.

> > Something like this?
> > 
> > 	if (!zswap_decompress(entry, folio)) {
> > 		/*
> > 		 * The zswap_load() return value doesn't indicate success or
> > 		 * failure, but whether zswap owns the swapped out contents.
> > 		 * This MUST return true here, otherwise swap_readpage() will
> > 		 * read garbage from the backend.
> > 		 *
> > 		 * Success is signaled by marking the folio uptodate.
> > 		 */
> 
> We use the same trick in the folio_test_large() branch, so maybe this
> should be moved to above the function definition. Then we can perhaps
> refer to it in places where we return true wihout setting uptodate for
> added clarity if needed.

That makes sense to me. Nhat, what do you think?

Johannes Weiner Feb. 26, 2025, 4:57 a.m. UTC | #8

On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote:
> On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > Currently, we crash the kernel when a decompression failure occurs in
> > zswap (either because of memory corruption, or a bug in the compression
> > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > process asking for the zswap entry on zswap load, and skip the corrupted
> > entry in zswap writeback.
> 
> Some relevant observations/questions, but not really actionable for this
> patch, perhaps some future work, or more likely some incoherent
> illogical thoughts :
> 
> (1) It seems like not making the folio uptodate will cause shmem faults
> to mark the swap entry as hwpoisoned, but I don't see similar handling
> for do_swap_page(). So it seems like even if we SIGBUS the process,
> other processes mapping the same page could follow in the same
> footsteps.

It's analogous to what __end_swap_bio_read() does for block backends,
so it's hitchhiking on the standard swap protocol for read failures.

The page sticks around if there are other users. It can get reclaimed,
but since it's not marked dirty, it won't get overwritten. Another
access will either find it in the swapcache and die on !uptodate; if
it was reclaimed, it will attempt another decompression. If all
references have been killed, zswap_invalidate() will finally drop it.

Swapoff actually poisons the page table as well (unuse_pte).

> (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases
> (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()),
> even though we have VM_FAULT_HWPOISON. This patch falls under this
> bucket, but unfortunately we cannot tell for sure if it's a hwpoision or
> a decompression bug.

Are you sure? Actual memory failure should replace the ptes of a
mapped shmem page with TTU_HWPOISON, which turns them into special
swap entries that trigger VM_FAULT_HWPOISON in do_swap_page().

Anon swap distinguishes as long as the swapfile is there. Swapoff
installs poison markers, which are then handled the same in future
faults (VM_FAULT_HWPOISON):

/*
 * "Poisoned" here is meant in the very general sense of "future accesses are
 * invalid", instead of referring very specifically to hardware memory errors.
 * This marker is meant to represent any of various different causes of this.
 *
 * Note that, when encountered by the faulting logic, PTEs with this marker will
 * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
 * logic.
 */
#define  PTE_MARKER_POISONED                    BIT(1)

Yosry Ahmed Feb. 26, 2025, 3:33 p.m. UTC | #9

On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote:
> > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > Currently, we crash the kernel when a decompression failure occurs in
> > > zswap (either because of memory corruption, or a bug in the compression
> > > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > > process asking for the zswap entry on zswap load, and skip the corrupted
> > > entry in zswap writeback.
> > 
> > Some relevant observations/questions, but not really actionable for this
> > patch, perhaps some future work, or more likely some incoherent
> > illogical thoughts :
> > 
> > (1) It seems like not making the folio uptodate will cause shmem faults
> > to mark the swap entry as hwpoisoned, but I don't see similar handling
> > for do_swap_page(). So it seems like even if we SIGBUS the process,
> > other processes mapping the same page could follow in the same
> > footsteps.
> 
> It's analogous to what __end_swap_bio_read() does for block backends,
> so it's hitchhiking on the standard swap protocol for read failures.

Right, that's also how I got the idea when I did the same for large
folios handling.

> 
> The page sticks around if there are other users. It can get reclaimed,
> but since it's not marked dirty, it won't get overwritten. Another
> access will either find it in the swapcache and die on !uptodate; if
> it was reclaimed, it will attempt another decompression. If all
> references have been killed, zswap_invalidate() will finally drop it.
> 
> Swapoff actually poisons the page table as well (unuse_pte).

Right. My question was basically why don't we also poison the page table
in do_swap_page() in this case. It's like that we never swapoff.

This will cause subsequent fault attempts to return VM_FAULT_HWPOISON
quickly without doing through the swapcache or decompression. Probably
not a big deal, but shmem does it so maybe it'd be nice to do it for
consistency.

> 
> > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases
> > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()),
> > even though we have VM_FAULT_HWPOISON. This patch falls under this
> > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or
> > a decompression bug.
> 
> Are you sure? Actual memory failure should replace the ptes of a
> mapped shmem page with TTU_HWPOISON, which turns them into special
> swap entries that trigger VM_FAULT_HWPOISON in do_swap_page().

I was looking at the shmem_fault() path. It seems like for this path we
end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not
-EHWPOISON. This seems like something that can be easily fixed though,
unless -EHWPOISON is not always correct for a diffrent reason.

> 
> Anon swap distinguishes as long as the swapfile is there. Swapoff
> installs poison markers, which are then handled the same in future
> faults (VM_FAULT_HWPOISON):
> 
> /*
>  * "Poisoned" here is meant in the very general sense of "future accesses are
>  * invalid", instead of referring very specifically to hardware memory errors.
>  * This marker is meant to represent any of various different causes of this.
>  *
>  * Note that, when encountered by the faulting logic, PTEs with this marker will
>  * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
>  * logic.

If that's the case, maybe it's better for zswap in the future if we stop
relying on not marking the folio uptodate, and instead propagate an
error through swap_read_folio() to the callers to make sure we always
return VM_FAULT_HWPOISON and install poison markers.

The handling is a bit quirky and inconsistent, but it ultimately results
in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now.

>  */
> #define  PTE_MARKER_POISONED                    BIT(1)

Yosry Ahmed Feb. 26, 2025, 3:42 p.m. UTC | #10

On Tue, Feb 25, 2025 at 10:57:30PM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2025 at 02:45:41AM +0000, Yosry Ahmed wrote:
> > On Tue, Feb 25, 2025 at 07:51:49PM -0500, Johannes Weiner wrote:
> > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > > +	}
> > > >  	mutex_unlock(&acomp_ctx->mutex);
> > > >  
> > > >  	if (src != acomp_ctx->buffer)
> > > >  		zpool_unmap_handle(zpool, entry->handle);
> > > > +	return ret;
> > > >  }
> > > >  
> > > >  /*********************************
> > > > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >  	struct writeback_control wbc = {
> > > >  		.sync_mode = WB_SYNC_NONE,
> > > >  	};
> > > > +	int ret = 0;
> > > >  
> > > >  	/* try to allocate swap cache folio */
> > > >  	mpol = get_task_policy(current);
> > > > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >  	 * and freed when invalidated by the concurrent shrinker anyway.
> > > >  	 */
> > > >  	if (!folio_was_allocated) {
> > > > -		folio_put(folio);
> > > > -		return -EEXIST;
> > > > +		ret = -EEXIST;
> > > > +		goto put_folio;
> > > >  	}
> > > >  
> > > >  	/*
> > > > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >  	 * be dereferenced.
> > > >  	 */
> > > >  	tree = swap_zswap_tree(swpentry);
> > > > -	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> > > > -		delete_from_swap_cache(folio);
> > > > -		folio_unlock(folio);
> > > > -		folio_put(folio);
> > > > -		return -ENOMEM;
> > > > +	if (entry != xa_load(tree, offset)) {
> > > > +		ret = -ENOMEM;
> > > > +		goto fail;
> > > >  	}
> > > >  
> > > > -	zswap_decompress(entry, folio);
> > > > +	if (!zswap_decompress(entry, folio)) {
> > > > +		ret = -EIO;
> > > > +		goto fail;
> > > > +	}
> > > > +
> > > > +	xa_erase(tree, offset);
> > > >  
> > > >  	count_vm_event(ZSWPWB);
> > > >  	if (entry->objcg)
> > > > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > > >  
> > > >  	/* start writeback */
> > > >  	__swap_writepage(folio, &wbc);
> > > > -	folio_put(folio);
> > > > +	goto put_folio;
> > > >  
> > > > -	return 0;
> > > > +fail:
> > > > +	delete_from_swap_cache(folio);
> > > > +	folio_unlock(folio);
> > > > +put_folio:
> > > > +	folio_put(folio);
> > > > +	return ret;
> > > >  }
> > > 
> > > Nice, yeah it's time for factoring out the error unwinding. If you
> > > write it like this, you can save a jump in the main sequence:
> > > 
> > > 	__swap_writepage(folio, &wbc);
> > > 	ret = 0;
> > > put:
> > > 	folio_put(folio);
> > > 	return ret;
> > > delete_unlock:
> > 
> > (I like how you sneaked the label rename in here, I didn't like 'fail'
> > either :P)
> > 
> > > 	delete_from_swap_cache(folio);
> > > 	folio_unlock(folio);
> > > 	goto put;
> > 
> > I would go even further and avoid gotos completely (and make it super
> > clear what gets executed in the normal path vs the failure path):
> > 
> > 	__swap_writepage(folio, &wbc);
> > 	folio_put(folio);
> > 	if (ret) {
> > 		delete_from_swap_cache(folio);
> > 		folio_unlock(folio);
> > 	}
> > 	return ret;
> 
> The !folio_was_allocated case only needs the put. I guess that could
> stay open-coded.

We also do:

	if (ret && ret != -EEXIST) {
		...
	}

or

	if (ret && !folio_was_allocated) {
		...
	}

Probably the latter is clearer. If the folio was already there, we
didn't add it to the swapcache or lock it ourselves so we don't unwind
that.

> 
> And I think you still need one goto for the other two error legs to
> jump past the __swap_writepage.

Oh yeah I meant eliminate the jumps within the cleanup/return code, not
entirely. Sorry for the confusion. We still need an 'out' label or
similar after __swap_writepage().

> 
> > > Something like this?
> > > 
> > > 	if (!zswap_decompress(entry, folio)) {
> > > 		/*
> > > 		 * The zswap_load() return value doesn't indicate success or
> > > 		 * failure, but whether zswap owns the swapped out contents.
> > > 		 * This MUST return true here, otherwise swap_readpage() will
> > > 		 * read garbage from the backend.
> > > 		 *
> > > 		 * Success is signaled by marking the folio uptodate.
> > > 		 */
> > 
> > We use the same trick in the folio_test_large() branch, so maybe this
> > should be moved to above the function definition. Then we can perhaps
> > refer to it in places where we return true wihout setting uptodate for
> > added clarity if needed.
> 
> That makes sense to me. Nhat, what do you think?

Nhat Pham Feb. 26, 2025, 11:16 p.m. UTC | #11

On Tue, Feb 25, 2025 at 6:40 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > Currently, we crash the kernel when a decompression failure occurs in
> > zswap (either because of memory corruption, or a bug in the compression
> > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > process asking for the zswap entry on zswap load, and skip the corrupted
> > entry in zswap writeback.
>
> It's probably worth adding more details here. For example, how the
> SIGBUS is delivered (it's not super clear in the code), and the
> distinction between handling decompression failures for loads and
> writeback.

Will do :)

>
> >
> > See [1] for a recent upstream discussion about this.
> >
> > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
>
> It's Yosry Ahmed <yosry.ahmed@linux.dev> now :P

Oops, will fix :P

>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 58 insertions(+), 27 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb23..31d4397eed61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
> >  static u64 zswap_reject_compress_fail;
> >  /* Compressed page was too big for the allocator to (optimally) store */
> >  static u64 zswap_reject_compress_poor;
> > +/* Load and writeback failed due to decompression failure */
>
> Nit: Load or writeback?

Consider it done :)

>
> > +static u64 zswap_reject_decompress_fail;
> >  /* Store failed because underlying allocator could not get memory */
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >       return comp_ret == 0 && alloc_ret == 0;
> >  }
> >
> > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >  {
> >       struct zpool *zpool = entry->pool->zpool;
> >       struct scatterlist input, output;
> >       struct crypto_acomp_ctx *acomp_ctx;
> > +     bool ret = true;
> >       u8 *src;
> >
> >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >       sg_init_table(&output, 1);
> >       sg_set_folio(&output, folio, PAGE_SIZE, 0);
> >       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> > -     BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> > -     BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +     if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> > +                     acomp_ctx->req->dlen != PAGE_SIZE) {
> > +             ret = false;
> > +             zswap_reject_decompress_fail++;
> > +             pr_alert_ratelimited(
> > +                     "decompression failed on zswap entry with offset %08lx\n",
> > +                     entry->swpentry.val);
> > +     }
>
> This can probably be prettier if we save the return value of
> crypto_wait_req() in a variable, and then do the check at the end of the
> function:
>
> zswap_decompress()
> {
>         mutex_lock();
>         ...
>         ret = crypto_wait_req(..);
>         ...
>         mutex_unlock();
>         ...
>         if (ret || acomp_ctx->req->dlen != PAGE_SIZE) {

Hmmmm, do we need the mutex to protect acomp_ctx->req->dlen? No strong
opinions here from my end TBH.

>                 /* SHIT */
>                 return false;
>         }
>         return true;
> }
[...]
>
> We are doing load + erase here and in the writeback path now, so two
> xarray walks instead of one. How does this affect performance? We had a
> similar about the possiblity of doing a lockless xas_load() followed by
> a conditional xas_erase() for zswap_invalidate():
>
> https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/
>
> Unfortunately it seems like we can't trivially do that unless we keep
> the tree locked, which we probably don't want to do throughout
> decompression.
>
> How crazy would it be to remove the entry from the tree and re-add it if
> compression fails? Does swapcache_prepare() provide sufficient
> protection for us to do this without anyone else looking at this entry
> (seems like it)?

My concern is, what happens if xa_store() in the re-add path fails
because we do not have enough memory?

>
> Anyway, this is all moot if the second walk is not noticeable from a
> perf perspective.

Nhat Pham Feb. 26, 2025, 11:20 p.m. UTC | #12

On Wed, Feb 26, 2025 at 7:33 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote:
> > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote:
> > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > > Currently, we crash the kernel when a decompression failure occurs in
> > > > zswap (either because of memory corruption, or a bug in the compression
> > > > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > > > process asking for the zswap entry on zswap load, and skip the corrupted
> > > > entry in zswap writeback.
> > >
> > > Some relevant observations/questions, but not really actionable for this
> > > patch, perhaps some future work, or more likely some incoherent
> > > illogical thoughts :
> > >
> > > (1) It seems like not making the folio uptodate will cause shmem faults
> > > to mark the swap entry as hwpoisoned, but I don't see similar handling
> > > for do_swap_page(). So it seems like even if we SIGBUS the process,
> > > other processes mapping the same page could follow in the same
> > > footsteps.
> >
> > It's analogous to what __end_swap_bio_read() does for block backends,
> > so it's hitchhiking on the standard swap protocol for read failures.
>
> Right, that's also how I got the idea when I did the same for large
> folios handling.

And your handling of the large folio (along with the comment in the
other thread) was how I got the idea for this patch :)

>
> >
> > The page sticks around if there are other users. It can get reclaimed,
> > but since it's not marked dirty, it won't get overwritten. Another
> > access will either find it in the swapcache and die on !uptodate; if
> > it was reclaimed, it will attempt another decompression. If all
> > references have been killed, zswap_invalidate() will finally drop it.
> >
> > Swapoff actually poisons the page table as well (unuse_pte).
>
> Right. My question was basically why don't we also poison the page table
> in do_swap_page() in this case. It's like that we never swapoff.

That would require a rmap walk right? To also poison the other PTEs
that point to the faulty (z)swap entry?

Or am I misunderstanding your point :)

>
> This will cause subsequent fault attempts to return VM_FAULT_HWPOISON
> quickly without doing through the swapcache or decompression. Probably
> not a big deal, but shmem does it so maybe it'd be nice to do it for
> consistency.
>
> >
> > > (2) A hwpoisoned swap entry results in VM_FAULT_SIGBUS in some cases
> > > (e.g. shmem_fault() -> shmem_get_folio_gfp() -> shmem_swapin_folio()),
> > > even though we have VM_FAULT_HWPOISON. This patch falls under this
> > > bucket, but unfortunately we cannot tell for sure if it's a hwpoision or
> > > a decompression bug.
> >
> > Are you sure? Actual memory failure should replace the ptes of a
> > mapped shmem page with TTU_HWPOISON, which turns them into special
> > swap entries that trigger VM_FAULT_HWPOISON in do_swap_page().
>
> I was looking at the shmem_fault() path. It seems like for this path we
> end up with VM_SIGBUS because shmem_swapin_folio() returns -EIO and not
> -EHWPOISON. This seems like something that can be easily fixed though,
> unless -EHWPOISON is not always correct for a diffrent reason.
>
> >
> > Anon swap distinguishes as long as the swapfile is there. Swapoff
> > installs poison markers, which are then handled the same in future
> > faults (VM_FAULT_HWPOISON):
> >
> > /*
> >  * "Poisoned" here is meant in the very general sense of "future accesses are
> >  * invalid", instead of referring very specifically to hardware memory errors.
> >  * This marker is meant to represent any of various different causes of this.
> >  *
> >  * Note that, when encountered by the faulting logic, PTEs with this marker will
> >  * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error
> >  * logic.
>
> If that's the case, maybe it's better for zswap in the future if we stop
> relying on not marking the folio uptodate, and instead propagate an
> error through swap_read_folio() to the callers to make sure we always
> return VM_FAULT_HWPOISON and install poison markers.
>
> The handling is a bit quirky and inconsistent, but it ultimately results
> in VM_SIGBUS or VM_FAULT_HWPOISON which I guess is fine for now.

Yeah I think it's OK for now. FWIW it's consistent with the way we
treat swap IO error, as you pointed out :)

>
> >  */
> > #define  PTE_MARKER_POISONED                    BIT(1)

Nhat Pham Feb. 26, 2025, 11:29 p.m. UTC | #13

On Tue, Feb 25, 2025 at 7:12 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > Currently, we crash the kernel when a decompression failure occurs in
> > zswap (either because of memory corruption, or a bug in the compression
> > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > process asking for the zswap entry on zswap load, and skip the corrupted
> > entry in zswap writeback.
>
> Some relevant observations/questions, but not really actionable for this
> patch, perhaps some future work, or more likely some incoherent
> illogical thoughts :
>
> (1) It seems like not making the folio uptodate will cause shmem faults
> to mark the swap entry as hwpoisoned, but I don't see similar handling
> for do_swap_page(). So it seems like even if we SIGBUS the process,
> other processes mapping the same page could follow in the same
> footsteps.

poisoned, I think? It's the weird SWP_PTE_MARKER thing.

[...]

>
>
> (3) If we run into a decompression failure, should we free the
> underlying memory from zsmalloc? I don't know. On one hand if we free it
> zsmalloc may start using it for more compressed objects. OTOH, I don't
> think proper hwpoison handling will kick in until the page is freed.
> Maybe we should tell zsmalloc to drop this page entirely and mark
> objects within it as invalid? Probably not worth the hassle but
> something to think about.

This might be a fun follow up :) I guess my question is - is there a
chance that we might recover in the future?

For example, can memory (hardware) failure somehow recover, or the
decompression algorithm somehow fix itself? I suppose not?

If that is the case, one thing we can do is just free the zsmalloc
slot, then mark the zswap entry as corrupted somehow. We can even
invalidate the zswap entry altogether, and install a (shared)
ZSWAP_CORRUPT_ENTRY. Future readers can check for this and exit if
they encounter a corrupted entry?

It's not common enough (lol hopefully) for me to optimize right away,
but I can get on with it if there are actual data of this happening
IRL/in product :)ion

Nhat Pham Feb. 26, 2025, 11:39 p.m. UTC | #14

On Tue, Feb 25, 2025 at 4:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > Currently, we crash the kernel when a decompression failure occurs in
> > zswap (either because of memory corruption, or a bug in the compression
> > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > process asking for the zswap entry on zswap load, and skip the corrupted
> > entry in zswap writeback.
> >
> > See [1] for a recent upstream discussion about this.
> >
> > [1]: https://lore.kernel.org/all/ZsiLElTykamcYZ6J@casper.infradead.org/
> >
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  mm/zswap.c | 85 +++++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 58 insertions(+), 27 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index f6316b66fb23..31d4397eed61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -62,6 +62,8 @@ static u64 zswap_reject_reclaim_fail;
> >  static u64 zswap_reject_compress_fail;
> >  /* Compressed page was too big for the allocator to (optimally) store */
> >  static u64 zswap_reject_compress_poor;
> > +/* Load and writeback failed due to decompression failure */
> > +static u64 zswap_reject_decompress_fail;
>
> "reject" refers to "rejected store", so the name doesn't quite make
> sense. zswap_decompress_fail?

SGTM :)

>
> >  /* Store failed because underlying allocator could not get memory */
> >  static u64 zswap_reject_alloc_fail;
> >  /* Store failed because the entry metadata could not be allocated (rare) */
> > @@ -953,11 +955,12 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> >       return comp_ret == 0 && alloc_ret == 0;
> >  }
> >
> > -static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> > +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >  {
> >       struct zpool *zpool = entry->pool->zpool;
> >       struct scatterlist input, output;
> >       struct crypto_acomp_ctx *acomp_ctx;
> > +     bool ret = true;
> >       u8 *src;
> >
> >       acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > @@ -984,12 +987,19 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> >       sg_init_table(&output, 1);
> >       sg_set_folio(&output, folio, PAGE_SIZE, 0);
> >       acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
> > -     BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
> > -     BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
> > +     if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait) ||
> > +                     acomp_ctx->req->dlen != PAGE_SIZE) {
> > +             ret = false;
> > +             zswap_reject_decompress_fail++;
> > +             pr_alert_ratelimited(
> > +                     "decompression failed on zswap entry with offset %08lx\n",
> > +                     entry->swpentry.val);
>
> Since this is a pretty dramatic failure scenario, IMO it would be
> useful to dump as much info as possible.
>
> The exact return value of crypto_wait_req() could be useful,
> entry->length and req->dlen too.
>
> entry->pool->tfm_name just to make absolute sure there is no
> confusion, as the compressor can be switched for new stores.
>
> Maybe swp_type() and swp_offset() of entry->swpentry? Those could be
> easy markers to see if the entry was corrupted for example.

I'll include them all :)

>
> > +     }
> >       mutex_unlock(&acomp_ctx->mutex);
> >
> >       if (src != acomp_ctx->buffer)
> >               zpool_unmap_handle(zpool, entry->handle);
> > +     return ret;
> >  }
> >
> >  /*********************************
> > @@ -1018,6 +1028,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >       struct writeback_control wbc = {
> >               .sync_mode = WB_SYNC_NONE,
> >       };
> > +     int ret = 0;
> >
> >       /* try to allocate swap cache folio */
> >       mpol = get_task_policy(current);
> > @@ -1034,8 +1045,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >        * and freed when invalidated by the concurrent shrinker anyway.
> >        */
> >       if (!folio_was_allocated) {
> > -             folio_put(folio);
> > -             return -EEXIST;
> > +             ret = -EEXIST;
> > +             goto put_folio;
> >       }
> >
> >       /*
> > @@ -1048,14 +1059,17 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >        * be dereferenced.
> >        */
> >       tree = swap_zswap_tree(swpentry);
> > -     if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
> > -             delete_from_swap_cache(folio);
> > -             folio_unlock(folio);
> > -             folio_put(folio);
> > -             return -ENOMEM;
> > +     if (entry != xa_load(tree, offset)) {
> > +             ret = -ENOMEM;
> > +             goto fail;
> >       }
> >
> > -     zswap_decompress(entry, folio);
> > +     if (!zswap_decompress(entry, folio)) {
> > +             ret = -EIO;
> > +             goto fail;
> > +     }
> > +
> > +     xa_erase(tree, offset);
> >
> >       count_vm_event(ZSWPWB);
> >       if (entry->objcg)
> > @@ -1071,9 +1085,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >
> >       /* start writeback */
> >       __swap_writepage(folio, &wbc);
> > -     folio_put(folio);
> > +     goto put_folio;
> >
> > -     return 0;
> > +fail:
> > +     delete_from_swap_cache(folio);
> > +     folio_unlock(folio);
> > +put_folio:
> > +     folio_put(folio);
> > +     return ret;
> >  }
>
> Nice, yeah it's time for factoring out the error unwinding. If you
> write it like this, you can save a jump in the main sequence:
>
>         __swap_writepage(folio, &wbc);
>         ret = 0;
> put:
>         folio_put(folio);
>         return ret;
> delete_unlock:
>         delete_from_swap_cache(folio);
>         folio_unlock(folio);
>         goto put;

SGTM :)

>
> >
> >  /*********************************
> > @@ -1600,6 +1619,29 @@ bool zswap_load(struct folio *folio)
> >       if (WARN_ON_ONCE(folio_test_large(folio)))
> >               return true;
> >
> > +     /*
> > +      * We cannot invalidate the zswap entry before decompressing it. If
> > +      * decompression fails, we must keep the entry in the tree so that
> > +      * a future read by another process on the same swap entry will also
> > +      * have to go through zswap. Otherwise, we risk silently reading
> > +      * corrupted data for the other process.
> > +      */
> > +     entry = xa_load(tree, offset);
> > +     if (!entry)
> > +             return false;
>
> The explanation in the comment makes sense in the context of this
> change. But fresh eyes reading this function and having never seen
> that this *used to* open with xa_erase() will be confused. It answers
> questions the reader doesn't have at this point - it's just a function
> called zswap_load() beginning with an xa_load(), so what?
>
> At first I was going to suggest moving it down to the swapcache
> branch. But honestly after reading *that* comment again, in the new
> sequence, I don't think the question will arise there either. It's
> pretty self-evident that the whole "we can invalidate when reading
> into the swapcache" does not apply if the read failed.
>
> > +     /*
> > +      * If decompression fails, we return true to notify the caller that the
> > +      * folio's data were in zswap, but do not mark the folio as up-to-date.
> > +      * This will effectively SIGBUS the calling process.
> > +      */
>
> It would be good to put a lampshade on this weirdness that the return
> value has nothing to do with success or not. This wasn't as important
> a distinction, but with actual decompression failures I think it is.
>
> Something like this?
>
>         if (!zswap_decompress(entry, folio)) {
>                 /*
>                  * The zswap_load() return value doesn't indicate success or
>                  * failure, but whether zswap owns the swapped out contents.
>                  * This MUST return true here, otherwise swap_readpage() will
>                  * read garbage from the backend.
>                  *
>                  * Success is signaled by marking the folio uptodate.
>                  */
>                 return true;
>         }

Per Yosry's comment, I'll just move this into zswap_load()'s function
documentation.

>
>         folio_mark_uptodate(folio);
>
> > +     if (!zswap_decompress(entry, folio))
> > +             return true;
> > +
> > +     count_vm_event(ZSWPIN);
> > +     if (entry->objcg)
> > +             count_objcg_events(entry->objcg, ZSWPIN, 1);
> > +
> >       /*
> >        * When reading into the swapcache, invalidate our entry. The
> >        * swapcache can be the authoritative owner of the page and
> > @@ -1612,21 +1654,8 @@ bool zswap_load(struct folio *folio)
> >        * files, which reads into a private page and may free it if
> >        * the fault fails. We remain the primary owner of the entry.)
> >        */
> > -     if (swapcache)
> > -             entry = xa_erase(tree, offset);
> > -     else
> > -             entry = xa_load(tree, offset);
> > -
> > -     if (!entry)
> > -             return false;
> > -
> > -     zswap_decompress(entry, folio);
> > -
> > -     count_vm_event(ZSWPIN);
> > -     if (entry->objcg)
> > -             count_objcg_events(entry->objcg, ZSWPIN, 1);
> > -
> >       if (swapcache) {
> > +             xa_erase(tree, offset);
> >               zswap_entry_free(entry);
> >               folio_mark_dirty(folio);
> >       }

Yosry Ahmed Feb. 26, 2025, 11:58 p.m. UTC | #15

On Wed, Feb 26, 2025 at 03:29:06PM -0800, Nhat Pham wrote:
> On Tue, Feb 25, 2025 at 7:12 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> >
> > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > Currently, we crash the kernel when a decompression failure occurs in
> > > zswap (either because of memory corruption, or a bug in the compression
> > > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > > process asking for the zswap entry on zswap load, and skip the corrupted
> > > entry in zswap writeback.
> >
> > Some relevant observations/questions, but not really actionable for this
> > patch, perhaps some future work, or more likely some incoherent
> > illogical thoughts :
> >
> > (1) It seems like not making the folio uptodate will cause shmem faults
> > to mark the swap entry as hwpoisoned, but I don't see similar handling
> > for do_swap_page(). So it seems like even if we SIGBUS the process,
> > other processes mapping the same page could follow in the same
> > footsteps.
> 
> poisoned, I think? It's the weird SWP_PTE_MARKER thing.

Not sure what you mean here, I am referring to the inconsistency between
shmem_swapin_folio() and do_swap_page().

> 
> [...]
> 
> >
> >
> > (3) If we run into a decompression failure, should we free the
> > underlying memory from zsmalloc? I don't know. On one hand if we free it
> > zsmalloc may start using it for more compressed objects. OTOH, I don't
> > think proper hwpoison handling will kick in until the page is freed.
> > Maybe we should tell zsmalloc to drop this page entirely and mark
> > objects within it as invalid? Probably not worth the hassle but
> > something to think about.
> 
> This might be a fun follow up :) I guess my question is - is there a
> chance that we might recover in the future?
> 
> For example, can memory (hardware) failure somehow recover, or the
> decompression algorithm somehow fix itself? I suppose not?
> 
> If that is the case, one thing we can do is just free the zsmalloc
> slot, then mark the zswap entry as corrupted somehow. We can even
> invalidate the zswap entry altogether, and install a (shared)
> ZSWAP_CORRUPT_ENTRY. Future readers can check for this and exit if
> they encounter a corrupted entry?
> 
> It's not common enough (lol hopefully) for me to optimize right away,
> but I can get on with it if there are actual data of this happening
> IRL/in product :)ion

I am not aware of this being a common problem, but something to keep in
mind, perhaps.

Yosry Ahmed Feb. 27, 2025, midnight UTC | #16

On Wed, Feb 26, 2025 at 03:20:13PM -0800, Nhat Pham wrote:
> On Wed, Feb 26, 2025 at 7:33 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
> >
> > On Tue, Feb 25, 2025 at 11:57:27PM -0500, Johannes Weiner wrote:
> > > On Wed, Feb 26, 2025 at 03:12:35AM +0000, Yosry Ahmed wrote:
> > > > On Tue, Feb 25, 2025 at 01:32:00PM -0800, Nhat Pham wrote:
> > > > > Currently, we crash the kernel when a decompression failure occurs in
> > > > > zswap (either because of memory corruption, or a bug in the compression
> > > > > algorithm). This is overkill. We should only SIGBUS the unfortunate
> > > > > process asking for the zswap entry on zswap load, and skip the corrupted
> > > > > entry in zswap writeback.
> > > >
> > > > Some relevant observations/questions, but not really actionable for this
> > > > patch, perhaps some future work, or more likely some incoherent
> > > > illogical thoughts :
> > > >
> > > > (1) It seems like not making the folio uptodate will cause shmem faults
> > > > to mark the swap entry as hwpoisoned, but I don't see similar handling
> > > > for do_swap_page(). So it seems like even if we SIGBUS the process,
> > > > other processes mapping the same page could follow in the same
> > > > footsteps.
> > >
> > > It's analogous to what __end_swap_bio_read() does for block backends,
> > > so it's hitchhiking on the standard swap protocol for read failures.
> >
> > Right, that's also how I got the idea when I did the same for large
> > folios handling.
> 
> And your handling of the large folio (along with the comment in the
> other thread) was how I got the idea for this patch :)
> 
> >
> > >
> > > The page sticks around if there are other users. It can get reclaimed,
> > > but since it's not marked dirty, it won't get overwritten. Another
> > > access will either find it in the swapcache and die on !uptodate; if
> > > it was reclaimed, it will attempt another decompression. If all
> > > references have been killed, zswap_invalidate() will finally drop it.
> > >
> > > Swapoff actually poisons the page table as well (unuse_pte).
> >
> > Right. My question was basically why don't we also poison the page table
> > in do_swap_page() in this case. It's like that we never swapoff.
> 
> That would require a rmap walk right? To also poison the other PTEs
> that point to the faulty (z)swap entry?
> 
> Or am I misunderstanding your point :)

Oh I meant why not just mark the entry where the fault happened as
poisoned at least. Finding other PTEs that point to the swap entry is a
different story. I don't think we can even use the rmap here.

Yosry Ahmed Feb. 27, 2025, 12:03 a.m. UTC | #17

[..]
> > zswap_decompress()
> > {
> >         mutex_lock();
> >         ...
> >         ret = crypto_wait_req(..);
> >         ...
> >         mutex_unlock();
> >         ...
> >         if (ret || acomp_ctx->req->dlen != PAGE_SIZE) {
> 
> Hmmmm, do we need the mutex to protect acomp_ctx->req->dlen? No strong
> opinions here from my end TBH.

We can copy the length to a local variable if that's the case.

> 
> >                 /* SHIT */
> >                 return false;
> >         }
> >         return true;
> > }
> [...]
> >
> > We are doing load + erase here and in the writeback path now, so two
> > xarray walks instead of one. How does this affect performance? We had a
> > similar about the possiblity of doing a lockless xas_load() followed by
> > a conditional xas_erase() for zswap_invalidate():
> >
> > https://lore.kernel.org/lkml/20241018192525.95862-1-ryncsn@gmail.com/
> >
> > Unfortunately it seems like we can't trivially do that unless we keep
> > the tree locked, which we probably don't want to do throughout
> > decompression.
> >
> > How crazy would it be to remove the entry from the tree and re-add it if
> > compression fails? Does swapcache_prepare() provide sufficient
> > protection for us to do this without anyone else looking at this entry
> > (seems like it)?
> 
> My concern is, what happens if xa_store() in the re-add path fails
> because we do not have enough memory?

Hmm good point. xa_erase() is essentially xas_store(NULL), but I think
at some point if a node is made of entirely NULLs we'll free it, and it
would be theoritically possible that we hit this case.

Let's start by checking if the added xarray walk adds any noticable
overhead and go from there. Ideally we want to test with a large (and
sparse?) xarray to try and hit the worst case.

> 
> >
> > Anyway, this is all moot if the second walk is not noticeable from a
> > perf perspective.

zswap: do not crash the kernel on decompression failure

Commit Message

Comments

Patch