Message ID | 20250226185625.2672936-1-yosry.ahmed@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead() | expand |
On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > (through crypto_exit_scomp_ops_async()). > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > (through crypto_scomp_init_tfm()), and then allocates memory. > If the allocation results in reclaim, we may attempt to hold the per-CPU > acomp_ctx mutex. The bug is in acomp. crypto_free_acomp() should never have to wait for a memory allocation. That is what needs to be fixed. But really the bounce buffering in acomp (which is what is causing this problem) should not exist at all. There is really no practical use case for it; it's just there because of the Crypto API's insistence on shoehorning everything into scatterlists for no reason... - Eric
On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote: > On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: > > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > > (through crypto_exit_scomp_ops_async()). > > > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > > (through crypto_scomp_init_tfm()), and then allocates memory. > > If the allocation results in reclaim, we may attempt to hold the per-CPU > > acomp_ctx mutex. > > The bug is in acomp. crypto_free_acomp() should never have to wait for a memory > allocation. That is what needs to be fixed. crypto_free_acomp() does not explicitly wait for an allocation, but it waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be held while allocating memory from crypto_scomp_init_tfm(). Are you suggesting that crypto_exit_scomp_ops_async() should not be holding scomp_lock? > > But really the bounce buffering in acomp (which is what is causing this problem) > should not exist at all. There is really no practical use case for it; it's > just there because of the Crypto API's insistence on shoehorning everything into > scatterlists for no reason... I am assuming this about scomp_scratch logic, which is what we need to hold the scomp_lock for, resulting in this problem. If this is something that can be done right away I am fine with dropping this patch for an alternative fix, although it may be nice to reduce the lock critical section in zswap_cpu_comp_dead() to the bare minimum anyway.
On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote: > On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote: > > On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: > > > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > > > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > > > (through crypto_exit_scomp_ops_async()). > > > > > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > > > (through crypto_scomp_init_tfm()), and then allocates memory. > > > If the allocation results in reclaim, we may attempt to hold the per-CPU > > > acomp_ctx mutex. > > > > The bug is in acomp. crypto_free_acomp() should never have to wait for a memory > > allocation. That is what needs to be fixed. > > crypto_free_acomp() does not explicitly wait for an allocation, but it > waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be > held while allocating memory from crypto_scomp_init_tfm(). > > Are you suggesting that crypto_exit_scomp_ops_async() should not be > holding scomp_lock? I think the solution while keeping the bounce buffer in place would be to do what the patch https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, i.e. make the actual allocation and free happen outside the lock. > > But really the bounce buffering in acomp (which is what is causing this problem) > > should not exist at all. There is really no practical use case for it; it's > > just there because of the Crypto API's insistence on shoehorning everything into > > scatterlists for no reason... > > I am assuming this about scomp_scratch logic, which is what we need to > hold the scomp_lock for, resulting in this problem. Yes. > If this is something that can be done right away I am fine with dropping > this patch for an alternative fix, although it may be nice to reduce the > lock critical section in zswap_cpu_comp_dead() to the bare minimum > anyway. Well, unfortunately the whole Crypto API philosophy of having a single interface for software and for hardware offload doesn't really work. This is just yet another example of that; it's a problem caused by shoehorning software compression into an interface designed for hardware offload. zcomp really should just use the compression libs directly (like most users of compression in the kernel already do), and have an alternate code path specifically for hardware offload (using acomp) for the few people who really want that. - Eric
On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote: > On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote: > > On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote: > > > On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: > > > > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > > > > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > > > > (through crypto_exit_scomp_ops_async()). > > > > > > > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > > > > (through crypto_scomp_init_tfm()), and then allocates memory. > > > > If the allocation results in reclaim, we may attempt to hold the per-CPU > > > > acomp_ctx mutex. > > > > > > The bug is in acomp. crypto_free_acomp() should never have to wait for a memory > > > allocation. That is what needs to be fixed. > > > > crypto_free_acomp() does not explicitly wait for an allocation, but it > > waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be > > held while allocating memory from crypto_scomp_init_tfm(). > > > > Are you suggesting that crypto_exit_scomp_ops_async() should not be > > holding scomp_lock? > > I think the solution while keeping the bounce buffer in place would be to do > what the patch > https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, > i.e. make the actual allocation and free happen outside the lock. I am fine with a solution like that if Herbert is fine with it. Although as I mentioned, I think this patch is nice to have anyway. > > > > But really the bounce buffering in acomp (which is what is causing this problem) > > > should not exist at all. There is really no practical use case for it; it's > > > just there because of the Crypto API's insistence on shoehorning everything into > > > scatterlists for no reason... > > > > I am assuming this about scomp_scratch logic, which is what we need to > > hold the scomp_lock for, resulting in this problem. > > Yes. > > > If this is something that can be done right away I am fine with dropping > > this patch for an alternative fix, although it may be nice to reduce the > > lock critical section in zswap_cpu_comp_dead() to the bare minimum > > anyway. > > Well, unfortunately the whole Crypto API philosophy of having a single interface > for software and for hardware offload doesn't really work. This is just yet > another example of that; it's a problem caused by shoehorning software > compression into an interface designed for hardware offload. zcomp really > should just use the compression libs directly (like most users of compression in > the kernel already do), and have an alternate code path specifically for > hardware offload (using acomp) for the few people who really want that. zcomp is for zram, zswap does not use it. If zswap is not going to use the crypto API we'll want something like zcomp or maybe reuse zcomp itself. That's a problem for another day :)
On Wed, Feb 26, 2025 at 1:23 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote: > > On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote: > > > On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote: > > > > On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: > > > > > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > > > > > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > > > > > (through crypto_exit_scomp_ops_async()). > > > > > > > > > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > > > > > (through crypto_scomp_init_tfm()), and then allocates memory. > > > > > If the allocation results in reclaim, we may attempt to hold the per-CPU > > > > > acomp_ctx mutex. > > > > > > > > The bug is in acomp. crypto_free_acomp() should never have to wait for a memory > > > > allocation. That is what needs to be fixed. > > > > > > crypto_free_acomp() does not explicitly wait for an allocation, but it > > > waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be > > > held while allocating memory from crypto_scomp_init_tfm(). > > > > > > Are you suggesting that crypto_exit_scomp_ops_async() should not be > > > holding scomp_lock? > > > > I think the solution while keeping the bounce buffer in place would be to do > > what the patch > > https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, > > i.e. make the actual allocation and free happen outside the lock. > > I am fine with a solution like that if Herbert is fine with it. Although > as I mentioned, I think this patch is nice to have anyway. > > > > > > > But really the bounce buffering in acomp (which is what is causing this problem) > > > > should not exist at all. There is really no practical use case for it; it's > > > > just there because of the Crypto API's insistence on shoehorning everything into > > > > scatterlists for no reason... > > > > > > I am assuming this about scomp_scratch logic, which is what we need to > > > hold the scomp_lock for, resulting in this problem. > > > > Yes. > > > > > If this is something that can be done right away I am fine with dropping > > > this patch for an alternative fix, although it may be nice to reduce the > > > lock critical section in zswap_cpu_comp_dead() to the bare minimum > > > anyway. > > > > Well, unfortunately the whole Crypto API philosophy of having a single interface > > for software and for hardware offload doesn't really work. This is just yet > > another example of that; it's a problem caused by shoehorning software > > compression into an interface designed for hardware offload. zcomp really > > should just use the compression libs directly (like most users of compression in > > the kernel already do), and have an alternate code path specifically for > > hardware offload (using acomp) for the few people who really want that. > > zcomp is for zram, zswap does not use it. If zswap is not going to use > the crypto API we'll want something like zcomp or maybe reuse zcomp > itself. That's a problem for another day :) I'm actually thinking whether we should expose the zcomp API and use it for zswap. There are a couple of parameters for zstd I wanna play with, which zcomp/zram seems to already support, but not the crypto API (zstd level, dictionary, etc.). But yes, a different problem for another day :)
On 2025/2/27 02:56, Yosry Ahmed wrote: > Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding > the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock > (through crypto_exit_scomp_ops_async()). > > On the other hand, crypto_alloc_acomp_node() holds the scomp_lock > (through crypto_scomp_init_tfm()), and then allocates memory. > If the allocation results in reclaim, we may attempt to hold the per-CPU > acomp_ctx mutex. > > The above dependencies can cause an ABBA deadlock. For example in the > following scenario: > > (1) Task A running on CPU #1: > crypto_alloc_acomp_node() > Holds scomp_lock > Enters reclaim > Reads per_cpu_ptr(pool->acomp_ctx, 1) > > (2) Task A is descheduled > > (3) CPU #1 goes offline > zswap_cpu_comp_dead(CPU #1) > Holds per_cpu_ptr(pool->acomp_ctx, 1)) > Calls crypto_free_acomp() > Waits for scomp_lock > > (4) Task A running on CPU #2: > Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 > DEADLOCK > > Since there is no requirement to call crypto_free_acomp() with the > per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the > mutex is unlocked. Also move the acomp_request_free() and kfree() calls > for consistency and to avoid any potential sublte locking dependencies > in the future. > > With this, only setting acomp_ctx fields to NULL occurs with the mutex > held. This is similar to how zswap_cpu_comp_prepare() only initializes > acomp_ctx fields with the mutex held, after performing all allocations > before holding the mutex. > > Opportunistically, move the NULL check on acomp_ctx so that it takes > place before the mutex dereference. > > Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") > Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ > Cc: <stable@vger.kernel.org> > Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> > Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Looks good to me: Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Thanks! > --- > > v1 -> v2: > - Explained the problem more clearly in the commit message. > - Moved all freeing calls outside the lock critical section. > v1: https://lore.kernel.org/all/Z72FJnbA39zWh4zS@gondor.apana.org.au/ > > --- > mm/zswap.c | 30 ++++++++++++++++++++++-------- > 1 file changed, 22 insertions(+), 8 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index ac9d299e7d0c1..adf745c66aa1d 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -881,18 +881,32 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) > { > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); > struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); > + struct acomp_req *req; > + struct crypto_acomp *acomp; > + u8 *buffer; > + > + if (IS_ERR_OR_NULL(acomp_ctx)) > + return 0; > > mutex_lock(&acomp_ctx->mutex); > - if (!IS_ERR_OR_NULL(acomp_ctx)) { > - if (!IS_ERR_OR_NULL(acomp_ctx->req)) > - acomp_request_free(acomp_ctx->req); > - acomp_ctx->req = NULL; > - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) > - crypto_free_acomp(acomp_ctx->acomp); > - kfree(acomp_ctx->buffer); > - } > + req = acomp_ctx->req; > + acomp = acomp_ctx->acomp; > + buffer = acomp_ctx->buffer; > + acomp_ctx->req = NULL; > + acomp_ctx->acomp = NULL; > + acomp_ctx->buffer = NULL; > mutex_unlock(&acomp_ctx->mutex); > > + /* > + * Do the actual freeing after releasing the mutex to avoid subtle > + * locking dependencies causing deadlocks. > + */ > + if (!IS_ERR_OR_NULL(req)) > + acomp_request_free(req); > + if (!IS_ERR_OR_NULL(acomp)) > + crypto_free_acomp(acomp); > + kfree(buffer); > + > return 0; > } >
On 2025/2/27 07:47, Nhat Pham wrote: > On Wed, Feb 26, 2025 at 1:23 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: >> >> On Wed, Feb 26, 2025 at 09:16:28PM +0000, Eric Biggers wrote: >>> On Wed, Feb 26, 2025 at 08:32:22PM +0000, Yosry Ahmed wrote: >>>> On Wed, Feb 26, 2025 at 08:00:16PM +0000, Eric Biggers wrote: >>>>> On Wed, Feb 26, 2025 at 06:56:25PM +0000, Yosry Ahmed wrote: >>>>>> Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding >>>>>> the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock >>>>>> (through crypto_exit_scomp_ops_async()). >>>>>> >>>>>> On the other hand, crypto_alloc_acomp_node() holds the scomp_lock >>>>>> (through crypto_scomp_init_tfm()), and then allocates memory. >>>>>> If the allocation results in reclaim, we may attempt to hold the per-CPU >>>>>> acomp_ctx mutex. >>>>> >>>>> The bug is in acomp. crypto_free_acomp() should never have to wait for a memory >>>>> allocation. That is what needs to be fixed. >>>> >>>> crypto_free_acomp() does not explicitly wait for an allocation, but it >>>> waits for scomp_lock (in crypto_exit_scomp_ops_async()), which may be >>>> held while allocating memory from crypto_scomp_init_tfm(). >>>> >>>> Are you suggesting that crypto_exit_scomp_ops_async() should not be >>>> holding scomp_lock? >>> >>> I think the solution while keeping the bounce buffer in place would be to do >>> what the patch >>> https://lore.kernel.org/linux-crypto/Z6w7Pz8jBeqhijut@gondor.apana.org.au/ does, >>> i.e. make the actual allocation and free happen outside the lock. >> >> I am fine with a solution like that if Herbert is fine with it. Although >> as I mentioned, I think this patch is nice to have anyway. >> >>> >>>>> But really the bounce buffering in acomp (which is what is causing this problem) >>>>> should not exist at all. There is really no practical use case for it; it's >>>>> just there because of the Crypto API's insistence on shoehorning everything into >>>>> scatterlists for no reason... >>>> >>>> I am assuming this about scomp_scratch logic, which is what we need to >>>> hold the scomp_lock for, resulting in this problem. >>> >>> Yes. >>> >>>> If this is something that can be done right away I am fine with dropping >>>> this patch for an alternative fix, although it may be nice to reduce the >>>> lock critical section in zswap_cpu_comp_dead() to the bare minimum >>>> anyway. >>> >>> Well, unfortunately the whole Crypto API philosophy of having a single interface >>> for software and for hardware offload doesn't really work. This is just yet >>> another example of that; it's a problem caused by shoehorning software >>> compression into an interface designed for hardware offload. zcomp really >>> should just use the compression libs directly (like most users of compression in >>> the kernel already do), and have an alternate code path specifically for >>> hardware offload (using acomp) for the few people who really want that. >> >> zcomp is for zram, zswap does not use it. If zswap is not going to use >> the crypto API we'll want something like zcomp or maybe reuse zcomp >> itself. That's a problem for another day :) > > I'm actually thinking whether we should expose the zcomp API and use > it for zswap. There are a couple of parameters for zstd I wanna play > with, which zcomp/zram seems to already support, but not the crypto > API (zstd level, dictionary, etc.). Ah, agree! Actually I also think we should use the zcomp API in zswap, if its API meets our requirements. > > But yes, a different problem for another day :)
diff --git a/mm/zswap.c b/mm/zswap.c index ac9d299e7d0c1..adf745c66aa1d 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -881,18 +881,32 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) { struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); + struct acomp_req *req; + struct crypto_acomp *acomp; + u8 *buffer; + + if (IS_ERR_OR_NULL(acomp_ctx)) + return 0; mutex_lock(&acomp_ctx->mutex); - if (!IS_ERR_OR_NULL(acomp_ctx)) { - if (!IS_ERR_OR_NULL(acomp_ctx->req)) - acomp_request_free(acomp_ctx->req); - acomp_ctx->req = NULL; - if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) - crypto_free_acomp(acomp_ctx->acomp); - kfree(acomp_ctx->buffer); - } + req = acomp_ctx->req; + acomp = acomp_ctx->acomp; + buffer = acomp_ctx->buffer; + acomp_ctx->req = NULL; + acomp_ctx->acomp = NULL; + acomp_ctx->buffer = NULL; mutex_unlock(&acomp_ctx->mutex); + /* + * Do the actual freeing after releasing the mutex to avoid subtle + * locking dependencies causing deadlocks. + */ + if (!IS_ERR_OR_NULL(req)) + acomp_request_free(req); + if (!IS_ERR_OR_NULL(acomp)) + crypto_free_acomp(acomp); + kfree(buffer); + return 0; }