diff mbox series

[v2,1/2] block: prevent freeing a zone write plug too early

Message ID 20240420075811.1276893-2-dlemoal@kernel.org (mailing list archive)
State New, archived
Headers show
Series Fixes for zone write plugging | expand

Commit Message

Damien Le Moal April 20, 2024, 7:58 a.m. UTC
The submission of plugged BIOs is done using a work struct executing the
function blk_zone_wplug_bio_work(). This function gets and submits a
plugged zone write BIO and is guaranteed to operate on a valid zone
write plug (with a reference count higher than 0) on entry as plugged
BIOs hold a reference on their zone write plugs. However, once a BIO is
submitted with submit_bio_noacct_nocheck(), the BIO may complete before
blk_zone_wplug_bio_work(), with the BIO completion trigering a release
and freeing of the zone write plug if the BIO is the last write to a
zone (making the zone FULL). This potentially can result in the zone
write plug being freed while the work is still active.

Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/blk-zoned.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Christoph Hellwig April 22, 2024, 6:23 a.m. UTC | #1
On Sat, Apr 20, 2024 at 04:58:10PM +0900, Damien Le Moal wrote:
> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().

Calling flush_work from a rcu callback is just asking for nasty
deadlocks.

What prevents you from just holding an extra zwplug reference while
blk_zone_wplug_bio_work is running?
Damien Le Moal April 23, 2024, 6:12 a.m. UTC | #2
On 2024/04/22 16:23, Christoph Hellwig wrote:
> On Sat, Apr 20, 2024 at 04:58:10PM +0900, Damien Le Moal wrote:
>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
> 
> Calling flush_work from a rcu callback is just asking for nasty
> deadlocks.
> 
> What prevents you from just holding an extra zwplug reference while
> blk_zone_wplug_bio_work is running?

Problem is that this extra reference needs to be released in
blk_zone_wplug_bio_work(), before that function returns, and that is still the
work thread context using zwplug->bio_work. So we always have a small window
between the ref drop and the zone BIO work thread completing (context switch).
If we get a BIO completion in that window and free the plug, then the BIO work
struct may go away while the work thread is still referencing it.

Given that freeing of plugs will happen only after the RCU grace periods
elapses, I think this is all very unlikely to happen, but at the same time, I do
not see any guarantee that this cannot happen...
Jens Axboe April 23, 2024, 2:21 p.m. UTC | #3
On 4/20/24 1:58 AM, Damien Le Moal wrote:
> The submission of plugged BIOs is done using a work struct executing the
> function blk_zone_wplug_bio_work(). This function gets and submits a
> plugged zone write BIO and is guaranteed to operate on a valid zone
> write plug (with a reference count higher than 0) on entry as plugged
> BIOs hold a reference on their zone write plugs. However, once a BIO is
> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
> and freeing of the zone write plug if the BIO is the last write to a
> zone (making the zone FULL). This potentially can result in the zone
> write plug being freed while the work is still active.
> 
> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>  block/blk-zoned.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 3befebe6b319..685f0b9159fd 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>  	struct blk_zone_wplug *zwplug =
>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>  
> +	flush_work(&zwplug->bio_work);
> +
>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>  }

This is totally backwards. First of all, if you actually had work that
needed flushing at this point, the kernel would bomb spectacularly.
Secondly, what's the point of using RCU to protect this, if you're now
needing to flush work from the RCU callback? That's a clear sign that
something is very wrong here with your references / RCU usage.. The work
item should hold a reference to it, trying to paper around it like this
is not going to work at all.

Why is the work item racing with RCU freeing?!
Damien Le Moal April 23, 2024, 3:16 p.m. UTC | #4
On 2024/04/24 0:21, Jens Axboe wrote:
> On 4/20/24 1:58 AM, Damien Le Moal wrote:
>> The submission of plugged BIOs is done using a work struct executing the
>> function blk_zone_wplug_bio_work(). This function gets and submits a
>> plugged zone write BIO and is guaranteed to operate on a valid zone
>> write plug (with a reference count higher than 0) on entry as plugged
>> BIOs hold a reference on their zone write plugs. However, once a BIO is
>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
>> and freeing of the zone write plug if the BIO is the last write to a
>> zone (making the zone FULL). This potentially can result in the zone
>> write plug being freed while the work is still active.
>>
>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
>>
>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>  block/blk-zoned.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 3befebe6b319..685f0b9159fd 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>>  	struct blk_zone_wplug *zwplug =
>>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>>  
>> +	flush_work(&zwplug->bio_work);
>> +
>>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>>  }
> 
> This is totally backwards. First of all, if you actually had work that
> needed flushing at this point, the kernel would bomb spectacularly.
> Secondly, what's the point of using RCU to protect this, if you're now
> needing to flush work from the RCU callback? That's a clear sign that
> something is very wrong here with your references / RCU usage.. The work
> item should hold a reference to it, trying to paper around it like this
> is not going to work at all.
> 
> Why is the work item racing with RCU freeing?!

The work item is a field of the zone write plug. Zone write plugs have
references to them as long as BIOs are in flight and and the zone is not full.
The zone write plug freeing through rcu is triggered by the last write to a zone
that makes the zone full. But the completion of this last write BIO may happen
right after the work issued the BIO with submit_bio_noacct_nocheck() and before
blk_zone_wplug_bio_work() returns, while the work item is still active.

The actual freeing of the plug happens only after the rcu grace period, and I
was not entirely sure if this is enough to guarantee that the work thread is
finished. But checking how the workqueue code processes the work item by calling
the work function (blk_zone_wplug_bio_work() in this case), there is no issue
because the work item (struct work_struct) is not touched once the work function
is called. So there are no issues/races with freeing the zone write plug. I was
overthinking this. My bad. We can drop this patch. Apologies for the noise.
Jens Axboe April 23, 2024, 3:36 p.m. UTC | #5
On 4/23/24 9:16 AM, Damien Le Moal wrote:
> On 2024/04/24 0:21, Jens Axboe wrote:
>> On 4/20/24 1:58 AM, Damien Le Moal wrote:
>>> The submission of plugged BIOs is done using a work struct executing the
>>> function blk_zone_wplug_bio_work(). This function gets and submits a
>>> plugged zone write BIO and is guaranteed to operate on a valid zone
>>> write plug (with a reference count higher than 0) on entry as plugged
>>> BIOs hold a reference on their zone write plugs. However, once a BIO is
>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
>>> and freeing of the zone write plug if the BIO is the last write to a
>>> zone (making the zone FULL). This potentially can result in the zone
>>> write plug being freed while the work is still active.
>>>
>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
>>>
>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> ---
>>>  block/blk-zoned.c | 2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>> index 3befebe6b319..685f0b9159fd 100644
>>> --- a/block/blk-zoned.c
>>> +++ b/block/blk-zoned.c
>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>>>  	struct blk_zone_wplug *zwplug =
>>>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>>>  
>>> +	flush_work(&zwplug->bio_work);
>>> +
>>>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>>>  }
>>
>> This is totally backwards. First of all, if you actually had work that
>> needed flushing at this point, the kernel would bomb spectacularly.
>> Secondly, what's the point of using RCU to protect this, if you're now
>> needing to flush work from the RCU callback? That's a clear sign that
>> something is very wrong here with your references / RCU usage.. The work
>> item should hold a reference to it, trying to paper around it like this
>> is not going to work at all.
>>
>> Why is the work item racing with RCU freeing?!
> 
> The work item is a field of the zone write plug. Zone write plugs have
> references to them as long as BIOs are in flight and and the zone is
> not full. The zone write plug freeing through rcu is triggered by the
> last write to a zone that makes the zone full. But the completion of
> this last write BIO may happen right after the work issued the BIO
> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work()
> returns, while the work item is still active.
> 
> The actual freeing of the plug happens only after the rcu grace
> period, and I was not entirely sure if this is enough to guarantee
> that the work thread is finished. But checking how the workqueue code
> processes the work item by calling the work function
> (blk_zone_wplug_bio_work() in this case), there is no issue because
> the work item (struct work_struct) is not touched once the work
> function is called. So there are no issues/races with freeing the zone
> write plug. I was overthinking this. My bad. We can drop this patch.
> Apologies for the noise.

I took a closer look at the zone write plug reference handling, and it
still doesn't look very good. Why are some just atomic_dec and there's
just one spot that does dec_and_test? This again looks like janky
referencing, to be honest.

The relationship seems like it should be pretty clear. Any bio inflight
against this zone plug should have a reference to it, AND the owner
should have a reference to it, otherwise any bio completion (which can
happen at ANY time) could free it. Any dropping of the ref should use a
helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug()
does.

There should be no doubt about the above at all. If the plug has been
added to a workqueue, it should be quite obvious that of course it has a
reference to it already, outside of the bio's that are in it.

I'd strongly encourage getting this sorted out before the merge window,
I'm not at all convinced it's correct as-is. It's certainly not
obviously correct, which it should be. The RCU rules are pretty simple
if the the references are done in the kernel idiomatic way, but they are
not.
Damien Le Moal April 23, 2024, 6:19 p.m. UTC | #6
On 2024/04/24 1:36, Jens Axboe wrote:
> On 4/23/24 9:16 AM, Damien Le Moal wrote:
>> On 2024/04/24 0:21, Jens Axboe wrote:
>>> On 4/20/24 1:58 AM, Damien Le Moal wrote:
>>>> The submission of plugged BIOs is done using a work struct executing the
>>>> function blk_zone_wplug_bio_work(). This function gets and submits a
>>>> plugged zone write BIO and is guaranteed to operate on a valid zone
>>>> write plug (with a reference count higher than 0) on entry as plugged
>>>> BIOs hold a reference on their zone write plugs. However, once a BIO is
>>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
>>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
>>>> and freeing of the zone write plug if the BIO is the last write to a
>>>> zone (making the zone FULL). This potentially can result in the zone
>>>> write plug being freed while the work is still active.
>>>>
>>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
>>>>
>>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>>> ---
>>>>  block/blk-zoned.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>> index 3befebe6b319..685f0b9159fd 100644
>>>> --- a/block/blk-zoned.c
>>>> +++ b/block/blk-zoned.c
>>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>>>>  	struct blk_zone_wplug *zwplug =
>>>>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>>>>  
>>>> +	flush_work(&zwplug->bio_work);
>>>> +
>>>>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>>>>  }
>>>
>>> This is totally backwards. First of all, if you actually had work that
>>> needed flushing at this point, the kernel would bomb spectacularly.
>>> Secondly, what's the point of using RCU to protect this, if you're now
>>> needing to flush work from the RCU callback? That's a clear sign that
>>> something is very wrong here with your references / RCU usage.. The work
>>> item should hold a reference to it, trying to paper around it like this
>>> is not going to work at all.
>>>
>>> Why is the work item racing with RCU freeing?!
>>
>> The work item is a field of the zone write plug. Zone write plugs have
>> references to them as long as BIOs are in flight and and the zone is
>> not full. The zone write plug freeing through rcu is triggered by the
>> last write to a zone that makes the zone full. But the completion of
>> this last write BIO may happen right after the work issued the BIO
>> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work()
>> returns, while the work item is still active.
>>
>> The actual freeing of the plug happens only after the rcu grace
>> period, and I was not entirely sure if this is enough to guarantee
>> that the work thread is finished. But checking how the workqueue code
>> processes the work item by calling the work function
>> (blk_zone_wplug_bio_work() in this case), there is no issue because
>> the work item (struct work_struct) is not touched once the work
>> function is called. So there are no issues/races with freeing the zone
>> write plug. I was overthinking this. My bad. We can drop this patch.
>> Apologies for the noise.
> 
> I took a closer look at the zone write plug reference handling, and it
> still doesn't look very good. Why are some just atomic_dec and there's
> just one spot that does dec_and_test? This again looks like janky
> referencing, to be honest.
> 
> The relationship seems like it should be pretty clear. Any bio inflight
> against this zone plug should have a reference to it, AND the owner
> should have a reference to it, otherwise any bio completion (which can
> happen at ANY time) could free it. Any dropping of the ref should use a
> helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug()
> does.
> 
> There should be no doubt about the above at all. If the plug has been
> added to a workqueue, it should be quite obvious that of course it has a
> reference to it already, outside of the bio's that are in it.
> 
> I'd strongly encourage getting this sorted out before the merge window,
> I'm not at all convinced it's correct as-is. It's certainly not
> obviously correct, which it should be. The RCU rules are pretty simple
> if the the references are done in the kernel idiomatic way, but they are
> not.

OK. I am traveling this week so I will not be able to send a well-tested cleanup
patch but I will do so first thing next week.
Jens Axboe April 24, 2024, 1:58 p.m. UTC | #7
On 4/23/24 12:19 PM, Damien Le Moal wrote:
> On 2024/04/24 1:36, Jens Axboe wrote:
>> On 4/23/24 9:16 AM, Damien Le Moal wrote:
>>> On 2024/04/24 0:21, Jens Axboe wrote:
>>>> On 4/20/24 1:58 AM, Damien Le Moal wrote:
>>>>> The submission of plugged BIOs is done using a work struct executing the
>>>>> function blk_zone_wplug_bio_work(). This function gets and submits a
>>>>> plugged zone write BIO and is guaranteed to operate on a valid zone
>>>>> write plug (with a reference count higher than 0) on entry as plugged
>>>>> BIOs hold a reference on their zone write plugs. However, once a BIO is
>>>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
>>>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
>>>>> and freeing of the zone write plug if the BIO is the last write to a
>>>>> zone (making the zone FULL). This potentially can result in the zone
>>>>> write plug being freed while the work is still active.
>>>>>
>>>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
>>>>>
>>>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>>>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>>>> ---
>>>>>  block/blk-zoned.c | 2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>> index 3befebe6b319..685f0b9159fd 100644
>>>>> --- a/block/blk-zoned.c
>>>>> +++ b/block/blk-zoned.c
>>>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>>>>>  	struct blk_zone_wplug *zwplug =
>>>>>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>>>>>  
>>>>> +	flush_work(&zwplug->bio_work);
>>>>> +
>>>>>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>>>>>  }
>>>>
>>>> This is totally backwards. First of all, if you actually had work that
>>>> needed flushing at this point, the kernel would bomb spectacularly.
>>>> Secondly, what's the point of using RCU to protect this, if you're now
>>>> needing to flush work from the RCU callback? That's a clear sign that
>>>> something is very wrong here with your references / RCU usage.. The work
>>>> item should hold a reference to it, trying to paper around it like this
>>>> is not going to work at all.
>>>>
>>>> Why is the work item racing with RCU freeing?!
>>>
>>> The work item is a field of the zone write plug. Zone write plugs have
>>> references to them as long as BIOs are in flight and and the zone is
>>> not full. The zone write plug freeing through rcu is triggered by the
>>> last write to a zone that makes the zone full. But the completion of
>>> this last write BIO may happen right after the work issued the BIO
>>> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work()
>>> returns, while the work item is still active.
>>>
>>> The actual freeing of the plug happens only after the rcu grace
>>> period, and I was not entirely sure if this is enough to guarantee
>>> that the work thread is finished. But checking how the workqueue code
>>> processes the work item by calling the work function
>>> (blk_zone_wplug_bio_work() in this case), there is no issue because
>>> the work item (struct work_struct) is not touched once the work
>>> function is called. So there are no issues/races with freeing the zone
>>> write plug. I was overthinking this. My bad. We can drop this patch.
>>> Apologies for the noise.
>>
>> I took a closer look at the zone write plug reference handling, and it
>> still doesn't look very good. Why are some just atomic_dec and there's
>> just one spot that does dec_and_test? This again looks like janky
>> referencing, to be honest.
>>
>> The relationship seems like it should be pretty clear. Any bio inflight
>> against this zone plug should have a reference to it, AND the owner
>> should have a reference to it, otherwise any bio completion (which can
>> happen at ANY time) could free it. Any dropping of the ref should use a
>> helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug()
>> does.
>>
>> There should be no doubt about the above at all. If the plug has been
>> added to a workqueue, it should be quite obvious that of course it has a
>> reference to it already, outside of the bio's that are in it.
>>
>> I'd strongly encourage getting this sorted out before the merge window,
>> I'm not at all convinced it's correct as-is. It's certainly not
>> obviously correct, which it should be. The RCU rules are pretty simple
>> if the the references are done in the kernel idiomatic way, but they are
>> not.
> 
> OK. I am traveling this week so I will not be able to send a
> well-tested cleanup patch but I will do so first thing next week.

Sounds good, thanks.
diff mbox series

Patch

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3befebe6b319..685f0b9159fd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -526,6 +526,8 @@  static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
 	struct blk_zone_wplug *zwplug =
 		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
 
+	flush_work(&zwplug->bio_work);
+
 	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
 }