diff mbox series

[for-8.0] ide: Fix manual in-flight count for TRIM BH

Message ID 20230309114430.33684-1-hreitz@redhat.com (mailing list archive)
State New, archived
Headers show
Series [for-8.0] ide: Fix manual in-flight count for TRIM BH | expand

Commit Message

Hanna Czenczek March 9, 2023, 11:44 a.m. UTC
Commit 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca ("ide: Increment BB
in-flight counter for TRIM BH") fixed a problem when cancelling trims:
ide_cancel_dma_sync() drains the BB to stop a potentially ongoing
request, and then asserts that no request is active anymore.  When
trying to cancel a trim, this would only drain a single
blk_aio_pdiscard() operation, but not necessarily the trim as a whole.
Said commit fixed that by counting the whole trim as an operation,
incrementing the in-flight counter for it, so the blk_drain() in
ide_cancel_dma_sync() would wait for it.

Fiona found a problem with this, specifically that draining a BB while
an IDE trim operation is going on can produce a deadlock: An ongoing
blk_aio_pdiscard() will be settled, but any further discard in the same
trim will go into the BB's queued_request list and wait there until the
drain stops.  Meanwhile, the whole trim keeps the BB's in_flight_counter
incremented, so the BB will never be considered drained, and qemu hangs.

Therefore, we cannot keep the in-flight counter incremented throughout
the trim operation.  We should still increment it before the completion
BH (ide_trim_bh_cb()) is scheduled and decrement it there, so that the
blk_drain() in ide_cancel_dma_sync() can await completion of this
scheduled BH.  With that change, however, it can take multiple
iterations of blk_drain() to really settle the whole trim operation, and
so ide_cancel_dma_sync() must run it in a loop.

Reported-by: Fiona Ebner <f.ebner@proxmox.com>
Fixes: 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca
       ("ide: Increment BB in-flight counter for TRIM BH")
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
Tested with a small test boot sector that issues a trim operation with
any number of discards from 0 to 64.  Before this patch, doing so hangs
starting from 2 discards; and before 7e5cdb345f, doing so crashes qemu
with any number of discards.  With this patch, it always works.
---
 hw/ide/core.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

Comments

Paolo Bonzini March 9, 2023, 12:05 p.m. UTC | #1
On 3/9/23 12:44, Hanna Czenczek wrote:
> +     *
> +     * Note that TRIM operations call blk_aio_pdiscard() multiple
> +     * times (and finally increment s->blk's in-flight counter while
> +     * ide_trim_bh_cb() is scheduled), so we have to loop blk_drain()
> +     * until the whole operation is done.
>        */
>       if (s->bus->dma->aiocb) {
>           trace_ide_cancel_dma_sync_remaining();
> -        blk_drain(s->blk);
> -        assert(s->bus->dma->aiocb == NULL);
> +        while (s->bus->dma->aiocb) {
> +            blk_drain(s->blk);
> +        }

I think having to do this is problematic, because the blk_drain should 
leave no pending operation.

Here it seems okay because you do it in a controlled situation, but the 
main thread can also issue blk_drain(), or worse bdrv_drained_begin(), 
and there would be pending I/O operations when it returns.

Unfortunately I don't have a solution (I'm not considering the idea of 
using disable_request_queuing and having even more atomics magic in 
block/block-backend.c), but I'll read the thread.

Paolo
Paolo Bonzini March 9, 2023, 12:08 p.m. UTC | #2
On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> I think having to do this is problematic, because the blk_drain should
> leave no pending operation.
>
> Here it seems okay because you do it in a controlled situation, but the
> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
> and there would be pending I/O operations when it returns.
>
> Unfortunately I don't have a solution (I'm not considering the idea of
> using disable_request_queuing and having even more atomics magic in
> block/block-backend.c), but I'll read the thread.

Hmm, what about making blk_aio_prwv non-static and calling
bdrv_co_pdiscard directly from IDE?

Paolo
Hanna Czenczek March 9, 2023, 12:31 p.m. UTC | #3
On 09.03.23 13:08, Paolo Bonzini wrote:
> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> I think having to do this is problematic, because the blk_drain should
>> leave no pending operation.
>>
>> Here it seems okay because you do it in a controlled situation, but the
>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
>> and there would be pending I/O operations when it returns.

Not really.  We would stop in the middle of a trim that processes a list 
of discard requests.  So I see it more like stopping in the middle of 
anything that processes guest requests.  Once drain ends, we continue 
processing them, and that’s not exactly pending I/O.

There is a pending object in s->bus->dma->aiocb on the IDE side, so 
there is a pending DMA operation, but naïvely, I don’t see that as a 
problem.

>> Unfortunately I don't have a solution (I'm not considering the idea of
>> using disable_request_queuing and having even more atomics magic in
>> block/block-backend.c), but I'll read the thread.

I wouldn’t disable request queuing, because its introducing commit 
message (cf3129323f900ef5ddbccbe86e4fa801e88c566e) specifically says it 
fixes IDE.  I suppose the reason might actually be this problem here, in 
that before request queuing, it was possible that IDE would continue 
issuing discard requests even while drained, because processing the list 
didn’t stop.  Maybe.

Or the issue is generally that IDE uses dma_* functions, which might 
cause I/O functions to be run from new BHs (I guess through 
reschedule_dma()?).

> Hmm, what about making blk_aio_prwv non-static and calling
> bdrv_co_pdiscard directly from IDE?

You mean transforming ide_issue_trim_cb() into an iterative coroutine 
(instead of being recursive and using AIO) and invoking it via 
blk_aio_prwv()?

It doesn’t feel right to call a bdrv_* function directly from a user 
external to the core block layer, so in this case I’d rather fall back 
to Fiona’s idea of invoking all discards concurrently.

Hanna
Paolo Bonzini March 9, 2023, 1:59 p.m. UTC | #4
On 3/9/23 13:31, Hanna Czenczek wrote:
> On 09.03.23 13:08, Paolo Bonzini wrote:
>> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> I think having to do this is problematic, because the blk_drain should
>>> leave no pending operation.
>>>
>>> Here it seems okay because you do it in a controlled situation, but the
>>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
>>> and there would be pending I/O operations when it returns.
> 
> Not really.  We would stop in the middle of a trim that processes a list 
> of discard requests.  So I see it more like stopping in the middle of 
> anything that processes guest requests.  Once drain ends, we continue 
> processing them, and that’s not exactly pending I/O.
> 
> There is a pending object in s->bus->dma->aiocb on the IDE side, so 
> there is a pending DMA operation, but naïvely, I don’t see that as a 
> problem.

What about the bdrv_drain_all() when a VM stops, would the guest 
continue to access memory and disks after bdrv_drain() return?

Migration could also be a problem, because the partial TRIM would not be 
recorded in the s->bus->error_status field of IDEState (no surprise 
there, it's not an error).  Also, errors happening after bdrv_drain() 
might not be migrated correctly.

> Or the issue is generally that IDE uses dma_* functions, which might 
> cause I/O functions to be run from new BHs (I guess through 
> reschedule_dma()?).

Ah, you mean that you can have pending I/O operations while 
blk->in_flight is zero?  That would be a problem indeed.  We already 
have BlockDevOps for ide-cd and ide-hd, should we add a .drained_poll 
callback there?

>> Hmm, what about making blk_aio_prwv non-static and calling
>> bdrv_co_pdiscard directly from IDE?
> 
> You mean transforming ide_issue_trim_cb() into an iterative coroutine 
> (instead of being recursive and using AIO) and invoking it via 
> blk_aio_prwv()?
> 
> It doesn’t feel right to call a bdrv_* function directly from a user 
> external to the core block layer, so in this case I’d rather fall back 
> to Fiona’s idea of invoking all discards concurrently.

Yeah, honestly it doesn't feel very much right to me either.

Paolo
Kevin Wolf March 9, 2023, 5:46 p.m. UTC | #5
Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben:
> On 3/9/23 13:31, Hanna Czenczek wrote:
> > On 09.03.23 13:08, Paolo Bonzini wrote:
> > > On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > > I think having to do this is problematic, because the blk_drain should
> > > > leave no pending operation.
> > > > 
> > > > Here it seems okay because you do it in a controlled situation, but the
> > > > main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
> > > > and there would be pending I/O operations when it returns.
> > 
> > Not really.  We would stop in the middle of a trim that processes a list
> > of discard requests.  So I see it more like stopping in the middle of
> > anything that processes guest requests.  Once drain ends, we continue
> > processing them, and that’s not exactly pending I/O.
> > 
> > There is a pending object in s->bus->dma->aiocb on the IDE side, so
> > there is a pending DMA operation, but naïvely, I don’t see that as a
> > problem.
> 
> What about the bdrv_drain_all() when a VM stops, would the guest continue to
> access memory and disks after bdrv_drain() return?

That one shouldn't be a problem because the devices are stopped before
the backends.

> Migration could also be a problem, because the partial TRIM would not be
> recorded in the s->bus->error_status field of IDEState (no surprise there,
> it's not an error).  Also, errors happening after bdrv_drain() might not be
> migrated correctly.

Yes, I think it would be good to have the I/O operation fully completed
on the IDE level rather than just in the block layer.

> > Or the issue is generally that IDE uses dma_* functions, which might
> > cause I/O functions to be run from new BHs (I guess through
> > reschedule_dma()?).

None of those complicated scenarios actually. The problem solved by the
request queuing is simply that nothing else stops the guest from
submitting new requests to drained nodes if the CPUs are still running.

Drain uses aio_disable_external() to disable processing of external
events, in particular the ioeventfd used by virtio-blk and virtio-scsi.
But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread
exits to userspace and calls directly into the IDE code, so it's
completely unaffected by aio_disable_external().

> Ah, you mean that you can have pending I/O operations while blk->in_flight
> is zero?  That would be a problem indeed.  We already have BlockDevOps for
> ide-cd and ide-hd, should we add a .drained_poll callback there?

To be more precise, you suggested in the call that .drained_poll should
return that IDE is still busy while aiocb != NULL. Without having looked
at the code in detail yet, that seems to make sense to me. And maybe
even the blk_inc/dec_in_flight() pair can then go completely away
because IDE takes care of its drain state itself then.

Kevin
Fiona Ebner March 10, 2023, 1:05 p.m. UTC | #6
Am 09.03.23 um 18:46 schrieb Kevin Wolf:
> Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben:
>> On 3/9/23 13:31, Hanna Czenczek wrote:
>>> On 09.03.23 13:08, Paolo Bonzini wrote:
>>>> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> I think having to do this is problematic, because the blk_drain should
>>>>> leave no pending operation.
>>>>>
>>>>> Here it seems okay because you do it in a controlled situation, but the
>>>>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
>>>>> and there would be pending I/O operations when it returns.
>>>
>>> Not really.  We would stop in the middle of a trim that processes a list
>>> of discard requests.  So I see it more like stopping in the middle of
>>> anything that processes guest requests.  Once drain ends, we continue
>>> processing them, and that’s not exactly pending I/O.
>>>
>>> There is a pending object in s->bus->dma->aiocb on the IDE side, so
>>> there is a pending DMA operation, but naïvely, I don’t see that as a
>>> problem.
>>
>> What about the bdrv_drain_all() when a VM stops, would the guest continue to
>> access memory and disks after bdrv_drain() return?
> 
> That one shouldn't be a problem because the devices are stopped before
> the backends.
> 
>> Migration could also be a problem, because the partial TRIM would not be
>> recorded in the s->bus->error_status field of IDEState (no surprise there,
>> it's not an error).  Also, errors happening after bdrv_drain() might not be
>> migrated correctly.
> 
> Yes, I think it would be good to have the I/O operation fully completed
> on the IDE level rather than just in the block layer.
> 
>>> Or the issue is generally that IDE uses dma_* functions, which might
>>> cause I/O functions to be run from new BHs (I guess through
>>> reschedule_dma()?).
> 
> None of those complicated scenarios actually. The problem solved by the
> request queuing is simply that nothing else stops the guest from
> submitting new requests to drained nodes if the CPUs are still running.
> 
> Drain uses aio_disable_external() to disable processing of external
> events, in particular the ioeventfd used by virtio-blk and virtio-scsi.
> But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread
> exits to userspace and calls directly into the IDE code, so it's
> completely unaffected by aio_disable_external().
> 
>> Ah, you mean that you can have pending I/O operations while blk->in_flight
>> is zero?  That would be a problem indeed.  We already have BlockDevOps for
>> ide-cd and ide-hd, should we add a .drained_poll callback there?
> 
> To be more precise, you suggested in the call that .drained_poll should
> return that IDE is still busy while aiocb != NULL. Without having looked
> at the code in detail yet, that seems to make sense to me. And maybe
> even the blk_inc/dec_in_flight() pair can then go completely away
> because IDE takes care of its drain state itself then.
> 

I assume the addition of drained_poll is meant to be orthogonal to the
fix of the deadlock? At least I can't see how it would help there?

If we have the assumptions:
1. The TRIM operation should be completed on the IDE level before
draining ends.
2. Block layer requests issued after draining has begun are queued.

To me, the conclusion seems to be:
Issue all block layer requests belonging to the IDE TRIM operation up front.

The other alternative I see is to break assumption 2, introduce a way to
not queue certain requests while drained, and use it for the recursive
requests issued by ide_issue_trim_cb. But not the initial one, if that
would defeat the purpose of request queuing. Of course this can't be
done if QEMU relies on the assumption in other places already.

Best Regards,
Fiona
Kevin Wolf March 10, 2023, 2:25 p.m. UTC | #7
Am 10.03.2023 um 14:05 hat Fiona Ebner geschrieben:
> Am 09.03.23 um 18:46 schrieb Kevin Wolf:
> > Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben:
> >> On 3/9/23 13:31, Hanna Czenczek wrote:
> >>> On 09.03.23 13:08, Paolo Bonzini wrote:
> >>>> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>>> I think having to do this is problematic, because the blk_drain should
> >>>>> leave no pending operation.
> >>>>>
> >>>>> Here it seems okay because you do it in a controlled situation, but the
> >>>>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(),
> >>>>> and there would be pending I/O operations when it returns.
> >>>
> >>> Not really.  We would stop in the middle of a trim that processes a list
> >>> of discard requests.  So I see it more like stopping in the middle of
> >>> anything that processes guest requests.  Once drain ends, we continue
> >>> processing them, and that’s not exactly pending I/O.
> >>>
> >>> There is a pending object in s->bus->dma->aiocb on the IDE side, so
> >>> there is a pending DMA operation, but naïvely, I don’t see that as a
> >>> problem.
> >>
> >> What about the bdrv_drain_all() when a VM stops, would the guest continue to
> >> access memory and disks after bdrv_drain() return?
> > 
> > That one shouldn't be a problem because the devices are stopped before
> > the backends.
> > 
> >> Migration could also be a problem, because the partial TRIM would not be
> >> recorded in the s->bus->error_status field of IDEState (no surprise there,
> >> it's not an error).  Also, errors happening after bdrv_drain() might not be
> >> migrated correctly.
> > 
> > Yes, I think it would be good to have the I/O operation fully completed
> > on the IDE level rather than just in the block layer.
> > 
> >>> Or the issue is generally that IDE uses dma_* functions, which might
> >>> cause I/O functions to be run from new BHs (I guess through
> >>> reschedule_dma()?).
> > 
> > None of those complicated scenarios actually. The problem solved by the
> > request queuing is simply that nothing else stops the guest from
> > submitting new requests to drained nodes if the CPUs are still running.
> > 
> > Drain uses aio_disable_external() to disable processing of external
> > events, in particular the ioeventfd used by virtio-blk and virtio-scsi.
> > But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread
> > exits to userspace and calls directly into the IDE code, so it's
> > completely unaffected by aio_disable_external().
> > 
> >> Ah, you mean that you can have pending I/O operations while blk->in_flight
> >> is zero?  That would be a problem indeed.  We already have BlockDevOps for
> >> ide-cd and ide-hd, should we add a .drained_poll callback there?
> > 
> > To be more precise, you suggested in the call that .drained_poll should
> > return that IDE is still busy while aiocb != NULL. Without having looked
> > at the code in detail yet, that seems to make sense to me. And maybe
> > even the blk_inc/dec_in_flight() pair can then go completely away
> > because IDE takes care of its drain state itself then.
> > 
> 
> I assume the addition of drained_poll is meant to be orthogonal to the
> fix of the deadlock? At least I can't see how it would help there?

You're right, it doesn't.

Basically we're running into the old problem again that draining is
overloaded with two different meanings: I want exclusive access to the
backend or I want to wait for all my requests to complete. IDE or more
generally blk_drain() wants the latter, but queuing is done for
implementing the former.

> If we have the assumptions:
> 1. The TRIM operation should be completed on the IDE level before
> draining ends.
> 2. Block layer requests issued after draining has begun are queued.
> 
> To me, the conclusion seems to be:
> Issue all block layer requests belonging to the IDE TRIM operation up
> front.
> 
> The other alternative I see is to break assumption 2, introduce a way
> to not queue certain requests while drained, and use it for the
> recursive requests issued by ide_issue_trim_cb. But not the initial
> one, if that would defeat the purpose of request queuing. Of course
> this can't be done if QEMU relies on the assumption in other places
> already.

I feel like this should be allowed because if anyone has exclusive
access in this scenario, it's IDE, so it should be able to bypass the
queuing. Of course, the queuing is still needed if someone else drained
the backend, so we can't just make TRIM bypass it in general. And if you
make it conditional on IDE being in blk_drain(), it already starts to
become ugly again...

So maybe the while loop is unavoidable.

Hmm... But could ide_cancel_dma_sync() just directly use
AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?

Kevin
Paolo Bonzini March 10, 2023, 3:13 p.m. UTC | #8
On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > 1. The TRIM operation should be completed on the IDE level before
> > draining ends.
> > 2. Block layer requests issued after draining has begun are queued.
> >
> > To me, the conclusion seems to be:
> > Issue all block layer requests belonging to the IDE TRIM operation up
> > front.
> >
> > The other alternative I see is to break assumption 2, introduce a way
> > to not queue certain requests while drained, and use it for the
> > recursive requests issued by ide_issue_trim_cb. But not the initial
> > one, if that would defeat the purpose of request queuing. Of course
> > this can't be done if QEMU relies on the assumption in other places
> > already.
>
> I feel like this should be allowed because if anyone has exclusive
> access in this scenario, it's IDE, so it should be able to bypass the
> queuing. Of course, the queuing is still needed if someone else drained
> the backend, so we can't just make TRIM bypass it in general. And if you
> make it conditional on IDE being in blk_drain(), it already starts to
> become ugly again...
>
> So maybe the while loop is unavoidable.
>
> Hmm... But could ide_cancel_dma_sync() just directly use
> AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?

While that should work, it would not fix other uses of
bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device
model relies on those to run *until the device model has finished
submitting requests*.

So I still think that this bug is a symptom of a problem in the design
of request queuing.

In fact, shouldn't request queuing was enabled at the _end_ of
bdrv_drained_begin (once the BlockBackend has reached a quiescent
state on its own terms), rather than at the beginning (which leads to
deadlocks like this one)?

blk->quiesce_counter becomes just a nesting counter for
drained_begin/end, with no uses outside, and blk_wait_while_drained
uses a new counter. Then you have something like this in
blk_root_drained_poll:

    if (blk->dev_ops && blk->dev_ops->drained_poll) {
        busy = blk->dev_ops->drained_poll(blk->dev_opaque);
    }
    busy |= !!blk->in_flight;
    if (!busy) {
       qatomic_set(&blk->queue_requests, true);
    }
    return busy;

And there's no need to touch IDE at all.

Paolo
Fiona Ebner March 13, 2023, 12:29 p.m. UTC | #9
Am 10.03.23 um 16:13 schrieb Paolo Bonzini:
> On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote:
>>> 1. The TRIM operation should be completed on the IDE level before
>>> draining ends.
>>> 2. Block layer requests issued after draining has begun are queued.
>>>
>>> To me, the conclusion seems to be:
>>> Issue all block layer requests belonging to the IDE TRIM operation up
>>> front.
>>>
>>> The other alternative I see is to break assumption 2, introduce a way
>>> to not queue certain requests while drained, and use it for the
>>> recursive requests issued by ide_issue_trim_cb. But not the initial
>>> one, if that would defeat the purpose of request queuing. Of course
>>> this can't be done if QEMU relies on the assumption in other places
>>> already.
>>
>> I feel like this should be allowed because if anyone has exclusive
>> access in this scenario, it's IDE, so it should be able to bypass the
>> queuing. Of course, the queuing is still needed if someone else drained
>> the backend, so we can't just make TRIM bypass it in general. And if you
>> make it conditional on IDE being in blk_drain(), it already starts to
>> become ugly again...
>>
>> So maybe the while loop is unavoidable.
>>
>> Hmm... But could ide_cancel_dma_sync() just directly use
>> AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?
> 
> While that should work, it would not fix other uses of
> bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device
> model relies on those to run *until the device model has finished
> submitting requests*.
> 
> So I still think that this bug is a symptom of a problem in the design
> of request queuing.
> 
> In fact, shouldn't request queuing was enabled at the _end_ of
> bdrv_drained_begin (once the BlockBackend has reached a quiescent
> state on its own terms), rather than at the beginning (which leads to
> deadlocks like this one)?
> 
> blk->quiesce_counter becomes just a nesting counter for
> drained_begin/end, with no uses outside, and blk_wait_while_drained
> uses a new counter. Then you have something like this in
> blk_root_drained_poll:
> 
>     if (blk->dev_ops && blk->dev_ops->drained_poll) {
>         busy = blk->dev_ops->drained_poll(blk->dev_opaque);
>     }
>     busy |= !!blk->in_flight;
>     if (!busy) {
>        qatomic_set(&blk->queue_requests, true);
>     }
>     return busy;
> 
> And there's no need to touch IDE at all.
> 
Couldn't this lead to scenarios where a busy or malicious guest, which
continues to submit new requests, slows down draining or even prevents
it from finishing?

Best Regards,
Fiona
Paolo Bonzini March 13, 2023, 1:09 p.m. UTC | #10
On 3/13/23 13:29, Fiona Ebner wrote:
>> In fact, shouldn't request queuing was enabled at the _end_ of
>> bdrv_drained_begin (once the BlockBackend has reached a quiescent
>> state on its own terms), rather than at the beginning (which leads to
>> deadlocks like this one)?
>
> Couldn't this lead to scenarios where a busy or malicious guest, which
> continues to submit new requests, slows down draining or even prevents
> it from finishing?

Possibly, but there is also a .drained_begin/.drained_end callback that 
can be defined in order to apply backpressure.  (For some other devices, 
there's also aio_disable_external/aio_enable_external that do the 
equivalent of request queuing but without the deadlocks)

Since starting the queuing of requests at the end of bdrv_drained_begin 
wouldn't hurt correctness, and it would fix this kind of deadlock, I 
think it would be worth giving it a try.

Paolo
Kevin Wolf March 13, 2023, 4:32 p.m. UTC | #11
Am 10.03.2023 um 16:13 hat Paolo Bonzini geschrieben:
> On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > > 1. The TRIM operation should be completed on the IDE level before
> > > draining ends.
> > > 2. Block layer requests issued after draining has begun are queued.
> > >
> > > To me, the conclusion seems to be:
> > > Issue all block layer requests belonging to the IDE TRIM operation up
> > > front.
> > >
> > > The other alternative I see is to break assumption 2, introduce a way
> > > to not queue certain requests while drained, and use it for the
> > > recursive requests issued by ide_issue_trim_cb. But not the initial
> > > one, if that would defeat the purpose of request queuing. Of course
> > > this can't be done if QEMU relies on the assumption in other places
> > > already.
> >
> > I feel like this should be allowed because if anyone has exclusive
> > access in this scenario, it's IDE, so it should be able to bypass the
> > queuing. Of course, the queuing is still needed if someone else drained
> > the backend, so we can't just make TRIM bypass it in general. And if you
> > make it conditional on IDE being in blk_drain(), it already starts to
> > become ugly again...
> >
> > So maybe the while loop is unavoidable.
> >
> > Hmm... But could ide_cancel_dma_sync() just directly use
> > AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()?
> 
> While that should work, it would not fix other uses of
> bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device
> model relies on those to run *until the device model has finished
> submitting requests*.

If so, do_vm_stop() really expects drain to do something it isn't
designed to do. It's only for quiescing backends, not for any other
activity a qdev device might still be doing. I think it's really the
vm_state_notify() that should take care of stopping device activity.

But maybe we can make it work with drain anyway.

> So I still think that this bug is a symptom of a problem in the design
> of request queuing.
> 
> In fact, shouldn't request queuing was enabled at the _end_ of
> bdrv_drained_begin (once the BlockBackend has reached a quiescent
> state on its own terms), rather than at the beginning (which leads to
> deadlocks like this one)?

No, I don't think that is ever right. As I said earlier in this thread
(and you said yourself previously), there are two different users of
drain:

1. I want to have exclusive access to the node. This one wants request
   queuing from the start to avoid losing time unnecessarily until the
   guest stops sending new requests.

2. I want to wait for my requests to complete. This one never wants
   request queuing. Enabling it at the end of bdrv_drained_begin()
   wouldn't hurt it (because it has already achieved its goal then), but
   it's also not necessary for the same reason.

IDE reset and do_vm_stop() are case 2, implemented with blk_drain*().
The request queuing was implemented for case 1, something else in the
block graph draining the BlockBackend's root node with bdrv_drain*().

So maybe what we could take from this is that request queuing should be
temporarily disabled while we're in blk_drain*() because these
interfaces are only meant for case 2. In all other cases, it should
continue to work as it does now.

Kevin
Paolo Bonzini March 14, 2023, 9:42 a.m. UTC | #12
On Mon, Mar 13, 2023 at 5:32 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > So I still think that this bug is a symptom of a problem in the design
> > of request queuing.
> >
> > In fact, shouldn't request queuing was enabled at the _end_ of
> > bdrv_drained_begin (once the BlockBackend has reached a quiescent
> > state on its own terms), rather than at the beginning (which leads to
> > deadlocks like this one)?
>
> 1. I want to have exclusive access to the node. This one wants request
>    queuing from the start to avoid losing time unnecessarily until the
>    guest stops sending new requests.
>
> 2. I want to wait for my requests to complete. This one never wants
>    request queuing. Enabling it at the end of bdrv_drained_begin()
>    wouldn't hurt it (because it has already achieved its goal then), but
>    it's also not necessary for the same reason.

Right, doing it at the end would be needed to avoid the deadlocks. On
the other hand, case 1 can (and I think should) be handled by
.drained_begin, or shortcut through aio_disable_external() for those
devices that use ioeventfd.

Paolo

> So maybe what we could take from this is that request queuing should be
> temporarily disabled while we're in blk_drain*() because these
> interfaces are only meant for case 2. In all other cases, it should
> continue to work as it does now.
diff mbox series

Patch

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 2d034731cf..54c51084d2 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -444,7 +444,7 @@  static void ide_trim_bh_cb(void *opaque)
     iocb->bh = NULL;
     qemu_aio_unref(iocb);
 
-    /* Paired with an increment in ide_issue_trim() */
+    /* Paired with an increment in ide_issue_trim_cb() */
     blk_dec_in_flight(blk);
 }
 
@@ -504,6 +504,14 @@  static void ide_issue_trim_cb(void *opaque, int ret)
 done:
     iocb->aiocb = NULL;
     if (iocb->bh) {
+        /*
+         * Paired with a decrement in ide_trim_bh_cb(): Ensure we have
+         * an in-flight count while this TrimAIOCB object lives.
+         * There is no ongoing blk_aio_pdiscard() operation anymore,
+         * so here we increment the counter manually before returning.
+         */
+        blk_inc_in_flight(s->blk);
+
         replay_bh_schedule_event(iocb->bh);
     }
 }
@@ -515,9 +523,6 @@  BlockAIOCB *ide_issue_trim(
     IDEState *s = opaque;
     TrimAIOCB *iocb;
 
-    /* Paired with a decrement in ide_trim_bh_cb() */
-    blk_inc_in_flight(s->blk);
-
     iocb = blk_aio_get(&trim_aiocb_info, s->blk, cb, cb_opaque);
     iocb->s = s;
     iocb->bh = qemu_bh_new(ide_trim_bh_cb, iocb);
@@ -737,11 +742,17 @@  void ide_cancel_dma_sync(IDEState *s)
      * In the future we'll be able to safely cancel the I/O if the
      * whole DMA operation will be submitted to disk with a single
      * aio operation with preadv/pwritev.
+     *
+     * Note that TRIM operations call blk_aio_pdiscard() multiple
+     * times (and finally increment s->blk's in-flight counter while
+     * ide_trim_bh_cb() is scheduled), so we have to loop blk_drain()
+     * until the whole operation is done.
      */
     if (s->bus->dma->aiocb) {
         trace_ide_cancel_dma_sync_remaining();
-        blk_drain(s->blk);
-        assert(s->bus->dma->aiocb == NULL);
+        while (s->bus->dma->aiocb) {
+            blk_drain(s->blk);
+        }
     }
 }