Message ID | 20230309114430.33684-1-hreitz@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [for-8.0] ide: Fix manual in-flight count for TRIM BH | expand |
On 3/9/23 12:44, Hanna Czenczek wrote: > + * > + * Note that TRIM operations call blk_aio_pdiscard() multiple > + * times (and finally increment s->blk's in-flight counter while > + * ide_trim_bh_cb() is scheduled), so we have to loop blk_drain() > + * until the whole operation is done. > */ > if (s->bus->dma->aiocb) { > trace_ide_cancel_dma_sync_remaining(); > - blk_drain(s->blk); > - assert(s->bus->dma->aiocb == NULL); > + while (s->bus->dma->aiocb) { > + blk_drain(s->blk); > + } I think having to do this is problematic, because the blk_drain should leave no pending operation. Here it seems okay because you do it in a controlled situation, but the main thread can also issue blk_drain(), or worse bdrv_drained_begin(), and there would be pending I/O operations when it returns. Unfortunately I don't have a solution (I'm not considering the idea of using disable_request_queuing and having even more atomics magic in block/block-backend.c), but I'll read the thread. Paolo
On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > I think having to do this is problematic, because the blk_drain should > leave no pending operation. > > Here it seems okay because you do it in a controlled situation, but the > main thread can also issue blk_drain(), or worse bdrv_drained_begin(), > and there would be pending I/O operations when it returns. > > Unfortunately I don't have a solution (I'm not considering the idea of > using disable_request_queuing and having even more atomics magic in > block/block-backend.c), but I'll read the thread. Hmm, what about making blk_aio_prwv non-static and calling bdrv_co_pdiscard directly from IDE? Paolo
On 09.03.23 13:08, Paolo Bonzini wrote: > On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: >> I think having to do this is problematic, because the blk_drain should >> leave no pending operation. >> >> Here it seems okay because you do it in a controlled situation, but the >> main thread can also issue blk_drain(), or worse bdrv_drained_begin(), >> and there would be pending I/O operations when it returns. Not really. We would stop in the middle of a trim that processes a list of discard requests. So I see it more like stopping in the middle of anything that processes guest requests. Once drain ends, we continue processing them, and that’s not exactly pending I/O. There is a pending object in s->bus->dma->aiocb on the IDE side, so there is a pending DMA operation, but naïvely, I don’t see that as a problem. >> Unfortunately I don't have a solution (I'm not considering the idea of >> using disable_request_queuing and having even more atomics magic in >> block/block-backend.c), but I'll read the thread. I wouldn’t disable request queuing, because its introducing commit message (cf3129323f900ef5ddbccbe86e4fa801e88c566e) specifically says it fixes IDE. I suppose the reason might actually be this problem here, in that before request queuing, it was possible that IDE would continue issuing discard requests even while drained, because processing the list didn’t stop. Maybe. Or the issue is generally that IDE uses dma_* functions, which might cause I/O functions to be run from new BHs (I guess through reschedule_dma()?). > Hmm, what about making blk_aio_prwv non-static and calling > bdrv_co_pdiscard directly from IDE? You mean transforming ide_issue_trim_cb() into an iterative coroutine (instead of being recursive and using AIO) and invoking it via blk_aio_prwv()? It doesn’t feel right to call a bdrv_* function directly from a user external to the core block layer, so in this case I’d rather fall back to Fiona’s idea of invoking all discards concurrently. Hanna
On 3/9/23 13:31, Hanna Czenczek wrote: > On 09.03.23 13:08, Paolo Bonzini wrote: >> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: >>> I think having to do this is problematic, because the blk_drain should >>> leave no pending operation. >>> >>> Here it seems okay because you do it in a controlled situation, but the >>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(), >>> and there would be pending I/O operations when it returns. > > Not really. We would stop in the middle of a trim that processes a list > of discard requests. So I see it more like stopping in the middle of > anything that processes guest requests. Once drain ends, we continue > processing them, and that’s not exactly pending I/O. > > There is a pending object in s->bus->dma->aiocb on the IDE side, so > there is a pending DMA operation, but naïvely, I don’t see that as a > problem. What about the bdrv_drain_all() when a VM stops, would the guest continue to access memory and disks after bdrv_drain() return? Migration could also be a problem, because the partial TRIM would not be recorded in the s->bus->error_status field of IDEState (no surprise there, it's not an error). Also, errors happening after bdrv_drain() might not be migrated correctly. > Or the issue is generally that IDE uses dma_* functions, which might > cause I/O functions to be run from new BHs (I guess through > reschedule_dma()?). Ah, you mean that you can have pending I/O operations while blk->in_flight is zero? That would be a problem indeed. We already have BlockDevOps for ide-cd and ide-hd, should we add a .drained_poll callback there? >> Hmm, what about making blk_aio_prwv non-static and calling >> bdrv_co_pdiscard directly from IDE? > > You mean transforming ide_issue_trim_cb() into an iterative coroutine > (instead of being recursive and using AIO) and invoking it via > blk_aio_prwv()? > > It doesn’t feel right to call a bdrv_* function directly from a user > external to the core block layer, so in this case I’d rather fall back > to Fiona’s idea of invoking all discards concurrently. Yeah, honestly it doesn't feel very much right to me either. Paolo
Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben: > On 3/9/23 13:31, Hanna Czenczek wrote: > > On 09.03.23 13:08, Paolo Bonzini wrote: > > > On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > I think having to do this is problematic, because the blk_drain should > > > > leave no pending operation. > > > > > > > > Here it seems okay because you do it in a controlled situation, but the > > > > main thread can also issue blk_drain(), or worse bdrv_drained_begin(), > > > > and there would be pending I/O operations when it returns. > > > > Not really. We would stop in the middle of a trim that processes a list > > of discard requests. So I see it more like stopping in the middle of > > anything that processes guest requests. Once drain ends, we continue > > processing them, and that’s not exactly pending I/O. > > > > There is a pending object in s->bus->dma->aiocb on the IDE side, so > > there is a pending DMA operation, but naïvely, I don’t see that as a > > problem. > > What about the bdrv_drain_all() when a VM stops, would the guest continue to > access memory and disks after bdrv_drain() return? That one shouldn't be a problem because the devices are stopped before the backends. > Migration could also be a problem, because the partial TRIM would not be > recorded in the s->bus->error_status field of IDEState (no surprise there, > it's not an error). Also, errors happening after bdrv_drain() might not be > migrated correctly. Yes, I think it would be good to have the I/O operation fully completed on the IDE level rather than just in the block layer. > > Or the issue is generally that IDE uses dma_* functions, which might > > cause I/O functions to be run from new BHs (I guess through > > reschedule_dma()?). None of those complicated scenarios actually. The problem solved by the request queuing is simply that nothing else stops the guest from submitting new requests to drained nodes if the CPUs are still running. Drain uses aio_disable_external() to disable processing of external events, in particular the ioeventfd used by virtio-blk and virtio-scsi. But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread exits to userspace and calls directly into the IDE code, so it's completely unaffected by aio_disable_external(). > Ah, you mean that you can have pending I/O operations while blk->in_flight > is zero? That would be a problem indeed. We already have BlockDevOps for > ide-cd and ide-hd, should we add a .drained_poll callback there? To be more precise, you suggested in the call that .drained_poll should return that IDE is still busy while aiocb != NULL. Without having looked at the code in detail yet, that seems to make sense to me. And maybe even the blk_inc/dec_in_flight() pair can then go completely away because IDE takes care of its drain state itself then. Kevin
Am 09.03.23 um 18:46 schrieb Kevin Wolf: > Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben: >> On 3/9/23 13:31, Hanna Czenczek wrote: >>> On 09.03.23 13:08, Paolo Bonzini wrote: >>>> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: >>>>> I think having to do this is problematic, because the blk_drain should >>>>> leave no pending operation. >>>>> >>>>> Here it seems okay because you do it in a controlled situation, but the >>>>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(), >>>>> and there would be pending I/O operations when it returns. >>> >>> Not really. We would stop in the middle of a trim that processes a list >>> of discard requests. So I see it more like stopping in the middle of >>> anything that processes guest requests. Once drain ends, we continue >>> processing them, and that’s not exactly pending I/O. >>> >>> There is a pending object in s->bus->dma->aiocb on the IDE side, so >>> there is a pending DMA operation, but naïvely, I don’t see that as a >>> problem. >> >> What about the bdrv_drain_all() when a VM stops, would the guest continue to >> access memory and disks after bdrv_drain() return? > > That one shouldn't be a problem because the devices are stopped before > the backends. > >> Migration could also be a problem, because the partial TRIM would not be >> recorded in the s->bus->error_status field of IDEState (no surprise there, >> it's not an error). Also, errors happening after bdrv_drain() might not be >> migrated correctly. > > Yes, I think it would be good to have the I/O operation fully completed > on the IDE level rather than just in the block layer. > >>> Or the issue is generally that IDE uses dma_* functions, which might >>> cause I/O functions to be run from new BHs (I guess through >>> reschedule_dma()?). > > None of those complicated scenarios actually. The problem solved by the > request queuing is simply that nothing else stops the guest from > submitting new requests to drained nodes if the CPUs are still running. > > Drain uses aio_disable_external() to disable processing of external > events, in particular the ioeventfd used by virtio-blk and virtio-scsi. > But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread > exits to userspace and calls directly into the IDE code, so it's > completely unaffected by aio_disable_external(). > >> Ah, you mean that you can have pending I/O operations while blk->in_flight >> is zero? That would be a problem indeed. We already have BlockDevOps for >> ide-cd and ide-hd, should we add a .drained_poll callback there? > > To be more precise, you suggested in the call that .drained_poll should > return that IDE is still busy while aiocb != NULL. Without having looked > at the code in detail yet, that seems to make sense to me. And maybe > even the blk_inc/dec_in_flight() pair can then go completely away > because IDE takes care of its drain state itself then. > I assume the addition of drained_poll is meant to be orthogonal to the fix of the deadlock? At least I can't see how it would help there? If we have the assumptions: 1. The TRIM operation should be completed on the IDE level before draining ends. 2. Block layer requests issued after draining has begun are queued. To me, the conclusion seems to be: Issue all block layer requests belonging to the IDE TRIM operation up front. The other alternative I see is to break assumption 2, introduce a way to not queue certain requests while drained, and use it for the recursive requests issued by ide_issue_trim_cb. But not the initial one, if that would defeat the purpose of request queuing. Of course this can't be done if QEMU relies on the assumption in other places already. Best Regards, Fiona
Am 10.03.2023 um 14:05 hat Fiona Ebner geschrieben: > Am 09.03.23 um 18:46 schrieb Kevin Wolf: > > Am 09.03.2023 um 14:59 hat Paolo Bonzini geschrieben: > >> On 3/9/23 13:31, Hanna Czenczek wrote: > >>> On 09.03.23 13:08, Paolo Bonzini wrote: > >>>> On Thu, Mar 9, 2023 at 1:05 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > >>>>> I think having to do this is problematic, because the blk_drain should > >>>>> leave no pending operation. > >>>>> > >>>>> Here it seems okay because you do it in a controlled situation, but the > >>>>> main thread can also issue blk_drain(), or worse bdrv_drained_begin(), > >>>>> and there would be pending I/O operations when it returns. > >>> > >>> Not really. We would stop in the middle of a trim that processes a list > >>> of discard requests. So I see it more like stopping in the middle of > >>> anything that processes guest requests. Once drain ends, we continue > >>> processing them, and that’s not exactly pending I/O. > >>> > >>> There is a pending object in s->bus->dma->aiocb on the IDE side, so > >>> there is a pending DMA operation, but naïvely, I don’t see that as a > >>> problem. > >> > >> What about the bdrv_drain_all() when a VM stops, would the guest continue to > >> access memory and disks after bdrv_drain() return? > > > > That one shouldn't be a problem because the devices are stopped before > > the backends. > > > >> Migration could also be a problem, because the partial TRIM would not be > >> recorded in the s->bus->error_status field of IDEState (no surprise there, > >> it's not an error). Also, errors happening after bdrv_drain() might not be > >> migrated correctly. > > > > Yes, I think it would be good to have the I/O operation fully completed > > on the IDE level rather than just in the block layer. > > > >>> Or the issue is generally that IDE uses dma_* functions, which might > >>> cause I/O functions to be run from new BHs (I guess through > >>> reschedule_dma()?). > > > > None of those complicated scenarios actually. The problem solved by the > > request queuing is simply that nothing else stops the guest from > > submitting new requests to drained nodes if the CPUs are still running. > > > > Drain uses aio_disable_external() to disable processing of external > > events, in particular the ioeventfd used by virtio-blk and virtio-scsi. > > But IDE submits requests through MMIO or port I/O, i.e. the vcpu thread > > exits to userspace and calls directly into the IDE code, so it's > > completely unaffected by aio_disable_external(). > > > >> Ah, you mean that you can have pending I/O operations while blk->in_flight > >> is zero? That would be a problem indeed. We already have BlockDevOps for > >> ide-cd and ide-hd, should we add a .drained_poll callback there? > > > > To be more precise, you suggested in the call that .drained_poll should > > return that IDE is still busy while aiocb != NULL. Without having looked > > at the code in detail yet, that seems to make sense to me. And maybe > > even the blk_inc/dec_in_flight() pair can then go completely away > > because IDE takes care of its drain state itself then. > > > > I assume the addition of drained_poll is meant to be orthogonal to the > fix of the deadlock? At least I can't see how it would help there? You're right, it doesn't. Basically we're running into the old problem again that draining is overloaded with two different meanings: I want exclusive access to the backend or I want to wait for all my requests to complete. IDE or more generally blk_drain() wants the latter, but queuing is done for implementing the former. > If we have the assumptions: > 1. The TRIM operation should be completed on the IDE level before > draining ends. > 2. Block layer requests issued after draining has begun are queued. > > To me, the conclusion seems to be: > Issue all block layer requests belonging to the IDE TRIM operation up > front. > > The other alternative I see is to break assumption 2, introduce a way > to not queue certain requests while drained, and use it for the > recursive requests issued by ide_issue_trim_cb. But not the initial > one, if that would defeat the purpose of request queuing. Of course > this can't be done if QEMU relies on the assumption in other places > already. I feel like this should be allowed because if anyone has exclusive access in this scenario, it's IDE, so it should be able to bypass the queuing. Of course, the queuing is still needed if someone else drained the backend, so we can't just make TRIM bypass it in general. And if you make it conditional on IDE being in blk_drain(), it already starts to become ugly again... So maybe the while loop is unavoidable. Hmm... But could ide_cancel_dma_sync() just directly use AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()? Kevin
On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote: > > 1. The TRIM operation should be completed on the IDE level before > > draining ends. > > 2. Block layer requests issued after draining has begun are queued. > > > > To me, the conclusion seems to be: > > Issue all block layer requests belonging to the IDE TRIM operation up > > front. > > > > The other alternative I see is to break assumption 2, introduce a way > > to not queue certain requests while drained, and use it for the > > recursive requests issued by ide_issue_trim_cb. But not the initial > > one, if that would defeat the purpose of request queuing. Of course > > this can't be done if QEMU relies on the assumption in other places > > already. > > I feel like this should be allowed because if anyone has exclusive > access in this scenario, it's IDE, so it should be able to bypass the > queuing. Of course, the queuing is still needed if someone else drained > the backend, so we can't just make TRIM bypass it in general. And if you > make it conditional on IDE being in blk_drain(), it already starts to > become ugly again... > > So maybe the while loop is unavoidable. > > Hmm... But could ide_cancel_dma_sync() just directly use > AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()? While that should work, it would not fix other uses of bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device model relies on those to run *until the device model has finished submitting requests*. So I still think that this bug is a symptom of a problem in the design of request queuing. In fact, shouldn't request queuing was enabled at the _end_ of bdrv_drained_begin (once the BlockBackend has reached a quiescent state on its own terms), rather than at the beginning (which leads to deadlocks like this one)? blk->quiesce_counter becomes just a nesting counter for drained_begin/end, with no uses outside, and blk_wait_while_drained uses a new counter. Then you have something like this in blk_root_drained_poll: if (blk->dev_ops && blk->dev_ops->drained_poll) { busy = blk->dev_ops->drained_poll(blk->dev_opaque); } busy |= !!blk->in_flight; if (!busy) { qatomic_set(&blk->queue_requests, true); } return busy; And there's no need to touch IDE at all. Paolo
Am 10.03.23 um 16:13 schrieb Paolo Bonzini: > On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote: >>> 1. The TRIM operation should be completed on the IDE level before >>> draining ends. >>> 2. Block layer requests issued after draining has begun are queued. >>> >>> To me, the conclusion seems to be: >>> Issue all block layer requests belonging to the IDE TRIM operation up >>> front. >>> >>> The other alternative I see is to break assumption 2, introduce a way >>> to not queue certain requests while drained, and use it for the >>> recursive requests issued by ide_issue_trim_cb. But not the initial >>> one, if that would defeat the purpose of request queuing. Of course >>> this can't be done if QEMU relies on the assumption in other places >>> already. >> >> I feel like this should be allowed because if anyone has exclusive >> access in this scenario, it's IDE, so it should be able to bypass the >> queuing. Of course, the queuing is still needed if someone else drained >> the backend, so we can't just make TRIM bypass it in general. And if you >> make it conditional on IDE being in blk_drain(), it already starts to >> become ugly again... >> >> So maybe the while loop is unavoidable. >> >> Hmm... But could ide_cancel_dma_sync() just directly use >> AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()? > > While that should work, it would not fix other uses of > bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device > model relies on those to run *until the device model has finished > submitting requests*. > > So I still think that this bug is a symptom of a problem in the design > of request queuing. > > In fact, shouldn't request queuing was enabled at the _end_ of > bdrv_drained_begin (once the BlockBackend has reached a quiescent > state on its own terms), rather than at the beginning (which leads to > deadlocks like this one)? > > blk->quiesce_counter becomes just a nesting counter for > drained_begin/end, with no uses outside, and blk_wait_while_drained > uses a new counter. Then you have something like this in > blk_root_drained_poll: > > if (blk->dev_ops && blk->dev_ops->drained_poll) { > busy = blk->dev_ops->drained_poll(blk->dev_opaque); > } > busy |= !!blk->in_flight; > if (!busy) { > qatomic_set(&blk->queue_requests, true); > } > return busy; > > And there's no need to touch IDE at all. > Couldn't this lead to scenarios where a busy or malicious guest, which continues to submit new requests, slows down draining or even prevents it from finishing? Best Regards, Fiona
On 3/13/23 13:29, Fiona Ebner wrote: >> In fact, shouldn't request queuing was enabled at the _end_ of >> bdrv_drained_begin (once the BlockBackend has reached a quiescent >> state on its own terms), rather than at the beginning (which leads to >> deadlocks like this one)? > > Couldn't this lead to scenarios where a busy or malicious guest, which > continues to submit new requests, slows down draining or even prevents > it from finishing? Possibly, but there is also a .drained_begin/.drained_end callback that can be defined in order to apply backpressure. (For some other devices, there's also aio_disable_external/aio_enable_external that do the equivalent of request queuing but without the deadlocks) Since starting the queuing of requests at the end of bdrv_drained_begin wouldn't hurt correctness, and it would fix this kind of deadlock, I think it would be worth giving it a try. Paolo
Am 10.03.2023 um 16:13 hat Paolo Bonzini geschrieben: > On Fri, Mar 10, 2023 at 3:25 PM Kevin Wolf <kwolf@redhat.com> wrote: > > > 1. The TRIM operation should be completed on the IDE level before > > > draining ends. > > > 2. Block layer requests issued after draining has begun are queued. > > > > > > To me, the conclusion seems to be: > > > Issue all block layer requests belonging to the IDE TRIM operation up > > > front. > > > > > > The other alternative I see is to break assumption 2, introduce a way > > > to not queue certain requests while drained, and use it for the > > > recursive requests issued by ide_issue_trim_cb. But not the initial > > > one, if that would defeat the purpose of request queuing. Of course > > > this can't be done if QEMU relies on the assumption in other places > > > already. > > > > I feel like this should be allowed because if anyone has exclusive > > access in this scenario, it's IDE, so it should be able to bypass the > > queuing. Of course, the queuing is still needed if someone else drained > > the backend, so we can't just make TRIM bypass it in general. And if you > > make it conditional on IDE being in blk_drain(), it already starts to > > become ugly again... > > > > So maybe the while loop is unavoidable. > > > > Hmm... But could ide_cancel_dma_sync() just directly use > > AIO_WAIT_WHILE(s->bus->dma->aiocb) instead of using blk_drain()? > > While that should work, it would not fix other uses of > bdrv_drain_all(), for example in softmmu/cpus.c. Stopping the device > model relies on those to run *until the device model has finished > submitting requests*. If so, do_vm_stop() really expects drain to do something it isn't designed to do. It's only for quiescing backends, not for any other activity a qdev device might still be doing. I think it's really the vm_state_notify() that should take care of stopping device activity. But maybe we can make it work with drain anyway. > So I still think that this bug is a symptom of a problem in the design > of request queuing. > > In fact, shouldn't request queuing was enabled at the _end_ of > bdrv_drained_begin (once the BlockBackend has reached a quiescent > state on its own terms), rather than at the beginning (which leads to > deadlocks like this one)? No, I don't think that is ever right. As I said earlier in this thread (and you said yourself previously), there are two different users of drain: 1. I want to have exclusive access to the node. This one wants request queuing from the start to avoid losing time unnecessarily until the guest stops sending new requests. 2. I want to wait for my requests to complete. This one never wants request queuing. Enabling it at the end of bdrv_drained_begin() wouldn't hurt it (because it has already achieved its goal then), but it's also not necessary for the same reason. IDE reset and do_vm_stop() are case 2, implemented with blk_drain*(). The request queuing was implemented for case 1, something else in the block graph draining the BlockBackend's root node with bdrv_drain*(). So maybe what we could take from this is that request queuing should be temporarily disabled while we're in blk_drain*() because these interfaces are only meant for case 2. In all other cases, it should continue to work as it does now. Kevin
On Mon, Mar 13, 2023 at 5:32 PM Kevin Wolf <kwolf@redhat.com> wrote: > > So I still think that this bug is a symptom of a problem in the design > > of request queuing. > > > > In fact, shouldn't request queuing was enabled at the _end_ of > > bdrv_drained_begin (once the BlockBackend has reached a quiescent > > state on its own terms), rather than at the beginning (which leads to > > deadlocks like this one)? > > 1. I want to have exclusive access to the node. This one wants request > queuing from the start to avoid losing time unnecessarily until the > guest stops sending new requests. > > 2. I want to wait for my requests to complete. This one never wants > request queuing. Enabling it at the end of bdrv_drained_begin() > wouldn't hurt it (because it has already achieved its goal then), but > it's also not necessary for the same reason. Right, doing it at the end would be needed to avoid the deadlocks. On the other hand, case 1 can (and I think should) be handled by .drained_begin, or shortcut through aio_disable_external() for those devices that use ioeventfd. Paolo > So maybe what we could take from this is that request queuing should be > temporarily disabled while we're in blk_drain*() because these > interfaces are only meant for case 2. In all other cases, it should > continue to work as it does now.
diff --git a/hw/ide/core.c b/hw/ide/core.c index 2d034731cf..54c51084d2 100644 --- a/hw/ide/core.c +++ b/hw/ide/core.c @@ -444,7 +444,7 @@ static void ide_trim_bh_cb(void *opaque) iocb->bh = NULL; qemu_aio_unref(iocb); - /* Paired with an increment in ide_issue_trim() */ + /* Paired with an increment in ide_issue_trim_cb() */ blk_dec_in_flight(blk); } @@ -504,6 +504,14 @@ static void ide_issue_trim_cb(void *opaque, int ret) done: iocb->aiocb = NULL; if (iocb->bh) { + /* + * Paired with a decrement in ide_trim_bh_cb(): Ensure we have + * an in-flight count while this TrimAIOCB object lives. + * There is no ongoing blk_aio_pdiscard() operation anymore, + * so here we increment the counter manually before returning. + */ + blk_inc_in_flight(s->blk); + replay_bh_schedule_event(iocb->bh); } } @@ -515,9 +523,6 @@ BlockAIOCB *ide_issue_trim( IDEState *s = opaque; TrimAIOCB *iocb; - /* Paired with a decrement in ide_trim_bh_cb() */ - blk_inc_in_flight(s->blk); - iocb = blk_aio_get(&trim_aiocb_info, s->blk, cb, cb_opaque); iocb->s = s; iocb->bh = qemu_bh_new(ide_trim_bh_cb, iocb); @@ -737,11 +742,17 @@ void ide_cancel_dma_sync(IDEState *s) * In the future we'll be able to safely cancel the I/O if the * whole DMA operation will be submitted to disk with a single * aio operation with preadv/pwritev. + * + * Note that TRIM operations call blk_aio_pdiscard() multiple + * times (and finally increment s->blk's in-flight counter while + * ide_trim_bh_cb() is scheduled), so we have to loop blk_drain() + * until the whole operation is done. */ if (s->bus->dma->aiocb) { trace_ide_cancel_dma_sync_remaining(); - blk_drain(s->blk); - assert(s->bus->dma->aiocb == NULL); + while (s->bus->dma->aiocb) { + blk_drain(s->blk); + } } }
Commit 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca ("ide: Increment BB in-flight counter for TRIM BH") fixed a problem when cancelling trims: ide_cancel_dma_sync() drains the BB to stop a potentially ongoing request, and then asserts that no request is active anymore. When trying to cancel a trim, this would only drain a single blk_aio_pdiscard() operation, but not necessarily the trim as a whole. Said commit fixed that by counting the whole trim as an operation, incrementing the in-flight counter for it, so the blk_drain() in ide_cancel_dma_sync() would wait for it. Fiona found a problem with this, specifically that draining a BB while an IDE trim operation is going on can produce a deadlock: An ongoing blk_aio_pdiscard() will be settled, but any further discard in the same trim will go into the BB's queued_request list and wait there until the drain stops. Meanwhile, the whole trim keeps the BB's in_flight_counter incremented, so the BB will never be considered drained, and qemu hangs. Therefore, we cannot keep the in-flight counter incremented throughout the trim operation. We should still increment it before the completion BH (ide_trim_bh_cb()) is scheduled and decrement it there, so that the blk_drain() in ide_cancel_dma_sync() can await completion of this scheduled BH. With that change, however, it can take multiple iterations of blk_drain() to really settle the whole trim operation, and so ide_cancel_dma_sync() must run it in a loop. Reported-by: Fiona Ebner <f.ebner@proxmox.com> Fixes: 7e5cdb345f77d76cb4877fe6230c4e17a7d0d0ca ("ide: Increment BB in-flight counter for TRIM BH") Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- Tested with a small test boot sector that issues a trim operation with any number of discards from 0 to 64. Before this patch, doing so hangs starting from 2 discards; and before 7e5cdb345f, doing so crashes qemu with any number of discards. With this patch, it always works. --- hw/ide/core.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-)