Message ID | 20200112173006.29863-2-digetx@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | NVIDIA Tegra APB DMA driver fixes and improvements | expand |
On 12/01/2020 17:29, Dmitry Osipenko wrote: > I was doing some experiments with I2C and noticed that Tegra APB DMA > driver crashes sometime after I2C DMA transfer termination. The crash > happens because tegra_dma_terminate_all() bails out immediately if pending > list is empty, thus it doesn't release the half-completed descriptors > which are getting re-used before ISR tasklet kicks-in. Can you elaborate a bit more on how these are getting re-used? What is the sequence of events which results in the panic? I believe that this was also reported in the past [0] and so I don't doubt there is an issue here, but would like to completely understand this. Thanks! Jon [0] https://lore.kernel.org/patchwork/patch/675349/
14.01.2020 18:09, Jon Hunter пишет: > > On 12/01/2020 17:29, Dmitry Osipenko wrote: >> I was doing some experiments with I2C and noticed that Tegra APB DMA >> driver crashes sometime after I2C DMA transfer termination. The crash >> happens because tegra_dma_terminate_all() bails out immediately if pending >> list is empty, thus it doesn't release the half-completed descriptors >> which are getting re-used before ISR tasklet kicks-in. > > Can you elaborate a bit more on how these are getting re-used? What is > the sequence of events which results in the panic? I believe that this > was also reported in the past [0] and so I don't doubt there is an issue > here, but would like to completely understand this. > > Thanks! > Jon > > [0] https://lore.kernel.org/patchwork/patch/675349/ > In my case it happens in the touchscreen driver during of the touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is under load and there is other interrupts activity. So what happens here is that the TS driver issues one I2C transfer, which fails with (apparently bogus) timeout (because DMA descriptor is completed and removed from the pending list, but tasklet not executed yet), and then TS immediately issues another I2C transfer that re-uses the yet-incompleted descriptor. That's my understanding.
On 14/01/2020 20:33, Dmitry Osipenko wrote: > 14.01.2020 18:09, Jon Hunter пишет: >> >> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>> driver crashes sometime after I2C DMA transfer termination. The crash >>> happens because tegra_dma_terminate_all() bails out immediately if pending >>> list is empty, thus it doesn't release the half-completed descriptors >>> which are getting re-used before ISR tasklet kicks-in. >> >> Can you elaborate a bit more on how these are getting re-used? What is >> the sequence of events which results in the panic? I believe that this >> was also reported in the past [0] and so I don't doubt there is an issue >> here, but would like to completely understand this. >> >> Thanks! >> Jon >> >> [0] https://lore.kernel.org/patchwork/patch/675349/ >> > > In my case it happens in the touchscreen driver during of the > touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is > under load and there is other interrupts activity. So what happens here > is that the TS driver issues one I2C transfer, which fails with > (apparently bogus) timeout (because DMA descriptor is completed and > removed from the pending list, but tasklet not executed yet), and then > TS immediately issues another I2C transfer that re-uses the > yet-incompleted descriptor. That's my understanding. OK, but what is the exact sequence that it allowing it to re-use the incompleted descriptor? Thanks Jon
15.01.2020 12:00, Jon Hunter пишет: > > On 14/01/2020 20:33, Dmitry Osipenko wrote: >> 14.01.2020 18:09, Jon Hunter пишет: >>> >>> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>>> driver crashes sometime after I2C DMA transfer termination. The crash >>>> happens because tegra_dma_terminate_all() bails out immediately if pending >>>> list is empty, thus it doesn't release the half-completed descriptors >>>> which are getting re-used before ISR tasklet kicks-in. >>> >>> Can you elaborate a bit more on how these are getting re-used? What is >>> the sequence of events which results in the panic? I believe that this >>> was also reported in the past [0] and so I don't doubt there is an issue >>> here, but would like to completely understand this. >>> >>> Thanks! >>> Jon >>> >>> [0] https://lore.kernel.org/patchwork/patch/675349/ >>> >> >> In my case it happens in the touchscreen driver during of the >> touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is >> under load and there is other interrupts activity. So what happens here >> is that the TS driver issues one I2C transfer, which fails with >> (apparently bogus) timeout (because DMA descriptor is completed and >> removed from the pending list, but tasklet not executed yet), and then >> TS immediately issues another I2C transfer that re-uses the >> yet-incompleted descriptor. That's my understanding. > > OK, but what is the exact sequence that it allowing it to re-use the > incompleted descriptor? TDMA driver DMA Client 1. dmaengine_prep() 2. tegra_dma_desc_get() dma_desc = kzalloc() ... tegra_dma_prep_slave_sg() INIT_LIST_HEAD(&dma_desc->tx_list); INIT_LIST_HEAD(&dma_desc->cb_node); list_add_tail(sgreq->node, dma_desc->tx_list) 3. dma_async_issue_pending() 4. tegra_dma_tx_submit() list_splice_tail_init(dma_desc->tx_list, tdc->pending_sg_req) 5. tegra_dma_isr() ... handle_once_dma_done() ... sgreq = list_first_entry(tdc->pending_sg_req) list_del(sgreq->node); ... list_add_tail(dma_desc->cb_node, tdc->cb_desc); list_add_tail(dma_desc->node, tdc->free_dma_desc); ... tasklet_schedule(&tdc->tasklet); ... 6. timeout dmaengine_terminate_async() 7. tegra_dma_terminate_all() if (list_empty(tdc->pending_sg_req)) return 0; 8. dmaengine_prep() 9. tegra_dma_desc_get() list_for_each_entry(dma_desc, tdc->free_dma_desc) { list_del(dma_desc->node); return dma_desc; } ... tegra_dma_prep_slave_sg() INIT_LIST_HEAD(&dma_desc->tx_list); INIT_LIST_HEAD(&dma_desc->cb_node); *** tdc->cb_desc list is wrecked now! *** list_add_tail(sgreq->node, dma_desc->tx_list) ... 10. same actions as in #4 #5 ... 11. tegra_dma_tasklet() dma_desc = list_first_entry(tdc->cb_desc) list_del(dma_desc->cb_node); eventual woopsie
On 16/01/2020 20:10, Dmitry Osipenko wrote: > 15.01.2020 12:00, Jon Hunter пишет: >> >> On 14/01/2020 20:33, Dmitry Osipenko wrote: >>> 14.01.2020 18:09, Jon Hunter пишет: >>>> >>>> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>>>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>>>> driver crashes sometime after I2C DMA transfer termination. The crash >>>>> happens because tegra_dma_terminate_all() bails out immediately if pending >>>>> list is empty, thus it doesn't release the half-completed descriptors >>>>> which are getting re-used before ISR tasklet kicks-in. >>>> >>>> Can you elaborate a bit more on how these are getting re-used? What is >>>> the sequence of events which results in the panic? I believe that this >>>> was also reported in the past [0] and so I don't doubt there is an issue >>>> here, but would like to completely understand this. >>>> >>>> Thanks! >>>> Jon >>>> >>>> [0] https://lore.kernel.org/patchwork/patch/675349/ >>>> >>> >>> In my case it happens in the touchscreen driver during of the >>> touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is >>> under load and there is other interrupts activity. So what happens here >>> is that the TS driver issues one I2C transfer, which fails with >>> (apparently bogus) timeout (because DMA descriptor is completed and >>> removed from the pending list, but tasklet not executed yet), and then >>> TS immediately issues another I2C transfer that re-uses the >>> yet-incompleted descriptor. That's my understanding. >> >> OK, but what is the exact sequence that it allowing it to re-use the >> incompleted descriptor? > > TDMA driver DMA Client > > 1. > dmaengine_prep() > > 2. > tegra_dma_desc_get() > dma_desc = kzalloc() > ... > tegra_dma_prep_slave_sg() > INIT_LIST_HEAD(&dma_desc->tx_list); > INIT_LIST_HEAD(&dma_desc->cb_node); > list_add_tail(sgreq->node, > dma_desc->tx_list) > > 3. > dma_async_issue_pending() > > 4. > tegra_dma_tx_submit() > list_splice_tail_init(dma_desc->tx_list, > tdc->pending_sg_req) > > 5. > tegra_dma_isr() > ... > handle_once_dma_done() > ... > sgreq = list_first_entry(tdc->pending_sg_req) > list_del(sgreq->node); > ... > list_add_tail(dma_desc->cb_node, > tdc->cb_desc); > list_add_tail(dma_desc->node, > tdc->free_dma_desc); Isn't this the problem here, that we have placed this on the free list before we are actually done? It seems to me that there could still be a potential race condition between the ISR and the tasklet running. Jon
28.01.2020 17:02, Jon Hunter пишет: > > On 16/01/2020 20:10, Dmitry Osipenko wrote: >> 15.01.2020 12:00, Jon Hunter пишет: >>> >>> On 14/01/2020 20:33, Dmitry Osipenko wrote: >>>> 14.01.2020 18:09, Jon Hunter пишет: >>>>> >>>>> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>>>>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>>>>> driver crashes sometime after I2C DMA transfer termination. The crash >>>>>> happens because tegra_dma_terminate_all() bails out immediately if pending >>>>>> list is empty, thus it doesn't release the half-completed descriptors >>>>>> which are getting re-used before ISR tasklet kicks-in. >>>>> >>>>> Can you elaborate a bit more on how these are getting re-used? What is >>>>> the sequence of events which results in the panic? I believe that this >>>>> was also reported in the past [0] and so I don't doubt there is an issue >>>>> here, but would like to completely understand this. >>>>> >>>>> Thanks! >>>>> Jon >>>>> >>>>> [0] https://lore.kernel.org/patchwork/patch/675349/ >>>>> >>>> >>>> In my case it happens in the touchscreen driver during of the >>>> touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is >>>> under load and there is other interrupts activity. So what happens here >>>> is that the TS driver issues one I2C transfer, which fails with >>>> (apparently bogus) timeout (because DMA descriptor is completed and >>>> removed from the pending list, but tasklet not executed yet), and then >>>> TS immediately issues another I2C transfer that re-uses the >>>> yet-incompleted descriptor. That's my understanding. >>> >>> OK, but what is the exact sequence that it allowing it to re-use the >>> incompleted descriptor? >> >> TDMA driver DMA Client >> >> 1. >> dmaengine_prep() >> >> 2. >> tegra_dma_desc_get() >> dma_desc = kzalloc() >> ... >> tegra_dma_prep_slave_sg() >> INIT_LIST_HEAD(&dma_desc->tx_list); >> INIT_LIST_HEAD(&dma_desc->cb_node); >> list_add_tail(sgreq->node, >> dma_desc->tx_list) >> >> 3. >> dma_async_issue_pending() >> >> 4. >> tegra_dma_tx_submit() >> list_splice_tail_init(dma_desc->tx_list, >> tdc->pending_sg_req) >> >> 5. >> tegra_dma_isr() >> ... >> handle_once_dma_done() >> ... >> sgreq = list_first_entry(tdc->pending_sg_req) >> list_del(sgreq->node); >> ... >> list_add_tail(dma_desc->cb_node, >> tdc->cb_desc); >> list_add_tail(dma_desc->node, >> tdc->free_dma_desc); > > Isn't this the problem here, that we have placed this on the free list > before we are actually done? > > It seems to me that there could still be a potential race condition > between the ISR and the tasklet running. Yes, this should be addressed by the patch #3 "dmaengine: tegra-apb: Prevent race conditions of tasklet vs free list".
28.01.2020 17:51, Dmitry Osipenko пишет: > 28.01.2020 17:02, Jon Hunter пишет: >> >> On 16/01/2020 20:10, Dmitry Osipenko wrote: >>> 15.01.2020 12:00, Jon Hunter пишет: >>>> >>>> On 14/01/2020 20:33, Dmitry Osipenko wrote: >>>>> 14.01.2020 18:09, Jon Hunter пишет: >>>>>> >>>>>> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>>>>>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>>>>>> driver crashes sometime after I2C DMA transfer termination. The crash >>>>>>> happens because tegra_dma_terminate_all() bails out immediately if pending >>>>>>> list is empty, thus it doesn't release the half-completed descriptors >>>>>>> which are getting re-used before ISR tasklet kicks-in. >>>>>> >>>>>> Can you elaborate a bit more on how these are getting re-used? What is >>>>>> the sequence of events which results in the panic? I believe that this >>>>>> was also reported in the past [0] and so I don't doubt there is an issue >>>>>> here, but would like to completely understand this. >>>>>> >>>>>> Thanks! >>>>>> Jon >>>>>> >>>>>> [0] https://lore.kernel.org/patchwork/patch/675349/ >>>>>> >>>>> >>>>> In my case it happens in the touchscreen driver during of the >>>>> touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is >>>>> under load and there is other interrupts activity. So what happens here >>>>> is that the TS driver issues one I2C transfer, which fails with >>>>> (apparently bogus) timeout (because DMA descriptor is completed and >>>>> removed from the pending list, but tasklet not executed yet), and then >>>>> TS immediately issues another I2C transfer that re-uses the >>>>> yet-incompleted descriptor. That's my understanding. >>>> >>>> OK, but what is the exact sequence that it allowing it to re-use the >>>> incompleted descriptor? >>> >>> TDMA driver DMA Client >>> >>> 1. >>> dmaengine_prep() >>> >>> 2. >>> tegra_dma_desc_get() >>> dma_desc = kzalloc() >>> ... >>> tegra_dma_prep_slave_sg() >>> INIT_LIST_HEAD(&dma_desc->tx_list); >>> INIT_LIST_HEAD(&dma_desc->cb_node); >>> list_add_tail(sgreq->node, >>> dma_desc->tx_list) >>> >>> 3. >>> dma_async_issue_pending() >>> >>> 4. >>> tegra_dma_tx_submit() >>> list_splice_tail_init(dma_desc->tx_list, >>> tdc->pending_sg_req) >>> >>> 5. >>> tegra_dma_isr() >>> ... >>> handle_once_dma_done() >>> ... >>> sgreq = list_first_entry(tdc->pending_sg_req) >>> list_del(sgreq->node); >>> ... >>> list_add_tail(dma_desc->cb_node, >>> tdc->cb_desc); >>> list_add_tail(dma_desc->node, >>> tdc->free_dma_desc); >> >> Isn't this the problem here, that we have placed this on the free list >> before we are actually done? >> >> It seems to me that there could still be a potential race condition >> between the ISR and the tasklet running. > > Yes, this should be addressed by the patch #3 "dmaengine: tegra-apb: > Prevent race conditions of tasklet vs free list". correction (to avoid confusion): it's actually patch #5, my bad
On 29/01/2020 00:12, Dmitry Osipenko wrote: > 28.01.2020 17:51, Dmitry Osipenko пишет: >> 28.01.2020 17:02, Jon Hunter пишет: >>> >>> On 16/01/2020 20:10, Dmitry Osipenko wrote: >>>> 15.01.2020 12:00, Jon Hunter пишет: >>>>> >>>>> On 14/01/2020 20:33, Dmitry Osipenko wrote: >>>>>> 14.01.2020 18:09, Jon Hunter пишет: >>>>>>> >>>>>>> On 12/01/2020 17:29, Dmitry Osipenko wrote: >>>>>>>> I was doing some experiments with I2C and noticed that Tegra APB DMA >>>>>>>> driver crashes sometime after I2C DMA transfer termination. The crash >>>>>>>> happens because tegra_dma_terminate_all() bails out immediately if pending >>>>>>>> list is empty, thus it doesn't release the half-completed descriptors >>>>>>>> which are getting re-used before ISR tasklet kicks-in. >>>>>>> >>>>>>> Can you elaborate a bit more on how these are getting re-used? What is >>>>>>> the sequence of events which results in the panic? I believe that this >>>>>>> was also reported in the past [0] and so I don't doubt there is an issue >>>>>>> here, but would like to completely understand this. >>>>>>> >>>>>>> Thanks! >>>>>>> Jon >>>>>>> >>>>>>> [0] https://lore.kernel.org/patchwork/patch/675349/ >>>>>>> >>>>>> >>>>>> In my case it happens in the touchscreen driver during of the >>>>>> touchscreen's interrupt handling (in a threaded IRQ handler) + CPU is >>>>>> under load and there is other interrupts activity. So what happens here >>>>>> is that the TS driver issues one I2C transfer, which fails with >>>>>> (apparently bogus) timeout (because DMA descriptor is completed and >>>>>> removed from the pending list, but tasklet not executed yet), and then >>>>>> TS immediately issues another I2C transfer that re-uses the >>>>>> yet-incompleted descriptor. That's my understanding. >>>>> >>>>> OK, but what is the exact sequence that it allowing it to re-use the >>>>> incompleted descriptor? >>>> >>>> TDMA driver DMA Client >>>> >>>> 1. >>>> dmaengine_prep() >>>> >>>> 2. >>>> tegra_dma_desc_get() >>>> dma_desc = kzalloc() >>>> ... >>>> tegra_dma_prep_slave_sg() >>>> INIT_LIST_HEAD(&dma_desc->tx_list); >>>> INIT_LIST_HEAD(&dma_desc->cb_node); >>>> list_add_tail(sgreq->node, >>>> dma_desc->tx_list) >>>> >>>> 3. >>>> dma_async_issue_pending() >>>> >>>> 4. >>>> tegra_dma_tx_submit() >>>> list_splice_tail_init(dma_desc->tx_list, >>>> tdc->pending_sg_req) >>>> >>>> 5. >>>> tegra_dma_isr() >>>> ... >>>> handle_once_dma_done() >>>> ... >>>> sgreq = list_first_entry(tdc->pending_sg_req) >>>> list_del(sgreq->node); >>>> ... >>>> list_add_tail(dma_desc->cb_node, >>>> tdc->cb_desc); >>>> list_add_tail(dma_desc->node, >>>> tdc->free_dma_desc); >>> >>> Isn't this the problem here, that we have placed this on the free list >>> before we are actually done? >>> >>> It seems to me that there could still be a potential race condition >>> between the ISR and the tasklet running. >> >> Yes, this should be addressed by the patch #3 "dmaengine: tegra-apb: >> Prevent race conditions of tasklet vs free list". > > correction (to avoid confusion): it's actually patch #5, my bad Ah OK great! Could be worth making that patch #1 or #2 in the series as theses are somewhat related. Cheers Jon
diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c index 3a45079d11ec..319f31d27014 100644 --- a/drivers/dma/tegra20-apb-dma.c +++ b/drivers/dma/tegra20-apb-dma.c @@ -756,10 +756,6 @@ static int tegra_dma_terminate_all(struct dma_chan *dc) bool was_busy; spin_lock_irqsave(&tdc->lock, flags); - if (list_empty(&tdc->pending_sg_req)) { - spin_unlock_irqrestore(&tdc->lock, flags); - return 0; - } if (!tdc->busy) goto skip_dma_stop;
I was doing some experiments with I2C and noticed that Tegra APB DMA driver crashes sometime after I2C DMA transfer termination. The crash happens because tegra_dma_terminate_all() bails out immediately if pending list is empty, thus it doesn't release the half-completed descriptors which are getting re-used before ISR tasklet kicks-in. tegra-i2c 7000c400.i2c: DMA transfer timeout elants_i2c 0-0010: elants_i2c_irq: failed to read data: -110 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 142 at lib/list_debug.c:45 __list_del_entry_valid+0x45/0xac list_del corruption, ddbaac44->next is LIST_POISON1 (00000100) Modules linked in: CPU: 0 PID: 142 Comm: kworker/0:2 Not tainted 5.5.0-rc2-next-20191220-00175-gc3605715758d-dirty #538 Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) Workqueue: events_freezable_power_ thermal_zone_device_check [<c010e5c5>] (unwind_backtrace) from [<c010a1c5>] (show_stack+0x11/0x14) [<c010a1c5>] (show_stack) from [<c0973925>] (dump_stack+0x85/0x94) [<c0973925>] (dump_stack) from [<c011f529>] (__warn+0xc1/0xc4) [<c011f529>] (__warn) from [<c011f7e9>] (warn_slowpath_fmt+0x61/0x78) [<c011f7e9>] (warn_slowpath_fmt) from [<c042497d>] (__list_del_entry_valid+0x45/0xac) [<c042497d>] (__list_del_entry_valid) from [<c047a87f>] (tegra_dma_tasklet+0x5b/0x154) [<c047a87f>] (tegra_dma_tasklet) from [<c0124799>] (tasklet_action_common.constprop.0+0x41/0x7c) [<c0124799>] (tasklet_action_common.constprop.0) from [<c01022ab>] (__do_softirq+0xd3/0x2a8) [<c01022ab>] (__do_softirq) from [<c0124683>] (irq_exit+0x7b/0x98) [<c0124683>] (irq_exit) from [<c0168c19>] (__handle_domain_irq+0x45/0x80) [<c0168c19>] (__handle_domain_irq) from [<c043e429>] (gic_handle_irq+0x45/0x7c) [<c043e429>] (gic_handle_irq) from [<c0101aa5>] (__irq_svc+0x65/0x94) Exception stack(0xde2ebb90 to 0xde2ebbd8) Cc: <stable@vger.kernel.org> Signed-off-by: Dmitry Osipenko <digetx@gmail.com> --- drivers/dma/tegra20-apb-dma.c | 4 ---- 1 file changed, 4 deletions(-)