Message ID | 20190614083959.37944-1-yibin.gong@nxp.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [v1] dmaengine: imx-sdma: remove BD_INTR for channel0 | expand |
Hi Robin, On Fri, Jun 14, 2019 at 5:38 AM <yibin.gong@nxp.com> wrote: > > From: Robin Gong <yibin.gong@nxp.com> > > It is possible for an irq triggered by channel0 to be received later > after clks are disabled once firmware loaded during sdma probe. If > that happens then clearing them by writing to SDMA_H_INTR won't work > and the system will hang processing infinite interrupts. Actually, > don't need interrupt triggered on channel0 since it's pollling > SDMA_H_STATSTOP to know channel0 done rather than interrupt in > current code, just clear BD_INTR to disable channel0 interrupt to > avoid the above case. > > Signed-off-by: Robin Gong <yibin.gong@nxp.com> > Reported-by: Sven Van Asbroeck <thesven73@gmail.com> According to the original report from Sven the issue started to happen on 5.0, so it would be good to add a Fixes tag and Cc stable so that this fix could be backported to 5.0/5.1 stable trees. Thanks
On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> wrote: > > According to the original report from Sven the issue started to happen > on 5.0, so it would be good to add a Fixes tag and Cc stable so that > this fix could be backported to 5.0/5.1 stable trees. Good catch ! However, the issue is highly timing-dependent. It will come and go depending on the kernel version, devicetree and defconfig. If it works for me on 4.19, that doesn't mean the bug is gone on 4.19. Looking at the commit history, I think the commit below possibly introduced the issue. Until this commit, sdma_run_channel() would wait on the interrupt before proceeding. It has been there since 4.8: Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the interrupt handler") But my knowledge of imx-sdma is non-existent, so I invite the more knowledgeable people in this thread to take a look at this commit. [Adding Michael Olbrich to the thread]
Hi Robin, see comments inline. On Fri, Jun 14, 2019 at 4:38 AM <yibin.gong@nxp.com> wrote: > > diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c > index deea9aa..b5a1ee2 100644 > --- a/drivers/dma/imx-sdma.c > +++ b/drivers/dma/imx-sdma.c > @@ -742,7 +742,7 @@ static int sdma_load_script(struct sdma_engine *sdma, void *buf, int size, > spin_lock_irqsave(&sdma->channel_0_lock, flags); > > bd0->mode.command = C0_SETPM; > - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; > + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; I tested this change on its own, and it seemed sufficient to make the crash disappear. > bd0->mode.count = size / 2; > bd0->buffer_addr = buf_phys; > bd0->ext_buffer_addr = address; > @@ -1064,7 +1064,7 @@ static int sdma_load_context(struct sdma_channel *sdmac) > context->gReg[7] = sdmac->watermark_level; > > bd0->mode.command = C0_SETDM; > - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; > + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; This function isn't part of the firmware load path, so how can it be related to fixing the firmware crash? If this is an unrelated efficiency saving, maybe it should go into its own patch? Maybe we want bugfix patches to be as small and specific as possible, so they can more easily be backported to older kernels? > bd0->mode.count = sizeof(*context) / 4; > bd0->buffer_addr = sdma->context_phys; > bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) * channel; > -- > 2.7.4 >
On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote: > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> wrote: > > > > According to the original report from Sven the issue started to happen > > on 5.0, so it would be good to add a Fixes tag and Cc stable so that > > this fix could be backported to 5.0/5.1 stable trees. > > Good catch ! > > However, the issue is highly timing-dependent. It will come and go depending > on the kernel version, devicetree and defconfig. If it works for me on > 4.19, that > doesn't mean the bug is gone on 4.19. > > Looking at the commit history, I think the commit below possibly introduced the > issue. Until this commit, sdma_run_channel() would wait on the interrupt > before proceeding. It has been there since 4.8: > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the > interrupt handler") I think this is correct. Starting with this commit, the interrupt status fr channel 0 is no longer cleared in sdma_run_channel0() and sdma_int_handler() is always called for channel 0. During firmware loading the interrupts are enabled again just before the clocks are disabled. The interrupt is pending at this moment so on a single core system I think this will always work as expected. If the firmware loading and the interrupt handler run on different cores then this is racy. Maybe something else changed to make this more likely? With this new change sdma_int_handler() is no longer called for channel 0 right, so you should also remove the special handling there. Michael
On 2019-06-14 at 13:35 +0000, Sven Van Asbroeck wrote: > Hi Robin, see comments inline. > > On Fri, Jun 14, 2019 at 4:38 AM <yibin.gong@nxp.com> wrote: > > > > > > diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c > > index deea9aa..b5a1ee2 100644 > > --- a/drivers/dma/imx-sdma.c > > +++ b/drivers/dma/imx-sdma.c > > @@ -742,7 +742,7 @@ static int sdma_load_script(struct sdma_engine > > *sdma, void *buf, int size, > > spin_lock_irqsave(&sdma->channel_0_lock, flags); > > > > bd0->mode.command = C0_SETPM; > > - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; > > + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; > I tested this change on its own, and it seemed sufficient to make the > crash > disappear. > > > > > bd0->mode.count = size / 2; > > bd0->buffer_addr = buf_phys; > > bd0->ext_buffer_addr = address; > > @@ -1064,7 +1064,7 @@ static int sdma_load_context(struct > > sdma_channel *sdmac) > > context->gReg[7] = sdmac->watermark_level; > > > > bd0->mode.command = C0_SETDM; > > - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; > > + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; > This function isn't part of the firmware load path, so how can it be > related > to fixing the firmware crash? Yes, that's not in your firmware load case, but I want to keep the same behavior for all channel0 conditions, don't worry, harmless here. > > If this is an unrelated efficiency saving, maybe it should go into > its > own patch? > Maybe we want bugfix patches to be as small and specific as possible, > so they > can more easily be backported to older kernels? > > > > > bd0->mode.count = sizeof(*context) / 4; > > bd0->buffer_addr = sdma->context_phys; > > bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) * > > channel; > > -- > > 2.7.4 > >
On 2019-06-14 at 13:25 +0000, Sven Van Asbroeck wrote: > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> > wrote: > > > > > > According to the original report from Sven the issue started to > > happen > > on 5.0, so it would be good to add a Fixes tag and Cc stable so > > that > > this fix could be backported to 5.0/5.1 stable trees. > Good catch ! > > However, the issue is highly timing-dependent. It will come and go > depending > on the kernel version, devicetree and defconfig. If it works for me > on > 4.19, that > doesn't mean the bug is gone on 4.19. The default imx defconfig and dts should be ok, because firmware load is delayed after rootfs mounted where firmware located in and before that, some driver which use sdma such as spi/uart/audio may have already enable sdma clock which means channel0 interrupt could be cleared immediately without interrupt storm. That's why I can't reproduce your issue at first, but catch it once I sync with your directly firmware load defconfig. So seems not very must to CC stable tree? > > Looking at the commit history, I think the commit below possibly > introduced the > issue. Until this commit, sdma_run_channel() would wait on the > interrupt > before proceeding. It has been there since 4.8: > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the > interrupt handler") Yes, but Michael's patch is the right direction, at least it fix RT case and meaningless channel0 interrupt storm coming before clearing channel0 interrupt status in sdma_run_channel0(). Now, this patch could fix its minor side-effect. > > But my knowledge of imx-sdma is non-existent, so I invite the more > knowledgeable > people in this thread to take a look at this commit. > > [Adding Michael Olbrich to the thread]
On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote: > On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote: > > > > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> > > wrote: > > > > > > > > > According to the original report from Sven the issue started to > > > happen > > > on 5.0, so it would be good to add a Fixes tag and Cc stable so > > > that > > > this fix could be backported to 5.0/5.1 stable trees. > > Good catch ! > > > > However, the issue is highly timing-dependent. It will come and go > > depending > > on the kernel version, devicetree and defconfig. If it works for me > > on > > 4.19, that > > doesn't mean the bug is gone on 4.19. > > > > Looking at the commit history, I think the commit below possibly > > introduced the > > issue. Until this commit, sdma_run_channel() would wait on the > > interrupt > > before proceeding. It has been there since 4.8: > > > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the > > interrupt handler") > I think this is correct. Starting with this commit, the interrupt > status fr > channel 0 is no longer cleared in sdma_run_channel0() and > sdma_int_handler() is always called for channel 0. > During firmware loading the interrupts are enabled again just before > the > clocks are disabled. The interrupt is pending at this moment so on a > single > core system I think this will always work as expected. If the > firmware > loading and the interrupt handler run on different cores then this is > racy. > Maybe something else changed to make this more likely? > > With this new change sdma_int_handler() is no longer called for > channel 0 > right, so you should also remove the special handling there. What's 'special handling' should be removed here? Do you mean put below pieces of your patch back? static int sdma_load_script(struct sdma_engine *sdma, void *buf, int size, @@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq, void *dev_id) unsigned long stat; stat = readl_relaxed(sdma->regs + SDMA_H_INTR); - /* not interested in channel 0 interrupts */ - stat &= ~1; writel_relaxed(stat, sdma->regs + SDMA_H_INTR); + /* channel 0 is special and not handled here, see run_channel0() */ + stat &= ~1; > > Michael >
On Mon, Jun 17, 2019 at 02:14:34AM +0000, Robin Gong wrote: > On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote: > > On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote: > > > > > > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> > > > wrote: > > > > > > > > > > > > According to the original report from Sven the issue started to > > > > happen > > > > on 5.0, so it would be good to add a Fixes tag and Cc stable so > > > > that > > > > this fix could be backported to 5.0/5.1 stable trees. > > > Good catch ! > > > > > > However, the issue is highly timing-dependent. It will come and go > > > depending > > > on the kernel version, devicetree and defconfig. If it works for me > > > on > > > 4.19, that > > > doesn't mean the bug is gone on 4.19. > > > > > > Looking at the commit history, I think the commit below possibly > > > introduced the > > > issue. Until this commit, sdma_run_channel() would wait on the > > > interrupt > > > before proceeding. It has been there since 4.8: > > > > > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the > > > interrupt handler") > > I think this is correct. Starting with this commit, the interrupt > > status fr > > channel 0 is no longer cleared in sdma_run_channel0() and > > sdma_int_handler() is always called for channel 0. > > During firmware loading the interrupts are enabled again just before > > the > > clocks are disabled. The interrupt is pending at this moment so on a > > single > > core system I think this will always work as expected. If the > > firmware > > loading and the interrupt handler run on different cores then this is > > racy. > > Maybe something else changed to make this more likely? > > > > With this new change sdma_int_handler() is no longer called for > > channel 0 > > right, so you should also remove the special handling there. > What's 'special handling' should be removed here? Do you mean put below > pieces of your patch back? > static int sdma_load_script(struct sdma_engine *sdma, void *buf, int > size, > @@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq, void > *dev_id) > unsigned long stat; > > stat = readl_relaxed(sdma->regs + SDMA_H_INTR); > - /* not interested in channel 0 interrupts */ > - stat &= ~1; > writel_relaxed(stat, sdma->regs + SDMA_H_INTR); > + /* channel 0 is special and not handled here, see > run_channel0() */ > + stat &= ~1; I think the "stat &= ~1;" can be removed completely. This bit should never be set, now that the interrupt for channel 0 is disabled. Michael
Hello Robin, On Sun, Jun 16, 2019 at 10:02 PM Robin Gong <yibin.gong@nxp.com> wrote: > > The default imx defconfig and dts should be ok, because firmware load > is delayed after rootfs mounted where firmware located in and before > that, some driver which use sdma such as spi/uart/audio may have > already enable sdma clock which means channel0 interrupt could be > cleared immediately without interrupt storm. That's why I can't > reproduce your issue at first, but catch it once I sync with your > directly firmware load defconfig. So seems not very must to CC stable > tree? As far as I know, the bug/crash does not happen if you're loading the sdma firmware from a filesystem. So the vast majority of users would never see the crash. I agree that this is not a high-priority bugfix. But it's worthwhile for the stable trees to have it. > Yes, but Michael's patch is the right direction, at least it fix RT > case and meaningless channel0 interrupt storm coming before clearing > channel0 interrupt status in sdma_run_channel0(). Now, this patch could > fix its minor side-effect. I'm not suggesting that we should revert or change Michael's patch. Just that it would be good for the v2 patch to contain: Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the interrupt handler") This should allow stable maintainers to pull in your patch if and only if their release already contains 1d069bfa3c78.
On 2019-06-17 at 09:27 -0400, Sven Van Asbroeck wrote: > Hello Robin, > > On Sun, Jun 16, 2019 at 10:02 PM Robin Gong <yibin.gong@nxp.com> > wrote: > > > > > > The default imx defconfig and dts should be ok, because firmware > > load > > is delayed after rootfs mounted where firmware located in and > > before > > that, some driver which use sdma such as spi/uart/audio may have > > already enable sdma clock which means channel0 interrupt could be > > cleared immediately without interrupt storm. That's why I can't > > reproduce your issue at first, but catch it once I sync with your > > directly firmware load defconfig. So seems not very must to CC > > stable > > tree? > As far as I know, the bug/crash does not happen if you're loading the > sdma firmware from a filesystem. So the vast majority of users would > never see the crash. > > I agree that this is not a high-priority bugfix. But it's worthwhile > for the > stable trees to have it. > > > > > Yes, but Michael's patch is the right direction, at least it fix RT > > case and meaningless channel0 interrupt storm coming before > > clearing > > channel0 interrupt status in sdma_run_channel0(). Now, this patch > > could > > fix its minor side-effect. > I'm not suggesting that we should revert or change Michael's patch. > Just > that it would be good for the v2 patch to contain: > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the > interrupt handler") > > This should allow stable maintainers to pull in your patch if and > only if > their release already contains 1d069bfa3c78. Okay, would add Fixes tag into v2.
On 2019-06-17 at 12:15 +0200, m.olbrich@pengutronix.de wrote: > On Mon, Jun 17, 2019 at 02:14:34AM +0000, Robin Gong wrote: > > > > On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote: > > > > > > On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck > > > wrote: > > > > > > > > > > > > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.c > > > > om> > > > > wrote: > > > > > > > > > > > > > > > > > > > > According to the original report from Sven the issue started > > > > > to > > > > > happen > > > > > on 5.0, so it would be good to add a Fixes tag and Cc stable > > > > > so > > > > > that > > > > > this fix could be backported to 5.0/5.1 stable trees. > > > > Good catch ! > > > > > > > > However, the issue is highly timing-dependent. It will come and > > > > go > > > > depending > > > > on the kernel version, devicetree and defconfig. If it works > > > > for me > > > > on > > > > 4.19, that > > > > doesn't mean the bug is gone on 4.19. > > > > > > > > Looking at the commit history, I think the commit below > > > > possibly > > > > introduced the > > > > issue. Until this commit, sdma_run_channel() would wait on the > > > > interrupt > > > > before proceeding. It has been there since 4.8: > > > > > > > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in > > > > the > > > > interrupt handler") > > > I think this is correct. Starting with this commit, the interrupt > > > status fr > > > channel 0 is no longer cleared in sdma_run_channel0() and > > > sdma_int_handler() is always called for channel 0. > > > During firmware loading the interrupts are enabled again just > > > before > > > the > > > clocks are disabled. The interrupt is pending at this moment so > > > on a > > > single > > > core system I think this will always work as expected. If the > > > firmware > > > loading and the interrupt handler run on different cores then > > > this is > > > racy. > > > Maybe something else changed to make this more likely? > > > > > > With this new change sdma_int_handler() is no longer called for > > > channel 0 > > > right, so you should also remove the special handling there. > > What's 'special handling' should be removed here? Do you mean put > > below > > pieces of your patch back? > > static int sdma_load_script(struct sdma_engine *sdma, void *buf, > > int > > size, > > @@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq, > > void > > *dev_id) > > unsigned long stat; > > > > stat = readl_relaxed(sdma->regs + SDMA_H_INTR); > > - /* not interested in channel 0 interrupts */ > > - stat &= ~1; > > writel_relaxed(stat, sdma->regs + SDMA_H_INTR); > > + /* channel 0 is special and not handled here, see > > run_channel0() */ > > + stat &= ~1; > I think the "stat &= ~1;" can be removed completely. This bit should > never > be set, now that the interrupt for channel 0 is disabled. Okay, but that's harmless, moreover, I like your comment '/* channel 0 is special and not handled here, see run_channel0() */' which said clearly channel0 interrupt is a special one and NOT handled in sdma_int_handler. So I'd like to keep it... > > Michael >
diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index deea9aa..b5a1ee2 100644 --- a/drivers/dma/imx-sdma.c +++ b/drivers/dma/imx-sdma.c @@ -742,7 +742,7 @@ static int sdma_load_script(struct sdma_engine *sdma, void *buf, int size, spin_lock_irqsave(&sdma->channel_0_lock, flags); bd0->mode.command = C0_SETPM; - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; bd0->mode.count = size / 2; bd0->buffer_addr = buf_phys; bd0->ext_buffer_addr = address; @@ -1064,7 +1064,7 @@ static int sdma_load_context(struct sdma_channel *sdmac) context->gReg[7] = sdmac->watermark_level; bd0->mode.command = C0_SETDM; - bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD; + bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD; bd0->mode.count = sizeof(*context) / 4; bd0->buffer_addr = sdma->context_phys; bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) * channel;