diff mbox series

[v1] dmaengine: imx-sdma: remove BD_INTR for channel0

Message ID 20190614083959.37944-1-yibin.gong@nxp.com (mailing list archive)
State Changes Requested
Headers show
Series [v1] dmaengine: imx-sdma: remove BD_INTR for channel0 | expand

Commit Message

Robin Gong June 14, 2019, 8:39 a.m. UTC
From: Robin Gong <yibin.gong@nxp.com>

It is possible for an irq triggered by channel0 to be received later
after clks are disabled once firmware loaded during sdma probe. If
that happens then clearing them by writing to SDMA_H_INTR won't work
and the system will hang processing infinite interrupts. Actually,
don't need interrupt triggered on channel0 since it's pollling
SDMA_H_STATSTOP to know channel0 done rather than interrupt in
current code, just clear BD_INTR to disable channel0 interrupt to
avoid the above case.

Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Reported-by: Sven Van Asbroeck <thesven73@gmail.com>
---
 drivers/dma/imx-sdma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Fabio Estevam June 14, 2019, 10:49 a.m. UTC | #1
Hi Robin,

On Fri, Jun 14, 2019 at 5:38 AM <yibin.gong@nxp.com> wrote:
>
> From: Robin Gong <yibin.gong@nxp.com>
>
> It is possible for an irq triggered by channel0 to be received later
> after clks are disabled once firmware loaded during sdma probe. If
> that happens then clearing them by writing to SDMA_H_INTR won't work
> and the system will hang processing infinite interrupts. Actually,
> don't need interrupt triggered on channel0 since it's pollling
> SDMA_H_STATSTOP to know channel0 done rather than interrupt in
> current code, just clear BD_INTR to disable channel0 interrupt to
> avoid the above case.
>
> Signed-off-by: Robin Gong <yibin.gong@nxp.com>
> Reported-by: Sven Van Asbroeck <thesven73@gmail.com>

According to the original report from Sven the issue started to happen
on 5.0, so it would be good to add a Fixes tag and Cc stable so that
this fix could be backported to 5.0/5.1 stable trees.

Thanks
Sven Van Asbroeck June 14, 2019, 1:25 p.m. UTC | #2
On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> wrote:
>
> According to the original report from Sven the issue started to happen
> on 5.0, so it would be good to add a Fixes tag and Cc stable so that
> this fix could be backported to 5.0/5.1 stable trees.

Good catch !

However, the issue is highly timing-dependent. It will come and go depending
on the kernel version, devicetree and defconfig. If it works for me on
4.19, that
doesn't mean the bug is gone on 4.19.

Looking at the commit history, I think the commit below possibly introduced the
issue. Until this commit, sdma_run_channel() would wait on the interrupt
before proceeding. It has been there since 4.8:

Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
interrupt handler")

But my knowledge of imx-sdma is non-existent, so I invite the more knowledgeable
people in this thread to take a look at this commit.

[Adding Michael Olbrich to the thread]
Sven Van Asbroeck June 14, 2019, 1:35 p.m. UTC | #3
Hi Robin, see comments inline.

On Fri, Jun 14, 2019 at 4:38 AM <yibin.gong@nxp.com> wrote:
>
> diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
> index deea9aa..b5a1ee2 100644
> --- a/drivers/dma/imx-sdma.c
> +++ b/drivers/dma/imx-sdma.c
> @@ -742,7 +742,7 @@ static int sdma_load_script(struct sdma_engine *sdma, void *buf, int size,
>         spin_lock_irqsave(&sdma->channel_0_lock, flags);
>
>         bd0->mode.command = C0_SETPM;
> -       bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
> +       bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;

I tested this change on its own, and it seemed sufficient to make the crash
disappear.

>         bd0->mode.count = size / 2;
>         bd0->buffer_addr = buf_phys;
>         bd0->ext_buffer_addr = address;
> @@ -1064,7 +1064,7 @@ static int sdma_load_context(struct sdma_channel *sdmac)
>         context->gReg[7] = sdmac->watermark_level;
>
>         bd0->mode.command = C0_SETDM;
> -       bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
> +       bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;

This function isn't part of the firmware load path, so how can it be related
to fixing the firmware crash?

If this is an unrelated efficiency saving, maybe it should go into its
own patch?
Maybe we want bugfix patches to be as small and specific as possible, so they
can more easily be backported to older kernels?

>         bd0->mode.count = sizeof(*context) / 4;
>         bd0->buffer_addr = sdma->context_phys;
>         bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) * channel;
> --
> 2.7.4
>
Michael Olbrich June 14, 2019, 6:09 p.m. UTC | #4
On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote:
> On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com> wrote:
> >
> > According to the original report from Sven the issue started to happen
> > on 5.0, so it would be good to add a Fixes tag and Cc stable so that
> > this fix could be backported to 5.0/5.1 stable trees.
> 
> Good catch !
> 
> However, the issue is highly timing-dependent. It will come and go depending
> on the kernel version, devicetree and defconfig. If it works for me on
> 4.19, that
> doesn't mean the bug is gone on 4.19.
> 
> Looking at the commit history, I think the commit below possibly introduced the
> issue. Until this commit, sdma_run_channel() would wait on the interrupt
> before proceeding. It has been there since 4.8:
> 
> Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
> interrupt handler")

I think this is correct. Starting with this commit, the interrupt status fr
channel 0 is no longer cleared in sdma_run_channel0() and
sdma_int_handler() is always called for channel 0.
During firmware loading the interrupts are enabled again just before the
clocks are disabled. The interrupt is pending at this moment so on a single
core system I think this will always work as expected. If the firmware
loading and the interrupt handler run on different cores then this is racy.
Maybe something else changed to make this more likely?

With this new change sdma_int_handler() is no longer called for channel 0
right, so you should also remove the special handling there.

Michael
Robin Gong June 17, 2019, 1:42 a.m. UTC | #5
On 2019-06-14 at 13:35 +0000, Sven Van Asbroeck wrote:
> Hi Robin, see comments inline.
> 
> On Fri, Jun 14, 2019 at 4:38 AM <yibin.gong@nxp.com> wrote:
> > 
> > 
> > diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
> > index deea9aa..b5a1ee2 100644
> > --- a/drivers/dma/imx-sdma.c
> > +++ b/drivers/dma/imx-sdma.c
> > @@ -742,7 +742,7 @@ static int sdma_load_script(struct sdma_engine
> > *sdma, void *buf, int size,
> >         spin_lock_irqsave(&sdma->channel_0_lock, flags);
> > 
> >         bd0->mode.command = C0_SETPM;
> > -       bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
> > +       bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;
> I tested this change on its own, and it seemed sufficient to make the
> crash
> disappear.
> 
> > 
> >         bd0->mode.count = size / 2;
> >         bd0->buffer_addr = buf_phys;
> >         bd0->ext_buffer_addr = address;
> > @@ -1064,7 +1064,7 @@ static int sdma_load_context(struct
> > sdma_channel *sdmac)
> >         context->gReg[7] = sdmac->watermark_level;
> > 
> >         bd0->mode.command = C0_SETDM;
> > -       bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
> > +       bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;
> This function isn't part of the firmware load path, so how can it be
> related
> to fixing the firmware crash?
Yes, that's not in your firmware load case, but I want to keep the same
behavior for all channel0 conditions, don't worry, harmless here. 
> 
> If this is an unrelated efficiency saving, maybe it should go into
> its
> own patch?
> Maybe we want bugfix patches to be as small and specific as possible,
> so they
> can more easily be backported to older kernels?
> 
> > 
> >         bd0->mode.count = sizeof(*context) / 4;
> >         bd0->buffer_addr = sdma->context_phys;
> >         bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) *
> > channel;
> > --
> > 2.7.4
> >
Robin Gong June 17, 2019, 2:02 a.m. UTC | #6
On 2019-06-14 at 13:25 +0000, Sven Van Asbroeck wrote:
> On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com>
> wrote:
> > 
> > 
> > According to the original report from Sven the issue started to
> > happen
> > on 5.0, so it would be good to add a Fixes tag and Cc stable so
> > that
> > this fix could be backported to 5.0/5.1 stable trees.
> Good catch !
> 
> However, the issue is highly timing-dependent. It will come and go
> depending
> on the kernel version, devicetree and defconfig. If it works for me
> on
> 4.19, that
> doesn't mean the bug is gone on 4.19.
The default imx defconfig and dts should be ok, because firmware load
is delayed after rootfs mounted where firmware located in and before
that, some driver which use sdma such as spi/uart/audio may have
already enable sdma clock which means channel0 interrupt could be
cleared immediately without interrupt storm. That's why I can't
reproduce your issue at first, but catch it once I sync with your
directly firmware load defconfig. So seems not very must to CC stable
tree?
> 
> Looking at the commit history, I think the commit below possibly
> introduced the
> issue. Until this commit, sdma_run_channel() would wait on the
> interrupt
> before proceeding. It has been there since 4.8:
> 
> Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
> interrupt handler")
Yes, but Michael's patch is the right direction, at least it fix RT
case and meaningless channel0 interrupt storm coming before clearing
channel0 interrupt status in sdma_run_channel0(). Now, this patch could
fix its minor side-effect.
> 
> But my knowledge of imx-sdma is non-existent, so I invite the more
> knowledgeable
> people in this thread to take a look at this commit.
> 
> [Adding Michael Olbrich to the thread]
Robin Gong June 17, 2019, 2:14 a.m. UTC | #7
On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote:
> On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote:
> > 
> > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com>
> > wrote:
> > > 
> > > 
> > > According to the original report from Sven the issue started to
> > > happen
> > > on 5.0, so it would be good to add a Fixes tag and Cc stable so
> > > that
> > > this fix could be backported to 5.0/5.1 stable trees.
> > Good catch !
> > 
> > However, the issue is highly timing-dependent. It will come and go
> > depending
> > on the kernel version, devicetree and defconfig. If it works for me
> > on
> > 4.19, that
> > doesn't mean the bug is gone on 4.19.
> > 
> > Looking at the commit history, I think the commit below possibly
> > introduced the
> > issue. Until this commit, sdma_run_channel() would wait on the
> > interrupt
> > before proceeding. It has been there since 4.8:
> > 
> > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
> > interrupt handler")
> I think this is correct. Starting with this commit, the interrupt
> status fr
> channel 0 is no longer cleared in sdma_run_channel0() and
> sdma_int_handler() is always called for channel 0.
> During firmware loading the interrupts are enabled again just before
> the
> clocks are disabled. The interrupt is pending at this moment so on a
> single
> core system I think this will always work as expected. If the
> firmware
> loading and the interrupt handler run on different cores then this is
> racy.
> Maybe something else changed to make this more likely?
> 
> With this new change sdma_int_handler() is no longer called for
> channel 0
> right, so you should also remove the special handling there.
What's 'special handling' should be removed here? Do you mean put below
pieces of your patch back?
 static int sdma_load_script(struct sdma_engine *sdma, void *buf, int
size,
@@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq, void
*dev_id)
        unsigned long stat;
 
        stat = readl_relaxed(sdma->regs + SDMA_H_INTR);
-       /* not interested in channel 0 interrupts */
-       stat &= ~1;
        writel_relaxed(stat, sdma->regs + SDMA_H_INTR);
+       /* channel 0 is special and not handled here, see
run_channel0() */
+       stat &= ~1;

> 
> Michael
>
Michael Olbrich June 17, 2019, 10:15 a.m. UTC | #8
On Mon, Jun 17, 2019 at 02:14:34AM +0000, Robin Gong wrote:
> On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote:
> > On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck wrote:
> > > 
> > > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.com>
> > > wrote:
> > > > 
> > > > 
> > > > According to the original report from Sven the issue started to
> > > > happen
> > > > on 5.0, so it would be good to add a Fixes tag and Cc stable so
> > > > that
> > > > this fix could be backported to 5.0/5.1 stable trees.
> > > Good catch !
> > > 
> > > However, the issue is highly timing-dependent. It will come and go
> > > depending
> > > on the kernel version, devicetree and defconfig. If it works for me
> > > on
> > > 4.19, that
> > > doesn't mean the bug is gone on 4.19.
> > > 
> > > Looking at the commit history, I think the commit below possibly
> > > introduced the
> > > issue. Until this commit, sdma_run_channel() would wait on the
> > > interrupt
> > > before proceeding. It has been there since 4.8:
> > > 
> > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
> > > interrupt handler")
> > I think this is correct. Starting with this commit, the interrupt
> > status fr
> > channel 0 is no longer cleared in sdma_run_channel0() and
> > sdma_int_handler() is always called for channel 0.
> > During firmware loading the interrupts are enabled again just before
> > the
> > clocks are disabled. The interrupt is pending at this moment so on a
> > single
> > core system I think this will always work as expected. If the
> > firmware
> > loading and the interrupt handler run on different cores then this is
> > racy.
> > Maybe something else changed to make this more likely?
> > 
> > With this new change sdma_int_handler() is no longer called for
> > channel 0
> > right, so you should also remove the special handling there.
> What's 'special handling' should be removed here? Do you mean put below
> pieces of your patch back?
>  static int sdma_load_script(struct sdma_engine *sdma, void *buf, int
> size,
> @@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq, void
> *dev_id)
>         unsigned long stat;
>  
>         stat = readl_relaxed(sdma->regs + SDMA_H_INTR);
> -       /* not interested in channel 0 interrupts */
> -       stat &= ~1;
>         writel_relaxed(stat, sdma->regs + SDMA_H_INTR);
> +       /* channel 0 is special and not handled here, see
> run_channel0() */
> +       stat &= ~1;

I think the "stat &= ~1;" can be removed completely. This bit should never
be set, now that the interrupt for channel 0 is disabled.

Michael
Sven Van Asbroeck June 17, 2019, 1:27 p.m. UTC | #9
Hello Robin,

On Sun, Jun 16, 2019 at 10:02 PM Robin Gong <yibin.gong@nxp.com> wrote:
>
> The default imx defconfig and dts should be ok, because firmware load
> is delayed after rootfs mounted where firmware located in and before
> that, some driver which use sdma such as spi/uart/audio may have
> already enable sdma clock which means channel0 interrupt could be
> cleared immediately without interrupt storm. That's why I can't
> reproduce your issue at first, but catch it once I sync with your
> directly firmware load defconfig. So seems not very must to CC stable
> tree?

As far as I know, the bug/crash does not happen if you're loading the
sdma firmware from a filesystem. So the vast majority of users would
never see the crash.

I agree that this is not a high-priority bugfix. But it's worthwhile for the
stable trees to have it.

> Yes, but Michael's patch is the right direction, at least it fix RT
> case and meaningless channel0 interrupt storm coming before clearing
> channel0 interrupt status in sdma_run_channel0(). Now, this patch could
> fix its minor side-effect.

I'm not suggesting that we should revert or change Michael's patch. Just
that it would be good for the v2 patch to contain:

Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
interrupt handler")

This should allow stable maintainers to pull in your patch if and only if
their release already contains 1d069bfa3c78.
Robin Gong June 18, 2019, 5:50 a.m. UTC | #10
On 2019-06-17 at 09:27 -0400, Sven Van Asbroeck wrote:
> Hello Robin,
> 
> On Sun, Jun 16, 2019 at 10:02 PM Robin Gong <yibin.gong@nxp.com>
> wrote:
> > 
> > 
> > The default imx defconfig and dts should be ok, because firmware
> > load
> > is delayed after rootfs mounted where firmware located in and
> > before
> > that, some driver which use sdma such as spi/uart/audio may have
> > already enable sdma clock which means channel0 interrupt could be
> > cleared immediately without interrupt storm. That's why I can't
> > reproduce your issue at first, but catch it once I sync with your
> > directly firmware load defconfig. So seems not very must to CC
> > stable
> > tree?
> As far as I know, the bug/crash does not happen if you're loading the
> sdma firmware from a filesystem. So the vast majority of users would
> never see the crash.
> 
> I agree that this is not a high-priority bugfix. But it's worthwhile
> for the
> stable trees to have it.
> 
> > 
> > Yes, but Michael's patch is the right direction, at least it fix RT
> > case and meaningless channel0 interrupt storm coming before
> > clearing
> > channel0 interrupt status in sdma_run_channel0(). Now, this patch
> > could
> > fix its minor side-effect.
> I'm not suggesting that we should revert or change Michael's patch.
> Just
> that it would be good for the v2 patch to contain:
> 
> Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the
> interrupt handler")
> 
> This should allow stable maintainers to pull in your patch if and
> only if
> their release already contains 1d069bfa3c78.
Okay, would add Fixes tag into v2.
Robin Gong June 18, 2019, 6:08 a.m. UTC | #11
On 2019-06-17 at 12:15 +0200, m.olbrich@pengutronix.de wrote:
> On Mon, Jun 17, 2019 at 02:14:34AM +0000, Robin Gong wrote:
> > 
> > On 2019-06-14 at 18:09 +0000, Michael Olbrich wrote:
> > > 
> > > On Fri, Jun 14, 2019 at 09:25:51AM -0400, Sven Van Asbroeck
> > > wrote:
> > > > 
> > > > 
> > > > On Fri, Jun 14, 2019 at 6:49 AM Fabio Estevam <festevam@gmail.c
> > > > om>
> > > > wrote:
> > > > > 
> > > > > 
> > > > > 
> > > > > According to the original report from Sven the issue started
> > > > > to
> > > > > happen
> > > > > on 5.0, so it would be good to add a Fixes tag and Cc stable
> > > > > so
> > > > > that
> > > > > this fix could be backported to 5.0/5.1 stable trees.
> > > > Good catch !
> > > > 
> > > > However, the issue is highly timing-dependent. It will come and
> > > > go
> > > > depending
> > > > on the kernel version, devicetree and defconfig. If it works
> > > > for me
> > > > on
> > > > 4.19, that
> > > > doesn't mean the bug is gone on 4.19.
> > > > 
> > > > Looking at the commit history, I think the commit below
> > > > possibly
> > > > introduced the
> > > > issue. Until this commit, sdma_run_channel() would wait on the
> > > > interrupt
> > > > before proceeding. It has been there since 4.8:
> > > > 
> > > > Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in
> > > > the
> > > > interrupt handler")
> > > I think this is correct. Starting with this commit, the interrupt
> > > status fr
> > > channel 0 is no longer cleared in sdma_run_channel0() and
> > > sdma_int_handler() is always called for channel 0.
> > > During firmware loading the interrupts are enabled again just
> > > before
> > > the
> > > clocks are disabled. The interrupt is pending at this moment so
> > > on a
> > > single
> > > core system I think this will always work as expected. If the
> > > firmware
> > > loading and the interrupt handler run on different cores then
> > > this is
> > > racy.
> > > Maybe something else changed to make this more likely?
> > > 
> > > With this new change sdma_int_handler() is no longer called for
> > > channel 0
> > > right, so you should also remove the special handling there.
> > What's 'special handling' should be removed here? Do you mean put
> > below
> > pieces of your patch back?
> >  static int sdma_load_script(struct sdma_engine *sdma, void *buf,
> > int
> > size,
> > @@ -727,9 +720,9 @@ static irqreturn_t sdma_int_handler(int irq,
> > void
> > *dev_id)
> >         unsigned long stat;
> >  
> >         stat = readl_relaxed(sdma->regs + SDMA_H_INTR);
> > -       /* not interested in channel 0 interrupts */
> > -       stat &= ~1;
> >         writel_relaxed(stat, sdma->regs + SDMA_H_INTR);
> > +       /* channel 0 is special and not handled here, see
> > run_channel0() */
> > +       stat &= ~1;
> I think the "stat &= ~1;" can be removed completely. This bit should
> never
> be set, now that the interrupt for channel 0 is disabled.
Okay, but that's harmless, moreover, I like your comment '/* channel 0
is special and not handled here, see run_channel0() */' which said
clearly channel0 interrupt is a special one and NOT handled in
sdma_int_handler. So I'd like to keep it...  
> 
> Michael
>
diff mbox series

Patch

diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
index deea9aa..b5a1ee2 100644
--- a/drivers/dma/imx-sdma.c
+++ b/drivers/dma/imx-sdma.c
@@ -742,7 +742,7 @@  static int sdma_load_script(struct sdma_engine *sdma, void *buf, int size,
 	spin_lock_irqsave(&sdma->channel_0_lock, flags);
 
 	bd0->mode.command = C0_SETPM;
-	bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
+	bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;
 	bd0->mode.count = size / 2;
 	bd0->buffer_addr = buf_phys;
 	bd0->ext_buffer_addr = address;
@@ -1064,7 +1064,7 @@  static int sdma_load_context(struct sdma_channel *sdmac)
 	context->gReg[7] = sdmac->watermark_level;
 
 	bd0->mode.command = C0_SETDM;
-	bd0->mode.status = BD_DONE | BD_INTR | BD_WRAP | BD_EXTD;
+	bd0->mode.status = BD_DONE | BD_WRAP | BD_EXTD;
 	bd0->mode.count = sizeof(*context) / 4;
 	bd0->buffer_addr = sdma->context_phys;
 	bd0->ext_buffer_addr = 2048 + (sizeof(*context) / 4) * channel;