diff mbox series

mailbox: mtk-cmdq: Fix sleeping function called from invalid context

Message ID 20240430083934.26980-1-jason-jh.lin@mediatek.com (mailing list archive)
State New
Headers show
Series mailbox: mtk-cmdq: Fix sleeping function called from invalid context | expand

Commit Message

Jason-JH Lin (林睿祥) April 30, 2024, 8:39 a.m. UTC
When we run kernel with lockdebug option, we will get the BUG below:
[  106.692124] BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1164
[  106.692190] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 3616, name: kworker/u17:3
[  106.692226] preempt_count: 1, expected: 0
[  106.692254] RCU nest depth: 0, expected: 0
[  106.692282] INFO: lockdep is turned off.
[  106.692306] irq event stamp: 0
[  106.692331] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  106.692376] hardirqs last disabled at (0): [<ffffffee15d37fa0>] copy_process+0xc90/0x2ac0
[  106.692429] softirqs last  enabled at (0): [<ffffffee15d37fc4>] copy_process+0xcb4/0x2ac0
[  106.692473] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  106.692513] CPU: 1 PID: 3616 Comm: kworker/u17:3 Not tainted 6.1.87-lockdep-14133-g26e933aca785 #1 6839942e1cf34914b0a366137843dd2366f52aa9
[  106.692556] Hardware name: Google Ciri sku0/unprovisioned board (DT)
[  106.692586] Workqueue: imgsys_runner imgsys_runner_func
[  106.692638] Call trace:
[  106.692662]  dump_backtrace+0x100/0x120
[  106.692702]  show_stack+0x20/0x2c
[  106.692737]  dump_stack_lvl+0x84/0xb4
[  106.692775]  dump_stack+0x18/0x48
[  106.692809]  __might_resched+0x354/0x4c0
[  106.692847]  __might_sleep+0x98/0xe4
[  106.692883]  __pm_runtime_resume+0x70/0x124
[  106.692921]  cmdq_mbox_send_data+0xe4/0xb1c
[  106.692964]  msg_submit+0x194/0x2dc
[  106.693003]  mbox_send_message+0x190/0x330
[  106.693043]  imgsys_cmdq_sendtask+0x1618/0x2224
[  106.693082]  imgsys_runner_func+0xac/0x11c
[  106.693118]  process_one_work+0x638/0xf84
[  106.693158]  worker_thread+0x808/0xcd0
[  106.693196]  kthread+0x24c/0x324
[  106.693231]  ret_from_fork+0x10/0x20

We found that there is a spin_lock_irqsave protection in msg_submit()
of mailbox.c.
So when cmdq driver calls pm_runtime_get_sync() in cmdq_mbox_send_data(),
it will get this BUG report.

Add pm_runtime_irq_safe() to let pm_runtime callbacks is safe in atomic
context with interrupts disabled.

Fixes: 8afe816b0c99 ("mailbox: mtk-cmdq-mailbox: Implement Runtime PM with autosuspend")
Signed-off-by: Jason-JH.Lin <jason-jh.lin@mediatek.com>
---
 drivers/mailbox/mtk-cmdq-mailbox.c | 1 +
 1 file changed, 1 insertion(+)

Comments

AngeloGioacchino Del Regno April 30, 2024, 9:39 a.m. UTC | #1
Il 30/04/24 10:39, Jason-JH.Lin ha scritto:
> When we run kernel with lockdebug option, we will get the BUG below:
> [  106.692124] BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1164
> [  106.692190] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 3616, name: kworker/u17:3
> [  106.692226] preempt_count: 1, expected: 0
> [  106.692254] RCU nest depth: 0, expected: 0
> [  106.692282] INFO: lockdep is turned off.
> [  106.692306] irq event stamp: 0
> [  106.692331] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [  106.692376] hardirqs last disabled at (0): [<ffffffee15d37fa0>] copy_process+0xc90/0x2ac0
> [  106.692429] softirqs last  enabled at (0): [<ffffffee15d37fc4>] copy_process+0xcb4/0x2ac0
> [  106.692473] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [  106.692513] CPU: 1 PID: 3616 Comm: kworker/u17:3 Not tainted 6.1.87-lockdep-14133-g26e933aca785 #1 6839942e1cf34914b0a366137843dd2366f52aa9
> [  106.692556] Hardware name: Google Ciri sku0/unprovisioned board (DT)
> [  106.692586] Workqueue: imgsys_runner imgsys_runner_func
> [  106.692638] Call trace:
> [  106.692662]  dump_backtrace+0x100/0x120
> [  106.692702]  show_stack+0x20/0x2c
> [  106.692737]  dump_stack_lvl+0x84/0xb4
> [  106.692775]  dump_stack+0x18/0x48
> [  106.692809]  __might_resched+0x354/0x4c0
> [  106.692847]  __might_sleep+0x98/0xe4
> [  106.692883]  __pm_runtime_resume+0x70/0x124
> [  106.692921]  cmdq_mbox_send_data+0xe4/0xb1c
> [  106.692964]  msg_submit+0x194/0x2dc
> [  106.693003]  mbox_send_message+0x190/0x330
> [  106.693043]  imgsys_cmdq_sendtask+0x1618/0x2224
> [  106.693082]  imgsys_runner_func+0xac/0x11c
> [  106.693118]  process_one_work+0x638/0xf84
> [  106.693158]  worker_thread+0x808/0xcd0
> [  106.693196]  kthread+0x24c/0x324
> [  106.693231]  ret_from_fork+0x10/0x20
> 
> We found that there is a spin_lock_irqsave protection in msg_submit()
> of mailbox.c.
> So when cmdq driver calls pm_runtime_get_sync() in cmdq_mbox_send_data(),
> it will get this BUG report.
> 
> Add pm_runtime_irq_safe() to let pm_runtime callbacks is safe in atomic
> context with interrupts disabled.
> 

I see. The problem with this is that pm_runtime_irq_safe() will raise the refcount
of the parent device "forever", which isn't the best and partially defeats what we
are trying to do here (keeping stuff off unless really needed).

At this point I wonder if it'd be just better to really fix this instead of working
around the problem.

I'd say that one of the options would be to perform a change to msg_submit() so
that it will take into account that a .send_data() callback might be using RPM
functions.

Ideas? :-)

Cheers,
Angelo

> Fixes: 8afe816b0c99 ("mailbox: mtk-cmdq-mailbox: Implement Runtime PM with autosuspend")
> Signed-off-by: Jason-JH.Lin <jason-jh.lin@mediatek.com>
> ---
>   drivers/mailbox/mtk-cmdq-mailbox.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/mailbox/mtk-cmdq-mailbox.c b/drivers/mailbox/mtk-cmdq-mailbox.c
> index ead2200f39ba..3b4f19633c83 100644
> --- a/drivers/mailbox/mtk-cmdq-mailbox.c
> +++ b/drivers/mailbox/mtk-cmdq-mailbox.c
> @@ -694,6 +694,7 @@ static int cmdq_probe(struct platform_device *pdev)
>   
>   	pm_runtime_set_autosuspend_delay(dev, CMDQ_MBOX_AUTOSUSPEND_DELAY_MS);
>   	pm_runtime_use_autosuspend(dev);
> +	pm_runtime_irq_safe(dev);
>   
>   	return 0;
>   }
Jassi Brar May 1, 2024, 12:57 a.m. UTC | #2
On Tue, Apr 30, 2024 at 4:39 AM AngeloGioacchino Del Regno
<angelogioacchino.delregno@collabora.com> wrote:
>
> Il 30/04/24 10:39, Jason-JH.Lin ha scritto:
> > When we run kernel with lockdebug option, we will get the BUG below:
> > [  106.692124] BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:1164
> > [  106.692190] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 3616, name: kworker/u17:3
> > [  106.692226] preempt_count: 1, expected: 0
> > [  106.692254] RCU nest depth: 0, expected: 0
> > [  106.692282] INFO: lockdep is turned off.
> > [  106.692306] irq event stamp: 0
> > [  106.692331] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [  106.692376] hardirqs last disabled at (0): [<ffffffee15d37fa0>] copy_process+0xc90/0x2ac0
> > [  106.692429] softirqs last  enabled at (0): [<ffffffee15d37fc4>] copy_process+0xcb4/0x2ac0
> > [  106.692473] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [  106.692513] CPU: 1 PID: 3616 Comm: kworker/u17:3 Not tainted 6.1.87-lockdep-14133-g26e933aca785 #1 6839942e1cf34914b0a366137843dd2366f52aa9
> > [  106.692556] Hardware name: Google Ciri sku0/unprovisioned board (DT)
> > [  106.692586] Workqueue: imgsys_runner imgsys_runner_func
> > [  106.692638] Call trace:
> > [  106.692662]  dump_backtrace+0x100/0x120
> > [  106.692702]  show_stack+0x20/0x2c
> > [  106.692737]  dump_stack_lvl+0x84/0xb4
> > [  106.692775]  dump_stack+0x18/0x48
> > [  106.692809]  __might_resched+0x354/0x4c0
> > [  106.692847]  __might_sleep+0x98/0xe4
> > [  106.692883]  __pm_runtime_resume+0x70/0x124
> > [  106.692921]  cmdq_mbox_send_data+0xe4/0xb1c
> > [  106.692964]  msg_submit+0x194/0x2dc
> > [  106.693003]  mbox_send_message+0x190/0x330
> > [  106.693043]  imgsys_cmdq_sendtask+0x1618/0x2224
> > [  106.693082]  imgsys_runner_func+0xac/0x11c
> > [  106.693118]  process_one_work+0x638/0xf84
> > [  106.693158]  worker_thread+0x808/0xcd0
> > [  106.693196]  kthread+0x24c/0x324
> > [  106.693231]  ret_from_fork+0x10/0x20
> >
> > We found that there is a spin_lock_irqsave protection in msg_submit()
> > of mailbox.c.
> > So when cmdq driver calls pm_runtime_get_sync() in cmdq_mbox_send_data(),
> > it will get this BUG report.
> >
> > Add pm_runtime_irq_safe() to let pm_runtime callbacks is safe in atomic
> > context with interrupts disabled.
> >
>
> I see. The problem with this is that pm_runtime_irq_safe() will raise the refcount
> of the parent device "forever", which isn't the best and partially defeats what we
> are trying to do here (keeping stuff off unless really needed).
>
> At this point I wonder if it'd be just better to really fix this instead of working
> around the problem.
>
> I'd say that one of the options would be to perform a change to msg_submit() so
> that it will take into account that a .send_data() callback might be using RPM
> functions.
>
> Ideas? :-)
>
@send_data:  The API asks the MBOX controller driver, in atomic
 *      context try to transmit a message on the bus. Returns 0 if
 *      data is accepted for transmission, -EBUSY while rejecting
 *      if the remote hasn't yet read the last data sent. Actual
 *      transmission of data is reported by the controller via
 *      mbox_chan_txdone (if it has some TX ACK irq). It must not
 *      sleep.

As long as send_data() does not sleep, ideas are welcome.

Cheers.
Jason-JH Lin (林睿祥) May 1, 2024, 7:09 a.m. UTC | #3
Hi Angelo,

Thanks for the reviews.

On Tue, 2024-04-30 at 11:39 +0200, AngeloGioacchino Del Regno wrote:
> Il 30/04/24 10:39, Jason-JH.Lin ha scritto:
> > When we run kernel with lockdebug option, we will get the BUG
> > below:
> > [  106.692124] BUG: sleeping function called from invalid context
> > at drivers/base/power/runtime.c:1164
> > [  106.692190] in_atomic(): 1, irqs_disabled(): 128, non_block: 0,
> > pid: 3616, name: kworker/u17:3
> > [  106.692226] preempt_count: 1, expected: 0
> > [  106.692254] RCU nest depth: 0, expected: 0
> > [  106.692282] INFO: lockdep is turned off.
> > [  106.692306] irq event stamp: 0
> > [  106.692331] hardirqs last  enabled at (0): [<0000000000000000>]
> > 0x0
> > [  106.692376] hardirqs last disabled at (0): [<ffffffee15d37fa0>]
> > copy_process+0xc90/0x2ac0
> > [  106.692429] softirqs last  enabled at (0): [<ffffffee15d37fc4>]
> > copy_process+0xcb4/0x2ac0
> > [  106.692473] softirqs last disabled at (0): [<0000000000000000>]
> > 0x0
> > [  106.692513] CPU: 1 PID: 3616 Comm: kworker/u17:3 Not tainted
> > 6.1.87-lockdep-14133-g26e933aca785 #1
> > 6839942e1cf34914b0a366137843dd2366f52aa9
> > [  106.692556] Hardware name: Google Ciri sku0/unprovisioned board
> > (DT)
> > [  106.692586] Workqueue: imgsys_runner imgsys_runner_func
> > [  106.692638] Call trace:
> > [  106.692662]  dump_backtrace+0x100/0x120
> > [  106.692702]  show_stack+0x20/0x2c
> > [  106.692737]  dump_stack_lvl+0x84/0xb4
> > [  106.692775]  dump_stack+0x18/0x48
> > [  106.692809]  __might_resched+0x354/0x4c0
> > [  106.692847]  __might_sleep+0x98/0xe4
> > [  106.692883]  __pm_runtime_resume+0x70/0x124
> > [  106.692921]  cmdq_mbox_send_data+0xe4/0xb1c
> > [  106.692964]  msg_submit+0x194/0x2dc
> > [  106.693003]  mbox_send_message+0x190/0x330
> > [  106.693043]  imgsys_cmdq_sendtask+0x1618/0x2224
> > [  106.693082]  imgsys_runner_func+0xac/0x11c
> > [  106.693118]  process_one_work+0x638/0xf84
> > [  106.693158]  worker_thread+0x808/0xcd0
> > [  106.693196]  kthread+0x24c/0x324
> > [  106.693231]  ret_from_fork+0x10/0x20
> > 
> > We found that there is a spin_lock_irqsave protection in
> > msg_submit()
> > of mailbox.c.
> > So when cmdq driver calls pm_runtime_get_sync() in
> > cmdq_mbox_send_data(),
> > it will get this BUG report.
> > 
> > Add pm_runtime_irq_safe() to let pm_runtime callbacks is safe in
> > atomic
> > context with interrupts disabled.
> > 
> 
> I see. The problem with this is that pm_runtime_irq_safe() will raise
> the refcount
> of the parent device "forever", which isn't the best and partially
> defeats what we
> are trying to do here (keeping stuff off unless really needed).
> 
> At this point I wonder if it'd be just better to really fix this
> instead of working
> around the problem.
> 
> I'd say that one of the options would be to perform a change to
> msg_submit() so
> that it will take into account that a .send_data() callback might be
> using RPM
> functions.
> 
> Ideas? :-)
> 
> Cheers,
> Angelo
> 

As Jassi mentioned, .send_data() can not use sleep, so we can not use
pm_runtime_get_sync() in cmdq_mbox_send_data().

Because we have to make sure the clocks is enabled before setting GCE
registers inside the cmdq_mbox_send_data(). I think we can't use the
ASYNC pm_runtime_get() instead of pm_runtime_get_sync() either.

And you're right, I didn't notice that raising the irq_safe flag will
make the parent device can not be suspended.

How about this:
1. Use clk_bluk_enable() + pm_runtime_get() instead of 
pm_runtime_get_sync() everywhere in mtk-cmdq-mailbox.c.

2. Use refcount++ instead of clk_bluk_enable() in
cmdq_runtime_resume().

3. Call clk_bluk_disable() when refcount > 0 in cmdq_runtime_suspend().

Is this an option?

Regards,
Jason-JH.Lin

> > Fixes: 8afe816b0c99 ("mailbox: mtk-cmdq-mailbox: Implement Runtime
> > PM with autosuspend")
> > Signed-off-by: Jason-JH.Lin <jason-jh.lin@mediatek.com>
> > ---
> >   drivers/mailbox/mtk-cmdq-mailbox.c | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/mailbox/mtk-cmdq-mailbox.c
> > b/drivers/mailbox/mtk-cmdq-mailbox.c
> > index ead2200f39ba..3b4f19633c83 100644
> > --- a/drivers/mailbox/mtk-cmdq-mailbox.c
> > +++ b/drivers/mailbox/mtk-cmdq-mailbox.c
> > @@ -694,6 +694,7 @@ static int cmdq_probe(struct platform_device
> > *pdev)
> >   
> >   	pm_runtime_set_autosuspend_delay(dev,
> > CMDQ_MBOX_AUTOSUSPEND_DELAY_MS);
> >   	pm_runtime_use_autosuspend(dev);
> > +	pm_runtime_irq_safe(dev);
> >   
> >   	return 0;
> >   }
> 
> 
>
diff mbox series

Patch

diff --git a/drivers/mailbox/mtk-cmdq-mailbox.c b/drivers/mailbox/mtk-cmdq-mailbox.c
index ead2200f39ba..3b4f19633c83 100644
--- a/drivers/mailbox/mtk-cmdq-mailbox.c
+++ b/drivers/mailbox/mtk-cmdq-mailbox.c
@@ -694,6 +694,7 @@  static int cmdq_probe(struct platform_device *pdev)
 
 	pm_runtime_set_autosuspend_delay(dev, CMDQ_MBOX_AUTOSUSPEND_DELAY_MS);
 	pm_runtime_use_autosuspend(dev);
+	pm_runtime_irq_safe(dev);
 
 	return 0;
 }