diff mbox series

[v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT

Message ID 20241016045546.2613436-1-quic_mojha@quicinc.com (mailing list archive)
State New
Headers show
Series [v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT | expand

Commit Message

Mukesh Ojha Oct. 16, 2024, 4:55 a.m. UTC
Multiple call to glink_subdev_stop() for the same remoteproc can happen
if rproc_stop() fails from Process-A that leaves the rproc state to
RPROC_CRASHED state later a call to recovery_store from user space in
Process B triggers rproc_trigger_recovery() of the same remoteproc to
recover it results in NULL pointer dereference issue in
qcom_glink_smem_unregister().

There is other side to this issue if we want to fix this via adding a
NULL check on glink->edge which does not guarantees that the remoteproc
will recover in second call from Process B as it has failed in the first
Process A during SMC shutdown call and may again fail at the same call
and rproc can not recover for such case.

Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
remoteproc and the only way to recover from it via system restart.

	Process-A                			Process-B

  fatal error interrupt happens

  rproc_crash_handler_work()
    mutex_lock_interruptible(&rproc->lock);
    ...

       rproc->state = RPROC_CRASHED;
    ...
    mutex_unlock(&rproc->lock);

    rproc_trigger_recovery()
     mutex_lock_interruptible(&rproc->lock);

      adsp_stop()
      qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
      remoteproc remoteproc3: can't stop rproc: -22
     mutex_unlock(&rproc->lock);

						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
						recovery_store()
						 rproc_trigger_recovery()
						  mutex_lock_interruptible(&rproc->lock);
						   rproc_stop()
						    glink_subdev_stop()
						      qcom_glink_smem_unregister() ==|
                                                                                     |
                                                                                     V
						      Unable to handle kernel NULL pointer dereference
                                                                at virtual address 0000000000000358

Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
---
Changes in v3:
 - Fix kernel test reported error.

Changes in v2:
 - Removed NULL pointer check instead added a new state to signify
   non-recoverable state of remoteproc.

 drivers/remoteproc/remoteproc_core.c  | 3 ++-
 drivers/remoteproc/remoteproc_sysfs.c | 1 +
 include/linux/remoteproc.h            | 5 ++++-
 3 files changed, 7 insertions(+), 2 deletions(-)

Comments

anish kumar Oct. 16, 2024, 4:04 p.m. UTC | #1
On Tue, Oct 15, 2024 at 9:57 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote:
>
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
>
> There is other side to this issue if we want to fix this via adding a
> NULL check on glink->edge which does not guarantees that the remoteproc
> will recover in second call from Process B as it has failed in the first
> Process A during SMC shutdown call and may again fail at the same call
> and rproc can not recover for such case.

What is the guarantee that the second stop also will fail? I feel
it should be handled in user space, if rproc calls are failing then
there is a bigger issue and then let userspace decide what to do if it
is happening continuously. Also, why not add this DEFUNCT_STATE
in other callbacks, as all callbacks from core to rproc driver can fail?
>
> Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of

Even if this state is present, ultimately it will be up to user space to
decide what to do, right?

> remoteproc and the only way to recover from it via system restart.
>
>         Process-A                                       Process-B
>
>   fatal error interrupt happens
>
>   rproc_crash_handler_work()
>     mutex_lock_interruptible(&rproc->lock);
>     ...
>
>        rproc->state = RPROC_CRASHED;
>     ...
>     mutex_unlock(&rproc->lock);
>
>     rproc_trigger_recovery()
>      mutex_lock_interruptible(&rproc->lock);
>
>       adsp_stop()
>       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
>       remoteproc remoteproc3: can't stop rproc: -22
>      mutex_unlock(&rproc->lock);
>
>                                                 echo enabled > /sys/class/remoteproc/remoteprocX/recovery
>                                                 recovery_store()
>                                                  rproc_trigger_recovery()
>                                                   mutex_lock_interruptible(&rproc->lock);
>                                                    rproc_stop()
>                                                     glink_subdev_stop()
>                                                       qcom_glink_smem_unregister() ==|
>                                                                                      |
>                                                                                      V
>                                                       Unable to handle kernel NULL pointer dereference
>                                                                 at virtual address 0000000000000358
>
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
> Changes in v3:
>  - Fix kernel test reported error.
>
> Changes in v2:
>  - Removed NULL pointer check instead added a new state to signify
>    non-recoverable state of remoteproc.
>
>  drivers/remoteproc/remoteproc_core.c  | 3 ++-
>  drivers/remoteproc/remoteproc_sysfs.c | 1 +
>  include/linux/remoteproc.h            | 5 ++++-
>  3 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f276956f2c5c..c4e14503b971 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
>         /* power off the remote processor */
>         ret = rproc->ops->stop(rproc);
>         if (ret) {
> +               rproc->state = RPROC_DEFUNCT;
>                 dev_err(dev, "can't stop rproc: %d\n", ret);
>                 return ret;
>         }
> @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>                 return ret;
>
>         /* State could have changed before we got the mutex */
> -       if (rproc->state != RPROC_CRASHED)
> +       if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
>                 goto unlock_mutex;
>
>         dev_err(dev, "recovering %s\n", rproc->name);
> diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> index 138e752c5e4e..5f722b4576b2 100644
> --- a/drivers/remoteproc/remoteproc_sysfs.c
> +++ b/drivers/remoteproc/remoteproc_sysfs.c
> @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
>         [RPROC_DELETED]         = "deleted",
>         [RPROC_ATTACHED]        = "attached",
>         [RPROC_DETACHED]        = "detached",
> +       [RPROC_DEFUNCT]         = "defunct",
>         [RPROC_LAST]            = "invalid",
>  };
>
> diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> index b4795698d8c2..3e4ba06c6a9a 100644
> --- a/include/linux/remoteproc.h
> +++ b/include/linux/remoteproc.h
> @@ -417,6 +417,8 @@ struct rproc_ops {
>   *                     has attached to it
>   * @RPROC_DETACHED:    device has been booted by another entity and waiting
>   *                     for the core to attach to it
> + * @RPROC_DEFUNCT:     device neither crashed nor responding to any of the
> + *                     requests and can only recover on system restart.
>   * @RPROC_LAST:                just keep this one at the end
>   *
>   * Please note that the values of these states are used as indices
> @@ -433,7 +435,8 @@ enum rproc_state {
>         RPROC_DELETED   = 4,
>         RPROC_ATTACHED  = 5,
>         RPROC_DETACHED  = 6,
> -       RPROC_LAST      = 7,
> +       RPROC_DEFUNCT   = 7,
> +       RPROC_LAST      = 8,
>  };
>
>  /**
> --
> 2.34.1
>
>
Mathieu Poirier Oct. 21, 2024, 3:12 p.m. UTC | #2
Hi Mukesh,

On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> Multiple call to glink_subdev_stop() for the same remoteproc can happen
> if rproc_stop() fails from Process-A that leaves the rproc state to
> RPROC_CRASHED state later a call to recovery_store from user space in
> Process B triggers rproc_trigger_recovery() of the same remoteproc to
> recover it results in NULL pointer dereference issue in
> qcom_glink_smem_unregister().
> 
> There is other side to this issue if we want to fix this via adding a
> NULL check on glink->edge which does not guarantees that the remoteproc
> will recover in second call from Process B as it has failed in the first
> Process A during SMC shutdown call and may again fail at the same call
> and rproc can not recover for such case.
> 
> Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> remoteproc and the only way to recover from it via system restart.
> 
> 	Process-A                			Process-B
> 
>   fatal error interrupt happens
> 
>   rproc_crash_handler_work()
>     mutex_lock_interruptible(&rproc->lock);
>     ...
> 
>        rproc->state = RPROC_CRASHED;
>     ...
>     mutex_unlock(&rproc->lock);
> 
>     rproc_trigger_recovery()
>      mutex_lock_interruptible(&rproc->lock);
> 
>       adsp_stop()
>       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
>       remoteproc remoteproc3: can't stop rproc: -22
>      mutex_unlock(&rproc->lock);

Ok, that can happen.

> 
> 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> 						recovery_store()
> 						 rproc_trigger_recovery()
> 						  mutex_lock_interruptible(&rproc->lock);
> 						   rproc_stop()
> 						    glink_subdev_stop()
> 						      qcom_glink_smem_unregister() ==|
>                                                                                      |
>                                                                                      V

I am missing some information here but I will _assume_ this is caused by
glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
process A.  Instead of adding a new state to the core I think a better idea
would be to add a check for a NULL value on @smem in
qcom_glink_smem_unregister().  This is a problem that should be fixed in the
driver rather than the core.

[1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213

> 						      Unable to handle kernel NULL pointer dereference
>                                                                 at virtual address 0000000000000358
> 
> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> ---
> Changes in v3:
>  - Fix kernel test reported error.
> 
> Changes in v2:
>  - Removed NULL pointer check instead added a new state to signify
>    non-recoverable state of remoteproc.
> 
>  drivers/remoteproc/remoteproc_core.c  | 3 ++-
>  drivers/remoteproc/remoteproc_sysfs.c | 1 +
>  include/linux/remoteproc.h            | 5 ++++-
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index f276956f2c5c..c4e14503b971 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
>  	/* power off the remote processor */
>  	ret = rproc->ops->stop(rproc);
>  	if (ret) {
> +		rproc->state = RPROC_DEFUNCT;
>  		dev_err(dev, "can't stop rproc: %d\n", ret);
>  		return ret;
>  	}
> @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
>  		return ret;
>  
>  	/* State could have changed before we got the mutex */
> -	if (rproc->state != RPROC_CRASHED)
> +	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
>  		goto unlock_mutex;

The problem is that rproc_trigger_recovery() an only be called once for a
remoteproc, something that modifies the state machine and may introduce backward
compatibility issues for other remote processor implementations.

Thanks,
Mathieu

>  
>  	dev_err(dev, "recovering %s\n", rproc->name);
> diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> index 138e752c5e4e..5f722b4576b2 100644
> --- a/drivers/remoteproc/remoteproc_sysfs.c
> +++ b/drivers/remoteproc/remoteproc_sysfs.c
> @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
>  	[RPROC_DELETED]		= "deleted",
>  	[RPROC_ATTACHED]	= "attached",
>  	[RPROC_DETACHED]	= "detached",
> +	[RPROC_DEFUNCT]		= "defunct",
>  	[RPROC_LAST]		= "invalid",
>  };
>  
> diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> index b4795698d8c2..3e4ba06c6a9a 100644
> --- a/include/linux/remoteproc.h
> +++ b/include/linux/remoteproc.h
> @@ -417,6 +417,8 @@ struct rproc_ops {
>   *			has attached to it
>   * @RPROC_DETACHED:	device has been booted by another entity and waiting
>   *			for the core to attach to it
> + * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
> + * 			requests and can only recover on system restart.
>   * @RPROC_LAST:		just keep this one at the end
>   *
>   * Please note that the values of these states are used as indices
> @@ -433,7 +435,8 @@ enum rproc_state {
>  	RPROC_DELETED	= 4,
>  	RPROC_ATTACHED	= 5,
>  	RPROC_DETACHED	= 6,
> -	RPROC_LAST	= 7,
> +	RPROC_DEFUNCT	= 7,
> +	RPROC_LAST	= 8,
>  };
>  
>  /**
> -- 
> 2.34.1
>
Mukesh Ojha Oct. 25, 2024, 8:10 a.m. UTC | #3
On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> Hi Mukesh,
> 
> On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > if rproc_stop() fails from Process-A that leaves the rproc state to
> > RPROC_CRASHED state later a call to recovery_store from user space in
> > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > recover it results in NULL pointer dereference issue in
> > qcom_glink_smem_unregister().
> > 
> > There is other side to this issue if we want to fix this via adding a
> > NULL check on glink->edge which does not guarantees that the remoteproc
> > will recover in second call from Process B as it has failed in the first
> > Process A during SMC shutdown call and may again fail at the same call
> > and rproc can not recover for such case.
> > 
> > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > remoteproc and the only way to recover from it via system restart.
> > 
> > 	Process-A                			Process-B
> > 
> >   fatal error interrupt happens
> > 
> >   rproc_crash_handler_work()
> >     mutex_lock_interruptible(&rproc->lock);
> >     ...
> > 
> >        rproc->state = RPROC_CRASHED;
> >     ...
> >     mutex_unlock(&rproc->lock);
> > 
> >     rproc_trigger_recovery()
> >      mutex_lock_interruptible(&rproc->lock);
> > 
> >       adsp_stop()
> >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> >       remoteproc remoteproc3: can't stop rproc: -22
> >      mutex_unlock(&rproc->lock);
> 
> Ok, that can happen.
> 
> > 
> > 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > 						recovery_store()
> > 						 rproc_trigger_recovery()
> > 						  mutex_lock_interruptible(&rproc->lock);
> > 						   rproc_stop()
> > 						    glink_subdev_stop()
> > 						      qcom_glink_smem_unregister() ==|
> >                                                                                      |
> >                                                                                      V
> 
> I am missing some information here but I will _assume_ this is caused by
> glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> process A.  Instead of adding a new state to the core I think a better idea
> would be to add a check for a NULL value on @smem in
> qcom_glink_smem_unregister().  This is a problem that should be fixed in the
> driver rather than the core.
> 
> [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213


I did the same here [1] but after discussion with Bjorn, realized that
remoteproc might not even recover and may fail in the second attempt as
well and only way is reboot of the machine.

[1]
https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/

> 
> > 						      Unable to handle kernel NULL pointer dereference
> >                                                                 at virtual address 0000000000000358
> > 
> > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > ---
> > Changes in v3:
> >  - Fix kernel test reported error.
> > 
> > Changes in v2:
> >  - Removed NULL pointer check instead added a new state to signify
> >    non-recoverable state of remoteproc.
> > 
> >  drivers/remoteproc/remoteproc_core.c  | 3 ++-
> >  drivers/remoteproc/remoteproc_sysfs.c | 1 +
> >  include/linux/remoteproc.h            | 5 ++++-
> >  3 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > index f276956f2c5c..c4e14503b971 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> >  	/* power off the remote processor */
> >  	ret = rproc->ops->stop(rproc);
> >  	if (ret) {
> > +		rproc->state = RPROC_DEFUNCT;
> >  		dev_err(dev, "can't stop rproc: %d\n", ret);
> >  		return ret;
> >  	}
> > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> >  		return ret;
> >  
> >  	/* State could have changed before we got the mutex */
> > -	if (rproc->state != RPROC_CRASHED)
> > +	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> >  		goto unlock_mutex;
> 
> The problem is that rproc_trigger_recovery() an only be called once for a
> remoteproc, something that modifies the state machine and may introduce backward
> compatibility issues for other remote processor implementations.
> 

I missed one more point to add here which i tried to highlight in second
version[2] that setting of RPROC_DEFUNCT should happen for this case
from vendor remoteproc driver and not at the core and that should take
care of the backward compatibility.

[2]
https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/

-Mukesh

> Thanks,
> Mathieu
> 
> >  
> >  	dev_err(dev, "recovering %s\n", rproc->name);
> > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > index 138e752c5e4e..5f722b4576b2 100644
> > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> >  	[RPROC_DELETED]		= "deleted",
> >  	[RPROC_ATTACHED]	= "attached",
> >  	[RPROC_DETACHED]	= "detached",
> > +	[RPROC_DEFUNCT]		= "defunct",
> >  	[RPROC_LAST]		= "invalid",
> >  };
> >  
> > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > index b4795698d8c2..3e4ba06c6a9a 100644
> > --- a/include/linux/remoteproc.h
> > +++ b/include/linux/remoteproc.h
> > @@ -417,6 +417,8 @@ struct rproc_ops {
> >   *			has attached to it
> >   * @RPROC_DETACHED:	device has been booted by another entity and waiting
> >   *			for the core to attach to it
> > + * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
> > + * 			requests and can only recover on system restart.
> >   * @RPROC_LAST:		just keep this one at the end
> >   *
> >   * Please note that the values of these states are used as indices
> > @@ -433,7 +435,8 @@ enum rproc_state {
> >  	RPROC_DELETED	= 4,
> >  	RPROC_ATTACHED	= 5,
> >  	RPROC_DETACHED	= 6,
> > -	RPROC_LAST	= 7,
> > +	RPROC_DEFUNCT	= 7,
> > +	RPROC_LAST	= 8,
> >  };
> >  
> >  /**
> > -- 
> > 2.34.1
> >
Mathieu Poirier Oct. 25, 2024, 3:08 p.m. UTC | #4
On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > Hi Mukesh,
> > 
> > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > recover it results in NULL pointer dereference issue in
> > > qcom_glink_smem_unregister().
> > > 
> > > There is other side to this issue if we want to fix this via adding a
> > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > will recover in second call from Process B as it has failed in the first
> > > Process A during SMC shutdown call and may again fail at the same call
> > > and rproc can not recover for such case.
> > > 
> > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > remoteproc and the only way to recover from it via system restart.
> > > 
> > > 	Process-A                			Process-B
> > > 
> > >   fatal error interrupt happens
> > > 
> > >   rproc_crash_handler_work()
> > >     mutex_lock_interruptible(&rproc->lock);
> > >     ...
> > > 
> > >        rproc->state = RPROC_CRASHED;
> > >     ...
> > >     mutex_unlock(&rproc->lock);
> > > 
> > >     rproc_trigger_recovery()
> > >      mutex_lock_interruptible(&rproc->lock);
> > > 
> > >       adsp_stop()
> > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > >       remoteproc remoteproc3: can't stop rproc: -22
> > >      mutex_unlock(&rproc->lock);
> > 
> > Ok, that can happen.
> > 
> > > 
> > > 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > 						recovery_store()
> > > 						 rproc_trigger_recovery()
> > > 						  mutex_lock_interruptible(&rproc->lock);
> > > 						   rproc_stop()
> > > 						    glink_subdev_stop()
> > > 						      qcom_glink_smem_unregister() ==|
> > >                                                                                      |
> > >                                                                                      V
> > 
> > I am missing some information here but I will _assume_ this is caused by
> > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> > process A.  Instead of adding a new state to the core I think a better idea
> > would be to add a check for a NULL value on @smem in
> > qcom_glink_smem_unregister().  This is a problem that should be fixed in the
> > driver rather than the core.
> > 
> > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> 
> 
> I did the same here [1] but after discussion with Bjorn, realized that
> remoteproc might not even recover and may fail in the second attempt as
> well and only way is reboot of the machine.

Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
manual intervention is needed.  I don't see why another state needs to be added.

> 
> [1]
> https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> 
> > 
> > > 						      Unable to handle kernel NULL pointer dereference
> > >                                                                 at virtual address 0000000000000358
> > > 
> > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > > ---
> > > Changes in v3:
> > >  - Fix kernel test reported error.
> > > 
> > > Changes in v2:
> > >  - Removed NULL pointer check instead added a new state to signify
> > >    non-recoverable state of remoteproc.
> > > 
> > >  drivers/remoteproc/remoteproc_core.c  | 3 ++-
> > >  drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > >  include/linux/remoteproc.h            | 5 ++++-
> > >  3 files changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > index f276956f2c5c..c4e14503b971 100644
> > > --- a/drivers/remoteproc/remoteproc_core.c
> > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > >  	/* power off the remote processor */
> > >  	ret = rproc->ops->stop(rproc);
> > >  	if (ret) {
> > > +		rproc->state = RPROC_DEFUNCT;
> > >  		dev_err(dev, "can't stop rproc: %d\n", ret);
> > >  		return ret;
> > >  	}
> > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > >  		return ret;
> > >  
> > >  	/* State could have changed before we got the mutex */
> > > -	if (rproc->state != RPROC_CRASHED)
> > > +	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > >  		goto unlock_mutex;
> > 
> > The problem is that rproc_trigger_recovery() an only be called once for a
> > remoteproc, something that modifies the state machine and may introduce backward
> > compatibility issues for other remote processor implementations.
> > 
> 
> I missed one more point to add here which i tried to highlight in second
> version[2] that setting of RPROC_DEFUNCT should happen for this case
> from vendor remoteproc driver and not at the core and that should take
> care of the backward compatibility.
> 
> [2]
> https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
> 
> -Mukesh
> 
> > Thanks,
> > Mathieu
> > 
> > >  
> > >  	dev_err(dev, "recovering %s\n", rproc->name);
> > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > > index 138e752c5e4e..5f722b4576b2 100644
> > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > >  	[RPROC_DELETED]		= "deleted",
> > >  	[RPROC_ATTACHED]	= "attached",
> > >  	[RPROC_DETACHED]	= "detached",
> > > +	[RPROC_DEFUNCT]		= "defunct",
> > >  	[RPROC_LAST]		= "invalid",
> > >  };
> > >  
> > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > --- a/include/linux/remoteproc.h
> > > +++ b/include/linux/remoteproc.h
> > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > >   *			has attached to it
> > >   * @RPROC_DETACHED:	device has been booted by another entity and waiting
> > >   *			for the core to attach to it
> > > + * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
> > > + * 			requests and can only recover on system restart.
> > >   * @RPROC_LAST:		just keep this one at the end
> > >   *
> > >   * Please note that the values of these states are used as indices
> > > @@ -433,7 +435,8 @@ enum rproc_state {
> > >  	RPROC_DELETED	= 4,
> > >  	RPROC_ATTACHED	= 5,
> > >  	RPROC_DETACHED	= 6,
> > > -	RPROC_LAST	= 7,
> > > +	RPROC_DEFUNCT	= 7,
> > > +	RPROC_LAST	= 8,
> > >  };
> > >  
> > >  /**
> > > -- 
> > > 2.34.1
> > >
Mukesh Ojha Oct. 25, 2024, 3:39 p.m. UTC | #5
On Fri, Oct 25, 2024 at 09:08:03AM -0600, Mathieu Poirier wrote:
> On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote:
> > On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote:
> > > Hi Mukesh,
> > > 
> > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote:
> > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen
> > > > if rproc_stop() fails from Process-A that leaves the rproc state to
> > > > RPROC_CRASHED state later a call to recovery_store from user space in
> > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to
> > > > recover it results in NULL pointer dereference issue in
> > > > qcom_glink_smem_unregister().
> > > > 
> > > > There is other side to this issue if we want to fix this via adding a
> > > > NULL check on glink->edge which does not guarantees that the remoteproc
> > > > will recover in second call from Process B as it has failed in the first
> > > > Process A during SMC shutdown call and may again fail at the same call
> > > > and rproc can not recover for such case.
> > > > 
> > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of
> > > > remoteproc and the only way to recover from it via system restart.
> > > > 
> > > > 	Process-A                			Process-B
> > > > 
> > > >   fatal error interrupt happens
> > > > 
> > > >   rproc_crash_handler_work()
> > > >     mutex_lock_interruptible(&rproc->lock);
> > > >     ...
> > > > 
> > > >        rproc->state = RPROC_CRASHED;
> > > >     ...
> > > >     mutex_unlock(&rproc->lock);
> > > > 
> > > >     rproc_trigger_recovery()
> > > >      mutex_lock_interruptible(&rproc->lock);
> > > > 
> > > >       adsp_stop()
> > > >       qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22
> > > >       remoteproc remoteproc3: can't stop rproc: -22
> > > >      mutex_unlock(&rproc->lock);
> > > 
> > > Ok, that can happen.
> > > 
> > > > 
> > > > 						echo enabled > /sys/class/remoteproc/remoteprocX/recovery
> > > > 						recovery_store()
> > > > 						 rproc_trigger_recovery()
> > > > 						  mutex_lock_interruptible(&rproc->lock);
> > > > 						   rproc_stop()
> > > > 						    glink_subdev_stop()
> > > > 						      qcom_glink_smem_unregister() ==|
> > > >                                                                                      |
> > > >                                                                                      V
> > > 
> > > I am missing some information here but I will _assume_ this is caused by
> > > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by
> > > process A.  Instead of adding a new state to the core I think a better idea
> > > would be to add a check for a NULL value on @smem in
> > > qcom_glink_smem_unregister().  This is a problem that should be fixed in the
> > > driver rather than the core.
> > > 
> > > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213
> > 
> > 
> > I did the same here [1] but after discussion with Bjorn, realized that
> > remoteproc might not even recover and may fail in the second attempt as
> > well and only way is reboot of the machine.
> 
> Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same -
> manual intervention is needed.  I don't see why another state needs to be added.

Is it really true ? As when recovery is disabled and any rproc crash
will result in RPROC_CRASHED state, while recovery enablement can
recover the rproc back to ONLINE while if rproc recovery is not
successful it can be put into RPROC_DEFUNCT state.

-Mukesh

> 
> > 
> > [1]
> > https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/
> > 
> > > 
> > > > 						      Unable to handle kernel NULL pointer dereference
> > > >                                                                 at virtual address 0000000000000358
> > > > 
> > > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
> > > > ---
> > > > Changes in v3:
> > > >  - Fix kernel test reported error.
> > > > 
> > > > Changes in v2:
> > > >  - Removed NULL pointer check instead added a new state to signify
> > > >    non-recoverable state of remoteproc.
> > > > 
> > > >  drivers/remoteproc/remoteproc_core.c  | 3 ++-
> > > >  drivers/remoteproc/remoteproc_sysfs.c | 1 +
> > > >  include/linux/remoteproc.h            | 5 ++++-
> > > >  3 files changed, 7 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> > > > index f276956f2c5c..c4e14503b971 100644
> > > > --- a/drivers/remoteproc/remoteproc_core.c
> > > > +++ b/drivers/remoteproc/remoteproc_core.c
> > > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed)
> > > >  	/* power off the remote processor */
> > > >  	ret = rproc->ops->stop(rproc);
> > > >  	if (ret) {
> > > > +		rproc->state = RPROC_DEFUNCT;
> > > >  		dev_err(dev, "can't stop rproc: %d\n", ret);
> > > >  		return ret;
> > > >  	}
> > > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc)
> > > >  		return ret;
> > > >  
> > > >  	/* State could have changed before we got the mutex */
> > > > -	if (rproc->state != RPROC_CRASHED)
> > > > +	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
> > > >  		goto unlock_mutex;
> > > 
> > > The problem is that rproc_trigger_recovery() an only be called once for a
> > > remoteproc, something that modifies the state machine and may introduce backward
> > > compatibility issues for other remote processor implementations.
> > > 
> > 
> > I missed one more point to add here which i tried to highlight in second
> > version[2] that setting of RPROC_DEFUNCT should happen for this case
> > from vendor remoteproc driver and not at the core and that should take
> > care of the backward compatibility.
> > 
> > [2]
> > https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/
> > 
> > -Mukesh
> > 
> > > Thanks,
> > > Mathieu
> > > 
> > > >  
> > > >  	dev_err(dev, "recovering %s\n", rproc->name);
> > > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
> > > > index 138e752c5e4e..5f722b4576b2 100644
> > > > --- a/drivers/remoteproc/remoteproc_sysfs.c
> > > > +++ b/drivers/remoteproc/remoteproc_sysfs.c
> > > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = {
> > > >  	[RPROC_DELETED]		= "deleted",
> > > >  	[RPROC_ATTACHED]	= "attached",
> > > >  	[RPROC_DETACHED]	= "detached",
> > > > +	[RPROC_DEFUNCT]		= "defunct",
> > > >  	[RPROC_LAST]		= "invalid",
> > > >  };
> > > >  
> > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
> > > > index b4795698d8c2..3e4ba06c6a9a 100644
> > > > --- a/include/linux/remoteproc.h
> > > > +++ b/include/linux/remoteproc.h
> > > > @@ -417,6 +417,8 @@ struct rproc_ops {
> > > >   *			has attached to it
> > > >   * @RPROC_DETACHED:	device has been booted by another entity and waiting
> > > >   *			for the core to attach to it
> > > > + * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
> > > > + * 			requests and can only recover on system restart.
> > > >   * @RPROC_LAST:		just keep this one at the end
> > > >   *
> > > >   * Please note that the values of these states are used as indices
> > > > @@ -433,7 +435,8 @@ enum rproc_state {
> > > >  	RPROC_DELETED	= 4,
> > > >  	RPROC_ATTACHED	= 5,
> > > >  	RPROC_DETACHED	= 6,
> > > > -	RPROC_LAST	= 7,
> > > > +	RPROC_DEFUNCT	= 7,
> > > > +	RPROC_LAST	= 8,
> > > >  };
> > > >  
> > > >  /**
> > > > -- 
> > > > 2.34.1
> > > >
diff mbox series

Patch

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index f276956f2c5c..c4e14503b971 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1727,6 +1727,7 @@  static int rproc_stop(struct rproc *rproc, bool crashed)
 	/* power off the remote processor */
 	ret = rproc->ops->stop(rproc);
 	if (ret) {
+		rproc->state = RPROC_DEFUNCT;
 		dev_err(dev, "can't stop rproc: %d\n", ret);
 		return ret;
 	}
@@ -1839,7 +1840,7 @@  int rproc_trigger_recovery(struct rproc *rproc)
 		return ret;
 
 	/* State could have changed before we got the mutex */
-	if (rproc->state != RPROC_CRASHED)
+	if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED)
 		goto unlock_mutex;
 
 	dev_err(dev, "recovering %s\n", rproc->name);
diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c
index 138e752c5e4e..5f722b4576b2 100644
--- a/drivers/remoteproc/remoteproc_sysfs.c
+++ b/drivers/remoteproc/remoteproc_sysfs.c
@@ -171,6 +171,7 @@  static const char * const rproc_state_string[] = {
 	[RPROC_DELETED]		= "deleted",
 	[RPROC_ATTACHED]	= "attached",
 	[RPROC_DETACHED]	= "detached",
+	[RPROC_DEFUNCT]		= "defunct",
 	[RPROC_LAST]		= "invalid",
 };
 
diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h
index b4795698d8c2..3e4ba06c6a9a 100644
--- a/include/linux/remoteproc.h
+++ b/include/linux/remoteproc.h
@@ -417,6 +417,8 @@  struct rproc_ops {
  *			has attached to it
  * @RPROC_DETACHED:	device has been booted by another entity and waiting
  *			for the core to attach to it
+ * @RPROC_DEFUNCT:	device neither crashed nor responding to any of the
+ * 			requests and can only recover on system restart.
  * @RPROC_LAST:		just keep this one at the end
  *
  * Please note that the values of these states are used as indices
@@ -433,7 +435,8 @@  enum rproc_state {
 	RPROC_DELETED	= 4,
 	RPROC_ATTACHED	= 5,
 	RPROC_DETACHED	= 6,
-	RPROC_LAST	= 7,
+	RPROC_DEFUNCT	= 7,
+	RPROC_LAST	= 8,
 };
 
 /**