Message ID | 20241016045546.2613436-1-quic_mojha@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] remoteproc: Add a new remoteproc state RPROC_DEFUNCT | expand |
On Tue, Oct 15, 2024 at 9:57 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote: > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > if rproc_stop() fails from Process-A that leaves the rproc state to > RPROC_CRASHED state later a call to recovery_store from user space in > Process B triggers rproc_trigger_recovery() of the same remoteproc to > recover it results in NULL pointer dereference issue in > qcom_glink_smem_unregister(). > > There is other side to this issue if we want to fix this via adding a > NULL check on glink->edge which does not guarantees that the remoteproc > will recover in second call from Process B as it has failed in the first > Process A during SMC shutdown call and may again fail at the same call > and rproc can not recover for such case. What is the guarantee that the second stop also will fail? I feel it should be handled in user space, if rproc calls are failing then there is a bigger issue and then let userspace decide what to do if it is happening continuously. Also, why not add this DEFUNCT_STATE in other callbacks, as all callbacks from core to rproc driver can fail? > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of Even if this state is present, ultimately it will be up to user space to decide what to do, right? > remoteproc and the only way to recover from it via system restart. > > Process-A Process-B > > fatal error interrupt happens > > rproc_crash_handler_work() > mutex_lock_interruptible(&rproc->lock); > ... > > rproc->state = RPROC_CRASHED; > ... > mutex_unlock(&rproc->lock); > > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > > adsp_stop() > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > remoteproc remoteproc3: can't stop rproc: -22 > mutex_unlock(&rproc->lock); > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > recovery_store() > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > rproc_stop() > glink_subdev_stop() > qcom_glink_smem_unregister() ==| > | > V > Unable to handle kernel NULL pointer dereference > at virtual address 0000000000000358 > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > --- > Changes in v3: > - Fix kernel test reported error. > > Changes in v2: > - Removed NULL pointer check instead added a new state to signify > non-recoverable state of remoteproc. > > drivers/remoteproc/remoteproc_core.c | 3 ++- > drivers/remoteproc/remoteproc_sysfs.c | 1 + > include/linux/remoteproc.h | 5 ++++- > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > index f276956f2c5c..c4e14503b971 100644 > --- a/drivers/remoteproc/remoteproc_core.c > +++ b/drivers/remoteproc/remoteproc_core.c > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > /* power off the remote processor */ > ret = rproc->ops->stop(rproc); > if (ret) { > + rproc->state = RPROC_DEFUNCT; > dev_err(dev, "can't stop rproc: %d\n", ret); > return ret; > } > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > return ret; > > /* State could have changed before we got the mutex */ > - if (rproc->state != RPROC_CRASHED) > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > goto unlock_mutex; > > dev_err(dev, "recovering %s\n", rproc->name); > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > index 138e752c5e4e..5f722b4576b2 100644 > --- a/drivers/remoteproc/remoteproc_sysfs.c > +++ b/drivers/remoteproc/remoteproc_sysfs.c > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > [RPROC_DELETED] = "deleted", > [RPROC_ATTACHED] = "attached", > [RPROC_DETACHED] = "detached", > + [RPROC_DEFUNCT] = "defunct", > [RPROC_LAST] = "invalid", > }; > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > index b4795698d8c2..3e4ba06c6a9a 100644 > --- a/include/linux/remoteproc.h > +++ b/include/linux/remoteproc.h > @@ -417,6 +417,8 @@ struct rproc_ops { > * has attached to it > * @RPROC_DETACHED: device has been booted by another entity and waiting > * for the core to attach to it > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > + * requests and can only recover on system restart. > * @RPROC_LAST: just keep this one at the end > * > * Please note that the values of these states are used as indices > @@ -433,7 +435,8 @@ enum rproc_state { > RPROC_DELETED = 4, > RPROC_ATTACHED = 5, > RPROC_DETACHED = 6, > - RPROC_LAST = 7, > + RPROC_DEFUNCT = 7, > + RPROC_LAST = 8, > }; > > /** > -- > 2.34.1 > >
Hi Mukesh, On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote: > Multiple call to glink_subdev_stop() for the same remoteproc can happen > if rproc_stop() fails from Process-A that leaves the rproc state to > RPROC_CRASHED state later a call to recovery_store from user space in > Process B triggers rproc_trigger_recovery() of the same remoteproc to > recover it results in NULL pointer dereference issue in > qcom_glink_smem_unregister(). > > There is other side to this issue if we want to fix this via adding a > NULL check on glink->edge which does not guarantees that the remoteproc > will recover in second call from Process B as it has failed in the first > Process A during SMC shutdown call and may again fail at the same call > and rproc can not recover for such case. > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of > remoteproc and the only way to recover from it via system restart. > > Process-A Process-B > > fatal error interrupt happens > > rproc_crash_handler_work() > mutex_lock_interruptible(&rproc->lock); > ... > > rproc->state = RPROC_CRASHED; > ... > mutex_unlock(&rproc->lock); > > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > > adsp_stop() > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > remoteproc remoteproc3: can't stop rproc: -22 > mutex_unlock(&rproc->lock); Ok, that can happen. > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > recovery_store() > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > rproc_stop() > glink_subdev_stop() > qcom_glink_smem_unregister() ==| > | > V I am missing some information here but I will _assume_ this is caused by glink->edge being set to NULL [1] when glink_subdev_stop() is first called by process A. Instead of adding a new state to the core I think a better idea would be to add a check for a NULL value on @smem in qcom_glink_smem_unregister(). This is a problem that should be fixed in the driver rather than the core. [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213 > Unable to handle kernel NULL pointer dereference > at virtual address 0000000000000358 > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > --- > Changes in v3: > - Fix kernel test reported error. > > Changes in v2: > - Removed NULL pointer check instead added a new state to signify > non-recoverable state of remoteproc. > > drivers/remoteproc/remoteproc_core.c | 3 ++- > drivers/remoteproc/remoteproc_sysfs.c | 1 + > include/linux/remoteproc.h | 5 ++++- > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > index f276956f2c5c..c4e14503b971 100644 > --- a/drivers/remoteproc/remoteproc_core.c > +++ b/drivers/remoteproc/remoteproc_core.c > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > /* power off the remote processor */ > ret = rproc->ops->stop(rproc); > if (ret) { > + rproc->state = RPROC_DEFUNCT; > dev_err(dev, "can't stop rproc: %d\n", ret); > return ret; > } > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > return ret; > > /* State could have changed before we got the mutex */ > - if (rproc->state != RPROC_CRASHED) > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > goto unlock_mutex; The problem is that rproc_trigger_recovery() an only be called once for a remoteproc, something that modifies the state machine and may introduce backward compatibility issues for other remote processor implementations. Thanks, Mathieu > > dev_err(dev, "recovering %s\n", rproc->name); > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > index 138e752c5e4e..5f722b4576b2 100644 > --- a/drivers/remoteproc/remoteproc_sysfs.c > +++ b/drivers/remoteproc/remoteproc_sysfs.c > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > [RPROC_DELETED] = "deleted", > [RPROC_ATTACHED] = "attached", > [RPROC_DETACHED] = "detached", > + [RPROC_DEFUNCT] = "defunct", > [RPROC_LAST] = "invalid", > }; > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > index b4795698d8c2..3e4ba06c6a9a 100644 > --- a/include/linux/remoteproc.h > +++ b/include/linux/remoteproc.h > @@ -417,6 +417,8 @@ struct rproc_ops { > * has attached to it > * @RPROC_DETACHED: device has been booted by another entity and waiting > * for the core to attach to it > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > + * requests and can only recover on system restart. > * @RPROC_LAST: just keep this one at the end > * > * Please note that the values of these states are used as indices > @@ -433,7 +435,8 @@ enum rproc_state { > RPROC_DELETED = 4, > RPROC_ATTACHED = 5, > RPROC_DETACHED = 6, > - RPROC_LAST = 7, > + RPROC_DEFUNCT = 7, > + RPROC_LAST = 8, > }; > > /** > -- > 2.34.1 >
On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote: > Hi Mukesh, > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote: > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > if rproc_stop() fails from Process-A that leaves the rproc state to > > RPROC_CRASHED state later a call to recovery_store from user space in > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > recover it results in NULL pointer dereference issue in > > qcom_glink_smem_unregister(). > > > > There is other side to this issue if we want to fix this via adding a > > NULL check on glink->edge which does not guarantees that the remoteproc > > will recover in second call from Process B as it has failed in the first > > Process A during SMC shutdown call and may again fail at the same call > > and rproc can not recover for such case. > > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of > > remoteproc and the only way to recover from it via system restart. > > > > Process-A Process-B > > > > fatal error interrupt happens > > > > rproc_crash_handler_work() > > mutex_lock_interruptible(&rproc->lock); > > ... > > > > rproc->state = RPROC_CRASHED; > > ... > > mutex_unlock(&rproc->lock); > > > > rproc_trigger_recovery() > > mutex_lock_interruptible(&rproc->lock); > > > > adsp_stop() > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > remoteproc remoteproc3: can't stop rproc: -22 > > mutex_unlock(&rproc->lock); > > Ok, that can happen. > > > > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > > recovery_store() > > rproc_trigger_recovery() > > mutex_lock_interruptible(&rproc->lock); > > rproc_stop() > > glink_subdev_stop() > > qcom_glink_smem_unregister() ==| > > | > > V > > I am missing some information here but I will _assume_ this is caused by > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by > process A. Instead of adding a new state to the core I think a better idea > would be to add a check for a NULL value on @smem in > qcom_glink_smem_unregister(). This is a problem that should be fixed in the > driver rather than the core. > > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213 I did the same here [1] but after discussion with Bjorn, realized that remoteproc might not even recover and may fail in the second attempt as well and only way is reboot of the machine. [1] https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/ > > > Unable to handle kernel NULL pointer dereference > > at virtual address 0000000000000358 > > > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > > --- > > Changes in v3: > > - Fix kernel test reported error. > > > > Changes in v2: > > - Removed NULL pointer check instead added a new state to signify > > non-recoverable state of remoteproc. > > > > drivers/remoteproc/remoteproc_core.c | 3 ++- > > drivers/remoteproc/remoteproc_sysfs.c | 1 + > > include/linux/remoteproc.h | 5 ++++- > > 3 files changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > > index f276956f2c5c..c4e14503b971 100644 > > --- a/drivers/remoteproc/remoteproc_core.c > > +++ b/drivers/remoteproc/remoteproc_core.c > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > > /* power off the remote processor */ > > ret = rproc->ops->stop(rproc); > > if (ret) { > > + rproc->state = RPROC_DEFUNCT; > > dev_err(dev, "can't stop rproc: %d\n", ret); > > return ret; > > } > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > > return ret; > > > > /* State could have changed before we got the mutex */ > > - if (rproc->state != RPROC_CRASHED) > > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > > goto unlock_mutex; > > The problem is that rproc_trigger_recovery() an only be called once for a > remoteproc, something that modifies the state machine and may introduce backward > compatibility issues for other remote processor implementations. > I missed one more point to add here which i tried to highlight in second version[2] that setting of RPROC_DEFUNCT should happen for this case from vendor remoteproc driver and not at the core and that should take care of the backward compatibility. [2] https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/ -Mukesh > Thanks, > Mathieu > > > > > dev_err(dev, "recovering %s\n", rproc->name); > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > > index 138e752c5e4e..5f722b4576b2 100644 > > --- a/drivers/remoteproc/remoteproc_sysfs.c > > +++ b/drivers/remoteproc/remoteproc_sysfs.c > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > > [RPROC_DELETED] = "deleted", > > [RPROC_ATTACHED] = "attached", > > [RPROC_DETACHED] = "detached", > > + [RPROC_DEFUNCT] = "defunct", > > [RPROC_LAST] = "invalid", > > }; > > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > > index b4795698d8c2..3e4ba06c6a9a 100644 > > --- a/include/linux/remoteproc.h > > +++ b/include/linux/remoteproc.h > > @@ -417,6 +417,8 @@ struct rproc_ops { > > * has attached to it > > * @RPROC_DETACHED: device has been booted by another entity and waiting > > * for the core to attach to it > > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > > + * requests and can only recover on system restart. > > * @RPROC_LAST: just keep this one at the end > > * > > * Please note that the values of these states are used as indices > > @@ -433,7 +435,8 @@ enum rproc_state { > > RPROC_DELETED = 4, > > RPROC_ATTACHED = 5, > > RPROC_DETACHED = 6, > > - RPROC_LAST = 7, > > + RPROC_DEFUNCT = 7, > > + RPROC_LAST = 8, > > }; > > > > /** > > -- > > 2.34.1 > >
On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote: > On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote: > > Hi Mukesh, > > > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote: > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > recover it results in NULL pointer dereference issue in > > > qcom_glink_smem_unregister(). > > > > > > There is other side to this issue if we want to fix this via adding a > > > NULL check on glink->edge which does not guarantees that the remoteproc > > > will recover in second call from Process B as it has failed in the first > > > Process A during SMC shutdown call and may again fail at the same call > > > and rproc can not recover for such case. > > > > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of > > > remoteproc and the only way to recover from it via system restart. > > > > > > Process-A Process-B > > > > > > fatal error interrupt happens > > > > > > rproc_crash_handler_work() > > > mutex_lock_interruptible(&rproc->lock); > > > ... > > > > > > rproc->state = RPROC_CRASHED; > > > ... > > > mutex_unlock(&rproc->lock); > > > > > > rproc_trigger_recovery() > > > mutex_lock_interruptible(&rproc->lock); > > > > > > adsp_stop() > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > remoteproc remoteproc3: can't stop rproc: -22 > > > mutex_unlock(&rproc->lock); > > > > Ok, that can happen. > > > > > > > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > > > recovery_store() > > > rproc_trigger_recovery() > > > mutex_lock_interruptible(&rproc->lock); > > > rproc_stop() > > > glink_subdev_stop() > > > qcom_glink_smem_unregister() ==| > > > | > > > V > > > > I am missing some information here but I will _assume_ this is caused by > > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by > > process A. Instead of adding a new state to the core I think a better idea > > would be to add a check for a NULL value on @smem in > > qcom_glink_smem_unregister(). This is a problem that should be fixed in the > > driver rather than the core. > > > > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213 > > > I did the same here [1] but after discussion with Bjorn, realized that > remoteproc might not even recover and may fail in the second attempt as > well and only way is reboot of the machine. Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same - manual intervention is needed. I don't see why another state needs to be added. > > [1] > https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/ > > > > > > Unable to handle kernel NULL pointer dereference > > > at virtual address 0000000000000358 > > > > > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > > > --- > > > Changes in v3: > > > - Fix kernel test reported error. > > > > > > Changes in v2: > > > - Removed NULL pointer check instead added a new state to signify > > > non-recoverable state of remoteproc. > > > > > > drivers/remoteproc/remoteproc_core.c | 3 ++- > > > drivers/remoteproc/remoteproc_sysfs.c | 1 + > > > include/linux/remoteproc.h | 5 ++++- > > > 3 files changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > > > index f276956f2c5c..c4e14503b971 100644 > > > --- a/drivers/remoteproc/remoteproc_core.c > > > +++ b/drivers/remoteproc/remoteproc_core.c > > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > > > /* power off the remote processor */ > > > ret = rproc->ops->stop(rproc); > > > if (ret) { > > > + rproc->state = RPROC_DEFUNCT; > > > dev_err(dev, "can't stop rproc: %d\n", ret); > > > return ret; > > > } > > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > > > return ret; > > > > > > /* State could have changed before we got the mutex */ > > > - if (rproc->state != RPROC_CRASHED) > > > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > > > goto unlock_mutex; > > > > The problem is that rproc_trigger_recovery() an only be called once for a > > remoteproc, something that modifies the state machine and may introduce backward > > compatibility issues for other remote processor implementations. > > > > I missed one more point to add here which i tried to highlight in second > version[2] that setting of RPROC_DEFUNCT should happen for this case > from vendor remoteproc driver and not at the core and that should take > care of the backward compatibility. > > [2] > https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/ > > -Mukesh > > > Thanks, > > Mathieu > > > > > > > > dev_err(dev, "recovering %s\n", rproc->name); > > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > > > index 138e752c5e4e..5f722b4576b2 100644 > > > --- a/drivers/remoteproc/remoteproc_sysfs.c > > > +++ b/drivers/remoteproc/remoteproc_sysfs.c > > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > > > [RPROC_DELETED] = "deleted", > > > [RPROC_ATTACHED] = "attached", > > > [RPROC_DETACHED] = "detached", > > > + [RPROC_DEFUNCT] = "defunct", > > > [RPROC_LAST] = "invalid", > > > }; > > > > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > > > index b4795698d8c2..3e4ba06c6a9a 100644 > > > --- a/include/linux/remoteproc.h > > > +++ b/include/linux/remoteproc.h > > > @@ -417,6 +417,8 @@ struct rproc_ops { > > > * has attached to it > > > * @RPROC_DETACHED: device has been booted by another entity and waiting > > > * for the core to attach to it > > > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > > > + * requests and can only recover on system restart. > > > * @RPROC_LAST: just keep this one at the end > > > * > > > * Please note that the values of these states are used as indices > > > @@ -433,7 +435,8 @@ enum rproc_state { > > > RPROC_DELETED = 4, > > > RPROC_ATTACHED = 5, > > > RPROC_DETACHED = 6, > > > - RPROC_LAST = 7, > > > + RPROC_DEFUNCT = 7, > > > + RPROC_LAST = 8, > > > }; > > > > > > /** > > > -- > > > 2.34.1 > > >
On Fri, Oct 25, 2024 at 09:08:03AM -0600, Mathieu Poirier wrote: > On Fri, Oct 25, 2024 at 01:40:45PM +0530, Mukesh Ojha wrote: > > On Mon, Oct 21, 2024 at 09:12:47AM -0600, Mathieu Poirier wrote: > > > Hi Mukesh, > > > > > > On Wed, Oct 16, 2024 at 10:25:46AM +0530, Mukesh Ojha wrote: > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > > recover it results in NULL pointer dereference issue in > > > > qcom_glink_smem_unregister(). > > > > > > > > There is other side to this issue if we want to fix this via adding a > > > > NULL check on glink->edge which does not guarantees that the remoteproc > > > > will recover in second call from Process B as it has failed in the first > > > > Process A during SMC shutdown call and may again fail at the same call > > > > and rproc can not recover for such case. > > > > > > > > Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of > > > > remoteproc and the only way to recover from it via system restart. > > > > > > > > Process-A Process-B > > > > > > > > fatal error interrupt happens > > > > > > > > rproc_crash_handler_work() > > > > mutex_lock_interruptible(&rproc->lock); > > > > ... > > > > > > > > rproc->state = RPROC_CRASHED; > > > > ... > > > > mutex_unlock(&rproc->lock); > > > > > > > > rproc_trigger_recovery() > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > > > adsp_stop() > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > mutex_unlock(&rproc->lock); > > > > > > Ok, that can happen. > > > > > > > > > > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > > > > recovery_store() > > > > rproc_trigger_recovery() > > > > mutex_lock_interruptible(&rproc->lock); > > > > rproc_stop() > > > > glink_subdev_stop() > > > > qcom_glink_smem_unregister() ==| > > > > | > > > > V > > > > > > I am missing some information here but I will _assume_ this is caused by > > > glink->edge being set to NULL [1] when glink_subdev_stop() is first called by > > > process A. Instead of adding a new state to the core I think a better idea > > > would be to add a check for a NULL value on @smem in > > > qcom_glink_smem_unregister(). This is a problem that should be fixed in the > > > driver rather than the core. > > > > > > [1]. https://elixir.bootlin.com/linux/v6.12-rc4/source/drivers/remoteproc/qcom_common.c#L213 > > > > > > I did the same here [1] but after discussion with Bjorn, realized that > > remoteproc might not even recover and may fail in the second attempt as > > well and only way is reboot of the machine. > > Whether in RPROC_CRASHED or RPROC_DEFUNCT state, the end result is the same - > manual intervention is needed. I don't see why another state needs to be added. Is it really true ? As when recovery is disabled and any rproc crash will result in RPROC_CRASHED state, while recovery enablement can recover the rproc back to ONLINE while if rproc recovery is not successful it can be put into RPROC_DEFUNCT state. -Mukesh > > > > > [1] > > https://lore.kernel.org/lkml/20240925103351.1628788-1-quic_mojha@quicinc.com/ > > > > > > > > > Unable to handle kernel NULL pointer dereference > > > > at virtual address 0000000000000358 > > > > > > > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > > > > --- > > > > Changes in v3: > > > > - Fix kernel test reported error. > > > > > > > > Changes in v2: > > > > - Removed NULL pointer check instead added a new state to signify > > > > non-recoverable state of remoteproc. > > > > > > > > drivers/remoteproc/remoteproc_core.c | 3 ++- > > > > drivers/remoteproc/remoteproc_sysfs.c | 1 + > > > > include/linux/remoteproc.h | 5 ++++- > > > > 3 files changed, 7 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > > > > index f276956f2c5c..c4e14503b971 100644 > > > > --- a/drivers/remoteproc/remoteproc_core.c > > > > +++ b/drivers/remoteproc/remoteproc_core.c > > > > @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) > > > > /* power off the remote processor */ > > > > ret = rproc->ops->stop(rproc); > > > > if (ret) { > > > > + rproc->state = RPROC_DEFUNCT; > > > > dev_err(dev, "can't stop rproc: %d\n", ret); > > > > return ret; > > > > } > > > > @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) > > > > return ret; > > > > > > > > /* State could have changed before we got the mutex */ > > > > - if (rproc->state != RPROC_CRASHED) > > > > + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) > > > > goto unlock_mutex; > > > > > > The problem is that rproc_trigger_recovery() an only be called once for a > > > remoteproc, something that modifies the state machine and may introduce backward > > > compatibility issues for other remote processor implementations. > > > > > > > I missed one more point to add here which i tried to highlight in second > > version[2] that setting of RPROC_DEFUNCT should happen for this case > > from vendor remoteproc driver and not at the core and that should take > > care of the backward compatibility. > > > > [2] > > https://lore.kernel.org/lkml/Zw2CAbMozI8vu4SL@hu-mojha-hyd.qualcomm.com/ > > > > -Mukesh > > > > > Thanks, > > > Mathieu > > > > > > > > > > > dev_err(dev, "recovering %s\n", rproc->name); > > > > diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c > > > > index 138e752c5e4e..5f722b4576b2 100644 > > > > --- a/drivers/remoteproc/remoteproc_sysfs.c > > > > +++ b/drivers/remoteproc/remoteproc_sysfs.c > > > > @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { > > > > [RPROC_DELETED] = "deleted", > > > > [RPROC_ATTACHED] = "attached", > > > > [RPROC_DETACHED] = "detached", > > > > + [RPROC_DEFUNCT] = "defunct", > > > > [RPROC_LAST] = "invalid", > > > > }; > > > > > > > > diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h > > > > index b4795698d8c2..3e4ba06c6a9a 100644 > > > > --- a/include/linux/remoteproc.h > > > > +++ b/include/linux/remoteproc.h > > > > @@ -417,6 +417,8 @@ struct rproc_ops { > > > > * has attached to it > > > > * @RPROC_DETACHED: device has been booted by another entity and waiting > > > > * for the core to attach to it > > > > + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the > > > > + * requests and can only recover on system restart. > > > > * @RPROC_LAST: just keep this one at the end > > > > * > > > > * Please note that the values of these states are used as indices > > > > @@ -433,7 +435,8 @@ enum rproc_state { > > > > RPROC_DELETED = 4, > > > > RPROC_ATTACHED = 5, > > > > RPROC_DETACHED = 6, > > > > - RPROC_LAST = 7, > > > > + RPROC_DEFUNCT = 7, > > > > + RPROC_LAST = 8, > > > > }; > > > > > > > > /** > > > > -- > > > > 2.34.1 > > > >
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c index f276956f2c5c..c4e14503b971 100644 --- a/drivers/remoteproc/remoteproc_core.c +++ b/drivers/remoteproc/remoteproc_core.c @@ -1727,6 +1727,7 @@ static int rproc_stop(struct rproc *rproc, bool crashed) /* power off the remote processor */ ret = rproc->ops->stop(rproc); if (ret) { + rproc->state = RPROC_DEFUNCT; dev_err(dev, "can't stop rproc: %d\n", ret); return ret; } @@ -1839,7 +1840,7 @@ int rproc_trigger_recovery(struct rproc *rproc) return ret; /* State could have changed before we got the mutex */ - if (rproc->state != RPROC_CRASHED) + if (rproc->state == RPROC_DEFUNCT || rproc->state != RPROC_CRASHED) goto unlock_mutex; dev_err(dev, "recovering %s\n", rproc->name); diff --git a/drivers/remoteproc/remoteproc_sysfs.c b/drivers/remoteproc/remoteproc_sysfs.c index 138e752c5e4e..5f722b4576b2 100644 --- a/drivers/remoteproc/remoteproc_sysfs.c +++ b/drivers/remoteproc/remoteproc_sysfs.c @@ -171,6 +171,7 @@ static const char * const rproc_state_string[] = { [RPROC_DELETED] = "deleted", [RPROC_ATTACHED] = "attached", [RPROC_DETACHED] = "detached", + [RPROC_DEFUNCT] = "defunct", [RPROC_LAST] = "invalid", }; diff --git a/include/linux/remoteproc.h b/include/linux/remoteproc.h index b4795698d8c2..3e4ba06c6a9a 100644 --- a/include/linux/remoteproc.h +++ b/include/linux/remoteproc.h @@ -417,6 +417,8 @@ struct rproc_ops { * has attached to it * @RPROC_DETACHED: device has been booted by another entity and waiting * for the core to attach to it + * @RPROC_DEFUNCT: device neither crashed nor responding to any of the + * requests and can only recover on system restart. * @RPROC_LAST: just keep this one at the end * * Please note that the values of these states are used as indices @@ -433,7 +435,8 @@ enum rproc_state { RPROC_DELETED = 4, RPROC_ATTACHED = 5, RPROC_DETACHED = 6, - RPROC_LAST = 7, + RPROC_DEFUNCT = 7, + RPROC_LAST = 8, }; /**
Multiple call to glink_subdev_stop() for the same remoteproc can happen if rproc_stop() fails from Process-A that leaves the rproc state to RPROC_CRASHED state later a call to recovery_store from user space in Process B triggers rproc_trigger_recovery() of the same remoteproc to recover it results in NULL pointer dereference issue in qcom_glink_smem_unregister(). There is other side to this issue if we want to fix this via adding a NULL check on glink->edge which does not guarantees that the remoteproc will recover in second call from Process B as it has failed in the first Process A during SMC shutdown call and may again fail at the same call and rproc can not recover for such case. Add a new rproc state RPROC_DEFUNCT i.e., non recoverable state of remoteproc and the only way to recover from it via system restart. Process-A Process-B fatal error interrupt happens rproc_crash_handler_work() mutex_lock_interruptible(&rproc->lock); ... rproc->state = RPROC_CRASHED; ... mutex_unlock(&rproc->lock); rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); adsp_stop() qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 remoteproc remoteproc3: can't stop rproc: -22 mutex_unlock(&rproc->lock); echo enabled > /sys/class/remoteproc/remoteprocX/recovery recovery_store() rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); rproc_stop() glink_subdev_stop() qcom_glink_smem_unregister() ==| | V Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> --- Changes in v3: - Fix kernel test reported error. Changes in v2: - Removed NULL pointer check instead added a new state to signify non-recoverable state of remoteproc. drivers/remoteproc/remoteproc_core.c | 3 ++- drivers/remoteproc/remoteproc_sysfs.c | 1 + include/linux/remoteproc.h | 5 ++++- 3 files changed, 7 insertions(+), 2 deletions(-)