diff mbox series

remoteproc: k3-r5: Fix driver shutdown

Message ID bf2bd3df-902f-4cef-91fc-2e6438539a01@siemens.com (mailing list archive)
State New
Headers show
Series remoteproc: k3-r5: Fix driver shutdown | expand

Commit Message

Jan Kiszka Aug. 19, 2024, 4:47 p.m. UTC
From: Jan Kiszka <jan.kiszka@siemens.com>

When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
first. When core 0 should then be stopped before its removal, it will
find core1->rproc as NULL already and crashes. Happens on rmmod e.g.

Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

There might be one more because I can still make this driver crash 
after an operator error. Were error scenarios tested at all?

 drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Beleswar Prasad Padhi Aug. 20, 2024, 9:30 a.m. UTC | #1
Hi Jan,

On 19-08-2024 22:17, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
> first. When core 0 should then be stopped before its removal, it will
> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.


Did you check this on top of -next-20240820 tag? There was a series[0] 
which was merged recently which fixed this condition. I don't see this 
issue when trying on top of -next-20240820 tag.
[0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/

>
> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs")
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>
> There might be one more because I can still make this driver crash
> after an operator error. Were error scenarios tested at all?


Can you point out what is this issue more specifically, and I can take 
this up then.

>
>   drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> index eb09d2e9b32a..9ebd7a34e638 100644
> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
>   		/* do not allow core 0 to stop before core 1 */
>   		core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
>   					elem);
> -		if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
> +		if (core != core1 && core1->rproc &&
> +		    core1->rproc->state != RPROC_OFFLINE) {
>   			dev_err(dev, "%s: can not stop core 0 before core 1\n",
>   				__func__);
>   			ret = -EPERM;
Jan Kiszka Aug. 20, 2024, 9:39 a.m. UTC | #2
On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
> Hi Jan,
> 
> On 19-08-2024 22:17, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>> first. When core 0 should then be stopped before its removal, it will
>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
> 
> 
> Did you check this on top of -next-20240820 tag? There was a series[0]
> which was merged recently which fixed this condition. I don't see this
> issue when trying on top of -next-20240820 tag.
> [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/
> 

I didn't try those yet, I was on 6.11-rcX. But from reading them
quickly, I'm not seeing the two issues I found directly addressed there.

>>
>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>> up before core0 via sysfs")
>> CC: stable@vger.kernel.org
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>
>> There might be one more because I can still make this driver crash
>> after an operator error. Were error scenarios tested at all?
> 
> 
> Can you point out what is this issue more specifically, and I can take
> this up then.

Try starting core1 before core0, and then again - system will hang or
crash while trying to wipe ATCM. I do not understand this problem yet -
same VA is used, and I cannot see where/how the region should have been
unmapped in between.

Jan

> 
>>
>>   drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> index eb09d2e9b32a..9ebd7a34e638 100644
>> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
>>           /* do not allow core 0 to stop before core 1 */
>>           core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
>>                       elem);
>> -        if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
>> +        if (core != core1 && core1->rproc &&
>> +            core1->rproc->state != RPROC_OFFLINE) {
>>               dev_err(dev, "%s: can not stop core 0 before core 1\n",
>>                   __func__);
>>               ret = -EPERM;
Beleswar Prasad Padhi Aug. 20, 2024, 9:48 a.m. UTC | #3
On 20-08-2024 15:09, Jan Kiszka wrote:
> On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
>> Hi Jan,
>>
>> On 19-08-2024 22:17, Jan Kiszka wrote:
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>>> first. When core 0 should then be stopped before its removal, it will
>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
>>
>> Did you check this on top of -next-20240820 tag? There was a series[0]
>> which was merged recently which fixed this condition. I don't see this
>> issue when trying on top of -next-20240820 tag.
>> [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/
>>
> I didn't try those yet, I was on 6.11-rcX. But from reading them
> quickly, I'm not seeing the two issues I found directly addressed there.

Check the comment by Andrew Davis[0], that addresses the above issue.

[0]: 
https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e1ca@ti.com/

>
>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>>> up before core0 via sysfs")
>>> CC: stable@vger.kernel.org
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>
>>> There might be one more because I can still make this driver crash
>>> after an operator error. Were error scenarios tested at all?
>>
>> Can you point out what is this issue more specifically, and I can take
>> this up then.
> Try starting core1 before core0, and then again - system will hang or
If you are trying to stop and then start the cores from sysfs, that is 
not yet supported. The hang is thus expected.
> crash while trying to wipe ATCM. I do not understand this problem yet -
> same VA is used, and I cannot see where/how the region should have been
> unmapped in between.
>
> Jan
>
>>>    drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>>> b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>>> index eb09d2e9b32a..9ebd7a34e638 100644
>>> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>>> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>>> @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
>>>            /* do not allow core 0 to stop before core 1 */
>>>            core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
>>>                        elem);
>>> -        if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
>>> +        if (core != core1 && core1->rproc &&
>>> +            core1->rproc->state != RPROC_OFFLINE) {
>>>                dev_err(dev, "%s: can not stop core 0 before core 1\n",
>>>                    __func__);
>>>                ret = -EPERM;
Jan Kiszka Aug. 20, 2024, 2:20 p.m. UTC | #4
On 20.08.24 11:48, Beleswar Prasad Padhi wrote:
> 
> On 20-08-2024 15:09, Jan Kiszka wrote:
>> On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
>>> Hi Jan,
>>>
>>> On 19-08-2024 22:17, Jan Kiszka wrote:
>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>
>>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>>>> first. When core 0 should then be stopped before its removal, it will
>>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
>>>
>>> Did you check this on top of -next-20240820 tag? There was a series[0]
>>> which was merged recently which fixed this condition. I don't see this
>>> issue when trying on top of -next-20240820 tag.
>>> [0]:
>>> https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/
>>>
>> I didn't try those yet, I was on 6.11-rcX. But from reading them
>> quickly, I'm not seeing the two issues I found directly addressed there.
> 
> Check the comment by Andrew Davis[0], that addresses the above issue.
> 
> [0]:
> https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e1ca@ti.com/
> 

OK, then someone still needs to update his patch accordingly.

>>
>>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>>>> up before core0 via sysfs")
>>>> CC: stable@vger.kernel.org
>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>> ---
>>>>
>>>> There might be one more because I can still make this driver crash
>>>> after an operator error. Were error scenarios tested at all?
>>>
>>> Can you point out what is this issue more specifically, and I can take
>>> this up then.
>> Try starting core1 before core0, and then again - system will hang or
> If you are trying to stop and then start the cores from sysfs, that is
> not yet supported. The hang is thus expected.

What? Then the driver is broken, even more. Why wasn't it fully implemented?

Jan
Beleswar Prasad Padhi Aug. 20, 2024, 2:59 p.m. UTC | #5
On 20-08-2024 19:50, Jan Kiszka wrote:
> On 20.08.24 11:48, Beleswar Prasad Padhi wrote:
>> On 20-08-2024 15:09, Jan Kiszka wrote:
>>> On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
>>>> Hi Jan,
>>>>
>>>> On 19-08-2024 22:17, Jan Kiszka wrote:
>>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>
>>>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>>>>> first. When core 0 should then be stopped before its removal, it will
>>>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
>>>> Did you check this on top of -next-20240820 tag? There was a series[0]
>>>> which was merged recently which fixed this condition. I don't see this
>>>> issue when trying on top of -next-20240820 tag.
>>>> [0]:
>>>> https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/
>>>>
>>> I didn't try those yet, I was on 6.11-rcX. But from reading them
>>> quickly, I'm not seeing the two issues I found directly addressed there.
>> Check the comment by Andrew Davis[0], that addresses the above issue.
>>
>> [0]:
>> https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e1ca@ti.com/
>>
> OK, then someone still needs to update his patch accordingly.
That comment was addressed in the v4 series revision[1] and was merged 
to linux-next, available with tag -next-20240820. Request you to please 
check if the issue persists with -next-20240820 tag. I checked myself, 
and was not able to reproduce.
[1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/
>
>>>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>>>>> up before core0 via sysfs")
>>>>> CC: stable@vger.kernel.org
>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>> ---
>>>>>
>>>>> There might be one more because I can still make this driver crash
>>>>> after an operator error. Were error scenarios tested at all?
>>>> Can you point out what is this issue more specifically, and I can take
>>>> this up then.
>>> Try starting core1 before core0, and then again - system will hang or
>> If you are trying to stop and then start the cores from sysfs, that is
>> not yet supported. The hang is thus expected.
> What? Then the driver is broken, even more. Why wasn't it fully implemented?
The driver is capable of starting a core and stopping it all well. The 
problem is, when we stop a core from sysfs (without resetting the SoC 
itself), the remotecore is powered off, but its resources are not 
relinquished. So when we start it back, there could be some memory 
corruptions. This feature of "graceful shutdown" of remotecores is 
almost implemented and will be posted to this driver soon. Request you 
to try out after that.

With graceful shutdown feature, this will be the flow:
1. We issue a core stop operation from sysfs.
2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to 
the remotecore.
3. The remotecore relinquishes all of its acquired resources through 
Device Manager Firmware and sends an ACK back.
4. The remotecore enters WFI state and then is resetted through Host core.
5. Then, if we try to do the core start operation from sysfs, core 
should be up as expected.

Thanks,
Beleswar
>
> Jan
>
Beleswar Prasad Padhi Aug. 20, 2024, 3:15 p.m. UTC | #6
On 20-08-2024 20:29, Beleswar Prasad Padhi wrote:
>
> On 20-08-2024 19:50, Jan Kiszka wrote:
>> On 20.08.24 11:48, Beleswar Prasad Padhi wrote:
>>> On 20-08-2024 15:09, Jan Kiszka wrote:
>>>> On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
>>>>> Hi Jan,
>>>>>
>>>>> On 19-08-2024 22:17, Jan Kiszka wrote:
>>>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>>
>>>>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>>>>>> first. When core 0 should then be stopped before its removal, it 
>>>>>> will
>>>>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
>>>>> Did you check this on top of -next-20240820 tag? There was a 
>>>>> series[0]
>>>>> which was merged recently which fixed this condition. I don't see 
>>>>> this
>>>>> issue when trying on top of -next-20240820 tag.
>>>>> [0]:
>>>>> https://lore.kernel.org/all/20240808074127.2688131-1-b-padhi@ti.com/
>>>>>
>>>> I didn't try those yet, I was on 6.11-rcX. But from reading them
>>>> quickly, I'm not seeing the two issues I found directly addressed 
>>>> there.
>>> Check the comment by Andrew Davis[0], that addresses the above issue.
>>>
>>> [0]:
>>> https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e1ca@ti.com/ 
>>>
>>>
>> OK, then someone still needs to update his patch accordingly.
> That comment was addressed in the v4 series revision[1] and was merged 
> to linux-next, available with tag -next-20240820. Request you to 
> please check if the issue persists with -next-20240820 tag. I checked 
> myself, and was not able to reproduce.
> [1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/
>>
>>>>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>>>>>> up before core0 via sysfs")
>>>>>> CC: stable@vger.kernel.org
>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>>>> ---
>>>>>>
>>>>>> There might be one more because I can still make this driver crash
>>>>>> after an operator error. Were error scenarios tested at all?
>>>>> Can you point out what is this issue more specifically, and I can 
>>>>> take
>>>>> this up then.
>>>> Try starting core1 before core0, and then again - system will hang or
>>> If you are trying to stop and then start the cores from sysfs, that is
>>> not yet supported. The hang is thus expected.
>> What? Then the driver is broken, even more. Why wasn't it fully 
>> implemented?


Just wanted to point out that this "graceful shutdown" feature is 
majorly dependent on the Device Manager Firmware(point 3) and minimal 
changes to the remoteproc driver (point 2 and 4). Thus, as soon as 
Firmware is capable, we will send out the patches for this feature.

> The driver is capable of starting a core and stopping it all well. The 
> problem is, when we stop a core from sysfs (without resetting the SoC 
> itself), the remotecore is powered off, but its resources are not 
> relinquished. So when we start it back, there could be some memory 
> corruptions. This feature of "graceful shutdown" of remotecores is 
> almost implemented and will be posted to this driver soon. Request you 
> to try out after that.
>
> With graceful shutdown feature, this will be the flow:
> 1. We issue a core stop operation from sysfs.
> 2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to 
> the remotecore.
> 3. The remotecore relinquishes all of its acquired resources through 
> Device Manager Firmware and sends an ACK back.
> 4. The remotecore enters WFI state and then is resetted through Host 
> core.
> 5. Then, if we try to do the core start operation from sysfs, core 
> should be up as expected.
>
> Thanks,
> Beleswar
>>
>> Jan
>>
diff mbox series

Patch

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index eb09d2e9b32a..9ebd7a34e638 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -646,7 +646,8 @@  static int k3_r5_rproc_stop(struct rproc *rproc)
 		/* do not allow core 0 to stop before core 1 */
 		core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
 					elem);
-		if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
+		if (core != core1 && core1->rproc &&
+		    core1->rproc->state != RPROC_OFFLINE) {
 			dev_err(dev, "%s: can not stop core 0 before core 1\n",
 				__func__);
 			ret = -EPERM;