diff mbox series

soc: qcom: mark pd-mapper as broken

Message ID 20241010074246.15725-1-johan+linaro@kernel.org (mailing list archive)
State New
Headers show
Series soc: qcom: mark pd-mapper as broken | expand

Commit Message

Johan Hovold Oct. 10, 2024, 7:42 a.m. UTC
When using the in-kernel pd-mapper on x1e80100, client drivers often
fail to communicate with the firmware during boot, which specifically
breaks battery and USB-C altmode notifications. This has been observed
to happen on almost every second boot (41%) but likely depends on probe
order:

    pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
    pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125

    ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125

    qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications

In the same setup audio also fails to probe albeit much more rarely:

    PDR: avs/audio get domain list txn wait failed: -110
    PDR: service lookup for avs/audio failed: -110

Chris Lew has provided an analysis and is working on a fix for the
ECANCELED (125) errors, but it is not yet clear whether this will also
address the audio regression.

Even if this was first observed on x1e80100 there is currently no reason
to believe that these issues are specific to that platform.

Disable the in-kernel pd-mapper for now, and make sure to backport this
to stable to prevent users and distros from migrating away from the
user-space service.

Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation")
Cc: stable@vger.kernel.org	# 6.11
Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---

It's now been over two months since I reported this regression, and even
if we seem to be making some progress on at least some of these issues I
think we need disable the pd-mapper temporarily until the fixes are in
place (e.g. to prevent distros from dropping the user-space service).

Johan


#regzbot introduced: 1ebcde047c54


 drivers/soc/qcom/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

Dmitry Baryshkov Oct. 10, 2024, 9:55 a.m. UTC | #1
On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote:
>
> When using the in-kernel pd-mapper on x1e80100, client drivers often
> fail to communicate with the firmware during boot, which specifically
> breaks battery and USB-C altmode notifications. This has been observed
> to happen on almost every second boot (41%) but likely depends on probe
> order:
>
>     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
>     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
>
>     ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125
>
>     qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
>
> In the same setup audio also fails to probe albeit much more rarely:
>
>     PDR: avs/audio get domain list txn wait failed: -110
>     PDR: service lookup for avs/audio failed: -110
>
> Chris Lew has provided an analysis and is working on a fix for the
> ECANCELED (125) errors, but it is not yet clear whether this will also
> address the audio regression.
>
> Even if this was first observed on x1e80100 there is currently no reason
> to believe that these issues are specific to that platform.
>
> Disable the in-kernel pd-mapper for now, and make sure to backport this
> to stable to prevent users and distros from migrating away from the
> user-space service.
>
> Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation")
> Cc: stable@vger.kernel.org      # 6.11
> Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>

Please don't break what is working. pd_mapper is working on all
previous platforms. I suggest reverting commit bd6db1f1486e ("soc:
qcom: pd_mapper: Add X1E80100") instead.

> ---
>
> It's now been over two months since I reported this regression, and even
> if we seem to be making some progress on at least some of these issues I
> think we need disable the pd-mapper temporarily until the fixes are in
> place (e.g. to prevent distros from dropping the user-space service).
>
> Johan
>
>
> #regzbot introduced: 1ebcde047c54
>
>
>  drivers/soc/qcom/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
> index 74b9121240f8..35ddab9338d4 100644
> --- a/drivers/soc/qcom/Kconfig
> +++ b/drivers/soc/qcom/Kconfig
> @@ -78,6 +78,7 @@ config QCOM_PD_MAPPER
>         select QCOM_PDR_MSG
>         select AUXILIARY_BUS
>         depends on NET && QRTR && (ARCH_QCOM || COMPILE_TEST)
> +       depends on BROKEN
>         default QCOM_RPROC_COMMON
>         help
>           The Protection Domain Mapper maps registered services to the domains
> --
> 2.45.2
>
Johan Hovold Oct. 10, 2024, 10:11 a.m. UTC | #2
On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote:
> On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > When using the in-kernel pd-mapper on x1e80100, client drivers often
> > fail to communicate with the firmware during boot, which specifically
> > breaks battery and USB-C altmode notifications. This has been observed
> > to happen on almost every second boot (41%) but likely depends on probe
> > order:
> >
> >     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> >     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> >
> >     ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125
> >
> >     qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
> >
> > In the same setup audio also fails to probe albeit much more rarely:
> >
> >     PDR: avs/audio get domain list txn wait failed: -110
> >     PDR: service lookup for avs/audio failed: -110
> >
> > Chris Lew has provided an analysis and is working on a fix for the
> > ECANCELED (125) errors, but it is not yet clear whether this will also
> > address the audio regression.
> >
> > Even if this was first observed on x1e80100 there is currently no reason
> > to believe that these issues are specific to that platform.
> >
> > Disable the in-kernel pd-mapper for now, and make sure to backport this
> > to stable to prevent users and distros from migrating away from the
> > user-space service.
> >
> > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation")
> > Cc: stable@vger.kernel.org      # 6.11
> > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> 
> Please don't break what is working. pd_mapper is working on all
> previous platforms. I suggest reverting commit bd6db1f1486e ("soc:
> qcom: pd_mapper: Add X1E80100") instead.

As I tried to explain in the commit message, there is currently nothing
indicating that these issues are specific to x1e80100 (even if you may
not hit them in your setup depending on things like probe order).

Let's disable it until the underlying bugs have been addressed.

Johan
Dmitry Baryshkov Oct. 10, 2024, 10:55 a.m. UTC | #3
On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote:
>
> On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote:
> > On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote:
> > >
> > > When using the in-kernel pd-mapper on x1e80100, client drivers often
> > > fail to communicate with the firmware during boot, which specifically
> > > breaks battery and USB-C altmode notifications. This has been observed
> > > to happen on almost every second boot (41%) but likely depends on probe
> > > order:
> > >
> > >     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> > >     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> > >
> > >     ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125
> > >
> > >     qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
> > >
> > > In the same setup audio also fails to probe albeit much more rarely:
> > >
> > >     PDR: avs/audio get domain list txn wait failed: -110
> > >     PDR: service lookup for avs/audio failed: -110
> > >
> > > Chris Lew has provided an analysis and is working on a fix for the
> > > ECANCELED (125) errors, but it is not yet clear whether this will also
> > > address the audio regression.
> > >
> > > Even if this was first observed on x1e80100 there is currently no reason
> > > to believe that these issues are specific to that platform.
> > >
> > > Disable the in-kernel pd-mapper for now, and make sure to backport this
> > > to stable to prevent users and distros from migrating away from the
> > > user-space service.
> > >
> > > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation")
> > > Cc: stable@vger.kernel.org      # 6.11
> > > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> >
> > Please don't break what is working. pd_mapper is working on all
> > previous platforms. I suggest reverting commit bd6db1f1486e ("soc:
> > qcom: pd_mapper: Add X1E80100") instead.
>
> As I tried to explain in the commit message, there is currently nothing
> indicating that these issues are specific to x1e80100 (even if you may
> not hit them in your setup depending on things like probe order).

I have the understanding that the issues are related to the ADSP
switching the firmware on the fly, which is only used on X1E8.

>
> Let's disable it until the underlying bugs have been addressed.
>
> Johan
Johan Hovold Oct. 10, 2024, 11:44 a.m. UTC | #4
On Thu, Oct 10, 2024 at 01:55:11PM +0300, Dmitry Baryshkov wrote:
> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote:
> > On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote:

> > > Please don't break what is working. pd_mapper is working on all
> > > previous platforms. I suggest reverting commit bd6db1f1486e ("soc:
> > > qcom: pd_mapper: Add X1E80100") instead.
> >
> > As I tried to explain in the commit message, there is currently nothing
> > indicating that these issues are specific to x1e80100 (even if you may
> > not hit them in your setup depending on things like probe order).
> 
> I have the understanding that the issues are related to the ADSP
> switching the firmware on the fly, which is only used on X1E8.

Is this speculation on your part or something that has recently been
confirmed to be the case? AFAIK, there is nothing SoC specific about the
ECANCELED issue, and we also still do not know what is causing the audio
regression.

The thing is, we have a working and well-tested solution in the
user-space service so there is no rush to switch to the in-kernel one
(and risk distros removing the user-space service) before this has been
fixed.

Johan
Neil Armstrong Oct. 10, 2024, 11:46 a.m. UTC | #5
On 10/10/2024 13:44, Johan Hovold wrote:
> On Thu, Oct 10, 2024 at 01:55:11PM +0300, Dmitry Baryshkov wrote:
>> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote:
>>> On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote:
> 
>>>> Please don't break what is working. pd_mapper is working on all
>>>> previous platforms. I suggest reverting commit bd6db1f1486e ("soc:
>>>> qcom: pd_mapper: Add X1E80100") instead.
>>>
>>> As I tried to explain in the commit message, there is currently nothing
>>> indicating that these issues are specific to x1e80100 (even if you may
>>> not hit them in your setup depending on things like probe order).
>>
>> I have the understanding that the issues are related to the ADSP
>> switching the firmware on the fly, which is only used on X1E8.
> 
> Is this speculation on your part or something that has recently been
> confirmed to be the case? AFAIK, there is nothing SoC specific about the
> ECANCELED issue, and we also still do not know what is causing the audio
> regression.
> 
> The thing is, we have a working and well-tested solution in the
> user-space service so there is no rush to switch to the in-kernel one
> (and risk distros removing the user-space service) before this has been
> fixed.

The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert
the X1E8 patch as suggested by Dmitry.

Neil

> 
> Johan
>
Johan Hovold Oct. 10, 2024, 1:24 p.m. UTC | #6
On Thu, Oct 10, 2024 at 01:46:48PM +0200, neil.armstrong@linaro.org wrote:
> >> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote:

> >>> As I tried to explain in the commit message, there is currently nothing
> >>> indicating that these issues are specific to x1e80100 (even if you may
> >>> not hit them in your setup depending on things like probe order).

> The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert
> the X1E8 patch as suggested by Dmitry.

Again, you may just be lucky, we have x1e users that also don't hit
these issues due to how things are timed during boot in their setups.

If there's some actual evidence that suggests that this is limited to
x1e, then that would of course be a different matter, but I'm not aware
of anything like that currently.

Johan
Dmitry Baryshkov Oct. 10, 2024, 1:45 p.m. UTC | #7
On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote:
> On Thu, Oct 10, 2024 at 01:46:48PM +0200, neil.armstrong@linaro.org wrote:
> > >> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote:
> 
> > >>> As I tried to explain in the commit message, there is currently nothing
> > >>> indicating that these issues are specific to x1e80100 (even if you may
> > >>> not hit them in your setup depending on things like probe order).
> 
> > The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert
> > the X1E8 patch as suggested by Dmitry.
> 
> Again, you may just be lucky, we have x1e users that also don't hit
> these issues due to how things are timed during boot in their setups.
> 
> If there's some actual evidence that suggests that this is limited to
> x1e, then that would of course be a different matter, but I'm not aware
> of anything like that currently.

Is there an evidence that it is broken on other platforms? I have been
daily driving the pd-mapper in my testing kernels for a long period of
time.
Johan Hovold Oct. 10, 2024, 2:07 p.m. UTC | #8
On Thu, Oct 10, 2024 at 04:45:57PM +0300, Dmitry Baryshkov wrote:
> On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote:

> > Again, you may just be lucky, we have x1e users that also don't hit
> > these issues due to how things are timed during boot in their setups.
> > 
> > If there's some actual evidence that suggests that this is limited to
> > x1e, then that would of course be a different matter, but I'm not aware
> > of anything like that currently.
> 
> Is there an evidence that it is broken on other platforms? I have been
> daily driving the pd-mapper in my testing kernels for a long period of
> time.

Yes, Chris's analysis of the ECANCELED issue suggests that this is not
SoC specific.

Johan
Dmitry Baryshkov Oct. 10, 2024, 2:13 p.m. UTC | #9
On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote:
>
> On Thu, Oct 10, 2024 at 04:45:57PM +0300, Dmitry Baryshkov wrote:
> > On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote:
>
> > > Again, you may just be lucky, we have x1e users that also don't hit
> > > these issues due to how things are timed during boot in their setups.
> > >
> > > If there's some actual evidence that suggests that this is limited to
> > > x1e, then that would of course be a different matter, but I'm not aware
> > > of anything like that currently.
> >
> > Is there an evidence that it is broken on other platforms? I have been
> > daily driving the pd-mapper in my testing kernels for a long period of
> > time.
>
> Yes, Chris's analysis of the ECANCELED issue suggests that this is not
> SoC specific.

"When the firmware implements the glink channel this way...", etc.
Yes, it doesn't sound like being SoC-specific, but we don't know which
SoC use this implementation.

>
> Johan
Johan Hovold Oct. 10, 2024, 2:20 p.m. UTC | #10
On Thu, Oct 10, 2024 at 05:13:44PM +0300, Dmitry Baryshkov wrote:
> On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote:

> > Yes, Chris's analysis of the ECANCELED issue suggests that this is not
> > SoC specific.
> 
> "When the firmware implements the glink channel this way...", etc.
> Yes, it doesn't sound like being SoC-specific, but we don't know which
> SoC use this implementation.

So let's err on the safe side until we have more information and avoid
having distros drop the user-space daemon until these known bugs exposed
by the in-kernel pd-mapper have been fixed.

Johan
Dmitry Baryshkov Oct. 10, 2024, 2:42 p.m. UTC | #11
On Thu, Oct 10, 2024 at 04:20:40PM GMT, Johan Hovold wrote:
> On Thu, Oct 10, 2024 at 05:13:44PM +0300, Dmitry Baryshkov wrote:
> > On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote:
> 
> > > Yes, Chris's analysis of the ECANCELED issue suggests that this is not
> > > SoC specific.
> > 
> > "When the firmware implements the glink channel this way...", etc.
> > Yes, it doesn't sound like being SoC-specific, but we don't know which
> > SoC use this implementation.
> 
> So let's err on the safe side until we have more information and avoid
> having distros drop the user-space daemon until these known bugs exposed
> by the in-kernel pd-mapper have been fixed.

Then default n + revert X1E sounds like a better approach?

> 
> Johan
Stephan Gerhold Oct. 11, 2024, 10:01 a.m. UTC | #12
On Thu, Oct 10, 2024 at 09:42:46AM +0200, Johan Hovold wrote:
> When using the in-kernel pd-mapper on x1e80100, client drivers often
> fail to communicate with the firmware during boot, which specifically
> breaks battery and USB-C altmode notifications. This has been observed
> to happen on almost every second boot (41%) but likely depends on probe
> order:
> 
>     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
>     pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> 
>     ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125
> 
>     qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications
> 
> In the same setup audio also fails to probe albeit much more rarely:
> 
>     PDR: avs/audio get domain list txn wait failed: -110
>     PDR: service lookup for avs/audio failed: -110
> 
> Chris Lew has provided an analysis and is working on a fix for the
> ECANCELED (125) errors, but it is not yet clear whether this will also
> address the audio regression.
> 
> Even if this was first observed on x1e80100 there is currently no reason
> to believe that these issues are specific to that platform.
> 
> Disable the in-kernel pd-mapper for now, and make sure to backport this
> to stable to prevent users and distros from migrating away from the
> user-space service.
> 
> Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation")
> Cc: stable@vger.kernel.org	# 6.11
> Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
> ---
> 
> It's now been over two months since I reported this regression, and even
> if we seem to be making some progress on at least some of these issues I
> think we need disable the pd-mapper temporarily until the fixes are in
> place (e.g. to prevent distros from dropping the user-space service).
> 

This is just a random thought, but I wonder if we could insert a delay
somewhere as temporary workaround to make the in-kernel pd-mapper more
reliable. I just tried replicating the userspace pd-mapper timing on
X1E80100 CRD by:

 1. Disabling auto-loading of qcom_pd_mapper
    (modprobe.blacklist=qcom_pd_mapper)
 2. Adding a systemd service that does nothing except running
    "modprobe qcom_pd_mapper" at the same point in time where the 
    userspace pd-mapper would usually be started.

This seems to work quite well for me, I haven't seen any of the
mentioned errors anymore in a couple of boot tests. Clearly, there is no
actual bug in the in-kernel pd-mapper, only worse timing.

Thanks,
Stephan
diff mbox series

Patch

diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index 74b9121240f8..35ddab9338d4 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -78,6 +78,7 @@  config QCOM_PD_MAPPER
 	select QCOM_PDR_MSG
 	select AUXILIARY_BUS
 	depends on NET && QRTR && (ARCH_QCOM || COMPILE_TEST)
+	depends on BROKEN
 	default QCOM_RPROC_COMMON
 	help
 	  The Protection Domain Mapper maps registered services to the domains