Message ID | 20241010074246.15725-1-johan+linaro@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | soc: qcom: mark pd-mapper as broken | expand |
On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote: > > When using the in-kernel pd-mapper on x1e80100, client drivers often > fail to communicate with the firmware during boot, which specifically > breaks battery and USB-C altmode notifications. This has been observed > to happen on almost every second boot (41%) but likely depends on probe > order: > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 > > ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 > > qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications > > In the same setup audio also fails to probe albeit much more rarely: > > PDR: avs/audio get domain list txn wait failed: -110 > PDR: service lookup for avs/audio failed: -110 > > Chris Lew has provided an analysis and is working on a fix for the > ECANCELED (125) errors, but it is not yet clear whether this will also > address the audio regression. > > Even if this was first observed on x1e80100 there is currently no reason > to believe that these issues are specific to that platform. > > Disable the in-kernel pd-mapper for now, and make sure to backport this > to stable to prevent users and distros from migrating away from the > user-space service. > > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation") > Cc: stable@vger.kernel.org # 6.11 > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> Please don't break what is working. pd_mapper is working on all previous platforms. I suggest reverting commit bd6db1f1486e ("soc: qcom: pd_mapper: Add X1E80100") instead. > --- > > It's now been over two months since I reported this regression, and even > if we seem to be making some progress on at least some of these issues I > think we need disable the pd-mapper temporarily until the fixes are in > place (e.g. to prevent distros from dropping the user-space service). > > Johan > > > #regzbot introduced: 1ebcde047c54 > > > drivers/soc/qcom/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig > index 74b9121240f8..35ddab9338d4 100644 > --- a/drivers/soc/qcom/Kconfig > +++ b/drivers/soc/qcom/Kconfig > @@ -78,6 +78,7 @@ config QCOM_PD_MAPPER > select QCOM_PDR_MSG > select AUXILIARY_BUS > depends on NET && QRTR && (ARCH_QCOM || COMPILE_TEST) > + depends on BROKEN > default QCOM_RPROC_COMMON > help > The Protection Domain Mapper maps registered services to the domains > -- > 2.45.2 >
On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote: > On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote: > > > > When using the in-kernel pd-mapper on x1e80100, client drivers often > > fail to communicate with the firmware during boot, which specifically > > breaks battery and USB-C altmode notifications. This has been observed > > to happen on almost every second boot (41%) but likely depends on probe > > order: > > > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 > > > > ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 > > > > qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications > > > > In the same setup audio also fails to probe albeit much more rarely: > > > > PDR: avs/audio get domain list txn wait failed: -110 > > PDR: service lookup for avs/audio failed: -110 > > > > Chris Lew has provided an analysis and is working on a fix for the > > ECANCELED (125) errors, but it is not yet clear whether this will also > > address the audio regression. > > > > Even if this was first observed on x1e80100 there is currently no reason > > to believe that these issues are specific to that platform. > > > > Disable the in-kernel pd-mapper for now, and make sure to backport this > > to stable to prevent users and distros from migrating away from the > > user-space service. > > > > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation") > > Cc: stable@vger.kernel.org # 6.11 > > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> > > Please don't break what is working. pd_mapper is working on all > previous platforms. I suggest reverting commit bd6db1f1486e ("soc: > qcom: pd_mapper: Add X1E80100") instead. As I tried to explain in the commit message, there is currently nothing indicating that these issues are specific to x1e80100 (even if you may not hit them in your setup depending on things like probe order). Let's disable it until the underlying bugs have been addressed. Johan
On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote: > > On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote: > > On Thu, 10 Oct 2024 at 10:44, Johan Hovold <johan+linaro@kernel.org> wrote: > > > > > > When using the in-kernel pd-mapper on x1e80100, client drivers often > > > fail to communicate with the firmware during boot, which specifically > > > breaks battery and USB-C altmode notifications. This has been observed > > > to happen on almost every second boot (41%) but likely depends on probe > > > order: > > > > > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) > > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 > > > > > > ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 > > > > > > qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications > > > > > > In the same setup audio also fails to probe albeit much more rarely: > > > > > > PDR: avs/audio get domain list txn wait failed: -110 > > > PDR: service lookup for avs/audio failed: -110 > > > > > > Chris Lew has provided an analysis and is working on a fix for the > > > ECANCELED (125) errors, but it is not yet clear whether this will also > > > address the audio regression. > > > > > > Even if this was first observed on x1e80100 there is currently no reason > > > to believe that these issues are specific to that platform. > > > > > > Disable the in-kernel pd-mapper for now, and make sure to backport this > > > to stable to prevent users and distros from migrating away from the > > > user-space service. > > > > > > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation") > > > Cc: stable@vger.kernel.org # 6.11 > > > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ > > > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> > > > > Please don't break what is working. pd_mapper is working on all > > previous platforms. I suggest reverting commit bd6db1f1486e ("soc: > > qcom: pd_mapper: Add X1E80100") instead. > > As I tried to explain in the commit message, there is currently nothing > indicating that these issues are specific to x1e80100 (even if you may > not hit them in your setup depending on things like probe order). I have the understanding that the issues are related to the ADSP switching the firmware on the fly, which is only used on X1E8. > > Let's disable it until the underlying bugs have been addressed. > > Johan
On Thu, Oct 10, 2024 at 01:55:11PM +0300, Dmitry Baryshkov wrote: > On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote: > > On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote: > > > Please don't break what is working. pd_mapper is working on all > > > previous platforms. I suggest reverting commit bd6db1f1486e ("soc: > > > qcom: pd_mapper: Add X1E80100") instead. > > > > As I tried to explain in the commit message, there is currently nothing > > indicating that these issues are specific to x1e80100 (even if you may > > not hit them in your setup depending on things like probe order). > > I have the understanding that the issues are related to the ADSP > switching the firmware on the fly, which is only used on X1E8. Is this speculation on your part or something that has recently been confirmed to be the case? AFAIK, there is nothing SoC specific about the ECANCELED issue, and we also still do not know what is causing the audio regression. The thing is, we have a working and well-tested solution in the user-space service so there is no rush to switch to the in-kernel one (and risk distros removing the user-space service) before this has been fixed. Johan
On 10/10/2024 13:44, Johan Hovold wrote: > On Thu, Oct 10, 2024 at 01:55:11PM +0300, Dmitry Baryshkov wrote: >> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote: >>> On Thu, Oct 10, 2024 at 12:55:48PM +0300, Dmitry Baryshkov wrote: > >>>> Please don't break what is working. pd_mapper is working on all >>>> previous platforms. I suggest reverting commit bd6db1f1486e ("soc: >>>> qcom: pd_mapper: Add X1E80100") instead. >>> >>> As I tried to explain in the commit message, there is currently nothing >>> indicating that these issues are specific to x1e80100 (even if you may >>> not hit them in your setup depending on things like probe order). >> >> I have the understanding that the issues are related to the ADSP >> switching the firmware on the fly, which is only used on X1E8. > > Is this speculation on your part or something that has recently been > confirmed to be the case? AFAIK, there is nothing SoC specific about the > ECANCELED issue, and we also still do not know what is causing the audio > regression. > > The thing is, we have a working and well-tested solution in the > user-space service so there is no rush to switch to the in-kernel one > (and risk distros removing the user-space service) before this has been > fixed. The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert the X1E8 patch as suggested by Dmitry. Neil > > Johan >
On Thu, Oct 10, 2024 at 01:46:48PM +0200, neil.armstrong@linaro.org wrote: > >> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote: > >>> As I tried to explain in the commit message, there is currently nothing > >>> indicating that these issues are specific to x1e80100 (even if you may > >>> not hit them in your setup depending on things like probe order). > The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert > the X1E8 patch as suggested by Dmitry. Again, you may just be lucky, we have x1e users that also don't hit these issues due to how things are timed during boot in their setups. If there's some actual evidence that suggests that this is limited to x1e, then that would of course be a different matter, but I'm not aware of anything like that currently. Johan
On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote: > On Thu, Oct 10, 2024 at 01:46:48PM +0200, neil.armstrong@linaro.org wrote: > > >> On Thu, 10 Oct 2024 at 13:11, Johan Hovold <johan@kernel.org> wrote: > > > >>> As I tried to explain in the commit message, there is currently nothing > > >>> indicating that these issues are specific to x1e80100 (even if you may > > >>> not hit them in your setup depending on things like probe order). > > > The in-kernel pd-mapper works fine on SM8550 and SM8650, please just revert > > the X1E8 patch as suggested by Dmitry. > > Again, you may just be lucky, we have x1e users that also don't hit > these issues due to how things are timed during boot in their setups. > > If there's some actual evidence that suggests that this is limited to > x1e, then that would of course be a different matter, but I'm not aware > of anything like that currently. Is there an evidence that it is broken on other platforms? I have been daily driving the pd-mapper in my testing kernels for a long period of time.
On Thu, Oct 10, 2024 at 04:45:57PM +0300, Dmitry Baryshkov wrote: > On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote: > > Again, you may just be lucky, we have x1e users that also don't hit > > these issues due to how things are timed during boot in their setups. > > > > If there's some actual evidence that suggests that this is limited to > > x1e, then that would of course be a different matter, but I'm not aware > > of anything like that currently. > > Is there an evidence that it is broken on other platforms? I have been > daily driving the pd-mapper in my testing kernels for a long period of > time. Yes, Chris's analysis of the ECANCELED issue suggests that this is not SoC specific. Johan
On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote: > > On Thu, Oct 10, 2024 at 04:45:57PM +0300, Dmitry Baryshkov wrote: > > On Thu, Oct 10, 2024 at 03:24:19PM GMT, Johan Hovold wrote: > > > > Again, you may just be lucky, we have x1e users that also don't hit > > > these issues due to how things are timed during boot in their setups. > > > > > > If there's some actual evidence that suggests that this is limited to > > > x1e, then that would of course be a different matter, but I'm not aware > > > of anything like that currently. > > > > Is there an evidence that it is broken on other platforms? I have been > > daily driving the pd-mapper in my testing kernels for a long period of > > time. > > Yes, Chris's analysis of the ECANCELED issue suggests that this is not > SoC specific. "When the firmware implements the glink channel this way...", etc. Yes, it doesn't sound like being SoC-specific, but we don't know which SoC use this implementation. > > Johan
On Thu, Oct 10, 2024 at 05:13:44PM +0300, Dmitry Baryshkov wrote: > On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote: > > Yes, Chris's analysis of the ECANCELED issue suggests that this is not > > SoC specific. > > "When the firmware implements the glink channel this way...", etc. > Yes, it doesn't sound like being SoC-specific, but we don't know which > SoC use this implementation. So let's err on the safe side until we have more information and avoid having distros drop the user-space daemon until these known bugs exposed by the in-kernel pd-mapper have been fixed. Johan
On Thu, Oct 10, 2024 at 04:20:40PM GMT, Johan Hovold wrote: > On Thu, Oct 10, 2024 at 05:13:44PM +0300, Dmitry Baryshkov wrote: > > On Thu, 10 Oct 2024 at 17:07, Johan Hovold <johan@kernel.org> wrote: > > > > Yes, Chris's analysis of the ECANCELED issue suggests that this is not > > > SoC specific. > > > > "When the firmware implements the glink channel this way...", etc. > > Yes, it doesn't sound like being SoC-specific, but we don't know which > > SoC use this implementation. > > So let's err on the safe side until we have more information and avoid > having distros drop the user-space daemon until these known bugs exposed > by the in-kernel pd-mapper have been fixed. Then default n + revert X1E sounds like a better approach? > > Johan
On Thu, Oct 10, 2024 at 09:42:46AM +0200, Johan Hovold wrote: > When using the in-kernel pd-mapper on x1e80100, client drivers often > fail to communicate with the firmware during boot, which specifically > breaks battery and USB-C altmode notifications. This has been observed > to happen on almost every second boot (41%) but likely depends on probe > order: > > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) > pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 > > ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 > > qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications > > In the same setup audio also fails to probe albeit much more rarely: > > PDR: avs/audio get domain list txn wait failed: -110 > PDR: service lookup for avs/audio failed: -110 > > Chris Lew has provided an analysis and is working on a fix for the > ECANCELED (125) errors, but it is not yet clear whether this will also > address the audio regression. > > Even if this was first observed on x1e80100 there is currently no reason > to believe that these issues are specific to that platform. > > Disable the in-kernel pd-mapper for now, and make sure to backport this > to stable to prevent users and distros from migrating away from the > user-space service. > > Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation") > Cc: stable@vger.kernel.org # 6.11 > Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ > Signed-off-by: Johan Hovold <johan+linaro@kernel.org> > --- > > It's now been over two months since I reported this regression, and even > if we seem to be making some progress on at least some of these issues I > think we need disable the pd-mapper temporarily until the fixes are in > place (e.g. to prevent distros from dropping the user-space service). > This is just a random thought, but I wonder if we could insert a delay somewhere as temporary workaround to make the in-kernel pd-mapper more reliable. I just tried replicating the userspace pd-mapper timing on X1E80100 CRD by: 1. Disabling auto-loading of qcom_pd_mapper (modprobe.blacklist=qcom_pd_mapper) 2. Adding a systemd service that does nothing except running "modprobe qcom_pd_mapper" at the same point in time where the userspace pd-mapper would usually be started. This seems to work quite well for me, I haven't seen any of the mentioned errors anymore in a couple of boot tests. Clearly, there is no actual bug in the in-kernel pd-mapper, only worse timing. Thanks, Stephan
diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig index 74b9121240f8..35ddab9338d4 100644 --- a/drivers/soc/qcom/Kconfig +++ b/drivers/soc/qcom/Kconfig @@ -78,6 +78,7 @@ config QCOM_PD_MAPPER select QCOM_PDR_MSG select AUXILIARY_BUS depends on NET && QRTR && (ARCH_QCOM || COMPILE_TEST) + depends on BROKEN default QCOM_RPROC_COMMON help The Protection Domain Mapper maps registered services to the domains
When using the in-kernel pd-mapper on x1e80100, client drivers often fail to communicate with the firmware during boot, which specifically breaks battery and USB-C altmode notifications. This has been observed to happen on almost every second boot (41%) but likely depends on probe order: pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125) pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125 ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125 qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications In the same setup audio also fails to probe albeit much more rarely: PDR: avs/audio get domain list txn wait failed: -110 PDR: service lookup for avs/audio failed: -110 Chris Lew has provided an analysis and is working on a fix for the ECANCELED (125) errors, but it is not yet clear whether this will also address the audio regression. Even if this was first observed on x1e80100 there is currently no reason to believe that these issues are specific to that platform. Disable the in-kernel pd-mapper for now, and make sure to backport this to stable to prevent users and distros from migrating away from the user-space service. Fixes: 1ebcde047c54 ("soc: qcom: add pd-mapper implementation") Cc: stable@vger.kernel.org # 6.11 Link: https://lore.kernel.org/lkml/Zqet8iInnDhnxkT9@hovoldconsulting.com/ Signed-off-by: Johan Hovold <johan+linaro@kernel.org> --- It's now been over two months since I reported this regression, and even if we seem to be making some progress on at least some of these issues I think we need disable the pd-mapper temporarily until the fixes are in place (e.g. to prevent distros from dropping the user-space service). Johan #regzbot introduced: 1ebcde047c54 drivers/soc/qcom/Kconfig | 1 + 1 file changed, 1 insertion(+)