diff mbox series

[2/2] soc: qcom: pmic_glink: Handle GLINK intent allocation rejections

Message ID 20241022-pmic-glink-ecancelled-v1-2-9e26fc74e0a3@oss.qualcomm.com (mailing list archive)
State Superseded
Headers show
Series soc: qcom: pmic_glink: Resolve failures to bring up pmic_glink | expand

Commit Message

Bjorn Andersson Oct. 22, 2024, 4:17 a.m. UTC
Some versions of the pmic_glink firmware does not allow dynamic GLINK
intent allocations, attempting to send a message before the firmware has
allocated its receive buffers and announced these intent allocations
will fail. When this happens something like this showns up in the log:

	[    9.799719] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
	[    9.812446] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
	[    9.831796] ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125

GLINK has been updated to distinguish between the cases where the remote
is going down (-ECANCELLED) and the intent allocation being rejected
(-EAGAIN).

Retry the send until intent buffers becomes available, or an actual
error occur.

To avoid infinitely waiting for the firmware in the event that this
misbehaves and no intents arrive, an arbitrary 10 second timeout is
used.

This patch was developed with input from Chris Lew.

Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
---
 drivers/soc/qcom/pmic_glink.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

Comments

Johan Hovold Oct. 22, 2024, 3:30 p.m. UTC | #1
On Tue, Oct 22, 2024 at 04:17:12AM +0000, Bjorn Andersson wrote:
> Some versions of the pmic_glink firmware does not allow dynamic GLINK
> intent allocations, attempting to send a message before the firmware has
> allocated its receive buffers and announced these intent allocations
> will fail. When this happens something like this showns up in the log:
> 
> 	[    9.799719] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> 	[    9.812446] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> 	[    9.831796] ucsi_glink.pmic_glink_ucsi pmic_glink.ucsi.0: failed to send UCSI read request: -125

I think you should drop the time stamps here, and also add the battery
notification error to make the patch easier to find when searching for
these errors:

	qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications

> GLINK has been updated to distinguish between the cases where the remote
> is going down (-ECANCELLED) and the intent allocation being rejected
> (-EAGAIN).
> 
> Retry the send until intent buffers becomes available, or an actual
> error occur.
> 
> To avoid infinitely waiting for the firmware in the event that this
> misbehaves and no intents arrive, an arbitrary 10 second timeout is
> used.
> 
> This patch was developed with input from Chris Lew.
> 
> Reported-by: Johan Hovold <johan@kernel.org>
> Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t

This indeed seems to fix the -ECANCELED related errors I reported above,
but the audio probe failure still remains as expected:

	PDR: avs/audio get domain list txn wait failed: -110
	PDR: service lookup for avs/audio failed: -110

I hit it on the third reboot and then again after another 75 reboots
(and have never seen it with the user space pd-mapper over several
hundred boots).

Do you guys have any theories as to what is causing the above with the
in-kernel pd-mapper (beyond the obvious changes in timing)?

> Cc: stable@vger.kernel.org

Can you add a Fixes tag here?

This patch depends on the former, but that is not necessarily obvious
for someone backporting this (and the previous patch is only going to be
backported to 6.4).

Perhaps you can use the stable tag dependency annotation or even mark
the previous patch so that it is backported far enough.

> Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>

Tested-by: Johan Hovold <johan+linaro@kernel.org>
	
> ---
>  drivers/soc/qcom/pmic_glink.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
> index 9606222993fd78e80d776ea299cad024a0197e91..221639f3da149da1f967dbc769a97d327ffd6c63 100644
> --- a/drivers/soc/qcom/pmic_glink.c
> +++ b/drivers/soc/qcom/pmic_glink.c
> @@ -13,6 +13,8 @@
>  #include <linux/soc/qcom/pmic_glink.h>
>  #include <linux/spinlock.h>
>  
> +#define PMIC_GLINK_SEND_TIMEOUT (10*HZ)

nit: spaces around *

Ten seconds seems a little excessive; are there any reasons for not
picking something shorter like 5 s (also used by USB but that comes from
spec)?

> +
>  enum {
>  	PMIC_GLINK_CLIENT_BATT = 0,
>  	PMIC_GLINK_CLIENT_ALTMODE,
> @@ -112,13 +114,23 @@ EXPORT_SYMBOL_GPL(pmic_glink_client_register);
>  int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len)
>  {
>  	struct pmic_glink *pg = client->pg;
> +	unsigned long start;
> +	bool timeout_reached = false;

No need to initialise.

>  	int ret;
>  
>  	mutex_lock(&pg->state_lock);
> -	if (!pg->ept)
> +	if (!pg->ept) {
>  		ret = -ECONNRESET;
> -	else
> -		ret = rpmsg_send(pg->ept, data, len);
> +	} else {
> +		start = jiffies;
> +		do {
> +			timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
> +			ret = rpmsg_send(pg->ept, data, len);

Add a delay here to avoid hammering the remote side with requests in a
tight loop for 10 s?

> +		} while (ret == -EAGAIN && !timeout_reached);
> +
> +		if (ret == -EAGAIN && timeout_reached)
> +			ret = -ETIMEDOUT;
> +	}
>  	mutex_unlock(&pg->state_lock);
>  
>  	return ret;

Looks good to me otherwise: 

Reviewed-by: Johan Hovold <johan+linaro@kernel.org>

Johan
Bjorn Andersson Oct. 23, 2024, 9:13 p.m. UTC | #2
On Tue, Oct 22, 2024 at 05:30:55PM GMT, Johan Hovold wrote:
> On Tue, Oct 22, 2024 at 04:17:12AM +0000, Bjorn Andersson wrote:
[..]
> > Reported-by: Johan Hovold <johan@kernel.org>
> > Closes: https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/#t
> 
> This indeed seems to fix the -ECANCELED related errors I reported above,
> but the audio probe failure still remains as expected:
> 
> 	PDR: avs/audio get domain list txn wait failed: -110
> 	PDR: service lookup for avs/audio failed: -110
> 
> I hit it on the third reboot and then again after another 75 reboots
> (and have never seen it with the user space pd-mapper over several
> hundred boots).
> 
> Do you guys have any theories as to what is causing the above with the
> in-kernel pd-mapper (beyond the obvious changes in timing)?
> 

Not yet. This would be a timeout in a completely different codepath.

I'm trying to figure out a better way to reproduce this, than just
restarting the whole machine...

Thanks for the review.

Regards,
Bjorn
diff mbox series

Patch

diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c
index 9606222993fd78e80d776ea299cad024a0197e91..221639f3da149da1f967dbc769a97d327ffd6c63 100644
--- a/drivers/soc/qcom/pmic_glink.c
+++ b/drivers/soc/qcom/pmic_glink.c
@@ -13,6 +13,8 @@ 
 #include <linux/soc/qcom/pmic_glink.h>
 #include <linux/spinlock.h>
 
+#define PMIC_GLINK_SEND_TIMEOUT (10*HZ)
+
 enum {
 	PMIC_GLINK_CLIENT_BATT = 0,
 	PMIC_GLINK_CLIENT_ALTMODE,
@@ -112,13 +114,23 @@  EXPORT_SYMBOL_GPL(pmic_glink_client_register);
 int pmic_glink_send(struct pmic_glink_client *client, void *data, size_t len)
 {
 	struct pmic_glink *pg = client->pg;
+	unsigned long start;
+	bool timeout_reached = false;
 	int ret;
 
 	mutex_lock(&pg->state_lock);
-	if (!pg->ept)
+	if (!pg->ept) {
 		ret = -ECONNRESET;
-	else
-		ret = rpmsg_send(pg->ept, data, len);
+	} else {
+		start = jiffies;
+		do {
+			timeout_reached = time_after(jiffies, start + PMIC_GLINK_SEND_TIMEOUT);
+			ret = rpmsg_send(pg->ept, data, len);
+		} while (ret == -EAGAIN && !timeout_reached);
+
+		if (ret == -EAGAIN && timeout_reached)
+			ret = -ETIMEDOUT;
+	}
 	mutex_unlock(&pg->state_lock);
 
 	return ret;