drm/i915/guc: Check ctx while waiting for response

Message ID	20220602172119.96324-1-zhanjun.dong@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Zhanjun Dong <zhanjun.dong@intel.com> To: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Date: Thu, 2 Jun 2022 10:21:19 -0700 Message-Id: <20220602172119.96324-1-zhanjun.dong@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH] drm/i915/guc: Check ctx while waiting for response Precedence: list Cc: Zhanjun Dong <zhanjun.dong@intel.com> Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915/guc: Check ctx while waiting for response \| expand drm/i915/guc: Check ctx while waiting for response

Dong, Zhanjun June 2, 2022, 5:21 p.m. UTC

We are seeing error message of "No response for request". Some cases happened
while waiting for response and reset/suspend action was triggered. In this
case, no response is not an error, active requests will be cancelled.

This patch will handle this condition and change the error message into
debug message.

Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

Alan Previn June 14, 2022, 3:34 p.m. UTC | #1

Can't see anything wrong with this.
I consider this only a NIT, so feel : not sure if -ECANCELLED is reflective of the "ct service being temporarily down"
as opposed to the "requester cancelling". Perhaps a -EPIPE or -EAGAIN (if we got this far, we know we are probably mid-
reset) ?? (if not already used elsewhere along this callstack). Else :

Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>

...alan


On Thu, 2022-06-02 at 10:21 -0700, Zhanjun Dong wrote:
> We are seeing error message of "No response for request". Some cases happened
> while waiting for response and reset/suspend action was triggered. In this
> case, no response is not an error, active requests will be cancelled.
> 
> This patch will handle this condition and change the error message into
> debug message.
> 
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index f01325cd1b62..a30a388877e2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -467,7 +467,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   * *	0 response received (status is valid)
>   * *	-ETIMEDOUT no response within hardcoded timeout
>   */
> -static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> +static int wait_for_ct_request_update(struct ct_request *req, u32 *status, struct intel_guc_ct *ct)
>  {
>  	int err;
>  
> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
>  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
>  #define done \
> -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
>  	 GUC_HXG_ORIGIN_GUC)
>  	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>  	if (err)
>  		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> +	if (!intel_guc_ct_enabled(ct))
> +		err = -ECANCELED;
>  
>  	*status = req->status;
>  	return err;
> @@ -703,11 +705,15 @@ static int ct_send(struct intel_guc_ct *ct,
>  
>  	intel_guc_notify(ct_to_guc(ct));
>  
> -	err = wait_for_ct_request_update(&request, status);
> +	err = wait_for_ct_request_update(&request, status, ct);
>  	g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
>  	if (unlikely(err)) {
> -		CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> -			 action[0], request.fence);
> +		if (unlikely(err == ECANCELED))
> +			CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is disabled\n",
> +				action[0], request.fence);
> +		else
> +			CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> +				action[0], request.fence);
>  		goto unlink;
>  	}
>  
> @@ -771,8 +777,9 @@ int intel_guc_ct_send(struct intel_guc_ct *ct, const u32 *action, u32 len,
>  
>  	ret = ct_send(ct, action, len, response_buf, response_buf_size, &status);
>  	if (unlikely(ret < 0)) {
> -		CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> -			 action[0], ERR_PTR(ret), status);
> +		if (likely(ret != ECANCELED))
> +			CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> +				action[0], ERR_PTR(ret), status);
>  	} else if (unlikely(ret)) {
>  		CT_DEBUG(ct, "send action %#x returned %d (%#x)\n",
>  			 action[0], ret, ret);
> -- 
> 2.36.0
>

Michal Wajdeczko June 14, 2022, 4:10 p.m. UTC | #2

On 02.06.2022 19:21, Zhanjun Dong wrote:
> We are seeing error message of "No response for request". Some cases happened
> while waiting for response and reset/suspend action was triggered. In this
> case, no response is not an error, active requests will be cancelled.
> 
> This patch will handle this condition and change the error message into
> debug message.
> 
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index f01325cd1b62..a30a388877e2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -467,7 +467,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   * *	0 response received (status is valid)
>   * *	-ETIMEDOUT no response within hardcoded timeout
>   */
> -static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> +static int wait_for_ct_request_update(struct ct_request *req, u32 *status, struct intel_guc_ct *ct)

if you need to add "intel_guc_ct *ct" param then make it the first one

>  {
>  	int err;
>  
> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
>  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
>  #define done \
> -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
>  	 GUC_HXG_ORIGIN_GUC)
>  	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>  	if (err)
>  		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> +	if (!intel_guc_ct_enabled(ct))
> +		err = -ECANCELED;
>  
>  	*status = req->status;
>  	return err;
> @@ -703,11 +705,15 @@ static int ct_send(struct intel_guc_ct *ct,
>  
>  	intel_guc_notify(ct_to_guc(ct));
>  
> -	err = wait_for_ct_request_update(&request, status);
> +	err = wait_for_ct_request_update(&request, status, ct);
>  	g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
>  	if (unlikely(err)) {
> -		CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> -			 action[0], request.fence);
> +		if (unlikely(err == ECANCELED))

you are looking for -ECANCELED
and I guess you can safely drop "unlikely" hint here

> +			CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is disabled\n",
> +				action[0], request.fence);
> +		else
> +			CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> +				action[0], request.fence);
>  		goto unlink;
>  	}
>  
> @@ -771,8 +777,9 @@ int intel_guc_ct_send(struct intel_guc_ct *ct, const u32 *action, u32 len,
>  
>  	ret = ct_send(ct, action, len, response_buf, response_buf_size, &status);
>  	if (unlikely(ret < 0)) {
> -		CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> -			 action[0], ERR_PTR(ret), status);
> +		if (likely(ret != ECANCELED))

ditto

,Michal

> +			CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> +				action[0], ERR_PTR(ret), status);
>  	} else if (unlikely(ret)) {
>  		CT_DEBUG(ct, "send action %#x returned %d (%#x)\n",
>  			 action[0], ret, ret);

Jani Nikula June 14, 2022, 4:21 p.m. UTC | #3

On Thu, 02 Jun 2022, Zhanjun Dong <zhanjun.dong@intel.com> wrote:
> We are seeing error message of "No response for request". Some cases happened
> while waiting for response and reset/suspend action was triggered. In this
> case, no response is not an error, active requests will be cancelled.
>
> This patch will handle this condition and change the error message into
> debug message.
>
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 21 ++++++++++++++-------
>  1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index f01325cd1b62..a30a388877e2 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -467,7 +467,7 @@ static int ct_write(struct intel_guc_ct *ct,
>   * *	0 response received (status is valid)
>   * *	-ETIMEDOUT no response within hardcoded timeout
>   */
> -static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> +static int wait_for_ct_request_update(struct ct_request *req, u32 *status, struct intel_guc_ct *ct)
>  {
>  	int err;
>  
> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
>  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
>  #define done \
> -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
>  	 GUC_HXG_ORIGIN_GUC)
>  	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>  	if (err)
>  		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> +	if (!intel_guc_ct_enabled(ct))
> +		err = -ECANCELED;
>  
>  	*status = req->status;
>  	return err;
> @@ -703,11 +705,15 @@ static int ct_send(struct intel_guc_ct *ct,
>  
>  	intel_guc_notify(ct_to_guc(ct));
>  
> -	err = wait_for_ct_request_update(&request, status);
> +	err = wait_for_ct_request_update(&request, status, ct);
>  	g2h_release_space(ct, GUC_CTB_HXG_MSG_MAX_LEN);
>  	if (unlikely(err)) {
> -		CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> -			 action[0], request.fence);
> +		if (unlikely(err == ECANCELED))

I think you mean -ECANCELED, not ECANCELED.

Please drop the unlikely(). I no longer want to see a single unlikely()
or likely() added anywhere without proper performance
justification. They make the code harder to read, for no real benefit,
and people just cargo cult copy paste them everywhere. Moreover, nested
unlikely/likely is just silly.

> +			CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is disabled\n",
> +				action[0], request.fence);
> +		else
> +			CT_ERROR(ct, "No response for request %#x (fence %u)\n",
> +				action[0], request.fence);
>  		goto unlink;
>  	}
>  
> @@ -771,8 +777,9 @@ int intel_guc_ct_send(struct intel_guc_ct *ct, const u32 *action, u32 len,
>  
>  	ret = ct_send(ct, action, len, response_buf, response_buf_size, &status);
>  	if (unlikely(ret < 0)) {
> -		CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> -			 action[0], ERR_PTR(ret), status);
> +		if (likely(ret != ECANCELED))

Ditto for -ECANCELED and likely().

BR,
Jani.

> +			CT_ERROR(ct, "Sending action %#x failed (%pe) status=%#X\n",
> +				action[0], ERR_PTR(ret), status);
>  	} else if (unlikely(ret)) {
>  		CT_DEBUG(ct, "send action %#x returned %d (%#x)\n",
>  			 action[0], ret, ret);

Dixit, Ashutosh June 14, 2022, 4:28 p.m. UTC | #4

On Thu, 02 Jun 2022 10:21:19 -0700, Zhanjun Dong wrote:
>

Hi Zhanjun,

> We are seeing error message of "No response for request". Some cases happened
> while waiting for response and reset/suspend action was triggered. In this
> case, no response is not an error, active requests will be cancelled.
>
> This patch will handle this condition and change the error message into
> debug message.

IMO the patch title should be changed: which ctx are we checking while
waiting for response? Something like "check for ct enabled while waiting
for response"?

> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
>  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
>  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
>  #define done \
> -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
>	 GUC_HXG_ORIGIN_GUC)
>	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>	if (err)
>		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> +	if (!intel_guc_ct_enabled(ct))
> +		err = -ECANCELED;

Also, I really don't like intel_guc_ct_enabled() being called in two
places. Is there a possibility that intel_guc_ct_enabled() can return false
in the first place (causing the wait to exit) and then return true in the
second place (so we don't return -ECANCELED)?

Is it possible to change the status of the request to something else from
intel_guc_ct_disable() (or wherever ct->enabled is set to false) rather
than introducing intel_guc_ct_enabled() checks here. Changing the status of
the request when CT goes down would cause the wait's to exit here. And then
we can check that special request status signifying CT went down?

Thanks.
--
Ashutosh

Dong, Zhanjun June 14, 2022, 6:42 p.m. UTC | #5

Thanks for all comments, I will update code and prepare for next version.

Regards,
Zhanjun

-----Original Message-----
From: Dixit, Ashutosh <ashutosh.dixit@intel.com> 
Sent: June 14, 2022 12:28 PM
To: Dong, Zhanjun <zhanjun.dong@intel.com>
Cc: intel-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com>; Wajdeczko, Michal <Michal.Wajdeczko@intel.com>
Subject: Re: [PATCH] drm/i915/guc: Check ctx while waiting for response

On Thu, 02 Jun 2022 10:21:19 -0700, Zhanjun Dong wrote:
>

Hi Zhanjun,

> We are seeing error message of "No response for request". Some cases 
> happened while waiting for response and reset/suspend action was 
> triggered. In this case, no response is not an error, active requests will be cancelled.
>
> This patch will handle this condition and change the error message 
> into debug message.

IMO the patch title should be changed: which ctx are we checking while waiting for response? Something like "check for ct enabled while waiting for response"?

> @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct 
> ct_request *req, u32 *status)  #define 
> GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10  #define 
> GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000  #define done \
> -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, 
> +READ_ONCE(req->status)) == \
>	 GUC_HXG_ORIGIN_GUC)
>	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
>	if (err)
>		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
>  #undef done
> +	if (!intel_guc_ct_enabled(ct))
> +		err = -ECANCELED;

Also, I really don't like intel_guc_ct_enabled() being called in two places. Is there a possibility that intel_guc_ct_enabled() can return false in the first place (causing the wait to exit) and then return true in the second place (so we don't return -ECANCELED)?

Is it possible to change the status of the request to something else from
intel_guc_ct_disable() (or wherever ct->enabled is set to false) rather than introducing intel_guc_ct_enabled() checks here. Changing the status of the request when CT goes down would cause the wait's to exit here. And then we can check that special request status signifying CT went down?

Thanks.
--
Ashutosh

Dixit, Ashutosh June 14, 2022, 8:01 p.m. UTC | #6

On Tue, 14 Jun 2022 09:28:14 -0700, Dixit, Ashutosh wrote:
> On Thu, 02 Jun 2022 10:21:19 -0700, Zhanjun Dong wrote:
>
> > @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct ct_request *req, u32 *status)
> >  #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10
> >  #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000
> >  #define done \
> > -	(FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> > +	(!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
> >	 GUC_HXG_ORIGIN_GUC)
> >	err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS);
> >	if (err)
> >		err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS);
> >  #undef done
> > +	if (!intel_guc_ct_enabled(ct))
> > +		err = -ECANCELED;
>
> Also, I really don't like intel_guc_ct_enabled() being called in two
> places. Is there a possibility that intel_guc_ct_enabled() can return false
> in the first place (causing the wait to exit) and then return true in the
> second place (so we don't return -ECANCELED)?
>
> Is it possible to change the status of the request to something else from
> intel_guc_ct_disable() (or wherever ct->enabled is set to false) rather
> than introducing intel_guc_ct_enabled() checks here. Changing the status of
> the request when CT goes down would cause the wait's to exit here. And then
> we can check that special request status signifying CT went down?

I think there are free bits in the request status fields which can be
used. But setting the request status say from intel_guc_ct_enabled() might
not be straightforward since you have to locate waiting requests and also
there may be multiple such requests (waiting in different threads).

Maybe an easier way might be to do something like:

bool foo(ct, req)
{
	if (!intel_guc_ct_enabled(ct)) {
		req->status = CT_WENT_AWAY;
		return true;
	}
	return false;
}

Now in your patch we can substitute foo() instead of
!intel_guc_ct_enabled(ct) so that we have:

 #define done \
    (foo() || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \
     GUC_HXG_ORIGIN_GUC)

And then check for req->status == CT_WENT_AWAY (most likely in ct_send()).

drm/i915/guc: Check ctx while waiting for response

Commit Message

Comments

Patch