diff mbox series

[5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Message ID 20240628214848.4075651-6-quic_abhinavk@quicinc.com (mailing list archive)
State Not Applicable
Headers show
Series [1/5] drm/msm: register a fault handler for display mmu faults | expand

Commit Message

Abhinav Kumar June 28, 2024, 9:48 p.m. UTC
There is no recovery mechanism in place yet to recover from mmu
faults for DPU. We can only prevent the faults by making sure there
is no misconfiguration.

Rate-limit the snapshot capture for mmu faults to once per
msm_kms_init_aspace() as that should be sufficient to capture
the snapshot for debugging otherwise there will be a lot of
dpu snapshots getting captured for the same fault which is
redundant and also might affect capturing even one snapshot
accurately.

Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
---
 drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
 drivers/gpu/drm/msm/msm_kms.h | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

Comments

Dmitry Baryshkov July 1, 2024, 7:43 p.m. UTC | #1
On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> There is no recovery mechanism in place yet to recover from mmu
> faults for DPU. We can only prevent the faults by making sure there
> is no misconfiguration.
> 
> Rate-limit the snapshot capture for mmu faults to once per
> msm_kms_init_aspace() as that should be sufficient to capture
> the snapshot for debugging otherwise there will be a lot of
> dpu snapshots getting captured for the same fault which is
> redundant and also might affect capturing even one snapshot
> accurately.

Please squash this into the first patch. There is no need to add code
with a known defficiency.

Also, is there a reason why you haven't used <linux/ratelimit.h> ?

> 
> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
> ---
>  drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
>  drivers/gpu/drm/msm/msm_kms.h | 3 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
> index d5d3117259cf..90a333920c01 100644
> --- a/drivers/gpu/drm/msm/msm_kms.c
> +++ b/drivers/gpu/drm/msm/msm_kms.c
> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
>  {
>  	struct msm_kms *kms = arg;
>  
> -	msm_disp_snapshot_state(kms->dev);
> +	if (!kms->fault_snapshot_capture) {
> +		msm_disp_snapshot_state(kms->dev);
> +		kms->fault_snapshot_capture++;

When is it decremented?

> +	}
>  
>  	return -ENOSYS;
>  }
> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
>  		mmu->funcs->destroy(mmu);
>  	}
>  
> +	kms->fault_snapshot_capture = 0;
>  	msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
>  
>  	return aspace;
> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
> index 1e0c54de3716..240b39e60828 100644
> --- a/drivers/gpu/drm/msm/msm_kms.h
> +++ b/drivers/gpu/drm/msm/msm_kms.h
> @@ -134,6 +134,9 @@ struct msm_kms {
>  	int irq;
>  	bool irq_requested;
>  
> +	/* rate limit the snapshot capture to once per attach */
> +	int fault_snapshot_capture;
> +
>  	/* mapper-id used to request GEM buffer mapped for scanout: */
>  	struct msm_gem_address_space *aspace;
>  
> -- 
> 2.44.0
>
Rob Clark July 15, 2024, 7:51 p.m. UTC | #2
On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
<dmitry.baryshkov@linaro.org> wrote:
>
> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> > There is no recovery mechanism in place yet to recover from mmu
> > faults for DPU. We can only prevent the faults by making sure there
> > is no misconfiguration.
> >
> > Rate-limit the snapshot capture for mmu faults to once per
> > msm_kms_init_aspace() as that should be sufficient to capture
> > the snapshot for debugging otherwise there will be a lot of
> > dpu snapshots getting captured for the same fault which is
> > redundant and also might affect capturing even one snapshot
> > accurately.
>
> Please squash this into the first patch. There is no need to add code
> with a known defficiency.
>
> Also, is there a reason why you haven't used <linux/ratelimit.h> ?

So, in some ways devcoredump is ratelimited by userspace needing to
clear an existing devcore..

What I'd suggest would be more useful is to limit the devcores to once
per atomic update, ie. if display state hasn't changed, maybe an
additional devcore isn't useful

BR,
-R

>
> >
> > Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
> > ---
> >  drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
> >  drivers/gpu/drm/msm/msm_kms.h | 3 +++
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
> > index d5d3117259cf..90a333920c01 100644
> > --- a/drivers/gpu/drm/msm/msm_kms.c
> > +++ b/drivers/gpu/drm/msm/msm_kms.c
> > @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
> >  {
> >       struct msm_kms *kms = arg;
> >
> > -     msm_disp_snapshot_state(kms->dev);
> > +     if (!kms->fault_snapshot_capture) {
> > +             msm_disp_snapshot_state(kms->dev);
> > +             kms->fault_snapshot_capture++;
>
> When is it decremented?
>
> > +     }
> >
> >       return -ENOSYS;
> >  }
> > @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
> >               mmu->funcs->destroy(mmu);
> >       }
> >
> > +     kms->fault_snapshot_capture = 0;
> >       msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
> >
> >       return aspace;
> > diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
> > index 1e0c54de3716..240b39e60828 100644
> > --- a/drivers/gpu/drm/msm/msm_kms.h
> > +++ b/drivers/gpu/drm/msm/msm_kms.h
> > @@ -134,6 +134,9 @@ struct msm_kms {
> >       int irq;
> >       bool irq_requested;
> >
> > +     /* rate limit the snapshot capture to once per attach */
> > +     int fault_snapshot_capture;
> > +
> >       /* mapper-id used to request GEM buffer mapped for scanout: */
> >       struct msm_gem_address_space *aspace;
> >
> > --
> > 2.44.0
> >
>
> --
> With best wishes
> Dmitry
Abhinav Kumar July 16, 2024, 9:25 p.m. UTC | #3
On 7/1/2024 12:43 PM, Dmitry Baryshkov wrote:
> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>> There is no recovery mechanism in place yet to recover from mmu
>> faults for DPU. We can only prevent the faults by making sure there
>> is no misconfiguration.
>>
>> Rate-limit the snapshot capture for mmu faults to once per
>> msm_kms_init_aspace() as that should be sufficient to capture
>> the snapshot for debugging otherwise there will be a lot of
>> dpu snapshots getting captured for the same fault which is
>> redundant and also might affect capturing even one snapshot
>> accurately.
> 
> Please squash this into the first patch. There is no need to add code
> with a known defficiency.
> 

Sure, will squash it.

> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> 

There is really no interval I can conclude on which is safe here. In 
fact rate-limit is probably not the right terminology here.

I should probably just rename this to once per init_aspace() which is 
essentially once per bootup.

I couldnt come up with a better limiter because ideally if we had a 
recovery we should reset the counter there.

Similar to other DPU errors like underrun and ping-pong timeouts (which 
capture the snapshot once per suspend/resume) , I just kept it to once 
per init_aspace().

smmu faults happen at a pretty rapid rate and capturing the full DPU 
snapshot each time was redundant. So I thought atleast once should be 
enough.

>>
>> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
>> ---
>>   drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
>>   drivers/gpu/drm/msm/msm_kms.h | 3 +++
>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
>> index d5d3117259cf..90a333920c01 100644
>> --- a/drivers/gpu/drm/msm/msm_kms.c
>> +++ b/drivers/gpu/drm/msm/msm_kms.c
>> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
>>   {
>>   	struct msm_kms *kms = arg;
>>   
>> -	msm_disp_snapshot_state(kms->dev);
>> +	if (!kms->fault_snapshot_capture) {
>> +		msm_disp_snapshot_state(kms->dev);
>> +		kms->fault_snapshot_capture++;
> 
> When is it decremented?
> 

It is not because it will only increment once in a bootup, I can switch 
this to a bool since it will happen only once unless we conclude on a 
better way.

>> +	}
>>   
>>   	return -ENOSYS;
>>   }
>> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
>>   		mmu->funcs->destroy(mmu);
>>   	}
>>   
>> +	kms->fault_snapshot_capture = 0;
>>   	msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
>>   
>>   	return aspace;
>> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
>> index 1e0c54de3716..240b39e60828 100644
>> --- a/drivers/gpu/drm/msm/msm_kms.h
>> +++ b/drivers/gpu/drm/msm/msm_kms.h
>> @@ -134,6 +134,9 @@ struct msm_kms {
>>   	int irq;
>>   	bool irq_requested;
>>   
>> +	/* rate limit the snapshot capture to once per attach */
>> +	int fault_snapshot_capture;
>> +
>>   	/* mapper-id used to request GEM buffer mapped for scanout: */
>>   	struct msm_gem_address_space *aspace;
>>   
>> -- 
>> 2.44.0
>>
>
Abhinav Kumar July 16, 2024, 9:45 p.m. UTC | #4
On 7/15/2024 12:51 PM, Rob Clark wrote:
> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> <dmitry.baryshkov@linaro.org> wrote:
>>
>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>>> There is no recovery mechanism in place yet to recover from mmu
>>> faults for DPU. We can only prevent the faults by making sure there
>>> is no misconfiguration.
>>>
>>> Rate-limit the snapshot capture for mmu faults to once per
>>> msm_kms_init_aspace() as that should be sufficient to capture
>>> the snapshot for debugging otherwise there will be a lot of
>>> dpu snapshots getting captured for the same fault which is
>>> redundant and also might affect capturing even one snapshot
>>> accurately.
>>
>> Please squash this into the first patch. There is no need to add code
>> with a known defficiency.
>>
>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> 
> So, in some ways devcoredump is ratelimited by userspace needing to
> clear an existing devcore..
> 

Yes, a new devcoredump device will not be created until the previous one 
is consumed or times out but here I am trying to limit even the DPU 
snapshot capture because DPU register space is really huge and the rate 
at which smmu faults occur is quite fast that its causing instability 
while snapshots are being captured.

> What I'd suggest would be more useful is to limit the devcores to once
> per atomic update, ie. if display state hasn't changed, maybe an
> additional devcore isn't useful
> 
> BR,
> -R
> 

By display state change, do you mean like the checks we have in 
drm_atomic_crtc_needs_modeset()?

OR do you mean we need to cache the previous (currently picked up by hw) 
state and trigger a new devcores if the new state is different by 
comparing more things?

This will help to reduce the snapshots to unique frame updates but I do 
not think it will reduce the rate enough for the case where DPU did not 
recover from the previous fault.

>>
>>>
>>> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
>>> ---
>>>   drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
>>>   drivers/gpu/drm/msm/msm_kms.h | 3 +++
>>>   2 files changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
>>> index d5d3117259cf..90a333920c01 100644
>>> --- a/drivers/gpu/drm/msm/msm_kms.c
>>> +++ b/drivers/gpu/drm/msm/msm_kms.c
>>> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
>>>   {
>>>        struct msm_kms *kms = arg;
>>>
>>> -     msm_disp_snapshot_state(kms->dev);
>>> +     if (!kms->fault_snapshot_capture) {
>>> +             msm_disp_snapshot_state(kms->dev);
>>> +             kms->fault_snapshot_capture++;
>>
>> When is it decremented?
>>
>>> +     }
>>>
>>>        return -ENOSYS;
>>>   }
>>> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
>>>                mmu->funcs->destroy(mmu);
>>>        }
>>>
>>> +     kms->fault_snapshot_capture = 0;
>>>        msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
>>>
>>>        return aspace;
>>> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
>>> index 1e0c54de3716..240b39e60828 100644
>>> --- a/drivers/gpu/drm/msm/msm_kms.h
>>> +++ b/drivers/gpu/drm/msm/msm_kms.h
>>> @@ -134,6 +134,9 @@ struct msm_kms {
>>>        int irq;
>>>        bool irq_requested;
>>>
>>> +     /* rate limit the snapshot capture to once per attach */
>>> +     int fault_snapshot_capture;
>>> +
>>>        /* mapper-id used to request GEM buffer mapped for scanout: */
>>>        struct msm_gem_address_space *aspace;
>>>
>>> --
>>> 2.44.0
>>>
>>
>> --
>> With best wishes
>> Dmitry
Rob Clark July 16, 2024, 9:50 p.m. UTC | #5
On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>
>
>
> On 7/15/2024 12:51 PM, Rob Clark wrote:
> > On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> > <dmitry.baryshkov@linaro.org> wrote:
> >>
> >> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> >>> There is no recovery mechanism in place yet to recover from mmu
> >>> faults for DPU. We can only prevent the faults by making sure there
> >>> is no misconfiguration.
> >>>
> >>> Rate-limit the snapshot capture for mmu faults to once per
> >>> msm_kms_init_aspace() as that should be sufficient to capture
> >>> the snapshot for debugging otherwise there will be a lot of
> >>> dpu snapshots getting captured for the same fault which is
> >>> redundant and also might affect capturing even one snapshot
> >>> accurately.
> >>
> >> Please squash this into the first patch. There is no need to add code
> >> with a known defficiency.
> >>
> >> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> >
> > So, in some ways devcoredump is ratelimited by userspace needing to
> > clear an existing devcore..
> >
>
> Yes, a new devcoredump device will not be created until the previous one
> is consumed or times out but here I am trying to limit even the DPU
> snapshot capture because DPU register space is really huge and the rate
> at which smmu faults occur is quite fast that its causing instability
> while snapshots are being captured.
>
> > What I'd suggest would be more useful is to limit the devcores to once
> > per atomic update, ie. if display state hasn't changed, maybe an
> > additional devcore isn't useful
> >
> > BR,
> > -R
> >
>
> By display state change, do you mean like the checks we have in
> drm_atomic_crtc_needs_modeset()?
>
> OR do you mean we need to cache the previous (currently picked up by hw)
> state and trigger a new devcores if the new state is different by
> comparing more things?
>
> This will help to reduce the snapshots to unique frame updates but I do
> not think it will reduce the rate enough for the case where DPU did not
> recover from the previous fault.

I was thinking the easy thing, of just resetting the counter in
msm_atomic_commit_tail().. I suppose we could be clever filter out
updates that only change scanout address.  Or hash the atomic state
and only generate devcoredumps for unique states.  But I'm not sure
how over-complicated we should make this.

BR,
-R

>
> >>
> >>>
> >>> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
> >>> ---
> >>>   drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
> >>>   drivers/gpu/drm/msm/msm_kms.h | 3 +++
> >>>   2 files changed, 8 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
> >>> index d5d3117259cf..90a333920c01 100644
> >>> --- a/drivers/gpu/drm/msm/msm_kms.c
> >>> +++ b/drivers/gpu/drm/msm/msm_kms.c
> >>> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
> >>>   {
> >>>        struct msm_kms *kms = arg;
> >>>
> >>> -     msm_disp_snapshot_state(kms->dev);
> >>> +     if (!kms->fault_snapshot_capture) {
> >>> +             msm_disp_snapshot_state(kms->dev);
> >>> +             kms->fault_snapshot_capture++;
> >>
> >> When is it decremented?
> >>
> >>> +     }
> >>>
> >>>        return -ENOSYS;
> >>>   }
> >>> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
> >>>                mmu->funcs->destroy(mmu);
> >>>        }
> >>>
> >>> +     kms->fault_snapshot_capture = 0;
> >>>        msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
> >>>
> >>>        return aspace;
> >>> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
> >>> index 1e0c54de3716..240b39e60828 100644
> >>> --- a/drivers/gpu/drm/msm/msm_kms.h
> >>> +++ b/drivers/gpu/drm/msm/msm_kms.h
> >>> @@ -134,6 +134,9 @@ struct msm_kms {
> >>>        int irq;
> >>>        bool irq_requested;
> >>>
> >>> +     /* rate limit the snapshot capture to once per attach */
> >>> +     int fault_snapshot_capture;
> >>> +
> >>>        /* mapper-id used to request GEM buffer mapped for scanout: */
> >>>        struct msm_gem_address_space *aspace;
> >>>
> >>> --
> >>> 2.44.0
> >>>
> >>
> >> --
> >> With best wishes
> >> Dmitry
Abhinav Kumar July 16, 2024, 10:43 p.m. UTC | #6
On 7/16/2024 2:50 PM, Rob Clark wrote:
> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>>
>>
>>
>> On 7/15/2024 12:51 PM, Rob Clark wrote:
>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
>>> <dmitry.baryshkov@linaro.org> wrote:
>>>>
>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>>>>> There is no recovery mechanism in place yet to recover from mmu
>>>>> faults for DPU. We can only prevent the faults by making sure there
>>>>> is no misconfiguration.
>>>>>
>>>>> Rate-limit the snapshot capture for mmu faults to once per
>>>>> msm_kms_init_aspace() as that should be sufficient to capture
>>>>> the snapshot for debugging otherwise there will be a lot of
>>>>> dpu snapshots getting captured for the same fault which is
>>>>> redundant and also might affect capturing even one snapshot
>>>>> accurately.
>>>>
>>>> Please squash this into the first patch. There is no need to add code
>>>> with a known defficiency.
>>>>
>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
>>>
>>> So, in some ways devcoredump is ratelimited by userspace needing to
>>> clear an existing devcore..
>>>
>>
>> Yes, a new devcoredump device will not be created until the previous one
>> is consumed or times out but here I am trying to limit even the DPU
>> snapshot capture because DPU register space is really huge and the rate
>> at which smmu faults occur is quite fast that its causing instability
>> while snapshots are being captured.
>>
>>> What I'd suggest would be more useful is to limit the devcores to once
>>> per atomic update, ie. if display state hasn't changed, maybe an
>>> additional devcore isn't useful
>>>
>>> BR,
>>> -R
>>>
>>
>> By display state change, do you mean like the checks we have in
>> drm_atomic_crtc_needs_modeset()?
>>
>> OR do you mean we need to cache the previous (currently picked up by hw)
>> state and trigger a new devcores if the new state is different by
>> comparing more things?
>>
>> This will help to reduce the snapshots to unique frame updates but I do
>> not think it will reduce the rate enough for the case where DPU did not
>> recover from the previous fault.
> 
> I was thinking the easy thing, of just resetting the counter in
> msm_atomic_commit_tail().. I suppose we could be clever filter out
> updates that only change scanout address.  Or hash the atomic state
> and only generate devcoredumps for unique states.  But I'm not sure
> how over-complicated we should make this.
> 
> BR,
> -R

Its a good idea actually and I would also like to keep it simple :)

One question, is it okay to assume that all compositors will only issue 
unique commits? Because we are assuming thats the case with resetting 
the counter in msm_atomic_commit_tail().

> 
>>
>>>>
>>>>>
>>>>> Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
>>>>> ---
>>>>>    drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
>>>>>    drivers/gpu/drm/msm/msm_kms.h | 3 +++
>>>>>    2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
>>>>> index d5d3117259cf..90a333920c01 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_kms.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_kms.c
>>>>> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
>>>>>    {
>>>>>         struct msm_kms *kms = arg;
>>>>>
>>>>> -     msm_disp_snapshot_state(kms->dev);
>>>>> +     if (!kms->fault_snapshot_capture) {
>>>>> +             msm_disp_snapshot_state(kms->dev);
>>>>> +             kms->fault_snapshot_capture++;
>>>>
>>>> When is it decremented?
>>>>
>>>>> +     }
>>>>>
>>>>>         return -ENOSYS;
>>>>>    }
>>>>> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
>>>>>                 mmu->funcs->destroy(mmu);
>>>>>         }
>>>>>
>>>>> +     kms->fault_snapshot_capture = 0;
>>>>>         msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
>>>>>
>>>>>         return aspace;
>>>>> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
>>>>> index 1e0c54de3716..240b39e60828 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_kms.h
>>>>> +++ b/drivers/gpu/drm/msm/msm_kms.h
>>>>> @@ -134,6 +134,9 @@ struct msm_kms {
>>>>>         int irq;
>>>>>         bool irq_requested;
>>>>>
>>>>> +     /* rate limit the snapshot capture to once per attach */
>>>>> +     int fault_snapshot_capture;
>>>>> +
>>>>>         /* mapper-id used to request GEM buffer mapped for scanout: */
>>>>>         struct msm_gem_address_space *aspace;
>>>>>
>>>>> --
>>>>> 2.44.0
>>>>>
>>>>
>>>> --
>>>> With best wishes
>>>> Dmitry
Dmitry Baryshkov July 16, 2024, 11:10 p.m. UTC | #7
On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>
>
>
> On 7/16/2024 2:50 PM, Rob Clark wrote:
> > On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
> >>
> >>
> >>
> >> On 7/15/2024 12:51 PM, Rob Clark wrote:
> >>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> >>> <dmitry.baryshkov@linaro.org> wrote:
> >>>>
> >>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> >>>>> There is no recovery mechanism in place yet to recover from mmu
> >>>>> faults for DPU. We can only prevent the faults by making sure there
> >>>>> is no misconfiguration.
> >>>>>
> >>>>> Rate-limit the snapshot capture for mmu faults to once per
> >>>>> msm_kms_init_aspace() as that should be sufficient to capture
> >>>>> the snapshot for debugging otherwise there will be a lot of
> >>>>> dpu snapshots getting captured for the same fault which is
> >>>>> redundant and also might affect capturing even one snapshot
> >>>>> accurately.
> >>>>
> >>>> Please squash this into the first patch. There is no need to add code
> >>>> with a known defficiency.
> >>>>
> >>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> >>>
> >>> So, in some ways devcoredump is ratelimited by userspace needing to
> >>> clear an existing devcore..
> >>>
> >>
> >> Yes, a new devcoredump device will not be created until the previous one
> >> is consumed or times out but here I am trying to limit even the DPU
> >> snapshot capture because DPU register space is really huge and the rate
> >> at which smmu faults occur is quite fast that its causing instability
> >> while snapshots are being captured.
> >>
> >>> What I'd suggest would be more useful is to limit the devcores to once
> >>> per atomic update, ie. if display state hasn't changed, maybe an
> >>> additional devcore isn't useful
> >>>
> >>> BR,
> >>> -R
> >>>
> >>
> >> By display state change, do you mean like the checks we have in
> >> drm_atomic_crtc_needs_modeset()?
> >>
> >> OR do you mean we need to cache the previous (currently picked up by hw)
> >> state and trigger a new devcores if the new state is different by
> >> comparing more things?
> >>
> >> This will help to reduce the snapshots to unique frame updates but I do
> >> not think it will reduce the rate enough for the case where DPU did not
> >> recover from the previous fault.
> >
> > I was thinking the easy thing, of just resetting the counter in
> > msm_atomic_commit_tail().. I suppose we could be clever filter out
> > updates that only change scanout address.  Or hash the atomic state
> > and only generate devcoredumps for unique states.  But I'm not sure
> > how over-complicated we should make this.
> >
> > BR,
> > -R
>
> Its a good idea actually and I would also like to keep it simple :)
>
> One question, is it okay to assume that all compositors will only issue
> unique commits? Because we are assuming thats the case with resetting
> the counter in msm_atomic_commit_tail().

The compositors use drm_mode_atomic_ioctl() which allocates a new
state each time. It is impossible to re-submit the same
drm_atomic_state from userspace.
Abhinav Kumar July 16, 2024, 11:14 p.m. UTC | #8
On 7/16/2024 4:10 PM, Dmitry Baryshkov wrote:
> On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>>
>>
>>
>> On 7/16/2024 2:50 PM, Rob Clark wrote:
>>> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/15/2024 12:51 PM, Rob Clark wrote:
>>>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
>>>>> <dmitry.baryshkov@linaro.org> wrote:
>>>>>>
>>>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>>>>>>> There is no recovery mechanism in place yet to recover from mmu
>>>>>>> faults for DPU. We can only prevent the faults by making sure there
>>>>>>> is no misconfiguration.
>>>>>>>
>>>>>>> Rate-limit the snapshot capture for mmu faults to once per
>>>>>>> msm_kms_init_aspace() as that should be sufficient to capture
>>>>>>> the snapshot for debugging otherwise there will be a lot of
>>>>>>> dpu snapshots getting captured for the same fault which is
>>>>>>> redundant and also might affect capturing even one snapshot
>>>>>>> accurately.
>>>>>>
>>>>>> Please squash this into the first patch. There is no need to add code
>>>>>> with a known defficiency.
>>>>>>
>>>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
>>>>>
>>>>> So, in some ways devcoredump is ratelimited by userspace needing to
>>>>> clear an existing devcore..
>>>>>
>>>>
>>>> Yes, a new devcoredump device will not be created until the previous one
>>>> is consumed or times out but here I am trying to limit even the DPU
>>>> snapshot capture because DPU register space is really huge and the rate
>>>> at which smmu faults occur is quite fast that its causing instability
>>>> while snapshots are being captured.
>>>>
>>>>> What I'd suggest would be more useful is to limit the devcores to once
>>>>> per atomic update, ie. if display state hasn't changed, maybe an
>>>>> additional devcore isn't useful
>>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>
>>>> By display state change, do you mean like the checks we have in
>>>> drm_atomic_crtc_needs_modeset()?
>>>>
>>>> OR do you mean we need to cache the previous (currently picked up by hw)
>>>> state and trigger a new devcores if the new state is different by
>>>> comparing more things?
>>>>
>>>> This will help to reduce the snapshots to unique frame updates but I do
>>>> not think it will reduce the rate enough for the case where DPU did not
>>>> recover from the previous fault.
>>>
>>> I was thinking the easy thing, of just resetting the counter in
>>> msm_atomic_commit_tail().. I suppose we could be clever filter out
>>> updates that only change scanout address.  Or hash the atomic state
>>> and only generate devcoredumps for unique states.  But I'm not sure
>>> how over-complicated we should make this.
>>>
>>> BR,
>>> -R
>>
>> Its a good idea actually and I would also like to keep it simple :)
>>
>> One question, is it okay to assume that all compositors will only issue
>> unique commits? Because we are assuming thats the case with resetting
>> the counter in msm_atomic_commit_tail().
> 
> The compositors use drm_mode_atomic_ioctl() which allocates a new
> state each time. It is impossible to re-submit the same
> drm_atomic_state from userspace.
> 

No, what I meant was, is it okay to assume that a commit is issued only 
when display configuration has changed?

Like if we get multiple commits for the same frame for some reason, 
thats also spam and this approach will not avoid that.

>
Dmitry Baryshkov July 16, 2024, 11:42 p.m. UTC | #9
On Wed, 17 Jul 2024 at 02:15, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
>
>
>
> On 7/16/2024 4:10 PM, Dmitry Baryshkov wrote:
> > On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
> >>
> >>
> >>
> >> On 7/16/2024 2:50 PM, Rob Clark wrote:
> >>> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk@quicinc.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 7/15/2024 12:51 PM, Rob Clark wrote:
> >>>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
> >>>>> <dmitry.baryshkov@linaro.org> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
> >>>>>>> There is no recovery mechanism in place yet to recover from mmu
> >>>>>>> faults for DPU. We can only prevent the faults by making sure there
> >>>>>>> is no misconfiguration.
> >>>>>>>
> >>>>>>> Rate-limit the snapshot capture for mmu faults to once per
> >>>>>>> msm_kms_init_aspace() as that should be sufficient to capture
> >>>>>>> the snapshot for debugging otherwise there will be a lot of
> >>>>>>> dpu snapshots getting captured for the same fault which is
> >>>>>>> redundant and also might affect capturing even one snapshot
> >>>>>>> accurately.
> >>>>>>
> >>>>>> Please squash this into the first patch. There is no need to add code
> >>>>>> with a known defficiency.
> >>>>>>
> >>>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
> >>>>>
> >>>>> So, in some ways devcoredump is ratelimited by userspace needing to
> >>>>> clear an existing devcore..
> >>>>>
> >>>>
> >>>> Yes, a new devcoredump device will not be created until the previous one
> >>>> is consumed or times out but here I am trying to limit even the DPU
> >>>> snapshot capture because DPU register space is really huge and the rate
> >>>> at which smmu faults occur is quite fast that its causing instability
> >>>> while snapshots are being captured.
> >>>>
> >>>>> What I'd suggest would be more useful is to limit the devcores to once
> >>>>> per atomic update, ie. if display state hasn't changed, maybe an
> >>>>> additional devcore isn't useful
> >>>>>
> >>>>> BR,
> >>>>> -R
> >>>>>
> >>>>
> >>>> By display state change, do you mean like the checks we have in
> >>>> drm_atomic_crtc_needs_modeset()?
> >>>>
> >>>> OR do you mean we need to cache the previous (currently picked up by hw)
> >>>> state and trigger a new devcores if the new state is different by
> >>>> comparing more things?
> >>>>
> >>>> This will help to reduce the snapshots to unique frame updates but I do
> >>>> not think it will reduce the rate enough for the case where DPU did not
> >>>> recover from the previous fault.
> >>>
> >>> I was thinking the easy thing, of just resetting the counter in
> >>> msm_atomic_commit_tail().. I suppose we could be clever filter out
> >>> updates that only change scanout address.  Or hash the atomic state
> >>> and only generate devcoredumps for unique states.  But I'm not sure
> >>> how over-complicated we should make this.
> >>>
> >>> BR,
> >>> -R
> >>
> >> Its a good idea actually and I would also like to keep it simple :)
> >>
> >> One question, is it okay to assume that all compositors will only issue
> >> unique commits? Because we are assuming thats the case with resetting
> >> the counter in msm_atomic_commit_tail().
> >
> > The compositors use drm_mode_atomic_ioctl() which allocates a new
> > state each time. It is impossible to re-submit the same
> > drm_atomic_state from userspace.
> >
>
> No, what I meant was, is it okay to assume that a commit is issued only
> when display configuration has changed?

No.

> Like if we get multiple commits for the same frame for some reason,
> thats also spam and this approach will not avoid that.

I'd say, new commit means that userspace thinks that something
changed. So if the driver got another hang / issue / etc, it should
try capturing a new state.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
index d5d3117259cf..90a333920c01 100644
--- a/drivers/gpu/drm/msm/msm_kms.c
+++ b/drivers/gpu/drm/msm/msm_kms.c
@@ -168,7 +168,10 @@  static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
 {
 	struct msm_kms *kms = arg;
 
-	msm_disp_snapshot_state(kms->dev);
+	if (!kms->fault_snapshot_capture) {
+		msm_disp_snapshot_state(kms->dev);
+		kms->fault_snapshot_capture++;
+	}
 
 	return -ENOSYS;
 }
@@ -208,6 +211,7 @@  struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
 		mmu->funcs->destroy(mmu);
 	}
 
+	kms->fault_snapshot_capture = 0;
 	msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
 
 	return aspace;
diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
index 1e0c54de3716..240b39e60828 100644
--- a/drivers/gpu/drm/msm/msm_kms.h
+++ b/drivers/gpu/drm/msm/msm_kms.h
@@ -134,6 +134,9 @@  struct msm_kms {
 	int irq;
 	bool irq_requested;
 
+	/* rate limit the snapshot capture to once per attach */
+	int fault_snapshot_capture;
+
 	/* mapper-id used to request GEM buffer mapped for scanout: */
 	struct msm_gem_address_space *aspace;