diff mbox

[5/6] drm: Prevent vblank counter jumps with timestamp based update method.

Message ID 1454894009-15466-6-git-send-email-mario.kleiner.de@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mario Kleiner Feb. 8, 2016, 1:13 a.m. UTC
The changes to drm_update_vblank_count() in Linux 4.4 added a
method to emulate a hardware vblank counter by use of high
precision vblank timestamps if a kms driver supports those,
but doesn't suppport hw vblank counters.

That method assumes that the old timestamp from a previous
invocation is valid, but that is not always the case. E.g.,
if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
or drm_update_vblank_count() gets called outside vblank irq and
the high precision timestamping can't deliver a precise timestamp,
ie. drm_get_last_vbltimestamp() delivers a return value of false,
then those functions will initialize the old timestamp to zero
to mark it as invalid.

A following call to drm_update_vblank_count() would then calculate
elapsed time with vblank irqs off as current vblank timestamp minus
the zero old timestamp and compute a software vblank counter increment
that corresponds to system uptime, causing a large forward jump of the
software vblank counter. That jump in turn can cause too long waits
in drmWaitVblank and very long delays in delivery of vblank events,
resulting in hangs of userspace clients.

This problem can be observed on nouveau-kms during machine
suspend->resume cycles, where drm_vblank_off is called during
suspend, drm_vblank_on is called during resume and the first
queries to drm_get_last_vbltimestamp() don't deliver high
precision timestamps, resulting in a large harmful counter jump.

Fix this by checking if the old timestamp used for this calculations
is zero == invalid. If so, perform a counter increment of +1 to
prevent large counter jumps and reinitialize the timestamps to
sane values.

Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: <stable@vger.kernel.org> # 4.4+
Cc: michel@daenzer.net
Cc: vbabka@suse.cz
Cc: ville.syrjala@linux.intel.com
Cc: daniel.vetter@ffwll.ch
Cc: dri-devel@lists.freedesktop.org
Cc: alexander.deucher@amd.com
Cc: christian.koenig@amd.com
---
 drivers/gpu/drm/drm_irq.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Daniel Vetter Feb. 9, 2016, 10:09 a.m. UTC | #1
On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
> The changes to drm_update_vblank_count() in Linux 4.4 added a
> method to emulate a hardware vblank counter by use of high
> precision vblank timestamps if a kms driver supports those,
> but doesn't suppport hw vblank counters.
> 
> That method assumes that the old timestamp from a previous
> invocation is valid, but that is not always the case. E.g.,
> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
> or drm_update_vblank_count() gets called outside vblank irq and
> the high precision timestamping can't deliver a precise timestamp,
> ie. drm_get_last_vbltimestamp() delivers a return value of false,
> then those functions will initialize the old timestamp to zero
> to mark it as invalid.
> 
> A following call to drm_update_vblank_count() would then calculate
> elapsed time with vblank irqs off as current vblank timestamp minus
> the zero old timestamp and compute a software vblank counter increment
> that corresponds to system uptime, causing a large forward jump of the
> software vblank counter. That jump in turn can cause too long waits
> in drmWaitVblank and very long delays in delivery of vblank events,
> resulting in hangs of userspace clients.
> 
> This problem can be observed on nouveau-kms during machine
> suspend->resume cycles, where drm_vblank_off is called during
> suspend, drm_vblank_on is called during resume and the first
> queries to drm_get_last_vbltimestamp() don't deliver high
> precision timestamps, resulting in a large harmful counter jump.

Why does nouveau enable vblank interrupts before it can get valid
timestamps? That sounds like the core bug here, and papering over that in
the vblank code feels very wrong to me. Maybe we should instead just not
sample the vblank at all if the timestamp fails and splat a big WARN_ON
about this, to give driver writers a chance to fix up their mess?
-Daniel

> 
> Fix this by checking if the old timestamp used for this calculations
> is zero == invalid. If so, perform a counter increment of +1 to
> prevent large counter jumps and reinitialize the timestamps to
> sane values.
> 
> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> Cc: <stable@vger.kernel.org> # 4.4+
> Cc: michel@daenzer.net
> Cc: vbabka@suse.cz
> Cc: ville.syrjala@linux.intel.com
> Cc: daniel.vetter@ffwll.ch
> Cc: dri-devel@lists.freedesktop.org
> Cc: alexander.deucher@amd.com
> Cc: christian.koenig@amd.com
> ---
>  drivers/gpu/drm/drm_irq.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> index fb17c45..88bdf19 100644
> --- a/drivers/gpu/drm/drm_irq.c
> +++ b/drivers/gpu/drm/drm_irq.c
> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>  			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
>  				      " diff_ns = %lld, framedur_ns = %d)\n",
>  				      pipe, (long long) diff_ns, framedur_ns);
> +
> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
> +				      pipe);
> +			diff = 1;
> +		}
>  	} else {
>  		/* some kind of default for drivers w/o accurate vbl timestamping */
>  		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
> -- 
> 1.9.1
>
Mario Kleiner Feb. 9, 2016, 1:53 p.m. UTC | #2
On 02/09/2016 11:09 AM, Daniel Vetter wrote:
> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
>> The changes to drm_update_vblank_count() in Linux 4.4 added a
>> method to emulate a hardware vblank counter by use of high
>> precision vblank timestamps if a kms driver supports those,
>> but doesn't suppport hw vblank counters.
>>
>> That method assumes that the old timestamp from a previous
>> invocation is valid, but that is not always the case. E.g.,
>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
>> or drm_update_vblank_count() gets called outside vblank irq and
>> the high precision timestamping can't deliver a precise timestamp,
>> ie. drm_get_last_vbltimestamp() delivers a return value of false,
>> then those functions will initialize the old timestamp to zero
>> to mark it as invalid.
>>
>> A following call to drm_update_vblank_count() would then calculate
>> elapsed time with vblank irqs off as current vblank timestamp minus
>> the zero old timestamp and compute a software vblank counter increment
>> that corresponds to system uptime, causing a large forward jump of the
>> software vblank counter. That jump in turn can cause too long waits
>> in drmWaitVblank and very long delays in delivery of vblank events,
>> resulting in hangs of userspace clients.
>>
>> This problem can be observed on nouveau-kms during machine
>> suspend->resume cycles, where drm_vblank_off is called during
>> suspend, drm_vblank_on is called during resume and the first
>> queries to drm_get_last_vbltimestamp() don't deliver high
>> precision timestamps, resulting in a large harmful counter jump.
>
> Why does nouveau enable vblank interrupts before it can get valid
> timestamps? That sounds like the core bug here, and papering over that in
> the vblank code feels very wrong to me. Maybe we should instead just not
> sample the vblank at all if the timestamp fails and splat a big WARN_ON
> about this, to give driver writers a chance to fix up their mess?
> -Daniel
>

The high precision timestamping is allowed to fail for a kms driver 
under some conditions which are not implementation errors of the driver, 
or getting disabled by user override, so we should handle that robustly. 
We handle it robustly everywhere else in the drm, so we should do it 
here as well.

E.g., nouveau generally supports timestamping/scanout position queries, 
but can't support it on old pre NV-50 hardware with some output type 
(either on analog VGA, or DVI-D, can't remember atm. which one is 
unsupported).

There are also new Soc drivers showing up which support those methods, 
but at least i didn't verify or test if their implementations are good 
enough for the needs of the new drm_udpate_vblank_count() which is more 
sensitive to kms drivers being even slightly off.

-mario

>>
>> Fix this by checking if the old timestamp used for this calculations
>> is zero == invalid. If so, perform a counter increment of +1 to
>> prevent large counter jumps and reinitialize the timestamps to
>> sane values.
>>
>> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
>> Cc: <stable@vger.kernel.org> # 4.4+
>> Cc: michel@daenzer.net
>> Cc: vbabka@suse.cz
>> Cc: ville.syrjala@linux.intel.com
>> Cc: daniel.vetter@ffwll.ch
>> Cc: dri-devel@lists.freedesktop.org
>> Cc: alexander.deucher@amd.com
>> Cc: christian.koenig@amd.com
>> ---
>>   drivers/gpu/drm/drm_irq.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>> index fb17c45..88bdf19 100644
>> --- a/drivers/gpu/drm/drm_irq.c
>> +++ b/drivers/gpu/drm/drm_irq.c
>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>>   			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
>>   				      " diff_ns = %lld, framedur_ns = %d)\n",
>>   				      pipe, (long long) diff_ns, framedur_ns);
>> +
>> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
>> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
>> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
>> +				      pipe);
>> +			diff = 1;
>> +		}
>>   	} else {
>>   		/* some kind of default for drivers w/o accurate vbl timestamping */
>>   		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
>> --
>> 1.9.1
>>
>
Ville Syrjälä Feb. 9, 2016, 2:11 p.m. UTC | #3
On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote:
> On 02/09/2016 11:09 AM, Daniel Vetter wrote:
> > On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
> >> The changes to drm_update_vblank_count() in Linux 4.4 added a
> >> method to emulate a hardware vblank counter by use of high
> >> precision vblank timestamps if a kms driver supports those,
> >> but doesn't suppport hw vblank counters.
> >>
> >> That method assumes that the old timestamp from a previous
> >> invocation is valid, but that is not always the case. E.g.,
> >> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
> >> or drm_update_vblank_count() gets called outside vblank irq and
> >> the high precision timestamping can't deliver a precise timestamp,
> >> ie. drm_get_last_vbltimestamp() delivers a return value of false,
> >> then those functions will initialize the old timestamp to zero
> >> to mark it as invalid.
> >>
> >> A following call to drm_update_vblank_count() would then calculate
> >> elapsed time with vblank irqs off as current vblank timestamp minus
> >> the zero old timestamp and compute a software vblank counter increment
> >> that corresponds to system uptime, causing a large forward jump of the
> >> software vblank counter. That jump in turn can cause too long waits
> >> in drmWaitVblank and very long delays in delivery of vblank events,
> >> resulting in hangs of userspace clients.
> >>
> >> This problem can be observed on nouveau-kms during machine
> >> suspend->resume cycles, where drm_vblank_off is called during
> >> suspend, drm_vblank_on is called during resume and the first
> >> queries to drm_get_last_vbltimestamp() don't deliver high
> >> precision timestamps, resulting in a large harmful counter jump.
> >
> > Why does nouveau enable vblank interrupts before it can get valid
> > timestamps? That sounds like the core bug here, and papering over that in
> > the vblank code feels very wrong to me. Maybe we should instead just not
> > sample the vblank at all if the timestamp fails and splat a big WARN_ON
> > about this, to give driver writers a chance to fix up their mess?
> > -Daniel
> >
> 
> The high precision timestamping is allowed to fail for a kms driver 
> under some conditions which are not implementation errors of the driver, 
> or getting disabled by user override, so we should handle that robustly. 
> We handle it robustly everywhere else in the drm, so we should do it 
> here as well.
> 
> E.g., nouveau generally supports timestamping/scanout position queries, 
> but can't support it on old pre NV-50 hardware with some output type 
> (either on analog VGA, or DVI-D, can't remember atm. which one is 
> unsupported).

I think the surprising thing here is that it fails first and then
succeeds on the same crtc/output combo presumably.

I guess in theory the driver could fail during random times for whatever
reason, though I tend to think that the driver should either make it
robust or not even pretend to support it.

I suppose one failure mode the driver can't really do anything about is
some random SMI crap or something stalling the timestamp queries enough
that we fail the precisions tests and exhaust the retry limits. So yeah,
making it robust against that kind of nastyness sounds line it might be
a good idea. Though perhaps it should be something a bit more severe
than a DRM_DEBUG since I think it really shouldn't happen when the
driver and system are otherwise sane. Of course if it routinely fails
with some driver then simply making it spew errors into dmesg isn't
so nice, unless the driver also gets fixed.

> 
> There are also new Soc drivers showing up which support those methods, 
> but at least i didn't verify or test if their implementations are good 
> enough for the needs of the new drm_udpate_vblank_count() which is more 
> sensitive to kms drivers being even slightly off.
> 
> -mario
> 
> >>
> >> Fix this by checking if the old timestamp used for this calculations
> >> is zero == invalid. If so, perform a counter increment of +1 to
> >> prevent large counter jumps and reinitialize the timestamps to
> >> sane values.
> >>
> >> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> >> Cc: <stable@vger.kernel.org> # 4.4+
> >> Cc: michel@daenzer.net
> >> Cc: vbabka@suse.cz
> >> Cc: ville.syrjala@linux.intel.com
> >> Cc: daniel.vetter@ffwll.ch
> >> Cc: dri-devel@lists.freedesktop.org
> >> Cc: alexander.deucher@amd.com
> >> Cc: christian.koenig@amd.com
> >> ---
> >>   drivers/gpu/drm/drm_irq.c | 7 +++++++
> >>   1 file changed, 7 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> >> index fb17c45..88bdf19 100644
> >> --- a/drivers/gpu/drm/drm_irq.c
> >> +++ b/drivers/gpu/drm/drm_irq.c
> >> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
> >>   			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
> >>   				      " diff_ns = %lld, framedur_ns = %d)\n",
> >>   				      pipe, (long long) diff_ns, framedur_ns);
> >> +
> >> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
> >> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
> >> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
> >> +				      pipe);
> >> +			diff = 1;
> >> +		}
> >>   	} else {
> >>   		/* some kind of default for drivers w/o accurate vbl timestamping */
> >>   		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
> >> --
> >> 1.9.1
> >>
> >
Daniel Vetter Feb. 9, 2016, 3:03 p.m. UTC | #4
On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrjälä wrote:
> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote:
> > On 02/09/2016 11:09 AM, Daniel Vetter wrote:
> > > On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
> > >> The changes to drm_update_vblank_count() in Linux 4.4 added a
> > >> method to emulate a hardware vblank counter by use of high
> > >> precision vblank timestamps if a kms driver supports those,
> > >> but doesn't suppport hw vblank counters.
> > >>
> > >> That method assumes that the old timestamp from a previous
> > >> invocation is valid, but that is not always the case. E.g.,
> > >> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
> > >> or drm_update_vblank_count() gets called outside vblank irq and
> > >> the high precision timestamping can't deliver a precise timestamp,
> > >> ie. drm_get_last_vbltimestamp() delivers a return value of false,
> > >> then those functions will initialize the old timestamp to zero
> > >> to mark it as invalid.
> > >>
> > >> A following call to drm_update_vblank_count() would then calculate
> > >> elapsed time with vblank irqs off as current vblank timestamp minus
> > >> the zero old timestamp and compute a software vblank counter increment
> > >> that corresponds to system uptime, causing a large forward jump of the
> > >> software vblank counter. That jump in turn can cause too long waits
> > >> in drmWaitVblank and very long delays in delivery of vblank events,
> > >> resulting in hangs of userspace clients.
> > >>
> > >> This problem can be observed on nouveau-kms during machine
> > >> suspend->resume cycles, where drm_vblank_off is called during
> > >> suspend, drm_vblank_on is called during resume and the first
> > >> queries to drm_get_last_vbltimestamp() don't deliver high
> > >> precision timestamps, resulting in a large harmful counter jump.
> > >
> > > Why does nouveau enable vblank interrupts before it can get valid
> > > timestamps? That sounds like the core bug here, and papering over that in
> > > the vblank code feels very wrong to me. Maybe we should instead just not
> > > sample the vblank at all if the timestamp fails and splat a big WARN_ON
> > > about this, to give driver writers a chance to fix up their mess?
> > > -Daniel
> > >
> > 
> > The high precision timestamping is allowed to fail for a kms driver 
> > under some conditions which are not implementation errors of the driver, 
> > or getting disabled by user override, so we should handle that robustly. 
> > We handle it robustly everywhere else in the drm, so we should do it 
> > here as well.
> > 
> > E.g., nouveau generally supports timestamping/scanout position queries, 
> > but can't support it on old pre NV-50 hardware with some output type 
> > (either on analog VGA, or DVI-D, can't remember atm. which one is 
> > unsupported).
> 
> I think the surprising thing here is that it fails first and then
> succeeds on the same crtc/output combo presumably.

Yeah exactly this. Failing consistently is ok imo (and probably should be
handled). Failing first and then later on working (without changing the
mode config in between) seems suspicous. That shouldn't ever happen really
and seems like a driver bug (like enabling vblanks too early, before the
pipe is fully up&running).
-Daniel

> 
> I guess in theory the driver could fail during random times for whatever
> reason, though I tend to think that the driver should either make it
> robust or not even pretend to support it.
> 
> I suppose one failure mode the driver can't really do anything about is
> some random SMI crap or something stalling the timestamp queries enough
> that we fail the precisions tests and exhaust the retry limits. So yeah,
> making it robust against that kind of nastyness sounds line it might be
> a good idea. Though perhaps it should be something a bit more severe
> than a DRM_DEBUG since I think it really shouldn't happen when the
> driver and system are otherwise sane. Of course if it routinely fails
> with some driver then simply making it spew errors into dmesg isn't
> so nice, unless the driver also gets fixed.
> 
> > 
> > There are also new Soc drivers showing up which support those methods, 
> > but at least i didn't verify or test if their implementations are good 
> > enough for the needs of the new drm_udpate_vblank_count() which is more 
> > sensitive to kms drivers being even slightly off.
> > 
> > -mario
> > 
> > >>
> > >> Fix this by checking if the old timestamp used for this calculations
> > >> is zero == invalid. If so, perform a counter increment of +1 to
> > >> prevent large counter jumps and reinitialize the timestamps to
> > >> sane values.
> > >>
> > >> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> > >> Cc: <stable@vger.kernel.org> # 4.4+
> > >> Cc: michel@daenzer.net
> > >> Cc: vbabka@suse.cz
> > >> Cc: ville.syrjala@linux.intel.com
> > >> Cc: daniel.vetter@ffwll.ch
> > >> Cc: dri-devel@lists.freedesktop.org
> > >> Cc: alexander.deucher@amd.com
> > >> Cc: christian.koenig@amd.com
> > >> ---
> > >>   drivers/gpu/drm/drm_irq.c | 7 +++++++
> > >>   1 file changed, 7 insertions(+)
> > >>
> > >> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> > >> index fb17c45..88bdf19 100644
> > >> --- a/drivers/gpu/drm/drm_irq.c
> > >> +++ b/drivers/gpu/drm/drm_irq.c
> > >> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
> > >>   			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
> > >>   				      " diff_ns = %lld, framedur_ns = %d)\n",
> > >>   				      pipe, (long long) diff_ns, framedur_ns);
> > >> +
> > >> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
> > >> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
> > >> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
> > >> +				      pipe);
> > >> +			diff = 1;
> > >> +		}
> > >>   	} else {
> > >>   		/* some kind of default for drivers w/o accurate vbl timestamping */
> > >>   		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
> > >> --
> > >> 1.9.1
> > >>
> > >
> 
> -- 
> Ville Syrjälä
> Intel OTC
Mario Kleiner Feb. 10, 2016, 4:28 p.m. UTC | #5
On 02/09/2016 04:03 PM, Daniel Vetter wrote:
> On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrjälä wrote:
>> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote:
>>> On 02/09/2016 11:09 AM, Daniel Vetter wrote:
>>>> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote:
>>>>> The changes to drm_update_vblank_count() in Linux 4.4 added a
>>>>> method to emulate a hardware vblank counter by use of high
>>>>> precision vblank timestamps if a kms driver supports those,
>>>>> but doesn't suppport hw vblank counters.
>>>>>
>>>>> That method assumes that the old timestamp from a previous
>>>>> invocation is valid, but that is not always the case. E.g.,
>>>>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on()
>>>>> or drm_update_vblank_count() gets called outside vblank irq and
>>>>> the high precision timestamping can't deliver a precise timestamp,
>>>>> ie. drm_get_last_vbltimestamp() delivers a return value of false,
>>>>> then those functions will initialize the old timestamp to zero
>>>>> to mark it as invalid.
>>>>>
>>>>> A following call to drm_update_vblank_count() would then calculate
>>>>> elapsed time with vblank irqs off as current vblank timestamp minus
>>>>> the zero old timestamp and compute a software vblank counter increment
>>>>> that corresponds to system uptime, causing a large forward jump of the
>>>>> software vblank counter. That jump in turn can cause too long waits
>>>>> in drmWaitVblank and very long delays in delivery of vblank events,
>>>>> resulting in hangs of userspace clients.
>>>>>
>>>>> This problem can be observed on nouveau-kms during machine
>>>>> suspend->resume cycles, where drm_vblank_off is called during
>>>>> suspend, drm_vblank_on is called during resume and the first
>>>>> queries to drm_get_last_vbltimestamp() don't deliver high
>>>>> precision timestamps, resulting in a large harmful counter jump.
>>>>
>>>> Why does nouveau enable vblank interrupts before it can get valid
>>>> timestamps? That sounds like the core bug here, and papering over that in
>>>> the vblank code feels very wrong to me. Maybe we should instead just not
>>>> sample the vblank at all if the timestamp fails and splat a big WARN_ON
>>>> about this, to give driver writers a chance to fix up their mess?
>>>> -Daniel
>>>>
>>>
>>> The high precision timestamping is allowed to fail for a kms driver
>>> under some conditions which are not implementation errors of the driver,
>>> or getting disabled by user override, so we should handle that robustly.
>>> We handle it robustly everywhere else in the drm, so we should do it
>>> here as well.
>>>
>>> E.g., nouveau generally supports timestamping/scanout position queries,
>>> but can't support it on old pre NV-50 hardware with some output type
>>> (either on analog VGA, or DVI-D, can't remember atm. which one is
>>> unsupported).
>>
>> I think the surprising thing here is that it fails first and then
>> succeeds on the same crtc/output combo presumably.
>
> Yeah exactly this. Failing consistently is ok imo (and probably should be
> handled). Failing first and then later on working (without changing the
> mode config in between) seems suspicous. That shouldn't ever happen really
> and seems like a driver bug (like enabling vblanks too early, before the
> pipe is fully up&running).
> -Daniel
>
>>
>> I guess in theory the driver could fail during random times for whatever
>> reason, though I tend to think that the driver should either make it
>> robust or not even pretend to support it.
>>
>> I suppose one failure mode the driver can't really do anything about is
>> some random SMI crap or something stalling the timestamp queries enough
>> that we fail the precisions tests and exhaust the retry limits. So yeah,
>> making it robust against that kind of nastyness sounds line it might be
>> a good idea. Though perhaps it should be something a bit more severe
>> than a DRM_DEBUG since I think it really shouldn't happen when the
>> driver and system are otherwise sane. Of course if it routinely fails
>> with some driver then simply making it spew errors into dmesg isn't
>> so nice, unless the driver also gets fixed.
>>

I think i have an idea what might go wrong with nouveau, so i'll see if 
i can add a fixup patch.

There's another scenario where this zero-ts case can be hit. If the 
driver drm_vblank_init()'s - setting all timestamps to zero - and then 
code starts using vblanks (drm_vblank_get()) without drm_vblank_on 
beforehand, which is afaics the case with nouveau. Unless that's 
considered an error as well, we'd need to init the timestamps to 
something non-zero but harmless like 1 usecs at drm_vblank_init() time?
What makes sense as output here? DRM_WARN_ONCE?

-mario

>>>
>>> There are also new Soc drivers showing up which support those methods,
>>> but at least i didn't verify or test if their implementations are good
>>> enough for the needs of the new drm_udpate_vblank_count() which is more
>>> sensitive to kms drivers being even slightly off.
>>>
>>> -mario
>>>
>>>>>
>>>>> Fix this by checking if the old timestamp used for this calculations
>>>>> is zero == invalid. If so, perform a counter increment of +1 to
>>>>> prevent large counter jumps and reinitialize the timestamps to
>>>>> sane values.
>>>>>
>>>>> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>> Cc: <stable@vger.kernel.org> # 4.4+
>>>>> Cc: michel@daenzer.net
>>>>> Cc: vbabka@suse.cz
>>>>> Cc: ville.syrjala@linux.intel.com
>>>>> Cc: daniel.vetter@ffwll.ch
>>>>> Cc: dri-devel@lists.freedesktop.org
>>>>> Cc: alexander.deucher@amd.com
>>>>> Cc: christian.koenig@amd.com
>>>>> ---
>>>>>    drivers/gpu/drm/drm_irq.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>> index fb17c45..88bdf19 100644
>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
>>>>>    			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
>>>>>    				      " diff_ns = %lld, framedur_ns = %d)\n",
>>>>>    				      pipe, (long long) diff_ns, framedur_ns);
>>>>> +
>>>>> +		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
>>>>> +		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
>>>>> +			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
>>>>> +				      pipe);
>>>>> +			diff = 1;
>>>>> +		}
>>>>>    	} else {
>>>>>    		/* some kind of default for drivers w/o accurate vbl timestamping */
>>>>>    		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
>>>>> --
>>>>> 1.9.1
>>>>>
>>>>
>>
>> --
>> Ville Syrjälä
>> Intel OTC
>
Daniel Vetter Feb. 10, 2016, 5:17 p.m. UTC | #6
On Wed, Feb 10, 2016 at 5:28 PM, Mario Kleiner
<mario.kleiner.de@gmail.com> wrote:
> There's another scenario where this zero-ts case can be hit. If the driver
> drm_vblank_init()'s - setting all timestamps to zero - and then code starts
> using vblanks (drm_vblank_get()) without drm_vblank_on beforehand, which is
> afaics the case with nouveau. Unless that's considered an error as well,
> we'd need to init the timestamps to something non-zero but harmless like 1
> usecs at drm_vblank_init() time?

Both legacy modeset helpers and atomic ones assume by default that you
start out with everything disabled. Pre-atomic drivers make that
happen by calling disable_unused_functions() to shut down anything the
bios has enabled. I think this can't happen.

For drivers that do take over bootloader display config they must call
vblank_on explicitly themselves, which i915 does.

> What makes sense as output here? DRM_WARN_ONCE?

I'd go with WARN_ON and tune it down if it's offensive. But WARN_ON
patch for 4.6 of course.
-Daniel
Mario Kleiner Feb. 10, 2016, 6:36 p.m. UTC | #7
On 02/10/2016 06:17 PM, Daniel Vetter wrote:
> On Wed, Feb 10, 2016 at 5:28 PM, Mario Kleiner
> <mario.kleiner.de@gmail.com> wrote:
>> There's another scenario where this zero-ts case can be hit. If the driver
>> drm_vblank_init()'s - setting all timestamps to zero - and then code starts
>> using vblanks (drm_vblank_get()) without drm_vblank_on beforehand, which is
>> afaics the case with nouveau. Unless that's considered an error as well,
>> we'd need to init the timestamps to something non-zero but harmless like 1
>> usecs at drm_vblank_init() time?
>
> Both legacy modeset helpers and atomic ones assume by default that you
> start out with everything disabled. Pre-atomic drivers make that
> happen by calling disable_unused_functions() to shut down anything the
> bios has enabled. I think this can't happen.
>
> For drivers that do take over bootloader display config they must call
> vblank_on explicitly themselves, which i915 does.
>
>> What makes sense as output here? DRM_WARN_ONCE?
>
> I'd go with WARN_ON and tune it down if it's offensive. But WARN_ON
> patch for 4.6 of course.
> -Daniel
>

Ok, so does this one have your R-b for stable as is? What about a proper 
nouveau fix if i find one? Probably also for 4.6 then, given that this 
patch fixes it up good enough for stable?

-mario
Daniel Vetter Feb. 10, 2016, 7:34 p.m. UTC | #8
On Wed, Feb 10, 2016 at 7:36 PM, Mario Kleiner
<mario.kleiner.de@gmail.com> wrote:
> On 02/10/2016 06:17 PM, Daniel Vetter wrote:
>>
>> On Wed, Feb 10, 2016 at 5:28 PM, Mario Kleiner
>> <mario.kleiner.de@gmail.com> wrote:
>>>
>>> There's another scenario where this zero-ts case can be hit. If the
>>> driver
>>> drm_vblank_init()'s - setting all timestamps to zero - and then code
>>> starts
>>> using vblanks (drm_vblank_get()) without drm_vblank_on beforehand, which
>>> is
>>> afaics the case with nouveau. Unless that's considered an error as well,
>>> we'd need to init the timestamps to something non-zero but harmless like
>>> 1
>>> usecs at drm_vblank_init() time?
>>
>>
>> Both legacy modeset helpers and atomic ones assume by default that you
>> start out with everything disabled. Pre-atomic drivers make that
>> happen by calling disable_unused_functions() to shut down anything the
>> bios has enabled. I think this can't happen.
>>
>> For drivers that do take over bootloader display config they must call
>> vblank_on explicitly themselves, which i915 does.
>>
>>> What makes sense as output here? DRM_WARN_ONCE?
>>
>>
>> I'd go with WARN_ON and tune it down if it's offensive. But WARN_ON
>> patch for 4.6 of course.
>> -Daniel
>>
>
> Ok, so does this one have your R-b for stable as is? What about a proper
> nouveau fix if i find one? Probably also for 4.6 then, given that this patch
> fixes it up good enough for stable?

If possible I'd prefer we just fix nouveau up for stable, and do a
WARN_ON patch when this ever happens for 4.6. It sounded like you've
figured out already what nouveau needs?
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index fb17c45..88bdf19 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -216,6 +216,13 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
 				      " diff_ns = %lld, framedur_ns = %d)\n",
 				      pipe, (long long) diff_ns, framedur_ns);
+
+		/* No valid t_old to calculate diff? Bump +1 to force reinit. */
+		if (t_old->tv_sec == 0 && t_old->tv_usec == 0) {
+			DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n",
+				      pipe);
+			diff = 1;
+		}
 	} else {
 		/* some kind of default for drivers w/o accurate vbl timestamping */
 		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;