diff mbox

Funky new vblank counter regressions in Linux 4.4-rc1

Message ID 5656063A.8080504@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mario Kleiner Nov. 25, 2015, 7:04 p.m. UTC
On 11/25/2015 06:46 PM, Ville Syrjälä wrote:
> On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
>> On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
>>> On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:
>>>>
>>>>
>>>> On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
>>>>> On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
>>>>>> On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
>>>>>>> On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
>>>>>>
>>>>>> ...
>>>>>> Ok, but why would that be a bad thing? I think we want it to think it is
>>>>>> in the previous frame if it is called outside the vblank irq context.
>>>>>> The only reason we fudge it to the next frames vblank if i vblank irq is
>>>>>> because we know the vblank irq handler we are executing atm. was meant
>>>>>> to execute within the upcoming vblank for the next frame, so we fudge
>>>>>> the scanout positions and thereby timestamp to correspond to that new
>>>>>> frame. But if something called outside irq context it should get a
>>>>>> scanout position/timestamp that corresponds to "reality".
>>>>>
>>>>> It would be a bad thing since it would cause the timestamp to jump
>>>>> backwards, and that would also cause the frame count guesstimate to go
>>>>> backwards.
>>>>>
>>>>
>>>> But only if we don't use the dev->driver->get_vblank_counter() method,
>>>> which we try to use on AMD.
>>>
>>> Well, if you do it that way then you have the problem of the hw counter
>>> seeming to jump forward by one after crossing the start of vblank (when
>>> compared to the value you sampled when you processed the early vblank
>>> interrupt).
>>>
>>
>> Ok, finally i see the bad scenario that wouldn't get prevented by our
>> current locking with the new vblank counting in the core. The vblank
>> enable path is safe due to locking and discounting of redundant
>> timestamps etc. But the disable path could go wrong:
>>
>> 1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
>> updates timestamps and counts "as if" in vblank -> incremented vblank
>> count and timestamp now set in the future.
>>
>> 2. After vblank irq finishes, but just before leading edge of vblank,
>> vblank_disable_and_save() executes, doesn't get bumped timestamp or
>> count because before vblank and not in vblank irq. Now
>> drm_update_vblank_count() would process a
>> "new" timestamp and count from the past and we'd have time and counts
>> going backwards, and bad things would happen.
>>
>> I haven't observed such a thing happening during testing so far,
>> probably because the time window in which it could happen is tiny, but
>> given how awfully bad it would be, it needs to be prevented.
>>
>> I had a look at the description of the Vblank irq in the "M76 Register
>> Reference Guide" for older asics and the description suggests that the
>> vblank irq fires when the crtc's line buffer is finished reading pixel
>> data from the scanout buffer in memory for a frame, ie., when the line
>> buffer read "enters" vblank. That would explain why the irq happens a
>> few scanlines before actual vblank, because line buffer refills must
>> obviously happen before the crtc can send pixel data from the line
>> buffer to the encoders, so it would lead a bit in time. That also means
>> we can't delay the vblank irq to actually happen at start of vblank and
>> have to deal with the early vblank irq.
>>
>>> I guess one silly idea would be to defer the vblank interrupt processing
>>> to a timer, and just schedule it a bit into the future from the actual
>>> interrupt handler.
>>>
>>
>> Timer would be bad because then we get problems with the pageflip
>> completion irq sometimes being processed before the vblank irq,
>
> You you'd need to move page flip completion to happen from the vblank
> timer too I suppose.
>
>> and
>> because we want to be fast in vblank irq handling, delivering vblank
>> events etc. I wouldn't trust a timer to be reliable enough for such
>> short waits.
>
> hrtimers should be accurate. But maybe more expensive than the timer
> wheel.
>

Sounds all a bit complex and fraught with new possible complications. I 
can't spend much more time on this than i did so far, and we need to get 
this into 4.4-rc, so i am aiming for the most simple solution that would 
work.

>> Busy waiting wouldn't be great either in irq.
>>
>> So what about this relatively simple one?
>>
>> 1. In radeon_get_crtc_scanoutpos() we artifically define the
>> vblank_start line to be, e.g, 5 scanlines before the true start of
>> vblank, so for the purpose of vblank counter queries and timestamping
>> our "vblank" would start a bit earlier and the vblank irq would always
>> execute in "vblank". Non-Irq invocations like vblank_disable_and_save()
>> would also be treated to this early vblank start and so what the DRM
>> core observes would always be consistent.
>>
>> 2. At least in radeon-kms we also use radeon_get_crtc_scanoutpos()
>> internally for "dynpm" dynamic power management/reclocking, and to
>> implement pageflip completion detection on asics older than DCE3 which
>> don't have pageflip interrupts. For those cases we need to use the true
>> start of vblank, so for this internal use we pass in some special flag
>> into radeon_get_crtc_scanoutpos() to tell it to not shift the vblank
>> start around.
>>
>> 3. I've added another check to my patch for drm_update_vblank_count() to
>> catch timestamps going backwards and discounting such cases, for a bit
>> of extra robustness against driver trouble.
>>
>> Any ideas why this would be a stupid idea? I'll try this now and just
>> hope meanwhile nobody finds a reason why this would be bad.
>
> What I don't like is leaking any of this into the core code. All the
> hacks should live in the hw specific code. We've managed to keep all
> the i915 hacks hidden that way.
>
> So if you would:
> - fudge your get_scanout_position as you suggested
> - _not_ expose the hw counter to the vblank core since it
>    disagrees with the scanout position
> - keep the internal get_scanout_position variant/flag
>    purely internal
>
> then I think I might like it. That way working hardware doesn't have to
> be exposed to these hacks, and there's possible a bit less danger that it
> all gets broken again next time the core needs some work.
>

Ok. Exposing the fudged hw counter shouldn't be a problem though, given 
that it would be fudged in a consistent way with the fudged scanout 
positions and to have incremented at time of drm_handle_vblank(). We'd 
bump the hw counter to the count of the new vblank at the same time when 
the scanout positions would start counting backwards from minus 
something to 0, showing how many vblank lines to go until start of 
scanout, etc. The only difference to reality would be that this 
simulated vblank would start 5 scanlines earlier than the true hw 
vblank, but i can't see how the core would care about that.


> One problem with that approach could be that the vblank event and page
> flip event would be delievered at different times if you keep using the
> page flip interrupt, but I'm not sure that would be a problem. At least
> they should both have the same timestamp and counter value.
>

That's the same now, and the timestamps and counts be the same. I'll 
check with my measurement equipment that the timestamps will be still 
accurate wrt. to reality.

Attached is my current patch i wanted to submit for the drm core's 
drm_update_vblank_count(). I think it's good to make the core somewhat 
robust against potential kms driver bugs or glitches. But if you 
wouldn't like that patch, there wouldn't be much of a point sending it 
out at all.

thanks,
-mario
diff mbox

Patch

>From 2d5d58a1c575ad002ce2cb643f395d0e4757d959 Mon Sep 17 00:00:00 2001
From: Mario Kleiner <mario.kleiner.de@gmail.com>
Date: Wed, 25 Nov 2015 18:48:31 +0100
Subject: [PATCH] drm/irq: Make drm_update_vblank_count() more robust.

The changes to drm_update_vblank_count() for Linux 4.4-rc
made the function more fragile wrt. some hw quirks. E.g.,
at dev->driver->enable_vblank(), AMD gpu's fire a spurious
redundant vblank irq shortly after enabling vblank irqs, not
locked to vblank. This causes a redundant call which needs
to be suppressed to avoid miscounting.

To increase robustness, shuffle things around a bit:

On drivers with high precision vblank timestamping always
evaluate the timestamp difference between current timestamp
and previously recorded timestamp to detect such redundant
invocations and no-op in that case.

Also detect and warn about timestamps going backwards to
catch potential kms driver bugs.

This patch is meant for Linux 4.4-rc and later.

Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
---
 drivers/gpu/drm/drm_irq.c | 53 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index 819b8c1..8728c3c 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -172,9 +172,11 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 				    unsigned long flags)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
-	u32 cur_vblank, diff;
+	u32 cur_vblank, diff = 0;
 	bool rc;
 	struct timeval t_vblank;
+	const struct timeval *t_old;
+	u64 diff_ns;
 	int count = DRM_TIMESTAMP_MAXRETRIES;
 	int framedur_ns = vblank->framedur_ns;
 
@@ -195,13 +197,15 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 		rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, flags);
 	} while (cur_vblank != dev->driver->get_vblank_counter(dev, pipe) && --count > 0);
 
-	if (dev->max_vblank_count != 0) {
-		/* trust the hw counter when it's around */
-		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
-	} else if (rc && framedur_ns) {
-		const struct timeval *t_old;
-		u64 diff_ns;
-
+	/*
+	 * Always use vblank timestamping based method if supported to reject
+	 * redundant vblank irqs. E.g., AMD hardware needs this to not screw up
+	 * due to some irq handling quirk.
+	 *
+	 * This also sets the diff value for use as fallback below in case the
+	 * hw does not support a suitable hw vblank counter.
+	 */
+	if (rc && framedur_ns) {
 		t_old = &vblanktimestamp(dev, pipe, vblank->count);
 		diff_ns = timeval_to_ns(&t_vblank) - timeval_to_ns(t_old);
 
@@ -212,11 +216,36 @@  static void drm_update_vblank_count(struct drm_device *dev, unsigned int pipe,
 		 */
 		diff = DIV_ROUND_CLOSEST_ULL(diff_ns, framedur_ns);
 
-		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ)
+		/* Catch driver timestamping bugs and prevent worse. */
+		if ((s32) diff < 0) {
+			DRM_DEBUG_VBL("crtc %u: Timestamp going backward!"
+			" diff_ns = %lld, framedur_ns = %d)\n",
+			pipe, (long long) diff_ns, framedur_ns);
+			return;
+		}
+
+		if (diff == 0 && flags & DRM_CALLED_FROM_VBLIRQ) {
 			DRM_DEBUG_VBL("crtc %u: Redundant vblirq ignored."
-				      " diff_ns = %lld, framedur_ns = %d)\n",
-				      pipe, (long long) diff_ns, framedur_ns);
-	} else {
+			" diff_ns = %lld, framedur_ns = %d)\n",
+			pipe, (long long) diff_ns, framedur_ns);
+
+			/* Treat this redundant invocation as no-op. */
+			WARN_ON_ONCE(cur_vblank != vblank->last);
+			return;
+		}
+	}
+
+	/*
+	 * hw counters, if marked as supported via max_vblank_count != 0,
+	 * *must* increment at leading edge of vblank or at least before
+	 * the firing of the hw vblank irq, otherwise we have a race here
+	 * between gpu and us, where small variations in our execution
+	 * timing can cause off-by-one miscounting of vblanks.
+	 */
+	if (dev->max_vblank_count != 0) {
+		/* trust the hw counter when it's around */
+		diff = (cur_vblank - vblank->last) & dev->max_vblank_count;
+	} else if (!(rc && framedur_ns)) {
 		/* some kind of default for drivers w/o accurate vbl timestamping */
 		diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0;
 	}
-- 
1.9.1