Funky new vblank counter regressions in Linux 4.4-rc1

On 11/25/2015 06:46 PM, Ville Syrjälä wrote:
> On Wed, Nov 25, 2015 at 06:24:13PM +0100, Mario Kleiner wrote:
>> On 11/23/2015 09:24 PM, Ville Syrjälä wrote:
>>> On Mon, Nov 23, 2015 at 06:58:34PM +0100, Mario Kleiner wrote:
>>>>
>>>>
>>>> On 11/23/2015 04:51 PM, Ville Syrjälä wrote:
>>>>> On Mon, Nov 23, 2015 at 04:23:21PM +0100, Mario Kleiner wrote:
>>>>>> On 11/20/2015 04:34 PM, Ville Syrjälä wrote:
>>>>>>> On Fri, Nov 20, 2015 at 04:24:50PM +0100, Mario Kleiner wrote:
>>>>>>
>>>>>> ...
>>>>>> Ok, but why would that be a bad thing? I think we want it to think it is
>>>>>> in the previous frame if it is called outside the vblank irq context.
>>>>>> The only reason we fudge it to the next frames vblank if i vblank irq is
>>>>>> because we know the vblank irq handler we are executing atm. was meant
>>>>>> to execute within the upcoming vblank for the next frame, so we fudge
>>>>>> the scanout positions and thereby timestamp to correspond to that new
>>>>>> frame. But if something called outside irq context it should get a
>>>>>> scanout position/timestamp that corresponds to "reality".
>>>>>
>>>>> It would be a bad thing since it would cause the timestamp to jump
>>>>> backwards, and that would also cause the frame count guesstimate to go
>>>>> backwards.
>>>>>
>>>>
>>>> But only if we don't use the dev->driver->get_vblank_counter() method,
>>>> which we try to use on AMD.
>>>
>>> Well, if you do it that way then you have the problem of the hw counter
>>> seeming to jump forward by one after crossing the start of vblank (when
>>> compared to the value you sampled when you processed the early vblank
>>> interrupt).
>>>
>>
>> Ok, finally i see the bad scenario that wouldn't get prevented by our
>> current locking with the new vblank counting in the core. The vblank
>> enable path is safe due to locking and discounting of redundant
>> timestamps etc. But the disable path could go wrong:
>>
>> 1. Vblank irq fires, drm_handle_vblank() -> drm_update_vblank_count(),
>> updates timestamps and counts "as if" in vblank -> incremented vblank
>> count and timestamp now set in the future.
>>
>> 2. After vblank irq finishes, but just before leading edge of vblank,
>> vblank_disable_and_save() executes, doesn't get bumped timestamp or
>> count because before vblank and not in vblank irq. Now
>> drm_update_vblank_count() would process a
>> "new" timestamp and count from the past and we'd have time and counts
>> going backwards, and bad things would happen.
>>
>> I haven't observed such a thing happening during testing so far,
>> probably because the time window in which it could happen is tiny, but
>> given how awfully bad it would be, it needs to be prevented.
>>
>> I had a look at the description of the Vblank irq in the "M76 Register
>> Reference Guide" for older asics and the description suggests that the
>> vblank irq fires when the crtc's line buffer is finished reading pixel
>> data from the scanout buffer in memory for a frame, ie., when the line
>> buffer read "enters" vblank. That would explain why the irq happens a
>> few scanlines before actual vblank, because line buffer refills must
>> obviously happen before the crtc can send pixel data from the line
>> buffer to the encoders, so it would lead a bit in time. That also means
>> we can't delay the vblank irq to actually happen at start of vblank and
>> have to deal with the early vblank irq.
>>
>>> I guess one silly idea would be to defer the vblank interrupt processing
>>> to a timer, and just schedule it a bit into the future from the actual
>>> interrupt handler.
>>>
>>
>> Timer would be bad because then we get problems with the pageflip
>> completion irq sometimes being processed before the vblank irq,
>
> You you'd need to move page flip completion to happen from the vblank
> timer too I suppose.
>
>> and
>> because we want to be fast in vblank irq handling, delivering vblank
>> events etc. I wouldn't trust a timer to be reliable enough for such
>> short waits.
>
> hrtimers should be accurate. But maybe more expensive than the timer
> wheel.
>

Sounds all a bit complex and fraught with new possible complications. I 
can't spend much more time on this than i did so far, and we need to get 
this into 4.4-rc, so i am aiming for the most simple solution that would 
work.

>> Busy waiting wouldn't be great either in irq.
>>
>> So what about this relatively simple one?
>>
>> 1. In radeon_get_crtc_scanoutpos() we artifically define the
>> vblank_start line to be, e.g, 5 scanlines before the true start of
>> vblank, so for the purpose of vblank counter queries and timestamping
>> our "vblank" would start a bit earlier and the vblank irq would always
>> execute in "vblank". Non-Irq invocations like vblank_disable_and_save()
>> would also be treated to this early vblank start and so what the DRM
>> core observes would always be consistent.
>>
>> 2. At least in radeon-kms we also use radeon_get_crtc_scanoutpos()
>> internally for "dynpm" dynamic power management/reclocking, and to
>> implement pageflip completion detection on asics older than DCE3 which
>> don't have pageflip interrupts. For those cases we need to use the true
>> start of vblank, so for this internal use we pass in some special flag
>> into radeon_get_crtc_scanoutpos() to tell it to not shift the vblank
>> start around.
>>
>> 3. I've added another check to my patch for drm_update_vblank_count() to
>> catch timestamps going backwards and discounting such cases, for a bit
>> of extra robustness against driver trouble.
>>
>> Any ideas why this would be a stupid idea? I'll try this now and just
>> hope meanwhile nobody finds a reason why this would be bad.
>
> What I don't like is leaking any of this into the core code. All the
> hacks should live in the hw specific code. We've managed to keep all
> the i915 hacks hidden that way.
>
> So if you would:
> - fudge your get_scanout_position as you suggested
> - _not_ expose the hw counter to the vblank core since it
>    disagrees with the scanout position
> - keep the internal get_scanout_position variant/flag
>    purely internal
>
> then I think I might like it. That way working hardware doesn't have to
> be exposed to these hacks, and there's possible a bit less danger that it
> all gets broken again next time the core needs some work.
>

Ok. Exposing the fudged hw counter shouldn't be a problem though, given 
that it would be fudged in a consistent way with the fudged scanout 
positions and to have incremented at time of drm_handle_vblank(). We'd 
bump the hw counter to the count of the new vblank at the same time when 
the scanout positions would start counting backwards from minus 
something to 0, showing how many vblank lines to go until start of 
scanout, etc. The only difference to reality would be that this 
simulated vblank would start 5 scanlines earlier than the true hw 
vblank, but i can't see how the core would care about that.

> One problem with that approach could be that the vblank event and page
> flip event would be delievered at different times if you keep using the
> page flip interrupt, but I'm not sure that would be a problem. At least
> they should both have the same timestamp and counter value.
>

That's the same now, and the timestamps and counts be the same. I'll 
check with my measurement equipment that the timestamps will be still 
accurate wrt. to reality.

Attached is my current patch i wanted to submit for the drm core's 
drm_update_vblank_count(). I think it's good to make the core somewhat 
robust against potential kms driver bugs or glitches. But if you 
wouldn't like that patch, there wouldn't be much of a point sending it 
out at all.

thanks,
-mario

Funky new vblank counter regressions in Linux 4.4-rc1

Commit Message

Patch