diff mbox series

drm/vblank: Avoid storing a timestamp for the same frame twice

Message ID 20210204020400.29628-1-ville.syrjala@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series drm/vblank: Avoid storing a timestamp for the same frame twice | expand

Commit Message

Ville Syrjälä Feb. 4, 2021, 2:04 a.m. UTC
From: Ville Syrjälä <ville.syrjala@linux.intel.com>

drm_vblank_restore() exists because certain power saving states
can clobber the hardware frame counter. The way it does this is
by guesstimating how many frames were missed purely based on
the difference between the last stored timestamp vs. a newly
sampled timestamp.

If we should call this function before a full frame has
elapsed since we sampled the last timestamp we would end up
with a possibly slightly different timestamp value for the
same frame. Currently we will happily overwrite the already
stored timestamp for the frame with the new value. This
could cause userspace to observe two different timestamps
for the same frame (and the timestamp could even go
backwards depending on how much error we introduce when
correcting the timestamp based on the scanout position).

To avoid that let's not update the stored timestamp unless we're
also incrementing the sequence counter. We do still want to update
vblank->last with the freshly sampled hw frame counter value so
that subsequent vblank irqs/queries can actually use the hw frame
counter to determine how many frames have elapsed.

Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Daniel Vetter Feb. 4, 2021, 3:32 p.m. UTC | #1
On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> drm_vblank_restore() exists because certain power saving states
> can clobber the hardware frame counter. The way it does this is
> by guesstimating how many frames were missed purely based on
> the difference between the last stored timestamp vs. a newly
> sampled timestamp.
> 
> If we should call this function before a full frame has
> elapsed since we sampled the last timestamp we would end up
> with a possibly slightly different timestamp value for the
> same frame. Currently we will happily overwrite the already
> stored timestamp for the frame with the new value. This
> could cause userspace to observe two different timestamps
> for the same frame (and the timestamp could even go
> backwards depending on how much error we introduce when
> correcting the timestamp based on the scanout position).
> 
> To avoid that let's not update the stored timestamp unless we're
> also incrementing the sequence counter. We do still want to update
> vblank->last with the freshly sampled hw frame counter value so
> that subsequent vblank irqs/queries can actually use the hw frame
> counter to determine how many frames have elapsed.

Hm I'm not getting the reason for why we store the updated hw vblank
counter?

There's definitely a race when we grab the hw timestamp at a bad time
(which can't happen for the irq handler, realistically), so maybe we
should first adjust that to make sure we never store anything inconsistent
in the vblank state?

And when we have that we should be able to pull the inc == 0 check out
into _restore(), including comment. Which I think should be cleaner.

Or I'm totally off with why you want to store the hw vblank counter?

> 
> Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> index 893165eeddf3..e127a7db2088 100644
> --- a/drivers/gpu/drm/drm_vblank.c
> +++ b/drivers/gpu/drm/drm_vblank.c
> @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
>  
>  	vblank->last = last;
>  
> +	/*
> +	 * drm_vblank_restore() wants to always update
> +	 * vblank->last since we can't trust the frame counter
> +	 * across power saving states. But we don't want to alter
> +	 * the stored timestamp for the same frame number since
> +	 * that would cause userspace to potentially observe two
> +	 * different timestamps for the same frame.
> +	 */
> +	if (vblank_count_inc == 0)
> +		return;
> +
>  	write_seqlock(&vblank->seqlock);
>  	vblank->time = t_vblank;
>  	atomic64_add(vblank_count_inc, &vblank->count);
> -- 
> 2.26.2
>
Ville Syrjälä Feb. 4, 2021, 3:55 p.m. UTC | #2
On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > drm_vblank_restore() exists because certain power saving states
> > can clobber the hardware frame counter. The way it does this is
> > by guesstimating how many frames were missed purely based on
> > the difference between the last stored timestamp vs. a newly
> > sampled timestamp.
> > 
> > If we should call this function before a full frame has
> > elapsed since we sampled the last timestamp we would end up
> > with a possibly slightly different timestamp value for the
> > same frame. Currently we will happily overwrite the already
> > stored timestamp for the frame with the new value. This
> > could cause userspace to observe two different timestamps
> > for the same frame (and the timestamp could even go
> > backwards depending on how much error we introduce when
> > correcting the timestamp based on the scanout position).
> > 
> > To avoid that let's not update the stored timestamp unless we're
> > also incrementing the sequence counter. We do still want to update
> > vblank->last with the freshly sampled hw frame counter value so
> > that subsequent vblank irqs/queries can actually use the hw frame
> > counter to determine how many frames have elapsed.
> 
> Hm I'm not getting the reason for why we store the updated hw vblank
> counter?

Because next time a vblank irq happens the code will do:
diff = current_hw_counter - vblank->last

which won't work very well if vblank->last is garbage.

Updating vblank->last is pretty much why drm_vblank_restore()
exists at all.

> There's definitely a race when we grab the hw timestamp at a bad time
> (which can't happen for the irq handler, realistically), so maybe we
> should first adjust that to make sure we never store anything inconsistent
> in the vblank state?

Not sure what race you mean, or what inconsistent thing we store?

> 
> And when we have that we should be able to pull the inc == 0 check out
> into _restore(), including comment. Which I think should be cleaner.
> 
> Or I'm totally off with why you want to store the hw vblank counter?
> 
> > 
> > Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> > index 893165eeddf3..e127a7db2088 100644
> > --- a/drivers/gpu/drm/drm_vblank.c
> > +++ b/drivers/gpu/drm/drm_vblank.c
> > @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
> >  
> >  	vblank->last = last;
> >  
> > +	/*
> > +	 * drm_vblank_restore() wants to always update
> > +	 * vblank->last since we can't trust the frame counter
> > +	 * across power saving states. But we don't want to alter
> > +	 * the stored timestamp for the same frame number since
> > +	 * that would cause userspace to potentially observe two
> > +	 * different timestamps for the same frame.
> > +	 */
> > +	if (vblank_count_inc == 0)
> > +		return;
> > +
> >  	write_seqlock(&vblank->seqlock);
> >  	vblank->time = t_vblank;
> >  	atomic64_add(vblank_count_inc, &vblank->count);
> > -- 
> > 2.26.2
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter Feb. 5, 2021, 3:46 p.m. UTC | #3
On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > drm_vblank_restore() exists because certain power saving states
> > > can clobber the hardware frame counter. The way it does this is
> > > by guesstimating how many frames were missed purely based on
> > > the difference between the last stored timestamp vs. a newly
> > > sampled timestamp.
> > > 
> > > If we should call this function before a full frame has
> > > elapsed since we sampled the last timestamp we would end up
> > > with a possibly slightly different timestamp value for the
> > > same frame. Currently we will happily overwrite the already
> > > stored timestamp for the frame with the new value. This
> > > could cause userspace to observe two different timestamps
> > > for the same frame (and the timestamp could even go
> > > backwards depending on how much error we introduce when
> > > correcting the timestamp based on the scanout position).
> > > 
> > > To avoid that let's not update the stored timestamp unless we're
> > > also incrementing the sequence counter. We do still want to update
> > > vblank->last with the freshly sampled hw frame counter value so
> > > that subsequent vblank irqs/queries can actually use the hw frame
> > > counter to determine how many frames have elapsed.
> > 
> > Hm I'm not getting the reason for why we store the updated hw vblank
> > counter?
> 
> Because next time a vblank irq happens the code will do:
> diff = current_hw_counter - vblank->last
> 
> which won't work very well if vblank->last is garbage.
> 
> Updating vblank->last is pretty much why drm_vblank_restore()
> exists at all.

Oh sure, _restore has to update this, together with the timestamp.

But your code adds such an update where we update the hw vblank counter,
but not the timestamp, and that feels buggy. Either we're still in the
same frame, and then we should story nothing. Or we advanced, and then we
probably want a new timestampt for that frame too.

Advancing the vblank counter and not advancing the timestamp sounds like a
bug in our code.

> > There's definitely a race when we grab the hw timestamp at a bad time
> > (which can't happen for the irq handler, realistically), so maybe we
> > should first adjust that to make sure we never store anything inconsistent
> > in the vblank state?
> 
> Not sure what race you mean, or what inconsistent thing we store?

For the drm_handle_vblank code we have some fudge so we don't compute
something silly when the irq fires (like it often does) before
top-of-frame. Ofc that fudge is inheritedly racy, if the irq is extremely
delay (almost an entire frame) we'll get it wrong.

In practice it doesn't matter.

Now _restore can be called anytime, so we might end up in situations where
the exact point where we jump to the next frame count, and the exact time
where the hw counter jumps, don't lign up. And I think in that case funny
things can happen, and I'm not sure your approach of "update hw counter
but don't update timestamp" is the right way.

I think if we instead ignore any update if our fudge-corrected timestamp
is roughly the same, then we handle that race correctly and there's no
jumping around.

Cheers, Daniel

> > And when we have that we should be able to pull the inc == 0 check out
> > into _restore(), including comment. Which I think should be cleaner.
> > 
> > Or I'm totally off with why you want to store the hw vblank counter?
> > 
> > > 
> > > Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> > > index 893165eeddf3..e127a7db2088 100644
> > > --- a/drivers/gpu/drm/drm_vblank.c
> > > +++ b/drivers/gpu/drm/drm_vblank.c
> > > @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
> > >  
> > >  	vblank->last = last;
> > >  
> > > +	/*
> > > +	 * drm_vblank_restore() wants to always update
> > > +	 * vblank->last since we can't trust the frame counter
> > > +	 * across power saving states. But we don't want to alter
> > > +	 * the stored timestamp for the same frame number since
> > > +	 * that would cause userspace to potentially observe two
> > > +	 * different timestamps for the same frame.
> > > +	 */
> > > +	if (vblank_count_inc == 0)
> > > +		return;
> > > +
> > >  	write_seqlock(&vblank->seqlock);
> > >  	vblank->time = t_vblank;
> > >  	atomic64_add(vblank_count_inc, &vblank->count);
> > > -- 
> > > 2.26.2
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> 
> -- 
> Ville Syrjälä
> Intel
Ville Syrjälä Feb. 5, 2021, 4:24 p.m. UTC | #4
On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > 
> > > > drm_vblank_restore() exists because certain power saving states
> > > > can clobber the hardware frame counter. The way it does this is
> > > > by guesstimating how many frames were missed purely based on
> > > > the difference between the last stored timestamp vs. a newly
> > > > sampled timestamp.
> > > > 
> > > > If we should call this function before a full frame has
> > > > elapsed since we sampled the last timestamp we would end up
> > > > with a possibly slightly different timestamp value for the
> > > > same frame. Currently we will happily overwrite the already
> > > > stored timestamp for the frame with the new value. This
> > > > could cause userspace to observe two different timestamps
> > > > for the same frame (and the timestamp could even go
> > > > backwards depending on how much error we introduce when
> > > > correcting the timestamp based on the scanout position).
> > > > 
> > > > To avoid that let's not update the stored timestamp unless we're
> > > > also incrementing the sequence counter. We do still want to update
> > > > vblank->last with the freshly sampled hw frame counter value so
> > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > counter to determine how many frames have elapsed.
> > > 
> > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > counter?
> > 
> > Because next time a vblank irq happens the code will do:
> > diff = current_hw_counter - vblank->last
> > 
> > which won't work very well if vblank->last is garbage.
> > 
> > Updating vblank->last is pretty much why drm_vblank_restore()
> > exists at all.
> 
> Oh sure, _restore has to update this, together with the timestamp.
> 
> But your code adds such an update where we update the hw vblank counter,
> but not the timestamp, and that feels buggy. Either we're still in the
> same frame, and then we should story nothing. Or we advanced, and then we
> probably want a new timestampt for that frame too.

Even if we're still in the same frame the hw frame counter may already
have been reset due to the power well having been turned off. That is
what I'm trying to fix here.

Now I suppose that's fairly unlikely, at least with PSR which probably
does impose some extra delays before the power gets yanked. But at least
theoretically possible.

> 
> Advancing the vblank counter and not advancing the timestamp sounds like a
> bug in our code.

We're not advancing the vblank counter. We're storing a new
timestamp for a vblank counter value which already had a timestamp.

> 
> > > There's definitely a race when we grab the hw timestamp at a bad time
> > > (which can't happen for the irq handler, realistically), so maybe we
> > > should first adjust that to make sure we never store anything inconsistent
> > > in the vblank state?
> > 
> > Not sure what race you mean, or what inconsistent thing we store?
> 
> For the drm_handle_vblank code we have some fudge so we don't compute
> something silly when the irq fires (like it often does) before
> top-of-frame. Ofc that fudge is inheritedly racy, if the irq is extremely
> delay (almost an entire frame) we'll get it wrong.

Sorry, still no idea what fudge you mean.

> 
> In practice it doesn't matter.
> 
> Now _restore can be called anytime, so we might end up in situations where
> the exact point where we jump to the next frame count, and the exact time
> where the hw counter jumps, don't lign up. And I think in that case funny
> things can happen, and I'm not sure your approach of "update hw counter
> but don't update timestamp" is the right way.
> 
> I think if we instead ignore any update if our fudge-corrected timestamp
> is roughly the same, then we handle that race correctly and there's no
> jumping around.

We can't just not update vblank->last, assuming the theory holds
that the power well may turn off even if the last vblank timestamp
was sampled less than a full frame ago.

That will cause the next diff=current_hw_counter-vblank->last to
generate total garbage and then the vblank seq number will jump
to some random value. Which is exactly the main problem
drm_vblank_restore() is trying to prevent.
Ville Syrjälä Feb. 5, 2021, 9:19 p.m. UTC | #5
On Fri, Feb 05, 2021 at 06:24:08PM +0200, Ville Syrjälä wrote:
> On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> > On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > 
> > > > > drm_vblank_restore() exists because certain power saving states
> > > > > can clobber the hardware frame counter. The way it does this is
> > > > > by guesstimating how many frames were missed purely based on
> > > > > the difference between the last stored timestamp vs. a newly
> > > > > sampled timestamp.
> > > > > 
> > > > > If we should call this function before a full frame has
> > > > > elapsed since we sampled the last timestamp we would end up
> > > > > with a possibly slightly different timestamp value for the
> > > > > same frame. Currently we will happily overwrite the already
> > > > > stored timestamp for the frame with the new value. This
> > > > > could cause userspace to observe two different timestamps
> > > > > for the same frame (and the timestamp could even go
> > > > > backwards depending on how much error we introduce when
> > > > > correcting the timestamp based on the scanout position).
> > > > > 
> > > > > To avoid that let's not update the stored timestamp unless we're
> > > > > also incrementing the sequence counter. We do still want to update
> > > > > vblank->last with the freshly sampled hw frame counter value so
> > > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > > counter to determine how many frames have elapsed.
> > > > 
> > > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > > counter?
> > > 
> > > Because next time a vblank irq happens the code will do:
> > > diff = current_hw_counter - vblank->last
> > > 
> > > which won't work very well if vblank->last is garbage.
> > > 
> > > Updating vblank->last is pretty much why drm_vblank_restore()
> > > exists at all.
> > 
> > Oh sure, _restore has to update this, together with the timestamp.
> > 
> > But your code adds such an update where we update the hw vblank counter,
> > but not the timestamp, and that feels buggy. Either we're still in the
> > same frame, and then we should story nothing. Or we advanced, and then we
> > probably want a new timestampt for that frame too.
> 
> Even if we're still in the same frame the hw frame counter may already
> have been reset due to the power well having been turned off. That is
> what I'm trying to fix here.
> 
> Now I suppose that's fairly unlikely, at least with PSR which probably
> does impose some extra delays before the power gets yanked. But at least
> theoretically possible.

Pondering about this a bit further. I think the fact that the current
code takes the round-to-closest approach I used for the vblank handler
is perhaps a bit bad. It could push the seq counter forward if we're
past the halfway point of a frame. I think that rounding behaviour
makes sense for the irq since those tick steadily and so allowing a bit
of error either way seems correct to me. Perhaps round-down might be
the better option for _restore(). Not quites sure, need more thinking
probably.

Another idea that came to me now is that maybe we should actually just
check if the current hw frame counter value looks sane, as in something
like:

diff_hw_counter = current_hw_counter-stored_hw_counter
diff_ts = (current_ts-stored_ts)/framedur

if (diff_hw_counter ~= diff_ts)
	diff = diff_hw_counter;
else
	diff = diff_ts;

and if they seem to match then just keep trusting the hw counter.
So only if there's a significant difference would we disregard
the diff of the hw counter and instead use the diff based on the
timestamps. Not sure what "significant" is though; One frame, two
frames?
Daniel Vetter Feb. 8, 2021, 9:56 a.m. UTC | #6
On Fri, Feb 05, 2021 at 11:19:19PM +0200, Ville Syrjälä wrote:
> On Fri, Feb 05, 2021 at 06:24:08PM +0200, Ville Syrjälä wrote:
> > On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> > > On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > > > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > 
> > > > > > drm_vblank_restore() exists because certain power saving states
> > > > > > can clobber the hardware frame counter. The way it does this is
> > > > > > by guesstimating how many frames were missed purely based on
> > > > > > the difference between the last stored timestamp vs. a newly
> > > > > > sampled timestamp.
> > > > > > 
> > > > > > If we should call this function before a full frame has
> > > > > > elapsed since we sampled the last timestamp we would end up
> > > > > > with a possibly slightly different timestamp value for the
> > > > > > same frame. Currently we will happily overwrite the already
> > > > > > stored timestamp for the frame with the new value. This
> > > > > > could cause userspace to observe two different timestamps
> > > > > > for the same frame (and the timestamp could even go
> > > > > > backwards depending on how much error we introduce when
> > > > > > correcting the timestamp based on the scanout position).
> > > > > > 
> > > > > > To avoid that let's not update the stored timestamp unless we're
> > > > > > also incrementing the sequence counter. We do still want to update
> > > > > > vblank->last with the freshly sampled hw frame counter value so
> > > > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > > > counter to determine how many frames have elapsed.
> > > > > 
> > > > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > > > counter?
> > > > 
> > > > Because next time a vblank irq happens the code will do:
> > > > diff = current_hw_counter - vblank->last
> > > > 
> > > > which won't work very well if vblank->last is garbage.
> > > > 
> > > > Updating vblank->last is pretty much why drm_vblank_restore()
> > > > exists at all.
> > > 
> > > Oh sure, _restore has to update this, together with the timestamp.
> > > 
> > > But your code adds such an update where we update the hw vblank counter,
> > > but not the timestamp, and that feels buggy. Either we're still in the
> > > same frame, and then we should story nothing. Or we advanced, and then we
> > > probably want a new timestampt for that frame too.
> > 
> > Even if we're still in the same frame the hw frame counter may already
> > have been reset due to the power well having been turned off. That is
> > what I'm trying to fix here.
> > 
> > Now I suppose that's fairly unlikely, at least with PSR which probably
> > does impose some extra delays before the power gets yanked. But at least
> > theoretically possible.
> 
> Pondering about this a bit further. I think the fact that the current
> code takes the round-to-closest approach I used for the vblank handler
> is perhaps a bit bad. It could push the seq counter forward if we're
> past the halfway point of a frame. I think that rounding behaviour
> makes sense for the irq since those tick steadily and so allowing a bit
> of error either way seems correct to me. Perhaps round-down might be
> the better option for _restore(). Not quites sure, need more thinking
> probably.

Yes this is the rounding I'm worried about.

But your point above that the hw might reset the counter again is also
valid. I'm assuming what you're worried about is that we first do a
_restore (and the hw vblank counter hasn't been trashed yet), and then in
the same frame we do another restore, but now the hw frame counter has
been trashe, and we need to update it?

> Another idea that came to me now is that maybe we should actually just
> check if the current hw frame counter value looks sane, as in something
> like:
> 
> diff_hw_counter = current_hw_counter-stored_hw_counter
> diff_ts = (current_ts-stored_ts)/framedur
> 
> if (diff_hw_counter ~= diff_ts)
> 	diff = diff_hw_counter;
> else
> 	diff = diff_ts;
> 
> and if they seem to match then just keep trusting the hw counter.
> So only if there's a significant difference would we disregard
> the diff of the hw counter and instead use the diff based on the
> timestamps. Not sure what "significant" is though; One frame, two
> frames?

Hm, another idea: The only point where we can trust the entire hw counter
+ timestamp sampling is when the irq happens. Because then we know the
driver will have properly corrected for any hw oddities (like hw counter
flipping not at top-of-frame, like the core expects).

So what if _restore always goes back to the last such trusted hw counter
for computing the frame counter diff and all that stuff? That way if we
have a bunch of _restore with incosisten hw vblank counter, we will a)
only take the last one (fixes the bug you're trying to fix) b) still use
the same last trusted baseline for computations (addresses the race I'm
seeing).

Or does this not work?

It does complicate the code a bit, because we'd need to store the
count/timestamp information from _restore outside of the usual vblank ts
array. But I think that addresses everything.
-Daniel
Ville Syrjälä Feb. 8, 2021, 4:58 p.m. UTC | #7
On Mon, Feb 08, 2021 at 10:56:36AM +0100, Daniel Vetter wrote:
> On Fri, Feb 05, 2021 at 11:19:19PM +0200, Ville Syrjälä wrote:
> > On Fri, Feb 05, 2021 at 06:24:08PM +0200, Ville Syrjälä wrote:
> > > On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> > > > On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > > > > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > > > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > > 
> > > > > > > drm_vblank_restore() exists because certain power saving states
> > > > > > > can clobber the hardware frame counter. The way it does this is
> > > > > > > by guesstimating how many frames were missed purely based on
> > > > > > > the difference between the last stored timestamp vs. a newly
> > > > > > > sampled timestamp.
> > > > > > > 
> > > > > > > If we should call this function before a full frame has
> > > > > > > elapsed since we sampled the last timestamp we would end up
> > > > > > > with a possibly slightly different timestamp value for the
> > > > > > > same frame. Currently we will happily overwrite the already
> > > > > > > stored timestamp for the frame with the new value. This
> > > > > > > could cause userspace to observe two different timestamps
> > > > > > > for the same frame (and the timestamp could even go
> > > > > > > backwards depending on how much error we introduce when
> > > > > > > correcting the timestamp based on the scanout position).
> > > > > > > 
> > > > > > > To avoid that let's not update the stored timestamp unless we're
> > > > > > > also incrementing the sequence counter. We do still want to update
> > > > > > > vblank->last with the freshly sampled hw frame counter value so
> > > > > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > > > > counter to determine how many frames have elapsed.
> > > > > > 
> > > > > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > > > > counter?
> > > > > 
> > > > > Because next time a vblank irq happens the code will do:
> > > > > diff = current_hw_counter - vblank->last
> > > > > 
> > > > > which won't work very well if vblank->last is garbage.
> > > > > 
> > > > > Updating vblank->last is pretty much why drm_vblank_restore()
> > > > > exists at all.
> > > > 
> > > > Oh sure, _restore has to update this, together with the timestamp.
> > > > 
> > > > But your code adds such an update where we update the hw vblank counter,
> > > > but not the timestamp, and that feels buggy. Either we're still in the
> > > > same frame, and then we should story nothing. Or we advanced, and then we
> > > > probably want a new timestampt for that frame too.
> > > 
> > > Even if we're still in the same frame the hw frame counter may already
> > > have been reset due to the power well having been turned off. That is
> > > what I'm trying to fix here.
> > > 
> > > Now I suppose that's fairly unlikely, at least with PSR which probably
> > > does impose some extra delays before the power gets yanked. But at least
> > > theoretically possible.
> > 
> > Pondering about this a bit further. I think the fact that the current
> > code takes the round-to-closest approach I used for the vblank handler
> > is perhaps a bit bad. It could push the seq counter forward if we're
> > past the halfway point of a frame. I think that rounding behaviour
> > makes sense for the irq since those tick steadily and so allowing a bit
> > of error either way seems correct to me. Perhaps round-down might be
> > the better option for _restore(). Not quites sure, need more thinking
> > probably.
> 
> Yes this is the rounding I'm worried about.

Actually I don't think this is really an issue since we are working 
with the corrected timestamps here. Those always line up with
frames, so unless the correction is really buggy or the hw somehow
skips a partial frame it should work rather well. At least when
operating with small timescales. For large gaps the error might
creep up, but I don't think a small error in the predicted seq
number over a long timespan is really a problem.

> 
> But your point above that the hw might reset the counter again is also
> valid. I'm assuming what you're worried about is that we first do a
> _restore (and the hw vblank counter hasn't been trashed yet), and then in
> the same frame we do another restore, but now the hw frame counter has
> been trashe, and we need to update it?

Yeah, although the pre-trashing _restore could also just be
a vblank irq I think.

> 
> > Another idea that came to me now is that maybe we should actually just
> > check if the current hw frame counter value looks sane, as in something
> > like:
> > 
> > diff_hw_counter = current_hw_counter-stored_hw_counter
> > diff_ts = (current_ts-stored_ts)/framedur
> > 
> > if (diff_hw_counter ~= diff_ts)
> > 	diff = diff_hw_counter;
> > else
> > 	diff = diff_ts;
> > 
> > and if they seem to match then just keep trusting the hw counter.
> > So only if there's a significant difference would we disregard
> > the diff of the hw counter and instead use the diff based on the
> > timestamps. Not sure what "significant" is though; One frame, two
> > frames?
> 
> Hm, another idea: The only point where we can trust the entire hw counter
> + timestamp sampling is when the irq happens. Because then we know the
> driver will have properly corrected for any hw oddities (like hw counter
> flipping not at top-of-frame, like the core expects).

i915 at least gives out correct data regardless of when you sample
it. Well, except for the cases where the hw counter gets trashed,
in which case the hw counter is garbage (when compared with .last)
but the timestamp is still correct.

> 
> So what if _restore always goes back to the last such trusted hw counter
> for computing the frame counter diff and all that stuff? That way if we
> have a bunch of _restore with incosisten hw vblank counter, we will a)
> only take the last one (fixes the bug you're trying to fix) b) still use
> the same last trusted baseline for computations (addresses the race I'm
> seeing).
> 
> Or does this not work?

I don't think I really understand what you're suggesting here.
_restore is already using the last trusted data (the stored
timestamp + .last).

So the one thing _restore will have to update is .last.
I think it can either do what it does now and set .last
to the current hw counter value + update the timestamp
to match, or it could perhaps adjust the stored .last
such that the already stored timestamp and the updated
.last match up. But I think both of those options have
the same level or inaccuracy since both would still do
the same ts_diff->hw_counter_diff prediction. 

> 
> It does complicate the code a bit, because we'd need to store the
> count/timestamp information from _restore outside of the usual vblank ts
> array. But I think that addresses everything.

Hmm. So restore would store this extra information
somewhere else, and not update the normal stuff at all?
What exactly would we do with that extra data?
Daniel Vetter Feb. 8, 2021, 5:43 p.m. UTC | #8
On Mon, Feb 8, 2021 at 5:58 PM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
>
> On Mon, Feb 08, 2021 at 10:56:36AM +0100, Daniel Vetter wrote:
> > On Fri, Feb 05, 2021 at 11:19:19PM +0200, Ville Syrjälä wrote:
> > > On Fri, Feb 05, 2021 at 06:24:08PM +0200, Ville Syrjälä wrote:
> > > > On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> > > > > On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > > > > > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > > > > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > > >
> > > > > > > > drm_vblank_restore() exists because certain power saving states
> > > > > > > > can clobber the hardware frame counter. The way it does this is
> > > > > > > > by guesstimating how many frames were missed purely based on
> > > > > > > > the difference between the last stored timestamp vs. a newly
> > > > > > > > sampled timestamp.
> > > > > > > >
> > > > > > > > If we should call this function before a full frame has
> > > > > > > > elapsed since we sampled the last timestamp we would end up
> > > > > > > > with a possibly slightly different timestamp value for the
> > > > > > > > same frame. Currently we will happily overwrite the already
> > > > > > > > stored timestamp for the frame with the new value. This
> > > > > > > > could cause userspace to observe two different timestamps
> > > > > > > > for the same frame (and the timestamp could even go
> > > > > > > > backwards depending on how much error we introduce when
> > > > > > > > correcting the timestamp based on the scanout position).
> > > > > > > >
> > > > > > > > To avoid that let's not update the stored timestamp unless we're
> > > > > > > > also incrementing the sequence counter. We do still want to update
> > > > > > > > vblank->last with the freshly sampled hw frame counter value so
> > > > > > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > > > > > counter to determine how many frames have elapsed.
> > > > > > >
> > > > > > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > > > > > counter?
> > > > > >
> > > > > > Because next time a vblank irq happens the code will do:
> > > > > > diff = current_hw_counter - vblank->last
> > > > > >
> > > > > > which won't work very well if vblank->last is garbage.
> > > > > >
> > > > > > Updating vblank->last is pretty much why drm_vblank_restore()
> > > > > > exists at all.
> > > > >
> > > > > Oh sure, _restore has to update this, together with the timestamp.
> > > > >
> > > > > But your code adds such an update where we update the hw vblank counter,
> > > > > but not the timestamp, and that feels buggy. Either we're still in the
> > > > > same frame, and then we should story nothing. Or we advanced, and then we
> > > > > probably want a new timestampt for that frame too.
> > > >
> > > > Even if we're still in the same frame the hw frame counter may already
> > > > have been reset due to the power well having been turned off. That is
> > > > what I'm trying to fix here.
> > > >
> > > > Now I suppose that's fairly unlikely, at least with PSR which probably
> > > > does impose some extra delays before the power gets yanked. But at least
> > > > theoretically possible.
> > >
> > > Pondering about this a bit further. I think the fact that the current
> > > code takes the round-to-closest approach I used for the vblank handler
> > > is perhaps a bit bad. It could push the seq counter forward if we're
> > > past the halfway point of a frame. I think that rounding behaviour
> > > makes sense for the irq since those tick steadily and so allowing a bit
> > > of error either way seems correct to me. Perhaps round-down might be
> > > the better option for _restore(). Not quites sure, need more thinking
> > > probably.
> >
> > Yes this is the rounding I'm worried about.
>
> Actually I don't think this is really an issue since we are working
> with the corrected timestamps here. Those always line up with
> frames, so unless the correction is really buggy or the hw somehow
> skips a partial frame it should work rather well. At least when
> operating with small timescales. For large gaps the error might
> creep up, but I don't think a small error in the predicted seq
> number over a long timespan is really a problem.

That corrected timestamp is what can go wrong I think: There's no
guarantee that drm_crtc_vblank_helper_get_vblank_timestamp_internal()
flips to top-of-frame at the exact same time than when the hw vblank
counter flips. Or at least I'm not seeing where we correct them both
together.

> > But your point above that the hw might reset the counter again is also
> > valid. I'm assuming what you're worried about is that we first do a
> > _restore (and the hw vblank counter hasn't been trashed yet), and then in
> > the same frame we do another restore, but now the hw frame counter has
> > been trashe, and we need to update it?
>
> Yeah, although the pre-trashing _restore could also just be
> a vblank irq I think.
>
> >
> > > Another idea that came to me now is that maybe we should actually just
> > > check if the current hw frame counter value looks sane, as in something
> > > like:
> > >
> > > diff_hw_counter = current_hw_counter-stored_hw_counter
> > > diff_ts = (current_ts-stored_ts)/framedur
> > >
> > > if (diff_hw_counter ~= diff_ts)
> > >     diff = diff_hw_counter;
> > > else
> > >     diff = diff_ts;
> > >
> > > and if they seem to match then just keep trusting the hw counter.
> > > So only if there's a significant difference would we disregard
> > > the diff of the hw counter and instead use the diff based on the
> > > timestamps. Not sure what "significant" is though; One frame, two
> > > frames?
> >
> > Hm, another idea: The only point where we can trust the entire hw counter
> > + timestamp sampling is when the irq happens. Because then we know the
> > driver will have properly corrected for any hw oddities (like hw counter
> > flipping not at top-of-frame, like the core expects).
>
> i915 at least gives out correct data regardless of when you sample
> it. Well, except for the cases where the hw counter gets trashed,
> in which case the hw counter is garbage (when compared with .last)
> but the timestamp is still correct.

Hm where/how do we handle this? Maybe I'm just out of date with how it
all works nowadays.

> > So what if _restore always goes back to the last such trusted hw counter
> > for computing the frame counter diff and all that stuff? That way if we
> > have a bunch of _restore with incosisten hw vblank counter, we will a)
> > only take the last one (fixes the bug you're trying to fix) b) still use
> > the same last trusted baseline for computations (addresses the race I'm
> > seeing).
> >
> > Or does this not work?
>
> I don't think I really understand what you're suggesting here.
> _restore is already using the last trusted data (the stored
> timestamp + .last).
>
> So the one thing _restore will have to update is .last.
> I think it can either do what it does now and set .last
> to the current hw counter value + update the timestamp
> to match, or it could perhaps adjust the stored .last
> such that the already stored timestamp and the updated
> .last match up. But I think both of those options have
> the same level or inaccuracy since both would still do
> the same ts_diff->hw_counter_diff prediction.
>
> >
> > It does complicate the code a bit, because we'd need to store the
> > count/timestamp information from _restore outside of the usual vblank ts
> > array. But I think that addresses everything.
>
> Hmm. So restore would store this extra information
> somewhere else, and not update the normal stuff at all?
> What exactly would we do with that extra data?

Hm I guess I didn't think this through. But the idea I had was:
- _restore always recomputes back from the las
drm_crtc_handl_vblank-stored timestamp.
- the first drm_crtc_handle_vblank bakes in any corrections that
_restore has prepared meanwhile
- same applies to all the sampling functions we might look at lastes
timestamps/counter values.
-Daniel
Ville Syrjälä Feb. 8, 2021, 6:05 p.m. UTC | #9
On Mon, Feb 08, 2021 at 06:43:53PM +0100, Daniel Vetter wrote:
> On Mon, Feb 8, 2021 at 5:58 PM Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> >
> > On Mon, Feb 08, 2021 at 10:56:36AM +0100, Daniel Vetter wrote:
> > > On Fri, Feb 05, 2021 at 11:19:19PM +0200, Ville Syrjälä wrote:
> > > > On Fri, Feb 05, 2021 at 06:24:08PM +0200, Ville Syrjälä wrote:
> > > > > On Fri, Feb 05, 2021 at 04:46:27PM +0100, Daniel Vetter wrote:
> > > > > > On Thu, Feb 04, 2021 at 05:55:28PM +0200, Ville Syrjälä wrote:
> > > > > > > On Thu, Feb 04, 2021 at 04:32:16PM +0100, Daniel Vetter wrote:
> > > > > > > > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > > > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > > > >
> > > > > > > > > drm_vblank_restore() exists because certain power saving states
> > > > > > > > > can clobber the hardware frame counter. The way it does this is
> > > > > > > > > by guesstimating how many frames were missed purely based on
> > > > > > > > > the difference between the last stored timestamp vs. a newly
> > > > > > > > > sampled timestamp.
> > > > > > > > >
> > > > > > > > > If we should call this function before a full frame has
> > > > > > > > > elapsed since we sampled the last timestamp we would end up
> > > > > > > > > with a possibly slightly different timestamp value for the
> > > > > > > > > same frame. Currently we will happily overwrite the already
> > > > > > > > > stored timestamp for the frame with the new value. This
> > > > > > > > > could cause userspace to observe two different timestamps
> > > > > > > > > for the same frame (and the timestamp could even go
> > > > > > > > > backwards depending on how much error we introduce when
> > > > > > > > > correcting the timestamp based on the scanout position).
> > > > > > > > >
> > > > > > > > > To avoid that let's not update the stored timestamp unless we're
> > > > > > > > > also incrementing the sequence counter. We do still want to update
> > > > > > > > > vblank->last with the freshly sampled hw frame counter value so
> > > > > > > > > that subsequent vblank irqs/queries can actually use the hw frame
> > > > > > > > > counter to determine how many frames have elapsed.
> > > > > > > >
> > > > > > > > Hm I'm not getting the reason for why we store the updated hw vblank
> > > > > > > > counter?
> > > > > > >
> > > > > > > Because next time a vblank irq happens the code will do:
> > > > > > > diff = current_hw_counter - vblank->last
> > > > > > >
> > > > > > > which won't work very well if vblank->last is garbage.
> > > > > > >
> > > > > > > Updating vblank->last is pretty much why drm_vblank_restore()
> > > > > > > exists at all.
> > > > > >
> > > > > > Oh sure, _restore has to update this, together with the timestamp.
> > > > > >
> > > > > > But your code adds such an update where we update the hw vblank counter,
> > > > > > but not the timestamp, and that feels buggy. Either we're still in the
> > > > > > same frame, and then we should story nothing. Or we advanced, and then we
> > > > > > probably want a new timestampt for that frame too.
> > > > >
> > > > > Even if we're still in the same frame the hw frame counter may already
> > > > > have been reset due to the power well having been turned off. That is
> > > > > what I'm trying to fix here.
> > > > >
> > > > > Now I suppose that's fairly unlikely, at least with PSR which probably
> > > > > does impose some extra delays before the power gets yanked. But at least
> > > > > theoretically possible.
> > > >
> > > > Pondering about this a bit further. I think the fact that the current
> > > > code takes the round-to-closest approach I used for the vblank handler
> > > > is perhaps a bit bad. It could push the seq counter forward if we're
> > > > past the halfway point of a frame. I think that rounding behaviour
> > > > makes sense for the irq since those tick steadily and so allowing a bit
> > > > of error either way seems correct to me. Perhaps round-down might be
> > > > the better option for _restore(). Not quites sure, need more thinking
> > > > probably.
> > >
> > > Yes this is the rounding I'm worried about.
> >
> > Actually I don't think this is really an issue since we are working
> > with the corrected timestamps here. Those always line up with
> > frames, so unless the correction is really buggy or the hw somehow
> > skips a partial frame it should work rather well. At least when
> > operating with small timescales. For large gaps the error might
> > creep up, but I don't think a small error in the predicted seq
> > number over a long timespan is really a problem.
> 
> That corrected timestamp is what can go wrong I think: There's no
> guarantee that drm_crtc_vblank_helper_get_vblank_timestamp_internal()
> flips to top-of-frame at the exact same time than when the hw vblank
> counter flips. Or at least I'm not seeing where we correct them both
> together.

We do this seqlock type of thing:
	do {
                cur_vblank = __get_vblank_counter(dev, pipe);
                rc = drm_get_last_vbltimestamp(dev, pipe, &t_vblank, in_vblank_irq);
        } while (cur_vblank != __get_vblank_counter(dev, pipe) && --count > 0);

which guarantees the timestamp really is for the frame we think it is for.

> 
> > > But your point above that the hw might reset the counter again is also
> > > valid. I'm assuming what you're worried about is that we first do a
> > > _restore (and the hw vblank counter hasn't been trashed yet), and then in
> > > the same frame we do another restore, but now the hw frame counter has
> > > been trashe, and we need to update it?
> >
> > Yeah, although the pre-trashing _restore could also just be
> > a vblank irq I think.
> >
> > >
> > > > Another idea that came to me now is that maybe we should actually just
> > > > check if the current hw frame counter value looks sane, as in something
> > > > like:
> > > >
> > > > diff_hw_counter = current_hw_counter-stored_hw_counter
> > > > diff_ts = (current_ts-stored_ts)/framedur
> > > >
> > > > if (diff_hw_counter ~= diff_ts)
> > > >     diff = diff_hw_counter;
> > > > else
> > > >     diff = diff_ts;
> > > >
> > > > and if they seem to match then just keep trusting the hw counter.
> > > > So only if there's a significant difference would we disregard
> > > > the diff of the hw counter and instead use the diff based on the
> > > > timestamps. Not sure what "significant" is though; One frame, two
> > > > frames?
> > >
> > > Hm, another idea: The only point where we can trust the entire hw counter
> > > + timestamp sampling is when the irq happens. Because then we know the
> > > driver will have properly corrected for any hw oddities (like hw counter
> > > flipping not at top-of-frame, like the core expects).
> >
> > i915 at least gives out correct data regardless of when you sample
> > it. Well, except for the cases where the hw counter gets trashed,
> > in which case the hw counter is garbage (when compared with .last)
> > but the timestamp is still correct.
> 
> Hm where/how do we handle this? Maybe I'm just out of date with how it
> all works nowadays.

There's not much to handle. We know when exactly the counters increment and
thus can give out the correct answer to the question "which frame is this?".

> 
> > > So what if _restore always goes back to the last such trusted hw counter
> > > for computing the frame counter diff and all that stuff? That way if we
> > > have a bunch of _restore with incosisten hw vblank counter, we will a)
> > > only take the last one (fixes the bug you're trying to fix) b) still use
> > > the same last trusted baseline for computations (addresses the race I'm
> > > seeing).
> > >
> > > Or does this not work?
> >
> > I don't think I really understand what you're suggesting here.
> > _restore is already using the last trusted data (the stored
> > timestamp + .last).
> >
> > So the one thing _restore will have to update is .last.
> > I think it can either do what it does now and set .last
> > to the current hw counter value + update the timestamp
> > to match, or it could perhaps adjust the stored .last
> > such that the already stored timestamp and the updated
> > .last match up. But I think both of those options have
> > the same level or inaccuracy since both would still do
> > the same ts_diff->hw_counter_diff prediction.
> >
> > >
> > > It does complicate the code a bit, because we'd need to store the
> > > count/timestamp information from _restore outside of the usual vblank ts
> > > array. But I think that addresses everything.
> >
> > Hmm. So restore would store this extra information
> > somewhere else, and not update the normal stuff at all?
> > What exactly would we do with that extra data?
> 
> Hm I guess I didn't think this through. But the idea I had was:
> - _restore always recomputes back from the las
> drm_crtc_handl_vblank-stored timestamp.
> - the first drm_crtc_handle_vblank bakes in any corrections that
> _restore has prepared meanwhile
> - same applies to all the sampling functions we might look at lastes
> timestamps/counter values.

So I guess instead of _restore adjusting .last we would instead
mainatian a separate correction information and apply it when
doing the diff between the current hw counter vs. .last. Not sure
why that would be particularly better than just adjusting .last
directly.
Daniel Vetter Feb. 9, 2021, 10:07 a.m. UTC | #10
On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> drm_vblank_restore() exists because certain power saving states
> can clobber the hardware frame counter. The way it does this is
> by guesstimating how many frames were missed purely based on
> the difference between the last stored timestamp vs. a newly
> sampled timestamp.
> 
> If we should call this function before a full frame has
> elapsed since we sampled the last timestamp we would end up
> with a possibly slightly different timestamp value for the
> same frame. Currently we will happily overwrite the already
> stored timestamp for the frame with the new value. This
> could cause userspace to observe two different timestamps
> for the same frame (and the timestamp could even go
> backwards depending on how much error we introduce when
> correcting the timestamp based on the scanout position).
> 
> To avoid that let's not update the stored timestamp unless we're
> also incrementing the sequence counter. We do still want to update
> vblank->last with the freshly sampled hw frame counter value so
> that subsequent vblank irqs/queries can actually use the hw frame
> counter to determine how many frames have elapsed.
> 
> Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

Ok, top-posting because lol I got confused. I mixed up the guesstimation
work we do for when we don't have a vblank counter with the precise vblank
timestamp stuff.

I think it'd still be good to maybe lock down/document a bit better the
requirements for drm_crtc_vblank_restore, but I convinced myself now that
your patch looks correct.

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

> ---
>  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> index 893165eeddf3..e127a7db2088 100644
> --- a/drivers/gpu/drm/drm_vblank.c
> +++ b/drivers/gpu/drm/drm_vblank.c
> @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
>  
>  	vblank->last = last;
>  
> +	/*
> +	 * drm_vblank_restore() wants to always update
> +	 * vblank->last since we can't trust the frame counter
> +	 * across power saving states. But we don't want to alter
> +	 * the stored timestamp for the same frame number since
> +	 * that would cause userspace to potentially observe two
> +	 * different timestamps for the same frame.
> +	 */
> +	if (vblank_count_inc == 0)
> +		return;
> +
>  	write_seqlock(&vblank->seqlock);
>  	vblank->time = t_vblank;
>  	atomic64_add(vblank_count_inc, &vblank->count);
> -- 
> 2.26.2
>
Ville Syrjälä Feb. 9, 2021, 3:40 p.m. UTC | #11
On Tue, Feb 09, 2021 at 11:07:53AM +0100, Daniel Vetter wrote:
> On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > 
> > drm_vblank_restore() exists because certain power saving states
> > can clobber the hardware frame counter. The way it does this is
> > by guesstimating how many frames were missed purely based on
> > the difference between the last stored timestamp vs. a newly
> > sampled timestamp.
> > 
> > If we should call this function before a full frame has
> > elapsed since we sampled the last timestamp we would end up
> > with a possibly slightly different timestamp value for the
> > same frame. Currently we will happily overwrite the already
> > stored timestamp for the frame with the new value. This
> > could cause userspace to observe two different timestamps
> > for the same frame (and the timestamp could even go
> > backwards depending on how much error we introduce when
> > correcting the timestamp based on the scanout position).
> > 
> > To avoid that let's not update the stored timestamp unless we're
> > also incrementing the sequence counter. We do still want to update
> > vblank->last with the freshly sampled hw frame counter value so
> > that subsequent vblank irqs/queries can actually use the hw frame
> > counter to determine how many frames have elapsed.
> > 
> > Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Ok, top-posting because lol I got confused. I mixed up the guesstimation
> work we do for when we don't have a vblank counter with the precise vblank
> timestamp stuff.
> 
> I think it'd still be good to maybe lock down/document a bit better the
> requirements for drm_crtc_vblank_restore, but I convinced myself now that
> your patch looks correct.
> 
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Ta.

Though I wonder if we should just do something like this instead:
-       store_vblank(dev, pipe, diff, t_vblank, cur_vblank);
+       vblank->last = (cur_vblank - diff) & max_vblank_count;

to make it entirely obvious that this exists only to fix up
the stored hw counter value?

Would also avoid the problem the original patch tries to fix
because we'd simply never store a new timestamp here.

> 
> > ---
> >  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> > index 893165eeddf3..e127a7db2088 100644
> > --- a/drivers/gpu/drm/drm_vblank.c
> > +++ b/drivers/gpu/drm/drm_vblank.c
> > @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
> >  
> >  	vblank->last = last;
> >  
> > +	/*
> > +	 * drm_vblank_restore() wants to always update
> > +	 * vblank->last since we can't trust the frame counter
> > +	 * across power saving states. But we don't want to alter
> > +	 * the stored timestamp for the same frame number since
> > +	 * that would cause userspace to potentially observe two
> > +	 * different timestamps for the same frame.
> > +	 */
> > +	if (vblank_count_inc == 0)
> > +		return;
> > +
> >  	write_seqlock(&vblank->seqlock);
> >  	vblank->time = t_vblank;
> >  	atomic64_add(vblank_count_inc, &vblank->count);
> > -- 
> > 2.26.2
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter Feb. 9, 2021, 4:44 p.m. UTC | #12
On Tue, Feb 9, 2021 at 4:41 PM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
> On Tue, Feb 09, 2021 at 11:07:53AM +0100, Daniel Vetter wrote:
> > On Thu, Feb 04, 2021 at 04:04:00AM +0200, Ville Syrjala wrote:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > >
> > > drm_vblank_restore() exists because certain power saving states
> > > can clobber the hardware frame counter. The way it does this is
> > > by guesstimating how many frames were missed purely based on
> > > the difference between the last stored timestamp vs. a newly
> > > sampled timestamp.
> > >
> > > If we should call this function before a full frame has
> > > elapsed since we sampled the last timestamp we would end up
> > > with a possibly slightly different timestamp value for the
> > > same frame. Currently we will happily overwrite the already
> > > stored timestamp for the frame with the new value. This
> > > could cause userspace to observe two different timestamps
> > > for the same frame (and the timestamp could even go
> > > backwards depending on how much error we introduce when
> > > correcting the timestamp based on the scanout position).
> > >
> > > To avoid that let's not update the stored timestamp unless we're
> > > also incrementing the sequence counter. We do still want to update
> > > vblank->last with the freshly sampled hw frame counter value so
> > > that subsequent vblank irqs/queries can actually use the hw frame
> > > counter to determine how many frames have elapsed.
> > >
> > > Cc: Dhinakaran Pandiyan <dhinakaran.pandiyan@intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >
> > Ok, top-posting because lol I got confused. I mixed up the guesstimation
> > work we do for when we don't have a vblank counter with the precise vblank
> > timestamp stuff.
> >
> > I think it'd still be good to maybe lock down/document a bit better the
> > requirements for drm_crtc_vblank_restore, but I convinced myself now that
> > your patch looks correct.
> >
> > Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Ta.
>
> Though I wonder if we should just do something like this instead:
> -       store_vblank(dev, pipe, diff, t_vblank, cur_vblank);
> +       vblank->last = (cur_vblank - diff) & max_vblank_count;
>
> to make it entirely obvious that this exists only to fix up
> the stored hw counter value?
>
> Would also avoid the problem the original patch tries to fix
> because we'd simply never store a new timestamp here.

Hm yeah, I think that would nicely limit the impact. But need to check
overflow/underflow math is all correct. And I think that would neatly
implement the trick I proposed to address the bug that wasn't there
:-)

The only thing that I've thought of as issue is that we might have
more wrap-around of the hw vblank counter, but that shouldn't be worse
than without this - anytime we have the vblank on for long enough we
fix the entire thing, and I think our wrap handling is now consistent
enough (there was some "let's just add a large bump" stuff for dri1
userspace iirc) that this shouldn't be any problem.

Plus the comment about _restore being very special would be in the
restore function, so this would also be rather tidy. If you go with
this maybe extend the kerneldoc for ->last to mention that
drm_vblank_restore() adjusts it?

The more I ponder this, the more I like it ... which probably means
I'm missing something, because this is drm_vblank.c?

Cheers, Daniel

>
> >
> > > ---
> > >  drivers/gpu/drm/drm_vblank.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
> > > index 893165eeddf3..e127a7db2088 100644
> > > --- a/drivers/gpu/drm/drm_vblank.c
> > > +++ b/drivers/gpu/drm/drm_vblank.c
> > > @@ -176,6 +176,17 @@ static void store_vblank(struct drm_device *dev, unsigned int pipe,
> > >
> > >     vblank->last = last;
> > >
> > > +   /*
> > > +    * drm_vblank_restore() wants to always update
> > > +    * vblank->last since we can't trust the frame counter
> > > +    * across power saving states. But we don't want to alter
> > > +    * the stored timestamp for the same frame number since
> > > +    * that would cause userspace to potentially observe two
> > > +    * different timestamps for the same frame.
> > > +    */
> > > +   if (vblank_count_inc == 0)
> > > +           return;
> > > +
> > >     write_seqlock(&vblank->seqlock);
> > >     vblank->time = t_vblank;
> > >     atomic64_add(vblank_count_inc, &vblank->count);
> > > --
> > > 2.26.2
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
> --
> Ville Syrjälä
> Intel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 893165eeddf3..e127a7db2088 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -176,6 +176,17 @@  static void store_vblank(struct drm_device *dev, unsigned int pipe,
 
 	vblank->last = last;
 
+	/*
+	 * drm_vblank_restore() wants to always update
+	 * vblank->last since we can't trust the frame counter
+	 * across power saving states. But we don't want to alter
+	 * the stored timestamp for the same frame number since
+	 * that would cause userspace to potentially observe two
+	 * different timestamps for the same frame.
+	 */
+	if (vblank_count_inc == 0)
+		return;
+
 	write_seqlock(&vblank->seqlock);
 	vblank->time = t_vblank;
 	atomic64_add(vblank_count_inc, &vblank->count);