[RFC,v1,0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

Message ID	20210729081659.2255499-1-vivek.kasireddy@intel.com (mailing list archive)
Headers	show Return-Path: <SRS0=2kLe=MV=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 24E9C60FD7 From: Vivek Kasireddy <vivek.kasireddy@intel.com> To: dri-devel@lists.freedesktop.org Subject: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability Date: Thu, 29 Jul 2021 01:16:55 -0700 Message-Id: <20210729081659.2255499-1-vivek.kasireddy@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: Dongwon Kim <dongwon.kim@intel.com>, =?utf-8?q?Michel_D=C3=A4nzer?= <michel@daenzer.net>, Vivek Kasireddy <vivek.kasireddy@intel.com>, Tina Zhang <tina.zhang@intel.com>, Gerd Hoffmann <kraxel@redhat.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability \| expand [RFC,v1,0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability [RFC,v1,1/4] drm: Add a capability flag to support deferred out_fence signalling [RFC,v1,2/4] virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature [RFC,v1,3/4] drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd [RFC,v1,4/4] drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature

Vivek Kasireddy July 29, 2021, 8:16 a.m. UTC

By separating the OUT_FENCE signalling from pageflip completion allows
a Guest compositor to start a new repaint cycle with a new buffer
instead of waiting for the old buffer to be free. 

This work is based on the idea/suggestion from Simon and Pekka.

This capability can be a solution for this issue:
https://gitlab.freedesktop.org/wayland/weston/-/issues/514

Corresponding Weston MR:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668

Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Simon Ser <contact@emersion.fr>
Cc: Michel Dänzer <michel@daenzer.net>
Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>

Vivek Kasireddy (4):
  drm: Add a capability flag to support deferred out_fence signalling
  virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
  drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
  drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature

 drivers/gpu/drm/drm_file.c               | 11 +++---
 drivers/gpu/drm/drm_ioctl.c              |  3 ++
 drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
 drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
 drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
 drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
 drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
 drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
 drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
 include/drm/drm_mode_config.h            |  9 +++++
 include/uapi/drm/drm.h                   |  1 +
 include/uapi/linux/virtio_gpu.h          | 12 +++++++
 12 files changed, 117 insertions(+), 7 deletions(-)

Daniel Vetter July 30, 2021, 10:25 a.m. UTC | #1

On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> By separating the OUT_FENCE signalling from pageflip completion allows
> a Guest compositor to start a new repaint cycle with a new buffer
> instead of waiting for the old buffer to be free. 
> 
> This work is based on the idea/suggestion from Simon and Pekka.
> 
> This capability can be a solution for this issue:
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> 
> Corresponding Weston MR:
> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668

Uh I kinda wanted to discuss this a bit more before we jump into typing
code, but well I guess not that much work yet.

So maybe I'm not understanding the problem, but I think the fundamental
underlying issue is that with KMS you can have at most 2 buffers
in-flight, due to our queue depth limit of 1 pending flip.

Unfortunately that means for virtual hw where it takes a few more
steps/vblanks until the framebuffer actually shows up on screen and is
scanned out, we suffer deeply. The usual fix for that is to drop the
latency and increase throughput, and have more buffers in-flight. Which
this patch tries to do.

Now I think where we go wrong here is that we're trying to hack this up by
defining different semantics for the out-fence and for the drm-event. Imo
that's wrong, they're both meant to show eactly the same thing:
- when is the new frame actually visible to the user (as in, eyeballs in a
  human head, preferrably, not the time when we've handed the buffer off
  to the virtual hw)
- when is the previous buffer no longer being used by the scanout hw

We do cheat a bit right now in so far that we assume they're both the
same, as in, panel-side latency is currently the compositor's problem to
figure out.

So for virtual hw I think the timestamp and even completion really need to
happen only when the buffer has been pushed through the entire
virtualization chain, i.e. ideally we get the timestamp from the kms
driver from the host side. Currently that's not done, so this is most
likely quite broken already (virtio relies on the no-vblank auto event
sending, which definitely doesn't wait for anything, or I'm completely
missing something).

I think instead of hacking up some ill-defined 1.5 queue depth support,
what we should do is support queue depth > 1 properly. So:

- Change atomic to support queue depth > 1, this needs to be a per-driver
  thing due to a bunch of issues in driver code. Essentially drivers must
  never look at obj->state pointers, and only ever look up state through
  the passed-in drm_atomic_state * update container.

- Aside: virtio should loose all it's empty hooks, there's no point in
  that.

- We fix virtio to send out the completion event at the end of this entire
  pipeline, i.e. virtio code needs to take care of sending out the
  crtc_state->event correctly.

- We probably also want some kind of (maybe per-crtc) recommended queue
  depth property so compositors know how many buffers to keep in flight.
  Not sure about that.

It's a bit more work, but also a lot less hacking around infrastructure in
dubious ways.

Thoughts?

Cheers, Daniel

> 
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Pekka Paalanen <ppaalanen@gmail.com>
> Cc: Simon Ser <contact@emersion.fr>
> Cc: Michel Dänzer <michel@daenzer.net>
> Cc: Tina Zhang <tina.zhang@intel.com>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> 
> Vivek Kasireddy (4):
>   drm: Add a capability flag to support deferred out_fence signalling
>   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
>   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
>   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> 
>  drivers/gpu/drm/drm_file.c               | 11 +++---
>  drivers/gpu/drm/drm_ioctl.c              |  3 ++
>  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
>  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
>  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
>  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
>  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
>  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
>  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
>  include/drm/drm_mode_config.h            |  9 +++++
>  include/uapi/drm/drm.h                   |  1 +
>  include/uapi/linux/virtio_gpu.h          | 12 +++++++
>  12 files changed, 117 insertions(+), 7 deletions(-)
> 
> -- 
> 2.30.2
>

Michel Dänzer July 30, 2021, 12:50 p.m. UTC | #2

On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>> By separating the OUT_FENCE signalling from pageflip completion allows
>> a Guest compositor to start a new repaint cycle with a new buffer
>> instead of waiting for the old buffer to be free. 
>>
>> This work is based on the idea/suggestion from Simon and Pekka.
>>
>> This capability can be a solution for this issue:
>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>
>> Corresponding Weston MR:
>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.

Per https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , IMO the underlying issue is actually that the guest compositor repaint cycle is not aligned with the host compositor one. If they were aligned, the problem would not occur even without allowing multiple page flips in flight, and latency would be lower.


> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
> 
> We do cheat a bit right now in so far that we assume they're both the
> same, as in, panel-side latency is currently the compositor's problem to
> figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms
> driver from the host side. Currently that's not done, so this is most
> likely quite broken already (virtio relies on the no-vblank auto event
> sending, which definitely doesn't wait for anything, or I'm completely
> missing something).
> 
> I think instead of hacking up some ill-defined 1.5 queue depth support,
> what we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.

I'd say there would definitely need to be some kind of signal for the display server that it should queue multiple flips, since this is normally not desirable for latency. In other words, this wouldn't really be useful on bare metal (in contrast to the ability to replace a pending flip with a newer one).

Gerd Hoffmann July 30, 2021, 1:38 p.m. UTC | #3

Hi,

> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.

That sounds sensible to me.  Fence the virtio commands, make sure (on
the host side) the command completes only when the work is actually done
not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
(aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
pageflipping), then send vblank events to userspace on command
completion certainly makes sense.

take care,
  Gerd

Zhang, Tina Aug. 2, 2021, 4:48 a.m. UTC | #4

> -----Original Message-----
> From: Daniel Vetter <daniel@ffwll.ch>
> Sent: Friday, July 30, 2021 6:26 PM
> To: Kasireddy, Vivek <vivek.kasireddy@intel.com>
> Cc: dri-devel@lists.freedesktop.org; Daniel Vetter <daniel@ffwll.ch>; Gerd
> Hoffmann <kraxel@redhat.com>; Pekka Paalanen <ppaalanen@gmail.com>;
> Simon Ser <contact@emersion.fr>; Michel Dänzer <michel@daenzer.net>;
> Zhang, Tina <tina.zhang@intel.com>; Kim, Dongwon
> <dongwon.kim@intel.com>
> Subject: Re: [RFC v1 0/4] drm: Add support for
> DRM_CAP_DEFERRED_OUT_FENCE capability
> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing code,
> but well I guess not that much work yet.
> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers in-flight,
> due to our queue depth limit of 1 pending flip.
> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the latency
> and increase throughput, and have more buffers in-flight. Which this patch
> tries to do.
> 
> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
> 
> We do cheat a bit right now in so far that we assume they're both the same,
> as in, panel-side latency is currently the compositor's problem to figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms driver
> from the host side. Currently that's not done, so this is most likely quite
> broken already (virtio relies on the no-vblank auto event sending, which
> definitely doesn't wait for anything, or I'm completely missing something).

Agree. One lesson we got from previous direct-display related work is that using host hardware event is kind of "must". Otherwise, problems like flickering or tearing or frame drop will come out. Besides, as the wayland-ui is working as a weston client, it needs to have more than 2 buffers to support the full-frame redraw. I tried the Weston-simple-dmabuf-egl with 2 buffers and it was bad.

BR,
Tina

> 
> I think instead of hacking up some ill-defined 1.5 queue depth support, what
> we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.
> 
> It's a bit more work, but also a lot less hacking around infrastructure in
> dubious ways.
> 
> Thoughts?
> 
> Cheers, Daniel
> 
> >
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Pekka Paalanen <ppaalanen@gmail.com>
> > Cc: Simon Ser <contact@emersion.fr>
> > Cc: Michel Dänzer <michel@daenzer.net>
> > Cc: Tina Zhang <tina.zhang@intel.com>
> > Cc: Dongwon Kim <dongwon.kim@intel.com>
> >
> > Vivek Kasireddy (4):
> >   drm: Add a capability flag to support deferred out_fence signalling
> >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> >
> >  drivers/gpu/drm/drm_file.c               | 11 +++---
> >  drivers/gpu/drm/drm_ioctl.c              |  3 ++
> >  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
> >  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
> >  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
> >  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
> >  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
> >  include/drm/drm_mode_config.h            |  9 +++++
> >  include/uapi/drm/drm.h                   |  1 +
> >  include/uapi/linux/virtio_gpu.h          | 12 +++++++
> >  12 files changed, 117 insertions(+), 7 deletions(-)
> >
> > --
> > 2.30.2
> >
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Vivek Kasireddy Aug. 2, 2021, 6:51 a.m. UTC | #5

Hi Daniel,

> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
[Kasireddy, Vivek] Right, it wasn't a lot of work :)

> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
[Kasireddy, Vivek] Let me summarize the problem again from the perspective of
both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
Host compositor:
- After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms
and then submits the atomic commit and at the tail end of its cycle sends a frame callback
event to all its clients (who registered and submitted frames) indicating to them to 
start their next redraw  -- giving them at-least ~16 ms to submit a new frame to be
included in its next repaint. Why a configurable 9 ms delay is needed is explained
in Pekka's blog post here:
https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html

- It'll send a wl_buffer.release event for a client submitted previous buffer only
when the client has submitted a new buffer and:
a) When it hasn't started its repaint cycle yet OR
b) When it clears its old state after it gets a pageflip completion event -- if it had
flipped the client's buffer onto a hardware plane.

Guest compositor:
- After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the 
Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the
9 ms wait -- just like the Host compositor -- for its clients to submit new buffers.
- When it gets a pageflip completion, it assumes that the previously submitted buffer
is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum
of 4 backbuffers included as part of the Mesa GBM surface implementation.

Guest KMS/Virtio-gpu/Qemu Wayland UI:
- Because no_vblank=true for Guest KMS and since the vblank event (which also serves
as the pageflip completion event for user-space) is sent right away after atomic commit,
as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know for
sure that the Host is completely done using the buffer. To ensure this, we signal the dma-fence
only after the Host compositor sends a wl_buffer.release event or an equivalent signal.

The goal:
- Maintain full framerate even when the Guest scanout FB is flipped onto a hardware plane
on the Host -- regardless of either compositor's scheduling policy -- without making any
copies and ensuring that both Host and Guest are not accessing the buffer at the same time.

The problem:
- If the Host compositor flips the client's buffer (in this case Guest compositor's buffer) 
onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.

The solution:
- To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
gets a pageflip completion.
- The Weston MR I linked above does this by getting an out_fence fd and taking a reference
on all the FBs included in the atomic commit forcing the compositor to use new FBs for its
next repaint cycle. It releases the references when the out_fence is signalled later when
the Host compositor sends a wl_buffer.release event.

> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.
> 
> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
[Kasireddy, Vivek] Right, they both mean the same thing but I think using both
at the same time would be redundant in the case of Weston. That's why I am trying
to repurpose the usage of out_fence in this case by introducing a new capability
that may not be relevant for bare-metal KMS drivers but would be useful for
virtual KMS drivers.

> 
> We do cheat a bit right now in so far that we assume they're both the
> same, as in, panel-side latency is currently the compositor's problem to
> figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms
> driver from the host side. Currently that's not done, so this is most
> likely quite broken already (virtio relies on the no-vblank auto event
> sending, which definitely doesn't wait for anything, or I'm completely
> missing something).
[Kasireddy, Vivek] You are right; virtio_gpu does use the no_vblank auto event but
as I mentioned above we do use an internal dma-fence to wait until the submitted
buffer is no longer used by the Host. In other words, we wait (in update_planes hook)
until we get an appropriate signal from the Host to proceed to make sure that we are
not rendering faster than what the Host can display.

However, as you suggest below, we could set no_vblank=false and send the vblank/
pageflip completion event from the virtio-gpu driver instead of having the DRM
core send it. This can prevent the DRM core from signalling the out_fence as well
which is my intended objective and what my first patch tries to do. I'd still need the
new capability though to include the patch in Weston that deals with out_fence --
unless Weston upstream can accept the patch after reviewing it without this newly
added capability which would be redundant but it does solve my problem. Would
this be acceptable?

> 
> I think instead of hacking up some ill-defined 1.5 queue depth support,
> what we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.
> 
> It's a bit more work, but also a lot less hacking around infrastructure in
> dubious ways.
> 
> Thoughts?
[Kasireddy, Vivek] IIUC, you are suggesting that we should make it possible to
submit a new atomic commit even though the completion event for the previous
one has not come in yet. This may potentially solve my problem but it sounds very
disruptive and not very useful for bare-metal cases. It also means that the compositors,
DRM core and the drivers need to keep track of multiple states -- as opposed to new and
old -- for all objects such as crtcs, planes, etc and account for multiple completion events.
I guess it is doable but as you suggest it seems like a lot of work with many pitfalls ahead.

Thanks,
Vivek
> 
> Cheers, Daniel
> 
> >
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Pekka Paalanen <ppaalanen@gmail.com>
> > Cc: Simon Ser <contact@emersion.fr>
> > Cc: Michel Dänzer <michel@daenzer.net>
> > Cc: Tina Zhang <tina.zhang@intel.com>
> > Cc: Dongwon Kim <dongwon.kim@intel.com>
> >
> > Vivek Kasireddy (4):
> >   drm: Add a capability flag to support deferred out_fence signalling
> >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> >
> >  drivers/gpu/drm/drm_file.c               | 11 +++---
> >  drivers/gpu/drm/drm_ioctl.c              |  3 ++
> >  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
> >  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
> >  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
> >  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
> >  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
> >  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
> >  include/drm/drm_mode_config.h            |  9 +++++
> >  include/uapi/drm/drm.h                   |  1 +
> >  include/uapi/linux/virtio_gpu.h          | 12 +++++++
> >  12 files changed, 117 insertions(+), 7 deletions(-)
> >
> > --
> > 2.30.2
> >
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 2, 2021, 7:59 a.m. UTC | #6

On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> >> By separating the OUT_FENCE signalling from pageflip completion allows
> >> a Guest compositor to start a new repaint cycle with a new buffer
> >> instead of waiting for the old buffer to be free. 
> >>
> >> This work is based on the idea/suggestion from Simon and Pekka.
> >>
> >> This capability can be a solution for this issue:
> >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >>
> >> Corresponding Weston MR:
> >> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > 
> > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > code, but well I guess not that much work yet.
> > 
> > So maybe I'm not understanding the problem, but I think the fundamental
> > underlying issue is that with KMS you can have at most 2 buffers
> > in-flight, due to our queue depth limit of 1 pending flip.
> > 
> > Unfortunately that means for virtual hw where it takes a few more
> > steps/vblanks until the framebuffer actually shows up on screen and is
> > scanned out, we suffer deeply. The usual fix for that is to drop the
> > latency and increase throughput, and have more buffers in-flight. Which
> > this patch tries to do.
> 
> Per
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
> IMO the underlying issue is actually that the guest compositor repaint
> cycle is not aligned with the host compositor one. If they were aligned,
> the problem would not occur even without allowing multiple page flips in
> flight, and latency would be lower.

Yeah my proposal here is under the premise that we do actually need to fix
this with a deeper queue depth.

> > Now I think where we go wrong here is that we're trying to hack this up by
> > defining different semantics for the out-fence and for the drm-event. Imo
> > that's wrong, they're both meant to show eactly the same thing:
> > - when is the new frame actually visible to the user (as in, eyeballs in a
> >   human head, preferrably, not the time when we've handed the buffer off
> >   to the virtual hw)
> > - when is the previous buffer no longer being used by the scanout hw
> > 
> > We do cheat a bit right now in so far that we assume they're both the
> > same, as in, panel-side latency is currently the compositor's problem to
> > figure out.
> > 
> > So for virtual hw I think the timestamp and even completion really need to
> > happen only when the buffer has been pushed through the entire
> > virtualization chain, i.e. ideally we get the timestamp from the kms
> > driver from the host side. Currently that's not done, so this is most
> > likely quite broken already (virtio relies on the no-vblank auto event
> > sending, which definitely doesn't wait for anything, or I'm completely
> > missing something).
> > 
> > I think instead of hacking up some ill-defined 1.5 queue depth support,
> > what we should do is support queue depth > 1 properly. So:
> > 
> > - Change atomic to support queue depth > 1, this needs to be a per-driver
> >   thing due to a bunch of issues in driver code. Essentially drivers must
> >   never look at obj->state pointers, and only ever look up state through
> >   the passed-in drm_atomic_state * update container.
> > 
> > - Aside: virtio should loose all it's empty hooks, there's no point in
> >   that.
> > 
> > - We fix virtio to send out the completion event at the end of this entire
> >   pipeline, i.e. virtio code needs to take care of sending out the
> >   crtc_state->event correctly.
> > 
> > - We probably also want some kind of (maybe per-crtc) recommended queue
> >   depth property so compositors know how many buffers to keep in flight.
> >   Not sure about that.
> 
> I'd say there would definitely need to be some kind of signal for the
> display server that it should queue multiple flips, since this is
> normally not desirable for latency. In other words, this wouldn't really
> be useful on bare metal (in contrast to the ability to replace a pending
> flip with a newer one).

Hm I was thinking that the compositor can tune this. If the round-trip
latency (as measured by events) is too long to get full refresh rate, it
can add more buffers to the queue. That's kinda why I think the returned
event really must be accurate wrt actual display time (and old buffer
release time), so that this computation in the compositor because a pretty
simple

num_buffers = (flip_time - submit_time) / frame_time

With maybe some rounding up and averaging. You can also hit this when your
3d engine has an extremely deep pipeline (like some of the tiling
renders have), where rendering just takes forever, but as long as you keep
2 frames in the renderer in-flight you can achieve full refresh rate (at a
latency cost).

So kernel can't really tell you in all cases how many buffers you should
have.
-Daniel
> -- 
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

Daniel Vetter Aug. 2, 2021, 8:14 a.m. UTC | #7

On Mon, Aug 02, 2021 at 06:51:33AM +0000, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > 
> > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > > By separating the OUT_FENCE signalling from pageflip completion allows
> > > a Guest compositor to start a new repaint cycle with a new buffer
> > > instead of waiting for the old buffer to be free.
> > >
> > > This work is based on the idea/suggestion from Simon and Pekka.
> > >
> > > This capability can be a solution for this issue:
> > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > >
> > > Corresponding Weston MR:
> > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > 
> > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > code, but well I guess not that much work yet.
> [Kasireddy, Vivek] Right, it wasn't a lot of work :)
> 
> > 
> > So maybe I'm not understanding the problem, but I think the fundamental
> > underlying issue is that with KMS you can have at most 2 buffers
> > in-flight, due to our queue depth limit of 1 pending flip.
> [Kasireddy, Vivek] Let me summarize the problem again from the perspective of
> both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
> of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
> Host compositor:
> - After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms
> and then submits the atomic commit and at the tail end of its cycle sends a frame callback
> event to all its clients (who registered and submitted frames) indicating to them to 
> start their next redraw  -- giving them at-least ~16 ms to submit a new frame to be
> included in its next repaint. Why a configurable 9 ms delay is needed is explained
> in Pekka's blog post here:
> https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html
> 
> - It'll send a wl_buffer.release event for a client submitted previous buffer only
> when the client has submitted a new buffer and:
> a) When it hasn't started its repaint cycle yet OR
> b) When it clears its old state after it gets a pageflip completion event -- if it had
> flipped the client's buffer onto a hardware plane.
> 
> Guest compositor:
> - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the 
> Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the
> 9 ms wait -- just like the Host compositor -- for its clients to submit new buffers.
> - When it gets a pageflip completion, it assumes that the previously submitted buffer
> is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum
> of 4 backbuffers included as part of the Mesa GBM surface implementation.
> 
> Guest KMS/Virtio-gpu/Qemu Wayland UI:
> - Because no_vblank=true for Guest KMS and since the vblank event (which also serves
> as the pageflip completion event for user-space) is sent right away after atomic commit,
> as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know for
> sure that the Host is completely done using the buffer. To ensure this, we signal the dma-fence
> only after the Host compositor sends a wl_buffer.release event or an equivalent signal.
> 
> The goal:
> - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware plane
> on the Host -- regardless of either compositor's scheduling policy -- without making any
> copies and ensuring that both Host and Guest are not accessing the buffer at the same time.
> 
> The problem:
> - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer) 
> onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
> only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
> submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
> Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
> 
> The solution:
> - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
> the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
> In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
> Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
> gets a pageflip completion.

Is that really the only solution?

If we fix the event timestamps so that both guest and host use the same
timestamp, but then the guest starts 5ms (or something like that) earlier,
then things should work too? I.e.
- host compositor starts at (previous_frametime + 9ms)
- guest compositor starts at (previous_frametime + 4ms)

Ofc this only works if the frametimes we hand out to both match _exactly_
and are as high-precision as the ones on the host side. Which for many gpu
drivers at least is the case, and all the ones you care about for sure :-)

But if the frametimes the guest receives are the no_vblank fake ones, then
they'll be all over the place and this carefully tuned low-latency redraw
loop falls apart. Aside fromm the fact that without tuning the guests to
be earlier than the hosts, you're guaranteed to miss every frame (except
when the timing wobbliness in the guest is big enough by chance to make
the deadline on the oddball frame).

> - The Weston MR I linked above does this by getting an out_fence fd and taking a reference
> on all the FBs included in the atomic commit forcing the compositor to use new FBs for its
> next repaint cycle. It releases the references when the out_fence is signalled later when
> the Host compositor sends a wl_buffer.release event.
> 
> > 
> > Unfortunately that means for virtual hw where it takes a few more
> > steps/vblanks until the framebuffer actually shows up on screen and is
> > scanned out, we suffer deeply. The usual fix for that is to drop the
> > latency and increase throughput, and have more buffers in-flight. Which
> > this patch tries to do.
> > 
> > Now I think where we go wrong here is that we're trying to hack this up by
> > defining different semantics for the out-fence and for the drm-event. Imo
> > that's wrong, they're both meant to show eactly the same thing:
> > - when is the new frame actually visible to the user (as in, eyeballs in a
> >   human head, preferrably, not the time when we've handed the buffer off
> >   to the virtual hw)
> > - when is the previous buffer no longer being used by the scanout hw
> [Kasireddy, Vivek] Right, they both mean the same thing but I think using both
> at the same time would be redundant in the case of Weston. That's why I am trying
> to repurpose the usage of out_fence in this case by introducing a new capability
> that may not be relevant for bare-metal KMS drivers but would be useful for
> virtual KMS drivers.
> 
> > 
> > We do cheat a bit right now in so far that we assume they're both the
> > same, as in, panel-side latency is currently the compositor's problem to
> > figure out.
> > 
> > So for virtual hw I think the timestamp and even completion really need to
> > happen only when the buffer has been pushed through the entire
> > virtualization chain, i.e. ideally we get the timestamp from the kms
> > driver from the host side. Currently that's not done, so this is most
> > likely quite broken already (virtio relies on the no-vblank auto event
> > sending, which definitely doesn't wait for anything, or I'm completely
> > missing something).
> [Kasireddy, Vivek] You are right; virtio_gpu does use the no_vblank auto event but
> as I mentioned above we do use an internal dma-fence to wait until the submitted
> buffer is no longer used by the Host. In other words, we wait (in update_planes hook)
> until we get an appropriate signal from the Host to proceed to make sure that we are
> not rendering faster than what the Host can display.

Yeah that internal dma_fence really should be the flip completion event
too. That's how this uapi is supposed to work.

Once you have that then maybe weston magically works because it realizes
that it misses the frames it's aiming for. Or at least there will be debug
output about that I hope (I'm not sure the auto-tuning works/exists).

> However, as you suggest below, we could set no_vblank=false and send the vblank/
> pageflip completion event from the virtio-gpu driver instead of having the DRM
> core send it. This can prevent the DRM core from signalling the out_fence as well
> which is my intended objective and what my first patch tries to do. I'd still need the
> new capability though to include the patch in Weston that deals with out_fence --
> unless Weston upstream can accept the patch after reviewing it without this newly
> added capability which would be redundant but it does solve my problem. Would
> this be acceptable?

out fence and flip completion event are exactly the same thing
semantically. Well, before your patch here at least. So if you fix up the
internal crtc->event handling then you fix up both. That's very much by
design, because otherwise we'd have a bunch of kms drivers that only work
on Android (which uses out-fence), and the others only work on dekstop
linux (which uses flip completion drm_event). And probably very few that
support both.

> > I think instead of hacking up some ill-defined 1.5 queue depth support,
> > what we should do is support queue depth > 1 properly. So:
> > 
> > - Change atomic to support queue depth > 1, this needs to be a per-driver
> >   thing due to a bunch of issues in driver code. Essentially drivers must
> >   never look at obj->state pointers, and only ever look up state through
> >   the passed-in drm_atomic_state * update container.
> > 
> > - Aside: virtio should loose all it's empty hooks, there's no point in
> >   that.
> > 
> > - We fix virtio to send out the completion event at the end of this entire
> >   pipeline, i.e. virtio code needs to take care of sending out the
> >   crtc_state->event correctly.
> > 
> > - We probably also want some kind of (maybe per-crtc) recommended queue
> >   depth property so compositors know how many buffers to keep in flight.
> >   Not sure about that.
> > 
> > It's a bit more work, but also a lot less hacking around infrastructure in
> > dubious ways.
> > 
> > Thoughts?
> [Kasireddy, Vivek] IIUC, you are suggesting that we should make it possible to
> submit a new atomic commit even though the completion event for the previous
> one has not come in yet. This may potentially solve my problem but it sounds very
> disruptive and not very useful for bare-metal cases. It also means that the compositors,
> DRM core and the drivers need to keep track of multiple states -- as opposed to new and
> old -- for all objects such as crtcs, planes, etc and account for multiple completion events.
> I guess it is doable but as you suggest it seems like a lot of work with many pitfalls ahead.

Queue deeper than 1 has been an eventual goal for atomic since the start,
we simply didn't get around to it.

All the state handling and helpers are built to support that (but there
could be more bugs). The only rule drivers must follow is that in their
atomic_commit code they never look at the various obj->state pointers
(like drm_crtc->state), since that might be the state of a subsequent
commit. Instead they must only get the state through the drm_atomic_state
structure. We've recently also updated all the helpers to pass that around
everywhere (for other reasons), so the challenge here is only to fix up
individual drivers. And maybe come up with some debug checks to make the
obj->state pointers aren't used in atomic_commit.

From a design pov I think your approach of hacking up the event machinery
to slip in 2 commits while not actually using the 2 deep queue stuff like
it's meant to be is much worse.

On the userspace side I'm not sure why you need to keep track of more
state. All you need to keep track is of more buffers in your retire/reuse
list, but you have to do that with your proposal here too. So no
difference at all there.

Anyway it sounds like if the guest compositor would adjust it's deadline
so that the guest and host compositor interleave correctly, then we should
still be able to hit full refresh rate without a deeper queue. Has that
been looked into?
-Daniel

> 
> Thanks,
> Vivek
> > 
> > Cheers, Daniel
> > 
> > >
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > > Cc: Pekka Paalanen <ppaalanen@gmail.com>
> > > Cc: Simon Ser <contact@emersion.fr>
> > > Cc: Michel Dänzer <michel@daenzer.net>
> > > Cc: Tina Zhang <tina.zhang@intel.com>
> > > Cc: Dongwon Kim <dongwon.kim@intel.com>
> > >
> > > Vivek Kasireddy (4):
> > >   drm: Add a capability flag to support deferred out_fence signalling
> > >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> > >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> > >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> > >
> > >  drivers/gpu/drm/drm_file.c               | 11 +++---
> > >  drivers/gpu/drm/drm_ioctl.c              |  3 ++
> > >  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
> > >  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
> > >  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
> > >  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
> > >  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
> > >  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
> > >  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
> > >  include/drm/drm_mode_config.h            |  9 +++++
> > >  include/uapi/drm/drm.h                   |  1 +
> > >  include/uapi/linux/virtio_gpu.h          | 12 +++++++
> > >  12 files changed, 117 insertions(+), 7 deletions(-)
> > >
> > > --
> > > 2.30.2
> > >
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Michel Dänzer Aug. 2, 2021, 8:49 a.m. UTC | #8

On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
>> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
>>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>>>> By separating the OUT_FENCE signalling from pageflip completion allows
>>>> a Guest compositor to start a new repaint cycle with a new buffer
>>>> instead of waiting for the old buffer to be free. 
>>>>
>>>> This work is based on the idea/suggestion from Simon and Pekka.
>>>>
>>>> This capability can be a solution for this issue:
>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>>>
>>>> Corresponding Weston MR:
>>>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
>>>
>>> Uh I kinda wanted to discuss this a bit more before we jump into typing
>>> code, but well I guess not that much work yet.
>>>
>>> So maybe I'm not understanding the problem, but I think the fundamental
>>> underlying issue is that with KMS you can have at most 2 buffers
>>> in-flight, due to our queue depth limit of 1 pending flip.
>>>
>>> Unfortunately that means for virtual hw where it takes a few more
>>> steps/vblanks until the framebuffer actually shows up on screen and is
>>> scanned out, we suffer deeply. The usual fix for that is to drop the
>>> latency and increase throughput, and have more buffers in-flight. Which
>>> this patch tries to do.
>>
>> Per
>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
>> IMO the underlying issue is actually that the guest compositor repaint
>> cycle is not aligned with the host compositor one. If they were aligned,
>> the problem would not occur even without allowing multiple page flips in
>> flight, and latency would be lower.
> 
> Yeah my proposal here is under the premise that we do actually need to fix
> this with a deeper queue depth.
> 
>>> Now I think where we go wrong here is that we're trying to hack this up by
>>> defining different semantics for the out-fence and for the drm-event. Imo
>>> that's wrong, they're both meant to show eactly the same thing:
>>> - when is the new frame actually visible to the user (as in, eyeballs in a
>>>   human head, preferrably, not the time when we've handed the buffer off
>>>   to the virtual hw)
>>> - when is the previous buffer no longer being used by the scanout hw
>>>
>>> We do cheat a bit right now in so far that we assume they're both the
>>> same, as in, panel-side latency is currently the compositor's problem to
>>> figure out.
>>>
>>> So for virtual hw I think the timestamp and even completion really need to
>>> happen only when the buffer has been pushed through the entire
>>> virtualization chain, i.e. ideally we get the timestamp from the kms
>>> driver from the host side. Currently that's not done, so this is most
>>> likely quite broken already (virtio relies on the no-vblank auto event
>>> sending, which definitely doesn't wait for anything, or I'm completely
>>> missing something).
>>>
>>> I think instead of hacking up some ill-defined 1.5 queue depth support,
>>> what we should do is support queue depth > 1 properly. So:
>>>
>>> - Change atomic to support queue depth > 1, this needs to be a per-driver
>>>   thing due to a bunch of issues in driver code. Essentially drivers must
>>>   never look at obj->state pointers, and only ever look up state through
>>>   the passed-in drm_atomic_state * update container.
>>>
>>> - Aside: virtio should loose all it's empty hooks, there's no point in
>>>   that.
>>>
>>> - We fix virtio to send out the completion event at the end of this entire
>>>   pipeline, i.e. virtio code needs to take care of sending out the
>>>   crtc_state->event correctly.
>>>
>>> - We probably also want some kind of (maybe per-crtc) recommended queue
>>>   depth property so compositors know how many buffers to keep in flight.
>>>   Not sure about that.
>>
>> I'd say there would definitely need to be some kind of signal for the
>> display server that it should queue multiple flips, since this is
>> normally not desirable for latency. In other words, this wouldn't really
>> be useful on bare metal (in contrast to the ability to replace a pending
>> flip with a newer one).
> 
> Hm I was thinking that the compositor can tune this. If the round-trip
> latency (as measured by events) is too long to get full refresh rate, it
> can add more buffers to the queue. That's kinda why I think the returned
> event really must be accurate wrt actual display time (and old buffer
> release time), so that this computation in the compositor because a pretty
> simple
> 
> num_buffers = (flip_time - submit_time) / frame_time
> 
> With maybe some rounding up and averaging. You can also hit this when your
> 3d engine has an extremely deep pipeline (like some of the tiling
> renders have), where rendering just takes forever, but as long as you keep
> 2 frames in the renderer in-flight you can achieve full refresh rate (at a
> latency cost).

As long as a page flip submitted after vblank N can complete during vblank N+1, full frame rate can be sustained[0]. User space can use as many buffers as needed to keep the rendering pipeline busy.

[0] This is broken by the mis-aligned compositor repaint cycles: The flip from the guest compositor misses the host compositor's cycle, so it takes more than one display refresh cycle to complete.


> So kernel can't really tell you in all cases how many buffers you should
> have.

That's not exactly what I mean. Right now, KMS user space has to wait for a flip to complete before submitting another one, or it gets EBUSY. So if the kernel wants to allow multiple flips to be submitted, it has to somehow tell user space that this is possible, or it'll never happen. And the kernel should never advertise this for bare metal, since it's not needed there (and undesirable).

Daniel Vetter Aug. 2, 2021, 9:06 a.m. UTC | #9

On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote:
> On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
> > On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
> >> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> >>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> >>>> By separating the OUT_FENCE signalling from pageflip completion allows
> >>>> a Guest compositor to start a new repaint cycle with a new buffer
> >>>> instead of waiting for the old buffer to be free. 
> >>>>
> >>>> This work is based on the idea/suggestion from Simon and Pekka.
> >>>>
> >>>> This capability can be a solution for this issue:
> >>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >>>>
> >>>> Corresponding Weston MR:
> >>>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> >>>
> >>> Uh I kinda wanted to discuss this a bit more before we jump into typing
> >>> code, but well I guess not that much work yet.
> >>>
> >>> So maybe I'm not understanding the problem, but I think the fundamental
> >>> underlying issue is that with KMS you can have at most 2 buffers
> >>> in-flight, due to our queue depth limit of 1 pending flip.
> >>>
> >>> Unfortunately that means for virtual hw where it takes a few more
> >>> steps/vblanks until the framebuffer actually shows up on screen and is
> >>> scanned out, we suffer deeply. The usual fix for that is to drop the
> >>> latency and increase throughput, and have more buffers in-flight. Which
> >>> this patch tries to do.
> >>
> >> Per
> >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
> >> IMO the underlying issue is actually that the guest compositor repaint
> >> cycle is not aligned with the host compositor one. If they were aligned,
> >> the problem would not occur even without allowing multiple page flips in
> >> flight, and latency would be lower.
> > 
> > Yeah my proposal here is under the premise that we do actually need to fix
> > this with a deeper queue depth.
> > 
> >>> Now I think where we go wrong here is that we're trying to hack this up by
> >>> defining different semantics for the out-fence and for the drm-event. Imo
> >>> that's wrong, they're both meant to show eactly the same thing:
> >>> - when is the new frame actually visible to the user (as in, eyeballs in a
> >>>   human head, preferrably, not the time when we've handed the buffer off
> >>>   to the virtual hw)
> >>> - when is the previous buffer no longer being used by the scanout hw
> >>>
> >>> We do cheat a bit right now in so far that we assume they're both the
> >>> same, as in, panel-side latency is currently the compositor's problem to
> >>> figure out.
> >>>
> >>> So for virtual hw I think the timestamp and even completion really need to
> >>> happen only when the buffer has been pushed through the entire
> >>> virtualization chain, i.e. ideally we get the timestamp from the kms
> >>> driver from the host side. Currently that's not done, so this is most
> >>> likely quite broken already (virtio relies on the no-vblank auto event
> >>> sending, which definitely doesn't wait for anything, or I'm completely
> >>> missing something).
> >>>
> >>> I think instead of hacking up some ill-defined 1.5 queue depth support,
> >>> what we should do is support queue depth > 1 properly. So:
> >>>
> >>> - Change atomic to support queue depth > 1, this needs to be a per-driver
> >>>   thing due to a bunch of issues in driver code. Essentially drivers must
> >>>   never look at obj->state pointers, and only ever look up state through
> >>>   the passed-in drm_atomic_state * update container.
> >>>
> >>> - Aside: virtio should loose all it's empty hooks, there's no point in
> >>>   that.
> >>>
> >>> - We fix virtio to send out the completion event at the end of this entire
> >>>   pipeline, i.e. virtio code needs to take care of sending out the
> >>>   crtc_state->event correctly.
> >>>
> >>> - We probably also want some kind of (maybe per-crtc) recommended queue
> >>>   depth property so compositors know how many buffers to keep in flight.
> >>>   Not sure about that.
> >>
> >> I'd say there would definitely need to be some kind of signal for the
> >> display server that it should queue multiple flips, since this is
> >> normally not desirable for latency. In other words, this wouldn't really
> >> be useful on bare metal (in contrast to the ability to replace a pending
> >> flip with a newer one).
> > 
> > Hm I was thinking that the compositor can tune this. If the round-trip
> > latency (as measured by events) is too long to get full refresh rate, it
> > can add more buffers to the queue. That's kinda why I think the returned
> > event really must be accurate wrt actual display time (and old buffer
> > release time), so that this computation in the compositor because a pretty
> > simple
> > 
> > num_buffers = (flip_time - submit_time) / frame_time
> > 
> > With maybe some rounding up and averaging. You can also hit this when your
> > 3d engine has an extremely deep pipeline (like some of the tiling
> > renders have), where rendering just takes forever, but as long as you keep
> > 2 frames in the renderer in-flight you can achieve full refresh rate (at a
> > latency cost).
> 
> As long as a page flip submitted after vblank N can complete during
> vblank N+1, full frame rate can be sustained[0]. User space can use as
> many buffers as needed to keep the rendering pipeline busy.
> 
> [0] This is broken by the mis-aligned compositor repaint cycles: The
> flip from the guest compositor misses the host compositor's cycle, so it
> takes more than one display refresh cycle to complete.
> 
> 
> > So kernel can't really tell you in all cases how many buffers you should
> > have.
> 
> That's not exactly what I mean. Right now, KMS user space has to wait
> for a flip to complete before submitting another one, or it gets EBUSY.
> So if the kernel wants to allow multiple flips to be submitted, it has
> to somehow tell user space that this is possible, or it'll never happen.
> And the kernel should never advertise this for bare metal, since it's
> not needed there (and undesirable).

Oh the existence of the deep queue needs a getcap ofc.

Also, deep queues do exist in hw (including scheduling to the right
frame), the benefit is reduced battery usage for e.g. video playback if
you do the rendering for an entire set of frames and then just let the
display show them.

It costs latency ofc (or we need a cancellable queue, which once it's in
hw is tricky).
-Daniel

Daniel Vetter Aug. 2, 2021, 9:09 a.m. UTC | #10

On Fri, Jul 30, 2021 at 03:38:50PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > - We fix virtio to send out the completion event at the end of this entire
> >   pipeline, i.e. virtio code needs to take care of sending out the
> >   crtc_state->event correctly.
> 
> That sounds sensible to me.  Fence the virtio commands, make sure (on
> the host side) the command completes only when the work is actually done
> not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> pageflipping), then send vblank events to userspace on command
> completion certainly makes sense.

Hm how does this all work? At least drm/virtio uses
drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
up in the same driver path for everything. Or do you just combine the
resource_flush with the flip as needed and let the host side figure it all
out? From a quick read of virtgpu_plane.c that seems to be the case ...

Also to make this work we don't just need the fence, we need the timestamp
(in a clock domain the guest can correct for ofc) of the host side kms
driver flip completion. If you just have the fence then the jitter from
going through all the layers will most likely make it unusable.
-Daniel

Michel Dänzer Aug. 2, 2021, 9:19 a.m. UTC | #11

On 2021-08-02 11:06 a.m., Daniel Vetter wrote:
> On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote:
>> On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
>>> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
>>>> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
>>>>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>>>>>> By separating the OUT_FENCE signalling from pageflip completion allows
>>>>>> a Guest compositor to start a new repaint cycle with a new buffer
>>>>>> instead of waiting for the old buffer to be free. 
>>>>>>
>>>>>> This work is based on the idea/suggestion from Simon and Pekka.
>>>>>>
>>>>>> This capability can be a solution for this issue:
>>>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>>>>>
>>>>>> Corresponding Weston MR:
>>>>>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
>>>>>
>>>>> Uh I kinda wanted to discuss this a bit more before we jump into typing
>>>>> code, but well I guess not that much work yet.
>>>>>
>>>>> So maybe I'm not understanding the problem, but I think the fundamental
>>>>> underlying issue is that with KMS you can have at most 2 buffers
>>>>> in-flight, due to our queue depth limit of 1 pending flip.
>>>>>
>>>>> Unfortunately that means for virtual hw where it takes a few more
>>>>> steps/vblanks until the framebuffer actually shows up on screen and is
>>>>> scanned out, we suffer deeply. The usual fix for that is to drop the
>>>>> latency and increase throughput, and have more buffers in-flight. Which
>>>>> this patch tries to do.
>>>>
>>>> Per
>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
>>>> IMO the underlying issue is actually that the guest compositor repaint
>>>> cycle is not aligned with the host compositor one. If they were aligned,
>>>> the problem would not occur even without allowing multiple page flips in
>>>> flight, and latency would be lower.
>>>
>>> Yeah my proposal here is under the premise that we do actually need to fix
>>> this with a deeper queue depth.
>>>
>>>>> Now I think where we go wrong here is that we're trying to hack this up by
>>>>> defining different semantics for the out-fence and for the drm-event. Imo
>>>>> that's wrong, they're both meant to show eactly the same thing:
>>>>> - when is the new frame actually visible to the user (as in, eyeballs in a
>>>>>   human head, preferrably, not the time when we've handed the buffer off
>>>>>   to the virtual hw)
>>>>> - when is the previous buffer no longer being used by the scanout hw
>>>>>
>>>>> We do cheat a bit right now in so far that we assume they're both the
>>>>> same, as in, panel-side latency is currently the compositor's problem to
>>>>> figure out.
>>>>>
>>>>> So for virtual hw I think the timestamp and even completion really need to
>>>>> happen only when the buffer has been pushed through the entire
>>>>> virtualization chain, i.e. ideally we get the timestamp from the kms
>>>>> driver from the host side. Currently that's not done, so this is most
>>>>> likely quite broken already (virtio relies on the no-vblank auto event
>>>>> sending, which definitely doesn't wait for anything, or I'm completely
>>>>> missing something).
>>>>>
>>>>> I think instead of hacking up some ill-defined 1.5 queue depth support,
>>>>> what we should do is support queue depth > 1 properly. So:
>>>>>
>>>>> - Change atomic to support queue depth > 1, this needs to be a per-driver
>>>>>   thing due to a bunch of issues in driver code. Essentially drivers must
>>>>>   never look at obj->state pointers, and only ever look up state through
>>>>>   the passed-in drm_atomic_state * update container.
>>>>>
>>>>> - Aside: virtio should loose all it's empty hooks, there's no point in
>>>>>   that.
>>>>>
>>>>> - We fix virtio to send out the completion event at the end of this entire
>>>>>   pipeline, i.e. virtio code needs to take care of sending out the
>>>>>   crtc_state->event correctly.
>>>>>
>>>>> - We probably also want some kind of (maybe per-crtc) recommended queue
>>>>>   depth property so compositors know how many buffers to keep in flight.
>>>>>   Not sure about that.
>>>>
>>>> I'd say there would definitely need to be some kind of signal for the
>>>> display server that it should queue multiple flips, since this is
>>>> normally not desirable for latency. In other words, this wouldn't really
>>>> be useful on bare metal (in contrast to the ability to replace a pending
>>>> flip with a newer one).
>>>
>>> Hm I was thinking that the compositor can tune this. If the round-trip
>>> latency (as measured by events) is too long to get full refresh rate, it
>>> can add more buffers to the queue. That's kinda why I think the returned
>>> event really must be accurate wrt actual display time (and old buffer
>>> release time), so that this computation in the compositor because a pretty
>>> simple
>>>
>>> num_buffers = (flip_time - submit_time) / frame_time
>>>
>>> With maybe some rounding up and averaging. You can also hit this when your
>>> 3d engine has an extremely deep pipeline (like some of the tiling
>>> renders have), where rendering just takes forever, but as long as you keep
>>> 2 frames in the renderer in-flight you can achieve full refresh rate (at a
>>> latency cost).
>>
>> As long as a page flip submitted after vblank N can complete during
>> vblank N+1, full frame rate can be sustained[0]. User space can use as
>> many buffers as needed to keep the rendering pipeline busy.
>>
>> [0] This is broken by the mis-aligned compositor repaint cycles: The
>> flip from the guest compositor misses the host compositor's cycle, so it
>> takes more than one display refresh cycle to complete.
>>
>>
>>> So kernel can't really tell you in all cases how many buffers you should
>>> have.
>>
>> That's not exactly what I mean. Right now, KMS user space has to wait
>> for a flip to complete before submitting another one, or it gets EBUSY.
>> So if the kernel wants to allow multiple flips to be submitted, it has
>> to somehow tell user space that this is possible, or it'll never happen.
>> And the kernel should never advertise this for bare metal, since it's
>> not needed there (and undesirable).
> 
> Oh the existence of the deep queue needs a getcap ofc.
> 
> Also, deep queues do exist in hw (including scheduling to the right
> frame), the benefit is reduced battery usage for e.g. video playback if
> you do the rendering for an entire set of frames and then just let the
> display show them.
> 
> It costs latency ofc (or we need a cancellable queue, which once it's in
> hw is tricky).

And if it's not in HW, it can be handled by the user-space display server instead of the kernel (except for cancelling pending flips).

(Note that this is currently not possible either way with Wayland, since it uses mailbox semantics; there are proposals for a Wayland extension which allows queuing multiple frames though. Meanwhile, this would need to be handled in the Wayland clients.)

Gerd Hoffmann Aug. 2, 2021, 12:50 p.m. UTC | #12

Hi,

> > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > the host side) the command completes only when the work is actually done
> > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > pageflipping), then send vblank events to userspace on command
> > completion certainly makes sense.
> 
> Hm how does this all work? At least drm/virtio uses
> drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> up in the same driver path for everything. Or do you just combine the
> resource_flush with the flip as needed and let the host side figure it all
> out? From a quick read of virtgpu_plane.c that seems to be the case ...

virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
think for the page-flip case the host (aka qemu) doesn't get the
"wait until old framebuffer is not in use any more" right yet.

So we'll need a host-side fix for that and a guest-side fix to switch
from a blocking wait on the fence to vblank events.

> Also to make this work we don't just need the fence, we need the timestamp
> (in a clock domain the guest can correct for ofc) of the host side kms
> driver flip completion. If you just have the fence then the jitter from
> going through all the layers will most likely make it unusable.

Well, there are no timestamps in the virtio-gpu protocol ...

Also I'm not sure they would be that helpful, any timing is *much* less
predictable in a virtual machine, especially in case the host machine is
loaded.

take care,
  Gerd

Daniel Vetter Aug. 2, 2021, 2:35 p.m. UTC | #13

On Mon, Aug 2, 2021 at 2:51 PM Gerd Hoffmann <kraxel@redhat.com> wrote:
>
>   Hi,
>
> > > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > > the host side) the command completes only when the work is actually done
> > > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > > pageflipping), then send vblank events to userspace on command
> > > completion certainly makes sense.
> >
> > Hm how does this all work? At least drm/virtio uses
> > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> > up in the same driver path for everything. Or do you just combine the
> > resource_flush with the flip as needed and let the host side figure it all
> > out? From a quick read of virtgpu_plane.c that seems to be the case ...
>
> virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> think for the page-flip case the host (aka qemu) doesn't get the
> "wait until old framebuffer is not in use any more" right yet.

Hm reading the code I think you simply elide the set_scanout if it's
still the same buffer. There's no difference betweeen dirtyfb and an
atomic commit that just hands the damage rects to the driver. At least
if you use the helpers.

> So we'll need a host-side fix for that and a guest-side fix to switch
> from a blocking wait on the fence to vblank events.
>
> > Also to make this work we don't just need the fence, we need the timestamp
> > (in a clock domain the guest can correct for ofc) of the host side kms
> > driver flip completion. If you just have the fence then the jitter from
> > going through all the layers will most likely make it unusable.
>
> Well, there are no timestamps in the virtio-gpu protocol ...
>
> Also I'm not sure they would be that helpful, any timing is *much* less
> predictable in a virtual machine, especially in case the host machine is
> loaded.

Hm yeah if the output is currently not displaying, then the timestamp
is very fake. But if you display you should be able to pass it all
around in both direction. So target vblank (or whatever it's called)
would go from guest to host to host-compositor (over wayland protocol)
to host-side kms, and the timestamp could travel all the way back.

But yeah making this all work correctly is going to be a pile of work.
Also I have no idea how well compositors take it when a kms driver
switches between high-precision timestamps and frame scheduling to the
entirely virtual/vblank-less approach on the fly.
-Daniel

> take care,
>   Gerd
>

Vivek Kasireddy Aug. 3, 2021, 6:11 a.m. UTC | #14

Hi Daniel,

> > > > By separating the OUT_FENCE signalling from pageflip completion allows
> > > > a Guest compositor to start a new repaint cycle with a new buffer
> > > > instead of waiting for the old buffer to be free.
> > > >
> > > > This work is based on the idea/suggestion from Simon and Pekka.
> > > >
> > > > This capability can be a solution for this issue:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > > >
> > > > Corresponding Weston MR:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > >
> > > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > > code, but well I guess not that much work yet.
> > [Kasireddy, Vivek] Right, it wasn't a lot of work :)
> >
> > >
> > > So maybe I'm not understanding the problem, but I think the fundamental
> > > underlying issue is that with KMS you can have at most 2 buffers
> > > in-flight, due to our queue depth limit of 1 pending flip.
> > [Kasireddy, Vivek] Let me summarize the problem again from the perspective of
> > both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
> > of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
> > Host compositor:
> > - After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms
> > and then submits the atomic commit and at the tail end of its cycle sends a frame
> callback
> > event to all its clients (who registered and submitted frames) indicating to them to
> > start their next redraw  -- giving them at-least ~16 ms to submit a new frame to be
> > included in its next repaint. Why a configurable 9 ms delay is needed is explained
> > in Pekka's blog post here:
> > https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html
> >
> > - It'll send a wl_buffer.release event for a client submitted previous buffer only
> > when the client has submitted a new buffer and:
> > a) When it hasn't started its repaint cycle yet OR
> > b) When it clears its old state after it gets a pageflip completion event -- if it had
> > flipped the client's buffer onto a hardware plane.
> >
> > Guest compositor:
> > - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the
> > Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the
> > 9 ms wait -- just like the Host compositor -- for its clients to submit new buffers.
> > - When it gets a pageflip completion, it assumes that the previously submitted buffer
> > is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum
> > of 4 backbuffers included as part of the Mesa GBM surface implementation.
> >
> > Guest KMS/Virtio-gpu/Qemu Wayland UI:
> > - Because no_vblank=true for Guest KMS and since the vblank event (which also serves
> > as the pageflip completion event for user-space) is sent right away after atomic commit,
> > as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know
> for
> > sure that the Host is completely done using the buffer. To ensure this, we signal the
> dma-fence
> > only after the Host compositor sends a wl_buffer.release event or an equivalent signal.
> >
> > The goal:
> > - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware
> plane
> > on the Host -- regardless of either compositor's scheduling policy -- without making any
> > copies and ensuring that both Host and Guest are not accessing the buffer at the same
> time.
> >
> > The problem:
> > - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer)
> > onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
> > only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
> > submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
> > Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
> >
> > The solution:
> > - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
> > the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
> > In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
> > pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
> > Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
> > gets a pageflip completion.
> 
> Is that really the only solution?
[Kasireddy, Vivek] There are a few others I mentioned here:
https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
But I think none of them are as compelling as this one.

> 
> If we fix the event timestamps so that both guest and host use the same
> timestamp, but then the guest starts 5ms (or something like that) earlier,
> then things should work too? I.e.
> - host compositor starts at (previous_frametime + 9ms)
> - guest compositor starts at (previous_frametime + 4ms)
> 
> Ofc this only works if the frametimes we hand out to both match _exactly_
> and are as high-precision as the ones on the host side. Which for many gpu
> drivers at least is the case, and all the ones you care about for sure :-)
> 
> But if the frametimes the guest receives are the no_vblank fake ones, then
> they'll be all over the place and this carefully tuned low-latency redraw
> loop falls apart. Aside fromm the fact that without tuning the guests to
> be earlier than the hosts, you're guaranteed to miss every frame (except
> when the timing wobbliness in the guest is big enough by chance to make
> the deadline on the oddball frame).
[Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
share these between the Guest and the Host. It does not seem to be causing any other
problems so far but we did try the experiment you mentioned (i.e., adjusting the delays)
and it works. However, this patch series is meant to fix the issue without having to tweak
anything (delays) because we can't do this for every compositor out there.

> 
> > - The Weston MR I linked above does this by getting an out_fence fd and taking a
> reference
> > on all the FBs included in the atomic commit forcing the compositor to use new FBs for
> its
> > next repaint cycle. It releases the references when the out_fence is signalled later when
> > the Host compositor sends a wl_buffer.release event.
> >
> > >
> > > Unfortunately that means for virtual hw where it takes a few more
> > > steps/vblanks until the framebuffer actually shows up on screen and is
> > > scanned out, we suffer deeply. The usual fix for that is to drop the
> > > latency and increase throughput, and have more buffers in-flight. Which
> > > this patch tries to do.
> > >
> > > Now I think where we go wrong here is that we're trying to hack this up by
> > > defining different semantics for the out-fence and for the drm-event. Imo
> > > that's wrong, they're both meant to show eactly the same thing:
> > > - when is the new frame actually visible to the user (as in, eyeballs in a
> > >   human head, preferrably, not the time when we've handed the buffer off
> > >   to the virtual hw)
> > > - when is the previous buffer no longer being used by the scanout hw
> > [Kasireddy, Vivek] Right, they both mean the same thing but I think using both
> > at the same time would be redundant in the case of Weston. That's why I am trying
> > to repurpose the usage of out_fence in this case by introducing a new capability
> > that may not be relevant for bare-metal KMS drivers but would be useful for
> > virtual KMS drivers.
> >
> > >
> > > We do cheat a bit right now in so far that we assume they're both the
> > > same, as in, panel-side latency is currently the compositor's problem to
> > > figure out.
> > >
> > > So for virtual hw I think the timestamp and even completion really need to
> > > happen only when the buffer has been pushed through the entire
> > > virtualization chain, i.e. ideally we get the timestamp from the kms
> > > driver from the host side. Currently that's not done, so this is most
> > > likely quite broken already (virtio relies on the no-vblank auto event
> > > sending, which definitely doesn't wait for anything, or I'm completely
> > > missing something).
> > [Kasireddy, Vivek] You are right; virtio_gpu does use the no_vblank auto event but
> > as I mentioned above we do use an internal dma-fence to wait until the submitted
> > buffer is no longer used by the Host. In other words, we wait (in update_planes hook)
> > until we get an appropriate signal from the Host to proceed to make sure that we are
> > not rendering faster than what the Host can display.
> 
> Yeah that internal dma_fence really should be the flip completion event
> too. That's how this uapi is supposed to work.
> 
> Once you have that then maybe weston magically works because it realizes
> that it misses the frames it's aiming for. Or at least there will be debug
> output about that I hope (I'm not sure the auto-tuning works/exists).
[Kasireddy, Vivek] Even if we send the flip completion event from the driver instead of
the DRM core, I don't think it'll make any difference. The only advantage I can see is
that the driver would be in control of both the event and the out_fence and can leverage
it for this specific use-case.

> 
> > However, as you suggest below, we could set no_vblank=false and send the vblank/
> > pageflip completion event from the virtio-gpu driver instead of having the DRM
> > core send it. This can prevent the DRM core from signalling the out_fence as well
> > which is my intended objective and what my first patch tries to do. I'd still need the
> > new capability though to include the patch in Weston that deals with out_fence --
> > unless Weston upstream can accept the patch after reviewing it without this newly
> > added capability which would be redundant but it does solve my problem. Would
> > this be acceptable?
> 
> out fence and flip completion event are exactly the same thing
> semantically. Well, before your patch here at least. So if you fix up the
> internal crtc->event handling then you fix up both. That's very much by
> design, because otherwise we'd have a bunch of kms drivers that only work
> on Android (which uses out-fence), and the others only work on dekstop
> linux (which uses flip completion drm_event). And probably very few that
> support both.
[Kasireddy, Vivek] Hmm, I think I see your point. If a Guest exclusively uses 
either out_fence or drm event, then this idea wont work because I am trying to
create a distinction between the two to mean: repaint when you get pageflip 
completion and just drop references when an out_fence is signalled. However,
looking at the code in drm_send_event_helper(), I see that the out_fence is
signaled only if the userspace subscribed for an event. Can the out_fence be
signaled without a corresponding drm event?

> 
> > > I think instead of hacking up some ill-defined 1.5 queue depth support,
> > > what we should do is support queue depth > 1 properly. So:
> > >
> > > - Change atomic to support queue depth > 1, this needs to be a per-driver
> > >   thing due to a bunch of issues in driver code. Essentially drivers must
> > >   never look at obj->state pointers, and only ever look up state through
> > >   the passed-in drm_atomic_state * update container.
> > >
> > > - Aside: virtio should loose all it's empty hooks, there's no point in
> > >   that.
> > >
> > > - We fix virtio to send out the completion event at the end of this entire
> > >   pipeline, i.e. virtio code needs to take care of sending out the
> > >   crtc_state->event correctly.
> > >
> > > - We probably also want some kind of (maybe per-crtc) recommended queue
> > >   depth property so compositors know how many buffers to keep in flight.
> > >   Not sure about that.
> > >
> > > It's a bit more work, but also a lot less hacking around infrastructure in
> > > dubious ways.
> > >
> > > Thoughts?
> > [Kasireddy, Vivek] IIUC, you are suggesting that we should make it possible to
> > submit a new atomic commit even though the completion event for the previous
> > one has not come in yet. This may potentially solve my problem but it sounds very
> > disruptive and not very useful for bare-metal cases. It also means that the compositors,
> > DRM core and the drivers need to keep track of multiple states -- as opposed to new and
> > old -- for all objects such as crtcs, planes, etc and account for multiple completion
> events.
> > I guess it is doable but as you suggest it seems like a lot of work with many pitfalls
> ahead.
> 
> Queue deeper than 1 has been an eventual goal for atomic since the start,
> we simply didn't get around to it.
> 
> All the state handling and helpers are built to support that (but there
> could be more bugs). The only rule drivers must follow is that in their
> atomic_commit code they never look at the various obj->state pointers
> (like drm_crtc->state), since that might be the state of a subsequent
> commit. Instead they must only get the state through the drm_atomic_state
> structure. We've recently also updated all the helpers to pass that around
> everywhere (for other reasons), so the challenge here is only to fix up
> individual drivers. And maybe come up with some debug checks to make the
> obj->state pointers aren't used in atomic_commit.
[Kasireddy, Vivek] Ok, if a significant amount of preparatory work has already
been done, then your suggestion to increase the queue depth does not sound
that onerous though.

> 
> From a design pov I think your approach of hacking up the event machinery
> to slip in 2 commits while not actually using the 2 deep queue stuff like
> it's meant to be is much worse.
> 
> On the userspace side I'm not sure why you need to keep track of more
> state. All you need to keep track is of more buffers in your retire/reuse
> list, but you have to do that with your proposal here too. So no
> difference at all there.
> 
> Anyway it sounds like if the guest compositor would adjust it's deadline
> so that the guest and host compositor interleave correctly, then we should
> still be able to hit full refresh rate without a deeper queue. Has that
> been looked into?
[Kasireddy, Vivek] Yeah, as I mentioned above, that was the first thing we 
tried and it worked (i.e., we get full frame-rate). But it obviously is not a 
solution that'll work for all Guest compositors as their scheduling policies
may not be tweakable.

It sounds like you are recommending the queue depth increase as the only
viable solution. We'd look into that but I am unable to see a clear picture in
terms of how it would play out with the Guest compositor. A Guest compositor
starts its repaint cycle after it gets a pageflip completion or an out_fence signal;
if it determines that the latency is high, then it can try to increase the queue depth
by submitting atomic commits without waiting for the pageflip completion/
out_fence. Once it starts doing this, I am wondering when it can repaint again 
given that there will be multiple completion events coming in. Should there be 
separate events for vblank (to mean start repaint with new buffer) and pageflip 
completion (to mean drop references to old FBs)? And, as I mentioned earlier, 
the Guest compositor has to start its repaint cycle when the Host compositor
sends a frame callback event otherwise it won't work.

Thanks,
Vivek

> -Daniel
> 
> >
> > Thanks,
> > Vivek
> > >
> > > Cheers, Daniel
> > >
> > > >
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > > > Cc: Pekka Paalanen <ppaalanen@gmail.com>
> > > > Cc: Simon Ser <contact@emersion.fr>
> > > > Cc: Michel Dänzer <michel@daenzer.net>
> > > > Cc: Tina Zhang <tina.zhang@intel.com>
> > > > Cc: Dongwon Kim <dongwon.kim@intel.com>
> > > >
> > > > Vivek Kasireddy (4):
> > > >   drm: Add a capability flag to support deferred out_fence signalling
> > > >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> > > >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> > > >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> > > >
> > > >  drivers/gpu/drm/drm_file.c               | 11 +++---
> > > >  drivers/gpu/drm/drm_ioctl.c              |  3 ++
> > > >  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
> > > >  drivers/gpu/drm/virtio/virtgpu_drv.c     |  1 +
> > > >  drivers/gpu/drm/virtio/virtgpu_drv.h     |  6 ++++
> > > >  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +++++
> > > >  drivers/gpu/drm/virtio/virtgpu_kms.c     | 10 ++++--
> > > >  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++++++++++++++++++++++-
> > > >  drivers/gpu/drm/virtio/virtgpu_vq.c      | 17 +++++++++
> > > >  include/drm/drm_mode_config.h            |  9 +++++
> > > >  include/uapi/drm/drm.h                   |  1 +
> > > >  include/uapi/linux/virtio_gpu.h          | 12 +++++++
> > > >  12 files changed, 117 insertions(+), 7 deletions(-)
> > > >
> > > > --
> > > > 2.30.2
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Vivek Kasireddy Aug. 3, 2021, 6:18 a.m. UTC | #15

Hi Gerd,

> 
>   Hi,
> 
> > > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > > the host side) the command completes only when the work is actually done
> > > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > > pageflipping), then send vblank events to userspace on command
> > > completion certainly makes sense.
> >
> > Hm how does this all work? At least drm/virtio uses
> > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> > up in the same driver path for everything. Or do you just combine the
> > resource_flush with the flip as needed and let the host side figure it all
> > out? From a quick read of virtgpu_plane.c that seems to be the case ...
> 
> virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> think for the page-flip case the host (aka qemu) doesn't get the
> "wait until old framebuffer is not in use any more" right yet.
[Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series: 
https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's main
event loop) for it to ensure that the Guest scanout FB is longer in use on the Host.
This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
also the double-buffer case. 

The out_fence work is only relevant for the future Wayland UI backend though.

> 
> So we'll need a host-side fix for that and a guest-side fix to switch
> from a blocking wait on the fence to vblank events.
[Kasireddy, Vivek] Do you see any concerns with the blocking wait? And, are you
suggesting that we use a vblank timer? Not sure if that would be needed because it
would not align with the render/draw signals used with GTK. And, the DRM core
does send out an event -- immediately after the blocking wait -- to Guest compositor
as no_vblank=true.

> 
> > Also to make this work we don't just need the fence, we need the timestamp
> > (in a clock domain the guest can correct for ofc) of the host side kms
> > driver flip completion. If you just have the fence then the jitter from
> > going through all the layers will most likely make it unusable.
> 
> Well, there are no timestamps in the virtio-gpu protocol ...
> 
> Also I'm not sure they would be that helpful, any timing is *much* less
> predictable in a virtual machine, especially in case the host machine is
> loaded.
[Kasireddy, Vivek] I agree; I think sharing the Host timestamps with the Guest or 
vice-versa may not be useful. We have not run into any problems without these so far.

Thanks,
Vivek

> 
> take care,
>   Gerd

Michel Dänzer Aug. 3, 2021, 7:33 a.m. UTC | #16

On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote:
> 
>>> The goal:
>>> - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware
>> plane
>>> on the Host -- regardless of either compositor's scheduling policy -- without making any
>>> copies and ensuring that both Host and Guest are not accessing the buffer at the same
>> time.
>>>
>>> The problem:
>>> - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer)
>>> onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
>>> only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
>>> submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
>>> Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
>>>
>>> The solution:
>>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
>>> the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
>>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
>>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
>>> Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
>>> gets a pageflip completion.
>>
>> Is that really the only solution?
> [Kasireddy, Vivek] There are a few others I mentioned here:
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> But I think none of them are as compelling as this one.
> 
>>
>> If we fix the event timestamps so that both guest and host use the same
>> timestamp, but then the guest starts 5ms (or something like that) earlier,
>> then things should work too? I.e.
>> - host compositor starts at (previous_frametime + 9ms)
>> - guest compositor starts at (previous_frametime + 4ms)
>>
>> Ofc this only works if the frametimes we hand out to both match _exactly_
>> and are as high-precision as the ones on the host side. Which for many gpu
>> drivers at least is the case, and all the ones you care about for sure :-)
>>
>> But if the frametimes the guest receives are the no_vblank fake ones, then
>> they'll be all over the place and this carefully tuned low-latency redraw
>> loop falls apart. Aside fromm the fact that without tuning the guests to
>> be earlier than the hosts, you're guaranteed to miss every frame (except
>> when the timing wobbliness in the guest is big enough by chance to make
>> the deadline on the oddball frame).
> [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> share these between the Guest and the Host. It does not seem to be causing any other
> problems so far but we did try the experiment you mentioned (i.e., adjusting the delays)
> and it works. However, this patch series is meant to fix the issue without having to tweak
> anything (delays) because we can't do this for every compositor out there.

Maybe there could be a mechanism which allows the compositor in the guest to automatically adjust its repaint cycle as needed.

This might even be possible without requiring changes in each compositor, by adjusting the vertical blank periods in the guest to be aligned with the host compositor repaint cycles. Not sure about that though.

Even if not, both this series or making it possible to queue multiple flips require corresponding changes in each compositor as well to have any effect.

Gerd Hoffmann Aug. 3, 2021, 7:51 a.m. UTC | #17

Hi,

> > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> > think for the page-flip case the host (aka qemu) doesn't get the
> > "wait until old framebuffer is not in use any more" right yet.
> [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series: 
> https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
> we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's main
> event loop) for it to ensure that the Guest scanout FB is longer in use on the Host.
> This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
> also the double-buffer case. 

Well, we don't explicitly wait on the old framebuffer.  Not fully sure
this is actually needed, maybe the command ordering (SET_SCANOUT goes
first) is enough.

> > So we'll need a host-side fix for that and a guest-side fix to switch
> > from a blocking wait on the fence to vblank events.
> [Kasireddy, Vivek] Do you see any concerns with the blocking wait?

Well, it's sync vs. async for userspace.

With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic
version of it) will return when the host is done.

Without the blocking wait the userspace ioctl will return right away and
userspace can do something else until the host is done (and the vbland
event is sent to notify userspace).

> And, are you
> suggesting that we use a vblank timer?

I think we should send the vblank event when the RESOURCE_FLUSH fence
signals the host is done.

take care,
  Gerd

Vivek Kasireddy Aug. 4, 2021, 7:25 a.m. UTC | #18

Hi Michel,

> >
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- without making
> any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest compositor's
> buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the previous
> buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms
> to
> >>> submit a new buffer and given the fact that the Host compositor waits only for 9 ms,
> the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
> (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
> sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that,
> the
> >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle when
> it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > share these between the Guest and the Host. It does not seem to be causing any other
> > problems so far but we did try the experiment you mentioned (i.e., adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without having to tweak
> > anything (delays) because we can't do this for every compositor out there.
> 
> Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> 
> This might even be possible without requiring changes in each compositor, by adjusting
> the vertical blank periods in the guest to be aligned with the host compositor repaint
> cycles. Not sure about that though.
[Kasireddy, Vivek] The problem really is that the Guest compositor -- or any other compositor
for that matter -- assumes that after a pageflip completion, the old buffer submitted in the
previous flip is free and can be reused again. I think this is a guarantee given by KMS. If we have
to enforce this, we (Guest KMS) have to wait until the Host compositor sends a wl_buffer.release --
which can only happen after Host gets a pageflip completion assuming it uses hardware planes .
From this point onwards, the Guest compositor only has 9 ms (in the case of Weston) -- or less
based on the Host compositor's scheduling policy -- to submit a new frame.

Although, we can adjust the repaint-window of the Guest compositor to ensure a submission 
within 9 ms or increase the delay on the Host, these tweaks are just heuristics. I think in order
to have a generic solution that'll work in all cases means that the Guest compositor has to start
its repaint cycle with a new buffer when the Host sends out the frame callback event.

> 
> Even if not, both this series or making it possible to queue multiple flips require
> corresponding changes in each compositor as well to have any effect.
[Kasireddy, Vivek] Yes, unfortunately; but the hope is that the Guest KMS can do most of
the heavy lifting and keep the changes for the compositors generic enough and minimal.

Thanks,
Vivek
> 
> 
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

Vivek Kasireddy Aug. 4, 2021, 7:27 a.m. UTC | #19

Hi Gerd,

> 
> > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> > > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> > > think for the page-flip case the host (aka qemu) doesn't get the
> > > "wait until old framebuffer is not in use any more" right yet.
> > [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
> > we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's main
> > event loop) for it to ensure that the Guest scanout FB is longer in use on the Host.
> > This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
> > also the double-buffer case.
> 
> Well, we don't explicitly wait on the old framebuffer.  Not fully sure
> this is actually needed, maybe the command ordering (SET_SCANOUT goes
> first) is enough.
[Kasireddy, Vivek] When the sync file fd is signaled, the new FB can be considered done/free
on the Host; and, when this new FB becomes the old FB -- after another FB is submitted
by the Guest -- we don't need to explicitly wait as we already did that in the previous
cycle. 

Strictly speaking, in the double-buffered Guest case, we should be waiting for the
sync file fd of the old FB and not the new one. However, if we do this, we saw that
the Guest will render faster (~90 FPS) than what the Host can consume (~60 FPS)
resulting in unnecessary GPU cycles. And, in addition, we can't be certain about
whether a Guest is using double-buffering or single as we noticed that Windows
Guests tend to switch between single and double-buffering at runtime based on
the damage, etc.

> 
> > > So we'll need a host-side fix for that and a guest-side fix to switch
> > > from a blocking wait on the fence to vblank events.
> > [Kasireddy, Vivek] Do you see any concerns with the blocking wait?
> 
> Well, it's sync vs. async for userspace.
> 
> With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic
> version of it) will return when the host is done.
> 
> Without the blocking wait the userspace ioctl will return right away and
> userspace can do something else until the host is done (and the vbland
> event is sent to notify userspace).
[Kasireddy, Vivek] Right, but upstream Weston -- and I am guessing Mutter as well -- 
almost always choose DRM_MODE_ATOMIC_NONBLOCK. In this case, the
atomic ioctl call would not block and the blocking wait will instead happen in the
commit_work/commit_tail workqueue thread.

> 
> > And, are you
> > suggesting that we use a vblank timer?
> 
> I think we should send the vblank event when the RESOURCE_FLUSH fence
> signals the host is done.
[Kasireddy, Vivek] That is how it works now:
        drm_atomic_helper_commit_planes(dev, old_state, 0);

        drm_atomic_helper_commit_modeset_enables(dev, old_state);

        drm_atomic_helper_fake_vblank(old_state);

The blocking wait is in the plane_update hook called by drm_atomic_helper_commit_planes()
and immediately after that the fake vblank is sent.

Thanks,
Vivek
> 
> take care,
>   Gerd

Daniel Vetter Aug. 4, 2021, 12:11 p.m. UTC | #20

On Tue, Aug 3, 2021 at 9:34 AM Michel Dänzer <michel@daenzer.net> wrote:
> On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote:
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- without making any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to
> >>> submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the
> >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle when it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > share these between the Guest and the Host. It does not seem to be causing any other
> > problems so far but we did try the experiment you mentioned (i.e., adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without having to tweak
> > anything (delays) because we can't do this for every compositor out there.
>
> Maybe there could be a mechanism which allows the compositor in the guest to automatically adjust its repaint cycle as needed.
>
> This might even be possible without requiring changes in each compositor, by adjusting the vertical blank periods in the guest to be aligned with the host compositor repaint cycles. Not sure about that though.
>
> Even if not, both this series or making it possible to queue multiple flips require corresponding changes in each compositor as well to have any effect.

Yeah from all the discussions and tests done it sounds even with a
deeper queue we have big coordination issues between the guest and
host compositor (like the example that the guest is now rendering at
90fps instead of 60fps like the host).

Hence my gut feeling reaction that first we need to get these two
compositors aligned in their timings, which propobably needs
consistent vblank periods/timestamps across them (plus/minux
guest/host clocksource fun ofc). Without this any of the next steps
will simply not work because there's too much jitter by the time the
guest compositor gets the flip completion events.

Once we have solid events I think we should look into statically
tuning guest/host compositor deadlines (like you've suggested in a
bunch of places) to consisently make that deadline and hit 60 fps.
With that we can then look into tuning this automatically and what to
do when e.g. switching between copying and zero-copy on the host side
(which might be needed in some cases) and how to handle all that.

Only when that all shows that we just can't hit 60fps consistently and
really need 3 buffers in flight should we look at deeper kms queues.
And then we really need to implement them properly and not with a
mismatch between drm_event an out-fence signalling. These quick hacks
are good for experiments, but there's a pile of other things we need
to do first. At least that's how I understand the problem here right
now.

Cheers, Daniel

>
>
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

Vivek Kasireddy Aug. 5, 2021, 4:15 a.m. UTC | #21

Hi Daniel,

> > >>> The solution:
> > >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
> (including
> > >>> the 9 ms wait) when the Host compositor sends the frame callback event to its
> clients.
> > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
> sending
> > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that,
> the
> > >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle
> when it
> > >>> gets a pageflip completion.
> > >>
> > >> Is that really the only solution?
> > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > But I think none of them are as compelling as this one.
> > >
> > >>
> > >> If we fix the event timestamps so that both guest and host use the same
> > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > >> then things should work too? I.e.
> > >> - host compositor starts at (previous_frametime + 9ms)
> > >> - guest compositor starts at (previous_frametime + 4ms)
> > >>
> > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > >> and are as high-precision as the ones on the host side. Which for many gpu
> > >> drivers at least is the case, and all the ones you care about for sure :-)
> > >>
> > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > >> they'll be all over the place and this carefully tuned low-latency redraw
> > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > >> when the timing wobbliness in the guest is big enough by chance to make
> > >> the deadline on the oddball frame).
> > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > > share these between the Guest and the Host. It does not seem to be causing any other
> > > problems so far but we did try the experiment you mentioned (i.e., adjusting the
> delays)
> > > and it works. However, this patch series is meant to fix the issue without having to
> tweak
> > > anything (delays) because we can't do this for every compositor out there.
> >
> > Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> >
> > This might even be possible without requiring changes in each compositor, by adjusting
> the vertical blank periods in the guest to be aligned with the host compositor repaint
> cycles. Not sure about that though.
> >
> > Even if not, both this series or making it possible to queue multiple flips require
> corresponding changes in each compositor as well to have any effect.
> 
> Yeah from all the discussions and tests done it sounds even with a
> deeper queue we have big coordination issues between the guest and
> host compositor (like the example that the guest is now rendering at
> 90fps instead of 60fps like the host).
[Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs 
60 FPS problem is a completely different issue that is associated with Qemu GTK UI
backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
scanout FB onto one of the backbuffers managed by EGL. 

I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit
and thereby have a truly zero-copy solution. And, this is there I am running into the 
halved frame-rate issue -- the current problem.

> 
> Hence my gut feeling reaction that first we need to get these two
> compositors aligned in their timings, which propobably needs
> consistent vblank periods/timestamps across them (plus/minux
> guest/host clocksource fun ofc). Without this any of the next steps
> will simply not work because there's too much jitter by the time the
> guest compositor gets the flip completion events.
[Kasireddy, Vivek] Timings are not a problem and do not significantly
affect the repaint cycles from what I have seen so far.

> 
> Once we have solid events I think we should look into statically
> tuning guest/host compositor deadlines (like you've suggested in a
> bunch of places) to consisently make that deadline and hit 60 fps.
> With that we can then look into tuning this automatically and what to
> do when e.g. switching between copying and zero-copy on the host side
> (which might be needed in some cases) and how to handle all that.
[Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984065
tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
I feel that this zero-copy solution I am trying to create should be independent
of compositors' deadlines, delays or other scheduling parameters.

> Only when that all shows that we just can't hit 60fps consistently and
> really need 3 buffers in flight should we look at deeper kms queues.
> And then we really need to implement them properly and not with a
> mismatch between drm_event an out-fence signalling. These quick hacks
> are good for experiments, but there's a pile of other things we need
> to do first. At least that's how I understand the problem here right
> now.
[Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS consistently
-- in a zero-copy way independent of compositors' delays/deadlines -- with this
patch series + the Weston MR I linked in the cover letter. The main reason why this
works is because we relax the assumption that when the Guest compositor gets a
pageflip completion event that it could reuse the old FB it submitted in the previous
atomic flip and instead force it to use a new one. And, we send the pageflip completion
event to the Guest when the Host compositor sends a frame callback event. Lastly,
we use the (deferred) out_fence as just a mechanism to tell the Guest compositor when
it can release references on old FBs so that they can be reused again.

With that being said, the only question is how can we accomplish the above in an upstream
acceptable way without regressing anything particularly on bare-metal. Its not clear if just
increasing the queue depth would work or not but I think the Guest compositor has to be told
when it can start its repaint cycle and when it can assume the old FB is no longer in use.
On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates both.
In other words, Vblank event is the same as Flip done, which makes sense on bare-metal.
But if we were to have two events at-least for VKMS: vblank to indicate to Guest to start
repaint and flip_done to indicate to drop references on old FBs, I think this problem can
be solved even without increasing the queue depth. Can this be acceptable?

Thanks,
Vivek
> 
> Cheers, Daniel
> 
> >
> >
> > --
> > Earthling Michel Dänzer               |               https://redhat.com
> > Libre software enthusiast             |             Mesa and X developer
> 
> 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 5, 2021, 12:08 p.m. UTC | #22

On Thu, Aug 05, 2021 at 04:15:27AM +0000, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > > >>> The solution:
> > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
> > (including
> > > >>> the 9 ms wait) when the Host compositor sends the frame callback event to its
> > clients.
> > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
> > sending
> > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means that,
> > the
> > > >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle
> > when it
> > > >>> gets a pageflip completion.
> > > >>
> > > >> Is that really the only solution?
> > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > But I think none of them are as compelling as this one.
> > > >
> > > >>
> > > >> If we fix the event timestamps so that both guest and host use the same
> > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > >> then things should work too? I.e.
> > > >> - host compositor starts at (previous_frametime + 9ms)
> > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > >>
> > > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > > >> and are as high-precision as the ones on the host side. Which for many gpu
> > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > >>
> > > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > >> the deadline on the oddball frame).
> > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > > > share these between the Guest and the Host. It does not seem to be causing any other
> > > > problems so far but we did try the experiment you mentioned (i.e., adjusting the
> > delays)
> > > > and it works. However, this patch series is meant to fix the issue without having to
> > tweak
> > > > anything (delays) because we can't do this for every compositor out there.
> > >
> > > Maybe there could be a mechanism which allows the compositor in the guest to
> > automatically adjust its repaint cycle as needed.
> > >
> > > This might even be possible without requiring changes in each compositor, by adjusting
> > the vertical blank periods in the guest to be aligned with the host compositor repaint
> > cycles. Not sure about that though.
> > >
> > > Even if not, both this series or making it possible to queue multiple flips require
> > corresponding changes in each compositor as well to have any effect.
> > 
> > Yeah from all the discussions and tests done it sounds even with a
> > deeper queue we have big coordination issues between the guest and
> > host compositor (like the example that the guest is now rendering at
> > 90fps instead of 60fps like the host).
> [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs 
> 60 FPS problem is a completely different issue that is associated with Qemu GTK UI
> backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
> scanout FB onto one of the backbuffers managed by EGL. 
> 
> I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit
> and thereby have a truly zero-copy solution. And, this is there I am running into the 
> halved frame-rate issue -- the current problem.

Yes, that's what I referenced. But I disagree that it's a different
problem. The underlying problem in both cases is that the guest and host
compositor free-wheel instead of rendering in sync. It's just that
depending upon how exactly the flip completion event on the gues side
plays out you either get guest rendering that's faster than the host-side
60fps, or guest rendering that's much slower than the host-side 60fps.

The fundamental problem in both cases is that they don't run in lockstep.
If you fix that, through fixing the timestamp and even reporting most
likely, you should be able to fix both bugs.

> > Hence my gut feeling reaction that first we need to get these two
> > compositors aligned in their timings, which propobably needs
> > consistent vblank periods/timestamps across them (plus/minux
> > guest/host clocksource fun ofc). Without this any of the next steps
> > will simply not work because there's too much jitter by the time the
> > guest compositor gets the flip completion events.
> [Kasireddy, Vivek] Timings are not a problem and do not significantly
> affect the repaint cycles from what I have seen so far.
> 
> > 
> > Once we have solid events I think we should look into statically
> > tuning guest/host compositor deadlines (like you've suggested in a
> > bunch of places) to consisently make that deadline and hit 60 fps.
> > With that we can then look into tuning this automatically and what to
> > do when e.g. switching between copying and zero-copy on the host side
> > (which might be needed in some cases) and how to handle all that.
> [Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984065
> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> I feel that this zero-copy solution I am trying to create should be independent
> of compositors' deadlines, delays or other scheduling parameters.

That's not how compositors work nowadays. Your problem is that you don't
have the guest/host compositor in sync. zero-copy only changes the timing,
so it changes things from "rendering way too many frames" to "rendering
way too few frames".

We need to fix the timing/sync issue here first, not paper over it with
hacks.

Only, and I really mean only, when that shows that it's simply impossible
to hit 60fps with zero-copy and the guest/host fully aligned should we
look into making the overall pipeline deeper.

> > Only when that all shows that we just can't hit 60fps consistently and
> > really need 3 buffers in flight should we look at deeper kms queues.
> > And then we really need to implement them properly and not with a
> > mismatch between drm_event an out-fence signalling. These quick hacks
> > are good for experiments, but there's a pile of other things we need
> > to do first. At least that's how I understand the problem here right
> > now.
> [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS consistently
> -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> patch series + the Weston MR I linked in the cover letter. The main reason why this
> works is because we relax the assumption that when the Guest compositor gets a
> pageflip completion event that it could reuse the old FB it submitted in the previous
> atomic flip and instead force it to use a new one. And, we send the pageflip completion
> event to the Guest when the Host compositor sends a frame callback event. Lastly,
> we use the (deferred) out_fence as just a mechanism to tell the Guest compositor when
> it can release references on old FBs so that they can be reused again.
> 
> With that being said, the only question is how can we accomplish the above in an upstream
> acceptable way without regressing anything particularly on bare-metal. Its not clear if just
> increasing the queue depth would work or not but I think the Guest compositor has to be told
> when it can start its repaint cycle and when it can assume the old FB is no longer in use.
> On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates both.
> In other words, Vblank event is the same as Flip done, which makes sense on bare-metal.
> But if we were to have two events at-least for VKMS: vblank to indicate to Guest to start
> repaint and flip_done to indicate to drop references on old FBs, I think this problem can
> be solved even without increasing the queue depth. Can this be acceptable?

That's just another flavour of your "increase queue depth without
increasing the atomic queue depth" approach. I still think the underlying
fundamental issue is a timing confusion, and the fact that adjusting the
timings fixes things too kinda proves that. So we need to fix that in a
clean way, not by shuffling things around semi-randomly until the specific
config we tests works.

Iow I think we need a solution here which both slows down the 90fps to
60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
case. Because the host might need to switch transparently between blt and
zerocopy for various reasons.
-Daniel

> Thanks,
> Vivek
> > 
> > Cheers, Daniel
> > 
> > >
> > >
> > > --
> > > Earthling Michel Dänzer               |               https://redhat.com
> > > Libre software enthusiast             |             Mesa and X developer
> > 
> > 
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Vivek Kasireddy Aug. 6, 2021, 7:27 a.m. UTC | #23

Hi Daniel,

> > > > >>> The solution:
> > > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
> > > (including
> > > > >>> the 9 ms wait) when the Host compositor sends the frame callback event to its
> > > clients.
> > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
> > > sending
> > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means
> that,
> > > the
> > > > >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle
> > > when it
> > > > >>> gets a pageflip completion.
> > > > >>
> > > > >> Is that really the only solution?
> > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > But I think none of them are as compelling as this one.
> > > > >
> > > > >>
> > > > >> If we fix the event timestamps so that both guest and host use the same
> > > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > > >> then things should work too? I.e.
> > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > >>
> > > > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > > > >> and are as high-precision as the ones on the host side. Which for many gpu
> > > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > > >>
> > > > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > > >> the deadline on the oddball frame).
> > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > > > > share these between the Guest and the Host. It does not seem to be causing any
> other
> > > > > problems so far but we did try the experiment you mentioned (i.e., adjusting the
> > > delays)
> > > > > and it works. However, this patch series is meant to fix the issue without having to
> > > tweak
> > > > > anything (delays) because we can't do this for every compositor out there.
> > > >
> > > > Maybe there could be a mechanism which allows the compositor in the guest to
> > > automatically adjust its repaint cycle as needed.
> > > >
> > > > This might even be possible without requiring changes in each compositor, by
> adjusting
> > > the vertical blank periods in the guest to be aligned with the host compositor repaint
> > > cycles. Not sure about that though.
> > > >
> > > > Even if not, both this series or making it possible to queue multiple flips require
> > > corresponding changes in each compositor as well to have any effect.
> > >
> > > Yeah from all the discussions and tests done it sounds even with a
> > > deeper queue we have big coordination issues between the guest and
> > > host compositor (like the example that the guest is now rendering at
> > > 90fps instead of 60fps like the host).
> > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs
> > 60 FPS problem is a completely different issue that is associated with Qemu GTK UI
> > backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
> > scanout FB onto one of the backbuffers managed by EGL.
> >
> > I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit
> > and thereby have a truly zero-copy solution. And, this is there I am running into the
> > halved frame-rate issue -- the current problem.
> 
> Yes, that's what I referenced. But I disagree that it's a different
> problem. The underlying problem in both cases is that the guest and host
> compositor free-wheel instead of rendering in sync. It's just that
> depending upon how exactly the flip completion event on the gues side
> plays out you either get guest rendering that's faster than the host-side
> 60fps, or guest rendering that's much slower than the host-side 60fps.
[Kasireddy, Vivek] That used to be the case before we added a synchronization
mechanism to the GTK UI backend that uses a sync file. After adding this
and making the Guest wait until this sync file fd on the Host is signaled, we
consistently get 60 FPS because the flip completion event for the Guest is
directly tied to the signaling of the sync file in this particular case (GTK UI).

> 
> The fundamental problem in both cases is that they don't run in lockstep.
> If you fix that, through fixing the timestamp and even reporting most
> likely, you should be able to fix both bugs.
[Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
solution that just wraps the dmabuf associated with Guest scanout FB in a 
wl_buffer and submits it directly to the Host compositor. These backends are
completely independent of each other and cannot be active at the same time.
In other words, we cannot have zero-copy and Blit based solutions running
parallelly. And, this issue is only relevant for Wayland UI backend and has 
nothing to do with GTK UI. 

> 
> > > Hence my gut feeling reaction that first we need to get these two
> > > compositors aligned in their timings, which propobably needs
> > > consistent vblank periods/timestamps across them (plus/minux
> > > guest/host clocksource fun ofc). Without this any of the next steps
> > > will simply not work because there's too much jitter by the time the
> > > guest compositor gets the flip completion events.
> > [Kasireddy, Vivek] Timings are not a problem and do not significantly
> > affect the repaint cycles from what I have seen so far.
> >
> > >
> > > Once we have solid events I think we should look into statically
> > > tuning guest/host compositor deadlines (like you've suggested in a
> > > bunch of places) to consisently make that deadline and hit 60 fps.
> > > With that we can then look into tuning this automatically and what to
> > > do when e.g. switching between copying and zero-copy on the host side
> > > (which might be needed in some cases) and how to handle all that.
> > [Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514#note_984065
> > tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> > I feel that this zero-copy solution I am trying to create should be independent
> > of compositors' deadlines, delays or other scheduling parameters.
> 
> That's not how compositors work nowadays. Your problem is that you don't
> have the guest/host compositor in sync. zero-copy only changes the timing,
> so it changes things from "rendering way too many frames" to "rendering
> way too few frames".
> 
> We need to fix the timing/sync issue here first, not paper over it with
> hacks.
[Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
independent of the scheduling policies to ensure that it works with all compositors.
 IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
configurable repaint-window value, refresh-rate, etc to determine when to start
its next repaint -- if there is any damage:
timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor->repaint_msec);

And, in the case of VKMS, since there is no real hardware, the timestamp is always:
now = ktime_get();
send_vblank_event(dev, e, seq, now);

When you say that the Guest/Host compositor need to stay in sync, are you 
suggesting that we need to ensure that the vblank timestamp on the Host 
needs to be shared and be the same on the Guest and a vblank/pageflip
completion for the Guest needs to be sent at exactly the same time it is sent
on the Host? If yes, I'd say that we do send the pageflip completion to Guest
around the same time a vblank is generated on the Host but it does not help
because the Guest compositor would only have 9 ms to submit a new frame
and if the Host is running Mutter, the Guest would only have 2 ms.
(https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)

> 
> Only, and I really mean only, when that shows that it's simply impossible
> to hit 60fps with zero-copy and the guest/host fully aligned should we
> look into making the overall pipeline deeper.
[Kasireddy, Vivek] From all the experiments conducted so far and given the
discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
I think we have already established that in order for a zero-copy solution to work 
reliably, the Guest compositor needs to start its repaint cycle when the Host
compositor sends a frame callback event to its clients.

> 
> > > Only when that all shows that we just can't hit 60fps consistently and
> > > really need 3 buffers in flight should we look at deeper kms queues.
> > > And then we really need to implement them properly and not with a
> > > mismatch between drm_event an out-fence signalling. These quick hacks
> > > are good for experiments, but there's a pile of other things we need
> > > to do first. At least that's how I understand the problem here right
> > > now.
> > [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS consistently
> > -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> > patch series + the Weston MR I linked in the cover letter. The main reason why this
> > works is because we relax the assumption that when the Guest compositor gets a
> > pageflip completion event that it could reuse the old FB it submitted in the previous
> > atomic flip and instead force it to use a new one. And, we send the pageflip completion
> > event to the Guest when the Host compositor sends a frame callback event. Lastly,
> > we use the (deferred) out_fence as just a mechanism to tell the Guest compositor when
> > it can release references on old FBs so that they can be reused again.
> >
> > With that being said, the only question is how can we accomplish the above in an
> upstream
> > acceptable way without regressing anything particularly on bare-metal. Its not clear if
> just
> > increasing the queue depth would work or not but I think the Guest compositor has to be
> told
> > when it can start its repaint cycle and when it can assume the old FB is no longer in use.
> > On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
> both.
> > In other words, Vblank event is the same as Flip done, which makes sense on bare-metal.
> > But if we were to have two events at-least for VKMS: vblank to indicate to Guest to start
> > repaint and flip_done to indicate to drop references on old FBs, I think this problem can
> > be solved even without increasing the queue depth. Can this be acceptable?
> 
> That's just another flavour of your "increase queue depth without
> increasing the atomic queue depth" approach. I still think the underlying
> fundamental issue is a timing confusion, and the fact that adjusting the
> timings fixes things too kinda proves that. So we need to fix that in a
> clean way, not by shuffling things around semi-randomly until the specific
> config we tests works.
[Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
have carefully instrumented both the Host and Guest compositors and measured
the latencies at each step. The relevant debug data only points to the scheduling
policy -- of both Host and Guest compositors -- playing a role in Guest rendering 
at 30 FPS.

> 
> Iow I think we need a solution here which both slows down the 90fps to
> 60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
> case. Because the host might need to switch transparently between blt and
> zerocopy for various reasons.
[Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
backends at runtime. In other words, with GTK UI backend, it is always Blit
whereas Wayland UI backend is always zero-copy.

Thanks,
Vivek

> -Daniel
> 
> > Thanks,
> > Vivek
> > >
> > > Cheers, Daniel
> > >
> > > >
> > > >
> > > > --
> > > > Earthling Michel Dänzer               |               https://redhat.com
> > > > Libre software enthusiast             |             Mesa and X developer
> > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 9, 2021, 2:15 p.m. UTC | #24

On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > > > > >>> The solution:
> > > > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint cycle
> > > > (including
> > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback event to its
> > > > clients.
> > > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- before
> > > > sending
> > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This means
> > that,
> > > > the
> > > > > >>> Guest compositor has to be forced to use a new buffer for its next repaint cycle
> > > > when it
> > > > > >>> gets a pageflip completion.
> > > > > >>
> > > > > >> Is that really the only solution?
> > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > But I think none of them are as compelling as this one.
> > > > > >
> > > > > >>
> > > > > >> If we fix the event timestamps so that both guest and host use the same
> > > > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > > > >> then things should work too? I.e.
> > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > >>
> > > > > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > > > > >> and are as high-precision as the ones on the host side. Which for many gpu
> > > > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > > > >>
> > > > > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > > > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > > > >> the deadline on the oddball frame).
> > > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we don't
> > > > > > share these between the Guest and the Host. It does not seem to be causing any
> > other
> > > > > > problems so far but we did try the experiment you mentioned (i.e., adjusting the
> > > > delays)
> > > > > > and it works. However, this patch series is meant to fix the issue without having to
> > > > tweak
> > > > > > anything (delays) because we can't do this for every compositor out there.
> > > > >
> > > > > Maybe there could be a mechanism which allows the compositor in the guest to
> > > > automatically adjust its repaint cycle as needed.
> > > > >
> > > > > This might even be possible without requiring changes in each compositor, by
> > adjusting
> > > > the vertical blank periods in the guest to be aligned with the host compositor repaint
> > > > cycles. Not sure about that though.
> > > > >
> > > > > Even if not, both this series or making it possible to queue multiple flips require
> > > > corresponding changes in each compositor as well to have any effect.
> > > >
> > > > Yeah from all the discussions and tests done it sounds even with a
> > > > deeper queue we have big coordination issues between the guest and
> > > > host compositor (like the example that the guest is now rendering at
> > > > 90fps instead of 60fps like the host).
> > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs
> > > 60 FPS problem is a completely different issue that is associated with Qemu GTK UI
> > > backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
> > > scanout FB onto one of the backbuffers managed by EGL.
> > >
> > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit
> > > and thereby have a truly zero-copy solution. And, this is there I am running into the
> > > halved frame-rate issue -- the current problem.
> > 
> > Yes, that's what I referenced. But I disagree that it's a different
> > problem. The underlying problem in both cases is that the guest and host
> > compositor free-wheel instead of rendering in sync. It's just that
> > depending upon how exactly the flip completion event on the gues side
> > plays out you either get guest rendering that's faster than the host-side
> > 60fps, or guest rendering that's much slower than the host-side 60fps.
> [Kasireddy, Vivek] That used to be the case before we added a synchronization
> mechanism to the GTK UI backend that uses a sync file. After adding this
> and making the Guest wait until this sync file fd on the Host is signaled, we
> consistently get 60 FPS because the flip completion event for the Guest is
> directly tied to the signaling of the sync file in this particular case (GTK UI).
> 
> > 
> > The fundamental problem in both cases is that they don't run in lockstep.
> > If you fix that, through fixing the timestamp and even reporting most
> > likely, you should be able to fix both bugs.
> [Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
> FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
> solution that just wraps the dmabuf associated with Guest scanout FB in a 
> wl_buffer and submits it directly to the Host compositor. These backends are
> completely independent of each other and cannot be active at the same time.
> In other words, we cannot have zero-copy and Blit based solutions running
> parallelly. And, this issue is only relevant for Wayland UI backend and has 
> nothing to do with GTK UI. 
> 
> > 
> > > > Hence my gut feeling reaction that first we need to get these two
> > > > compositors aligned in their timings, which propobably needs
> > > > consistent vblank periods/timestamps across them (plus/minux
> > > > guest/host clocksource fun ofc). Without this any of the next steps
> > > > will simply not work because there's too much jitter by the time the
> > > > guest compositor gets the flip completion events.
> > > [Kasireddy, Vivek] Timings are not a problem and do not significantly
> > > affect the repaint cycles from what I have seen so far.
> > >
> > > >
> > > > Once we have solid events I think we should look into statically
> > > > tuning guest/host compositor deadlines (like you've suggested in a
> > > > bunch of places) to consisently make that deadline and hit 60 fps.
> > > > With that we can then look into tuning this automatically and what to
> > > > do when e.g. switching between copying and zero-copy on the host side
> > > > (which might be needed in some cases) and how to handle all that.
> > > [Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-
> > /issues/514#note_984065
> > > tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> > > I feel that this zero-copy solution I am trying to create should be independent
> > > of compositors' deadlines, delays or other scheduling parameters.
> > 
> > That's not how compositors work nowadays. Your problem is that you don't
> > have the guest/host compositor in sync. zero-copy only changes the timing,
> > so it changes things from "rendering way too many frames" to "rendering
> > way too few frames".
> > 
> > We need to fix the timing/sync issue here first, not paper over it with
> > hacks.
> [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> independent of the scheduling policies to ensure that it works with all compositors.
>  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> configurable repaint-window value, refresh-rate, etc to determine when to start
> its next repaint -- if there is any damage:
> timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor->repaint_msec);
> 
> And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> now = ktime_get();
> send_vblank_event(dev, e, seq, now);

vkms has been fixed since a while to fake high-precision timestamps like
from a real display.

> When you say that the Guest/Host compositor need to stay in sync, are you 
> suggesting that we need to ensure that the vblank timestamp on the Host 
> needs to be shared and be the same on the Guest and a vblank/pageflip
> completion for the Guest needs to be sent at exactly the same time it is sent
> on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> around the same time a vblank is generated on the Host but it does not help
> because the Guest compositor would only have 9 ms to submit a new frame
> and if the Host is running Mutter, the Guest would only have 2 ms.
> (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)

Not at the same time, but the same timestamp. And yes there is some fun
there, which is I think the fundamental issue. Or at least some of the
compositor experts seem to think so, and it makes sense to me.

> > 
> > Only, and I really mean only, when that shows that it's simply impossible
> > to hit 60fps with zero-copy and the guest/host fully aligned should we
> > look into making the overall pipeline deeper.
> [Kasireddy, Vivek] From all the experiments conducted so far and given the
> discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> I think we have already established that in order for a zero-copy solution to work 
> reliably, the Guest compositor needs to start its repaint cycle when the Host
> compositor sends a frame callback event to its clients.
> 
> > 
> > > > Only when that all shows that we just can't hit 60fps consistently and
> > > > really need 3 buffers in flight should we look at deeper kms queues.
> > > > And then we really need to implement them properly and not with a
> > > > mismatch between drm_event an out-fence signalling. These quick hacks
> > > > are good for experiments, but there's a pile of other things we need
> > > > to do first. At least that's how I understand the problem here right
> > > > now.
> > > [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS consistently
> > > -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> > > patch series + the Weston MR I linked in the cover letter. The main reason why this
> > > works is because we relax the assumption that when the Guest compositor gets a
> > > pageflip completion event that it could reuse the old FB it submitted in the previous
> > > atomic flip and instead force it to use a new one. And, we send the pageflip completion
> > > event to the Guest when the Host compositor sends a frame callback event. Lastly,
> > > we use the (deferred) out_fence as just a mechanism to tell the Guest compositor when
> > > it can release references on old FBs so that they can be reused again.
> > >
> > > With that being said, the only question is how can we accomplish the above in an
> > upstream
> > > acceptable way without regressing anything particularly on bare-metal. Its not clear if
> > just
> > > increasing the queue depth would work or not but I think the Guest compositor has to be
> > told
> > > when it can start its repaint cycle and when it can assume the old FB is no longer in use.
> > > On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
> > both.
> > > In other words, Vblank event is the same as Flip done, which makes sense on bare-metal.
> > > But if we were to have two events at-least for VKMS: vblank to indicate to Guest to start
> > > repaint and flip_done to indicate to drop references on old FBs, I think this problem can
> > > be solved even without increasing the queue depth. Can this be acceptable?
> > 
> > That's just another flavour of your "increase queue depth without
> > increasing the atomic queue depth" approach. I still think the underlying
> > fundamental issue is a timing confusion, and the fact that adjusting the
> > timings fixes things too kinda proves that. So we need to fix that in a
> > clean way, not by shuffling things around semi-randomly until the specific
> > config we tests works.
> [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> have carefully instrumented both the Host and Guest compositors and measured
> the latencies at each step. The relevant debug data only points to the scheduling
> policy -- of both Host and Guest compositors -- playing a role in Guest rendering 
> at 30 FPS.

Hm but that essentially means that the events your passing around have an
even more ad-hoc implementation specific meaning: Essentially it's the
kick-off for the guest's repaint loop? That sounds even worse for a kms
uapi extension.

> > Iow I think we need a solution here which both slows down the 90fps to
> > 60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
> > case. Because the host might need to switch transparently between blt and
> > zerocopy for various reasons.
> [Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
> backends at runtime. In other words, with GTK UI backend, it is always Blit
> whereas Wayland UI backend is always zero-copy.

Hm ok, that at least makes things somewhat simpler. Another thing that I
just realized: What happens when the host changes screen resolution and
especially refresh rate?
-Daniel

> 
> Thanks,
> Vivek
> 
> > -Daniel
> > 
> > > Thanks,
> > > Vivek
> > > >
> > > > Cheers, Daniel
> > > >
> > > > >
> > > > >
> > > > > --
> > > > > Earthling Michel Dänzer               |               https://redhat.com
> > > > > Libre software enthusiast             |             Mesa and X developer
> > > >
> > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Vivek Kasireddy Aug. 10, 2021, 8:21 a.m. UTC | #25

Hi Daniel,

> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > > > > >>> The solution:
> > > > > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint
> cycle
> > > > > (including
> > > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback event to
> its
> > > > > clients.
> > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on --
> before
> > > > > sending
> > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This
> means
> > > that,
> > > > > the
> > > > > > >>> Guest compositor has to be forced to use a new buffer for its next repaint
> cycle
> > > > > when it
> > > > > > >>> gets a pageflip completion.
> > > > > > >>
> > > > > > >> Is that really the only solution?
> > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > But I think none of them are as compelling as this one.
> > > > > > >
> > > > > > >>
> > > > > > >> If we fix the event timestamps so that both guest and host use the same
> > > > > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > > > > >> then things should work too? I.e.
> > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > >>
> > > > > > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > > > > > >> and are as high-precision as the ones on the host side. Which for many gpu
> > > > > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > > > > >>
> > > > > > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > > > > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > > > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > > > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > > > > >> the deadline on the oddball frame).
> > > > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we
> don't
> > > > > > > share these between the Guest and the Host. It does not seem to be causing any
> > > other
> > > > > > > problems so far but we did try the experiment you mentioned (i.e., adjusting
> the
> > > > > delays)
> > > > > > > and it works. However, this patch series is meant to fix the issue without
> having to
> > > > > tweak
> > > > > > > anything (delays) because we can't do this for every compositor out there.
> > > > > >
> > > > > > Maybe there could be a mechanism which allows the compositor in the guest to
> > > > > automatically adjust its repaint cycle as needed.
> > > > > >
> > > > > > This might even be possible without requiring changes in each compositor, by
> > > adjusting
> > > > > the vertical blank periods in the guest to be aligned with the host compositor
> repaint
> > > > > cycles. Not sure about that though.
> > > > > >
> > > > > > Even if not, both this series or making it possible to queue multiple flips require
> > > > > corresponding changes in each compositor as well to have any effect.
> > > > >
> > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > deeper queue we have big coordination issues between the guest and
> > > > > host compositor (like the example that the guest is now rendering at
> > > > > 90fps instead of 60fps like the host).
> > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs
> > > > 60 FPS problem is a completely different issue that is associated with Qemu GTK UI
> > > > backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
> > > > scanout FB onto one of the backbuffers managed by EGL.
> > > >
> > > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate that
> Blit
> > > > and thereby have a truly zero-copy solution. And, this is there I am running into the
> > > > halved frame-rate issue -- the current problem.
> > >
> > > Yes, that's what I referenced. But I disagree that it's a different
> > > problem. The underlying problem in both cases is that the guest and host
> > > compositor free-wheel instead of rendering in sync. It's just that
> > > depending upon how exactly the flip completion event on the gues side
> > > plays out you either get guest rendering that's faster than the host-side
> > > 60fps, or guest rendering that's much slower than the host-side 60fps.
> > [Kasireddy, Vivek] That used to be the case before we added a synchronization
> > mechanism to the GTK UI backend that uses a sync file. After adding this
> > and making the Guest wait until this sync file fd on the Host is signaled, we
> > consistently get 60 FPS because the flip completion event for the Guest is
> > directly tied to the signaling of the sync file in this particular case (GTK UI).
> >
> > >
> > > The fundamental problem in both cases is that they don't run in lockstep.
> > > If you fix that, through fixing the timestamp and even reporting most
> > > likely, you should be able to fix both bugs.
> > [Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
> > FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
> > solution that just wraps the dmabuf associated with Guest scanout FB in a
> > wl_buffer and submits it directly to the Host compositor. These backends are
> > completely independent of each other and cannot be active at the same time.
> > In other words, we cannot have zero-copy and Blit based solutions running
> > parallelly. And, this issue is only relevant for Wayland UI backend and has
> > nothing to do with GTK UI.
> >
> > >
> > > > > Hence my gut feeling reaction that first we need to get these two
> > > > > compositors aligned in their timings, which propobably needs
> > > > > consistent vblank periods/timestamps across them (plus/minux
> > > > > guest/host clocksource fun ofc). Without this any of the next steps
> > > > > will simply not work because there's too much jitter by the time the
> > > > > guest compositor gets the flip completion events.
> > > > [Kasireddy, Vivek] Timings are not a problem and do not significantly
> > > > affect the repaint cycles from what I have seen so far.
> > > >
> > > > >
> > > > > Once we have solid events I think we should look into statically
> > > > > tuning guest/host compositor deadlines (like you've suggested in a
> > > > > bunch of places) to consisently make that deadline and hit 60 fps.
> > > > > With that we can then look into tuning this automatically and what to
> > > > > do when e.g. switching between copying and zero-copy on the host side
> > > > > (which might be needed in some cases) and how to handle all that.
> > > > [Kasireddy, Vivek] As I confirm here:
> https://gitlab.freedesktop.org/wayland/weston/-
> > > /issues/514#note_984065
> > > > tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> > > > I feel that this zero-copy solution I am trying to create should be independent
> > > > of compositors' deadlines, delays or other scheduling parameters.
> > >
> > > That's not how compositors work nowadays. Your problem is that you don't
> > > have the guest/host compositor in sync. zero-copy only changes the timing,
> > > so it changes things from "rendering way too many frames" to "rendering
> > > way too few frames".
> > >
> > > We need to fix the timing/sync issue here first, not paper over it with
> > > hacks.
> > [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> > independent of the scheduling policies to ensure that it works with all compositors.
> >  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> > configurable repaint-window value, refresh-rate, etc to determine when to start
> > its next repaint -- if there is any damage:
> > timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> > timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
> >repaint_msec);
> >
> > And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> > now = ktime_get();
> > send_vblank_event(dev, e, seq, now);
> 
> vkms has been fixed since a while to fake high-precision timestamps like
> from a real display.
[Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need 
to have the same timestamp as that of the Host -- to work as expected.

> 
> > When you say that the Guest/Host compositor need to stay in sync, are you
> > suggesting that we need to ensure that the vblank timestamp on the Host
> > needs to be shared and be the same on the Guest and a vblank/pageflip
> > completion for the Guest needs to be sent at exactly the same time it is sent
> > on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> > around the same time a vblank is generated on the Host but it does not help
> > because the Guest compositor would only have 9 ms to submit a new frame
> > and if the Host is running Mutter, the Guest would only have 2 ms.
> > (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> 
> Not at the same time, but the same timestamp. And yes there is some fun
> there, which is I think the fundamental issue. Or at least some of the
> compositor experts seem to think so, and it makes sense to me.
[Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
the Guest repaint cycle would be affected. However, I do not believe that is the case
here given the debug and instrumentation data we collected and scrutinized. Hopefully,
compositor experts could chime in to shed some light on this matter.

> 
> > >
> > > Only, and I really mean only, when that shows that it's simply impossible
> > > to hit 60fps with zero-copy and the guest/host fully aligned should we
> > > look into making the overall pipeline deeper.
> > [Kasireddy, Vivek] From all the experiments conducted so far and given the
> > discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > I think we have already established that in order for a zero-copy solution to work
> > reliably, the Guest compositor needs to start its repaint cycle when the Host
> > compositor sends a frame callback event to its clients.
> >
> > >
> > > > > Only when that all shows that we just can't hit 60fps consistently and
> > > > > really need 3 buffers in flight should we look at deeper kms queues.
> > > > > And then we really need to implement them properly and not with a
> > > > > mismatch between drm_event an out-fence signalling. These quick hacks
> > > > > are good for experiments, but there's a pile of other things we need
> > > > > to do first. At least that's how I understand the problem here right
> > > > > now.
> > > > [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
> consistently
> > > > -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> > > > patch series + the Weston MR I linked in the cover letter. The main reason why this
> > > > works is because we relax the assumption that when the Guest compositor gets a
> > > > pageflip completion event that it could reuse the old FB it submitted in the previous
> > > > atomic flip and instead force it to use a new one. And, we send the pageflip
> completion
> > > > event to the Guest when the Host compositor sends a frame callback event. Lastly,
> > > > we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
> when
> > > > it can release references on old FBs so that they can be reused again.
> > > >
> > > > With that being said, the only question is how can we accomplish the above in an
> > > upstream
> > > > acceptable way without regressing anything particularly on bare-metal. Its not clear
> if
> > > just
> > > > increasing the queue depth would work or not but I think the Guest compositor has to
> be
> > > told
> > > > when it can start its repaint cycle and when it can assume the old FB is no longer in
> use.
> > > > On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
> > > both.
> > > > In other words, Vblank event is the same as Flip done, which makes sense on bare-
> metal.
> > > > But if we were to have two events at-least for VKMS: vblank to indicate to Guest to
> start
> > > > repaint and flip_done to indicate to drop references on old FBs, I think this problem
> can
> > > > be solved even without increasing the queue depth. Can this be acceptable?
> > >
> > > That's just another flavour of your "increase queue depth without
> > > increasing the atomic queue depth" approach. I still think the underlying
> > > fundamental issue is a timing confusion, and the fact that adjusting the
> > > timings fixes things too kinda proves that. So we need to fix that in a
> > > clean way, not by shuffling things around semi-randomly until the specific
> > > config we tests works.
> > [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> > have carefully instrumented both the Host and Guest compositors and measured
> > the latencies at each step. The relevant debug data only points to the scheduling
> > policy -- of both Host and Guest compositors -- playing a role in Guest rendering
> > at 30 FPS.
> 
> Hm but that essentially means that the events your passing around have an
> even more ad-hoc implementation specific meaning: Essentially it's the
> kick-off for the guest's repaint loop? That sounds even worse for a kms
> uapi extension.
[Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston 
works that way and even if we increase the queue depth to solve this problem, I don't
think it'll help because the arrival of this event always indicates to a compositor to
start its repaint cycle again and assume that the previous buffers are all free.

> 
> > > Iow I think we need a solution here which both slows down the 90fps to
> > > 60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
> > > case. Because the host might need to switch transparently between blt and
> > > zerocopy for various reasons.
> > [Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
> > backends at runtime. In other words, with GTK UI backend, it is always Blit
> > whereas Wayland UI backend is always zero-copy.
> 
> Hm ok, that at least makes things somewhat simpler. Another thing that I
> just realized: What happens when the host changes screen resolution and
> especially refresh rate?
[Kasireddy, Vivek] AFAICT, if the Host changes resolution or if the Qemu UI window
is resized, then that'll trigger a Guest KMS modeset -- via drm_helper_hpd_irq_event().
As far as the refresh rate is concerned, if Qemu is launched with GTK UI backend,
then the "render signal" GTK sends out to apps would reflect the new refresh rate.
And, since the internal dma-fence is tied to this "render signal", Guest updates are
automatically synchronized to the new refresh rate.

If Qemu is launched with the Wayland UI backend, then the internal dma-fence would
be tied to the wl_buffer.release event. And, if Qemu UI's buffer is flipped onto a
hardware plane, then the compositor sends this event out after it gets a pageflip
completion. Therefore, the Guest would start its repaint cycle at Host vblank but 
whether it would submit its frame in time would depend on the scheduling policy --
of both Host and Guest compositors.

Thanks,
Vivek

> -Daniel
> 
> >
> > Thanks,
> > Vivek
> >
> > > -Daniel
> > >
> > > > Thanks,
> > > > Vivek
> > > > >
> > > > > Cheers, Daniel
> > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Earthling Michel Dänzer               |               https://redhat.com
> > > > > > Libre software enthusiast             |             Mesa and X developer
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 10, 2021, 8:30 a.m. UTC | #26

On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> > > Hi Daniel,
> > >
> > > > > > > >>> The solution:
> > > > > > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint
> > cycle
> > > > > > (including
> > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback event to
> > its
> > > > > > clients.
> > > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on --
> > before
> > > > > > sending
> > > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This
> > means
> > > > that,
> > > > > > the
> > > > > > > >>> Guest compositor has to be forced to use a new buffer for its next repaint
> > cycle
> > > > > > when it
> > > > > > > >>> gets a pageflip completion.
> > > > > > > >>
> > > > > > > >> Is that really the only solution?
> > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > > But I think none of them are as compelling as this one.
> > > > > > > >
> > > > > > > >>
> > > > > > > >> If we fix the event timestamps so that both guest and host use the same
> > > > > > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > > > > > >> then things should work too? I.e.
> > > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > > >>
> > > > > > > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > > > > > > >> and are as high-precision as the ones on the host side. Which for many gpu
> > > > > > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > > > > > >>
> > > > > > > >> But if the frametimes the guest receives are the no_vblank fake ones, then
> > > > > > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > > > > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > > > > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > > > > > >> the deadline on the oddball frame).
> > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we
> > don't
> > > > > > > > share these between the Guest and the Host. It does not seem to be causing any
> > > > other
> > > > > > > > problems so far but we did try the experiment you mentioned (i.e., adjusting
> > the
> > > > > > delays)
> > > > > > > > and it works. However, this patch series is meant to fix the issue without
> > having to
> > > > > > tweak
> > > > > > > > anything (delays) because we can't do this for every compositor out there.
> > > > > > >
> > > > > > > Maybe there could be a mechanism which allows the compositor in the guest to
> > > > > > automatically adjust its repaint cycle as needed.
> > > > > > >
> > > > > > > This might even be possible without requiring changes in each compositor, by
> > > > adjusting
> > > > > > the vertical blank periods in the guest to be aligned with the host compositor
> > repaint
> > > > > > cycles. Not sure about that though.
> > > > > > >
> > > > > > > Even if not, both this series or making it possible to queue multiple flips require
> > > > > > corresponding changes in each compositor as well to have any effect.
> > > > > >
> > > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > > deeper queue we have big coordination issues between the guest and
> > > > > > host compositor (like the example that the guest is now rendering at
> > > > > > 90fps instead of 60fps like the host).
> > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs
> > > > > 60 FPS problem is a completely different issue that is associated with Qemu GTK UI
> > > > > backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
> > > > > scanout FB onto one of the backbuffers managed by EGL.
> > > > >
> > > > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate that
> > Blit
> > > > > and thereby have a truly zero-copy solution. And, this is there I am running into the
> > > > > halved frame-rate issue -- the current problem.
> > > >
> > > > Yes, that's what I referenced. But I disagree that it's a different
> > > > problem. The underlying problem in both cases is that the guest and host
> > > > compositor free-wheel instead of rendering in sync. It's just that
> > > > depending upon how exactly the flip completion event on the gues side
> > > > plays out you either get guest rendering that's faster than the host-side
> > > > 60fps, or guest rendering that's much slower than the host-side 60fps.
> > > [Kasireddy, Vivek] That used to be the case before we added a synchronization
> > > mechanism to the GTK UI backend that uses a sync file. After adding this
> > > and making the Guest wait until this sync file fd on the Host is signaled, we
> > > consistently get 60 FPS because the flip completion event for the Guest is
> > > directly tied to the signaling of the sync file in this particular case (GTK UI).
> > >
> > > >
> > > > The fundamental problem in both cases is that they don't run in lockstep.
> > > > If you fix that, through fixing the timestamp and even reporting most
> > > > likely, you should be able to fix both bugs.
> > > [Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
> > > FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
> > > solution that just wraps the dmabuf associated with Guest scanout FB in a
> > > wl_buffer and submits it directly to the Host compositor. These backends are
> > > completely independent of each other and cannot be active at the same time.
> > > In other words, we cannot have zero-copy and Blit based solutions running
> > > parallelly. And, this issue is only relevant for Wayland UI backend and has
> > > nothing to do with GTK UI.
> > >
> > > >
> > > > > > Hence my gut feeling reaction that first we need to get these two
> > > > > > compositors aligned in their timings, which propobably needs
> > > > > > consistent vblank periods/timestamps across them (plus/minux
> > > > > > guest/host clocksource fun ofc). Without this any of the next steps
> > > > > > will simply not work because there's too much jitter by the time the
> > > > > > guest compositor gets the flip completion events.
> > > > > [Kasireddy, Vivek] Timings are not a problem and do not significantly
> > > > > affect the repaint cycles from what I have seen so far.
> > > > >
> > > > > >
> > > > > > Once we have solid events I think we should look into statically
> > > > > > tuning guest/host compositor deadlines (like you've suggested in a
> > > > > > bunch of places) to consisently make that deadline and hit 60 fps.
> > > > > > With that we can then look into tuning this automatically and what to
> > > > > > do when e.g. switching between copying and zero-copy on the host side
> > > > > > (which might be needed in some cases) and how to handle all that.
> > > > > [Kasireddy, Vivek] As I confirm here:
> > https://gitlab.freedesktop.org/wayland/weston/-
> > > > /issues/514#note_984065
> > > > > tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> > > > > I feel that this zero-copy solution I am trying to create should be independent
> > > > > of compositors' deadlines, delays or other scheduling parameters.
> > > >
> > > > That's not how compositors work nowadays. Your problem is that you don't
> > > > have the guest/host compositor in sync. zero-copy only changes the timing,
> > > > so it changes things from "rendering way too many frames" to "rendering
> > > > way too few frames".
> > > >
> > > > We need to fix the timing/sync issue here first, not paper over it with
> > > > hacks.
> > > [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> > > independent of the scheduling policies to ensure that it works with all compositors.
> > >  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> > > configurable repaint-window value, refresh-rate, etc to determine when to start
> > > its next repaint -- if there is any damage:
> > > timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> > > timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
> > >repaint_msec);
> > >
> > > And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> > > now = ktime_get();
> > > send_vblank_event(dev, e, seq, now);
> > 
> > vkms has been fixed since a while to fake high-precision timestamps like
> > from a real display.
> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need 
> to have the same timestamp as that of the Host -- to work as expected.
> 
> > 
> > > When you say that the Guest/Host compositor need to stay in sync, are you
> > > suggesting that we need to ensure that the vblank timestamp on the Host
> > > needs to be shared and be the same on the Guest and a vblank/pageflip
> > > completion for the Guest needs to be sent at exactly the same time it is sent
> > > on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> > > around the same time a vblank is generated on the Host but it does not help
> > > because the Guest compositor would only have 9 ms to submit a new frame
> > > and if the Host is running Mutter, the Guest would only have 2 ms.
> > > (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> > 
> > Not at the same time, but the same timestamp. And yes there is some fun
> > there, which is I think the fundamental issue. Or at least some of the
> > compositor experts seem to think so, and it makes sense to me.
> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
> the Guest repaint cycle would be affected. However, I do not believe that is the case
> here given the debug and instrumentation data we collected and scrutinized. Hopefully,
> compositor experts could chime in to shed some light on this matter.
> 
> > 
> > > >
> > > > Only, and I really mean only, when that shows that it's simply impossible
> > > > to hit 60fps with zero-copy and the guest/host fully aligned should we
> > > > look into making the overall pipeline deeper.
> > > [Kasireddy, Vivek] From all the experiments conducted so far and given the
> > > discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > > I think we have already established that in order for a zero-copy solution to work
> > > reliably, the Guest compositor needs to start its repaint cycle when the Host
> > > compositor sends a frame callback event to its clients.
> > >
> > > >
> > > > > > Only when that all shows that we just can't hit 60fps consistently and
> > > > > > really need 3 buffers in flight should we look at deeper kms queues.
> > > > > > And then we really need to implement them properly and not with a
> > > > > > mismatch between drm_event an out-fence signalling. These quick hacks
> > > > > > are good for experiments, but there's a pile of other things we need
> > > > > > to do first. At least that's how I understand the problem here right
> > > > > > now.
> > > > > [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
> > consistently
> > > > > -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> > > > > patch series + the Weston MR I linked in the cover letter. The main reason why this
> > > > > works is because we relax the assumption that when the Guest compositor gets a
> > > > > pageflip completion event that it could reuse the old FB it submitted in the previous
> > > > > atomic flip and instead force it to use a new one. And, we send the pageflip
> > completion
> > > > > event to the Guest when the Host compositor sends a frame callback event. Lastly,
> > > > > we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
> > when
> > > > > it can release references on old FBs so that they can be reused again.
> > > > >
> > > > > With that being said, the only question is how can we accomplish the above in an
> > > > upstream
> > > > > acceptable way without regressing anything particularly on bare-metal. Its not clear
> > if
> > > > just
> > > > > increasing the queue depth would work or not but I think the Guest compositor has to
> > be
> > > > told
> > > > > when it can start its repaint cycle and when it can assume the old FB is no longer in
> > use.
> > > > > On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
> > > > both.
> > > > > In other words, Vblank event is the same as Flip done, which makes sense on bare-
> > metal.
> > > > > But if we were to have two events at-least for VKMS: vblank to indicate to Guest to
> > start
> > > > > repaint and flip_done to indicate to drop references on old FBs, I think this problem
> > can
> > > > > be solved even without increasing the queue depth. Can this be acceptable?
> > > >
> > > > That's just another flavour of your "increase queue depth without
> > > > increasing the atomic queue depth" approach. I still think the underlying
> > > > fundamental issue is a timing confusion, and the fact that adjusting the
> > > > timings fixes things too kinda proves that. So we need to fix that in a
> > > > clean way, not by shuffling things around semi-randomly until the specific
> > > > config we tests works.
> > > [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> > > have carefully instrumented both the Host and Guest compositors and measured
> > > the latencies at each step. The relevant debug data only points to the scheduling
> > > policy -- of both Host and Guest compositors -- playing a role in Guest rendering
> > > at 30 FPS.
> > 
> > Hm but that essentially means that the events your passing around have an
> > even more ad-hoc implementation specific meaning: Essentially it's the
> > kick-off for the guest's repaint loop? That sounds even worse for a kms
> > uapi extension.
> [Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
> kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston 
> works that way and even if we increase the queue depth to solve this problem, I don't
> think it'll help because the arrival of this event always indicates to a compositor to
> start its repaint cycle again and assume that the previous buffers are all free.

I thought this is how simple compositors work, and weston has since a
while it's own timer, which is based on the timestamp it gets (at on
drivers with vblank support), so that it starts the repaint loop a few ms
before the next vblank. And not immediately when it receives the old page
flip completion event.

Ofc if the flip completion event is late, it needs to delay the repaint
cycle.

> > > > Iow I think we need a solution here which both slows down the 90fps to
> > > > 60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
> > > > case. Because the host might need to switch transparently between blt and
> > > > zerocopy for various reasons.
> > > [Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
> > > backends at runtime. In other words, with GTK UI backend, it is always Blit
> > > whereas Wayland UI backend is always zero-copy.
> > 
> > Hm ok, that at least makes things somewhat simpler. Another thing that I
> > just realized: What happens when the host changes screen resolution and
> > especially refresh rate?
> [Kasireddy, Vivek] AFAICT, if the Host changes resolution or if the Qemu UI window
> is resized, then that'll trigger a Guest KMS modeset -- via drm_helper_hpd_irq_event().
> As far as the refresh rate is concerned, if Qemu is launched with GTK UI backend,
> then the "render signal" GTK sends out to apps would reflect the new refresh rate.
> And, since the internal dma-fence is tied to this "render signal", Guest updates are
> automatically synchronized to the new refresh rate.

Yeah, the problem is that right now kms uapi assumes that the refresh rate
doesn't just randomly change underneath the compositor. Which with kvm it
does, which is a bit annoying. And without the refresh rate the guest
compositor can't really time it's repaint loop properly.

> If Qemu is launched with the Wayland UI backend, then the internal dma-fence would
> be tied to the wl_buffer.release event. And, if Qemu UI's buffer is flipped onto a
> hardware plane, then the compositor sends this event out after it gets a pageflip
> completion. Therefore, the Guest would start its repaint cycle at Host vblank but 
> whether it would submit its frame in time would depend on the scheduling policy --
> of both Host and Guest compositors.

Yeah this is all very tightly tied together, which is why I think we need
something that looks at the entire picture. And not so much a quick change
somewhere with badly defined semantics that happens to work in specific
cases. Which I think is what we have here.
-Daniel

> 
> Thanks,
> Vivek
> 
> > -Daniel
> > 
> > >
> > > Thanks,
> > > Vivek
> > >
> > > > -Daniel
> > > >
> > > > > Thanks,
> > > > > Vivek
> > > > > >
> > > > > > Cheers, Daniel
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Earthling Michel Dänzer               |               https://redhat.com
> > > > > > > Libre software enthusiast             |             Mesa and X developer
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > http://blog.ffwll.ch
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Michel Dänzer Aug. 10, 2021, 10:57 a.m. UTC | #27

On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
>>> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
>>>>>>>
>>>>>>> Hence my gut feeling reaction that first we need to get these two
>>>>>>> compositors aligned in their timings, which propobably needs
>>>>>>> consistent vblank periods/timestamps across them (plus/minux
>>>>>>> guest/host clocksource fun ofc). Without this any of the next steps
>>>>>>> will simply not work because there's too much jitter by the time the
>>>>>>> guest compositor gets the flip completion events.
>>>>>> [Kasireddy, Vivek] Timings are not a problem and do not significantly
>>>>>> affect the repaint cycles from what I have seen so far.
>>>>>>
>>>>>>>
>>>>>>> Once we have solid events I think we should look into statically
>>>>>>> tuning guest/host compositor deadlines (like you've suggested in a
>>>>>>> bunch of places) to consisently make that deadline and hit 60 fps.
>>>>>>> With that we can then look into tuning this automatically and what to
>>>>>>> do when e.g. switching between copying and zero-copy on the host side
>>>>>>> (which might be needed in some cases) and how to handle all that.
>>>>>> [Kasireddy, Vivek] As I confirm here:
>>> https://gitlab.freedesktop.org/wayland/weston/-
>>>>> /issues/514#note_984065
>>>>>> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
>>>>>> I feel that this zero-copy solution I am trying to create should be independent
>>>>>> of compositors' deadlines, delays or other scheduling parameters.
>>>>>
>>>>> That's not how compositors work nowadays. Your problem is that you don't
>>>>> have the guest/host compositor in sync. zero-copy only changes the timing,
>>>>> so it changes things from "rendering way too many frames" to "rendering
>>>>> way too few frames".
>>>>>
>>>>> We need to fix the timing/sync issue here first, not paper over it with
>>>>> hacks.
>>>> [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
>>>> independent of the scheduling policies to ensure that it works with all compositors.
>>>>  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
>>>> configurable repaint-window value, refresh-rate, etc to determine when to start
>>>> its next repaint -- if there is any damage:
>>>> timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
>>>> timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
>>>> repaint_msec);
>>>>
>>>> And, in the case of VKMS, since there is no real hardware, the timestamp is always:
>>>> now = ktime_get();
>>>> send_vblank_event(dev, e, seq, now);
>>>
>>> vkms has been fixed since a while to fake high-precision timestamps like
>>> from a real display.
>> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need 
>> to have the same timestamp as that of the Host -- to work as expected.
>>
>>>
>>>> When you say that the Guest/Host compositor need to stay in sync, are you
>>>> suggesting that we need to ensure that the vblank timestamp on the Host
>>>> needs to be shared and be the same on the Guest and a vblank/pageflip
>>>> completion for the Guest needs to be sent at exactly the same time it is sent
>>>> on the Host? If yes, I'd say that we do send the pageflip completion to Guest
>>>> around the same time a vblank is generated on the Host but it does not help
>>>> because the Guest compositor would only have 9 ms to submit a new frame
>>>> and if the Host is running Mutter, the Guest would only have 2 ms.
>>>> (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
>>>
>>> Not at the same time, but the same timestamp. And yes there is some fun
>>> there, which is I think the fundamental issue. Or at least some of the
>>> compositor experts seem to think so, and it makes sense to me.
>> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
>> the Guest repaint cycle would be affected. However, I do not believe that is the case
>> here given the debug and instrumentation data we collected and scrutinized. Hopefully,
>> compositor experts could chime in to shed some light on this matter.
>>
>>>
>>>>>
>>>>> Only, and I really mean only, when that shows that it's simply impossible
>>>>> to hit 60fps with zero-copy and the guest/host fully aligned should we
>>>>> look into making the overall pipeline deeper.
>>>> [Kasireddy, Vivek] From all the experiments conducted so far and given the
>>>> discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>>> I think we have already established that in order for a zero-copy solution to work
>>>> reliably, the Guest compositor needs to start its repaint cycle when the Host
>>>> compositor sends a frame callback event to its clients.
>>>>
>>>>>
>>>>>>> Only when that all shows that we just can't hit 60fps consistently and
>>>>>>> really need 3 buffers in flight should we look at deeper kms queues.
>>>>>>> And then we really need to implement them properly and not with a
>>>>>>> mismatch between drm_event an out-fence signalling. These quick hacks
>>>>>>> are good for experiments, but there's a pile of other things we need
>>>>>>> to do first. At least that's how I understand the problem here right
>>>>>>> now.
>>>>>> [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
>>> consistently
>>>>>> -- in a zero-copy way independent of compositors' delays/deadlines -- with this
>>>>>> patch series + the Weston MR I linked in the cover letter. The main reason why this
>>>>>> works is because we relax the assumption that when the Guest compositor gets a
>>>>>> pageflip completion event that it could reuse the old FB it submitted in the previous
>>>>>> atomic flip and instead force it to use a new one. And, we send the pageflip
>>> completion
>>>>>> event to the Guest when the Host compositor sends a frame callback event. Lastly,
>>>>>> we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
>>> when
>>>>>> it can release references on old FBs so that they can be reused again.
>>>>>>
>>>>>> With that being said, the only question is how can we accomplish the above in an
>>>>> upstream
>>>>>> acceptable way without regressing anything particularly on bare-metal. Its not clear
>>> if
>>>>> just
>>>>>> increasing the queue depth would work or not but I think the Guest compositor has to
>>> be
>>>>> told
>>>>>> when it can start its repaint cycle and when it can assume the old FB is no longer in
>>> use.
>>>>>> On bare-metal -- and also with VKMS as of today -- a pageflip completion indicates
>>>>> both.
>>>>>> In other words, Vblank event is the same as Flip done, which makes sense on bare-
>>> metal.
>>>>>> But if we were to have two events at-least for VKMS: vblank to indicate to Guest to
>>> start
>>>>>> repaint and flip_done to indicate to drop references on old FBs, I think this problem
>>> can
>>>>>> be solved even without increasing the queue depth. Can this be acceptable?
>>>>>
>>>>> That's just another flavour of your "increase queue depth without
>>>>> increasing the atomic queue depth" approach. I still think the underlying
>>>>> fundamental issue is a timing confusion, and the fact that adjusting the
>>>>> timings fixes things too kinda proves that. So we need to fix that in a
>>>>> clean way, not by shuffling things around semi-randomly until the specific
>>>>> config we tests works.
>>>> [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
>>>> have carefully instrumented both the Host and Guest compositors and measured
>>>> the latencies at each step. The relevant debug data only points to the scheduling
>>>> policy -- of both Host and Guest compositors -- playing a role in Guest rendering
>>>> at 30 FPS.
>>>
>>> Hm but that essentially means that the events your passing around have an
>>> even more ad-hoc implementation specific meaning: Essentially it's the
>>> kick-off for the guest's repaint loop? That sounds even worse for a kms
>>> uapi extension.
>> [Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
>> kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston 
>> works that way and even if we increase the queue depth to solve this problem, I don't
>> think it'll help because the arrival of this event always indicates to a compositor to
>> start its repaint cycle again and assume that the previous buffers are all free.
> 
> I thought this is how simple compositors work, and weston has since a
> while it's own timer, which is based on the timestamp it gets (at on
> drivers with vblank support), so that it starts the repaint loop a few ms
> before the next vblank. And not immediately when it receives the old page
> flip completion event.

As long as it's a fixed timer, there's still a risk that the guest compositor repaint cycle runs too late for the host one (unless the guest cycle happens to be scheduled significantly earlier than the host one).

Note that current mutter Git main (to become the 41 release this autumn) uses dynamic scheduling of its repaint cycle based on how long the last 16 frames took to draw and present. In theory, this could automatically schedule the guest cycle early enough for the host one.

Vivek Kasireddy Aug. 11, 2021, 7:20 a.m. UTC | #28

Hi Daniel,

> On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> > > > Hi Daniel,
> > > >
> > > > > > > > >>> The solution:
> > > > > > > > >>> - To ensure full framerate, the Guest compositor has to start it's repaint
> > > cycle
> > > > > > > (including
> > > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback event
> to
> > > its
> > > > > > > clients.
> > > > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -
> -
> > > before
> > > > > > > sending
> > > > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This
> > > means
> > > > > that,
> > > > > > > the
> > > > > > > > >>> Guest compositor has to be forced to use a new buffer for its next
> repaint
> > > cycle
> > > > > > > when it
> > > > > > > > >>> gets a pageflip completion.
> > > > > > > > >>
> > > > > > > > >> Is that really the only solution?
> > > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > > > But I think none of them are as compelling as this one.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> If we fix the event timestamps so that both guest and host use the same
> > > > > > > > >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> > > > > > > > >> then things should work too? I.e.
> > > > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > > > >>
> > > > > > > > >> Ofc this only works if the frametimes we hand out to both match
> _exactly_
> > > > > > > > >> and are as high-precision as the ones on the host side. Which for many
> gpu
> > > > > > > > >> drivers at least is the case, and all the ones you care about for sure :-)
> > > > > > > > >>
> > > > > > > > >> But if the frametimes the guest receives are the no_vblank fake ones,
> then
> > > > > > > > >> they'll be all over the place and this carefully tuned low-latency redraw
> > > > > > > > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > > > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > > > > > > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > > > > > > >> the deadline on the oddball frame).
> > > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as
> we
> > > don't
> > > > > > > > > share these between the Guest and the Host. It does not seem to be causing
> any
> > > > > other
> > > > > > > > > problems so far but we did try the experiment you mentioned (i.e.,
> adjusting
> > > the
> > > > > > > delays)
> > > > > > > > > and it works. However, this patch series is meant to fix the issue without
> > > having to
> > > > > > > tweak
> > > > > > > > > anything (delays) because we can't do this for every compositor out there.
> > > > > > > >
> > > > > > > > Maybe there could be a mechanism which allows the compositor in the guest
> to
> > > > > > > automatically adjust its repaint cycle as needed.
> > > > > > > >
> > > > > > > > This might even be possible without requiring changes in each compositor,
> by
> > > > > adjusting
> > > > > > > the vertical blank periods in the guest to be aligned with the host compositor
> > > repaint
> > > > > > > cycles. Not sure about that though.
> > > > > > > >
> > > > > > > > Even if not, both this series or making it possible to queue multiple flips
> require
> > > > > > > corresponding changes in each compositor as well to have any effect.
> > > > > > >
> > > > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > > > deeper queue we have big coordination issues between the guest and
> > > > > > > host compositor (like the example that the guest is now rendering at
> > > > > > > 90fps instead of 60fps like the host).
> > > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90
> FPS vs
> > > > > > 60 FPS problem is a completely different issue that is associated with Qemu
> GTK UI
> > > > > > backend. With the GTK backend -- and also with SDL backend -- we Blit the
> Guest
> > > > > > scanout FB onto one of the backbuffers managed by EGL.
> > > > > >
> > > > > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate
> that
> > > Blit
> > > > > > and thereby have a truly zero-copy solution. And, this is there I am running into
> the
> > > > > > halved frame-rate issue -- the current problem.
> > > > >
> > > > > Yes, that's what I referenced. But I disagree that it's a different
> > > > > problem. The underlying problem in both cases is that the guest and host
> > > > > compositor free-wheel instead of rendering in sync. It's just that
> > > > > depending upon how exactly the flip completion event on the gues side
> > > > > plays out you either get guest rendering that's faster than the host-side
> > > > > 60fps, or guest rendering that's much slower than the host-side 60fps.
> > > > [Kasireddy, Vivek] That used to be the case before we added a synchronization
> > > > mechanism to the GTK UI backend that uses a sync file. After adding this
> > > > and making the Guest wait until this sync file fd on the Host is signaled, we
> > > > consistently get 60 FPS because the flip completion event for the Guest is
> > > > directly tied to the signaling of the sync file in this particular case (GTK UI).
> > > >
> > > > >
> > > > > The fundamental problem in both cases is that they don't run in lockstep.
> > > > > If you fix that, through fixing the timestamp and even reporting most
> > > > > likely, you should be able to fix both bugs.
> > > > [Kasireddy, Vivek] GTK UI is an EGL based solution that Blits the Guest scanout
> > > > FB onto one of the backbuffers managed by EGL. Wayland UI is a zero-copy
> > > > solution that just wraps the dmabuf associated with Guest scanout FB in a
> > > > wl_buffer and submits it directly to the Host compositor. These backends are
> > > > completely independent of each other and cannot be active at the same time.
> > > > In other words, we cannot have zero-copy and Blit based solutions running
> > > > parallelly. And, this issue is only relevant for Wayland UI backend and has
> > > > nothing to do with GTK UI.
> > > >
> > > > >
> > > > > > > Hence my gut feeling reaction that first we need to get these two
> > > > > > > compositors aligned in their timings, which propobably needs
> > > > > > > consistent vblank periods/timestamps across them (plus/minux
> > > > > > > guest/host clocksource fun ofc). Without this any of the next steps
> > > > > > > will simply not work because there's too much jitter by the time the
> > > > > > > guest compositor gets the flip completion events.
> > > > > > [Kasireddy, Vivek] Timings are not a problem and do not significantly
> > > > > > affect the repaint cycles from what I have seen so far.
> > > > > >
> > > > > > >
> > > > > > > Once we have solid events I think we should look into statically
> > > > > > > tuning guest/host compositor deadlines (like you've suggested in a
> > > > > > > bunch of places) to consisently make that deadline and hit 60 fps.
> > > > > > > With that we can then look into tuning this automatically and what to
> > > > > > > do when e.g. switching between copying and zero-copy on the host side
> > > > > > > (which might be needed in some cases) and how to handle all that.
> > > > > > [Kasireddy, Vivek] As I confirm here:
> > > https://gitlab.freedesktop.org/wayland/weston/-
> > > > > /issues/514#note_984065
> > > > > > tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> > > > > > I feel that this zero-copy solution I am trying to create should be independent
> > > > > > of compositors' deadlines, delays or other scheduling parameters.
> > > > >
> > > > > That's not how compositors work nowadays. Your problem is that you don't
> > > > > have the guest/host compositor in sync. zero-copy only changes the timing,
> > > > > so it changes things from "rendering way too many frames" to "rendering
> > > > > way too few frames".
> > > > >
> > > > > We need to fix the timing/sync issue here first, not paper over it with
> > > > > hacks.
> > > > [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> > > > independent of the scheduling policies to ensure that it works with all compositors.
> > > >  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> > > > configurable repaint-window value, refresh-rate, etc to determine when to start
> > > > its next repaint -- if there is any damage:
> > > > timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> > > > timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
> > > >repaint_msec);
> > > >
> > > > And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> > > > now = ktime_get();
> > > > send_vblank_event(dev, e, seq, now);
> > >
> > > vkms has been fixed since a while to fake high-precision timestamps like
> > > from a real display.
> > [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need
> > to have the same timestamp as that of the Host -- to work as expected.
> >
> > >
> > > > When you say that the Guest/Host compositor need to stay in sync, are you
> > > > suggesting that we need to ensure that the vblank timestamp on the Host
> > > > needs to be shared and be the same on the Guest and a vblank/pageflip
> > > > completion for the Guest needs to be sent at exactly the same time it is sent
> > > > on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> > > > around the same time a vblank is generated on the Host but it does not help
> > > > because the Guest compositor would only have 9 ms to submit a new frame
> > > > and if the Host is running Mutter, the Guest would only have 2 ms.
> > > > (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> > >
> > > Not at the same time, but the same timestamp. And yes there is some fun
> > > there, which is I think the fundamental issue. Or at least some of the
> > > compositor experts seem to think so, and it makes sense to me.
> > [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
> > the Guest repaint cycle would be affected. However, I do not believe that is the case
> > here given the debug and instrumentation data we collected and scrutinized. Hopefully,
> > compositor experts could chime in to shed some light on this matter.
> >
> > >
> > > > >
> > > > > Only, and I really mean only, when that shows that it's simply impossible
> > > > > to hit 60fps with zero-copy and the guest/host fully aligned should we
> > > > > look into making the overall pipeline deeper.
> > > > [Kasireddy, Vivek] From all the experiments conducted so far and given the
> > > > discussion associated with https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514
> > > > I think we have already established that in order for a zero-copy solution to work
> > > > reliably, the Guest compositor needs to start its repaint cycle when the Host
> > > > compositor sends a frame callback event to its clients.
> > > >
> > > > >
> > > > > > > Only when that all shows that we just can't hit 60fps consistently and
> > > > > > > really need 3 buffers in flight should we look at deeper kms queues.
> > > > > > > And then we really need to implement them properly and not with a
> > > > > > > mismatch between drm_event an out-fence signalling. These quick hacks
> > > > > > > are good for experiments, but there's a pile of other things we need
> > > > > > > to do first. At least that's how I understand the problem here right
> > > > > > > now.
> > > > > > [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
> > > consistently
> > > > > > -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> > > > > > patch series + the Weston MR I linked in the cover letter. The main reason why
> this
> > > > > > works is because we relax the assumption that when the Guest compositor gets a
> > > > > > pageflip completion event that it could reuse the old FB it submitted in the
> previous
> > > > > > atomic flip and instead force it to use a new one. And, we send the pageflip
> > > completion
> > > > > > event to the Guest when the Host compositor sends a frame callback event.
> Lastly,
> > > > > > we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
> > > when
> > > > > > it can release references on old FBs so that they can be reused again.
> > > > > >
> > > > > > With that being said, the only question is how can we accomplish the above in an
> > > > > upstream
> > > > > > acceptable way without regressing anything particularly on bare-metal. Its not
> clear
> > > if
> > > > > just
> > > > > > increasing the queue depth would work or not but I think the Guest compositor
> has to
> > > be
> > > > > told
> > > > > > when it can start its repaint cycle and when it can assume the old FB is no longer
> in
> > > use.
> > > > > > On bare-metal -- and also with VKMS as of today -- a pageflip completion
> indicates
> > > > > both.
> > > > > > In other words, Vblank event is the same as Flip done, which makes sense on
> bare-
> > > metal.
> > > > > > But if we were to have two events at-least for VKMS: vblank to indicate to
> Guest to
> > > start
> > > > > > repaint and flip_done to indicate to drop references on old FBs, I think this
> problem
> > > can
> > > > > > be solved even without increasing the queue depth. Can this be acceptable?
> > > > >
> > > > > That's just another flavour of your "increase queue depth without
> > > > > increasing the atomic queue depth" approach. I still think the underlying
> > > > > fundamental issue is a timing confusion, and the fact that adjusting the
> > > > > timings fixes things too kinda proves that. So we need to fix that in a
> > > > > clean way, not by shuffling things around semi-randomly until the specific
> > > > > config we tests works.
> > > > [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> > > > have carefully instrumented both the Host and Guest compositors and measured
> > > > the latencies at each step. The relevant debug data only points to the scheduling
> > > > policy -- of both Host and Guest compositors -- playing a role in Guest rendering
> > > > at 30 FPS.
> > >
> > > Hm but that essentially means that the events your passing around have an
> > > even more ad-hoc implementation specific meaning: Essentially it's the
> > > kick-off for the guest's repaint loop? That sounds even worse for a kms
> > > uapi extension.
> > [Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
> > kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston
> > works that way and even if we increase the queue depth to solve this problem, I don't
> > think it'll help because the arrival of this event always indicates to a compositor to
> > start its repaint cycle again and assume that the previous buffers are all free.
> 
> I thought this is how simple compositors work, and weston has since a
> while it's own timer, which is based on the timestamp it gets (at on
> drivers with vblank support), so that it starts the repaint loop a few ms
> before the next vblank. And not immediately when it receives the old page
> flip completion event.
[Kasireddy, Vivek] Right, Weston does use a timer (named repaint_timer) to determine
when to start its next repaint cycle. And, IIUC, the way it works is it uses the Vblank
timestamp and refresh rate to calculate the cycle length and then deduct the configurable
"repaint-window" to calculate the delay. So, for a refresh rate of 60 Hz, which implies
a cycle length of ~16.66 ms, and a default repaint-window value of 7 ms, the delay would
be ~9 ms. Therefore, from the current vblank timestamp, it waits for ~9 ms before starting
repaint again.

The above behavior is identical for both bare-metal and also with virtual KMS Guest
drivers that use fake vblank events. However, it does all the above things only after
getting a pageflip completion event.

> 
> Ofc if the flip completion event is late, it needs to delay the repaint
> cycle.
> 
> > > > > Iow I think we need a solution here which both slows down the 90fps to
> > > > > 60fps for the blit case, and the 30fps speed up to 60fps for the zerocopy
> > > > > case. Because the host might need to switch transparently between blt and
> > > > > zerocopy for various reasons.
> > > > [Kasireddy, Vivek] As I mentioned above, the Host (Qemu) cannot switch UI
> > > > backends at runtime. In other words, with GTK UI backend, it is always Blit
> > > > whereas Wayland UI backend is always zero-copy.
> > >
> > > Hm ok, that at least makes things somewhat simpler. Another thing that I
> > > just realized: What happens when the host changes screen resolution and
> > > especially refresh rate?
> > [Kasireddy, Vivek] AFAICT, if the Host changes resolution or if the Qemu UI window
> > is resized, then that'll trigger a Guest KMS modeset -- via drm_helper_hpd_irq_event().
> > As far as the refresh rate is concerned, if Qemu is launched with GTK UI backend,
> > then the "render signal" GTK sends out to apps would reflect the new refresh rate.
> > And, since the internal dma-fence is tied to this "render signal", Guest updates are
> > automatically synchronized to the new refresh rate.
> 
> Yeah, the problem is that right now kms uapi assumes that the refresh rate
> doesn't just randomly change underneath the compositor. Which with kvm it
> does, which is a bit annoying. And without the refresh rate the guest
> compositor can't really time it's repaint loop properly.
[Kasireddy, Vivek] The Guest compositor would get notified via UDEV if the Host does
a modeset because Guest KMS would trigger a hotplug. However, I think having a
refresh rate that is different between Guest and Host compositors is not desirable.

> 
> > If Qemu is launched with the Wayland UI backend, then the internal dma-fence would
> > be tied to the wl_buffer.release event. And, if Qemu UI's buffer is flipped onto a
> > hardware plane, then the compositor sends this event out after it gets a pageflip
> > completion. Therefore, the Guest would start its repaint cycle at Host vblank but
> > whether it would submit its frame in time would depend on the scheduling policy --
> > of both Host and Guest compositors.
> 
> Yeah this is all very tightly tied together, which is why I think we need
> something that looks at the entire picture. And not so much a quick change
> somewhere with badly defined semantics that happens to work in specific
> cases. Which I think is what we have here.
[Kasireddy, Vivek] I think it is time to discuss and come up with correct semantics in order
to ensure that this solution works without being affected by the scheduling policy of either
compositors. AFAICT, for this to work, the Guest compositor needs two signals/fences --
or events: one to tell it to start repaint cycle (vblank event) and the other to tell it to release
references on old FBs (flip done event) instead of just pageflip completion event. And, we
might want to limit this to only virtual KMS drivers.

Thanks,
Vivek

> -Daniel
> 
> >
> > Thanks,
> > Vivek
> >
> > > -Daniel
> > >
> > > >
> > > > Thanks,
> > > > Vivek
> > > >
> > > > > -Daniel
> > > > >
> > > > > > Thanks,
> > > > > > Vivek
> > > > > > >
> > > > > > > Cheers, Daniel
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Earthling Michel Dänzer               |               https://redhat.com
> > > > > > > > Libre software enthusiast             |             Mesa and X developer
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Daniel Vetter
> > > > > > > Software Engineer, Intel Corporation
> > > > > > > http://blog.ffwll.ch
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Vivek Kasireddy Aug. 11, 2021, 7:25 a.m. UTC | #29

Hi Michel,
 
> On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
> >>> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> >>>>>>>
> >>>>>>> Hence my gut feeling reaction that first we need to get these two
> >>>>>>> compositors aligned in their timings, which propobably needs
> >>>>>>> consistent vblank periods/timestamps across them (plus/minux
> >>>>>>> guest/host clocksource fun ofc). Without this any of the next steps
> >>>>>>> will simply not work because there's too much jitter by the time the
> >>>>>>> guest compositor gets the flip completion events.
> >>>>>> [Kasireddy, Vivek] Timings are not a problem and do not significantly
> >>>>>> affect the repaint cycles from what I have seen so far.
> >>>>>>
> >>>>>>>
> >>>>>>> Once we have solid events I think we should look into statically
> >>>>>>> tuning guest/host compositor deadlines (like you've suggested in a
> >>>>>>> bunch of places) to consisently make that deadline and hit 60 fps.
> >>>>>>> With that we can then look into tuning this automatically and what to
> >>>>>>> do when e.g. switching between copying and zero-copy on the host side
> >>>>>>> (which might be needed in some cases) and how to handle all that.
> >>>>>> [Kasireddy, Vivek] As I confirm here:
> >>> https://gitlab.freedesktop.org/wayland/weston/-
> >>>>> /issues/514#note_984065
> >>>>>> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> >>>>>> I feel that this zero-copy solution I am trying to create should be independent
> >>>>>> of compositors' deadlines, delays or other scheduling parameters.
> >>>>>
> >>>>> That's not how compositors work nowadays. Your problem is that you don't
> >>>>> have the guest/host compositor in sync. zero-copy only changes the timing,
> >>>>> so it changes things from "rendering way too many frames" to "rendering
> >>>>> way too few frames".
> >>>>>
> >>>>> We need to fix the timing/sync issue here first, not paper over it with
> >>>>> hacks.
> >>>> [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> >>>> independent of the scheduling policies to ensure that it works with all compositors.
> >>>>  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> >>>> configurable repaint-window value, refresh-rate, etc to determine when to start
> >>>> its next repaint -- if there is any damage:
> >>>> timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> >>>> timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
> >>>> repaint_msec);
> >>>>
> >>>> And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> >>>> now = ktime_get();
> >>>> send_vblank_event(dev, e, seq, now);
> >>>
> >>> vkms has been fixed since a while to fake high-precision timestamps like
> >>> from a real display.
> >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need
> >> to have the same timestamp as that of the Host -- to work as expected.
> >>
> >>>
> >>>> When you say that the Guest/Host compositor need to stay in sync, are you
> >>>> suggesting that we need to ensure that the vblank timestamp on the Host
> >>>> needs to be shared and be the same on the Guest and a vblank/pageflip
> >>>> completion for the Guest needs to be sent at exactly the same time it is sent
> >>>> on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> >>>> around the same time a vblank is generated on the Host but it does not help
> >>>> because the Guest compositor would only have 9 ms to submit a new frame
> >>>> and if the Host is running Mutter, the Guest would only have 2 ms.
> >>>> (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> >>>
> >>> Not at the same time, but the same timestamp. And yes there is some fun
> >>> there, which is I think the fundamental issue. Or at least some of the
> >>> compositor experts seem to think so, and it makes sense to me.
> >> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
> >> the Guest repaint cycle would be affected. However, I do not believe that is the case
> >> here given the debug and instrumentation data we collected and scrutinized. Hopefully,
> >> compositor experts could chime in to shed some light on this matter.
> >>
> >>>
> >>>>>
> >>>>> Only, and I really mean only, when that shows that it's simply impossible
> >>>>> to hit 60fps with zero-copy and the guest/host fully aligned should we
> >>>>> look into making the overall pipeline deeper.
> >>>> [Kasireddy, Vivek] From all the experiments conducted so far and given the
> >>>> discussion associated with https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514
> >>>> I think we have already established that in order for a zero-copy solution to work
> >>>> reliably, the Guest compositor needs to start its repaint cycle when the Host
> >>>> compositor sends a frame callback event to its clients.
> >>>>
> >>>>>
> >>>>>>> Only when that all shows that we just can't hit 60fps consistently and
> >>>>>>> really need 3 buffers in flight should we look at deeper kms queues.
> >>>>>>> And then we really need to implement them properly and not with a
> >>>>>>> mismatch between drm_event an out-fence signalling. These quick hacks
> >>>>>>> are good for experiments, but there's a pile of other things we need
> >>>>>>> to do first. At least that's how I understand the problem here right
> >>>>>>> now.
> >>>>>> [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
> >>> consistently
> >>>>>> -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> >>>>>> patch series + the Weston MR I linked in the cover letter. The main reason why
> this
> >>>>>> works is because we relax the assumption that when the Guest compositor gets a
> >>>>>> pageflip completion event that it could reuse the old FB it submitted in the
> previous
> >>>>>> atomic flip and instead force it to use a new one. And, we send the pageflip
> >>> completion
> >>>>>> event to the Guest when the Host compositor sends a frame callback event.
> Lastly,
> >>>>>> we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
> >>> when
> >>>>>> it can release references on old FBs so that they can be reused again.
> >>>>>>
> >>>>>> With that being said, the only question is how can we accomplish the above in an
> >>>>> upstream
> >>>>>> acceptable way without regressing anything particularly on bare-metal. Its not
> clear
> >>> if
> >>>>> just
> >>>>>> increasing the queue depth would work or not but I think the Guest compositor
> has to
> >>> be
> >>>>> told
> >>>>>> when it can start its repaint cycle and when it can assume the old FB is no longer
> in
> >>> use.
> >>>>>> On bare-metal -- and also with VKMS as of today -- a pageflip completion
> indicates
> >>>>> both.
> >>>>>> In other words, Vblank event is the same as Flip done, which makes sense on
> bare-
> >>> metal.
> >>>>>> But if we were to have two events at-least for VKMS: vblank to indicate to Guest
> to
> >>> start
> >>>>>> repaint and flip_done to indicate to drop references on old FBs, I think this
> problem
> >>> can
> >>>>>> be solved even without increasing the queue depth. Can this be acceptable?
> >>>>>
> >>>>> That's just another flavour of your "increase queue depth without
> >>>>> increasing the atomic queue depth" approach. I still think the underlying
> >>>>> fundamental issue is a timing confusion, and the fact that adjusting the
> >>>>> timings fixes things too kinda proves that. So we need to fix that in a
> >>>>> clean way, not by shuffling things around semi-randomly until the specific
> >>>>> config we tests works.
> >>>> [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> >>>> have carefully instrumented both the Host and Guest compositors and measured
> >>>> the latencies at each step. The relevant debug data only points to the scheduling
> >>>> policy -- of both Host and Guest compositors -- playing a role in Guest rendering
> >>>> at 30 FPS.
> >>>
> >>> Hm but that essentially means that the events your passing around have an
> >>> even more ad-hoc implementation specific meaning: Essentially it's the
> >>> kick-off for the guest's repaint loop? That sounds even worse for a kms
> >>> uapi extension.
> >> [Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
> >> kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston
> >> works that way and even if we increase the queue depth to solve this problem, I don't
> >> think it'll help because the arrival of this event always indicates to a compositor to
> >> start its repaint cycle again and assume that the previous buffers are all free.
> >
> > I thought this is how simple compositors work, and weston has since a
> > while it's own timer, which is based on the timestamp it gets (at on
> > drivers with vblank support), so that it starts the repaint loop a few ms
> > before the next vblank. And not immediately when it receives the old page
> > flip completion event.
> 
> As long as it's a fixed timer, there's still a risk that the guest compositor repaint cycle runs
> too late for the host one (unless the guest cycle happens to be scheduled significantly
> earlier than the host one).
> 
> Note that current mutter Git main (to become the 41 release this autumn) uses dynamic
> scheduling of its repaint cycle based on how long the last 16 frames took to draw and
> present. In theory, this could automatically schedule the guest cycle early enough for the
> host one.
[Kasireddy, Vivek] I'd like to try it out soon; it'd be very interesting to see how Mutter
works in both Guest and Host with this new scheduling policy. Having said that, I think
there is still a need to come up with a comprehensive solution that is independent of
compositors' scheduling policies. To that end, I am thinking of splitting the pageflip
completion event into two events: vblank event (to indicate to compositor to start repaint)
and flip_done event (to indicate to release references on old FBs). Or, introduce two new
signals/fences along similar lines. Thoughts?

Thanks,
Vivek

> 
> 
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer

[RFC,v1,0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

Message

Comments