mbox series

[v1,0/5] Add memory shrinker to VirtIO-GPU DRM driver

Message ID 20220308131725.60607-1-dmitry.osipenko@collabora.com (mailing list archive)
Headers show
Series Add memory shrinker to VirtIO-GPU DRM driver | expand

Message

Dmitry Osipenko March 8, 2022, 1:17 p.m. UTC
Hello,

This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
During OOM, the shrinker will release BOs that are marked as "not needed"
by userspace using the new madvise IOCTL. The userspace in this case is
the Mesa VirGL driver, it will mark the cached BOs as "not needed",
allowing kernel driver to release memory of the cached shmem BOs on lowmem
situations, preventing OOM kills.

This patchset includes couple fixes for problems I found while was working
on the shrinker, it also includes prerequisite DMA API usage improvement
needed by the shrinker.

The Mesa and IGT patches will be kept on hold until this kernel series
will be approved and applied.

This patchset was tested using Qemu and crosvm, including both cases of
IOMMU off/on.

Mesa: https://gitlab.freedesktop.org/digetx/mesa/-/commits/virgl-madvise
IGT:  https://gitlab.freedesktop.org/digetx/igt-gpu-tools/-/tree/virtio-madvise

Dmitry Osipenko (5):
  drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling
  drm/virtio: Check whether transferred 2D BO is shmem
  drm/virtio: Unlock GEM reservations in error code path
  drm/virtio: Improve DMA API usage for shmem BOs
  drm/virtio: Add memory shrinker

 drivers/gpu/drm/virtio/Makefile               |   3 +-
 drivers/gpu/drm/virtio/virtgpu_drv.c          |  22 +++-
 drivers/gpu/drm/virtio/virtgpu_drv.h          |  31 ++++-
 drivers/gpu/drm/virtio/virtgpu_gem.c          |  84 ++++++++++++
 drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c | 124 ++++++++++++++++++
 drivers/gpu/drm/virtio/virtgpu_ioctl.c        |  37 ++++++
 drivers/gpu/drm/virtio/virtgpu_kms.c          |  17 ++-
 drivers/gpu/drm/virtio/virtgpu_object.c       |  63 +++------
 drivers/gpu/drm/virtio/virtgpu_plane.c        |  17 ++-
 drivers/gpu/drm/virtio/virtgpu_vq.c           |  30 +++--
 include/uapi/drm/virtgpu_drm.h                |  14 ++
 11 files changed, 373 insertions(+), 69 deletions(-)
 create mode 100644 drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c

Comments

Rob Clark March 8, 2022, 4:29 p.m. UTC | #1
On Tue, Mar 8, 2022 at 5:17 AM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> Hello,
>
> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
> During OOM, the shrinker will release BOs that are marked as "not needed"
> by userspace using the new madvise IOCTL. The userspace in this case is
> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
> allowing kernel driver to release memory of the cached shmem BOs on lowmem
> situations, preventing OOM kills.

Will host memory pressure already trigger shrinker in guest?  This is
something I'm quite interested in for "virtgpu native contexts" (ie.
native guest driver with new context type sitting on top of virtgpu),
since that isn't using host storage

BR,
-R

> This patchset includes couple fixes for problems I found while was working
> on the shrinker, it also includes prerequisite DMA API usage improvement
> needed by the shrinker.
>
> The Mesa and IGT patches will be kept on hold until this kernel series
> will be approved and applied.
>
> This patchset was tested using Qemu and crosvm, including both cases of
> IOMMU off/on.
>
> Mesa: https://gitlab.freedesktop.org/digetx/mesa/-/commits/virgl-madvise
> IGT:  https://gitlab.freedesktop.org/digetx/igt-gpu-tools/-/tree/virtio-madvise
>
> Dmitry Osipenko (5):
>   drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling
>   drm/virtio: Check whether transferred 2D BO is shmem
>   drm/virtio: Unlock GEM reservations in error code path
>   drm/virtio: Improve DMA API usage for shmem BOs
>   drm/virtio: Add memory shrinker
>
>  drivers/gpu/drm/virtio/Makefile               |   3 +-
>  drivers/gpu/drm/virtio/virtgpu_drv.c          |  22 +++-
>  drivers/gpu/drm/virtio/virtgpu_drv.h          |  31 ++++-
>  drivers/gpu/drm/virtio/virtgpu_gem.c          |  84 ++++++++++++
>  drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c | 124 ++++++++++++++++++
>  drivers/gpu/drm/virtio/virtgpu_ioctl.c        |  37 ++++++
>  drivers/gpu/drm/virtio/virtgpu_kms.c          |  17 ++-
>  drivers/gpu/drm/virtio/virtgpu_object.c       |  63 +++------
>  drivers/gpu/drm/virtio/virtgpu_plane.c        |  17 ++-
>  drivers/gpu/drm/virtio/virtgpu_vq.c           |  30 +++--
>  include/uapi/drm/virtgpu_drm.h                |  14 ++
>  11 files changed, 373 insertions(+), 69 deletions(-)
>  create mode 100644 drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c
>
> --
> 2.35.1
>
Dmitry Osipenko March 8, 2022, 7:28 p.m. UTC | #2
On 3/8/22 19:29, Rob Clark wrote:
> On Tue, Mar 8, 2022 at 5:17 AM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
>>
>> Hello,
>>
>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
>> During OOM, the shrinker will release BOs that are marked as "not needed"
>> by userspace using the new madvise IOCTL. The userspace in this case is
>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
>> allowing kernel driver to release memory of the cached shmem BOs on lowmem
>> situations, preventing OOM kills.
> 
> Will host memory pressure already trigger shrinker in guest? 

The host memory pressure won't trigger shrinker in guest here. This
series will help only with the memory pressure within the guest using a
usual "virgl context".

Having a host shrinker in a case of "virgl contexts" should be a
difficult problem to solve.

> This is
> something I'm quite interested in for "virtgpu native contexts" (ie.
> native guest driver with new context type sitting on top of virtgpu),

In a case of "native contexts" it should be doable, at least I can't see
any obvious problems. The madvise invocations could be passed to the
host using a new virtio-gpu command by the guest's madvise IOCTL
handler, instead-of/in-addition-to handling madvise in the guest's
kernel, and that's it.

> since that isn't using host storage

s/host/guest ?
Rob Clark March 8, 2022, 10:24 p.m. UTC | #3
On Tue, Mar 8, 2022 at 11:28 AM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 3/8/22 19:29, Rob Clark wrote:
> > On Tue, Mar 8, 2022 at 5:17 AM Dmitry Osipenko
> > <dmitry.osipenko@collabora.com> wrote:
> >>
> >> Hello,
> >>
> >> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
> >> During OOM, the shrinker will release BOs that are marked as "not needed"
> >> by userspace using the new madvise IOCTL. The userspace in this case is
> >> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
> >> allowing kernel driver to release memory of the cached shmem BOs on lowmem
> >> situations, preventing OOM kills.
> >
> > Will host memory pressure already trigger shrinker in guest?
>
> The host memory pressure won't trigger shrinker in guest here. This
> series will help only with the memory pressure within the guest using a
> usual "virgl context".
>
> Having a host shrinker in a case of "virgl contexts" should be a
> difficult problem to solve.

Hmm, I think we just need the balloon driver to trigger the shrinker
in the guest kernel?  I suppose a driver like drm/virtio might want to
differentiate between host and guest pressure (ie. consider only
objects that have host vs guest storage), but even without that,
freeing up memory in the guest when host is under memory pressure
seems worthwhile.  Maybe I'm over-simplifying?

> > This is
> > something I'm quite interested in for "virtgpu native contexts" (ie.
> > native guest driver with new context type sitting on top of virtgpu),
>
> In a case of "native contexts" it should be doable, at least I can't see
> any obvious problems. The madvise invocations could be passed to the
> host using a new virtio-gpu command by the guest's madvise IOCTL
> handler, instead-of/in-addition-to handling madvise in the guest's
> kernel, and that's it.

I think we don't want to do that, because MADV:WILLNEED would be by
far the most frequent guest<->host synchronous round trip.  So from
that perspective tracking madvise state in guest kernel seems quite
attractive.

If we really can't track madvise state in the guest for dealing with
host memory pressure, I think the better option is to introduce
MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
buffer is needed but the previous contents are not (as long as the GPU
VA remains the same).  With this the host could allocate new pages if
needed, and the guest would not need to wait for a reply from host.

> > since that isn't using host storage
>
> s/host/guest ?

Yes, sorry, I meant that it is not using guest storage.

BR,
-R
Dmitry Osipenko March 8, 2022, 11:36 p.m. UTC | #4
On 3/9/22 01:24, Rob Clark wrote:
> On Tue, Mar 8, 2022 at 11:28 AM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
>>
>> On 3/8/22 19:29, Rob Clark wrote:
>>> On Tue, Mar 8, 2022 at 5:17 AM Dmitry Osipenko
>>> <dmitry.osipenko@collabora.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
>>>> During OOM, the shrinker will release BOs that are marked as "not needed"
>>>> by userspace using the new madvise IOCTL. The userspace in this case is
>>>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
>>>> allowing kernel driver to release memory of the cached shmem BOs on lowmem
>>>> situations, preventing OOM kills.
>>>
>>> Will host memory pressure already trigger shrinker in guest?
>>
>> The host memory pressure won't trigger shrinker in guest here. This
>> series will help only with the memory pressure within the guest using a
>> usual "virgl context".
>>
>> Having a host shrinker in a case of "virgl contexts" should be a
>> difficult problem to solve.
> 
> Hmm, I think we just need the balloon driver to trigger the shrinker
> in the guest kernel?  I suppose a driver like drm/virtio might want to
> differentiate between host and guest pressure (ie. consider only
> objects that have host vs guest storage), but even without that,
> freeing up memory in the guest when host is under memory pressure
> seems worthwhile.  Maybe I'm over-simplifying?

Might be the opposite, i.e. me over-complicating :) The variant with
memory ballooning actually could be good and will work for all kinds of
virtio contexts universally. There will be some back-n-forth between
host and guest, but perhaps it will work okay. Thank you for the suggestion.

>>> This is
>>> something I'm quite interested in for "virtgpu native contexts" (ie.
>>> native guest driver with new context type sitting on top of virtgpu),
>>
>> In a case of "native contexts" it should be doable, at least I can't see
>> any obvious problems. The madvise invocations could be passed to the
>> host using a new virtio-gpu command by the guest's madvise IOCTL
>> handler, instead-of/in-addition-to handling madvise in the guest's
>> kernel, and that's it.
> 
> I think we don't want to do that, because MADV:WILLNEED would be by
> far the most frequent guest<->host synchronous round trip.  So from
> that perspective tracking madvise state in guest kernel seems quite
> attractive.

This is a valid concern. I'd assume that the overhead should be
tolerable, but I don't have any actual perf numbers.

> If we really can't track madvise state in the guest for dealing with
> host memory pressure, I think the better option is to introduce
> MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
> buffer is needed but the previous contents are not (as long as the GPU
> VA remains the same).  With this the host could allocate new pages if
> needed, and the guest would not need to wait for a reply from host.

If variant with the memory ballooning will work, then it will be
possible to track the state within guest-only. Let's consider the
simplest variant for now.

I'll try to implement the balloon driver support in the v2 and will get
back to you.

>>> since that isn't using host storage
>>
>> s/host/guest ?
> 
> Yes, sorry, I meant that it is not using guest storage.

Thank you for the clarification.
Rob Clark March 9, 2022, 12:56 a.m. UTC | #5
On Tue, Mar 8, 2022 at 3:36 PM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 3/9/22 01:24, Rob Clark wrote:
> > On Tue, Mar 8, 2022 at 11:28 AM Dmitry Osipenko
> > <dmitry.osipenko@collabora.com> wrote:
> >>
> >> On 3/8/22 19:29, Rob Clark wrote:
> >>> On Tue, Mar 8, 2022 at 5:17 AM Dmitry Osipenko
> >>> <dmitry.osipenko@collabora.com> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
> >>>> During OOM, the shrinker will release BOs that are marked as "not needed"
> >>>> by userspace using the new madvise IOCTL. The userspace in this case is
> >>>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
> >>>> allowing kernel driver to release memory of the cached shmem BOs on lowmem
> >>>> situations, preventing OOM kills.
> >>>
> >>> Will host memory pressure already trigger shrinker in guest?
> >>
> >> The host memory pressure won't trigger shrinker in guest here. This
> >> series will help only with the memory pressure within the guest using a
> >> usual "virgl context".
> >>
> >> Having a host shrinker in a case of "virgl contexts" should be a
> >> difficult problem to solve.
> >
> > Hmm, I think we just need the balloon driver to trigger the shrinker
> > in the guest kernel?  I suppose a driver like drm/virtio might want to
> > differentiate between host and guest pressure (ie. consider only
> > objects that have host vs guest storage), but even without that,
> > freeing up memory in the guest when host is under memory pressure
> > seems worthwhile.  Maybe I'm over-simplifying?
>
> Might be the opposite, i.e. me over-complicating :) The variant with
> memory ballooning actually could be good and will work for all kinds of
> virtio contexts universally. There will be some back-n-forth between
> host and guest, but perhaps it will work okay. Thank you for the suggestion.
>
> >>> This is
> >>> something I'm quite interested in for "virtgpu native contexts" (ie.
> >>> native guest driver with new context type sitting on top of virtgpu),
> >>
> >> In a case of "native contexts" it should be doable, at least I can't see
> >> any obvious problems. The madvise invocations could be passed to the
> >> host using a new virtio-gpu command by the guest's madvise IOCTL
> >> handler, instead-of/in-addition-to handling madvise in the guest's
> >> kernel, and that's it.
> >
> > I think we don't want to do that, because MADV:WILLNEED would be by
> > far the most frequent guest<->host synchronous round trip.  So from
> > that perspective tracking madvise state in guest kernel seems quite
> > attractive.
>
> This is a valid concern. I'd assume that the overhead should be
> tolerable, but I don't have any actual perf numbers.

jfwiw, MADV:WILLNEED is a *very* hot path for gl drivers, based on
some measurements I did a while back with various apps/benchmarks..
easily more than 10x the next most frequent ioctl (for MADV:WONTNEED
and MADV:WILLNEED each, so more than 20x combined.. but MADV:WONTNEED
can be async).

But if the balloon triggering shrinker approach works out, that would
be pretty great.. it seems like the easy option and doesn't require
adding new host kernel uabi :-)

BR,
-R

> > If we really can't track madvise state in the guest for dealing with
> > host memory pressure, I think the better option is to introduce
> > MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
> > buffer is needed but the previous contents are not (as long as the GPU
> > VA remains the same).  With this the host could allocate new pages if
> > needed, and the guest would not need to wait for a reply from host.
>
> If variant with the memory ballooning will work, then it will be
> possible to track the state within guest-only. Let's consider the
> simplest variant for now.
>
> I'll try to implement the balloon driver support in the v2 and will get
> back to you.
>
> >>> since that isn't using host storage
> >>
> >> s/host/guest ?
> >
> > Yes, sorry, I meant that it is not using guest storage.
>
> Thank you for the clarification.
Thomas Zimmermann March 9, 2022, 8:59 a.m. UTC | #6
Hi

Am 08.03.22 um 14:17 schrieb Dmitry Osipenko:
> Hello,
> 
> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
> During OOM, the shrinker will release BOs that are marked as "not needed"
> by userspace using the new madvise IOCTL. The userspace in this case is
> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
> allowing kernel driver to release memory of the cached shmem BOs on lowmem
> situations, preventing OOM kills.

Virtio-gpu is build on top of GEM shmem helpers. I have a prototype 
patchset that adds a shrinker to these helpers. If you want to go 
further, you could implement something like that instead. Panfrost and 
lima also have their own shrinker and could certainly be converted to 
the gem-shmem shrinker.

Best regards
Thomas

> 
> This patchset includes couple fixes for problems I found while was working
> on the shrinker, it also includes prerequisite DMA API usage improvement
> needed by the shrinker.
> 
> The Mesa and IGT patches will be kept on hold until this kernel series
> will be approved and applied.
> 
> This patchset was tested using Qemu and crosvm, including both cases of
> IOMMU off/on.
> 
> Mesa: https://gitlab.freedesktop.org/digetx/mesa/-/commits/virgl-madvise
> IGT:  https://gitlab.freedesktop.org/digetx/igt-gpu-tools/-/tree/virtio-madvise
> 
> Dmitry Osipenko (5):
>    drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling
>    drm/virtio: Check whether transferred 2D BO is shmem
>    drm/virtio: Unlock GEM reservations in error code path
>    drm/virtio: Improve DMA API usage for shmem BOs
>    drm/virtio: Add memory shrinker
> 
>   drivers/gpu/drm/virtio/Makefile               |   3 +-
>   drivers/gpu/drm/virtio/virtgpu_drv.c          |  22 +++-
>   drivers/gpu/drm/virtio/virtgpu_drv.h          |  31 ++++-
>   drivers/gpu/drm/virtio/virtgpu_gem.c          |  84 ++++++++++++
>   drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c | 124 ++++++++++++++++++
>   drivers/gpu/drm/virtio/virtgpu_ioctl.c        |  37 ++++++
>   drivers/gpu/drm/virtio/virtgpu_kms.c          |  17 ++-
>   drivers/gpu/drm/virtio/virtgpu_object.c       |  63 +++------
>   drivers/gpu/drm/virtio/virtgpu_plane.c        |  17 ++-
>   drivers/gpu/drm/virtio/virtgpu_vq.c           |  30 +++--
>   include/uapi/drm/virtgpu_drm.h                |  14 ++
>   11 files changed, 373 insertions(+), 69 deletions(-)
>   create mode 100644 drivers/gpu/drm/virtio/virtgpu_gem_shrinker.c
>
Dmitry Osipenko March 9, 2022, 11:55 a.m. UTC | #7
Hello,

On 3/9/22 11:59, Thomas Zimmermann wrote:
> Hi
> 
> Am 08.03.22 um 14:17 schrieb Dmitry Osipenko:
>> Hello,
>>
>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
>> During OOM, the shrinker will release BOs that are marked as "not needed"
>> by userspace using the new madvise IOCTL. The userspace in this case is
>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
>> allowing kernel driver to release memory of the cached shmem BOs on
>> lowmem
>> situations, preventing OOM kills.
> 
> Virtio-gpu is build on top of GEM shmem helpers. I have a prototype
> patchset that adds a shrinker to these helpers. If you want to go
> further, you could implement something like that instead. Panfrost and
> lima also have their own shrinker and could certainly be converted to
> the gem-shmem shrinker.

I had a thought that it could be possible to unify shrinkers into a
common DRM framework. Could you please give me a link to yours prototype
patchset?
Thomas Zimmermann March 9, 2022, 7:28 p.m. UTC | #8
Hi

Am 09.03.22 um 12:55 schrieb Dmitry Osipenko:
> Hello,
> 
> On 3/9/22 11:59, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 08.03.22 um 14:17 schrieb Dmitry Osipenko:
>>> Hello,
>>>
>>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
>>> During OOM, the shrinker will release BOs that are marked as "not needed"
>>> by userspace using the new madvise IOCTL. The userspace in this case is
>>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
>>> allowing kernel driver to release memory of the cached shmem BOs on
>>> lowmem
>>> situations, preventing OOM kills.
>>
>> Virtio-gpu is build on top of GEM shmem helpers. I have a prototype
>> patchset that adds a shrinker to these helpers. If you want to go
>> further, you could implement something like that instead. Panfrost and
>> lima also have their own shrinker and could certainly be converted to
>> the gem-shmem shrinker.
> 
> I had a thought that it could be possible to unify shrinkers into a
> common DRM framework. Could you please give me a link to yours prototype
> patchset?

I uploaded the patches to

 
https://gitlab.freedesktop.org/tzimmermann/linux/-/commits/gem-shmem-cached-mappings

it's incomplete and un-debugged, but it shows what needs to be done. It 
has the infrastructure, but lacks the changes to the GEM shmem code.

The reason for this work is to keep GEM shmem pages mapped and allocated 
even while the BO is neither mapped nor pinned.  As it is now, GEM SHMEM 
creates and releases pages on each pin and unpin, and maps and unmaps 
memory ranges on each vmap and vunmap.  It's all wasteful. Only the 
first pin and vmap calls should establish pages and mappings and only 
the purge and free functions should release them.

The patchset adds new helpers for BO purging to struct 
drm_gem_object_funcs. With this, I think it might be possible to have 
one global DRM shrinker and let it handle all BOs; independent of each 
BO's memory manager.

Best regards
Thomas
Dmitry Osipenko March 9, 2022, 8:06 p.m. UTC | #9
On 3/9/22 03:56, Rob Clark wrote:
>> If we really can't track madvise state in the guest for dealing with
>> host memory pressure, I think the better option is to introduce
>> MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
>> buffer is needed but the previous contents are not (as long as the GPU
>> VA remains the same).  With this the host could allocate new pages if
>> needed, and the guest would not need to wait for a reply from host.
> If variant with the memory ballooning will work, then it will be
> possible to track the state within guest-only. Let's consider the
> simplest variant for now.
> 
> I'll try to implement the balloon driver support in the v2 and will get
> back to you.
> 

I looked at the generic balloon driver and looks like this not what we
want because:

1. Memory ballooning is primarily about handling memory overcommit
situations. I.e. when there are multiple VMs consuming more memory than
available in the system. Ballooning allows host to ask guest to give
unused pages back to host and host could give pages to other VMs.

2. Memory ballooning operates with guest memory pages only. I.e. each
ballooned page is reported to/from host in a form of page's DMA address.

3. There is no direct connection between host's OOM events and the
balloon manager. I guess host could watch system's memory pressure and
inflate VMs' balloons on low memory, releasing the guest's memory to the
system, but apparently this use-case not supported by anyone today, at
least I don't see Qemu supporting it.


So the virtio-balloon driver isn't very useful for us as-is.

One possible solution could be to create something like a new
virtio-shrinker device or add shrinker functionality to the virtio-gpu
device, allowing host to ask guests to drop shared caches. Host then
should become a PSI handler. I think this should be doable in a case of
crosvm. In a case of GNU world, it could take a lot of effort to get
everything to upstreamable state, at first there is a need to
demonstrate real problem being solved by this solution.

The other minor issue is that only integrated GPUs may use system's
memory and even then they could use a dedicated memory carveout, i.e.
releasing VRAM BOs may not help with host's OOM. In case of virgl
context we have no clue about where buffers are physically located. On
the other hand, in the worst case dropping host caches just won't help
with OOM.

It's now unclear how we should proceed with the host-side shrinker
support. Thoughts?

We may start easy and instead of thinking about host-side shrinker, we
could make VirtIO-GPU driver to expire cached BOs after a certain
timeout. Mesa already uses timeout-based BO caching, but it doesn't have
an alarm timer and simply checks expiration when BO is allocated. Should
be too much trouble to handle timers within Mesa since it's executed in
application context, easier to do it in kernel, like VC4 driver does it
for example. This is not good as a proper memory shrinker, but could be
good enough in practice.
Rob Clark March 9, 2022, 9:51 p.m. UTC | #10
On Wed, Mar 9, 2022 at 12:06 PM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> On 3/9/22 03:56, Rob Clark wrote:
> >> If we really can't track madvise state in the guest for dealing with
> >> host memory pressure, I think the better option is to introduce
> >> MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
> >> buffer is needed but the previous contents are not (as long as the GPU
> >> VA remains the same).  With this the host could allocate new pages if
> >> needed, and the guest would not need to wait for a reply from host.
> > If variant with the memory ballooning will work, then it will be
> > possible to track the state within guest-only. Let's consider the
> > simplest variant for now.
> >
> > I'll try to implement the balloon driver support in the v2 and will get
> > back to you.
> >
>
> I looked at the generic balloon driver and looks like this not what we
> want because:
>
> 1. Memory ballooning is primarily about handling memory overcommit
> situations. I.e. when there are multiple VMs consuming more memory than
> available in the system. Ballooning allows host to ask guest to give
> unused pages back to host and host could give pages to other VMs.
>
> 2. Memory ballooning operates with guest memory pages only. I.e. each
> ballooned page is reported to/from host in a form of page's DMA address.
>
> 3. There is no direct connection between host's OOM events and the
> balloon manager. I guess host could watch system's memory pressure and
> inflate VMs' balloons on low memory, releasing the guest's memory to the
> system, but apparently this use-case not supported by anyone today, at
> least I don't see Qemu supporting it.
>

hmm, on CrOS I do see balloon getting used to balance host vs guest
memory.. but admittedly I've not yet looked closely at how that works,
and it does seem like we have some things that are not yet upstream
all over the place (not to mention crosvm vs qemu)

>
> So the virtio-balloon driver isn't very useful for us as-is.
>
> One possible solution could be to create something like a new
> virtio-shrinker device or add shrinker functionality to the virtio-gpu
> device, allowing host to ask guests to drop shared caches. Host then
> should become a PSI handler. I think this should be doable in a case of
> crosvm. In a case of GNU world, it could take a lot of effort to get
> everything to upstreamable state, at first there is a need to
> demonstrate real problem being solved by this solution.

I guess with 4GB chromebooks running one or more VMs in addition to
lots of browser tabs in the host, it shouldn't be too hard to
demonstrate a problem ;-)

(but also, however we end up solving that, certainly shouldn't block
this series)

> The other minor issue is that only integrated GPUs may use system's
> memory and even then they could use a dedicated memory carveout, i.e.
> releasing VRAM BOs may not help with host's OOM. In case of virgl
> context we have no clue about where buffers are physically located. On
> the other hand, in the worst case dropping host caches just won't help
> with OOM.

Userspace should know whether the BO has CPU storage, so I don't think
this should be a problem virtio_gpu needs to worry about

> It's now unclear how we should proceed with the host-side shrinker
> support. Thoughts?
>
> We may start easy and instead of thinking about host-side shrinker, we
> could make VirtIO-GPU driver to expire cached BOs after a certain
> timeout. Mesa already uses timeout-based BO caching, but it doesn't have
> an alarm timer and simply checks expiration when BO is allocated. Should
> be too much trouble to handle timers within Mesa since it's executed in
> application context, easier to do it in kernel, like VC4 driver does it
> for example. This is not good as a proper memory shrinker, but could be
> good enough in practice.

I think that, given virgl uses host storage, guest shrinker should be
still useful.. so I think continue with this series.

For host shrinker, I'll have to look more at when crosvm triggers
balloon inflation.  I could still go the MADV:WILLNEED_REPLACE
approach instead, which does have the advantage of host kernel not
relying on host userspace or vm having a chance to run in order to
release pages.  The downside (perhaps?) is it would be more specific
to virtgpu-native-context and less so to virgl or venus (but I guess
there doesn't currently exist a way for madvise to be useful for vk
drivers).

BR,
-R
Dmitry Osipenko March 9, 2022, 10:25 p.m. UTC | #11
On 3/9/22 22:28, Thomas Zimmermann wrote:
> Hi
> 
> Am 09.03.22 um 12:55 schrieb Dmitry Osipenko:
>> Hello,
>>
>> On 3/9/22 11:59, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 08.03.22 um 14:17 schrieb Dmitry Osipenko:
>>>> Hello,
>>>>
>>>> This patchset introduces memory shrinker for the VirtIO-GPU DRM driver.
>>>> During OOM, the shrinker will release BOs that are marked as "not
>>>> needed"
>>>> by userspace using the new madvise IOCTL. The userspace in this case is
>>>> the Mesa VirGL driver, it will mark the cached BOs as "not needed",
>>>> allowing kernel driver to release memory of the cached shmem BOs on
>>>> lowmem
>>>> situations, preventing OOM kills.
>>>
>>> Virtio-gpu is build on top of GEM shmem helpers. I have a prototype
>>> patchset that adds a shrinker to these helpers. If you want to go
>>> further, you could implement something like that instead. Panfrost and
>>> lima also have their own shrinker and could certainly be converted to
>>> the gem-shmem shrinker.
>>
>> I had a thought that it could be possible to unify shrinkers into a
>> common DRM framework. Could you please give me a link to yours prototype
>> patchset?
> 
> I uploaded the patches to
> 
> 
> https://gitlab.freedesktop.org/tzimmermann/linux/-/commits/gem-shmem-cached-mappings
> 
> 
> it's incomplete and un-debugged, but it shows what needs to be done. It
> has the infrastructure, but lacks the changes to the GEM shmem code.
> 
> The reason for this work is to keep GEM shmem pages mapped and allocated
> even while the BO is neither mapped nor pinned.  As it is now, GEM SHMEM
> creates and releases pages on each pin and unpin, and maps and unmaps
> memory ranges on each vmap and vunmap.  It's all wasteful. Only the
> first pin and vmap calls should establish pages and mappings and only
> the purge and free functions should release them.

Hm, aren't maps and pins already refcounted?

> The patchset adds new helpers for BO purging to struct
> drm_gem_object_funcs. With this, I think it might be possible to have
> one global DRM shrinker and let it handle all BOs; independent of each
> BO's memory manager.

Thank you, I'll give it a try.
Dmitry Osipenko March 9, 2022, 10:43 p.m. UTC | #12
On 3/10/22 00:51, Rob Clark wrote:
> On Wed, Mar 9, 2022 at 12:06 PM Dmitry Osipenko
> <dmitry.osipenko@collabora.com> wrote:
>>
>> On 3/9/22 03:56, Rob Clark wrote:
>>>> If we really can't track madvise state in the guest for dealing with
>>>> host memory pressure, I think the better option is to introduce
>>>> MADV:WILLNEED_REPLACE, ie. something to tell the host kernel that the
>>>> buffer is needed but the previous contents are not (as long as the GPU
>>>> VA remains the same).  With this the host could allocate new pages if
>>>> needed, and the guest would not need to wait for a reply from host.
>>> If variant with the memory ballooning will work, then it will be
>>> possible to track the state within guest-only. Let's consider the
>>> simplest variant for now.
>>>
>>> I'll try to implement the balloon driver support in the v2 and will get
>>> back to you.
>>>
>>
>> I looked at the generic balloon driver and looks like this not what we
>> want because:
>>
>> 1. Memory ballooning is primarily about handling memory overcommit
>> situations. I.e. when there are multiple VMs consuming more memory than
>> available in the system. Ballooning allows host to ask guest to give
>> unused pages back to host and host could give pages to other VMs.
>>
>> 2. Memory ballooning operates with guest memory pages only. I.e. each
>> ballooned page is reported to/from host in a form of page's DMA address.
>>
>> 3. There is no direct connection between host's OOM events and the
>> balloon manager. I guess host could watch system's memory pressure and
>> inflate VMs' balloons on low memory, releasing the guest's memory to the
>> system, but apparently this use-case not supported by anyone today, at
>> least I don't see Qemu supporting it.
>>
> 
> hmm, on CrOS I do see balloon getting used to balance host vs guest
> memory.. but admittedly I've not yet looked closely at how that works,
> and it does seem like we have some things that are not yet upstream
> all over the place (not to mention crosvm vs qemu)

That's interesting, I missed that CrOS uses a customized ballooning.

>> So the virtio-balloon driver isn't very useful for us as-is.
>>
>> One possible solution could be to create something like a new
>> virtio-shrinker device or add shrinker functionality to the virtio-gpu
>> device, allowing host to ask guests to drop shared caches. Host then
>> should become a PSI handler. I think this should be doable in a case of
>> crosvm. In a case of GNU world, it could take a lot of effort to get
>> everything to upstreamable state, at first there is a need to
>> demonstrate real problem being solved by this solution.
> 
> I guess with 4GB chromebooks running one or more VMs in addition to
> lots of browser tabs in the host, it shouldn't be too hard to
> demonstrate a problem ;-)

But then guest also should have a significant amount of BOs in cache to
purge, which potentially could be solved using a timer approach.

> (but also, however we end up solving that, certainly shouldn't block
> this series)

Sure, there is no problem with extending shrinker functionality with
more features later on, so the host's shrinker isn't a blocker.

>> The other minor issue is that only integrated GPUs may use system's
>> memory and even then they could use a dedicated memory carveout, i.e.
>> releasing VRAM BOs may not help with host's OOM. In case of virgl
>> context we have no clue about where buffers are physically located. On
>> the other hand, in the worst case dropping host caches just won't help
>> with OOM.
> 
> Userspace should know whether the BO has CPU storage, so I don't think
> this should be a problem virtio_gpu needs to worry about
> 
>> It's now unclear how we should proceed with the host-side shrinker
>> support. Thoughts?
>>
>> We may start easy and instead of thinking about host-side shrinker, we
>> could make VirtIO-GPU driver to expire cached BOs after a certain
>> timeout. Mesa already uses timeout-based BO caching, but it doesn't have
>> an alarm timer and simply checks expiration when BO is allocated. Should
>> be too much trouble to handle timers within Mesa since it's executed in
>> application context, easier to do it in kernel, like VC4 driver does it
>> for example. This is not good as a proper memory shrinker, but could be
>> good enough in practice.
> 
> I think that, given virgl uses host storage, guest shrinker should be
> still useful.. so I think continue with this series.

Guest shrinker indeed will be useful for virgl today. I was already
questioning why virgl needs both host and guest storages, maybe it will
move to a host-only storage approach in the future.

I think the variant with the timer expiration actually could be
interesting to try because it should allow host to purge its VM BOs by
itself, preventing host OOMs.

> For host shrinker, I'll have to look more at when crosvm triggers
> balloon inflation.  I could still go the MADV:WILLNEED_REPLACE
> approach instead, which does have the advantage of host kernel not
> relying on host userspace or vm having a chance to run in order to
> release pages.  The downside (perhaps?) is it would be more specific
> to virtgpu-native-context and less so to virgl or venus (but I guess
> there doesn't currently exist a way for madvise to be useful for vk
> drivers).

I'll also take a look at what CrOS does, maybe it has some interesting
ideas.
Thomas Zimmermann March 10, 2022, 7:02 p.m. UTC | #13
Hi

Am 09.03.22 um 23:25 schrieb Dmitry Osipenko:
>>
>> The reason for this work is to keep GEM shmem pages mapped and allocated
>> even while the BO is neither mapped nor pinned.  As it is now, GEM SHMEM
>> creates and releases pages on each pin and unpin, and maps and unmaps
>> memory ranges on each vmap and vunmap.  It's all wasteful. Only the
>> first pin and vmap calls should establish pages and mappings and only
>> the purge and free functions should release them.
> 
> Hm, aren't maps and pins already refcounted?

They are. But even when the refcounter reaches 0 on deref, there's no 
need to remove the mapping or free the memory pages. We can keep them 
around for the next ref operation.  Only the shrinker's purge or freeing 
the object has to do such clean-up operations.  Such behavior is 
supported by TTM and we already use it in VRAM helpers as well.

Best regards
Thomas

> 
>> The patchset adds new helpers for BO purging to struct
>> drm_gem_object_funcs. With this, I think it might be possible to have
>> one global DRM shrinker and let it handle all BOs; independent of each
>> BO's memory manager.
> 
> Thank you, I'll give it a try.
Dmitry Osipenko March 10, 2022, 9:32 p.m. UTC | #14
On 3/10/22 22:02, Thomas Zimmermann wrote:
> Hi
> 
> Am 09.03.22 um 23:25 schrieb Dmitry Osipenko:
>>>
>>> The reason for this work is to keep GEM shmem pages mapped and allocated
>>> even while the BO is neither mapped nor pinned.  As it is now, GEM SHMEM
>>> creates and releases pages on each pin and unpin, and maps and unmaps
>>> memory ranges on each vmap and vunmap.  It's all wasteful. Only the
>>> first pin and vmap calls should establish pages and mappings and only
>>> the purge and free functions should release them.
>>
>> Hm, aren't maps and pins already refcounted?
> 
> They are. But even when the refcounter reaches 0 on deref, there's no
> need to remove the mapping or free the memory pages. We can keep them
> around for the next ref operation.  Only the shrinker's purge or freeing
> the object has to do such clean-up operations.  Such behavior is
> supported by TTM and we already use it in VRAM helpers as well.

When driver won't want to pin BO at the creation time? Looks like all
shmem users do this today, and thus, pages shouldn't be freed until BO
is destroyed or purged.
Dmitry Osipenko March 14, 2022, 11:04 p.m. UTC | #15
On 3/10/22 01:43, Dmitry Osipenko wrote:
>> I think that, given virgl uses host storage, guest shrinker should be
>> still useful.. so I think continue with this series.
> Guest shrinker indeed will be useful for virgl today. I was already
> questioning why virgl needs both host and guest storages, maybe it will
> move to a host-only storage approach in the future.
> 
> I think the variant with the timer expiration actually could be
> interesting to try because it should allow host to purge its VM BOs by
> itself, preventing host OOMs.

While I was working on shrinker v2, I noticed that host-side allocation
may take a significant time. So I decided to defer implementation of my
idea of using timer-based expiration for host-only BOs. I'll need to
examine it more.

>> For host shrinker, I'll have to look more at when crosvm triggers
>> balloon inflation.  I could still go the MADV:WILLNEED_REPLACE
>> approach instead, which does have the advantage of host kernel not
>> relying on host userspace or vm having a chance to run in order to
>> release pages.  The downside (perhaps?) is it would be more specific
>> to virtgpu-native-context and less so to virgl or venus (but I guess
>> there doesn't currently exist a way for madvise to be useful for vk
>> drivers).
> I'll also take a look at what CrOS does, maybe it has some interesting
> ideas.

I looked at CrOS kernel and crosvm, and haven't noticed anything special
there in regards to purging GPU's memory of VM on host-side memory
pressure. If you'll find something, then please let me know.

I sent out v2 of the shrinker series, but missed to CC you on it by
accident, please find it here [1].

[1]
https://lore.kernel.org/dri-devel/20220314224253.236359-1-dmitry.osipenko@collabora.com/T/#t