[v6,32/32] drm/doc: gpusvm: Add GPU SVM documentation

Message ID	20250225044311.3178695-33-matthew.brost@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: himal.prasad.ghimiray@intel.com, apopple@nvidia.com, airlied@gmail.com, thomas.hellstrom@linux.intel.com, simona.vetter@ffwll.ch, felix.kuehling@amd.com, dakr@kernel.org Subject: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation Date: Mon, 24 Feb 2025 20:43:11 -0800 Message-Id: <20250225044311.3178695-33-matthew.brost@intel.com> In-Reply-To: <20250225044311.3178695-1-matthew.brost@intel.com> References: <20250225044311.3178695-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Introduce GPU SVM and Xe SVM implementation \| expand [v6,00/32] Introduce GPU SVM and Xe SVM implementation [v6,01/32] drm/xe: Retry BO allocation [v6,02/32] mm/migrate: Add migrate_device_pfns [v6,03/32] mm/migrate: Trylock device page in do_swap_page [v6,04/32] drm/pagemap: Add DRM pagemap [v6,05/32] drm/xe/bo: Introduce xe_bo_put_async [v6,06/32] drm/gpusvm: Add support for GPU Shared Virtual Memory [v6,07/32] drm/xe: Select DRM_GPUSVM Kconfig [v6,08/32] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR [v6,09/32] drm/xe: Add SVM init / close / fini to faulting VMs [v6,10/32] drm/xe: Add dma_addr res cursor [v6,11/32] drm/xe: Nuke VM's mapping upon close [v6,12/32] drm/xe: Add SVM range invalidation and page fault [v6,13/32] drm/gpuvm: Add DRM_GPUVA_OP_DRIVER [v6,14/32] drm/xe: Add (re)bind to SVM page fault handler [v6,15/32] drm/xe: Add SVM garbage collector [v6,16/32] drm/xe: Add unbind to SVM garbage collector [v6,17/32] drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has bindings [v6,18/32] drm/xe: Enable CPU address mirror uAPI [v6,19/32] drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR [v6,20/32] drm/xe: Add migrate layer functions for SVM support [v6,21/32] drm/xe: Add SVM device memory mirroring [v6,22/32] drm/xe: Add drm_gpusvm_devmem to xe_bo [v6,23/32] drm/xe: Add drm_pagemap ops to SVM [v6,24/32] drm/xe: Add GPUSVM device memory copy vfunc functions [v6,25/32] drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc [v6,26/32] drm/xe: Add Xe SVM devmem_release GPU SVM vfunc [v6,27/32] drm/xe: Add SVM VRAM migration [v6,28/32] drm/xe: Basic SVM BO eviction [v6,29/32] drm/xe: Add SVM debug [v6,30/32] drm/xe: Add modparam for SVM notifier size [v6,31/32] drm/xe: Add always_migrate_to_vram modparam [v6,32/32] drm/doc: gpusvm: Add GPU SVM documentation

Message ID

20250225044311.3178695-33-matthew.brost@intel.com (mailing list archive)

State

New

Headers

From: Matthew Brost <matthew.brost@intel.com>
To: intel-xe@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org
Cc: himal.prasad.ghimiray@intel.com, apopple@nvidia.com, airlied@gmail.com,
 thomas.hellstrom@linux.intel.com, simona.vetter@ffwll.ch,
 felix.kuehling@amd.com, dakr@kernel.org
Subject: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation
Date: Mon, 24 Feb 2025 20:43:11 -0800
Message-Id: <20250225044311.3178695-33-matthew.brost@intel.com>
In-Reply-To: <20250225044311.3178695-1-matthew.brost@intel.com>
References: <20250225044311.3178695-1-matthew.brost@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

Introduce GPU SVM and Xe SVM implementation | expand

Commit Message

Matthew Brost Feb. 25, 2025, 4:43 a.m. UTC

Add documentation for agree upon GPU SVM design principles, current
status, and future plans.

v4:
 - Address Thomas's feedback
v5:
 - s/Current/Basline (Thomas)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst  |  4 ++
 2 files changed, 88 insertions(+)
 create mode 100644 Documentation/gpu/rfc/gpusvm.rst

Comments

Alistair Popple Feb. 28, 2025, 2:34 a.m. UTC | #1

On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> Add documentation for agree upon GPU SVM design principles, current
> status, and future plans.

Thanks for writing this up. In general I didn't see anything too controversial
but added a couple of comments below.

> 
> v4:
>  - Address Thomas's feedback
> v5:
>  - s/Current/Basline (Thomas)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
>  Documentation/gpu/rfc/index.rst  |  4 ++
>  2 files changed, 88 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> 
> diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> new file mode 100644
> index 000000000000..063412160685
> --- /dev/null
> +++ b/Documentation/gpu/rfc/gpusvm.rst
> @@ -0,0 +1,84 @@
> +===============
> +GPU SVM Section
> +===============
> +
> +Agreed upon design principles
> +=============================

As a general comment I think it would be nice if we could add some rational/
reasons for these design principals. Things inevitably change and if/when
we need to violate or update these principals it would be good to have some
documented rational for why we decided on them in the first place because the
reasoning may have become invalid by then.

> +* migrate_to_ram path
> +	* Rely only on core MM concepts (migration PTEs, page references, and
> +	  page locking).
> +	* No driver specific locks other than locks for hardware interaction in
> +	  this path. These are not required and generally a bad idea to
> +	  invent driver defined locks to seal core MM races.

In principal I agree. The problem I think you will run into is the analogue of
what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
becoming invalid) may, depending on your design, kick off a migration of the
page to the GPU via migrate_vma_setup().

The problem with that is migrate_vma_setup() will temprarily raise the folio
refcount, which can cause the migrate_to_ram() callback to fail but the elevated
refcount from migrate_to_ram() can also cause the GPU migration to fail thus
leading to a live-lock when both CPU and GPU fault handlers just keep retrying.

This was particularly problematic for us on multi-GPU setups, and our solution
was to introduce a migration critical section in the form of a mutex to ensure
only one thread was calling migrate_vma_setup() at a time.

And now that I've looked at UVM development history, and remembered more
context, this is why I had a vague recollection that adding a migration entry
in do_swap_page() would be better than taking a page lock. Doing so fixes the
issue with concurrent GPU faults blocking migrate_to_ram() because it makes
migrate_vma_setup() ignore the page.

> +	* Partial migration is supported (i.e., a subset of pages attempting to
> +	  migrate can actually migrate, with only the faulting page guaranteed
> +	  to migrate).
> +	* Driver handles mixed migrations via retry loops rather than locking.
>
> +* Eviction

This is a term that seems be somewhat overloaded depending on context so a
definition would be nice. Is your view of eviction migrating data from GPU back
to CPU without a virtual address to free up GPU memory? (that's what I think of,
but would be good to make sure we're in sync).

> +	* Only looking at physical memory data structures and locks as opposed to
> +	  looking at virtual memory data structures and locks.
> +	* No looking at mm/vma structs or relying on those being locked.

Agree with the above points.

> +* GPU fault side
> +	* mmap_read only used around core MM functions which require this lock
> +	  and should strive to take mmap_read lock only in GPU SVM layer.
> +	* Big retry loop to handle all races with the mmu notifier under the gpu
> +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> +          those.

Again, one of the issues here (particularly with multi-GPU setups) is that it's
very easy to live-lock with rety loops because even attempting a migration that
fails can cause migration/fault handling in other threads to fail, either by
calling mmu_notifiers or taking a page reference.

Those are probably things that we should fix on the MM side, but for now UVM at
least uses a lock to ensure forward progress.

> +	* Races (especially against concurrent eviction or migrate_to_ram)
> +	  should not be handled on the fault side by trying to hold locks;
> +	  rather, they should be handled using retry loops. One possible
> +	  exception is holding a BO's dma-resv lock during the initial migration
> +	  to VRAM, as this is a well-defined lock that can be taken underneath
> +	  the mmap_read lock.

See my earlier comments. Although note I agree with this in principal, and we do
just retry if taking the lock fails.

> +* Physical memory to virtual backpointer
> +	* Does not work, no pointers from physical memory to virtual should
> +	  exist.

Agree. And my rational is because core-MM can update the virtual address for a
page without notifying a driver of the new address. For example with mremap().
So it's impossible to keep any backpointer to a virtual address up to date.

> +	* Physical memory backpointer (page->zone_device_data) should be stable
> +	  from allocation to page free.

Agree. And presumably the rational is because it is very difficult to safely
update page->zone_device_data and ensure there aren't concurrent users of it
unless the page is free (ie. has a 0 refcount)?

> +* GPU pagetable locking
> +	* Notifier lock only protects range tree, pages valid state for a range
> +	  (rather than seqno due to wider notifiers), pagetable entries, and
> +	  mmu notifier seqno tracking, it is not a global lock to protect
> +          against races.
> +	* All races handled with big retry as mentioned above.

Sounds reasonable.

> +Overview of current design
> +==========================
> +
> +Baseline design is simple as possible to get a working basline in which can be
> +built upon.
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> +   :doc: Overview
> +   :doc: Locking
> +   :doc: Migrataion
> +   :doc: Partial Unmapping of Ranges
> +   :doc: Examples
> +
> +Possible future design features
> +===============================
> +
> +* Concurrent GPU faults
> +	* CPU faults are concurrent so makes sense to have concurrent GPU
> +	  faults.
> +	* Should be possible with fined grained locking in the driver GPU
> +	  fault handler.
> +	* No expected GPU SVM changes required.
> +* Ranges with mixed system and device pages
> +	* Can be added if required to drm_gpusvm_get_pages fairly easily.

I don't see much of a need, but also don't see any barriers on the MM side for
doing that.

> +* Multi-GPU support
> +	* Work in progress and patches expected after initially landing on GPU
> +	  SVM.
> +	* Ideally can be done with little to no changes to GPU SVM.

See above, but I mostly agree.

> +* Drop ranges in favor of radix tree
> +	* May be desirable for faster notifiers.
> +* Compound device pages
> +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> +	  migrate device layer are a performance bottleneck, having compound
> +	  device pages should help increase performance by reducing the number
> +	  of these expensive calls.

I'm hoping my patch series[1] to allow for compound device pages will land in v6.15
as well. The next steps are extending that to DEVICE_PRIVATE pages with
migrate_vma_setup() and migrate_to_ram() and we have been making some good
progress on this internally. I hope to have something posted before LSFMM.

The other thing we have been looking at here is being able to migrate
file-backed pages to GPU memory.

[1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@nvidia.com/

> +* Higher order dma mapping for migration
> +	* 4k dma mapping adversely affects migration performance on Intel
> +	  hardware, higher order (2M) dma mapping should help here.
> +* Build common userptr implementation on top of GPU SVM
> diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> index 476719771eef..396e535377fb 100644
> --- a/Documentation/gpu/rfc/index.rst
> +++ b/Documentation/gpu/rfc/index.rst
> @@ -16,6 +16,10 @@ host such documentation:
>  * Once the code has landed move all the documentation to the right places in
>    the main core, helper or driver sections.
>  
> +.. toctree::
> +
> +    gpusvm.rst
> +
>  .. toctree::
>  
>      i915_gem_lmem.rst
> -- 
> 2.34.1
>

Matthew Brost Feb. 28, 2025, 4:36 a.m. UTC | #2

On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote:
> On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> > Add documentation for agree upon GPU SVM design principles, current
> > status, and future plans.
> 
> Thanks for writing this up. In general I didn't see anything too controversial
> but added a couple of comments below.
> 
> > 
> > v4:
> >  - Address Thomas's feedback
> > v5:
> >  - s/Current/Basline (Thomas)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
> >  Documentation/gpu/rfc/index.rst  |  4 ++
> >  2 files changed, 88 insertions(+)
> >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > 
> > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> > new file mode 100644
> > index 000000000000..063412160685
> > --- /dev/null
> > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > @@ -0,0 +1,84 @@
> > +===============
> > +GPU SVM Section
> > +===============
> > +
> > +Agreed upon design principles
> > +=============================
> 
> As a general comment I think it would be nice if we could add some rational/
> reasons for these design principals. Things inevitably change and if/when
> we need to violate or update these principals it would be good to have some
> documented rational for why we decided on them in the first place because the
> reasoning may have become invalid by then.
> 

Let me try to add somethings to the various cases.

> > +* migrate_to_ram path
> > +	* Rely only on core MM concepts (migration PTEs, page references, and
> > +	  page locking).
> > +	* No driver specific locks other than locks for hardware interaction in
> > +	  this path. These are not required and generally a bad idea to
> > +	  invent driver defined locks to seal core MM races.
> 
> In principal I agree. The problem I think you will run into is the analogue of
> what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
> GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
> becoming invalid) may, depending on your design, kick off a migration of the
> page to the GPU via migrate_vma_setup().
> 
> The problem with that is migrate_vma_setup() will temprarily raise the folio
> refcount, which can cause the migrate_to_ram() callback to fail but the elevated
> refcount from migrate_to_ram() can also cause the GPU migration to fail thus
> leading to a live-lock when both CPU and GPU fault handlers just keep retrying.
> 
> This was particularly problematic for us on multi-GPU setups, and our solution
> was to introduce a migration critical section in the form of a mutex to ensure
> only one thread was calling migrate_vma_setup() at a time.
> 
> And now that I've looked at UVM development history, and remembered more
> context, this is why I had a vague recollection that adding a migration entry
> in do_swap_page() would be better than taking a page lock. Doing so fixes the
> issue with concurrent GPU faults blocking migrate_to_ram() because it makes
> migrate_vma_setup() ignore the page.
> 

Ok, this is something to keep an eye on. In the current Xe code, we try
to migrate a chunk of memory from the CPU to the GPU in our GPU fault
handler once per fault. If it fails due to racing CPU access, we simply
leave it in CPU memory and move on. We don't have any real migration
policies in Xe yet—that is being worked on as a follow-up to my series.
However, if we had a policy requiring a memory region to 'must be in
GPU,' this could conceivably lead to a livelock with concurrent CPU and
GPU access. I'm still not fully convinced that a driver-side lock is the
solution here, but without encountering the issue on our side, I can't
be completely certain what the solution is.

> > +	* Partial migration is supported (i.e., a subset of pages attempting to
> > +	  migrate can actually migrate, with only the faulting page guaranteed
> > +	  to migrate).
> > +	* Driver handles mixed migrations via retry loops rather than locking.
> >
> > +* Eviction
> 
> This is a term that seems be somewhat overloaded depending on context so a
> definition would be nice. Is your view of eviction migrating data from GPU back
> to CPU without a virtual address to free up GPU memory? (that's what I think of,
> but would be good to make sure we're in sync).
> 

Yes. When GPU memory is oversubscribed, we find the physical backing in
an LRU list to evict. In Xe, this is a TTM BO.

> > +	* Only looking at physical memory data structures and locks as opposed to
> > +	  looking at virtual memory data structures and locks.
> > +	* No looking at mm/vma structs or relying on those being locked.
> 
> Agree with the above points.
> 
> > +* GPU fault side
> > +	* mmap_read only used around core MM functions which require this lock
> > +	  and should strive to take mmap_read lock only in GPU SVM layer.
> > +	* Big retry loop to handle all races with the mmu notifier under the gpu
> > +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> > +          those.
> 
> Again, one of the issues here (particularly with multi-GPU setups) is that it's
> very easy to live-lock with rety loops because even attempting a migration that
> fails can cause migration/fault handling in other threads to fail, either by
> calling mmu_notifiers or taking a page reference.
> 
> Those are probably things that we should fix on the MM side, but for now UVM at
> least uses a lock to ensure forward progress.
>

Again, see above. Right now, migration in Xe is more of a best-case
scenario rather than a mandatory process, and perhaps this is masking an
issue.

Maybe I should add a comment here stating your possible concerns and that
Xe will be implementing real migration policies and multi-GPU support
soon. If this issue arises, we can revisit the locking guidelines or
perhaps help contribute to the necessary core changes to make this work
properly.
 
> > +	* Races (especially against concurrent eviction or migrate_to_ram)
> > +	  should not be handled on the fault side by trying to hold locks;
> > +	  rather, they should be handled using retry loops. One possible
> > +	  exception is holding a BO's dma-resv lock during the initial migration
> > +	  to VRAM, as this is a well-defined lock that can be taken underneath
> > +	  the mmap_read lock.
> 
> See my earlier comments. Although note I agree with this in principal, and we do
> just retry if taking the lock fails.
> 
> > +* Physical memory to virtual backpointer
> > +	* Does not work, no pointers from physical memory to virtual should
> > +	  exist.
> 
> Agree. And my rational is because core-MM can update the virtual address for a
> page without notifying a driver of the new address. For example with mremap().
> So it's impossible to keep any backpointer to a virtual address up to date.
> 

Yep, this is good example and will include this in my next rev.

> > +	* Physical memory backpointer (page->zone_device_data) should be stable
> > +	  from allocation to page free.
> 
> Agree. And presumably the rational is because it is very difficult to safely
> update page->zone_device_data and ensure there aren't concurrent users of it
> unless the page is free (ie. has a 0 refcount)?
> 

Yes, exactly.

> > +* GPU pagetable locking
> > +	* Notifier lock only protects range tree, pages valid state for a range
> > +	  (rather than seqno due to wider notifiers), pagetable entries, and
> > +	  mmu notifier seqno tracking, it is not a global lock to protect
> > +          against races.
> > +	* All races handled with big retry as mentioned above.
> 
> Sounds reasonable.
> 
> > +Overview of current design
> > +==========================
> > +
> > +Baseline design is simple as possible to get a working basline in which can be
> > +built upon.
> > +
> > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > +   :doc: Overview
> > +   :doc: Locking
> > +   :doc: Migrataion
> > +   :doc: Partial Unmapping of Ranges
> > +   :doc: Examples
> > +
> > +Possible future design features
> > +===============================
> > +
> > +* Concurrent GPU faults
> > +	* CPU faults are concurrent so makes sense to have concurrent GPU
> > +	  faults.
> > +	* Should be possible with fined grained locking in the driver GPU
> > +	  fault handler.
> > +	* No expected GPU SVM changes required.
> > +* Ranges with mixed system and device pages
> > +	* Can be added if required to drm_gpusvm_get_pages fairly easily.
> 
> I don't see much of a need, but also don't see any barriers on the MM side for
> doing that.
>

I don't see any barriers either, I think it would work in Xe with slight
tweak to our VM bind code.

I'm unsure the use case though too.

> > +* Multi-GPU support
> > +	* Work in progress and patches expected after initially landing on GPU
> > +	  SVM.
> > +	* Ideally can be done with little to no changes to GPU SVM.
> 
> See above, but I mostly agree.
> 
> > +* Drop ranges in favor of radix tree
> > +	* May be desirable for faster notifiers.
> > +* Compound device pages
> > +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> > +	  migrate device layer are a performance bottleneck, having compound
> > +	  device pages should help increase performance by reducing the number
> > +	  of these expensive calls.
> 
> I'm hoping my patch series[1] to allow for compound device pages will land in v6.15

Cool! I was not aware of this ongoing series. Let me look.

> as well. The next steps are extending that to DEVICE_PRIVATE pages with
> migrate_vma_setup() and migrate_to_ram() and we have been making some good
> progress on this internally. I hope to have something posted before LSFMM.
> 

Also cool. If you think you have something working, let me know and will
pull in changes to test out.

> The other thing we have been looking at here is being able to migrate
> file-backed pages to GPU memory.

Also can test this one out too.

Matt

> 
> [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@nvidia.com/
> 
> > +* Higher order dma mapping for migration
> > +	* 4k dma mapping adversely affects migration performance on Intel
> > +	  hardware, higher order (2M) dma mapping should help here.
> > +* Build common userptr implementation on top of GPU SVM
> > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > index 476719771eef..396e535377fb 100644
> > --- a/Documentation/gpu/rfc/index.rst
> > +++ b/Documentation/gpu/rfc/index.rst
> > @@ -16,6 +16,10 @@ host such documentation:
> >  * Once the code has landed move all the documentation to the right places in
> >    the main core, helper or driver sections.
> >  
> > +.. toctree::
> > +
> > +    gpusvm.rst
> > +
> >  .. toctree::
> >  
> >      i915_gem_lmem.rst
> > -- 
> > 2.34.1
> >

Alistair Popple Feb. 28, 2025, 5:53 a.m. UTC | #3

On Thu, Feb 27, 2025 at 08:36:35PM -0800, Matthew Brost wrote:
> On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote:
> > On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> > > Add documentation for agree upon GPU SVM design principles, current
> > > status, and future plans.
> > 
> > Thanks for writing this up. In general I didn't see anything too controversial
> > but added a couple of comments below.
> > 
> > > 
> > > v4:
> > >  - Address Thomas's feedback
> > > v5:
> > >  - s/Current/Basline (Thomas)
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > ---
> > >  Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
> > >  Documentation/gpu/rfc/index.rst  |  4 ++
> > >  2 files changed, 88 insertions(+)
> > >  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > > 
> > > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> > > new file mode 100644
> > > index 000000000000..063412160685
> > > --- /dev/null
> > > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > > @@ -0,0 +1,84 @@
> > > +===============
> > > +GPU SVM Section
> > > +===============
> > > +
> > > +Agreed upon design principles
> > > +=============================
> > 
> > As a general comment I think it would be nice if we could add some rational/
> > reasons for these design principals. Things inevitably change and if/when
> > we need to violate or update these principals it would be good to have some
> > documented rational for why we decided on them in the first place because the
> > reasoning may have become invalid by then.
> > 
> 
> Let me try to add somethings to the various cases.

Thanks!

> > > +* migrate_to_ram path
> > > +	* Rely only on core MM concepts (migration PTEs, page references, and
> > > +	  page locking).
> > > +	* No driver specific locks other than locks for hardware interaction in
> > > +	  this path. These are not required and generally a bad idea to
> > > +	  invent driver defined locks to seal core MM races.
> > 
> > In principal I agree. The problem I think you will run into is the analogue of
> > what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
> > GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
> > becoming invalid) may, depending on your design, kick off a migration of the
> > page to the GPU via migrate_vma_setup().
> > 
> > The problem with that is migrate_vma_setup() will temprarily raise the folio
> > refcount, which can cause the migrate_to_ram() callback to fail but the elevated
> > refcount from migrate_to_ram() can also cause the GPU migration to fail thus
> > leading to a live-lock when both CPU and GPU fault handlers just keep retrying.
> > 
> > This was particularly problematic for us on multi-GPU setups, and our solution
> > was to introduce a migration critical section in the form of a mutex to ensure
> > only one thread was calling migrate_vma_setup() at a time.
> > 
> > And now that I've looked at UVM development history, and remembered more
> > context, this is why I had a vague recollection that adding a migration entry
> > in do_swap_page() would be better than taking a page lock. Doing so fixes the
> > issue with concurrent GPU faults blocking migrate_to_ram() because it makes
> > migrate_vma_setup() ignore the page.
> > 
> 
> Ok, this is something to keep an eye on. In the current Xe code, we try
> to migrate a chunk of memory from the CPU to the GPU in our GPU fault
> handler once per fault. If it fails due to racing CPU access, we simply
> leave it in CPU memory and move on. We don't have any real migration
> policies in Xe yet—that is being worked on as a follow-up to my series.
> However, if we had a policy requiring a memory region to 'must be in
> GPU,' this could conceivably lead to a livelock with concurrent CPU and
> GPU access. I'm still not fully convinced that a driver-side lock is the
> solution here, but without encountering the issue on our side, I can't
> be completely certain what the solution is.

Right - we have migration policies that can cause us to try harder to migrate.
Also I agree with you that a driver-side lock might not be the best solution
here. It's what we did due to various limiations we have, but they are
unimportant for this discussion.

I agree the ideal solution wouldn't involve locks and would instead be to fix
the migration interfaces up such that one thread attempting to migrate doesn't
cause another thread which has started a migration to fail. The solution to that
isn't obvious, but I don't think it would be impossible either.

> > > +	* Partial migration is supported (i.e., a subset of pages attempting to
> > > +	  migrate can actually migrate, with only the faulting page guaranteed
> > > +	  to migrate).
> > > +	* Driver handles mixed migrations via retry loops rather than locking.
> > >
> > > +* Eviction
> > 
> > This is a term that seems be somewhat overloaded depending on context so a
> > definition would be nice. Is your view of eviction migrating data from GPU back
> > to CPU without a virtual address to free up GPU memory? (that's what I think of,
> > but would be good to make sure we're in sync).
> > 
> 
> Yes. When GPU memory is oversubscribed, we find the physical backing in
> an LRU list to evict. In Xe, this is a TTM BO.

Sounds good. So eviction is just migration of physical memory. 

> > > +	* Only looking at physical memory data structures and locks as opposed to
> > > +	  looking at virtual memory data structures and locks.

Except of course whatever virtual memory data structures the core-MM needs to
touch in order to do the migration right? Agree that the driver shouldn't be
touching any driver data structures that concern themselves with virtual memory
addresses though. Except what about any data structures that are required as
part of GPU PTE/TLB invalidation?

> > > +	* No looking at mm/vma structs or relying on those being locked.
> > 
> > Agree with the above points.
> > 
> > > +* GPU fault side
> > > +	* mmap_read only used around core MM functions which require this lock
> > > +	  and should strive to take mmap_read lock only in GPU SVM layer.
> > > +	* Big retry loop to handle all races with the mmu notifier under the gpu
> > > +	  pagetable locks/mmu notifier range lock/whatever we end up calling
> > > +          those.
> > 
> > Again, one of the issues here (particularly with multi-GPU setups) is that it's
> > very easy to live-lock with rety loops because even attempting a migration that
> > fails can cause migration/fault handling in other threads to fail, either by
> > calling mmu_notifiers or taking a page reference.
> > 
> > Those are probably things that we should fix on the MM side, but for now UVM at
> > least uses a lock to ensure forward progress.
> >
> 
> Again, see above. Right now, migration in Xe is more of a best-case
> scenario rather than a mandatory process, and perhaps this is masking an
> issue.
> 
> Maybe I should add a comment here stating your possible concerns and that
> Xe will be implementing real migration policies and multi-GPU support
> soon. If this issue arises, we can revisit the locking guidelines or
> perhaps help contribute to the necessary core changes to make this work
> properly.

Yeah, that could be good. Something along the lines of core-MM code may need
fixing in the way I described above (one thread attempting a migration shouldn't
cause another thread that's already started one to fail).

> > > +	* Races (especially against concurrent eviction or migrate_to_ram)
> > > +	  should not be handled on the fault side by trying to hold locks;
> > > +	  rather, they should be handled using retry loops. One possible
> > > +	  exception is holding a BO's dma-resv lock during the initial migration
> > > +	  to VRAM, as this is a well-defined lock that can be taken underneath
> > > +	  the mmap_read lock.
> > 
> > See my earlier comments. Although note I agree with this in principal, and we do
> > just retry if taking the lock fails.
> > 
> > > +* Physical memory to virtual backpointer
> > > +	* Does not work, no pointers from physical memory to virtual should
> > > +	  exist.
> > 
> > Agree. And my rational is because core-MM can update the virtual address for a
> > page without notifying a driver of the new address. For example with mremap().
> > So it's impossible to keep any backpointer to a virtual address up to date.
> > 
> 
> Yep, this is good example and will include this in my next rev.
> 
> > > +	* Physical memory backpointer (page->zone_device_data) should be stable
> > > +	  from allocation to page free.
> > 
> > Agree. And presumably the rational is because it is very difficult to safely
> > update page->zone_device_data and ensure there aren't concurrent users of it
> > unless the page is free (ie. has a 0 refcount)?
> > 
> 
> Yes, exactly.
> 
> > > +* GPU pagetable locking
> > > +	* Notifier lock only protects range tree, pages valid state for a range
> > > +	  (rather than seqno due to wider notifiers), pagetable entries, and
> > > +	  mmu notifier seqno tracking, it is not a global lock to protect
> > > +          against races.
> > > +	* All races handled with big retry as mentioned above.
> > 
> > Sounds reasonable.
> > 
> > > +Overview of current design
> > > +==========================
> > > +
> > > +Baseline design is simple as possible to get a working basline in which can be
> > > +built upon.
> > > +
> > > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > > +   :doc: Overview
> > > +   :doc: Locking
> > > +   :doc: Migrataion
> > > +   :doc: Partial Unmapping of Ranges
> > > +   :doc: Examples
> > > +
> > > +Possible future design features
> > > +===============================
> > > +
> > > +* Concurrent GPU faults
> > > +	* CPU faults are concurrent so makes sense to have concurrent GPU
> > > +	  faults.
> > > +	* Should be possible with fined grained locking in the driver GPU
> > > +	  fault handler.
> > > +	* No expected GPU SVM changes required.
> > > +* Ranges with mixed system and device pages
> > > +	* Can be added if required to drm_gpusvm_get_pages fairly easily.
> > 
> > I don't see much of a need, but also don't see any barriers on the MM side for
> > doing that.
> >
> 
> I don't see any barriers either, I think it would work in Xe with slight
> tweak to our VM bind code.
> 
> I'm unsure the use case though too.
> 
> > > +* Multi-GPU support
> > > +	* Work in progress and patches expected after initially landing on GPU
> > > +	  SVM.
> > > +	* Ideally can be done with little to no changes to GPU SVM.
> > 
> > See above, but I mostly agree.
> > 
> > > +* Drop ranges in favor of radix tree
> > > +	* May be desirable for faster notifiers.
> > > +* Compound device pages
> > > +	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> > > +	  migrate device layer are a performance bottleneck, having compound
> > > +	  device pages should help increase performance by reducing the number
> > > +	  of these expensive calls.
> > 
> > I'm hoping my patch series[1] to allow for compound device pages will land in v6.15
> 
> Cool! I was not aware of this ongoing series. Let me look.

There's probably not much of direct interest to you there at the moment. It's a
prerequisite in that it allows current (and therefore future) users of compound
ZONE_DEVICE pages to have them ref-counted normally, instead of the funky scheme
DAX uses at the moment. Our changes will build on top of that.

> > as well. The next steps are extending that to DEVICE_PRIVATE pages with
> > migrate_vma_setup() and migrate_to_ram() and we have been making some good
> > progress on this internally. I hope to have something posted before LSFMM.
> > 
> 
> Also cool. If you think you have something working, let me know and will
> pull in changes to test out.

Will do!

> > The other thing we have been looking at here is being able to migrate
> > file-backed pages to GPU memory.
> 
> Also can test this one out too.

Thanks.

 - Alistair

> Matt
> 
> > 
> > [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@nvidia.com/
> > 
> > > +* Higher order dma mapping for migration
> > > +	* 4k dma mapping adversely affects migration performance on Intel
> > > +	  hardware, higher order (2M) dma mapping should help here.
> > > +* Build common userptr implementation on top of GPU SVM
> > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > index 476719771eef..396e535377fb 100644
> > > --- a/Documentation/gpu/rfc/index.rst
> > > +++ b/Documentation/gpu/rfc/index.rst
> > > @@ -16,6 +16,10 @@ host such documentation:
> > >  * Once the code has landed move all the documentation to the right places in
> > >    the main core, helper or driver sections.
> > >  
> > > +.. toctree::
> > > +
> > > +    gpusvm.rst
> > > +
> > >  .. toctree::
> > >  
> > >      i915_gem_lmem.rst
> > > -- 
> > > 2.34.1
> > >

diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
new file mode 100644
index 000000000000..063412160685
--- /dev/null
+++ b/Documentation/gpu/rfc/gpusvm.rst
@@ -0,0 +1,84 @@ 
+===============
+GPU SVM Section
+===============
+
+Agreed upon design principles
+=============================
+
+* migrate_to_ram path
+	* Rely only on core MM concepts (migration PTEs, page references, and
+	  page locking).
+	* No driver specific locks other than locks for hardware interaction in
+	  this path. These are not required and generally a bad idea to
+	  invent driver defined locks to seal core MM races.
+	* Partial migration is supported (i.e., a subset of pages attempting to
+	  migrate can actually migrate, with only the faulting page guaranteed
+	  to migrate).
+	* Driver handles mixed migrations via retry loops rather than locking.
+* Eviction
+	* Only looking at physical memory data structures and locks as opposed to
+	  looking at virtual memory data structures and locks.
+	* No looking at mm/vma structs or relying on those being locked.
+* GPU fault side
+	* mmap_read only used around core MM functions which require this lock
+	  and should strive to take mmap_read lock only in GPU SVM layer.
+	* Big retry loop to handle all races with the mmu notifier under the gpu
+	  pagetable locks/mmu notifier range lock/whatever we end up calling
+          those.
+	* Races (especially against concurrent eviction or migrate_to_ram)
+	  should not be handled on the fault side by trying to hold locks;
+	  rather, they should be handled using retry loops. One possible
+	  exception is holding a BO's dma-resv lock during the initial migration
+	  to VRAM, as this is a well-defined lock that can be taken underneath
+	  the mmap_read lock.
+* Physical memory to virtual backpointer
+	* Does not work, no pointers from physical memory to virtual should
+	  exist.
+	* Physical memory backpointer (page->zone_device_data) should be stable
+	  from allocation to page free.
+* GPU pagetable locking
+	* Notifier lock only protects range tree, pages valid state for a range
+	  (rather than seqno due to wider notifiers), pagetable entries, and
+	  mmu notifier seqno tracking, it is not a global lock to protect
+          against races.
+	* All races handled with big retry as mentioned above.
+
+Overview of current design
+==========================
+
+Baseline design is simple as possible to get a working basline in which can be
+built upon.
+
+.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
+   :doc: Overview
+   :doc: Locking
+   :doc: Migrataion
+   :doc: Partial Unmapping of Ranges
+   :doc: Examples
+
+Possible future design features
+===============================
+
+* Concurrent GPU faults
+	* CPU faults are concurrent so makes sense to have concurrent GPU
+	  faults.
+	* Should be possible with fined grained locking in the driver GPU
+	  fault handler.
+	* No expected GPU SVM changes required.
+* Ranges with mixed system and device pages
+	* Can be added if required to drm_gpusvm_get_pages fairly easily.
+* Multi-GPU support
+	* Work in progress and patches expected after initially landing on GPU
+	  SVM.
+	* Ideally can be done with little to no changes to GPU SVM.
+* Drop ranges in favor of radix tree
+	* May be desirable for faster notifiers.
+* Compound device pages
+	* Nvidia, AMD, and Intel all have agreed expensive core MM functions in
+	  migrate device layer are a performance bottleneck, having compound
+	  device pages should help increase performance by reducing the number
+	  of these expensive calls.
+* Higher order dma mapping for migration
+	* 4k dma mapping adversely affects migration performance on Intel
+	  hardware, higher order (2M) dma mapping should help here.
+* Build common userptr implementation on top of GPU SVM
diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
index 476719771eef..396e535377fb 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -16,6 +16,10 @@  host such documentation:
 * Once the code has landed move all the documentation to the right places in
   the main core, helper or driver sections.
 
+.. toctree::
+
+    gpusvm.rst
+
 .. toctree::
 
     i915_gem_lmem.rst

[v6,32/32] drm/doc: gpusvm: Add GPU SVM documentation

Commit Message

Comments

Patch