Message ID | 20231022180236.5170-1-thomas.hellstrom@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] Documentation/gpu: VM_BIND locking document | expand |
On Sun, Oct 22, 2023 at 08:02:36PM +0200, Thomas Hellström wrote: > Add the first version of the VM_BIND locking document which is > intended to be part of the xe driver upstreaming agreement. > > The document describes and discuss the locking used during exec- > functions, evicton and for userptr gpu-vmas. Intention is to be using the > same nomenclature as the drm-vm-bind-async.rst. > > v2: > - s/gvm/gpu_vm/g (Rodrigo Vivi) > - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c > (Rodrigo Vivi) > - Adjust commit message accordingly. > - Add SPDX license header. > > v3: > - Large update to align with the drm_gpuvm manager locking > - Add "Efficient userptr gpu_vma exec function iteration" section > - Add "Locking at bind- and unbind time" section. > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > --- > Documentation/gpu/drm-vm-bind-locking.rst | 494 ++++++++++++++++++++++ > 1 file changed, 494 insertions(+) > create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst > > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst > new file mode 100644 > index 000000000000..c290ff4287fb > --- /dev/null > +++ b/Documentation/gpu/drm-vm-bind-locking.rst > @@ -0,0 +1,494 @@ > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT) > + > +=============== > +VM_BIND locking > +=============== > + > +This document attempts to describe what's needed to get VM_BIND locking right, > +including the userptr mmu_notifier locking and it will also discuss some > +optimizations to get rid of the looping through of all userptr mappings and > +external / shared object mappings that is needed in the simplest > +implementation. It will also discuss some implications for faulting gpu_vms. > + > +Nomenclature > +============ > + > +* ``Context``: GPU execution context. > +* ``gpu_vm``: Abstraction of a virtual GPU address space with > + meta-data. Typically one per client (DRM file-private), or one per > + context. > +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with > + associated meta-data. The backing storage of a gpu_vma can either be > + a GEM object or anonymous pages mapped also into the CPU > + address space for the process. > +* gpu_vm_bo: Abstracts the association of a GEM object and > + a VM. Note that if only one gpu_vma per vm and buffer object were > + allowed, the state stored with a gpu_vm_bo could just as well have > + been stored with the gpu_vma. For the purpose of this document, each > + GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo > + maintains a list of gpu_vmas. > +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of > + which is anonymous pages as described above. something strange after the comma, but my bad english can't allow to pin point what. Or maybe it is right and the problem *is* my bad english :) > +* ``revalidating``: Revalidating a gpu_vma means making the latest version > + of the backing store resident and making sure the gpu_vma's > + page-table entries point to that backing store. > +* ``dma_fence``: A struct dma_fence that is similar to a struct completion > + and which tracks GPU activity. When the GPU activity is finished, > + the dma_fence signals. > +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used maybe s/AKA/a.k.a ?! > + to track GPU activity in the form of multiple dma_fences on a > + gpu_vm or a GEM object. The dma_resv contains an array / list > + of dma_fences and a lock that needs to be held when adding > + additional dma_fences to the dma_resv. The lock is of a type that > + allows deadlock-safe locking of multiple dma_resvs in arbitrary order. > +* ``exec function``: An exec function is a function that revalidates all > + affected gpu_vmas, submits a GPU command batch and registers the > + dma_fence representing the GPU command's activity with all affected > + dma_resvs. For completeness, although not covered by this document, > + it's worth mentioning that an exec function may also be the > + revalidation worker that is used by some drivers in compute / > + long-running mode. > +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem > + objects also share the gpu_vm's dma_resv. > +* ``shared object``: AKA external object: A GEM object which may be shared maybe s/AKA/a.k.a ?! > + by multiple gpu_vms and whose backing storage may be shared with > + other drivers. > + > + > +Locks used and locking orders > +============================= > + > +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's > +dma_resv object and hence the dma_resv lock. So even with a huge > +number of local GEM objects, only one lock is needed to make the exec > +sequence atomic. > + > +The following locks and locking orders are used: > + > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is > + partitioned into gpu_vmas. It can also protect the gpu_vm's list of > + userptr gpu_vmas. With a CPU mm analogy this would correspond to the > + mmap_lock. > +* The ``userptr_seqlock``. This lock is taken in read mode for each > + userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu > + notifier invalidation. This is not a real seqlock but described in > + ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side > + 'lock' a lot like a seqcount, however this allows multiple > + write-sides to hold it at once...". The read side critical section > + is enclosed by ``mmu_interval_read_begin() / > + mmu_interval_read_retry()`` with ``mmu_interval_read_begin()`` > + sleeping if the write side is held. > + The write side is held by the core mm while calling mmu interval > + invalidation notifiers. > +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing > + rebinding, and also the residency of all the gpu_vm's local GEM object. > + Furthermore it typically protects the gpu_vm's list of evicted GEM > + objects and external objects. > +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is > + taken in read mode during exec and write mode during a mmu notifier > + invalidation. The userptr notifier lock is per gpu_vm. > +* The gpu_vm list spinlocks. With some implementations they are needed > + to be able to update the gpu_vm evicted- and external object > + list. For those implementations, the spinlocks are grabbed when the > + lists are manipulated. However to avoid locking order violations > + with the dma_resv locks, a special scheme is needed when iterating > + over the lists. > + ^^ spurious spaces > +.. _gpu_vma lifetime: > + > +Protection and lifetime of gpu_vm_bos and gpu_vmas > +================================================== > + > +The GEM object's list of gpu_vm_bos is typically protected by the > +GEM object's dma_resv. Each gpu_vm_bo holds a reference counted pointer > +to the underlying GEM object, and each gpu_vma holds a reference counted > +pointer to the gpu_vm_bo. When iterating over the GEM object's > +list of gpu_vm_bos the gem object's dma_resv must thus be held, > +but if it needs to be dropped during the iteration, care needs to be > +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced > +while the lock is dropped, do not disappear. The easiest way to avoid > +this is to take a reference on affected objects while the dma_resv is > +still held. If iterating over the gpu_vm_bo's gpu_vmas, even > +greater care needs to be taken since the gpu_vmas are not > +reference counted. If a driver accesses a gpu_vma obtained from > +the gpu_vm_bo's list of gpu_vmas, and the GEM object's > +dma_resv is dropped, at the very least, it should be thoroughly > +documented how the gpu_vma is kept alive. Otherwise holding the > +GEM object's dma_resv lock also around unlinking a gpu_vma from a > +gpu_vm_bo will ensure that doesn't happen. > + > + > +Revalidation and eviction of local objects > +========================================== > + > +Revalidation > +____________ > +With VM_BIND, all local objects need to be resident when the gpu is > +executing using the gpu_vm, and the objects need to have valid > +gpu_vmas set up pointing to them. Typically each gpu command buffer > +submission is therefore preceded with a re-validation section: > + > +.. code-block:: C > + > + dma_resv_lock(gpu_vm->resv); > + > + // Validation section starts here. > + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { > + validate_gem_bo(&gpu_vm_bo->gem_bo); > + > + // The following list iteration needs the Gem object's > + // dma_resv to be held (it protects the gpu_vm_bo's list of > + // gpu_vmas, but since local gem objects share the gpu_vm's > + // dma_resv, it is already held at this point. > + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) > + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); > + } > + > + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { > + rebind_gpu_vma(&gpu_vma); > + remove_gpu_vma_from_rebind_list(&gpu_vma); > + } > + // Validation section ends here, and job submission starts. > + ^^^^^^^^^^^^^ more spurious spaces > + add_dependencies(&gpu_job, &gpu_vm->resv); > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + dma_resv_unlock(gpu_vm->resv); > + > +The reason for having a separate gpu_vm rebind list is that there > +might be userptr gpu_vmas that are not mapping a buffer object that > +also need rebinding. > + (pausing here... I will continue the review below tomorrow) > +Eviction > +________ > + > +Eviction of one of these local objects will then look similar to the > +following: > + > +.. code-block:: C > + > + obj = get_object_from_lru(); > + > + dma_resv_lock(obj->resv); > + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + > + add_dependencies(&eviction_job, &obj->resv); > + job_dma_fence = gpu_submit(&eviction_job); > + add_dma_fence(&obj->resv, job_dma_fence); > + > + dma_resv_unlock(&obj->resv); > + put_object(obj); > + > +Note that since the object is local to the gpu_vm, it will share the gpu_vm's > +dma_resv lock so that ``obj->resv == gpu_vm->resv``. > +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list, > +which is protected by ``gpu_vm->resv``, that is always locked while > +evicting, due to the above equality. > + > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction, > +Since the eviction blit or copy will wait for GPU idle, any attempt by > +the GPU to access freed memory through the gpu_vma will be preceded by > +a new exec function, with a revalidation section which will make sure > +the gpu_vma is rebound. The eviction code holding the object's dma_resv while > +revalidating will ensure a new exec function may not race with the eviction. > + > +Locking with external (or shared) buffer objects > +================================================ > + > +Since shared buffer objects may be shared by multiple gpu_vm's they > +can't share their reservation object with a single gpu_vm, but will rather > +have a reservation object of their own. The shared objects bound to a > +gpu_vm using one or many gpu_vmas are therefore typically put on a > +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once > +the gpu_vm's reservation object is locked, it is safe to traverse the > +external object list and lock the dma_resvs of all external objects. > + > +At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a > +shared object is bound to on the gpu_vm's evict list, but we can no longer > +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the > +object is bound to, since at eviction time we only hold the object's > +private dma_resv. If we have a ww_acquire context at hand at eviction > +time we could grab the those dma_resvs but that could cause > +expensive ww_mutex rollbacks. A simple option is to just mark the > +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that > +is inspected the next time the corresponding gpu_vm evicted list needs > +to be traversed. At that time the gpu_vm's dma_resv and the object's > +dma_resv is held, and the gpu_vm_bo marked evicted, can then be added > +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would > +then be protected by the object's dma_resv. > + > +The exec function would then become > + > +.. code-block:: C > + > + dma_resv_lock(gpu_vm->resv); > + > + // External object list is protected by the gpu_vm->resv lock. > + for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) { > + dma_resv_lock(gpu_vm_bo.gem_obj->resv); > + if (gpu_vm_bo_marked_evicted(&gpu_vm_bo)) > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + } > + > + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { > + validate_gem_bo(&gpu_vm_bo->gem_bo); > + > + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) > + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); > + } > + > + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { > + rebind_gpu_vma(&gpu_vma); > + remove_gpu_vma_from_rebind_list(&gpu_vma); > + } > + > + add_dependencies(&gpu_job, &gpu_vm->resv); > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + for_each_shared_obj(gpu_vm, &obj) > + add_dma_fence(job_dma_fence, &obj->resv); > + dma_resv_unlock_all_resv_locks(); > + > +And the corresponding shared-object aware eviction would look like: > + > +.. code-block:: C > + > + obj = get_object_from_lru(); > + > + dma_resv_lock(obj->resv); > + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) > + if (object_is_vm_local(obj)) > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + else > + mark_gpu_vm_bo_evicted(&gpu_vm_bo); > + > + add_dependencies(&eviction_job, &obj->resv); > + job_dma_fence = gpu_submit(&eviction_job); > + add_dma_fence(&obj->resv, job_dma_fence); > + > + dma_resv_unlock(&obj->resv); > + put_object(obj); > + > +.. _Spinlock iteration: > + > +Accessing the gpu_vm's lists without the dma_resv lock held > +=========================================================== > + > +Many drivers will not need to access the gpu_vm's evict- and > +external objects lists without holding the gpu_vm's dma_resv lock, > +but some drivers do, for example due to asynchronous state updates > +from within the dma_fence signalling critical path. In such case a > +spinlock can be used to protect manipulation of the lists. However, > +since higher level sleeping locks needs to be taken for each list item > +while iterating over the lists, the items already iterated over needs > +to be temporarily moved to a private list and the spinlock released > +while processing each item: > + > +.. code block:: C > + > + struct list_head still_in_list; > + > + INIT_LIST_HEAD(&still_in_list); > + > + spin_lock(&gpu_vm->list_lock); > + do { > + struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head); > + > + if (!entry) > + break; > + > + list_move_tail(&entry->head, &still_in_list); > + list_entry_get_unless_zero(entry); > + spin_unlock(&gpu_vm->list_lock); > + > + process(entry); > + > + spin_lock(&gpu_vm->list_lock); > + list_entry_put(entry); > + } while (true); > + > + list_splice_tail(&still_in_list, &gpu_vm->list); > + spin_unlock(&gpu_vm->list_lock); > + > +However, due to the additional locking and atomic operations, drivers that *can* > +avoid accessing the gpu_vm's list outside of the dma_resv lock > +might want to avoid this iteration scheme, if the driver anticipates a > +large number of list items. For lists where the anticipated number of > +list items is small, list iteration doesn't happen very often, or > +there is a significant additional cost associated with each iteration, > +the atomic operation overhead associated with this type of iteration > +is, however, probably negligible. Note that if this scheme is > +used, it is necessary to make sure this list iteration is protected by > +an outer level lock or semaphore, since list items are temporarily > +pulled off the list while iterating. > + > +TODO: Pointer to the gpuvm code implementation if this iteration and > +how to choose either iteration scheme. > + > +userptr gpu_vmas > +================ > + > +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a > +GPU virtual address range, directly maps a CPU mm range of anonymous- > +or file page-cache pages. > +A very simple approach would be to just pin the pages using > +pin_user_pages() at bind time and unpin them at unbind time, but this > +creates a Denial-Of-Service vector since a single user-space process > +would be able to pin down all of system memory, which is not > +desirable. (For special use-cases and with proper accounting pinning might > +still be a desirable feature, though). What we need to do in the > +general case is to obtain a reference to the desired pages, make sure > +we are notified > +using a MMU notifier just before the CPU mm unmaps the pages, dirty > +them if they are not mapped read-only to the GPU, and then drop the > +reference. > +When we are notified by the MMU notifier that CPU mm is about to drop the > +pages, we need to stop GPU access to the pages, > +and make sure that before the next time the GPU tries to access > +whatever is now present in the CPU mm range, we unmap the old pages > +from the GPU page tables and repeat the process of obtaining new page > +references. Note that when the core mm decides to laundry pages, we get such > +an unmap MMU notification and can mark the pages dirty again before the > +next GPU access. We also get similar MMU notifications for NUMA accounting > +which the GPU driver doesn't really need to care about, but so far > +it has proven difficult to exclude certain notifications. > + > +Using a MMU notifier for device DMA (and other methods) is described in > +`this document > +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_. > + > +Now the method of obtaining struct page references using > +get_user_pages() unfortunately can't be used under a dma_resv lock > +since that would violate the locking order of the dma_resv lock vs the > +mmap_lock that is grabbed when resolving a CPU pagefault. This means > +the gpu_vm's list of userptr gpu_vmas needs to be protected by an > +outer lock. > + > +The MMU interval seqlock for a userptr gpu_vma is used in the following > +way: > + > +.. code-block:: C > + > + // Exclusive locking mode here is strictly needed only if there are > + // invalidated userptr vmas present, to avoid multiple userptr > + // revalidations. > + down_write(&gpu_vm->lock); > + retry: > + > + // Note: mmu_interval_read_begin() blocks until there is no > + // invalidation notifier running anymore. > + seq = mmu_interval_read_begin(&gpu_vma->userptr_interval); > + if (seq != gpu_vma->saved_seq) { > + obtain_new_page_pointers(&gpu_vma); > + dma_resv_lock(&gpu_vm->resv); > + add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm); > + dma_resv_unlock(&gpu_vm->resv); > + gpu_vma->saved_seq = seq; > + } > + > + // The usual revalidation goes here. > + > + // Final userptr sequence validation may not happen before the > + // submission dma_fence is added to the gpu_vm's resv, from the POW > + // of the MMU invalidation notifier. Hence the > + // userptr_notifier_lock that will make them appear atomic. > + > + add_dependencies(&gpu_job, &gpu_vm->resv); > + down_read(&gpu_vm->userptr_notifier_lock); > + if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) { > + up_read(&gpu_vm->userptr_notifier_lock); > + goto retry; > + } > + > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + > + for_each_shared_obj(gpu_vm, &obj) > + add_dma_fence(job_dma_fence, &obj->resv); > + > + dma_resv_unlock_all_resv_locks(); > + up_read(&gpu_vm->userptr_notifier_lock); > + up_write(&gpu_vm->lock); > + > +The code between ``mmu_interval_read_begin()`` and the > +``mmu_interval_read_retry()`` marks the read side critical section of > +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr > +gpu_vma list is looped through, and the check is done for *all* of its > +userptr gpu_vmas, although we only show a single one here. > + > +The userptr gpu_vma MMU invalidation notifier might be called from > +reclaim context and, again to avoid locking order violations, we can't > +take any dma_resv lock nor the gpu_vm->lock from within it. > + > +.. code-block:: C > + > + bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq) > + { > + // Make sure the exec function either sees the new sequence > + // and backs off or we wait for the dma-fence: > + > + down_write(&gpu_vm->userptr_notifier_lock); > + mmu_interval_set_seq(userptr_interval, cur_seq); > + up_write(&gpu_vm->userptr_notifier_lock); > + > + // At this point, the exec function can't succeed in > + // submitting a new job, because cur_seq is an invalid > + // sequence number and will always cause a retry. When all > + // invalidation callbacks, the mmu notifier core will flip > + // the sequence number to a valid one. However we need to > + // stop gpu access to the old pages here. > + > + dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP, > + false, MAX_SCHEDULE_TIMEOUT); > + return true; > + } > + > +When this invalidation notifier returns, the GPU can no longer be > +accessing the old pages of the userptr gpu_vma and needs to redo the > +page-binding before a new GPU submission can succeed. > + > +Efficient userptr gpu_vma exec_function iteration > +_________________________________________________ > + > +If the gpu_vm's list of userptr gpu_vmas becomes large, it's > +inefficient to iterate through the complete lists of userptrs on each > +exec function to check whether each userptr gpu_vma's saved > +sequence number is invalid or stale. A solution to this is to put all > +*invalidated* userptr gpu_vmas on a separate gpu_vm list and > +only those gpu_vmas on the list are actually checked on each exec > +function. This list will then lend itself very-well to the spinlock > +locking scheme that is > +:ref:`described in the spinlock iteration section <Spinlock iteration>`, since > +in the mmu notifier, where we add the invalidated gpu_vmas to the > +list, it's not possible to take any outer locks like the > +``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the > +``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is > +complete, as also mentioned in that section. > + > +If using an invalidated userptr list like this, the retry check in the > +exec function trivially becomes a check for invalidated list empty. > + > +Locking at bind- and unbind time > +================================ > + > +At bind time, assuming a GEM object backed gpu_vma, each > +gpu_vma needs to be associated with a gpu_vm_bo and that > +gpu_vm_bo in turn needs to be added to the GEM object's > +gpu_vm_bo list, and possibly to the gpu_vm's external object > +list. This is referred to as *linking* the gpu_vma, and typically > +requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are > +held. When unlinking a gpu_vma the same locks are typically held, > +and that ensures, as briefly discussed > +:ref:`previously <gpu_vma lifetime>`, that when iterating over > +``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM > +object's dma_resv, that the gpu_vmas stay alive as long > +as the lock under which we iterate are not is not released. For > +userptr gpu_vmas it's similarly required that during unlink, the > +outer ``gpu_vm->lock`` is held, since otherwise when iterating over > +the invalidated userptr list as described in the previous section, > +there is nothing keeping those userptr gpu_vmas alive. > + > -- > 2.41.0 >
On Sun, Oct 22, 2023 at 08:02:36PM +0200, Thomas Hellström wrote: > Add the first version of the VM_BIND locking document which is > intended to be part of the xe driver upstreaming agreement. > > The document describes and discuss the locking used during exec- > functions, evicton and for userptr gpu-vmas. Intention is to be using the > same nomenclature as the drm-vm-bind-async.rst. > > v2: > - s/gvm/gpu_vm/g (Rodrigo Vivi) > - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c > (Rodrigo Vivi) > - Adjust commit message accordingly. > - Add SPDX license header. > > v3: > - Large update to align with the drm_gpuvm manager locking > - Add "Efficient userptr gpu_vma exec function iteration" section > - Add "Locking at bind- and unbind time" section. > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> > --- > Documentation/gpu/drm-vm-bind-locking.rst | 494 ++++++++++++++++++++++ > 1 file changed, 494 insertions(+) > create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst > > diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst > new file mode 100644 > index 000000000000..c290ff4287fb > --- /dev/null > +++ b/Documentation/gpu/drm-vm-bind-locking.rst > @@ -0,0 +1,494 @@ > +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT) > + > +=============== > +VM_BIND locking > +=============== > + > +This document attempts to describe what's needed to get VM_BIND locking right, > +including the userptr mmu_notifier locking and it will also discuss some > +optimizations to get rid of the looping through of all userptr mappings and > +external / shared object mappings that is needed in the simplest > +implementation. It will also discuss some implications for faulting gpu_vms. > + > +Nomenclature > +============ > + > +* ``Context``: GPU execution context. > +* ``gpu_vm``: Abstraction of a virtual GPU address space with > + meta-data. Typically one per client (DRM file-private), or one per > + context. > +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with > + associated meta-data. The backing storage of a gpu_vma can either be > + a GEM object or anonymous pages mapped also into the CPU > + address space for the process. > +* gpu_vm_bo: Abstracts the association of a GEM object and > + a VM. Note that if only one gpu_vma per vm and buffer object were > + allowed, the state stored with a gpu_vm_bo could just as well have > + been stored with the gpu_vma. For the purpose of this document, each > + GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo > + maintains a list of gpu_vmas. > +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of > + which is anonymous pages as described above. > +* ``revalidating``: Revalidating a gpu_vma means making the latest version > + of the backing store resident and making sure the gpu_vma's > + page-table entries point to that backing store. > +* ``dma_fence``: A struct dma_fence that is similar to a struct completion > + and which tracks GPU activity. When the GPU activity is finished, > + the dma_fence signals. > +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used > + to track GPU activity in the form of multiple dma_fences on a > + gpu_vm or a GEM object. The dma_resv contains an array / list > + of dma_fences and a lock that needs to be held when adding > + additional dma_fences to the dma_resv. The lock is of a type that > + allows deadlock-safe locking of multiple dma_resvs in arbitrary order. > +* ``exec function``: An exec function is a function that revalidates all > + affected gpu_vmas, submits a GPU command batch and registers the > + dma_fence representing the GPU command's activity with all affected > + dma_resvs. For completeness, although not covered by this document, > + it's worth mentioning that an exec function may also be the > + revalidation worker that is used by some drivers in compute / > + long-running mode. > +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem > + objects also share the gpu_vm's dma_resv. > +* ``shared object``: AKA external object: A GEM object which may be shared > + by multiple gpu_vms and whose backing storage may be shared with > + other drivers. > + > + > +Locks used and locking orders > +============================= > + > +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's > +dma_resv object and hence the dma_resv lock. So even with a huge > +number of local GEM objects, only one lock is needed to make the exec > +sequence atomic. > + > +The following locks and locking orders are used: > + > +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is > + partitioned into gpu_vmas. It can also protect the gpu_vm's list of > + userptr gpu_vmas. With a CPU mm analogy this would correspond to the > + mmap_lock. > +* The ``userptr_seqlock``. This lock is taken in read mode for each > + userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu > + notifier invalidation. This is not a real seqlock but described in > + ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side > + 'lock' a lot like a seqcount, however this allows multiple > + write-sides to hold it at once...". The read side critical section > + is enclosed by ``mmu_interval_read_begin() / > + mmu_interval_read_retry()`` with ``mmu_interval_read_begin()`` > + sleeping if the write side is held. > + The write side is held by the core mm while calling mmu interval > + invalidation notifiers. > +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing > + rebinding, and also the residency of all the gpu_vm's local GEM object. > + Furthermore it typically protects the gpu_vm's list of evicted GEM > + objects and external objects. > +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is > + taken in read mode during exec and write mode during a mmu notifier > + invalidation. The userptr notifier lock is per gpu_vm. > +* The gpu_vm list spinlocks. With some implementations they are needed > + to be able to update the gpu_vm evicted- and external object > + list. For those implementations, the spinlocks are grabbed when the > + lists are manipulated. However to avoid locking order violations > + with the dma_resv locks, a special scheme is needed when iterating > + over the lists. > + > +.. _gpu_vma lifetime: > + > +Protection and lifetime of gpu_vm_bos and gpu_vmas > +================================================== > + > +The GEM object's list of gpu_vm_bos is typically protected by the > +GEM object's dma_resv. Each gpu_vm_bo holds a reference counted pointer > +to the underlying GEM object, and each gpu_vma holds a reference counted > +pointer to the gpu_vm_bo. When iterating over the GEM object's > +list of gpu_vm_bos the gem object's dma_resv must thus be held, > +but if it needs to be dropped during the iteration, care needs to be > +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced > +while the lock is dropped, do not disappear. The easiest way to avoid > +this is to take a reference on affected objects while the dma_resv is > +still held. If iterating over the gpu_vm_bo's gpu_vmas, even > +greater care needs to be taken since the gpu_vmas are not > +reference counted. If a driver accesses a gpu_vma obtained from > +the gpu_vm_bo's list of gpu_vmas, and the GEM object's > +dma_resv is dropped, at the very least, it should be thoroughly > +documented how the gpu_vma is kept alive. Otherwise holding the > +GEM object's dma_resv lock also around unlinking a gpu_vma from a > +gpu_vm_bo will ensure that doesn't happen. > + > + > +Revalidation and eviction of local objects > +========================================== > + > +Revalidation > +____________ > +With VM_BIND, all local objects need to be resident when the gpu is > +executing using the gpu_vm, and the objects need to have valid > +gpu_vmas set up pointing to them. Typically each gpu command buffer > +submission is therefore preceded with a re-validation section: > + > +.. code-block:: C > + > + dma_resv_lock(gpu_vm->resv); > + > + // Validation section starts here. > + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { > + validate_gem_bo(&gpu_vm_bo->gem_bo); > + > + // The following list iteration needs the Gem object's > + // dma_resv to be held (it protects the gpu_vm_bo's list of > + // gpu_vmas, but since local gem objects share the gpu_vm's > + // dma_resv, it is already held at this point. > + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) > + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); > + } > + > + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { > + rebind_gpu_vma(&gpu_vma); > + remove_gpu_vma_from_rebind_list(&gpu_vma); > + } > + // Validation section ends here, and job submission starts. > + > + add_dependencies(&gpu_job, &gpu_vm->resv); > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + dma_resv_unlock(gpu_vm->resv); > + > +The reason for having a separate gpu_vm rebind list is that there > +might be userptr gpu_vmas that are not mapping a buffer object that > +also need rebinding. > + > +Eviction > +________ > + > +Eviction of one of these local objects will then look similar to the > +following: > + > +.. code-block:: C > + > + obj = get_object_from_lru(); > + > + dma_resv_lock(obj->resv); > + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + > + add_dependencies(&eviction_job, &obj->resv); > + job_dma_fence = gpu_submit(&eviction_job); > + add_dma_fence(&obj->resv, job_dma_fence); > + > + dma_resv_unlock(&obj->resv); > + put_object(obj); > + > +Note that since the object is local to the gpu_vm, it will share the gpu_vm's > +dma_resv lock so that ``obj->resv == gpu_vm->resv``. > +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list, > +which is protected by ``gpu_vm->resv``, that is always locked while > +evicting, due to the above equality. > + > +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction, > +Since the eviction blit or copy will wait for GPU idle, any attempt by is it true for everyone, or is it only an Xe implementation that is waiting for the idle? what guarantees that? > +the GPU to access freed memory through the gpu_vma will be preceded by > +a new exec function, with a revalidation section which will make sure > +the gpu_vma is rebound. The eviction code holding the object's dma_resv while > +revalidating will ensure a new exec function may not race with the eviction. > + > +Locking with external (or shared) buffer objects > +================================================ > + > +Since shared buffer objects may be shared by multiple gpu_vm's they > +can't share their reservation object with a single gpu_vm, but will rather > +have a reservation object of their own. The shared objects bound to a > +gpu_vm using one or many gpu_vmas are therefore typically put on a > +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once > +the gpu_vm's reservation object is locked, it is safe to traverse the > +external object list and lock the dma_resvs of all external objects. > + > +At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a > +shared object is bound to on the gpu_vm's evict list, but we can no longer > +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the > +object is bound to, since at eviction time we only hold the object's > +private dma_resv. If we have a ww_acquire context at hand at eviction > +time we could grab the those dma_resvs but that could cause > +expensive ww_mutex rollbacks. A simple option is to just mark the > +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that > +is inspected the next time the corresponding gpu_vm evicted list needs > +to be traversed. At that time the gpu_vm's dma_resv and the object's > +dma_resv is held, and the gpu_vm_bo marked evicted, can then be added > +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would > +then be protected by the object's dma_resv. > + > +The exec function would then become > + > +.. code-block:: C > + > + dma_resv_lock(gpu_vm->resv); > + > + // External object list is protected by the gpu_vm->resv lock. > + for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) { > + dma_resv_lock(gpu_vm_bo.gem_obj->resv); > + if (gpu_vm_bo_marked_evicted(&gpu_vm_bo)) > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + } > + > + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { > + validate_gem_bo(&gpu_vm_bo->gem_bo); > + > + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) > + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); > + } > + > + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { > + rebind_gpu_vma(&gpu_vma); > + remove_gpu_vma_from_rebind_list(&gpu_vma); > + } > + > + add_dependencies(&gpu_job, &gpu_vm->resv); > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + for_each_shared_obj(gpu_vm, &obj) > + add_dma_fence(job_dma_fence, &obj->resv); > + dma_resv_unlock_all_resv_locks(); > + > +And the corresponding shared-object aware eviction would look like: > + > +.. code-block:: C > + > + obj = get_object_from_lru(); > + > + dma_resv_lock(obj->resv); > + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) > + if (object_is_vm_local(obj)) > + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); > + else > + mark_gpu_vm_bo_evicted(&gpu_vm_bo); > + > + add_dependencies(&eviction_job, &obj->resv); > + job_dma_fence = gpu_submit(&eviction_job); > + add_dma_fence(&obj->resv, job_dma_fence); > + > + dma_resv_unlock(&obj->resv); > + put_object(obj); > + > +.. _Spinlock iteration: > + > +Accessing the gpu_vm's lists without the dma_resv lock held > +=========================================================== > + > +Many drivers will not need to access the gpu_vm's evict- and ^ is this a typo? > +external objects lists without holding the gpu_vm's dma_resv lock, "not need" "without". this kinda-double-negative seems confusing... do they need access with holding the dma_resv lock? > +but some drivers do, do == do not need to access with the lock? or do == do need to access without the lock? (I believe the last one is the correct one, but only to show the confusion of the double negative.) What about start with something like: Although many drivers will hold the drm_resv lock when accessing gpu_vm's evict and external object lists, there are some drivers that might need to access them without the lock. (or something like that) ?! > for example due to asynchronous state updates > +from within the dma_fence signalling critical path. In such case a > +spinlock can be used to protect manipulation of the lists. However, > +since higher level sleeping locks needs to be taken for each list item > +while iterating over the lists, the items already iterated over needs > +to be temporarily moved to a private list and the spinlock released > +while processing each item: > + > +.. code block:: C > + > + struct list_head still_in_list; > + > + INIT_LIST_HEAD(&still_in_list); > + > + spin_lock(&gpu_vm->list_lock); > + do { > + struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head); > + > + if (!entry) > + break; > + > + list_move_tail(&entry->head, &still_in_list); > + list_entry_get_unless_zero(entry); > + spin_unlock(&gpu_vm->list_lock); > + > + process(entry); > + > + spin_lock(&gpu_vm->list_lock); > + list_entry_put(entry); > + } while (true); > + > + list_splice_tail(&still_in_list, &gpu_vm->list); > + spin_unlock(&gpu_vm->list_lock); > + > +However, due to the additional locking and atomic operations, drivers that *can* > +avoid accessing the gpu_vm's list outside of the dma_resv lock > +might want to avoid this iteration scheme, if the driver anticipates a > +large number of list items. For lists where the anticipated number of > +list items is small, list iteration doesn't happen very often, or > +there is a significant additional cost associated with each iteration, > +the atomic operation overhead associated with this type of iteration > +is, however, probably negligible. Note that if this scheme is > +used, it is necessary to make sure this list iteration is protected by > +an outer level lock or semaphore, since list items are temporarily > +pulled off the list while iterating. > + > +TODO: Pointer to the gpuvm code implementation if this iteration and > +how to choose either iteration scheme. > + > +userptr gpu_vmas > +================ > + > +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a > +GPU virtual address range, directly maps a CPU mm range of anonymous- > +or file page-cache pages. > +A very simple approach would be to just pin the pages using > +pin_user_pages() at bind time and unpin them at unbind time, but this > +creates a Denial-Of-Service vector since a single user-space process > +would be able to pin down all of system memory, which is not > +desirable. (For special use-cases and with proper accounting pinning might > +still be a desirable feature, though). What we need to do in the > +general case is to obtain a reference to the desired pages, make sure > +we are notified ^ spurious space and strange line breakage... > +using a MMU notifier just before the CPU mm unmaps the pages, dirty > +them if they are not mapped read-only to the GPU, and then drop the > +reference. > +When we are notified by the MMU notifier that CPU mm is about to drop the > +pages, we need to stop GPU access to the pages, do we do this already? what are the steps that we need to do to ensure this? is it possible that this memory is being in use by the submission? maybe as any offset inside a batch buffer? is this possible? and if so, can we detect this and should we abort and block the execution? > +and make sure that before the next time the GPU tries to access > +whatever is now present in the CPU mm range, we unmap the old pages > +from the GPU page tables and repeat the process of obtaining new page > +references. Note that when the core mm decides to laundry pages, we get such > +an unmap MMU notification and can mark the pages dirty again before the > +next GPU access. We also get similar MMU notifications for NUMA accounting > +which the GPU driver doesn't really need to care about, but so far > +it has proven difficult to exclude certain notifications. > + > +Using a MMU notifier for device DMA (and other methods) is described in > +`this document > +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_. > + > +Now the method of obtaining struct page references using > +get_user_pages() unfortunately can't be used under a dma_resv lock > +since that would violate the locking order of the dma_resv lock vs the > +mmap_lock that is grabbed when resolving a CPU pagefault. This means > +the gpu_vm's list of userptr gpu_vmas needs to be protected by an > +outer lock. > + > +The MMU interval seqlock for a userptr gpu_vma is used in the following > +way: > + > +.. code-block:: C > + > + // Exclusive locking mode here is strictly needed only if there are > + // invalidated userptr vmas present, to avoid multiple userptr > + // revalidations. > + down_write(&gpu_vm->lock); > + retry: > + > + // Note: mmu_interval_read_begin() blocks until there is no > + // invalidation notifier running anymore. > + seq = mmu_interval_read_begin(&gpu_vma->userptr_interval); > + if (seq != gpu_vma->saved_seq) { > + obtain_new_page_pointers(&gpu_vma); ^ ^ more strange tabs vs spaces > + dma_resv_lock(&gpu_vm->resv); > + add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm); > + dma_resv_unlock(&gpu_vm->resv); > + gpu_vma->saved_seq = seq; > + } > + > + // The usual revalidation goes here. > + > + // Final userptr sequence validation may not happen before the > + // submission dma_fence is added to the gpu_vm's resv, from the POW > + // of the MMU invalidation notifier. Hence the > + // userptr_notifier_lock that will make them appear atomic. > + > + add_dependencies(&gpu_job, &gpu_vm->resv); > + down_read(&gpu_vm->userptr_notifier_lock); > + if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) { > + up_read(&gpu_vm->userptr_notifier_lock); > + goto retry; ^ ^ more strange tabs vs spaces > + } > + > + job_dma_fence = gpu_submit(&gpu_job)); > + > + add_dma_fence(job_dma_fence, &gpu_vm->resv); > + > + for_each_shared_obj(gpu_vm, &obj) > + add_dma_fence(job_dma_fence, &obj->resv); > + > + dma_resv_unlock_all_resv_locks(); > + up_read(&gpu_vm->userptr_notifier_lock); > + up_write(&gpu_vm->lock); > + > +The code between ``mmu_interval_read_begin()`` and the > +``mmu_interval_read_retry()`` marks the read side critical section of > +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr > +gpu_vma list is looped through, and the check is done for *all* of its > +userptr gpu_vmas, although we only show a single one here. > + > +The userptr gpu_vma MMU invalidation notifier might be called from > +reclaim context and, again to avoid locking order violations, we can't > +take any dma_resv lock nor the gpu_vm->lock from within it. > + > +.. code-block:: C > + > + bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq) > + { > + // Make sure the exec function either sees the new sequence > + // and backs off or we wait for the dma-fence: ^ ^ strange spaces vs tabs mix... > + > + down_write(&gpu_vm->userptr_notifier_lock); ^ same > + mmu_interval_set_seq(userptr_interval, cur_seq); > + up_write(&gpu_vm->userptr_notifier_lock); > + > + // At this point, the exec function can't succeed in > + // submitting a new job, because cur_seq is an invalid > + // sequence number and will always cause a retry. When all > + // invalidation callbacks, the mmu notifier core will flip > + // the sequence number to a valid one. However we need to > + // stop gpu access to the old pages here. > + > + dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP, > + false, MAX_SCHEDULE_TIMEOUT); > + return true; > + } > + > +When this invalidation notifier returns, the GPU can no longer be > +accessing the old pages of the userptr gpu_vma and needs to redo the > +page-binding before a new GPU submission can succeed. this partially responds the questions I had above. probably move this up? > + > +Efficient userptr gpu_vma exec_function iteration > +_________________________________________________ > + > +If the gpu_vm's list of userptr gpu_vmas becomes large, it's > +inefficient to iterate through the complete lists of userptrs on each > +exec function to check whether each userptr gpu_vma's saved > +sequence number is invalid or stale. A solution to this is to put all > +*invalidated* userptr gpu_vmas on a separate gpu_vm list and > +only those gpu_vmas on the list are actually checked on each exec > +function. This list will then lend itself very-well to the spinlock > +locking scheme that is > +:ref:`described in the spinlock iteration section <Spinlock iteration>`, since > +in the mmu notifier, where we add the invalidated gpu_vmas to the > +list, it's not possible to take any outer locks like the ^spurious space > +``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the > +``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is > +complete, as also mentioned in that section. > + > +If using an invalidated userptr list like this, the retry check in the > +exec function trivially becomes a check for invalidated list empty. > + > +Locking at bind- and unbind time ^ here again the extra '-', any particular reason that I might be missing? > +================================ > + > +At bind time, assuming a GEM object backed gpu_vma, each > +gpu_vma needs to be associated with a gpu_vm_bo and that > +gpu_vm_bo in turn needs to be added to the GEM object's > +gpu_vm_bo list, and possibly to the gpu_vm's external object > +list. This is referred to as *linking* the gpu_vma, and typically > +requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are > +held. When unlinking a gpu_vma the same locks are typically held, > +and that ensures, as briefly discussed > +:ref:`previously <gpu_vma lifetime>`, that when iterating over > +``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM > +object's dma_resv, that the gpu_vmas stay alive as long > +as the lock under which we iterate are not is not released. For > +userptr gpu_vmas it's similarly required that during unlink, the > +outer ``gpu_vm->lock`` is held, since otherwise when iterating over > +the invalidated userptr list as described in the previous section, > +there is nothing keeping those userptr gpu_vmas alive. Overall the doc looks great! > + > -- > 2.41.0 >
Hi, Rodrigo On 11/1/23 21:11, Rodrigo Vivi wrote: >> Add the first version of the VM_BIND locking document which is >> intended to be part of the xe driver upstreaming agreement. >> >> The document describes and discuss the locking used during exec- >> functions, evicton and for userptr gpu-vmas. Intention is to be using the >> same nomenclature as the drm-vm-bind-async.rst. >> >> v2: >> - s/gvm/gpu_vm/g (Rodrigo Vivi) >> - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c >> (Rodrigo Vivi) >> - Adjust commit message accordingly. >> - Add SPDX license header. >> >> v3: >> - Large update to align with the drm_gpuvm manager locking >> - Add "Efficient userptr gpu_vma exec function iteration" section >> - Add "Locking at bind- and unbind time" section. >> >> Cc: Rodrigo Vivi<rodrigo.vivi@intel.com> >> Signed-off-by: Thomas Hellström<thomas.hellstrom@linux.intel.com> Thanks for the review comments. I'll address all these in a v4 shortly. /Thomas
diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst new file mode 100644 index 000000000000..c290ff4287fb --- /dev/null +++ b/Documentation/gpu/drm-vm-bind-locking.rst @@ -0,0 +1,494 @@ +.. SPDX-License-Identifier: (GPL-2.0+ OR MIT) + +=============== +VM_BIND locking +=============== + +This document attempts to describe what's needed to get VM_BIND locking right, +including the userptr mmu_notifier locking and it will also discuss some +optimizations to get rid of the looping through of all userptr mappings and +external / shared object mappings that is needed in the simplest +implementation. It will also discuss some implications for faulting gpu_vms. + +Nomenclature +============ + +* ``Context``: GPU execution context. +* ``gpu_vm``: Abstraction of a virtual GPU address space with + meta-data. Typically one per client (DRM file-private), or one per + context. +* ``gpu_vma``: Abstraction of a GPU address range within a gpu_vm with + associated meta-data. The backing storage of a gpu_vma can either be + a GEM object or anonymous pages mapped also into the CPU + address space for the process. +* gpu_vm_bo: Abstracts the association of a GEM object and + a VM. Note that if only one gpu_vma per vm and buffer object were + allowed, the state stored with a gpu_vm_bo could just as well have + been stored with the gpu_vma. For the purpose of this document, each + GEM object maintains a list of gpu_vm_bos, and each gpu_vm_bo + maintains a list of gpu_vmas. +* ``userptr gpu_vma or just userptr``: A gpu_vma, the backing store of + which is anonymous pages as described above. +* ``revalidating``: Revalidating a gpu_vma means making the latest version + of the backing store resident and making sure the gpu_vma's + page-table entries point to that backing store. +* ``dma_fence``: A struct dma_fence that is similar to a struct completion + and which tracks GPU activity. When the GPU activity is finished, + the dma_fence signals. +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used + to track GPU activity in the form of multiple dma_fences on a + gpu_vm or a GEM object. The dma_resv contains an array / list + of dma_fences and a lock that needs to be held when adding + additional dma_fences to the dma_resv. The lock is of a type that + allows deadlock-safe locking of multiple dma_resvs in arbitrary order. +* ``exec function``: An exec function is a function that revalidates all + affected gpu_vmas, submits a GPU command batch and registers the + dma_fence representing the GPU command's activity with all affected + dma_resvs. For completeness, although not covered by this document, + it's worth mentioning that an exec function may also be the + revalidation worker that is used by some drivers in compute / + long-running mode. +* ``local object``: A GEM object which is local to a gpu_vm. Shared gem + objects also share the gpu_vm's dma_resv. +* ``shared object``: AKA external object: A GEM object which may be shared + by multiple gpu_vms and whose backing storage may be shared with + other drivers. + + +Locks used and locking orders +============================= + +One of the benefits of VM_BIND is that local GEM objects share the gpu_vm's +dma_resv object and hence the dma_resv lock. So even with a huge +number of local GEM objects, only one lock is needed to make the exec +sequence atomic. + +The following locks and locking orders are used: + +* The ``gpu_vm->lock`` (optionally an rwsem). Protects how the gpu_vm is + partitioned into gpu_vmas. It can also protect the gpu_vm's list of + userptr gpu_vmas. With a CPU mm analogy this would correspond to the + mmap_lock. +* The ``userptr_seqlock``. This lock is taken in read mode for each + userptr gpu_vma on the gpu_vm's userptr list, and in write mode during mmu + notifier invalidation. This is not a real seqlock but described in + ``mm/mmu_notifier.c`` as a "Collision-retry read-side/write-side + 'lock' a lot like a seqcount, however this allows multiple + write-sides to hold it at once...". The read side critical section + is enclosed by ``mmu_interval_read_begin() / + mmu_interval_read_retry()`` with ``mmu_interval_read_begin()`` + sleeping if the write side is held. + The write side is held by the core mm while calling mmu interval + invalidation notifiers. +* The ``gpu_vm->resv`` lock. Protects the gpu_vm's list of gpu_vmas needing + rebinding, and also the residency of all the gpu_vm's local GEM object. + Furthermore it typically protects the gpu_vm's list of evicted GEM + objects and external objects. +* The ``gpu_vm->userptr_notifier_lock``. This is an rwsem that is + taken in read mode during exec and write mode during a mmu notifier + invalidation. The userptr notifier lock is per gpu_vm. +* The gpu_vm list spinlocks. With some implementations they are needed + to be able to update the gpu_vm evicted- and external object + list. For those implementations, the spinlocks are grabbed when the + lists are manipulated. However to avoid locking order violations + with the dma_resv locks, a special scheme is needed when iterating + over the lists. + +.. _gpu_vma lifetime: + +Protection and lifetime of gpu_vm_bos and gpu_vmas +================================================== + +The GEM object's list of gpu_vm_bos is typically protected by the +GEM object's dma_resv. Each gpu_vm_bo holds a reference counted pointer +to the underlying GEM object, and each gpu_vma holds a reference counted +pointer to the gpu_vm_bo. When iterating over the GEM object's +list of gpu_vm_bos the gem object's dma_resv must thus be held, +but if it needs to be dropped during the iteration, care needs to be +taken so that any gpu_vm_bo, and the gpu_vm, if dereferenced +while the lock is dropped, do not disappear. The easiest way to avoid +this is to take a reference on affected objects while the dma_resv is +still held. If iterating over the gpu_vm_bo's gpu_vmas, even +greater care needs to be taken since the gpu_vmas are not +reference counted. If a driver accesses a gpu_vma obtained from +the gpu_vm_bo's list of gpu_vmas, and the GEM object's +dma_resv is dropped, at the very least, it should be thoroughly +documented how the gpu_vma is kept alive. Otherwise holding the +GEM object's dma_resv lock also around unlinking a gpu_vma from a +gpu_vm_bo will ensure that doesn't happen. + + +Revalidation and eviction of local objects +========================================== + +Revalidation +____________ +With VM_BIND, all local objects need to be resident when the gpu is +executing using the gpu_vm, and the objects need to have valid +gpu_vmas set up pointing to them. Typically each gpu command buffer +submission is therefore preceded with a re-validation section: + +.. code-block:: C + + dma_resv_lock(gpu_vm->resv); + + // Validation section starts here. + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { + validate_gem_bo(&gpu_vm_bo->gem_bo); + + // The following list iteration needs the Gem object's + // dma_resv to be held (it protects the gpu_vm_bo's list of + // gpu_vmas, but since local gem objects share the gpu_vm's + // dma_resv, it is already held at this point. + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); + } + + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { + rebind_gpu_vma(&gpu_vma); + remove_gpu_vma_from_rebind_list(&gpu_vma); + } + // Validation section ends here, and job submission starts. + + add_dependencies(&gpu_job, &gpu_vm->resv); + job_dma_fence = gpu_submit(&gpu_job)); + + add_dma_fence(job_dma_fence, &gpu_vm->resv); + dma_resv_unlock(gpu_vm->resv); + +The reason for having a separate gpu_vm rebind list is that there +might be userptr gpu_vmas that are not mapping a buffer object that +also need rebinding. + +Eviction +________ + +Eviction of one of these local objects will then look similar to the +following: + +.. code-block:: C + + obj = get_object_from_lru(); + + dma_resv_lock(obj->resv); + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo); + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); + + add_dependencies(&eviction_job, &obj->resv); + job_dma_fence = gpu_submit(&eviction_job); + add_dma_fence(&obj->resv, job_dma_fence); + + dma_resv_unlock(&obj->resv); + put_object(obj); + +Note that since the object is local to the gpu_vm, it will share the gpu_vm's +dma_resv lock so that ``obj->resv == gpu_vm->resv``. +The gpu_vm_bos marked for eviction are put on the gpu_vm's evict list, +which is protected by ``gpu_vm->resv``, that is always locked while +evicting, due to the above equality. + +For VM_BIND gpu_vms, gpu_vmas don't need to be unbound before eviction, +Since the eviction blit or copy will wait for GPU idle, any attempt by +the GPU to access freed memory through the gpu_vma will be preceded by +a new exec function, with a revalidation section which will make sure +the gpu_vma is rebound. The eviction code holding the object's dma_resv while +revalidating will ensure a new exec function may not race with the eviction. + +Locking with external (or shared) buffer objects +================================================ + +Since shared buffer objects may be shared by multiple gpu_vm's they +can't share their reservation object with a single gpu_vm, but will rather +have a reservation object of their own. The shared objects bound to a +gpu_vm using one or many gpu_vmas are therefore typically put on a +per-gpu_vm list which is protected by the gpu_vm's dma_resv lock. Once +the gpu_vm's reservation object is locked, it is safe to traverse the +external object list and lock the dma_resvs of all external objects. + +At eviction time we now need to put the gpu_vm_bos of *all* gpu_vms a +shared object is bound to on the gpu_vm's evict list, but we can no longer +be certain that we hold the gpu_vm's dma_resv of all the gpu_vms the +object is bound to, since at eviction time we only hold the object's +private dma_resv. If we have a ww_acquire context at hand at eviction +time we could grab the those dma_resvs but that could cause +expensive ww_mutex rollbacks. A simple option is to just mark the +gpu_vm_bos of the evicted gem object with an ``evicted`` bool that +is inspected the next time the corresponding gpu_vm evicted list needs +to be traversed. At that time the gpu_vm's dma_resv and the object's +dma_resv is held, and the gpu_vm_bo marked evicted, can then be added +to the gpu_vm's list of evicted gpu_vm_bos. The ``evicted`` bool would +then be protected by the object's dma_resv. + +The exec function would then become + +.. code-block:: C + + dma_resv_lock(gpu_vm->resv); + + // External object list is protected by the gpu_vm->resv lock. + for_each_gpu_vm_bo_on_extobj_list(gpu_vm, &gpu_vm_bo) { + dma_resv_lock(gpu_vm_bo.gem_obj->resv); + if (gpu_vm_bo_marked_evicted(&gpu_vm_bo)) + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); + } + + for_each_gpu_vm_bo_on_evict_list(&gpu_vm->evict_list, &gpu_vm_bo) { + validate_gem_bo(&gpu_vm_bo->gem_bo); + + for_each_gpu_vma_of_gpu_vm_bo(&gpu_vm_bo, &gpu_vma) + move_gpu_vma_to_rebind_list(&gpu_vma, &gpu_vm->rebind_list); + } + + for_each_gpu_vma_on_rebind_list(&gpu vm->rebind_list, &gpu_vma) { + rebind_gpu_vma(&gpu_vma); + remove_gpu_vma_from_rebind_list(&gpu_vma); + } + + add_dependencies(&gpu_job, &gpu_vm->resv); + job_dma_fence = gpu_submit(&gpu_job)); + + add_dma_fence(job_dma_fence, &gpu_vm->resv); + for_each_shared_obj(gpu_vm, &obj) + add_dma_fence(job_dma_fence, &obj->resv); + dma_resv_unlock_all_resv_locks(); + +And the corresponding shared-object aware eviction would look like: + +.. code-block:: C + + obj = get_object_from_lru(); + + dma_resv_lock(obj->resv); + for_each_gpu_vm_bo_of_obj(obj, &gpu_vm_bo) + if (object_is_vm_local(obj)) + add_gpu_vm_bo_to_evict_list(&gpu_vm_bo, &gpu_vm->evict_list); + else + mark_gpu_vm_bo_evicted(&gpu_vm_bo); + + add_dependencies(&eviction_job, &obj->resv); + job_dma_fence = gpu_submit(&eviction_job); + add_dma_fence(&obj->resv, job_dma_fence); + + dma_resv_unlock(&obj->resv); + put_object(obj); + +.. _Spinlock iteration: + +Accessing the gpu_vm's lists without the dma_resv lock held +=========================================================== + +Many drivers will not need to access the gpu_vm's evict- and +external objects lists without holding the gpu_vm's dma_resv lock, +but some drivers do, for example due to asynchronous state updates +from within the dma_fence signalling critical path. In such case a +spinlock can be used to protect manipulation of the lists. However, +since higher level sleeping locks needs to be taken for each list item +while iterating over the lists, the items already iterated over needs +to be temporarily moved to a private list and the spinlock released +while processing each item: + +.. code block:: C + + struct list_head still_in_list; + + INIT_LIST_HEAD(&still_in_list); + + spin_lock(&gpu_vm->list_lock); + do { + struct list_head *entry = list_first_entry_or_null(&gpu_vm->list, head); + + if (!entry) + break; + + list_move_tail(&entry->head, &still_in_list); + list_entry_get_unless_zero(entry); + spin_unlock(&gpu_vm->list_lock); + + process(entry); + + spin_lock(&gpu_vm->list_lock); + list_entry_put(entry); + } while (true); + + list_splice_tail(&still_in_list, &gpu_vm->list); + spin_unlock(&gpu_vm->list_lock); + +However, due to the additional locking and atomic operations, drivers that *can* +avoid accessing the gpu_vm's list outside of the dma_resv lock +might want to avoid this iteration scheme, if the driver anticipates a +large number of list items. For lists where the anticipated number of +list items is small, list iteration doesn't happen very often, or +there is a significant additional cost associated with each iteration, +the atomic operation overhead associated with this type of iteration +is, however, probably negligible. Note that if this scheme is +used, it is necessary to make sure this list iteration is protected by +an outer level lock or semaphore, since list items are temporarily +pulled off the list while iterating. + +TODO: Pointer to the gpuvm code implementation if this iteration and +how to choose either iteration scheme. + +userptr gpu_vmas +================ + +A userptr gpu_vma is a gpu_vma that, instead of mapping a buffer object to a +GPU virtual address range, directly maps a CPU mm range of anonymous- +or file page-cache pages. +A very simple approach would be to just pin the pages using +pin_user_pages() at bind time and unpin them at unbind time, but this +creates a Denial-Of-Service vector since a single user-space process +would be able to pin down all of system memory, which is not +desirable. (For special use-cases and with proper accounting pinning might +still be a desirable feature, though). What we need to do in the +general case is to obtain a reference to the desired pages, make sure +we are notified +using a MMU notifier just before the CPU mm unmaps the pages, dirty +them if they are not mapped read-only to the GPU, and then drop the +reference. +When we are notified by the MMU notifier that CPU mm is about to drop the +pages, we need to stop GPU access to the pages, +and make sure that before the next time the GPU tries to access +whatever is now present in the CPU mm range, we unmap the old pages +from the GPU page tables and repeat the process of obtaining new page +references. Note that when the core mm decides to laundry pages, we get such +an unmap MMU notification and can mark the pages dirty again before the +next GPU access. We also get similar MMU notifications for NUMA accounting +which the GPU driver doesn't really need to care about, but so far +it has proven difficult to exclude certain notifications. + +Using a MMU notifier for device DMA (and other methods) is described in +`this document +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_. + +Now the method of obtaining struct page references using +get_user_pages() unfortunately can't be used under a dma_resv lock +since that would violate the locking order of the dma_resv lock vs the +mmap_lock that is grabbed when resolving a CPU pagefault. This means +the gpu_vm's list of userptr gpu_vmas needs to be protected by an +outer lock. + +The MMU interval seqlock for a userptr gpu_vma is used in the following +way: + +.. code-block:: C + + // Exclusive locking mode here is strictly needed only if there are + // invalidated userptr vmas present, to avoid multiple userptr + // revalidations. + down_write(&gpu_vm->lock); + retry: + + // Note: mmu_interval_read_begin() blocks until there is no + // invalidation notifier running anymore. + seq = mmu_interval_read_begin(&gpu_vma->userptr_interval); + if (seq != gpu_vma->saved_seq) { + obtain_new_page_pointers(&gpu_vma); + dma_resv_lock(&gpu_vm->resv); + add_gpu_vma_top_revalidate_list(&gpu_vma, &gpu_vm); + dma_resv_unlock(&gpu_vm->resv); + gpu_vma->saved_seq = seq; + } + + // The usual revalidation goes here. + + // Final userptr sequence validation may not happen before the + // submission dma_fence is added to the gpu_vm's resv, from the POW + // of the MMU invalidation notifier. Hence the + // userptr_notifier_lock that will make them appear atomic. + + add_dependencies(&gpu_job, &gpu_vm->resv); + down_read(&gpu_vm->userptr_notifier_lock); + if (mmu_interval_read_retry(&gpu_vma->userptr_interval, gpu_vma->saved_seq)) { + up_read(&gpu_vm->userptr_notifier_lock); + goto retry; + } + + job_dma_fence = gpu_submit(&gpu_job)); + + add_dma_fence(job_dma_fence, &gpu_vm->resv); + + for_each_shared_obj(gpu_vm, &obj) + add_dma_fence(job_dma_fence, &obj->resv); + + dma_resv_unlock_all_resv_locks(); + up_read(&gpu_vm->userptr_notifier_lock); + up_write(&gpu_vm->lock); + +The code between ``mmu_interval_read_begin()`` and the +``mmu_interval_read_retry()`` marks the read side critical section of +what we call the ``userptr_seqlock``. In reality the gpu_vm's userptr +gpu_vma list is looped through, and the check is done for *all* of its +userptr gpu_vmas, although we only show a single one here. + +The userptr gpu_vma MMU invalidation notifier might be called from +reclaim context and, again to avoid locking order violations, we can't +take any dma_resv lock nor the gpu_vm->lock from within it. + +.. code-block:: C + + bool gpu_vma_userptr_invalidate(userptr_interval, cur_seq) + { + // Make sure the exec function either sees the new sequence + // and backs off or we wait for the dma-fence: + + down_write(&gpu_vm->userptr_notifier_lock); + mmu_interval_set_seq(userptr_interval, cur_seq); + up_write(&gpu_vm->userptr_notifier_lock); + + // At this point, the exec function can't succeed in + // submitting a new job, because cur_seq is an invalid + // sequence number and will always cause a retry. When all + // invalidation callbacks, the mmu notifier core will flip + // the sequence number to a valid one. However we need to + // stop gpu access to the old pages here. + + dma_resv_wait_timeout(&gpu_vm->resv, DMA_RESV_USAGE_BOOKKEEP, + false, MAX_SCHEDULE_TIMEOUT); + return true; + } + +When this invalidation notifier returns, the GPU can no longer be +accessing the old pages of the userptr gpu_vma and needs to redo the +page-binding before a new GPU submission can succeed. + +Efficient userptr gpu_vma exec_function iteration +_________________________________________________ + +If the gpu_vm's list of userptr gpu_vmas becomes large, it's +inefficient to iterate through the complete lists of userptrs on each +exec function to check whether each userptr gpu_vma's saved +sequence number is invalid or stale. A solution to this is to put all +*invalidated* userptr gpu_vmas on a separate gpu_vm list and +only those gpu_vmas on the list are actually checked on each exec +function. This list will then lend itself very-well to the spinlock +locking scheme that is +:ref:`described in the spinlock iteration section <Spinlock iteration>`, since +in the mmu notifier, where we add the invalidated gpu_vmas to the +list, it's not possible to take any outer locks like the +``gpu_vm->lock`` or the ``gpu_vm->resv`` lock. Note that the +``gpu_vm->lock`` still needs to be taken while iterating to ensure the list is +complete, as also mentioned in that section. + +If using an invalidated userptr list like this, the retry check in the +exec function trivially becomes a check for invalidated list empty. + +Locking at bind- and unbind time +================================ + +At bind time, assuming a GEM object backed gpu_vma, each +gpu_vma needs to be associated with a gpu_vm_bo and that +gpu_vm_bo in turn needs to be added to the GEM object's +gpu_vm_bo list, and possibly to the gpu_vm's external object +list. This is referred to as *linking* the gpu_vma, and typically +requires that the ``gpu_vm->resv`` and the GEM object's dma_resv are +held. When unlinking a gpu_vma the same locks are typically held, +and that ensures, as briefly discussed +:ref:`previously <gpu_vma lifetime>`, that when iterating over +``gpu_vmas`, either under the ``gpu_vm->resv`` or the GEM +object's dma_resv, that the gpu_vmas stay alive as long +as the lock under which we iterate are not is not released. For +userptr gpu_vmas it's similarly required that during unlink, the +outer ``gpu_vm->lock`` is held, since otherwise when iterating over +the invalidated userptr list as described in the previous section, +there is nothing keeping those userptr gpu_vmas alive. +
Add the first version of the VM_BIND locking document which is intended to be part of the xe driver upstreaming agreement. The document describes and discuss the locking used during exec- functions, evicton and for userptr gpu-vmas. Intention is to be using the same nomenclature as the drm-vm-bind-async.rst. v2: - s/gvm/gpu_vm/g (Rodrigo Vivi) - Clarify the userptr seqlock with a pointer to mm/mmu_notifier.c (Rodrigo Vivi) - Adjust commit message accordingly. - Add SPDX license header. v3: - Large update to align with the drm_gpuvm manager locking - Add "Efficient userptr gpu_vma exec function iteration" section - Add "Locking at bind- and unbind time" section. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> --- Documentation/gpu/drm-vm-bind-locking.rst | 494 ++++++++++++++++++++++ 1 file changed, 494 insertions(+) create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst