mbox series

[00/34] Add HMM-based SVM memory manager to KFD v4

Message ID 20210406014629.25141-1-Felix.Kuehling@amd.com (mailing list archive)
Headers show
Series Add HMM-based SVM memory manager to KFD v4 | expand

Message

Felix Kuehling April 6, 2021, 1:45 a.m. UTC
Rebased on upstream. Dropped already upstream patch
"drm/amdgpu: reserve fence slot to update page table".

Added more fixes:
- Fixed kernel test robot warnings about static functions
- Fixed a kernel test robot warning about an unused variable
- Fixed a kernel test robot warning about select DEVICE_PRIVATE.
  Using "depends on" now. (see patch 34)
- Proportionally longer timeout for hmm_range_fault on large address ranges
  (see patch 6)
- Fixed PTE flags for XGMI mappings on Arcturus and Aldebaran (see patch 17)
- Fixed range-list cleanup on process termination to avoid BUGs from dangling
  interval notifiers (see patch 16)
- Fixed SVM range locking and interval notifier sequence update
  (see patch 8 and related tweaks in patches 10, 11, 21)

Added my Reviewed-by to all patches primarily authored by Philip and Alex.
I believe this patch series is nearly ready to go.

This series and the corresponding ROCm Thunk and KFDTest changes are also
available on gitub and patchwork.

Link: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fxkamd/hmm-wip
Link: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/fxkamd/hmm-wip
Link: https://patchwork.freedesktop.org/series/85563/
CC: Jérôme Glisse <jglisse@redhat.com>
CC: Jason Gunthorpe <jgg@ziepe.ca>

Alex Sierra (9):
  drm/amdkfd: helper to convert gpu id and idx
  drm/amdkfd: add xnack enabled flag to kfd_process
  drm/amdkfd: add ioctl to configure and query xnack retries
  drm/amdgpu: enable 48-bit IH timestamp counter
  drm/amdkfd: SVM API call to restore page tables
  drm/amdkfd: add svm_bo reference for eviction fence
  drm/amdgpu: add param bit flag to create SVM BOs
  drm/amdgpu: svm bo enable_signal call condition
  drm/amdgpu: add svm_bo eviction to enable_signal cb

Felix Kuehling (13):
  drm/amdkfd: map svm range to GPUs
  drm/amdkfd: svm range eviction and restore
  drm/amdgpu: Enable retry faults unconditionally on Aldebaran
  drm/amdkfd: validate vram svm range from TTM
  drm/amdkfd: HMM migrate ram to vram
  drm/amdkfd: HMM migrate vram to ram
  drm/amdkfd: invalidate tables on page retry fault
  drm/amdkfd: page table restore through svm API
  drm/amdkfd: add svm_bo eviction mechanism support
  drm/amdkfd: refine migration policy with xnack on
  drm/amdkfd: add svm range validate timestamp
  drm/amdkfd: multiple gpu migrate vram to vram
  drm/amdkfd: Add CONFIG_HSA_AMD_SVM

Philip Yang (12):
  drm/amdkfd: add svm ioctl API
  drm/amdkfd: register svm range
  drm/amdkfd: add svm ioctl GET_ATTR op
  drm/amdgpu: add common HMM get pages function
  drm/amdkfd: support larger svm range allocation
  drm/amdkfd: validate svm range system memory
  drm/amdkfd: deregister svm range
  drm/amdgpu: export vm update mapping interface
  drm/amdkfd: register HMM device private zone
  drm/amdkfd: support xgmi same hive mapping
  drm/amdkfd: copy memory through gart table
  drm/amdkfd: Add SVM API support capability bits

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |    3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   86 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   38 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   11 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c      |    8 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c       |    6 +-
 drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |    1 +
 drivers/gpu/drm/amd/amdkfd/Kconfig            |   13 +
 drivers/gpu/drm/amd/amdkfd/Makefile           |    5 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   64 +
 drivers/gpu/drm/amd/amdkfd/kfd_device.c       |    4 +
 .../amd/amdkfd/kfd_device_queue_manager_v9.c  |   13 +-
 drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c  |    4 +
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  922 ++++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   64 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   36 +
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   82 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2906 +++++++++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  205 ++
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    6 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
 include/uapi/linux/kfd_ioctl.h                |  171 +-
 28 files changed, 4686 insertions(+), 106 deletions(-)
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
 create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h

Comments

Jason Gunthorpe April 8, 2021, 3:02 p.m. UTC | #1
On Mon, Apr 05, 2021 at 09:45:55PM -0400, Felix Kuehling wrote:
> Rebased on upstream. Dropped already upstream patch
> "drm/amdgpu: reserve fence slot to update page table".
> 
> Added more fixes:
> - Fixed kernel test robot warnings about static functions
> - Fixed a kernel test robot warning about an unused variable
> - Fixed a kernel test robot warning about select DEVICE_PRIVATE.
>   Using "depends on" now. (see patch 34)
> - Proportionally longer timeout for hmm_range_fault on large address ranges
>   (see patch 6)
> - Fixed PTE flags for XGMI mappings on Arcturus and Aldebaran (see patch 17)
> - Fixed range-list cleanup on process termination to avoid BUGs from dangling
>   interval notifiers (see patch 16)
> - Fixed SVM range locking and interval notifier sequence update
>   (see patch 8 and related tweaks in patches 10, 11, 21)
> 
> Added my Reviewed-by to all patches primarily authored by Philip and Alex.
> I believe this patch series is nearly ready to go.
> 
> This series and the corresponding ROCm Thunk and KFDTest changes are also
> available on gitub and patchwork.
> 
> Link: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fxkamd/hmm-wip
> Link: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/fxkamd/hmm-wip
> Link: https://patchwork.freedesktop.org/series/85563/
> CC: Jérôme Glisse <jglisse@redhat.com>
> CC: Jason Gunthorpe <jgg@ziepe.ca>

This series is huge, but it looks like it still doesn't fix the
FIXME's around the AMD driver use of hmm.

Can you fix them before piling on more stuff?

Jaason
Felix Kuehling April 8, 2021, 3:10 p.m. UTC | #2
Am 2021-04-08 um 11:02 a.m. schrieb Jason Gunthorpe:
> On Mon, Apr 05, 2021 at 09:45:55PM -0400, Felix Kuehling wrote:
>> Rebased on upstream. Dropped already upstream patch
>> "drm/amdgpu: reserve fence slot to update page table".
>>
>> Added more fixes:
>> - Fixed kernel test robot warnings about static functions
>> - Fixed a kernel test robot warning about an unused variable
>> - Fixed a kernel test robot warning about select DEVICE_PRIVATE.
>>   Using "depends on" now. (see patch 34)
>> - Proportionally longer timeout for hmm_range_fault on large address ranges
>>   (see patch 6)
>> - Fixed PTE flags for XGMI mappings on Arcturus and Aldebaran (see patch 17)
>> - Fixed range-list cleanup on process termination to avoid BUGs from dangling
>>   interval notifiers (see patch 16)
>> - Fixed SVM range locking and interval notifier sequence update
>>   (see patch 8 and related tweaks in patches 10, 11, 21)
>>
>> Added my Reviewed-by to all patches primarily authored by Philip and Alex.
>> I believe this patch series is nearly ready to go.
>>
>> This series and the corresponding ROCm Thunk and KFDTest changes are also
>> available on gitub and patchwork.
>>
>> Link: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fxkamd/hmm-wip
>> Link: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/fxkamd/hmm-wip
>> Link: https://patchwork.freedesktop.org/series/85563/
>> CC: Jérôme Glisse <jglisse@redhat.com>
>> CC: Jason Gunthorpe <jgg@ziepe.ca>
> This series is huge, but it looks like it still doesn't fix the
> FIXME's around the AMD driver use of hmm.
>
> Can you fix them before piling on more stuff?

It does avoid making the same mistakes in the new code. I'll take
another look at the pre-existing FIXMEs with the experience gained
working on this series.

Regards,
  Felix


>
> Jaason