mbox series

[v5,00/24] CHECKPOINT RESTORE WITH ROCm

Message ID 20220203090918.11520-1-rajneesh.bhardwaj@amd.com (mailing list archive)
Headers show
Series CHECKPOINT RESTORE WITH ROCm | expand

Message

Rajneesh Bhardwaj Feb. 3, 2022, 9:08 a.m. UTC
V5: Proposed IOCTL APIs for CRIU with consolidated feedback

CRIU is a user space tool which is very popular for container live
migration in datacentres. It can checkpoint a running application, save
its complete state, memory contents and all system resources to images
on disk which can be migrated to another m achine and restored later.
More information on CRIU can be found at https://criu.org/Main_Page

CRIU currently does not support Checkpoint / Restore with applications
that have devices files open so it cannot perform checkpoint and restore
on GPU devices which are very complex and have their own VRAM managed
privately. CRIU, however can support external devices by using a plugin
architecture. We feel that we are getting close to finalizing our IOCTL
APIs which were again changed since V3 for an improved modular design.

Our changes to CRIU user space  are can be obtained from here:
https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222

We have tested the following scenarios:
 - Checkpoint / Restore of a Pytorch (BERT) workload
 - kfdtests with queues and events
 - Gfx9 and Gfx10 based multi GPU test systems 
 - On baremetal and inside a docker container
 - Restoring on a different system

V1: Initial
V2: Addressed review comments
V3: Rebased on latest amd-staging-drm-next (5.15 based)
v4: New API design and basic support for SVM, however there is an
outstanding issue with SVM restore which is currently under debug and
hopefully that won't impact the ioctl APIs as SVMs are treated as
private data hidden from user space like queues and events with the new
approch.
V5: Fix the SVM related issues and finalize the APIs. 

David Yat Sin (9):
  drm/amdkfd: CRIU Implement KFD unpause operation
  drm/amdkfd: CRIU add queues support
  drm/amdkfd: CRIU restore queue ids
  drm/amdkfd: CRIU restore sdma id for queues
  drm/amdkfd: CRIU restore queue doorbell id
  drm/amdkfd: CRIU checkpoint and restore queue mqds
  drm/amdkfd: CRIU checkpoint and restore queue control stack
  drm/amdkfd: CRIU checkpoint and restore events
  drm/amdkfd: CRIU implement gpu_id remapping

Rajneesh Bhardwaj (15):
  x86/configs: CRIU update debug rock defconfig
  drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
  drm/amdkfd: CRIU Implement KFD process_info ioctl
  drm/amdkfd: CRIU Implement KFD checkpoint ioctl
  drm/amdkfd: CRIU Implement KFD restore ioctl
  drm/amdkfd: CRIU Implement KFD resume ioctl
  drm/amdkfd: CRIU export BOs as prime dmabuf objects
  drm/amdkfd: CRIU checkpoint and restore xnack mode
  drm/amdkfd: CRIU allow external mm for svm ranges
  drm/amdkfd: use user_gpu_id for svm ranges
  drm/amdkfd: CRIU Discover svm ranges
  drm/amdkfd: CRIU Save Shared Virtual Memory ranges
  drm/amdkfd: CRIU prepare for svm resume
  drm/amdkfd: CRIU resume shared virtual memory ranges
  drm/amdkfd: Bump up KFD API version for CRIU

 arch/x86/configs/rock-dbg_defconfig           |   53 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    7 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   64 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1471 ++++++++++++++---
 drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
 .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 ++-
 .../drm/amd/amdkfd/kfd_device_queue_manager.h |   16 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   75 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   77 +
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   92 ++
 .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   84 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  160 +-
 drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   72 +-
 .../amd/amdkfd/kfd_process_queue_manager.c    |  372 ++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  331 +++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
 include/uapi/linux/kfd_ioctl.h                |   84 +-
 21 files changed, 3193 insertions(+), 340 deletions(-)

Comments

Felix Kuehling Feb. 4, 2022, 3:22 a.m. UTC | #1
The series is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
> V5: Proposed IOCTL APIs for CRIU with consolidated feedback
>
> CRIU is a user space tool which is very popular for container live
> migration in datacentres. It can checkpoint a running application, save
> its complete state, memory contents and all system resources to images
> on disk which can be migrated to another m achine and restored later.
> More information on CRIU can be found at https://criu.org/Main_Page
>
> CRIU currently does not support Checkpoint / Restore with applications
> that have devices files open so it cannot perform checkpoint and restore
> on GPU devices which are very complex and have their own VRAM managed
> privately. CRIU, however can support external devices by using a plugin
> architecture. We feel that we are getting close to finalizing our IOCTL
> APIs which were again changed since V3 for an improved modular design.
>
> Our changes to CRIU user space  are can be obtained from here:
> https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
>
> We have tested the following scenarios:
>   - Checkpoint / Restore of a Pytorch (BERT) workload
>   - kfdtests with queues and events
>   - Gfx9 and Gfx10 based multi GPU test systems
>   - On baremetal and inside a docker container
>   - Restoring on a different system
>
> V1: Initial
> V2: Addressed review comments
> V3: Rebased on latest amd-staging-drm-next (5.15 based)
> v4: New API design and basic support for SVM, however there is an
> outstanding issue with SVM restore which is currently under debug and
> hopefully that won't impact the ioctl APIs as SVMs are treated as
> private data hidden from user space like queues and events with the new
> approch.
> V5: Fix the SVM related issues and finalize the APIs.
>
> David Yat Sin (9):
>    drm/amdkfd: CRIU Implement KFD unpause operation
>    drm/amdkfd: CRIU add queues support
>    drm/amdkfd: CRIU restore queue ids
>    drm/amdkfd: CRIU restore sdma id for queues
>    drm/amdkfd: CRIU restore queue doorbell id
>    drm/amdkfd: CRIU checkpoint and restore queue mqds
>    drm/amdkfd: CRIU checkpoint and restore queue control stack
>    drm/amdkfd: CRIU checkpoint and restore events
>    drm/amdkfd: CRIU implement gpu_id remapping
>
> Rajneesh Bhardwaj (15):
>    x86/configs: CRIU update debug rock defconfig
>    drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
>    drm/amdkfd: CRIU Implement KFD process_info ioctl
>    drm/amdkfd: CRIU Implement KFD checkpoint ioctl
>    drm/amdkfd: CRIU Implement KFD restore ioctl
>    drm/amdkfd: CRIU Implement KFD resume ioctl
>    drm/amdkfd: CRIU export BOs as prime dmabuf objects
>    drm/amdkfd: CRIU checkpoint and restore xnack mode
>    drm/amdkfd: CRIU allow external mm for svm ranges
>    drm/amdkfd: use user_gpu_id for svm ranges
>    drm/amdkfd: CRIU Discover svm ranges
>    drm/amdkfd: CRIU Save Shared Virtual Memory ranges
>    drm/amdkfd: CRIU prepare for svm resume
>    drm/amdkfd: CRIU resume shared virtual memory ranges
>    drm/amdkfd: Bump up KFD API version for CRIU
>
>   arch/x86/configs/rock-dbg_defconfig           |   53 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    7 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   64 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1471 ++++++++++++++---
>   drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 ++-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |   16 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   75 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   77 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   92 ++
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   84 +
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  160 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   72 +-
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  372 ++++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  331 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
>   include/uapi/linux/kfd_ioctl.h                |   84 +-
>   21 files changed, 3193 insertions(+), 340 deletions(-)
>
Rajneesh Bhardwaj Feb. 4, 2022, 3:23 a.m. UTC | #2
[AMD Official Use Only]

Thank you Felix for the review and your guidance.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Thursday, February 3, 2022 10:22 PM
To: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Yat Sin, David <David.YatSin@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; dri-devel@lists.freedesktop.org
Subject: Re: [Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm

The series is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj:
> V5: Proposed IOCTL APIs for CRIU with consolidated feedback
>
> CRIU is a user space tool which is very popular for container live 
> migration in datacentres. It can checkpoint a running application, 
> save its complete state, memory contents and all system resources to 
> images on disk which can be migrated to another m achine and restored later.
> More information on CRIU can be found at https://criu.org/Main_Page
>
> CRIU currently does not support Checkpoint / Restore with applications 
> that have devices files open so it cannot perform checkpoint and 
> restore on GPU devices which are very complex and have their own VRAM 
> managed privately. CRIU, however can support external devices by using 
> a plugin architecture. We feel that we are getting close to finalizing 
> our IOCTL APIs which were again changed since V3 for an improved modular design.
>
> Our changes to CRIU user space  are can be obtained from here:
> https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222
>
> We have tested the following scenarios:
>   - Checkpoint / Restore of a Pytorch (BERT) workload
>   - kfdtests with queues and events
>   - Gfx9 and Gfx10 based multi GPU test systems
>   - On baremetal and inside a docker container
>   - Restoring on a different system
>
> V1: Initial
> V2: Addressed review comments
> V3: Rebased on latest amd-staging-drm-next (5.15 based)
> v4: New API design and basic support for SVM, however there is an 
> outstanding issue with SVM restore which is currently under debug and 
> hopefully that won't impact the ioctl APIs as SVMs are treated as 
> private data hidden from user space like queues and events with the 
> new approch.
> V5: Fix the SVM related issues and finalize the APIs.
>
> David Yat Sin (9):
>    drm/amdkfd: CRIU Implement KFD unpause operation
>    drm/amdkfd: CRIU add queues support
>    drm/amdkfd: CRIU restore queue ids
>    drm/amdkfd: CRIU restore sdma id for queues
>    drm/amdkfd: CRIU restore queue doorbell id
>    drm/amdkfd: CRIU checkpoint and restore queue mqds
>    drm/amdkfd: CRIU checkpoint and restore queue control stack
>    drm/amdkfd: CRIU checkpoint and restore events
>    drm/amdkfd: CRIU implement gpu_id remapping
>
> Rajneesh Bhardwaj (15):
>    x86/configs: CRIU update debug rock defconfig
>    drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs
>    drm/amdkfd: CRIU Implement KFD process_info ioctl
>    drm/amdkfd: CRIU Implement KFD checkpoint ioctl
>    drm/amdkfd: CRIU Implement KFD restore ioctl
>    drm/amdkfd: CRIU Implement KFD resume ioctl
>    drm/amdkfd: CRIU export BOs as prime dmabuf objects
>    drm/amdkfd: CRIU checkpoint and restore xnack mode
>    drm/amdkfd: CRIU allow external mm for svm ranges
>    drm/amdkfd: use user_gpu_id for svm ranges
>    drm/amdkfd: CRIU Discover svm ranges
>    drm/amdkfd: CRIU Save Shared Virtual Memory ranges
>    drm/amdkfd: CRIU prepare for svm resume
>    drm/amdkfd: CRIU resume shared virtual memory ranges
>    drm/amdkfd: Bump up KFD API version for CRIU
>
>   arch/x86/configs/rock-dbg_defconfig           |   53 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    7 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   64 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   20 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h       |    2 +
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 1471 ++++++++++++++---
>   drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c       |    2 +-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.c |  185 ++-
>   .../drm/amd/amdkfd/kfd_device_queue_manager.h |   16 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_events.c       |  313 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |   14 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c  |   75 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |   77 +
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |   92 ++
>   .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c   |   84 +
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  160 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      |   72 +-
>   .../amd/amdkfd/kfd_process_queue_manager.c    |  372 ++++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c          |  331 +++-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |   39 +
>   include/uapi/linux/kfd_ioctl.h                |   84 +-
>   21 files changed, 3193 insertions(+), 340 deletions(-)
>