Message ID | 20220203090918.11520-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | CHECKPOINT RESTORE WITH ROCm | expand |
The series is Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj: > V5: Proposed IOCTL APIs for CRIU with consolidated feedback > > CRIU is a user space tool which is very popular for container live > migration in datacentres. It can checkpoint a running application, save > its complete state, memory contents and all system resources to images > on disk which can be migrated to another m achine and restored later. > More information on CRIU can be found at https://criu.org/Main_Page > > CRIU currently does not support Checkpoint / Restore with applications > that have devices files open so it cannot perform checkpoint and restore > on GPU devices which are very complex and have their own VRAM managed > privately. CRIU, however can support external devices by using a plugin > architecture. We feel that we are getting close to finalizing our IOCTL > APIs which were again changed since V3 for an improved modular design. > > Our changes to CRIU user space are can be obtained from here: > https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222 > > We have tested the following scenarios: > - Checkpoint / Restore of a Pytorch (BERT) workload > - kfdtests with queues and events > - Gfx9 and Gfx10 based multi GPU test systems > - On baremetal and inside a docker container > - Restoring on a different system > > V1: Initial > V2: Addressed review comments > V3: Rebased on latest amd-staging-drm-next (5.15 based) > v4: New API design and basic support for SVM, however there is an > outstanding issue with SVM restore which is currently under debug and > hopefully that won't impact the ioctl APIs as SVMs are treated as > private data hidden from user space like queues and events with the new > approch. > V5: Fix the SVM related issues and finalize the APIs. > > David Yat Sin (9): > drm/amdkfd: CRIU Implement KFD unpause operation > drm/amdkfd: CRIU add queues support > drm/amdkfd: CRIU restore queue ids > drm/amdkfd: CRIU restore sdma id for queues > drm/amdkfd: CRIU restore queue doorbell id > drm/amdkfd: CRIU checkpoint and restore queue mqds > drm/amdkfd: CRIU checkpoint and restore queue control stack > drm/amdkfd: CRIU checkpoint and restore events > drm/amdkfd: CRIU implement gpu_id remapping > > Rajneesh Bhardwaj (15): > x86/configs: CRIU update debug rock defconfig > drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs > drm/amdkfd: CRIU Implement KFD process_info ioctl > drm/amdkfd: CRIU Implement KFD checkpoint ioctl > drm/amdkfd: CRIU Implement KFD restore ioctl > drm/amdkfd: CRIU Implement KFD resume ioctl > drm/amdkfd: CRIU export BOs as prime dmabuf objects > drm/amdkfd: CRIU checkpoint and restore xnack mode > drm/amdkfd: CRIU allow external mm for svm ranges > drm/amdkfd: use user_gpu_id for svm ranges > drm/amdkfd: CRIU Discover svm ranges > drm/amdkfd: CRIU Save Shared Virtual Memory ranges > drm/amdkfd: CRIU prepare for svm resume > drm/amdkfd: CRIU resume shared virtual memory ranges > drm/amdkfd: Bump up KFD API version for CRIU > > arch/x86/configs/rock-dbg_defconfig | 53 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1471 ++++++++++++++--- > drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 185 ++- > .../drm/amd/amdkfd/kfd_device_queue_manager.h | 16 +- > drivers/gpu/drm/amd/amdkfd/kfd_events.c | 313 +++- > drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 14 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c | 75 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 77 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 92 ++ > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c | 84 + > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 160 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 72 +- > .../amd/amdkfd/kfd_process_queue_manager.c | 372 ++++- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 331 +++- > drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 39 + > include/uapi/linux/kfd_ioctl.h | 84 +- > 21 files changed, 3193 insertions(+), 340 deletions(-) >
[AMD Official Use Only] Thank you Felix for the review and your guidance. -----Original Message----- From: Kuehling, Felix <Felix.Kuehling@amd.com> Sent: Thursday, February 3, 2022 10:22 PM To: Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>; amd-gfx@lists.freedesktop.org Cc: Yat Sin, David <David.YatSin@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; dri-devel@lists.freedesktop.org Subject: Re: [Patch v5 00/24] CHECKPOINT RESTORE WITH ROCm The series is Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Am 2022-02-03 um 04:08 schrieb Rajneesh Bhardwaj: > V5: Proposed IOCTL APIs for CRIU with consolidated feedback > > CRIU is a user space tool which is very popular for container live > migration in datacentres. It can checkpoint a running application, > save its complete state, memory contents and all system resources to > images on disk which can be migrated to another m achine and restored later. > More information on CRIU can be found at https://criu.org/Main_Page > > CRIU currently does not support Checkpoint / Restore with applications > that have devices files open so it cannot perform checkpoint and > restore on GPU devices which are very complex and have their own VRAM > managed privately. CRIU, however can support external devices by using > a plugin architecture. We feel that we are getting close to finalizing > our IOCTL APIs which were again changed since V3 for an improved modular design. > > Our changes to CRIU user space are can be obtained from here: > https://github.com/RadeonOpenCompute/criu/tree/amdgpu_rfc-211222 > > We have tested the following scenarios: > - Checkpoint / Restore of a Pytorch (BERT) workload > - kfdtests with queues and events > - Gfx9 and Gfx10 based multi GPU test systems > - On baremetal and inside a docker container > - Restoring on a different system > > V1: Initial > V2: Addressed review comments > V3: Rebased on latest amd-staging-drm-next (5.15 based) > v4: New API design and basic support for SVM, however there is an > outstanding issue with SVM restore which is currently under debug and > hopefully that won't impact the ioctl APIs as SVMs are treated as > private data hidden from user space like queues and events with the > new approch. > V5: Fix the SVM related issues and finalize the APIs. > > David Yat Sin (9): > drm/amdkfd: CRIU Implement KFD unpause operation > drm/amdkfd: CRIU add queues support > drm/amdkfd: CRIU restore queue ids > drm/amdkfd: CRIU restore sdma id for queues > drm/amdkfd: CRIU restore queue doorbell id > drm/amdkfd: CRIU checkpoint and restore queue mqds > drm/amdkfd: CRIU checkpoint and restore queue control stack > drm/amdkfd: CRIU checkpoint and restore events > drm/amdkfd: CRIU implement gpu_id remapping > > Rajneesh Bhardwaj (15): > x86/configs: CRIU update debug rock defconfig > drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs > drm/amdkfd: CRIU Implement KFD process_info ioctl > drm/amdkfd: CRIU Implement KFD checkpoint ioctl > drm/amdkfd: CRIU Implement KFD restore ioctl > drm/amdkfd: CRIU Implement KFD resume ioctl > drm/amdkfd: CRIU export BOs as prime dmabuf objects > drm/amdkfd: CRIU checkpoint and restore xnack mode > drm/amdkfd: CRIU allow external mm for svm ranges > drm/amdkfd: use user_gpu_id for svm ranges > drm/amdkfd: CRIU Discover svm ranges > drm/amdkfd: CRIU Save Shared Virtual Memory ranges > drm/amdkfd: CRIU prepare for svm resume > drm/amdkfd: CRIU resume shared virtual memory ranges > drm/amdkfd: Bump up KFD API version for CRIU > > arch/x86/configs/rock-dbg_defconfig | 53 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 7 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 64 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 20 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 2 + > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1471 ++++++++++++++--- > drivers/gpu/drm/amd/amdkfd/kfd_dbgdev.c | 2 +- > .../drm/amd/amdkfd/kfd_device_queue_manager.c | 185 ++- > .../drm/amd/amdkfd/kfd_device_queue_manager.h | 16 +- > drivers/gpu/drm/amd/amdkfd/kfd_events.c | 313 +++- > drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h | 14 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c | 75 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c | 77 + > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c | 92 ++ > .../gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c | 84 + > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 160 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 72 +- > .../amd/amdkfd/kfd_process_queue_manager.c | 372 ++++- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 331 +++- > drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 39 + > include/uapi/linux/kfd_ioctl.h | 84 +- > 21 files changed, 3193 insertions(+), 340 deletions(-) >