Message ID | 20210520031523.12834-1-xinhui.pan@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: Let userptr BO ttm have TTM_PAGE_FLAG_SG set | expand |
I think this works for KFD userptr BOs. But this problem is probably not specific to KFD. It's only most obvious with KFD because we rely so heavily for userptrs. I don't really understand why we're messing with TTM_PAGE_FLAG_SG in amdgpu_ttm_tt_populate and amdgpu_ttm_tt_unpopulate. And why are userptr BOs created as ttm_bo_type_device, not ttm_bo_type_sg? Christian, do you know about the history of this code? Either way, the patch is Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Thanks for looking into this! Regards, Felix Am 2021-05-19 um 11:15 p.m. schrieb xinhui pan: > We have met memory corruption due to unexcepted swapout/swapin. > > swapout function create one swap storage which is filled with zero. And > set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm > has no backend page at that time, no real data is swapout to swap > storage. > > swapin function is called during userptr BO populate as > TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to > ttm bakend memory from swap storage. That just causes the memory been > overwritten. > > CPU 1 CPU 2 > kfd alloc BO A(userptr) alloc BO B(GTT) > ->init -> validate(create ttm) -> init -> validate -> populate > init_user_pages -> swapout BO A > -> get_user_pages (fill up ttm->pages) > -> validate -> populate > -> swapin BO A // memory overwritten > > To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO > ttm. Then swapout function would not swap it. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +--- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++++ > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 928e8d57cd08..9a6ea966ddb2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -1410,7 +1410,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( > } else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) { > domain = AMDGPU_GEM_DOMAIN_GTT; > alloc_domain = AMDGPU_GEM_DOMAIN_CPU; > - alloc_flags = 0; > + alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO; > if (!offset || !*offset) > return -EINVAL; > user_addr = untagged_addr(*offset); > @@ -1477,8 +1477,6 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( > } > bo->kfd_bo = *mem; > (*mem)->bo = bo; > - if (user_addr) > - bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO; > > (*mem)->va = va; > (*mem)->domain = domain; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index c7f5cc503601..5b3f45637fb5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -1119,6 +1119,10 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, > kfree(gtt); > return NULL; > } > + > + if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO) > + gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG; > + > return >t->ttm; > } >
Am 20.05.21 um 05:15 schrieb xinhui pan: > We have met memory corruption due to unexcepted swapout/swapin. > > swapout function create one swap storage which is filled with zero. And > set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm > has no backend page at that time, no real data is swapout to swap > storage. > > swapin function is called during userptr BO populate as > TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to > ttm bakend memory from swap storage. That just causes the memory been > overwritten. > > CPU 1 CPU 2 > kfd alloc BO A(userptr) alloc BO B(GTT) > ->init -> validate(create ttm) -> init -> validate -> populate > init_user_pages -> swapout BO A > -> get_user_pages (fill up ttm->pages) > -> validate -> populate > -> swapin BO A // memory overwritten > > To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO > ttm. Then swapout function would not swap it. That's a possible solution, but I would rather like to have the underlying problem in TTM fixed. Christian. > > Signed-off-by: xinhui pan <xinhui.pan@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +--- > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++++ > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 928e8d57cd08..9a6ea966ddb2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -1410,7 +1410,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( > } else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) { > domain = AMDGPU_GEM_DOMAIN_GTT; > alloc_domain = AMDGPU_GEM_DOMAIN_CPU; > - alloc_flags = 0; > + alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO; > if (!offset || !*offset) > return -EINVAL; > user_addr = untagged_addr(*offset); > @@ -1477,8 +1477,6 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( > } > bo->kfd_bo = *mem; > (*mem)->bo = bo; > - if (user_addr) > - bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO; > > (*mem)->va = va; > (*mem)->domain = domain; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index c7f5cc503601..5b3f45637fb5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -1119,6 +1119,10 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, > kfree(gtt); > return NULL; > } > + > + if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO) > + gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG; > + > return >t->ttm; > } >
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 928e8d57cd08..9a6ea966ddb2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1410,7 +1410,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( } else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) { domain = AMDGPU_GEM_DOMAIN_GTT; alloc_domain = AMDGPU_GEM_DOMAIN_CPU; - alloc_flags = 0; + alloc_flags = AMDGPU_AMDKFD_CREATE_USERPTR_BO; if (!offset || !*offset) return -EINVAL; user_addr = untagged_addr(*offset); @@ -1477,8 +1477,6 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( } bo->kfd_bo = *mem; (*mem)->bo = bo; - if (user_addr) - bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO; (*mem)->va = va; (*mem)->domain = domain; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c7f5cc503601..5b3f45637fb5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1119,6 +1119,10 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo, kfree(gtt); return NULL; } + + if (abo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO) + gtt->ttm.page_flags |= TTM_PAGE_FLAG_SG; + return >t->ttm; }
We have met memory corruption due to unexcepted swapout/swapin. swapout function create one swap storage which is filled with zero. And set ttm->page_flags as TTM_PAGE_FLAG_SWAPPED. But because userptr BO ttm has no backend page at that time, no real data is swapout to swap storage. swapin function is called during userptr BO populate as TTM_PAGE_FLAG_SWAPPED is set. Now here is the problem, we swapin data to ttm bakend memory from swap storage. That just causes the memory been overwritten. CPU 1 CPU 2 kfd alloc BO A(userptr) alloc BO B(GTT) ->init -> validate(create ttm) -> init -> validate -> populate init_user_pages -> swapout BO A -> get_user_pages (fill up ttm->pages) -> validate -> populate -> swapin BO A // memory overwritten To fix this issue, we can set TTM_PAGE_FLAG_SG when we create userptr BO ttm. Then swapout function would not swap it. Signed-off-by: xinhui pan <xinhui.pan@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++++ 2 files changed, 5 insertions(+), 3 deletions(-)