diff mbox series

drm/ttm: Don't inherit GEM object VMAs in child process

Message ID 20211208205344.3034-1-rajneesh.bhardwaj@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/ttm: Don't inherit GEM object VMAs in child process | expand

Commit Message

Rajneesh Bhardwaj Dec. 8, 2021, 8:53 p.m. UTC
When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. With the
existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/drm_gem.c       | 3 ++-
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

Comments

Christian König Dec. 9, 2021, 7:54 a.m. UTC | #1
Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
> When an application having open file access to a node forks, its shared
> mappings also get reflected in the address space of child process even
> though it cannot access them with the object permissions applied. With the
> existing permission checks on the gem objects, it might be reasonable to
> also create the VMAs with VM_DONTCOPY flag so a user space application
> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
> system call to prevent the pages in the mapped range to appear in the
> address space of the child process. It also prevents the memory leaks
> due to additional reference counts on the mapped BOs in the child
> process that prevented freeing the memory in the parent for which we had
> worked around earlier in the user space inside the thunk library.
>
> Additionally, we faced this issue when using CRIU to checkpoint restore
> an application that had such inherited mappings in the child which
> confuse CRIU when it mmaps on restore. Having this flag set for the
> render node VMAs helps. VMAs mapped via KFD already take care of this so
> this is needed only for the render nodes.

Unfortunately that is most likely a NAK. We already tried something similar.

While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call fork() 
with a GL context open and expect it to work.

Regards,
Christian.

>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>
> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>   2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 09c820045859..d9c4149f36dd 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size,
>   			goto err_drm_gem_object_put;
>   		}
>   
> -		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> +		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
> +				| VM_DONTDUMP | VM_DONTCOPY;
>   		vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>   		vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>   	}
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index 33680c94127c..420a4898fdd2 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo)
>   
>   	vma->vm_private_data = bo;
>   
> -	vma->vm_flags |= VM_PFNMAP;
> +	vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>   	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>   	return 0;
>   }
Rajneesh Bhardwaj Dec. 9, 2021, 3:23 p.m. UTC | #2
Thanks Christian. Would it make it less intrusive if I just use the flag 
for ttm bo mmap and remove the drm_gem_mmap_obj change from this patch? 
For our use case, just the ttm_bo_mmap_obj change should suffice and we 
don't want to put any more work arounds in the user space (thunk, in our 
case).

The child cannot access the BOs mapped by the parent anyway with access 
restrictions applied so I wonder why even inherit the vma?

On 12/9/2021 2:54 AM, Christian König wrote:
> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>> When an application having open file access to a node forks, its shared
>> mappings also get reflected in the address space of child process even
>> though it cannot access them with the object permissions applied. 
>> With the
>> existing permission checks on the gem objects, it might be reasonable to
>> also create the VMAs with VM_DONTCOPY flag so a user space application
>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
>> system call to prevent the pages in the mapped range to appear in the
>> address space of the child process. It also prevents the memory leaks
>> due to additional reference counts on the mapped BOs in the child
>> process that prevented freeing the memory in the parent for which we had
>> worked around earlier in the user space inside the thunk library.
>>
>> Additionally, we faced this issue when using CRIU to checkpoint restore
>> an application that had such inherited mappings in the child which
>> confuse CRIU when it mmaps on restore. Having this flag set for the
>> render node VMAs helps. VMAs mapped via KFD already take care of this so
>> this is needed only for the render nodes.
>
> Unfortunately that is most likely a NAK. We already tried something 
> similar.
>
> While it is illegal by the OpenGL specification and doesn't work for 
> most userspace stacks, we do have some implementations which call 
> fork() with a GL context open and expect it to work.
>
> Regards,
> Christian.
>
>>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>
>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>> ---
>>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>> index 09c820045859..d9c4149f36dd 100644
>> --- a/drivers/gpu/drm/drm_gem.c
>> +++ b/drivers/gpu/drm/drm_gem.c
>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
>> *obj, unsigned long obj_size,
>>               goto err_drm_gem_object_put;
>>           }
>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>> VM_DONTDUMP;
>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>           vma->vm_page_prot = 
>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>       }
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 33680c94127c..420a4898fdd2 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
>> struct ttm_buffer_object *bo)
>>         vma->vm_private_data = bo;
>>   -    vma->vm_flags |= VM_PFNMAP;
>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>       return 0;
>>   }
>
Christian König Dec. 9, 2021, 3:27 p.m. UTC | #3
Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly a 
good idea.

> The child cannot access the BOs mapped by the parent anyway with 
> access restrictions applied

exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
> Thanks Christian. Would it make it less intrusive if I just use the 
> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
> patch? For our use case, just the ttm_bo_mmap_obj change should 
> suffice and we don't want to put any more work arounds in the user 
> space (thunk, in our case).
>
> The child cannot access the BOs mapped by the parent anyway with 
> access restrictions applied so I wonder why even inherit the vma?
>
> On 12/9/2021 2:54 AM, Christian König wrote:
>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>> When an application having open file access to a node forks, its shared
>>> mappings also get reflected in the address space of child process even
>>> though it cannot access them with the object permissions applied. 
>>> With the
>>> existing permission checks on the gem objects, it might be 
>>> reasonable to
>>> also create the VMAs with VM_DONTCOPY flag so a user space application
>>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
>>> system call to prevent the pages in the mapped range to appear in the
>>> address space of the child process. It also prevents the memory leaks
>>> due to additional reference counts on the mapped BOs in the child
>>> process that prevented freeing the memory in the parent for which we 
>>> had
>>> worked around earlier in the user space inside the thunk library.
>>>
>>> Additionally, we faced this issue when using CRIU to checkpoint restore
>>> an application that had such inherited mappings in the child which
>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>> render node VMAs helps. VMAs mapped via KFD already take care of 
>>> this so
>>> this is needed only for the render nodes.
>>
>> Unfortunately that is most likely a NAK. We already tried something 
>> similar.
>>
>> While it is illegal by the OpenGL specification and doesn't work for 
>> most userspace stacks, we do have some implementations which call 
>> fork() with a GL context open and expect it to work.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>
>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>> ---
>>>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>> index 09c820045859..d9c4149f36dd 100644
>>> --- a/drivers/gpu/drm/drm_gem.c
>>> +++ b/drivers/gpu/drm/drm_gem.c
>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
>>> *obj, unsigned long obj_size,
>>>               goto err_drm_gem_object_put;
>>>           }
>>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>>> VM_DONTDUMP;
>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>           vma->vm_page_prot = 
>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>       }
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> index 33680c94127c..420a4898fdd2 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
>>> struct ttm_buffer_object *bo)
>>>         vma->vm_private_data = bo;
>>>   -    vma->vm_flags |= VM_PFNMAP;
>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>       return 0;
>>>   }
>>
Rajneesh Bhardwaj Dec. 9, 2021, 3:29 p.m. UTC | #4
Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank you!

On 12/9/2021 10:27 AM, Christian König wrote:
> Hi Rajneesh,
>
> yes, separating this from the drm_gem_mmap_obj() change is certainly a 
> good idea.
>
>> The child cannot access the BOs mapped by the parent anyway with 
>> access restrictions applied
>
> exactly that is not correct. That behavior is actively used by some 
> userspace stacks as far as I know.
>
> Regards,
> Christian.
>
> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>> Thanks Christian. Would it make it less intrusive if I just use the 
>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
>> patch? For our use case, just the ttm_bo_mmap_obj change should 
>> suffice and we don't want to put any more work arounds in the user 
>> space (thunk, in our case).
>>
>> The child cannot access the BOs mapped by the parent anyway with 
>> access restrictions applied so I wonder why even inherit the vma?
>>
>> On 12/9/2021 2:54 AM, Christian König wrote:
>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>> When an application having open file access to a node forks, its 
>>>> shared
>>>> mappings also get reflected in the address space of child process even
>>>> though it cannot access them with the object permissions applied. 
>>>> With the
>>>> existing permission checks on the gem objects, it might be 
>>>> reasonable to
>>>> also create the VMAs with VM_DONTCOPY flag so a user space application
>>>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
>>>> system call to prevent the pages in the mapped range to appear in the
>>>> address space of the child process. It also prevents the memory leaks
>>>> due to additional reference counts on the mapped BOs in the child
>>>> process that prevented freeing the memory in the parent for which 
>>>> we had
>>>> worked around earlier in the user space inside the thunk library.
>>>>
>>>> Additionally, we faced this issue when using CRIU to checkpoint 
>>>> restore
>>>> an application that had such inherited mappings in the child which
>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>> render node VMAs helps. VMAs mapped via KFD already take care of 
>>>> this so
>>>> this is needed only for the render nodes.
>>>
>>> Unfortunately that is most likely a NAK. We already tried something 
>>> similar.
>>>
>>> While it is illegal by the OpenGL specification and doesn't work for 
>>> most userspace stacks, we do have some implementations which call 
>>> fork() with a GL context open and expect it to work.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>
>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>> index 09c820045859..d9c4149f36dd 100644
>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
>>>> *obj, unsigned long obj_size,
>>>>               goto err_drm_gem_object_put;
>>>>           }
>>>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>>>> VM_DONTDUMP;
>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>           vma->vm_page_prot = 
>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>       }
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> index 33680c94127c..420a4898fdd2 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
>>>> struct ttm_buffer_object *bo)
>>>>         vma->vm_private_data = bo;
>>>>   -    vma->vm_flags |= VM_PFNMAP;
>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>       return 0;
>>>>   }
>>>
>
Christian König Dec. 9, 2021, 3:30 p.m. UTC | #5
That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank 
> you!
>
> On 12/9/2021 10:27 AM, Christian König wrote:
>> Hi Rajneesh,
>>
>> yes, separating this from the drm_gem_mmap_obj() change is certainly 
>> a good idea.
>>
>>> The child cannot access the BOs mapped by the parent anyway with 
>>> access restrictions applied
>>
>> exactly that is not correct. That behavior is actively used by some 
>> userspace stacks as far as I know.
>>
>> Regards,
>> Christian.
>>
>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>> Thanks Christian. Would it make it less intrusive if I just use the 
>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from 
>>> this patch? For our use case, just the ttm_bo_mmap_obj change should 
>>> suffice and we don't want to put any more work arounds in the user 
>>> space (thunk, in our case).
>>>
>>> The child cannot access the BOs mapped by the parent anyway with 
>>> access restrictions applied so I wonder why even inherit the vma?
>>>
>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>> When an application having open file access to a node forks, its 
>>>>> shared
>>>>> mappings also get reflected in the address space of child process 
>>>>> even
>>>>> though it cannot access them with the object permissions applied. 
>>>>> With the
>>>>> existing permission checks on the gem objects, it might be 
>>>>> reasonable to
>>>>> also create the VMAs with VM_DONTCOPY flag so a user space 
>>>>> application
>>>>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
>>>>> system call to prevent the pages in the mapped range to appear in the
>>>>> address space of the child process. It also prevents the memory leaks
>>>>> due to additional reference counts on the mapped BOs in the child
>>>>> process that prevented freeing the memory in the parent for which 
>>>>> we had
>>>>> worked around earlier in the user space inside the thunk library.
>>>>>
>>>>> Additionally, we faced this issue when using CRIU to checkpoint 
>>>>> restore
>>>>> an application that had such inherited mappings in the child which
>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>> render node VMAs helps. VMAs mapped via KFD already take care of 
>>>>> this so
>>>>> this is needed only for the render nodes.
>>>>
>>>> Unfortunately that is most likely a NAK. We already tried something 
>>>> similar.
>>>>
>>>> While it is illegal by the OpenGL specification and doesn't work 
>>>> for most userspace stacks, we do have some implementations which 
>>>> call fork() with a GL context open and expect it to work.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>
>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
>>>>> *obj, unsigned long obj_size,
>>>>>               goto err_drm_gem_object_put;
>>>>>           }
>>>>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
>>>>> VM_DONTDUMP;
>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>           vma->vm_page_prot = 
>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>       }
>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct 
>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>         vma->vm_private_data = bo;
>>>>>   -    vma->vm_flags |= VM_PFNMAP;
>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>       return 0;
>>>>>   }
>>>>
>>
Felix Kuehling Dec. 9, 2021, 6:28 p.m. UTC | #6
Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> That still won't work.
>
> But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

Regards,
  Felix


>
> Regards,
> Christian.
>
> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>> you!
>>
>> On 12/9/2021 10:27 AM, Christian König wrote:
>>> Hi Rajneesh,
>>>
>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>> a good idea.
>>>
>>>> The child cannot access the BOs mapped by the parent anyway with
>>>> access restrictions applied
>>>
>>> exactly that is not correct. That behavior is actively used by some
>>> userspace stacks as far as I know.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>>> Thanks Christian. Would it make it less intrusive if I just use the
>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
>>>> this patch? For our use case, just the ttm_bo_mmap_obj change
>>>> should suffice and we don't want to put any more work arounds in
>>>> the user space (thunk, in our case).
>>>>
>>>> The child cannot access the BOs mapped by the parent anyway with
>>>> access restrictions applied so I wonder why even inherit the vma?
>>>>
>>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>>> When an application having open file access to a node forks, its
>>>>>> shared
>>>>>> mappings also get reflected in the address space of child process
>>>>>> even
>>>>>> though it cannot access them with the object permissions applied.
>>>>>> With the
>>>>>> existing permission checks on the gem objects, it might be
>>>>>> reasonable to
>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space
>>>>>> application
>>>>>> doesn't need to explicitly call the madvise(addr, len,
>>>>>> MADV_DONTFORK)
>>>>>> system call to prevent the pages in the mapped range to appear in
>>>>>> the
>>>>>> address space of the child process. It also prevents the memory
>>>>>> leaks
>>>>>> due to additional reference counts on the mapped BOs in the child
>>>>>> process that prevented freeing the memory in the parent for which
>>>>>> we had
>>>>>> worked around earlier in the user space inside the thunk library.
>>>>>>
>>>>>> Additionally, we faced this issue when using CRIU to checkpoint
>>>>>> restore
>>>>>> an application that had such inherited mappings in the child which
>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of
>>>>>> this so
>>>>>> this is needed only for the render nodes.
>>>>>
>>>>> Unfortunately that is most likely a NAK. We already tried
>>>>> something similar.
>>>>>
>>>>> While it is illegal by the OpenGL specification and doesn't work
>>>>> for most userspace stacks, we do have some implementations which
>>>>> call fork() with a GL context open and expect it to work.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>
>>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>>>>>> *obj, unsigned long obj_size,
>>>>>>               goto err_drm_gem_object_put;
>>>>>>           }
>>>>>>   -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>>>>>> VM_DONTDUMP;
>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>>           vma->vm_page_prot =
>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>>           vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>>       }
>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>>         vma->vm_private_data = bo;
>>>>>>   -    vma->vm_flags |= VM_PFNMAP;
>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>>       vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>>       return 0;
>>>>>>   }
>>>>>
>>>
>
Christian König Dec. 10, 2021, 6:58 a.m. UTC | #7
Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
>> That still won't work.
>>
>> But I think we could do this change for the amdgpu mmap callback only.
> If graphics user mode has problems with it, we could even make this
> specific to KFD BOs in the amdgpu_gem_object_mmap callback.

I think it's fine for the whole amdgpu stack, my concern is more about 
radeon, nouveau and the ARM stacks which are using this as well.

That blew up so nicely the last time we tried to change it and I know of 
at least one case where radeon was/is used with BOs in a child process.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>>> you!
>>>
>>> On 12/9/2021 10:27 AM, Christian König wrote:
>>>> Hi Rajneesh,
>>>>
>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>>> a good idea.
>>>>
>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>> access restrictions applied
>>>> exactly that is not correct. That behavior is actively used by some
>>>> userspace stacks as far as I know.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>>>> Thanks Christian. Would it make it less intrusive if I just use the
>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change
>>>>> should suffice and we don't want to put any more work arounds in
>>>>> the user space (thunk, in our case).
>>>>>
>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>> access restrictions applied so I wonder why even inherit the vma?
>>>>>
>>>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>>>> When an application having open file access to a node forks, its
>>>>>>> shared
>>>>>>> mappings also get reflected in the address space of child process
>>>>>>> even
>>>>>>> though it cannot access them with the object permissions applied.
>>>>>>> With the
>>>>>>> existing permission checks on the gem objects, it might be
>>>>>>> reasonable to
>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space
>>>>>>> application
>>>>>>> doesn't need to explicitly call the madvise(addr, len,
>>>>>>> MADV_DONTFORK)
>>>>>>> system call to prevent the pages in the mapped range to appear in
>>>>>>> the
>>>>>>> address space of the child process. It also prevents the memory
>>>>>>> leaks
>>>>>>> due to additional reference counts on the mapped BOs in the child
>>>>>>> process that prevented freeing the memory in the parent for which
>>>>>>> we had
>>>>>>> worked around earlier in the user space inside the thunk library.
>>>>>>>
>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint
>>>>>>> restore
>>>>>>> an application that had such inherited mappings in the child which
>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of
>>>>>>> this so
>>>>>>> this is needed only for the render nodes.
>>>>>> Unfortunately that is most likely a NAK. We already tried
>>>>>> something similar.
>>>>>>
>>>>>> While it is illegal by the OpenGL specification and doesn't work
>>>>>> for most userspace stacks, we do have some implementations which
>>>>>> call fork() with a GL context open and expect it to work.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>>
>>>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>>>    drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>>>    2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>>>>>>> *obj, unsigned long obj_size,
>>>>>>>                goto err_drm_gem_object_put;
>>>>>>>            }
>>>>>>>    -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>>>>>>> VM_DONTDUMP;
>>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>>>            vma->vm_page_prot =
>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>>>            vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>>>        }
>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>>>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>>>          vma->vm_private_data = bo;
>>>>>>>    -    vma->vm_flags |= VM_PFNMAP;
>>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>>>        vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>>>        return 0;
>>>>>>>    }
Daniel Vetter Dec. 20, 2021, 9:29 a.m. UTC | #8
On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
> Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> > Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> > > That still won't work.
> > > 
> > > But I think we could do this change for the amdgpu mmap callback only.
> > If graphics user mode has problems with it, we could even make this
> > specific to KFD BOs in the amdgpu_gem_object_mmap callback.
> 
> I think it's fine for the whole amdgpu stack, my concern is more about
> radeon, nouveau and the ARM stacks which are using this as well.
> 
> That blew up so nicely the last time we tried to change it and I know of at
> least one case where radeon was/is used with BOs in a child process.

I'm way late and burried again, but I think it'd be good to be consistent
here across drivers. Or at least across drm drivers. And we've had the vma
open/close refcounting to make fork work since forever.

I think if we do this we should really only do this for mmap() where this
applies, but reading through the thread here I'm honestly confused why
this is a problem. If CRIU can't handle forked mmaps it needs to be
thought that, not hacked around. Or at least I'm not understanding why
this shouldn't work ...
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> >    Felix
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
> > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
> > > > you!
> > > > 
> > > > On 12/9/2021 10:27 AM, Christian König wrote:
> > > > > Hi Rajneesh,
> > > > > 
> > > > > yes, separating this from the drm_gem_mmap_obj() change is certainly
> > > > > a good idea.
> > > > > 
> > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > access restrictions applied
> > > > > exactly that is not correct. That behavior is actively used by some
> > > > > userspace stacks as far as I know.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
> > > > > > Thanks Christian. Would it make it less intrusive if I just use the
> > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
> > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change
> > > > > > should suffice and we don't want to put any more work arounds in
> > > > > > the user space (thunk, in our case).
> > > > > > 
> > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > access restrictions applied so I wonder why even inherit the vma?
> > > > > > 
> > > > > > On 12/9/2021 2:54 AM, Christian König wrote:
> > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
> > > > > > > > When an application having open file access to a node forks, its
> > > > > > > > shared
> > > > > > > > mappings also get reflected in the address space of child process
> > > > > > > > even
> > > > > > > > though it cannot access them with the object permissions applied.
> > > > > > > > With the
> > > > > > > > existing permission checks on the gem objects, it might be
> > > > > > > > reasonable to
> > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space
> > > > > > > > application
> > > > > > > > doesn't need to explicitly call the madvise(addr, len,
> > > > > > > > MADV_DONTFORK)
> > > > > > > > system call to prevent the pages in the mapped range to appear in
> > > > > > > > the
> > > > > > > > address space of the child process. It also prevents the memory
> > > > > > > > leaks
> > > > > > > > due to additional reference counts on the mapped BOs in the child
> > > > > > > > process that prevented freeing the memory in the parent for which
> > > > > > > > we had
> > > > > > > > worked around earlier in the user space inside the thunk library.
> > > > > > > > 
> > > > > > > > Additionally, we faced this issue when using CRIU to checkpoint
> > > > > > > > restore
> > > > > > > > an application that had such inherited mappings in the child which
> > > > > > > > confuse CRIU when it mmaps on restore. Having this flag set for the
> > > > > > > > render node VMAs helps. VMAs mapped via KFD already take care of
> > > > > > > > this so
> > > > > > > > this is needed only for the render nodes.
> > > > > > > Unfortunately that is most likely a NAK. We already tried
> > > > > > > something similar.
> > > > > > > 
> > > > > > > While it is illegal by the OpenGL specification and doesn't work
> > > > > > > for most userspace stacks, we do have some implementations which
> > > > > > > call fork() with a GL context open and expect it to work.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > > > > > > > 
> > > > > > > > Signed-off-by: David Yat Sin <david.yatsin@amd.com>
> > > > > > > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> > > > > > > > ---
> > > > > > > >    drivers/gpu/drm/drm_gem.c       | 3 ++-
> > > > > > > >    drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
> > > > > > > >    2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > > > > > > index 09c820045859..d9c4149f36dd 100644
> > > > > > > > --- a/drivers/gpu/drm/drm_gem.c
> > > > > > > > +++ b/drivers/gpu/drm/drm_gem.c
> > > > > > > > @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
> > > > > > > > *obj, unsigned long obj_size,
> > > > > > > >                goto err_drm_gem_object_put;
> > > > > > > >            }
> > > > > > > >    -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
> > > > > > > > VM_DONTDUMP;
> > > > > > > > +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
> > > > > > > > +                | VM_DONTDUMP | VM_DONTCOPY;
> > > > > > > >            vma->vm_page_prot =
> > > > > > > > pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
> > > > > > > >            vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> > > > > > > >        }
> > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > index 33680c94127c..420a4898fdd2 100644
> > > > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
> > > > > > > > *vma, struct ttm_buffer_object *bo)
> > > > > > > >          vma->vm_private_data = bo;
> > > > > > > >    -    vma->vm_flags |= VM_PFNMAP;
> > > > > > > > +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
> > > > > > > >        vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> > > > > > > >        return 0;
> > > > > > > >    }
>
Rajneesh Bhardwaj Dec. 20, 2021, 6:12 p.m. UTC | #9
On 12/20/2021 4:29 AM, Daniel Vetter wrote:
> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
>> Am 09.12.21 um 19:28 schrieb Felix Kuehling:
>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
>>>> That still won't work.
>>>>
>>>> But I think we could do this change for the amdgpu mmap callback only.
>>> If graphics user mode has problems with it, we could even make this
>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback.
>> I think it's fine for the whole amdgpu stack, my concern is more about
>> radeon, nouveau and the ARM stacks which are using this as well.
>>
>> That blew up so nicely the last time we tried to change it and I know of at
>> least one case where radeon was/is used with BOs in a child process.
> I'm way late and burried again, but I think it'd be good to be consistent
> here across drivers. Or at least across drm drivers. And we've had the vma
> open/close refcounting to make fork work since forever.
>
> I think if we do this we should really only do this for mmap() where this
> applies, but reading through the thread here I'm honestly confused why
> this is a problem. If CRIU can't handle forked mmaps it needs to be
> thought that, not hacked around. Or at least I'm not understanding why
> this shouldn't work ...
> -Daniel
>

Hi Daniel

In the v2 
https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53eceb6@amd.com/T/ 
I pretty much limited the scope of the change to KFD BOs on mmap. 
Regarding CRIU, I think its not a CRIU problem as CRIU on restore, only 
tries to recreate all the child processes and then mmaps all the VMAs it 
sees (as per checkpoint snapshot) in the new process address space after 
the VMA placements are finalized in the position independent code phase. 
Since the inherited VMAs don't have access rights the criu mmap fails.

Regards,

Rajneesh

>> Regards,
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>>>>> you!
>>>>>
>>>>> On 12/9/2021 10:27 AM, Christian König wrote:
>>>>>> Hi Rajneesh,
>>>>>>
>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>>>>> a good idea.
>>>>>>
>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>> access restrictions applied
>>>>>> exactly that is not correct. That behavior is actively used by some
>>>>>> userspace stacks as far as I know.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the
>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change
>>>>>>> should suffice and we don't want to put any more work arounds in
>>>>>>> the user space (thunk, in our case).
>>>>>>>
>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>> access restrictions applied so I wonder why even inherit the vma?
>>>>>>>
>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>>>>>> When an application having open file access to a node forks, its
>>>>>>>>> shared
>>>>>>>>> mappings also get reflected in the address space of child process
>>>>>>>>> even
>>>>>>>>> though it cannot access them with the object permissions applied.
>>>>>>>>> With the
>>>>>>>>> existing permission checks on the gem objects, it might be
>>>>>>>>> reasonable to
>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space
>>>>>>>>> application
>>>>>>>>> doesn't need to explicitly call the madvise(addr, len,
>>>>>>>>> MADV_DONTFORK)
>>>>>>>>> system call to prevent the pages in the mapped range to appear in
>>>>>>>>> the
>>>>>>>>> address space of the child process. It also prevents the memory
>>>>>>>>> leaks
>>>>>>>>> due to additional reference counts on the mapped BOs in the child
>>>>>>>>> process that prevented freeing the memory in the parent for which
>>>>>>>>> we had
>>>>>>>>> worked around earlier in the user space inside the thunk library.
>>>>>>>>>
>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint
>>>>>>>>> restore
>>>>>>>>> an application that had such inherited mappings in the child which
>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of
>>>>>>>>> this so
>>>>>>>>> this is needed only for the render nodes.
>>>>>>>> Unfortunately that is most likely a NAK. We already tried
>>>>>>>> something similar.
>>>>>>>>
>>>>>>>> While it is illegal by the OpenGL specification and doesn't work
>>>>>>>> for most userspace stacks, we do have some implementations which
>>>>>>>> call fork() with a GL context open and expect it to work.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>>>>>>>
>>>>>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com>
>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>>>>>     drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>>>>>     2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>>>>>>>>> *obj, unsigned long obj_size,
>>>>>>>>>                 goto err_drm_gem_object_put;
>>>>>>>>>             }
>>>>>>>>>     -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>>>>>>>>> VM_DONTDUMP;
>>>>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>>>>>             vma->vm_page_prot =
>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>>>>>             vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>>>>>         }
>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>>>>>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>>>>>           vma->vm_private_data = bo;
>>>>>>>>>     -    vma->vm_flags |= VM_PFNMAP;
>>>>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>>>>>         vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>>>>>         return 0;
>>>>>>>>>     }
Daniel Vetter Dec. 22, 2021, 8:53 p.m. UTC | #10
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
> 
> On 12/20/2021 4:29 AM, Daniel Vetter wrote:
> > On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
> > > Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> > > > Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> > > > > That still won't work.
> > > > > 
> > > > > But I think we could do this change for the amdgpu mmap callback only.
> > > > If graphics user mode has problems with it, we could even make this
> > > > specific to KFD BOs in the amdgpu_gem_object_mmap callback.
> > > I think it's fine for the whole amdgpu stack, my concern is more about
> > > radeon, nouveau and the ARM stacks which are using this as well.
> > > 
> > > That blew up so nicely the last time we tried to change it and I know of at
> > > least one case where radeon was/is used with BOs in a child process.
> > I'm way late and burried again, but I think it'd be good to be consistent
> > here across drivers. Or at least across drm drivers. And we've had the vma
> > open/close refcounting to make fork work since forever.
> > 
> > I think if we do this we should really only do this for mmap() where this
> > applies, but reading through the thread here I'm honestly confused why
> > this is a problem. If CRIU can't handle forked mmaps it needs to be
> > thought that, not hacked around. Or at least I'm not understanding why
> > this shouldn't work ...
> > -Daniel
> > 
> 
> Hi Daniel
> 
> In the v2
> https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53eceb6@amd.com/T/
> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
> recreate all the child processes and then mmaps all the VMAs it sees (as per
> checkpoint snapshot) in the new process address space after the VMA
> placements are finalized in the position independent code phase. Since the
> inherited VMAs don't have access rights the criu mmap fails.

Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random one-off hacks that don't work across the
board, for a problem where we (drm subsystem) really shouldn't be the only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please grab an
ack from them.

Cheers, Daniel

> 
> Regards,
> 
> Rajneesh
> 
> > > Regards,
> > > Christian.
> > > 
> > > > Regards,
> > > >     Felix
> > > > 
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
> > > > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
> > > > > > you!
> > > > > > 
> > > > > > On 12/9/2021 10:27 AM, Christian König wrote:
> > > > > > > Hi Rajneesh,
> > > > > > > 
> > > > > > > yes, separating this from the drm_gem_mmap_obj() change is certainly
> > > > > > > a good idea.
> > > > > > > 
> > > > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > > > access restrictions applied
> > > > > > > exactly that is not correct. That behavior is actively used by some
> > > > > > > userspace stacks as far as I know.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
> > > > > > > > Thanks Christian. Would it make it less intrusive if I just use the
> > > > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
> > > > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change
> > > > > > > > should suffice and we don't want to put any more work arounds in
> > > > > > > > the user space (thunk, in our case).
> > > > > > > > 
> > > > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > > > access restrictions applied so I wonder why even inherit the vma?
> > > > > > > > 
> > > > > > > > On 12/9/2021 2:54 AM, Christian König wrote:
> > > > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
> > > > > > > > > > When an application having open file access to a node forks, its
> > > > > > > > > > shared
> > > > > > > > > > mappings also get reflected in the address space of child process
> > > > > > > > > > even
> > > > > > > > > > though it cannot access them with the object permissions applied.
> > > > > > > > > > With the
> > > > > > > > > > existing permission checks on the gem objects, it might be
> > > > > > > > > > reasonable to
> > > > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space
> > > > > > > > > > application
> > > > > > > > > > doesn't need to explicitly call the madvise(addr, len,
> > > > > > > > > > MADV_DONTFORK)
> > > > > > > > > > system call to prevent the pages in the mapped range to appear in
> > > > > > > > > > the
> > > > > > > > > > address space of the child process. It also prevents the memory
> > > > > > > > > > leaks
> > > > > > > > > > due to additional reference counts on the mapped BOs in the child
> > > > > > > > > > process that prevented freeing the memory in the parent for which
> > > > > > > > > > we had
> > > > > > > > > > worked around earlier in the user space inside the thunk library.
> > > > > > > > > > 
> > > > > > > > > > Additionally, we faced this issue when using CRIU to checkpoint
> > > > > > > > > > restore
> > > > > > > > > > an application that had such inherited mappings in the child which
> > > > > > > > > > confuse CRIU when it mmaps on restore. Having this flag set for the
> > > > > > > > > > render node VMAs helps. VMAs mapped via KFD already take care of
> > > > > > > > > > this so
> > > > > > > > > > this is needed only for the render nodes.
> > > > > > > > > Unfortunately that is most likely a NAK. We already tried
> > > > > > > > > something similar.
> > > > > > > > > 
> > > > > > > > > While it is illegal by the OpenGL specification and doesn't work
> > > > > > > > > for most userspace stacks, we do have some implementations which
> > > > > > > > > call fork() with a GL context open and expect it to work.
> > > > > > > > > 
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: David Yat Sin <david.yatsin@amd.com>
> > > > > > > > > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/drm_gem.c       | 3 ++-
> > > > > > > > > >     drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
> > > > > > > > > >     2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > > > > > > > > index 09c820045859..d9c4149f36dd 100644
> > > > > > > > > > --- a/drivers/gpu/drm/drm_gem.c
> > > > > > > > > > +++ b/drivers/gpu/drm/drm_gem.c
> > > > > > > > > > @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
> > > > > > > > > > *obj, unsigned long obj_size,
> > > > > > > > > >                 goto err_drm_gem_object_put;
> > > > > > > > > >             }
> > > > > > > > > >     -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
> > > > > > > > > > VM_DONTDUMP;
> > > > > > > > > > +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
> > > > > > > > > > +                | VM_DONTDUMP | VM_DONTCOPY;
> > > > > > > > > >             vma->vm_page_prot =
> > > > > > > > > > pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
> > > > > > > > > >             vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> > > > > > > > > >         }
> > > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > > > b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > > > index 33680c94127c..420a4898fdd2 100644
> > > > > > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> > > > > > > > > > @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
> > > > > > > > > > *vma, struct ttm_buffer_object *bo)
> > > > > > > > > >           vma->vm_private_data = bo;
> > > > > > > > > >     -    vma->vm_flags |= VM_PFNMAP;
> > > > > > > > > > +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
> > > > > > > > > >         vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
> > > > > > > > > >         return 0;
> > > > > > > > > >     }
Rajneesh Bhardwaj Dec. 23, 2021, 1:49 a.m. UTC | #11
Adding Adrian Rebel who is the CRIU maintainer and CRIU list

On 12/22/2021 3:53 PM, Daniel Vetter wrote:
> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>> On 12/20/2021 4:29 AM, Daniel Vetter wrote:
>>> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
>>>> Am 09.12.21 um 19:28 schrieb Felix Kuehling:
>>>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
>>>>>> That still won't work.
>>>>>>
>>>>>> But I think we could do this change for the amdgpu mmap callback only.
>>>>> If graphics user mode has problems with it, we could even make this
>>>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback.
>>>> I think it's fine for the whole amdgpu stack, my concern is more about
>>>> radeon, nouveau and the ARM stacks which are using this as well.
>>>>
>>>> That blew up so nicely the last time we tried to change it and I know of at
>>>> least one case where radeon was/is used with BOs in a child process.
>>> I'm way late and burried again, but I think it'd be good to be consistent


I had committed this change into our amd-staging-drm-next branch last 
week after I got the ACK and RB from Felix and Christian.


>>> here across drivers. Or at least across drm drivers. And we've had the vma
>>> open/close refcounting to make fork work since forever.
>>>
>>> I think if we do this we should really only do this for mmap() where this
>>> applies, but reading through the thread here I'm honestly confused why
>>> this is a problem. If CRIU can't handle forked mmaps it needs to be
>>> thought that, not hacked around. Or at least I'm not understanding why
>>> this shouldn't work ...
>>> -Daniel
>>>
>> Hi Daniel
>>
>> In the v2
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2F&amp;data=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3D&amp;reserved=0
>> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
>> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
>> recreate all the child processes and then mmaps all the VMAs it sees (as per
>> checkpoint snapshot) in the new process address space after the VMA
>> placements are finalized in the position independent code phase. Since the
>> inherited VMAs don't have access rights the criu mmap fails.
> Still sounds funky. I think minimally we should have an ack from CRIU
> developers that this is officially the right way to solve this problem. I
> really don't want to have random one-off hacks that don't work across the
> board, for a problem where we (drm subsystem) really shouldn't be the only
> one with this problem. Where "this problem" means that the mmap space is
> per file description, and not per underlying inode or real device or
> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
> want a consistent solution across the board for this. Hence please grab an
> ack from them.
>
> Cheers, Daniel


Maybe Adrian can share his views on this.

Hi Adrian - For the context, on CRIU restore we see mmap failures ( in 
PIE restore phase) due to permission issues on the (render node) VMAs 
that were inherited since the application that check pointed had 
forked.  The VMAs ideally should not be in the child process but the 
smaps file shows these VMAs in the child address space. We didn't want 
to use madvise to avoid this copy and rather change in the kernel mode 
to limit the impact to our user space library thunk. Based on my 
understanding, during PIE restore phase, after the VMA placements are 
finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry 
list and I think its not an issue as per CRIU design but do you think we 
could handle this corner case better inside CRIU?


>
>> Regards,
>>
>> Rajneesh
>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Regards,
>>>>>      Felix
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>>>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>>>>>>> you!
>>>>>>>
>>>>>>> On 12/9/2021 10:27 AM, Christian König wrote:
>>>>>>>> Hi Rajneesh,
>>>>>>>>
>>>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>>>>>>> a good idea.
>>>>>>>>
>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>>>> access restrictions applied
>>>>>>>> exactly that is not correct. That behavior is actively used by some
>>>>>>>> userspace stacks as far as I know.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the
>>>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
>>>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change
>>>>>>>>> should suffice and we don't want to put any more work arounds in
>>>>>>>>> the user space (thunk, in our case).
>>>>>>>>>
>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>>>> access restrictions applied so I wonder why even inherit the vma?
>>>>>>>>>
>>>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>>>>>>>> When an application having open file access to a node forks, its
>>>>>>>>>>> shared
>>>>>>>>>>> mappings also get reflected in the address space of child process
>>>>>>>>>>> even
>>>>>>>>>>> though it cannot access them with the object permissions applied.
>>>>>>>>>>> With the
>>>>>>>>>>> existing permission checks on the gem objects, it might be
>>>>>>>>>>> reasonable to
>>>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space
>>>>>>>>>>> application
>>>>>>>>>>> doesn't need to explicitly call the madvise(addr, len,
>>>>>>>>>>> MADV_DONTFORK)
>>>>>>>>>>> system call to prevent the pages in the mapped range to appear in
>>>>>>>>>>> the
>>>>>>>>>>> address space of the child process. It also prevents the memory
>>>>>>>>>>> leaks
>>>>>>>>>>> due to additional reference counts on the mapped BOs in the child
>>>>>>>>>>> process that prevented freeing the memory in the parent for which
>>>>>>>>>>> we had
>>>>>>>>>>> worked around earlier in the user space inside the thunk library.
>>>>>>>>>>>
>>>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint
>>>>>>>>>>> restore
>>>>>>>>>>> an application that had such inherited mappings in the child which
>>>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of
>>>>>>>>>>> this so
>>>>>>>>>>> this is needed only for the render nodes.
>>>>>>>>>> Unfortunately that is most likely a NAK. We already tried
>>>>>>>>>> something similar.
>>>>>>>>>>
>>>>>>>>>> While it is illegal by the OpenGL specification and doesn't work
>>>>>>>>>> for most userspace stacks, we do have some implementations which
>>>>>>>>>> call fork() with a GL context open and expect it to work.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Cc: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: David Yat Sin<david.yatsin@amd.com>
>>>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>      drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>>>>>>>      drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>>>>>>>      2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>>>>>>>>>>> *obj, unsigned long obj_size,
>>>>>>>>>>>                  goto err_drm_gem_object_put;
>>>>>>>>>>>              }
>>>>>>>>>>>      -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>>>>>>>>>>> VM_DONTDUMP;
>>>>>>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>>>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>>>>>>>              vma->vm_page_prot =
>>>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>>>>>>>              vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>>>>>>>          }
>>>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>>>>>>>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>>>>>>>            vma->vm_private_data = bo;
>>>>>>>>>>>      -    vma->vm_flags |= VM_PFNMAP;
>>>>>>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>>>>>>>          vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>>>>>>>          return 0;
>>>>>>>>>>>      }
Rajneesh Bhardwaj Dec. 23, 2021, 1:51 a.m. UTC | #12
Sorry for the typo in my previous email. Please read Adrian Reber*

On 12/22/2021 8:49 PM, Bhardwaj, Rajneesh wrote:
>
> Adding Adrian Rebel who is the CRIU maintainer and CRIU list
>
> On 12/22/2021 3:53 PM, Daniel Vetter wrote:
>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>> On 12/20/2021 4:29 AM, Daniel Vetter wrote:
>>>> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
>>>>> Am 09.12.21 um 19:28 schrieb Felix Kuehling:
>>>>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
>>>>>>> That still won't work.
>>>>>>>
>>>>>>> But I think we could do this change for the amdgpu mmap callback only.
>>>>>> If graphics user mode has problems with it, we could even make this
>>>>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback.
>>>>> I think it's fine for the whole amdgpu stack, my concern is more about
>>>>> radeon, nouveau and the ARM stacks which are using this as well.
>>>>>
>>>>> That blew up so nicely the last time we tried to change it and I know of at
>>>>> least one case where radeon was/is used with BOs in a child process.
>>>> I'm way late and burried again, but I think it'd be good to be consistent
>
>
> I had committed this change into our amd-staging-drm-next branch last 
> week after I got the ACK and RB from Felix and Christian.
>
>
>>>> here across drivers. Or at least across drm drivers. And we've had the vma
>>>> open/close refcounting to make fork work since forever.
>>>>
>>>> I think if we do this we should really only do this for mmap() where this
>>>> applies, but reading through the thread here I'm honestly confused why
>>>> this is a problem. If CRIU can't handle forked mmaps it needs to be
>>>> thought that, not hacked around. Or at least I'm not understanding why
>>>> this shouldn't work ...
>>>> -Daniel
>>>>
>>> Hi Daniel
>>>
>>> In the v2
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2F&amp;data=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3D&amp;reserved=0
>>> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
>>> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
>>> recreate all the child processes and then mmaps all the VMAs it sees (as per
>>> checkpoint snapshot) in the new process address space after the VMA
>>> placements are finalized in the position independent code phase. Since the
>>> inherited VMAs don't have access rights the criu mmap fails.
>> Still sounds funky. I think minimally we should have an ack from CRIU
>> developers that this is officially the right way to solve this problem. I
>> really don't want to have random one-off hacks that don't work across the
>> board, for a problem where we (drm subsystem) really shouldn't be the only
>> one with this problem. Where "this problem" means that the mmap space is
>> per file description, and not per underlying inode or real device or
>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
>> want a consistent solution across the board for this. Hence please grab an
>> ack from them.
>>
>> Cheers, Daniel
>
>
> Maybe Adrian can share his views on this.
>
> Hi Adrian - For the context, on CRIU restore we see mmap failures ( in 
> PIE restore phase) due to permission issues on the (render node) VMAs 
> that were inherited since the application that check pointed had 
> forked.  The VMAs ideally should not be in the child process but the 
> smaps file shows these VMAs in the child address space. We didn't want 
> to use madvise to avoid this copy and rather change in the kernel mode 
> to limit the impact to our user space library thunk. Based on my 
> understanding, during PIE restore phase, after the VMA placements are 
> finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry 
> list and I think its not an issue as per CRIU design but do you think 
> we could handle this corner case better inside CRIU?
>
>
>>> Regards,
>>>
>>> Rajneesh
>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Regards,
>>>>>>      Felix
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>>>>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>>>>>>>> you!
>>>>>>>>
>>>>>>>> On 12/9/2021 10:27 AM, Christian König wrote:
>>>>>>>>> Hi Rajneesh,
>>>>>>>>>
>>>>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>>>>>>>> a good idea.
>>>>>>>>>
>>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>>>>> access restrictions applied
>>>>>>>>> exactly that is not correct. That behavior is actively used by some
>>>>>>>>> userspace stacks as far as I know.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
>>>>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the
>>>>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
>>>>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change
>>>>>>>>>> should suffice and we don't want to put any more work arounds in
>>>>>>>>>> the user space (thunk, in our case).
>>>>>>>>>>
>>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with
>>>>>>>>>> access restrictions applied so I wonder why even inherit the vma?
>>>>>>>>>>
>>>>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote:
>>>>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>>>>>>>>>>>> When an application having open file access to a node forks, its
>>>>>>>>>>>> shared
>>>>>>>>>>>> mappings also get reflected in the address space of child process
>>>>>>>>>>>> even
>>>>>>>>>>>> though it cannot access them with the object permissions applied.
>>>>>>>>>>>> With the
>>>>>>>>>>>> existing permission checks on the gem objects, it might be
>>>>>>>>>>>> reasonable to
>>>>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space
>>>>>>>>>>>> application
>>>>>>>>>>>> doesn't need to explicitly call the madvise(addr, len,
>>>>>>>>>>>> MADV_DONTFORK)
>>>>>>>>>>>> system call to prevent the pages in the mapped range to appear in
>>>>>>>>>>>> the
>>>>>>>>>>>> address space of the child process. It also prevents the memory
>>>>>>>>>>>> leaks
>>>>>>>>>>>> due to additional reference counts on the mapped BOs in the child
>>>>>>>>>>>> process that prevented freeing the memory in the parent for which
>>>>>>>>>>>> we had
>>>>>>>>>>>> worked around earlier in the user space inside the thunk library.
>>>>>>>>>>>>
>>>>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint
>>>>>>>>>>>> restore
>>>>>>>>>>>> an application that had such inherited mappings in the child which
>>>>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the
>>>>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of
>>>>>>>>>>>> this so
>>>>>>>>>>>> this is needed only for the render nodes.
>>>>>>>>>>> Unfortunately that is most likely a NAK. We already tried
>>>>>>>>>>> something similar.
>>>>>>>>>>>
>>>>>>>>>>> While it is illegal by the OpenGL specification and doesn't work
>>>>>>>>>>> for most userspace stacks, we do have some implementations which
>>>>>>>>>>> call fork() with a GL context open and expect it to work.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Cc: Felix Kuehling<Felix.Kuehling@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: David Yat Sin<david.yatsin@amd.com>
>>>>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>      drivers/gpu/drm/drm_gem.c       | 3 ++-
>>>>>>>>>>>>      drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>>>>>>>>>>>      2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>>> index 09c820045859..d9c4149f36dd 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c
>>>>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>>>>>>>>>>>> *obj, unsigned long obj_size,
>>>>>>>>>>>>                  goto err_drm_gem_object_put;
>>>>>>>>>>>>              }
>>>>>>>>>>>>      -        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>>>>>>>>>>>> VM_DONTDUMP;
>>>>>>>>>>>> +        vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>>>>>>>>>>>> +                | VM_DONTDUMP | VM_DONTCOPY;
>>>>>>>>>>>>              vma->vm_page_prot =
>>>>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>>>>>>>>>>>              vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>>>>>>>>>>>          }
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>>> index 33680c94127c..420a4898fdd2 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>>>>>>>>>>>> *vma, struct ttm_buffer_object *bo)
>>>>>>>>>>>>            vma->vm_private_data = bo;
>>>>>>>>>>>>      -    vma->vm_flags |= VM_PFNMAP;
>>>>>>>>>>>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>>>>>>>>>>>          vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>>>>>>>>>>>          return 0;
>>>>>>>>>>>>      }
Christian König Dec. 23, 2021, 7:05 a.m. UTC | #13
Am 22.12.21 um 21:53 schrieb Daniel Vetter:
> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>
> [SNIP]
> Still sounds funky. I think minimally we should have an ack from CRIU
> developers that this is officially the right way to solve this problem. I
> really don't want to have random one-off hacks that don't work across the
> board, for a problem where we (drm subsystem) really shouldn't be the only
> one with this problem. Where "this problem" means that the mmap space is
> per file description, and not per underlying inode or real device or
> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
> want a consistent solution across the board for this. Hence please grab an
> ack from them.

Unfortunately it's a KFD design problem. AMD used a single device node, 
then mmaped different objects from the same offset to different 
processes and expected it to work the rest of the fs subsystem without 
churn.

So yes, this is indeed because the mmap space is per file descriptor for 
the use case here.

And thanks for pointing this out, this indeed makes the whole change 
extremely questionable.

Regards,
Christian.

>
> Cheers, Daniel
>
Felix Kuehling Jan. 4, 2022, 6:08 p.m. UTC | #14
[+Adrian]

Am 2021-12-23 um 2:05 a.m. schrieb Christian König:

> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>
>> [SNIP]
>> Still sounds funky. I think minimally we should have an ack from CRIU
>> developers that this is officially the right way to solve this
>> problem. I
>> really don't want to have random one-off hacks that don't work across
>> the
>> board, for a problem where we (drm subsystem) really shouldn't be the
>> only
>> one with this problem. Where "this problem" means that the mmap space is
>> per file description, and not per underlying inode or real device or
>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
>> want a consistent solution across the board for this. Hence please
>> grab an
>> ack from them.
>
> Unfortunately it's a KFD design problem. AMD used a single device
> node, then mmaped different objects from the same offset to different
> processes and expected it to work the rest of the fs subsystem without
> churn.

This may be true for mmaps in the KFD device, but not for mmaps in the
DRM render nodes.


>
> So yes, this is indeed because the mmap space is per file descriptor
> for the use case here.

No. This is a different problem.

The problem has to do with the way that DRM manages mmap permissions. In
order to be able to mmap an offset in the render node, there needs to be
a BO that was created in the same render node. If you fork a process, it
inherits the VMA. But KFD doesn't know anything about the inherited BOs
from the parent process. Therefore those BOs don't get checkpointed and
restored in the child process. When the CRIU checkpoint is restored, our
CRIU plugin never creates a BO corresponding to the VMA in the child
process' render node FD. We've also lost the relationship between the
parent and child-process' render node FDs. After "fork" the render node
FD points to the same struct file in parent and child. After restoring
the CRIU checkpoint, they are separate struct files, created by separate
"open" system calls. Therefore the mmap call that restores the VMA fails
in the child process.

At least for KFD, there is no point inheriting BOs from a child process,
because the GPU has no way of accessing the BOs in the child process.
The child process has no GPU address space, no user mode queues, no way
to do anything with the GPU before it completely reinitializes its KFD
context.

We can workaround this issue in user mode with madvise(...,
MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
memory leak in the parent process while a child process exists. But it's
slightly racy because there is a short time window where VMA exists
without the VM_DONTCOPY flag. A fork during that time window could still
create a child process with an inherited VMA.

Therefore a safer solution is to set the vm_flags in the VMA in the
driver when the VMA is first created.

Regards,
  Felix


>
> And thanks for pointing this out, this indeed makes the whole change
> extremely questionable.
>
> Regards,
> Christian.
>
>>
>> Cheers, Daniel
>>
>
Christian König Jan. 5, 2022, 8:08 a.m. UTC | #15
Am 04.01.22 um 19:08 schrieb Felix Kuehling:
> [+Adrian]
>
> Am 2021-12-23 um 2:05 a.m. schrieb Christian König:
>
>> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>>
>>> [SNIP]
>>> Still sounds funky. I think minimally we should have an ack from CRIU
>>> developers that this is officially the right way to solve this
>>> problem. I
>>> really don't want to have random one-off hacks that don't work across
>>> the
>>> board, for a problem where we (drm subsystem) really shouldn't be the
>>> only
>>> one with this problem. Where "this problem" means that the mmap space is
>>> per file description, and not per underlying inode or real device or
>>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
>>> want a consistent solution across the board for this. Hence please
>>> grab an
>>> ack from them.
>> Unfortunately it's a KFD design problem. AMD used a single device
>> node, then mmaped different objects from the same offset to different
>> processes and expected it to work the rest of the fs subsystem without
>> churn.
> This may be true for mmaps in the KFD device, but not for mmaps in the
> DRM render nodes.

Correct, yes.

>> So yes, this is indeed because the mmap space is per file descriptor
>> for the use case here.
> No. This is a different problem.

I was already wondering which mmaps through the KFD node we have left 
which cause problems here.

> The problem has to do with the way that DRM manages mmap permissions. In
> order to be able to mmap an offset in the render node, there needs to be
> a BO that was created in the same render node. If you fork a process, it
> inherits the VMA.

Yeah, so far it works like designed.

> But KFD doesn't know anything about the inherited BOs
> from the parent process.

Ok, why that? When the KFD is reinitializing it's context why shouldn't 
it cleanup those VMAs?

> Therefore those BOs don't get checkpointed and
> restored in the child process. When the CRIU checkpoint is restored, our
> CRIU plugin never creates a BO corresponding to the VMA in the child
> process' render node FD. We've also lost the relationship between the
> parent and child-process' render node FDs. After "fork" the render node
> FD points to the same struct file in parent and child. After restoring
> the CRIU checkpoint, they are separate struct files, created by separate
> "open" system calls. Therefore the mmap call that restores the VMA fails
> in the child process.
>
> At least for KFD, there is no point inheriting BOs from a child process,
> because the GPU has no way of accessing the BOs in the child process.
> The child process has no GPU address space, no user mode queues, no way
> to do anything with the GPU before it completely reinitializes its KFD
> context.
>
> We can workaround this issue in user mode with madvise(...,
> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
> memory leak in the parent process while a child process exists. But it's
> slightly racy because there is a short time window where VMA exists
> without the VM_DONTCOPY flag. A fork during that time window could still
> create a child process with an inherited VMA.
>
> Therefore a safer solution is to set the vm_flags in the VMA in the
> driver when the VMA is first created.

Thanks for the full explanation, it makes much more sense now.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> And thanks for pointing this out, this indeed makes the whole change
>> extremely questionable.
>>
>> Regards,
>> Christian.
>>
>>> Cheers, Daniel
>>>
Felix Kuehling Jan. 5, 2022, 4:16 p.m. UTC | #16
Am 2022-01-05 um 3:08 a.m. schrieb Christian König:
> Am 04.01.22 um 19:08 schrieb Felix Kuehling:
>> [+Adrian]
>>
>> Am 2021-12-23 um 2:05 a.m. schrieb Christian König:
>>
>>> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>>>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>>>
>>>> [SNIP]
>>>> Still sounds funky. I think minimally we should have an ack from CRIU
>>>> developers that this is officially the right way to solve this
>>>> problem. I
>>>> really don't want to have random one-off hacks that don't work across
>>>> the
>>>> board, for a problem where we (drm subsystem) really shouldn't be the
>>>> only
>>>> one with this problem. Where "this problem" means that the mmap
>>>> space is
>>>> per file description, and not per underlying inode or real device or
>>>> whatever. That part sounds like a CRIU problem, and I expect CRIU
>>>> folks
>>>> want a consistent solution across the board for this. Hence please
>>>> grab an
>>>> ack from them.
>>> Unfortunately it's a KFD design problem. AMD used a single device
>>> node, then mmaped different objects from the same offset to different
>>> processes and expected it to work the rest of the fs subsystem without
>>> churn.
>> This may be true for mmaps in the KFD device, but not for mmaps in the
>> DRM render nodes.
>
> Correct, yes.
>
>>> So yes, this is indeed because the mmap space is per file descriptor
>>> for the use case here.
>> No. This is a different problem.
>
> I was already wondering which mmaps through the KFD node we have left
> which cause problems here.

We still use the KFD FD for mapping doorbells and HDP flushing. These
are both SG BOs, so they cannot be CPU-mapped through render nodes. The
KFD FD is also used for mapping signal pages and CWSR trap handlers on
old APUs.

Those VMAs aren't causing the problem. They still map successfully on
restore.


>
>> The problem has to do with the way that DRM manages mmap permissions. In
>> order to be able to mmap an offset in the render node, there needs to be
>> a BO that was created in the same render node. If you fork a process, it
>> inherits the VMA.
>
> Yeah, so far it works like designed.
>
>> But KFD doesn't know anything about the inherited BOs
>> from the parent process.
>
> Ok, why that? When the KFD is reinitializing it's context why
> shouldn't it cleanup those VMAs?

That cleanup has to be initiated by user mode. Basically closing the old
KFD and DRM file descriptors, cleaning up all the user mode VM state,
unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
and starts from scratch.

User mode will do this automatically when it tries to reinitialize ROCm.
However, in this case the child process doesn't do that (e.g. a python
application using the multi-processing package). The child process does
not use ROCm. But you're left with all the dangling VMAs in the child
process indefinitely.

Regards,
  Felix


>
>> Therefore those BOs don't get checkpointed and
>> restored in the child process. When the CRIU checkpoint is restored, our
>> CRIU plugin never creates a BO corresponding to the VMA in the child
>> process' render node FD. We've also lost the relationship between the
>> parent and child-process' render node FDs. After "fork" the render node
>> FD points to the same struct file in parent and child. After restoring
>> the CRIU checkpoint, they are separate struct files, created by separate
>> "open" system calls. Therefore the mmap call that restores the VMA fails
>> in the child process.
>>
>> At least for KFD, there is no point inheriting BOs from a child process,
>> because the GPU has no way of accessing the BOs in the child process.
>> The child process has no GPU address space, no user mode queues, no way
>> to do anything with the GPU before it completely reinitializes its KFD
>> context.
>>
>> We can workaround this issue in user mode with madvise(...,
>> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
>> memory leak in the parent process while a child process exists. But it's
>> slightly racy because there is a short time window where VMA exists
>> without the VM_DONTCOPY flag. A fork during that time window could still
>> create a child process with an inherited VMA.
>>
>> Therefore a safer solution is to set the vm_flags in the VMA in the
>> driver when the VMA is first created.
>
> Thanks for the full explanation, it makes much more sense now.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> And thanks for pointing this out, this indeed makes the whole change
>>> extremely questionable.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Cheers, Daniel
>>>>
>
Felix Kuehling Jan. 5, 2022, 4:27 p.m. UTC | #17
Am 2022-01-05 um 11:16 a.m. schrieb Felix Kuehling:
>> I was already wondering which mmaps through the KFD node we have left
>> which cause problems here.
> We still use the KFD FD for mapping doorbells and HDP flushing. These
> are both SG BOs, so they cannot be CPU-mapped through render nodes. The
> KFD FD is also used for mapping signal pages and CWSR trap handlers on
> old APUs.
>
> Those VMAs aren't causing the problem. They still map successfully on
> restore.
>
>
Small correction: KFD already sets the VM_DONTCOPY flag for all KFD FD
mappings.

The patch under discussion here does the same for DRM FD mappings (at
least for KFD BOs).

Regards,
  Felix
Christian König Jan. 6, 2022, 9:05 a.m. UTC | #18
Am 05.01.22 um 17:16 schrieb Felix Kuehling:
> [SNIP]
>>> But KFD doesn't know anything about the inherited BOs
>>> from the parent process.
>> Ok, why that? When the KFD is reinitializing it's context why
>> shouldn't it cleanup those VMAs?
> That cleanup has to be initiated by user mode. Basically closing the old
> KFD and DRM file descriptors, cleaning up all the user mode VM state,
> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
> and starts from scratch.
>
> User mode will do this automatically when it tries to reinitialize ROCm.
> However, in this case the child process doesn't do that (e.g. a python
> application using the multi-processing package). The child process does
> not use ROCm. But you're left with all the dangling VMAs in the child
> process indefinitely.

Oh, not that one again. I'm unfortunately pretty sure that this is an 
clear NAK then.

This python multi-processing package is violating various specifications 
by doing this fork() and we already had multiple discussions about that.

Let's talk about this on Mondays call. Thanks for giving the whole context.

Regards,
Christian.

>
> Regards,
>    Felix
>
Felix Kuehling Jan. 6, 2022, 4:45 p.m. UTC | #19
Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
> Am 05.01.22 um 17:16 schrieb Felix Kuehling:
>> [SNIP]
>>>> But KFD doesn't know anything about the inherited BOs
>>>> from the parent process.
>>> Ok, why that? When the KFD is reinitializing it's context why
>>> shouldn't it cleanup those VMAs?
>> That cleanup has to be initiated by user mode. Basically closing the old
>> KFD and DRM file descriptors, cleaning up all the user mode VM state,
>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
>> and starts from scratch.
>>
>> User mode will do this automatically when it tries to reinitialize ROCm.
>> However, in this case the child process doesn't do that (e.g. a python
>> application using the multi-processing package). The child process does
>> not use ROCm. But you're left with all the dangling VMAs in the child
>> process indefinitely.
>
> Oh, not that one again. I'm unfortunately pretty sure that this is an
> clear NAK then.
>
> This python multi-processing package is violating various
> specifications by doing this fork() and we already had multiple
> discussions about that.

Well, it's in wide-spread use. We can't just throw up our hands and say
they're buggy and not supported.

Also, why does your ACK or NAK depend on this at all. If it's the right
thing to do, it's the right thing to do regardless of who benefits from
it. In addition, how can a child process that doesn't even use the GPU
be in violation of any GPU-driver related specifications.

Regards,
  Felix


>
> Let's talk about this on Mondays call. Thanks for giving the whole
> context.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>
Christian König Jan. 6, 2022, 4:48 p.m. UTC | #20
Am 06.01.22 um 17:45 schrieb Felix Kuehling:
> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>> Am 05.01.22 um 17:16 schrieb Felix Kuehling:
>>> [SNIP]
>>>>> But KFD doesn't know anything about the inherited BOs
>>>>> from the parent process.
>>>> Ok, why that? When the KFD is reinitializing it's context why
>>>> shouldn't it cleanup those VMAs?
>>> That cleanup has to be initiated by user mode. Basically closing the old
>>> KFD and DRM file descriptors, cleaning up all the user mode VM state,
>>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
>>> and starts from scratch.
>>>
>>> User mode will do this automatically when it tries to reinitialize ROCm.
>>> However, in this case the child process doesn't do that (e.g. a python
>>> application using the multi-processing package). The child process does
>>> not use ROCm. But you're left with all the dangling VMAs in the child
>>> process indefinitely.
>> Oh, not that one again. I'm unfortunately pretty sure that this is an
>> clear NAK then.
>>
>> This python multi-processing package is violating various
>> specifications by doing this fork() and we already had multiple
>> discussions about that.
> Well, it's in wide-spread use. We can't just throw up our hands and say
> they're buggy and not supported.

Because that's not my NAK, but rather from upstream.

> Also, why does your ACK or NAK depend on this at all. If it's the right
> thing to do, it's the right thing to do regardless of who benefits from
> it. In addition, how can a child process that doesn't even use the GPU
> be in violation of any GPU-driver related specifications.

The argument is that the application is broken and needs to be fixed 
instead of worked around inside the kernel.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Let's talk about this on Mondays call. Thanks for giving the whole
>> context.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
Felix Kuehling Jan. 6, 2022, 4:51 p.m. UTC | #21
Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>> Am 05.01.22 um 17:16 schrieb Felix Kuehling:
>>>> [SNIP]
>>>>>> But KFD doesn't know anything about the inherited BOs
>>>>>> from the parent process.
>>>>> Ok, why that? When the KFD is reinitializing it's context why
>>>>> shouldn't it cleanup those VMAs?
>>>> That cleanup has to be initiated by user mode. Basically closing
>>>> the old
>>>> KFD and DRM file descriptors, cleaning up all the user mode VM state,
>>>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
>>>> and starts from scratch.
>>>>
>>>> User mode will do this automatically when it tries to reinitialize
>>>> ROCm.
>>>> However, in this case the child process doesn't do that (e.g. a python
>>>> application using the multi-processing package). The child process
>>>> does
>>>> not use ROCm. But you're left with all the dangling VMAs in the child
>>>> process indefinitely.
>>> Oh, not that one again. I'm unfortunately pretty sure that this is an
>>> clear NAK then.
>>>
>>> This python multi-processing package is violating various
>>> specifications by doing this fork() and we already had multiple
>>> discussions about that.
>> Well, it's in wide-spread use. We can't just throw up our hands and say
>> they're buggy and not supported.
>
> Because that's not my NAK, but rather from upstream.
>
>> Also, why does your ACK or NAK depend on this at all. If it's the right
>> thing to do, it's the right thing to do regardless of who benefits from
>> it. In addition, how can a child process that doesn't even use the GPU
>> be in violation of any GPU-driver related specifications.
>
> The argument is that the application is broken and needs to be fixed
> instead of worked around inside the kernel.

I still don't get how they the application is broken. Like I said, the
child process is not using the GPU. How can the application be fixed in
this case?

Are you saying, any application that forks and doesn't immediately call
exec is broken?

Or does an application that forks need to be aware that some other part
of the application used the GPU and explicitly free any GPU resources?

Thanks,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>> context.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>     Felix
>>>>
>
Christian König Jan. 7, 2022, 8:56 a.m. UTC | #22
Am 06.01.22 um 17:51 schrieb Felix Kuehling:
> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>> [SNIP]
>>> Also, why does your ACK or NAK depend on this at all. If it's the right
>>> thing to do, it's the right thing to do regardless of who benefits from
>>> it. In addition, how can a child process that doesn't even use the GPU
>>> be in violation of any GPU-driver related specifications.
>> The argument is that the application is broken and needs to be fixed
>> instead of worked around inside the kernel.
> I still don't get how they the application is broken. Like I said, the
> child process is not using the GPU. How can the application be fixed in
> this case?

Sounds like I'm still missing some important puzzle pieces for the full 
picture to figure out why this doesn't work.

> Are you saying, any application that forks and doesn't immediately call
> exec is broken?

More or less yes. We essentially have three possible cases here:

1. Application is already using (for example) OpenGL or OpenCL to do 
some rendering on the GPU and then calls fork() and expects to use 
OpenGL both in the parent and the child at the same time.
     As far as I know that's illegal from the Khronos specification 
point of view and working around inside the kernel for something not 
allowed in the first place is seen as futile effort.

2. Application opened the file descriptor, forks and then initializes 
OpenGL/Vulkan/OpenCL.
     That's what some compositors are doing to drop into the backround 
and is explicitly legal.

3. Application calls fork() and then doesn't use the GPU in the child. 
Either by not using it or calling exec.
     That should be legal and not cause any problems in the first place.

But from your description I still don't get why we are running into 
problems here.

I was assuming that you have case #1 because we previously had some 
examples of this with this python library, but from your description it 
seems to be case #3.

> Or does an application that forks need to be aware that some other part
> of the application used the GPU and explicitly free any GPU resources?

Others might fill that information in, but I think that was the plan for 
newer APIs like Vulkan.

Regards,
Christian.

>
> Thanks,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>>> context.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Regards,
>>>>>      Felix
>>>>>
Felix Kuehling Jan. 7, 2022, 5:47 p.m. UTC | #23
Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
> Am 06.01.22 um 17:51 schrieb Felix Kuehling:
>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>>> [SNIP]
>>>> Also, why does your ACK or NAK depend on this at all. If it's the
>>>> right
>>>> thing to do, it's the right thing to do regardless of who benefits
>>>> from
>>>> it. In addition, how can a child process that doesn't even use the GPU
>>>> be in violation of any GPU-driver related specifications.
>>> The argument is that the application is broken and needs to be fixed
>>> instead of worked around inside the kernel.
>> I still don't get how they the application is broken. Like I said, the
>> child process is not using the GPU. How can the application be fixed in
>> this case?
>
> Sounds like I'm still missing some important puzzle pieces for the
> full picture to figure out why this doesn't work.
>
>> Are you saying, any application that forks and doesn't immediately call
>> exec is broken?
>
> More or less yes. We essentially have three possible cases here:
>
> 1. Application is already using (for example) OpenGL or OpenCL to do
> some rendering on the GPU and then calls fork() and expects to use
> OpenGL both in the parent and the child at the same time.
>     As far as I know that's illegal from the Khronos specification
> point of view and working around inside the kernel for something not
> allowed in the first place is seen as futile effort.
>
> 2. Application opened the file descriptor, forks and then initializes
> OpenGL/Vulkan/OpenCL.
>     That's what some compositors are doing to drop into the backround
> and is explicitly legal.
>
> 3. Application calls fork() and then doesn't use the GPU in the child.
> Either by not using it or calling exec.
>     That should be legal and not cause any problems in the first place. 
>
> But from your description I still don't get why we are running into
> problems here.
>
> I was assuming that you have case #1 because we previously had some
> examples of this with this python library, but from your description
> it seems to be case #3.

Correct. #3 has at least one issue we previously worked around in the
Thunk: The inherited VMAs prevent BOs from being freed in the parent
process. This manifests as an apparent memory leak. Therefore the Thunk
calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
that are causing problems with CRIU are GTT BOs that weren't covered by
this previous workaround.

The new issue with CRIU is, that CRIU saves and restores all the VMAs.
When trying to restore the inherited VMAs in the child process, the mmap
call fails because the child process' render node FD is no longer
inherited from the parent, but is instead created by its own "open"
system call. The mmap call in the child fails for at least two reasons:

  * The child process' render node FD doesn't have permission to map the
    parent process' BO any more
  * The parent BO doesn't get restored in the child process, so its mmap
    offset doesn't get updated to the new mmap offset of the restored
    parent BO by the amdgpu CRIU plugin

We could maybe add a whole bunch of complexity in CRIU and our CRIU
plugin to fix this. But it's pointless because like you said, actually
doing anything with the BO in the child process is illegal anyway
(scenario #1 above). The easiest solution seems to be, to just not
inherit the VMA in the first place.

Regards,
  Felix


>
>> Or does an application that forks need to be aware that some other part
>> of the application used the GPU and explicitly free any GPU resources?
>
> Others might fill that information in, but I think that was the plan
> for newer APIs like Vulkan.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>>
>>>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>>>> context.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Regards,
>>>>>>      Felix
>>>>>>
>
Rajneesh Bhardwaj Jan. 10, 2022, 5:30 p.m. UTC | #24
Hi Christian


I have reverted the change from the amd-staging-drm-next as per the 
discussion.  Thank you.


Regards

Rajneesh

On 1/4/2022 1:08 PM, Felix Kuehling wrote:
> [+Adrian]
>
> Am 2021-12-23 um 2:05 a.m. schrieb Christian König:
>
>> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>>
>>> [SNIP]
>>> Still sounds funky. I think minimally we should have an ack from CRIU
>>> developers that this is officially the right way to solve this
>>> problem. I
>>> really don't want to have random one-off hacks that don't work across
>>> the
>>> board, for a problem where we (drm subsystem) really shouldn't be the
>>> only
>>> one with this problem. Where "this problem" means that the mmap space is
>>> per file description, and not per underlying inode or real device or
>>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
>>> want a consistent solution across the board for this. Hence please
>>> grab an
>>> ack from them.
>> Unfortunately it's a KFD design problem. AMD used a single device
>> node, then mmaped different objects from the same offset to different
>> processes and expected it to work the rest of the fs subsystem without
>> churn.
> This may be true for mmaps in the KFD device, but not for mmaps in the
> DRM render nodes.
>
>
>> So yes, this is indeed because the mmap space is per file descriptor
>> for the use case here.
> No. This is a different problem.
>
> The problem has to do with the way that DRM manages mmap permissions. In
> order to be able to mmap an offset in the render node, there needs to be
> a BO that was created in the same render node. If you fork a process, it
> inherits the VMA. But KFD doesn't know anything about the inherited BOs
> from the parent process. Therefore those BOs don't get checkpointed and
> restored in the child process. When the CRIU checkpoint is restored, our
> CRIU plugin never creates a BO corresponding to the VMA in the child
> process' render node FD. We've also lost the relationship between the
> parent and child-process' render node FDs. After "fork" the render node
> FD points to the same struct file in parent and child. After restoring
> the CRIU checkpoint, they are separate struct files, created by separate
> "open" system calls. Therefore the mmap call that restores the VMA fails
> in the child process.
>
> At least for KFD, there is no point inheriting BOs from a child process,
> because the GPU has no way of accessing the BOs in the child process.
> The child process has no GPU address space, no user mode queues, no way
> to do anything with the GPU before it completely reinitializes its KFD
> context.
>
> We can workaround this issue in user mode with madvise(...,
> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
> memory leak in the parent process while a child process exists. But it's
> slightly racy because there is a short time window where VMA exists
> without the VM_DONTCOPY flag. A fork during that time window could still
> create a child process with an inherited VMA.
>
> Therefore a safer solution is to set the vm_flags in the VMA in the
> driver when the VMA is first created.
>
> Regards,
>    Felix
>
>
>> And thanks for pointing this out, this indeed makes the whole change
>> extremely questionable.
>>
>> Regards,
>> Christian.
>>
>>> Cheers, Daniel
>>>
Daniel Vetter Jan. 14, 2022, 4:44 p.m. UTC | #25
Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole once and
for all. The thing that worries me from a upstream/platform pov is really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
  drivers at least (or maybe even all gem drivers, no idea), with the
  below discussion cleaned up as justification.

- Get acks from userspace driver folks who know real-world gl/vk usage and
  khr specs in-depth enough to be able to tell us how much we'll regret
  this.

- Merge it. Or come up with something new. Or maybe stick to the Nack, but
  then maybe it would be good to document that somewhere in-tree?

This entire can of worms just feels way too tricky to have it be handled
inconsistently across drivers. And trying to fix these kind of low-level
things across drivers once divergences exists is just really painful (e.g.
trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
Christian is doing to get our dma_resv rules consistent across drivers).

Cheers, Daniel

On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:
> Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
> > Am 06.01.22 um 17:51 schrieb Felix Kuehling:
> >> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
> >>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
> >>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
> >>>> [SNIP]
> >>>> Also, why does your ACK or NAK depend on this at all. If it's the
> >>>> right
> >>>> thing to do, it's the right thing to do regardless of who benefits
> >>>> from
> >>>> it. In addition, how can a child process that doesn't even use the GPU
> >>>> be in violation of any GPU-driver related specifications.
> >>> The argument is that the application is broken and needs to be fixed
> >>> instead of worked around inside the kernel.
> >> I still don't get how they the application is broken. Like I said, the
> >> child process is not using the GPU. How can the application be fixed in
> >> this case?
> >
> > Sounds like I'm still missing some important puzzle pieces for the
> > full picture to figure out why this doesn't work.
> >
> >> Are you saying, any application that forks and doesn't immediately call
> >> exec is broken?
> >
> > More or less yes. We essentially have three possible cases here:
> >
> > 1. Application is already using (for example) OpenGL or OpenCL to do
> > some rendering on the GPU and then calls fork() and expects to use
> > OpenGL both in the parent and the child at the same time.
> >     As far as I know that's illegal from the Khronos specification
> > point of view and working around inside the kernel for something not
> > allowed in the first place is seen as futile effort.
> >
> > 2. Application opened the file descriptor, forks and then initializes
> > OpenGL/Vulkan/OpenCL.
> >     That's what some compositors are doing to drop into the backround
> > and is explicitly legal.
> >
> > 3. Application calls fork() and then doesn't use the GPU in the child.
> > Either by not using it or calling exec.
> >     That should be legal and not cause any problems in the first place. 
> >
> > But from your description I still don't get why we are running into
> > problems here.
> >
> > I was assuming that you have case #1 because we previously had some
> > examples of this with this python library, but from your description
> > it seems to be case #3.
> 
> Correct. #3 has at least one issue we previously worked around in the
> Thunk: The inherited VMAs prevent BOs from being freed in the parent
> process. This manifests as an apparent memory leak. Therefore the Thunk
> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
> that are causing problems with CRIU are GTT BOs that weren't covered by
> this previous workaround.
> 
> The new issue with CRIU is, that CRIU saves and restores all the VMAs.
> When trying to restore the inherited VMAs in the child process, the mmap
> call fails because the child process' render node FD is no longer
> inherited from the parent, but is instead created by its own "open"
> system call. The mmap call in the child fails for at least two reasons:
> 
>   * The child process' render node FD doesn't have permission to map the
>     parent process' BO any more
>   * The parent BO doesn't get restored in the child process, so its mmap
>     offset doesn't get updated to the new mmap offset of the restored
>     parent BO by the amdgpu CRIU plugin
> 
> We could maybe add a whole bunch of complexity in CRIU and our CRIU
> plugin to fix this. But it's pointless because like you said, actually
> doing anything with the BO in the child process is illegal anyway
> (scenario #1 above). The easiest solution seems to be, to just not
> inherit the VMA in the first place.
> 
> Regards,
>   Felix
> 
> 
> >
> >> Or does an application that forks need to be aware that some other part
> >> of the application used the GPU and explicitly free any GPU resources?
> >
> > Others might fill that information in, but I think that was the plan
> > for newer APIs like Vulkan.
> >
> > Regards,
> > Christian.
> >
> >>
> >> Thanks,
> >>    Felix
> >>
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>>     Felix
> >>>>
> >>>>
> >>>>> Let's talk about this on Mondays call. Thanks for giving the whole
> >>>>> context.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>> Regards,
> >>>>>>      Felix
> >>>>>>
> >
Christian König Jan. 14, 2022, 5:26 p.m. UTC | #26
Am 14.01.22 um 17:44 schrieb Daniel Vetter:
> Top post because I tried to catch up on the entire discussion here.
>
> So fundamentally I'm not opposed to just close this fork() hole once and
> for all. The thing that worries me from a upstream/platform pov is really
> only if we don't do it consistently across all drivers.
>
> So maybe as an idea:
> - Do the original patch, but not just for ttm but all gem rendernode
>    drivers at least (or maybe even all gem drivers, no idea), with the
>    below discussion cleaned up as justification.

I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list 
because (IIRC) Marek introduced a background thread to push command 
submissions to the kernel.

That broke because some compositor used to initialize OpenGL and then do 
a fork(). This indeed worked previously (no GPUVM at that time), but 
with the addition of the backround thread obviously broke.

The conclusion back then was that the compositor is broken and needs 
fixing, but it still essentially means that there could be people out 
there with really old userspace where this setting would just break the 
desktop.

I'm not really against that change either, but at least in theory we 
could make fork() work perfectly fine even with VMs and background threads.

Regards,
Christian.

> - Get acks from userspace driver folks who know real-world gl/vk usage and
>    khr specs in-depth enough to be able to tell us how much we'll regret
>    this.
>
> - Merge it. Or come up with something new. Or maybe stick to the Nack, but
>    then maybe it would be good to document that somewhere in-tree?
>
> This entire can of worms just feels way too tricky to have it be handled
> inconsistently across drivers. And trying to fix these kind of low-level
> things across drivers once divergences exists is just really painful (e.g.
> trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
> Christian is doing to get our dma_resv rules consistent across drivers).
>
> Cheers, Daniel
>
> On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:
>> Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
>>> Am 06.01.22 um 17:51 schrieb Felix Kuehling:
>>>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>>>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>>>>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>>>>> [SNIP]
>>>>>> Also, why does your ACK or NAK depend on this at all. If it's the
>>>>>> right
>>>>>> thing to do, it's the right thing to do regardless of who benefits
>>>>>> from
>>>>>> it. In addition, how can a child process that doesn't even use the GPU
>>>>>> be in violation of any GPU-driver related specifications.
>>>>> The argument is that the application is broken and needs to be fixed
>>>>> instead of worked around inside the kernel.
>>>> I still don't get how they the application is broken. Like I said, the
>>>> child process is not using the GPU. How can the application be fixed in
>>>> this case?
>>> Sounds like I'm still missing some important puzzle pieces for the
>>> full picture to figure out why this doesn't work.
>>>
>>>> Are you saying, any application that forks and doesn't immediately call
>>>> exec is broken?
>>> More or less yes. We essentially have three possible cases here:
>>>
>>> 1. Application is already using (for example) OpenGL or OpenCL to do
>>> some rendering on the GPU and then calls fork() and expects to use
>>> OpenGL both in the parent and the child at the same time.
>>>      As far as I know that's illegal from the Khronos specification
>>> point of view and working around inside the kernel for something not
>>> allowed in the first place is seen as futile effort.
>>>
>>> 2. Application opened the file descriptor, forks and then initializes
>>> OpenGL/Vulkan/OpenCL.
>>>      That's what some compositors are doing to drop into the backround
>>> and is explicitly legal.
>>>
>>> 3. Application calls fork() and then doesn't use the GPU in the child.
>>> Either by not using it or calling exec.
>>>      That should be legal and not cause any problems in the first place.
>>>
>>> But from your description I still don't get why we are running into
>>> problems here.
>>>
>>> I was assuming that you have case #1 because we previously had some
>>> examples of this with this python library, but from your description
>>> it seems to be case #3.
>> Correct. #3 has at least one issue we previously worked around in the
>> Thunk: The inherited VMAs prevent BOs from being freed in the parent
>> process. This manifests as an apparent memory leak. Therefore the Thunk
>> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
>> that are causing problems with CRIU are GTT BOs that weren't covered by
>> this previous workaround.
>>
>> The new issue with CRIU is, that CRIU saves and restores all the VMAs.
>> When trying to restore the inherited VMAs in the child process, the mmap
>> call fails because the child process' render node FD is no longer
>> inherited from the parent, but is instead created by its own "open"
>> system call. The mmap call in the child fails for at least two reasons:
>>
>>    * The child process' render node FD doesn't have permission to map the
>>      parent process' BO any more
>>    * The parent BO doesn't get restored in the child process, so its mmap
>>      offset doesn't get updated to the new mmap offset of the restored
>>      parent BO by the amdgpu CRIU plugin
>>
>> We could maybe add a whole bunch of complexity in CRIU and our CRIU
>> plugin to fix this. But it's pointless because like you said, actually
>> doing anything with the BO in the child process is illegal anyway
>> (scenario #1 above). The easiest solution seems to be, to just not
>> inherit the VMA in the first place.
>>
>> Regards,
>>    Felix
>>
>>
>>>> Or does an application that forks need to be aware that some other part
>>>> of the application used the GPU and explicitly free any GPU resources?
>>> Others might fill that information in, but I think that was the plan
>>> for newer APIs like Vulkan.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Thanks,
>>>>     Felix
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Regards,
>>>>>>      Felix
>>>>>>
>>>>>>
>>>>>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>>>>>> context.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Regards,
>>>>>>>>       Felix
>>>>>>>>
Felix Kuehling Jan. 14, 2022, 5:40 p.m. UTC | #27
Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole once and
>> for all. The thing that worries me from a upstream/platform pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem rendernode
>>    drivers at least (or maybe even all gem drivers, no idea), with the
>>    below discussion cleaned up as justification.
>
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa mailing list
> because (IIRC) Marek introduced a background thread to push command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL and then
> do a fork(). This indeed worked previously (no GPUVM at that time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken and needs
> fixing, but it still essentially means that there could be people out
> there with really old userspace where this setting would just break
> the desktop.
>
> I'm not really against that change either, but at least in theory we
> could make fork() work perfectly fine even with VMs and background
> threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.

Regards,
  Felix


>
> Regards,
> Christian.
>
>> - Get acks from userspace driver folks who know real-world gl/vk
>> usage and
>>    khr specs in-depth enough to be able to tell us how much we'll regret
>>    this.
>>
>> - Merge it. Or come up with something new. Or maybe stick to the
>> Nack, but
>>    then maybe it would be good to document that somewhere in-tree?
>>
>> This entire can of worms just feels way too tricky to have it be handled
>> inconsistently across drivers. And trying to fix these kind of low-level
>> things across drivers once divergences exists is just really painful
>> (e.g.
>> trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
>> Christian is doing to get our dma_resv rules consistent across drivers).
>>
>> Cheers, Daniel
>>
>> On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:
>>> Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
>>>> Am 06.01.22 um 17:51 schrieb Felix Kuehling:
>>>>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>>>>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>>>>>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>>>>>> [SNIP]
>>>>>>> Also, why does your ACK or NAK depend on this at all. If it's the
>>>>>>> right
>>>>>>> thing to do, it's the right thing to do regardless of who benefits
>>>>>>> from
>>>>>>> it. In addition, how can a child process that doesn't even use
>>>>>>> the GPU
>>>>>>> be in violation of any GPU-driver related specifications.
>>>>>> The argument is that the application is broken and needs to be fixed
>>>>>> instead of worked around inside the kernel.
>>>>> I still don't get how they the application is broken. Like I said,
>>>>> the
>>>>> child process is not using the GPU. How can the application be
>>>>> fixed in
>>>>> this case?
>>>> Sounds like I'm still missing some important puzzle pieces for the
>>>> full picture to figure out why this doesn't work.
>>>>
>>>>> Are you saying, any application that forks and doesn't immediately
>>>>> call
>>>>> exec is broken?
>>>> More or less yes. We essentially have three possible cases here:
>>>>
>>>> 1. Application is already using (for example) OpenGL or OpenCL to do
>>>> some rendering on the GPU and then calls fork() and expects to use
>>>> OpenGL both in the parent and the child at the same time.
>>>>      As far as I know that's illegal from the Khronos specification
>>>> point of view and working around inside the kernel for something not
>>>> allowed in the first place is seen as futile effort.
>>>>
>>>> 2. Application opened the file descriptor, forks and then initializes
>>>> OpenGL/Vulkan/OpenCL.
>>>>      That's what some compositors are doing to drop into the backround
>>>> and is explicitly legal.
>>>>
>>>> 3. Application calls fork() and then doesn't use the GPU in the child.
>>>> Either by not using it or calling exec.
>>>>      That should be legal and not cause any problems in the first
>>>> place.
>>>>
>>>> But from your description I still don't get why we are running into
>>>> problems here.
>>>>
>>>> I was assuming that you have case #1 because we previously had some
>>>> examples of this with this python library, but from your description
>>>> it seems to be case #3.
>>> Correct. #3 has at least one issue we previously worked around in the
>>> Thunk: The inherited VMAs prevent BOs from being freed in the parent
>>> process. This manifests as an apparent memory leak. Therefore the Thunk
>>> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
>>> that are causing problems with CRIU are GTT BOs that weren't covered by
>>> this previous workaround.
>>>
>>> The new issue with CRIU is, that CRIU saves and restores all the VMAs.
>>> When trying to restore the inherited VMAs in the child process, the
>>> mmap
>>> call fails because the child process' render node FD is no longer
>>> inherited from the parent, but is instead created by its own "open"
>>> system call. The mmap call in the child fails for at least two reasons:
>>>
>>>    * The child process' render node FD doesn't have permission to
>>> map the
>>>      parent process' BO any more
>>>    * The parent BO doesn't get restored in the child process, so its
>>> mmap
>>>      offset doesn't get updated to the new mmap offset of the restored
>>>      parent BO by the amdgpu CRIU plugin
>>>
>>> We could maybe add a whole bunch of complexity in CRIU and our CRIU
>>> plugin to fix this. But it's pointless because like you said, actually
>>> doing anything with the BO in the child process is illegal anyway
>>> (scenario #1 above). The easiest solution seems to be, to just not
>>> inherit the VMA in the first place.
>>>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>> Or does an application that forks need to be aware that some other
>>>>> part
>>>>> of the application used the GPU and explicitly free any GPU
>>>>> resources?
>>>> Others might fill that information in, but I think that was the plan
>>>> for newer APIs like Vulkan.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Thanks,
>>>>>     Felix
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Regards,
>>>>>>>      Felix
>>>>>>>
>>>>>>>
>>>>>>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>>>>>>> context.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>       Felix
>>>>>>>>>
>
Christian König Jan. 17, 2022, 11:44 a.m. UTC | #28
Am 14.01.22 um 18:40 schrieb Felix Kuehling:
> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>>> Top post because I tried to catch up on the entire discussion here.
>>>
>>> So fundamentally I'm not opposed to just close this fork() hole once and
>>> for all. The thing that worries me from a upstream/platform pov is
>>> really
>>> only if we don't do it consistently across all drivers.
>>>
>>> So maybe as an idea:
>>> - Do the original patch, but not just for ttm but all gem rendernode
>>>     drivers at least (or maybe even all gem drivers, no idea), with the
>>>     below discussion cleaned up as justification.
>> I know of at least one use case which this will break.
>>
>> A couple of years back we had a discussion on the Mesa mailing list
>> because (IIRC) Marek introduced a background thread to push command
>> submissions to the kernel.
>>
>> That broke because some compositor used to initialize OpenGL and then
>> do a fork(). This indeed worked previously (no GPUVM at that time),
>> but with the addition of the backround thread obviously broke.
>>
>> The conclusion back then was that the compositor is broken and needs
>> fixing, but it still essentially means that there could be people out
>> there with really old userspace where this setting would just break
>> the desktop.
>>
>> I'm not really against that change either, but at least in theory we
>> could make fork() work perfectly fine even with VMs and background
>> threads.
> You may regret this if you ever try to build a shared virtual address
> space between GPU and CPU. Then you have two processes (parent and
> child) sharing the same render context and GPU VM address space. But the
> CPU address spaces are different. You can't maintain consistent shared
> virtual address spaces for both processes when the GPU address space is
> shared between them.

That's actually not much of a problem.

All you need to do is to use pthread_atfork() and do the appropriate 
action in parent/child to clean up your context: 
https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

The rest is just to make sure that all shared and all private data are 
kept separate all the time. Sharing virtual memory is already done for 
decades this way, it's just that nobody ever did it with a statefull 
device like GPUs.

Regards,
Christian.

>
> Regards,
>    Felix
>
Felix Kuehling Jan. 17, 2022, 2:17 p.m. UTC | #29
Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>>>> Top post because I tried to catch up on the entire discussion here.
>>>>
>>>> So fundamentally I'm not opposed to just close this fork() hole
>>>> once and
>>>> for all. The thing that worries me from a upstream/platform pov is
>>>> really
>>>> only if we don't do it consistently across all drivers.
>>>>
>>>> So maybe as an idea:
>>>> - Do the original patch, but not just for ttm but all gem rendernode
>>>>     drivers at least (or maybe even all gem drivers, no idea), with
>>>> the
>>>>     below discussion cleaned up as justification.
>>> I know of at least one use case which this will break.
>>>
>>> A couple of years back we had a discussion on the Mesa mailing list
>>> because (IIRC) Marek introduced a background thread to push command
>>> submissions to the kernel.
>>>
>>> That broke because some compositor used to initialize OpenGL and then
>>> do a fork(). This indeed worked previously (no GPUVM at that time),
>>> but with the addition of the backround thread obviously broke.
>>>
>>> The conclusion back then was that the compositor is broken and needs
>>> fixing, but it still essentially means that there could be people out
>>> there with really old userspace where this setting would just break
>>> the desktop.
>>>
>>> I'm not really against that change either, but at least in theory we
>>> could make fork() work perfectly fine even with VMs and background
>>> threads.
>> You may regret this if you ever try to build a shared virtual address
>> space between GPU and CPU. Then you have two processes (parent and
>> child) sharing the same render context and GPU VM address space. But the
>> CPU address spaces are different. You can't maintain consistent shared
>> virtual address spaces for both processes when the GPU address space is
>> shared between them.
>
> That's actually not much of a problem.
>
> All you need to do is to use pthread_atfork() and do the appropriate
> action in parent/child to clean up your context:
> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

Thunk already does that. However, it's not foolproof. pthread_atfork
hanlders aren't called when the process is forked with a clone call.


>
> The rest is just to make sure that all shared and all private data are
> kept separate all the time. Sharing virtual memory is already done for
> decades this way, it's just that nobody ever did it with a statefull
> device like GPUs.
My concern is not with sharing or not sharing data. It's with sharing
the address space itself. If you share the render node, you share GPU
virtual address space. However CPU address space is not shared between
parent and child. That's a fundamental mismatch between the CPU world
and current GPU driver implementation.

Regards,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>
Christian König Jan. 17, 2022, 2:21 p.m. UTC | #30
Am 17.01.22 um 15:17 schrieb Felix Kuehling:
> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>>>>> Top post because I tried to catch up on the entire discussion here.
>>>>>
>>>>> So fundamentally I'm not opposed to just close this fork() hole
>>>>> once and
>>>>> for all. The thing that worries me from a upstream/platform pov is
>>>>> really
>>>>> only if we don't do it consistently across all drivers.
>>>>>
>>>>> So maybe as an idea:
>>>>> - Do the original patch, but not just for ttm but all gem rendernode
>>>>>      drivers at least (or maybe even all gem drivers, no idea), with
>>>>> the
>>>>>      below discussion cleaned up as justification.
>>>> I know of at least one use case which this will break.
>>>>
>>>> A couple of years back we had a discussion on the Mesa mailing list
>>>> because (IIRC) Marek introduced a background thread to push command
>>>> submissions to the kernel.
>>>>
>>>> That broke because some compositor used to initialize OpenGL and then
>>>> do a fork(). This indeed worked previously (no GPUVM at that time),
>>>> but with the addition of the backround thread obviously broke.
>>>>
>>>> The conclusion back then was that the compositor is broken and needs
>>>> fixing, but it still essentially means that there could be people out
>>>> there with really old userspace where this setting would just break
>>>> the desktop.
>>>>
>>>> I'm not really against that change either, but at least in theory we
>>>> could make fork() work perfectly fine even with VMs and background
>>>> threads.
>>> You may regret this if you ever try to build a shared virtual address
>>> space between GPU and CPU. Then you have two processes (parent and
>>> child) sharing the same render context and GPU VM address space. But the
>>> CPU address spaces are different. You can't maintain consistent shared
>>> virtual address spaces for both processes when the GPU address space is
>>> shared between them.
>> That's actually not much of a problem.
>>
>> All you need to do is to use pthread_atfork() and do the appropriate
>> action in parent/child to clean up your context:
>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
> Thunk already does that. However, it's not foolproof. pthread_atfork
> hanlders aren't called when the process is forked with a clone call.

Yeah, but that's perfectly intentional. clone() is usually used to 
create threads.

>> The rest is just to make sure that all shared and all private data are
>> kept separate all the time. Sharing virtual memory is already done for
>> decades this way, it's just that nobody ever did it with a statefull
>> device like GPUs.
> My concern is not with sharing or not sharing data. It's with sharing
> the address space itself. If you share the render node, you share GPU
> virtual address space. However CPU address space is not shared between
> parent and child. That's a fundamental mismatch between the CPU world
> and current GPU driver implementation.

Correct, but even that is easily solvable. As I said before you can hang 
this state on a VMA and let it be cloned together with the CPU address 
space.

Since VMAs are informed about their cloning (in opposite to file 
descriptors) it's trivial to even just clone kernel data on first access.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>     Felix
>>>
Felix Kuehling Jan. 17, 2022, 2:34 p.m. UTC | #31
Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>>>>>> Top post because I tried to catch up on the entire discussion here.
>>>>>>
>>>>>> So fundamentally I'm not opposed to just close this fork() hole
>>>>>> once and
>>>>>> for all. The thing that worries me from a upstream/platform pov is
>>>>>> really
>>>>>> only if we don't do it consistently across all drivers.
>>>>>>
>>>>>> So maybe as an idea:
>>>>>> - Do the original patch, but not just for ttm but all gem rendernode
>>>>>>      drivers at least (or maybe even all gem drivers, no idea), with
>>>>>> the
>>>>>>      below discussion cleaned up as justification.
>>>>> I know of at least one use case which this will break.
>>>>>
>>>>> A couple of years back we had a discussion on the Mesa mailing list
>>>>> because (IIRC) Marek introduced a background thread to push command
>>>>> submissions to the kernel.
>>>>>
>>>>> That broke because some compositor used to initialize OpenGL and then
>>>>> do a fork(). This indeed worked previously (no GPUVM at that time),
>>>>> but with the addition of the backround thread obviously broke.
>>>>>
>>>>> The conclusion back then was that the compositor is broken and needs
>>>>> fixing, but it still essentially means that there could be people out
>>>>> there with really old userspace where this setting would just break
>>>>> the desktop.
>>>>>
>>>>> I'm not really against that change either, but at least in theory we
>>>>> could make fork() work perfectly fine even with VMs and background
>>>>> threads.
>>>> You may regret this if you ever try to build a shared virtual address
>>>> space between GPU and CPU. Then you have two processes (parent and
>>>> child) sharing the same render context and GPU VM address space.
>>>> But the
>>>> CPU address spaces are different. You can't maintain consistent shared
>>>> virtual address spaces for both processes when the GPU address
>>>> space is
>>>> shared between them.
>>> That's actually not much of a problem.
>>>
>>> All you need to do is to use pthread_atfork() and do the appropriate
>>> action in parent/child to clean up your context:
>>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
>> Thunk already does that. However, it's not foolproof. pthread_atfork
>> hanlders aren't called when the process is forked with a clone call.
>
> Yeah, but that's perfectly intentional. clone() is usually used to
> create threads.

Clone can be used to create new processes. Maybe not the common use today.


>
>>> The rest is just to make sure that all shared and all private data are
>>> kept separate all the time. Sharing virtual memory is already done for
>>> decades this way, it's just that nobody ever did it with a statefull
>>> device like GPUs.
>> My concern is not with sharing or not sharing data. It's with sharing
>> the address space itself. If you share the render node, you share GPU
>> virtual address space. However CPU address space is not shared between
>> parent and child. That's a fundamental mismatch between the CPU world
>> and current GPU driver implementation.
>
> Correct, but even that is easily solvable. As I said before you can
> hang this state on a VMA and let it be cloned together with the CPU
> address space.

I'm not following. The address space I'm talking about is struct
amdgpu_vm. It's associated with the render node file descriptor.
Inheriting and using that file descriptor in the child inherits the
amdgpu_vm. I don't see how you can hang that state on any one VMA.

To be consistent with the CPU, you'd need to clone the GPU address space
(struct amdgpu_vm) in the child process. That means you need a new
render node file descriptor that imports all the BOs from the parent
address space. It's a bunch of extra work to fork a process, that you're
proposing to immediately undo with an atfork handler. So I really don't
see the point.

Regards,
  Felix


>
> Since VMAs are informed about their cloning (in opposite to file
> descriptors) it's trivial to even just clone kernel data on first access.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>>     Felix
>>>>
>
Marek Olšák Jan. 17, 2022, 2:50 p.m. UTC | #32
I don't think fork() would work with userspace where all buffers are
shared. It certainly doesn't work now. The driver needs to be notified that
a buffer or texture is shared to ensure data coherency between processes,
and the driver must execute decompression and other render passes when a
buffer or texture is being shared for the first time. Those aren't called
when fork() is called.

Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling <felix.kuehling@amd.com>
wrote:

> Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> > Am 17.01.22 um 15:17 schrieb Felix Kuehling:
> >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
> >>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> >>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
> >>>>>> Top post because I tried to catch up on the entire discussion here.
> >>>>>>
> >>>>>> So fundamentally I'm not opposed to just close this fork() hole
> >>>>>> once and
> >>>>>> for all. The thing that worries me from a upstream/platform pov is
> >>>>>> really
> >>>>>> only if we don't do it consistently across all drivers.
> >>>>>>
> >>>>>> So maybe as an idea:
> >>>>>> - Do the original patch, but not just for ttm but all gem rendernode
> >>>>>>      drivers at least (or maybe even all gem drivers, no idea), with
> >>>>>> the
> >>>>>>      below discussion cleaned up as justification.
> >>>>> I know of at least one use case which this will break.
> >>>>>
> >>>>> A couple of years back we had a discussion on the Mesa mailing list
> >>>>> because (IIRC) Marek introduced a background thread to push command
> >>>>> submissions to the kernel.
> >>>>>
> >>>>> That broke because some compositor used to initialize OpenGL and then
> >>>>> do a fork(). This indeed worked previously (no GPUVM at that time),
> >>>>> but with the addition of the backround thread obviously broke.
> >>>>>
> >>>>> The conclusion back then was that the compositor is broken and needs
> >>>>> fixing, but it still essentially means that there could be people out
> >>>>> there with really old userspace where this setting would just break
> >>>>> the desktop.
> >>>>>
> >>>>> I'm not really against that change either, but at least in theory we
> >>>>> could make fork() work perfectly fine even with VMs and background
> >>>>> threads.
> >>>> You may regret this if you ever try to build a shared virtual address
> >>>> space between GPU and CPU. Then you have two processes (parent and
> >>>> child) sharing the same render context and GPU VM address space.
> >>>> But the
> >>>> CPU address spaces are different. You can't maintain consistent shared
> >>>> virtual address spaces for both processes when the GPU address
> >>>> space is
> >>>> shared between them.
> >>> That's actually not much of a problem.
> >>>
> >>> All you need to do is to use pthread_atfork() and do the appropriate
> >>> action in parent/child to clean up your context:
> >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
> >> Thunk already does that. However, it's not foolproof. pthread_atfork
> >> hanlders aren't called when the process is forked with a clone call.
> >
> > Yeah, but that's perfectly intentional. clone() is usually used to
> > create threads.
>
> Clone can be used to create new processes. Maybe not the common use today.
>
>
> >
> >>> The rest is just to make sure that all shared and all private data are
> >>> kept separate all the time. Sharing virtual memory is already done for
> >>> decades this way, it's just that nobody ever did it with a statefull
> >>> device like GPUs.
> >> My concern is not with sharing or not sharing data. It's with sharing
> >> the address space itself. If you share the render node, you share GPU
> >> virtual address space. However CPU address space is not shared between
> >> parent and child. That's a fundamental mismatch between the CPU world
> >> and current GPU driver implementation.
> >
> > Correct, but even that is easily solvable. As I said before you can
> > hang this state on a VMA and let it be cloned together with the CPU
> > address space.
>
> I'm not following. The address space I'm talking about is struct
> amdgpu_vm. It's associated with the render node file descriptor.
> Inheriting and using that file descriptor in the child inherits the
> amdgpu_vm. I don't see how you can hang that state on any one VMA.
>
> To be consistent with the CPU, you'd need to clone the GPU address space
> (struct amdgpu_vm) in the child process. That means you need a new
> render node file descriptor that imports all the BOs from the parent
> address space. It's a bunch of extra work to fork a process, that you're
> proposing to immediately undo with an atfork handler. So I really don't
> see the point.
>
> Regards,
>   Felix
>
>
> >
> > Since VMAs are informed about their cloning (in opposite to file
> > descriptors) it's trivial to even just clone kernel data on first access.
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >>    Felix
> >>
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>>     Felix
> >>>>
> >
>
Christian König Jan. 17, 2022, 4:23 p.m. UTC | #33
Am 17.01.22 um 15:50 schrieb Marek Olšák:
> I don't think fork() would work with userspace where all buffers are 
> shared. It certainly doesn't work now. The driver needs to be notified 
> that a buffer or texture is shared to ensure data coherency between 
> processes, and the driver must execute decompression and other render 
> passes when a buffer or texture is being shared for the first time. 
> Those aren't called when fork() is called.

Yeah, that's why you can install handlers which run before/after fork() 
is executed. But to summarize it is illegal for OpenGL, so we don't 
really need to worry about it.

For compute there are a couple of use cases though, but even those are 
not real world ones as far as I know.

But see below.

>
> Marek
>
> On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling <felix.kuehling@amd.com 
> <mailto:felix.kuehling@amd.com>> wrote:
>
>     Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
>     > Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>     >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>     >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>     >>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>     >>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>     >>>>>> Top post because I tried to catch up on the entire
>     discussion here.
>     >>>>>>
>     >>>>>> So fundamentally I'm not opposed to just close this fork() hole
>     >>>>>> once and
>     >>>>>> for all. The thing that worries me from a upstream/platform
>     pov is
>     >>>>>> really
>     >>>>>> only if we don't do it consistently across all drivers.
>     >>>>>>
>     >>>>>> So maybe as an idea:
>     >>>>>> - Do the original patch, but not just for ttm but all gem
>     rendernode
>     >>>>>>      drivers at least (or maybe even all gem drivers, no
>     idea), with
>     >>>>>> the
>     >>>>>>      below discussion cleaned up as justification.
>     >>>>> I know of at least one use case which this will break.
>     >>>>>
>     >>>>> A couple of years back we had a discussion on the Mesa
>     mailing list
>     >>>>> because (IIRC) Marek introduced a background thread to push
>     command
>     >>>>> submissions to the kernel.
>     >>>>>
>     >>>>> That broke because some compositor used to initialize OpenGL
>     and then
>     >>>>> do a fork(). This indeed worked previously (no GPUVM at that
>     time),
>     >>>>> but with the addition of the backround thread obviously broke.
>     >>>>>
>     >>>>> The conclusion back then was that the compositor is broken
>     and needs
>     >>>>> fixing, but it still essentially means that there could be
>     people out
>     >>>>> there with really old userspace where this setting would
>     just break
>     >>>>> the desktop.
>     >>>>>
>     >>>>> I'm not really against that change either, but at least in
>     theory we
>     >>>>> could make fork() work perfectly fine even with VMs and
>     background
>     >>>>> threads.
>     >>>> You may regret this if you ever try to build a shared virtual
>     address
>     >>>> space between GPU and CPU. Then you have two processes
>     (parent and
>     >>>> child) sharing the same render context and GPU VM address space.
>     >>>> But the
>     >>>> CPU address spaces are different. You can't maintain
>     consistent shared
>     >>>> virtual address spaces for both processes when the GPU address
>     >>>> space is
>     >>>> shared between them.
>     >>> That's actually not much of a problem.
>     >>>
>     >>> All you need to do is to use pthread_atfork() and do the
>     appropriate
>     >>> action in parent/child to clean up your context:
>     >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
>     <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman3%2Fpthread_atfork.3.html&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd917b56904c64bcb501a08d9d9c8c05e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637780278519496422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4%2FEATucZoZnlP4t0FI6bYtCdThxC3HTOtkIcTU8G%2FqY%3D&reserved=0>
>     >> Thunk already does that. However, it's not foolproof.
>     pthread_atfork
>     >> hanlders aren't called when the process is forked with a clone
>     call.
>     >
>     > Yeah, but that's perfectly intentional. clone() is usually used to
>     > create threads.
>
>     Clone can be used to create new processes. Maybe not the common
>     use today.
>
>
>     >
>     >>> The rest is just to make sure that all shared and all private
>     data are
>     >>> kept separate all the time. Sharing virtual memory is already
>     done for
>     >>> decades this way, it's just that nobody ever did it with a
>     statefull
>     >>> device like GPUs.
>     >> My concern is not with sharing or not sharing data. It's with
>     sharing
>     >> the address space itself. If you share the render node, you
>     share GPU
>     >> virtual address space. However CPU address space is not shared
>     between
>     >> parent and child. That's a fundamental mismatch between the CPU
>     world
>     >> and current GPU driver implementation.
>     >
>     > Correct, but even that is easily solvable. As I said before you can
>     > hang this state on a VMA and let it be cloned together with the CPU
>     > address space.
>
>     I'm not following. The address space I'm talking about is struct
>     amdgpu_vm. It's associated with the render node file descriptor.
>     Inheriting and using that file descriptor in the child inherits the
>     amdgpu_vm. I don't see how you can hang that state on any one VMA.
>

But you don't really need that. You can bind the VM to your VMA mapping 
and clone that as necessary.

I'm not sure how else I should describe that, as far as I know the 
kernel that would be rather trivial to do.

Cloning all the userspace state like Marek described above is the much 
harder part.

Regards,
Christian.

>
>     To be consistent with the CPU, you'd need to clone the GPU address
>     space
>     (struct amdgpu_vm) in the child process. That means you need a new
>     render node file descriptor that imports all the BOs from the parent
>     address space. It's a bunch of extra work to fork a process, that
>     you're
>     proposing to immediately undo with an atfork handler. So I really
>     don't
>     see the point.
>
>     Regards,
>       Felix
>
>
>     >
>     > Since VMAs are informed about their cloning (in opposite to file
>     > descriptors) it's trivial to even just clone kernel data on
>     first access.
>     >
>     > Regards,
>     > Christian.
>     >
>     >>
>     >> Regards,
>     >>    Felix
>     >>
>     >>
>     >>> Regards,
>     >>> Christian.
>     >>>
>     >>>> Regards,
>     >>>>     Felix
>     >>>>
>     >
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@  int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size,
 			goto err_drm_gem_object_put;
 		}
 
-		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
+		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+				| VM_DONTDUMP | VM_DONTCOPY;
 		vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
 		vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 	}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@  int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo)
 
 	vma->vm_private_data = bo;
 
-	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
 	vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
 	return 0;
 }