Message ID | 20211208205344.3034-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/ttm: Don't inherit GEM object VMAs in child process | expand |
Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: > When an application having open file access to a node forks, its shared > mappings also get reflected in the address space of child process even > though it cannot access them with the object permissions applied. With the > existing permission checks on the gem objects, it might be reasonable to > also create the VMAs with VM_DONTCOPY flag so a user space application > doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) > system call to prevent the pages in the mapped range to appear in the > address space of the child process. It also prevents the memory leaks > due to additional reference counts on the mapped BOs in the child > process that prevented freeing the memory in the parent for which we had > worked around earlier in the user space inside the thunk library. > > Additionally, we faced this issue when using CRIU to checkpoint restore > an application that had such inherited mappings in the child which > confuse CRIU when it mmaps on restore. Having this flag set for the > render node VMAs helps. VMAs mapped via KFD already take care of this so > this is needed only for the render nodes. Unfortunately that is most likely a NAK. We already tried something similar. While it is illegal by the OpenGL specification and doesn't work for most userspace stacks, we do have some implementations which call fork() with a GL context open and expect it to work. Regards, Christian. > > Cc: Felix Kuehling <Felix.Kuehling@amd.com> > > Signed-off-by: David Yat Sin <david.yatsin@amd.com> > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > --- > drivers/gpu/drm/drm_gem.c | 3 ++- > drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c > index 09c820045859..d9c4149f36dd 100644 > --- a/drivers/gpu/drm/drm_gem.c > +++ b/drivers/gpu/drm/drm_gem.c > @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, > goto err_drm_gem_object_put; > } > > - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; > + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND > + | VM_DONTDUMP | VM_DONTCOPY; > vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); > vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); > } > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c > index 33680c94127c..420a4898fdd2 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c > @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo) > > vma->vm_private_data = bo; > > - vma->vm_flags |= VM_PFNMAP; > + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; > vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; > return 0; > }
Thanks Christian. Would it make it less intrusive if I just use the flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this patch? For our use case, just the ttm_bo_mmap_obj change should suffice and we don't want to put any more work arounds in the user space (thunk, in our case). The child cannot access the BOs mapped by the parent anyway with access restrictions applied so I wonder why even inherit the vma? On 12/9/2021 2:54 AM, Christian König wrote: > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >> When an application having open file access to a node forks, its shared >> mappings also get reflected in the address space of child process even >> though it cannot access them with the object permissions applied. >> With the >> existing permission checks on the gem objects, it might be reasonable to >> also create the VMAs with VM_DONTCOPY flag so a user space application >> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) >> system call to prevent the pages in the mapped range to appear in the >> address space of the child process. It also prevents the memory leaks >> due to additional reference counts on the mapped BOs in the child >> process that prevented freeing the memory in the parent for which we had >> worked around earlier in the user space inside the thunk library. >> >> Additionally, we faced this issue when using CRIU to checkpoint restore >> an application that had such inherited mappings in the child which >> confuse CRIU when it mmaps on restore. Having this flag set for the >> render node VMAs helps. VMAs mapped via KFD already take care of this so >> this is needed only for the render nodes. > > Unfortunately that is most likely a NAK. We already tried something > similar. > > While it is illegal by the OpenGL specification and doesn't work for > most userspace stacks, we do have some implementations which call > fork() with a GL context open and expect it to work. > > Regards, > Christian. > >> >> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >> >> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >> --- >> drivers/gpu/drm/drm_gem.c | 3 ++- >> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >> 2 files changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >> index 09c820045859..d9c4149f36dd 100644 >> --- a/drivers/gpu/drm/drm_gem.c >> +++ b/drivers/gpu/drm/drm_gem.c >> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >> *obj, unsigned long obj_size, >> goto err_drm_gem_object_put; >> } >> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >> VM_DONTDUMP; >> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >> + | VM_DONTDUMP | VM_DONTCOPY; >> vma->vm_page_prot = >> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >> } >> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >> index 33680c94127c..420a4898fdd2 100644 >> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, >> struct ttm_buffer_object *bo) >> vma->vm_private_data = bo; >> - vma->vm_flags |= VM_PFNMAP; >> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >> return 0; >> } >
Hi Rajneesh, yes, separating this from the drm_gem_mmap_obj() change is certainly a good idea. > The child cannot access the BOs mapped by the parent anyway with > access restrictions applied exactly that is not correct. That behavior is actively used by some userspace stacks as far as I know. Regards, Christian. Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: > Thanks Christian. Would it make it less intrusive if I just use the > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this > patch? For our use case, just the ttm_bo_mmap_obj change should > suffice and we don't want to put any more work arounds in the user > space (thunk, in our case). > > The child cannot access the BOs mapped by the parent anyway with > access restrictions applied so I wonder why even inherit the vma? > > On 12/9/2021 2:54 AM, Christian König wrote: >> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>> When an application having open file access to a node forks, its shared >>> mappings also get reflected in the address space of child process even >>> though it cannot access them with the object permissions applied. >>> With the >>> existing permission checks on the gem objects, it might be >>> reasonable to >>> also create the VMAs with VM_DONTCOPY flag so a user space application >>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) >>> system call to prevent the pages in the mapped range to appear in the >>> address space of the child process. It also prevents the memory leaks >>> due to additional reference counts on the mapped BOs in the child >>> process that prevented freeing the memory in the parent for which we >>> had >>> worked around earlier in the user space inside the thunk library. >>> >>> Additionally, we faced this issue when using CRIU to checkpoint restore >>> an application that had such inherited mappings in the child which >>> confuse CRIU when it mmaps on restore. Having this flag set for the >>> render node VMAs helps. VMAs mapped via KFD already take care of >>> this so >>> this is needed only for the render nodes. >> >> Unfortunately that is most likely a NAK. We already tried something >> similar. >> >> While it is illegal by the OpenGL specification and doesn't work for >> most userspace stacks, we do have some implementations which call >> fork() with a GL context open and expect it to work. >> >> Regards, >> Christian. >> >>> >>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>> >>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>> --- >>> drivers/gpu/drm/drm_gem.c | 3 ++- >>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>> 2 files changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>> index 09c820045859..d9c4149f36dd 100644 >>> --- a/drivers/gpu/drm/drm_gem.c >>> +++ b/drivers/gpu/drm/drm_gem.c >>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>> *obj, unsigned long obj_size, >>> goto err_drm_gem_object_put; >>> } >>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>> VM_DONTDUMP; >>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>> + | VM_DONTDUMP | VM_DONTCOPY; >>> vma->vm_page_prot = >>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>> } >>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>> index 33680c94127c..420a4898fdd2 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, >>> struct ttm_buffer_object *bo) >>> vma->vm_private_data = bo; >>> - vma->vm_flags |= VM_PFNMAP; >>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>> return 0; >>> } >>
Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank you! On 12/9/2021 10:27 AM, Christian König wrote: > Hi Rajneesh, > > yes, separating this from the drm_gem_mmap_obj() change is certainly a > good idea. > >> The child cannot access the BOs mapped by the parent anyway with >> access restrictions applied > > exactly that is not correct. That behavior is actively used by some > userspace stacks as far as I know. > > Regards, > Christian. > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >> Thanks Christian. Would it make it less intrusive if I just use the >> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this >> patch? For our use case, just the ttm_bo_mmap_obj change should >> suffice and we don't want to put any more work arounds in the user >> space (thunk, in our case). >> >> The child cannot access the BOs mapped by the parent anyway with >> access restrictions applied so I wonder why even inherit the vma? >> >> On 12/9/2021 2:54 AM, Christian König wrote: >>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>> When an application having open file access to a node forks, its >>>> shared >>>> mappings also get reflected in the address space of child process even >>>> though it cannot access them with the object permissions applied. >>>> With the >>>> existing permission checks on the gem objects, it might be >>>> reasonable to >>>> also create the VMAs with VM_DONTCOPY flag so a user space application >>>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) >>>> system call to prevent the pages in the mapped range to appear in the >>>> address space of the child process. It also prevents the memory leaks >>>> due to additional reference counts on the mapped BOs in the child >>>> process that prevented freeing the memory in the parent for which >>>> we had >>>> worked around earlier in the user space inside the thunk library. >>>> >>>> Additionally, we faced this issue when using CRIU to checkpoint >>>> restore >>>> an application that had such inherited mappings in the child which >>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>> this so >>>> this is needed only for the render nodes. >>> >>> Unfortunately that is most likely a NAK. We already tried something >>> similar. >>> >>> While it is illegal by the OpenGL specification and doesn't work for >>> most userspace stacks, we do have some implementations which call >>> fork() with a GL context open and expect it to work. >>> >>> Regards, >>> Christian. >>> >>>> >>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>>> >>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>>> --- >>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>> index 09c820045859..d9c4149f36dd 100644 >>>> --- a/drivers/gpu/drm/drm_gem.c >>>> +++ b/drivers/gpu/drm/drm_gem.c >>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>> *obj, unsigned long obj_size, >>>> goto err_drm_gem_object_put; >>>> } >>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>> VM_DONTDUMP; >>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>> vma->vm_page_prot = >>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>> } >>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>> index 33680c94127c..420a4898fdd2 100644 >>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, >>>> struct ttm_buffer_object *bo) >>>> vma->vm_private_data = bo; >>>> - vma->vm_flags |= VM_PFNMAP; >>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>> return 0; >>>> } >>> >
That still won't work. But I think we could do this change for the amdgpu mmap callback only. Regards, Christian. Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank > you! > > On 12/9/2021 10:27 AM, Christian König wrote: >> Hi Rajneesh, >> >> yes, separating this from the drm_gem_mmap_obj() change is certainly >> a good idea. >> >>> The child cannot access the BOs mapped by the parent anyway with >>> access restrictions applied >> >> exactly that is not correct. That behavior is actively used by some >> userspace stacks as far as I know. >> >> Regards, >> Christian. >> >> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>> Thanks Christian. Would it make it less intrusive if I just use the >>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>> this patch? For our use case, just the ttm_bo_mmap_obj change should >>> suffice and we don't want to put any more work arounds in the user >>> space (thunk, in our case). >>> >>> The child cannot access the BOs mapped by the parent anyway with >>> access restrictions applied so I wonder why even inherit the vma? >>> >>> On 12/9/2021 2:54 AM, Christian König wrote: >>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>> When an application having open file access to a node forks, its >>>>> shared >>>>> mappings also get reflected in the address space of child process >>>>> even >>>>> though it cannot access them with the object permissions applied. >>>>> With the >>>>> existing permission checks on the gem objects, it might be >>>>> reasonable to >>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>> application >>>>> doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK) >>>>> system call to prevent the pages in the mapped range to appear in the >>>>> address space of the child process. It also prevents the memory leaks >>>>> due to additional reference counts on the mapped BOs in the child >>>>> process that prevented freeing the memory in the parent for which >>>>> we had >>>>> worked around earlier in the user space inside the thunk library. >>>>> >>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>> restore >>>>> an application that had such inherited mappings in the child which >>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>> this so >>>>> this is needed only for the render nodes. >>>> >>>> Unfortunately that is most likely a NAK. We already tried something >>>> similar. >>>> >>>> While it is illegal by the OpenGL specification and doesn't work >>>> for most userspace stacks, we do have some implementations which >>>> call fork() with a GL context open and expect it to work. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>>>> >>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>>>> --- >>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>> index 09c820045859..d9c4149f36dd 100644 >>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>> *obj, unsigned long obj_size, >>>>> goto err_drm_gem_object_put; >>>>> } >>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>> VM_DONTDUMP; >>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>> vma->vm_page_prot = >>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>> } >>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>> index 33680c94127c..420a4898fdd2 100644 >>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>> *vma, struct ttm_buffer_object *bo) >>>>> vma->vm_private_data = bo; >>>>> - vma->vm_flags |= VM_PFNMAP; >>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>> return 0; >>>>> } >>>> >>
Am 2021-12-09 um 10:30 a.m. schrieb Christian König: > That still won't work. > > But I think we could do this change for the amdgpu mmap callback only. If graphics user mode has problems with it, we could even make this specific to KFD BOs in the amdgpu_gem_object_mmap callback. Regards, Felix > > Regards, > Christian. > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >> you! >> >> On 12/9/2021 10:27 AM, Christian König wrote: >>> Hi Rajneesh, >>> >>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>> a good idea. >>> >>>> The child cannot access the BOs mapped by the parent anyway with >>>> access restrictions applied >>> >>> exactly that is not correct. That behavior is actively used by some >>> userspace stacks as far as I know. >>> >>> Regards, >>> Christian. >>> >>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>> Thanks Christian. Would it make it less intrusive if I just use the >>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>> should suffice and we don't want to put any more work arounds in >>>> the user space (thunk, in our case). >>>> >>>> The child cannot access the BOs mapped by the parent anyway with >>>> access restrictions applied so I wonder why even inherit the vma? >>>> >>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>> When an application having open file access to a node forks, its >>>>>> shared >>>>>> mappings also get reflected in the address space of child process >>>>>> even >>>>>> though it cannot access them with the object permissions applied. >>>>>> With the >>>>>> existing permission checks on the gem objects, it might be >>>>>> reasonable to >>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>> application >>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>> MADV_DONTFORK) >>>>>> system call to prevent the pages in the mapped range to appear in >>>>>> the >>>>>> address space of the child process. It also prevents the memory >>>>>> leaks >>>>>> due to additional reference counts on the mapped BOs in the child >>>>>> process that prevented freeing the memory in the parent for which >>>>>> we had >>>>>> worked around earlier in the user space inside the thunk library. >>>>>> >>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>> restore >>>>>> an application that had such inherited mappings in the child which >>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>> this so >>>>>> this is needed only for the render nodes. >>>>> >>>>> Unfortunately that is most likely a NAK. We already tried >>>>> something similar. >>>>> >>>>> While it is illegal by the OpenGL specification and doesn't work >>>>> for most userspace stacks, we do have some implementations which >>>>> call fork() with a GL context open and expect it to work. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> >>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>>>>> >>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>> *obj, unsigned long obj_size, >>>>>> goto err_drm_gem_object_put; >>>>>> } >>>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>> VM_DONTDUMP; >>>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>>> vma->vm_page_prot = >>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>> } >>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>> *vma, struct ttm_buffer_object *bo) >>>>>> vma->vm_private_data = bo; >>>>>> - vma->vm_flags |= VM_PFNMAP; >>>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>> return 0; >>>>>> } >>>>> >>> >
Am 09.12.21 um 19:28 schrieb Felix Kuehling: > Am 2021-12-09 um 10:30 a.m. schrieb Christian König: >> That still won't work. >> >> But I think we could do this change for the amdgpu mmap callback only. > If graphics user mode has problems with it, we could even make this > specific to KFD BOs in the amdgpu_gem_object_mmap callback. I think it's fine for the whole amdgpu stack, my concern is more about radeon, nouveau and the ARM stacks which are using this as well. That blew up so nicely the last time we tried to change it and I know of at least one case where radeon was/is used with BOs in a child process. Regards, Christian. > > Regards, > Felix > > >> Regards, >> Christian. >> >> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >>> you! >>> >>> On 12/9/2021 10:27 AM, Christian König wrote: >>>> Hi Rajneesh, >>>> >>>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>>> a good idea. >>>> >>>>> The child cannot access the BOs mapped by the parent anyway with >>>>> access restrictions applied >>>> exactly that is not correct. That behavior is actively used by some >>>> userspace stacks as far as I know. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>>> Thanks Christian. Would it make it less intrusive if I just use the >>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>>> should suffice and we don't want to put any more work arounds in >>>>> the user space (thunk, in our case). >>>>> >>>>> The child cannot access the BOs mapped by the parent anyway with >>>>> access restrictions applied so I wonder why even inherit the vma? >>>>> >>>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>>> When an application having open file access to a node forks, its >>>>>>> shared >>>>>>> mappings also get reflected in the address space of child process >>>>>>> even >>>>>>> though it cannot access them with the object permissions applied. >>>>>>> With the >>>>>>> existing permission checks on the gem objects, it might be >>>>>>> reasonable to >>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>>> application >>>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>>> MADV_DONTFORK) >>>>>>> system call to prevent the pages in the mapped range to appear in >>>>>>> the >>>>>>> address space of the child process. It also prevents the memory >>>>>>> leaks >>>>>>> due to additional reference counts on the mapped BOs in the child >>>>>>> process that prevented freeing the memory in the parent for which >>>>>>> we had >>>>>>> worked around earlier in the user space inside the thunk library. >>>>>>> >>>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>>> restore >>>>>>> an application that had such inherited mappings in the child which >>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>>> this so >>>>>>> this is needed only for the render nodes. >>>>>> Unfortunately that is most likely a NAK. We already tried >>>>>> something similar. >>>>>> >>>>>> While it is illegal by the OpenGL specification and doesn't work >>>>>> for most userspace stacks, we do have some implementations which >>>>>> call fork() with a GL context open and expect it to work. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>>>>>> >>>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>>> *obj, unsigned long obj_size, >>>>>>> goto err_drm_gem_object_put; >>>>>>> } >>>>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>>> VM_DONTDUMP; >>>>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>>>> vma->vm_page_prot = >>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>>> } >>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>>> *vma, struct ttm_buffer_object *bo) >>>>>>> vma->vm_private_data = bo; >>>>>>> - vma->vm_flags |= VM_PFNMAP; >>>>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>>> return 0; >>>>>>> }
On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: > Am 09.12.21 um 19:28 schrieb Felix Kuehling: > > Am 2021-12-09 um 10:30 a.m. schrieb Christian König: > > > That still won't work. > > > > > > But I think we could do this change for the amdgpu mmap callback only. > > If graphics user mode has problems with it, we could even make this > > specific to KFD BOs in the amdgpu_gem_object_mmap callback. > > I think it's fine for the whole amdgpu stack, my concern is more about > radeon, nouveau and the ARM stacks which are using this as well. > > That blew up so nicely the last time we tried to change it and I know of at > least one case where radeon was/is used with BOs in a child process. I'm way late and burried again, but I think it'd be good to be consistent here across drivers. Or at least across drm drivers. And we've had the vma open/close refcounting to make fork work since forever. I think if we do this we should really only do this for mmap() where this applies, but reading through the thread here I'm honestly confused why this is a problem. If CRIU can't handle forked mmaps it needs to be thought that, not hacked around. Or at least I'm not understanding why this shouldn't work ... -Daniel > > Regards, > Christian. > > > > > Regards, > > Felix > > > > > > > Regards, > > > Christian. > > > > > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: > > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank > > > > you! > > > > > > > > On 12/9/2021 10:27 AM, Christian König wrote: > > > > > Hi Rajneesh, > > > > > > > > > > yes, separating this from the drm_gem_mmap_obj() change is certainly > > > > > a good idea. > > > > > > > > > > > The child cannot access the BOs mapped by the parent anyway with > > > > > > access restrictions applied > > > > > exactly that is not correct. That behavior is actively used by some > > > > > userspace stacks as far as I know. > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: > > > > > > Thanks Christian. Would it make it less intrusive if I just use the > > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from > > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change > > > > > > should suffice and we don't want to put any more work arounds in > > > > > > the user space (thunk, in our case). > > > > > > > > > > > > The child cannot access the BOs mapped by the parent anyway with > > > > > > access restrictions applied so I wonder why even inherit the vma? > > > > > > > > > > > > On 12/9/2021 2:54 AM, Christian König wrote: > > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: > > > > > > > > When an application having open file access to a node forks, its > > > > > > > > shared > > > > > > > > mappings also get reflected in the address space of child process > > > > > > > > even > > > > > > > > though it cannot access them with the object permissions applied. > > > > > > > > With the > > > > > > > > existing permission checks on the gem objects, it might be > > > > > > > > reasonable to > > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space > > > > > > > > application > > > > > > > > doesn't need to explicitly call the madvise(addr, len, > > > > > > > > MADV_DONTFORK) > > > > > > > > system call to prevent the pages in the mapped range to appear in > > > > > > > > the > > > > > > > > address space of the child process. It also prevents the memory > > > > > > > > leaks > > > > > > > > due to additional reference counts on the mapped BOs in the child > > > > > > > > process that prevented freeing the memory in the parent for which > > > > > > > > we had > > > > > > > > worked around earlier in the user space inside the thunk library. > > > > > > > > > > > > > > > > Additionally, we faced this issue when using CRIU to checkpoint > > > > > > > > restore > > > > > > > > an application that had such inherited mappings in the child which > > > > > > > > confuse CRIU when it mmaps on restore. Having this flag set for the > > > > > > > > render node VMAs helps. VMAs mapped via KFD already take care of > > > > > > > > this so > > > > > > > > this is needed only for the render nodes. > > > > > > > Unfortunately that is most likely a NAK. We already tried > > > > > > > something similar. > > > > > > > > > > > > > > While it is illegal by the OpenGL specification and doesn't work > > > > > > > for most userspace stacks, we do have some implementations which > > > > > > > call fork() with a GL context open and expect it to work. > > > > > > > > > > > > > > Regards, > > > > > > > Christian. > > > > > > > > > > > > > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com> > > > > > > > > > > > > > > > > Signed-off-by: David Yat Sin <david.yatsin@amd.com> > > > > > > > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > > > > > > > > --- > > > > > > > > drivers/gpu/drm/drm_gem.c | 3 ++- > > > > > > > > drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- > > > > > > > > 2 files changed, 3 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c > > > > > > > > index 09c820045859..d9c4149f36dd 100644 > > > > > > > > --- a/drivers/gpu/drm/drm_gem.c > > > > > > > > +++ b/drivers/gpu/drm/drm_gem.c > > > > > > > > @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object > > > > > > > > *obj, unsigned long obj_size, > > > > > > > > goto err_drm_gem_object_put; > > > > > > > > } > > > > > > > > - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | > > > > > > > > VM_DONTDUMP; > > > > > > > > + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND > > > > > > > > + | VM_DONTDUMP | VM_DONTCOPY; > > > > > > > > vma->vm_page_prot = > > > > > > > > pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); > > > > > > > > vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); > > > > > > > > } > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > b/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > index 33680c94127c..420a4898fdd2 100644 > > > > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct > > > > > > > > *vma, struct ttm_buffer_object *bo) > > > > > > > > vma->vm_private_data = bo; > > > > > > > > - vma->vm_flags |= VM_PFNMAP; > > > > > > > > + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; > > > > > > > > vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; > > > > > > > > return 0; > > > > > > > > } >
On 12/20/2021 4:29 AM, Daniel Vetter wrote: > On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: >> Am 09.12.21 um 19:28 schrieb Felix Kuehling: >>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König: >>>> That still won't work. >>>> >>>> But I think we could do this change for the amdgpu mmap callback only. >>> If graphics user mode has problems with it, we could even make this >>> specific to KFD BOs in the amdgpu_gem_object_mmap callback. >> I think it's fine for the whole amdgpu stack, my concern is more about >> radeon, nouveau and the ARM stacks which are using this as well. >> >> That blew up so nicely the last time we tried to change it and I know of at >> least one case where radeon was/is used with BOs in a child process. > I'm way late and burried again, but I think it'd be good to be consistent > here across drivers. Or at least across drm drivers. And we've had the vma > open/close refcounting to make fork work since forever. > > I think if we do this we should really only do this for mmap() where this > applies, but reading through the thread here I'm honestly confused why > this is a problem. If CRIU can't handle forked mmaps it needs to be > thought that, not hacked around. Or at least I'm not understanding why > this shouldn't work ... > -Daniel > Hi Daniel In the v2 https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53eceb6@amd.com/T/ I pretty much limited the scope of the change to KFD BOs on mmap. Regarding CRIU, I think its not a CRIU problem as CRIU on restore, only tries to recreate all the child processes and then mmaps all the VMAs it sees (as per checkpoint snapshot) in the new process address space after the VMA placements are finalized in the position independent code phase. Since the inherited VMAs don't have access rights the criu mmap fails. Regards, Rajneesh >> Regards, >> Christian. >> >>> Regards, >>> Felix >>> >>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >>>>> you! >>>>> >>>>> On 12/9/2021 10:27 AM, Christian König wrote: >>>>>> Hi Rajneesh, >>>>>> >>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>>>>> a good idea. >>>>>> >>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>> access restrictions applied >>>>>> exactly that is not correct. That behavior is actively used by some >>>>>> userspace stacks as far as I know. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>>>>> Thanks Christian. Would it make it less intrusive if I just use the >>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>>>>> should suffice and we don't want to put any more work arounds in >>>>>>> the user space (thunk, in our case). >>>>>>> >>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>> access restrictions applied so I wonder why even inherit the vma? >>>>>>> >>>>>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>>>>> When an application having open file access to a node forks, its >>>>>>>>> shared >>>>>>>>> mappings also get reflected in the address space of child process >>>>>>>>> even >>>>>>>>> though it cannot access them with the object permissions applied. >>>>>>>>> With the >>>>>>>>> existing permission checks on the gem objects, it might be >>>>>>>>> reasonable to >>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>>>>> application >>>>>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>>>>> MADV_DONTFORK) >>>>>>>>> system call to prevent the pages in the mapped range to appear in >>>>>>>>> the >>>>>>>>> address space of the child process. It also prevents the memory >>>>>>>>> leaks >>>>>>>>> due to additional reference counts on the mapped BOs in the child >>>>>>>>> process that prevented freeing the memory in the parent for which >>>>>>>>> we had >>>>>>>>> worked around earlier in the user space inside the thunk library. >>>>>>>>> >>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>>>>> restore >>>>>>>>> an application that had such inherited mappings in the child which >>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>>>>> this so >>>>>>>>> this is needed only for the render nodes. >>>>>>>> Unfortunately that is most likely a NAK. We already tried >>>>>>>> something similar. >>>>>>>> >>>>>>>> While it is illegal by the OpenGL specification and doesn't work >>>>>>>> for most userspace stacks, we do have some implementations which >>>>>>>> call fork() with a GL context open and expect it to work. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com> >>>>>>>>> >>>>>>>>> Signed-off-by: David Yat Sin <david.yatsin@amd.com> >>>>>>>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>>>>> *obj, unsigned long obj_size, >>>>>>>>> goto err_drm_gem_object_put; >>>>>>>>> } >>>>>>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>>>>> VM_DONTDUMP; >>>>>>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>>>>>> vma->vm_page_prot = >>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>>>>> } >>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>>>>> *vma, struct ttm_buffer_object *bo) >>>>>>>>> vma->vm_private_data = bo; >>>>>>>>> - vma->vm_flags |= VM_PFNMAP; >>>>>>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>>>>> return 0; >>>>>>>>> }
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: > > On 12/20/2021 4:29 AM, Daniel Vetter wrote: > > On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: > > > Am 09.12.21 um 19:28 schrieb Felix Kuehling: > > > > Am 2021-12-09 um 10:30 a.m. schrieb Christian König: > > > > > That still won't work. > > > > > > > > > > But I think we could do this change for the amdgpu mmap callback only. > > > > If graphics user mode has problems with it, we could even make this > > > > specific to KFD BOs in the amdgpu_gem_object_mmap callback. > > > I think it's fine for the whole amdgpu stack, my concern is more about > > > radeon, nouveau and the ARM stacks which are using this as well. > > > > > > That blew up so nicely the last time we tried to change it and I know of at > > > least one case where radeon was/is used with BOs in a child process. > > I'm way late and burried again, but I think it'd be good to be consistent > > here across drivers. Or at least across drm drivers. And we've had the vma > > open/close refcounting to make fork work since forever. > > > > I think if we do this we should really only do this for mmap() where this > > applies, but reading through the thread here I'm honestly confused why > > this is a problem. If CRIU can't handle forked mmaps it needs to be > > thought that, not hacked around. Or at least I'm not understanding why > > this shouldn't work ... > > -Daniel > > > > Hi Daniel > > In the v2 > https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53eceb6@amd.com/T/ > I pretty much limited the scope of the change to KFD BOs on mmap. Regarding > CRIU, I think its not a CRIU problem as CRIU on restore, only tries to > recreate all the child processes and then mmaps all the VMAs it sees (as per > checkpoint snapshot) in the new process address space after the VMA > placements are finalized in the position independent code phase. Since the > inherited VMAs don't have access rights the criu mmap fails. Still sounds funky. I think minimally we should have an ack from CRIU developers that this is officially the right way to solve this problem. I really don't want to have random one-off hacks that don't work across the board, for a problem where we (drm subsystem) really shouldn't be the only one with this problem. Where "this problem" means that the mmap space is per file description, and not per underlying inode or real device or whatever. That part sounds like a CRIU problem, and I expect CRIU folks want a consistent solution across the board for this. Hence please grab an ack from them. Cheers, Daniel > > Regards, > > Rajneesh > > > > Regards, > > > Christian. > > > > > > > Regards, > > > > Felix > > > > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: > > > > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank > > > > > > you! > > > > > > > > > > > > On 12/9/2021 10:27 AM, Christian König wrote: > > > > > > > Hi Rajneesh, > > > > > > > > > > > > > > yes, separating this from the drm_gem_mmap_obj() change is certainly > > > > > > > a good idea. > > > > > > > > > > > > > > > The child cannot access the BOs mapped by the parent anyway with > > > > > > > > access restrictions applied > > > > > > > exactly that is not correct. That behavior is actively used by some > > > > > > > userspace stacks as far as I know. > > > > > > > > > > > > > > Regards, > > > > > > > Christian. > > > > > > > > > > > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: > > > > > > > > Thanks Christian. Would it make it less intrusive if I just use the > > > > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from > > > > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change > > > > > > > > should suffice and we don't want to put any more work arounds in > > > > > > > > the user space (thunk, in our case). > > > > > > > > > > > > > > > > The child cannot access the BOs mapped by the parent anyway with > > > > > > > > access restrictions applied so I wonder why even inherit the vma? > > > > > > > > > > > > > > > > On 12/9/2021 2:54 AM, Christian König wrote: > > > > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: > > > > > > > > > > When an application having open file access to a node forks, its > > > > > > > > > > shared > > > > > > > > > > mappings also get reflected in the address space of child process > > > > > > > > > > even > > > > > > > > > > though it cannot access them with the object permissions applied. > > > > > > > > > > With the > > > > > > > > > > existing permission checks on the gem objects, it might be > > > > > > > > > > reasonable to > > > > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space > > > > > > > > > > application > > > > > > > > > > doesn't need to explicitly call the madvise(addr, len, > > > > > > > > > > MADV_DONTFORK) > > > > > > > > > > system call to prevent the pages in the mapped range to appear in > > > > > > > > > > the > > > > > > > > > > address space of the child process. It also prevents the memory > > > > > > > > > > leaks > > > > > > > > > > due to additional reference counts on the mapped BOs in the child > > > > > > > > > > process that prevented freeing the memory in the parent for which > > > > > > > > > > we had > > > > > > > > > > worked around earlier in the user space inside the thunk library. > > > > > > > > > > > > > > > > > > > > Additionally, we faced this issue when using CRIU to checkpoint > > > > > > > > > > restore > > > > > > > > > > an application that had such inherited mappings in the child which > > > > > > > > > > confuse CRIU when it mmaps on restore. Having this flag set for the > > > > > > > > > > render node VMAs helps. VMAs mapped via KFD already take care of > > > > > > > > > > this so > > > > > > > > > > this is needed only for the render nodes. > > > > > > > > > Unfortunately that is most likely a NAK. We already tried > > > > > > > > > something similar. > > > > > > > > > > > > > > > > > > While it is illegal by the OpenGL specification and doesn't work > > > > > > > > > for most userspace stacks, we do have some implementations which > > > > > > > > > call fork() with a GL context open and expect it to work. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com> > > > > > > > > > > > > > > > > > > > > Signed-off-by: David Yat Sin <david.yatsin@amd.com> > > > > > > > > > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/drm_gem.c | 3 ++- > > > > > > > > > > drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- > > > > > > > > > > 2 files changed, 3 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c > > > > > > > > > > index 09c820045859..d9c4149f36dd 100644 > > > > > > > > > > --- a/drivers/gpu/drm/drm_gem.c > > > > > > > > > > +++ b/drivers/gpu/drm/drm_gem.c > > > > > > > > > > @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object > > > > > > > > > > *obj, unsigned long obj_size, > > > > > > > > > > goto err_drm_gem_object_put; > > > > > > > > > > } > > > > > > > > > > - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | > > > > > > > > > > VM_DONTDUMP; > > > > > > > > > > + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND > > > > > > > > > > + | VM_DONTDUMP | VM_DONTCOPY; > > > > > > > > > > vma->vm_page_prot = > > > > > > > > > > pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); > > > > > > > > > > vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); > > > > > > > > > > } > > > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > > > b/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > > > index 33680c94127c..420a4898fdd2 100644 > > > > > > > > > > --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > > > +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c > > > > > > > > > > @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct > > > > > > > > > > *vma, struct ttm_buffer_object *bo) > > > > > > > > > > vma->vm_private_data = bo; > > > > > > > > > > - vma->vm_flags |= VM_PFNMAP; > > > > > > > > > > + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; > > > > > > > > > > vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; > > > > > > > > > > return 0; > > > > > > > > > > }
Adding Adrian Rebel who is the CRIU maintainer and CRIU list On 12/22/2021 3:53 PM, Daniel Vetter wrote: > On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >> On 12/20/2021 4:29 AM, Daniel Vetter wrote: >>> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: >>>> Am 09.12.21 um 19:28 schrieb Felix Kuehling: >>>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König: >>>>>> That still won't work. >>>>>> >>>>>> But I think we could do this change for the amdgpu mmap callback only. >>>>> If graphics user mode has problems with it, we could even make this >>>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback. >>>> I think it's fine for the whole amdgpu stack, my concern is more about >>>> radeon, nouveau and the ARM stacks which are using this as well. >>>> >>>> That blew up so nicely the last time we tried to change it and I know of at >>>> least one case where radeon was/is used with BOs in a child process. >>> I'm way late and burried again, but I think it'd be good to be consistent I had committed this change into our amd-staging-drm-next branch last week after I got the ACK and RB from Felix and Christian. >>> here across drivers. Or at least across drm drivers. And we've had the vma >>> open/close refcounting to make fork work since forever. >>> >>> I think if we do this we should really only do this for mmap() where this >>> applies, but reading through the thread here I'm honestly confused why >>> this is a problem. If CRIU can't handle forked mmaps it needs to be >>> thought that, not hacked around. Or at least I'm not understanding why >>> this shouldn't work ... >>> -Daniel >>> >> Hi Daniel >> >> In the v2 >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2F&data=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3D&reserved=0 >> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding >> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to >> recreate all the child processes and then mmaps all the VMAs it sees (as per >> checkpoint snapshot) in the new process address space after the VMA >> placements are finalized in the position independent code phase. Since the >> inherited VMAs don't have access rights the criu mmap fails. > Still sounds funky. I think minimally we should have an ack from CRIU > developers that this is officially the right way to solve this problem. I > really don't want to have random one-off hacks that don't work across the > board, for a problem where we (drm subsystem) really shouldn't be the only > one with this problem. Where "this problem" means that the mmap space is > per file description, and not per underlying inode or real device or > whatever. That part sounds like a CRIU problem, and I expect CRIU folks > want a consistent solution across the board for this. Hence please grab an > ack from them. > > Cheers, Daniel Maybe Adrian can share his views on this. Hi Adrian - For the context, on CRIU restore we see mmap failures ( in PIE restore phase) due to permission issues on the (render node) VMAs that were inherited since the application that check pointed had forked. The VMAs ideally should not be in the child process but the smaps file shows these VMAs in the child address space. We didn't want to use madvise to avoid this copy and rather change in the kernel mode to limit the impact to our user space library thunk. Based on my understanding, during PIE restore phase, after the VMA placements are finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry list and I think its not an issue as per CRIU design but do you think we could handle this corner case better inside CRIU? > >> Regards, >> >> Rajneesh >> >>>> Regards, >>>> Christian. >>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >>>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >>>>>>> you! >>>>>>> >>>>>>> On 12/9/2021 10:27 AM, Christian König wrote: >>>>>>>> Hi Rajneesh, >>>>>>>> >>>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>>>>>>> a good idea. >>>>>>>> >>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>> access restrictions applied >>>>>>>> exactly that is not correct. That behavior is actively used by some >>>>>>>> userspace stacks as far as I know. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the >>>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>>>>>>> should suffice and we don't want to put any more work arounds in >>>>>>>>> the user space (thunk, in our case). >>>>>>>>> >>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>> access restrictions applied so I wonder why even inherit the vma? >>>>>>>>> >>>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>>>>>>> When an application having open file access to a node forks, its >>>>>>>>>>> shared >>>>>>>>>>> mappings also get reflected in the address space of child process >>>>>>>>>>> even >>>>>>>>>>> though it cannot access them with the object permissions applied. >>>>>>>>>>> With the >>>>>>>>>>> existing permission checks on the gem objects, it might be >>>>>>>>>>> reasonable to >>>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>>>>>>> application >>>>>>>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>>>>>>> MADV_DONTFORK) >>>>>>>>>>> system call to prevent the pages in the mapped range to appear in >>>>>>>>>>> the >>>>>>>>>>> address space of the child process. It also prevents the memory >>>>>>>>>>> leaks >>>>>>>>>>> due to additional reference counts on the mapped BOs in the child >>>>>>>>>>> process that prevented freeing the memory in the parent for which >>>>>>>>>>> we had >>>>>>>>>>> worked around earlier in the user space inside the thunk library. >>>>>>>>>>> >>>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>>>>>>> restore >>>>>>>>>>> an application that had such inherited mappings in the child which >>>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>>>>>>> this so >>>>>>>>>>> this is needed only for the render nodes. >>>>>>>>>> Unfortunately that is most likely a NAK. We already tried >>>>>>>>>> something similar. >>>>>>>>>> >>>>>>>>>> While it is illegal by the OpenGL specification and doesn't work >>>>>>>>>> for most userspace stacks, we do have some implementations which >>>>>>>>>> call fork() with a GL context open and expect it to work. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Christian. >>>>>>>>>> >>>>>>>>>>> Cc: Felix Kuehling<Felix.Kuehling@amd.com> >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: David Yat Sin<david.yatsin@amd.com> >>>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj@amd.com> >>>>>>>>>>> --- >>>>>>>>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>>>>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>>>>>>> *obj, unsigned long obj_size, >>>>>>>>>>> goto err_drm_gem_object_put; >>>>>>>>>>> } >>>>>>>>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>>>>>>> VM_DONTDUMP; >>>>>>>>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>>>>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>>>>>>>> vma->vm_page_prot = >>>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>>>>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>>>>>>> } >>>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>>>>>>> *vma, struct ttm_buffer_object *bo) >>>>>>>>>>> vma->vm_private_data = bo; >>>>>>>>>>> - vma->vm_flags |= VM_PFNMAP; >>>>>>>>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>>>>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>>>>>>> return 0; >>>>>>>>>>> }
Sorry for the typo in my previous email. Please read Adrian Reber* On 12/22/2021 8:49 PM, Bhardwaj, Rajneesh wrote: > > Adding Adrian Rebel who is the CRIU maintainer and CRIU list > > On 12/22/2021 3:53 PM, Daniel Vetter wrote: >> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >>> On 12/20/2021 4:29 AM, Daniel Vetter wrote: >>>> On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote: >>>>> Am 09.12.21 um 19:28 schrieb Felix Kuehling: >>>>>> Am 2021-12-09 um 10:30 a.m. schrieb Christian König: >>>>>>> That still won't work. >>>>>>> >>>>>>> But I think we could do this change for the amdgpu mmap callback only. >>>>>> If graphics user mode has problems with it, we could even make this >>>>>> specific to KFD BOs in the amdgpu_gem_object_mmap callback. >>>>> I think it's fine for the whole amdgpu stack, my concern is more about >>>>> radeon, nouveau and the ARM stacks which are using this as well. >>>>> >>>>> That blew up so nicely the last time we tried to change it and I know of at >>>>> least one case where radeon was/is used with BOs in a child process. >>>> I'm way late and burried again, but I think it'd be good to be consistent > > > I had committed this change into our amd-staging-drm-next branch last > week after I got the ACK and RB from Felix and Christian. > > >>>> here across drivers. Or at least across drm drivers. And we've had the vma >>>> open/close refcounting to make fork work since forever. >>>> >>>> I think if we do this we should really only do this for mmap() where this >>>> applies, but reading through the thread here I'm honestly confused why >>>> this is a problem. If CRIU can't handle forked mmaps it needs to be >>>> thought that, not hacked around. Or at least I'm not understanding why >>>> this shouldn't work ... >>>> -Daniel >>>> >>> Hi Daniel >>> >>> In the v2 >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2F&data=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3D&reserved=0 >>> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding >>> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to >>> recreate all the child processes and then mmaps all the VMAs it sees (as per >>> checkpoint snapshot) in the new process address space after the VMA >>> placements are finalized in the position independent code phase. Since the >>> inherited VMAs don't have access rights the criu mmap fails. >> Still sounds funky. I think minimally we should have an ack from CRIU >> developers that this is officially the right way to solve this problem. I >> really don't want to have random one-off hacks that don't work across the >> board, for a problem where we (drm subsystem) really shouldn't be the only >> one with this problem. Where "this problem" means that the mmap space is >> per file description, and not per underlying inode or real device or >> whatever. That part sounds like a CRIU problem, and I expect CRIU folks >> want a consistent solution across the board for this. Hence please grab an >> ack from them. >> >> Cheers, Daniel > > > Maybe Adrian can share his views on this. > > Hi Adrian - For the context, on CRIU restore we see mmap failures ( in > PIE restore phase) due to permission issues on the (render node) VMAs > that were inherited since the application that check pointed had > forked. The VMAs ideally should not be in the child process but the > smaps file shows these VMAs in the child address space. We didn't want > to use madvise to avoid this copy and rather change in the kernel mode > to limit the impact to our user space library thunk. Based on my > understanding, during PIE restore phase, after the VMA placements are > finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry > list and I think its not an issue as per CRIU design but do you think > we could handle this corner case better inside CRIU? > > >>> Regards, >>> >>> Rajneesh >>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Regards, >>>>>> Felix >>>>>> >>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh: >>>>>>>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank >>>>>>>> you! >>>>>>>> >>>>>>>> On 12/9/2021 10:27 AM, Christian König wrote: >>>>>>>>> Hi Rajneesh, >>>>>>>>> >>>>>>>>> yes, separating this from the drm_gem_mmap_obj() change is certainly >>>>>>>>> a good idea. >>>>>>>>> >>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>>> access restrictions applied >>>>>>>>> exactly that is not correct. That behavior is actively used by some >>>>>>>>> userspace stacks as far as I know. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh: >>>>>>>>>> Thanks Christian. Would it make it less intrusive if I just use the >>>>>>>>>> flag for ttm bo mmap and remove the drm_gem_mmap_obj change from >>>>>>>>>> this patch? For our use case, just the ttm_bo_mmap_obj change >>>>>>>>>> should suffice and we don't want to put any more work arounds in >>>>>>>>>> the user space (thunk, in our case). >>>>>>>>>> >>>>>>>>>> The child cannot access the BOs mapped by the parent anyway with >>>>>>>>>> access restrictions applied so I wonder why even inherit the vma? >>>>>>>>>> >>>>>>>>>> On 12/9/2021 2:54 AM, Christian König wrote: >>>>>>>>>>> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj: >>>>>>>>>>>> When an application having open file access to a node forks, its >>>>>>>>>>>> shared >>>>>>>>>>>> mappings also get reflected in the address space of child process >>>>>>>>>>>> even >>>>>>>>>>>> though it cannot access them with the object permissions applied. >>>>>>>>>>>> With the >>>>>>>>>>>> existing permission checks on the gem objects, it might be >>>>>>>>>>>> reasonable to >>>>>>>>>>>> also create the VMAs with VM_DONTCOPY flag so a user space >>>>>>>>>>>> application >>>>>>>>>>>> doesn't need to explicitly call the madvise(addr, len, >>>>>>>>>>>> MADV_DONTFORK) >>>>>>>>>>>> system call to prevent the pages in the mapped range to appear in >>>>>>>>>>>> the >>>>>>>>>>>> address space of the child process. It also prevents the memory >>>>>>>>>>>> leaks >>>>>>>>>>>> due to additional reference counts on the mapped BOs in the child >>>>>>>>>>>> process that prevented freeing the memory in the parent for which >>>>>>>>>>>> we had >>>>>>>>>>>> worked around earlier in the user space inside the thunk library. >>>>>>>>>>>> >>>>>>>>>>>> Additionally, we faced this issue when using CRIU to checkpoint >>>>>>>>>>>> restore >>>>>>>>>>>> an application that had such inherited mappings in the child which >>>>>>>>>>>> confuse CRIU when it mmaps on restore. Having this flag set for the >>>>>>>>>>>> render node VMAs helps. VMAs mapped via KFD already take care of >>>>>>>>>>>> this so >>>>>>>>>>>> this is needed only for the render nodes. >>>>>>>>>>> Unfortunately that is most likely a NAK. We already tried >>>>>>>>>>> something similar. >>>>>>>>>>> >>>>>>>>>>> While it is illegal by the OpenGL specification and doesn't work >>>>>>>>>>> for most userspace stacks, we do have some implementations which >>>>>>>>>>> call fork() with a GL context open and expect it to work. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>>> Cc: Felix Kuehling<Felix.Kuehling@amd.com> >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: David Yat Sin<david.yatsin@amd.com> >>>>>>>>>>>> Signed-off-by: Rajneesh Bhardwaj<rajneesh.bhardwaj@amd.com> >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/gpu/drm/drm_gem.c | 3 ++- >>>>>>>>>>>> drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +- >>>>>>>>>>>> 2 files changed, 3 insertions(+), 2 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> index 09c820045859..d9c4149f36dd 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/drm_gem.c >>>>>>>>>>>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object >>>>>>>>>>>> *obj, unsigned long obj_size, >>>>>>>>>>>> goto err_drm_gem_object_put; >>>>>>>>>>>> } >>>>>>>>>>>> - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | >>>>>>>>>>>> VM_DONTDUMP; >>>>>>>>>>>> + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND >>>>>>>>>>>> + | VM_DONTDUMP | VM_DONTCOPY; >>>>>>>>>>>> vma->vm_page_prot = >>>>>>>>>>>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); >>>>>>>>>>>> vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); >>>>>>>>>>>> } >>>>>>>>>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> index 33680c94127c..420a4898fdd2 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c >>>>>>>>>>>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct >>>>>>>>>>>> *vma, struct ttm_buffer_object *bo) >>>>>>>>>>>> vma->vm_private_data = bo; >>>>>>>>>>>> - vma->vm_flags |= VM_PFNMAP; >>>>>>>>>>>> + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; >>>>>>>>>>>> vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; >>>>>>>>>>>> return 0; >>>>>>>>>>>> }
Am 22.12.21 um 21:53 schrieb Daniel Vetter: > On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: > > [SNIP] > Still sounds funky. I think minimally we should have an ack from CRIU > developers that this is officially the right way to solve this problem. I > really don't want to have random one-off hacks that don't work across the > board, for a problem where we (drm subsystem) really shouldn't be the only > one with this problem. Where "this problem" means that the mmap space is > per file description, and not per underlying inode or real device or > whatever. That part sounds like a CRIU problem, and I expect CRIU folks > want a consistent solution across the board for this. Hence please grab an > ack from them. Unfortunately it's a KFD design problem. AMD used a single device node, then mmaped different objects from the same offset to different processes and expected it to work the rest of the fs subsystem without churn. So yes, this is indeed because the mmap space is per file descriptor for the use case here. And thanks for pointing this out, this indeed makes the whole change extremely questionable. Regards, Christian. > > Cheers, Daniel >
[+Adrian] Am 2021-12-23 um 2:05 a.m. schrieb Christian König: > Am 22.12.21 um 21:53 schrieb Daniel Vetter: >> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >> >> [SNIP] >> Still sounds funky. I think minimally we should have an ack from CRIU >> developers that this is officially the right way to solve this >> problem. I >> really don't want to have random one-off hacks that don't work across >> the >> board, for a problem where we (drm subsystem) really shouldn't be the >> only >> one with this problem. Where "this problem" means that the mmap space is >> per file description, and not per underlying inode or real device or >> whatever. That part sounds like a CRIU problem, and I expect CRIU folks >> want a consistent solution across the board for this. Hence please >> grab an >> ack from them. > > Unfortunately it's a KFD design problem. AMD used a single device > node, then mmaped different objects from the same offset to different > processes and expected it to work the rest of the fs subsystem without > churn. This may be true for mmaps in the KFD device, but not for mmaps in the DRM render nodes. > > So yes, this is indeed because the mmap space is per file descriptor > for the use case here. No. This is a different problem. The problem has to do with the way that DRM manages mmap permissions. In order to be able to mmap an offset in the render node, there needs to be a BO that was created in the same render node. If you fork a process, it inherits the VMA. But KFD doesn't know anything about the inherited BOs from the parent process. Therefore those BOs don't get checkpointed and restored in the child process. When the CRIU checkpoint is restored, our CRIU plugin never creates a BO corresponding to the VMA in the child process' render node FD. We've also lost the relationship between the parent and child-process' render node FDs. After "fork" the render node FD points to the same struct file in parent and child. After restoring the CRIU checkpoint, they are separate struct files, created by separate "open" system calls. Therefore the mmap call that restores the VMA fails in the child process. At least for KFD, there is no point inheriting BOs from a child process, because the GPU has no way of accessing the BOs in the child process. The child process has no GPU address space, no user mode queues, no way to do anything with the GPU before it completely reinitializes its KFD context. We can workaround this issue in user mode with madvise(..., MADV_DONTFORK). In fact we've already done this for some BOs to avoid a memory leak in the parent process while a child process exists. But it's slightly racy because there is a short time window where VMA exists without the VM_DONTCOPY flag. A fork during that time window could still create a child process with an inherited VMA. Therefore a safer solution is to set the vm_flags in the VMA in the driver when the VMA is first created. Regards, Felix > > And thanks for pointing this out, this indeed makes the whole change > extremely questionable. > > Regards, > Christian. > >> >> Cheers, Daniel >> >
Am 04.01.22 um 19:08 schrieb Felix Kuehling: > [+Adrian] > > Am 2021-12-23 um 2:05 a.m. schrieb Christian König: > >> Am 22.12.21 um 21:53 schrieb Daniel Vetter: >>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >>> >>> [SNIP] >>> Still sounds funky. I think minimally we should have an ack from CRIU >>> developers that this is officially the right way to solve this >>> problem. I >>> really don't want to have random one-off hacks that don't work across >>> the >>> board, for a problem where we (drm subsystem) really shouldn't be the >>> only >>> one with this problem. Where "this problem" means that the mmap space is >>> per file description, and not per underlying inode or real device or >>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks >>> want a consistent solution across the board for this. Hence please >>> grab an >>> ack from them. >> Unfortunately it's a KFD design problem. AMD used a single device >> node, then mmaped different objects from the same offset to different >> processes and expected it to work the rest of the fs subsystem without >> churn. > This may be true for mmaps in the KFD device, but not for mmaps in the > DRM render nodes. Correct, yes. >> So yes, this is indeed because the mmap space is per file descriptor >> for the use case here. > No. This is a different problem. I was already wondering which mmaps through the KFD node we have left which cause problems here. > The problem has to do with the way that DRM manages mmap permissions. In > order to be able to mmap an offset in the render node, there needs to be > a BO that was created in the same render node. If you fork a process, it > inherits the VMA. Yeah, so far it works like designed. > But KFD doesn't know anything about the inherited BOs > from the parent process. Ok, why that? When the KFD is reinitializing it's context why shouldn't it cleanup those VMAs? > Therefore those BOs don't get checkpointed and > restored in the child process. When the CRIU checkpoint is restored, our > CRIU plugin never creates a BO corresponding to the VMA in the child > process' render node FD. We've also lost the relationship between the > parent and child-process' render node FDs. After "fork" the render node > FD points to the same struct file in parent and child. After restoring > the CRIU checkpoint, they are separate struct files, created by separate > "open" system calls. Therefore the mmap call that restores the VMA fails > in the child process. > > At least for KFD, there is no point inheriting BOs from a child process, > because the GPU has no way of accessing the BOs in the child process. > The child process has no GPU address space, no user mode queues, no way > to do anything with the GPU before it completely reinitializes its KFD > context. > > We can workaround this issue in user mode with madvise(..., > MADV_DONTFORK). In fact we've already done this for some BOs to avoid a > memory leak in the parent process while a child process exists. But it's > slightly racy because there is a short time window where VMA exists > without the VM_DONTCOPY flag. A fork during that time window could still > create a child process with an inherited VMA. > > Therefore a safer solution is to set the vm_flags in the VMA in the > driver when the VMA is first created. Thanks for the full explanation, it makes much more sense now. Regards, Christian. > > Regards, > Felix > > >> And thanks for pointing this out, this indeed makes the whole change >> extremely questionable. >> >> Regards, >> Christian. >> >>> Cheers, Daniel >>>
Am 2022-01-05 um 3:08 a.m. schrieb Christian König: > Am 04.01.22 um 19:08 schrieb Felix Kuehling: >> [+Adrian] >> >> Am 2021-12-23 um 2:05 a.m. schrieb Christian König: >> >>> Am 22.12.21 um 21:53 schrieb Daniel Vetter: >>>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >>>> >>>> [SNIP] >>>> Still sounds funky. I think minimally we should have an ack from CRIU >>>> developers that this is officially the right way to solve this >>>> problem. I >>>> really don't want to have random one-off hacks that don't work across >>>> the >>>> board, for a problem where we (drm subsystem) really shouldn't be the >>>> only >>>> one with this problem. Where "this problem" means that the mmap >>>> space is >>>> per file description, and not per underlying inode or real device or >>>> whatever. That part sounds like a CRIU problem, and I expect CRIU >>>> folks >>>> want a consistent solution across the board for this. Hence please >>>> grab an >>>> ack from them. >>> Unfortunately it's a KFD design problem. AMD used a single device >>> node, then mmaped different objects from the same offset to different >>> processes and expected it to work the rest of the fs subsystem without >>> churn. >> This may be true for mmaps in the KFD device, but not for mmaps in the >> DRM render nodes. > > Correct, yes. > >>> So yes, this is indeed because the mmap space is per file descriptor >>> for the use case here. >> No. This is a different problem. > > I was already wondering which mmaps through the KFD node we have left > which cause problems here. We still use the KFD FD for mapping doorbells and HDP flushing. These are both SG BOs, so they cannot be CPU-mapped through render nodes. The KFD FD is also used for mapping signal pages and CWSR trap handlers on old APUs. Those VMAs aren't causing the problem. They still map successfully on restore. > >> The problem has to do with the way that DRM manages mmap permissions. In >> order to be able to mmap an offset in the render node, there needs to be >> a BO that was created in the same render node. If you fork a process, it >> inherits the VMA. > > Yeah, so far it works like designed. > >> But KFD doesn't know anything about the inherited BOs >> from the parent process. > > Ok, why that? When the KFD is reinitializing it's context why > shouldn't it cleanup those VMAs? That cleanup has to be initiated by user mode. Basically closing the old KFD and DRM file descriptors, cleaning up all the user mode VM state, unmapping all the VMAs, etc. Then it reopens KFD and the render nodes and starts from scratch. User mode will do this automatically when it tries to reinitialize ROCm. However, in this case the child process doesn't do that (e.g. a python application using the multi-processing package). The child process does not use ROCm. But you're left with all the dangling VMAs in the child process indefinitely. Regards, Felix > >> Therefore those BOs don't get checkpointed and >> restored in the child process. When the CRIU checkpoint is restored, our >> CRIU plugin never creates a BO corresponding to the VMA in the child >> process' render node FD. We've also lost the relationship between the >> parent and child-process' render node FDs. After "fork" the render node >> FD points to the same struct file in parent and child. After restoring >> the CRIU checkpoint, they are separate struct files, created by separate >> "open" system calls. Therefore the mmap call that restores the VMA fails >> in the child process. >> >> At least for KFD, there is no point inheriting BOs from a child process, >> because the GPU has no way of accessing the BOs in the child process. >> The child process has no GPU address space, no user mode queues, no way >> to do anything with the GPU before it completely reinitializes its KFD >> context. >> >> We can workaround this issue in user mode with madvise(..., >> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a >> memory leak in the parent process while a child process exists. But it's >> slightly racy because there is a short time window where VMA exists >> without the VM_DONTCOPY flag. A fork during that time window could still >> create a child process with an inherited VMA. >> >> Therefore a safer solution is to set the vm_flags in the VMA in the >> driver when the VMA is first created. > > Thanks for the full explanation, it makes much more sense now. > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> And thanks for pointing this out, this indeed makes the whole change >>> extremely questionable. >>> >>> Regards, >>> Christian. >>> >>>> Cheers, Daniel >>>> >
Am 2022-01-05 um 11:16 a.m. schrieb Felix Kuehling: >> I was already wondering which mmaps through the KFD node we have left >> which cause problems here. > We still use the KFD FD for mapping doorbells and HDP flushing. These > are both SG BOs, so they cannot be CPU-mapped through render nodes. The > KFD FD is also used for mapping signal pages and CWSR trap handlers on > old APUs. > > Those VMAs aren't causing the problem. They still map successfully on > restore. > > Small correction: KFD already sets the VM_DONTCOPY flag for all KFD FD mappings. The patch under discussion here does the same for DRM FD mappings (at least for KFD BOs). Regards, Felix
Am 05.01.22 um 17:16 schrieb Felix Kuehling: > [SNIP] >>> But KFD doesn't know anything about the inherited BOs >>> from the parent process. >> Ok, why that? When the KFD is reinitializing it's context why >> shouldn't it cleanup those VMAs? > That cleanup has to be initiated by user mode. Basically closing the old > KFD and DRM file descriptors, cleaning up all the user mode VM state, > unmapping all the VMAs, etc. Then it reopens KFD and the render nodes > and starts from scratch. > > User mode will do this automatically when it tries to reinitialize ROCm. > However, in this case the child process doesn't do that (e.g. a python > application using the multi-processing package). The child process does > not use ROCm. But you're left with all the dangling VMAs in the child > process indefinitely. Oh, not that one again. I'm unfortunately pretty sure that this is an clear NAK then. This python multi-processing package is violating various specifications by doing this fork() and we already had multiple discussions about that. Let's talk about this on Mondays call. Thanks for giving the whole context. Regards, Christian. > > Regards, > Felix >
Am 2022-01-06 um 4:05 a.m. schrieb Christian König: > Am 05.01.22 um 17:16 schrieb Felix Kuehling: >> [SNIP] >>>> But KFD doesn't know anything about the inherited BOs >>>> from the parent process. >>> Ok, why that? When the KFD is reinitializing it's context why >>> shouldn't it cleanup those VMAs? >> That cleanup has to be initiated by user mode. Basically closing the old >> KFD and DRM file descriptors, cleaning up all the user mode VM state, >> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes >> and starts from scratch. >> >> User mode will do this automatically when it tries to reinitialize ROCm. >> However, in this case the child process doesn't do that (e.g. a python >> application using the multi-processing package). The child process does >> not use ROCm. But you're left with all the dangling VMAs in the child >> process indefinitely. > > Oh, not that one again. I'm unfortunately pretty sure that this is an > clear NAK then. > > This python multi-processing package is violating various > specifications by doing this fork() and we already had multiple > discussions about that. Well, it's in wide-spread use. We can't just throw up our hands and say they're buggy and not supported. Also, why does your ACK or NAK depend on this at all. If it's the right thing to do, it's the right thing to do regardless of who benefits from it. In addition, how can a child process that doesn't even use the GPU be in violation of any GPU-driver related specifications. Regards, Felix > > Let's talk about this on Mondays call. Thanks for giving the whole > context. > > Regards, > Christian. > >> >> Regards, >> Felix >> >
Am 06.01.22 um 17:45 schrieb Felix Kuehling: > Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >> Am 05.01.22 um 17:16 schrieb Felix Kuehling: >>> [SNIP] >>>>> But KFD doesn't know anything about the inherited BOs >>>>> from the parent process. >>>> Ok, why that? When the KFD is reinitializing it's context why >>>> shouldn't it cleanup those VMAs? >>> That cleanup has to be initiated by user mode. Basically closing the old >>> KFD and DRM file descriptors, cleaning up all the user mode VM state, >>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes >>> and starts from scratch. >>> >>> User mode will do this automatically when it tries to reinitialize ROCm. >>> However, in this case the child process doesn't do that (e.g. a python >>> application using the multi-processing package). The child process does >>> not use ROCm. But you're left with all the dangling VMAs in the child >>> process indefinitely. >> Oh, not that one again. I'm unfortunately pretty sure that this is an >> clear NAK then. >> >> This python multi-processing package is violating various >> specifications by doing this fork() and we already had multiple >> discussions about that. > Well, it's in wide-spread use. We can't just throw up our hands and say > they're buggy and not supported. Because that's not my NAK, but rather from upstream. > Also, why does your ACK or NAK depend on this at all. If it's the right > thing to do, it's the right thing to do regardless of who benefits from > it. In addition, how can a child process that doesn't even use the GPU > be in violation of any GPU-driver related specifications. The argument is that the application is broken and needs to be fixed instead of worked around inside the kernel. Regards, Christian. > > Regards, > Felix > > >> Let's talk about this on Mondays call. Thanks for giving the whole >> context. >> >> Regards, >> Christian. >> >>> Regards, >>> Felix >>>
Am 2022-01-06 um 11:48 a.m. schrieb Christian König: > Am 06.01.22 um 17:45 schrieb Felix Kuehling: >> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>> Am 05.01.22 um 17:16 schrieb Felix Kuehling: >>>> [SNIP] >>>>>> But KFD doesn't know anything about the inherited BOs >>>>>> from the parent process. >>>>> Ok, why that? When the KFD is reinitializing it's context why >>>>> shouldn't it cleanup those VMAs? >>>> That cleanup has to be initiated by user mode. Basically closing >>>> the old >>>> KFD and DRM file descriptors, cleaning up all the user mode VM state, >>>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes >>>> and starts from scratch. >>>> >>>> User mode will do this automatically when it tries to reinitialize >>>> ROCm. >>>> However, in this case the child process doesn't do that (e.g. a python >>>> application using the multi-processing package). The child process >>>> does >>>> not use ROCm. But you're left with all the dangling VMAs in the child >>>> process indefinitely. >>> Oh, not that one again. I'm unfortunately pretty sure that this is an >>> clear NAK then. >>> >>> This python multi-processing package is violating various >>> specifications by doing this fork() and we already had multiple >>> discussions about that. >> Well, it's in wide-spread use. We can't just throw up our hands and say >> they're buggy and not supported. > > Because that's not my NAK, but rather from upstream. > >> Also, why does your ACK or NAK depend on this at all. If it's the right >> thing to do, it's the right thing to do regardless of who benefits from >> it. In addition, how can a child process that doesn't even use the GPU >> be in violation of any GPU-driver related specifications. > > The argument is that the application is broken and needs to be fixed > instead of worked around inside the kernel. I still don't get how they the application is broken. Like I said, the child process is not using the GPU. How can the application be fixed in this case? Are you saying, any application that forks and doesn't immediately call exec is broken? Or does an application that forks need to be aware that some other part of the application used the GPU and explicitly free any GPU resources? Thanks, Felix > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> Let's talk about this on Mondays call. Thanks for giving the whole >>> context. >>> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Felix >>>> >
Am 06.01.22 um 17:51 schrieb Felix Kuehling: > Am 2022-01-06 um 11:48 a.m. schrieb Christian König: >> Am 06.01.22 um 17:45 schrieb Felix Kuehling: >>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>> [SNIP] >>> Also, why does your ACK or NAK depend on this at all. If it's the right >>> thing to do, it's the right thing to do regardless of who benefits from >>> it. In addition, how can a child process that doesn't even use the GPU >>> be in violation of any GPU-driver related specifications. >> The argument is that the application is broken and needs to be fixed >> instead of worked around inside the kernel. > I still don't get how they the application is broken. Like I said, the > child process is not using the GPU. How can the application be fixed in > this case? Sounds like I'm still missing some important puzzle pieces for the full picture to figure out why this doesn't work. > Are you saying, any application that forks and doesn't immediately call > exec is broken? More or less yes. We essentially have three possible cases here: 1. Application is already using (for example) OpenGL or OpenCL to do some rendering on the GPU and then calls fork() and expects to use OpenGL both in the parent and the child at the same time. As far as I know that's illegal from the Khronos specification point of view and working around inside the kernel for something not allowed in the first place is seen as futile effort. 2. Application opened the file descriptor, forks and then initializes OpenGL/Vulkan/OpenCL. That's what some compositors are doing to drop into the backround and is explicitly legal. 3. Application calls fork() and then doesn't use the GPU in the child. Either by not using it or calling exec. That should be legal and not cause any problems in the first place. But from your description I still don't get why we are running into problems here. I was assuming that you have case #1 because we previously had some examples of this with this python library, but from your description it seems to be case #3. > Or does an application that forks need to be aware that some other part > of the application used the GPU and explicitly free any GPU resources? Others might fill that information in, but I think that was the plan for newer APIs like Vulkan. Regards, Christian. > > Thanks, > Felix > > >> Regards, >> Christian. >> >>> Regards, >>> Felix >>> >>> >>>> Let's talk about this on Mondays call. Thanks for giving the whole >>>> context. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Regards, >>>>> Felix >>>>>
Am 2022-01-07 um 3:56 a.m. schrieb Christian König: > Am 06.01.22 um 17:51 schrieb Felix Kuehling: >> Am 2022-01-06 um 11:48 a.m. schrieb Christian König: >>> Am 06.01.22 um 17:45 schrieb Felix Kuehling: >>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>>> [SNIP] >>>> Also, why does your ACK or NAK depend on this at all. If it's the >>>> right >>>> thing to do, it's the right thing to do regardless of who benefits >>>> from >>>> it. In addition, how can a child process that doesn't even use the GPU >>>> be in violation of any GPU-driver related specifications. >>> The argument is that the application is broken and needs to be fixed >>> instead of worked around inside the kernel. >> I still don't get how they the application is broken. Like I said, the >> child process is not using the GPU. How can the application be fixed in >> this case? > > Sounds like I'm still missing some important puzzle pieces for the > full picture to figure out why this doesn't work. > >> Are you saying, any application that forks and doesn't immediately call >> exec is broken? > > More or less yes. We essentially have three possible cases here: > > 1. Application is already using (for example) OpenGL or OpenCL to do > some rendering on the GPU and then calls fork() and expects to use > OpenGL both in the parent and the child at the same time. > As far as I know that's illegal from the Khronos specification > point of view and working around inside the kernel for something not > allowed in the first place is seen as futile effort. > > 2. Application opened the file descriptor, forks and then initializes > OpenGL/Vulkan/OpenCL. > That's what some compositors are doing to drop into the backround > and is explicitly legal. > > 3. Application calls fork() and then doesn't use the GPU in the child. > Either by not using it or calling exec. > That should be legal and not cause any problems in the first place. > > But from your description I still don't get why we are running into > problems here. > > I was assuming that you have case #1 because we previously had some > examples of this with this python library, but from your description > it seems to be case #3. Correct. #3 has at least one issue we previously worked around in the Thunk: The inherited VMAs prevent BOs from being freed in the parent process. This manifests as an apparent memory leak. Therefore the Thunk calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs that are causing problems with CRIU are GTT BOs that weren't covered by this previous workaround. The new issue with CRIU is, that CRIU saves and restores all the VMAs. When trying to restore the inherited VMAs in the child process, the mmap call fails because the child process' render node FD is no longer inherited from the parent, but is instead created by its own "open" system call. The mmap call in the child fails for at least two reasons: * The child process' render node FD doesn't have permission to map the parent process' BO any more * The parent BO doesn't get restored in the child process, so its mmap offset doesn't get updated to the new mmap offset of the restored parent BO by the amdgpu CRIU plugin We could maybe add a whole bunch of complexity in CRIU and our CRIU plugin to fix this. But it's pointless because like you said, actually doing anything with the BO in the child process is illegal anyway (scenario #1 above). The easiest solution seems to be, to just not inherit the VMA in the first place. Regards, Felix > >> Or does an application that forks need to be aware that some other part >> of the application used the GPU and explicitly free any GPU resources? > > Others might fill that information in, but I think that was the plan > for newer APIs like Vulkan. > > Regards, > Christian. > >> >> Thanks, >> Felix >> >> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Felix >>>> >>>> >>>>> Let's talk about this on Mondays call. Thanks for giving the whole >>>>> context. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Regards, >>>>>> Felix >>>>>> >
Hi Christian I have reverted the change from the amd-staging-drm-next as per the discussion. Thank you. Regards Rajneesh On 1/4/2022 1:08 PM, Felix Kuehling wrote: > [+Adrian] > > Am 2021-12-23 um 2:05 a.m. schrieb Christian König: > >> Am 22.12.21 um 21:53 schrieb Daniel Vetter: >>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote: >>> >>> [SNIP] >>> Still sounds funky. I think minimally we should have an ack from CRIU >>> developers that this is officially the right way to solve this >>> problem. I >>> really don't want to have random one-off hacks that don't work across >>> the >>> board, for a problem where we (drm subsystem) really shouldn't be the >>> only >>> one with this problem. Where "this problem" means that the mmap space is >>> per file description, and not per underlying inode or real device or >>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks >>> want a consistent solution across the board for this. Hence please >>> grab an >>> ack from them. >> Unfortunately it's a KFD design problem. AMD used a single device >> node, then mmaped different objects from the same offset to different >> processes and expected it to work the rest of the fs subsystem without >> churn. > This may be true for mmaps in the KFD device, but not for mmaps in the > DRM render nodes. > > >> So yes, this is indeed because the mmap space is per file descriptor >> for the use case here. > No. This is a different problem. > > The problem has to do with the way that DRM manages mmap permissions. In > order to be able to mmap an offset in the render node, there needs to be > a BO that was created in the same render node. If you fork a process, it > inherits the VMA. But KFD doesn't know anything about the inherited BOs > from the parent process. Therefore those BOs don't get checkpointed and > restored in the child process. When the CRIU checkpoint is restored, our > CRIU plugin never creates a BO corresponding to the VMA in the child > process' render node FD. We've also lost the relationship between the > parent and child-process' render node FDs. After "fork" the render node > FD points to the same struct file in parent and child. After restoring > the CRIU checkpoint, they are separate struct files, created by separate > "open" system calls. Therefore the mmap call that restores the VMA fails > in the child process. > > At least for KFD, there is no point inheriting BOs from a child process, > because the GPU has no way of accessing the BOs in the child process. > The child process has no GPU address space, no user mode queues, no way > to do anything with the GPU before it completely reinitializes its KFD > context. > > We can workaround this issue in user mode with madvise(..., > MADV_DONTFORK). In fact we've already done this for some BOs to avoid a > memory leak in the parent process while a child process exists. But it's > slightly racy because there is a short time window where VMA exists > without the VM_DONTCOPY flag. A fork during that time window could still > create a child process with an inherited VMA. > > Therefore a safer solution is to set the vm_flags in the VMA in the > driver when the VMA is first created. > > Regards, > Felix > > >> And thanks for pointing this out, this indeed makes the whole change >> extremely questionable. >> >> Regards, >> Christian. >> >>> Cheers, Daniel >>>
Top post because I tried to catch up on the entire discussion here. So fundamentally I'm not opposed to just close this fork() hole once and for all. The thing that worries me from a upstream/platform pov is really only if we don't do it consistently across all drivers. So maybe as an idea: - Do the original patch, but not just for ttm but all gem rendernode drivers at least (or maybe even all gem drivers, no idea), with the below discussion cleaned up as justification. - Get acks from userspace driver folks who know real-world gl/vk usage and khr specs in-depth enough to be able to tell us how much we'll regret this. - Merge it. Or come up with something new. Or maybe stick to the Nack, but then maybe it would be good to document that somewhere in-tree? This entire can of worms just feels way too tricky to have it be handled inconsistently across drivers. And trying to fix these kind of low-level things across drivers once divergences exists is just really painful (e.g. trying to make all dma-buf mmap VM_SPECIAL or the herculeian task Christian is doing to get our dma_resv rules consistent across drivers). Cheers, Daniel On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote: > Am 2022-01-07 um 3:56 a.m. schrieb Christian König: > > Am 06.01.22 um 17:51 schrieb Felix Kuehling: > >> Am 2022-01-06 um 11:48 a.m. schrieb Christian König: > >>> Am 06.01.22 um 17:45 schrieb Felix Kuehling: > >>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: > >>>> [SNIP] > >>>> Also, why does your ACK or NAK depend on this at all. If it's the > >>>> right > >>>> thing to do, it's the right thing to do regardless of who benefits > >>>> from > >>>> it. In addition, how can a child process that doesn't even use the GPU > >>>> be in violation of any GPU-driver related specifications. > >>> The argument is that the application is broken and needs to be fixed > >>> instead of worked around inside the kernel. > >> I still don't get how they the application is broken. Like I said, the > >> child process is not using the GPU. How can the application be fixed in > >> this case? > > > > Sounds like I'm still missing some important puzzle pieces for the > > full picture to figure out why this doesn't work. > > > >> Are you saying, any application that forks and doesn't immediately call > >> exec is broken? > > > > More or less yes. We essentially have three possible cases here: > > > > 1. Application is already using (for example) OpenGL or OpenCL to do > > some rendering on the GPU and then calls fork() and expects to use > > OpenGL both in the parent and the child at the same time. > > As far as I know that's illegal from the Khronos specification > > point of view and working around inside the kernel for something not > > allowed in the first place is seen as futile effort. > > > > 2. Application opened the file descriptor, forks and then initializes > > OpenGL/Vulkan/OpenCL. > > That's what some compositors are doing to drop into the backround > > and is explicitly legal. > > > > 3. Application calls fork() and then doesn't use the GPU in the child. > > Either by not using it or calling exec. > > That should be legal and not cause any problems in the first place. > > > > But from your description I still don't get why we are running into > > problems here. > > > > I was assuming that you have case #1 because we previously had some > > examples of this with this python library, but from your description > > it seems to be case #3. > > Correct. #3 has at least one issue we previously worked around in the > Thunk: The inherited VMAs prevent BOs from being freed in the parent > process. This manifests as an apparent memory leak. Therefore the Thunk > calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs > that are causing problems with CRIU are GTT BOs that weren't covered by > this previous workaround. > > The new issue with CRIU is, that CRIU saves and restores all the VMAs. > When trying to restore the inherited VMAs in the child process, the mmap > call fails because the child process' render node FD is no longer > inherited from the parent, but is instead created by its own "open" > system call. The mmap call in the child fails for at least two reasons: > > * The child process' render node FD doesn't have permission to map the > parent process' BO any more > * The parent BO doesn't get restored in the child process, so its mmap > offset doesn't get updated to the new mmap offset of the restored > parent BO by the amdgpu CRIU plugin > > We could maybe add a whole bunch of complexity in CRIU and our CRIU > plugin to fix this. But it's pointless because like you said, actually > doing anything with the BO in the child process is illegal anyway > (scenario #1 above). The easiest solution seems to be, to just not > inherit the VMA in the first place. > > Regards, > Felix > > > > > >> Or does an application that forks need to be aware that some other part > >> of the application used the GPU and explicitly free any GPU resources? > > > > Others might fill that information in, but I think that was the plan > > for newer APIs like Vulkan. > > > > Regards, > > Christian. > > > >> > >> Thanks, > >> Felix > >> > >> > >>> Regards, > >>> Christian. > >>> > >>>> Regards, > >>>> Felix > >>>> > >>>> > >>>>> Let's talk about this on Mondays call. Thanks for giving the whole > >>>>> context. > >>>>> > >>>>> Regards, > >>>>> Christian. > >>>>> > >>>>>> Regards, > >>>>>> Felix > >>>>>> > >
Am 14.01.22 um 17:44 schrieb Daniel Vetter: > Top post because I tried to catch up on the entire discussion here. > > So fundamentally I'm not opposed to just close this fork() hole once and > for all. The thing that worries me from a upstream/platform pov is really > only if we don't do it consistently across all drivers. > > So maybe as an idea: > - Do the original patch, but not just for ttm but all gem rendernode > drivers at least (or maybe even all gem drivers, no idea), with the > below discussion cleaned up as justification. I know of at least one use case which this will break. A couple of years back we had a discussion on the Mesa mailing list because (IIRC) Marek introduced a background thread to push command submissions to the kernel. That broke because some compositor used to initialize OpenGL and then do a fork(). This indeed worked previously (no GPUVM at that time), but with the addition of the backround thread obviously broke. The conclusion back then was that the compositor is broken and needs fixing, but it still essentially means that there could be people out there with really old userspace where this setting would just break the desktop. I'm not really against that change either, but at least in theory we could make fork() work perfectly fine even with VMs and background threads. Regards, Christian. > - Get acks from userspace driver folks who know real-world gl/vk usage and > khr specs in-depth enough to be able to tell us how much we'll regret > this. > > - Merge it. Or come up with something new. Or maybe stick to the Nack, but > then maybe it would be good to document that somewhere in-tree? > > This entire can of worms just feels way too tricky to have it be handled > inconsistently across drivers. And trying to fix these kind of low-level > things across drivers once divergences exists is just really painful (e.g. > trying to make all dma-buf mmap VM_SPECIAL or the herculeian task > Christian is doing to get our dma_resv rules consistent across drivers). > > Cheers, Daniel > > On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote: >> Am 2022-01-07 um 3:56 a.m. schrieb Christian König: >>> Am 06.01.22 um 17:51 schrieb Felix Kuehling: >>>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König: >>>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling: >>>>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>>>>> [SNIP] >>>>>> Also, why does your ACK or NAK depend on this at all. If it's the >>>>>> right >>>>>> thing to do, it's the right thing to do regardless of who benefits >>>>>> from >>>>>> it. In addition, how can a child process that doesn't even use the GPU >>>>>> be in violation of any GPU-driver related specifications. >>>>> The argument is that the application is broken and needs to be fixed >>>>> instead of worked around inside the kernel. >>>> I still don't get how they the application is broken. Like I said, the >>>> child process is not using the GPU. How can the application be fixed in >>>> this case? >>> Sounds like I'm still missing some important puzzle pieces for the >>> full picture to figure out why this doesn't work. >>> >>>> Are you saying, any application that forks and doesn't immediately call >>>> exec is broken? >>> More or less yes. We essentially have three possible cases here: >>> >>> 1. Application is already using (for example) OpenGL or OpenCL to do >>> some rendering on the GPU and then calls fork() and expects to use >>> OpenGL both in the parent and the child at the same time. >>> As far as I know that's illegal from the Khronos specification >>> point of view and working around inside the kernel for something not >>> allowed in the first place is seen as futile effort. >>> >>> 2. Application opened the file descriptor, forks and then initializes >>> OpenGL/Vulkan/OpenCL. >>> That's what some compositors are doing to drop into the backround >>> and is explicitly legal. >>> >>> 3. Application calls fork() and then doesn't use the GPU in the child. >>> Either by not using it or calling exec. >>> That should be legal and not cause any problems in the first place. >>> >>> But from your description I still don't get why we are running into >>> problems here. >>> >>> I was assuming that you have case #1 because we previously had some >>> examples of this with this python library, but from your description >>> it seems to be case #3. >> Correct. #3 has at least one issue we previously worked around in the >> Thunk: The inherited VMAs prevent BOs from being freed in the parent >> process. This manifests as an apparent memory leak. Therefore the Thunk >> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs >> that are causing problems with CRIU are GTT BOs that weren't covered by >> this previous workaround. >> >> The new issue with CRIU is, that CRIU saves and restores all the VMAs. >> When trying to restore the inherited VMAs in the child process, the mmap >> call fails because the child process' render node FD is no longer >> inherited from the parent, but is instead created by its own "open" >> system call. The mmap call in the child fails for at least two reasons: >> >> * The child process' render node FD doesn't have permission to map the >> parent process' BO any more >> * The parent BO doesn't get restored in the child process, so its mmap >> offset doesn't get updated to the new mmap offset of the restored >> parent BO by the amdgpu CRIU plugin >> >> We could maybe add a whole bunch of complexity in CRIU and our CRIU >> plugin to fix this. But it's pointless because like you said, actually >> doing anything with the BO in the child process is illegal anyway >> (scenario #1 above). The easiest solution seems to be, to just not >> inherit the VMA in the first place. >> >> Regards, >> Felix >> >> >>>> Or does an application that forks need to be aware that some other part >>>> of the application used the GPU and explicitly free any GPU resources? >>> Others might fill that information in, but I think that was the plan >>> for newer APIs like Vulkan. >>> >>> Regards, >>> Christian. >>> >>>> Thanks, >>>> Felix >>>> >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Regards, >>>>>> Felix >>>>>> >>>>>> >>>>>>> Let's talk about this on Mondays call. Thanks for giving the whole >>>>>>> context. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> Regards, >>>>>>>> Felix >>>>>>>>
Am 2022-01-14 um 12:26 p.m. schrieb Christian König: > Am 14.01.22 um 17:44 schrieb Daniel Vetter: >> Top post because I tried to catch up on the entire discussion here. >> >> So fundamentally I'm not opposed to just close this fork() hole once and >> for all. The thing that worries me from a upstream/platform pov is >> really >> only if we don't do it consistently across all drivers. >> >> So maybe as an idea: >> - Do the original patch, but not just for ttm but all gem rendernode >> drivers at least (or maybe even all gem drivers, no idea), with the >> below discussion cleaned up as justification. > > I know of at least one use case which this will break. > > A couple of years back we had a discussion on the Mesa mailing list > because (IIRC) Marek introduced a background thread to push command > submissions to the kernel. > > That broke because some compositor used to initialize OpenGL and then > do a fork(). This indeed worked previously (no GPUVM at that time), > but with the addition of the backround thread obviously broke. > > The conclusion back then was that the compositor is broken and needs > fixing, but it still essentially means that there could be people out > there with really old userspace where this setting would just break > the desktop. > > I'm not really against that change either, but at least in theory we > could make fork() work perfectly fine even with VMs and background > threads. You may regret this if you ever try to build a shared virtual address space between GPU and CPU. Then you have two processes (parent and child) sharing the same render context and GPU VM address space. But the CPU address spaces are different. You can't maintain consistent shared virtual address spaces for both processes when the GPU address space is shared between them. Regards, Felix > > Regards, > Christian. > >> - Get acks from userspace driver folks who know real-world gl/vk >> usage and >> khr specs in-depth enough to be able to tell us how much we'll regret >> this. >> >> - Merge it. Or come up with something new. Or maybe stick to the >> Nack, but >> then maybe it would be good to document that somewhere in-tree? >> >> This entire can of worms just feels way too tricky to have it be handled >> inconsistently across drivers. And trying to fix these kind of low-level >> things across drivers once divergences exists is just really painful >> (e.g. >> trying to make all dma-buf mmap VM_SPECIAL or the herculeian task >> Christian is doing to get our dma_resv rules consistent across drivers). >> >> Cheers, Daniel >> >> On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote: >>> Am 2022-01-07 um 3:56 a.m. schrieb Christian König: >>>> Am 06.01.22 um 17:51 schrieb Felix Kuehling: >>>>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König: >>>>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling: >>>>>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>>>>>> [SNIP] >>>>>>> Also, why does your ACK or NAK depend on this at all. If it's the >>>>>>> right >>>>>>> thing to do, it's the right thing to do regardless of who benefits >>>>>>> from >>>>>>> it. In addition, how can a child process that doesn't even use >>>>>>> the GPU >>>>>>> be in violation of any GPU-driver related specifications. >>>>>> The argument is that the application is broken and needs to be fixed >>>>>> instead of worked around inside the kernel. >>>>> I still don't get how they the application is broken. Like I said, >>>>> the >>>>> child process is not using the GPU. How can the application be >>>>> fixed in >>>>> this case? >>>> Sounds like I'm still missing some important puzzle pieces for the >>>> full picture to figure out why this doesn't work. >>>> >>>>> Are you saying, any application that forks and doesn't immediately >>>>> call >>>>> exec is broken? >>>> More or less yes. We essentially have three possible cases here: >>>> >>>> 1. Application is already using (for example) OpenGL or OpenCL to do >>>> some rendering on the GPU and then calls fork() and expects to use >>>> OpenGL both in the parent and the child at the same time. >>>> As far as I know that's illegal from the Khronos specification >>>> point of view and working around inside the kernel for something not >>>> allowed in the first place is seen as futile effort. >>>> >>>> 2. Application opened the file descriptor, forks and then initializes >>>> OpenGL/Vulkan/OpenCL. >>>> That's what some compositors are doing to drop into the backround >>>> and is explicitly legal. >>>> >>>> 3. Application calls fork() and then doesn't use the GPU in the child. >>>> Either by not using it or calling exec. >>>> That should be legal and not cause any problems in the first >>>> place. >>>> >>>> But from your description I still don't get why we are running into >>>> problems here. >>>> >>>> I was assuming that you have case #1 because we previously had some >>>> examples of this with this python library, but from your description >>>> it seems to be case #3. >>> Correct. #3 has at least one issue we previously worked around in the >>> Thunk: The inherited VMAs prevent BOs from being freed in the parent >>> process. This manifests as an apparent memory leak. Therefore the Thunk >>> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs >>> that are causing problems with CRIU are GTT BOs that weren't covered by >>> this previous workaround. >>> >>> The new issue with CRIU is, that CRIU saves and restores all the VMAs. >>> When trying to restore the inherited VMAs in the child process, the >>> mmap >>> call fails because the child process' render node FD is no longer >>> inherited from the parent, but is instead created by its own "open" >>> system call. The mmap call in the child fails for at least two reasons: >>> >>> * The child process' render node FD doesn't have permission to >>> map the >>> parent process' BO any more >>> * The parent BO doesn't get restored in the child process, so its >>> mmap >>> offset doesn't get updated to the new mmap offset of the restored >>> parent BO by the amdgpu CRIU plugin >>> >>> We could maybe add a whole bunch of complexity in CRIU and our CRIU >>> plugin to fix this. But it's pointless because like you said, actually >>> doing anything with the BO in the child process is illegal anyway >>> (scenario #1 above). The easiest solution seems to be, to just not >>> inherit the VMA in the first place. >>> >>> Regards, >>> Felix >>> >>> >>>>> Or does an application that forks need to be aware that some other >>>>> part >>>>> of the application used the GPU and explicitly free any GPU >>>>> resources? >>>> Others might fill that information in, but I think that was the plan >>>> for newer APIs like Vulkan. >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Thanks, >>>>> Felix >>>>> >>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> Regards, >>>>>>> Felix >>>>>>> >>>>>>> >>>>>>>> Let's talk about this on Mondays call. Thanks for giving the whole >>>>>>>> context. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Felix >>>>>>>>> >
Am 14.01.22 um 18:40 schrieb Felix Kuehling: > Am 2022-01-14 um 12:26 p.m. schrieb Christian König: >> Am 14.01.22 um 17:44 schrieb Daniel Vetter: >>> Top post because I tried to catch up on the entire discussion here. >>> >>> So fundamentally I'm not opposed to just close this fork() hole once and >>> for all. The thing that worries me from a upstream/platform pov is >>> really >>> only if we don't do it consistently across all drivers. >>> >>> So maybe as an idea: >>> - Do the original patch, but not just for ttm but all gem rendernode >>> drivers at least (or maybe even all gem drivers, no idea), with the >>> below discussion cleaned up as justification. >> I know of at least one use case which this will break. >> >> A couple of years back we had a discussion on the Mesa mailing list >> because (IIRC) Marek introduced a background thread to push command >> submissions to the kernel. >> >> That broke because some compositor used to initialize OpenGL and then >> do a fork(). This indeed worked previously (no GPUVM at that time), >> but with the addition of the backround thread obviously broke. >> >> The conclusion back then was that the compositor is broken and needs >> fixing, but it still essentially means that there could be people out >> there with really old userspace where this setting would just break >> the desktop. >> >> I'm not really against that change either, but at least in theory we >> could make fork() work perfectly fine even with VMs and background >> threads. > You may regret this if you ever try to build a shared virtual address > space between GPU and CPU. Then you have two processes (parent and > child) sharing the same render context and GPU VM address space. But the > CPU address spaces are different. You can't maintain consistent shared > virtual address spaces for both processes when the GPU address space is > shared between them. That's actually not much of a problem. All you need to do is to use pthread_atfork() and do the appropriate action in parent/child to clean up your context: https://man7.org/linux/man-pages/man3/pthread_atfork.3.html The rest is just to make sure that all shared and all private data are kept separate all the time. Sharing virtual memory is already done for decades this way, it's just that nobody ever did it with a statefull device like GPUs. Regards, Christian. > > Regards, > Felix >
Am 2022-01-17 um 6:44 a.m. schrieb Christian König: > Am 14.01.22 um 18:40 schrieb Felix Kuehling: >> Am 2022-01-14 um 12:26 p.m. schrieb Christian König: >>> Am 14.01.22 um 17:44 schrieb Daniel Vetter: >>>> Top post because I tried to catch up on the entire discussion here. >>>> >>>> So fundamentally I'm not opposed to just close this fork() hole >>>> once and >>>> for all. The thing that worries me from a upstream/platform pov is >>>> really >>>> only if we don't do it consistently across all drivers. >>>> >>>> So maybe as an idea: >>>> - Do the original patch, but not just for ttm but all gem rendernode >>>> drivers at least (or maybe even all gem drivers, no idea), with >>>> the >>>> below discussion cleaned up as justification. >>> I know of at least one use case which this will break. >>> >>> A couple of years back we had a discussion on the Mesa mailing list >>> because (IIRC) Marek introduced a background thread to push command >>> submissions to the kernel. >>> >>> That broke because some compositor used to initialize OpenGL and then >>> do a fork(). This indeed worked previously (no GPUVM at that time), >>> but with the addition of the backround thread obviously broke. >>> >>> The conclusion back then was that the compositor is broken and needs >>> fixing, but it still essentially means that there could be people out >>> there with really old userspace where this setting would just break >>> the desktop. >>> >>> I'm not really against that change either, but at least in theory we >>> could make fork() work perfectly fine even with VMs and background >>> threads. >> You may regret this if you ever try to build a shared virtual address >> space between GPU and CPU. Then you have two processes (parent and >> child) sharing the same render context and GPU VM address space. But the >> CPU address spaces are different. You can't maintain consistent shared >> virtual address spaces for both processes when the GPU address space is >> shared between them. > > That's actually not much of a problem. > > All you need to do is to use pthread_atfork() and do the appropriate > action in parent/child to clean up your context: > https://man7.org/linux/man-pages/man3/pthread_atfork.3.html Thunk already does that. However, it's not foolproof. pthread_atfork hanlders aren't called when the process is forked with a clone call. > > The rest is just to make sure that all shared and all private data are > kept separate all the time. Sharing virtual memory is already done for > decades this way, it's just that nobody ever did it with a statefull > device like GPUs. My concern is not with sharing or not sharing data. It's with sharing the address space itself. If you share the render node, you share GPU virtual address space. However CPU address space is not shared between parent and child. That's a fundamental mismatch between the CPU world and current GPU driver implementation. Regards, Felix > > Regards, > Christian. > >> >> Regards, >> Felix >> >
Am 17.01.22 um 15:17 schrieb Felix Kuehling: > Am 2022-01-17 um 6:44 a.m. schrieb Christian König: >> Am 14.01.22 um 18:40 schrieb Felix Kuehling: >>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König: >>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter: >>>>> Top post because I tried to catch up on the entire discussion here. >>>>> >>>>> So fundamentally I'm not opposed to just close this fork() hole >>>>> once and >>>>> for all. The thing that worries me from a upstream/platform pov is >>>>> really >>>>> only if we don't do it consistently across all drivers. >>>>> >>>>> So maybe as an idea: >>>>> - Do the original patch, but not just for ttm but all gem rendernode >>>>> drivers at least (or maybe even all gem drivers, no idea), with >>>>> the >>>>> below discussion cleaned up as justification. >>>> I know of at least one use case which this will break. >>>> >>>> A couple of years back we had a discussion on the Mesa mailing list >>>> because (IIRC) Marek introduced a background thread to push command >>>> submissions to the kernel. >>>> >>>> That broke because some compositor used to initialize OpenGL and then >>>> do a fork(). This indeed worked previously (no GPUVM at that time), >>>> but with the addition of the backround thread obviously broke. >>>> >>>> The conclusion back then was that the compositor is broken and needs >>>> fixing, but it still essentially means that there could be people out >>>> there with really old userspace where this setting would just break >>>> the desktop. >>>> >>>> I'm not really against that change either, but at least in theory we >>>> could make fork() work perfectly fine even with VMs and background >>>> threads. >>> You may regret this if you ever try to build a shared virtual address >>> space between GPU and CPU. Then you have two processes (parent and >>> child) sharing the same render context and GPU VM address space. But the >>> CPU address spaces are different. You can't maintain consistent shared >>> virtual address spaces for both processes when the GPU address space is >>> shared between them. >> That's actually not much of a problem. >> >> All you need to do is to use pthread_atfork() and do the appropriate >> action in parent/child to clean up your context: >> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html > Thunk already does that. However, it's not foolproof. pthread_atfork > hanlders aren't called when the process is forked with a clone call. Yeah, but that's perfectly intentional. clone() is usually used to create threads. >> The rest is just to make sure that all shared and all private data are >> kept separate all the time. Sharing virtual memory is already done for >> decades this way, it's just that nobody ever did it with a statefull >> device like GPUs. > My concern is not with sharing or not sharing data. It's with sharing > the address space itself. If you share the render node, you share GPU > virtual address space. However CPU address space is not shared between > parent and child. That's a fundamental mismatch between the CPU world > and current GPU driver implementation. Correct, but even that is easily solvable. As I said before you can hang this state on a VMA and let it be cloned together with the CPU address space. Since VMAs are informed about their cloning (in opposite to file descriptors) it's trivial to even just clone kernel data on first access. Regards, Christian. > > Regards, > Felix > > >> Regards, >> Christian. >> >>> Regards, >>> Felix >>>
Am 2022-01-17 um 9:21 a.m. schrieb Christian König: > Am 17.01.22 um 15:17 schrieb Felix Kuehling: >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König: >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling: >>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König: >>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter: >>>>>> Top post because I tried to catch up on the entire discussion here. >>>>>> >>>>>> So fundamentally I'm not opposed to just close this fork() hole >>>>>> once and >>>>>> for all. The thing that worries me from a upstream/platform pov is >>>>>> really >>>>>> only if we don't do it consistently across all drivers. >>>>>> >>>>>> So maybe as an idea: >>>>>> - Do the original patch, but not just for ttm but all gem rendernode >>>>>> drivers at least (or maybe even all gem drivers, no idea), with >>>>>> the >>>>>> below discussion cleaned up as justification. >>>>> I know of at least one use case which this will break. >>>>> >>>>> A couple of years back we had a discussion on the Mesa mailing list >>>>> because (IIRC) Marek introduced a background thread to push command >>>>> submissions to the kernel. >>>>> >>>>> That broke because some compositor used to initialize OpenGL and then >>>>> do a fork(). This indeed worked previously (no GPUVM at that time), >>>>> but with the addition of the backround thread obviously broke. >>>>> >>>>> The conclusion back then was that the compositor is broken and needs >>>>> fixing, but it still essentially means that there could be people out >>>>> there with really old userspace where this setting would just break >>>>> the desktop. >>>>> >>>>> I'm not really against that change either, but at least in theory we >>>>> could make fork() work perfectly fine even with VMs and background >>>>> threads. >>>> You may regret this if you ever try to build a shared virtual address >>>> space between GPU and CPU. Then you have two processes (parent and >>>> child) sharing the same render context and GPU VM address space. >>>> But the >>>> CPU address spaces are different. You can't maintain consistent shared >>>> virtual address spaces for both processes when the GPU address >>>> space is >>>> shared between them. >>> That's actually not much of a problem. >>> >>> All you need to do is to use pthread_atfork() and do the appropriate >>> action in parent/child to clean up your context: >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html >> Thunk already does that. However, it's not foolproof. pthread_atfork >> hanlders aren't called when the process is forked with a clone call. > > Yeah, but that's perfectly intentional. clone() is usually used to > create threads. Clone can be used to create new processes. Maybe not the common use today. > >>> The rest is just to make sure that all shared and all private data are >>> kept separate all the time. Sharing virtual memory is already done for >>> decades this way, it's just that nobody ever did it with a statefull >>> device like GPUs. >> My concern is not with sharing or not sharing data. It's with sharing >> the address space itself. If you share the render node, you share GPU >> virtual address space. However CPU address space is not shared between >> parent and child. That's a fundamental mismatch between the CPU world >> and current GPU driver implementation. > > Correct, but even that is easily solvable. As I said before you can > hang this state on a VMA and let it be cloned together with the CPU > address space. I'm not following. The address space I'm talking about is struct amdgpu_vm. It's associated with the render node file descriptor. Inheriting and using that file descriptor in the child inherits the amdgpu_vm. I don't see how you can hang that state on any one VMA. To be consistent with the CPU, you'd need to clone the GPU address space (struct amdgpu_vm) in the child process. That means you need a new render node file descriptor that imports all the BOs from the parent address space. It's a bunch of extra work to fork a process, that you're proposing to immediately undo with an atfork handler. So I really don't see the point. Regards, Felix > > Since VMAs are informed about their cloning (in opposite to file > descriptors) it's trivial to even just clone kernel data on first access. > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Felix >>>> >
I don't think fork() would work with userspace where all buffers are shared. It certainly doesn't work now. The driver needs to be notified that a buffer or texture is shared to ensure data coherency between processes, and the driver must execute decompression and other render passes when a buffer or texture is being shared for the first time. Those aren't called when fork() is called. Marek On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling <felix.kuehling@amd.com> wrote: > Am 2022-01-17 um 9:21 a.m. schrieb Christian König: > > Am 17.01.22 um 15:17 schrieb Felix Kuehling: > >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König: > >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling: > >>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König: > >>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter: > >>>>>> Top post because I tried to catch up on the entire discussion here. > >>>>>> > >>>>>> So fundamentally I'm not opposed to just close this fork() hole > >>>>>> once and > >>>>>> for all. The thing that worries me from a upstream/platform pov is > >>>>>> really > >>>>>> only if we don't do it consistently across all drivers. > >>>>>> > >>>>>> So maybe as an idea: > >>>>>> - Do the original patch, but not just for ttm but all gem rendernode > >>>>>> drivers at least (or maybe even all gem drivers, no idea), with > >>>>>> the > >>>>>> below discussion cleaned up as justification. > >>>>> I know of at least one use case which this will break. > >>>>> > >>>>> A couple of years back we had a discussion on the Mesa mailing list > >>>>> because (IIRC) Marek introduced a background thread to push command > >>>>> submissions to the kernel. > >>>>> > >>>>> That broke because some compositor used to initialize OpenGL and then > >>>>> do a fork(). This indeed worked previously (no GPUVM at that time), > >>>>> but with the addition of the backround thread obviously broke. > >>>>> > >>>>> The conclusion back then was that the compositor is broken and needs > >>>>> fixing, but it still essentially means that there could be people out > >>>>> there with really old userspace where this setting would just break > >>>>> the desktop. > >>>>> > >>>>> I'm not really against that change either, but at least in theory we > >>>>> could make fork() work perfectly fine even with VMs and background > >>>>> threads. > >>>> You may regret this if you ever try to build a shared virtual address > >>>> space between GPU and CPU. Then you have two processes (parent and > >>>> child) sharing the same render context and GPU VM address space. > >>>> But the > >>>> CPU address spaces are different. You can't maintain consistent shared > >>>> virtual address spaces for both processes when the GPU address > >>>> space is > >>>> shared between them. > >>> That's actually not much of a problem. > >>> > >>> All you need to do is to use pthread_atfork() and do the appropriate > >>> action in parent/child to clean up your context: > >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html > >> Thunk already does that. However, it's not foolproof. pthread_atfork > >> hanlders aren't called when the process is forked with a clone call. > > > > Yeah, but that's perfectly intentional. clone() is usually used to > > create threads. > > Clone can be used to create new processes. Maybe not the common use today. > > > > > >>> The rest is just to make sure that all shared and all private data are > >>> kept separate all the time. Sharing virtual memory is already done for > >>> decades this way, it's just that nobody ever did it with a statefull > >>> device like GPUs. > >> My concern is not with sharing or not sharing data. It's with sharing > >> the address space itself. If you share the render node, you share GPU > >> virtual address space. However CPU address space is not shared between > >> parent and child. That's a fundamental mismatch between the CPU world > >> and current GPU driver implementation. > > > > Correct, but even that is easily solvable. As I said before you can > > hang this state on a VMA and let it be cloned together with the CPU > > address space. > > I'm not following. The address space I'm talking about is struct > amdgpu_vm. It's associated with the render node file descriptor. > Inheriting and using that file descriptor in the child inherits the > amdgpu_vm. I don't see how you can hang that state on any one VMA. > > To be consistent with the CPU, you'd need to clone the GPU address space > (struct amdgpu_vm) in the child process. That means you need a new > render node file descriptor that imports all the BOs from the parent > address space. It's a bunch of extra work to fork a process, that you're > proposing to immediately undo with an atfork handler. So I really don't > see the point. > > Regards, > Felix > > > > > > Since VMAs are informed about their cloning (in opposite to file > > descriptors) it's trivial to even just clone kernel data on first access. > > > > Regards, > > Christian. > > > >> > >> Regards, > >> Felix > >> > >> > >>> Regards, > >>> Christian. > >>> > >>>> Regards, > >>>> Felix > >>>> > > >
Am 17.01.22 um 15:50 schrieb Marek Olšák: > I don't think fork() would work with userspace where all buffers are > shared. It certainly doesn't work now. The driver needs to be notified > that a buffer or texture is shared to ensure data coherency between > processes, and the driver must execute decompression and other render > passes when a buffer or texture is being shared for the first time. > Those aren't called when fork() is called. Yeah, that's why you can install handlers which run before/after fork() is executed. But to summarize it is illegal for OpenGL, so we don't really need to worry about it. For compute there are a couple of use cases though, but even those are not real world ones as far as I know. But see below. > > Marek > > On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling <felix.kuehling@amd.com > <mailto:felix.kuehling@amd.com>> wrote: > > Am 2022-01-17 um 9:21 a.m. schrieb Christian König: > > Am 17.01.22 um 15:17 schrieb Felix Kuehling: > >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König: > >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling: > >>>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König: > >>>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter: > >>>>>> Top post because I tried to catch up on the entire > discussion here. > >>>>>> > >>>>>> So fundamentally I'm not opposed to just close this fork() hole > >>>>>> once and > >>>>>> for all. The thing that worries me from a upstream/platform > pov is > >>>>>> really > >>>>>> only if we don't do it consistently across all drivers. > >>>>>> > >>>>>> So maybe as an idea: > >>>>>> - Do the original patch, but not just for ttm but all gem > rendernode > >>>>>> drivers at least (or maybe even all gem drivers, no > idea), with > >>>>>> the > >>>>>> below discussion cleaned up as justification. > >>>>> I know of at least one use case which this will break. > >>>>> > >>>>> A couple of years back we had a discussion on the Mesa > mailing list > >>>>> because (IIRC) Marek introduced a background thread to push > command > >>>>> submissions to the kernel. > >>>>> > >>>>> That broke because some compositor used to initialize OpenGL > and then > >>>>> do a fork(). This indeed worked previously (no GPUVM at that > time), > >>>>> but with the addition of the backround thread obviously broke. > >>>>> > >>>>> The conclusion back then was that the compositor is broken > and needs > >>>>> fixing, but it still essentially means that there could be > people out > >>>>> there with really old userspace where this setting would > just break > >>>>> the desktop. > >>>>> > >>>>> I'm not really against that change either, but at least in > theory we > >>>>> could make fork() work perfectly fine even with VMs and > background > >>>>> threads. > >>>> You may regret this if you ever try to build a shared virtual > address > >>>> space between GPU and CPU. Then you have two processes > (parent and > >>>> child) sharing the same render context and GPU VM address space. > >>>> But the > >>>> CPU address spaces are different. You can't maintain > consistent shared > >>>> virtual address spaces for both processes when the GPU address > >>>> space is > >>>> shared between them. > >>> That's actually not much of a problem. > >>> > >>> All you need to do is to use pthread_atfork() and do the > appropriate > >>> action in parent/child to clean up your context: > >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html > <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman3%2Fpthread_atfork.3.html&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd917b56904c64bcb501a08d9d9c8c05e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637780278519496422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4%2FEATucZoZnlP4t0FI6bYtCdThxC3HTOtkIcTU8G%2FqY%3D&reserved=0> > >> Thunk already does that. However, it's not foolproof. > pthread_atfork > >> hanlders aren't called when the process is forked with a clone > call. > > > > Yeah, but that's perfectly intentional. clone() is usually used to > > create threads. > > Clone can be used to create new processes. Maybe not the common > use today. > > > > > >>> The rest is just to make sure that all shared and all private > data are > >>> kept separate all the time. Sharing virtual memory is already > done for > >>> decades this way, it's just that nobody ever did it with a > statefull > >>> device like GPUs. > >> My concern is not with sharing or not sharing data. It's with > sharing > >> the address space itself. If you share the render node, you > share GPU > >> virtual address space. However CPU address space is not shared > between > >> parent and child. That's a fundamental mismatch between the CPU > world > >> and current GPU driver implementation. > > > > Correct, but even that is easily solvable. As I said before you can > > hang this state on a VMA and let it be cloned together with the CPU > > address space. > > I'm not following. The address space I'm talking about is struct > amdgpu_vm. It's associated with the render node file descriptor. > Inheriting and using that file descriptor in the child inherits the > amdgpu_vm. I don't see how you can hang that state on any one VMA. > But you don't really need that. You can bind the VM to your VMA mapping and clone that as necessary. I'm not sure how else I should describe that, as far as I know the kernel that would be rather trivial to do. Cloning all the userspace state like Marek described above is the much harder part. Regards, Christian. > > To be consistent with the CPU, you'd need to clone the GPU address > space > (struct amdgpu_vm) in the child process. That means you need a new > render node file descriptor that imports all the BOs from the parent > address space. It's a bunch of extra work to fork a process, that > you're > proposing to immediately undo with an atfork handler. So I really > don't > see the point. > > Regards, > Felix > > > > > > Since VMAs are informed about their cloning (in opposite to file > > descriptors) it's trivial to even just clone kernel data on > first access. > > > > Regards, > > Christian. > > > >> > >> Regards, > >> Felix > >> > >> > >>> Regards, > >>> Christian. > >>> > >>>> Regards, > >>>> Felix > >>>> > > >
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 09c820045859..d9c4149f36dd 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, goto err_drm_gem_object_put; } - vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; + vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND + | VM_DONTDUMP | VM_DONTCOPY; vma->vm_page_prot = pgprot_writecombine(vm_get_page_prot(vma->vm_flags)); vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); } diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c index 33680c94127c..420a4898fdd2 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct ttm_buffer_object *bo) vma->vm_private_data = bo; - vma->vm_flags |= VM_PFNMAP; + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY; vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; return 0; }