Message ID | 20220601004014.158247-4-bas@basnieuwenhuizen.nl (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add option to disable implicit sync for userspace submits. | expand |
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > This should be okay because moves themselves use KERNEL usage and > hence still sync with BOOKKEEP usage. Then any later submits still > wait on any pending VM operations. > > (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > way around) Well NAK again. This allows access to freed up memory and is a complete no-go. Regards, Christian. > > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > index f10332e1c6c0..31bc73fd1fae 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, > if (!resv) > return 0; > > - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); > + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); > } > > /** > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 63b484dc76c5..c8d5898bea11 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, > if (!resv) > return 0; > > - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); > + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > } > > /**
On Wed, Jun 1, 2022 at 10:03 AM Christian König <christian.koenig@amd.com> wrote: > > Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > > This should be okay because moves themselves use KERNEL usage and > > hence still sync with BOOKKEEP usage. Then any later submits still > > wait on any pending VM operations. > > > > (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > > way around) > > Well NAK again. This allows access to freed up memory and is a complete > no-go. How does this allow access to freed memory? Worst I can see is that the unmap happens earlier if the app/drivers gets the waits wrong, which wouldn't give access after the underlying BO is freed? > > Regards, > Christian. > > > > > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > index f10332e1c6c0..31bc73fd1fae 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, > > if (!resv) > > return 0; > > > > - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); > > + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); > > } > > > > /** > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > index 63b484dc76c5..c8d5898bea11 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, > > if (!resv) > > return 0; > > > > - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); > > + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > > } > > > > /** >
Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: > On Wed, Jun 1, 2022 at 10:03 AM Christian König > <christian.koenig@amd.com> wrote: >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: >>> This should be okay because moves themselves use KERNEL usage and >>> hence still sync with BOOKKEEP usage. Then any later submits still >>> wait on any pending VM operations. >>> >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other >>> way around) >> Well NAK again. This allows access to freed up memory and is a complete >> no-go. > How does this allow access to freed memory? Worst I can see is that > the unmap happens earlier if the app/drivers gets the waits wrong, > which wouldn't give access after the underlying BO is freed? To free up memory we need to update the PTEs and then flush those out by invalidating the TLB. On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can only be done while the VMID is idle. Only gfx9 can reliable invalidate the TLB while it is in use and even there it comes with quite some performance penalty (at TLB invalidation can take multiple seconds). Because of this what we do in the kernel driver is to sync to everything when we unmap entries: if (!(flags & AMDGPU_PTE_VALID)) sync_mode = AMDGPU_SYNC_EQ_OWNER; else sync_mode = AMDGPU_SYNC_EXPLICIT; This acts as a barrier for freeing the memory. In other words we intentionally add a bubble which syncs everything. I'm working for month on a concept how to do all this without causing the stalls you observer, but so far didn't came to much of a conclusion. Regards, Christian. > >> Regards, >> Christian. >> >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- >>> 2 files changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>> index f10332e1c6c0..31bc73fd1fae 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, >>> if (!resv) >>> return 0; >>> >>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); >>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); >>> } >>> >>> /** >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> index 63b484dc76c5..c8d5898bea11 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, >>> if (!resv) >>> return 0; >>> >>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); >>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); >>> } >>> >>> /**
On Wed, Jun 1, 2022 at 10:40 AM Christian König <christian.koenig@amd.com> wrote: > > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: > > On Wed, Jun 1, 2022 at 10:03 AM Christian König > > <christian.koenig@amd.com> wrote: > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > >>> This should be okay because moves themselves use KERNEL usage and > >>> hence still sync with BOOKKEEP usage. Then any later submits still > >>> wait on any pending VM operations. > >>> > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > >>> way around) > >> Well NAK again. This allows access to freed up memory and is a complete > >> no-go. > > How does this allow access to freed memory? Worst I can see is that > > the unmap happens earlier if the app/drivers gets the waits wrong, > > which wouldn't give access after the underlying BO is freed? > > To free up memory we need to update the PTEs and then flush those out by > invalidating the TLB. > > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can > only be done while the VMID is idle. > > Only gfx9 can reliable invalidate the TLB while it is in use and even > there it comes with quite some performance penalty (at TLB invalidation > can take multiple seconds). > > Because of this what we do in the kernel driver is to sync to everything > when we unmap entries: > > if (!(flags & AMDGPU_PTE_VALID)) > sync_mode = AMDGPU_SYNC_EQ_OWNER; > else > sync_mode = AMDGPU_SYNC_EXPLICIT; > > This acts as a barrier for freeing the memory. In other words we > intentionally add a bubble which syncs everything. > > I'm working for month on a concept how to do all this without causing > the stalls you observer, but so far didn't came to much of a conclusion. That might cause an unmap operation too early, but for freeing up the actual backing memory we still wait for all fences on the BO to finish first, no? In that case, since BOOKKEEP fences are still added for explicit sync, that should not be a problem, no? (If not, that sounds like the obvious fix for making this work?) > > Regards, > Christian. > > > > >> Regards, > >> Christian. > >> > >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > >>> 2 files changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>> index f10332e1c6c0..31bc73fd1fae 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, > >>> if (!resv) > >>> return 0; > >>> > >>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); > >>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); > >>> } > >>> > >>> /** > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>> index 63b484dc76c5..c8d5898bea11 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, > >>> if (!resv) > >>> return 0; > >>> > >>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); > >>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > >>> } > >>> > >>> /** >
On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> wrote: > On Wed, Jun 1, 2022 at 10:40 AM Christian König > <christian.koenig@amd.com> wrote: > > > > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: > > > On Wed, Jun 1, 2022 at 10:03 AM Christian König > > > <christian.koenig@amd.com> wrote: > > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > > >>> This should be okay because moves themselves use KERNEL usage and > > >>> hence still sync with BOOKKEEP usage. Then any later submits still > > >>> wait on any pending VM operations. > > >>> > > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > > >>> way around) > > >> Well NAK again. This allows access to freed up memory and is a > complete > > >> no-go. > > > How does this allow access to freed memory? Worst I can see is that > > > the unmap happens earlier if the app/drivers gets the waits wrong, > > > which wouldn't give access after the underlying BO is freed? > > > > To free up memory we need to update the PTEs and then flush those out by > > invalidating the TLB. > > > > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can > > only be done while the VMID is idle. > > > > Only gfx9 can reliable invalidate the TLB while it is in use and even > > there it comes with quite some performance penalty (at TLB invalidation > > can take multiple seconds). > > > > Because of this what we do in the kernel driver is to sync to everything > > when we unmap entries: > > > > if (!(flags & AMDGPU_PTE_VALID)) > > sync_mode = AMDGPU_SYNC_EQ_OWNER; > > else > > sync_mode = AMDGPU_SYNC_EXPLICIT; > > > > This acts as a barrier for freeing the memory. In other words we > > intentionally add a bubble which syncs everything. > > > > I'm working for month on a concept how to do all this without causing > > the stalls you observer, but so far didn't came to much of a conclusion. > > That might cause an unmap operation too early, but for freeing up the > actual backing memory we still wait for all fences on the BO to finish > first, no? In that case, since BOOKKEEP fences are still added for > explicit sync, that should not be a problem, no? > > (If not, that sounds like the obvious fix for making this work?) > As an aside this is the same hole/issue as when an app forgets a bo in the bo list on submission. > > > > Regards, > > Christian. > > > > > > > >> Regards, > > >> Christian. > > >> > > >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> > > >>> --- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > > >>> 2 files changed, 2 insertions(+), 2 deletions(-) > > >>> > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> index f10332e1c6c0..31bc73fd1fae 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct > amdgpu_vm_update_params *p, > > >>> if (!resv) > > >>> return 0; > > >>> > > >>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, > sync_mode, p->vm, true); > > >>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, > AMDGPU_SYNC_EXPLICIT, p->vm, true); > > >>> } > > >>> > > >>> /** > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> index 63b484dc76c5..c8d5898bea11 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct > amdgpu_vm_update_params *p, > > >>> if (!resv) > > >>> return 0; > > >>> > > >>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, > sync_mode, sync_mode, p->vm); > > >>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, > sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > > >>> } > > >>> > > >>> /** > > >
Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen: > On Wed, Jun 1, 2022 at 10:40 AM Christian König > <christian.koenig@amd.com> wrote: >> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: >>> On Wed, Jun 1, 2022 at 10:03 AM Christian König >>> <christian.koenig@amd.com> wrote: >>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: >>>>> This should be okay because moves themselves use KERNEL usage and >>>>> hence still sync with BOOKKEEP usage. Then any later submits still >>>>> wait on any pending VM operations. >>>>> >>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other >>>>> way around) >>>> Well NAK again. This allows access to freed up memory and is a complete >>>> no-go. >>> How does this allow access to freed memory? Worst I can see is that >>> the unmap happens earlier if the app/drivers gets the waits wrong, >>> which wouldn't give access after the underlying BO is freed? >> To free up memory we need to update the PTEs and then flush those out by >> invalidating the TLB. >> >> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can >> only be done while the VMID is idle. >> >> Only gfx9 can reliable invalidate the TLB while it is in use and even >> there it comes with quite some performance penalty (at TLB invalidation >> can take multiple seconds). >> >> Because of this what we do in the kernel driver is to sync to everything >> when we unmap entries: >> >> if (!(flags & AMDGPU_PTE_VALID)) >> sync_mode = AMDGPU_SYNC_EQ_OWNER; >> else >> sync_mode = AMDGPU_SYNC_EXPLICIT; >> >> This acts as a barrier for freeing the memory. In other words we >> intentionally add a bubble which syncs everything. >> >> I'm working for month on a concept how to do all this without causing >> the stalls you observer, but so far didn't came to much of a conclusion. > That might cause an unmap operation too early, but for freeing up the > actual backing memory we still wait for all fences on the BO to finish > first, no? In that case, since BOOKKEEP fences are still added for > explicit sync, that should not be a problem, no? > > (If not, that sounds like the obvious fix for making this work?) The problem is we need to wait on fences *not* added to the buffer object. E.g. what we currently do here while freeing memory is: 1. Update the PTEs and make that update wait for everything! 2. Add the fence of that update to the freed up BO so that this BO isn't freed before the next CS. We might be able to fix this by adding the fences to the BO before freeing it manually, but I'm not 100% sure we can actually allocate memory for the fences in that moment. Regards, Christian. >> Regards, >> Christian. >> >>>> Regards, >>>> Christian. >>>> >>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- >>>>> 2 files changed, 2 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>>>> index f10332e1c6c0..31bc73fd1fae 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c >>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, >>>>> if (!resv) >>>>> return 0; >>>>> >>>>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); >>>>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); >>>>> } >>>>> >>>>> /** >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>>>> index 63b484dc76c5..c8d5898bea11 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, >>>>> if (!resv) >>>>> return 0; >>>>> >>>>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); >>>>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); >>>>> } >>>>> >>>>> /**
On Wed, Jun 1, 2022 at 11:01 AM Christian König <christian.koenig@amd.com> wrote: > > Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen: > > On Wed, Jun 1, 2022 at 10:40 AM Christian König > > <christian.koenig@amd.com> wrote: > >> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen: > >>> On Wed, Jun 1, 2022 at 10:03 AM Christian König > >>> <christian.koenig@amd.com> wrote: > >>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > >>>>> This should be okay because moves themselves use KERNEL usage and > >>>>> hence still sync with BOOKKEEP usage. Then any later submits still > >>>>> wait on any pending VM operations. > >>>>> > >>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other > >>>>> way around) > >>>> Well NAK again. This allows access to freed up memory and is a complete > >>>> no-go. > >>> How does this allow access to freed memory? Worst I can see is that > >>> the unmap happens earlier if the app/drivers gets the waits wrong, > >>> which wouldn't give access after the underlying BO is freed? > >> To free up memory we need to update the PTEs and then flush those out by > >> invalidating the TLB. > >> > >> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can > >> only be done while the VMID is idle. > >> > >> Only gfx9 can reliable invalidate the TLB while it is in use and even > >> there it comes with quite some performance penalty (at TLB invalidation > >> can take multiple seconds). > >> > >> Because of this what we do in the kernel driver is to sync to everything > >> when we unmap entries: > >> > >> if (!(flags & AMDGPU_PTE_VALID)) > >> sync_mode = AMDGPU_SYNC_EQ_OWNER; > >> else > >> sync_mode = AMDGPU_SYNC_EXPLICIT; > >> > >> This acts as a barrier for freeing the memory. In other words we > >> intentionally add a bubble which syncs everything. > >> > >> I'm working for month on a concept how to do all this without causing > >> the stalls you observer, but so far didn't came to much of a conclusion. > > That might cause an unmap operation too early, but for freeing up the > > actual backing memory we still wait for all fences on the BO to finish > > first, no? In that case, since BOOKKEEP fences are still added for > > explicit sync, that should not be a problem, no? > > > > (If not, that sounds like the obvious fix for making this work?) > > The problem is we need to wait on fences *not* added to the buffer object. What fences wouldn't be added to the buffer object that we need here? > > E.g. what we currently do here while freeing memory is: > 1. Update the PTEs and make that update wait for everything! > 2. Add the fence of that update to the freed up BO so that this BO isn't > freed before the next CS. > > We might be able to fix this by adding the fences to the BO before > freeing it manually, but I'm not 100% sure we can actually allocate > memory for the fences in that moment. I think we don't need to be able to. We're already adding the unmap fence to the BO in the gem close ioctl, and that has the fallback that if we can't allocate space for the fence in the BO, we wait on the fence manually on the CPU. I think that is a reasonable fallback for this as well? For the TTM move path amdgpu_copy_buffer will wait on the BO resv and then following submissions will trigger VM updates that will wait on the amdgpu_copy_buffer jobs (and hence transitively) will wait on the work. AFAICT the amdgpu_bo_move does not trigger any VM updates by itself, and the amdgpu_bo_move_notify is way after the move (and after the ttm_bo_move_accel_cleanup which would free the old resource), so any VM changes triggered by that would see the TTM copy and sync to it. I do have to fix some stuff indeed, especially for the GEM close but with that we should be able to keep the same basic approach? > > Regards, > Christian. > > > >> Regards, > >> Christian. > >> > >>>> Regards, > >>>> Christian. > >>>> > >>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> > >>>>> --- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- > >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- > >>>>> 2 files changed, 2 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>>>> index f10332e1c6c0..31bc73fd1fae 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c > >>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, > >>>>> if (!resv) > >>>>> return 0; > >>>>> > >>>>> - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); > >>>>> + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); > >>>>> } > >>>>> > >>>>> /** > >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>>>> index 63b484dc76c5..c8d5898bea11 100644 > >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > >>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, > >>>>> if (!resv) > >>>>> return 0; > >>>>> > >>>>> - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); > >>>>> + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); > >>>>> } > >>>>> > >>>>> /** >
Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen: > [SNIP] >> The problem is we need to wait on fences *not* added to the buffer object. > What fences wouldn't be added to the buffer object that we need here? Basically all still running submissions from the VM which could potentially access the BO. That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range(). >> E.g. what we currently do here while freeing memory is: >> 1. Update the PTEs and make that update wait for everything! >> 2. Add the fence of that update to the freed up BO so that this BO isn't >> freed before the next CS. >> >> We might be able to fix this by adding the fences to the BO before >> freeing it manually, but I'm not 100% sure we can actually allocate >> memory for the fences in that moment. > I think we don't need to be able to. We're already adding the unmap > fence to the BO in the gem close ioctl, and that has the fallback that > if we can't allocate space for the fence in the BO, we wait on the > fence manually on the CPU. I think that is a reasonable fallback for > this as well? Yes, just blocking might work in an OOM situation as well. > For the TTM move path amdgpu_copy_buffer will wait on the BO resv and > then following submissions will trigger VM updates that will wait on > the amdgpu_copy_buffer jobs (and hence transitively) will wait on the > work. AFAICT the amdgpu_bo_move does not trigger any VM updates by > itself, and the amdgpu_bo_move_notify is way after the move (and after > the ttm_bo_move_accel_cleanup which would free the old resource), so > any VM changes triggered by that would see the TTM copy and sync to > it. > > I do have to fix some stuff indeed, especially for the GEM close but > with that we should be able to keep the same basic approach? Nope, not even remotely. What we need is the following: 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. 2. When we get a VM operation we not only lock the VM page tables, but also all buffers we potentially need to unmap. 3. Nuking the freed list in the amdgpu_vm structure by updating freed areas directly when they are unmapped. 4. Tracking those updates inside the bo_va structure for the BO+VM combination. 5. When the bo_va structure is destroy because of closing the handle move the last clear operation over to the VM as implicit sync. Only when all this is done we then can resolve the dependency that the CS currently must wait for any clear operation on the VM. Regards, Christian.
On Fri, Jun 3, 2022 at 10:11 AM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen: > > [SNIP] > >> The problem is we need to wait on fences *not* added to the buffer object. > > What fences wouldn't be added to the buffer object that we need here? > > Basically all still running submissions from the VM which could > potentially access the BO. > > That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range(). > > >> E.g. what we currently do here while freeing memory is: > >> 1. Update the PTEs and make that update wait for everything! > >> 2. Add the fence of that update to the freed up BO so that this BO isn't > >> freed before the next CS. > >> > >> We might be able to fix this by adding the fences to the BO before > >> freeing it manually, but I'm not 100% sure we can actually allocate > >> memory for the fences in that moment. > > I think we don't need to be able to. We're already adding the unmap > > fence to the BO in the gem close ioctl, and that has the fallback that > > if we can't allocate space for the fence in the BO, we wait on the > > fence manually on the CPU. I think that is a reasonable fallback for > > this as well? > > Yes, just blocking might work in an OOM situation as well. > > > For the TTM move path amdgpu_copy_buffer will wait on the BO resv and > > then following submissions will trigger VM updates that will wait on > > the amdgpu_copy_buffer jobs (and hence transitively) will wait on the > > work. AFAICT the amdgpu_bo_move does not trigger any VM updates by > > itself, and the amdgpu_bo_move_notify is way after the move (and after > > the ttm_bo_move_accel_cleanup which would free the old resource), so > > any VM changes triggered by that would see the TTM copy and sync to > > it. > > > > I do have to fix some stuff indeed, especially for the GEM close but > > with that we should be able to keep the same basic approach? > > Nope, not even remotely. > > What we need is the following: > 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > 2. When we get a VM operation we not only lock the VM page tables, but > also all buffers we potentially need to unmap. > 3. Nuking the freed list in the amdgpu_vm structure by updating freed > areas directly when they are unmapped. > 4. Tracking those updates inside the bo_va structure for the BO+VM > combination. > 5. When the bo_va structure is destroy because of closing the handle > move the last clear operation over to the VM as implicit sync. > Hi Christian, isn't that a different problem though (that we're also trying to solve, but in your series)? What this patch tries to achieve: (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) (t+1) a VM operation on a BO/VM accessed by the CS. to run concurrently. What it *doesn't* try is (t+0) a VM operation on a BO/VM accessed by the CS. (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) to run concurrently. When you write > Only when all this is done we then can resolve the dependency that the > CS currently must wait for any clear operation on the VM. isn't that all about the second problem? > > Regards, > Christian. > >
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > [SNIP] >>> I do have to fix some stuff indeed, especially for the GEM close but >>> with that we should be able to keep the same basic approach? >> Nope, not even remotely. >> >> What we need is the following: >> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >> 2. When we get a VM operation we not only lock the VM page tables, but >> also all buffers we potentially need to unmap. >> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >> areas directly when they are unmapped. >> 4. Tracking those updates inside the bo_va structure for the BO+VM >> combination. >> 5. When the bo_va structure is destroy because of closing the handle >> move the last clear operation over to the VM as implicit sync. >> > Hi Christian, isn't that a different problem though (that we're also > trying to solve, but in your series)? > > What this patch tries to achieve: > > (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > (t+1) a VM operation on a BO/VM accessed by the CS. > > to run concurrently. What it *doesn't* try is > > (t+0) a VM operation on a BO/VM accessed by the CS. > (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > > to run concurrently. When you write > >> Only when all this is done we then can resolve the dependency that the >> CS currently must wait for any clear operation on the VM. > isn't that all about the second problem? No, it's the same. See what we do in the VM code is to artificially insert a bubble so that all VM clear operations wait for all CS operations and then use the clear fence to indicate when the backing store of the BO can be freed. When you want to remove this bubble (which is certainly a good idea) you need to first come up with a different approach to handle the clear operations. Regards, Christian. > > >> Regards, >> Christian. >> >>
On Fri, Jun 3, 2022 at 12:16 PM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > > [SNIP] > >>> I do have to fix some stuff indeed, especially for the GEM close but > >>> with that we should be able to keep the same basic approach? > >> Nope, not even remotely. > >> > >> What we need is the following: > >> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > >> 2. When we get a VM operation we not only lock the VM page tables, but > >> also all buffers we potentially need to unmap. > >> 3. Nuking the freed list in the amdgpu_vm structure by updating freed > >> areas directly when they are unmapped. > >> 4. Tracking those updates inside the bo_va structure for the BO+VM > >> combination. > >> 5. When the bo_va structure is destroy because of closing the handle > >> move the last clear operation over to the VM as implicit sync. > >> > > Hi Christian, isn't that a different problem though (that we're also > > trying to solve, but in your series)? > > > > What this patch tries to achieve: > > > > (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > > (t+1) a VM operation on a BO/VM accessed by the CS. > > > > to run concurrently. What it *doesn't* try is > > > > (t+0) a VM operation on a BO/VM accessed by the CS. > > (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > > > > to run concurrently. When you write > > > >> Only when all this is done we then can resolve the dependency that the > >> CS currently must wait for any clear operation on the VM. > > isn't that all about the second problem? > > No, it's the same. > > See what we do in the VM code is to artificially insert a bubble so that > all VM clear operations wait for all CS operations and then use the > clear fence to indicate when the backing store of the BO can be freed. Isn't that remediated with something like the code below? At least the gem_close case should be handled with this, and the move case was already handled by the copy operation. --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct drm_gem_object *obj, return 0; } +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) +{ + struct dma_resv_iter cursor; + struct dma_fence *f; + int r; + unsigned num_fences = 0; + + if (src == dst) + return; + + /* We assume the later loops get the same fences as the caller should + * lock the resv. */ + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + ++num_fences; + dma_fence_put(f); + } + + r = dma_resv_reserve_fences(dst, num_fences); + if (r) { + /* As last resort on OOM we block for the fence */ + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + dma_fence_wait(f, false); + dma_fence_put(f); + } + } + + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); + dma_fence_put(f); + } +} + + static void amdgpu_gem_object_close(struct drm_gem_object *obj, struct drm_file *file_priv) { @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct drm_gem_object *obj, amdgpu_bo_fence(bo, fence, true); dma_fence_put(fence); + dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); + out_unlock: if (unlikely(r < 0)) dev_err(adev->dev, "failed to clear page " > > When you want to remove this bubble (which is certainly a good idea) you > need to first come up with a different approach to handle the clear > operations. > > Regards, > Christian. > > > > > > >> Regards, > >> Christian. > >> > >> >
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: > On Fri, Jun 3, 2022 at 12:16 PM Christian König > <christian.koenig@amd.com> wrote: >> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: >>> [SNIP] >>>>> I do have to fix some stuff indeed, especially for the GEM close but >>>>> with that we should be able to keep the same basic approach? >>>> Nope, not even remotely. >>>> >>>> What we need is the following: >>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >>>> 2. When we get a VM operation we not only lock the VM page tables, but >>>> also all buffers we potentially need to unmap. >>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >>>> areas directly when they are unmapped. >>>> 4. Tracking those updates inside the bo_va structure for the BO+VM >>>> combination. >>>> 5. When the bo_va structure is destroy because of closing the handle >>>> move the last clear operation over to the VM as implicit sync. >>>> >>> Hi Christian, isn't that a different problem though (that we're also >>> trying to solve, but in your series)? >>> >>> What this patch tries to achieve: >>> >>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>> (t+1) a VM operation on a BO/VM accessed by the CS. >>> >>> to run concurrently. What it *doesn't* try is >>> >>> (t+0) a VM operation on a BO/VM accessed by the CS. >>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>> >>> to run concurrently. When you write >>> >>>> Only when all this is done we then can resolve the dependency that the >>>> CS currently must wait for any clear operation on the VM. >>> isn't that all about the second problem? >> No, it's the same. >> >> See what we do in the VM code is to artificially insert a bubble so that >> all VM clear operations wait for all CS operations and then use the >> clear fence to indicate when the backing store of the BO can be freed. > Isn't that remediated with something like the code below? At least the > gem_close case should be handled with this, and the move case was > already handled by the copy operation. That is one necessary puzzle piece, yes. But you need more than that. Especially the explicit unmap operation needs to be converted into an implicit unmap to get the TLB flush right. I think I know all the necessary steps now, it's just tons of work to do. Regards, Christian. > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct > drm_gem_object *obj, > return 0; > } > > +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) > +{ > + struct dma_resv_iter cursor; > + struct dma_fence *f; > + int r; > + unsigned num_fences = 0; > + > + if (src == dst) > + return; > + > + /* We assume the later loops get the same fences as the caller should > + * lock the resv. */ > + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > + ++num_fences; > + dma_fence_put(f); > + } > + > + r = dma_resv_reserve_fences(dst, num_fences); > + if (r) { > + /* As last resort on OOM we block for the fence */ > + dma_resv_for_each_fence(&cursor, src, > DMA_RESV_USAGE_BOOKKEEP, f) { > + dma_fence_wait(f, false); > + dma_fence_put(f); > + } > + } > + > + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); > + dma_fence_put(f); > + } > +} > + > + > static void amdgpu_gem_object_close(struct drm_gem_object *obj, > struct drm_file *file_priv) > { > @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct > drm_gem_object *obj, > amdgpu_bo_fence(bo, fence, true); > dma_fence_put(fence); > > + dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); > + > out_unlock: > if (unlikely(r < 0)) > dev_err(adev->dev, "failed to clear page " > >> When you want to remove this bubble (which is certainly a good idea) you >> need to first come up with a different approach to handle the clear >> operations. >> >> Regards, >> Christian. >> >>> >>>> Regards, >>>> Christian. >>>> >>>>
On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: > > On Fri, Jun 3, 2022 at 12:16 PM Christian König > > <christian.koenig@amd.com> wrote: > >> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > >>> [SNIP] > >>>>> I do have to fix some stuff indeed, especially for the GEM close but > >>>>> with that we should be able to keep the same basic approach? > >>>> Nope, not even remotely. > >>>> > >>>> What we need is the following: > >>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > >>>> 2. When we get a VM operation we not only lock the VM page tables, but > >>>> also all buffers we potentially need to unmap. > >>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed > >>>> areas directly when they are unmapped. > >>>> 4. Tracking those updates inside the bo_va structure for the BO+VM > >>>> combination. > >>>> 5. When the bo_va structure is destroy because of closing the handle > >>>> move the last clear operation over to the VM as implicit sync. > >>>> > >>> Hi Christian, isn't that a different problem though (that we're also > >>> trying to solve, but in your series)? > >>> > >>> What this patch tries to achieve: > >>> > >>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>> (t+1) a VM operation on a BO/VM accessed by the CS. > >>> > >>> to run concurrently. What it *doesn't* try is > >>> > >>> (t+0) a VM operation on a BO/VM accessed by the CS. > >>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>> > >>> to run concurrently. When you write > >>> > >>>> Only when all this is done we then can resolve the dependency that the > >>>> CS currently must wait for any clear operation on the VM. > >>> isn't that all about the second problem? > >> No, it's the same. > >> > >> See what we do in the VM code is to artificially insert a bubble so that > >> all VM clear operations wait for all CS operations and then use the > >> clear fence to indicate when the backing store of the BO can be freed. > > Isn't that remediated with something like the code below? At least the > > gem_close case should be handled with this, and the move case was > > already handled by the copy operation. > > That is one necessary puzzle piece, yes. But you need more than that. > > Especially the explicit unmap operation needs to be converted into an > implicit unmap to get the TLB flush right. This doesn't change anything about the TLB flush though? Since all unmap -> later jobs dependencies are still implicit. So the worst what could happen (i.f. e.g. userspace gets the waits/dependencies wrong) is 1) non-implicit CS gets submitted that touches a BO 2) VM unmap on that BO happens 2.5) the CS from 1 is still active due to missing dependencies 2.6) but any CS submission after 2 will trigger a TLB flush 3) A TLB flush happens for a new CS 4) All CS submissions here see the TLB flush and hence the unmap So the main problem would be the CS from step 1, but (a) if that VMFaults that is the apps own fault and (b) because we don't free the memory until (1) finishes it is not a security issue kernel-wise. > > I think I know all the necessary steps now, it's just tons of work to do. > > Regards, > Christian. > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > > @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct > > drm_gem_object *obj, > > return 0; > > } > > > > +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) > > +{ > > + struct dma_resv_iter cursor; > > + struct dma_fence *f; > > + int r; > > + unsigned num_fences = 0; > > + > > + if (src == dst) > > + return; > > + > > + /* We assume the later loops get the same fences as the caller should > > + * lock the resv. */ > > + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > > + ++num_fences; > > + dma_fence_put(f); > > + } > > + > > + r = dma_resv_reserve_fences(dst, num_fences); > > + if (r) { > > + /* As last resort on OOM we block for the fence */ > > + dma_resv_for_each_fence(&cursor, src, > > DMA_RESV_USAGE_BOOKKEEP, f) { > > + dma_fence_wait(f, false); > > + dma_fence_put(f); > > + } > > + } > > + > > + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > > + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); > > + dma_fence_put(f); > > + } > > +} > > + > > + > > static void amdgpu_gem_object_close(struct drm_gem_object *obj, > > struct drm_file *file_priv) > > { > > @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct > > drm_gem_object *obj, > > amdgpu_bo_fence(bo, fence, true); > > dma_fence_put(fence); > > > > + dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); > > + > > out_unlock: > > if (unlikely(r < 0)) > > dev_err(adev->dev, "failed to clear page " > > > >> When you want to remove this bubble (which is certainly a good idea) you > >> need to first come up with a different approach to handle the clear > >> operations. > >> > >> Regards, > >> Christian. > >> > >>> > >>>> Regards, > >>>> Christian. > >>>> > >>>> >
Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen: > On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote: >> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: >>> On Fri, Jun 3, 2022 at 12:16 PM Christian König >>> <christian.koenig@amd.com> wrote: >>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: >>>>> [SNIP] >>>>>>> I do have to fix some stuff indeed, especially for the GEM close but >>>>>>> with that we should be able to keep the same basic approach? >>>>>> Nope, not even remotely. >>>>>> >>>>>> What we need is the following: >>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >>>>>> 2. When we get a VM operation we not only lock the VM page tables, but >>>>>> also all buffers we potentially need to unmap. >>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >>>>>> areas directly when they are unmapped. >>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM >>>>>> combination. >>>>>> 5. When the bo_va structure is destroy because of closing the handle >>>>>> move the last clear operation over to the VM as implicit sync. >>>>>> >>>>> Hi Christian, isn't that a different problem though (that we're also >>>>> trying to solve, but in your series)? >>>>> >>>>> What this patch tries to achieve: >>>>> >>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>>>> (t+1) a VM operation on a BO/VM accessed by the CS. >>>>> >>>>> to run concurrently. What it *doesn't* try is >>>>> >>>>> (t+0) a VM operation on a BO/VM accessed by the CS. >>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>>>> >>>>> to run concurrently. When you write >>>>> >>>>>> Only when all this is done we then can resolve the dependency that the >>>>>> CS currently must wait for any clear operation on the VM. >>>>> isn't that all about the second problem? >>>> No, it's the same. >>>> >>>> See what we do in the VM code is to artificially insert a bubble so that >>>> all VM clear operations wait for all CS operations and then use the >>>> clear fence to indicate when the backing store of the BO can be freed. >>> Isn't that remediated with something like the code below? At least the >>> gem_close case should be handled with this, and the move case was >>> already handled by the copy operation. >> That is one necessary puzzle piece, yes. But you need more than that. >> >> Especially the explicit unmap operation needs to be converted into an >> implicit unmap to get the TLB flush right. > This doesn't change anything about the TLB flush though? Since all > unmap -> later jobs dependencies are still implicit. > > So the worst what could happen (i.f. e.g. userspace gets the > waits/dependencies wrong) is > > 1) non-implicit CS gets submitted that touches a BO > 2) VM unmap on that BO happens > 2.5) the CS from 1 is still active due to missing dependencies > 2.6) but any CS submission after 2 will trigger a TLB flush Yeah, but that's exactly the bubble we try to avoid. Isn't it? When we want to do a TLB flush the unmap operation must already be completed. Otherwise the flush is rather pointless since any access could reloads the not yet updated PTEs. And this means that we need to artificially add a dependency on every command submission after 2 to wait until the unmap operation is completed. Christian. > 3) A TLB flush happens for a new CS > 4) All CS submissions here see the TLB flush and hence the unmap > > So the main problem would be the CS from step 1, but (a) if that > VMFaults that is the apps own fault and (b) because we don't free the > memory until (1) finishes it is not a security issue kernel-wise. > >> I think I know all the necessary steps now, it's just tons of work to do. >> >> Regards, >> Christian. >> >>> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct >>> drm_gem_object *obj, >>> return 0; >>> } >>> >>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) >>> +{ >>> + struct dma_resv_iter cursor; >>> + struct dma_fence *f; >>> + int r; >>> + unsigned num_fences = 0; >>> + >>> + if (src == dst) >>> + return; >>> + >>> + /* We assume the later loops get the same fences as the caller should >>> + * lock the resv. */ >>> + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { >>> + ++num_fences; >>> + dma_fence_put(f); >>> + } >>> + >>> + r = dma_resv_reserve_fences(dst, num_fences); >>> + if (r) { >>> + /* As last resort on OOM we block for the fence */ >>> + dma_resv_for_each_fence(&cursor, src, >>> DMA_RESV_USAGE_BOOKKEEP, f) { >>> + dma_fence_wait(f, false); >>> + dma_fence_put(f); >>> + } >>> + } >>> + >>> + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { >>> + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); >>> + dma_fence_put(f); >>> + } >>> +} >>> + >>> + >>> static void amdgpu_gem_object_close(struct drm_gem_object *obj, >>> struct drm_file *file_priv) >>> { >>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct >>> drm_gem_object *obj, >>> amdgpu_bo_fence(bo, fence, true); >>> dma_fence_put(fence); >>> >>> + dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); >>> + >>> out_unlock: >>> if (unlikely(r < 0)) >>> dev_err(adev->dev, "failed to clear page " >>> >>>> When you want to remove this bubble (which is certainly a good idea) you >>>> need to first come up with a different approach to handle the clear >>>> operations. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>
On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen: > > On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote: > >> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: > >>> On Fri, Jun 3, 2022 at 12:16 PM Christian König > >>> <christian.koenig@amd.com> wrote: > >>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > >>>>> [SNIP] > >>>>>>> I do have to fix some stuff indeed, especially for the GEM close but > >>>>>>> with that we should be able to keep the same basic approach? > >>>>>> Nope, not even remotely. > >>>>>> > >>>>>> What we need is the following: > >>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > >>>>>> 2. When we get a VM operation we not only lock the VM page tables, but > >>>>>> also all buffers we potentially need to unmap. > >>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed > >>>>>> areas directly when they are unmapped. > >>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM > >>>>>> combination. > >>>>>> 5. When the bo_va structure is destroy because of closing the handle > >>>>>> move the last clear operation over to the VM as implicit sync. > >>>>>> > >>>>> Hi Christian, isn't that a different problem though (that we're also > >>>>> trying to solve, but in your series)? > >>>>> > >>>>> What this patch tries to achieve: > >>>>> > >>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>>>> (t+1) a VM operation on a BO/VM accessed by the CS. > >>>>> > >>>>> to run concurrently. What it *doesn't* try is > >>>>> > >>>>> (t+0) a VM operation on a BO/VM accessed by the CS. > >>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>>>> > >>>>> to run concurrently. When you write > >>>>> > >>>>>> Only when all this is done we then can resolve the dependency that the > >>>>>> CS currently must wait for any clear operation on the VM. > >>>>> isn't that all about the second problem? > >>>> No, it's the same. > >>>> > >>>> See what we do in the VM code is to artificially insert a bubble so that > >>>> all VM clear operations wait for all CS operations and then use the > >>>> clear fence to indicate when the backing store of the BO can be freed. > >>> Isn't that remediated with something like the code below? At least the > >>> gem_close case should be handled with this, and the move case was > >>> already handled by the copy operation. > >> That is one necessary puzzle piece, yes. But you need more than that. > >> > >> Especially the explicit unmap operation needs to be converted into an > >> implicit unmap to get the TLB flush right. > > This doesn't change anything about the TLB flush though? Since all > > unmap -> later jobs dependencies are still implicit. > > > > So the worst what could happen (i.f. e.g. userspace gets the > > waits/dependencies wrong) is > > > > 1) non-implicit CS gets submitted that touches a BO > > 2) VM unmap on that BO happens > > 2.5) the CS from 1 is still active due to missing dependencies > > 2.6) but any CS submission after 2 will trigger a TLB flush > > Yeah, but that's exactly the bubble we try to avoid. Isn't it? For this series, not really. To clarify there are two sides for getting GPU bubbles and no overlap: (1) VM operations implicitly wait for earlier CS submissions (2) CS submissions implicitly wait for earlier VM operations Together, these combine to ensure that you get a (potentially small) bubble any time VM work happens. Your series (and further ideas) tackles (2), and is a worthwhile thing to do. However, while writing the userspace for this I noticed this isn't enough to get rid of all our GPU bubbles. In particular when doing a non-sparse map of a new BO, that tends to need to be waited on for the next CS anyway for API semantics. Due to VM operations happening on a single timeline that means this high priority map can end up being blocked by earlier sparse maps and hence the bubble in that case still exists. So in this series I try to tackle (1) instead. Since GPU work typically lags behind CPU submissions and VM operations aren't that slow, we can typically execute VM operations early enough that any implicit syncs from (2) are less/no issue. In particular, by doing all dependency waits in userspace, we can make almost all VM operations start pretty much immediately (with a bunch of exceptions, like other VM work that takes time, radeonsi still submitting implicitly synced stuff etc.). So I think (2) is valuable, just not what this series tries to focus on or touch at all. (And then the cherry on top would be having two timelines for VM operations, a high priority one for non-sparse bindings and a low priority one for sparse bindings, but that is very complex and not super high value on top of eliminating (1) + (2), so I'd punt that for "maybe later". See e.g. the discussion wrt Intel at https://patchwork.freedesktop.org/patch/486604/#comment_879193) > > When we want to do a TLB flush the unmap operation must already be > completed. Otherwise the flush is rather pointless since any access > could reloads the not yet updated PTEs. > > And this means that we need to artificially add a dependency on every > command submission after 2 to wait until the unmap operation is completed. > > Christian. > > > 3) A TLB flush happens for a new CS > > 4) All CS submissions here see the TLB flush and hence the unmap > > > > So the main problem would be the CS from step 1, but (a) if that > > VMFaults that is the apps own fault and (b) because we don't free the > > memory until (1) finishes it is not a security issue kernel-wise. > > > >> I think I know all the necessary steps now, it's just tons of work to do. > >> > >> Regards, > >> Christian. > >> > >>> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > >>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct > >>> drm_gem_object *obj, > >>> return 0; > >>> } > >>> > >>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst) > >>> +{ > >>> + struct dma_resv_iter cursor; > >>> + struct dma_fence *f; > >>> + int r; > >>> + unsigned num_fences = 0; > >>> + > >>> + if (src == dst) > >>> + return; > >>> + > >>> + /* We assume the later loops get the same fences as the caller should > >>> + * lock the resv. */ > >>> + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > >>> + ++num_fences; > >>> + dma_fence_put(f); > >>> + } > >>> + > >>> + r = dma_resv_reserve_fences(dst, num_fences); > >>> + if (r) { > >>> + /* As last resort on OOM we block for the fence */ > >>> + dma_resv_for_each_fence(&cursor, src, > >>> DMA_RESV_USAGE_BOOKKEEP, f) { > >>> + dma_fence_wait(f, false); > >>> + dma_fence_put(f); > >>> + } > >>> + } > >>> + > >>> + dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) { > >>> + dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor)); > >>> + dma_fence_put(f); > >>> + } > >>> +} > >>> + > >>> + > >>> static void amdgpu_gem_object_close(struct drm_gem_object *obj, > >>> struct drm_file *file_priv) > >>> { > >>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct > >>> drm_gem_object *obj, > >>> amdgpu_bo_fence(bo, fence, true); > >>> dma_fence_put(fence); > >>> > >>> + dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv); > >>> + > >>> out_unlock: > >>> if (unlikely(r < 0)) > >>> dev_err(adev->dev, "failed to clear page " > >>> > >>>> When you want to remove this bubble (which is certainly a good idea) you > >>>> need to first come up with a different approach to handle the clear > >>>> operations. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>>> Regards, > >>>>>> Christian. > >>>>>> > >>>>>> >
Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen: > On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote: >> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen: >>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote: >>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: >>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König >>>>> <christian.koenig@amd.com> wrote: >>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: >>>>>>> [SNIP] >>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but >>>>>>>>> with that we should be able to keep the same basic approach? >>>>>>>> Nope, not even remotely. >>>>>>>> >>>>>>>> What we need is the following: >>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. >>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but >>>>>>>> also all buffers we potentially need to unmap. >>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed >>>>>>>> areas directly when they are unmapped. >>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM >>>>>>>> combination. >>>>>>>> 5. When the bo_va structure is destroy because of closing the handle >>>>>>>> move the last clear operation over to the VM as implicit sync. >>>>>>>> >>>>>>> Hi Christian, isn't that a different problem though (that we're also >>>>>>> trying to solve, but in your series)? >>>>>>> >>>>>>> What this patch tries to achieve: >>>>>>> >>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS. >>>>>>> >>>>>>> to run concurrently. What it *doesn't* try is >>>>>>> >>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS. >>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) >>>>>>> >>>>>>> to run concurrently. When you write >>>>>>> >>>>>>>> Only when all this is done we then can resolve the dependency that the >>>>>>>> CS currently must wait for any clear operation on the VM. >>>>>>> isn't that all about the second problem? >>>>>> No, it's the same. >>>>>> >>>>>> See what we do in the VM code is to artificially insert a bubble so that >>>>>> all VM clear operations wait for all CS operations and then use the >>>>>> clear fence to indicate when the backing store of the BO can be freed. >>>>> Isn't that remediated with something like the code below? At least the >>>>> gem_close case should be handled with this, and the move case was >>>>> already handled by the copy operation. >>>> That is one necessary puzzle piece, yes. But you need more than that. >>>> >>>> Especially the explicit unmap operation needs to be converted into an >>>> implicit unmap to get the TLB flush right. >>> This doesn't change anything about the TLB flush though? Since all >>> unmap -> later jobs dependencies are still implicit. >>> >>> So the worst what could happen (i.f. e.g. userspace gets the >>> waits/dependencies wrong) is >>> >>> 1) non-implicit CS gets submitted that touches a BO >>> 2) VM unmap on that BO happens >>> 2.5) the CS from 1 is still active due to missing dependencies >>> 2.6) but any CS submission after 2 will trigger a TLB flush >> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > For this series, not really. To clarify there are two sides for > getting GPU bubbles and no overlap: > > (1) VM operations implicitly wait for earlier CS submissions > (2) CS submissions implicitly wait for earlier VM operations > > Together, these combine to ensure that you get a (potentially small) > bubble any time VM work happens. > > Your series (and further ideas) tackles (2), and is a worthwhile thing > to do. However, while writing the userspace for this I noticed this > isn't enough to get rid of all our GPU bubbles. In particular when > doing a non-sparse map of a new BO, that tends to need to be waited on > for the next CS anyway for API semantics. Due to VM operations > happening on a single timeline that means this high priority map can > end up being blocked by earlier sparse maps and hence the bubble in > that case still exists. > > So in this series I try to tackle (1) instead. Since GPU work > typically lags behind CPU submissions and VM operations aren't that > slow, we can typically execute VM operations early enough that any > implicit syncs from (2) are less/no issue. Ok, once more since you don't seem to understand what I want to say: It isn't possible to fix #1 before you have fixed #2. The VM unmap operation here is a barrier which divides the CS operations in a before and after. This is intentional design. To get rid of this barrier you must first fix the part where CS submissions wait for the VM operation to complete, e.g. the necessity of the barrier. I'm working on this for a couple of years now and I'm really running out of idea how to explain this restriction. Regards, Christian.
On Fri, Jun 3, 2022 at 7:42 PM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen: > > On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote: > >> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen: > >>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote: > >>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen: > >>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König > >>>>> <christian.koenig@amd.com> wrote: > >>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen: > >>>>>>> [SNIP] > >>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but > >>>>>>>>> with that we should be able to keep the same basic approach? > >>>>>>>> Nope, not even remotely. > >>>>>>>> > >>>>>>>> What we need is the following: > >>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed. > >>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but > >>>>>>>> also all buffers we potentially need to unmap. > >>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed > >>>>>>>> areas directly when they are unmapped. > >>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM > >>>>>>>> combination. > >>>>>>>> 5. When the bo_va structure is destroy because of closing the handle > >>>>>>>> move the last clear operation over to the VM as implicit sync. > >>>>>>>> > >>>>>>> Hi Christian, isn't that a different problem though (that we're also > >>>>>>> trying to solve, but in your series)? > >>>>>>> > >>>>>>> What this patch tries to achieve: > >>>>>>> > >>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS. > >>>>>>> > >>>>>>> to run concurrently. What it *doesn't* try is > >>>>>>> > >>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS. > >>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync) > >>>>>>> > >>>>>>> to run concurrently. When you write > >>>>>>> > >>>>>>>> Only when all this is done we then can resolve the dependency that the > >>>>>>>> CS currently must wait for any clear operation on the VM. > >>>>>>> isn't that all about the second problem? > >>>>>> No, it's the same. > >>>>>> > >>>>>> See what we do in the VM code is to artificially insert a bubble so that > >>>>>> all VM clear operations wait for all CS operations and then use the > >>>>>> clear fence to indicate when the backing store of the BO can be freed. > >>>>> Isn't that remediated with something like the code below? At least the > >>>>> gem_close case should be handled with this, and the move case was > >>>>> already handled by the copy operation. > >>>> That is one necessary puzzle piece, yes. But you need more than that. > >>>> > >>>> Especially the explicit unmap operation needs to be converted into an > >>>> implicit unmap to get the TLB flush right. > >>> This doesn't change anything about the TLB flush though? Since all > >>> unmap -> later jobs dependencies are still implicit. > >>> > >>> So the worst what could happen (i.f. e.g. userspace gets the > >>> waits/dependencies wrong) is > >>> > >>> 1) non-implicit CS gets submitted that touches a BO > >>> 2) VM unmap on that BO happens > >>> 2.5) the CS from 1 is still active due to missing dependencies > >>> 2.6) but any CS submission after 2 will trigger a TLB flush > >> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > > For this series, not really. To clarify there are two sides for > > getting GPU bubbles and no overlap: > > > > (1) VM operations implicitly wait for earlier CS submissions > > (2) CS submissions implicitly wait for earlier VM operations > > > > Together, these combine to ensure that you get a (potentially small) > > bubble any time VM work happens. > > > > Your series (and further ideas) tackles (2), and is a worthwhile thing > > to do. However, while writing the userspace for this I noticed this > > isn't enough to get rid of all our GPU bubbles. In particular when > > doing a non-sparse map of a new BO, that tends to need to be waited on > > for the next CS anyway for API semantics. Due to VM operations > > happening on a single timeline that means this high priority map can > > end up being blocked by earlier sparse maps and hence the bubble in > > that case still exists. > > > > So in this series I try to tackle (1) instead. Since GPU work > > typically lags behind CPU submissions and VM operations aren't that > > slow, we can typically execute VM operations early enough that any > > implicit syncs from (2) are less/no issue. > > Ok, once more since you don't seem to understand what I want to say: It > isn't possible to fix #1 before you have fixed #2. > > The VM unmap operation here is a barrier which divides the CS operations > in a before and after. This is intentional design. Why is that barrier needed? The two barriers I got and understood and I think we can deal with: 1) the VM unmap is a barrier between prior CS and later memory free. 2) The TLB flush need to happen between a VM unmap and later CS. But why do we need the VM unmap to be a strict barrier between prior CS and later CS? > > To get rid of this barrier you must first fix the part where CS > submissions wait for the VM operation to complete, e.g. the necessity of > the barrier. > > I'm working on this for a couple of years now and I'm really running out > of idea how to explain this restriction. > > Regards, > Christian. >
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > [SNIP] >>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>> For this series, not really. To clarify there are two sides for >>> getting GPU bubbles and no overlap: >>> >>> (1) VM operations implicitly wait for earlier CS submissions >>> (2) CS submissions implicitly wait for earlier VM operations >>> >>> Together, these combine to ensure that you get a (potentially small) >>> bubble any time VM work happens. >>> >>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>> to do. However, while writing the userspace for this I noticed this >>> isn't enough to get rid of all our GPU bubbles. In particular when >>> doing a non-sparse map of a new BO, that tends to need to be waited on >>> for the next CS anyway for API semantics. Due to VM operations >>> happening on a single timeline that means this high priority map can >>> end up being blocked by earlier sparse maps and hence the bubble in >>> that case still exists. >>> >>> So in this series I try to tackle (1) instead. Since GPU work >>> typically lags behind CPU submissions and VM operations aren't that >>> slow, we can typically execute VM operations early enough that any >>> implicit syncs from (2) are less/no issue. >> Ok, once more since you don't seem to understand what I want to say: It >> isn't possible to fix #1 before you have fixed #2. >> >> The VM unmap operation here is a barrier which divides the CS operations >> in a before and after. This is intentional design. > Why is that barrier needed? The two barriers I got and understood and > I think we can deal with: > > 1) the VM unmap is a barrier between prior CS and later memory free. > 2) The TLB flush need to happen between a VM unmap and later CS. > > But why do we need the VM unmap to be a strict barrier between prior > CS and later CS? Exactly because of the two reasons you mentioned. #1 Is rather easy to fix, you just need to copy all dma_fences from the page table dma_resv object over to the BOs dma_resv object in the gem close handler. E.g. exactly what you suggested with the dma_resv_copy function. #2 is a nightmare. We can't move the TLB flush at the end of the unmap operation because on async TLB flushes are either a bit complicated (double flushes etc..) or don't even work at all because of hw bugs. So to have a reliable TLB flush we must make sure that nothing else is ongoing and that means CS->VM->CS barrier. We try very hard to circumvent that already on maps by (for example) using a completely new VMID for CS after the VM map operation. But for the unmap operation we would need some kind special dma_fence implementation which would not only wait for all existing dma_fence but also for the one added until the unmap operation is completed. Cause otherwise our operation we do at #1 would simply not catch all dma_fences which have access to the memory. That's certainly doable, but I think just using the drm_exec stuff I already came up with is easier. When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() goes away and we can keep track of the unmap operations in the bo_va structure. With that done you can make the explicit sync you noted in the bo_va structure and implicit sync when the bo_va structure goes away. Then the only reason I can see why we would need a CS->VM dependency is implicit synchronization, and that's what we are trying to avoid here in the first place. Regards, Christian. > >> To get rid of this barrier you must first fix the part where CS >> submissions wait for the VM operation to complete, e.g. the necessity of >> the barrier. >> >> I'm working on this for a couple of years now and I'm really running out >> of idea how to explain this restriction. >> >> Regards, >> Christian. >>
On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: > > Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > > [SNIP] > >>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > >>> For this series, not really. To clarify there are two sides for > >>> getting GPU bubbles and no overlap: > >>> > >>> (1) VM operations implicitly wait for earlier CS submissions > >>> (2) CS submissions implicitly wait for earlier VM operations > >>> > >>> Together, these combine to ensure that you get a (potentially small) > >>> bubble any time VM work happens. > >>> > >>> Your series (and further ideas) tackles (2), and is a worthwhile thing > >>> to do. However, while writing the userspace for this I noticed this > >>> isn't enough to get rid of all our GPU bubbles. In particular when > >>> doing a non-sparse map of a new BO, that tends to need to be waited on > >>> for the next CS anyway for API semantics. Due to VM operations > >>> happening on a single timeline that means this high priority map can > >>> end up being blocked by earlier sparse maps and hence the bubble in > >>> that case still exists. > >>> > >>> So in this series I try to tackle (1) instead. Since GPU work > >>> typically lags behind CPU submissions and VM operations aren't that > >>> slow, we can typically execute VM operations early enough that any > >>> implicit syncs from (2) are less/no issue. > >> Ok, once more since you don't seem to understand what I want to say: It > >> isn't possible to fix #1 before you have fixed #2. > >> > >> The VM unmap operation here is a barrier which divides the CS operations > >> in a before and after. This is intentional design. > > Why is that barrier needed? The two barriers I got and understood and > > I think we can deal with: > > > > 1) the VM unmap is a barrier between prior CS and later memory free. > > 2) The TLB flush need to happen between a VM unmap and later CS. > > > > But why do we need the VM unmap to be a strict barrier between prior > > CS and later CS? > > Exactly because of the two reasons you mentioned. This is the part I'm not seeing. I get that removing #2 is a nightmare, which is why I did something that doesn't violate that constraint. Like if an explicit CS that was running before the VM operation runs till after the VM operation (and hence possibly till after the TLB flush, or otherwise have the TLB flush not apply due to lack of async TLB flush support), that is not an issue. It might see the state from before the unmap, or after the unmap, or some intermediate state and all of those would be okay. We still get the constraint that the TLB flush happens between the VM unmap and later CS and hence the unmap is certainly visible to them. > > #1 Is rather easy to fix, you just need to copy all dma_fences from the > page table dma_resv object over to the BOs dma_resv object in the gem > close handler. E.g. exactly what you suggested with the dma_resv_copy > function. > > #2 is a nightmare. > > We can't move the TLB flush at the end of the unmap operation because on > async TLB flushes are either a bit complicated (double flushes etc..) or > don't even work at all because of hw bugs. So to have a reliable TLB > flush we must make sure that nothing else is ongoing and that means > CS->VM->CS barrier. > > We try very hard to circumvent that already on maps by (for example) > using a completely new VMID for CS after the VM map operation. > > But for the unmap operation we would need some kind special dma_fence > implementation which would not only wait for all existing dma_fence but > also for the one added until the unmap operation is completed. Cause > otherwise our operation we do at #1 would simply not catch all > dma_fences which have access to the memory. > > That's certainly doable, but I think just using the drm_exec stuff I > already came up with is easier. > > When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > goes away and we can keep track of the unmap operations in the bo_va > structure. > > With that done you can make the explicit sync you noted in the bo_va > structure and implicit sync when the bo_va structure goes away. > > Then the only reason I can see why we would need a CS->VM dependency is > implicit synchronization, and that's what we are trying to avoid here in > the first place. > > Regards, > Christian. > > > > >> To get rid of this barrier you must first fix the part where CS > >> submissions wait for the VM operation to complete, e.g. the necessity of > >> the barrier. > >> > >> I'm working on this for a couple of years now and I'm really running out > >> of idea how to explain this restriction. > >> > >> Regards, > >> Christian. > >> >
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: >>> [SNIP] >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>>>> For this series, not really. To clarify there are two sides for >>>>> getting GPU bubbles and no overlap: >>>>> >>>>> (1) VM operations implicitly wait for earlier CS submissions >>>>> (2) CS submissions implicitly wait for earlier VM operations >>>>> >>>>> Together, these combine to ensure that you get a (potentially small) >>>>> bubble any time VM work happens. >>>>> >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>>>> to do. However, while writing the userspace for this I noticed this >>>>> isn't enough to get rid of all our GPU bubbles. In particular when >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on >>>>> for the next CS anyway for API semantics. Due to VM operations >>>>> happening on a single timeline that means this high priority map can >>>>> end up being blocked by earlier sparse maps and hence the bubble in >>>>> that case still exists. >>>>> >>>>> So in this series I try to tackle (1) instead. Since GPU work >>>>> typically lags behind CPU submissions and VM operations aren't that >>>>> slow, we can typically execute VM operations early enough that any >>>>> implicit syncs from (2) are less/no issue. >>>> Ok, once more since you don't seem to understand what I want to say: It >>>> isn't possible to fix #1 before you have fixed #2. >>>> >>>> The VM unmap operation here is a barrier which divides the CS operations >>>> in a before and after. This is intentional design. >>> Why is that barrier needed? The two barriers I got and understood and >>> I think we can deal with: >>> >>> 1) the VM unmap is a barrier between prior CS and later memory free. >>> 2) The TLB flush need to happen between a VM unmap and later CS. >>> >>> But why do we need the VM unmap to be a strict barrier between prior >>> CS and later CS? >> Exactly because of the two reasons you mentioned. > This is the part I'm not seeing. I get that removing #2 is a > nightmare, which is why I did something that doesn't violate that > constraint. > > Like if an explicit CS that was running before the VM operation runs > till after the VM operation (and hence possibly till after the TLB > flush, or otherwise have the TLB flush not apply due to lack of async > TLB flush support), that is not an issue. It might see the state from > before the unmap, or after the unmap, or some intermediate state and > all of those would be okay. > > We still get the constraint that the TLB flush happens between the VM > unmap and later CS and hence the unmap is certainly visible to them. So you basically just want to set the sync mode in amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? That should be doable, but then you don't need any of the other changes. Regards, Christian. > >> #1 Is rather easy to fix, you just need to copy all dma_fences from the >> page table dma_resv object over to the BOs dma_resv object in the gem >> close handler. E.g. exactly what you suggested with the dma_resv_copy >> function. >> >> #2 is a nightmare. >> >> We can't move the TLB flush at the end of the unmap operation because on >> async TLB flushes are either a bit complicated (double flushes etc..) or >> don't even work at all because of hw bugs. So to have a reliable TLB >> flush we must make sure that nothing else is ongoing and that means >> CS->VM->CS barrier. >> >> We try very hard to circumvent that already on maps by (for example) >> using a completely new VMID for CS after the VM map operation. >> >> But for the unmap operation we would need some kind special dma_fence >> implementation which would not only wait for all existing dma_fence but >> also for the one added until the unmap operation is completed. Cause >> otherwise our operation we do at #1 would simply not catch all >> dma_fences which have access to the memory. >> >> That's certainly doable, but I think just using the drm_exec stuff I >> already came up with is easier. >> >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() >> goes away and we can keep track of the unmap operations in the bo_va >> structure. >> >> With that done you can make the explicit sync you noted in the bo_va >> structure and implicit sync when the bo_va structure goes away. >> >> Then the only reason I can see why we would need a CS->VM dependency is >> implicit synchronization, and that's what we are trying to avoid here in >> the first place. >> >> Regards, >> Christian. >> >>>> To get rid of this barrier you must first fix the part where CS >>>> submissions wait for the VM operation to complete, e.g. the necessity of >>>> the barrier. >>>> >>>> I'm working on this for a couple of years now and I'm really running out >>>> of idea how to explain this restriction. >>>> >>>> Regards, >>>> Christian. >>>>
On Mon, Jun 6, 2022 at 12:15 PM Christian König <christian.koenig@amd.com> wrote: > > > > Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: > >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > >>> [SNIP] > >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > >>>>> For this series, not really. To clarify there are two sides for > >>>>> getting GPU bubbles and no overlap: > >>>>> > >>>>> (1) VM operations implicitly wait for earlier CS submissions > >>>>> (2) CS submissions implicitly wait for earlier VM operations > >>>>> > >>>>> Together, these combine to ensure that you get a (potentially small) > >>>>> bubble any time VM work happens. > >>>>> > >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing > >>>>> to do. However, while writing the userspace for this I noticed this > >>>>> isn't enough to get rid of all our GPU bubbles. In particular when > >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on > >>>>> for the next CS anyway for API semantics. Due to VM operations > >>>>> happening on a single timeline that means this high priority map can > >>>>> end up being blocked by earlier sparse maps and hence the bubble in > >>>>> that case still exists. > >>>>> > >>>>> So in this series I try to tackle (1) instead. Since GPU work > >>>>> typically lags behind CPU submissions and VM operations aren't that > >>>>> slow, we can typically execute VM operations early enough that any > >>>>> implicit syncs from (2) are less/no issue. > >>>> Ok, once more since you don't seem to understand what I want to say: It > >>>> isn't possible to fix #1 before you have fixed #2. > >>>> > >>>> The VM unmap operation here is a barrier which divides the CS operations > >>>> in a before and after. This is intentional design. > >>> Why is that barrier needed? The two barriers I got and understood and > >>> I think we can deal with: > >>> > >>> 1) the VM unmap is a barrier between prior CS and later memory free. > >>> 2) The TLB flush need to happen between a VM unmap and later CS. > >>> > >>> But why do we need the VM unmap to be a strict barrier between prior > >>> CS and later CS? > >> Exactly because of the two reasons you mentioned. > > This is the part I'm not seeing. I get that removing #2 is a > > nightmare, which is why I did something that doesn't violate that > > constraint. > > > > Like if an explicit CS that was running before the VM operation runs > > till after the VM operation (and hence possibly till after the TLB > > flush, or otherwise have the TLB flush not apply due to lack of async > > TLB flush support), that is not an issue. It might see the state from > > before the unmap, or after the unmap, or some intermediate state and > > all of those would be okay. > > > > We still get the constraint that the TLB flush happens between the VM > > unmap and later CS and hence the unmap is certainly visible to them. > > So you basically just want to set the sync mode in > amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? Yes, with the caveat that I want to do that only for DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with implicit sync we get the old implicit behavior, if we submit a CS with explicit sync we get the new explicit behavior). The rest of the series is basically just for enabling explicit sync submissions. > That should be doable, but then you don't need any of the other changes. > > Regards, > Christian. > > > > >> #1 Is rather easy to fix, you just need to copy all dma_fences from the > >> page table dma_resv object over to the BOs dma_resv object in the gem > >> close handler. E.g. exactly what you suggested with the dma_resv_copy > >> function. > >> > >> #2 is a nightmare. > >> > >> We can't move the TLB flush at the end of the unmap operation because on > >> async TLB flushes are either a bit complicated (double flushes etc..) or > >> don't even work at all because of hw bugs. So to have a reliable TLB > >> flush we must make sure that nothing else is ongoing and that means > >> CS->VM->CS barrier. > >> > >> We try very hard to circumvent that already on maps by (for example) > >> using a completely new VMID for CS after the VM map operation. > >> > >> But for the unmap operation we would need some kind special dma_fence > >> implementation which would not only wait for all existing dma_fence but > >> also for the one added until the unmap operation is completed. Cause > >> otherwise our operation we do at #1 would simply not catch all > >> dma_fences which have access to the memory. > >> > >> That's certainly doable, but I think just using the drm_exec stuff I > >> already came up with is easier. > >> > >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > >> goes away and we can keep track of the unmap operations in the bo_va > >> structure. > >> > >> With that done you can make the explicit sync you noted in the bo_va > >> structure and implicit sync when the bo_va structure goes away. > >> > >> Then the only reason I can see why we would need a CS->VM dependency is > >> implicit synchronization, and that's what we are trying to avoid here in > >> the first place. > >> > >> Regards, > >> Christian. > >> > >>>> To get rid of this barrier you must first fix the part where CS > >>>> submissions wait for the VM operation to complete, e.g. the necessity of > >>>> the barrier. > >>>> > >>>> I'm working on this for a couple of years now and I'm really running out > >>>> of idea how to explain this restriction. > >>>> > >>>> Regards, > >>>> Christian. > >>>> >
Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen: > On Mon, Jun 6, 2022 at 12:15 PM Christian König > <christian.koenig@amd.com> wrote: >> >> >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: >>>>> [SNIP] >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>>>>>> For this series, not really. To clarify there are two sides for >>>>>>> getting GPU bubbles and no overlap: >>>>>>> >>>>>>> (1) VM operations implicitly wait for earlier CS submissions >>>>>>> (2) CS submissions implicitly wait for earlier VM operations >>>>>>> >>>>>>> Together, these combine to ensure that you get a (potentially small) >>>>>>> bubble any time VM work happens. >>>>>>> >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>>>>>> to do. However, while writing the userspace for this I noticed this >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on >>>>>>> for the next CS anyway for API semantics. Due to VM operations >>>>>>> happening on a single timeline that means this high priority map can >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in >>>>>>> that case still exists. >>>>>>> >>>>>>> So in this series I try to tackle (1) instead. Since GPU work >>>>>>> typically lags behind CPU submissions and VM operations aren't that >>>>>>> slow, we can typically execute VM operations early enough that any >>>>>>> implicit syncs from (2) are less/no issue. >>>>>> Ok, once more since you don't seem to understand what I want to say: It >>>>>> isn't possible to fix #1 before you have fixed #2. >>>>>> >>>>>> The VM unmap operation here is a barrier which divides the CS operations >>>>>> in a before and after. This is intentional design. >>>>> Why is that barrier needed? The two barriers I got and understood and >>>>> I think we can deal with: >>>>> >>>>> 1) the VM unmap is a barrier between prior CS and later memory free. >>>>> 2) The TLB flush need to happen between a VM unmap and later CS. >>>>> >>>>> But why do we need the VM unmap to be a strict barrier between prior >>>>> CS and later CS? >>>> Exactly because of the two reasons you mentioned. >>> This is the part I'm not seeing. I get that removing #2 is a >>> nightmare, which is why I did something that doesn't violate that >>> constraint. >>> >>> Like if an explicit CS that was running before the VM operation runs >>> till after the VM operation (and hence possibly till after the TLB >>> flush, or otherwise have the TLB flush not apply due to lack of async >>> TLB flush support), that is not an issue. It might see the state from >>> before the unmap, or after the unmap, or some intermediate state and >>> all of those would be okay. >>> >>> We still get the constraint that the TLB flush happens between the VM >>> unmap and later CS and hence the unmap is certainly visible to them. >> So you basically just want to set the sync mode in >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? > Yes, with the caveat that I want to do that only for > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with > implicit sync we get the old implicit behavior, if we submit a CS with > explicit sync we get the new explicit behavior). The rest of the > series is basically just for enabling explicit sync submissions. That part won't work at all and would cause additional synchronization problems. First of all for implicit synced CS we should use READ, not BOOKKEEP. Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've fixed that this causes memory corruption, but it is still nice to avoid. BOOKKEEP can only be used by VM updates themselves. So that they don't interfere with CS. Then the second problem is that the VM IOCTL has absolutely no idea what the CS IOCTL would be doing. That's why we have added the EXPLICIT sync flag on the BO. Regards, Christian. > >> That should be doable, but then you don't need any of the other changes. >> >> Regards, >> Christian. >> >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the >>>> page table dma_resv object over to the BOs dma_resv object in the gem >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy >>>> function. >>>> >>>> #2 is a nightmare. >>>> >>>> We can't move the TLB flush at the end of the unmap operation because on >>>> async TLB flushes are either a bit complicated (double flushes etc..) or >>>> don't even work at all because of hw bugs. So to have a reliable TLB >>>> flush we must make sure that nothing else is ongoing and that means >>>> CS->VM->CS barrier. >>>> >>>> We try very hard to circumvent that already on maps by (for example) >>>> using a completely new VMID for CS after the VM map operation. >>>> >>>> But for the unmap operation we would need some kind special dma_fence >>>> implementation which would not only wait for all existing dma_fence but >>>> also for the one added until the unmap operation is completed. Cause >>>> otherwise our operation we do at #1 would simply not catch all >>>> dma_fences which have access to the memory. >>>> >>>> That's certainly doable, but I think just using the drm_exec stuff I >>>> already came up with is easier. >>>> >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() >>>> goes away and we can keep track of the unmap operations in the bo_va >>>> structure. >>>> >>>> With that done you can make the explicit sync you noted in the bo_va >>>> structure and implicit sync when the bo_va structure goes away. >>>> >>>> Then the only reason I can see why we would need a CS->VM dependency is >>>> implicit synchronization, and that's what we are trying to avoid here in >>>> the first place. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> To get rid of this barrier you must first fix the part where CS >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of >>>>>> the barrier. >>>>>> >>>>>> I'm working on this for a couple of years now and I'm really running out >>>>>> of idea how to explain this restriction. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>>
On Mon, Jun 6, 2022 at 12:35 PM Christian König <christian.koenig@amd.com> wrote: > > Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen: > > On Mon, Jun 6, 2022 at 12:15 PM Christian König > > <christian.koenig@amd.com> wrote: > >> > >> > >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: > >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > >>>>> [SNIP] > >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > >>>>>>> For this series, not really. To clarify there are two sides for > >>>>>>> getting GPU bubbles and no overlap: > >>>>>>> > >>>>>>> (1) VM operations implicitly wait for earlier CS submissions > >>>>>>> (2) CS submissions implicitly wait for earlier VM operations > >>>>>>> > >>>>>>> Together, these combine to ensure that you get a (potentially small) > >>>>>>> bubble any time VM work happens. > >>>>>>> > >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing > >>>>>>> to do. However, while writing the userspace for this I noticed this > >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when > >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on > >>>>>>> for the next CS anyway for API semantics. Due to VM operations > >>>>>>> happening on a single timeline that means this high priority map can > >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in > >>>>>>> that case still exists. > >>>>>>> > >>>>>>> So in this series I try to tackle (1) instead. Since GPU work > >>>>>>> typically lags behind CPU submissions and VM operations aren't that > >>>>>>> slow, we can typically execute VM operations early enough that any > >>>>>>> implicit syncs from (2) are less/no issue. > >>>>>> Ok, once more since you don't seem to understand what I want to say: It > >>>>>> isn't possible to fix #1 before you have fixed #2. > >>>>>> > >>>>>> The VM unmap operation here is a barrier which divides the CS operations > >>>>>> in a before and after. This is intentional design. > >>>>> Why is that barrier needed? The two barriers I got and understood and > >>>>> I think we can deal with: > >>>>> > >>>>> 1) the VM unmap is a barrier between prior CS and later memory free. > >>>>> 2) The TLB flush need to happen between a VM unmap and later CS. > >>>>> > >>>>> But why do we need the VM unmap to be a strict barrier between prior > >>>>> CS and later CS? > >>>> Exactly because of the two reasons you mentioned. > >>> This is the part I'm not seeing. I get that removing #2 is a > >>> nightmare, which is why I did something that doesn't violate that > >>> constraint. > >>> > >>> Like if an explicit CS that was running before the VM operation runs > >>> till after the VM operation (and hence possibly till after the TLB > >>> flush, or otherwise have the TLB flush not apply due to lack of async > >>> TLB flush support), that is not an issue. It might see the state from > >>> before the unmap, or after the unmap, or some intermediate state and > >>> all of those would be okay. > >>> > >>> We still get the constraint that the TLB flush happens between the VM > >>> unmap and later CS and hence the unmap is certainly visible to them. > >> So you basically just want to set the sync mode in > >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? > > Yes, with the caveat that I want to do that only for > > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with > > implicit sync we get the old implicit behavior, if we submit a CS with > > explicit sync we get the new explicit behavior). The rest of the > > series is basically just for enabling explicit sync submissions. > > That part won't work at all and would cause additional synchronization > problems. > > First of all for implicit synced CS we should use READ, not BOOKKEEP. > Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've > fixed that this causes memory corruption, but it is still nice to avoid. Yes, what I'm saying is that on implicit sync CS submission should add READ fences to the dma resv and on explicit sync CS submission should add BOOKKEEP fences. > > BOOKKEEP can only be used by VM updates themselves. So that they don't > interfere with CS. That is the point why we would go BOOKKEEP for explicit sync CS submissions, no? Explicit submission shouldn't interfere with any other CS submissions. That includes being totally ignored by GL importers (if we want to have synchronization there between an explicit submission and GL, userspace is expected to use Jason's dmabuf fence import/export IOCTLs) > > Then the second problem is that the VM IOCTL has absolutely no idea what > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync > flag on the BO. It doesn't need to? We just use a different sync_mode for BOOKKEEP fences vs others: https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2 (the nice thing about doing it this way is that it is independent of the IOCTL, i.e. also works for the delayed mapping changes we trigger on CS submit) > > Regards, > Christian. > > > > >> That should be doable, but then you don't need any of the other changes. > >> > >> Regards, > >> Christian. > >> > >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the > >>>> page table dma_resv object over to the BOs dma_resv object in the gem > >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy > >>>> function. > >>>> > >>>> #2 is a nightmare. > >>>> > >>>> We can't move the TLB flush at the end of the unmap operation because on > >>>> async TLB flushes are either a bit complicated (double flushes etc..) or > >>>> don't even work at all because of hw bugs. So to have a reliable TLB > >>>> flush we must make sure that nothing else is ongoing and that means > >>>> CS->VM->CS barrier. > >>>> > >>>> We try very hard to circumvent that already on maps by (for example) > >>>> using a completely new VMID for CS after the VM map operation. > >>>> > >>>> But for the unmap operation we would need some kind special dma_fence > >>>> implementation which would not only wait for all existing dma_fence but > >>>> also for the one added until the unmap operation is completed. Cause > >>>> otherwise our operation we do at #1 would simply not catch all > >>>> dma_fences which have access to the memory. > >>>> > >>>> That's certainly doable, but I think just using the drm_exec stuff I > >>>> already came up with is easier. > >>>> > >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > >>>> goes away and we can keep track of the unmap operations in the bo_va > >>>> structure. > >>>> > >>>> With that done you can make the explicit sync you noted in the bo_va > >>>> structure and implicit sync when the bo_va structure goes away. > >>>> > >>>> Then the only reason I can see why we would need a CS->VM dependency is > >>>> implicit synchronization, and that's what we are trying to avoid here in > >>>> the first place. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>>> To get rid of this barrier you must first fix the part where CS > >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of > >>>>>> the barrier. > >>>>>> > >>>>>> I'm working on this for a couple of years now and I'm really running out > >>>>>> of idea how to explain this restriction. > >>>>>> > >>>>>> Regards, > >>>>>> Christian. > >>>>>> >
Hi Christian, Friendly ping on the comments here. Are you okay with me re-spinning the series with a fixed patch 1 or do we need further discussion on the approach here? Thanks, Bas On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> wrote: > > On Mon, Jun 6, 2022 at 12:35 PM Christian König > <christian.koenig@amd.com> wrote: > > > > Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen: > > > On Mon, Jun 6, 2022 at 12:15 PM Christian König > > > <christian.koenig@amd.com> wrote: > > >> > > >> > > >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: > > >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: > > >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: > > >>>>> [SNIP] > > >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? > > >>>>>>> For this series, not really. To clarify there are two sides for > > >>>>>>> getting GPU bubbles and no overlap: > > >>>>>>> > > >>>>>>> (1) VM operations implicitly wait for earlier CS submissions > > >>>>>>> (2) CS submissions implicitly wait for earlier VM operations > > >>>>>>> > > >>>>>>> Together, these combine to ensure that you get a (potentially small) > > >>>>>>> bubble any time VM work happens. > > >>>>>>> > > >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing > > >>>>>>> to do. However, while writing the userspace for this I noticed this > > >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when > > >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on > > >>>>>>> for the next CS anyway for API semantics. Due to VM operations > > >>>>>>> happening on a single timeline that means this high priority map can > > >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in > > >>>>>>> that case still exists. > > >>>>>>> > > >>>>>>> So in this series I try to tackle (1) instead. Since GPU work > > >>>>>>> typically lags behind CPU submissions and VM operations aren't that > > >>>>>>> slow, we can typically execute VM operations early enough that any > > >>>>>>> implicit syncs from (2) are less/no issue. > > >>>>>> Ok, once more since you don't seem to understand what I want to say: It > > >>>>>> isn't possible to fix #1 before you have fixed #2. > > >>>>>> > > >>>>>> The VM unmap operation here is a barrier which divides the CS operations > > >>>>>> in a before and after. This is intentional design. > > >>>>> Why is that barrier needed? The two barriers I got and understood and > > >>>>> I think we can deal with: > > >>>>> > > >>>>> 1) the VM unmap is a barrier between prior CS and later memory free. > > >>>>> 2) The TLB flush need to happen between a VM unmap and later CS. > > >>>>> > > >>>>> But why do we need the VM unmap to be a strict barrier between prior > > >>>>> CS and later CS? > > >>>> Exactly because of the two reasons you mentioned. > > >>> This is the part I'm not seeing. I get that removing #2 is a > > >>> nightmare, which is why I did something that doesn't violate that > > >>> constraint. > > >>> > > >>> Like if an explicit CS that was running before the VM operation runs > > >>> till after the VM operation (and hence possibly till after the TLB > > >>> flush, or otherwise have the TLB flush not apply due to lack of async > > >>> TLB flush support), that is not an issue. It might see the state from > > >>> before the unmap, or after the unmap, or some intermediate state and > > >>> all of those would be okay. > > >>> > > >>> We still get the constraint that the TLB flush happens between the VM > > >>> unmap and later CS and hence the unmap is certainly visible to them. > > >> So you basically just want to set the sync mode in > > >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? > > > Yes, with the caveat that I want to do that only for > > > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with > > > implicit sync we get the old implicit behavior, if we submit a CS with > > > explicit sync we get the new explicit behavior). The rest of the > > > series is basically just for enabling explicit sync submissions. > > > > That part won't work at all and would cause additional synchronization > > problems. > > > > First of all for implicit synced CS we should use READ, not BOOKKEEP. > > Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've > > fixed that this causes memory corruption, but it is still nice to avoid. > > Yes, what I'm saying is that on implicit sync CS submission should add > READ fences to the dma resv and on explicit sync CS submission should > add BOOKKEEP fences. > > > > > BOOKKEEP can only be used by VM updates themselves. So that they don't > > interfere with CS. > > That is the point why we would go BOOKKEEP for explicit sync CS > submissions, no? Explicit submission shouldn't interfere with any > other CS submissions. That includes being totally ignored by GL > importers (if we want to have synchronization there between an > explicit submission and GL, userspace is expected to use Jason's > dmabuf fence import/export IOCTLs) > > > > > Then the second problem is that the VM IOCTL has absolutely no idea what > > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync > > flag on the BO. > > It doesn't need to? We just use a different sync_mode for BOOKKEEP > fences vs others: > https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2 > > (the nice thing about doing it this way is that it is independent of > the IOCTL, i.e. also works for the delayed mapping changes we trigger > on CS submit) > > > > > Regards, > > Christian. > > > > > > > >> That should be doable, but then you don't need any of the other changes. > > >> > > >> Regards, > > >> Christian. > > >> > > >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the > > >>>> page table dma_resv object over to the BOs dma_resv object in the gem > > >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy > > >>>> function. > > >>>> > > >>>> #2 is a nightmare. > > >>>> > > >>>> We can't move the TLB flush at the end of the unmap operation because on > > >>>> async TLB flushes are either a bit complicated (double flushes etc..) or > > >>>> don't even work at all because of hw bugs. So to have a reliable TLB > > >>>> flush we must make sure that nothing else is ongoing and that means > > >>>> CS->VM->CS barrier. > > >>>> > > >>>> We try very hard to circumvent that already on maps by (for example) > > >>>> using a completely new VMID for CS after the VM map operation. > > >>>> > > >>>> But for the unmap operation we would need some kind special dma_fence > > >>>> implementation which would not only wait for all existing dma_fence but > > >>>> also for the one added until the unmap operation is completed. Cause > > >>>> otherwise our operation we do at #1 would simply not catch all > > >>>> dma_fences which have access to the memory. > > >>>> > > >>>> That's certainly doable, but I think just using the drm_exec stuff I > > >>>> already came up with is easier. > > >>>> > > >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > > >>>> goes away and we can keep track of the unmap operations in the bo_va > > >>>> structure. > > >>>> > > >>>> With that done you can make the explicit sync you noted in the bo_va > > >>>> structure and implicit sync when the bo_va structure goes away. > > >>>> > > >>>> Then the only reason I can see why we would need a CS->VM dependency is > > >>>> implicit synchronization, and that's what we are trying to avoid here in > > >>>> the first place. > > >>>> > > >>>> Regards, > > >>>> Christian. > > >>>> > > >>>>>> To get rid of this barrier you must first fix the part where CS > > >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of > > >>>>>> the barrier. > > >>>>>> > > >>>>>> I'm working on this for a couple of years now and I'm really running out > > >>>>>> of idea how to explain this restriction. > > >>>>>> > > >>>>>> Regards, > > >>>>>> Christian. > > >>>>>> > >
Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen: > On Mon, Jun 6, 2022 at 12:35 PM Christian König > <christian.koenig@amd.com> wrote: >> [SNIP] >> That part won't work at all and would cause additional synchronization >> problems. >> >> First of all for implicit synced CS we should use READ, not BOOKKEEP. >> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've >> fixed that this causes memory corruption, but it is still nice to avoid. > Yes, what I'm saying is that on implicit sync CS submission should add > READ fences to the dma resv and on explicit sync CS submission should > add BOOKKEEP fences. No, exactly that is wrong. Implicit CS submissions should add WRITE fences. Explicit CS submissions should add READ fences. Only VM updates should add BOOKKEEP fences. >> BOOKKEEP can only be used by VM updates themselves. So that they don't >> interfere with CS. > That is the point why we would go BOOKKEEP for explicit sync CS > submissions, no? Explicit submission shouldn't interfere with any > other CS submissions. That includes being totally ignored by GL > importers (if we want to have synchronization there between an > explicit submission and GL, userspace is expected to use Jason's > dmabuf fence import/export IOCTLs) No, that would break existing DMA-buf rules. Explicit CS submissions are still a dependency for implicit submissions. > > Then the second problem is that the VM IOCTL has absolutely no idea what > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync > flag on the BO. > It doesn't need to? We just use a different sync_mode for BOOKKEEP > fences vs others: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&reserved=0 No, exactly that's completely broken. Regards, Christian. > > (the nice thing about doing it this way is that it is independent of > the IOCTL, i.e. also works for the delayed mapping changes we trigger > on CS submit) > >> Regards, >> Christian. >> >>>> That should be doable, but then you don't need any of the other changes. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the >>>>>> page table dma_resv object over to the BOs dma_resv object in the gem >>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy >>>>>> function. >>>>>> >>>>>> #2 is a nightmare. >>>>>> >>>>>> We can't move the TLB flush at the end of the unmap operation because on >>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or >>>>>> don't even work at all because of hw bugs. So to have a reliable TLB >>>>>> flush we must make sure that nothing else is ongoing and that means >>>>>> CS->VM->CS barrier. >>>>>> >>>>>> We try very hard to circumvent that already on maps by (for example) >>>>>> using a completely new VMID for CS after the VM map operation. >>>>>> >>>>>> But for the unmap operation we would need some kind special dma_fence >>>>>> implementation which would not only wait for all existing dma_fence but >>>>>> also for the one added until the unmap operation is completed. Cause >>>>>> otherwise our operation we do at #1 would simply not catch all >>>>>> dma_fences which have access to the memory. >>>>>> >>>>>> That's certainly doable, but I think just using the drm_exec stuff I >>>>>> already came up with is easier. >>>>>> >>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() >>>>>> goes away and we can keep track of the unmap operations in the bo_va >>>>>> structure. >>>>>> >>>>>> With that done you can make the explicit sync you noted in the bo_va >>>>>> structure and implicit sync when the bo_va structure goes away. >>>>>> >>>>>> Then the only reason I can see why we would need a CS->VM dependency is >>>>>> implicit synchronization, and that's what we are trying to avoid here in >>>>>> the first place. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>>> To get rid of this barrier you must first fix the part where CS >>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of >>>>>>>> the barrier. >>>>>>>> >>>>>>>> I'm working on this for a couple of years now and I'm really running out >>>>>>>> of idea how to explain this restriction. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>>
Hi Bas, sorry I totally missed your reply. Just tried to answer your original questions. Regards, Christian. Am 15.06.22 um 02:40 schrieb Bas Nieuwenhuizen: > Hi Christian, > > Friendly ping on the comments here. Are you okay with me re-spinning > the series with a fixed patch 1 or do we need further discussion on > the approach here? > > Thanks, > Bas > > On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen > <bas@basnieuwenhuizen.nl> wrote: >> On Mon, Jun 6, 2022 at 12:35 PM Christian König >> <christian.koenig@amd.com> wrote: >>> Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen: >>>> On Mon, Jun 6, 2022 at 12:15 PM Christian König >>>> <christian.koenig@amd.com> wrote: >>>>> >>>>> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen: >>>>>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote: >>>>>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen: >>>>>>>> [SNIP] >>>>>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it? >>>>>>>>>> For this series, not really. To clarify there are two sides for >>>>>>>>>> getting GPU bubbles and no overlap: >>>>>>>>>> >>>>>>>>>> (1) VM operations implicitly wait for earlier CS submissions >>>>>>>>>> (2) CS submissions implicitly wait for earlier VM operations >>>>>>>>>> >>>>>>>>>> Together, these combine to ensure that you get a (potentially small) >>>>>>>>>> bubble any time VM work happens. >>>>>>>>>> >>>>>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing >>>>>>>>>> to do. However, while writing the userspace for this I noticed this >>>>>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when >>>>>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on >>>>>>>>>> for the next CS anyway for API semantics. Due to VM operations >>>>>>>>>> happening on a single timeline that means this high priority map can >>>>>>>>>> end up being blocked by earlier sparse maps and hence the bubble in >>>>>>>>>> that case still exists. >>>>>>>>>> >>>>>>>>>> So in this series I try to tackle (1) instead. Since GPU work >>>>>>>>>> typically lags behind CPU submissions and VM operations aren't that >>>>>>>>>> slow, we can typically execute VM operations early enough that any >>>>>>>>>> implicit syncs from (2) are less/no issue. >>>>>>>>> Ok, once more since you don't seem to understand what I want to say: It >>>>>>>>> isn't possible to fix #1 before you have fixed #2. >>>>>>>>> >>>>>>>>> The VM unmap operation here is a barrier which divides the CS operations >>>>>>>>> in a before and after. This is intentional design. >>>>>>>> Why is that barrier needed? The two barriers I got and understood and >>>>>>>> I think we can deal with: >>>>>>>> >>>>>>>> 1) the VM unmap is a barrier between prior CS and later memory free. >>>>>>>> 2) The TLB flush need to happen between a VM unmap and later CS. >>>>>>>> >>>>>>>> But why do we need the VM unmap to be a strict barrier between prior >>>>>>>> CS and later CS? >>>>>>> Exactly because of the two reasons you mentioned. >>>>>> This is the part I'm not seeing. I get that removing #2 is a >>>>>> nightmare, which is why I did something that doesn't violate that >>>>>> constraint. >>>>>> >>>>>> Like if an explicit CS that was running before the VM operation runs >>>>>> till after the VM operation (and hence possibly till after the TLB >>>>>> flush, or otherwise have the TLB flush not apply due to lack of async >>>>>> TLB flush support), that is not an issue. It might see the state from >>>>>> before the unmap, or after the unmap, or some intermediate state and >>>>>> all of those would be okay. >>>>>> >>>>>> We still get the constraint that the TLB flush happens between the VM >>>>>> unmap and later CS and hence the unmap is certainly visible to them. >>>>> So you basically just want to set the sync mode in >>>>> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap? >>>> Yes, with the caveat that I want to do that only for >>>> DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with >>>> implicit sync we get the old implicit behavior, if we submit a CS with >>>> explicit sync we get the new explicit behavior). The rest of the >>>> series is basically just for enabling explicit sync submissions. >>> That part won't work at all and would cause additional synchronization >>> problems. >>> >>> First of all for implicit synced CS we should use READ, not BOOKKEEP. >>> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've >>> fixed that this causes memory corruption, but it is still nice to avoid. >> Yes, what I'm saying is that on implicit sync CS submission should add >> READ fences to the dma resv and on explicit sync CS submission should >> add BOOKKEEP fences. >> >>> BOOKKEEP can only be used by VM updates themselves. So that they don't >>> interfere with CS. >> That is the point why we would go BOOKKEEP for explicit sync CS >> submissions, no? Explicit submission shouldn't interfere with any >> other CS submissions. That includes being totally ignored by GL >> importers (if we want to have synchronization there between an >> explicit submission and GL, userspace is expected to use Jason's >> dmabuf fence import/export IOCTLs) >> >>> Then the second problem is that the VM IOCTL has absolutely no idea what >>> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync >>> flag on the BO. >> It doesn't need to? We just use a different sync_mode for BOOKKEEP >> fences vs others: >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&data=05%7C01%7Cchristian.koenig%40amd.com%7C0c76d5c34db846f2fff208da4e67ad7b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908504442767830%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cyfYyKR6hVpDV%2FathhSO6EnCHjNkEM6sJs%2BPLPERCEE%3D&reserved=0 >> >> (the nice thing about doing it this way is that it is independent of >> the IOCTL, i.e. also works for the delayed mapping changes we trigger >> on CS submit) >> >>> Regards, >>> Christian. >>> >>>>> That should be doable, but then you don't need any of the other changes. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the >>>>>>> page table dma_resv object over to the BOs dma_resv object in the gem >>>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy >>>>>>> function. >>>>>>> >>>>>>> #2 is a nightmare. >>>>>>> >>>>>>> We can't move the TLB flush at the end of the unmap operation because on >>>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or >>>>>>> don't even work at all because of hw bugs. So to have a reliable TLB >>>>>>> flush we must make sure that nothing else is ongoing and that means >>>>>>> CS->VM->CS barrier. >>>>>>> >>>>>>> We try very hard to circumvent that already on maps by (for example) >>>>>>> using a completely new VMID for CS after the VM map operation. >>>>>>> >>>>>>> But for the unmap operation we would need some kind special dma_fence >>>>>>> implementation which would not only wait for all existing dma_fence but >>>>>>> also for the one added until the unmap operation is completed. Cause >>>>>>> otherwise our operation we do at #1 would simply not catch all >>>>>>> dma_fences which have access to the memory. >>>>>>> >>>>>>> That's certainly doable, but I think just using the drm_exec stuff I >>>>>>> already came up with is easier. >>>>>>> >>>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() >>>>>>> goes away and we can keep track of the unmap operations in the bo_va >>>>>>> structure. >>>>>>> >>>>>>> With that done you can make the explicit sync you noted in the bo_va >>>>>>> structure and implicit sync when the bo_va structure goes away. >>>>>>> >>>>>>> Then the only reason I can see why we would need a CS->VM dependency is >>>>>>> implicit synchronization, and that's what we are trying to avoid here in >>>>>>> the first place. >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>>> To get rid of this barrier you must first fix the part where CS >>>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of >>>>>>>>> the barrier. >>>>>>>>> >>>>>>>>> I'm working on this for a couple of years now and I'm really running out >>>>>>>>> of idea how to explain this restriction. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>>
On Wed, Jun 15, 2022 at 9:00 AM Christian König <christian.koenig@amd.com> wrote: > > Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen: > > On Mon, Jun 6, 2022 at 12:35 PM Christian König > > <christian.koenig@amd.com> wrote: > >> [SNIP] > >> That part won't work at all and would cause additional synchronization > >> problems. > >> > >> First of all for implicit synced CS we should use READ, not BOOKKEEP. > >> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've > >> fixed that this causes memory corruption, but it is still nice to avoid. > > Yes, what I'm saying is that on implicit sync CS submission should add > > READ fences to the dma resv and on explicit sync CS submission should > > add BOOKKEEP fences. > > No, exactly that is wrong. > > Implicit CS submissions should add WRITE fences. > > Explicit CS submissions should add READ fences. > > Only VM updates should add BOOKKEEP fences. > > >> BOOKKEEP can only be used by VM updates themselves. So that they don't > >> interfere with CS. > > That is the point why we would go BOOKKEEP for explicit sync CS > > submissions, no? Explicit submission shouldn't interfere with any > > other CS submissions. That includes being totally ignored by GL > > importers (if we want to have synchronization there between an > > explicit submission and GL, userspace is expected to use Jason's > > dmabuf fence import/export IOCTLs) > > No, that would break existing DMA-buf rules. > > Explicit CS submissions are still a dependency for implicit submissions. This is explicitly what we don't want for explicit submissions and why I waited with this series until the DMA_RESV_USAGE series landed. We wish to opt out from implicit sync completely, and just use the IOCTLs Jason wrote for back-compat with windowing systems that need it. If BOOKKEEP isn't for that, should we add a new USAGE? > > > > > Then the second problem is that the VM IOCTL has absolutely no idea what > > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync > > flag on the BO. > > It doesn't need to? We just use a different sync_mode for BOOKKEEP > > fences vs others: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&reserved=0 > > No, exactly that's completely broken. > > Regards, > Christian. > > > > > (the nice thing about doing it this way is that it is independent of > > the IOCTL, i.e. also works for the delayed mapping changes we trigger > > on CS submit) > > > >> Regards, > >> Christian. > >> > >>>> That should be doable, but then you don't need any of the other changes. > >>>> > >>>> Regards, > >>>> Christian. > >>>> > >>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the > >>>>>> page table dma_resv object over to the BOs dma_resv object in the gem > >>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy > >>>>>> function. > >>>>>> > >>>>>> #2 is a nightmare. > >>>>>> > >>>>>> We can't move the TLB flush at the end of the unmap operation because on > >>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or > >>>>>> don't even work at all because of hw bugs. So to have a reliable TLB > >>>>>> flush we must make sure that nothing else is ongoing and that means > >>>>>> CS->VM->CS barrier. > >>>>>> > >>>>>> We try very hard to circumvent that already on maps by (for example) > >>>>>> using a completely new VMID for CS after the VM map operation. > >>>>>> > >>>>>> But for the unmap operation we would need some kind special dma_fence > >>>>>> implementation which would not only wait for all existing dma_fence but > >>>>>> also for the one added until the unmap operation is completed. Cause > >>>>>> otherwise our operation we do at #1 would simply not catch all > >>>>>> dma_fences which have access to the memory. > >>>>>> > >>>>>> That's certainly doable, but I think just using the drm_exec stuff I > >>>>>> already came up with is easier. > >>>>>> > >>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() > >>>>>> goes away and we can keep track of the unmap operations in the bo_va > >>>>>> structure. > >>>>>> > >>>>>> With that done you can make the explicit sync you noted in the bo_va > >>>>>> structure and implicit sync when the bo_va structure goes away. > >>>>>> > >>>>>> Then the only reason I can see why we would need a CS->VM dependency is > >>>>>> implicit synchronization, and that's what we are trying to avoid here in > >>>>>> the first place. > >>>>>> > >>>>>> Regards, > >>>>>> Christian. > >>>>>> > >>>>>>>> To get rid of this barrier you must first fix the part where CS > >>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of > >>>>>>>> the barrier. > >>>>>>>> > >>>>>>>> I'm working on this for a couple of years now and I'm really running out > >>>>>>>> of idea how to explain this restriction. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Christian. > >>>>>>>> >
Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen: > [SNIP] >>>> BOOKKEEP can only be used by VM updates themselves. So that they don't >>>> interfere with CS. >>> That is the point why we would go BOOKKEEP for explicit sync CS >>> submissions, no? Explicit submission shouldn't interfere with any >>> other CS submissions. That includes being totally ignored by GL >>> importers (if we want to have synchronization there between an >>> explicit submission and GL, userspace is expected to use Jason's >>> dmabuf fence import/export IOCTLs) >> No, that would break existing DMA-buf rules. >> >> Explicit CS submissions are still a dependency for implicit submissions. > This is explicitly what we don't want for explicit submissions and why > I waited with this series until the DMA_RESV_USAGE series landed. We > wish to opt out from implicit sync completely, and just use the IOCTLs > Jason wrote for back-compat with windowing systems that need it. > > If BOOKKEEP isn't for that, should we add a new USAGE? BOOKKEEP is exactly for that, but as discussed with Daniel that's not what we want in the kernel. When you mix implicit with explicit synchronization (OpenGL with RADV for example) it should be mandatory for the OpenGL to wait for any RADV submission before issuing an operation. What you want to do is intentionally not supported. Regards, Christian.
Digging out of a hole, apologies to everyone. On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote: > Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen: > > [SNIP] > > > > > BOOKKEEP can only be used by VM updates themselves. So that they don't > > > > > interfere with CS. > > > > That is the point why we would go BOOKKEEP for explicit sync CS > > > > submissions, no? Explicit submission shouldn't interfere with any > > > > other CS submissions. That includes being totally ignored by GL > > > > importers (if we want to have synchronization there between an > > > > explicit submission and GL, userspace is expected to use Jason's > > > > dmabuf fence import/export IOCTLs) > > > No, that would break existing DMA-buf rules. > > > > > > Explicit CS submissions are still a dependency for implicit submissions. > > This is explicitly what we don't want for explicit submissions and why > > I waited with this series until the DMA_RESV_USAGE series landed. We > > wish to opt out from implicit sync completely, and just use the IOCTLs > > Jason wrote for back-compat with windowing systems that need it. > > > > If BOOKKEEP isn't for that, should we add a new USAGE? > > BOOKKEEP is exactly for that, but as discussed with Daniel that's not what > we want in the kernel. Not sure which Daniel you talked to, but this wasn't me. > When you mix implicit with explicit synchronization (OpenGL with RADV for > example) it should be mandatory for the OpenGL to wait for any RADV > submission before issuing an operation. > > What you want to do is intentionally not supported. vk is very intentional in it's rejecting of any implicit sync. Which means when you share a buffer with gl, even in _that_ case there must be no sync automatically, or your implementation is kinda shit. Instead anyone sharing a buffer with vk and using it in gl must take care of sync by importing the timeline syncobj to gl, that's why all these extensions got added. This leaves libva in the cold, but hey libva didn't even get around to adding the full set of modifier extensions so I can't really get myself to care. So summary this means: - a CS/execbuf for vk should _only_ set BOOKKEEPING fences (except ofc if there's memory management moves in the preparation, which use KERNEL fences and then become additional dependencies for the job) - because vk memory model is that always everything currently bound can be used this means you set BOOKKEEPING on absolutely everything. The current clever trick amdgpu has with shared buffers is also not really the right thing. - implicit sync is only controlled through the new import/export ioctl on the dma-buf - if you set any READ/WRITE fences anywhere else, you have potential oversync compared to what the vk spec would want - userspace gets to keep absolutely all the pieces here. Which is not an issue, because userspace is totally allowed to fill a buffer with garbage and hand that to the compositor already, so there's nothing new going wrong here. - ideally (definitely required for vk sparse) when you unbind or rebind then the BOOKKEEPING fences for the vm/ctx get for the old buffers get simply replaced by the pte clearing and tlb flushing fences (like amdkfd does for compute, vk really just wants to look like compute in everything). In practice, especially with partial and multiple mappings of the same underlying bo involved, this might be too expensive to accurately track since you can only do the replacement trick when the last mapping is gone. It might be worth it for private bo though, dunno. For amdgpu the current special owner checks mostly allow you to get the semantics vulkan wants. But it breaks down when you have cross-device or cross-process sharing. We should probably also document this in the kerneldoc for the BOOKKEEPING usage that this is the fence type that vulkan cs should use in all drivers, otherwise this will become an endless mess of driver specific hacks (i.e. the world we currently live in). -Daniel
Am 24.06.22 um 22:34 schrieb Daniel Vetter: > Digging out of a hole, apologies to everyone. No problem, I'm totally overworked as well. > On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote: >> Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen: >>> [SNIP] >> BOOKKEEP is exactly for that, but as discussed with Daniel that's not what >> we want in the kernel. > Not sure which Daniel you talked to, but this wasn't me. Hui what? Of course I'm talking about you. >> When you mix implicit with explicit synchronization (OpenGL with RADV for >> example) it should be mandatory for the OpenGL to wait for any RADV >> submission before issuing an operation. >> >> What you want to do is intentionally not supported. > vk is very intentional in it's rejecting of any implicit sync. [SNIP] > We should probably also document this in the kerneldoc for the BOOKKEEPING > usage that this is the fence type that vulkan cs should use in all > drivers, otherwise this will become an endless mess of driver specific > hacks (i.e. the world we currently live in). Well, Daniel somehow we are somehow not talking about the same thing here :) I've documented exactly what you describe above in the initial patch which added BOOKKEEPING (I've still called it OTHER in that iteration): > >/+ /**/ > >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./ > >/+ */ > >/+ * This should be used for operations which don't want to add an/ > >/+ * implicit dependency at all, but still have a dependency on memory/ > >/+ * management./ > >/+ */ > >/+ * This might include things like preemption fences as well as device/ > >/+ * page table updates or even userspace command submissions./ > >/+ */ > >/+ * The kernel memory management *always* need to wait for those fences/ > >/+ * before moving or freeing the resource protected by the dma_resv/ > >/+ * object./ > >/+ *// > >/+ DMA_RESV_USAGE_OTHER/ Later on I've even explicitly mentioned that this is for Vulkan submissions. But it was *you* who made me remove that with the explanation that we have to use READ for that or we break existing userspace. I mean that still makes a lot of sense to me because if I'm not completely mistaken we do have use cases which break, especially Vulkan+encoding. Regards, Christian. > -Daniel
On Sat, Jun 25, 2022 at 03:58:17PM +0200, Christian König wrote: > Am 24.06.22 um 22:34 schrieb Daniel Vetter: > > Digging out of a hole, apologies to everyone. > > No problem, I'm totally overworked as well. > > > On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote: > > > Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen: > > > > [SNIP] > > > BOOKKEEP is exactly for that, but as discussed with Daniel that's not what > > > we want in the kernel. > > Not sure which Daniel you talked to, but this wasn't me. > > Hui what? Of course I'm talking about you. > > > > When you mix implicit with explicit synchronization (OpenGL with RADV for > > > example) it should be mandatory for the OpenGL to wait for any RADV > > > submission before issuing an operation. > > > > > > What you want to do is intentionally not supported. > > vk is very intentional in it's rejecting of any implicit sync. > > [SNIP] > > > We should probably also document this in the kerneldoc for the BOOKKEEPING > > usage that this is the fence type that vulkan cs should use in all > > drivers, otherwise this will become an endless mess of driver specific > > hacks (i.e. the world we currently live in). > > Well, Daniel somehow we are somehow not talking about the same thing here :) > > I've documented exactly what you describe above in the initial patch which > added BOOKKEEPING (I've still called it OTHER in that iteration): > > > >/+ /**/ > > >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./ > > >/+ */ > > >/+ * This should be used for operations which don't want to add an/ > > >/+ * implicit dependency at all, but still have a dependency on memory/ > > >/+ * management./ > > >/+ */ > > >/+ * This might include things like preemption fences as well as device/ > > >/+ * page table updates or even userspace command submissions./ > > >/+ */ > > >/+ * The kernel memory management *always* need to wait for those fences/ > > >/+ * before moving or freeing the resource protected by the dma_resv/ > > >/+ * object./ > > >/+ *// > > >/+ DMA_RESV_USAGE_OTHER/ > > Later on I've even explicitly mentioned that this is for Vulkan submissions. > > But it was *you* who made me remove that with the explanation that we have > to use READ for that or we break existing userspace. Hm the only discussion I've found I actually mentioend we should highlight that vk should use OTHER even more than what you had. Quoting myself: > + * This might include things like preemption fences as well as device > + * page table updates or even userspace command submissions. I think we should highlight a bit more that for explicitly synchronized userspace like vk OTHER is the normal case. So really not an exception. Ofc aside from amdkgf there's currently no driver doing this, but really we should have lots of them ... See https://lore.kernel.org/dri-devel/YZ+y+Uwo809qtvs5@phenom.ffwll.local/ I didn't find anything else. So not sure how we managed to create confusion here :-( > I mean that still makes a lot of sense to me because if I'm not completely > mistaken we do have use cases which break, especially Vulkan+encoding. Yeah I think we only have some communication fumble here, nothing else :-) And yes libva doesn't have any support for vk's explicit sync model, so that will just fall flat on its face. Might motivate a few folks to fix libva :-) Note that on i915 side it's exactly the same, we've also been setting the READ fence thus far. Since the breakage will be introduced by upgrading mesa we'll at least avoid the kernel regression complaints, or at least I hope we can get away with that. Since really I don't have any idea how it could be fixed otherwise, except through some really, really terrible hacks. Maybe kernel module option or so. Anyway I think all we need is just a patch to the dma_resv docs to explain that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas, you're up to typing that? Cheers, Daniel
Hey Daniel, Am 26.06.22 um 00:45 schrieb Daniel Vetter: > [SNIP] > I think we should highlight a bit more that for explicitly synchronized > userspace like vk OTHER is the normal case. So really not an exception. > Ofc aside from amdkgf there's currently no driver doing this, but really > we should have lots of them ... > > See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&reserved=0 > > I didn't find anything else. So not sure how we managed to create > confusion here :-( Well you said something like "Yeah, READ is supposed to be used for that...." and that created the impression that AMDGPU should start using that for Vulkan submissions and you are rejecting my idea of using BOOKKEEP for that. >> I mean that still makes a lot of sense to me because if I'm not completely >> mistaken we do have use cases which break, especially Vulkan+encoding. > Yeah I think we only have some communication fumble here, nothing else :-) Ok, well then @Bas: Sorry for all the noise, we are actually all on the same page :) > And yes libva doesn't have any support for vk's explicit sync model, so > that will just fall flat on its face. Might motivate a few folks to fix > libva :-) Well that's not the problem. The problem is that we have a couple of use cases where libva is supposed to encode a Vulkan surface without letting Vulkan know about that. In other words: Application shares DMA-buf between Vulkan and VA-API, renders with Vulkan and encodes with VA-API without any explicit synchronization between the two. I know that this is absolutely against the Vulkan specification, but it just happened to work fine. And when you break something which used to work people start to complain... > Note that on i915 side it's exactly the same, we've also been setting the > READ fence thus far. Since the breakage will be introduced by upgrading > mesa we'll at least avoid the kernel regression complaints, or at least I > hope we can get away with that. Yeah, the path to salvation start's with the words: It's not my f... problem :) > Since really I don't have any idea how it could be fixed otherwise, except > through some really, really terrible hacks. Maybe kernel module option or > so. > > Anyway I think all we need is just a patch to the dma_resv docs to explain > that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas, > you're up to typing that? I can do that. I'm just back from a week of vacation and still digging through my mails. Cheers, Christian. > > Cheers, Daniel
[Back from vacations and work change and out sick and absolutely everything else going wrong] On Mon, Jul 04, 2022 at 03:37:43PM +0200, Christian König wrote: > Hey Daniel, > > Am 26.06.22 um 00:45 schrieb Daniel Vetter: > > [SNIP] > > I think we should highlight a bit more that for explicitly synchronized > > userspace like vk OTHER is the normal case. So really not an exception. > > Ofc aside from amdkgf there's currently no driver doing this, but really > > we should have lots of them ... > > > > See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&reserved=0 > > > > I didn't find anything else. So not sure how we managed to create > > confusion here :-( > > Well you said something like "Yeah, READ is supposed to be used for > that...." and that created the impression that AMDGPU should start using > that for Vulkan submissions and you are rejecting my idea of using BOOKKEEP > for that. > > > > I mean that still makes a lot of sense to me because if I'm not completely > > > mistaken we do have use cases which break, especially Vulkan+encoding. > > Yeah I think we only have some communication fumble here, nothing else :-) > > Ok, well then @Bas: Sorry for all the noise, we are actually all on the same > page :) > > > And yes libva doesn't have any support for vk's explicit sync model, so > > that will just fall flat on its face. Might motivate a few folks to fix > > libva :-) > > Well that's not the problem. The problem is that we have a couple of use > cases where libva is supposed to encode a Vulkan surface without letting > Vulkan know about that. > > In other words: Application shares DMA-buf between Vulkan and VA-API, > renders with Vulkan and encodes with VA-API without any explicit > synchronization between the two. > > I know that this is absolutely against the Vulkan specification, but it just > happened to work fine. And when you break something which used to work > people start to complain... Yeah I feared that, and worse libva doesn't have the nice gl interop extensions to make it actually work. > > Note that on i915 side it's exactly the same, we've also been setting the > > READ fence thus far. Since the breakage will be introduced by upgrading > > mesa we'll at least avoid the kernel regression complaints, or at least I > > hope we can get away with that. > > Yeah, the path to salvation start's with the words: It's not my f... problem > :) > > > Since really I don't have any idea how it could be fixed otherwise, except > > through some really, really terrible hacks. Maybe kernel module option or > > so. > > > > Anyway I think all we need is just a patch to the dma_resv docs to explain > > that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas, > > you're up to typing that? > > I can do that. I'm just back from a week of vacation and still digging > through my mails. Yeah I think the best path is we're pushing hard for adding the libva syncobj extensions like gl has, so that this can be done properly. And then just pave it over with kernel modoptions until userspace is fixed. If we end up making implicit sync part of defacto vk api on linux, then a _lot_ of people will be very sad :-( -Daniel
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c index f10332e1c6c0..31bc73fd1fae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0; - return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true); + return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true); } /** diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 63b484dc76c5..c8d5898bea11 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p, if (!resv) return 0; - return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm); + return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm); } /**
This should be okay because moves themselves use KERNEL usage and hence still sync with BOOKKEEP usage. Then any later submits still wait on any pending VM operations. (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other way around) Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)