diff mbox series

[RFC,3/5] drm/amdgpu: Allow explicit sync for VM ops.

Message ID 20220601004014.158247-4-bas@basnieuwenhuizen.nl (mailing list archive)
State New, archived
Headers show
Series Add option to disable implicit sync for userspace submits. | expand

Commit Message

Bas Nieuwenhuizen June 1, 2022, 12:40 a.m. UTC
This should be okay because moves themselves use KERNEL usage and
hence still sync with BOOKKEEP usage. Then any later submits still
wait on any pending VM operations.

(i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
 way around)

Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Christian König June 1, 2022, 8:03 a.m. UTC | #1
Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> This should be okay because moves themselves use KERNEL usage and
> hence still sync with BOOKKEEP usage. Then any later submits still
> wait on any pending VM operations.
>
> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>   way around)

Well NAK again. This allows access to freed up memory and is a complete 
no-go.

Regards,
Christian.

>
> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> index f10332e1c6c0..31bc73fd1fae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>   	if (!resv)
>   		return 0;
>   
> -	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> +	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 63b484dc76c5..c8d5898bea11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>   	if (!resv)
>   		return 0;
>   
> -	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> +	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>   }
>   
>   /**
Bas Nieuwenhuizen June 1, 2022, 8:16 a.m. UTC | #2
On Wed, Jun 1, 2022 at 10:03 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > This should be okay because moves themselves use KERNEL usage and
> > hence still sync with BOOKKEEP usage. Then any later submits still
> > wait on any pending VM operations.
> >
> > (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >   way around)
>
> Well NAK again. This allows access to freed up memory and is a complete
> no-go.

How does this allow access to freed memory? Worst I can see is that
the unmap happens earlier if the app/drivers gets the waits wrong,
which wouldn't give access after the underlying BO is freed?

>
> Regards,
> Christian.
>
> >
> > Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > index f10332e1c6c0..31bc73fd1fae 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >       if (!resv)
> >               return 0;
> >
> > -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> > +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >   }
> >
> >   /**
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index 63b484dc76c5..c8d5898bea11 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >       if (!resv)
> >               return 0;
> >
> > -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> > +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >   }
> >
> >   /**
>
Christian König June 1, 2022, 8:40 a.m. UTC | #3
Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:03 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>> This should be okay because moves themselves use KERNEL usage and
>>> hence still sync with BOOKKEEP usage. Then any later submits still
>>> wait on any pending VM operations.
>>>
>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>>>    way around)
>> Well NAK again. This allows access to freed up memory and is a complete
>> no-go.
> How does this allow access to freed memory? Worst I can see is that
> the unmap happens earlier if the app/drivers gets the waits wrong,
> which wouldn't give access after the underlying BO is freed?

To free up memory we need to update the PTEs and then flush those out by 
invalidating the TLB.

On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can 
only be done while the VMID is idle.

Only gfx9 can reliable invalidate the TLB while it is in use and even 
there it comes with quite some performance penalty (at TLB invalidation 
can take multiple seconds).

Because of this what we do in the kernel driver is to sync to everything 
when we unmap entries:

         if (!(flags & AMDGPU_PTE_VALID))
                 sync_mode = AMDGPU_SYNC_EQ_OWNER;
         else
                 sync_mode = AMDGPU_SYNC_EXPLICIT;

This acts as a barrier for freeing the memory. In other words we 
intentionally add a bubble which syncs everything.

I'm working for month on a concept how to do all this without causing 
the stalls you observer, but so far didn't came to much of a conclusion.

Regards,
Christian.

>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>>>    2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> index f10332e1c6c0..31bc73fd1fae 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>>>        if (!resv)
>>>                return 0;
>>>
>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>>>    }
>>>
>>>    /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> index 63b484dc76c5..c8d5898bea11 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>>>        if (!resv)
>>>                return 0;
>>>
>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>>>    }
>>>
>>>    /**
Bas Nieuwenhuizen June 1, 2022, 8:48 a.m. UTC | #4
On Wed, Jun 1, 2022 at 10:40 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 1, 2022 at 10:03 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> >>> This should be okay because moves themselves use KERNEL usage and
> >>> hence still sync with BOOKKEEP usage. Then any later submits still
> >>> wait on any pending VM operations.
> >>>
> >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >>>    way around)
> >> Well NAK again. This allows access to freed up memory and is a complete
> >> no-go.
> > How does this allow access to freed memory? Worst I can see is that
> > the unmap happens earlier if the app/drivers gets the waits wrong,
> > which wouldn't give access after the underlying BO is freed?
>
> To free up memory we need to update the PTEs and then flush those out by
> invalidating the TLB.
>
> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> only be done while the VMID is idle.
>
> Only gfx9 can reliable invalidate the TLB while it is in use and even
> there it comes with quite some performance penalty (at TLB invalidation
> can take multiple seconds).
>
> Because of this what we do in the kernel driver is to sync to everything
> when we unmap entries:
>
>          if (!(flags & AMDGPU_PTE_VALID))
>                  sync_mode = AMDGPU_SYNC_EQ_OWNER;
>          else
>                  sync_mode = AMDGPU_SYNC_EXPLICIT;
>
> This acts as a barrier for freeing the memory. In other words we
> intentionally add a bubble which syncs everything.
>
> I'm working for month on a concept how to do all this without causing
> the stalls you observer, but so far didn't came to much of a conclusion.

That might cause an unmap operation too early, but for freeing up the
actual backing memory we still wait for all fences on the BO to finish
first, no? In that case, since BOOKKEEP fences are still added for
explicit sync, that should not be a problem, no?

(If not, that sounds like the obvious fix for making this work?)
>
> Regards,
> Christian.
>
> >
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >>>    2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> index f10332e1c6c0..31bc73fd1fae 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >>>        if (!resv)
> >>>                return 0;
> >>>
> >>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> >>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >>>    }
> >>>
> >>>    /**
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> index 63b484dc76c5..c8d5898bea11 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >>>        if (!resv)
> >>>                return 0;
> >>>
> >>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> >>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >>>    }
> >>>
> >>>    /**
>
Bas Nieuwenhuizen June 1, 2022, 8:59 a.m. UTC | #5
On Wed, Jun 1, 2022, 10:48 Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

> On Wed, Jun 1, 2022 at 10:40 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> > > On Wed, Jun 1, 2022 at 10:03 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > >> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> > >>> This should be okay because moves themselves use KERNEL usage and
> > >>> hence still sync with BOOKKEEP usage. Then any later submits still
> > >>> wait on any pending VM operations.
> > >>>
> > >>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> > >>>    way around)
> > >> Well NAK again. This allows access to freed up memory and is a
> complete
> > >> no-go.
> > > How does this allow access to freed memory? Worst I can see is that
> > > the unmap happens earlier if the app/drivers gets the waits wrong,
> > > which wouldn't give access after the underlying BO is freed?
> >
> > To free up memory we need to update the PTEs and then flush those out by
> > invalidating the TLB.
> >
> > On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> > only be done while the VMID is idle.
> >
> > Only gfx9 can reliable invalidate the TLB while it is in use and even
> > there it comes with quite some performance penalty (at TLB invalidation
> > can take multiple seconds).
> >
> > Because of this what we do in the kernel driver is to sync to everything
> > when we unmap entries:
> >
> >          if (!(flags & AMDGPU_PTE_VALID))
> >                  sync_mode = AMDGPU_SYNC_EQ_OWNER;
> >          else
> >                  sync_mode = AMDGPU_SYNC_EXPLICIT;
> >
> > This acts as a barrier for freeing the memory. In other words we
> > intentionally add a bubble which syncs everything.
> >
> > I'm working for month on a concept how to do all this without causing
> > the stalls you observer, but so far didn't came to much of a conclusion.
>
> That might cause an unmap operation too early, but for freeing up the
> actual backing memory we still wait for all fences on the BO to finish
> first, no? In that case, since BOOKKEEP fences are still added for
> explicit sync, that should not be a problem, no?
>
> (If not, that sounds like the obvious fix for making this work?)
>

As an aside this is the same hole/issue as when an app forgets a bo in the
bo list on submission.

> >
> > Regards,
> > Christian.
> >
> > >
> > >> Regards,
> > >> Christian.
> > >>
> > >>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> > >>> ---
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> > >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> > >>>    2 files changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> index f10332e1c6c0..31bc73fd1fae 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> > >>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> sync_mode, p->vm, true);
> > >>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode,
> AMDGPU_SYNC_EXPLICIT, p->vm, true);
> > >>>    }
> > >>>
> > >>>    /**
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> index 63b484dc76c5..c8d5898bea11 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > >>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct
> amdgpu_vm_update_params *p,
> > >>>        if (!resv)
> > >>>                return 0;
> > >>>
> > >>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, sync_mode, p->vm);
> > >>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv,
> sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> > >>>    }
> > >>>
> > >>>    /**
> >
>
Christian König June 1, 2022, 9:01 a.m. UTC | #6
Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:
> On Wed, Jun 1, 2022 at 10:40 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
>>> On Wed, Jun 1, 2022 at 10:03 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
>>>>> This should be okay because moves themselves use KERNEL usage and
>>>>> hence still sync with BOOKKEEP usage. Then any later submits still
>>>>> wait on any pending VM operations.
>>>>>
>>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
>>>>>     way around)
>>>> Well NAK again. This allows access to freed up memory and is a complete
>>>> no-go.
>>> How does this allow access to freed memory? Worst I can see is that
>>> the unmap happens earlier if the app/drivers gets the waits wrong,
>>> which wouldn't give access after the underlying BO is freed?
>> To free up memory we need to update the PTEs and then flush those out by
>> invalidating the TLB.
>>
>> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
>> only be done while the VMID is idle.
>>
>> Only gfx9 can reliable invalidate the TLB while it is in use and even
>> there it comes with quite some performance penalty (at TLB invalidation
>> can take multiple seconds).
>>
>> Because of this what we do in the kernel driver is to sync to everything
>> when we unmap entries:
>>
>>           if (!(flags & AMDGPU_PTE_VALID))
>>                   sync_mode = AMDGPU_SYNC_EQ_OWNER;
>>           else
>>                   sync_mode = AMDGPU_SYNC_EXPLICIT;
>>
>> This acts as a barrier for freeing the memory. In other words we
>> intentionally add a bubble which syncs everything.
>>
>> I'm working for month on a concept how to do all this without causing
>> the stalls you observer, but so far didn't came to much of a conclusion.
> That might cause an unmap operation too early, but for freeing up the
> actual backing memory we still wait for all fences on the BO to finish
> first, no? In that case, since BOOKKEEP fences are still added for
> explicit sync, that should not be a problem, no?
>
> (If not, that sounds like the obvious fix for making this work?)

The problem is we need to wait on fences *not* added to the buffer object.

E.g. what we currently do here while freeing memory is:
1. Update the PTEs and make that update wait for everything!
2. Add the fence of that update to the freed up BO so that this BO isn't 
freed before the next CS.

We might be able to fix this by adding the fences to the BO before 
freeing it manually, but I'm not 100% sure we can actually allocate 
memory for the fences in that moment.

Regards,
Christian.


>> Regards,
>> Christian.
>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
>>>>>     2 files changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> index f10332e1c6c0..31bc73fd1fae 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
>>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
>>>>>         if (!resv)
>>>>>                 return 0;
>>>>>
>>>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
>>>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
>>>>>     }
>>>>>
>>>>>     /**
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> index 63b484dc76c5..c8d5898bea11 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
>>>>>         if (!resv)
>>>>>                 return 0;
>>>>>
>>>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
>>>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
>>>>>     }
>>>>>
>>>>>     /**
Bas Nieuwenhuizen June 3, 2022, 1:21 a.m. UTC | #7
On Wed, Jun 1, 2022 at 11:01 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 01.06.22 um 10:48 schrieb Bas Nieuwenhuizen:
> > On Wed, Jun 1, 2022 at 10:40 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 01.06.22 um 10:16 schrieb Bas Nieuwenhuizen:
> >>> On Wed, Jun 1, 2022 at 10:03 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 01.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> >>>>> This should be okay because moves themselves use KERNEL usage and
> >>>>> hence still sync with BOOKKEEP usage. Then any later submits still
> >>>>> wait on any pending VM operations.
> >>>>>
> >>>>> (i.e. we only made VM ops not wait on BOOKKEEP submits, not the other
> >>>>>     way around)
> >>>> Well NAK again. This allows access to freed up memory and is a complete
> >>>> no-go.
> >>> How does this allow access to freed memory? Worst I can see is that
> >>> the unmap happens earlier if the app/drivers gets the waits wrong,
> >>> which wouldn't give access after the underlying BO is freed?
> >> To free up memory we need to update the PTEs and then flush those out by
> >> invalidating the TLB.
> >>
> >> On gfx6, gfx7 and gfx8 and some broken gfx10 hw invalidating the TLB can
> >> only be done while the VMID is idle.
> >>
> >> Only gfx9 can reliable invalidate the TLB while it is in use and even
> >> there it comes with quite some performance penalty (at TLB invalidation
> >> can take multiple seconds).
> >>
> >> Because of this what we do in the kernel driver is to sync to everything
> >> when we unmap entries:
> >>
> >>           if (!(flags & AMDGPU_PTE_VALID))
> >>                   sync_mode = AMDGPU_SYNC_EQ_OWNER;
> >>           else
> >>                   sync_mode = AMDGPU_SYNC_EXPLICIT;
> >>
> >> This acts as a barrier for freeing the memory. In other words we
> >> intentionally add a bubble which syncs everything.
> >>
> >> I'm working for month on a concept how to do all this without causing
> >> the stalls you observer, but so far didn't came to much of a conclusion.
> > That might cause an unmap operation too early, but for freeing up the
> > actual backing memory we still wait for all fences on the BO to finish
> > first, no? In that case, since BOOKKEEP fences are still added for
> > explicit sync, that should not be a problem, no?
> >
> > (If not, that sounds like the obvious fix for making this work?)
>
> The problem is we need to wait on fences *not* added to the buffer object.

What fences wouldn't be added to the buffer object that we need here?
>
> E.g. what we currently do here while freeing memory is:
> 1. Update the PTEs and make that update wait for everything!
> 2. Add the fence of that update to the freed up BO so that this BO isn't
> freed before the next CS.
>
> We might be able to fix this by adding the fences to the BO before
> freeing it manually, but I'm not 100% sure we can actually allocate
> memory for the fences in that moment.

I think we don't need to be able to. We're already adding the unmap
fence to the BO in the gem close ioctl, and that has the fallback that
if we can't allocate space for the fence in the BO, we wait on the
fence manually on the CPU. I think that is a reasonable fallback for
this as well?

For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
then following submissions will trigger VM updates that will wait on
the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
itself, and the amdgpu_bo_move_notify is way after the move (and after
the ttm_bo_move_accel_cleanup which would free the old resource), so
any VM changes triggered by that would see the TTM copy and sync to
it.

I do have to fix some stuff indeed, especially for the GEM close but
with that we should be able to keep the same basic approach?
>
> Regards,
> Christian.
>
>
> >> Regards,
> >> Christian.
> >>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> >>>>> ---
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c  | 2 +-
> >>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 +-
> >>>>>     2 files changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> index f10332e1c6c0..31bc73fd1fae 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
> >>>>> @@ -51,7 +51,7 @@ static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
> >>>>>         if (!resv)
> >>>>>                 return 0;
> >>>>>
> >>>>> -     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
> >>>>> +     return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
> >>>>>     }
> >>>>>
> >>>>>     /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> index 63b484dc76c5..c8d5898bea11 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>>>> @@ -75,7 +75,7 @@ static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
> >>>>>         if (!resv)
> >>>>>                 return 0;
> >>>>>
> >>>>> -     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
> >>>>> +     return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
> >>>>>     }
> >>>>>
> >>>>>     /**
>
Christian König June 3, 2022, 8:11 a.m. UTC | #8
Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:
> [SNIP]
>> The problem is we need to wait on fences *not* added to the buffer object.
> What fences wouldn't be added to the buffer object that we need here?

Basically all still running submissions from the VM which could 
potentially access the BO.

That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().

>> E.g. what we currently do here while freeing memory is:
>> 1. Update the PTEs and make that update wait for everything!
>> 2. Add the fence of that update to the freed up BO so that this BO isn't
>> freed before the next CS.
>>
>> We might be able to fix this by adding the fences to the BO before
>> freeing it manually, but I'm not 100% sure we can actually allocate
>> memory for the fences in that moment.
> I think we don't need to be able to. We're already adding the unmap
> fence to the BO in the gem close ioctl, and that has the fallback that
> if we can't allocate space for the fence in the BO, we wait on the
> fence manually on the CPU. I think that is a reasonable fallback for
> this as well?

Yes, just blocking might work in an OOM situation as well.

> For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
> then following submissions will trigger VM updates that will wait on
> the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
> work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
> itself, and the amdgpu_bo_move_notify is way after the move (and after
> the ttm_bo_move_accel_cleanup which would free the old resource), so
> any VM changes triggered by that would see the TTM copy and sync to
> it.
>
> I do have to fix some stuff indeed, especially for the GEM close but
> with that we should be able to keep the same basic approach?

Nope, not even remotely.

What we need is the following:
1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
2. When we get a VM operation we not only lock the VM page tables, but 
also all buffers we potentially need to unmap.
3. Nuking the freed list in the amdgpu_vm structure by updating freed 
areas directly when they are unmapped.
4. Tracking those updates inside the bo_va structure for the BO+VM 
combination.
5. When the bo_va structure is destroy because of closing the handle 
move the last clear operation over to the VM as implicit sync.

Only when all this is done we then can resolve the dependency that the 
CS currently must wait for any clear operation on the VM.

Regards,
Christian.
Bas Nieuwenhuizen June 3, 2022, 10:08 a.m. UTC | #9
On Fri, Jun 3, 2022 at 10:11 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 03:21 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >> The problem is we need to wait on fences *not* added to the buffer object.
> > What fences wouldn't be added to the buffer object that we need here?
>
> Basically all still running submissions from the VM which could
> potentially access the BO.
>
> That's why we have the AMDGPU_SYNC_EQ_OWNER in amdgpu_vm_update_range().
>
> >> E.g. what we currently do here while freeing memory is:
> >> 1. Update the PTEs and make that update wait for everything!
> >> 2. Add the fence of that update to the freed up BO so that this BO isn't
> >> freed before the next CS.
> >>
> >> We might be able to fix this by adding the fences to the BO before
> >> freeing it manually, but I'm not 100% sure we can actually allocate
> >> memory for the fences in that moment.
> > I think we don't need to be able to. We're already adding the unmap
> > fence to the BO in the gem close ioctl, and that has the fallback that
> > if we can't allocate space for the fence in the BO, we wait on the
> > fence manually on the CPU. I think that is a reasonable fallback for
> > this as well?
>
> Yes, just blocking might work in an OOM situation as well.
>
> > For the TTM move path amdgpu_copy_buffer will wait on the BO resv and
> > then following submissions will trigger VM updates that will wait on
> > the amdgpu_copy_buffer jobs (and hence transitively) will wait on the
> > work.  AFAICT the amdgpu_bo_move does not trigger any VM updates by
> > itself, and the amdgpu_bo_move_notify is way after the move (and after
> > the ttm_bo_move_accel_cleanup which would free the old resource), so
> > any VM changes triggered by that would see the TTM copy and sync to
> > it.
> >
> > I do have to fix some stuff indeed, especially for the GEM close but
> > with that we should be able to keep the same basic approach?
>
> Nope, not even remotely.
>
> What we need is the following:
> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> 2. When we get a VM operation we not only lock the VM page tables, but
> also all buffers we potentially need to unmap.
> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> areas directly when they are unmapped.
> 4. Tracking those updates inside the bo_va structure for the BO+VM
> combination.
> 5. When the bo_va structure is destroy because of closing the handle
> move the last clear operation over to the VM as implicit sync.
>

Hi Christian, isn't that a different problem though (that we're also
trying to solve, but in your series)?

What this patch tries to achieve:

(t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
(t+1) a VM operation on a BO/VM accessed by the CS.

to run concurrently. What it *doesn't* try is

(t+0) a VM operation on a BO/VM accessed by the CS.
(t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)

to run concurrently. When you write

> Only when all this is done we then can resolve the dependency that the
> CS currently must wait for any clear operation on the VM.

isn't that all about the second problem?


>
> Regards,
> Christian.
>
>
Christian König June 3, 2022, 10:16 a.m. UTC | #10
Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>> I do have to fix some stuff indeed, especially for the GEM close but
>>> with that we should be able to keep the same basic approach?
>> Nope, not even remotely.
>>
>> What we need is the following:
>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>> 2. When we get a VM operation we not only lock the VM page tables, but
>> also all buffers we potentially need to unmap.
>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>> areas directly when they are unmapped.
>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>> combination.
>> 5. When the bo_va structure is destroy because of closing the handle
>> move the last clear operation over to the VM as implicit sync.
>>
> Hi Christian, isn't that a different problem though (that we're also
> trying to solve, but in your series)?
>
> What this patch tries to achieve:
>
> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> (t+1) a VM operation on a BO/VM accessed by the CS.
>
> to run concurrently. What it *doesn't* try is
>
> (t+0) a VM operation on a BO/VM accessed by the CS.
> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>
> to run concurrently. When you write
>
>> Only when all this is done we then can resolve the dependency that the
>> CS currently must wait for any clear operation on the VM.
> isn't that all about the second problem?

No, it's the same.

See what we do in the VM code is to artificially insert a bubble so that 
all VM clear operations wait for all CS operations and then use the 
clear fence to indicate when the backing store of the BO can be freed.

When you want to remove this bubble (which is certainly a good idea) you 
need to first come up with a different approach to handle the clear 
operations.

Regards,
Christian.

>
>
>> Regards,
>> Christian.
>>
>>
Bas Nieuwenhuizen June 3, 2022, 11:07 a.m. UTC | #11
On Fri, Jun 3, 2022 at 12:16 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >>> I do have to fix some stuff indeed, especially for the GEM close but
> >>> with that we should be able to keep the same basic approach?
> >> Nope, not even remotely.
> >>
> >> What we need is the following:
> >> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >> 2. When we get a VM operation we not only lock the VM page tables, but
> >> also all buffers we potentially need to unmap.
> >> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >> areas directly when they are unmapped.
> >> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >> combination.
> >> 5. When the bo_va structure is destroy because of closing the handle
> >> move the last clear operation over to the VM as implicit sync.
> >>
> > Hi Christian, isn't that a different problem though (that we're also
> > trying to solve, but in your series)?
> >
> > What this patch tries to achieve:
> >
> > (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> > (t+1) a VM operation on a BO/VM accessed by the CS.
> >
> > to run concurrently. What it *doesn't* try is
> >
> > (t+0) a VM operation on a BO/VM accessed by the CS.
> > (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >
> > to run concurrently. When you write
> >
> >> Only when all this is done we then can resolve the dependency that the
> >> CS currently must wait for any clear operation on the VM.
> > isn't that all about the second problem?
>
> No, it's the same.
>
> See what we do in the VM code is to artificially insert a bubble so that
> all VM clear operations wait for all CS operations and then use the
> clear fence to indicate when the backing store of the BO can be freed.

Isn't that remediated with something like the code below? At least the
gem_close case should be handled with this, and the move case was
already handled by the copy operation.


--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
drm_gem_object *obj,
       return 0;
}

+static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
+{
+       struct dma_resv_iter cursor;
+       struct dma_fence *f;
+       int r;
+       unsigned num_fences = 0;
+
+       if (src == dst)
+               return;
+
+       /* We assume the later loops get the same fences as the caller should
+        * lock the resv. */
+       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
+               ++num_fences;
+               dma_fence_put(f);
+       }
+
+       r = dma_resv_reserve_fences(dst, num_fences);
+       if (r) {
+               /* As last resort on OOM we block for the fence */
+               dma_resv_for_each_fence(&cursor, src,
DMA_RESV_USAGE_BOOKKEEP, f) {
+                       dma_fence_wait(f, false);
+                       dma_fence_put(f);
+               }
+       }
+
+       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
+               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
+               dma_fence_put(f);
+       }
+}
+
+
static void amdgpu_gem_object_close(struct drm_gem_object *obj,
                                   struct drm_file *file_priv)
{
@@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
drm_gem_object *obj,
       amdgpu_bo_fence(bo, fence, true);
       dma_fence_put(fence);

+       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
+
out_unlock:
       if (unlikely(r < 0))
               dev_err(adev->dev, "failed to clear page "

>
> When you want to remove this bubble (which is certainly a good idea) you
> need to first come up with a different approach to handle the clear
> operations.
>
> Regards,
> Christian.
>
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>
>
Christian König June 3, 2022, 12:08 p.m. UTC | #12
Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>> with that we should be able to keep the same basic approach?
>>>> Nope, not even remotely.
>>>>
>>>> What we need is the following:
>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>> also all buffers we potentially need to unmap.
>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>> areas directly when they are unmapped.
>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>> combination.
>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>> move the last clear operation over to the VM as implicit sync.
>>>>
>>> Hi Christian, isn't that a different problem though (that we're also
>>> trying to solve, but in your series)?
>>>
>>> What this patch tries to achieve:
>>>
>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>
>>> to run concurrently. What it *doesn't* try is
>>>
>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>
>>> to run concurrently. When you write
>>>
>>>> Only when all this is done we then can resolve the dependency that the
>>>> CS currently must wait for any clear operation on the VM.
>>> isn't that all about the second problem?
>> No, it's the same.
>>
>> See what we do in the VM code is to artificially insert a bubble so that
>> all VM clear operations wait for all CS operations and then use the
>> clear fence to indicate when the backing store of the BO can be freed.
> Isn't that remediated with something like the code below? At least the
> gem_close case should be handled with this, and the move case was
> already handled by the copy operation.

That is one necessary puzzle piece, yes. But you need more than that.

Especially the explicit unmap operation needs to be converted into an 
implicit unmap to get the TLB flush right.

I think I know all the necessary steps now, it's just tons of work to do.

Regards,
Christian.

>
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> drm_gem_object *obj,
>         return 0;
> }
>
> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> +{
> +       struct dma_resv_iter cursor;
> +       struct dma_fence *f;
> +       int r;
> +       unsigned num_fences = 0;
> +
> +       if (src == dst)
> +               return;
> +
> +       /* We assume the later loops get the same fences as the caller should
> +        * lock the resv. */
> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> +               ++num_fences;
> +               dma_fence_put(f);
> +       }
> +
> +       r = dma_resv_reserve_fences(dst, num_fences);
> +       if (r) {
> +               /* As last resort on OOM we block for the fence */
> +               dma_resv_for_each_fence(&cursor, src,
> DMA_RESV_USAGE_BOOKKEEP, f) {
> +                       dma_fence_wait(f, false);
> +                       dma_fence_put(f);
> +               }
> +       }
> +
> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> +               dma_fence_put(f);
> +       }
> +}
> +
> +
> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>                                     struct drm_file *file_priv)
> {
> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> drm_gem_object *obj,
>         amdgpu_bo_fence(bo, fence, true);
>         dma_fence_put(fence);
>
> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> +
> out_unlock:
>         if (unlikely(r < 0))
>                 dev_err(adev->dev, "failed to clear page "
>
>> When you want to remove this bubble (which is certainly a good idea) you
>> need to first come up with a different approach to handle the clear
>> operations.
>>
>> Regards,
>> Christian.
>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
Bas Nieuwenhuizen June 3, 2022, 12:39 p.m. UTC | #13
On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 12:16 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>> [SNIP]
> >>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>> with that we should be able to keep the same basic approach?
> >>>> Nope, not even remotely.
> >>>>
> >>>> What we need is the following:
> >>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>> also all buffers we potentially need to unmap.
> >>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>> areas directly when they are unmapped.
> >>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>> combination.
> >>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>> move the last clear operation over to the VM as implicit sync.
> >>>>
> >>> Hi Christian, isn't that a different problem though (that we're also
> >>> trying to solve, but in your series)?
> >>>
> >>> What this patch tries to achieve:
> >>>
> >>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>
> >>> to run concurrently. What it *doesn't* try is
> >>>
> >>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>
> >>> to run concurrently. When you write
> >>>
> >>>> Only when all this is done we then can resolve the dependency that the
> >>>> CS currently must wait for any clear operation on the VM.
> >>> isn't that all about the second problem?
> >> No, it's the same.
> >>
> >> See what we do in the VM code is to artificially insert a bubble so that
> >> all VM clear operations wait for all CS operations and then use the
> >> clear fence to indicate when the backing store of the BO can be freed.
> > Isn't that remediated with something like the code below? At least the
> > gem_close case should be handled with this, and the move case was
> > already handled by the copy operation.
>
> That is one necessary puzzle piece, yes. But you need more than that.
>
> Especially the explicit unmap operation needs to be converted into an
> implicit unmap to get the TLB flush right.

This doesn't change anything about the TLB flush though? Since all
unmap -> later jobs dependencies are still implicit.

So the worst what could happen (i.f. e.g. userspace gets the
waits/dependencies wrong) is

1) non-implicit CS gets submitted that touches a BO
2)  VM unmap on that BO happens
2.5) the CS from 1 is still active due to missing dependencies
2.6) but any CS submission after 2 will trigger a TLB flush
3) A TLB flush happens for a new CS
4) All CS submissions here see the TLB flush and hence the unmap

So the main problem would be the CS from step 1, but (a) if that
VMFaults that is the apps own fault and (b) because we don't free the
memory until (1) finishes it is not a security issue kernel-wise.

>
> I think I know all the necessary steps now, it's just tons of work to do.
>
> Regards,
> Christian.
>
> >
> >
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> > @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> > drm_gem_object *obj,
> >         return 0;
> > }
> >
> > +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> > +{
> > +       struct dma_resv_iter cursor;
> > +       struct dma_fence *f;
> > +       int r;
> > +       unsigned num_fences = 0;
> > +
> > +       if (src == dst)
> > +               return;
> > +
> > +       /* We assume the later loops get the same fences as the caller should
> > +        * lock the resv. */
> > +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> > +               ++num_fences;
> > +               dma_fence_put(f);
> > +       }
> > +
> > +       r = dma_resv_reserve_fences(dst, num_fences);
> > +       if (r) {
> > +               /* As last resort on OOM we block for the fence */
> > +               dma_resv_for_each_fence(&cursor, src,
> > DMA_RESV_USAGE_BOOKKEEP, f) {
> > +                       dma_fence_wait(f, false);
> > +                       dma_fence_put(f);
> > +               }
> > +       }
> > +
> > +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> > +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> > +               dma_fence_put(f);
> > +       }
> > +}
> > +
> > +
> > static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >                                     struct drm_file *file_priv)
> > {
> > @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> > drm_gem_object *obj,
> >         amdgpu_bo_fence(bo, fence, true);
> >         dma_fence_put(fence);
> >
> > +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> > +
> > out_unlock:
> >         if (unlikely(r < 0))
> >                 dev_err(adev->dev, "failed to clear page "
> >
> >> When you want to remove this bubble (which is certainly a good idea) you
> >> need to first come up with a different approach to handle the clear
> >> operations.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>
>
Christian König June 3, 2022, 12:49 p.m. UTC | #14
Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>>>> [SNIP]
>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>>>> with that we should be able to keep the same basic approach?
>>>>>> Nope, not even remotely.
>>>>>>
>>>>>> What we need is the following:
>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>>>> also all buffers we potentially need to unmap.
>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>>>> areas directly when they are unmapped.
>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>>>> combination.
>>>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>>>> move the last clear operation over to the VM as implicit sync.
>>>>>>
>>>>> Hi Christian, isn't that a different problem though (that we're also
>>>>> trying to solve, but in your series)?
>>>>>
>>>>> What this patch tries to achieve:
>>>>>
>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>>>
>>>>> to run concurrently. What it *doesn't* try is
>>>>>
>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>
>>>>> to run concurrently. When you write
>>>>>
>>>>>> Only when all this is done we then can resolve the dependency that the
>>>>>> CS currently must wait for any clear operation on the VM.
>>>>> isn't that all about the second problem?
>>>> No, it's the same.
>>>>
>>>> See what we do in the VM code is to artificially insert a bubble so that
>>>> all VM clear operations wait for all CS operations and then use the
>>>> clear fence to indicate when the backing store of the BO can be freed.
>>> Isn't that remediated with something like the code below? At least the
>>> gem_close case should be handled with this, and the move case was
>>> already handled by the copy operation.
>> That is one necessary puzzle piece, yes. But you need more than that.
>>
>> Especially the explicit unmap operation needs to be converted into an
>> implicit unmap to get the TLB flush right.
> This doesn't change anything about the TLB flush though? Since all
> unmap -> later jobs dependencies are still implicit.
>
> So the worst what could happen (i.f. e.g. userspace gets the
> waits/dependencies wrong) is
>
> 1) non-implicit CS gets submitted that touches a BO
> 2)  VM unmap on that BO happens
> 2.5) the CS from 1 is still active due to missing dependencies
> 2.6) but any CS submission after 2 will trigger a TLB flush

Yeah, but that's exactly the bubble we try to avoid. Isn't it?

When we want to do a TLB flush the unmap operation must already be 
completed. Otherwise the flush is rather pointless since any access 
could reloads the not yet updated PTEs.

And this means that we need to artificially add a dependency on every 
command submission after 2 to wait until the unmap operation is completed.

Christian.

> 3) A TLB flush happens for a new CS
> 4) All CS submissions here see the TLB flush and hence the unmap
>
> So the main problem would be the CS from step 1, but (a) if that
> VMFaults that is the apps own fault and (b) because we don't free the
> memory until (1) finishes it is not a security issue kernel-wise.
>
>> I think I know all the necessary steps now, it's just tons of work to do.
>>
>> Regards,
>> Christian.
>>
>>>
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
>>> drm_gem_object *obj,
>>>          return 0;
>>> }
>>>
>>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
>>> +{
>>> +       struct dma_resv_iter cursor;
>>> +       struct dma_fence *f;
>>> +       int r;
>>> +       unsigned num_fences = 0;
>>> +
>>> +       if (src == dst)
>>> +               return;
>>> +
>>> +       /* We assume the later loops get the same fences as the caller should
>>> +        * lock the resv. */
>>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +               ++num_fences;
>>> +               dma_fence_put(f);
>>> +       }
>>> +
>>> +       r = dma_resv_reserve_fences(dst, num_fences);
>>> +       if (r) {
>>> +               /* As last resort on OOM we block for the fence */
>>> +               dma_resv_for_each_fence(&cursor, src,
>>> DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +                       dma_fence_wait(f, false);
>>> +                       dma_fence_put(f);
>>> +               }
>>> +       }
>>> +
>>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
>>> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
>>> +               dma_fence_put(f);
>>> +       }
>>> +}
>>> +
>>> +
>>> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
>>>                                      struct drm_file *file_priv)
>>> {
>>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
>>> drm_gem_object *obj,
>>>          amdgpu_bo_fence(bo, fence, true);
>>>          dma_fence_put(fence);
>>>
>>> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
>>> +
>>> out_unlock:
>>>          if (unlikely(r < 0))
>>>                  dev_err(adev->dev, "failed to clear page "
>>>
>>>> When you want to remove this bubble (which is certainly a good idea) you
>>>> need to first come up with a different approach to handle the clear
>>>> operations.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>
Bas Nieuwenhuizen June 3, 2022, 1:23 p.m. UTC | #15
On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>>>> [SNIP]
> >>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>>>> with that we should be able to keep the same basic approach?
> >>>>>> Nope, not even remotely.
> >>>>>>
> >>>>>> What we need is the following:
> >>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>>>> also all buffers we potentially need to unmap.
> >>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>>>> areas directly when they are unmapped.
> >>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>>>> combination.
> >>>>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>>>> move the last clear operation over to the VM as implicit sync.
> >>>>>>
> >>>>> Hi Christian, isn't that a different problem though (that we're also
> >>>>> trying to solve, but in your series)?
> >>>>>
> >>>>> What this patch tries to achieve:
> >>>>>
> >>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>>>
> >>>>> to run concurrently. What it *doesn't* try is
> >>>>>
> >>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>
> >>>>> to run concurrently. When you write
> >>>>>
> >>>>>> Only when all this is done we then can resolve the dependency that the
> >>>>>> CS currently must wait for any clear operation on the VM.
> >>>>> isn't that all about the second problem?
> >>>> No, it's the same.
> >>>>
> >>>> See what we do in the VM code is to artificially insert a bubble so that
> >>>> all VM clear operations wait for all CS operations and then use the
> >>>> clear fence to indicate when the backing store of the BO can be freed.
> >>> Isn't that remediated with something like the code below? At least the
> >>> gem_close case should be handled with this, and the move case was
> >>> already handled by the copy operation.
> >> That is one necessary puzzle piece, yes. But you need more than that.
> >>
> >> Especially the explicit unmap operation needs to be converted into an
> >> implicit unmap to get the TLB flush right.
> > This doesn't change anything about the TLB flush though? Since all
> > unmap -> later jobs dependencies are still implicit.
> >
> > So the worst what could happen (i.f. e.g. userspace gets the
> > waits/dependencies wrong) is
> >
> > 1) non-implicit CS gets submitted that touches a BO
> > 2)  VM unmap on that BO happens
> > 2.5) the CS from 1 is still active due to missing dependencies
> > 2.6) but any CS submission after 2 will trigger a TLB flush
>
> Yeah, but that's exactly the bubble we try to avoid. Isn't it?

For this series, not really.  To clarify there are two sides for
getting GPU bubbles and no overlap:

(1) VM operations implicitly wait for earlier CS submissions
(2) CS submissions implicitly wait for earlier VM operations

Together, these combine to ensure that you get a (potentially small)
bubble any time VM work happens.

Your series (and further ideas) tackles (2), and is a worthwhile thing
to do. However, while writing the userspace for this I noticed this
isn't enough to get rid of all our GPU bubbles. In particular when
doing a non-sparse map of a new BO, that tends to need to be waited on
for the next CS anyway for API semantics. Due to VM operations
happening on a single timeline that means this high priority map can
end up being blocked by earlier sparse maps and hence the bubble in
that case still exists.

So in this series I try to tackle (1) instead. Since GPU work
typically lags behind CPU submissions and VM operations aren't that
slow, we can typically execute VM operations early enough that any
implicit syncs from (2) are less/no issue. In particular, by doing all
dependency waits in userspace, we can make almost all VM operations
start pretty much immediately (with a bunch of exceptions, like other
VM work that takes time, radeonsi still submitting implicitly synced
stuff etc.).

So I think (2) is valuable, just not what this series tries to focus
on or touch at all.

(And then the cherry on top would be having two timelines for VM
operations, a high priority one for non-sparse bindings and a low
priority one for sparse bindings, but that is very complex and not
super high value on top of eliminating (1) + (2), so I'd punt that for
"maybe later". See e.g. the discussion wrt Intel at
https://patchwork.freedesktop.org/patch/486604/#comment_879193)

>
> When we want to do a TLB flush the unmap operation must already be
> completed. Otherwise the flush is rather pointless since any access
> could reloads the not yet updated PTEs.
>
> And this means that we need to artificially add a dependency on every
> command submission after 2 to wait until the unmap operation is completed.
>
> Christian.
>
> > 3) A TLB flush happens for a new CS
> > 4) All CS submissions here see the TLB flush and hence the unmap
> >
> > So the main problem would be the CS from step 1, but (a) if that
> > VMFaults that is the apps own fault and (b) because we don't free the
> > memory until (1) finishes it is not a security issue kernel-wise.
> >
> >> I think I know all the necessary steps now, it's just tons of work to do.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> >>> @@ -187,6 +187,39 @@ static int amdgpu_gem_object_open(struct
> >>> drm_gem_object *obj,
> >>>          return 0;
> >>> }
> >>>
> >>> +static void dma_resv_copy(struct dma_resv *src, struct dma_resv *dst)
> >>> +{
> >>> +       struct dma_resv_iter cursor;
> >>> +       struct dma_fence *f;
> >>> +       int r;
> >>> +       unsigned num_fences = 0;
> >>> +
> >>> +       if (src == dst)
> >>> +               return;
> >>> +
> >>> +       /* We assume the later loops get the same fences as the caller should
> >>> +        * lock the resv. */
> >>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +               ++num_fences;
> >>> +               dma_fence_put(f);
> >>> +       }
> >>> +
> >>> +       r = dma_resv_reserve_fences(dst, num_fences);
> >>> +       if (r) {
> >>> +               /* As last resort on OOM we block for the fence */
> >>> +               dma_resv_for_each_fence(&cursor, src,
> >>> DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +                       dma_fence_wait(f, false);
> >>> +                       dma_fence_put(f);
> >>> +               }
> >>> +       }
> >>> +
> >>> +       dma_resv_for_each_fence(&cursor, src, DMA_RESV_USAGE_BOOKKEEP, f) {
> >>> +               dma_resv_add_fence(dst, f, dma_resv_iter_usage(&cursor));
> >>> +               dma_fence_put(f);
> >>> +       }
> >>> +}
> >>> +
> >>> +
> >>> static void amdgpu_gem_object_close(struct drm_gem_object *obj,
> >>>                                      struct drm_file *file_priv)
> >>> {
> >>> @@ -233,6 +266,8 @@ static void amdgpu_gem_object_close(struct
> >>> drm_gem_object *obj,
> >>>          amdgpu_bo_fence(bo, fence, true);
> >>>          dma_fence_put(fence);
> >>>
> >>> +       dma_resv_copy(vm->root.bo->tbo.base.resv, bo->tbo.base.resv);
> >>> +
> >>> out_unlock:
> >>>          if (unlikely(r < 0))
> >>>                  dev_err(adev->dev, "failed to clear page "
> >>>
> >>>> When you want to remove this bubble (which is certainly a good idea) you
> >>>> need to first come up with a different approach to handle the clear
> >>>> operations.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>
>
Christian König June 3, 2022, 5:41 p.m. UTC | #16
Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
>>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
>>>>>>> [SNIP]
>>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
>>>>>>>>> with that we should be able to keep the same basic approach?
>>>>>>>> Nope, not even remotely.
>>>>>>>>
>>>>>>>> What we need is the following:
>>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
>>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
>>>>>>>> also all buffers we potentially need to unmap.
>>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
>>>>>>>> areas directly when they are unmapped.
>>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
>>>>>>>> combination.
>>>>>>>> 5. When the bo_va structure is destroy because of closing the handle
>>>>>>>> move the last clear operation over to the VM as implicit sync.
>>>>>>>>
>>>>>>> Hi Christian, isn't that a different problem though (that we're also
>>>>>>> trying to solve, but in your series)?
>>>>>>>
>>>>>>> What this patch tries to achieve:
>>>>>>>
>>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
>>>>>>>
>>>>>>> to run concurrently. What it *doesn't* try is
>>>>>>>
>>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
>>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
>>>>>>>
>>>>>>> to run concurrently. When you write
>>>>>>>
>>>>>>>> Only when all this is done we then can resolve the dependency that the
>>>>>>>> CS currently must wait for any clear operation on the VM.
>>>>>>> isn't that all about the second problem?
>>>>>> No, it's the same.
>>>>>>
>>>>>> See what we do in the VM code is to artificially insert a bubble so that
>>>>>> all VM clear operations wait for all CS operations and then use the
>>>>>> clear fence to indicate when the backing store of the BO can be freed.
>>>>> Isn't that remediated with something like the code below? At least the
>>>>> gem_close case should be handled with this, and the move case was
>>>>> already handled by the copy operation.
>>>> That is one necessary puzzle piece, yes. But you need more than that.
>>>>
>>>> Especially the explicit unmap operation needs to be converted into an
>>>> implicit unmap to get the TLB flush right.
>>> This doesn't change anything about the TLB flush though? Since all
>>> unmap -> later jobs dependencies are still implicit.
>>>
>>> So the worst what could happen (i.f. e.g. userspace gets the
>>> waits/dependencies wrong) is
>>>
>>> 1) non-implicit CS gets submitted that touches a BO
>>> 2)  VM unmap on that BO happens
>>> 2.5) the CS from 1 is still active due to missing dependencies
>>> 2.6) but any CS submission after 2 will trigger a TLB flush
>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> For this series, not really.  To clarify there are two sides for
> getting GPU bubbles and no overlap:
>
> (1) VM operations implicitly wait for earlier CS submissions
> (2) CS submissions implicitly wait for earlier VM operations
>
> Together, these combine to ensure that you get a (potentially small)
> bubble any time VM work happens.
>
> Your series (and further ideas) tackles (2), and is a worthwhile thing
> to do. However, while writing the userspace for this I noticed this
> isn't enough to get rid of all our GPU bubbles. In particular when
> doing a non-sparse map of a new BO, that tends to need to be waited on
> for the next CS anyway for API semantics. Due to VM operations
> happening on a single timeline that means this high priority map can
> end up being blocked by earlier sparse maps and hence the bubble in
> that case still exists.
>
> So in this series I try to tackle (1) instead. Since GPU work
> typically lags behind CPU submissions and VM operations aren't that
> slow, we can typically execute VM operations early enough that any
> implicit syncs from (2) are less/no issue.

Ok, once more since you don't seem to understand what I want to say: It 
isn't possible to fix #1 before you have fixed #2.

The VM unmap operation here is a barrier which divides the CS operations 
in a before and after. This is intentional design.

To get rid of this barrier you must first fix the part where CS 
submissions wait for the VM operation to complete, e.g. the necessity of 
the barrier.

I'm working on this for a couple of years now and I'm really running out 
of idea how to explain this restriction.

Regards,
Christian.
Bas Nieuwenhuizen June 3, 2022, 5:50 p.m. UTC | #17
On Fri, Jun 3, 2022 at 7:42 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 15:23 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 2:49 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 14:39 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 2:08 PM Christian König <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 13:07 schrieb Bas Nieuwenhuizen:
> >>>>> On Fri, Jun 3, 2022 at 12:16 PM Christian König
> >>>>> <christian.koenig@amd.com> wrote:
> >>>>>> Am 03.06.22 um 12:08 schrieb Bas Nieuwenhuizen:
> >>>>>>> [SNIP]
> >>>>>>>>> I do have to fix some stuff indeed, especially for the GEM close but
> >>>>>>>>> with that we should be able to keep the same basic approach?
> >>>>>>>> Nope, not even remotely.
> >>>>>>>>
> >>>>>>>> What we need is the following:
> >>>>>>>> 1. Rolling out my drm_exec patch set, so that we can lock buffers as needed.
> >>>>>>>> 2. When we get a VM operation we not only lock the VM page tables, but
> >>>>>>>> also all buffers we potentially need to unmap.
> >>>>>>>> 3. Nuking the freed list in the amdgpu_vm structure by updating freed
> >>>>>>>> areas directly when they are unmapped.
> >>>>>>>> 4. Tracking those updates inside the bo_va structure for the BO+VM
> >>>>>>>> combination.
> >>>>>>>> 5. When the bo_va structure is destroy because of closing the handle
> >>>>>>>> move the last clear operation over to the VM as implicit sync.
> >>>>>>>>
> >>>>>>> Hi Christian, isn't that a different problem though (that we're also
> >>>>>>> trying to solve, but in your series)?
> >>>>>>>
> >>>>>>> What this patch tries to achieve:
> >>>>>>>
> >>>>>>> (t+0) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>> (t+1) a VM operation on a BO/VM accessed by the CS.
> >>>>>>>
> >>>>>>> to run concurrently. What it *doesn't* try is
> >>>>>>>
> >>>>>>> (t+0) a VM operation on a BO/VM accessed by the CS.
> >>>>>>> (t+1) CS submission setting BOOKKEEP fences (i.e. no implicit sync)
> >>>>>>>
> >>>>>>> to run concurrently. When you write
> >>>>>>>
> >>>>>>>> Only when all this is done we then can resolve the dependency that the
> >>>>>>>> CS currently must wait for any clear operation on the VM.
> >>>>>>> isn't that all about the second problem?
> >>>>>> No, it's the same.
> >>>>>>
> >>>>>> See what we do in the VM code is to artificially insert a bubble so that
> >>>>>> all VM clear operations wait for all CS operations and then use the
> >>>>>> clear fence to indicate when the backing store of the BO can be freed.
> >>>>> Isn't that remediated with something like the code below? At least the
> >>>>> gem_close case should be handled with this, and the move case was
> >>>>> already handled by the copy operation.
> >>>> That is one necessary puzzle piece, yes. But you need more than that.
> >>>>
> >>>> Especially the explicit unmap operation needs to be converted into an
> >>>> implicit unmap to get the TLB flush right.
> >>> This doesn't change anything about the TLB flush though? Since all
> >>> unmap -> later jobs dependencies are still implicit.
> >>>
> >>> So the worst what could happen (i.f. e.g. userspace gets the
> >>> waits/dependencies wrong) is
> >>>
> >>> 1) non-implicit CS gets submitted that touches a BO
> >>> 2)  VM unmap on that BO happens
> >>> 2.5) the CS from 1 is still active due to missing dependencies
> >>> 2.6) but any CS submission after 2 will trigger a TLB flush
> >> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> > For this series, not really.  To clarify there are two sides for
> > getting GPU bubbles and no overlap:
> >
> > (1) VM operations implicitly wait for earlier CS submissions
> > (2) CS submissions implicitly wait for earlier VM operations
> >
> > Together, these combine to ensure that you get a (potentially small)
> > bubble any time VM work happens.
> >
> > Your series (and further ideas) tackles (2), and is a worthwhile thing
> > to do. However, while writing the userspace for this I noticed this
> > isn't enough to get rid of all our GPU bubbles. In particular when
> > doing a non-sparse map of a new BO, that tends to need to be waited on
> > for the next CS anyway for API semantics. Due to VM operations
> > happening on a single timeline that means this high priority map can
> > end up being blocked by earlier sparse maps and hence the bubble in
> > that case still exists.
> >
> > So in this series I try to tackle (1) instead. Since GPU work
> > typically lags behind CPU submissions and VM operations aren't that
> > slow, we can typically execute VM operations early enough that any
> > implicit syncs from (2) are less/no issue.
>
> Ok, once more since you don't seem to understand what I want to say: It
> isn't possible to fix #1 before you have fixed #2.
>
> The VM unmap operation here is a barrier which divides the CS operations
> in a before and after. This is intentional design.

Why is that barrier needed? The two barriers I got and understood and
I think we can deal with:

1) the VM unmap is a barrier between prior CS and later memory free.
2) The TLB flush need to happen between a VM unmap and later CS.

But why do we need the VM unmap to be a strict barrier between prior
CS and later CS?

>
> To get rid of this barrier you must first fix the part where CS
> submissions wait for the VM operation to complete, e.g. the necessity of
> the barrier.
>
> I'm working on this for a couple of years now and I'm really running out
> of idea how to explain this restriction.
>
> Regards,
> Christian.
>
Christian König June 3, 2022, 6:41 p.m. UTC | #18
Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>> For this series, not really.  To clarify there are two sides for
>>> getting GPU bubbles and no overlap:
>>>
>>> (1) VM operations implicitly wait for earlier CS submissions
>>> (2) CS submissions implicitly wait for earlier VM operations
>>>
>>> Together, these combine to ensure that you get a (potentially small)
>>> bubble any time VM work happens.
>>>
>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>> to do. However, while writing the userspace for this I noticed this
>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>> for the next CS anyway for API semantics. Due to VM operations
>>> happening on a single timeline that means this high priority map can
>>> end up being blocked by earlier sparse maps and hence the bubble in
>>> that case still exists.
>>>
>>> So in this series I try to tackle (1) instead. Since GPU work
>>> typically lags behind CPU submissions and VM operations aren't that
>>> slow, we can typically execute VM operations early enough that any
>>> implicit syncs from (2) are less/no issue.
>> Ok, once more since you don't seem to understand what I want to say: It
>> isn't possible to fix #1 before you have fixed #2.
>>
>> The VM unmap operation here is a barrier which divides the CS operations
>> in a before and after. This is intentional design.
> Why is that barrier needed? The two barriers I got and understood and
> I think we can deal with:
>
> 1) the VM unmap is a barrier between prior CS and later memory free.
> 2) The TLB flush need to happen between a VM unmap and later CS.
>
> But why do we need the VM unmap to be a strict barrier between prior
> CS and later CS?

Exactly because of the two reasons you mentioned.

#1 Is rather easy to fix, you just need to copy all dma_fences from the 
page table dma_resv object over to the BOs dma_resv object in the gem 
close handler. E.g. exactly what you suggested with the dma_resv_copy 
function.

#2 is a nightmare.

We can't move the TLB flush at the end of the unmap operation because on 
async TLB flushes are either a bit complicated (double flushes etc..) or 
don't even work at all because of hw bugs. So to have a reliable TLB 
flush we must make sure that nothing else is ongoing and that means 
CS->VM->CS barrier.

We try very hard to circumvent that already on maps by (for example) 
using a completely new VMID for CS after the VM map operation.

But for the unmap operation we would need some kind special dma_fence 
implementation which would not only wait for all existing dma_fence but 
also for the one added until the unmap operation is completed. Cause 
otherwise our operation we do at #1 would simply not catch all 
dma_fences which have access to the memory.

That's certainly doable, but I think just using the drm_exec stuff I 
already came up with is easier.

When we can grab locks for all the BOs involved amdgpu_vm_clear_freed() 
goes away and we can keep track of the unmap operations in the bo_va 
structure.

With that done you can make the explicit sync you noted in the bo_va 
structure and implicit sync when the bo_va structure goes away.

Then the only reason I can see why we would need a CS->VM dependency is 
implicit synchronization, and that's what we are trying to avoid here in 
the first place.

Regards,
Christian.

>
>> To get rid of this barrier you must first fix the part where CS
>> submissions wait for the VM operation to complete, e.g. the necessity of
>> the barrier.
>>
>> I'm working on this for a couple of years now and I'm really running out
>> of idea how to explain this restriction.
>>
>> Regards,
>> Christian.
>>
Bas Nieuwenhuizen June 3, 2022, 7:11 p.m. UTC | #19
On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>
> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> >>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>> For this series, not really.  To clarify there are two sides for
> >>> getting GPU bubbles and no overlap:
> >>>
> >>> (1) VM operations implicitly wait for earlier CS submissions
> >>> (2) CS submissions implicitly wait for earlier VM operations
> >>>
> >>> Together, these combine to ensure that you get a (potentially small)
> >>> bubble any time VM work happens.
> >>>
> >>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>> to do. However, while writing the userspace for this I noticed this
> >>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>> for the next CS anyway for API semantics. Due to VM operations
> >>> happening on a single timeline that means this high priority map can
> >>> end up being blocked by earlier sparse maps and hence the bubble in
> >>> that case still exists.
> >>>
> >>> So in this series I try to tackle (1) instead. Since GPU work
> >>> typically lags behind CPU submissions and VM operations aren't that
> >>> slow, we can typically execute VM operations early enough that any
> >>> implicit syncs from (2) are less/no issue.
> >> Ok, once more since you don't seem to understand what I want to say: It
> >> isn't possible to fix #1 before you have fixed #2.
> >>
> >> The VM unmap operation here is a barrier which divides the CS operations
> >> in a before and after. This is intentional design.
> > Why is that barrier needed? The two barriers I got and understood and
> > I think we can deal with:
> >
> > 1) the VM unmap is a barrier between prior CS and later memory free.
> > 2) The TLB flush need to happen between a VM unmap and later CS.
> >
> > But why do we need the VM unmap to be a strict barrier between prior
> > CS and later CS?
>
> Exactly because of the two reasons you mentioned.

This is the part I'm not seeing. I get that removing #2 is a
nightmare, which is why I did something that doesn't violate that
constraint.

Like if an explicit CS that was running before the VM operation  runs
till after the VM operation (and hence possibly till after the TLB
flush, or otherwise have the TLB flush not apply due to lack of async
TLB flush support), that is not an issue. It might see the state from
before the unmap, or after the unmap, or some intermediate state and
all of those would be okay.

We still get the constraint that the TLB flush happens between the VM
unmap and later CS and hence the unmap is certainly visible to them.

>
> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> page table dma_resv object over to the BOs dma_resv object in the gem
> close handler. E.g. exactly what you suggested with the dma_resv_copy
> function.
>
> #2 is a nightmare.
>
> We can't move the TLB flush at the end of the unmap operation because on
> async TLB flushes are either a bit complicated (double flushes etc..) or
> don't even work at all because of hw bugs. So to have a reliable TLB
> flush we must make sure that nothing else is ongoing and that means
> CS->VM->CS barrier.
>
> We try very hard to circumvent that already on maps by (for example)
> using a completely new VMID for CS after the VM map operation.
>
> But for the unmap operation we would need some kind special dma_fence
> implementation which would not only wait for all existing dma_fence but
> also for the one added until the unmap operation is completed. Cause
> otherwise our operation we do at #1 would simply not catch all
> dma_fences which have access to the memory.
>
> That's certainly doable, but I think just using the drm_exec stuff I
> already came up with is easier.
>
> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> goes away and we can keep track of the unmap operations in the bo_va
> structure.
>
> With that done you can make the explicit sync you noted in the bo_va
> structure and implicit sync when the bo_va structure goes away.
>
> Then the only reason I can see why we would need a CS->VM dependency is
> implicit synchronization, and that's what we are trying to avoid here in
> the first place.
>
> Regards,
> Christian.
>
> >
> >> To get rid of this barrier you must first fix the part where CS
> >> submissions wait for the VM operation to complete, e.g. the necessity of
> >> the barrier.
> >>
> >> I'm working on this for a couple of years now and I'm really running out
> >> of idea how to explain this restriction.
> >>
> >> Regards,
> >> Christian.
> >>
>
Christian König June 6, 2022, 10:15 a.m. UTC | #20
Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>> For this series, not really.  To clarify there are two sides for
>>>>> getting GPU bubbles and no overlap:
>>>>>
>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>
>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>> bubble any time VM work happens.
>>>>>
>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>> to do. However, while writing the userspace for this I noticed this
>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>> happening on a single timeline that means this high priority map can
>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>> that case still exists.
>>>>>
>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>> slow, we can typically execute VM operations early enough that any
>>>>> implicit syncs from (2) are less/no issue.
>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>> isn't possible to fix #1 before you have fixed #2.
>>>>
>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>> in a before and after. This is intentional design.
>>> Why is that barrier needed? The two barriers I got and understood and
>>> I think we can deal with:
>>>
>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>
>>> But why do we need the VM unmap to be a strict barrier between prior
>>> CS and later CS?
>> Exactly because of the two reasons you mentioned.
> This is the part I'm not seeing. I get that removing #2 is a
> nightmare, which is why I did something that doesn't violate that
> constraint.
>
> Like if an explicit CS that was running before the VM operation  runs
> till after the VM operation (and hence possibly till after the TLB
> flush, or otherwise have the TLB flush not apply due to lack of async
> TLB flush support), that is not an issue. It might see the state from
> before the unmap, or after the unmap, or some intermediate state and
> all of those would be okay.
>
> We still get the constraint that the TLB flush happens between the VM
> unmap and later CS and hence the unmap is certainly visible to them.

So you basically just want to set the sync mode in 
amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

That should be doable, but then you don't need any of the other changes.

Regards,
Christian.

>
>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>> page table dma_resv object over to the BOs dma_resv object in the gem
>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>> function.
>>
>> #2 is a nightmare.
>>
>> We can't move the TLB flush at the end of the unmap operation because on
>> async TLB flushes are either a bit complicated (double flushes etc..) or
>> don't even work at all because of hw bugs. So to have a reliable TLB
>> flush we must make sure that nothing else is ongoing and that means
>> CS->VM->CS barrier.
>>
>> We try very hard to circumvent that already on maps by (for example)
>> using a completely new VMID for CS after the VM map operation.
>>
>> But for the unmap operation we would need some kind special dma_fence
>> implementation which would not only wait for all existing dma_fence but
>> also for the one added until the unmap operation is completed. Cause
>> otherwise our operation we do at #1 would simply not catch all
>> dma_fences which have access to the memory.
>>
>> That's certainly doable, but I think just using the drm_exec stuff I
>> already came up with is easier.
>>
>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>> goes away and we can keep track of the unmap operations in the bo_va
>> structure.
>>
>> With that done you can make the explicit sync you noted in the bo_va
>> structure and implicit sync when the bo_va structure goes away.
>>
>> Then the only reason I can see why we would need a CS->VM dependency is
>> implicit synchronization, and that's what we are trying to avoid here in
>> the first place.
>>
>> Regards,
>> Christian.
>>
>>>> To get rid of this barrier you must first fix the part where CS
>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>> the barrier.
>>>>
>>>> I'm working on this for a couple of years now and I'm really running out
>>>> of idea how to explain this restriction.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
Bas Nieuwenhuizen June 6, 2022, 10:30 a.m. UTC | #21
On Mon, Jun 6, 2022 at 12:15 PM Christian König
<christian.koenig@amd.com> wrote:
>
>
>
> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> > On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> >> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> >>> [SNIP]
> >>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>>>> For this series, not really.  To clarify there are two sides for
> >>>>> getting GPU bubbles and no overlap:
> >>>>>
> >>>>> (1) VM operations implicitly wait for earlier CS submissions
> >>>>> (2) CS submissions implicitly wait for earlier VM operations
> >>>>>
> >>>>> Together, these combine to ensure that you get a (potentially small)
> >>>>> bubble any time VM work happens.
> >>>>>
> >>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>>>> to do. However, while writing the userspace for this I noticed this
> >>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>>>> for the next CS anyway for API semantics. Due to VM operations
> >>>>> happening on a single timeline that means this high priority map can
> >>>>> end up being blocked by earlier sparse maps and hence the bubble in
> >>>>> that case still exists.
> >>>>>
> >>>>> So in this series I try to tackle (1) instead. Since GPU work
> >>>>> typically lags behind CPU submissions and VM operations aren't that
> >>>>> slow, we can typically execute VM operations early enough that any
> >>>>> implicit syncs from (2) are less/no issue.
> >>>> Ok, once more since you don't seem to understand what I want to say: It
> >>>> isn't possible to fix #1 before you have fixed #2.
> >>>>
> >>>> The VM unmap operation here is a barrier which divides the CS operations
> >>>> in a before and after. This is intentional design.
> >>> Why is that barrier needed? The two barriers I got and understood and
> >>> I think we can deal with:
> >>>
> >>> 1) the VM unmap is a barrier between prior CS and later memory free.
> >>> 2) The TLB flush need to happen between a VM unmap and later CS.
> >>>
> >>> But why do we need the VM unmap to be a strict barrier between prior
> >>> CS and later CS?
> >> Exactly because of the two reasons you mentioned.
> > This is the part I'm not seeing. I get that removing #2 is a
> > nightmare, which is why I did something that doesn't violate that
> > constraint.
> >
> > Like if an explicit CS that was running before the VM operation  runs
> > till after the VM operation (and hence possibly till after the TLB
> > flush, or otherwise have the TLB flush not apply due to lack of async
> > TLB flush support), that is not an issue. It might see the state from
> > before the unmap, or after the unmap, or some intermediate state and
> > all of those would be okay.
> >
> > We still get the constraint that the TLB flush happens between the VM
> > unmap and later CS and hence the unmap is certainly visible to them.
>
> So you basically just want to set the sync mode in
> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?

Yes, with the caveat that I want to do that only for
DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
implicit sync we get the old implicit behavior, if we submit a CS with
explicit sync we get the new explicit behavior). The rest of the
series is basically just for enabling explicit sync submissions.

> That should be doable, but then you don't need any of the other changes.
>
> Regards,
> Christian.
>
> >
> >> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >> page table dma_resv object over to the BOs dma_resv object in the gem
> >> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >> function.
> >>
> >> #2 is a nightmare.
> >>
> >> We can't move the TLB flush at the end of the unmap operation because on
> >> async TLB flushes are either a bit complicated (double flushes etc..) or
> >> don't even work at all because of hw bugs. So to have a reliable TLB
> >> flush we must make sure that nothing else is ongoing and that means
> >> CS->VM->CS barrier.
> >>
> >> We try very hard to circumvent that already on maps by (for example)
> >> using a completely new VMID for CS after the VM map operation.
> >>
> >> But for the unmap operation we would need some kind special dma_fence
> >> implementation which would not only wait for all existing dma_fence but
> >> also for the one added until the unmap operation is completed. Cause
> >> otherwise our operation we do at #1 would simply not catch all
> >> dma_fences which have access to the memory.
> >>
> >> That's certainly doable, but I think just using the drm_exec stuff I
> >> already came up with is easier.
> >>
> >> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >> goes away and we can keep track of the unmap operations in the bo_va
> >> structure.
> >>
> >> With that done you can make the explicit sync you noted in the bo_va
> >> structure and implicit sync when the bo_va structure goes away.
> >>
> >> Then the only reason I can see why we would need a CS->VM dependency is
> >> implicit synchronization, and that's what we are trying to avoid here in
> >> the first place.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>> To get rid of this barrier you must first fix the part where CS
> >>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>> the barrier.
> >>>>
> >>>> I'm working on this for a couple of years now and I'm really running out
> >>>> of idea how to explain this restriction.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
>
Christian König June 6, 2022, 10:35 a.m. UTC | #22
Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> On Mon, Jun 6, 2022 at 12:15 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>>
>> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
>>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>>>> [SNIP]
>>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>>>> For this series, not really.  To clarify there are two sides for
>>>>>>> getting GPU bubbles and no overlap:
>>>>>>>
>>>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>>>
>>>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>>>> bubble any time VM work happens.
>>>>>>>
>>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>>>> to do. However, while writing the userspace for this I noticed this
>>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>>>> happening on a single timeline that means this high priority map can
>>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>>>> that case still exists.
>>>>>>>
>>>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>>>> slow, we can typically execute VM operations early enough that any
>>>>>>> implicit syncs from (2) are less/no issue.
>>>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>>>> isn't possible to fix #1 before you have fixed #2.
>>>>>>
>>>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>>>> in a before and after. This is intentional design.
>>>>> Why is that barrier needed? The two barriers I got and understood and
>>>>> I think we can deal with:
>>>>>
>>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>>>
>>>>> But why do we need the VM unmap to be a strict barrier between prior
>>>>> CS and later CS?
>>>> Exactly because of the two reasons you mentioned.
>>> This is the part I'm not seeing. I get that removing #2 is a
>>> nightmare, which is why I did something that doesn't violate that
>>> constraint.
>>>
>>> Like if an explicit CS that was running before the VM operation  runs
>>> till after the VM operation (and hence possibly till after the TLB
>>> flush, or otherwise have the TLB flush not apply due to lack of async
>>> TLB flush support), that is not an issue. It might see the state from
>>> before the unmap, or after the unmap, or some intermediate state and
>>> all of those would be okay.
>>>
>>> We still get the constraint that the TLB flush happens between the VM
>>> unmap and later CS and hence the unmap is certainly visible to them.
>> So you basically just want to set the sync mode in
>> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> Yes, with the caveat that I want to do that only for
> DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> implicit sync we get the old implicit behavior, if we submit a CS with
> explicit sync we get the new explicit behavior). The rest of the
> series is basically just for enabling explicit sync submissions.

That part won't work at all and would cause additional synchronization 
problems.

First of all for implicit synced CS we should use READ, not BOOKKEEP. 
Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've 
fixed that this causes memory corruption, but it is still nice to avoid.

BOOKKEEP can only be used by VM updates themselves. So that they don't 
interfere with CS.

Then the second problem is that the VM IOCTL has absolutely no idea what 
the CS IOCTL would be doing. That's why we have added the EXPLICIT sync 
flag on the BO.

Regards,
Christian.

>
>> That should be doable, but then you don't need any of the other changes.
>>
>> Regards,
>> Christian.
>>
>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>> function.
>>>>
>>>> #2 is a nightmare.
>>>>
>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>> flush we must make sure that nothing else is ongoing and that means
>>>> CS->VM->CS barrier.
>>>>
>>>> We try very hard to circumvent that already on maps by (for example)
>>>> using a completely new VMID for CS after the VM map operation.
>>>>
>>>> But for the unmap operation we would need some kind special dma_fence
>>>> implementation which would not only wait for all existing dma_fence but
>>>> also for the one added until the unmap operation is completed. Cause
>>>> otherwise our operation we do at #1 would simply not catch all
>>>> dma_fences which have access to the memory.
>>>>
>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>> already came up with is easier.
>>>>
>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>> structure.
>>>>
>>>> With that done you can make the explicit sync you noted in the bo_va
>>>> structure and implicit sync when the bo_va structure goes away.
>>>>
>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>> the first place.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>> the barrier.
>>>>>>
>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>> of idea how to explain this restriction.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
Bas Nieuwenhuizen June 6, 2022, 11 a.m. UTC | #23
On Mon, Jun 6, 2022 at 12:35 PM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> > On Mon, Jun 6, 2022 at 12:15 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >>
> >>
> >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> >>>>> [SNIP]
> >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> >>>>>>> For this series, not really.  To clarify there are two sides for
> >>>>>>> getting GPU bubbles and no overlap:
> >>>>>>>
> >>>>>>> (1) VM operations implicitly wait for earlier CS submissions
> >>>>>>> (2) CS submissions implicitly wait for earlier VM operations
> >>>>>>>
> >>>>>>> Together, these combine to ensure that you get a (potentially small)
> >>>>>>> bubble any time VM work happens.
> >>>>>>>
> >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> >>>>>>> to do. However, while writing the userspace for this I noticed this
> >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> >>>>>>> for the next CS anyway for API semantics. Due to VM operations
> >>>>>>> happening on a single timeline that means this high priority map can
> >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
> >>>>>>> that case still exists.
> >>>>>>>
> >>>>>>> So in this series I try to tackle (1) instead. Since GPU work
> >>>>>>> typically lags behind CPU submissions and VM operations aren't that
> >>>>>>> slow, we can typically execute VM operations early enough that any
> >>>>>>> implicit syncs from (2) are less/no issue.
> >>>>>> Ok, once more since you don't seem to understand what I want to say: It
> >>>>>> isn't possible to fix #1 before you have fixed #2.
> >>>>>>
> >>>>>> The VM unmap operation here is a barrier which divides the CS operations
> >>>>>> in a before and after. This is intentional design.
> >>>>> Why is that barrier needed? The two barriers I got and understood and
> >>>>> I think we can deal with:
> >>>>>
> >>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
> >>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
> >>>>>
> >>>>> But why do we need the VM unmap to be a strict barrier between prior
> >>>>> CS and later CS?
> >>>> Exactly because of the two reasons you mentioned.
> >>> This is the part I'm not seeing. I get that removing #2 is a
> >>> nightmare, which is why I did something that doesn't violate that
> >>> constraint.
> >>>
> >>> Like if an explicit CS that was running before the VM operation  runs
> >>> till after the VM operation (and hence possibly till after the TLB
> >>> flush, or otherwise have the TLB flush not apply due to lack of async
> >>> TLB flush support), that is not an issue. It might see the state from
> >>> before the unmap, or after the unmap, or some intermediate state and
> >>> all of those would be okay.
> >>>
> >>> We still get the constraint that the TLB flush happens between the VM
> >>> unmap and later CS and hence the unmap is certainly visible to them.
> >> So you basically just want to set the sync mode in
> >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> > Yes, with the caveat that I want to do that only for
> > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> > implicit sync we get the old implicit behavior, if we submit a CS with
> > explicit sync we get the new explicit behavior). The rest of the
> > series is basically just for enabling explicit sync submissions.
>
> That part won't work at all and would cause additional synchronization
> problems.
>
> First of all for implicit synced CS we should use READ, not BOOKKEEP.
> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> fixed that this causes memory corruption, but it is still nice to avoid.

Yes, what I'm saying is that on implicit sync CS submission should add
READ fences to the dma resv and on explicit sync CS submission should
add BOOKKEEP fences.

>
> BOOKKEEP can only be used by VM updates themselves. So that they don't
> interfere with CS.

That is the point why we would go BOOKKEEP for explicit sync CS
submissions, no? Explicit submission shouldn't interfere with any
other CS submissions. That includes being totally ignored by GL
importers (if we want to have synchronization there between an
explicit submission and GL, userspace is expected to use Jason's
dmabuf fence import/export IOCTLs)

>
> Then the second problem is that the VM IOCTL has absolutely no idea what
> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> flag on the BO.

It doesn't need to? We just use a different sync_mode for BOOKKEEP
fences vs others:
https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2

(the nice thing about doing it this way is that it is independent of
the IOCTL, i.e. also works for the delayed mapping changes we trigger
on CS submit)

>
> Regards,
> Christian.
>
> >
> >> That should be doable, but then you don't need any of the other changes.
> >>
> >> Regards,
> >> Christian.
> >>
> >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >>>> page table dma_resv object over to the BOs dma_resv object in the gem
> >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >>>> function.
> >>>>
> >>>> #2 is a nightmare.
> >>>>
> >>>> We can't move the TLB flush at the end of the unmap operation because on
> >>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> >>>> don't even work at all because of hw bugs. So to have a reliable TLB
> >>>> flush we must make sure that nothing else is ongoing and that means
> >>>> CS->VM->CS barrier.
> >>>>
> >>>> We try very hard to circumvent that already on maps by (for example)
> >>>> using a completely new VMID for CS after the VM map operation.
> >>>>
> >>>> But for the unmap operation we would need some kind special dma_fence
> >>>> implementation which would not only wait for all existing dma_fence but
> >>>> also for the one added until the unmap operation is completed. Cause
> >>>> otherwise our operation we do at #1 would simply not catch all
> >>>> dma_fences which have access to the memory.
> >>>>
> >>>> That's certainly doable, but I think just using the drm_exec stuff I
> >>>> already came up with is easier.
> >>>>
> >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >>>> goes away and we can keep track of the unmap operations in the bo_va
> >>>> structure.
> >>>>
> >>>> With that done you can make the explicit sync you noted in the bo_va
> >>>> structure and implicit sync when the bo_va structure goes away.
> >>>>
> >>>> Then the only reason I can see why we would need a CS->VM dependency is
> >>>> implicit synchronization, and that's what we are trying to avoid here in
> >>>> the first place.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> To get rid of this barrier you must first fix the part where CS
> >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>>>> the barrier.
> >>>>>>
> >>>>>> I'm working on this for a couple of years now and I'm really running out
> >>>>>> of idea how to explain this restriction.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
>
Bas Nieuwenhuizen June 15, 2022, 12:40 a.m. UTC | #24
Hi Christian,

Friendly ping on the comments here. Are you okay with me re-spinning
the series with a fixed patch 1 or do we need further discussion on
the approach here?

Thanks,
Bas

On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen
<bas@basnieuwenhuizen.nl> wrote:
>
> On Mon, Jun 6, 2022 at 12:35 PM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
> > > On Mon, Jun 6, 2022 at 12:15 PM Christian König
> > > <christian.koenig@amd.com> wrote:
> > >>
> > >>
> > >> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
> > >>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
> > >>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
> > >>>>> [SNIP]
> > >>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
> > >>>>>>> For this series, not really.  To clarify there are two sides for
> > >>>>>>> getting GPU bubbles and no overlap:
> > >>>>>>>
> > >>>>>>> (1) VM operations implicitly wait for earlier CS submissions
> > >>>>>>> (2) CS submissions implicitly wait for earlier VM operations
> > >>>>>>>
> > >>>>>>> Together, these combine to ensure that you get a (potentially small)
> > >>>>>>> bubble any time VM work happens.
> > >>>>>>>
> > >>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
> > >>>>>>> to do. However, while writing the userspace for this I noticed this
> > >>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
> > >>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
> > >>>>>>> for the next CS anyway for API semantics. Due to VM operations
> > >>>>>>> happening on a single timeline that means this high priority map can
> > >>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
> > >>>>>>> that case still exists.
> > >>>>>>>
> > >>>>>>> So in this series I try to tackle (1) instead. Since GPU work
> > >>>>>>> typically lags behind CPU submissions and VM operations aren't that
> > >>>>>>> slow, we can typically execute VM operations early enough that any
> > >>>>>>> implicit syncs from (2) are less/no issue.
> > >>>>>> Ok, once more since you don't seem to understand what I want to say: It
> > >>>>>> isn't possible to fix #1 before you have fixed #2.
> > >>>>>>
> > >>>>>> The VM unmap operation here is a barrier which divides the CS operations
> > >>>>>> in a before and after. This is intentional design.
> > >>>>> Why is that barrier needed? The two barriers I got and understood and
> > >>>>> I think we can deal with:
> > >>>>>
> > >>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
> > >>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
> > >>>>>
> > >>>>> But why do we need the VM unmap to be a strict barrier between prior
> > >>>>> CS and later CS?
> > >>>> Exactly because of the two reasons you mentioned.
> > >>> This is the part I'm not seeing. I get that removing #2 is a
> > >>> nightmare, which is why I did something that doesn't violate that
> > >>> constraint.
> > >>>
> > >>> Like if an explicit CS that was running before the VM operation  runs
> > >>> till after the VM operation (and hence possibly till after the TLB
> > >>> flush, or otherwise have the TLB flush not apply due to lack of async
> > >>> TLB flush support), that is not an issue. It might see the state from
> > >>> before the unmap, or after the unmap, or some intermediate state and
> > >>> all of those would be okay.
> > >>>
> > >>> We still get the constraint that the TLB flush happens between the VM
> > >>> unmap and later CS and hence the unmap is certainly visible to them.
> > >> So you basically just want to set the sync mode in
> > >> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
> > > Yes, with the caveat that I want to do that only for
> > > DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
> > > implicit sync we get the old implicit behavior, if we submit a CS with
> > > explicit sync we get the new explicit behavior). The rest of the
> > > series is basically just for enabling explicit sync submissions.
> >
> > That part won't work at all and would cause additional synchronization
> > problems.
> >
> > First of all for implicit synced CS we should use READ, not BOOKKEEP.
> > Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> > fixed that this causes memory corruption, but it is still nice to avoid.
>
> Yes, what I'm saying is that on implicit sync CS submission should add
> READ fences to the dma resv and on explicit sync CS submission should
> add BOOKKEEP fences.
>
> >
> > BOOKKEEP can only be used by VM updates themselves. So that they don't
> > interfere with CS.
>
> That is the point why we would go BOOKKEEP for explicit sync CS
> submissions, no? Explicit submission shouldn't interfere with any
> other CS submissions. That includes being totally ignored by GL
> importers (if we want to have synchronization there between an
> explicit submission and GL, userspace is expected to use Jason's
> dmabuf fence import/export IOCTLs)
>
> >
> > Then the second problem is that the VM IOCTL has absolutely no idea what
> > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> > flag on the BO.
>
> It doesn't need to? We just use a different sync_mode for BOOKKEEP
> fences vs others:
> https://patchwork.freedesktop.org/patch/487887/?series=104578&rev=2
>
> (the nice thing about doing it this way is that it is independent of
> the IOCTL, i.e. also works for the delayed mapping changes we trigger
> on CS submit)
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> That should be doable, but then you don't need any of the other changes.
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> > >>>> page table dma_resv object over to the BOs dma_resv object in the gem
> > >>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> > >>>> function.
> > >>>>
> > >>>> #2 is a nightmare.
> > >>>>
> > >>>> We can't move the TLB flush at the end of the unmap operation because on
> > >>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> > >>>> don't even work at all because of hw bugs. So to have a reliable TLB
> > >>>> flush we must make sure that nothing else is ongoing and that means
> > >>>> CS->VM->CS barrier.
> > >>>>
> > >>>> We try very hard to circumvent that already on maps by (for example)
> > >>>> using a completely new VMID for CS after the VM map operation.
> > >>>>
> > >>>> But for the unmap operation we would need some kind special dma_fence
> > >>>> implementation which would not only wait for all existing dma_fence but
> > >>>> also for the one added until the unmap operation is completed. Cause
> > >>>> otherwise our operation we do at #1 would simply not catch all
> > >>>> dma_fences which have access to the memory.
> > >>>>
> > >>>> That's certainly doable, but I think just using the drm_exec stuff I
> > >>>> already came up with is easier.
> > >>>>
> > >>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> > >>>> goes away and we can keep track of the unmap operations in the bo_va
> > >>>> structure.
> > >>>>
> > >>>> With that done you can make the explicit sync you noted in the bo_va
> > >>>> structure and implicit sync when the bo_va structure goes away.
> > >>>>
> > >>>> Then the only reason I can see why we would need a CS->VM dependency is
> > >>>> implicit synchronization, and that's what we are trying to avoid here in
> > >>>> the first place.
> > >>>>
> > >>>> Regards,
> > >>>> Christian.
> > >>>>
> > >>>>>> To get rid of this barrier you must first fix the part where CS
> > >>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> > >>>>>> the barrier.
> > >>>>>>
> > >>>>>> I'm working on this for a couple of years now and I'm really running out
> > >>>>>> of idea how to explain this restriction.
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> Christian.
> > >>>>>>
> >
Christian König June 15, 2022, 7 a.m. UTC | #25
Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:
> On Mon, Jun 6, 2022 at 12:35 PM Christian König
> <christian.koenig@amd.com> wrote:
>> [SNIP]
>> That part won't work at all and would cause additional synchronization
>> problems.
>>
>> First of all for implicit synced CS we should use READ, not BOOKKEEP.
>> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
>> fixed that this causes memory corruption, but it is still nice to avoid.
> Yes, what I'm saying is that on implicit sync CS submission should add
> READ fences to the dma resv and on explicit sync CS submission should
> add BOOKKEEP fences.

No, exactly that is wrong.

Implicit CS submissions should add WRITE fences.

Explicit CS submissions should add READ fences.

Only VM updates should add BOOKKEEP fences.

>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>> interfere with CS.
> That is the point why we would go BOOKKEEP for explicit sync CS
> submissions, no? Explicit submission shouldn't interfere with any
> other CS submissions. That includes being totally ignored by GL
> importers (if we want to have synchronization there between an
> explicit submission and GL, userspace is expected to use Jason's
> dmabuf fence import/export IOCTLs)

No, that would break existing DMA-buf rules.

Explicit CS submissions are still a dependency for implicit submissions.

>
> Then the second problem is that the VM IOCTL has absolutely no idea what
> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> flag on the BO.
> It doesn't need to? We just use a different sync_mode for BOOKKEEP
> fences vs others:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&amp;reserved=0

No, exactly that's completely broken.

Regards,
Christian.

>
> (the nice thing about doing it this way is that it is independent of
> the IOCTL, i.e. also works for the delayed mapping changes we trigger
> on CS submit)
>
>> Regards,
>> Christian.
>>
>>>> That should be doable, but then you don't need any of the other changes.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>>>> function.
>>>>>>
>>>>>> #2 is a nightmare.
>>>>>>
>>>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>>>> flush we must make sure that nothing else is ongoing and that means
>>>>>> CS->VM->CS barrier.
>>>>>>
>>>>>> We try very hard to circumvent that already on maps by (for example)
>>>>>> using a completely new VMID for CS after the VM map operation.
>>>>>>
>>>>>> But for the unmap operation we would need some kind special dma_fence
>>>>>> implementation which would not only wait for all existing dma_fence but
>>>>>> also for the one added until the unmap operation is completed. Cause
>>>>>> otherwise our operation we do at #1 would simply not catch all
>>>>>> dma_fences which have access to the memory.
>>>>>>
>>>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>>>> already came up with is easier.
>>>>>>
>>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>>>> structure.
>>>>>>
>>>>>> With that done you can make the explicit sync you noted in the bo_va
>>>>>> structure and implicit sync when the bo_va structure goes away.
>>>>>>
>>>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>>>> the first place.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>>>> the barrier.
>>>>>>>>
>>>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>>>> of idea how to explain this restriction.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
Christian König June 15, 2022, 7 a.m. UTC | #26
Hi Bas,

sorry I totally missed your reply. Just tried to answer your original 
questions.

Regards,
Christian.

Am 15.06.22 um 02:40 schrieb Bas Nieuwenhuizen:
> Hi Christian,
>
> Friendly ping on the comments here. Are you okay with me re-spinning
> the series with a fixed patch 1 or do we need further discussion on
> the approach here?
>
> Thanks,
> Bas
>
> On Mon, Jun 6, 2022 at 1:00 PM Bas Nieuwenhuizen
> <bas@basnieuwenhuizen.nl> wrote:
>> On Mon, Jun 6, 2022 at 12:35 PM Christian König
>> <christian.koenig@amd.com> wrote:
>>> Am 06.06.22 um 12:30 schrieb Bas Nieuwenhuizen:
>>>> On Mon, Jun 6, 2022 at 12:15 PM Christian König
>>>> <christian.koenig@amd.com> wrote:
>>>>>
>>>>> Am 03.06.22 um 21:11 schrieb Bas Nieuwenhuizen:
>>>>>> On Fri, Jun 3, 2022 at 8:41 PM Christian König <christian.koenig@amd.com> wrote:
>>>>>>> Am 03.06.22 um 19:50 schrieb Bas Nieuwenhuizen:
>>>>>>>> [SNIP]
>>>>>>>>>>> Yeah, but that's exactly the bubble we try to avoid. Isn't it?
>>>>>>>>>> For this series, not really.  To clarify there are two sides for
>>>>>>>>>> getting GPU bubbles and no overlap:
>>>>>>>>>>
>>>>>>>>>> (1) VM operations implicitly wait for earlier CS submissions
>>>>>>>>>> (2) CS submissions implicitly wait for earlier VM operations
>>>>>>>>>>
>>>>>>>>>> Together, these combine to ensure that you get a (potentially small)
>>>>>>>>>> bubble any time VM work happens.
>>>>>>>>>>
>>>>>>>>>> Your series (and further ideas) tackles (2), and is a worthwhile thing
>>>>>>>>>> to do. However, while writing the userspace for this I noticed this
>>>>>>>>>> isn't enough to get rid of all our GPU bubbles. In particular when
>>>>>>>>>> doing a non-sparse map of a new BO, that tends to need to be waited on
>>>>>>>>>> for the next CS anyway for API semantics. Due to VM operations
>>>>>>>>>> happening on a single timeline that means this high priority map can
>>>>>>>>>> end up being blocked by earlier sparse maps and hence the bubble in
>>>>>>>>>> that case still exists.
>>>>>>>>>>
>>>>>>>>>> So in this series I try to tackle (1) instead. Since GPU work
>>>>>>>>>> typically lags behind CPU submissions and VM operations aren't that
>>>>>>>>>> slow, we can typically execute VM operations early enough that any
>>>>>>>>>> implicit syncs from (2) are less/no issue.
>>>>>>>>> Ok, once more since you don't seem to understand what I want to say: It
>>>>>>>>> isn't possible to fix #1 before you have fixed #2.
>>>>>>>>>
>>>>>>>>> The VM unmap operation here is a barrier which divides the CS operations
>>>>>>>>> in a before and after. This is intentional design.
>>>>>>>> Why is that barrier needed? The two barriers I got and understood and
>>>>>>>> I think we can deal with:
>>>>>>>>
>>>>>>>> 1) the VM unmap is a barrier between prior CS and later memory free.
>>>>>>>> 2) The TLB flush need to happen between a VM unmap and later CS.
>>>>>>>>
>>>>>>>> But why do we need the VM unmap to be a strict barrier between prior
>>>>>>>> CS and later CS?
>>>>>>> Exactly because of the two reasons you mentioned.
>>>>>> This is the part I'm not seeing. I get that removing #2 is a
>>>>>> nightmare, which is why I did something that doesn't violate that
>>>>>> constraint.
>>>>>>
>>>>>> Like if an explicit CS that was running before the VM operation  runs
>>>>>> till after the VM operation (and hence possibly till after the TLB
>>>>>> flush, or otherwise have the TLB flush not apply due to lack of async
>>>>>> TLB flush support), that is not an issue. It might see the state from
>>>>>> before the unmap, or after the unmap, or some intermediate state and
>>>>>> all of those would be okay.
>>>>>>
>>>>>> We still get the constraint that the TLB flush happens between the VM
>>>>>> unmap and later CS and hence the unmap is certainly visible to them.
>>>>> So you basically just want to set the sync mode in
>>>>> amdgpu_vm_update_range() to AMDGPU_SYNC_EXPLICIT even when it is an unmap?
>>>> Yes, with the caveat that I want to do that only for
>>>> DMA_RESV_USAGE_BOOKKEEP or higher (i.e. if we submit a CS with
>>>> implicit sync we get the old implicit behavior, if we submit a CS with
>>>> explicit sync we get the new explicit behavior). The rest of the
>>>> series is basically just for enabling explicit sync submissions.
>>> That part won't work at all and would cause additional synchronization
>>> problems.
>>>
>>> First of all for implicit synced CS we should use READ, not BOOKKEEP.
>>> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
>>> fixed that this causes memory corruption, but it is still nice to avoid.
>> Yes, what I'm saying is that on implicit sync CS submission should add
>> READ fences to the dma resv and on explicit sync CS submission should
>> add BOOKKEEP fences.
>>
>>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>>> interfere with CS.
>> That is the point why we would go BOOKKEEP for explicit sync CS
>> submissions, no? Explicit submission shouldn't interfere with any
>> other CS submissions. That includes being totally ignored by GL
>> importers (if we want to have synchronization there between an
>> explicit submission and GL, userspace is expected to use Jason's
>> dmabuf fence import/export IOCTLs)
>>
>>> Then the second problem is that the VM IOCTL has absolutely no idea what
>>> the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
>>> flag on the BO.
>> It doesn't need to? We just use a different sync_mode for BOOKKEEP
>> fences vs others:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C0c76d5c34db846f2fff208da4e67ad7b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908504442767830%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=cyfYyKR6hVpDV%2FathhSO6EnCHjNkEM6sJs%2BPLPERCEE%3D&amp;reserved=0
>>
>> (the nice thing about doing it this way is that it is independent of
>> the IOCTL, i.e. also works for the delayed mapping changes we trigger
>> on CS submit)
>>
>>> Regards,
>>> Christian.
>>>
>>>>> That should be doable, but then you don't need any of the other changes.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
>>>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
>>>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
>>>>>>> function.
>>>>>>>
>>>>>>> #2 is a nightmare.
>>>>>>>
>>>>>>> We can't move the TLB flush at the end of the unmap operation because on
>>>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
>>>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
>>>>>>> flush we must make sure that nothing else is ongoing and that means
>>>>>>> CS->VM->CS barrier.
>>>>>>>
>>>>>>> We try very hard to circumvent that already on maps by (for example)
>>>>>>> using a completely new VMID for CS after the VM map operation.
>>>>>>>
>>>>>>> But for the unmap operation we would need some kind special dma_fence
>>>>>>> implementation which would not only wait for all existing dma_fence but
>>>>>>> also for the one added until the unmap operation is completed. Cause
>>>>>>> otherwise our operation we do at #1 would simply not catch all
>>>>>>> dma_fences which have access to the memory.
>>>>>>>
>>>>>>> That's certainly doable, but I think just using the drm_exec stuff I
>>>>>>> already came up with is easier.
>>>>>>>
>>>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
>>>>>>> goes away and we can keep track of the unmap operations in the bo_va
>>>>>>> structure.
>>>>>>>
>>>>>>> With that done you can make the explicit sync you noted in the bo_va
>>>>>>> structure and implicit sync when the bo_va structure goes away.
>>>>>>>
>>>>>>> Then the only reason I can see why we would need a CS->VM dependency is
>>>>>>> implicit synchronization, and that's what we are trying to avoid here in
>>>>>>> the first place.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>> To get rid of this barrier you must first fix the part where CS
>>>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
>>>>>>>>> the barrier.
>>>>>>>>>
>>>>>>>>> I'm working on this for a couple of years now and I'm really running out
>>>>>>>>> of idea how to explain this restriction.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
Bas Nieuwenhuizen June 17, 2022, 1:03 p.m. UTC | #27
On Wed, Jun 15, 2022 at 9:00 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 06.06.22 um 13:00 schrieb Bas Nieuwenhuizen:
> > On Mon, Jun 6, 2022 at 12:35 PM Christian König
> > <christian.koenig@amd.com> wrote:
> >> [SNIP]
> >> That part won't work at all and would cause additional synchronization
> >> problems.
> >>
> >> First of all for implicit synced CS we should use READ, not BOOKKEEP.
> >> Because BOOKKEEP would incorrectly be ignored by OpenGL importers. I've
> >> fixed that this causes memory corruption, but it is still nice to avoid.
> > Yes, what I'm saying is that on implicit sync CS submission should add
> > READ fences to the dma resv and on explicit sync CS submission should
> > add BOOKKEEP fences.
>
> No, exactly that is wrong.
>
> Implicit CS submissions should add WRITE fences.
>
> Explicit CS submissions should add READ fences.
>
> Only VM updates should add BOOKKEEP fences.
>
> >> BOOKKEEP can only be used by VM updates themselves. So that they don't
> >> interfere with CS.
> > That is the point why we would go BOOKKEEP for explicit sync CS
> > submissions, no? Explicit submission shouldn't interfere with any
> > other CS submissions. That includes being totally ignored by GL
> > importers (if we want to have synchronization there between an
> > explicit submission and GL, userspace is expected to use Jason's
> > dmabuf fence import/export IOCTLs)
>
> No, that would break existing DMA-buf rules.
>
> Explicit CS submissions are still a dependency for implicit submissions.

This is explicitly what we don't want for explicit submissions and why
I waited with this series until the DMA_RESV_USAGE series landed. We
wish to opt out from implicit sync completely, and just use the IOCTLs
Jason wrote for back-compat with windowing systems that need it.

If BOOKKEEP isn't for that, should we add a new USAGE?

>
> >
> > Then the second problem is that the VM IOCTL has absolutely no idea what
> > the CS IOCTL would be doing. That's why we have added the EXPLICIT sync
> > flag on the BO.
> > It doesn't need to? We just use a different sync_mode for BOOKKEEP
> > fences vs others:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.freedesktop.org%2Fpatch%2F487887%2F%3Fseries%3D104578%26rev%3D2&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C81db0fea1c854076fc4408da47abafaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637901099957139870%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=F72Boaesx83MD2pjGuucA1buawi205XLSsQHg5EH39A%3D&amp;reserved=0
>
> No, exactly that's completely broken.
>
> Regards,
> Christian.
>
> >
> > (the nice thing about doing it this way is that it is independent of
> > the IOCTL, i.e. also works for the delayed mapping changes we trigger
> > on CS submit)
> >
> >> Regards,
> >> Christian.
> >>
> >>>> That should be doable, but then you don't need any of the other changes.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>> #1 Is rather easy to fix, you just need to copy all dma_fences from the
> >>>>>> page table dma_resv object over to the BOs dma_resv object in the gem
> >>>>>> close handler. E.g. exactly what you suggested with the dma_resv_copy
> >>>>>> function.
> >>>>>>
> >>>>>> #2 is a nightmare.
> >>>>>>
> >>>>>> We can't move the TLB flush at the end of the unmap operation because on
> >>>>>> async TLB flushes are either a bit complicated (double flushes etc..) or
> >>>>>> don't even work at all because of hw bugs. So to have a reliable TLB
> >>>>>> flush we must make sure that nothing else is ongoing and that means
> >>>>>> CS->VM->CS barrier.
> >>>>>>
> >>>>>> We try very hard to circumvent that already on maps by (for example)
> >>>>>> using a completely new VMID for CS after the VM map operation.
> >>>>>>
> >>>>>> But for the unmap operation we would need some kind special dma_fence
> >>>>>> implementation which would not only wait for all existing dma_fence but
> >>>>>> also for the one added until the unmap operation is completed. Cause
> >>>>>> otherwise our operation we do at #1 would simply not catch all
> >>>>>> dma_fences which have access to the memory.
> >>>>>>
> >>>>>> That's certainly doable, but I think just using the drm_exec stuff I
> >>>>>> already came up with is easier.
> >>>>>>
> >>>>>> When we can grab locks for all the BOs involved amdgpu_vm_clear_freed()
> >>>>>> goes away and we can keep track of the unmap operations in the bo_va
> >>>>>> structure.
> >>>>>>
> >>>>>> With that done you can make the explicit sync you noted in the bo_va
> >>>>>> structure and implicit sync when the bo_va structure goes away.
> >>>>>>
> >>>>>> Then the only reason I can see why we would need a CS->VM dependency is
> >>>>>> implicit synchronization, and that's what we are trying to avoid here in
> >>>>>> the first place.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Christian.
> >>>>>>
> >>>>>>>> To get rid of this barrier you must first fix the part where CS
> >>>>>>>> submissions wait for the VM operation to complete, e.g. the necessity of
> >>>>>>>> the barrier.
> >>>>>>>>
> >>>>>>>> I'm working on this for a couple of years now and I'm really running out
> >>>>>>>> of idea how to explain this restriction.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
>
Christian König June 17, 2022, 1:08 p.m. UTC | #28
Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> [SNIP]
>>>> BOOKKEEP can only be used by VM updates themselves. So that they don't
>>>> interfere with CS.
>>> That is the point why we would go BOOKKEEP for explicit sync CS
>>> submissions, no? Explicit submission shouldn't interfere with any
>>> other CS submissions. That includes being totally ignored by GL
>>> importers (if we want to have synchronization there between an
>>> explicit submission and GL, userspace is expected to use Jason's
>>> dmabuf fence import/export IOCTLs)
>> No, that would break existing DMA-buf rules.
>>
>> Explicit CS submissions are still a dependency for implicit submissions.
> This is explicitly what we don't want for explicit submissions and why
> I waited with this series until the DMA_RESV_USAGE series landed. We
> wish to opt out from implicit sync completely, and just use the IOCTLs
> Jason wrote for back-compat with windowing systems that need it.
>
> If BOOKKEEP isn't for that, should we add a new USAGE?

BOOKKEEP is exactly for that, but as discussed with Daniel that's not 
what we want in the kernel.

When you mix implicit with explicit synchronization (OpenGL with RADV 
for example) it should be mandatory for the OpenGL to wait for any RADV 
submission before issuing an operation.

What you want to do is intentionally not supported.

Regards,
Christian.
Daniel Vetter June 24, 2022, 8:34 p.m. UTC | #29
Digging out of a hole, apologies to everyone.

On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
> Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> > [SNIP]
> > > > > BOOKKEEP can only be used by VM updates themselves. So that they don't
> > > > > interfere with CS.
> > > > That is the point why we would go BOOKKEEP for explicit sync CS
> > > > submissions, no? Explicit submission shouldn't interfere with any
> > > > other CS submissions. That includes being totally ignored by GL
> > > > importers (if we want to have synchronization there between an
> > > > explicit submission and GL, userspace is expected to use Jason's
> > > > dmabuf fence import/export IOCTLs)
> > > No, that would break existing DMA-buf rules.
> > > 
> > > Explicit CS submissions are still a dependency for implicit submissions.
> > This is explicitly what we don't want for explicit submissions and why
> > I waited with this series until the DMA_RESV_USAGE series landed. We
> > wish to opt out from implicit sync completely, and just use the IOCTLs
> > Jason wrote for back-compat with windowing systems that need it.
> > 
> > If BOOKKEEP isn't for that, should we add a new USAGE?
> 
> BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
> we want in the kernel.

Not sure which Daniel you talked to, but this wasn't me.

> When you mix implicit with explicit synchronization (OpenGL with RADV for
> example) it should be mandatory for the OpenGL to wait for any RADV
> submission before issuing an operation.
> 
> What you want to do is intentionally not supported.

vk is very intentional in it's rejecting of any implicit sync. Which means
when you share a buffer with gl, even in _that_ case there must be no sync
automatically, or your implementation is kinda shit. Instead anyone
sharing a buffer with vk and using it in gl must take care of sync by
importing the timeline syncobj to gl, that's why all these extensions got
added.

This leaves libva in the cold, but hey libva didn't even get around to
adding the full set of modifier extensions so I can't really get myself to
care.

So summary this means:

- a CS/execbuf for vk should _only_ set BOOKKEEPING fences (except ofc if
  there's memory management moves in the preparation, which use KERNEL
  fences and then become additional dependencies for the job)

- because vk memory model is that always everything currently bound can be
  used this means you set BOOKKEEPING on absolutely everything. The
  current clever trick amdgpu has with shared buffers is also not really
  the right thing.

- implicit sync is only controlled through the new import/export ioctl on
  the dma-buf

- if you set any READ/WRITE fences anywhere else, you have potential
  oversync compared to what the vk spec would want

- userspace gets to keep absolutely all the pieces here. Which is not an
  issue, because userspace is totally allowed to fill a buffer with
  garbage and hand that to the compositor already, so there's nothing new
  going wrong here.

- ideally (definitely required for vk sparse) when you unbind or rebind
  then the BOOKKEEPING fences for the vm/ctx get for the old buffers get
  simply replaced by the pte clearing and tlb flushing fences (like amdkfd
  does for compute, vk really just wants to look like compute in
  everything). In practice, especially with partial and multiple mappings
  of the same underlying bo involved, this might be too expensive to
  accurately track since you can only do the replacement trick when the
  last mapping is gone. It might be worth it for private bo though, dunno.

For amdgpu the current special owner checks mostly allow you to get the
semantics vulkan wants. But it breaks down when you have cross-device or
cross-process sharing.

We should probably also document this in the kerneldoc for the BOOKKEEPING
usage that this is the fence type that vulkan cs should use in all
drivers, otherwise this will become an endless mess of driver specific
hacks (i.e. the world we currently live in).
-Daniel
Christian König June 25, 2022, 1:58 p.m. UTC | #30
Am 24.06.22 um 22:34 schrieb Daniel Vetter:
> Digging out of a hole, apologies to everyone.

No problem, I'm totally overworked as well.

> On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
>> Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
>>> [SNIP]
>> BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
>> we want in the kernel.
> Not sure which Daniel you talked to, but this wasn't me.

Hui what? Of course I'm talking about you.

>> When you mix implicit with explicit synchronization (OpenGL with RADV for
>> example) it should be mandatory for the OpenGL to wait for any RADV
>> submission before issuing an operation.
>>
>> What you want to do is intentionally not supported.
> vk is very intentional in it's rejecting of any implicit sync.

[SNIP]

> We should probably also document this in the kerneldoc for the BOOKKEEPING
> usage that this is the fence type that vulkan cs should use in all
> drivers, otherwise this will become an endless mess of driver specific
> hacks (i.e. the world we currently live in).

Well, Daniel somehow we are somehow not talking about the same thing here :)

I've documented exactly what you describe above in the initial patch 
which added BOOKKEEPING (I've still called it OTHER in that iteration):

> >/+ /**/
> >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./
> >/+ */
> >/+ * This should be used for operations which don't want to add an/
> >/+ * implicit dependency at all, but still have a dependency on memory/
> >/+ * management./
> >/+ */
> >/+ * This might include things like preemption fences as well as device/
> >/+ * page table updates or even userspace command submissions./
> >/+ */
> >/+ * The kernel memory management *always* need to wait for those fences/
> >/+ * before moving or freeing the resource protected by the dma_resv/
> >/+ * object./
> >/+ *//
> >/+ DMA_RESV_USAGE_OTHER/

Later on I've even explicitly mentioned that this is for Vulkan submissions.

But it was *you* who made me remove that with the explanation that we 
have to use READ for that or we break existing userspace.

I mean that still makes a lot of sense to me because if I'm not 
completely mistaken we do have use cases which break, especially 
Vulkan+encoding.

Regards,
Christian.

> -Daniel
Daniel Vetter June 25, 2022, 10:45 p.m. UTC | #31
On Sat, Jun 25, 2022 at 03:58:17PM +0200, Christian König wrote:
> Am 24.06.22 um 22:34 schrieb Daniel Vetter:
> > Digging out of a hole, apologies to everyone.
> 
> No problem, I'm totally overworked as well.
> 
> > On Fri, Jun 17, 2022 at 03:08:00PM +0200, Christian König wrote:
> > > Am 17.06.22 um 15:03 schrieb Bas Nieuwenhuizen:
> > > > [SNIP]
> > > BOOKKEEP is exactly for that, but as discussed with Daniel that's not what
> > > we want in the kernel.
> > Not sure which Daniel you talked to, but this wasn't me.
> 
> Hui what? Of course I'm talking about you.
> 
> > > When you mix implicit with explicit synchronization (OpenGL with RADV for
> > > example) it should be mandatory for the OpenGL to wait for any RADV
> > > submission before issuing an operation.
> > > 
> > > What you want to do is intentionally not supported.
> > vk is very intentional in it's rejecting of any implicit sync.
> 
> [SNIP]
> 
> > We should probably also document this in the kerneldoc for the BOOKKEEPING
> > usage that this is the fence type that vulkan cs should use in all
> > drivers, otherwise this will become an endless mess of driver specific
> > hacks (i.e. the world we currently live in).
> 
> Well, Daniel somehow we are somehow not talking about the same thing here :)
> 
> I've documented exactly what you describe above in the initial patch which
> added BOOKKEEPING (I've still called it OTHER in that iteration):
> 
> > >/+ /**/
> > >/+ * @DMA_RESV_USAGE_OTHER: No implicit sync./
> > >/+ */
> > >/+ * This should be used for operations which don't want to add an/
> > >/+ * implicit dependency at all, but still have a dependency on memory/
> > >/+ * management./
> > >/+ */
> > >/+ * This might include things like preemption fences as well as device/
> > >/+ * page table updates or even userspace command submissions./
> > >/+ */
> > >/+ * The kernel memory management *always* need to wait for those fences/
> > >/+ * before moving or freeing the resource protected by the dma_resv/
> > >/+ * object./
> > >/+ *//
> > >/+ DMA_RESV_USAGE_OTHER/
> 
> Later on I've even explicitly mentioned that this is for Vulkan submissions.
> 
> But it was *you* who made me remove that with the explanation that we have
> to use READ for that or we break existing userspace.

Hm the only discussion I've found I actually mentioend we should highlight
that vk should use OTHER even more than what you had. Quoting myself:

> +      * This might include things like preemption fences as well as device
> +      * page table updates or even userspace command submissions.

I think we should highlight a bit more that for explicitly synchronized
userspace like vk OTHER is the normal case. So really not an exception.
Ofc aside from amdkgf there's currently no driver doing this, but really
we should have lots of them ...

See https://lore.kernel.org/dri-devel/YZ+y+Uwo809qtvs5@phenom.ffwll.local/

I didn't find anything else. So not sure how we managed to create
confusion here :-(

> I mean that still makes a lot of sense to me because if I'm not completely
> mistaken we do have use cases which break, especially Vulkan+encoding.

Yeah I think we only have some communication fumble here, nothing else :-)

And yes libva doesn't have any support for vk's explicit sync model, so
that will just fall flat on its face. Might motivate a few folks to fix
libva :-)

Note that on i915 side it's exactly the same, we've also been setting the
READ fence thus far. Since the breakage will be introduced by upgrading
mesa we'll at least avoid the kernel regression complaints, or at least I
hope we can get away with that.

Since really I don't have any idea how it could be fixed otherwise, except
through some really, really terrible hacks. Maybe kernel module option or
so.

Anyway I think all we need is just a patch to the dma_resv docs to explain
that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
you're up to typing that?

Cheers, Daniel
Christian König July 4, 2022, 1:37 p.m. UTC | #32
Hey Daniel,

Am 26.06.22 um 00:45 schrieb Daniel Vetter:
> [SNIP]
> I think we should highlight a bit more that for explicitly synchronized
> userspace like vk OTHER is the normal case. So really not an exception.
> Ofc aside from amdkgf there's currently no driver doing this, but really
> we should have lots of them ...
>
> See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&amp;reserved=0
>
> I didn't find anything else. So not sure how we managed to create
> confusion here :-(

Well you said something like "Yeah, READ is supposed to be used for 
that...." and that created the impression that AMDGPU should start using 
that for Vulkan submissions and you are rejecting my idea of using 
BOOKKEEP for that.

>> I mean that still makes a lot of sense to me because if I'm not completely
>> mistaken we do have use cases which break, especially Vulkan+encoding.
> Yeah I think we only have some communication fumble here, nothing else :-)

Ok, well then @Bas: Sorry for all the noise, we are actually all on the 
same page :)

> And yes libva doesn't have any support for vk's explicit sync model, so
> that will just fall flat on its face. Might motivate a few folks to fix
> libva :-)

Well that's not the problem. The problem is that we have a couple of use 
cases where libva is supposed to encode a Vulkan surface without letting 
Vulkan know about that.

In other words: Application shares DMA-buf between Vulkan and VA-API, 
renders with Vulkan and encodes with VA-API without any explicit 
synchronization between the two.

I know that this is absolutely against the Vulkan specification, but it 
just happened to work fine. And when you break something which used to 
work people start to complain...

> Note that on i915 side it's exactly the same, we've also been setting the
> READ fence thus far. Since the breakage will be introduced by upgrading
> mesa we'll at least avoid the kernel regression complaints, or at least I
> hope we can get away with that.

Yeah, the path to salvation start's with the words: It's not my f... 
problem :)

> Since really I don't have any idea how it could be fixed otherwise, except
> through some really, really terrible hacks. Maybe kernel module option or
> so.
>
> Anyway I think all we need is just a patch to the dma_resv docs to explain
> that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
> you're up to typing that?

I can do that. I'm just back from a week of vacation and still digging 
through my mails.

Cheers,
Christian.

>
> Cheers, Daniel
Daniel Vetter Aug. 9, 2022, 2:37 p.m. UTC | #33
[Back from vacations and work change and out sick and absolutely
everything else going wrong]

On Mon, Jul 04, 2022 at 03:37:43PM +0200, Christian König wrote:
> Hey Daniel,
> 
> Am 26.06.22 um 00:45 schrieb Daniel Vetter:
> > [SNIP]
> > I think we should highlight a bit more that for explicitly synchronized
> > userspace like vk OTHER is the normal case. So really not an exception.
> > Ofc aside from amdkgf there's currently no driver doing this, but really
> > we should have lots of them ...
> > 
> > See https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2FYZ%2By%2BUwo809qtvs5%40phenom.ffwll.local%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C88037a566a8d4c8d4aca08da56fc6e3c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637917939428739923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=6sYto7GCLw8i3pT9OCFN1l6dxeYYHPghzKDMYxqUw90%3D&amp;reserved=0
> > 
> > I didn't find anything else. So not sure how we managed to create
> > confusion here :-(
> 
> Well you said something like "Yeah, READ is supposed to be used for
> that...." and that created the impression that AMDGPU should start using
> that for Vulkan submissions and you are rejecting my idea of using BOOKKEEP
> for that.
> 
> > > I mean that still makes a lot of sense to me because if I'm not completely
> > > mistaken we do have use cases which break, especially Vulkan+encoding.
> > Yeah I think we only have some communication fumble here, nothing else :-)
> 
> Ok, well then @Bas: Sorry for all the noise, we are actually all on the same
> page :)
> 
> > And yes libva doesn't have any support for vk's explicit sync model, so
> > that will just fall flat on its face. Might motivate a few folks to fix
> > libva :-)
> 
> Well that's not the problem. The problem is that we have a couple of use
> cases where libva is supposed to encode a Vulkan surface without letting
> Vulkan know about that.
> 
> In other words: Application shares DMA-buf between Vulkan and VA-API,
> renders with Vulkan and encodes with VA-API without any explicit
> synchronization between the two.
> 
> I know that this is absolutely against the Vulkan specification, but it just
> happened to work fine. And when you break something which used to work
> people start to complain...

Yeah I feared that, and worse libva doesn't have the nice gl interop
extensions to make it actually work.

> > Note that on i915 side it's exactly the same, we've also been setting the
> > READ fence thus far. Since the breakage will be introduced by upgrading
> > mesa we'll at least avoid the kernel regression complaints, or at least I
> > hope we can get away with that.
> 
> Yeah, the path to salvation start's with the words: It's not my f... problem
> :)
> 
> > Since really I don't have any idea how it could be fixed otherwise, except
> > through some really, really terrible hacks. Maybe kernel module option or
> > so.
> > 
> > Anyway I think all we need is just a patch to the dma_resv docs to explain
> > that USAGE_BOOKKEEPING is what vulkan userspace wants, and why. Bas,
> > you're up to typing that?
> 
> I can do that. I'm just back from a week of vacation and still digging
> through my mails.

Yeah I think the best path is we're pushing hard for adding the libva
syncobj extensions like gl has, so that this can be done properly. And
then just pave it over with kernel modoptions until userspace is fixed.

If we end up making implicit sync part of defacto vk api on linux, then a
_lot_ of people will be very sad :-(
-Daniel
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
index f10332e1c6c0..31bc73fd1fae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_cpu.c
@@ -51,7 +51,7 @@  static int amdgpu_vm_cpu_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, sync_mode, p->vm, true);
+	return amdgpu_bo_sync_wait_resv(p->adev, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm, true);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 63b484dc76c5..c8d5898bea11 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -75,7 +75,7 @@  static int amdgpu_vm_sdma_prepare(struct amdgpu_vm_update_params *p,
 	if (!resv)
 		return 0;
 
-	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, sync_mode, p->vm);
+	return amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode, AMDGPU_SYNC_EXPLICIT, p->vm);
 }
 
 /**