diff mbox series

[5/7] drm/i915/gem/ttm: Respect the objection region in placement_from_obj

Message ID 20210715223900.1840576-6-jason@jlekstrand.net (mailing list archive)
State New, archived
Headers show
Series drm/i915: Migrate memory to SMEM when imported cross-device | expand

Commit Message

Jason Ekstrand July 15, 2021, 10:38 p.m. UTC
Whenever we had a user object (n_placements > 0), we were ignoring
obj->mm.region and always putting obj->placements[0] as the requested
region.  For LMEM+SMEM objects, this was causing them to get shoved into
LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
i915_gem_object_migrate().

Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Matthew Auld July 16, 2021, 1:54 p.m. UTC | #1
On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> Whenever we had a user object (n_placements > 0), we were ignoring
> obj->mm.region and always putting obj->placements[0] as the requested
> region.  For LMEM+SMEM objects, this was causing them to get shoved into
> LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> i915_gem_object_migrate().

i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
requested region, so there shouldn't be an issue with migration right?
Do you have some more details?

>
> Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> index d30f274c329c7..5985e994d56cf 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
>         unsigned int i;
>
>         placement->num_placement = 1;
> -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> -                                  obj->mm.region, requested, flags);
> +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
>
>         /* Cache this on object? */
>         placement->num_busy_placement = num_allowed;
> --
> 2.31.1
>
Jason Ekstrand July 16, 2021, 2:10 p.m. UTC | #2
On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > Whenever we had a user object (n_placements > 0), we were ignoring
> > obj->mm.region and always putting obj->placements[0] as the requested
> > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > i915_gem_object_migrate().
>
> i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> requested region, so there shouldn't be an issue with migration right?
> Do you have some more details?

With i915_ttm_migrate directly, no.  But, in the last patch in the
series, we're trying to migrate LMEM+SMEM buffers into SMEM on
attach() and pin it there.  This blows up in a very unexpected (IMO)
way.  The flow goes something like this:

 - Client attempts a dma-buf import from another device
 - In attach() we call i915_gem_object_migrate() which calls
i915_ttm_migrate() which migrates as requested.
 - Once the migration is complete, we call i915_gem_object_pin_pages()
which calls i915_ttm_get_pages() which depends on
i915_ttm_placement_from_obj() and so migrates it right back to LMEM.

Maybe the problem here is actually that our TTM code isn't respecting
obj->mm.pages_pin_count?

In case you can't tell, I really have no clue what I'm doing here.
I'm really stumbling around in the dark finding things that make my
bug go away.  I'm happy for the feedback.

--Jason

>
> >
> > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > index d30f274c329c7..5985e994d56cf 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> >         unsigned int i;
> >
> >         placement->num_placement = 1;
> > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > -                                  obj->mm.region, requested, flags);
> > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> >
> >         /* Cache this on object? */
> >         placement->num_busy_placement = num_allowed;
> > --
> > 2.31.1
> >
Matthew Auld July 16, 2021, 3:52 p.m. UTC | #3
On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > obj->mm.region and always putting obj->placements[0] as the requested
> > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > i915_gem_object_migrate().
> >
> > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > requested region, so there shouldn't be an issue with migration right?
> > Do you have some more details?
>
> With i915_ttm_migrate directly, no.  But, in the last patch in the
> series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> attach() and pin it there.  This blows up in a very unexpected (IMO)
> way.  The flow goes something like this:
>
>  - Client attempts a dma-buf import from another device
>  - In attach() we call i915_gem_object_migrate() which calls
> i915_ttm_migrate() which migrates as requested.
>  - Once the migration is complete, we call i915_gem_object_pin_pages()
> which calls i915_ttm_get_pages() which depends on
> i915_ttm_placement_from_obj() and so migrates it right back to LMEM.

The mm.pages must be NULL here, otherwise it would just increment the
pages_pin_count?

>
> Maybe the problem here is actually that our TTM code isn't respecting
> obj->mm.pages_pin_count?

I think if the resource is moved, we always nuke the mm.pages after
being notified of the move. Also TTM is also not allowed to move
pinned buffers.

I guess if we are evicted/swapped, so assuming we are not holding the
object lock, and it's not pinned, the future call to get_pages() will
see mm.pages = NULL, even though the ttm_resource is still there, and
because we prioritise the placements[0], instead of mm.region we end
up moving it for no good reason. But in your case you are holding the
lock, or it's pinned? Also is this just with the selftest, or
something real?

>
> In case you can't tell, I really have no clue what I'm doing here.
> I'm really stumbling around in the dark finding things that make my
> bug go away.  I'm happy for the feedback.
>
> --Jason
>
> >
> > >
> > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > index d30f274c329c7..5985e994d56cf 100644
> > > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> > >         unsigned int i;
> > >
> > >         placement->num_placement = 1;
> > > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > > -                                  obj->mm.region, requested, flags);
> > > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> > >
> > >         /* Cache this on object? */
> > >         placement->num_busy_placement = num_allowed;
> > > --
> > > 2.31.1
> > >
Matthew Auld July 16, 2021, 4 p.m. UTC | #4
On Fri, 16 Jul 2021 at 16:52, Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> > >
> > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > i915_gem_object_migrate().
> > >
> > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > requested region, so there shouldn't be an issue with migration right?
> > > Do you have some more details?
> >
> > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > way.  The flow goes something like this:
> >
> >  - Client attempts a dma-buf import from another device
> >  - In attach() we call i915_gem_object_migrate() which calls
> > i915_ttm_migrate() which migrates as requested.
> >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > which calls i915_ttm_get_pages() which depends on
> > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
>
> The mm.pages must be NULL here, otherwise it would just increment the
> pages_pin_count?
>
> >
> > Maybe the problem here is actually that our TTM code isn't respecting
> > obj->mm.pages_pin_count?
>
> I think if the resource is moved, we always nuke the mm.pages after
> being notified of the move. Also TTM is also not allowed to move
> pinned buffers.
>
> I guess if we are evicted/swapped, so assuming we are not holding the
> object lock, and it's not pinned, the future call to get_pages() will
> see mm.pages = NULL, even though the ttm_resource is still there, and
> because we prioritise the placements[0], instead of mm.region we end
> up moving it for no good reason. But in your case you are holding the
> lock, or it's pinned? Also is this just with the selftest, or
> something real?

Or at least in the selftest I see ____i915_gem_object_get_pages()
which doesn't even consider the mm.pages AFAIK.

>
> >
> > In case you can't tell, I really have no clue what I'm doing here.
> > I'm really stumbling around in the dark finding things that make my
> > bug go away.  I'm happy for the feedback.
> >
> > --Jason
> >
> > >
> > > >
> > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > index d30f274c329c7..5985e994d56cf 100644
> > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> > > >         unsigned int i;
> > > >
> > > >         placement->num_placement = 1;
> > > > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > > > -                                  obj->mm.region, requested, flags);
> > > > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> > > >
> > > >         /* Cache this on object? */
> > > >         placement->num_busy_placement = num_allowed;
> > > > --
> > > > 2.31.1
> > > >
Jason Ekstrand July 16, 2021, 5:38 p.m. UTC | #5
On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > <matthew.william.auld@gmail.com> wrote:
> > > >
> > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >
> > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > i915_gem_object_migrate().
> > > >
> > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > requested region, so there shouldn't be an issue with migration right?
> > > > Do you have some more details?
> > >
> > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > way.  The flow goes something like this:
> > >
> > >  - Client attempts a dma-buf import from another device
> > >  - In attach() we call i915_gem_object_migrate() which calls
> > > i915_ttm_migrate() which migrates as requested.
> > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > which calls i915_ttm_get_pages() which depends on
> > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> >
> > The mm.pages must be NULL here, otherwise it would just increment the
> > pages_pin_count?

Given that the test is using the ____four_underscores version, it
doesn't have that check.  However, this executes after we've done the
dma-buf import which pinned pages.  So we should definitely have
pages.

> > >
> > > Maybe the problem here is actually that our TTM code isn't respecting
> > > obj->mm.pages_pin_count?
> >
> > I think if the resource is moved, we always nuke the mm.pages after
> > being notified of the move. Also TTM is also not allowed to move
> > pinned buffers.
> >
> > I guess if we are evicted/swapped, so assuming we are not holding the
> > object lock, and it's not pinned, the future call to get_pages() will
> > see mm.pages = NULL, even though the ttm_resource is still there, and
> > because we prioritise the placements[0], instead of mm.region we end
> > up moving it for no good reason. But in your case you are holding the
> > lock, or it's pinned? Also is this just with the selftest, or
> > something real?
>
> Or at least in the selftest I see ____i915_gem_object_get_pages()
> which doesn't even consider the mm.pages AFAIK.

The bogus migration is happening as part of the
__i915_gem_object_get_pages() (2 __underscores) call in
i915_gem_dmabuf_attach (see last patch).  That code is attempting to
migrate the BO to SMEM and then pin it there using the obvious calls
to do so.  However, in the pin_pages call, it gets implicitly migrated
back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
migrating things at all?

--Jason

> >
> > >
> > > In case you can't tell, I really have no clue what I'm doing here.
> > > I'm really stumbling around in the dark finding things that make my
> > > bug go away.  I'm happy for the feedback.
> > >
> > > --Jason
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > index d30f274c329c7..5985e994d56cf 100644
> > > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> > > > >         unsigned int i;
> > > > >
> > > > >         placement->num_placement = 1;
> > > > > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > > > > -                                  obj->mm.region, requested, flags);
> > > > > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> > > > >
> > > > >         /* Cache this on object? */
> > > > >         placement->num_busy_placement = num_allowed;
> > > > > --
> > > > > 2.31.1
> > > > >
Matthew Auld July 16, 2021, 6:44 p.m. UTC | #6
On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> > >
> > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > <matthew.william.auld@gmail.com> wrote:
> > > > >
> > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > >
> > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > i915_gem_object_migrate().
> > > > >
> > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > Do you have some more details?
> > > >
> > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > way.  The flow goes something like this:
> > > >
> > > >  - Client attempts a dma-buf import from another device
> > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > i915_ttm_migrate() which migrates as requested.
> > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > which calls i915_ttm_get_pages() which depends on
> > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > >
> > > The mm.pages must be NULL here, otherwise it would just increment the
> > > pages_pin_count?
>
> Given that the test is using the ____four_underscores version, it
> doesn't have that check.  However, this executes after we've done the
> dma-buf import which pinned pages.  So we should definitely have
> pages.

We shouldn't call ____four_underscores() if we might already have
pages though. Under non-TTM that would leak the pages, and in TTM we
might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
example nothing was moved. I take it we can't just call pin_pages()?
Four scary underscores usually means "don't call this in normal code".

>
> > > >
> > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > obj->mm.pages_pin_count?
> > >
> > > I think if the resource is moved, we always nuke the mm.pages after
> > > being notified of the move. Also TTM is also not allowed to move
> > > pinned buffers.
> > >
> > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > object lock, and it's not pinned, the future call to get_pages() will
> > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > because we prioritise the placements[0], instead of mm.region we end
> > > up moving it for no good reason. But in your case you are holding the
> > > lock, or it's pinned? Also is this just with the selftest, or
> > > something real?
> >
> > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > which doesn't even consider the mm.pages AFAIK.
>
> The bogus migration is happening as part of the
> __i915_gem_object_get_pages() (2 __underscores) call in
> i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> migrate the BO to SMEM and then pin it there using the obvious calls
> to do so.  However, in the pin_pages call, it gets implicitly migrated
> back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> migrating things at all?

Not sure yet, but __two_underscores() checks if
i915_gem_object_has_pages() before actually calling into
i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
some reason, so best guess is something to do with move_notify().

>
> --Jason
>
> > >
> > > >
> > > > In case you can't tell, I really have no clue what I'm doing here.
> > > > I'm really stumbling around in the dark finding things that make my
> > > > bug go away.  I'm happy for the feedback.
> > > >
> > > > --Jason
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
> > > > > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> > > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > > index d30f274c329c7..5985e994d56cf 100644
> > > > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > > > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> > > > > >         unsigned int i;
> > > > > >
> > > > > >         placement->num_placement = 1;
> > > > > > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > > > > > -                                  obj->mm.region, requested, flags);
> > > > > > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> > > > > >
> > > > > >         /* Cache this on object? */
> > > > > >         placement->num_busy_placement = num_allowed;
> > > > > > --
> > > > > > 2.31.1
> > > > > >
Jason Ekstrand July 16, 2021, 7:49 p.m. UTC | #7
On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> > >
> > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > <matthew.william.auld@gmail.com> wrote:
> > > >
> > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >
> > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >
> > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > i915_gem_object_migrate().
> > > > > >
> > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > Do you have some more details?
> > > > >
> > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > way.  The flow goes something like this:
> > > > >
> > > > >  - Client attempts a dma-buf import from another device
> > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > i915_ttm_migrate() which migrates as requested.
> > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > which calls i915_ttm_get_pages() which depends on
> > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > >
> > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > pages_pin_count?
> >
> > Given that the test is using the ____four_underscores version, it
> > doesn't have that check.  However, this executes after we've done the
> > dma-buf import which pinned pages.  So we should definitely have
> > pages.
>
> We shouldn't call ____four_underscores() if we might already have
> pages though. Under non-TTM that would leak the pages, and in TTM we
> might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> example nothing was moved. I take it we can't just call pin_pages()?
> Four scary underscores usually means "don't call this in normal code".

I've switched the ____four_underscores call to a __two_underscores in
the selftests and it had no effect, good or bad.  But, still, probably
better to call that one.

> >
> > > > >
> > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > obj->mm.pages_pin_count?
> > > >
> > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > being notified of the move. Also TTM is also not allowed to move
> > > > pinned buffers.
> > > >
> > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > because we prioritise the placements[0], instead of mm.region we end
> > > > up moving it for no good reason. But in your case you are holding the
> > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > something real?
> > >
> > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > which doesn't even consider the mm.pages AFAIK.
> >
> > The bogus migration is happening as part of the
> > __i915_gem_object_get_pages() (2 __underscores) call in
> > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > migrate the BO to SMEM and then pin it there using the obvious calls
> > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > migrating things at all?
>
> Not sure yet, but __two_underscores() checks if
> i915_gem_object_has_pages() before actually calling into
> i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> some reason, so best guess is something to do with move_notify().

Did a bit of experimenting along those lines and added the following
to the self-test BEFORE the export/import:

    i915_gem_object_lock(obj, NULL);
    err = __i915_gem_object_get_pages(obj);
    __i915_gem_object_unpin_pages(obj);
    i915_gem_object_unlock(obj);
    if (err) {
        pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
        goto out_ret;
    }

This seems to make the migration happen as expected without this
patch.  So it seems the problem only exists on buffers that haven't
gotten any backing storage yet (if I'm understanding get_pages
correctly).

One potential work-around (not sure if this is a good idea or not!)
would be to do this inside dmabuf_attach().  Is this reliable?  Once
it has pages will it always have pages?  Or are there crazy races I
need to be worried about here?

--Jason
Matthew Auld July 19, 2021, 1:34 p.m. UTC | #8
On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > > <matthew.william.auld@gmail.com> wrote:
> > > >
> > > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > > <matthew.william.auld@gmail.com> wrote:
> > > > >
> > > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > >
> > > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > >
> > > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > > i915_gem_object_migrate().
> > > > > > >
> > > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > > Do you have some more details?
> > > > > >
> > > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > > way.  The flow goes something like this:
> > > > > >
> > > > > >  - Client attempts a dma-buf import from another device
> > > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > > i915_ttm_migrate() which migrates as requested.
> > > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > > which calls i915_ttm_get_pages() which depends on
> > > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > > >
> > > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > > pages_pin_count?
> > >
> > > Given that the test is using the ____four_underscores version, it
> > > doesn't have that check.  However, this executes after we've done the
> > > dma-buf import which pinned pages.  So we should definitely have
> > > pages.
> >
> > We shouldn't call ____four_underscores() if we might already have
> > pages though. Under non-TTM that would leak the pages, and in TTM we
> > might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> > example nothing was moved. I take it we can't just call pin_pages()?
> > Four scary underscores usually means "don't call this in normal code".
>
> I've switched the ____four_underscores call to a __two_underscores in
> the selftests and it had no effect, good or bad.  But, still, probably
> better to call that one.
>
> > >
> > > > > >
> > > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > > obj->mm.pages_pin_count?
> > > > >
> > > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > > being notified of the move. Also TTM is also not allowed to move
> > > > > pinned buffers.
> > > > >
> > > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > > because we prioritise the placements[0], instead of mm.region we end
> > > > > up moving it for no good reason. But in your case you are holding the
> > > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > > something real?
> > > >
> > > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > > which doesn't even consider the mm.pages AFAIK.
> > >
> > > The bogus migration is happening as part of the
> > > __i915_gem_object_get_pages() (2 __underscores) call in
> > > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > > migrate the BO to SMEM and then pin it there using the obvious calls
> > > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > > migrating things at all?
> >
> > Not sure yet, but __two_underscores() checks if
> > i915_gem_object_has_pages() before actually calling into
> > i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> > some reason, so best guess is something to do with move_notify().
>
> Did a bit of experimenting along those lines and added the following
> to the self-test BEFORE the export/import:
>
>     i915_gem_object_lock(obj, NULL);
>     err = __i915_gem_object_get_pages(obj);
>     __i915_gem_object_unpin_pages(obj);
>     i915_gem_object_unlock(obj);
>     if (err) {
>         pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
>         goto out_ret;
>     }
>
> This seems to make the migration happen as expected without this
> patch.  So it seems the problem only exists on buffers that haven't
> gotten any backing storage yet (if I'm understanding get_pages
> correctly).
>
> One potential work-around (not sure if this is a good idea or not!)
> would be to do this inside dmabuf_attach().  Is this reliable?  Once
> it has pages will it always have pages?  Or are there crazy races I
> need to be worried about here?

It turns out that the i915_ttm_adjust_gem_after_move() call in
ttm_object_init will always update the mm.region to system memory(so
that it matches the ttm resource), which seems reasonable given the
default system placeholder thing, but does seem slightly iffy since we
haven't actually moved/allocated anything.

So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
mm.region == mr. Which ofc means when we actually call get_pages() all
that happens is that we allocate the pages in system memory(or without
this patch placements[0]). Also with this patch lmem+smem, will always
be placed in smem first, regardless of the placements ordering.

For now we could maybe just split i915_ttm_adjust_gem_after_move() so
we skip the part which updates the mm.region here in the init portion,
since that should only happen when we try to place the object for
real?

>
> --Jason
Jason Ekstrand July 21, 2021, 8:11 p.m. UTC | #9
On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> > >
> > > On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > > > <matthew.william.auld@gmail.com> wrote:
> > > > >
> > > > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > >
> > > > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > > > i915_gem_object_migrate().
> > > > > > > >
> > > > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > > > Do you have some more details?
> > > > > > >
> > > > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > > > way.  The flow goes something like this:
> > > > > > >
> > > > > > >  - Client attempts a dma-buf import from another device
> > > > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > > > i915_ttm_migrate() which migrates as requested.
> > > > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > > > which calls i915_ttm_get_pages() which depends on
> > > > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > > > >
> > > > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > > > pages_pin_count?
> > > >
> > > > Given that the test is using the ____four_underscores version, it
> > > > doesn't have that check.  However, this executes after we've done the
> > > > dma-buf import which pinned pages.  So we should definitely have
> > > > pages.
> > >
> > > We shouldn't call ____four_underscores() if we might already have
> > > pages though. Under non-TTM that would leak the pages, and in TTM we
> > > might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> > > example nothing was moved. I take it we can't just call pin_pages()?
> > > Four scary underscores usually means "don't call this in normal code".
> >
> > I've switched the ____four_underscores call to a __two_underscores in
> > the selftests and it had no effect, good or bad.  But, still, probably
> > better to call that one.
> >
> > > >
> > > > > > >
> > > > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > > > obj->mm.pages_pin_count?
> > > > > >
> > > > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > > > being notified of the move. Also TTM is also not allowed to move
> > > > > > pinned buffers.
> > > > > >
> > > > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > > > because we prioritise the placements[0], instead of mm.region we end
> > > > > > up moving it for no good reason. But in your case you are holding the
> > > > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > > > something real?
> > > > >
> > > > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > > > which doesn't even consider the mm.pages AFAIK.
> > > >
> > > > The bogus migration is happening as part of the
> > > > __i915_gem_object_get_pages() (2 __underscores) call in
> > > > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > > > migrate the BO to SMEM and then pin it there using the obvious calls
> > > > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > > > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > > > migrating things at all?
> > >
> > > Not sure yet, but __two_underscores() checks if
> > > i915_gem_object_has_pages() before actually calling into
> > > i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> > > some reason, so best guess is something to do with move_notify().
> >
> > Did a bit of experimenting along those lines and added the following
> > to the self-test BEFORE the export/import:
> >
> >     i915_gem_object_lock(obj, NULL);
> >     err = __i915_gem_object_get_pages(obj);
> >     __i915_gem_object_unpin_pages(obj);
> >     i915_gem_object_unlock(obj);
> >     if (err) {
> >         pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
> >         goto out_ret;
> >     }
> >
> > This seems to make the migration happen as expected without this
> > patch.  So it seems the problem only exists on buffers that haven't
> > gotten any backing storage yet (if I'm understanding get_pages
> > correctly).
> >
> > One potential work-around (not sure if this is a good idea or not!)
> > would be to do this inside dmabuf_attach().  Is this reliable?  Once
> > it has pages will it always have pages?  Or are there crazy races I
> > need to be worried about here?
>
> It turns out that the i915_ttm_adjust_gem_after_move() call in
> ttm_object_init will always update the mm.region to system memory(so
> that it matches the ttm resource), which seems reasonable given the
> default system placeholder thing, but does seem slightly iffy since we
> haven't actually moved/allocated anything.
>
> So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
> mm.region == mr. Which ofc means when we actually call get_pages() all
> that happens is that we allocate the pages in system memory(or without
> this patch placements[0]). Also with this patch lmem+smem, will always
> be placed in smem first, regardless of the placements ordering.
>
> For now we could maybe just split i915_ttm_adjust_gem_after_move() so
> we skip the part which updates the mm.region here in the init portion,
> since that should only happen when we try to place the object for
> real?

Doesn't that mean we would end up with obj->mm.region and
obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
we'd want the two in sync at all times.

It seems like the fundamental problem here is that, when it's created,
the object isn't really in any memory region at all.  While I don't
think obj->mm.region == NULL is allowed or a good idea, it does seem
closer to the ground truth.

Perhaps what we really want is for i915_gem_object_migrate to
get_pages before it does the migration to ensure that pages exist.
The only call to i915_gem_object_migrate in the code-base today is in
the display code and it's immediately followed by pin_pages().  For
that matter, maybe the call we actually want is
i915_object_migrate_and_pin that does the whole lot.

Thoughts?

--Jason

P.S.  I'm going to go ahead and send another version with your other
comments addressed.  We can keep this discussion going here for now.
Daniel Vetter July 21, 2021, 8:32 p.m. UTC | #10
On Wed, Jul 21, 2021 at 10:11 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> > > <matthew.william.auld@gmail.com> wrote:
> > > >
> > > > On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >
> > > > > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > > >
> > > > > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > > > > i915_gem_object_migrate().
> > > > > > > > >
> > > > > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > > > > Do you have some more details?
> > > > > > > >
> > > > > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > > > > way.  The flow goes something like this:
> > > > > > > >
> > > > > > > >  - Client attempts a dma-buf import from another device
> > > > > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > > > > i915_ttm_migrate() which migrates as requested.
> > > > > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > > > > which calls i915_ttm_get_pages() which depends on
> > > > > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > > > > >
> > > > > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > > > > pages_pin_count?
> > > > >
> > > > > Given that the test is using the ____four_underscores version, it
> > > > > doesn't have that check.  However, this executes after we've done the
> > > > > dma-buf import which pinned pages.  So we should definitely have
> > > > > pages.
> > > >
> > > > We shouldn't call ____four_underscores() if we might already have
> > > > pages though. Under non-TTM that would leak the pages, and in TTM we
> > > > might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> > > > example nothing was moved. I take it we can't just call pin_pages()?
> > > > Four scary underscores usually means "don't call this in normal code".
> > >
> > > I've switched the ____four_underscores call to a __two_underscores in
> > > the selftests and it had no effect, good or bad.  But, still, probably
> > > better to call that one.
> > >
> > > > >
> > > > > > > >
> > > > > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > > > > obj->mm.pages_pin_count?
> > > > > > >
> > > > > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > > > > being notified of the move. Also TTM is also not allowed to move
> > > > > > > pinned buffers.
> > > > > > >
> > > > > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > > > > because we prioritise the placements[0], instead of mm.region we end
> > > > > > > up moving it for no good reason. But in your case you are holding the
> > > > > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > > > > something real?
> > > > > >
> > > > > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > > > > which doesn't even consider the mm.pages AFAIK.
> > > > >
> > > > > The bogus migration is happening as part of the
> > > > > __i915_gem_object_get_pages() (2 __underscores) call in
> > > > > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > > > > migrate the BO to SMEM and then pin it there using the obvious calls
> > > > > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > > > > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > > > > migrating things at all?
> > > >
> > > > Not sure yet, but __two_underscores() checks if
> > > > i915_gem_object_has_pages() before actually calling into
> > > > i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> > > > some reason, so best guess is something to do with move_notify().
> > >
> > > Did a bit of experimenting along those lines and added the following
> > > to the self-test BEFORE the export/import:
> > >
> > >     i915_gem_object_lock(obj, NULL);
> > >     err = __i915_gem_object_get_pages(obj);
> > >     __i915_gem_object_unpin_pages(obj);
> > >     i915_gem_object_unlock(obj);
> > >     if (err) {
> > >         pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
> > >         goto out_ret;
> > >     }
> > >
> > > This seems to make the migration happen as expected without this
> > > patch.  So it seems the problem only exists on buffers that haven't
> > > gotten any backing storage yet (if I'm understanding get_pages
> > > correctly).
> > >
> > > One potential work-around (not sure if this is a good idea or not!)
> > > would be to do this inside dmabuf_attach().  Is this reliable?  Once
> > > it has pages will it always have pages?  Or are there crazy races I
> > > need to be worried about here?
> >
> > It turns out that the i915_ttm_adjust_gem_after_move() call in
> > ttm_object_init will always update the mm.region to system memory(so
> > that it matches the ttm resource), which seems reasonable given the
> > default system placeholder thing, but does seem slightly iffy since we
> > haven't actually moved/allocated anything.
> >
> > So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
> > mm.region == mr. Which ofc means when we actually call get_pages() all
> > that happens is that we allocate the pages in system memory(or without
> > this patch placements[0]). Also with this patch lmem+smem, will always
> > be placed in smem first, regardless of the placements ordering.
> >
> > For now we could maybe just split i915_ttm_adjust_gem_after_move() so
> > we skip the part which updates the mm.region here in the init portion,
> > since that should only happen when we try to place the object for
> > real?
>
> Doesn't that mean we would end up with obj->mm.region and
> obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
> we'd want the two in sync at all times.
>
> It seems like the fundamental problem here is that, when it's created,
> the object isn't really in any memory region at all.  While I don't
> think obj->mm.region == NULL is allowed or a good idea, it does seem
> closer to the ground truth.
>
> Perhaps what we really want is for i915_gem_object_migrate to
> get_pages before it does the migration to ensure that pages exist.
> The only call to i915_gem_object_migrate in the code-base today is in
> the display code and it's immediately followed by pin_pages().  For
> that matter, maybe the call we actually want is
> i915_object_migrate_and_pin that does the whole lot.

I think long term at least we should track the curren region in the
ttm_resource struct (which can now be subclassed, yay, so we can stuff
anything we want into that one). And maybe make our regions proper
subclassss of ttm_resource_manager.

Even on platforms where ttm isn't used, where we will simply use the
same fields until the code is more unified.

With that there should be only the invairant placement list from
create_ext and the current region allocation left, and nothing else.
Also this would give us a very clear design for the objects which
change their type on igfx (like the shmem->phys_object stuff, which
currently just punches out the ->ops table and hopes for the best).

Also, get_pages is just a transition crutch too, we really want to
more explicit manage placement:
- for migration in dma-buf or display and other places where we must
limit the list of placements beyond what the user specified (and fail
it the intersection is empty)
- for execbuf, where we want to limit migration to avoid thrashing,
i.e. get_pages shouldn't just blindly try to move the buffer (but also
should not just never try to move the buffer, either is suboptimal).

All of this should be orthogonal to pin, which just nails something in
place wherever it is. Which is also extremely not what we currently
have, because right now pin is what allows to to say where you need
the object to be - in the old design only holding a pin prevented the
object from slipping away, now we'll move over to dma_resv_lock and
explicit migration, instead of smashing it all into the one pin
function call.

> Thoughts?

I think aside from starting to embed the ttm objects in the right
places and starting to use only those fields were we duplicate I think
we should leave things as-is for now. It's not pretty, but we need a
pile more ttm things in place first.
-Daniel

>
> --Jason
>
> P.S.  I'm going to go ahead and send another version with your other
> comments addressed.  We can keep this discussion going here for now.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Matthew Auld July 22, 2021, 9:49 a.m. UTC | #11
On Wed, 21 Jul 2021 at 21:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
> >
> > On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > >
> > > On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> > > <matthew.william.auld@gmail.com> wrote:
> > > >
> > > > On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > >
> > > > > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > > >
> > > > > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > > > > i915_gem_object_migrate().
> > > > > > > > >
> > > > > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > > > > Do you have some more details?
> > > > > > > >
> > > > > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > > > > way.  The flow goes something like this:
> > > > > > > >
> > > > > > > >  - Client attempts a dma-buf import from another device
> > > > > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > > > > i915_ttm_migrate() which migrates as requested.
> > > > > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > > > > which calls i915_ttm_get_pages() which depends on
> > > > > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > > > > >
> > > > > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > > > > pages_pin_count?
> > > > >
> > > > > Given that the test is using the ____four_underscores version, it
> > > > > doesn't have that check.  However, this executes after we've done the
> > > > > dma-buf import which pinned pages.  So we should definitely have
> > > > > pages.
> > > >
> > > > We shouldn't call ____four_underscores() if we might already have
> > > > pages though. Under non-TTM that would leak the pages, and in TTM we
> > > > might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> > > > example nothing was moved. I take it we can't just call pin_pages()?
> > > > Four scary underscores usually means "don't call this in normal code".
> > >
> > > I've switched the ____four_underscores call to a __two_underscores in
> > > the selftests and it had no effect, good or bad.  But, still, probably
> > > better to call that one.
> > >
> > > > >
> > > > > > > >
> > > > > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > > > > obj->mm.pages_pin_count?
> > > > > > >
> > > > > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > > > > being notified of the move. Also TTM is also not allowed to move
> > > > > > > pinned buffers.
> > > > > > >
> > > > > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > > > > because we prioritise the placements[0], instead of mm.region we end
> > > > > > > up moving it for no good reason. But in your case you are holding the
> > > > > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > > > > something real?
> > > > > >
> > > > > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > > > > which doesn't even consider the mm.pages AFAIK.
> > > > >
> > > > > The bogus migration is happening as part of the
> > > > > __i915_gem_object_get_pages() (2 __underscores) call in
> > > > > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > > > > migrate the BO to SMEM and then pin it there using the obvious calls
> > > > > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > > > > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > > > > migrating things at all?
> > > >
> > > > Not sure yet, but __two_underscores() checks if
> > > > i915_gem_object_has_pages() before actually calling into
> > > > i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> > > > some reason, so best guess is something to do with move_notify().
> > >
> > > Did a bit of experimenting along those lines and added the following
> > > to the self-test BEFORE the export/import:
> > >
> > >     i915_gem_object_lock(obj, NULL);
> > >     err = __i915_gem_object_get_pages(obj);
> > >     __i915_gem_object_unpin_pages(obj);
> > >     i915_gem_object_unlock(obj);
> > >     if (err) {
> > >         pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
> > >         goto out_ret;
> > >     }
> > >
> > > This seems to make the migration happen as expected without this
> > > patch.  So it seems the problem only exists on buffers that haven't
> > > gotten any backing storage yet (if I'm understanding get_pages
> > > correctly).
> > >
> > > One potential work-around (not sure if this is a good idea or not!)
> > > would be to do this inside dmabuf_attach().  Is this reliable?  Once
> > > it has pages will it always have pages?  Or are there crazy races I
> > > need to be worried about here?
> >
> > It turns out that the i915_ttm_adjust_gem_after_move() call in
> > ttm_object_init will always update the mm.region to system memory(so
> > that it matches the ttm resource), which seems reasonable given the
> > default system placeholder thing, but does seem slightly iffy since we
> > haven't actually moved/allocated anything.
> >
> > So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
> > mm.region == mr. Which ofc means when we actually call get_pages() all
> > that happens is that we allocate the pages in system memory(or without
> > this patch placements[0]). Also with this patch lmem+smem, will always
> > be placed in smem first, regardless of the placements ordering.
> >
> > For now we could maybe just split i915_ttm_adjust_gem_after_move() so
> > we skip the part which updates the mm.region here in the init portion,
> > since that should only happen when we try to place the object for
> > real?
>
> Doesn't that mean we would end up with obj->mm.region and
> obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
> we'd want the two in sync at all times.

It likely doesn't matter since all roads lead to i915_ttm_get_pages()
when we need to actually use the object?

Also updating the mm.region in ttm_object_init() to reflect the dummy
ttm resource seems a little scary, since any existing is_lmem() check
now needs to happen after we place the object. Or at least the
existing callers(for kernel internal objects) might not have expected
that behaviour. Not sure if we checked all the callers.

>
> It seems like the fundamental problem here is that, when it's created,
> the object isn't really in any memory region at all.  While I don't
> think obj->mm.region == NULL is allowed or a good idea, it does seem
> closer to the ground truth.

Yeah, seems reasonable, especially for create_user where we don't know
the placement until we actually call get_pages(). I think for internal
users like with create_lmem() setting the mm.region early still makes
some sense?

>
> Perhaps what we really want is for i915_gem_object_migrate to
> get_pages before it does the migration to ensure that pages exist.
> The only call to i915_gem_object_migrate in the code-base today is in
> the display code and it's immediately followed by pin_pages().  For
> that matter, maybe the call we actually want is
> i915_object_migrate_and_pin that does the whole lot.

I guess the only downside is that we might end up doing a real
migration, with mempy or the blitter vs just changing the preferred
placement for later? I think just go with whatever you feel is the
simplest for now.

>
> Thoughts?
>
> --Jason
>
> P.S.  I'm going to go ahead and send another version with your other
> comments addressed.  We can keep this discussion going here for now.
Matthew Auld July 22, 2021, 9:59 a.m. UTC | #12
On Thu, 22 Jul 2021 at 10:49, Matthew Auld
<matthew.william.auld@gmail.com> wrote:
>
> On Wed, 21 Jul 2021 at 21:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> > >
> > > On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > >
> > > > On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> > > > <matthew.william.auld@gmail.com> wrote:
> > > > >
> > > > > On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > >
> > > > > > On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> > > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > > > > > > > > <matthew.william.auld@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > > > > > > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > > > > > > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > > > > > > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > > > > > > > > i915_gem_object_migrate().
> > > > > > > > > >
> > > > > > > > > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > > > > > > > > requested region, so there shouldn't be an issue with migration right?
> > > > > > > > > > Do you have some more details?
> > > > > > > > >
> > > > > > > > > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > > > > > > > > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > > > > > > > > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > > > > > > > > way.  The flow goes something like this:
> > > > > > > > >
> > > > > > > > >  - Client attempts a dma-buf import from another device
> > > > > > > > >  - In attach() we call i915_gem_object_migrate() which calls
> > > > > > > > > i915_ttm_migrate() which migrates as requested.
> > > > > > > > >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > > > > > > > > which calls i915_ttm_get_pages() which depends on
> > > > > > > > > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> > > > > > > >
> > > > > > > > The mm.pages must be NULL here, otherwise it would just increment the
> > > > > > > > pages_pin_count?
> > > > > >
> > > > > > Given that the test is using the ____four_underscores version, it
> > > > > > doesn't have that check.  However, this executes after we've done the
> > > > > > dma-buf import which pinned pages.  So we should definitely have
> > > > > > pages.
> > > > >
> > > > > We shouldn't call ____four_underscores() if we might already have
> > > > > pages though. Under non-TTM that would leak the pages, and in TTM we
> > > > > might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> > > > > example nothing was moved. I take it we can't just call pin_pages()?
> > > > > Four scary underscores usually means "don't call this in normal code".
> > > >
> > > > I've switched the ____four_underscores call to a __two_underscores in
> > > > the selftests and it had no effect, good or bad.  But, still, probably
> > > > better to call that one.
> > > >
> > > > > >
> > > > > > > > >
> > > > > > > > > Maybe the problem here is actually that our TTM code isn't respecting
> > > > > > > > > obj->mm.pages_pin_count?
> > > > > > > >
> > > > > > > > I think if the resource is moved, we always nuke the mm.pages after
> > > > > > > > being notified of the move. Also TTM is also not allowed to move
> > > > > > > > pinned buffers.
> > > > > > > >
> > > > > > > > I guess if we are evicted/swapped, so assuming we are not holding the
> > > > > > > > object lock, and it's not pinned, the future call to get_pages() will
> > > > > > > > see mm.pages = NULL, even though the ttm_resource is still there, and
> > > > > > > > because we prioritise the placements[0], instead of mm.region we end
> > > > > > > > up moving it for no good reason. But in your case you are holding the
> > > > > > > > lock, or it's pinned? Also is this just with the selftest, or
> > > > > > > > something real?
> > > > > > >
> > > > > > > Or at least in the selftest I see ____i915_gem_object_get_pages()
> > > > > > > which doesn't even consider the mm.pages AFAIK.
> > > > > >
> > > > > > The bogus migration is happening as part of the
> > > > > > __i915_gem_object_get_pages() (2 __underscores) call in
> > > > > > i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> > > > > > migrate the BO to SMEM and then pin it there using the obvious calls
> > > > > > to do so.  However, in the pin_pages call, it gets implicitly migrated
> > > > > > back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> > > > > > migrating things at all?
> > > > >
> > > > > Not sure yet, but __two_underscores() checks if
> > > > > i915_gem_object_has_pages() before actually calling into
> > > > > i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> > > > > some reason, so best guess is something to do with move_notify().
> > > >
> > > > Did a bit of experimenting along those lines and added the following
> > > > to the self-test BEFORE the export/import:
> > > >
> > > >     i915_gem_object_lock(obj, NULL);
> > > >     err = __i915_gem_object_get_pages(obj);
> > > >     __i915_gem_object_unpin_pages(obj);
> > > >     i915_gem_object_unlock(obj);
> > > >     if (err) {
> > > >         pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
> > > >         goto out_ret;
> > > >     }
> > > >
> > > > This seems to make the migration happen as expected without this
> > > > patch.  So it seems the problem only exists on buffers that haven't
> > > > gotten any backing storage yet (if I'm understanding get_pages
> > > > correctly).
> > > >
> > > > One potential work-around (not sure if this is a good idea or not!)
> > > > would be to do this inside dmabuf_attach().  Is this reliable?  Once
> > > > it has pages will it always have pages?  Or are there crazy races I
> > > > need to be worried about here?
> > >
> > > It turns out that the i915_ttm_adjust_gem_after_move() call in
> > > ttm_object_init will always update the mm.region to system memory(so
> > > that it matches the ttm resource), which seems reasonable given the
> > > default system placeholder thing, but does seem slightly iffy since we
> > > haven't actually moved/allocated anything.
> > >
> > > So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
> > > mm.region == mr. Which ofc means when we actually call get_pages() all
> > > that happens is that we allocate the pages in system memory(or without
> > > this patch placements[0]). Also with this patch lmem+smem, will always
> > > be placed in smem first, regardless of the placements ordering.
> > >
> > > For now we could maybe just split i915_ttm_adjust_gem_after_move() so
> > > we skip the part which updates the mm.region here in the init portion,
> > > since that should only happen when we try to place the object for
> > > real?
> >
> > Doesn't that mean we would end up with obj->mm.region and
> > obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
> > we'd want the two in sync at all times.
>
> It likely doesn't matter since all roads lead to i915_ttm_get_pages()
> when we need to actually use the object?
>
> Also updating the mm.region in ttm_object_init() to reflect the dummy
> ttm resource seems a little scary, since any existing is_lmem() check
> now needs to happen after we place the object. Or at least the
> existing callers(for kernel internal objects) might not have expected
> that behaviour. Not sure if we checked all the callers.
>
> >
> > It seems like the fundamental problem here is that, when it's created,
> > the object isn't really in any memory region at all.  While I don't
> > think obj->mm.region == NULL is allowed or a good idea, it does seem
> > closer to the ground truth.
>
> Yeah, seems reasonable, especially for create_user where we don't know
> the placement until we actually call get_pages(). I think for internal
> users like with create_lmem() setting the mm.region early still makes
> some sense?
>
> >
> > Perhaps what we really want is for i915_gem_object_migrate to
> > get_pages before it does the migration to ensure that pages exist.
> > The only call to i915_gem_object_migrate in the code-base today is in
> > the display code and it's immediately followed by pin_pages().  For
> > that matter, maybe the call we actually want is
> > i915_object_migrate_and_pin that does the whole lot.
>
> I guess the only downside is that we might end up doing a real
> migration, with mempy or the blitter vs just changing the preferred
> placement for later? I think just go with whatever you feel is the
> simplest for now.

Another cheapo could be to drop the mr == mm.region noop, and just try
to place the object at mr anyway?

>
> >
> > Thoughts?
> >
> > --Jason
> >
> > P.S.  I'm going to go ahead and send another version with your other
> > comments addressed.  We can keep this discussion going here for now.
Thomas Hellstrom Aug. 4, 2021, 8 a.m. UTC | #13
Hi,

On 7/22/21 11:59 AM, Matthew Auld wrote:
> On Thu, 22 Jul 2021 at 10:49, Matthew Auld
> <matthew.william.auld@gmail.com> wrote:
>> On Wed, 21 Jul 2021 at 21:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>> On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
>>> <matthew.william.auld@gmail.com> wrote:
>>>> On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>> On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>> On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>>>> On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
>>>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>>>> On Fri, 16 Jul 2021 at 16:52, Matthew Auld
>>>>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>>>>> On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>>>>>>> On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
>>>>>>>>>> <matthew.william.auld@gmail.com> wrote:
>>>>>>>>>>> On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>>>>>>>>>>> Whenever we had a user object (n_placements > 0), we were ignoring
>>>>>>>>>>>> obj->mm.region and always putting obj->placements[0] as the requested
>>>>>>>>>>>> region.  For LMEM+SMEM objects, this was causing them to get shoved into
>>>>>>>>>>>> LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
>>>>>>>>>>>> i915_gem_object_migrate().
>>>>>>>>>>> i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
>>>>>>>>>>> requested region, so there shouldn't be an issue with migration right?
>>>>>>>>>>> Do you have some more details?
>>>>>>>>>> With i915_ttm_migrate directly, no.  But, in the last patch in the
>>>>>>>>>> series, we're trying to migrate LMEM+SMEM buffers into SMEM on
>>>>>>>>>> attach() and pin it there.  This blows up in a very unexpected (IMO)
>>>>>>>>>> way.  The flow goes something like this:
>>>>>>>>>>
>>>>>>>>>>   - Client attempts a dma-buf import from another device
>>>>>>>>>>   - In attach() we call i915_gem_object_migrate() which calls
>>>>>>>>>> i915_ttm_migrate() which migrates as requested.
>>>>>>>>>>   - Once the migration is complete, we call i915_gem_object_pin_pages()
>>>>>>>>>> which calls i915_ttm_get_pages() which depends on
>>>>>>>>>> i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
>>>>>>>>> The mm.pages must be NULL here, otherwise it would just increment the
>>>>>>>>> pages_pin_count?
>>>>>>> Given that the test is using the ____four_underscores version, it
>>>>>>> doesn't have that check.  However, this executes after we've done the
>>>>>>> dma-buf import which pinned pages.  So we should definitely have
>>>>>>> pages.
>>>>>> We shouldn't call ____four_underscores() if we might already have
>>>>>> pages though. Under non-TTM that would leak the pages, and in TTM we
>>>>>> might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
>>>>>> example nothing was moved. I take it we can't just call pin_pages()?
>>>>>> Four scary underscores usually means "don't call this in normal code".
>>>>> I've switched the ____four_underscores call to a __two_underscores in
>>>>> the selftests and it had no effect, good or bad.  But, still, probably
>>>>> better to call that one.
>>>>>
>>>>>>>>>> Maybe the problem here is actually that our TTM code isn't respecting
>>>>>>>>>> obj->mm.pages_pin_count?
>>>>>>>>> I think if the resource is moved, we always nuke the mm.pages after
>>>>>>>>> being notified of the move. Also TTM is also not allowed to move
>>>>>>>>> pinned buffers.
>>>>>>>>>
>>>>>>>>> I guess if we are evicted/swapped, so assuming we are not holding the
>>>>>>>>> object lock, and it's not pinned, the future call to get_pages() will
>>>>>>>>> see mm.pages = NULL, even though the ttm_resource is still there, and
>>>>>>>>> because we prioritise the placements[0], instead of mm.region we end
>>>>>>>>> up moving it for no good reason. But in your case you are holding the
>>>>>>>>> lock, or it's pinned? Also is this just with the selftest, or
>>>>>>>>> something real?
>>>>>>>> Or at least in the selftest I see ____i915_gem_object_get_pages()
>>>>>>>> which doesn't even consider the mm.pages AFAIK.
>>>>>>> The bogus migration is happening as part of the
>>>>>>> __i915_gem_object_get_pages() (2 __underscores) call in
>>>>>>> i915_gem_dmabuf_attach (see last patch).  That code is attempting to
>>>>>>> migrate the BO to SMEM and then pin it there using the obvious calls
>>>>>>> to do so.  However, in the pin_pages call, it gets implicitly migrated
>>>>>>> back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
>>>>>>> migrating things at all?
>>>>>> Not sure yet, but __two_underscores() checks if
>>>>>> i915_gem_object_has_pages() before actually calling into
>>>>>> i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
>>>>>> some reason, so best guess is something to do with move_notify().
>>>>> Did a bit of experimenting along those lines and added the following
>>>>> to the self-test BEFORE the export/import:
>>>>>
>>>>>      i915_gem_object_lock(obj, NULL);
>>>>>      err = __i915_gem_object_get_pages(obj);
>>>>>      __i915_gem_object_unpin_pages(obj);
>>>>>      i915_gem_object_unlock(obj);
>>>>>      if (err) {
>>>>>          pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
>>>>>          goto out_ret;
>>>>>      }
>>>>>
>>>>> This seems to make the migration happen as expected without this
>>>>> patch.  So it seems the problem only exists on buffers that haven't
>>>>> gotten any backing storage yet (if I'm understanding get_pages
>>>>> correctly).
>>>>>
>>>>> One potential work-around (not sure if this is a good idea or not!)
>>>>> would be to do this inside dmabuf_attach().  Is this reliable?  Once
>>>>> it has pages will it always have pages?  Or are there crazy races I
>>>>> need to be worried about here?
>>>> It turns out that the i915_ttm_adjust_gem_after_move() call in
>>>> ttm_object_init will always update the mm.region to system memory(so
>>>> that it matches the ttm resource), which seems reasonable given the
>>>> default system placeholder thing, but does seem slightly iffy since we
>>>> haven't actually moved/allocated anything.
>>>>
>>>> So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
>>>> mm.region == mr. Which ofc means when we actually call get_pages() all
>>>> that happens is that we allocate the pages in system memory(or without
>>>> this patch placements[0]). Also with this patch lmem+smem, will always
>>>> be placed in smem first, regardless of the placements ordering.
>>>>
>>>> For now we could maybe just split i915_ttm_adjust_gem_after_move() so
>>>> we skip the part which updates the mm.region here in the init portion,
>>>> since that should only happen when we try to place the object for
>>>> real?
>>> Doesn't that mean we would end up with obj->mm.region and
>>> obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
>>> we'd want the two in sync at all times.
>> It likely doesn't matter since all roads lead to i915_ttm_get_pages()
>> when we need to actually use the object?
>>
>> Also updating the mm.region in ttm_object_init() to reflect the dummy
>> ttm resource seems a little scary, since any existing is_lmem() check
>> now needs to happen after we place the object. Or at least the
>> existing callers(for kernel internal objects) might not have expected
>> that behaviour. Not sure if we checked all the callers.
>>
>>> It seems like the fundamental problem here is that, when it's created,
>>> the object isn't really in any memory region at all.  While I don't
>>> think obj->mm.region == NULL is allowed or a good idea, it does seem
>>> closer to the ground truth.
>> Yeah, seems reasonable, especially for create_user where we don't know
>> the placement until we actually call get_pages(). I think for internal
>> users like with create_lmem() setting the mm.region early still makes
>> some sense?
>>
>>> Perhaps what we really want is for i915_gem_object_migrate to
>>> get_pages before it does the migration to ensure that pages exist.
>>> The only call to i915_gem_object_migrate in the code-base today is in
>>> the display code and it's immediately followed by pin_pages().  For
>>> that matter, maybe the call we actually want is
>>> i915_object_migrate_and_pin that does the whole lot.
>> I guess the only downside is that we might end up doing a real
>> migration, with mempy or the blitter vs just changing the preferred
>> placement for later? I think just go with whatever you feel is the
>> simplest for now.
> Another cheapo could be to drop the mr == mm.region noop, and just try
> to place the object at mr anyway?
>
There are a number of things to consider here,

First, as Jason found out what's keeping thing from working as intended 
is that we actually call into TTM get_pages() after migration, since the 
object isn't populated with pages yet. That's indeed a bug.

We should probably have migrate be migrate_and_populate(): Whatever 
kernel code decides to migrate needs to hold the object lock over the 
operation where data needs to be migrated or in the worst case call 
pin() under the lock which currently needs to be the case for dma-buf 
and display.

If we blindly just look at obj->mm.region() in get_pages() then if an 
object with allowable placements in lmem and smem initially gets placed 
in lmem, and then evicted to smem it will never migrate back to lmem 
unless if there is an explicit i915_gem_object_migrate(), but again, 
that's perhaps what we want? I guess we need to more clearly define the 
migration policies; for example should we attempt to migrate evicted 
buffers back to lmem on each execbuf where they are referenced, even if 
they haven't lost their pages?

On region dicrepance between gem and TTM there is a short DOC: section 
in i915_gem_ttm.c

/Thomas


>>> Thoughts?
>>>
>>> --Jason
>>>
>>> P.S.  I'm going to go ahead and send another version with your other
>>> comments addressed.  We can keep this discussion going here for now.
Daniel Vetter Aug. 4, 2021, 2:35 p.m. UTC | #14
On Wed, Aug 4, 2021 at 10:00 AM Thomas Hellström
<thomas.hellstrom@linux.intel.com> wrote:
>
> Hi,
>
> On 7/22/21 11:59 AM, Matthew Auld wrote:
> > On Thu, 22 Jul 2021 at 10:49, Matthew Auld
> > <matthew.william.auld@gmail.com> wrote:
> >> On Wed, 21 Jul 2021 at 21:11, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>> On Mon, Jul 19, 2021 at 8:35 AM Matthew Auld
> >>> <matthew.william.auld@gmail.com> wrote:
> >>>> On Fri, 16 Jul 2021 at 20:49, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>>> On Fri, Jul 16, 2021 at 1:45 PM Matthew Auld
> >>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>> On Fri, 16 Jul 2021 at 18:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>>>>> On Fri, Jul 16, 2021 at 11:00 AM Matthew Auld
> >>>>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>>>> On Fri, 16 Jul 2021 at 16:52, Matthew Auld
> >>>>>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>>>>> On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>>>>>>>> On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> >>>>>>>>>> <matthew.william.auld@gmail.com> wrote:
> >>>>>>>>>>> On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason@jlekstrand.net> wrote:
> >>>>>>>>>>>> Whenever we had a user object (n_placements > 0), we were ignoring
> >>>>>>>>>>>> obj->mm.region and always putting obj->placements[0] as the requested
> >>>>>>>>>>>> region.  For LMEM+SMEM objects, this was causing them to get shoved into
> >>>>>>>>>>>> LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> >>>>>>>>>>>> i915_gem_object_migrate().
> >>>>>>>>>>> i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> >>>>>>>>>>> requested region, so there shouldn't be an issue with migration right?
> >>>>>>>>>>> Do you have some more details?
> >>>>>>>>>> With i915_ttm_migrate directly, no.  But, in the last patch in the
> >>>>>>>>>> series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> >>>>>>>>>> attach() and pin it there.  This blows up in a very unexpected (IMO)
> >>>>>>>>>> way.  The flow goes something like this:
> >>>>>>>>>>
> >>>>>>>>>>   - Client attempts a dma-buf import from another device
> >>>>>>>>>>   - In attach() we call i915_gem_object_migrate() which calls
> >>>>>>>>>> i915_ttm_migrate() which migrates as requested.
> >>>>>>>>>>   - Once the migration is complete, we call i915_gem_object_pin_pages()
> >>>>>>>>>> which calls i915_ttm_get_pages() which depends on
> >>>>>>>>>> i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
> >>>>>>>>> The mm.pages must be NULL here, otherwise it would just increment the
> >>>>>>>>> pages_pin_count?
> >>>>>>> Given that the test is using the ____four_underscores version, it
> >>>>>>> doesn't have that check.  However, this executes after we've done the
> >>>>>>> dma-buf import which pinned pages.  So we should definitely have
> >>>>>>> pages.
> >>>>>> We shouldn't call ____four_underscores() if we might already have
> >>>>>> pages though. Under non-TTM that would leak the pages, and in TTM we
> >>>>>> might hit the WARN_ON(mm->pages) in __i915_ttm_get_pages(), if for
> >>>>>> example nothing was moved. I take it we can't just call pin_pages()?
> >>>>>> Four scary underscores usually means "don't call this in normal code".
> >>>>> I've switched the ____four_underscores call to a __two_underscores in
> >>>>> the selftests and it had no effect, good or bad.  But, still, probably
> >>>>> better to call that one.
> >>>>>
> >>>>>>>>>> Maybe the problem here is actually that our TTM code isn't respecting
> >>>>>>>>>> obj->mm.pages_pin_count?
> >>>>>>>>> I think if the resource is moved, we always nuke the mm.pages after
> >>>>>>>>> being notified of the move. Also TTM is also not allowed to move
> >>>>>>>>> pinned buffers.
> >>>>>>>>>
> >>>>>>>>> I guess if we are evicted/swapped, so assuming we are not holding the
> >>>>>>>>> object lock, and it's not pinned, the future call to get_pages() will
> >>>>>>>>> see mm.pages = NULL, even though the ttm_resource is still there, and
> >>>>>>>>> because we prioritise the placements[0], instead of mm.region we end
> >>>>>>>>> up moving it for no good reason. But in your case you are holding the
> >>>>>>>>> lock, or it's pinned? Also is this just with the selftest, or
> >>>>>>>>> something real?
> >>>>>>>> Or at least in the selftest I see ____i915_gem_object_get_pages()
> >>>>>>>> which doesn't even consider the mm.pages AFAIK.
> >>>>>>> The bogus migration is happening as part of the
> >>>>>>> __i915_gem_object_get_pages() (2 __underscores) call in
> >>>>>>> i915_gem_dmabuf_attach (see last patch).  That code is attempting to
> >>>>>>> migrate the BO to SMEM and then pin it there using the obvious calls
> >>>>>>> to do so.  However, in the pin_pages call, it gets implicitly migrated
> >>>>>>> back to LMEM thanks to i915_ttm_get_pages().  Why is _get_pages()
> >>>>>>> migrating things at all?
> >>>>>> Not sure yet, but __two_underscores() checks if
> >>>>>> i915_gem_object_has_pages() before actually calling into
> >>>>>> i915_ttm_get_pages(), so the mm.pages would have to be NULL here for
> >>>>>> some reason, so best guess is something to do with move_notify().
> >>>>> Did a bit of experimenting along those lines and added the following
> >>>>> to the self-test BEFORE the export/import:
> >>>>>
> >>>>>      i915_gem_object_lock(obj, NULL);
> >>>>>      err = __i915_gem_object_get_pages(obj);
> >>>>>      __i915_gem_object_unpin_pages(obj);
> >>>>>      i915_gem_object_unlock(obj);
> >>>>>      if (err) {
> >>>>>          pr_err("__i915_gem_object_get_pages failed with err=%d\n", err);
> >>>>>          goto out_ret;
> >>>>>      }
> >>>>>
> >>>>> This seems to make the migration happen as expected without this
> >>>>> patch.  So it seems the problem only exists on buffers that haven't
> >>>>> gotten any backing storage yet (if I'm understanding get_pages
> >>>>> correctly).
> >>>>>
> >>>>> One potential work-around (not sure if this is a good idea or not!)
> >>>>> would be to do this inside dmabuf_attach().  Is this reliable?  Once
> >>>>> it has pages will it always have pages?  Or are there crazy races I
> >>>>> need to be worried about here?
> >>>> It turns out that the i915_ttm_adjust_gem_after_move() call in
> >>>> ttm_object_init will always update the mm.region to system memory(so
> >>>> that it matches the ttm resource), which seems reasonable given the
> >>>> default system placeholder thing, but does seem slightly iffy since we
> >>>> haven't actually moved/allocated anything.
> >>>>
> >>>> So effectively i915_ttm_migrate(SYSTEM) becomes a noop here since
> >>>> mm.region == mr. Which ofc means when we actually call get_pages() all
> >>>> that happens is that we allocate the pages in system memory(or without
> >>>> this patch placements[0]). Also with this patch lmem+smem, will always
> >>>> be placed in smem first, regardless of the placements ordering.
> >>>>
> >>>> For now we could maybe just split i915_ttm_adjust_gem_after_move() so
> >>>> we skip the part which updates the mm.region here in the init portion,
> >>>> since that should only happen when we try to place the object for
> >>>> real?
> >>> Doesn't that mean we would end up with obj->mm.region and
> >>> obj->mm.res->mem_type are out-of-sync?  That seems bad.  I would think
> >>> we'd want the two in sync at all times.
> >> It likely doesn't matter since all roads lead to i915_ttm_get_pages()
> >> when we need to actually use the object?
> >>
> >> Also updating the mm.region in ttm_object_init() to reflect the dummy
> >> ttm resource seems a little scary, since any existing is_lmem() check
> >> now needs to happen after we place the object. Or at least the
> >> existing callers(for kernel internal objects) might not have expected
> >> that behaviour. Not sure if we checked all the callers.
> >>
> >>> It seems like the fundamental problem here is that, when it's created,
> >>> the object isn't really in any memory region at all.  While I don't
> >>> think obj->mm.region == NULL is allowed or a good idea, it does seem
> >>> closer to the ground truth.
> >> Yeah, seems reasonable, especially for create_user where we don't know
> >> the placement until we actually call get_pages(). I think for internal
> >> users like with create_lmem() setting the mm.region early still makes
> >> some sense?
> >>
> >>> Perhaps what we really want is for i915_gem_object_migrate to
> >>> get_pages before it does the migration to ensure that pages exist.
> >>> The only call to i915_gem_object_migrate in the code-base today is in
> >>> the display code and it's immediately followed by pin_pages().  For
> >>> that matter, maybe the call we actually want is
> >>> i915_object_migrate_and_pin that does the whole lot.
> >> I guess the only downside is that we might end up doing a real
> >> migration, with mempy or the blitter vs just changing the preferred
> >> placement for later? I think just go with whatever you feel is the
> >> simplest for now.
> > Another cheapo could be to drop the mr == mm.region noop, and just try
> > to place the object at mr anyway?
> >
> There are a number of things to consider here,
>
> First, as Jason found out what's keeping thing from working as intended
> is that we actually call into TTM get_pages() after migration, since the
> object isn't populated with pages yet. That's indeed a bug.
>
> We should probably have migrate be migrate_and_populate(): Whatever
> kernel code decides to migrate needs to hold the object lock over the
> operation where data needs to be migrated or in the worst case call
> pin() under the lock which currently needs to be the case for dma-buf
> and display.
>
> If we blindly just look at obj->mm.region() in get_pages() then if an
> object with allowable placements in lmem and smem initially gets placed
> in lmem, and then evicted to smem it will never migrate back to lmem
> unless if there is an explicit i915_gem_object_migrate(), but again,
> that's perhaps what we want? I guess we need to more clearly define the
> migration policies; for example should we attempt to migrate evicted
> buffers back to lmem on each execbuf where they are referenced, even if
> they haven't lost their pages?

Looking at amdgpu things are indeed complicated:
- mmap adds some hints that cpu access is preferred (iirc at least) so
that the unmappable vram problems aren't too awful
- execbuf adds vram to the non-evict placement list whenever that
makes sense (i.e. preferred place and no inferred hint like mmap
access countering that)
- for eviction there's a ratelimit, to make sure we're not thrashing
terribly and spending all the gpu time moving buffers around with the
copy engine

Maybe another interim strategy would be to only evict non-busy
buffers, not sure ttm supports that already. We definitely don't want
to unconditionally force all buffers into lmem on every execbuf.
-Daniel


> On region dicrepance between gem and TTM there is a short DOC: section
> in i915_gem_ttm.c
>
> /Thomas
>
>
> >>> Thoughts?
> >>>
> >>> --Jason
> >>>
> >>> P.S.  I'm going to go ahead and send another version with your other
> >>> comments addressed.  We can keep this discussion going here for now.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
index d30f274c329c7..5985e994d56cf 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
@@ -150,8 +150,7 @@  i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
 	unsigned int i;
 
 	placement->num_placement = 1;
-	i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
-				   obj->mm.region, requested, flags);
+	i915_ttm_place_from_region(obj->mm.region, requested, flags);
 
 	/* Cache this on object? */
 	placement->num_busy_placement = num_allowed;