Message ID | 20210917175328.694429-1-zackr@vmware.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/ttm: Don't delete the system manager before the delayed delete | expand |
On 2021-09-17 1:53 p.m., Zack Rusin wrote: > On some hardware, in particular in virtualized environments, the > system memory can be shared with the "hardware". In those cases > the BO's allocated through the ttm system manager might be > busy during ttm_bo_put which results in them being scheduled > for a delayed deletion. > > The problem is that that the ttm system manager is disabled > before the final delayed deletion is ran in ttm_device_fini. > This results in crashes during freeing of the BO resources > because they're trying to remove themselves from a no longer > existent ttm_resource_manager (e.g. in IGT's core_hotunplug > on vmwgfx) > > In general reloading any driver that could share system mem > resources with "hardware" could hit it because nothing > prevents the system mem resources from being scheduled > for delayed deletion (apart from them not being busy probably > anywhere apart from virtualized environments). > > Signed-off-by: Zack Rusin <zackr@vmware.com> > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: David Airlie <airlied@linux.ie> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: dri-devel@lists.freedesktop.org > --- > drivers/gpu/drm/ttm/ttm_device.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 9eb8f54b66fc..4ef19cafc755 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev) > struct ttm_resource_manager *man; > unsigned i; > > - man = ttm_manager_type(bdev, TTM_PL_SYSTEM); > - ttm_resource_manager_set_used(man, false); > - ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); > - > mutex_lock(&ttm_global_mutex); > list_del(&bdev->device_list); > mutex_unlock(&ttm_global_mutex); > @@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev) > if (ttm_bo_delayed_delete(bdev, true)) > pr_debug("Delayed destroy list was clean\n"); > > + man = ttm_manager_type(bdev, TTM_PL_SYSTEM); > + ttm_resource_manager_set_used(man, false); > + ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); > + Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Andrey > spin_lock(&bdev->lru_lock); > for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) > if (list_empty(&man->lru[0]))
Am 17.09.21 um 19:53 schrieb Zack Rusin: > On some hardware, in particular in virtualized environments, the > system memory can be shared with the "hardware". In those cases > the BO's allocated through the ttm system manager might be > busy during ttm_bo_put which results in them being scheduled > for a delayed deletion. While the patch itself is probably fine the reasoning here is a clear NAK. Buffers in the system domain are not GPU accessible by definition, even in a shared environment and so *must* be idle. Otherwise you break quite a number of assumptions in the code. Regards, Christian. > > The problem is that that the ttm system manager is disabled > before the final delayed deletion is ran in ttm_device_fini. > This results in crashes during freeing of the BO resources > because they're trying to remove themselves from a no longer > existent ttm_resource_manager (e.g. in IGT's core_hotunplug > on vmwgfx) > > In general reloading any driver that could share system mem > resources with "hardware" could hit it because nothing > prevents the system mem resources from being scheduled > for delayed deletion (apart from them not being busy probably > anywhere apart from virtualized environments). > > Signed-off-by: Zack Rusin <zackr@vmware.com> > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: David Airlie <airlied@linux.ie> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: dri-devel@lists.freedesktop.org > --- > drivers/gpu/drm/ttm/ttm_device.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 9eb8f54b66fc..4ef19cafc755 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev) > struct ttm_resource_manager *man; > unsigned i; > > - man = ttm_manager_type(bdev, TTM_PL_SYSTEM); > - ttm_resource_manager_set_used(man, false); > - ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); > - > mutex_lock(&ttm_global_mutex); > list_del(&bdev->device_list); > mutex_unlock(&ttm_global_mutex); > @@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev) > if (ttm_bo_delayed_delete(bdev, true)) > pr_debug("Delayed destroy list was clean\n"); > > + man = ttm_manager_type(bdev, TTM_PL_SYSTEM); > + ttm_resource_manager_set_used(man, false); > + ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); > + > spin_lock(&bdev->lru_lock); > for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) > if (list_empty(&man->lru[0]))
> On Sep 20, 2021, at 02:30, Christian König <christian.koenig@amd.com> wrote: > > Am 17.09.21 um 19:53 schrieb Zack Rusin: >> On some hardware, in particular in virtualized environments, the >> system memory can be shared with the "hardware". In those cases >> the BO's allocated through the ttm system manager might be >> busy during ttm_bo_put which results in them being scheduled >> for a delayed deletion. > > While the patch itself is probably fine the reasoning here is a clear NAK. > > Buffers in the system domain are not GPU accessible by definition, even in a shared environment and so *must* be idle. I’m assuming that means they are not allowed to be ever fenced then, yes? > Otherwise you break quite a number of assumptions in the code. Are there more assumptions like that or do you mean there’s more places that depend on the assumption that system domain bo’s are always idle? If there’s more assumptions like that in TTM that would be incredibly valuable to know. I haven’t been paying much attention to the kernel code in years and coming back now and looking at a few years old vmwgfx code it’s almost impossible to tell the difference between: “this assumption breaks the driver” and “this driver breaks this assumption”. z
On 9/20/21 10:59 AM, Zack Rusin wrote: >> On Sep 20, 2021, at 02:30, Christian König <christian.koenig@amd.com> wrote: >> >> Am 17.09.21 um 19:53 schrieb Zack Rusin: >>> On some hardware, in particular in virtualized environments, the >>> system memory can be shared with the "hardware". In those cases >>> the BO's allocated through the ttm system manager might be >>> busy during ttm_bo_put which results in them being scheduled >>> for a delayed deletion. >> >> While the patch itself is probably fine the reasoning here is a clear NAK. >> >> Buffers in the system domain are not GPU accessible by definition, even in a shared environment and so *must* be idle. > > I’m assuming that means they are not allowed to be ever fenced then, yes? Any thoughts on this? I'd love a confirmation because it would mean I need to go and rewrite the vmwgfx_mob.c bits where we use TTM_PL_SYSTEM memory (through vmw_bo_create_and_populate) for a page table which is read by the host, and those bo's need to be fenced to prevent destruction of the page tables while the memory they point to is still used. So if those were never allowed to be fenced in the first place we probably need to add a new memory type to hold those page tables. z
Am 23.09.21 um 15:53 schrieb Zack Rusin: > On 9/20/21 10:59 AM, Zack Rusin wrote: >>> On Sep 20, 2021, at 02:30, Christian König >>> <christian.koenig@amd.com> wrote: >>> >>> Am 17.09.21 um 19:53 schrieb Zack Rusin: >>>> On some hardware, in particular in virtualized environments, the >>>> system memory can be shared with the "hardware". In those cases >>>> the BO's allocated through the ttm system manager might be >>>> busy during ttm_bo_put which results in them being scheduled >>>> for a delayed deletion. >>> >>> While the patch itself is probably fine the reasoning here is a >>> clear NAK. >>> >>> Buffers in the system domain are not GPU accessible by definition, >>> even in a shared environment and so *must* be idle. >> >> I’m assuming that means they are not allowed to be ever fenced then, >> yes? > > Any thoughts on this? I'd love a confirmation because it would mean I > need to go and rewrite the vmwgfx_mob.c bits where we use > TTM_PL_SYSTEM memory (through vmw_bo_create_and_populate) for a page > table which is read by the host, and those bo's need to be fenced to > prevent destruction of the page tables while the memory they point to > is still used. So if those were never allowed to be fenced in the > first place we probably need to add a new memory type to hold those > page tables. Yeah, as far as I can see that is pretty much illegal from a design point of view. We could probably change that rule on the TTM side, but I think that keeping the design as it is and adding a placement in vmwgfx sounds like the cleaner approach. Christian. > > z
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 9eb8f54b66fc..4ef19cafc755 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev) struct ttm_resource_manager *man; unsigned i; - man = ttm_manager_type(bdev, TTM_PL_SYSTEM); - ttm_resource_manager_set_used(man, false); - ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); - mutex_lock(&ttm_global_mutex); list_del(&bdev->device_list); mutex_unlock(&ttm_global_mutex); @@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev) if (ttm_bo_delayed_delete(bdev, true)) pr_debug("Delayed destroy list was clean\n"); + man = ttm_manager_type(bdev, TTM_PL_SYSTEM); + ttm_resource_manager_set_used(man, false); + ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); + spin_lock(&bdev->lru_lock); for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) if (list_empty(&man->lru[0]))
On some hardware, in particular in virtualized environments, the system memory can be shared with the "hardware". In those cases the BO's allocated through the ttm system manager might be busy during ttm_bo_put which results in them being scheduled for a delayed deletion. The problem is that that the ttm system manager is disabled before the final delayed deletion is ran in ttm_device_fini. This results in crashes during freeing of the BO resources because they're trying to remove themselves from a no longer existent ttm_resource_manager (e.g. in IGT's core_hotunplug on vmwgfx) In general reloading any driver that could share system mem resources with "hardware" could hit it because nothing prevents the system mem resources from being scheduled for delayed deletion (apart from them not being busy probably anywhere apart from virtualized environments). Signed-off-by: Zack Rusin <zackr@vmware.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: dri-devel@lists.freedesktop.org --- drivers/gpu/drm/ttm/ttm_device.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)