Message ID | 20230621162652.10875-2-Yunxiang.Li@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm: fix soft lockup when saturating vram | expand |
Am 21.06.23 um 18:23 schrieb Yunxiang Li: > When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve was > removed in that process, so putting it back. The slots for this are reserved in amdgpu_vm_get_pd_bo(): /* Two for VM updates, one for TTM and one for the CS job */ entry->tv.num_shared = 4; The problem here is rather that we didn't took the gang submit into account. See my patch earlier this week I've CCed you on. . > > Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2") > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 349416e176a1..f590b97853d9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > struct amdgpu_ib *ib = p->job->ibs; > struct amdgpu_ring *ring; > struct dma_fence *f; > + int r; > > ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring, > sched); > @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > swap(p->vm->last_unlocked, tmp); > dma_fence_put(tmp); > } else { > + r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1); > + if (r) > + return r; That is simply illegal and would potentially lead to memory corruption. Regards, Christian. > dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f, > DMA_RESV_USAGE_BOOKKEEP); > }
[Public] If I take out this reserve and the change that makes dma_resv_reserve_fences nestable, I get a BUG in the add_fence, so we might be missing a reserve somewhere else then. Teddy -----Original Message----- From: Koenig, Christian <Christian.Koenig@amd.com> Sent: Thursday, June 22, 2023 3:39 To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com> Subject: Re: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit Am 21.06.23 um 18:23 schrieb Yunxiang Li: > When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve > was removed in that process, so putting it back. The slots for this are reserved in amdgpu_vm_get_pd_bo(): /* Two for VM updates, one for TTM and one for the CS job */ entry->tv.num_shared = 4; The problem here is rather that we didn't took the gang submit into account. See my patch earlier this week I've CCed you on. . > > Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2") > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 349416e176a1..f590b97853d9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > struct amdgpu_ib *ib = p->job->ibs; > struct amdgpu_ring *ring; > struct dma_fence *f; > + int r; > > ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring, > sched); > @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > swap(p->vm->last_unlocked, tmp); > dma_fence_put(tmp); > } else { > + r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1); > + if (r) > + return r; That is simply illegal and would potentially lead to memory corruption. Regards, Christian. > dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f, > DMA_RESV_USAGE_BOOKKEEP); > }
[Public] Hmm, seems all the stack trace I get now are from amdgpu_gem_va_ioctl, so I suppose the issue is only on that path now, the gang submit patch fixed it for amdgpu_cs_ioctl. Teddy -----Original Message----- From: Li, Yunxiang (Teddy) Sent: Thursday, June 22, 2023 10:12 To: Koenig, Christian <Christian.Koenig@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com> Subject: RE: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit If I take out this reserve and the change that makes dma_resv_reserve_fences nestable, I get a BUG in the add_fence, so we might be missing a reserve somewhere else then. Teddy -----Original Message----- From: Koenig, Christian <Christian.Koenig@amd.com> Sent: Thursday, June 22, 2023 3:39 To: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com> Subject: Re: [PATCH 1/2] drm/amdgpu: fix missing fence reserve in amdgpu_vm_sdma_commit Am 21.06.23 um 18:23 schrieb Yunxiang Li: > When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve > was removed in that process, so putting it back. The slots for this are reserved in amdgpu_vm_get_pd_bo(): /* Two for VM updates, one for TTM and one for the CS job */ entry->tv.num_shared = 4; The problem here is rather that we didn't took the gang submit into account. See my patch earlier this week I've CCed you on. . > > Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2") > Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index 349416e176a1..f590b97853d9 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > struct amdgpu_ib *ib = p->job->ibs; > struct amdgpu_ring *ring; > struct dma_fence *f; > + int r; > > ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring, > sched); > @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > swap(p->vm->last_unlocked, tmp); > dma_fence_put(tmp); > } else { > + r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1); > + if (r) > + return r; That is simply illegal and would potentially lead to memory corruption. Regards, Christian. > dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f, > DMA_RESV_USAGE_BOOKKEEP); > }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index 349416e176a1..f590b97853d9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c @@ -120,6 +120,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, struct amdgpu_ib *ib = p->job->ibs; struct amdgpu_ring *ring; struct dma_fence *f; + int r; ring = container_of(p->vm->delayed.rq->sched, struct amdgpu_ring, sched); @@ -135,6 +136,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, swap(p->vm->last_unlocked, tmp); dma_fence_put(tmp); } else { + r = dma_resv_reserve_fences(p->vm->root.bo->tbo.base.resv, 1); + if (r) + return r; dma_resv_add_fence(p->vm->root.bo->tbo.base.resv, f, DMA_RESV_USAGE_BOOKKEEP); }
When amdgpu_bo_fence is converted to dma_resv_add_fence, the reserve was removed in that process, so putting it back. Fixes: 4247084057cf ("drm/amdgpu: use DMA_RESV_USAGE_BOOKKEEP v2") Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 4 ++++ 1 file changed, 4 insertions(+)