Message ID | 20230108210445.3948344-2-dmitry.osipenko@collabora.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add generic memory shrinker to VirtIO-GPU and Panfrost DRM drivers | expand |
Hi Am 08.01.23 um 22:04 schrieb Dmitry Osipenko: > Consider this scenario: > > 1. APP1 continuously creates lots of small GEMs > 2. APP2 triggers `drop_caches` > 3. Shrinker starts to evict APP1 GEMs, while APP1 produces new purgeable > GEMs > 4. msm_gem_shrinker_scan() returns non-zero number of freed pages > and causes shrinker to try shrink more > 5. msm_gem_shrinker_scan() returns non-zero number of freed pages again, > goto 4 > 6. The APP2 is blocked in `drop_caches` until APP1 stops producing > purgeable GEMs > > To prevent this blocking scenario, check number of remaining pages > that GPU shrinker couldn't release due to a GEM locking contention > or shrinking rejection. If there are no remaining pages left to shrink, > then there is no need to free up more pages and shrinker may break out > from the loop. > > This problem was found during shrinker/madvise IOCTL testing of > virtio-gpu driver. The MSM driver is affected in the same way. > > Reviewed-by: Rob Clark <robdclark@gmail.com> > Fixes: b352ba54a820 ("drm/msm/gem: Convert to using drm_gem_lru") > Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> > --- > drivers/gpu/drm/drm_gem.c | 9 +++++++-- > drivers/gpu/drm/msm/msm_gem_shrinker.c | 8 ++++++-- > include/drm/drm_gem.h | 4 +++- > 3 files changed, 16 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c > index 59a0bb5ebd85..c6bca5ac6e0f 100644 > --- a/drivers/gpu/drm/drm_gem.c > +++ b/drivers/gpu/drm/drm_gem.c > @@ -1388,10 +1388,13 @@ EXPORT_SYMBOL(drm_gem_lru_move_tail); > * > * @lru: The LRU to scan > * @nr_to_scan: The number of pages to try to reclaim > + * @remaining: The number of pages left to reclaim > * @shrink: Callback to try to shrink/reclaim the object. > */ > unsigned long > -drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, > +drm_gem_lru_scan(struct drm_gem_lru *lru, > + unsigned int nr_to_scan, > + unsigned long *remaining, > bool (*shrink)(struct drm_gem_object *obj)) > { > struct drm_gem_lru still_in_lru; > @@ -1430,8 +1433,10 @@ drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, > * hit shrinker in response to trying to get backing pages > * for this obj (ie. while it's lock is already held) > */ > - if (!dma_resv_trylock(obj->resv)) > + if (!dma_resv_trylock(obj->resv)) { > + *remaining += obj->size >> PAGE_SHIFT; > goto tail; > + } > > if (shrink(obj)) { > freed += obj->size >> PAGE_SHIFT; > diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c b/drivers/gpu/drm/msm/msm_gem_shrinker.c > index 051bdbc093cf..b7c1242014ec 100644 > --- a/drivers/gpu/drm/msm/msm_gem_shrinker.c > +++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c > @@ -116,12 +116,14 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) > }; > long nr = sc->nr_to_scan; > unsigned long freed = 0; > + unsigned long remaining = 0; > > for (unsigned i = 0; (nr > 0) && (i < ARRAY_SIZE(stages)); i++) { > if (!stages[i].cond) > continue; > stages[i].freed = > - drm_gem_lru_scan(stages[i].lru, nr, stages[i].shrink); > + drm_gem_lru_scan(stages[i].lru, nr, &remaining, This function relies in remaining being pre-initialized. That's not obvious and error prone. At least, pass-in something like &stages[i].remaining that is then initialized internally by drm_gem_lru_scan() to zero. And similar to freed, sum up the individual stages' remaining here. TBH I somehow don't like the overall design of how all these functions interact with each other. But I also can't really point to the actual problem. So it's best to take what you have here; maybe with the change I proposed. Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Best regards Thomas > + stages[i].shrink); > nr -= stages[i].freed; > freed += stages[i].freed; > } > @@ -132,7 +134,7 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) > stages[3].freed); > } > > - return (freed > 0) ? freed : SHRINK_STOP; > + return (freed > 0 && remaining > 0) ? freed : SHRINK_STOP; > } > > #ifdef CONFIG_DEBUG_FS > @@ -182,10 +184,12 @@ msm_gem_shrinker_vmap(struct notifier_block *nb, unsigned long event, void *ptr) > NULL, > }; > unsigned idx, unmapped = 0; > + unsigned long remaining = 0; > > for (idx = 0; lrus[idx] && unmapped < vmap_shrink_limit; idx++) { > unmapped += drm_gem_lru_scan(lrus[idx], > vmap_shrink_limit - unmapped, > + &remaining, > vmap_shrink); > } > > diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h > index 772a4adf5287..f1f00fc2dba6 100644 > --- a/include/drm/drm_gem.h > +++ b/include/drm/drm_gem.h > @@ -476,7 +476,9 @@ int drm_gem_dumb_map_offset(struct drm_file *file, struct drm_device *dev, > void drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock); > void drm_gem_lru_remove(struct drm_gem_object *obj); > void drm_gem_lru_move_tail(struct drm_gem_lru *lru, struct drm_gem_object *obj); > -unsigned long drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, > +unsigned long drm_gem_lru_scan(struct drm_gem_lru *lru, > + unsigned int nr_to_scan, > + unsigned long *remaining, > bool (*shrink)(struct drm_gem_object *obj)); > > #endif /* __DRM_GEM_H__ */
On 2/17/23 15:02, Thomas Zimmermann wrote: > Hi > > Am 08.01.23 um 22:04 schrieb Dmitry Osipenko: >> Consider this scenario: >> >> 1. APP1 continuously creates lots of small GEMs >> 2. APP2 triggers `drop_caches` >> 3. Shrinker starts to evict APP1 GEMs, while APP1 produces new purgeable >> GEMs >> 4. msm_gem_shrinker_scan() returns non-zero number of freed pages >> and causes shrinker to try shrink more >> 5. msm_gem_shrinker_scan() returns non-zero number of freed pages again, >> goto 4 >> 6. The APP2 is blocked in `drop_caches` until APP1 stops producing >> purgeable GEMs >> >> To prevent this blocking scenario, check number of remaining pages >> that GPU shrinker couldn't release due to a GEM locking contention >> or shrinking rejection. If there are no remaining pages left to shrink, >> then there is no need to free up more pages and shrinker may break out >> from the loop. >> >> This problem was found during shrinker/madvise IOCTL testing of >> virtio-gpu driver. The MSM driver is affected in the same way. >> >> Reviewed-by: Rob Clark <robdclark@gmail.com> >> Fixes: b352ba54a820 ("drm/msm/gem: Convert to using drm_gem_lru") >> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> >> --- >> drivers/gpu/drm/drm_gem.c | 9 +++++++-- >> drivers/gpu/drm/msm/msm_gem_shrinker.c | 8 ++++++-- >> include/drm/drm_gem.h | 4 +++- >> 3 files changed, 16 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c >> index 59a0bb5ebd85..c6bca5ac6e0f 100644 >> --- a/drivers/gpu/drm/drm_gem.c >> +++ b/drivers/gpu/drm/drm_gem.c >> @@ -1388,10 +1388,13 @@ EXPORT_SYMBOL(drm_gem_lru_move_tail); >> * >> * @lru: The LRU to scan >> * @nr_to_scan: The number of pages to try to reclaim >> + * @remaining: The number of pages left to reclaim >> * @shrink: Callback to try to shrink/reclaim the object. >> */ >> unsigned long >> -drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, >> +drm_gem_lru_scan(struct drm_gem_lru *lru, >> + unsigned int nr_to_scan, >> + unsigned long *remaining, >> bool (*shrink)(struct drm_gem_object *obj)) >> { >> struct drm_gem_lru still_in_lru; >> @@ -1430,8 +1433,10 @@ drm_gem_lru_scan(struct drm_gem_lru *lru, >> unsigned nr_to_scan, >> * hit shrinker in response to trying to get backing pages >> * for this obj (ie. while it's lock is already held) >> */ >> - if (!dma_resv_trylock(obj->resv)) >> + if (!dma_resv_trylock(obj->resv)) { >> + *remaining += obj->size >> PAGE_SHIFT; >> goto tail; >> + } >> if (shrink(obj)) { >> freed += obj->size >> PAGE_SHIFT; >> diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c >> b/drivers/gpu/drm/msm/msm_gem_shrinker.c >> index 051bdbc093cf..b7c1242014ec 100644 >> --- a/drivers/gpu/drm/msm/msm_gem_shrinker.c >> +++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c >> @@ -116,12 +116,14 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, >> struct shrink_control *sc) >> }; >> long nr = sc->nr_to_scan; >> unsigned long freed = 0; >> + unsigned long remaining = 0; >> for (unsigned i = 0; (nr > 0) && (i < ARRAY_SIZE(stages)); i++) { >> if (!stages[i].cond) >> continue; >> stages[i].freed = >> - drm_gem_lru_scan(stages[i].lru, nr, stages[i].shrink); >> + drm_gem_lru_scan(stages[i].lru, nr, &remaining, > > This function relies in remaining being pre-initialized. That's not > obvious and error prone. At least, pass-in something like > &stages[i].remaining that is then initialized internally by > drm_gem_lru_scan() to zero. And similar to freed, sum up the individual > stages' remaining here. > > TBH I somehow don't like the overall design of how all these functions > interact with each other. But I also can't really point to the actual > problem. So it's best to take what you have here; maybe with the change > I proposed. > > Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> I had to keep to the remaining being pre-initialized because moving the initialization was hurting the rest of the code. Though, updated the MSM patch to use &stages[i].remaining
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 59a0bb5ebd85..c6bca5ac6e0f 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1388,10 +1388,13 @@ EXPORT_SYMBOL(drm_gem_lru_move_tail); * * @lru: The LRU to scan * @nr_to_scan: The number of pages to try to reclaim + * @remaining: The number of pages left to reclaim * @shrink: Callback to try to shrink/reclaim the object. */ unsigned long -drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, +drm_gem_lru_scan(struct drm_gem_lru *lru, + unsigned int nr_to_scan, + unsigned long *remaining, bool (*shrink)(struct drm_gem_object *obj)) { struct drm_gem_lru still_in_lru; @@ -1430,8 +1433,10 @@ drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, * hit shrinker in response to trying to get backing pages * for this obj (ie. while it's lock is already held) */ - if (!dma_resv_trylock(obj->resv)) + if (!dma_resv_trylock(obj->resv)) { + *remaining += obj->size >> PAGE_SHIFT; goto tail; + } if (shrink(obj)) { freed += obj->size >> PAGE_SHIFT; diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c b/drivers/gpu/drm/msm/msm_gem_shrinker.c index 051bdbc093cf..b7c1242014ec 100644 --- a/drivers/gpu/drm/msm/msm_gem_shrinker.c +++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c @@ -116,12 +116,14 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) }; long nr = sc->nr_to_scan; unsigned long freed = 0; + unsigned long remaining = 0; for (unsigned i = 0; (nr > 0) && (i < ARRAY_SIZE(stages)); i++) { if (!stages[i].cond) continue; stages[i].freed = - drm_gem_lru_scan(stages[i].lru, nr, stages[i].shrink); + drm_gem_lru_scan(stages[i].lru, nr, &remaining, + stages[i].shrink); nr -= stages[i].freed; freed += stages[i].freed; } @@ -132,7 +134,7 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) stages[3].freed); } - return (freed > 0) ? freed : SHRINK_STOP; + return (freed > 0 && remaining > 0) ? freed : SHRINK_STOP; } #ifdef CONFIG_DEBUG_FS @@ -182,10 +184,12 @@ msm_gem_shrinker_vmap(struct notifier_block *nb, unsigned long event, void *ptr) NULL, }; unsigned idx, unmapped = 0; + unsigned long remaining = 0; for (idx = 0; lrus[idx] && unmapped < vmap_shrink_limit; idx++) { unmapped += drm_gem_lru_scan(lrus[idx], vmap_shrink_limit - unmapped, + &remaining, vmap_shrink); } diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 772a4adf5287..f1f00fc2dba6 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -476,7 +476,9 @@ int drm_gem_dumb_map_offset(struct drm_file *file, struct drm_device *dev, void drm_gem_lru_init(struct drm_gem_lru *lru, struct mutex *lock); void drm_gem_lru_remove(struct drm_gem_object *obj); void drm_gem_lru_move_tail(struct drm_gem_lru *lru, struct drm_gem_object *obj); -unsigned long drm_gem_lru_scan(struct drm_gem_lru *lru, unsigned nr_to_scan, +unsigned long drm_gem_lru_scan(struct drm_gem_lru *lru, + unsigned int nr_to_scan, + unsigned long *remaining, bool (*shrink)(struct drm_gem_object *obj)); #endif /* __DRM_GEM_H__ */