Message ID | 20230417135420.1836741-1-senozhatsky@chromium.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [PATCHv2] zsmalloc: allow only one active pool compaction context | expand |
On Mon, Apr 17, 2023 at 6:54 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > zsmalloc pool can be compacted concurrently by many contexts, > e.g. > > cc1 handle_mm_fault() > do_anonymous_page() > __alloc_pages_slowpath() > try_to_free_pages() > do_try_to_free_pages( > lru_gen_shrink_node() > shrink_slab() > do_shrink_slab() > zs_shrinker_scan() > zs_compact() > > This creates unnecessary contention as all those processes > compete for access to the same classes. A single compaction > process is enough. Moreover contention that is created by > multiple compaction processes impact other zsmalloc functions, > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to > synchronize access to pool. > > Introduce pool compaction mutex and permit only one compaction > context at a time. This reduces overall pool->lock contention. > > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for > &pool->lock#3: > > Base Patched > ------------------------------------------ > con-bounces 2035730 1540066 > contentions 2343871 1774348 > waittime-min 0.10 0.10 > waittime-max 4004216.24 2745.22 > waittime-total 101334168.29 67865414.91 > waittime-avg 43.23 38.25 > acq-bounces 2895765 2186745 > acquisitions 6247686 5136943 > holdtime-min 0.07 0.07 > holdtime-max 2605507.97 482439.16 > holdtime-total 9998599.59 5107151.01 > holdtime-avg 1.60 0.99 The numbers seem to be better when using an atomic vs. a mutex, is this just noise or significant difference? (I am not familiar with lock-stat). > > Test run time: > Base > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU > > Patched > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU > > Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> FWIW, Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > --- > mm/zsmalloc.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c > index cc81dfba05a0..dfec2fc6a30f 100644 > --- a/mm/zsmalloc.c > +++ b/mm/zsmalloc.c > @@ -264,6 +264,7 @@ struct zs_pool { > struct work_struct free_work; > #endif > spinlock_t lock; > + atomic_t compaction_in_progress; > }; > > struct zspage { > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool) > struct size_class *class; > unsigned long pages_freed = 0; > > + if (atomic_xchg(&pool->compaction_in_progress, 1)) > + return 0; > + > for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) { > class = pool->size_class[i]; > if (class->index != i) > @@ -2281,6 +2285,7 @@ unsigned long zs_compact(struct zs_pool *pool) > pages_freed += __zs_compact(pool, class); > } > atomic_long_add(pages_freed, &pool->stats.pages_compacted); > + atomic_set(&pool->compaction_in_progress, 0); > > return pages_freed; > } > @@ -2388,6 +2393,7 @@ struct zs_pool *zs_create_pool(const char *name) > > init_deferred_free(pool); > spin_lock_init(&pool->lock); > + atomic_set(&pool->compaction_in_progress, 0); > > pool->name = kstrdup(name, GFP_KERNEL); > if (!pool->name) > -- > 2.40.0.634.g4ca3ef3211-goog >
On (23/04/17 11:32), Yosry Ahmed wrote: > > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for > > &pool->lock#3: > > > > Base Patched > > ------------------------------------------ > > con-bounces 2035730 1540066 > > contentions 2343871 1774348 > > waittime-min 0.10 0.10 > > waittime-max 4004216.24 2745.22 > > waittime-total 101334168.29 67865414.91 > > waittime-avg 43.23 38.25 > > acq-bounces 2895765 2186745 > > acquisitions 6247686 5136943 > > holdtime-min 0.07 0.07 > > holdtime-max 2605507.97 482439.16 > > holdtime-total 9998599.59 5107151.01 > > holdtime-avg 1.60 0.99 > > The numbers seem to be better when using an atomic vs. a mutex, is > this just noise or significant difference? (I am not familiar with > lock-stat). Pretty sure that's just noise. The test is make -j72 on a system that swaps out, so it's terribly noisy.
On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > zsmalloc pool can be compacted concurrently by many contexts, > e.g. > > cc1 handle_mm_fault() > do_anonymous_page() > __alloc_pages_slowpath() > try_to_free_pages() > do_try_to_free_pages( > lru_gen_shrink_node() > shrink_slab() > do_shrink_slab() > zs_shrinker_scan() > zs_compact() > > This creates unnecessary contention as all those processes > compete for access to the same classes. A single compaction > process is enough. Moreover contention that is created by > multiple compaction processes impact other zsmalloc functions, > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to > synchronize access to pool. > > Introduce pool compaction mutex and permit only one compaction > context at a time. This reduces overall pool->lock contention. That isn't what the patch does! Perhaps an earlier version used a mutex? > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for > &pool->lock#3: > > Base Patched > ------------------------------------------ > con-bounces 2035730 1540066 > contentions 2343871 1774348 > waittime-min 0.10 0.10 > waittime-max 4004216.24 2745.22 > waittime-total 101334168.29 67865414.91 > waittime-avg 43.23 38.25 > acq-bounces 2895765 2186745 > acquisitions 6247686 5136943 > holdtime-min 0.07 0.07 > holdtime-max 2605507.97 482439.16 > holdtime-total 9998599.59 5107151.01 > holdtime-avg 1.60 0.99 > > Test run time: > Base > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU > > Patched > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU > > ... > > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool) > struct size_class *class; > unsigned long pages_freed = 0; > > + if (atomic_xchg(&pool->compaction_in_progress, 1)) > + return 0; > + A code comment might be appropriate here. Is the spin_is_contended() test in __zs_compact() still relevant? And.... single-threading the operation seems a pretty sad way of addressing a contention issue. zs_compact() is fairly computationally expensive - surely a large machine would like to be able to concurrently run many instances of zs_compact()?
On Mon, Apr 17, 2023 at 5:41 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > > zsmalloc pool can be compacted concurrently by many contexts, > > e.g. > > > > cc1 handle_mm_fault() > > do_anonymous_page() > > __alloc_pages_slowpath() > > try_to_free_pages() > > do_try_to_free_pages( > > lru_gen_shrink_node() > > shrink_slab() > > do_shrink_slab() > > zs_shrinker_scan() > > zs_compact() > > > > This creates unnecessary contention as all those processes > > compete for access to the same classes. A single compaction > > process is enough. Moreover contention that is created by > > multiple compaction processes impact other zsmalloc functions, > > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to > > synchronize access to pool. > > > > Introduce pool compaction mutex and permit only one compaction > > context at a time. This reduces overall pool->lock contention. > > That isn't what the patch does! Perhaps an earlier version used a mutex? Yup that's from v1. Thanks for catching that. > > > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for > > &pool->lock#3: > > > > Base Patched > > ------------------------------------------ > > con-bounces 2035730 1540066 > > contentions 2343871 1774348 > > waittime-min 0.10 0.10 > > waittime-max 4004216.24 2745.22 > > waittime-total 101334168.29 67865414.91 > > waittime-avg 43.23 38.25 > > acq-bounces 2895765 2186745 > > acquisitions 6247686 5136943 > > holdtime-min 0.07 0.07 > > holdtime-max 2605507.97 482439.16 > > holdtime-total 9998599.59 5107151.01 > > holdtime-avg 1.60 0.99 > > > > Test run time: > > Base > > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU > > > > Patched > > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU > > > > ... > > > > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool) > > struct size_class *class; > > unsigned long pages_freed = 0; > > > > + if (atomic_xchg(&pool->compaction_in_progress, 1)) > > + return 0; > > + > > A code comment might be appropriate here. > > Is the spin_is_contended() test in __zs_compact() still relevant? It is, as pool->lock is used to synchronize a lot of operations, not just compaction. > > And.... single-threading the operation seems a pretty sad way of > addressing a contention issue. zs_compact() is fairly computationally > expensive - surely a large machine would like to be able to > concurrently run many instances of zs_compact()? Yeah that was my initial thought as well, to attack this problem at a larger scale and break down the pool->lock into class->lock. AFAICT, zsmalloc pages/objects cannot change classes during their lifetime, so this sounds reasonable initially. The problem is that the pool->lock is used to synchronize ~all zsmalloc operations: zs_map_object(), zs_malloc(), zs_free(), async_free_zspage(), zs_compact(), zs_reclaim_page(), zs_page_migrate(). For some of those, changing a pool->lock into a class->lock is fairly straightforward, specifically the ones where we operate on a page/zspage (e.g. zs_compact()). The problem is for operations where we operate on an object (e.g. zs_map_object()). At this point, we only have a handle that we need to use to find the underlying object. We have no idea what class the object belongs to, and finding an object given a handle needs synchronization as the handle<->object mapping can change by compaction or page migration. We can store the class somewhere and pay an extra byte per compressed object (which doesn't sound too bad) -- but even then we'd have trouble finding the right place to stash it. Perhaps we can enforce some alignment on the objects and use the lowermost 8 bits of the handle pointer, which also means we don't need to pay any extra overhead, but I am not sure if this is practical. We can also store the class at the first byte of the object, and guarantee that through migration/compaction the first byte always remains a valid class. We can then read the class from the handle locklessly, acquire the class lock, and then make sure the handle is still the same. These are all ideas I am throwing in the air, I didn't think about them thoroughly or try implementing them. TLDR: I believe it is possible (in theory at least) to break down the global pool->lock into class locks, which should benefit virtually ~all zsmalloc concurrent operations, not just compaction. It is not easy though. As for this patch, I personally do not observe a lot of compaction in our production environment, and allowing one thread to perform compaction while others move on with their lives can be better than having all of them continuously contending for the pool->lock, which means more contention with ~all zsmalloc operations, not just concurrent compactors. I can't say for sure that this is an improvement, but I *believe* it is. I will shut up now and leave it up to Sergey to decide how to move forward, sorry for the very long response :)
On (23/04/17 17:41), Andrew Morton wrote: > On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > Introduce pool compaction mutex and permit only one compaction > > context at a time. This reduces overall pool->lock contention. > > That isn't what the patch does! Perhaps an earlier version used a mutex? Oh, yes. [..] > > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool) > > struct size_class *class; > > unsigned long pages_freed = 0; > > > > + if (atomic_xchg(&pool->compaction_in_progress, 1)) > > + return 0; > > + > > A code comment might be appropriate here. OK. > Is the spin_is_contended() test in __zs_compact() still relevant? I'd say yes, pool->lock is "big pool lock", we use it for everything: zs_map_object() when we read objects, zs_malloc() when we allocate objects, zs_free() when we free objects, etc. > And.... single-threading the operation seems a pretty sad way of > addressing a contention issue. zs_compact() is fairly computationally > expensive - surely a large machine would like to be able to > concurrently run many instances of zs_compact()? Compaction is effective only when zspool suffers from internal fragmentation. Concurrent compaction threads iterate size classes in the same order and are likely to compete for pool->lock to just find out that previous pool->lock owner has compacted the class already and there isn't much left to do. As of concurrent compaction on big machines, the thing is - concurrent compaction doesn't really happen. We always have just one thread compacting size classes under pool->lock, the remaining compaction threads just spin on pool->lock. I believe it used to be slightly different in the past when we had per size-class lock instead of "global" pool->lock. Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") has basically made compaction single-threaded.
On (23/04/17 19:53), Yosry Ahmed wrote: > > As for this patch, I personally do not observe a lot of compaction in > our production environment, and allowing one thread to perform > compaction while others move on with their lives can be better than > having all of them continuously contending for the pool->lock, which > means more contention with ~all zsmalloc operations, not just > concurrent compactors. I can't say for sure that this is an > improvement, but I *believe* it is. Looking at one of ChromeOS memory-pressure tests, I see that sometimes (albeit rarely) we can have up to 9 parallel zspool compaction contexts, perhaps a little bit too many for a 12 CPUs laptop: [ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact() [ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact() [ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact() [ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact() [ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact() [ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact() [ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact() [ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact() [ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact()
On Tue, Apr 18, 2023 at 4:24 AM Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > On (23/04/17 19:53), Yosry Ahmed wrote: > > > > As for this patch, I personally do not observe a lot of compaction in > > our production environment, and allowing one thread to perform > > compaction while others move on with their lives can be better than > > having all of them continuously contending for the pool->lock, which > > means more contention with ~all zsmalloc operations, not just > > concurrent compactors. I can't say for sure that this is an > > improvement, but I *believe* it is. > > Looking at one of ChromeOS memory-pressure tests, I see that sometimes > (albeit rarely) we can have up to 9 parallel zspool compaction contexts, > perhaps a little bit too many for a 12 CPUs laptop: > > [ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact() > [ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact() > [ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact() > [ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact() > [ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact() > [ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact() > [ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact() > [ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact() > [ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact() I haven't looked into such pressure tests on our end, the information I have is based on the amount of CPU cycles we are spending on zsmalloc compaction in general.
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index cc81dfba05a0..dfec2fc6a30f 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -264,6 +264,7 @@ struct zs_pool { struct work_struct free_work; #endif spinlock_t lock; + atomic_t compaction_in_progress; }; struct zspage { @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool) struct size_class *class; unsigned long pages_freed = 0; + if (atomic_xchg(&pool->compaction_in_progress, 1)) + return 0; + for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) { class = pool->size_class[i]; if (class->index != i) @@ -2281,6 +2285,7 @@ unsigned long zs_compact(struct zs_pool *pool) pages_freed += __zs_compact(pool, class); } atomic_long_add(pages_freed, &pool->stats.pages_compacted); + atomic_set(&pool->compaction_in_progress, 0); return pages_freed; } @@ -2388,6 +2393,7 @@ struct zs_pool *zs_create_pool(const char *name) init_deferred_free(pool); spin_lock_init(&pool->lock); + atomic_set(&pool->compaction_in_progress, 0); pool->name = kstrdup(name, GFP_KERNEL); if (!pool->name)
zsmalloc pool can be compacted concurrently by many contexts, e.g. cc1 handle_mm_fault() do_anonymous_page() __alloc_pages_slowpath() try_to_free_pages() do_try_to_free_pages( lru_gen_shrink_node() shrink_slab() do_shrink_slab() zs_shrinker_scan() zs_compact() This creates unnecessary contention as all those processes compete for access to the same classes. A single compaction process is enough. Moreover contention that is created by multiple compaction processes impact other zsmalloc functions, e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to synchronize access to pool. Introduce pool compaction mutex and permit only one compaction context at a time. This reduces overall pool->lock contention. /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for &pool->lock#3: Base Patched ------------------------------------------ con-bounces 2035730 1540066 contentions 2343871 1774348 waittime-min 0.10 0.10 waittime-max 4004216.24 2745.22 waittime-total 101334168.29 67865414.91 waittime-avg 43.23 38.25 acq-bounces 2895765 2186745 acquisitions 6247686 5136943 holdtime-min 0.07 0.07 holdtime-max 2605507.97 482439.16 holdtime-total 9998599.59 5107151.01 holdtime-avg 1.60 0.99 Test run time: Base 2775.15user 1709.13system 2:13.82elapsed 3350%CPU Patched 2608.25user 1439.03system 2:03.63elapsed 3273%CPU Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> --- mm/zsmalloc.c | 6 ++++++ 1 file changed, 6 insertions(+)