Message ID | 20240102184633.748113-11-urezki@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Mitigate a vmap lock contention v3 | expand |
On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > A number of nodes which are used in the alloc/free paths is > set based on num_possible_cpus() in a system. Please note a > high limit threshold though is fixed and corresponds to 128 > nodes. Large CPU count machines are NUMA machines. ALl of the allocation and reclaim is NUMA node based i.e. a pgdat per NUMA node. Shrinkers are also able to be run in a NUMA aware mode so that per-node structures can be reclaimed similar to how per-node LRU lists are scanned for reclaim. Hence I'm left to wonder if it would be better to have a vmalloc area per pgdat (or sub-node cluster) rather than just base the number on CPU count and then have an arbitrary maximum number when we get to 128 CPU cores. We can have 128 CPU cores in a single socket these days, so not being able to scale the vmalloc areas beyond a single socket seems like a bit of a limitation. Scaling out the vmalloc areas in a NUMA aware fashion allows the shrinker to be run in numa aware mode, which gets rid of the need for the global shrinker to loop over every single vmap area in every shrinker invocation. Only the vm areas on the node that has a memory shortage need to be scanned and reclaimed, it doesn't need reclaim everything globally when a single node runs out of memory. Yes, this may not give quite as good microbenchmark scalability results, but being able to locate each vm area in node local memory and have operation on them largely isolated to node-local tasks and vmalloc area reclaim will work much better on large multi-socket NUMA machines. Cheers, Dave.
> On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > A number of nodes which are used in the alloc/free paths is > > set based on num_possible_cpus() in a system. Please note a > > high limit threshold though is fixed and corresponds to 128 > > nodes. > > Large CPU count machines are NUMA machines. ALl of the allocation > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > Shrinkers are also able to be run in a NUMA aware mode so that > per-node structures can be reclaimed similar to how per-node LRU > lists are scanned for reclaim. > > Hence I'm left to wonder if it would be better to have a vmalloc > area per pgdat (or sub-node cluster) rather than just base the > number on CPU count and then have an arbitrary maximum number when > we get to 128 CPU cores. We can have 128 CPU cores in a > single socket these days, so not being able to scale the vmalloc > areas beyond a single socket seems like a bit of a limitation. > > > Hence I'm left to wonder if it would be better to have a vmalloc > area per pgdat (or sub-node cluster) rather than just base the > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > shrinker to be run in numa aware mode, which gets rid of the need > for the global shrinker to loop over every single vmap area in every > shrinker invocation. Only the vm areas on the node that has a memory > shortage need to be scanned and reclaimed, it doesn't need reclaim > everything globally when a single node runs out of memory. > > Yes, this may not give quite as good microbenchmark scalability > results, but being able to locate each vm area in node local memory > and have operation on them largely isolated to node-local tasks and > vmalloc area reclaim will work much better on large multi-socket > NUMA machines. > Currently i fix the max nodes number to 128. This is because i do not have an access to such big NUMA systems whereas i do have an access to around ~128 ones. That is why i have decided to stop on that number as of now. We can easily set nr_nodes to num_possible_cpus() and let it scale for anyone. But before doing this, i would like to give it a try as a first step because i have not tested it well on really big NUMA systems. Thanks for you NUMA-aware input. -- Uladzislau Rezki
On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > > A number of nodes which are used in the alloc/free paths is > > > set based on num_possible_cpus() in a system. Please note a > > > high limit threshold though is fixed and corresponds to 128 > > > nodes. > > > > Large CPU count machines are NUMA machines. ALl of the allocation > > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > > > Shrinkers are also able to be run in a NUMA aware mode so that > > per-node structures can be reclaimed similar to how per-node LRU > > lists are scanned for reclaim. > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > area per pgdat (or sub-node cluster) rather than just base the > > number on CPU count and then have an arbitrary maximum number when > > we get to 128 CPU cores. We can have 128 CPU cores in a > > single socket these days, so not being able to scale the vmalloc > > areas beyond a single socket seems like a bit of a limitation. > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > area per pgdat (or sub-node cluster) rather than just base the > > > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > > shrinker to be run in numa aware mode, which gets rid of the need > > for the global shrinker to loop over every single vmap area in every > > shrinker invocation. Only the vm areas on the node that has a memory > > shortage need to be scanned and reclaimed, it doesn't need reclaim > > everything globally when a single node runs out of memory. > > > > Yes, this may not give quite as good microbenchmark scalability > > results, but being able to locate each vm area in node local memory > > and have operation on them largely isolated to node-local tasks and > > vmalloc area reclaim will work much better on large multi-socket > > NUMA machines. > > > Currently i fix the max nodes number to 128. This is because i do not > have an access to such big NUMA systems whereas i do have an access to > around ~128 ones. That is why i have decided to stop on that number as > of now. I suspect you are confusing number of CPUs with number of NUMA nodes. A NUMA system with 128 nodes is a large NUMA system that will have thousands of CPU cores, whilst above you talk about basing the count on CPU cores and that a single socket can have 128 cores? > We can easily set nr_nodes to num_possible_cpus() and let it scale for > anyone. But before doing this, i would like to give it a try as a first > step because i have not tested it well on really big NUMA systems. I don't think you need to have large NUMA systems to test it. We have the "fakenuma" feature for a reason. Essentially, once you have enough CPU cores that catastrophic lock contention can be generated in a fast path (can take as few as 4-5 CPU cores), then you can effectively test NUMA scalability with fakenuma by creating nodes with >=8 CPUs each. This is how I've done testing of numa aware algorithms (like shrinkers!) for the past decade - I haven't had direct access to a big NUMA machine since 2008, yet it's relatively trivial to test NUMA based scalability algorithms without them these days. -Dave.
On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > > > A number of nodes which are used in the alloc/free paths is > > > > set based on num_possible_cpus() in a system. Please note a > > > > high limit threshold though is fixed and corresponds to 128 > > > > nodes. > > > > > > Large CPU count machines are NUMA machines. ALl of the allocation > > > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > > > > > Shrinkers are also able to be run in a NUMA aware mode so that > > > per-node structures can be reclaimed similar to how per-node LRU > > > lists are scanned for reclaim. > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > number on CPU count and then have an arbitrary maximum number when > > > we get to 128 CPU cores. We can have 128 CPU cores in a > > > single socket these days, so not being able to scale the vmalloc > > > areas beyond a single socket seems like a bit of a limitation. > > > > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > > > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > > > shrinker to be run in numa aware mode, which gets rid of the need > > > for the global shrinker to loop over every single vmap area in every > > > shrinker invocation. Only the vm areas on the node that has a memory > > > shortage need to be scanned and reclaimed, it doesn't need reclaim > > > everything globally when a single node runs out of memory. > > > > > > Yes, this may not give quite as good microbenchmark scalability > > > results, but being able to locate each vm area in node local memory > > > and have operation on them largely isolated to node-local tasks and > > > vmalloc area reclaim will work much better on large multi-socket > > > NUMA machines. > > > > > Currently i fix the max nodes number to 128. This is because i do not > > have an access to such big NUMA systems whereas i do have an access to > > around ~128 ones. That is why i have decided to stop on that number as > > of now. > > I suspect you are confusing number of CPUs with number of NUMA nodes. > I do not think so :) > > A NUMA system with 128 nodes is a large NUMA system that will have > thousands of CPU cores, whilst above you talk about basing the > count on CPU cores and that a single socket can have 128 cores? > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > anyone. But before doing this, i would like to give it a try as a first > > step because i have not tested it well on really big NUMA systems. > > I don't think you need to have large NUMA systems to test it. We > have the "fakenuma" feature for a reason. Essentially, once you > have enough CPU cores that catastrophic lock contention can be > generated in a fast path (can take as few as 4-5 CPU cores), then > you can effectively test NUMA scalability with fakenuma by creating > nodes with >=8 CPUs each. > > This is how I've done testing of numa aware algorithms (like > shrinkers!) for the past decade - I haven't had direct access to a > big NUMA machine since 2008, yet it's relatively trivial to test > NUMA based scalability algorithms without them these days. > I see your point. NUMA-aware scalability require reworking adding extra layer that allows such scaling. If the socket has 256 CPUs, how do scale VAs inside that node among those CPUs? -- Uladzislau Rezki
On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote: > On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > > anyone. But before doing this, i would like to give it a try as a first > > > step because i have not tested it well on really big NUMA systems. > > > > I don't think you need to have large NUMA systems to test it. We > > have the "fakenuma" feature for a reason. Essentially, once you > > have enough CPU cores that catastrophic lock contention can be > > generated in a fast path (can take as few as 4-5 CPU cores), then > > you can effectively test NUMA scalability with fakenuma by creating > > nodes with >=8 CPUs each. > > > > This is how I've done testing of numa aware algorithms (like > > shrinkers!) for the past decade - I haven't had direct access to a > > big NUMA machine since 2008, yet it's relatively trivial to test > > NUMA based scalability algorithms without them these days. > > > I see your point. NUMA-aware scalability require reworking adding extra > layer that allows such scaling. > > If the socket has 256 CPUs, how do scale VAs inside that node among > those CPUs? It's called "sub-numa clustering" and is a bios option that presents large core count CPU packages as multiple NUMA nodes. See: https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html Essentially, large core count CPUs are a cluster of smaller core groups with their own resources and memory controllers. This is how they are laid out either on a single die (intel) or as a collection of smaller dies (AMD compute complexes) that are tied together by the interconnect between the LLCs and memory controllers. They only appear as a "unified" CPU because they are configured that way by the bios, but can also be configured to actually expose their inner non-uniform memory access topology for operating systems and application stacks that are NUMA aware (like Linux). This means a "256 core" CPU would probably present as 16 smaller 16 core CPUs each with their own L1/2/3 caches and memory controllers. IOWs, a single socket appears to the kernel as a 16 node NUMA system with 16 cores per node. Most NUMA aware scalability algorithms will work just fine with this sort setup - it's just another set of numbers in the NUMA distance table... Cheers, Dave.
On Fri, Jan 19, 2024 at 08:28:05AM +1100, Dave Chinner wrote: > On Thu, Jan 18, 2024 at 07:23:47PM +0100, Uladzislau Rezki wrote: > > On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > > > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > > > anyone. But before doing this, i would like to give it a try as a first > > > > step because i have not tested it well on really big NUMA systems. > > > > > > I don't think you need to have large NUMA systems to test it. We > > > have the "fakenuma" feature for a reason. Essentially, once you > > > have enough CPU cores that catastrophic lock contention can be > > > generated in a fast path (can take as few as 4-5 CPU cores), then > > > you can effectively test NUMA scalability with fakenuma by creating > > > nodes with >=8 CPUs each. > > > > > > This is how I've done testing of numa aware algorithms (like > > > shrinkers!) for the past decade - I haven't had direct access to a > > > big NUMA machine since 2008, yet it's relatively trivial to test > > > NUMA based scalability algorithms without them these days. > > > > > I see your point. NUMA-aware scalability require reworking adding extra > > layer that allows such scaling. > > > > If the socket has 256 CPUs, how do scale VAs inside that node among > > those CPUs? > > It's called "sub-numa clustering" and is a bios option that presents > large core count CPU packages as multiple NUMA nodes. See: > > https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html > > Essentially, large core count CPUs are a cluster of smaller core > groups with their own resources and memory controllers. This is how > they are laid out either on a single die (intel) or as a collection > of smaller dies (AMD compute complexes) that are tied together by > the interconnect between the LLCs and memory controllers. They only > appear as a "unified" CPU because they are configured that way by > the bios, but can also be configured to actually expose their inner > non-uniform memory access topology for operating systems and > application stacks that are NUMA aware (like Linux). > > This means a "256 core" CPU would probably present as 16 smaller 16 > core CPUs each with their own L1/2/3 caches and memory controllers. > IOWs, a single socket appears to the kernel as a 16 node NUMA system > with 16 cores per node. Most NUMA aware scalability algorithms will > work just fine with this sort setup - it's just another set of > numbers in the NUMA distance table... > Thank you for your input. I will go through it to see what we can do in terms of NUMA-aware with thousands of CPUs in total. Thanks! -- Uladzislau Rezki
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 0c671cb96151..ef534c76daef 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -4879,10 +4879,27 @@ static void vmap_init_free_space(void) static void vmap_init_nodes(void) { struct vmap_node *vn; - int i, j; + int i, n; + +#if BITS_PER_LONG == 64 + /* A high threshold of max nodes is fixed and bound to 128. */ + n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); + + if (n > 1) { + vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN); + if (vn) { + /* Node partition is 16 pages. */ + vmap_zone_size = (1 << 4) * PAGE_SIZE; + nr_vmap_nodes = n; + vmap_nodes = vn; + } else { + pr_err("Failed to allocate an array. Disable a node layer\n"); + } + } +#endif - for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + for (n = 0; n < nr_vmap_nodes; n++) { + vn = &vmap_nodes[n]; vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); spin_lock_init(&vn->busy.lock); @@ -4891,9 +4908,9 @@ static void vmap_init_nodes(void) INIT_LIST_HEAD(&vn->lazy.head); spin_lock_init(&vn->lazy.lock); - for (j = 0; j < MAX_VA_SIZE_PAGES; j++) { - INIT_LIST_HEAD(&vn->pool[j].head); - WRITE_ONCE(vn->pool[j].len, 0); + for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { + INIT_LIST_HEAD(&vn->pool[i].head); + WRITE_ONCE(vn->pool[i].len, 0); } spin_lock_init(&vn->pool_lock);
A number of nodes which are used in the alloc/free paths is set based on num_possible_cpus() in a system. Please note a high limit threshold though is fixed and corresponds to 128 nodes. For 32-bit or single core systems an access to a global vmap heap is not balanced. Such small systems do not suffer from lock contentions due to low number of CPUs. In such case the nr_nodes is equal to 1. Test on AMD Ryzen Threadripper 3970X 32-Core Processor: sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> --- mm/vmalloc.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-)