Message ID | 20240829130633.2184-1-ahuang12@lenovo.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area | expand |
On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote: > From: Adrian Huang <ahuang12@lenovo.com> > > When running the vmalloc stress on a 448-core system, observe the average > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc > 'funclatency.py' tool [1]. > > # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > usecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 29 | | > 4 -> 7 : 19 | | > 8 -> 15 : 56 | | > 16 -> 31 : 483 |**** | > 32 -> 63 : 1548 |************ | > 64 -> 127 : 2634 |********************* | > 128 -> 255 : 2535 |********************* | > 256 -> 511 : 1776 |************** | > 512 -> 1023 : 1015 |******** | > 1024 -> 2047 : 573 |**** | > 2048 -> 4095 : 488 |**** | > 4096 -> 8191 : 1091 |********* | > 8192 -> 16383 : 3078 |************************* | > 16384 -> 32767 : 4821 |****************************************| > 32768 -> 65535 : 3318 |*************************** | > 65536 -> 131071 : 1718 |************** | > 131072 -> 262143 : 2220 |****************** | > 262144 -> 524287 : 1147 |********* | > 524288 -> 1048575 : 1179 |********* | > 1048576 -> 2097151 : 822 |****** | > 2097152 -> 4194303 : 906 |******* | > 4194304 -> 8388607 : 2148 |***************** | > 8388608 -> 16777215 : 4497 |************************************* | > 16777216 -> 33554431 : 289 |** | > > avg = 2041714 usecs, total: 78381401772 usecs, count: 38390 > > The worst case is over 16-33 seconds, so soft lockup is triggered [2]. > > [Root Cause] > 1) Each purge_list has the long list. The following shows the number of > vmap_area is purged. > > crash> p vmap_nodes > vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 > crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged > nr_purged = 663070 > ... > nr_purged = 821670 > nr_purged = 692214 > nr_purged = 726808 > ... > > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic > operation when purging each vmap_area. However, the iteration is over > 600000 vmap_area (See 'nr_purged' above). > > Here is objdump output: > > $ objdump -D vmlinux > ffffffff813e8c80 <purge_vmap_node>: > ... > ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) > ... > > Quote from "Instruction tables" pdf file [3]: > Instructions with a LOCK prefix have a long latency that depends on > cache organization and possibly RAM speed. If there are multiple > processors or cores or direct memory access (DMA) devices, then all > locked instructions will lock a cache line for exclusive access, > which may involve RAM access. A LOCK prefix typically costs more > than a hundred clock cycles, even on single-processor systems. > > That's why the latency of purge_vmap_node() dramatically increases > on a many-core system: One core is busy on purging each vmap_area of > the *long* purge_list and executing atomic_long_sub() for each > vmap_area, while other cores free vmalloc allocations and execute > atomic_long_add_return() in free_vmap_area_noflush(). > > [Solution] > Employ a local variable to record the total purged pages, and execute > atomic_long_sub() after the traversal of the purge_list is done. The > experiment result shows the latency improvement is 99%. > > [Experiment Result] > 1) System Configuration: Three servers (with HT-enabled) are tested. > * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 > * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 > * 448-core server: AMD Zen 4 Processor*2 > > 2) Kernel Config > * CONFIG_KASAN is disabled > > 3) The data in column "w/o patch" and "w/ patch" > * Unit: micro seconds (us) > * Each data is the average of 3-time measurements > > System w/o patch (us) w/ patch (us) Improvement (%) > --------------- -------------- ------------- ------------- > 72-core server 2194 14 99.36% > 192-core server 143799 1139 99.21% > 448-core server 1992122 6883 99.65% > > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 > [3] https://www.agner.org/optimize/instruction_tables.pdf > > Signed-off-by: Adrian Huang <ahuang12@lenovo.com> > --- > mm/vmalloc.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 3f9b6bd707d2..607697c81e60 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work) > { > struct vmap_node *vn = container_of(work, > struct vmap_node, purge_work); > + unsigned long nr_purged_pages = 0; > struct vmap_area *va, *n_va; > LIST_HEAD(local_list); > > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work) > > list_del_init(&va->list); > > - atomic_long_sub(nr, &vmap_lazy_nr); > + nr_purged_pages += nr; > vn->nr_purged++; > > if (is_vn_id_valid(vn_id) && !vn->skip_populate) > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work) > list_add(&va->list, &local_list); > } > > + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr); > + > reclaim_list_global(&local_list); > } > > -- > 2.34.1 > I see the point and it looks good to me. Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Thank you for improving this. There is one more spot which i detected earlier, it is: <snip> static void free_vmap_area_noflush(struct vmap_area *va) { unsigned long nr_lazy_max = lazy_max_pages(); unsigned long va_start = va->va_start; unsigned int vn_id = decode_vn_id(va->flags); struct vmap_node *vn; unsigned long nr_lazy; if (WARN_ON_ONCE(!list_empty(&va->list))) return; nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); ... <snip> atomic_long_add_return() might also introduce a high contention. We can optimize by splitting into more light atomics. Can you check it on your 448-cores system? Tnanks! -- Uladzislau Rezki
On Thu, Aug 29, 2024 at 09:00:16PM +0200, Uladzislau Rezki wrote: > On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote: > > From: Adrian Huang <ahuang12@lenovo.com> > > > > When running the vmalloc stress on a 448-core system, observe the average > > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc > > 'funclatency.py' tool [1]. > > > > # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > > > usecs : count distribution > > 0 -> 1 : 0 | | > > 2 -> 3 : 29 | | > > 4 -> 7 : 19 | | > > 8 -> 15 : 56 | | > > 16 -> 31 : 483 |**** | > > 32 -> 63 : 1548 |************ | > > 64 -> 127 : 2634 |********************* | > > 128 -> 255 : 2535 |********************* | > > 256 -> 511 : 1776 |************** | > > 512 -> 1023 : 1015 |******** | > > 1024 -> 2047 : 573 |**** | > > 2048 -> 4095 : 488 |**** | > > 4096 -> 8191 : 1091 |********* | > > 8192 -> 16383 : 3078 |************************* | > > 16384 -> 32767 : 4821 |****************************************| > > 32768 -> 65535 : 3318 |*************************** | > > 65536 -> 131071 : 1718 |************** | > > 131072 -> 262143 : 2220 |****************** | > > 262144 -> 524287 : 1147 |********* | > > 524288 -> 1048575 : 1179 |********* | > > 1048576 -> 2097151 : 822 |****** | > > 2097152 -> 4194303 : 906 |******* | > > 4194304 -> 8388607 : 2148 |***************** | > > 8388608 -> 16777215 : 4497 |************************************* | > > 16777216 -> 33554431 : 289 |** | > > > > avg = 2041714 usecs, total: 78381401772 usecs, count: 38390 > > > > The worst case is over 16-33 seconds, so soft lockup is triggered [2]. > > > > [Root Cause] > > 1) Each purge_list has the long list. The following shows the number of > > vmap_area is purged. > > > > crash> p vmap_nodes > > vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 > > crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged > > nr_purged = 663070 > > ... > > nr_purged = 821670 > > nr_purged = 692214 > > nr_purged = 726808 > > ... > > > > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic > > operation when purging each vmap_area. However, the iteration is over > > 600000 vmap_area (See 'nr_purged' above). > > > > Here is objdump output: > > > > $ objdump -D vmlinux > > ffffffff813e8c80 <purge_vmap_node>: > > ... > > ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) > > ... > > > > Quote from "Instruction tables" pdf file [3]: > > Instructions with a LOCK prefix have a long latency that depends on > > cache organization and possibly RAM speed. If there are multiple > > processors or cores or direct memory access (DMA) devices, then all > > locked instructions will lock a cache line for exclusive access, > > which may involve RAM access. A LOCK prefix typically costs more > > than a hundred clock cycles, even on single-processor systems. > > > > That's why the latency of purge_vmap_node() dramatically increases > > on a many-core system: One core is busy on purging each vmap_area of > > the *long* purge_list and executing atomic_long_sub() for each > > vmap_area, while other cores free vmalloc allocations and execute > > atomic_long_add_return() in free_vmap_area_noflush(). > > > > [Solution] > > Employ a local variable to record the total purged pages, and execute > > atomic_long_sub() after the traversal of the purge_list is done. The > > experiment result shows the latency improvement is 99%. > > > > [Experiment Result] > > 1) System Configuration: Three servers (with HT-enabled) are tested. > > * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 > > * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 > > * 448-core server: AMD Zen 4 Processor*2 > > > > 2) Kernel Config > > * CONFIG_KASAN is disabled > > > > 3) The data in column "w/o patch" and "w/ patch" > > * Unit: micro seconds (us) > > * Each data is the average of 3-time measurements > > > > System w/o patch (us) w/ patch (us) Improvement (%) > > --------------- -------------- ------------- ------------- > > 72-core server 2194 14 99.36% > > 192-core server 143799 1139 99.21% > > 448-core server 1992122 6883 99.65% > > > > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py > > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 > > [3] https://www.agner.org/optimize/instruction_tables.pdf > > > > Signed-off-by: Adrian Huang <ahuang12@lenovo.com> > > --- > > mm/vmalloc.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 3f9b6bd707d2..607697c81e60 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work) > > { > > struct vmap_node *vn = container_of(work, > > struct vmap_node, purge_work); > > + unsigned long nr_purged_pages = 0; > > struct vmap_area *va, *n_va; > > LIST_HEAD(local_list); > > > > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work) > > > > list_del_init(&va->list); > > > > - atomic_long_sub(nr, &vmap_lazy_nr); > > + nr_purged_pages += nr; > > vn->nr_purged++; > > > > if (is_vn_id_valid(vn_id) && !vn->skip_populate) > > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work) > > list_add(&va->list, &local_list); > > } > > > > + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr); > > + > > reclaim_list_global(&local_list); > > } > > > > -- > > 2.34.1 > > > I see the point and it looks good to me. > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Thank you for improving this. There is one more spot which i detected > earlier, it is: > > <snip> > static void free_vmap_area_noflush(struct vmap_area *va) > { > unsigned long nr_lazy_max = lazy_max_pages(); > unsigned long va_start = va->va_start; > unsigned int vn_id = decode_vn_id(va->flags); > struct vmap_node *vn; > unsigned long nr_lazy; > > if (WARN_ON_ONCE(!list_empty(&va->list))) > return; > > nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> > PAGE_SHIFT, &vmap_lazy_nr); > > ... > <snip> > > atomic_long_add_return() might also introduce a high contention. We can > optimize by splitting into more light atomics. Can you check it on your > 448-cores system? > I have checked the free_vmap_area_noflush() on my hardware. It is 64 cores system: <perf cycles> ... + 7.84% 5.18% [kernel] [k] free_vmap_area_noflush + 6.16% 1.61% [kernel] [k] free_unref_page + 5.57% 1.51% [kernel] [k] find_unlink_vmap_area ... <perf cycles> <perf cycles annotate> .. │ arch_atomic64_add_return(): 23352402 │ mov %r12,%rdx │ lock xadd %rdx,vmap_lazy_nr │ is_vn_id_valid(): 52364447314 │ mov nr_vmap_nodes,%ecx <----- the hotest spot which consumes the CPU cycles the most(99%) │ arch_atomic64_add_return(): 45547180 │ add %rdx,%r12 │ is_vn_id_valid(): ... <perf cycles annotate> At least in my case, HW, i do not see that atomic_long_add_return() is a top when it comes to CPU cycles. Below one is the hottest instead: static bool is_vn_id_valid(unsigned int node_id) { if (node_id < nr_vmap_nodes) return true; return false; } access to "nr_vmap_nodes" which is read-only and globally defined: static __read_mostly unsigned int nr_vmap_nodes = 1; Any thoughts? -- Uladzislau Rezki
On Thu, 29 Aug 2024 21:06:33 +0800 Adrian Huang <adrianhuang0701@gmail.com> wrote: > From: Adrian Huang <ahuang12@lenovo.com> > > When running the vmalloc stress on a 448-core system, observe the average > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc > 'funclatency.py' tool [1]. > > ... > > 3) The data in column "w/o patch" and "w/ patch" > * Unit: micro seconds (us) > * Each data is the average of 3-time measurements > > System w/o patch (us) w/ patch (us) Improvement (%) > --------------- -------------- ------------- ------------- > 72-core server 2194 14 99.36% > 192-core server 143799 1139 99.21% > 448-core server 1992122 6883 99.65% > Holy cow. I'll add cc:stable to this. Partly because it fixes a softlockup, partly because holy cow.
Le 30/08/2024 à 18:26, Uladzislau Rezki a écrit : > At least in my case, HW, i do not see that atomic_long_add_return() is a > top when it comes to CPU cycles. Below one is the hottest instead: > > static bool > is_vn_id_valid(unsigned int node_id) > { > if (node_id < nr_vmap_nodes) > return true; > > return false; > } > > access to "nr_vmap_nodes" which is read-only and globally defined: > > static __read_mostly unsigned int nr_vmap_nodes = 1; > > Any thoughts? > > -- > Uladzislau Rezki > Hi, unrelated to your use case, but something that coud easily save a few cycles on some system, IMHO. Maybe: #if NR_CPUS > 1 static __read_mostly unsigned int nr_vmap_nodes = 1; static __read_mostly unsigned int vmap_zone_size = 1; #else #define nr_vmap_nodes 1 #define vmap_zone_size 1 #endif So that the compiler can do a better job because some loops can be optimized away and there is no need to access some memory to get theses values. Not sure if such a use case can exist or is of any interest. This is valide because of [1] and the #ifdef around the num_possible_cpus() declaration [2, 3]. Just my 2c. CJ [1]: https://elixir.bootlin.com/linux/v6.11-rc5/source/mm/vmalloc.c#L5026 [2]: https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1083 [3]: https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1136
On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote: > atomic_long_add_return() might also introduce a high contention. We can > optimize by splitting into more light atomics. Can you check it on your > 448-cores system? Interestingly, the following result shows the latency of free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms). /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 usecs : count distribution 0 -> 1 : 18166 | | 2 -> 3 : 41929818 |** | 4 -> 7 : 181203439 |*********** | 8 -> 15 : 464242836 |***************************** | 16 -> 31 : 620077545 |****************************************| 32 -> 63 : 442133041 |**************************** | 64 -> 127 : 111432597 |******* | 128 -> 255 : 3441649 | | 256 -> 511 : 302655 | | 512 -> 1023 : 738 | | 1024 -> 2047 : 73 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 196 | | avg = 26 usecs, total: 49415657269 usecs, count: 1864782753 free_vmap_area_noflush() just executes the lock prefix one time, so the wrost case might be just about a hundred clock cycles. The problem of purge_vmap_node() is that some cores are busy on purging each vmap_area of the *long* purge_list and executing atomic_long_sub() for each vmap_area, while other cores free vmalloc allocations and execute atomic_long_add_return() in free_vmap_area_noflush(). The following crash log shows the 22 cores are busy on purging vmap_area structs [1]: crash> bt -a | grep "purge_vmap_node+291" | wc -l 22 So, the latency of purge_vmap_node() dramatically increases becase it excutes the lock prefix over 600,0000 times. The issue can be easier to reproduce if more cores execute purge_vmap_node() simultaneously. [1] https://gist.github.com/AdrianHuang/c0030dd7755e673ed00cb197b76ce0a7 Tested the following patch with the light atomics. However, nothing improved (But, the worst case is improved): usecs : count distribution 0 -> 1 : 7146 | | 2 -> 3 : 31734187 |** | 4 -> 7 : 161408609 |*********** | 8 -> 15 : 461411377 |********************************* | 16 -> 31 : 557005293 |****************************************| 32 -> 63 : 435518485 |******************************* | 64 -> 127 : 175033097 |************ | 128 -> 255 : 42265379 |*** | 256 -> 511 : 399112 | | 512 -> 1023 : 734 | | 1024 -> 2047 : 72 | | avg = 32 usecs, total: 59952713176 usecs, count: 1864783491 [free_vmap_area_noflush() assembly instructions wo/ the test patch] ffffffff813e6e90 <free_vmap_area_noflush>: ... ffffffff813e6ed4: 4c 8b 65 08 mov 0x8(%rbp),%r12 ffffffff813e6ed8: 4c 2b 65 00 sub 0x0(%rbp),%r12 ffffffff813e6edc: 49 c1 ec 0c shr $0xc,%r12 ffffffff813e6ee0: 4c 89 e2 mov %r12,%rdx ffffffff813e6ee3: f0 48 0f c1 15 f4 2a lock xadd %rdx,0x2bb2af4(%rip) # ffffffff83f999e0 <vmap_lazy_nr> ffffffff813e6eea: bb 02 ffffffff813e6eec: 8b 0d 0e b1 a5 01 mov 0x1a5b10e(%rip),%ecx # ffffffff82e42000 <nr_vmap_nodes> ffffffff813e6ef2: 49 01 d4 add %rdx,%r12 ffffffff813e6ef5: 39 c8 cmp %ecx,%eax ... [free_vmap_area_noflush() assembly instructions w/ the test patch: no lock prefix] ffffffff813e6e90 <free_vmap_area_noflush>: ... ffffffff813e6edb: 48 8b 5d 08 mov 0x8(%rbp),%rbx ffffffff813e6edf: 48 2b 5d 00 sub 0x0(%rbp),%rbx ffffffff813e6ee3: 8b 0d d7 b0 a5 01 mov 0x1a5b0d7(%rip),%ecx # ffffffff82e41fc0 <nr_vmap_nodes> ffffffff813e6ee9: 48 c1 eb 0c shr $0xc,%rbx ffffffff813e6eed: 4c 8b 25 b4 f1 92 01 mov 0x192f1b4(%rip),%r12 # ffffffff82d160a8 <vmap_nodes> ffffffff813e6ef4: 48 01 d3 add %rdx,%rbx ffffffff813e6ef7: 48 89 1d e2 2a bb 02 mov %rbx,0x2bb2ae2(%rip) # ffffffff83f999e0 <vmap_lazy_nr> ffffffff813e6efe: 39 c8 cmp %ecx,%eax ... diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3f9b6bd707d2..3927c541440b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2357,8 +2357,9 @@ static void free_vmap_area_noflush(struct vmap_area *va) if (WARN_ON_ONCE(!list_empty(&va->list))) return; - nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> - PAGE_SHIFT, &vmap_lazy_nr); + nr_lazy = atomic_long_read(&vmap_lazy_nr); + nr_lazy += (va->va_end - va->va_start) >> PAGE_SHIFT; + atomic_long_set(&vmap_lazy_nr, nr_lazy); /* * If it was request by a certain node we would like to
On Mon, Sep 02, 2024 at 08:00:46PM +0800, Adrian Huang wrote: > On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > atomic_long_add_return() might also introduce a high contention. We can > > optimize by splitting into more light atomics. Can you check it on your > > 448-cores system? > > Interestingly, the following result shows the latency of > free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms). > > /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > usecs : count distribution > 0 -> 1 : 18166 | | > 2 -> 3 : 41929818 |** | > 4 -> 7 : 181203439 |*********** | > 8 -> 15 : 464242836 |***************************** | > 16 -> 31 : 620077545 |****************************************| > 32 -> 63 : 442133041 |**************************** | > 64 -> 127 : 111432597 |******* | > 128 -> 255 : 3441649 | | > 256 -> 511 : 302655 | | > 512 -> 1023 : 738 | | > 1024 -> 2047 : 73 | | > 2048 -> 4095 : 0 | | > 4096 -> 8191 : 0 | | > 8192 -> 16383 : 0 | | > 16384 -> 32767 : 196 | | > > avg = 26 usecs, total: 49415657269 usecs, count: 1864782753 > > > free_vmap_area_noflush() just executes the lock prefix one time, so the > wrost case might be just about a hundred clock cycles. > > The problem of purge_vmap_node() is that some cores are busy on purging > each vmap_area of the *long* purge_list and executing atomic_long_sub() > for each vmap_area, while other cores free vmalloc allocations and execute > atomic_long_add_return() in free_vmap_area_noflush(). The following crash > log shows the 22 cores are busy on purging vmap_area structs [1]: > > crash> bt -a | grep "purge_vmap_node+291" | wc -l > 22 > > So, the latency of purge_vmap_node() dramatically increases becase it > excutes the lock prefix over 600,0000 times. The issue can be easier > to reproduce if more cores execute purge_vmap_node() simultaneously. > Right. This is clear to me. Under heavy stressing in a tight loop we invoke atomic_long_sub() per one freed VA. Having 448-cores and one stress job per-cpu we end up with a high-contention spot when access to an atomic which requires a cache-line lock. > > > Tested the following patch with the light atomics. However, nothing improved > (But, the worst case is improved): > > usecs : count distribution > 0 -> 1 : 7146 | | > 2 -> 3 : 31734187 |** | > 4 -> 7 : 161408609 |*********** | > 8 -> 15 : 461411377 |********************************* | > 16 -> 31 : 557005293 |****************************************| > 32 -> 63 : 435518485 |******************************* | > 64 -> 127 : 175033097 |************ | > 128 -> 255 : 42265379 |*** | > 256 -> 511 : 399112 | | > 512 -> 1023 : 734 | | > 1024 -> 2047 : 72 | | > > avg = 32 usecs, total: 59952713176 usecs, count: 1864783491 > Thank you for checking this! So there is no difference. As for worst case, it might be an error of measurements. The problem is that we/you measure the time which includes a context switch because a context which triggers the free_vmap_area_noflush() function can easily be preempted. -- Uladzislau Rezki
Hello! > > Hi, > > unrelated to your use case, but something that coud easily save a few cycles > on some system, IMHO. > > Maybe: > > #if NR_CPUS > 1 > static __read_mostly unsigned int nr_vmap_nodes = 1; > static __read_mostly unsigned int vmap_zone_size = 1; > #else > #define nr_vmap_nodes 1 > #define vmap_zone_size 1 > #endif > > So that the compiler can do a better job because some loops can be optimized > away and there is no need to access some memory to get theses values. > > Not sure if such a use case can exist or is of any interest. > > This is valide because of [1] and the #ifdef around the num_possible_cpus() > declaration [2, 3]. > > > Just my 2c. > Thank you, i see your point. -- Uladzislau Rezki
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 3f9b6bd707d2..607697c81e60 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work) { struct vmap_node *vn = container_of(work, struct vmap_node, purge_work); + unsigned long nr_purged_pages = 0; struct vmap_area *va, *n_va; LIST_HEAD(local_list); @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work) list_del_init(&va->list); - atomic_long_sub(nr, &vmap_lazy_nr); + nr_purged_pages += nr; vn->nr_purged++; if (is_vn_id_valid(vn_id) && !vn->skip_populate) @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work) list_add(&va->list, &local_list); } + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr); + reclaim_list_global(&local_list); }