diff mbox series

[1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area

Message ID 20240829130633.2184-1-ahuang12@lenovo.com (mailing list archive)
State New
Headers show
Series [1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area | expand

Commit Message

Adrian Huang Aug. 29, 2024, 1:06 p.m. UTC
From: Adrian Huang <ahuang12@lenovo.com>

When running the vmalloc stress on a 448-core system, observe the average
latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
'funclatency.py' tool [1].

  # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

     usecs             : count    distribution
        0 -> 1         : 0       |                                        |
        2 -> 3         : 29      |                                        |
        4 -> 7         : 19      |                                        |
        8 -> 15        : 56      |                                        |
       16 -> 31        : 483     |****                                    |
       32 -> 63        : 1548    |************                            |
       64 -> 127       : 2634    |*********************                   |
      128 -> 255       : 2535    |*********************                   |
      256 -> 511       : 1776    |**************                          |
      512 -> 1023      : 1015    |********                                |
     1024 -> 2047      : 573     |****                                    |
     2048 -> 4095      : 488     |****                                    |
     4096 -> 8191      : 1091    |*********                               |
     8192 -> 16383     : 3078    |*************************               |
    16384 -> 32767     : 4821    |****************************************|
    32768 -> 65535     : 3318    |***************************             |
    65536 -> 131071    : 1718    |**************                          |
   131072 -> 262143    : 2220    |******************                      |
   262144 -> 524287    : 1147    |*********                               |
   524288 -> 1048575   : 1179    |*********                               |
  1048576 -> 2097151   : 822     |******                                  |
  2097152 -> 4194303   : 906     |*******                                 |
  4194304 -> 8388607   : 2148    |*****************                       |
  8388608 -> 16777215  : 4497    |*************************************   |
 16777216 -> 33554431  : 289     |**                                      |

  avg = 2041714 usecs, total: 78381401772 usecs, count: 38390

  The worst case is over 16-33 seconds, so soft lockup is triggered [2].

[Root Cause]
1) Each purge_list has the long list. The following shows the number of
   vmap_area is purged.

   crash> p vmap_nodes
   vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
   crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
     nr_purged = 663070
     ...
     nr_purged = 821670
     nr_purged = 692214
     nr_purged = 726808
     ...

2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
   operation when purging each vmap_area. However, the iteration is over
   600000 vmap_area (See 'nr_purged' above).

   Here is objdump output:

     $ objdump -D vmlinux
     ffffffff813e8c80 <purge_vmap_node>:
     ...
     ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
     ...

   Quote from "Instruction tables" pdf file [3]:
     Instructions with a LOCK prefix have a long latency that depends on
     cache organization and possibly RAM speed. If there are multiple
     processors or cores or direct memory access (DMA) devices, then all
     locked instructions will lock a cache line for exclusive access,
     which may involve RAM access. A LOCK prefix typically costs more
     than a hundred clock cycles, even on single-processor systems.

   That's why the latency of purge_vmap_node() dramatically increases
   on a many-core system: One core is busy on purging each vmap_area of
   the *long* purge_list and executing atomic_long_sub() for each
   vmap_area, while other cores free vmalloc allocations and execute
   atomic_long_add_return() in free_vmap_area_noflush().

[Solution]
Employ a local variable to record the total purged pages, and execute
atomic_long_sub() after the traversal of the purge_list is done. The
experiment result shows the latency improvement is 99%.

[Experiment Result]
1) System Configuration: Three servers (with HT-enabled) are tested.
     * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
     * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
     * 448-core server: AMD Zen 4 Processor*2

2) Kernel Config
     * CONFIG_KASAN is disabled

3) The data in column "w/o patch" and "w/ patch"
     * Unit: micro seconds (us)
     * Each data is the average of 3-time measurements

         System        w/o patch (us)   w/ patch (us)    Improvement (%)
     ---------------   --------------   -------------    -------------
     72-core server          2194              14            99.36%
     192-core server       143799            1139            99.21%
     448-core server      1992122            6883            99.65%

[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
[2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 
[3] https://www.agner.org/optimize/instruction_tables.pdf

Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
---
 mm/vmalloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Uladzislau Rezki Aug. 29, 2024, 7 p.m. UTC | #1
On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> From: Adrian Huang <ahuang12@lenovo.com>
> 
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
> 
>   # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> 
>      usecs             : count    distribution
>         0 -> 1         : 0       |                                        |
>         2 -> 3         : 29      |                                        |
>         4 -> 7         : 19      |                                        |
>         8 -> 15        : 56      |                                        |
>        16 -> 31        : 483     |****                                    |
>        32 -> 63        : 1548    |************                            |
>        64 -> 127       : 2634    |*********************                   |
>       128 -> 255       : 2535    |*********************                   |
>       256 -> 511       : 1776    |**************                          |
>       512 -> 1023      : 1015    |********                                |
>      1024 -> 2047      : 573     |****                                    |
>      2048 -> 4095      : 488     |****                                    |
>      4096 -> 8191      : 1091    |*********                               |
>      8192 -> 16383     : 3078    |*************************               |
>     16384 -> 32767     : 4821    |****************************************|
>     32768 -> 65535     : 3318    |***************************             |
>     65536 -> 131071    : 1718    |**************                          |
>    131072 -> 262143    : 2220    |******************                      |
>    262144 -> 524287    : 1147    |*********                               |
>    524288 -> 1048575   : 1179    |*********                               |
>   1048576 -> 2097151   : 822     |******                                  |
>   2097152 -> 4194303   : 906     |*******                                 |
>   4194304 -> 8388607   : 2148    |*****************                       |
>   8388608 -> 16777215  : 4497    |*************************************   |
>  16777216 -> 33554431  : 289     |**                                      |
> 
>   avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
> 
>   The worst case is over 16-33 seconds, so soft lockup is triggered [2].
> 
> [Root Cause]
> 1) Each purge_list has the long list. The following shows the number of
>    vmap_area is purged.
> 
>    crash> p vmap_nodes
>    vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
>    crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
>      nr_purged = 663070
>      ...
>      nr_purged = 821670
>      nr_purged = 692214
>      nr_purged = 726808
>      ...
> 
> 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
>    operation when purging each vmap_area. However, the iteration is over
>    600000 vmap_area (See 'nr_purged' above).
> 
>    Here is objdump output:
> 
>      $ objdump -D vmlinux
>      ffffffff813e8c80 <purge_vmap_node>:
>      ...
>      ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
>      ...
> 
>    Quote from "Instruction tables" pdf file [3]:
>      Instructions with a LOCK prefix have a long latency that depends on
>      cache organization and possibly RAM speed. If there are multiple
>      processors or cores or direct memory access (DMA) devices, then all
>      locked instructions will lock a cache line for exclusive access,
>      which may involve RAM access. A LOCK prefix typically costs more
>      than a hundred clock cycles, even on single-processor systems.
> 
>    That's why the latency of purge_vmap_node() dramatically increases
>    on a many-core system: One core is busy on purging each vmap_area of
>    the *long* purge_list and executing atomic_long_sub() for each
>    vmap_area, while other cores free vmalloc allocations and execute
>    atomic_long_add_return() in free_vmap_area_noflush().
> 
> [Solution]
> Employ a local variable to record the total purged pages, and execute
> atomic_long_sub() after the traversal of the purge_list is done. The
> experiment result shows the latency improvement is 99%.
> 
> [Experiment Result]
> 1) System Configuration: Three servers (with HT-enabled) are tested.
>      * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
>      * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
>      * 448-core server: AMD Zen 4 Processor*2
> 
> 2) Kernel Config
>      * CONFIG_KASAN is disabled
> 
> 3) The data in column "w/o patch" and "w/ patch"
>      * Unit: micro seconds (us)
>      * Each data is the average of 3-time measurements
> 
>          System        w/o patch (us)   w/ patch (us)    Improvement (%)
>      ---------------   --------------   -------------    -------------
>      72-core server          2194              14            99.36%
>      192-core server       143799            1139            99.21%
>      448-core server      1992122            6883            99.65%
> 
> [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 
> [3] https://www.agner.org/optimize/instruction_tables.pdf
> 
> Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> ---
>  mm/vmalloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3f9b6bd707d2..607697c81e60 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
>  {
>  	struct vmap_node *vn = container_of(work,
>  		struct vmap_node, purge_work);
> +	unsigned long nr_purged_pages = 0;
>  	struct vmap_area *va, *n_va;
>  	LIST_HEAD(local_list);
>  
> @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
>  
>  		list_del_init(&va->list);
>  
> -		atomic_long_sub(nr, &vmap_lazy_nr);
> +		nr_purged_pages += nr;
>  		vn->nr_purged++;
>  
>  		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
>  		list_add(&va->list, &local_list);
>  	}
>  
> +	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> +
>  	reclaim_list_global(&local_list);
>  }
>  
> -- 
> 2.34.1
> 
I see the point and it looks good to me.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Thank you for improving this. There is one more spot which i detected
earlier, it is:

<snip>
static void free_vmap_area_noflush(struct vmap_area *va)
{
	unsigned long nr_lazy_max = lazy_max_pages();
	unsigned long va_start = va->va_start;
	unsigned int vn_id = decode_vn_id(va->flags);
	struct vmap_node *vn;
	unsigned long nr_lazy;

	if (WARN_ON_ONCE(!list_empty(&va->list)))
		return;

	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
				PAGE_SHIFT, &vmap_lazy_nr);

...
<snip>

atomic_long_add_return() might also introduce a high contention. We can
optimize by splitting into more light atomics. Can you check it on your
448-cores system?

Tnanks!

--
Uladzislau Rezki
Uladzislau Rezki Aug. 30, 2024, 4:26 p.m. UTC | #2
On Thu, Aug 29, 2024 at 09:00:16PM +0200, Uladzislau Rezki wrote:
> On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote:
> > From: Adrian Huang <ahuang12@lenovo.com>
> > 
> > When running the vmalloc stress on a 448-core system, observe the average
> > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> > 'funclatency.py' tool [1].
> > 
> >   # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> > 
> >      usecs             : count    distribution
> >         0 -> 1         : 0       |                                        |
> >         2 -> 3         : 29      |                                        |
> >         4 -> 7         : 19      |                                        |
> >         8 -> 15        : 56      |                                        |
> >        16 -> 31        : 483     |****                                    |
> >        32 -> 63        : 1548    |************                            |
> >        64 -> 127       : 2634    |*********************                   |
> >       128 -> 255       : 2535    |*********************                   |
> >       256 -> 511       : 1776    |**************                          |
> >       512 -> 1023      : 1015    |********                                |
> >      1024 -> 2047      : 573     |****                                    |
> >      2048 -> 4095      : 488     |****                                    |
> >      4096 -> 8191      : 1091    |*********                               |
> >      8192 -> 16383     : 3078    |*************************               |
> >     16384 -> 32767     : 4821    |****************************************|
> >     32768 -> 65535     : 3318    |***************************             |
> >     65536 -> 131071    : 1718    |**************                          |
> >    131072 -> 262143    : 2220    |******************                      |
> >    262144 -> 524287    : 1147    |*********                               |
> >    524288 -> 1048575   : 1179    |*********                               |
> >   1048576 -> 2097151   : 822     |******                                  |
> >   2097152 -> 4194303   : 906     |*******                                 |
> >   4194304 -> 8388607   : 2148    |*****************                       |
> >   8388608 -> 16777215  : 4497    |*************************************   |
> >  16777216 -> 33554431  : 289     |**                                      |
> > 
> >   avg = 2041714 usecs, total: 78381401772 usecs, count: 38390
> > 
> >   The worst case is over 16-33 seconds, so soft lockup is triggered [2].
> > 
> > [Root Cause]
> > 1) Each purge_list has the long list. The following shows the number of
> >    vmap_area is purged.
> > 
> >    crash> p vmap_nodes
> >    vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
> >    crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged
> >      nr_purged = 663070
> >      ...
> >      nr_purged = 821670
> >      nr_purged = 692214
> >      nr_purged = 726808
> >      ...
> > 
> > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
> >    operation when purging each vmap_area. However, the iteration is over
> >    600000 vmap_area (See 'nr_purged' above).
> > 
> >    Here is objdump output:
> > 
> >      $ objdump -D vmlinux
> >      ffffffff813e8c80 <purge_vmap_node>:
> >      ...
> >      ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
> >      ...
> > 
> >    Quote from "Instruction tables" pdf file [3]:
> >      Instructions with a LOCK prefix have a long latency that depends on
> >      cache organization and possibly RAM speed. If there are multiple
> >      processors or cores or direct memory access (DMA) devices, then all
> >      locked instructions will lock a cache line for exclusive access,
> >      which may involve RAM access. A LOCK prefix typically costs more
> >      than a hundred clock cycles, even on single-processor systems.
> > 
> >    That's why the latency of purge_vmap_node() dramatically increases
> >    on a many-core system: One core is busy on purging each vmap_area of
> >    the *long* purge_list and executing atomic_long_sub() for each
> >    vmap_area, while other cores free vmalloc allocations and execute
> >    atomic_long_add_return() in free_vmap_area_noflush().
> > 
> > [Solution]
> > Employ a local variable to record the total purged pages, and execute
> > atomic_long_sub() after the traversal of the purge_list is done. The
> > experiment result shows the latency improvement is 99%.
> > 
> > [Experiment Result]
> > 1) System Configuration: Three servers (with HT-enabled) are tested.
> >      * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
> >      * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
> >      * 448-core server: AMD Zen 4 Processor*2
> > 
> > 2) Kernel Config
> >      * CONFIG_KASAN is disabled
> > 
> > 3) The data in column "w/o patch" and "w/ patch"
> >      * Unit: micro seconds (us)
> >      * Each data is the average of 3-time measurements
> > 
> >          System        w/o patch (us)   w/ patch (us)    Improvement (%)
> >      ---------------   --------------   -------------    -------------
> >      72-core server          2194              14            99.36%
> >      192-core server       143799            1139            99.21%
> >      448-core server      1992122            6883            99.65%
> > 
> > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
> > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 
> > [3] https://www.agner.org/optimize/instruction_tables.pdf
> > 
> > Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
> > ---
> >  mm/vmalloc.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 3f9b6bd707d2..607697c81e60 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work)
> >  {
> >  	struct vmap_node *vn = container_of(work,
> >  		struct vmap_node, purge_work);
> > +	unsigned long nr_purged_pages = 0;
> >  	struct vmap_area *va, *n_va;
> >  	LIST_HEAD(local_list);
> >  
> > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work)
> >  
> >  		list_del_init(&va->list);
> >  
> > -		atomic_long_sub(nr, &vmap_lazy_nr);
> > +		nr_purged_pages += nr;
> >  		vn->nr_purged++;
> >  
> >  		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
> > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work)
> >  		list_add(&va->list, &local_list);
> >  	}
> >  
> > +	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
> > +
> >  	reclaim_list_global(&local_list);
> >  }
> >  
> > -- 
> > 2.34.1
> > 
> I see the point and it looks good to me.
> 
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Thank you for improving this. There is one more spot which i detected
> earlier, it is:
> 
> <snip>
> static void free_vmap_area_noflush(struct vmap_area *va)
> {
> 	unsigned long nr_lazy_max = lazy_max_pages();
> 	unsigned long va_start = va->va_start;
> 	unsigned int vn_id = decode_vn_id(va->flags);
> 	struct vmap_node *vn;
> 	unsigned long nr_lazy;
> 
> 	if (WARN_ON_ONCE(!list_empty(&va->list)))
> 		return;
> 
> 	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
> 				PAGE_SHIFT, &vmap_lazy_nr);
> 
> ...
> <snip>
> 
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?
> 
I have checked the free_vmap_area_noflush() on my hardware. It is 64
cores system:

<perf cycles>
...
+    7.84%     5.18%  [kernel]          [k] free_vmap_area_noflush
+    6.16%     1.61%  [kernel]          [k] free_unref_page
+    5.57%     1.51%  [kernel]          [k] find_unlink_vmap_area
...
<perf cycles>

<perf cycles annotate>
..
            │     arch_atomic64_add_return():
   23352402 │       mov  %r12,%rdx
            │       lock xadd %rdx,vmap_lazy_nr
            │     is_vn_id_valid():
52364447314 │       mov  nr_vmap_nodes,%ecx <----- the hotest spot which consumes the CPU cycles the most(99%)
            │     arch_atomic64_add_return():
   45547180 │       add  %rdx,%r12        
            │     is_vn_id_valid():
...
<perf cycles annotate>

At least in my case, HW, i do not see that atomic_long_add_return() is a
top when it comes to CPU cycles. Below one is the hottest instead:

static bool
is_vn_id_valid(unsigned int node_id)
{
	if (node_id < nr_vmap_nodes)
		return true;

	return false;
}

access to "nr_vmap_nodes" which is read-only and globally defined:

static __read_mostly unsigned int nr_vmap_nodes = 1;

Any thoughts?

--
Uladzislau Rezki
Andrew Morton Aug. 31, 2024, 12:33 a.m. UTC | #3
On Thu, 29 Aug 2024 21:06:33 +0800 Adrian Huang <adrianhuang0701@gmail.com> wrote:

> From: Adrian Huang <ahuang12@lenovo.com>
> 
> When running the vmalloc stress on a 448-core system, observe the average
> latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
> 'funclatency.py' tool [1].
> 
> ...
>
> 3) The data in column "w/o patch" and "w/ patch"
>      * Unit: micro seconds (us)
>      * Each data is the average of 3-time measurements
> 
>          System        w/o patch (us)   w/ patch (us)    Improvement (%)
>      ---------------   --------------   -------------    -------------
>      72-core server          2194              14            99.36%
>      192-core server       143799            1139            99.21%
>      448-core server      1992122            6883            99.65%
> 

Holy cow.  I'll add cc:stable to this.  Partly because it fixes a
softlockup, partly because holy cow.
Christophe JAILLET Aug. 31, 2024, 7:03 a.m. UTC | #4
Le 30/08/2024 à 18:26, Uladzislau Rezki a écrit :

> At least in my case, HW, i do not see that atomic_long_add_return() is a
> top when it comes to CPU cycles. Below one is the hottest instead:
> 
> static bool
> is_vn_id_valid(unsigned int node_id)
> {
> 	if (node_id < nr_vmap_nodes)
> 		return true;
> 
> 	return false;
> }
> 
> access to "nr_vmap_nodes" which is read-only and globally defined:
> 
> static __read_mostly unsigned int nr_vmap_nodes = 1;
> 
> Any thoughts?
> 
> --
> Uladzislau Rezki
> 

Hi,

unrelated to your use case, but something that coud easily save a few 
cycles on some system, IMHO.

Maybe:

#if NR_CPUS > 1
static __read_mostly unsigned int nr_vmap_nodes = 1;
static __read_mostly unsigned int vmap_zone_size = 1;
#else
#define nr_vmap_nodes	1
#define vmap_zone_size	1
#endif

So that the compiler can do a better job because some loops can be 
optimized away and there is no need to access some memory to get theses 
values.

Not sure if such a use case can exist or is of any interest.

This is valide because of [1] and the #ifdef around the 
num_possible_cpus() declaration [2, 3].


Just my 2c.

CJ

[1]: https://elixir.bootlin.com/linux/v6.11-rc5/source/mm/vmalloc.c#L5026

[2]: 
https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1083
[3]: 
https://elixir.bootlin.com/linux/v6.11-rc5/source/include/linux/cpumask.h#L1136
Adrian Huang Sept. 2, 2024, noon UTC | #5
On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> atomic_long_add_return() might also introduce a high contention. We can
> optimize by splitting into more light atomics. Can you check it on your
> 448-cores system?

Interestingly, the following result shows the latency of
free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).

/home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

         usecs       : count     distribution
         0 -> 1      : 18166     |                                        |
         2 -> 3      : 41929818  |**                                      |
         4 -> 7      : 181203439 |***********                             |
         8 -> 15     : 464242836 |*****************************           |
        16 -> 31     : 620077545 |****************************************|
        32 -> 63     : 442133041 |****************************            |
        64 -> 127    : 111432597 |*******                                 |
       128 -> 255    : 3441649   |                                        |
       256 -> 511    : 302655    |                                        |
       512 -> 1023   : 738       |                                        |
      1024 -> 2047   : 73        |                                        |
      2048 -> 4095   : 0         |                                        |
      4096 -> 8191   : 0         |                                        |
      8192 -> 16383  : 0         |                                        |
     16384 -> 32767  : 196       |                                        |

   avg = 26 usecs, total: 49415657269 usecs, count: 1864782753


free_vmap_area_noflush() just executes the lock prefix one time, so the
wrost case might be just about a hundred clock cycles.

The problem of purge_vmap_node() is that some cores are busy on purging
each vmap_area of the *long* purge_list and executing atomic_long_sub()
for each vmap_area, while other cores free vmalloc allocations and execute
atomic_long_add_return() in free_vmap_area_noflush(). The following crash
log shows the 22 cores are busy on purging vmap_area structs [1]:

  crash> bt -a | grep "purge_vmap_node+291" | wc -l
  22

So, the latency of purge_vmap_node() dramatically increases becase it
excutes the lock prefix over 600,0000 times. The issue can be easier
to reproduce if more cores execute purge_vmap_node() simultaneously.

[1] https://gist.github.com/AdrianHuang/c0030dd7755e673ed00cb197b76ce0a7


Tested the following patch with the light atomics. However, nothing improved 
(But, the worst case is improved):

         usecs        : count     distribution
         0 -> 1       : 7146      |                                        |
         2 -> 3       : 31734187  |**                                      |
         4 -> 7       : 161408609 |***********                             |
         8 -> 15      : 461411377 |*********************************       |
        16 -> 31      : 557005293 |****************************************|
        32 -> 63      : 435518485 |*******************************         |
        64 -> 127     : 175033097 |************                            |
       128 -> 255     : 42265379  |***                                     |
       256 -> 511     : 399112    |                                        |
       512 -> 1023    : 734       |                                        |
      1024 -> 2047    : 72        |                                        |

avg = 32 usecs, total: 59952713176 usecs, count: 1864783491

[free_vmap_area_noflush() assembly instructions wo/ the test patch]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6ed4:       4c 8b 65 08             mov    0x8(%rbp),%r12
ffffffff813e6ed8:       4c 2b 65 00             sub    0x0(%rbp),%r12
ffffffff813e6edc:       49 c1 ec 0c             shr    $0xc,%r12
ffffffff813e6ee0:       4c 89 e2                mov    %r12,%rdx
ffffffff813e6ee3:       f0 48 0f c1 15 f4 2a    lock xadd %rdx,0x2bb2af4(%rip)        # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6eea:       bb 02
ffffffff813e6eec:       8b 0d 0e b1 a5 01       mov    0x1a5b10e(%rip),%ecx        # ffffffff82e42000 <nr_vmap_nodes>
ffffffff813e6ef2:       49 01 d4                add    %rdx,%r12
ffffffff813e6ef5:       39 c8                   cmp    %ecx,%eax
...

[free_vmap_area_noflush() assembly instructions w/ the test patch: no lock prefix]
ffffffff813e6e90 <free_vmap_area_noflush>:
...
ffffffff813e6edb:       48 8b 5d 08             mov    0x8(%rbp),%rbx
ffffffff813e6edf:       48 2b 5d 00             sub    0x0(%rbp),%rbx
ffffffff813e6ee3:       8b 0d d7 b0 a5 01       mov    0x1a5b0d7(%rip),%ecx        # ffffffff82e41fc0 <nr_vmap_nodes>
ffffffff813e6ee9:       48 c1 eb 0c             shr    $0xc,%rbx
ffffffff813e6eed:       4c 8b 25 b4 f1 92 01    mov    0x192f1b4(%rip),%r12        # ffffffff82d160a8 <vmap_nodes>
ffffffff813e6ef4:       48 01 d3                add    %rdx,%rbx
ffffffff813e6ef7:       48 89 1d e2 2a bb 02    mov    %rbx,0x2bb2ae2(%rip)        # ffffffff83f999e0 <vmap_lazy_nr>
ffffffff813e6efe:       39 c8                   cmp    %ecx,%eax
...


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f9b6bd707d2..3927c541440b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2357,8 +2357,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	if (WARN_ON_ONCE(!list_empty(&va->list)))
 		return;
 
-	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
-				PAGE_SHIFT, &vmap_lazy_nr);
+	nr_lazy = atomic_long_read(&vmap_lazy_nr);
+	nr_lazy += (va->va_end - va->va_start) >> PAGE_SHIFT;
+	atomic_long_set(&vmap_lazy_nr, nr_lazy);
 
 	/*
 	 * If it was request by a certain node we would like to
Uladzislau Rezki Sept. 2, 2024, 5 p.m. UTC | #6
On Mon, Sep 02, 2024 at 08:00:46PM +0800, Adrian Huang wrote:
> On Fri, Aug 30, 2024 at 3:00 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > atomic_long_add_return() might also introduce a high contention. We can
> > optimize by splitting into more light atomics. Can you check it on your
> > 448-cores system?
> 
> Interestingly, the following result shows the latency of
> free_vmap_area_noflush() is just 26 usecs (The worst case is 16ms-32ms).
> 
> /home/git-repo/bcc/tools/funclatency.py -u free_vmap_area_noflush & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1
> 
>          usecs       : count     distribution
>          0 -> 1      : 18166     |                                        |
>          2 -> 3      : 41929818  |**                                      |
>          4 -> 7      : 181203439 |***********                             |
>          8 -> 15     : 464242836 |*****************************           |
>         16 -> 31     : 620077545 |****************************************|
>         32 -> 63     : 442133041 |****************************            |
>         64 -> 127    : 111432597 |*******                                 |
>        128 -> 255    : 3441649   |                                        |
>        256 -> 511    : 302655    |                                        |
>        512 -> 1023   : 738       |                                        |
>       1024 -> 2047   : 73        |                                        |
>       2048 -> 4095   : 0         |                                        |
>       4096 -> 8191   : 0         |                                        |
>       8192 -> 16383  : 0         |                                        |
>      16384 -> 32767  : 196       |                                        |
> 
>    avg = 26 usecs, total: 49415657269 usecs, count: 1864782753
> 
> 
> free_vmap_area_noflush() just executes the lock prefix one time, so the
> wrost case might be just about a hundred clock cycles.
> 
> The problem of purge_vmap_node() is that some cores are busy on purging
> each vmap_area of the *long* purge_list and executing atomic_long_sub()
> for each vmap_area, while other cores free vmalloc allocations and execute
> atomic_long_add_return() in free_vmap_area_noflush(). The following crash
> log shows the 22 cores are busy on purging vmap_area structs [1]:
> 
>   crash> bt -a | grep "purge_vmap_node+291" | wc -l
>   22
> 
> So, the latency of purge_vmap_node() dramatically increases becase it
> excutes the lock prefix over 600,0000 times. The issue can be easier
> to reproduce if more cores execute purge_vmap_node() simultaneously.
> 
Right. This is clear to me. Under heavy stressing in a tight loop we
invoke atomic_long_sub() per one freed VA. Having 448-cores and one
stress job per-cpu we end up with a high-contention spot when access
to an atomic which requires a cache-line lock.

> 
> 
> Tested the following patch with the light atomics. However, nothing improved 
> (But, the worst case is improved):
> 
>          usecs        : count     distribution
>          0 -> 1       : 7146      |                                        |
>          2 -> 3       : 31734187  |**                                      |
>          4 -> 7       : 161408609 |***********                             |
>          8 -> 15      : 461411377 |*********************************       |
>         16 -> 31      : 557005293 |****************************************|
>         32 -> 63      : 435518485 |*******************************         |
>         64 -> 127     : 175033097 |************                            |
>        128 -> 255     : 42265379  |***                                     |
>        256 -> 511     : 399112    |                                        |
>        512 -> 1023    : 734       |                                        |
>       1024 -> 2047    : 72        |                                        |
> 
> avg = 32 usecs, total: 59952713176 usecs, count: 1864783491
> 
Thank you for checking this! So there is no difference. As for worst
case, it might be an error of measurements. The problem is that we/you
measure the time which includes a context switch because a context which
triggers the free_vmap_area_noflush() function can easily be preempted.

--
Uladzislau Rezki
Uladzislau Rezki Sept. 2, 2024, 5:03 p.m. UTC | #7
Hello!

> 
> Hi,
> 
> unrelated to your use case, but something that coud easily save a few cycles
> on some system, IMHO.
> 
> Maybe:
> 
> #if NR_CPUS > 1
> static __read_mostly unsigned int nr_vmap_nodes = 1;
> static __read_mostly unsigned int vmap_zone_size = 1;
> #else
> #define nr_vmap_nodes	1
> #define vmap_zone_size	1
> #endif
> 
> So that the compiler can do a better job because some loops can be optimized
> away and there is no need to access some memory to get theses values.
> 
> Not sure if such a use case can exist or is of any interest.
> 
> This is valide because of [1] and the #ifdef around the num_possible_cpus()
> declaration [2, 3].
> 
> 
> Just my 2c.
> 
Thank you, i see your point.

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3f9b6bd707d2..607697c81e60 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2210,6 +2210,7 @@  static void purge_vmap_node(struct work_struct *work)
 {
 	struct vmap_node *vn = container_of(work,
 		struct vmap_node, purge_work);
+	unsigned long nr_purged_pages = 0;
 	struct vmap_area *va, *n_va;
 	LIST_HEAD(local_list);
 
@@ -2224,7 +2225,7 @@  static void purge_vmap_node(struct work_struct *work)
 
 		list_del_init(&va->list);
 
-		atomic_long_sub(nr, &vmap_lazy_nr);
+		nr_purged_pages += nr;
 		vn->nr_purged++;
 
 		if (is_vn_id_valid(vn_id) && !vn->skip_populate)
@@ -2235,6 +2236,8 @@  static void purge_vmap_node(struct work_struct *work)
 		list_add(&va->list, &local_list);
 	}
 
+	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
+
 	reclaim_list_global(&local_list);
 }