[0/9] Mitigate a vmap lock contention

Message ID	20230522110849.2921-1-urezki@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, Baoquan He <bhe@redhat.com>, Lorenzo Stoakes <lstoakes@gmail.com>, Christoph Hellwig <hch@infradead.org>, Matthew Wilcox <willy@infradead.org>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Dave Chinner <david@fromorbit.com>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Subject: [PATCH 0/9] Mitigate a vmap lock contention Date: Mon, 22 May 2023 13:08:40 +0200 Message-Id: <20230522110849.2921-1-urezki@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Mitigate a vmap lock contention \| expand [0/9] Mitigate a vmap lock contention [1/9] mm: vmalloc: Add va_alloc() helper [2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function [3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c [4/9] mm: vmalloc: Add a per-CPU-zone infrastructure [5/9] mm: vmalloc: Insert busy-VA per-cpu zone [6/9] mm: vmalloc: Support multiple zones in vmallocinfo [7/9] mm: vmalloc: Insert lazy-VA per-cpu zone [8/9] mm: vmalloc: Offload free_vmap_area_lock global lock [9/9] mm: vmalloc: Scale and activate cvz_size

Uladzislau Rezki May 22, 2023, 11:08 a.m. UTC

Hello, folk.

1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
conference. This small serial attempts to mitigate the contention across the
vmap/vmalloc code. The problem is described here:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

The material is tagged as a v2 version. It contains extra slides about testing
the throughput, steps and comparison with a current approach.

2. Motivation.

- The vmap code is not scalled to number of CPUs and this should be fixed;
- XFS folk has complained several times that vmalloc might be contented on
  their workloads:

<snip>
commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers
    
    Oh, let me count the ways that the kvmalloc API sucks dog eggs.
    
    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:
...
<snip>

- If we align with per-cpu allocator in terms of performance we can
  remove it to simplify the vmap code. Currently we have 3 allocators
  See:

<snip>
/*** Per cpu kva allocator ***/

/*
 * vmap space is limited especially on 32 bit architectures. Ensure there is
 * room for at least 16 percpu vmap blocks per CPU.
 */
/*
 * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
 * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
 * instead (we just need a rough idea)
 */
<snip>

3. Test

On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:

    1-page     1-page-this-patch
1  0.576131   vs   0.555889
2   2.68376   vs    1.07895
3   4.26502   vs    1.01739
4   6.04306   vs    1.28924
5   8.04786   vs    1.57616
6   9.38844   vs    1.78142
7   9.53481   vs    2.00172
8   10.4609   vs    2.15964
9   10.6089   vs      2.484
10  11.7372   vs    2.40443
11  11.5969   vs    2.71635
12   13.053   vs     2.6162
13  12.2973   vs      2.843
14  13.1429   vs    2.85714
15  13.7348   vs    2.90691
16  14.3687   vs     3.0285
17  14.8823   vs    3.05717
18  14.9571   vs    2.98018
19  14.9127   vs     3.0951
20  16.0786   vs    3.19521
21  15.8055   vs    3.24915
22  16.8087   vs     3.2521
23  16.7298   vs    3.25698
24   17.244   vs    3.36586
25  17.8416   vs    3.39384
26  18.8008   vs    3.40907
27  18.5591   vs     3.5254
28   19.761   vs    3.55468
29    20.06   vs    3.59869
30  20.4353   vs     3.6991
31  20.9082   vs    3.73028
32  21.0865   vs    3.82904

1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
pair throughput.

The series is based on the v6.3 tag and considered as beta version. Please note
it does not support vread() functionality yet. So it means that it is not fully
complete.

Any input/thoughts are welcome.

Uladzislau Rezki (Sony) (9):
  mm: vmalloc: Add va_alloc() helper
  mm: vmalloc: Rename adjust_va_to_fit_type() function
  mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  mm: vmalloc: Add a per-CPU-zone infrastructure
  mm: vmalloc: Insert busy-VA per-cpu zone
  mm: vmalloc: Support multiple zones in vmallocinfo
  mm: vmalloc: Insert lazy-VA per-cpu zone
  mm: vmalloc: Offload free_vmap_area_lock global lock
  mm: vmalloc: Scale and activate cvz_size

 mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 448 insertions(+), 193 deletions(-)

Harry (Hyeonggon) Yoo May 23, 2023, 11:59 a.m. UTC | #1

On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote:
> Hello, folk.
> 
> 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
> conference. This small serial attempts to mitigate the contention across the
> vmap/vmalloc code. The problem is described here:
> 

Hello Uladzislau, thank you for the work!

> wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

I ran the exactly same command but couldn't download the file, did I
miss something?

$ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
[...]
==> PASV ... done.    ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... 
No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'.

> The material is tagged as a v2 version. It contains extra slides about testing
> the throughput, steps and comparison with a current approach.
> 
> 2. Motivation.
> 
> - The vmap code is not scalled to number of CPUs and this should be fixed;
> - XFS folk has complained several times that vmalloc might be contented on
>   their workloads:
> 
> <snip>
> commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Jan 4 17:22:18 2022 -0800
> 
>     xfs: reduce kvmalloc overhead for CIL shadow buffers
>     
>     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
>     
>     The problem is when we are logging lots of large objects, we hit
>     kvmalloc really damn hard with costly order allocations, and
>     behaviour utterly sucks:

based on the commit I guess xfs should use vmalloc/kvmalloc is because
it allocates large buffers, how large could it be?

> 3. Test
> 
> On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> 
>     1-page     1-page-this-patch
> 1  0.576131   vs   0.555889
> 2   2.68376   vs    1.07895
> 3   4.26502   vs    1.01739
> 4   6.04306   vs    1.28924
> 5   8.04786   vs    1.57616
> 6   9.38844   vs    1.78142

<snip>

> 29    20.06   vs    3.59869
> 30  20.4353   vs     3.6991
> 31  20.9082   vs    3.73028
> 32  21.0865   vs    3.82904
> 
> 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> pair throughput.

I would be more interested in real numbers than synthetic benchmarks,
Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
with and without this patchset?

By the way looking at the commit, teaching __p?d_alloc() about gfp
context (that I'm _slowly_ working on...) could be nice for allowing
non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]

Thanks!

[1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org

Uladzislau Rezki May 23, 2023, 3:12 p.m. UTC | #2

On Tue, May 23, 2023 at 08:59:05PM +0900, Hyeonggon Yoo wrote:
> On Mon, May 22, 2023 at 01:08:40PM +0200, Uladzislau Rezki (Sony) wrote:
> > Hello, folk.
> > 
> > 1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
> > conference. This small serial attempts to mitigate the contention across the
> > vmap/vmalloc code. The problem is described here:
> > 
> 
> Hello Uladzislau, thank you for the work!
> 
> > wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
> 
> I ran the exactly same command but couldn't download the file, did I
> miss something?
> 
wget ftp://vps418301.ovh.net/incoming/Mitigate_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

Sorry, i renamed the file name :)

> $ wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
> [...]
> ==> PASV ... done.    ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... 
> No such file `Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf'.
> 
> > The material is tagged as a v2 version. It contains extra slides about testing
> > the throughput, steps and comparison with a current approach.
> > 
> > 2. Motivation.
> > 
> > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > - XFS folk has complained several times that vmalloc might be contented on
> >   their workloads:
> > 
> > <snip>
> > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Jan 4 17:22:18 2022 -0800
> > 
> >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> >     
> >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> >     
> >     The problem is when we are logging lots of large objects, we hit
> >     kvmalloc really damn hard with costly order allocations, and
> >     behaviour utterly sucks:
> 
> based on the commit I guess xfs should use vmalloc/kvmalloc is because
> it allocates large buffers, how large could it be?
> 
They use kvmalloc(). When the page allocator is not able to serve a
request they fallback to vmalloc. At least what i see, the sizes are:

from 73728 up to 1048576, i.e. 18 pages up to 256 pages.

> > 3. Test
> > 
> > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > 
> >     1-page     1-page-this-patch
> > 1  0.576131   vs   0.555889
> > 2   2.68376   vs    1.07895
> > 3   4.26502   vs    1.01739
> > 4   6.04306   vs    1.28924
> > 5   8.04786   vs    1.57616
> > 6   9.38844   vs    1.78142
> 
> <snip>
> 
> > 29    20.06   vs    3.59869
> > 30  20.4353   vs     3.6991
> > 31  20.9082   vs    3.73028
> > 32  21.0865   vs    3.82904
> > 
> > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > pair throughput.
> 
> I would be more interested in real numbers than synthetic benchmarks,
> Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> with and without this patchset?
> 
I added Dave Chinner <david@fromorbit.com> to this thread. But. The
contention exists. Apart of that per-cpu-KVA allocator can go away
if we make it generic instead.

> By the way looking at the commit, teaching __p?d_alloc() about gfp
> context (that I'm _slowly_ working on...) could be nice for allowing
> non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]
> 
> Thanks!
> 
> [1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org
> 

Thanks!

--
Uladzisdlau Rezki

Harry (Hyeonggon) Yoo May 23, 2023, 6:04 p.m. UTC | #3

On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > 2. Motivation.
> > > 
> > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > - XFS folk has complained several times that vmalloc might be contented on
> > >   their workloads:
> > > 
> > > <snip>
> > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > Author: Dave Chinner <dchinner@redhat.com>
> > > Date:   Tue Jan 4 17:22:18 2022 -0800
> > > 
> > >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> > >     
> > >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > >     
> > >     The problem is when we are logging lots of large objects, we hit
> > >     kvmalloc really damn hard with costly order allocations, and
> > >     behaviour utterly sucks:
> > 
> > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > it allocates large buffers, how large could it be?
> > 
> They use kvmalloc(). When the page allocator is not able to serve a
> request they fallback to vmalloc. At least what i see, the sizes are:
> 
> from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> 
> > > 3. Test
> > > 
> > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > 
> > >     1-page     1-page-this-patch
> > > 1  0.576131   vs   0.555889
> > > 2   2.68376   vs    1.07895
> > > 3   4.26502   vs    1.01739
> > > 4   6.04306   vs    1.28924
> > > 5   8.04786   vs    1.57616
> > > 6   9.38844   vs    1.78142
> > 
> > <snip>
> > 
> > > 29    20.06   vs    3.59869
> > > 30  20.4353   vs     3.6991
> > > 31  20.9082   vs    3.73028
> > > 32  21.0865   vs    3.82904
> > > 
> > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > pair throughput.
> > 
> > I would be more interested in real numbers than synthetic benchmarks,
> > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > with and without this patchset?
> > 
> I added Dave Chinner <david@fromorbit.com> to this thread.

Oh, I missed that, and it would be better to [+Cc linux-xfs]

> But. The contention exists.

I think "theoretically can be contended" doesn't necessarily mean it's actually
contended in the real world.

Also I find it difficult to imagine vmalloc being highly contended because it was
historically considered slow and thus discouraged when performance is important.

IOW vmalloc would not be contended when allocation size is small because we have
kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
buffers and at the same time allocating very frequently, thus performance-sensitive.

I am not against this series, but wondering which workloads would benefit ;)

> Apart of that per-cpu-KVA allocator can go away if we make it generic instead.

Not sure I understand your point, can you elaborate please?

And I would like to ask some side questions:

1. Is vm_[un]map_ram() API still worth with this patchset?

2. How does this patchset deals with 32-bit machines where
   vmalloc address space is limited?

Thanks!

> > By the way looking at the commit, teaching __p?d_alloc() about gfp
> > context (that I'm _slowly_ working on...) could be nice for allowing
> > non-GFP_KERNEL kvmalloc allocations, as Matthew mentioned. [1]
> > 
> > Thanks!
> > 
> > [1] https://lore.kernel.org/linux-mm/Y%2FOHC33YLedMXTlD@casper.infradead.org
> >

Dave Chinner May 23, 2023, 9:43 p.m. UTC | #4

On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > 2. Motivation.
> > > > 
> > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > - XFS folk has complained several times that vmalloc might be contented on
> > > >   their workloads:
> > > > 
> > > > <snip>
> > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date:   Tue Jan 4 17:22:18 2022 -0800
> > > > 
> > > >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > >     
> > > >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > >     
> > > >     The problem is when we are logging lots of large objects, we hit
> > > >     kvmalloc really damn hard with costly order allocations, and
> > > >     behaviour utterly sucks:
> > > 
> > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > it allocates large buffers, how large could it be?
> > > 
> > They use kvmalloc(). When the page allocator is not able to serve a
> > request they fallback to vmalloc. At least what i see, the sizes are:
> > 
> > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > 
> > > > 3. Test
> > > > 
> > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > 
> > > >     1-page     1-page-this-patch
> > > > 1  0.576131   vs   0.555889
> > > > 2   2.68376   vs    1.07895
> > > > 3   4.26502   vs    1.01739
> > > > 4   6.04306   vs    1.28924
> > > > 5   8.04786   vs    1.57616
> > > > 6   9.38844   vs    1.78142
> > > 
> > > <snip>
> > > 
> > > > 29    20.06   vs    3.59869
> > > > 30  20.4353   vs     3.6991
> > > > 31  20.9082   vs    3.73028
> > > > 32  21.0865   vs    3.82904
> > > > 
> > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > pair throughput.
> > > 
> > > I would be more interested in real numbers than synthetic benchmarks,
> > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > with and without this patchset?
> > > 
> > I added Dave Chinner <david@fromorbit.com> to this thread.
> 
> Oh, I missed that, and it would be better to [+Cc linux-xfs]
> 
> > But. The contention exists.
> 
> I think "theoretically can be contended" doesn't necessarily mean it's actually
> contended in the real world.

Did you not read the commit message for the XFS commit documented
above? vmalloc lock contention most c0ertainly does exist in the
real world and the profiles in commit 8dc9384b7d75  ("xfs: reduce
kvmalloc overhead for CIL shadow buffers") document it clearly.

> Also I find it difficult to imagine vmalloc being highly contended because it was
> historically considered slow and thus discouraged when performance is important.

Read the above XFS commit.

We use vmalloc in critical high performance fast paths that cannot
tolerate high order memory allocation failure. XFS runs this
fast path millions of times a second, and will call into
vmalloc() several hundred thousands times a second with machine wide
concurrency under certain types of workloads.

> IOW vmalloc would not be contended when allocation size is small because we have
> kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> buffers and at the same time allocating very frequently, thus performance-sensitive.
>
> I am not against this series, but wondering which workloads would benefit ;)

Yup, you need to read the XFS commit message. If you understand what
is in that commit message, then you wouldn't be doubting that
vmalloc contention is real and that it is used in high performance
fast paths that are traversed millions of times a second....

> > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> 
> Not sure I understand your point, can you elaborate please?
> 
> And I would like to ask some side questions:
> 
> 1. Is vm_[un]map_ram() API still worth with this patchset?

XFS also uses this interface for mapping multi-page buffers in the
XFS buffer cache. These are the items that also require the high
order costly kvmalloc allocations in the transaction commit path
when they are modified.

So, yes, we need these mapping interfaces to scale just as well as
vmalloc itself....

> 2. How does this patchset deals with 32-bit machines where
>    vmalloc address space is limited?

From the XFS side, we just don't care about 32 bit machines at all.
XFS is aimed at server and HPC environments which have been entirely
64 bit for a long, long time now...

-Dave.

Harry (Hyeonggon) Yoo May 24, 2023, 1:30 a.m. UTC | #5

On Wed, May 24, 2023 at 07:43:38AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > > 2. Motivation.
> > > > > 
> > > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > > - XFS folk has complained several times that vmalloc might be contented on
> > > > >   their workloads:
> > > > > 
> > > > > <snip>
> > > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > > Date:   Tue Jan 4 17:22:18 2022 -0800
> > > > > 
> > > > >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > > >     
> > > > >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > > >     
> > > > >     The problem is when we are logging lots of large objects, we hit
> > > > >     kvmalloc really damn hard with costly order allocations, and
> > > > >     behaviour utterly sucks:
> > > > 
> > > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > > it allocates large buffers, how large could it be?
> > > > 
> > > They use kvmalloc(). When the page allocator is not able to serve a
> > > request they fallback to vmalloc. At least what i see, the sizes are:
> > > 
> > > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > > 
> > > > > 3. Test
> > > > > 
> > > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > > 
> > > > >     1-page     1-page-this-patch
> > > > > 1  0.576131   vs   0.555889
> > > > > 2   2.68376   vs    1.07895
> > > > > 3   4.26502   vs    1.01739
> > > > > 4   6.04306   vs    1.28924
> > > > > 5   8.04786   vs    1.57616
> > > > > 6   9.38844   vs    1.78142
> > > > 
> > > > <snip>
> > > > 
> > > > > 29    20.06   vs    3.59869
> > > > > 30  20.4353   vs     3.6991
> > > > > 31  20.9082   vs    3.73028
> > > > > 32  21.0865   vs    3.82904
> > > > > 
> > > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > > pair throughput.
> > > > 
> > > > I would be more interested in real numbers than synthetic benchmarks,
> > > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > > with and without this patchset?
> > > > 
> > > I added Dave Chinner <david@fromorbit.com> to this thread.
> > 
> > Oh, I missed that, and it would be better to [+Cc linux-xfs]
> > 
> > > But. The contention exists.
> > 
> > I think "theoretically can be contended" doesn't necessarily mean it's actually
> > contended in the real world.
> 
> Did you not read the commit message for the XFS commit documented
> above? vmalloc lock contention most c0ertainly does exist in the
> real world and the profiles in commit 8dc9384b7d75  ("xfs: reduce
> kvmalloc overhead for CIL shadow buffers") document it clearly.
>
> > Also I find it difficult to imagine vmalloc being highly contended because it was
> > historically considered slow and thus discouraged when performance is important.
> 
> Read the above XFS commit.
> 
> We use vmalloc in critical high performance fast paths that cannot
> tolerate high order memory allocation failure. XFS runs this
> fast path millions of times a second, and will call into
> vmalloc() several hundred thousands times a second with machine wide
> concurrency under certain types of workloads.
>
> > IOW vmalloc would not be contended when allocation size is small because we have
> > kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> > buffers and at the same time allocating very frequently, thus performance-sensitive.
> >
> > I am not against this series, but wondering which workloads would benefit ;)
> 
> Yup, you need to read the XFS commit message. If you understand what
> is in that commit message, then you wouldn't be doubting that
> vmalloc contention is real and that it is used in high performance
> fast paths that are traversed millions of times a second....

Oh, I read the commit but seems slipped my mind while reading it - sorry for such a dumb
question, now I get it, and thank you so much. In any case didn't mean to offend,
I should've read and thought more before asking.

>
> > > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> > 
> > Not sure I understand your point, can you elaborate please?
> > 
> > And I would like to ask some side questions:
> > 
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
> 
> XFS also uses this interface for mapping multi-page buffers in the
> XFS buffer cache. These are the items that also require the high
> order costly kvmalloc allocations in the transaction commit path
> when they are modified.
> 
> So, yes, we need these mapping interfaces to scale just as well as
> vmalloc itself....

I mean, even before this series, vm_[un]map_ram() caches vmap_blocks
per CPU but it has limitation on size that can be cached per cpu.

But now that vmap() itself becomes scalable after this series, I wonder
they are still worth, why not replace it with v[un]map()?
> 
> > 2. How does this patchset deals with 32-bit machines where
> >    vmalloc address space is limited?
> 
> From the XFS side, we just don't care about 32 bit machines at all.
> XFS is aimed at server and HPC environments which have been entirely
> 64 bit for a long, long time now...

But Linux still supports 32 bit machines and is not going to drop
support for them anytime soon so I think there should be at least a way to
disable this feature.

Thanks!

Uladzislau Rezki May 24, 2023, 9:50 a.m. UTC | #6

On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > > 2. Motivation.
> > > > 
> > > > - The vmap code is not scalled to number of CPUs and this should be fixed;
> > > > - XFS folk has complained several times that vmalloc might be contented on
> > > >   their workloads:
> > > > 
> > > > <snip>
> > > > commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date:   Tue Jan 4 17:22:18 2022 -0800
> > > > 
> > > >     xfs: reduce kvmalloc overhead for CIL shadow buffers
> > > >     
> > > >     Oh, let me count the ways that the kvmalloc API sucks dog eggs.
> > > >     
> > > >     The problem is when we are logging lots of large objects, we hit
> > > >     kvmalloc really damn hard with costly order allocations, and
> > > >     behaviour utterly sucks:
> > > 
> > > based on the commit I guess xfs should use vmalloc/kvmalloc is because
> > > it allocates large buffers, how large could it be?
> > > 
> > They use kvmalloc(). When the page allocator is not able to serve a
> > request they fallback to vmalloc. At least what i see, the sizes are:
> > 
> > from 73728 up to 1048576, i.e. 18 pages up to 256 pages.
> > 
> > > > 3. Test
> > > > 
> > > > On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:
> > > > 
> > > >     1-page     1-page-this-patch
> > > > 1  0.576131   vs   0.555889
> > > > 2   2.68376   vs    1.07895
> > > > 3   4.26502   vs    1.01739
> > > > 4   6.04306   vs    1.28924
> > > > 5   8.04786   vs    1.57616
> > > > 6   9.38844   vs    1.78142
> > > 
> > > <snip>
> > > 
> > > > 29    20.06   vs    3.59869
> > > > 30  20.4353   vs     3.6991
> > > > 31  20.9082   vs    3.73028
> > > > 32  21.0865   vs    3.82904
> > > > 
> > > > 1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
> > > > pair throughput.
> > > 
> > > I would be more interested in real numbers than synthetic benchmarks,
> > > Maybe XFS folks could help performing profiling similar to commit 8dc9384b7d750
> > > with and without this patchset?
> > > 
> > I added Dave Chinner <david@fromorbit.com> to this thread.
> 
> Oh, I missed that, and it would be better to [+Cc linux-xfs]
> 
> > But. The contention exists.
> 
> I think "theoretically can be contended" doesn't necessarily mean it's actually
> contended in the real world.
> 
> Also I find it difficult to imagine vmalloc being highly contended because it was
> historically considered slow and thus discouraged when performance is important.
> 
> IOW vmalloc would not be contended when allocation size is small because we have
> kmalloc/buddy API, and therefore I wonder which workloads are allocating very large
> buffers and at the same time allocating very frequently, thus performance-sensitive.
> 
> I am not against this series, but wondering which workloads would benefit ;)
> 
> > Apart of that per-cpu-KVA allocator can go away if we make it generic instead.
> 
> Not sure I understand your point, can you elaborate please?
> 
> And I would like to ask some side questions:
> 
> 1. Is vm_[un]map_ram() API still worth with this patchset?
> 
It is up to community to decide. As i see XFS needs it also. Maybe in
the future it can be removed(who knows). If the vmalloc code itself can
deliver such performance as vm_map* APIs.

>
> 2. How does this patchset deals with 32-bit machines where
>    vmalloc address space is limited?
> 
It can deal without any problems. Though i am not sure it is needed for
32-bit systems. The reason is, the vmalloc code was a bit slow when it
comes to lookup time, it used to be O(n). After that it was improved to
O(logn).

vm_map_ram() and friends interface was added because of vmalloc drawbacks.
I am not sure that there are 32-bit systems with 10/20/30... CPUs on board.
In that case it is worth care about contention.

--
Uladzislau Rezki

Dave Chinner May 24, 2023, 9:56 p.m. UTC | #7

On Wed, May 24, 2023 at 11:50:12AM +0200, Uladzislau Rezki wrote:
> On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > And I would like to ask some side questions:
> > 
> > 1. Is vm_[un]map_ram() API still worth with this patchset?
> > 
> It is up to community to decide. As i see XFS needs it also. Maybe in
> the future it can be removed(who knows). If the vmalloc code itself can
> deliver such performance as vm_map* APIs.

vm_map* APIs cannot be replaced with vmalloc, they cover a very
different use case.  i.e.  vmalloc allocates mapped memory,
vm_map_ram() maps allocated memory....

> vm_map_ram() and friends interface was added because of vmalloc drawbacks.

No. vm_map*() were scalability improvements added in 2009 to replace
on vmap() and vunmap() to avoid global lock contention in the vmap
allocator that XFS had been working around for years with it's own
internal vmap cache....

commit 95f8e302c04c0b0c6de35ab399a5551605eeb006
Author: Nicholas Piggin <npiggin@gmail.com>
Date:   Tue Jan 6 14:43:09 2009 +1100

    [XFS] use scalable vmap API

    Implement XFS's large buffer support with the new vmap APIs. See the vmap
    rewrite (db64fe02) for some numbers. The biggest improvement that comes from
    using the new APIs is avoiding the global KVA allocation lock on every call.

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Christoph Hellwig <hch@infradead.org>
    Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>

vmap/vunmap() themselves were introduce in 2.5.32 (2002) and before
that XFS was using remap_page_array() and vfree() in exactly the
same way it uses vm_map_ram() and vm_unmap_ram() today....

XFS has a long, long history of causing virtual memory allocator
scalability and contention problems. As you can see, this isn't our
first rodeo...

Cheers,

Dave.

Christoph Hellwig May 25, 2023, 7:59 a.m. UTC | #8

On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote:
> > It is up to community to decide. As i see XFS needs it also. Maybe in
> > the future it can be removed(who knows). If the vmalloc code itself can
> > deliver such performance as vm_map* APIs.
> 
> vm_map* APIs cannot be replaced with vmalloc, they cover a very
> different use case.  i.e.  vmalloc allocates mapped memory,
> vm_map_ram() maps allocated memory....
> 
> > vm_map_ram() and friends interface was added because of vmalloc drawbacks.
> 
> No. vm_map*() were scalability improvements added in 2009 to replace
> on vmap() and vunmap() to avoid global lock contention in the vmap
> allocator that XFS had been working around for years with it's own
> internal vmap cache....

All of that is true.  At the same time XFS could very much switch to
vmalloc for !XBF_UNMAPPED && size > PAGE_SIZE buffers IFF that provided
an advantage.  The need for vmap and then vm_map_* initially came from
the fact that we were using the page cache to back the xfs_buf (or
page_buf back then).  With your work on getting rid of the pagecache
usage we could just use vmalloc now if we wanted to and it improves
performance.  Or at some point we could even look into just using large
folios for that with the infrastructure for that improving a lot (but
I suspect we're not quite there yet).

But ther are other uses of vm_map_* that can't do that, and they will
benefit from the additional scalability as well even IFF just using
vmalloc was fine for XFS now (I don't know, I haven't actually looked
into it).

Uladzislau Rezki May 25, 2023, 10:20 a.m. UTC | #9

On Thu, May 25, 2023 at 07:56:56AM +1000, Dave Chinner wrote:
> On Wed, May 24, 2023 at 11:50:12AM +0200, Uladzislau Rezki wrote:
> > On Wed, May 24, 2023 at 03:04:28AM +0900, Hyeonggon Yoo wrote:
> > > On Tue, May 23, 2023 at 05:12:30PM +0200, Uladzislau Rezki wrote:
> > > And I would like to ask some side questions:
> > > 
> > > 1. Is vm_[un]map_ram() API still worth with this patchset?
> > > 
> > It is up to community to decide. As i see XFS needs it also. Maybe in
> > the future it can be removed(who knows). If the vmalloc code itself can
> > deliver such performance as vm_map* APIs.
> 
> vm_map* APIs cannot be replaced with vmalloc, they cover a very
> different use case.  i.e.  vmalloc allocates mapped memory,
> vm_map_ram() maps allocated memory....
> 
> > vm_map_ram() and friends interface was added because of vmalloc drawbacks.
> 
> No. vm_map*() were scalability improvements added in 2009 to replace
> on vmap() and vunmap() to avoid global lock contention in the vmap
> allocator that XFS had been working around for years with it's own
> internal vmap cache....
> 
> commit 95f8e302c04c0b0c6de35ab399a5551605eeb006
> Author: Nicholas Piggin <npiggin@gmail.com>
> Date:   Tue Jan 6 14:43:09 2009 +1100
> 
>     [XFS] use scalable vmap API
>     
>     Implement XFS's large buffer support with the new vmap APIs. See the vmap
>     rewrite (db64fe02) for some numbers. The biggest improvement that comes from
>     using the new APIs is avoiding the global KVA allocation lock on every call.
>     
>     Signed-off-by: Nick Piggin <npiggin@suse.de>
>     Reviewed-by: Christoph Hellwig <hch@infradead.org>
>     Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
> 
> vmap/vunmap() themselves were introduce in 2.5.32 (2002) and before
> that XFS was using remap_page_array() and vfree() in exactly the
> same way it uses vm_map_ram() and vm_unmap_ram() today....
> 
> XFS has a long, long history of causing virtual memory allocator
> scalability and contention problems. As you can see, this isn't our
> first rodeo...
> 
Let me be more specific, sorry it looks like there is misunderstanding.
I am talking about removing of vb_alloc()/vb_free() per-cpu stuff. If
alloc_vmap_area() gives same performance:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d50c551592fc..a1687bbdad30 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2503,12 +2503,6 @@ void vm_unmap_ram(const void *mem, unsigned int count)

        kasan_poison_vmalloc(mem, size);

-       if (likely(count <= VMAP_MAX_ALLOC)) {
-               debug_check_no_locks_freed(mem, size);
-               vb_free(addr, size);
-               return;
-       }
-
        va = find_unlink_vmap_area(addr);
        if (WARN_ON_ONCE(!va))
                return;
@@ -2539,12 +2533,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
        unsigned long addr;
        void *mem;

-       if (likely(count <= VMAP_MAX_ALLOC)) {
-               mem = vb_alloc(size, GFP_KERNEL);
-               if (IS_ERR(mem))
-                       return NULL;
-               addr = (unsigned long)mem;
-       } else {
                struct vmap_area *va;
                va = alloc_vmap_area(size, PAGE_SIZE,
                                VMALLOC_START, VMALLOC_END,
@@ -2554,7 +2542,6 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)

                addr = va->va_start;
                mem = (void *)addr;
-       }

        if (vmap_pages_range(addr, addr + size, PAGE_KERNEL,
                                pages, PAGE_SHIFT) < 0) {


+ other related parts.

--
Uladzislau Rezki

[0/9] Mitigate a vmap lock contention

Message

Comments