diff mbox series

[-next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

Message ID 20241203023159.219355-1-zuoze1@huawei.com (mailing list archive)
State New
Headers show
Series [-next] mm: usercopy: add a debugfs interface to bypass the vmalloc check. | expand

Commit Message

zuoze Dec. 3, 2024, 2:31 a.m. UTC
The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced
vmalloc check for usercopy. However, in subsystems like networking, when
memory allocated using vmalloc or vmap is subsequently copied using
functions like copy_to_iter/copy_from_iter, the check is triggered. This
adds overhead in the copy path, such as the cost of searching the
red-black tree, which increases the performance burden.

We found that after merging this patch, network bandwidth performance in
the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec,
the hardened_usercopy is enabled by default.

To address this, we introduced a debugfs interface that allows selectively
enabling or disabling the vmalloc check based on the use case, optimizing
performance.

By default, vmalloc check for usercopy is enabled.

To disable the vmalloc check:
        echo Y > /sys/kernel/debug/bypass_usercopy_vmalloc_check

After executing the above command, the XDP performance returns to 25
Gbits/sec.

Signed-off-by: Ze Zuo <zuoze1@huawei.com>
---
 mm/usercopy.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) Dec. 3, 2024, 4:11 a.m. UTC | #1
On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote:
> The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced
> vmalloc check for usercopy. However, in subsystems like networking, when
> memory allocated using vmalloc or vmap is subsequently copied using
> functions like copy_to_iter/copy_from_iter, the check is triggered. This
> adds overhead in the copy path, such as the cost of searching the
> red-black tree, which increases the performance burden.
> 
> We found that after merging this patch, network bandwidth performance in
> the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec,
> the hardened_usercopy is enabled by default.

What is "the XDP scenario", exactly?  Are these large or small packets?
What's taking the time in find_vmap_area()?  Is it lock contention?
Uladzislau Rezki Dec. 3, 2024, 6:12 a.m. UTC | #2
On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote:
> The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced
> vmalloc check for usercopy. However, in subsystems like networking, when
> memory allocated using vmalloc or vmap is subsequently copied using
> functions like copy_to_iter/copy_from_iter, the check is triggered. This
> adds overhead in the copy path, such as the cost of searching the
> red-black tree, which increases the performance burden.
> 
> We found that after merging this patch, network bandwidth performance in
> the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec,
> the hardened_usercopy is enabled by default.
> 
> To address this, we introduced a debugfs interface that allows selectively
> enabling or disabling the vmalloc check based on the use case, optimizing
> performance.
> 
> By default, vmalloc check for usercopy is enabled.
> 
> To disable the vmalloc check:
>         echo Y > /sys/kernel/debug/bypass_usercopy_vmalloc_check
> 
> After executing the above command, the XDP performance returns to 25
> Gbits/sec.
> 
To what Matthew has asked, could you please also specify the kernel version
you run your experiment on? Apart of that please describe your system and HW.

Thank you!

--
Uladzislau Rezki
zuoze Dec. 3, 2024, 11:23 a.m. UTC | #3
We have implemented host-guest communication based on the TUN device
using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
and the operating system is based on the 6.6 LTS version with kernel
version 6.6. The specific stack for hotspot collection is as follows:

-  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
    - ret_from_fork
       - 99.99% vhost_task_fn
          - 99.98% 0xffffdc59f619876c
             - 98.99% handle_rx_kick
                - 98.94% handle_rx
                   - 94.92% tun_recvmsg
                      - 94.76% tun_do_read
                         - 94.62% tun_put_user_xdp_zc
                            - 63.53% __check_object_size
                               - 63.49% __check_object_size.part.0
                                    find_vmap_area
                            - 30.02% _copy_to_iter
                                 __arch_copy_to_user
                   - 2.27% get_rx_bufs
                      - 2.12% vhost_get_vq_desc
                           1.49% __arch_copy_from_user
                   - 0.89% peek_head_len
                        0.54% xsk_tx_peek_desc
                   - 0.68% vhost_add_used_and_signal_n
                      - 0.53% eventfd_signal
                           eventfd_signal_mask
             - 0.94% handle_tx_kick
                - 0.94% handle_tx
                   - handle_tx_copy
                      - 0.59% vhost_tx_batch.constprop.0
                           0.52% tun_sendmsg

It can be observed that most of the overhead is concentrated in the 
find_vmap_area function.

[1]: https://www.kernel.org/doc/html/latest/networking/af_xdp.html

在 2024/12/3 12:11, Matthew Wilcox 写道:
> On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote:
>> The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced
>> vmalloc check for usercopy. However, in subsystems like networking, when
>> memory allocated using vmalloc or vmap is subsequently copied using
>> functions like copy_to_iter/copy_from_iter, the check is triggered. This
>> adds overhead in the copy path, such as the cost of searching the
>> red-black tree, which increases the performance burden.
>>
>> We found that after merging this patch, network bandwidth performance in
>> the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec,
>> the hardened_usercopy is enabled by default.
> 
> What is "the XDP scenario", exactly?  Are these large or small packets?
> What's taking the time in find_vmap_area()?  Is it lock contention?
>
Uladzislau Rezki Dec. 3, 2024, 12:39 p.m. UTC | #4
On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> We have implemented host-guest communication based on the TUN device
> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> and the operating system is based on the 6.6 LTS version with kernel
> version 6.6. The specific stack for hotspot collection is as follows:
> 
> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>    - ret_from_fork
>       - 99.99% vhost_task_fn
>          - 99.98% 0xffffdc59f619876c
>             - 98.99% handle_rx_kick
>                - 98.94% handle_rx
>                   - 94.92% tun_recvmsg
>                      - 94.76% tun_do_read
>                         - 94.62% tun_put_user_xdp_zc
>                            - 63.53% __check_object_size
>                               - 63.49% __check_object_size.part.0
>                                    find_vmap_area
>                            - 30.02% _copy_to_iter
>                                 __arch_copy_to_user
>                   - 2.27% get_rx_bufs
>                      - 2.12% vhost_get_vq_desc
>                           1.49% __arch_copy_from_user
>                   - 0.89% peek_head_len
>                        0.54% xsk_tx_peek_desc
>                   - 0.68% vhost_add_used_and_signal_n
>                      - 0.53% eventfd_signal
>                           eventfd_signal_mask
>             - 0.94% handle_tx_kick
>                - 0.94% handle_tx
>                   - handle_tx_copy
>                      - 0.59% vhost_tx_batch.constprop.0
>                           0.52% tun_sendmsg
> 
> It can be observed that most of the overhead is concentrated in the
> find_vmap_area function.
> 
I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
was a work that tends to improve it to mitigate a vmap lock contention.
See it here: https://lwn.net/Articles/956590/

The work was taken in the v6.9 kernel:

<snip>
commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:23 2024 +0100

    mm: vmalloc: add va_alloc() helper

    Patch series "Mitigate a vmap lock contention", v3.

    1. Motivation
...
<snip>

Could you please try the v6.9 kernel on your setup?

How to solve it, probably, it can be back-ported to the v6.6 kernel. 

Thanks!

--
Uladzislau Rezki
zuoze Dec. 3, 2024, 1:10 p.m. UTC | #5
在 2024/12/3 20:39, Uladzislau Rezki 写道:
> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
>> We have implemented host-guest communication based on the TUN device
>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
>> and the operating system is based on the 6.6 LTS version with kernel
>> version 6.6. The specific stack for hotspot collection is as follows:
>>
>> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>>     - ret_from_fork
>>        - 99.99% vhost_task_fn
>>           - 99.98% 0xffffdc59f619876c
>>              - 98.99% handle_rx_kick
>>                 - 98.94% handle_rx
>>                    - 94.92% tun_recvmsg
>>                       - 94.76% tun_do_read
>>                          - 94.62% tun_put_user_xdp_zc
>>                             - 63.53% __check_object_size
>>                                - 63.49% __check_object_size.part.0
>>                                     find_vmap_area
>>                             - 30.02% _copy_to_iter
>>                                  __arch_copy_to_user
>>                    - 2.27% get_rx_bufs
>>                       - 2.12% vhost_get_vq_desc
>>                            1.49% __arch_copy_from_user
>>                    - 0.89% peek_head_len
>>                         0.54% xsk_tx_peek_desc
>>                    - 0.68% vhost_add_used_and_signal_n
>>                       - 0.53% eventfd_signal
>>                            eventfd_signal_mask
>>              - 0.94% handle_tx_kick
>>                 - 0.94% handle_tx
>>                    - handle_tx_copy
>>                       - 0.59% vhost_tx_batch.constprop.0
>>                            0.52% tun_sendmsg
>>
>> It can be observed that most of the overhead is concentrated in the
>> find_vmap_area function.
>>
> I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
> was a work that tends to improve it to mitigate a vmap lock contention.
> See it here: https://lwn.net/Articles/956590/
> 
> The work was taken in the v6.9 kernel:
> 
> <snip>
> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:23 2024 +0100
> 
>      mm: vmalloc: add va_alloc() helper
> 
>      Patch series "Mitigate a vmap lock contention", v3.
> 
>      1. Motivation
> ...
> <snip>
> 
> Could you please try the v6.9 kernel on your setup?
> 
> How to solve it, probably, it can be back-ported to the v6.6 kernel.

All the vmalloc-related optimizations have already been merged into 6.6,
including the set of optimization patches you suggested. Thank you very
much for your input.

> 
> Thanks!
> 
> --
> Uladzislau Rezki
> 

Thanks,
ze zuo
Uladzislau Rezki Dec. 3, 2024, 1:25 p.m. UTC | #6
> > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> > > We have implemented host-guest communication based on the TUN device
> > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> > > and the operating system is based on the 6.6 LTS version with kernel
> > > version 6.6. The specific stack for hotspot collection is as follows:
> > > 
> > > -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
> > >     - ret_from_fork
> > >        - 99.99% vhost_task_fn
> > >           - 99.98% 0xffffdc59f619876c
> > >              - 98.99% handle_rx_kick
> > >                 - 98.94% handle_rx
> > >                    - 94.92% tun_recvmsg
> > >                       - 94.76% tun_do_read
> > >                          - 94.62% tun_put_user_xdp_zc
> > >                             - 63.53% __check_object_size
> > >                                - 63.49% __check_object_size.part.0
> > >                                     find_vmap_area
> > >                             - 30.02% _copy_to_iter
> > >                                  __arch_copy_to_user
> > >                    - 2.27% get_rx_bufs
> > >                       - 2.12% vhost_get_vq_desc
> > >                            1.49% __arch_copy_from_user
> > >                    - 0.89% peek_head_len
> > >                         0.54% xsk_tx_peek_desc
> > >                    - 0.68% vhost_add_used_and_signal_n
> > >                       - 0.53% eventfd_signal
> > >                            eventfd_signal_mask
> > >              - 0.94% handle_tx_kick
> > >                 - 0.94% handle_tx
> > >                    - handle_tx_copy
> > >                       - 0.59% vhost_tx_batch.constprop.0
> > >                            0.52% tun_sendmsg
> > > 
> > > It can be observed that most of the overhead is concentrated in the
> > > find_vmap_area function.
> > > 
> > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
> > was a work that tends to improve it to mitigate a vmap lock contention.
> > See it here: https://lwn.net/Articles/956590/
> > 
> > The work was taken in the v6.9 kernel:
> > 
> > <snip>
> > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
> > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Date:   Tue Jan 2 19:46:23 2024 +0100
> > 
> >      mm: vmalloc: add va_alloc() helper
> > 
> >      Patch series "Mitigate a vmap lock contention", v3.
> > 
> >      1. Motivation
> > ...
> > <snip>
> > 
> > Could you please try the v6.9 kernel on your setup?
> > 
> > How to solve it, probably, it can be back-ported to the v6.6 kernel.
> 
> All the vmalloc-related optimizations have already been merged into 6.6,
> including the set of optimization patches you suggested. Thank you very
> much for your input.
> 
Do you mean that the perf-data that you posted to this email thread,
contains the: Patch series "Mitigate a vmap lock contention", v3. from the v6.9?

and you still see a high contention?

Thank you!

--
Uladzislau Rezki
Kefeng Wang Dec. 3, 2024, 1:30 p.m. UTC | #7
On 2024/12/3 21:10, zuoze wrote:
> 
> 
> 在 2024/12/3 20:39, Uladzislau Rezki 写道:
>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
>>> We have implemented host-guest communication based on the TUN device
>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
>>> and the operating system is based on the 6.6 LTS version with kernel
>>> version 6.6. The specific stack for hotspot collection is as follows:
>>>
>>> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>>>     - ret_from_fork
>>>        - 99.99% vhost_task_fn
>>>           - 99.98% 0xffffdc59f619876c
>>>              - 98.99% handle_rx_kick
>>>                 - 98.94% handle_rx
>>>                    - 94.92% tun_recvmsg
>>>                       - 94.76% tun_do_read
>>>                          - 94.62% tun_put_user_xdp_zc
>>>                             - 63.53% __check_object_size
>>>                                - 63.49% __check_object_size.part.0
>>>                                     find_vmap_area
>>>                             - 30.02% _copy_to_iter
>>>                                  __arch_copy_to_user
>>>                    - 2.27% get_rx_bufs
>>>                       - 2.12% vhost_get_vq_desc
>>>                            1.49% __arch_copy_from_user
>>>                    - 0.89% peek_head_len
>>>                         0.54% xsk_tx_peek_desc
>>>                    - 0.68% vhost_add_used_and_signal_n
>>>                       - 0.53% eventfd_signal
>>>                            eventfd_signal_mask
>>>              - 0.94% handle_tx_kick
>>>                 - 0.94% handle_tx
>>>                    - handle_tx_copy
>>>                       - 0.59% vhost_tx_batch.constprop.0
>>>                            0.52% tun_sendmsg
>>>
>>> It can be observed that most of the overhead is concentrated in the
>>> find_vmap_area function.
>>>
>> I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
>> was a work that tends to improve it to mitigate a vmap lock contention.
>> See it here: https://lwn.net/Articles/956590/
>>
>> The work was taken in the v6.9 kernel:
>>
>> <snip>
>> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
>> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>> Date:   Tue Jan 2 19:46:23 2024 +0100
>>
>>      mm: vmalloc: add va_alloc() helper
>>
>>      Patch series "Mitigate a vmap lock contention", v3.
>>
>>      1. Motivation
>> ...
>> <snip>
>>
>> Could you please try the v6.9 kernel on your setup?
>>
>> How to solve it, probably, it can be back-ported to the v6.6 kernel.
> 
> All the vmalloc-related optimizations have already been merged into 6.6,
> including the set of optimization patches you suggested. Thank you very
> much for your input.
> 

It is unclear, we have backported the vmalloc optimization into our 6.6
kernel before, so the above stack already with those patches and even
with those optimization, the find_vmap_area() is still the hotpots.
Uladzislau Rezki Dec. 3, 2024, 1:39 p.m. UTC | #8
On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/12/3 21:10, zuoze wrote:
> > 
> > 
> > 在 2024/12/3 20:39, Uladzislau Rezki 写道:
> > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> > > > We have implemented host-guest communication based on the TUN device
> > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> > > > and the operating system is based on the 6.6 LTS version with kernel
> > > > version 6.6. The specific stack for hotspot collection is as follows:
> > > > 
> > > > -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
> > > >     - ret_from_fork
> > > >        - 99.99% vhost_task_fn
> > > >           - 99.98% 0xffffdc59f619876c
> > > >              - 98.99% handle_rx_kick
> > > >                 - 98.94% handle_rx
> > > >                    - 94.92% tun_recvmsg
> > > >                       - 94.76% tun_do_read
> > > >                          - 94.62% tun_put_user_xdp_zc
> > > >                             - 63.53% __check_object_size
> > > >                                - 63.49% __check_object_size.part.0
> > > >                                     find_vmap_area
> > > >                             - 30.02% _copy_to_iter
> > > >                                  __arch_copy_to_user
> > > >                    - 2.27% get_rx_bufs
> > > >                       - 2.12% vhost_get_vq_desc
> > > >                            1.49% __arch_copy_from_user
> > > >                    - 0.89% peek_head_len
> > > >                         0.54% xsk_tx_peek_desc
> > > >                    - 0.68% vhost_add_used_and_signal_n
> > > >                       - 0.53% eventfd_signal
> > > >                            eventfd_signal_mask
> > > >              - 0.94% handle_tx_kick
> > > >                 - 0.94% handle_tx
> > > >                    - handle_tx_copy
> > > >                       - 0.59% vhost_tx_batch.constprop.0
> > > >                            0.52% tun_sendmsg
> > > > 
> > > > It can be observed that most of the overhead is concentrated in the
> > > > find_vmap_area function.
> > > > 
> > > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
> > > was a work that tends to improve it to mitigate a vmap lock contention.
> > > See it here: https://lwn.net/Articles/956590/
> > > 
> > > The work was taken in the v6.9 kernel:
> > > 
> > > <snip>
> > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
> > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Date:   Tue Jan 2 19:46:23 2024 +0100
> > > 
> > >      mm: vmalloc: add va_alloc() helper
> > > 
> > >      Patch series "Mitigate a vmap lock contention", v3.
> > > 
> > >      1. Motivation
> > > ...
> > > <snip>
> > > 
> > > Could you please try the v6.9 kernel on your setup?
> > > 
> > > How to solve it, probably, it can be back-ported to the v6.6 kernel.
> > 
> > All the vmalloc-related optimizations have already been merged into 6.6,
> > including the set of optimization patches you suggested. Thank you very
> > much for your input.
> > 
> 
> It is unclear, we have backported the vmalloc optimization into our 6.6
> kernel before, so the above stack already with those patches and even
> with those optimization, the find_vmap_area() is still the hotpots.
> 
> 
Could you please check that all below patches are in your v6.6 kernel?


<snip>
commit 8be4d46e12af32342569840d958272dbb3be3f4c
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Wed Jan 24 19:09:20 2024 +0100

    mm: vmalloc: refactor vmalloc_dump_obj() function

commit 15e02a39fb6b43f37100563c6a246252d5d1e6da
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Wed Jan 24 19:09:19 2024 +0100

    mm: vmalloc: improve description of vmap node layer

commit 7679ba6b36dbb300b757b672d6a32a606499e14b
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:33 2024 +0100

    mm: vmalloc: add a shrinker to drain vmap pools

commit 8f33a2ff307248c3e55a7696f60b3658b28edb57
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:32 2024 +0100

    mm: vmalloc: set nr_nodes based on CPUs in a system

commit 8e1d743f2c2671aa54f6f91a2b33823f92512870
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:31 2024 +0100

    mm: vmalloc: support multiple nodes in vmallocinfo

commit 53becf32aec1c8049b854f0c31a11df5ed75df6f
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:30 2024 +0100

    mm: vmalloc: support multiple nodes in vread_iter

commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Fri Feb 2 20:06:28 2024 +0100

    mm: vmalloc: add a scan area of VA only once

commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:29 2024 +0100

    mm: vmalloc: offload free_vmap_area_lock lock

commit 282631cb2447318e2a55b41a665dbe8571c46d70
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:28 2024 +0100

    mm: vmalloc: remove global purge_vmap_area_root rb-tree

commit 55c49fee57af99f3c663e69dedc5b85e691bbe50
Author: Baoquan He <bhe@redhat.com>
Date:   Tue Jan 2 19:46:27 2024 +0100

    mm/vmalloc: remove vmap_area_list

commit d093602919ad5908532142a048539800fa94a0d1
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:26 2024 +0100

    mm: vmalloc: remove global vmap_area_root rb-tree

commit 7fa8cee003166ef6db0bba70d610dbf173543811
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:25 2024 +0100

    mm: vmalloc: move vmap_init_free_space() down in vmalloc.c


commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:24 2024 +0100

    mm: vmalloc: rename adjust_va_to_fit_type() function


commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
Date:   Tue Jan 2 19:46:23 2024 +0100

    mm: vmalloc: add va_alloc() helper
<snip>

Thank you!

--
Uladzislau Rezki
Kefeng Wang Dec. 3, 2024, 1:45 p.m. UTC | #9
On 2024/12/3 21:39, Uladzislau Rezki wrote:
> On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
>>
>>
>> On 2024/12/3 21:10, zuoze wrote:
>>>
>>>
>>> 在 2024/12/3 20:39, Uladzislau Rezki 写道:
>>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
>>>>> We have implemented host-guest communication based on the TUN device
>>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
>>>>> and the operating system is based on the 6.6 LTS version with kernel
>>>>> version 6.6. The specific stack for hotspot collection is as follows:
>>>>>
>>>>> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>>>>>      - ret_from_fork
>>>>>         - 99.99% vhost_task_fn
>>>>>            - 99.98% 0xffffdc59f619876c
>>>>>               - 98.99% handle_rx_kick
>>>>>                  - 98.94% handle_rx
>>>>>                     - 94.92% tun_recvmsg
>>>>>                        - 94.76% tun_do_read
>>>>>                           - 94.62% tun_put_user_xdp_zc
>>>>>                              - 63.53% __check_object_size
>>>>>                                 - 63.49% __check_object_size.part.0
>>>>>                                      find_vmap_area
>>>>>                              - 30.02% _copy_to_iter
>>>>>                                   __arch_copy_to_user
>>>>>                     - 2.27% get_rx_bufs
>>>>>                        - 2.12% vhost_get_vq_desc
>>>>>                             1.49% __arch_copy_from_user
>>>>>                     - 0.89% peek_head_len
>>>>>                          0.54% xsk_tx_peek_desc
>>>>>                     - 0.68% vhost_add_used_and_signal_n
>>>>>                        - 0.53% eventfd_signal
>>>>>                             eventfd_signal_mask
>>>>>               - 0.94% handle_tx_kick
>>>>>                  - 0.94% handle_tx
>>>>>                     - handle_tx_copy
>>>>>                        - 0.59% vhost_tx_batch.constprop.0
>>>>>                             0.52% tun_sendmsg
>>>>>
>>>>> It can be observed that most of the overhead is concentrated in the
>>>>> find_vmap_area function.
>>>>>
>>>> I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
>>>> was a work that tends to improve it to mitigate a vmap lock contention.
>>>> See it here: https://lwn.net/Articles/956590/
>>>>
>>>> The work was taken in the v6.9 kernel:
>>>>
>>>> <snip>
>>>> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
>>>> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>> Date:   Tue Jan 2 19:46:23 2024 +0100
>>>>
>>>>       mm: vmalloc: add va_alloc() helper
>>>>
>>>>       Patch series "Mitigate a vmap lock contention", v3.
>>>>
>>>>       1. Motivation
>>>> ...
>>>> <snip>
>>>>
>>>> Could you please try the v6.9 kernel on your setup?
>>>>
>>>> How to solve it, probably, it can be back-ported to the v6.6 kernel.
>>>
>>> All the vmalloc-related optimizations have already been merged into 6.6,
>>> including the set of optimization patches you suggested. Thank you very
>>> much for your input.
>>>
>>
>> It is unclear, we have backported the vmalloc optimization into our 6.6
>> kernel before, so the above stack already with those patches and even
>> with those optimization, the find_vmap_area() is still the hotpots.
>>
>>
> Could you please check that all below patches are in your v6.6 kernel?

Yes,

$ git lg v6.6..HEAD  --oneline mm/vmalloc.c
* 86fee542f145 mm: vmalloc: ensure vmap_block is initialised before 
adding to queue
* f459a0b59f7c mm/vmalloc: fix page mapping if vm_area_alloc_pages() 
with high order fallback to order 0
* 0be7a82c2555 mm: vmalloc: fix lockdep warning
* 58b99a00d0a0 mm/vmalloc: eliminated the lock contention from twice to once
* 2c549aa32fa0 mm: vmalloc: check if a hash-index is in cpu_possible_mask
* 0bc6d608b445 mm: fix incorrect vbq reference in purge_fragmented_block
* 450f8c5270df mm/vmalloc: fix vmalloc which may return null if called 
with __GFP_NOFAIL
* 2ea2bf4a18c3 mm: vmalloc: bail out early in find_vmap_area() if vmap 
is not init
* bde74a3e8a71 mm/vmalloc: fix return value of vb_alloc if size is 0
* 8c620d05b7c3 mm: vmalloc: refactor vmalloc_dump_obj() function
* b0c8281703b8 mm: vmalloc: improve description of vmap node layer
* ecc3f0bf5c5a mm: vmalloc: add a shrinker to drain vmap pools
* dd89a137f483 mm: vmalloc: set nr_nodes based on CPUs in a system
* 8e63c98d86f6 mm: vmalloc: support multiple nodes in vmallocinfo
* cc32683cef48 mm: vmalloc: support multiple nodes in vread_iter
* 54d5ce65633d mm: vmalloc: add a scan area of VA only once
* ee9c199fb859 mm: vmalloc: offload free_vmap_area_lock lock
* c2c272d78b5a mm: vmalloc: remove global purge_vmap_area_root rb-tree
* c9b39e3ffa86 mm/vmalloc: remove vmap_area_list
* 091d2493d15f mm: vmalloc: remove global vmap_area_root rb-tree
* 53f06cc34bac mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
* bf24196d9ab9 mm: vmalloc: rename adjust_va_to_fit_type() function
* 6e9c94401e34 mm: vmalloc: add va_alloc() helper
* ae528eb14e9a mm: Introduce vmap_page_range() to map pages in PCI 
address space
* e1dbcfaa1854 mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
* d3a24e7a01c4 mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
* fc9813220585 mm/vmalloc: fix the unchecked dereference warning in 
vread_iter()
* a52e0157837e ascend: export interfaces required by ascend drivers
* 9b1283f2bec2 mm/vmalloc: Extend vmalloc usage about hugepage
> 
> 
> <snip>
> commit 8be4d46e12af32342569840d958272dbb3be3f4c
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Wed Jan 24 19:09:20 2024 +0100
> 
>      mm: vmalloc: refactor vmalloc_dump_obj() function
> 
> commit 15e02a39fb6b43f37100563c6a246252d5d1e6da
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Wed Jan 24 19:09:19 2024 +0100
> 
>      mm: vmalloc: improve description of vmap node layer
> 
> commit 7679ba6b36dbb300b757b672d6a32a606499e14b
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:33 2024 +0100
> 
>      mm: vmalloc: add a shrinker to drain vmap pools
> 
> commit 8f33a2ff307248c3e55a7696f60b3658b28edb57
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:32 2024 +0100
> 
>      mm: vmalloc: set nr_nodes based on CPUs in a system
> 
> commit 8e1d743f2c2671aa54f6f91a2b33823f92512870
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:31 2024 +0100
> 
>      mm: vmalloc: support multiple nodes in vmallocinfo
> 
> commit 53becf32aec1c8049b854f0c31a11df5ed75df6f
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:30 2024 +0100
> 
>      mm: vmalloc: support multiple nodes in vread_iter
> 
> commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Fri Feb 2 20:06:28 2024 +0100
> 
>      mm: vmalloc: add a scan area of VA only once
> 
> commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:29 2024 +0100
> 
>      mm: vmalloc: offload free_vmap_area_lock lock
> 
> commit 282631cb2447318e2a55b41a665dbe8571c46d70
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:28 2024 +0100
> 
>      mm: vmalloc: remove global purge_vmap_area_root rb-tree
> 
> commit 55c49fee57af99f3c663e69dedc5b85e691bbe50
> Author: Baoquan He <bhe@redhat.com>
> Date:   Tue Jan 2 19:46:27 2024 +0100
> 
>      mm/vmalloc: remove vmap_area_list
> 
> commit d093602919ad5908532142a048539800fa94a0d1
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:26 2024 +0100
> 
>      mm: vmalloc: remove global vmap_area_root rb-tree
> 
> commit 7fa8cee003166ef6db0bba70d610dbf173543811
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:25 2024 +0100
> 
>      mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
> 
> 
> commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:24 2024 +0100
> 
>      mm: vmalloc: rename adjust_va_to_fit_type() function
> 
> 
> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
> Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Date:   Tue Jan 2 19:46:23 2024 +0100
> 
>      mm: vmalloc: add va_alloc() helper
> <snip>
> 
> Thank you!
> 
> --
> Uladzislau Rezki
Uladzislau Rezki Dec. 3, 2024, 1:51 p.m. UTC | #10
On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/12/3 21:39, Uladzislau Rezki wrote:
> > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
> > > 
> > > 
> > > On 2024/12/3 21:10, zuoze wrote:
> > > > 
> > > > 
> > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道:
> > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> > > > > > We have implemented host-guest communication based on the TUN device
> > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> > > > > > and the operating system is based on the 6.6 LTS version with kernel
> > > > > > version 6.6. The specific stack for hotspot collection is as follows:
> > > > > > 
> > > > > > -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
> > > > > >      - ret_from_fork
> > > > > >         - 99.99% vhost_task_fn
> > > > > >            - 99.98% 0xffffdc59f619876c
> > > > > >               - 98.99% handle_rx_kick
> > > > > >                  - 98.94% handle_rx
> > > > > >                     - 94.92% tun_recvmsg
> > > > > >                        - 94.76% tun_do_read
> > > > > >                           - 94.62% tun_put_user_xdp_zc
> > > > > >                              - 63.53% __check_object_size
> > > > > >                                 - 63.49% __check_object_size.part.0
> > > > > >                                      find_vmap_area
> > > > > >                              - 30.02% _copy_to_iter
> > > > > >                                   __arch_copy_to_user
> > > > > >                     - 2.27% get_rx_bufs
> > > > > >                        - 2.12% vhost_get_vq_desc
> > > > > >                             1.49% __arch_copy_from_user
> > > > > >                     - 0.89% peek_head_len
> > > > > >                          0.54% xsk_tx_peek_desc
> > > > > >                     - 0.68% vhost_add_used_and_signal_n
> > > > > >                        - 0.53% eventfd_signal
> > > > > >                             eventfd_signal_mask
> > > > > >               - 0.94% handle_tx_kick
> > > > > >                  - 0.94% handle_tx
> > > > > >                     - handle_tx_copy
> > > > > >                        - 0.59% vhost_tx_batch.constprop.0
> > > > > >                             0.52% tun_sendmsg
> > > > > > 
> > > > > > It can be observed that most of the overhead is concentrated in the
> > > > > > find_vmap_area function.
> > > > > > 
> > > > > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There
> > > > > was a work that tends to improve it to mitigate a vmap lock contention.
> > > > > See it here: https://lwn.net/Articles/956590/
> > > > > 
> > > > > The work was taken in the v6.9 kernel:
> > > > > 
> > > > > <snip>
> > > > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc
> > > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Date:   Tue Jan 2 19:46:23 2024 +0100
> > > > > 
> > > > >       mm: vmalloc: add va_alloc() helper
> > > > > 
> > > > >       Patch series "Mitigate a vmap lock contention", v3.
> > > > > 
> > > > >       1. Motivation
> > > > > ...
> > > > > <snip>
> > > > > 
> > > > > Could you please try the v6.9 kernel on your setup?
> > > > > 
> > > > > How to solve it, probably, it can be back-ported to the v6.6 kernel.
> > > > 
> > > > All the vmalloc-related optimizations have already been merged into 6.6,
> > > > including the set of optimization patches you suggested. Thank you very
> > > > much for your input.
> > > > 
> > > 
> > > It is unclear, we have backported the vmalloc optimization into our 6.6
> > > kernel before, so the above stack already with those patches and even
> > > with those optimization, the find_vmap_area() is still the hotpots.
> > > 
> > > 
> > Could you please check that all below patches are in your v6.6 kernel?
> 
> Yes,
> 
> $ git lg v6.6..HEAD  --oneline mm/vmalloc.c
> * 86fee542f145 mm: vmalloc: ensure vmap_block is initialised before adding
> to queue
> * f459a0b59f7c mm/vmalloc: fix page mapping if vm_area_alloc_pages() with
> high order fallback to order 0
> * 0be7a82c2555 mm: vmalloc: fix lockdep warning
> * 58b99a00d0a0 mm/vmalloc: eliminated the lock contention from twice to once
> * 2c549aa32fa0 mm: vmalloc: check if a hash-index is in cpu_possible_mask
> * 0bc6d608b445 mm: fix incorrect vbq reference in purge_fragmented_block
> * 450f8c5270df mm/vmalloc: fix vmalloc which may return null if called with
> __GFP_NOFAIL
> * 2ea2bf4a18c3 mm: vmalloc: bail out early in find_vmap_area() if vmap is
> not init
> * bde74a3e8a71 mm/vmalloc: fix return value of vb_alloc if size is 0
> * 8c620d05b7c3 mm: vmalloc: refactor vmalloc_dump_obj() function
> * b0c8281703b8 mm: vmalloc: improve description of vmap node layer
> * ecc3f0bf5c5a mm: vmalloc: add a shrinker to drain vmap pools
> * dd89a137f483 mm: vmalloc: set nr_nodes based on CPUs in a system
> * 8e63c98d86f6 mm: vmalloc: support multiple nodes in vmallocinfo
> * cc32683cef48 mm: vmalloc: support multiple nodes in vread_iter
> * 54d5ce65633d mm: vmalloc: add a scan area of VA only once
> * ee9c199fb859 mm: vmalloc: offload free_vmap_area_lock lock
> * c2c272d78b5a mm: vmalloc: remove global purge_vmap_area_root rb-tree
> * c9b39e3ffa86 mm/vmalloc: remove vmap_area_list
> * 091d2493d15f mm: vmalloc: remove global vmap_area_root rb-tree
> * 53f06cc34bac mm: vmalloc: move vmap_init_free_space() down in vmalloc.c
> * bf24196d9ab9 mm: vmalloc: rename adjust_va_to_fit_type() function
> * 6e9c94401e34 mm: vmalloc: add va_alloc() helper
> * ae528eb14e9a mm: Introduce vmap_page_range() to map pages in PCI address
> space
> * e1dbcfaa1854 mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
> * d3a24e7a01c4 mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
> * fc9813220585 mm/vmalloc: fix the unchecked dereference warning in
> vread_iter()
> * a52e0157837e ascend: export interfaces required by ascend drivers
> * 9b1283f2bec2 mm/vmalloc: Extend vmalloc usage about hugepage
>
Thank you. Then you have tons of copy_to_iter/copy_from_iter calls
during your test case. Per each you need to find an area which might
be really heavy.

How many CPUs in a system you have?

--
Uladzislau Rezki
Kefeng Wang Dec. 3, 2024, 2:10 p.m. UTC | #11
On 2024/12/3 21:51, Uladzislau Rezki wrote:
> On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote:
>>
>>
>> On 2024/12/3 21:39, Uladzislau Rezki wrote:
>>> On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/12/3 21:10, zuoze wrote:
>>>>>
>>>>>
>>>>> 在 2024/12/3 20:39, Uladzislau Rezki 写道:
>>>>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
>>>>>>> We have implemented host-guest communication based on the TUN device
>>>>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
>>>>>>> and the operating system is based on the 6.6 LTS version with kernel
>>>>>>> version 6.6. The specific stack for hotspot collection is as follows:
>>>>>>>
>>>>>>> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>>>>>>>       - ret_from_fork
>>>>>>>          - 99.99% vhost_task_fn
>>>>>>>             - 99.98% 0xffffdc59f619876c
>>>>>>>                - 98.99% handle_rx_kick
>>>>>>>                   - 98.94% handle_rx
>>>>>>>                      - 94.92% tun_recvmsg
>>>>>>>                         - 94.76% tun_do_read
>>>>>>>                            - 94.62% tun_put_user_xdp_zc
>>>>>>>                               - 63.53% __check_object_size
>>>>>>>                                  - 63.49% __check_object_size.part.0
>>>>>>>                                       find_vmap_area
>>>>>>>                               - 30.02% _copy_to_iter
>>>>>>>                                    __arch_copy_to_user
>>>>>>>                      - 2.27% get_rx_bufs
>>>>>>>                         - 2.12% vhost_get_vq_desc
>>>>>>>                              1.49% __arch_copy_from_user
>>>>>>>                      - 0.89% peek_head_len
>>>>>>>                           0.54% xsk_tx_peek_desc
>>>>>>>                      - 0.68% vhost_add_used_and_signal_n
>>>>>>>                         - 0.53% eventfd_signal
>>>>>>>                              eventfd_signal_mask
>>>>>>>                - 0.94% handle_tx_kick
>>>>>>>                   - 0.94% handle_tx
>>>>>>>                      - handle_tx_copy
>>>>>>>                         - 0.59% vhost_tx_batch.constprop.0
>>>>>>>                              0.52% tun_sendmsg
>>>>>>>
>>>>>>> It can be observed that most of the overhead is concentrated in the
>>>>>>> find_vmap_area function.
>>>>>>>
...
>>
> Thank you. Then you have tons of copy_to_iter/copy_from_iter calls
> during your test case. Per each you need to find an area which might
> be really heavy.

Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect 
vmalloc overruns"), so no burden in find_vmap_area in old kernel.

> 
> How many CPUs in a system you have?
> 

128 core
Uladzislau Rezki Dec. 3, 2024, 2:20 p.m. UTC | #12
On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/12/3 21:51, Uladzislau Rezki wrote:
> > On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote:
> > > 
> > > 
> > > On 2024/12/3 21:39, Uladzislau Rezki wrote:
> > > > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
> > > > > 
> > > > > 
> > > > > On 2024/12/3 21:10, zuoze wrote:
> > > > > > 
> > > > > > 
> > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道:
> > > > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> > > > > > > > We have implemented host-guest communication based on the TUN device
> > > > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> > > > > > > > and the operating system is based on the 6.6 LTS version with kernel
> > > > > > > > version 6.6. The specific stack for hotspot collection is as follows:
> > > > > > > > 
> > > > > > > > -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
> > > > > > > >       - ret_from_fork
> > > > > > > >          - 99.99% vhost_task_fn
> > > > > > > >             - 99.98% 0xffffdc59f619876c
> > > > > > > >                - 98.99% handle_rx_kick
> > > > > > > >                   - 98.94% handle_rx
> > > > > > > >                      - 94.92% tun_recvmsg
> > > > > > > >                         - 94.76% tun_do_read
> > > > > > > >                            - 94.62% tun_put_user_xdp_zc
> > > > > > > >                               - 63.53% __check_object_size
> > > > > > > >                                  - 63.49% __check_object_size.part.0
> > > > > > > >                                       find_vmap_area
> > > > > > > >                               - 30.02% _copy_to_iter
> > > > > > > >                                    __arch_copy_to_user
> > > > > > > >                      - 2.27% get_rx_bufs
> > > > > > > >                         - 2.12% vhost_get_vq_desc
> > > > > > > >                              1.49% __arch_copy_from_user
> > > > > > > >                      - 0.89% peek_head_len
> > > > > > > >                           0.54% xsk_tx_peek_desc
> > > > > > > >                      - 0.68% vhost_add_used_and_signal_n
> > > > > > > >                         - 0.53% eventfd_signal
> > > > > > > >                              eventfd_signal_mask
> > > > > > > >                - 0.94% handle_tx_kick
> > > > > > > >                   - 0.94% handle_tx
> > > > > > > >                      - handle_tx_copy
> > > > > > > >                         - 0.59% vhost_tx_batch.constprop.0
> > > > > > > >                              0.52% tun_sendmsg
> > > > > > > > 
> > > > > > > > It can be observed that most of the overhead is concentrated in the
> > > > > > > > find_vmap_area function.
> > > > > > > > 
> ...
> > > 
> > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls
> > during your test case. Per each you need to find an area which might
> > be really heavy.
> 
> Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc
> overruns"), so no burden in find_vmap_area in old kernel.
> 
Yep. It will slow down for sure.

> > 
> > How many CPUs in a system you have?
> > 
> 
> 128 core
OK. Just in case, do you see in a boot log something like:

"Failed to allocate an array. Disable a node layer"

Thanks!

--
Uladzislau Rezki
Uladzislau Rezki Dec. 3, 2024, 7:02 p.m. UTC | #13
On Tue, Dec 03, 2024 at 03:20:04PM +0100, Uladzislau Rezki wrote:
> On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote:
> > 
> > 
> > On 2024/12/3 21:51, Uladzislau Rezki wrote:
> > > On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote:
> > > > 
> > > > 
> > > > On 2024/12/3 21:39, Uladzislau Rezki wrote:
> > > > > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2024/12/3 21:10, zuoze wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道:
> > > > > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
> > > > > > > > > We have implemented host-guest communication based on the TUN device
> > > > > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
> > > > > > > > > and the operating system is based on the 6.6 LTS version with kernel
> > > > > > > > > version 6.6. The specific stack for hotspot collection is as follows:
> > > > > > > > > 
> > > > > > > > > -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
> > > > > > > > >       - ret_from_fork
> > > > > > > > >          - 99.99% vhost_task_fn
> > > > > > > > >             - 99.98% 0xffffdc59f619876c
> > > > > > > > >                - 98.99% handle_rx_kick
> > > > > > > > >                   - 98.94% handle_rx
> > > > > > > > >                      - 94.92% tun_recvmsg
> > > > > > > > >                         - 94.76% tun_do_read
> > > > > > > > >                            - 94.62% tun_put_user_xdp_zc
> > > > > > > > >                               - 63.53% __check_object_size
> > > > > > > > >                                  - 63.49% __check_object_size.part.0
> > > > > > > > >                                       find_vmap_area
> > > > > > > > >                               - 30.02% _copy_to_iter
> > > > > > > > >                                    __arch_copy_to_user
> > > > > > > > >                      - 2.27% get_rx_bufs
> > > > > > > > >                         - 2.12% vhost_get_vq_desc
> > > > > > > > >                              1.49% __arch_copy_from_user
> > > > > > > > >                      - 0.89% peek_head_len
> > > > > > > > >                           0.54% xsk_tx_peek_desc
> > > > > > > > >                      - 0.68% vhost_add_used_and_signal_n
> > > > > > > > >                         - 0.53% eventfd_signal
> > > > > > > > >                              eventfd_signal_mask
> > > > > > > > >                - 0.94% handle_tx_kick
> > > > > > > > >                   - 0.94% handle_tx
> > > > > > > > >                      - handle_tx_copy
> > > > > > > > >                         - 0.59% vhost_tx_batch.constprop.0
> > > > > > > > >                              0.52% tun_sendmsg
> > > > > > > > > 
> > > > > > > > > It can be observed that most of the overhead is concentrated in the
> > > > > > > > > find_vmap_area function.
> > > > > > > > > 
> > ...
> > > > 
> > > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls
> > > during your test case. Per each you need to find an area which might
> > > be really heavy.
> > 
> > Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc
> > overruns"), so no burden in find_vmap_area in old kernel.
> > 
> Yep. It will slow down for sure.
> 
> > > 
> > > How many CPUs in a system you have?
> > > 
> > 
> > 128 core
> OK. Just in case, do you see in a boot log something like:
> 
> "Failed to allocate an array. Disable a node layer"
> 
And if you do not see such failing message, it means that a node layer
is up and running fully, can you also test below patch on your workload?

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 634162271c00..35b28be27cf4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -896,7 +896,7 @@ static struct vmap_node {
  * is fully disabled. Later on, after vmap is initialized these
  * parameters are updated based on a system capacity.
  */
-static struct vmap_node *vmap_nodes = &single;
+static struct vmap_node **vmap_nodes;
 static __read_mostly unsigned int nr_vmap_nodes = 1;
 static __read_mostly unsigned int vmap_zone_size = 1;
 
@@ -909,13 +909,13 @@ addr_to_node_id(unsigned long addr)
 static inline struct vmap_node *
 addr_to_node(unsigned long addr)
 {
-	return &vmap_nodes[addr_to_node_id(addr)];
+	return vmap_nodes[addr_to_node_id(addr)];
 }
 
 static inline struct vmap_node *
 id_to_node(unsigned int id)
 {
-	return &vmap_nodes[id % nr_vmap_nodes];
+	return vmap_nodes[id % nr_vmap_nodes];
 }
 
 /*
@@ -1060,7 +1060,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 
 repeat:
 	for (i = 0, va_start_lowest = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
 		*va = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
@@ -2240,7 +2240,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 	purge_nodes = CPU_MASK_NONE;
 
 	for (i = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		INIT_LIST_HEAD(&vn->purge_list);
 		vn->skip_populate = full_pool_decay;
@@ -2272,7 +2272,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
 
 		for_each_cpu(i, &purge_nodes) {
-			vn = &vmap_nodes[i];
+			vn = vmap_nodes[i];
 
 			if (nr_purge_helpers > 0) {
 				INIT_WORK(&vn->purge_work, purge_vmap_node);
@@ -2291,7 +2291,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 		}
 
 		for_each_cpu(i, &purge_nodes) {
-			vn = &vmap_nodes[i];
+			vn = vmap_nodes[i];
 
 			if (vn->purge_work.func) {
 				flush_work(&vn->purge_work);
@@ -2397,7 +2397,7 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 	 */
 	i = j = addr_to_node_id(addr);
 	do {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
 		va = __find_vmap_area(addr, &vn->busy.root);
@@ -2421,7 +2421,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 	 */
 	i = j = addr_to_node_id(addr);
 	do {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
 		va = __find_vmap_area(addr, &vn->busy.root);
@@ -4928,7 +4928,7 @@ static void show_purge_info(struct seq_file *m)
 	int i;
 
 	for (i = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		spin_lock(&vn->lazy.lock);
 		list_for_each_entry(va, &vn->lazy.head, list) {
@@ -4948,7 +4948,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
 	int i;
 
 	for (i = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
 		list_for_each_entry(va, &vn->busy.head, list) {
@@ -5069,6 +5069,7 @@ static void __init vmap_init_free_space(void)
 
 static void vmap_init_nodes(void)
 {
+	struct vmap_node **nodes;
 	struct vmap_node *vn;
 	int i, n;
 
@@ -5087,23 +5088,34 @@ static void vmap_init_nodes(void)
 	 * set of cores. Therefore a per-domain purging is supposed to
 	 * be added as well as a per-domain balancing.
 	 */
-	n = clamp_t(unsigned int, num_possible_cpus(), 1, 128);
+	n = 1024;
 
 	if (n > 1) {
-		vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN);
-		if (vn) {
+		nodes = kmalloc_array(n, sizeof(struct vmap_node **),
+			GFP_NOWAIT | __GFP_NOWARN | __GFP_ZERO);
+
+		if (nodes) {
+			for (i = 0; i < n; i++) {
+				nodes[i] = kmalloc(sizeof(struct vmap_node), GFP_NOWAIT | __GFP_ZERO);
+
+				if (!nodes[i])
+					break;
+			}
+
 			/* Node partition is 16 pages. */
 			vmap_zone_size = (1 << 4) * PAGE_SIZE;
-			nr_vmap_nodes = n;
-			vmap_nodes = vn;
+			nr_vmap_nodes = i;
+			vmap_nodes = nodes;
 		} else {
 			pr_err("Failed to allocate an array. Disable a node layer\n");
+			vmap_nodes[0] = &single;
+			nr_vmap_nodes = 1;
 		}
 	}
 #endif
 
 	for (n = 0; n < nr_vmap_nodes; n++) {
-		vn = &vmap_nodes[n];
+		vn = vmap_nodes[n];
 		vn->busy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->busy.head);
 		spin_lock_init(&vn->busy.lock);
@@ -5129,7 +5141,7 @@ vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 	int i, j;
 
 	for (count = 0, i = 0; i < nr_vmap_nodes; i++) {
-		vn = &vmap_nodes[i];
+		vn = vmap_nodes[i];
 
 		for (j = 0; j < MAX_VA_SIZE_PAGES; j++)
 			count += READ_ONCE(vn->pool[j].len);
@@ -5144,7 +5156,7 @@ vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 	int i;
 
 	for (i = 0; i < nr_vmap_nodes; i++)
-		decay_va_pool_node(&vmap_nodes[i], true);
+		decay_va_pool_node(vmap_nodes[i], true);
 
 	return SHRINK_STOP;
 }
<snip>

it sets a number of nodes to 1024. It would be really appreciated to see
the perf-delta with this patch. If it improves the things or not.

Thank you in advance.

--
Uladzislau Rezki
Matthew Wilcox (Oracle) Dec. 3, 2024, 7:56 p.m. UTC | #14
On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:

I think there are a few other things we can try here.

First, if the copy is small (and I still don't have an answer to that
...), we can skip the vmalloc lookup if the copy doesn't cross a page
boundary.

Second, we could try storing this in a maple tree rather than an rbtree.
That gives us RCU protected lookups rather than under a spinlock.

It might even be worth going to a rwlock first, in case the problem is
that there's severe lock contention.

But I've asked for data on spinlock contention and not received an
answer on that either, so I don't know what to suggest.

Anyway, NACK to the original patch; that's just a horrible idea.
zuoze Dec. 4, 2024, 1:21 a.m. UTC | #15
在 2024/12/4 3:02, Uladzislau Rezki 写道:
> On Tue, Dec 03, 2024 at 03:20:04PM +0100, Uladzislau Rezki wrote:
>> On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/12/3 21:51, Uladzislau Rezki wrote:
>>>> On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/12/3 21:39, Uladzislau Rezki wrote:
>>>>>> On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/12/3 21:10, zuoze wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/12/3 20:39, Uladzislau Rezki 写道:
>>>>>>>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote:
>>>>>>>>>> We have implemented host-guest communication based on the TUN device
>>>>>>>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture),
>>>>>>>>>> and the operating system is based on the 6.6 LTS version with kernel
>>>>>>>>>> version 6.6. The specific stack for hotspot collection is as follows:
>>>>>>>>>>
>>>>>>>>>> -  100.00%     0.00%  vhost-12384  [unknown]      [k] 0000000000000000
>>>>>>>>>>        - ret_from_fork
>>>>>>>>>>           - 99.99% vhost_task_fn
>>>>>>>>>>              - 99.98% 0xffffdc59f619876c
>>>>>>>>>>                 - 98.99% handle_rx_kick
>>>>>>>>>>                    - 98.94% handle_rx
>>>>>>>>>>                       - 94.92% tun_recvmsg
>>>>>>>>>>                          - 94.76% tun_do_read
>>>>>>>>>>                             - 94.62% tun_put_user_xdp_zc
>>>>>>>>>>                                - 63.53% __check_object_size
>>>>>>>>>>                                   - 63.49% __check_object_size.part.0
>>>>>>>>>>                                        find_vmap_area
>>>>>>>>>>                                - 30.02% _copy_to_iter
>>>>>>>>>>                                     __arch_copy_to_user
>>>>>>>>>>                       - 2.27% get_rx_bufs
>>>>>>>>>>                          - 2.12% vhost_get_vq_desc
>>>>>>>>>>                               1.49% __arch_copy_from_user
>>>>>>>>>>                       - 0.89% peek_head_len
>>>>>>>>>>                            0.54% xsk_tx_peek_desc
>>>>>>>>>>                       - 0.68% vhost_add_used_and_signal_n
>>>>>>>>>>                          - 0.53% eventfd_signal
>>>>>>>>>>                               eventfd_signal_mask
>>>>>>>>>>                 - 0.94% handle_tx_kick
>>>>>>>>>>                    - 0.94% handle_tx
>>>>>>>>>>                       - handle_tx_copy
>>>>>>>>>>                          - 0.59% vhost_tx_batch.constprop.0
>>>>>>>>>>                               0.52% tun_sendmsg
>>>>>>>>>>
>>>>>>>>>> It can be observed that most of the overhead is concentrated in the
>>>>>>>>>> find_vmap_area function.
>>>>>>>>>>
>>> ...
>>>>>
>>>> Thank you. Then you have tons of copy_to_iter/copy_from_iter calls
>>>> during your test case. Per each you need to find an area which might
>>>> be really heavy.
>>>
>>> Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc
>>> overruns"), so no burden in find_vmap_area in old kernel.
>>>
>> Yep. It will slow down for sure.
>>
>>>>
>>>> How many CPUs in a system you have?
>>>>
>>>
>>> 128 core
>> OK. Just in case, do you see in a boot log something like:
>>
>> "Failed to allocate an array. Disable a node layer"
>>
> And if you do not see such failing message, it means that a node layer
> is up and running fully, can you also test below patch on your workload?

We did not see such failing message. As for the test patch you
suggested, we will test it and check the results. Thank you very much.

> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 634162271c00..35b28be27cf4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -896,7 +896,7 @@ static struct vmap_node {
>    * is fully disabled. Later on, after vmap is initialized these
>    * parameters are updated based on a system capacity.
>    */
> -static struct vmap_node *vmap_nodes = &single;
> +static struct vmap_node **vmap_nodes;
>   static __read_mostly unsigned int nr_vmap_nodes = 1;
>   static __read_mostly unsigned int vmap_zone_size = 1;
>   
> @@ -909,13 +909,13 @@ addr_to_node_id(unsigned long addr)
>   static inline struct vmap_node *
>   addr_to_node(unsigned long addr)
>   {
> -	return &vmap_nodes[addr_to_node_id(addr)];
> +	return vmap_nodes[addr_to_node_id(addr)];
>   }
>   
>   static inline struct vmap_node *
>   id_to_node(unsigned int id)
>   {
> -	return &vmap_nodes[id % nr_vmap_nodes];
> +	return vmap_nodes[id % nr_vmap_nodes];
>   }
>   
>   /*
> @@ -1060,7 +1060,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
>   
>   repeat:
>   	for (i = 0, va_start_lowest = 0; i < nr_vmap_nodes; i++) {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		spin_lock(&vn->busy.lock);
>   		*va = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
> @@ -2240,7 +2240,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>   	purge_nodes = CPU_MASK_NONE;
>   
>   	for (i = 0; i < nr_vmap_nodes; i++) {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		INIT_LIST_HEAD(&vn->purge_list);
>   		vn->skip_populate = full_pool_decay;
> @@ -2272,7 +2272,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>   		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
>   
>   		for_each_cpu(i, &purge_nodes) {
> -			vn = &vmap_nodes[i];
> +			vn = vmap_nodes[i];
>   
>   			if (nr_purge_helpers > 0) {
>   				INIT_WORK(&vn->purge_work, purge_vmap_node);
> @@ -2291,7 +2291,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
>   		}
>   
>   		for_each_cpu(i, &purge_nodes) {
> -			vn = &vmap_nodes[i];
> +			vn = vmap_nodes[i];
>   
>   			if (vn->purge_work.func) {
>   				flush_work(&vn->purge_work);
> @@ -2397,7 +2397,7 @@ struct vmap_area *find_vmap_area(unsigned long addr)
>   	 */
>   	i = j = addr_to_node_id(addr);
>   	do {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		spin_lock(&vn->busy.lock);
>   		va = __find_vmap_area(addr, &vn->busy.root);
> @@ -2421,7 +2421,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
>   	 */
>   	i = j = addr_to_node_id(addr);
>   	do {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		spin_lock(&vn->busy.lock);
>   		va = __find_vmap_area(addr, &vn->busy.root);
> @@ -4928,7 +4928,7 @@ static void show_purge_info(struct seq_file *m)
>   	int i;
>   
>   	for (i = 0; i < nr_vmap_nodes; i++) {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		spin_lock(&vn->lazy.lock);
>   		list_for_each_entry(va, &vn->lazy.head, list) {
> @@ -4948,7 +4948,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
>   	int i;
>   
>   	for (i = 0; i < nr_vmap_nodes; i++) {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		spin_lock(&vn->busy.lock);
>   		list_for_each_entry(va, &vn->busy.head, list) {
> @@ -5069,6 +5069,7 @@ static void __init vmap_init_free_space(void)
>   
>   static void vmap_init_nodes(void)
>   {
> +	struct vmap_node **nodes;
>   	struct vmap_node *vn;
>   	int i, n;
>   
> @@ -5087,23 +5088,34 @@ static void vmap_init_nodes(void)
>   	 * set of cores. Therefore a per-domain purging is supposed to
>   	 * be added as well as a per-domain balancing.
>   	 */
> -	n = clamp_t(unsigned int, num_possible_cpus(), 1, 128);
> +	n = 1024;
>   
>   	if (n > 1) {
> -		vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN);
> -		if (vn) {
> +		nodes = kmalloc_array(n, sizeof(struct vmap_node **),
> +			GFP_NOWAIT | __GFP_NOWARN | __GFP_ZERO);
> +
> +		if (nodes) {
> +			for (i = 0; i < n; i++) {
> +				nodes[i] = kmalloc(sizeof(struct vmap_node), GFP_NOWAIT | __GFP_ZERO);
> +
> +				if (!nodes[i])
> +					break;
> +			}
> +
>   			/* Node partition is 16 pages. */
>   			vmap_zone_size = (1 << 4) * PAGE_SIZE;
> -			nr_vmap_nodes = n;
> -			vmap_nodes = vn;
> +			nr_vmap_nodes = i;
> +			vmap_nodes = nodes;
>   		} else {
>   			pr_err("Failed to allocate an array. Disable a node layer\n");
> +			vmap_nodes[0] = &single;
> +			nr_vmap_nodes = 1;
>   		}
>   	}
>   #endif
>   
>   	for (n = 0; n < nr_vmap_nodes; n++) {
> -		vn = &vmap_nodes[n];
> +		vn = vmap_nodes[n];
>   		vn->busy.root = RB_ROOT;
>   		INIT_LIST_HEAD(&vn->busy.head);
>   		spin_lock_init(&vn->busy.lock);
> @@ -5129,7 +5141,7 @@ vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>   	int i, j;
>   
>   	for (count = 0, i = 0; i < nr_vmap_nodes; i++) {
> -		vn = &vmap_nodes[i];
> +		vn = vmap_nodes[i];
>   
>   		for (j = 0; j < MAX_VA_SIZE_PAGES; j++)
>   			count += READ_ONCE(vn->pool[j].len);
> @@ -5144,7 +5156,7 @@ vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>   	int i;
>   
>   	for (i = 0; i < nr_vmap_nodes; i++)
> -		decay_va_pool_node(&vmap_nodes[i], true);
> +		decay_va_pool_node(vmap_nodes[i], true);
>   
>   	return SHRINK_STOP;
>   }
> <snip>
> 
> it sets a number of nodes to 1024. It would be really appreciated to see
> the perf-delta with this patch. If it improves the things or not.
> 
> Thank you in advance.
> 
> --
> Uladzislau Rezki
>
zuoze Dec. 4, 2024, 1:38 a.m. UTC | #16
在 2024/12/4 3:56, Matthew Wilcox 写道:
> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
> 
> I think there are a few other things we can try here.
> 
> First, if the copy is small (and I still don't have an answer to that
> ...), we can skip the vmalloc lookup if the copy doesn't cross a page
> boundary.
> 

large data packet.

> Second, we could try storing this in a maple tree rather than an rbtree.
> That gives us RCU protected lookups rather than under a spinlock.
> 
> It might even be worth going to a rwlock first, in case the problem is
> that there's severe lock contention.
> 
> But I've asked for data on spinlock contention and not received an
> answer on that either, so I don't know what to suggest.

Thank you very much for your suggestions. We will check the spinlock
contention in the future.

> 
> Anyway, NACK to the original patch; that's just a horrible idea.
>
Kees Cook Dec. 4, 2024, 4:43 a.m. UTC | #17
On December 4, 2024 5:56:34 AM GMT+10:00, Matthew Wilcox <willy@infradead.org> wrote:
>On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
>
>I think there are a few other things we can try here.
>
>First, if the copy is small (and I still don't have an answer to that
>...), we can skip the vmalloc lookup if the copy doesn't cross a page
>boundary.

Yeah, this seems a reasonable optimization.

>Anyway, NACK to the original patch; that's just a horrible idea.

Right, please no debugfs knob. The point is to make sure the check cannot be disabled at runtime and to make the check a no-op. Please use the boot param to disable this check. 

-Kees
Uladzislau Rezki Dec. 4, 2024, 7:55 a.m. UTC | #18
On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
> 
> I think there are a few other things we can try here.
> 
> First, if the copy is small (and I still don't have an answer to that
> ...), we can skip the vmalloc lookup if the copy doesn't cross a page
> boundary.
> 
> Second, we could try storing this in a maple tree rather than an rbtree.
> That gives us RCU protected lookups rather than under a spinlock.
> 
> It might even be worth going to a rwlock first, in case the problem is
> that there's severe lock contention.
> 
> But I've asked for data on spinlock contention and not received an
> answer on that either, so I don't know what to suggest.
> 
I think, it is not about contention. It is about the extra "attached
load" when a data is heavily copied force and back. On each copy path
you need to do a scan. Maple tree is not that something can help here :)

Indeed, no contention data. Zuoze, please share this if you can.

--
Uladzislau Rezki
Uladzislau Rezki Dec. 4, 2024, 8:51 a.m. UTC | #19
> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
> 
> I think there are a few other things we can try here.
> 
> First, if the copy is small (and I still don't have an answer to that
> ...), we can skip the vmalloc lookup if the copy doesn't cross a page
> boundary.
>
I noticed that, a path which is in question, does not supply a starting
address of mapping area, instead it passes something within.

> 
> Second, we could try storing this in a maple tree rather than an rbtree.
> That gives us RCU protected lookups rather than under a spinlock.
> 
I think, when i have more free cycles, i will check it from performance
point of view. Because i do not know how much a maple tree is efficient
when it comes to lookups, insert and removing.

As an RCU safe data structure, yes, a searching is improved in a way there
is no need in taking spinlock. As a noted earlier i do not know if a maple 
tree allows to find a data when instead of key, it is associated with, we
pass something that is withing a searchable area: [va_start:va_end].

--
Uladzislau Rezki
zuoze Dec. 4, 2024, 9:21 a.m. UTC | #20
在 2024/12/4 15:55, Uladzislau Rezki 写道:
> On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote:
>> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
>>
>> I think there are a few other things we can try here.
>>
>> First, if the copy is small (and I still don't have an answer to that
>> ...), we can skip the vmalloc lookup if the copy doesn't cross a page
>> boundary.
>>
>> Second, we could try storing this in a maple tree rather than an rbtree.
>> That gives us RCU protected lookups rather than under a spinlock.
>>
>> It might even be worth going to a rwlock first, in case the problem is
>> that there's severe lock contention.
>>
>> But I've asked for data on spinlock contention and not received an
>> answer on that either, so I don't know what to suggest.
>>
> I think, it is not about contention. It is about the extra "attached
> load" when a data is heavily copied force and back. On each copy path
> you need to do a scan. Maple tree is not that something can help here :)
> 
> Indeed, no contention data. Zuoze, please share this if you can.

We have enabled perf lock contention and are currently debugging the
environment. We will share the results as soon as we have them.

> 
> --
> Uladzislau Rezki
Uladzislau Rezki Dec. 4, 2024, 9:27 a.m. UTC | #21
On Wed, Dec 04, 2024 at 05:21:12PM +0800, zuoze wrote:
> 
> 
> 在 2024/12/4 15:55, Uladzislau Rezki 写道:
> > On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote:
> > > On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:
> > > 
> > > I think there are a few other things we can try here.
> > > 
> > > First, if the copy is small (and I still don't have an answer to that
> > > ...), we can skip the vmalloc lookup if the copy doesn't cross a page
> > > boundary.
> > > 
> > > Second, we could try storing this in a maple tree rather than an rbtree.
> > > That gives us RCU protected lookups rather than under a spinlock.
> > > 
> > > It might even be worth going to a rwlock first, in case the problem is
> > > that there's severe lock contention.
> > > 
> > > But I've asked for data on spinlock contention and not received an
> > > answer on that either, so I don't know what to suggest.
> > > 
> > I think, it is not about contention. It is about the extra "attached
> > load" when a data is heavily copied force and back. On each copy path
> > you need to do a scan. Maple tree is not that something can help here :)
> > 
> > Indeed, no contention data. Zuoze, please share this if you can.
> 
> We have enabled perf lock contention and are currently debugging the
> environment. We will share the results as soon as we have them.
> 
Sounds good and thank you for helping to improve this :)

--
Uladzislau Rezki
Matthew Wilcox (Oracle) Dec. 16, 2024, 4:24 a.m. UTC | #22
On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote:
> I think, when i have more free cycles, i will check it from performance
> point of view. Because i do not know how much a maple tree is efficient
> when it comes to lookups, insert and removing.

Maple tree has a fanout of around 8-12 at each level, while an rbtree has
a fanout of two (arguably 3, since we might find the node).  Let's say you
have 1000 vmalloc areas.  A perfectly balanced rbtree would have 9 levels
(and might well be 11+ levels if imperfectly balanced -- and part of the
advantage of rbtrees over AVL trees is that they can be less balanced
so need fewer rotations).  A perfectly balanced maple tree would have
only 3 levels.

Addition/removal is more expensive.  We biased the implementation heavily
towards lookup, so we chose to keep it very compact.  Most users (and
particularly the VMA tree which was our first client) do more lookups
than modifications; a real application takes many more pagefaults than
it does calls to mmap/munmap/mprotect/etc.

> As an RCU safe data structure, yes, a searching is improved in a way there
> is no need in taking spinlock. As a noted earlier i do not know if a maple 
> tree allows to find a data when instead of key, it is associated with, we
> pass something that is withing a searchable area: [va_start:va_end].

That's what maple trees do; they store non-overlapping ranges.  So you
can look up any address in a range and it will return you the pointer
associated with that range.  Just like you'd want for a page fault ;-)
Uladzislau Rezki Dec. 16, 2024, 7:18 p.m. UTC | #23
Hello, Matthew!

> On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote:
> > I think, when i have more free cycles, i will check it from performance
> > point of view. Because i do not know how much a maple tree is efficient
> > when it comes to lookups, insert and removing.
> 
> Maple tree has a fanout of around 8-12 at each level, while an rbtree has
> a fanout of two (arguably 3, since we might find the node).  Let's say you
> have 1000 vmalloc areas.  A perfectly balanced rbtree would have 9 levels
> (and might well be 11+ levels if imperfectly balanced -- and part of the
> advantage of rbtrees over AVL trees is that they can be less balanced
> so need fewer rotations).  A perfectly balanced maple tree would have
> only 3 levels.
> 
Thank you for your explanation and some input on this topic. Density, a
high of tree and branching factor should make the work better :)

>
> Addition/removal is more expensive.  We biased the implementation heavily
> towards lookup, so we chose to keep it very compact.  Most users (and
> particularly the VMA tree which was our first client) do more lookups
> than modifications; a real application takes many more pagefaults than
> it does calls to mmap/munmap/mprotect/etc.
> 
This is what i see. Some use cases are degraded. For example stress-ng
forking bench is worse, test_vmalloc.sh also reports a degrade:

See below figures:

<snip>
# Default
urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+   59.52%     7.15%  [kernel]        [k] __vmalloc_node_range_noprof
+   37.98%     0.22%  [test_vmalloc]  [k] fix_size_alloc_test
+   37.32%     8.56%  [kernel]        [k] vfree.part.0
+   35.31%     0.00%  [kernel]        [k] ret_from_fork_asm
+   35.31%     0.00%  [kernel]        [k] ret_from_fork
+   35.31%     0.00%  [kernel]        [k] kthread
+   35.05%     0.00%  [test_vmalloc]  [k] test_func
+   34.16%     0.06%  [test_vmalloc]  [k] long_busy_list_alloc_test
+   32.10%     0.12%  [kernel]        [k] __get_vm_area_node
+   31.69%     1.82%  [kernel]        [k] alloc_vmap_area
+   27.24%     5.01%  [kernel]        [k] _raw_spin_lock
+   25.45%     0.15%  [test_vmalloc]  [k] full_fit_alloc_test
+   23.57%     0.03%  [kernel]        [k] remove_vm_area
+   22.23%    22.23%  [kernel]        [k] native_queued_spin_lock_slowpath
+   14.34%     0.94%  [kernel]        [k] alloc_pages_bulk_noprof
+   10.80%     7.51%  [kernel]        [k] free_vmap_area_noflush
+   10.59%    10.59%  [kernel]        [k] clear_page_rep
+    9.52%     8.96%  [kernel]        [k] insert_vmap_area
+    7.39%     2.82%  [kernel]        [k] find_unlink_vmap_area

# Maple-tree
time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64
+   74.33%     1.50%  [kernel]        [k] __vmalloc_node_range_noprof
+   55.73%     0.06%  [kernel]        [k] __get_vm_area_node
+   55.53%     1.07%  [kernel]        [k] alloc_vmap_area
+   53.78%     0.09%  [test_vmalloc]  [k] long_busy_list_alloc_test
+   53.75%     1.76%  [kernel]        [k] _raw_spin_lock
+   52.81%    51.80%  [kernel]        [k] native_queued_spin_lock_slowpath
+   28.93%     0.09%  [test_vmalloc]  [k] full_fit_alloc_test
+   23.29%     2.43%  [kernel]        [k] vfree.part.0
+   20.29%     0.01%  [kernel]        [k] mt_insert_vmap_area
+   20.27%     0.34%  [kernel]        [k] mtree_insert_range
+   15.30%     0.05%  [test_vmalloc]  [k] fix_size_alloc_test
+   14.06%     0.05%  [kernel]        [k] remove_vm_area
+   13.73%     0.00%  [kernel]        [k] ret_from_fork_asm
+   13.73%     0.00%  [kernel]        [k] ret_from_fork
+   13.73%     0.00%  [kernel]        [k] kthread
+   13.51%     0.00%  [test_vmalloc]  [k] test_func
+   13.15%     0.87%  [kernel]        [k] alloc_pages_bulk_noprof
+    9.92%     9.54%  [kernel]        [k] clear_page_rep
+    9.62%     0.07%  [kernel]        [k] find_unlink_vmap_area
+    9.55%     0.04%  [kernel]        [k] mtree_erase
+    5.92%     1.44%  [kernel]        [k] free_unref_page
+    4.92%     0.24%  [kernel]        [k] mas_insert.isra.0
+    4.69%     0.93%  [kernel]        [k] mas_erase
+    4.47%     0.02%  [kernel]        [k] rcu_do_batch
+    3.35%     2.10%  [kernel]        [k] __vmap_pages_range_noflush
+    3.00%     2.81%  [kernel]        [k] mas_wr_store_type
<snip>

i.e. insert/remove are more expansive, at least my test show this.
It looks like, mtree_insert() uses a range_variant which implies
a tree update after an insert operation is completed. And probably
where an overhead comes from. If i use a b+tree(my own implementation),
as expected, it is better than rb-tree because of b+tree properties.

I have composed some data, you can find more bench data there:

wget ftp://vps418301.ovh.net/incoming/Maple_tree_comparison_with_rb_tree_in_vmalloc.pdf

>> That's what maple trees do; they store non-overlapping ranges.  So you
>> can look up any address in a range and it will return you the pointer
>> associated with that range.  Just like you'd want for a page fault ;-)
Thank you. I see. I though that it also can work as a regular b+ or b
tress so we do not spend cycles on updates to track ranges. Like below
code:

int ret = mtree_insert(t, va->va_start, va, GFP_KERNEL);

i do not store a range here, i store key -> value pair but maple-tree
considers it as range: [va_start:va_start]. Maybe we can improve this
case when not a range is passed?

This is just my thoughts :)

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 83c164aba6e0..ef1eb23b2273 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -21,6 +21,7 @@ 
 #include <linux/vmalloc.h>
 #include <linux/atomic.h>
 #include <linux/jump_label.h>
+#include <linux/debugfs.h>
 #include <asm/sections.h>
 #include "slab.h"
 
@@ -159,6 +160,8 @@  static inline void check_bogus_address(const unsigned long ptr, unsigned long n,
 		usercopy_abort("null address", NULL, to_user, ptr, n);
 }
 
+static bool bypass_vmalloc_check __read_mostly;
+
 static inline void check_heap_object(const void *ptr, unsigned long n,
 				     bool to_user)
 {
@@ -174,8 +177,13 @@  static inline void check_heap_object(const void *ptr, unsigned long n,
 	}
 
 	if (is_vmalloc_addr(ptr) && !pagefault_disabled()) {
-		struct vmap_area *area = find_vmap_area(addr);
+		struct vmap_area *area;
+
+		 /* Bypass it since searching the kernel VM area is slow */
+		if (bypass_vmalloc_check)
+			return;
 
+		area = find_vmap_area(addr);
 		if (!area)
 			usercopy_abort("vmalloc", "no area", to_user, 0, n);
 
@@ -271,6 +279,9 @@  static int __init set_hardened_usercopy(void)
 {
 	if (enable_checks == false)
 		static_branch_enable(&bypass_usercopy_checks);
+	else
+		debugfs_create_bool("bypass_usercopy_vmalloc_check", 0600,
+				    NULL, &bypass_vmalloc_check);
 	return 1;
 }