Message ID | 20241203023159.219355-1-zuoze1@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [-next] mm: usercopy: add a debugfs interface to bypass the vmalloc check. | expand |
On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote: > The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced > vmalloc check for usercopy. However, in subsystems like networking, when > memory allocated using vmalloc or vmap is subsequently copied using > functions like copy_to_iter/copy_from_iter, the check is triggered. This > adds overhead in the copy path, such as the cost of searching the > red-black tree, which increases the performance burden. > > We found that after merging this patch, network bandwidth performance in > the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec, > the hardened_usercopy is enabled by default. What is "the XDP scenario", exactly? Are these large or small packets? What's taking the time in find_vmap_area()? Is it lock contention?
On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote: > The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced > vmalloc check for usercopy. However, in subsystems like networking, when > memory allocated using vmalloc or vmap is subsequently copied using > functions like copy_to_iter/copy_from_iter, the check is triggered. This > adds overhead in the copy path, such as the cost of searching the > red-black tree, which increases the performance burden. > > We found that after merging this patch, network bandwidth performance in > the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec, > the hardened_usercopy is enabled by default. > > To address this, we introduced a debugfs interface that allows selectively > enabling or disabling the vmalloc check based on the use case, optimizing > performance. > > By default, vmalloc check for usercopy is enabled. > > To disable the vmalloc check: > echo Y > /sys/kernel/debug/bypass_usercopy_vmalloc_check > > After executing the above command, the XDP performance returns to 25 > Gbits/sec. > To what Matthew has asked, could you please also specify the kernel version you run your experiment on? Apart of that please describe your system and HW. Thank you! -- Uladzislau Rezki
We have implemented host-guest communication based on the TUN device using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), and the operating system is based on the 6.6 LTS version with kernel version 6.6. The specific stack for hotspot collection is as follows: - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 - ret_from_fork - 99.99% vhost_task_fn - 99.98% 0xffffdc59f619876c - 98.99% handle_rx_kick - 98.94% handle_rx - 94.92% tun_recvmsg - 94.76% tun_do_read - 94.62% tun_put_user_xdp_zc - 63.53% __check_object_size - 63.49% __check_object_size.part.0 find_vmap_area - 30.02% _copy_to_iter __arch_copy_to_user - 2.27% get_rx_bufs - 2.12% vhost_get_vq_desc 1.49% __arch_copy_from_user - 0.89% peek_head_len 0.54% xsk_tx_peek_desc - 0.68% vhost_add_used_and_signal_n - 0.53% eventfd_signal eventfd_signal_mask - 0.94% handle_tx_kick - 0.94% handle_tx - handle_tx_copy - 0.59% vhost_tx_batch.constprop.0 0.52% tun_sendmsg It can be observed that most of the overhead is concentrated in the find_vmap_area function. [1]: https://www.kernel.org/doc/html/latest/networking/af_xdp.html 在 2024/12/3 12:11, Matthew Wilcox 写道: > On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote: >> The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced >> vmalloc check for usercopy. However, in subsystems like networking, when >> memory allocated using vmalloc or vmap is subsequently copied using >> functions like copy_to_iter/copy_from_iter, the check is triggered. This >> adds overhead in the copy path, such as the cost of searching the >> red-black tree, which increases the performance burden. >> >> We found that after merging this patch, network bandwidth performance in >> the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec, >> the hardened_usercopy is enabled by default. > > What is "the XDP scenario", exactly? Are these large or small packets? > What's taking the time in find_vmap_area()? Is it lock contention? >
On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > We have implemented host-guest communication based on the TUN device > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > and the operating system is based on the 6.6 LTS version with kernel > version 6.6. The specific stack for hotspot collection is as follows: > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > - ret_from_fork > - 99.99% vhost_task_fn > - 99.98% 0xffffdc59f619876c > - 98.99% handle_rx_kick > - 98.94% handle_rx > - 94.92% tun_recvmsg > - 94.76% tun_do_read > - 94.62% tun_put_user_xdp_zc > - 63.53% __check_object_size > - 63.49% __check_object_size.part.0 > find_vmap_area > - 30.02% _copy_to_iter > __arch_copy_to_user > - 2.27% get_rx_bufs > - 2.12% vhost_get_vq_desc > 1.49% __arch_copy_from_user > - 0.89% peek_head_len > 0.54% xsk_tx_peek_desc > - 0.68% vhost_add_used_and_signal_n > - 0.53% eventfd_signal > eventfd_signal_mask > - 0.94% handle_tx_kick > - 0.94% handle_tx > - handle_tx_copy > - 0.59% vhost_tx_batch.constprop.0 > 0.52% tun_sendmsg > > It can be observed that most of the overhead is concentrated in the > find_vmap_area function. > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There was a work that tends to improve it to mitigate a vmap lock contention. See it here: https://lwn.net/Articles/956590/ The work was taken in the v6.9 kernel: <snip> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:23 2024 +0100 mm: vmalloc: add va_alloc() helper Patch series "Mitigate a vmap lock contention", v3. 1. Motivation ... <snip> Could you please try the v6.9 kernel on your setup? How to solve it, probably, it can be back-ported to the v6.6 kernel. Thanks! -- Uladzislau Rezki
在 2024/12/3 20:39, Uladzislau Rezki 写道: > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: >> We have implemented host-guest communication based on the TUN device >> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), >> and the operating system is based on the 6.6 LTS version with kernel >> version 6.6. The specific stack for hotspot collection is as follows: >> >> - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 >> - ret_from_fork >> - 99.99% vhost_task_fn >> - 99.98% 0xffffdc59f619876c >> - 98.99% handle_rx_kick >> - 98.94% handle_rx >> - 94.92% tun_recvmsg >> - 94.76% tun_do_read >> - 94.62% tun_put_user_xdp_zc >> - 63.53% __check_object_size >> - 63.49% __check_object_size.part.0 >> find_vmap_area >> - 30.02% _copy_to_iter >> __arch_copy_to_user >> - 2.27% get_rx_bufs >> - 2.12% vhost_get_vq_desc >> 1.49% __arch_copy_from_user >> - 0.89% peek_head_len >> 0.54% xsk_tx_peek_desc >> - 0.68% vhost_add_used_and_signal_n >> - 0.53% eventfd_signal >> eventfd_signal_mask >> - 0.94% handle_tx_kick >> - 0.94% handle_tx >> - handle_tx_copy >> - 0.59% vhost_tx_batch.constprop.0 >> 0.52% tun_sendmsg >> >> It can be observed that most of the overhead is concentrated in the >> find_vmap_area function. >> > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There > was a work that tends to improve it to mitigate a vmap lock contention. > See it here: https://lwn.net/Articles/956590/ > > The work was taken in the v6.9 kernel: > > <snip> > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:23 2024 +0100 > > mm: vmalloc: add va_alloc() helper > > Patch series "Mitigate a vmap lock contention", v3. > > 1. Motivation > ... > <snip> > > Could you please try the v6.9 kernel on your setup? > > How to solve it, probably, it can be back-ported to the v6.6 kernel. All the vmalloc-related optimizations have already been merged into 6.6, including the set of optimization patches you suggested. Thank you very much for your input. > > Thanks! > > -- > Uladzislau Rezki > Thanks, ze zuo
> > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > > > We have implemented host-guest communication based on the TUN device > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > > > and the operating system is based on the 6.6 LTS version with kernel > > > version 6.6. The specific stack for hotspot collection is as follows: > > > > > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > > > - ret_from_fork > > > - 99.99% vhost_task_fn > > > - 99.98% 0xffffdc59f619876c > > > - 98.99% handle_rx_kick > > > - 98.94% handle_rx > > > - 94.92% tun_recvmsg > > > - 94.76% tun_do_read > > > - 94.62% tun_put_user_xdp_zc > > > - 63.53% __check_object_size > > > - 63.49% __check_object_size.part.0 > > > find_vmap_area > > > - 30.02% _copy_to_iter > > > __arch_copy_to_user > > > - 2.27% get_rx_bufs > > > - 2.12% vhost_get_vq_desc > > > 1.49% __arch_copy_from_user > > > - 0.89% peek_head_len > > > 0.54% xsk_tx_peek_desc > > > - 0.68% vhost_add_used_and_signal_n > > > - 0.53% eventfd_signal > > > eventfd_signal_mask > > > - 0.94% handle_tx_kick > > > - 0.94% handle_tx > > > - handle_tx_copy > > > - 0.59% vhost_tx_batch.constprop.0 > > > 0.52% tun_sendmsg > > > > > > It can be observed that most of the overhead is concentrated in the > > > find_vmap_area function. > > > > > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There > > was a work that tends to improve it to mitigate a vmap lock contention. > > See it here: https://lwn.net/Articles/956590/ > > > > The work was taken in the v6.9 kernel: > > > > <snip> > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Date: Tue Jan 2 19:46:23 2024 +0100 > > > > mm: vmalloc: add va_alloc() helper > > > > Patch series "Mitigate a vmap lock contention", v3. > > > > 1. Motivation > > ... > > <snip> > > > > Could you please try the v6.9 kernel on your setup? > > > > How to solve it, probably, it can be back-ported to the v6.6 kernel. > > All the vmalloc-related optimizations have already been merged into 6.6, > including the set of optimization patches you suggested. Thank you very > much for your input. > Do you mean that the perf-data that you posted to this email thread, contains the: Patch series "Mitigate a vmap lock contention", v3. from the v6.9? and you still see a high contention? Thank you! -- Uladzislau Rezki
On 2024/12/3 21:10, zuoze wrote: > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道: >> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: >>> We have implemented host-guest communication based on the TUN device >>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), >>> and the operating system is based on the 6.6 LTS version with kernel >>> version 6.6. The specific stack for hotspot collection is as follows: >>> >>> - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 >>> - ret_from_fork >>> - 99.99% vhost_task_fn >>> - 99.98% 0xffffdc59f619876c >>> - 98.99% handle_rx_kick >>> - 98.94% handle_rx >>> - 94.92% tun_recvmsg >>> - 94.76% tun_do_read >>> - 94.62% tun_put_user_xdp_zc >>> - 63.53% __check_object_size >>> - 63.49% __check_object_size.part.0 >>> find_vmap_area >>> - 30.02% _copy_to_iter >>> __arch_copy_to_user >>> - 2.27% get_rx_bufs >>> - 2.12% vhost_get_vq_desc >>> 1.49% __arch_copy_from_user >>> - 0.89% peek_head_len >>> 0.54% xsk_tx_peek_desc >>> - 0.68% vhost_add_used_and_signal_n >>> - 0.53% eventfd_signal >>> eventfd_signal_mask >>> - 0.94% handle_tx_kick >>> - 0.94% handle_tx >>> - handle_tx_copy >>> - 0.59% vhost_tx_batch.constprop.0 >>> 0.52% tun_sendmsg >>> >>> It can be observed that most of the overhead is concentrated in the >>> find_vmap_area function. >>> >> I see. Yes, it is pretty contented, since you run the v6.6 kernel. There >> was a work that tends to improve it to mitigate a vmap lock contention. >> See it here: https://lwn.net/Articles/956590/ >> >> The work was taken in the v6.9 kernel: >> >> <snip> >> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc >> Author: Uladzislau Rezki (Sony) <urezki@gmail.com> >> Date: Tue Jan 2 19:46:23 2024 +0100 >> >> mm: vmalloc: add va_alloc() helper >> >> Patch series "Mitigate a vmap lock contention", v3. >> >> 1. Motivation >> ... >> <snip> >> >> Could you please try the v6.9 kernel on your setup? >> >> How to solve it, probably, it can be back-ported to the v6.6 kernel. > > All the vmalloc-related optimizations have already been merged into 6.6, > including the set of optimization patches you suggested. Thank you very > much for your input. > It is unclear, we have backported the vmalloc optimization into our 6.6 kernel before, so the above stack already with those patches and even with those optimization, the find_vmap_area() is still the hotpots.
On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: > > > On 2024/12/3 21:10, zuoze wrote: > > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道: > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > > > > We have implemented host-guest communication based on the TUN device > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > > > > and the operating system is based on the 6.6 LTS version with kernel > > > > version 6.6. The specific stack for hotspot collection is as follows: > > > > > > > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > > > > - ret_from_fork > > > > - 99.99% vhost_task_fn > > > > - 99.98% 0xffffdc59f619876c > > > > - 98.99% handle_rx_kick > > > > - 98.94% handle_rx > > > > - 94.92% tun_recvmsg > > > > - 94.76% tun_do_read > > > > - 94.62% tun_put_user_xdp_zc > > > > - 63.53% __check_object_size > > > > - 63.49% __check_object_size.part.0 > > > > find_vmap_area > > > > - 30.02% _copy_to_iter > > > > __arch_copy_to_user > > > > - 2.27% get_rx_bufs > > > > - 2.12% vhost_get_vq_desc > > > > 1.49% __arch_copy_from_user > > > > - 0.89% peek_head_len > > > > 0.54% xsk_tx_peek_desc > > > > - 0.68% vhost_add_used_and_signal_n > > > > - 0.53% eventfd_signal > > > > eventfd_signal_mask > > > > - 0.94% handle_tx_kick > > > > - 0.94% handle_tx > > > > - handle_tx_copy > > > > - 0.59% vhost_tx_batch.constprop.0 > > > > 0.52% tun_sendmsg > > > > > > > > It can be observed that most of the overhead is concentrated in the > > > > find_vmap_area function. > > > > > > > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There > > > was a work that tends to improve it to mitigate a vmap lock contention. > > > See it here: https://lwn.net/Articles/956590/ > > > > > > The work was taken in the v6.9 kernel: > > > > > > <snip> > > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > Date: Tue Jan 2 19:46:23 2024 +0100 > > > > > > mm: vmalloc: add va_alloc() helper > > > > > > Patch series "Mitigate a vmap lock contention", v3. > > > > > > 1. Motivation > > > ... > > > <snip> > > > > > > Could you please try the v6.9 kernel on your setup? > > > > > > How to solve it, probably, it can be back-ported to the v6.6 kernel. > > > > All the vmalloc-related optimizations have already been merged into 6.6, > > including the set of optimization patches you suggested. Thank you very > > much for your input. > > > > It is unclear, we have backported the vmalloc optimization into our 6.6 > kernel before, so the above stack already with those patches and even > with those optimization, the find_vmap_area() is still the hotpots. > > Could you please check that all below patches are in your v6.6 kernel? <snip> commit 8be4d46e12af32342569840d958272dbb3be3f4c Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Wed Jan 24 19:09:20 2024 +0100 mm: vmalloc: refactor vmalloc_dump_obj() function commit 15e02a39fb6b43f37100563c6a246252d5d1e6da Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Wed Jan 24 19:09:19 2024 +0100 mm: vmalloc: improve description of vmap node layer commit 7679ba6b36dbb300b757b672d6a32a606499e14b Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:33 2024 +0100 mm: vmalloc: add a shrinker to drain vmap pools commit 8f33a2ff307248c3e55a7696f60b3658b28edb57 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:32 2024 +0100 mm: vmalloc: set nr_nodes based on CPUs in a system commit 8e1d743f2c2671aa54f6f91a2b33823f92512870 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:31 2024 +0100 mm: vmalloc: support multiple nodes in vmallocinfo commit 53becf32aec1c8049b854f0c31a11df5ed75df6f Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:30 2024 +0100 mm: vmalloc: support multiple nodes in vread_iter commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Fri Feb 2 20:06:28 2024 +0100 mm: vmalloc: add a scan area of VA only once commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:29 2024 +0100 mm: vmalloc: offload free_vmap_area_lock lock commit 282631cb2447318e2a55b41a665dbe8571c46d70 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:28 2024 +0100 mm: vmalloc: remove global purge_vmap_area_root rb-tree commit 55c49fee57af99f3c663e69dedc5b85e691bbe50 Author: Baoquan He <bhe@redhat.com> Date: Tue Jan 2 19:46:27 2024 +0100 mm/vmalloc: remove vmap_area_list commit d093602919ad5908532142a048539800fa94a0d1 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:26 2024 +0100 mm: vmalloc: remove global vmap_area_root rb-tree commit 7fa8cee003166ef6db0bba70d610dbf173543811 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:25 2024 +0100 mm: vmalloc: move vmap_init_free_space() down in vmalloc.c commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:24 2024 +0100 mm: vmalloc: rename adjust_va_to_fit_type() function commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Tue Jan 2 19:46:23 2024 +0100 mm: vmalloc: add va_alloc() helper <snip> Thank you! -- Uladzislau Rezki
On 2024/12/3 21:39, Uladzislau Rezki wrote: > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: >> >> >> On 2024/12/3 21:10, zuoze wrote: >>> >>> >>> 在 2024/12/3 20:39, Uladzislau Rezki 写道: >>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: >>>>> We have implemented host-guest communication based on the TUN device >>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), >>>>> and the operating system is based on the 6.6 LTS version with kernel >>>>> version 6.6. The specific stack for hotspot collection is as follows: >>>>> >>>>> - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 >>>>> - ret_from_fork >>>>> - 99.99% vhost_task_fn >>>>> - 99.98% 0xffffdc59f619876c >>>>> - 98.99% handle_rx_kick >>>>> - 98.94% handle_rx >>>>> - 94.92% tun_recvmsg >>>>> - 94.76% tun_do_read >>>>> - 94.62% tun_put_user_xdp_zc >>>>> - 63.53% __check_object_size >>>>> - 63.49% __check_object_size.part.0 >>>>> find_vmap_area >>>>> - 30.02% _copy_to_iter >>>>> __arch_copy_to_user >>>>> - 2.27% get_rx_bufs >>>>> - 2.12% vhost_get_vq_desc >>>>> 1.49% __arch_copy_from_user >>>>> - 0.89% peek_head_len >>>>> 0.54% xsk_tx_peek_desc >>>>> - 0.68% vhost_add_used_and_signal_n >>>>> - 0.53% eventfd_signal >>>>> eventfd_signal_mask >>>>> - 0.94% handle_tx_kick >>>>> - 0.94% handle_tx >>>>> - handle_tx_copy >>>>> - 0.59% vhost_tx_batch.constprop.0 >>>>> 0.52% tun_sendmsg >>>>> >>>>> It can be observed that most of the overhead is concentrated in the >>>>> find_vmap_area function. >>>>> >>>> I see. Yes, it is pretty contented, since you run the v6.6 kernel. There >>>> was a work that tends to improve it to mitigate a vmap lock contention. >>>> See it here: https://lwn.net/Articles/956590/ >>>> >>>> The work was taken in the v6.9 kernel: >>>> >>>> <snip> >>>> commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc >>>> Author: Uladzislau Rezki (Sony) <urezki@gmail.com> >>>> Date: Tue Jan 2 19:46:23 2024 +0100 >>>> >>>> mm: vmalloc: add va_alloc() helper >>>> >>>> Patch series "Mitigate a vmap lock contention", v3. >>>> >>>> 1. Motivation >>>> ... >>>> <snip> >>>> >>>> Could you please try the v6.9 kernel on your setup? >>>> >>>> How to solve it, probably, it can be back-ported to the v6.6 kernel. >>> >>> All the vmalloc-related optimizations have already been merged into 6.6, >>> including the set of optimization patches you suggested. Thank you very >>> much for your input. >>> >> >> It is unclear, we have backported the vmalloc optimization into our 6.6 >> kernel before, so the above stack already with those patches and even >> with those optimization, the find_vmap_area() is still the hotpots. >> >> > Could you please check that all below patches are in your v6.6 kernel? Yes, $ git lg v6.6..HEAD --oneline mm/vmalloc.c * 86fee542f145 mm: vmalloc: ensure vmap_block is initialised before adding to queue * f459a0b59f7c mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 * 0be7a82c2555 mm: vmalloc: fix lockdep warning * 58b99a00d0a0 mm/vmalloc: eliminated the lock contention from twice to once * 2c549aa32fa0 mm: vmalloc: check if a hash-index is in cpu_possible_mask * 0bc6d608b445 mm: fix incorrect vbq reference in purge_fragmented_block * 450f8c5270df mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL * 2ea2bf4a18c3 mm: vmalloc: bail out early in find_vmap_area() if vmap is not init * bde74a3e8a71 mm/vmalloc: fix return value of vb_alloc if size is 0 * 8c620d05b7c3 mm: vmalloc: refactor vmalloc_dump_obj() function * b0c8281703b8 mm: vmalloc: improve description of vmap node layer * ecc3f0bf5c5a mm: vmalloc: add a shrinker to drain vmap pools * dd89a137f483 mm: vmalloc: set nr_nodes based on CPUs in a system * 8e63c98d86f6 mm: vmalloc: support multiple nodes in vmallocinfo * cc32683cef48 mm: vmalloc: support multiple nodes in vread_iter * 54d5ce65633d mm: vmalloc: add a scan area of VA only once * ee9c199fb859 mm: vmalloc: offload free_vmap_area_lock lock * c2c272d78b5a mm: vmalloc: remove global purge_vmap_area_root rb-tree * c9b39e3ffa86 mm/vmalloc: remove vmap_area_list * 091d2493d15f mm: vmalloc: remove global vmap_area_root rb-tree * 53f06cc34bac mm: vmalloc: move vmap_init_free_space() down in vmalloc.c * bf24196d9ab9 mm: vmalloc: rename adjust_va_to_fit_type() function * 6e9c94401e34 mm: vmalloc: add va_alloc() helper * ae528eb14e9a mm: Introduce vmap_page_range() to map pages in PCI address space * e1dbcfaa1854 mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages(). * d3a24e7a01c4 mm: Enforce VM_IOREMAP flag and range in ioremap_page_range. * fc9813220585 mm/vmalloc: fix the unchecked dereference warning in vread_iter() * a52e0157837e ascend: export interfaces required by ascend drivers * 9b1283f2bec2 mm/vmalloc: Extend vmalloc usage about hugepage > > > <snip> > commit 8be4d46e12af32342569840d958272dbb3be3f4c > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Wed Jan 24 19:09:20 2024 +0100 > > mm: vmalloc: refactor vmalloc_dump_obj() function > > commit 15e02a39fb6b43f37100563c6a246252d5d1e6da > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Wed Jan 24 19:09:19 2024 +0100 > > mm: vmalloc: improve description of vmap node layer > > commit 7679ba6b36dbb300b757b672d6a32a606499e14b > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:33 2024 +0100 > > mm: vmalloc: add a shrinker to drain vmap pools > > commit 8f33a2ff307248c3e55a7696f60b3658b28edb57 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:32 2024 +0100 > > mm: vmalloc: set nr_nodes based on CPUs in a system > > commit 8e1d743f2c2671aa54f6f91a2b33823f92512870 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:31 2024 +0100 > > mm: vmalloc: support multiple nodes in vmallocinfo > > commit 53becf32aec1c8049b854f0c31a11df5ed75df6f > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:30 2024 +0100 > > mm: vmalloc: support multiple nodes in vread_iter > > commit 96aa8437d169b8e030a98e2b74fd9a8ee9d3be7e > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Fri Feb 2 20:06:28 2024 +0100 > > mm: vmalloc: add a scan area of VA only once > > commit 72210662c5a2b6005f6daea7fe293a0dc573e1a5 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:29 2024 +0100 > > mm: vmalloc: offload free_vmap_area_lock lock > > commit 282631cb2447318e2a55b41a665dbe8571c46d70 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:28 2024 +0100 > > mm: vmalloc: remove global purge_vmap_area_root rb-tree > > commit 55c49fee57af99f3c663e69dedc5b85e691bbe50 > Author: Baoquan He <bhe@redhat.com> > Date: Tue Jan 2 19:46:27 2024 +0100 > > mm/vmalloc: remove vmap_area_list > > commit d093602919ad5908532142a048539800fa94a0d1 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:26 2024 +0100 > > mm: vmalloc: remove global vmap_area_root rb-tree > > commit 7fa8cee003166ef6db0bba70d610dbf173543811 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:25 2024 +0100 > > mm: vmalloc: move vmap_init_free_space() down in vmalloc.c > > > commit 5b75b8e1b9040b34f43809b1948eaa4e83e39112 > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:24 2024 +0100 > > mm: vmalloc: rename adjust_va_to_fit_type() function > > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > Date: Tue Jan 2 19:46:23 2024 +0100 > > mm: vmalloc: add va_alloc() helper > <snip> > > Thank you! > > -- > Uladzislau Rezki
On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote: > > > On 2024/12/3 21:39, Uladzislau Rezki wrote: > > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: > > > > > > > > > On 2024/12/3 21:10, zuoze wrote: > > > > > > > > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道: > > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > > > > > > We have implemented host-guest communication based on the TUN device > > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > > > > > > and the operating system is based on the 6.6 LTS version with kernel > > > > > > version 6.6. The specific stack for hotspot collection is as follows: > > > > > > > > > > > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > > > > > > - ret_from_fork > > > > > > - 99.99% vhost_task_fn > > > > > > - 99.98% 0xffffdc59f619876c > > > > > > - 98.99% handle_rx_kick > > > > > > - 98.94% handle_rx > > > > > > - 94.92% tun_recvmsg > > > > > > - 94.76% tun_do_read > > > > > > - 94.62% tun_put_user_xdp_zc > > > > > > - 63.53% __check_object_size > > > > > > - 63.49% __check_object_size.part.0 > > > > > > find_vmap_area > > > > > > - 30.02% _copy_to_iter > > > > > > __arch_copy_to_user > > > > > > - 2.27% get_rx_bufs > > > > > > - 2.12% vhost_get_vq_desc > > > > > > 1.49% __arch_copy_from_user > > > > > > - 0.89% peek_head_len > > > > > > 0.54% xsk_tx_peek_desc > > > > > > - 0.68% vhost_add_used_and_signal_n > > > > > > - 0.53% eventfd_signal > > > > > > eventfd_signal_mask > > > > > > - 0.94% handle_tx_kick > > > > > > - 0.94% handle_tx > > > > > > - handle_tx_copy > > > > > > - 0.59% vhost_tx_batch.constprop.0 > > > > > > 0.52% tun_sendmsg > > > > > > > > > > > > It can be observed that most of the overhead is concentrated in the > > > > > > find_vmap_area function. > > > > > > > > > > > I see. Yes, it is pretty contented, since you run the v6.6 kernel. There > > > > > was a work that tends to improve it to mitigate a vmap lock contention. > > > > > See it here: https://lwn.net/Articles/956590/ > > > > > > > > > > The work was taken in the v6.9 kernel: > > > > > > > > > > <snip> > > > > > commit 38f6b9af04c4b79f81b3c2a0f76d1de94b78d7bc > > > > > Author: Uladzislau Rezki (Sony) <urezki@gmail.com> > > > > > Date: Tue Jan 2 19:46:23 2024 +0100 > > > > > > > > > > mm: vmalloc: add va_alloc() helper > > > > > > > > > > Patch series "Mitigate a vmap lock contention", v3. > > > > > > > > > > 1. Motivation > > > > > ... > > > > > <snip> > > > > > > > > > > Could you please try the v6.9 kernel on your setup? > > > > > > > > > > How to solve it, probably, it can be back-ported to the v6.6 kernel. > > > > > > > > All the vmalloc-related optimizations have already been merged into 6.6, > > > > including the set of optimization patches you suggested. Thank you very > > > > much for your input. > > > > > > > > > > It is unclear, we have backported the vmalloc optimization into our 6.6 > > > kernel before, so the above stack already with those patches and even > > > with those optimization, the find_vmap_area() is still the hotpots. > > > > > > > > Could you please check that all below patches are in your v6.6 kernel? > > Yes, > > $ git lg v6.6..HEAD --oneline mm/vmalloc.c > * 86fee542f145 mm: vmalloc: ensure vmap_block is initialised before adding > to queue > * f459a0b59f7c mm/vmalloc: fix page mapping if vm_area_alloc_pages() with > high order fallback to order 0 > * 0be7a82c2555 mm: vmalloc: fix lockdep warning > * 58b99a00d0a0 mm/vmalloc: eliminated the lock contention from twice to once > * 2c549aa32fa0 mm: vmalloc: check if a hash-index is in cpu_possible_mask > * 0bc6d608b445 mm: fix incorrect vbq reference in purge_fragmented_block > * 450f8c5270df mm/vmalloc: fix vmalloc which may return null if called with > __GFP_NOFAIL > * 2ea2bf4a18c3 mm: vmalloc: bail out early in find_vmap_area() if vmap is > not init > * bde74a3e8a71 mm/vmalloc: fix return value of vb_alloc if size is 0 > * 8c620d05b7c3 mm: vmalloc: refactor vmalloc_dump_obj() function > * b0c8281703b8 mm: vmalloc: improve description of vmap node layer > * ecc3f0bf5c5a mm: vmalloc: add a shrinker to drain vmap pools > * dd89a137f483 mm: vmalloc: set nr_nodes based on CPUs in a system > * 8e63c98d86f6 mm: vmalloc: support multiple nodes in vmallocinfo > * cc32683cef48 mm: vmalloc: support multiple nodes in vread_iter > * 54d5ce65633d mm: vmalloc: add a scan area of VA only once > * ee9c199fb859 mm: vmalloc: offload free_vmap_area_lock lock > * c2c272d78b5a mm: vmalloc: remove global purge_vmap_area_root rb-tree > * c9b39e3ffa86 mm/vmalloc: remove vmap_area_list > * 091d2493d15f mm: vmalloc: remove global vmap_area_root rb-tree > * 53f06cc34bac mm: vmalloc: move vmap_init_free_space() down in vmalloc.c > * bf24196d9ab9 mm: vmalloc: rename adjust_va_to_fit_type() function > * 6e9c94401e34 mm: vmalloc: add va_alloc() helper > * ae528eb14e9a mm: Introduce vmap_page_range() to map pages in PCI address > space > * e1dbcfaa1854 mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages(). > * d3a24e7a01c4 mm: Enforce VM_IOREMAP flag and range in ioremap_page_range. > * fc9813220585 mm/vmalloc: fix the unchecked dereference warning in > vread_iter() > * a52e0157837e ascend: export interfaces required by ascend drivers > * 9b1283f2bec2 mm/vmalloc: Extend vmalloc usage about hugepage > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls during your test case. Per each you need to find an area which might be really heavy. How many CPUs in a system you have? -- Uladzislau Rezki
On 2024/12/3 21:51, Uladzislau Rezki wrote: > On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote: >> >> >> On 2024/12/3 21:39, Uladzislau Rezki wrote: >>> On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: >>>> >>>> >>>> On 2024/12/3 21:10, zuoze wrote: >>>>> >>>>> >>>>> 在 2024/12/3 20:39, Uladzislau Rezki 写道: >>>>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: >>>>>>> We have implemented host-guest communication based on the TUN device >>>>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), >>>>>>> and the operating system is based on the 6.6 LTS version with kernel >>>>>>> version 6.6. The specific stack for hotspot collection is as follows: >>>>>>> >>>>>>> - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 >>>>>>> - ret_from_fork >>>>>>> - 99.99% vhost_task_fn >>>>>>> - 99.98% 0xffffdc59f619876c >>>>>>> - 98.99% handle_rx_kick >>>>>>> - 98.94% handle_rx >>>>>>> - 94.92% tun_recvmsg >>>>>>> - 94.76% tun_do_read >>>>>>> - 94.62% tun_put_user_xdp_zc >>>>>>> - 63.53% __check_object_size >>>>>>> - 63.49% __check_object_size.part.0 >>>>>>> find_vmap_area >>>>>>> - 30.02% _copy_to_iter >>>>>>> __arch_copy_to_user >>>>>>> - 2.27% get_rx_bufs >>>>>>> - 2.12% vhost_get_vq_desc >>>>>>> 1.49% __arch_copy_from_user >>>>>>> - 0.89% peek_head_len >>>>>>> 0.54% xsk_tx_peek_desc >>>>>>> - 0.68% vhost_add_used_and_signal_n >>>>>>> - 0.53% eventfd_signal >>>>>>> eventfd_signal_mask >>>>>>> - 0.94% handle_tx_kick >>>>>>> - 0.94% handle_tx >>>>>>> - handle_tx_copy >>>>>>> - 0.59% vhost_tx_batch.constprop.0 >>>>>>> 0.52% tun_sendmsg >>>>>>> >>>>>>> It can be observed that most of the overhead is concentrated in the >>>>>>> find_vmap_area function. >>>>>>> ... >> > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls > during your test case. Per each you need to find an area which might > be really heavy. Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns"), so no burden in find_vmap_area in old kernel. > > How many CPUs in a system you have? > 128 core
On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote: > > > On 2024/12/3 21:51, Uladzislau Rezki wrote: > > On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote: > > > > > > > > > On 2024/12/3 21:39, Uladzislau Rezki wrote: > > > > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: > > > > > > > > > > > > > > > On 2024/12/3 21:10, zuoze wrote: > > > > > > > > > > > > > > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道: > > > > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > > > > > > > > We have implemented host-guest communication based on the TUN device > > > > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > > > > > > > > and the operating system is based on the 6.6 LTS version with kernel > > > > > > > > version 6.6. The specific stack for hotspot collection is as follows: > > > > > > > > > > > > > > > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > > > > > > > > - ret_from_fork > > > > > > > > - 99.99% vhost_task_fn > > > > > > > > - 99.98% 0xffffdc59f619876c > > > > > > > > - 98.99% handle_rx_kick > > > > > > > > - 98.94% handle_rx > > > > > > > > - 94.92% tun_recvmsg > > > > > > > > - 94.76% tun_do_read > > > > > > > > - 94.62% tun_put_user_xdp_zc > > > > > > > > - 63.53% __check_object_size > > > > > > > > - 63.49% __check_object_size.part.0 > > > > > > > > find_vmap_area > > > > > > > > - 30.02% _copy_to_iter > > > > > > > > __arch_copy_to_user > > > > > > > > - 2.27% get_rx_bufs > > > > > > > > - 2.12% vhost_get_vq_desc > > > > > > > > 1.49% __arch_copy_from_user > > > > > > > > - 0.89% peek_head_len > > > > > > > > 0.54% xsk_tx_peek_desc > > > > > > > > - 0.68% vhost_add_used_and_signal_n > > > > > > > > - 0.53% eventfd_signal > > > > > > > > eventfd_signal_mask > > > > > > > > - 0.94% handle_tx_kick > > > > > > > > - 0.94% handle_tx > > > > > > > > - handle_tx_copy > > > > > > > > - 0.59% vhost_tx_batch.constprop.0 > > > > > > > > 0.52% tun_sendmsg > > > > > > > > > > > > > > > > It can be observed that most of the overhead is concentrated in the > > > > > > > > find_vmap_area function. > > > > > > > > > ... > > > > > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls > > during your test case. Per each you need to find an area which might > > be really heavy. > > Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc > overruns"), so no burden in find_vmap_area in old kernel. > Yep. It will slow down for sure. > > > > How many CPUs in a system you have? > > > > 128 core OK. Just in case, do you see in a boot log something like: "Failed to allocate an array. Disable a node layer" Thanks! -- Uladzislau Rezki
On Tue, Dec 03, 2024 at 03:20:04PM +0100, Uladzislau Rezki wrote: > On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote: > > > > > > On 2024/12/3 21:51, Uladzislau Rezki wrote: > > > On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote: > > > > > > > > > > > > On 2024/12/3 21:39, Uladzislau Rezki wrote: > > > > > On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: > > > > > > > > > > > > > > > > > > On 2024/12/3 21:10, zuoze wrote: > > > > > > > > > > > > > > > > > > > > > 在 2024/12/3 20:39, Uladzislau Rezki 写道: > > > > > > > > On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: > > > > > > > > > We have implemented host-guest communication based on the TUN device > > > > > > > > > using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), > > > > > > > > > and the operating system is based on the 6.6 LTS version with kernel > > > > > > > > > version 6.6. The specific stack for hotspot collection is as follows: > > > > > > > > > > > > > > > > > > - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 > > > > > > > > > - ret_from_fork > > > > > > > > > - 99.99% vhost_task_fn > > > > > > > > > - 99.98% 0xffffdc59f619876c > > > > > > > > > - 98.99% handle_rx_kick > > > > > > > > > - 98.94% handle_rx > > > > > > > > > - 94.92% tun_recvmsg > > > > > > > > > - 94.76% tun_do_read > > > > > > > > > - 94.62% tun_put_user_xdp_zc > > > > > > > > > - 63.53% __check_object_size > > > > > > > > > - 63.49% __check_object_size.part.0 > > > > > > > > > find_vmap_area > > > > > > > > > - 30.02% _copy_to_iter > > > > > > > > > __arch_copy_to_user > > > > > > > > > - 2.27% get_rx_bufs > > > > > > > > > - 2.12% vhost_get_vq_desc > > > > > > > > > 1.49% __arch_copy_from_user > > > > > > > > > - 0.89% peek_head_len > > > > > > > > > 0.54% xsk_tx_peek_desc > > > > > > > > > - 0.68% vhost_add_used_and_signal_n > > > > > > > > > - 0.53% eventfd_signal > > > > > > > > > eventfd_signal_mask > > > > > > > > > - 0.94% handle_tx_kick > > > > > > > > > - 0.94% handle_tx > > > > > > > > > - handle_tx_copy > > > > > > > > > - 0.59% vhost_tx_batch.constprop.0 > > > > > > > > > 0.52% tun_sendmsg > > > > > > > > > > > > > > > > > > It can be observed that most of the overhead is concentrated in the > > > > > > > > > find_vmap_area function. > > > > > > > > > > > ... > > > > > > > Thank you. Then you have tons of copy_to_iter/copy_from_iter calls > > > during your test case. Per each you need to find an area which might > > > be really heavy. > > > > Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc > > overruns"), so no burden in find_vmap_area in old kernel. > > > Yep. It will slow down for sure. > > > > > > > How many CPUs in a system you have? > > > > > > > 128 core > OK. Just in case, do you see in a boot log something like: > > "Failed to allocate an array. Disable a node layer" > And if you do not see such failing message, it means that a node layer is up and running fully, can you also test below patch on your workload? <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 634162271c00..35b28be27cf4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -896,7 +896,7 @@ static struct vmap_node { * is fully disabled. Later on, after vmap is initialized these * parameters are updated based on a system capacity. */ -static struct vmap_node *vmap_nodes = &single; +static struct vmap_node **vmap_nodes; static __read_mostly unsigned int nr_vmap_nodes = 1; static __read_mostly unsigned int vmap_zone_size = 1; @@ -909,13 +909,13 @@ addr_to_node_id(unsigned long addr) static inline struct vmap_node * addr_to_node(unsigned long addr) { - return &vmap_nodes[addr_to_node_id(addr)]; + return vmap_nodes[addr_to_node_id(addr)]; } static inline struct vmap_node * id_to_node(unsigned int id) { - return &vmap_nodes[id % nr_vmap_nodes]; + return vmap_nodes[id % nr_vmap_nodes]; } /* @@ -1060,7 +1060,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) repeat: for (i = 0, va_start_lowest = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; spin_lock(&vn->busy.lock); *va = __find_vmap_area_exceed_addr(addr, &vn->busy.root); @@ -2240,7 +2240,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, purge_nodes = CPU_MASK_NONE; for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; INIT_LIST_HEAD(&vn->purge_list); vn->skip_populate = full_pool_decay; @@ -2272,7 +2272,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1; for_each_cpu(i, &purge_nodes) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; if (nr_purge_helpers > 0) { INIT_WORK(&vn->purge_work, purge_vmap_node); @@ -2291,7 +2291,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, } for_each_cpu(i, &purge_nodes) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; if (vn->purge_work.func) { flush_work(&vn->purge_work); @@ -2397,7 +2397,7 @@ struct vmap_area *find_vmap_area(unsigned long addr) */ i = j = addr_to_node_id(addr); do { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; spin_lock(&vn->busy.lock); va = __find_vmap_area(addr, &vn->busy.root); @@ -2421,7 +2421,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) */ i = j = addr_to_node_id(addr); do { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; spin_lock(&vn->busy.lock); va = __find_vmap_area(addr, &vn->busy.root); @@ -4928,7 +4928,7 @@ static void show_purge_info(struct seq_file *m) int i; for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; spin_lock(&vn->lazy.lock); list_for_each_entry(va, &vn->lazy.head, list) { @@ -4948,7 +4948,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) int i; for (i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; spin_lock(&vn->busy.lock); list_for_each_entry(va, &vn->busy.head, list) { @@ -5069,6 +5069,7 @@ static void __init vmap_init_free_space(void) static void vmap_init_nodes(void) { + struct vmap_node **nodes; struct vmap_node *vn; int i, n; @@ -5087,23 +5088,34 @@ static void vmap_init_nodes(void) * set of cores. Therefore a per-domain purging is supposed to * be added as well as a per-domain balancing. */ - n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); + n = 1024; if (n > 1) { - vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN); - if (vn) { + nodes = kmalloc_array(n, sizeof(struct vmap_node **), + GFP_NOWAIT | __GFP_NOWARN | __GFP_ZERO); + + if (nodes) { + for (i = 0; i < n; i++) { + nodes[i] = kmalloc(sizeof(struct vmap_node), GFP_NOWAIT | __GFP_ZERO); + + if (!nodes[i]) + break; + } + /* Node partition is 16 pages. */ vmap_zone_size = (1 << 4) * PAGE_SIZE; - nr_vmap_nodes = n; - vmap_nodes = vn; + nr_vmap_nodes = i; + vmap_nodes = nodes; } else { pr_err("Failed to allocate an array. Disable a node layer\n"); + vmap_nodes[0] = &single; + nr_vmap_nodes = 1; } } #endif for (n = 0; n < nr_vmap_nodes; n++) { - vn = &vmap_nodes[n]; + vn = vmap_nodes[n]; vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); spin_lock_init(&vn->busy.lock); @@ -5129,7 +5141,7 @@ vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc) int i, j; for (count = 0, i = 0; i < nr_vmap_nodes; i++) { - vn = &vmap_nodes[i]; + vn = vmap_nodes[i]; for (j = 0; j < MAX_VA_SIZE_PAGES; j++) count += READ_ONCE(vn->pool[j].len); @@ -5144,7 +5156,7 @@ vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) int i; for (i = 0; i < nr_vmap_nodes; i++) - decay_va_pool_node(&vmap_nodes[i], true); + decay_va_pool_node(vmap_nodes[i], true); return SHRINK_STOP; } <snip> it sets a number of nodes to 1024. It would be really appreciated to see the perf-delta with this patch. If it improves the things or not. Thank you in advance. -- Uladzislau Rezki
On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: I think there are a few other things we can try here. First, if the copy is small (and I still don't have an answer to that ...), we can skip the vmalloc lookup if the copy doesn't cross a page boundary. Second, we could try storing this in a maple tree rather than an rbtree. That gives us RCU protected lookups rather than under a spinlock. It might even be worth going to a rwlock first, in case the problem is that there's severe lock contention. But I've asked for data on spinlock contention and not received an answer on that either, so I don't know what to suggest. Anyway, NACK to the original patch; that's just a horrible idea.
在 2024/12/4 3:02, Uladzislau Rezki 写道: > On Tue, Dec 03, 2024 at 03:20:04PM +0100, Uladzislau Rezki wrote: >> On Tue, Dec 03, 2024 at 10:10:26PM +0800, Kefeng Wang wrote: >>> >>> >>> On 2024/12/3 21:51, Uladzislau Rezki wrote: >>>> On Tue, Dec 03, 2024 at 09:45:09PM +0800, Kefeng Wang wrote: >>>>> >>>>> >>>>> On 2024/12/3 21:39, Uladzislau Rezki wrote: >>>>>> On Tue, Dec 03, 2024 at 09:30:09PM +0800, Kefeng Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 2024/12/3 21:10, zuoze wrote: >>>>>>>> >>>>>>>> >>>>>>>> 在 2024/12/3 20:39, Uladzislau Rezki 写道: >>>>>>>>> On Tue, Dec 03, 2024 at 07:23:44PM +0800, zuoze wrote: >>>>>>>>>> We have implemented host-guest communication based on the TUN device >>>>>>>>>> using XSK[1]. The hardware is a Kunpeng 920 machine (ARM architecture), >>>>>>>>>> and the operating system is based on the 6.6 LTS version with kernel >>>>>>>>>> version 6.6. The specific stack for hotspot collection is as follows: >>>>>>>>>> >>>>>>>>>> - 100.00% 0.00% vhost-12384 [unknown] [k] 0000000000000000 >>>>>>>>>> - ret_from_fork >>>>>>>>>> - 99.99% vhost_task_fn >>>>>>>>>> - 99.98% 0xffffdc59f619876c >>>>>>>>>> - 98.99% handle_rx_kick >>>>>>>>>> - 98.94% handle_rx >>>>>>>>>> - 94.92% tun_recvmsg >>>>>>>>>> - 94.76% tun_do_read >>>>>>>>>> - 94.62% tun_put_user_xdp_zc >>>>>>>>>> - 63.53% __check_object_size >>>>>>>>>> - 63.49% __check_object_size.part.0 >>>>>>>>>> find_vmap_area >>>>>>>>>> - 30.02% _copy_to_iter >>>>>>>>>> __arch_copy_to_user >>>>>>>>>> - 2.27% get_rx_bufs >>>>>>>>>> - 2.12% vhost_get_vq_desc >>>>>>>>>> 1.49% __arch_copy_from_user >>>>>>>>>> - 0.89% peek_head_len >>>>>>>>>> 0.54% xsk_tx_peek_desc >>>>>>>>>> - 0.68% vhost_add_used_and_signal_n >>>>>>>>>> - 0.53% eventfd_signal >>>>>>>>>> eventfd_signal_mask >>>>>>>>>> - 0.94% handle_tx_kick >>>>>>>>>> - 0.94% handle_tx >>>>>>>>>> - handle_tx_copy >>>>>>>>>> - 0.59% vhost_tx_batch.constprop.0 >>>>>>>>>> 0.52% tun_sendmsg >>>>>>>>>> >>>>>>>>>> It can be observed that most of the overhead is concentrated in the >>>>>>>>>> find_vmap_area function. >>>>>>>>>> >>> ... >>>>> >>>> Thank you. Then you have tons of copy_to_iter/copy_from_iter calls >>>> during your test case. Per each you need to find an area which might >>>> be really heavy. >>> >>> Exactly, no vmalloc check before 0aef499f3172 ("mm/usercopy: Detect vmalloc >>> overruns"), so no burden in find_vmap_area in old kernel. >>> >> Yep. It will slow down for sure. >> >>>> >>>> How many CPUs in a system you have? >>>> >>> >>> 128 core >> OK. Just in case, do you see in a boot log something like: >> >> "Failed to allocate an array. Disable a node layer" >> > And if you do not see such failing message, it means that a node layer > is up and running fully, can you also test below patch on your workload? We did not see such failing message. As for the test patch you suggested, we will test it and check the results. Thank you very much. > > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 634162271c00..35b28be27cf4 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -896,7 +896,7 @@ static struct vmap_node { > * is fully disabled. Later on, after vmap is initialized these > * parameters are updated based on a system capacity. > */ > -static struct vmap_node *vmap_nodes = &single; > +static struct vmap_node **vmap_nodes; > static __read_mostly unsigned int nr_vmap_nodes = 1; > static __read_mostly unsigned int vmap_zone_size = 1; > > @@ -909,13 +909,13 @@ addr_to_node_id(unsigned long addr) > static inline struct vmap_node * > addr_to_node(unsigned long addr) > { > - return &vmap_nodes[addr_to_node_id(addr)]; > + return vmap_nodes[addr_to_node_id(addr)]; > } > > static inline struct vmap_node * > id_to_node(unsigned int id) > { > - return &vmap_nodes[id % nr_vmap_nodes]; > + return vmap_nodes[id % nr_vmap_nodes]; > } > > /* > @@ -1060,7 +1060,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) > > repeat: > for (i = 0, va_start_lowest = 0; i < nr_vmap_nodes; i++) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > spin_lock(&vn->busy.lock); > *va = __find_vmap_area_exceed_addr(addr, &vn->busy.root); > @@ -2240,7 +2240,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > purge_nodes = CPU_MASK_NONE; > > for (i = 0; i < nr_vmap_nodes; i++) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > INIT_LIST_HEAD(&vn->purge_list); > vn->skip_populate = full_pool_decay; > @@ -2272,7 +2272,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1; > > for_each_cpu(i, &purge_nodes) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > if (nr_purge_helpers > 0) { > INIT_WORK(&vn->purge_work, purge_vmap_node); > @@ -2291,7 +2291,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > } > > for_each_cpu(i, &purge_nodes) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > if (vn->purge_work.func) { > flush_work(&vn->purge_work); > @@ -2397,7 +2397,7 @@ struct vmap_area *find_vmap_area(unsigned long addr) > */ > i = j = addr_to_node_id(addr); > do { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > spin_lock(&vn->busy.lock); > va = __find_vmap_area(addr, &vn->busy.root); > @@ -2421,7 +2421,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > */ > i = j = addr_to_node_id(addr); > do { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > spin_lock(&vn->busy.lock); > va = __find_vmap_area(addr, &vn->busy.root); > @@ -4928,7 +4928,7 @@ static void show_purge_info(struct seq_file *m) > int i; > > for (i = 0; i < nr_vmap_nodes; i++) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > spin_lock(&vn->lazy.lock); > list_for_each_entry(va, &vn->lazy.head, list) { > @@ -4948,7 +4948,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) > int i; > > for (i = 0; i < nr_vmap_nodes; i++) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > spin_lock(&vn->busy.lock); > list_for_each_entry(va, &vn->busy.head, list) { > @@ -5069,6 +5069,7 @@ static void __init vmap_init_free_space(void) > > static void vmap_init_nodes(void) > { > + struct vmap_node **nodes; > struct vmap_node *vn; > int i, n; > > @@ -5087,23 +5088,34 @@ static void vmap_init_nodes(void) > * set of cores. Therefore a per-domain purging is supposed to > * be added as well as a per-domain balancing. > */ > - n = clamp_t(unsigned int, num_possible_cpus(), 1, 128); > + n = 1024; > > if (n > 1) { > - vn = kmalloc_array(n, sizeof(*vn), GFP_NOWAIT | __GFP_NOWARN); > - if (vn) { > + nodes = kmalloc_array(n, sizeof(struct vmap_node **), > + GFP_NOWAIT | __GFP_NOWARN | __GFP_ZERO); > + > + if (nodes) { > + for (i = 0; i < n; i++) { > + nodes[i] = kmalloc(sizeof(struct vmap_node), GFP_NOWAIT | __GFP_ZERO); > + > + if (!nodes[i]) > + break; > + } > + > /* Node partition is 16 pages. */ > vmap_zone_size = (1 << 4) * PAGE_SIZE; > - nr_vmap_nodes = n; > - vmap_nodes = vn; > + nr_vmap_nodes = i; > + vmap_nodes = nodes; > } else { > pr_err("Failed to allocate an array. Disable a node layer\n"); > + vmap_nodes[0] = &single; > + nr_vmap_nodes = 1; > } > } > #endif > > for (n = 0; n < nr_vmap_nodes; n++) { > - vn = &vmap_nodes[n]; > + vn = vmap_nodes[n]; > vn->busy.root = RB_ROOT; > INIT_LIST_HEAD(&vn->busy.head); > spin_lock_init(&vn->busy.lock); > @@ -5129,7 +5141,7 @@ vmap_node_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > int i, j; > > for (count = 0, i = 0; i < nr_vmap_nodes; i++) { > - vn = &vmap_nodes[i]; > + vn = vmap_nodes[i]; > > for (j = 0; j < MAX_VA_SIZE_PAGES; j++) > count += READ_ONCE(vn->pool[j].len); > @@ -5144,7 +5156,7 @@ vmap_node_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > int i; > > for (i = 0; i < nr_vmap_nodes; i++) > - decay_va_pool_node(&vmap_nodes[i], true); > + decay_va_pool_node(vmap_nodes[i], true); > > return SHRINK_STOP; > } > <snip> > > it sets a number of nodes to 1024. It would be really appreciated to see > the perf-delta with this patch. If it improves the things or not. > > Thank you in advance. > > -- > Uladzislau Rezki >
在 2024/12/4 3:56, Matthew Wilcox 写道: > On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: > > I think there are a few other things we can try here. > > First, if the copy is small (and I still don't have an answer to that > ...), we can skip the vmalloc lookup if the copy doesn't cross a page > boundary. > large data packet. > Second, we could try storing this in a maple tree rather than an rbtree. > That gives us RCU protected lookups rather than under a spinlock. > > It might even be worth going to a rwlock first, in case the problem is > that there's severe lock contention. > > But I've asked for data on spinlock contention and not received an > answer on that either, so I don't know what to suggest. Thank you very much for your suggestions. We will check the spinlock contention in the future. > > Anyway, NACK to the original patch; that's just a horrible idea. >
On December 4, 2024 5:56:34 AM GMT+10:00, Matthew Wilcox <willy@infradead.org> wrote: >On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: > >I think there are a few other things we can try here. > >First, if the copy is small (and I still don't have an answer to that >...), we can skip the vmalloc lookup if the copy doesn't cross a page >boundary. Yeah, this seems a reasonable optimization. >Anyway, NACK to the original patch; that's just a horrible idea. Right, please no debugfs knob. The point is to make sure the check cannot be disabled at runtime and to make the check a no-op. Please use the boot param to disable this check. -Kees
On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote: > On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: > > I think there are a few other things we can try here. > > First, if the copy is small (and I still don't have an answer to that > ...), we can skip the vmalloc lookup if the copy doesn't cross a page > boundary. > > Second, we could try storing this in a maple tree rather than an rbtree. > That gives us RCU protected lookups rather than under a spinlock. > > It might even be worth going to a rwlock first, in case the problem is > that there's severe lock contention. > > But I've asked for data on spinlock contention and not received an > answer on that either, so I don't know what to suggest. > I think, it is not about contention. It is about the extra "attached load" when a data is heavily copied force and back. On each copy path you need to do a scan. Maple tree is not that something can help here :) Indeed, no contention data. Zuoze, please share this if you can. -- Uladzislau Rezki
> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: > > I think there are a few other things we can try here. > > First, if the copy is small (and I still don't have an answer to that > ...), we can skip the vmalloc lookup if the copy doesn't cross a page > boundary. > I noticed that, a path which is in question, does not supply a starting address of mapping area, instead it passes something within. > > Second, we could try storing this in a maple tree rather than an rbtree. > That gives us RCU protected lookups rather than under a spinlock. > I think, when i have more free cycles, i will check it from performance point of view. Because i do not know how much a maple tree is efficient when it comes to lookups, insert and removing. As an RCU safe data structure, yes, a searching is improved in a way there is no need in taking spinlock. As a noted earlier i do not know if a maple tree allows to find a data when instead of key, it is associated with, we pass something that is withing a searchable area: [va_start:va_end]. -- Uladzislau Rezki
在 2024/12/4 15:55, Uladzislau Rezki 写道: > On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote: >> On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: >> >> I think there are a few other things we can try here. >> >> First, if the copy is small (and I still don't have an answer to that >> ...), we can skip the vmalloc lookup if the copy doesn't cross a page >> boundary. >> >> Second, we could try storing this in a maple tree rather than an rbtree. >> That gives us RCU protected lookups rather than under a spinlock. >> >> It might even be worth going to a rwlock first, in case the problem is >> that there's severe lock contention. >> >> But I've asked for data on spinlock contention and not received an >> answer on that either, so I don't know what to suggest. >> > I think, it is not about contention. It is about the extra "attached > load" when a data is heavily copied force and back. On each copy path > you need to do a scan. Maple tree is not that something can help here :) > > Indeed, no contention data. Zuoze, please share this if you can. We have enabled perf lock contention and are currently debugging the environment. We will share the results as soon as we have them. > > -- > Uladzislau Rezki
On Wed, Dec 04, 2024 at 05:21:12PM +0800, zuoze wrote: > > > 在 2024/12/4 15:55, Uladzislau Rezki 写道: > > On Tue, Dec 03, 2024 at 07:56:34PM +0000, Matthew Wilcox wrote: > > > On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote: > > > > > > I think there are a few other things we can try here. > > > > > > First, if the copy is small (and I still don't have an answer to that > > > ...), we can skip the vmalloc lookup if the copy doesn't cross a page > > > boundary. > > > > > > Second, we could try storing this in a maple tree rather than an rbtree. > > > That gives us RCU protected lookups rather than under a spinlock. > > > > > > It might even be worth going to a rwlock first, in case the problem is > > > that there's severe lock contention. > > > > > > But I've asked for data on spinlock contention and not received an > > > answer on that either, so I don't know what to suggest. > > > > > I think, it is not about contention. It is about the extra "attached > > load" when a data is heavily copied force and back. On each copy path > > you need to do a scan. Maple tree is not that something can help here :) > > > > Indeed, no contention data. Zuoze, please share this if you can. > > We have enabled perf lock contention and are currently debugging the > environment. We will share the results as soon as we have them. > Sounds good and thank you for helping to improve this :) -- Uladzislau Rezki
On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote: > I think, when i have more free cycles, i will check it from performance > point of view. Because i do not know how much a maple tree is efficient > when it comes to lookups, insert and removing. Maple tree has a fanout of around 8-12 at each level, while an rbtree has a fanout of two (arguably 3, since we might find the node). Let's say you have 1000 vmalloc areas. A perfectly balanced rbtree would have 9 levels (and might well be 11+ levels if imperfectly balanced -- and part of the advantage of rbtrees over AVL trees is that they can be less balanced so need fewer rotations). A perfectly balanced maple tree would have only 3 levels. Addition/removal is more expensive. We biased the implementation heavily towards lookup, so we chose to keep it very compact. Most users (and particularly the VMA tree which was our first client) do more lookups than modifications; a real application takes many more pagefaults than it does calls to mmap/munmap/mprotect/etc. > As an RCU safe data structure, yes, a searching is improved in a way there > is no need in taking spinlock. As a noted earlier i do not know if a maple > tree allows to find a data when instead of key, it is associated with, we > pass something that is withing a searchable area: [va_start:va_end]. That's what maple trees do; they store non-overlapping ranges. So you can look up any address in a range and it will return you the pointer associated with that range. Just like you'd want for a page fault ;-)
Hello, Matthew! > On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote: > > I think, when i have more free cycles, i will check it from performance > > point of view. Because i do not know how much a maple tree is efficient > > when it comes to lookups, insert and removing. > > Maple tree has a fanout of around 8-12 at each level, while an rbtree has > a fanout of two (arguably 3, since we might find the node). Let's say you > have 1000 vmalloc areas. A perfectly balanced rbtree would have 9 levels > (and might well be 11+ levels if imperfectly balanced -- and part of the > advantage of rbtrees over AVL trees is that they can be less balanced > so need fewer rotations). A perfectly balanced maple tree would have > only 3 levels. > Thank you for your explanation and some input on this topic. Density, a high of tree and branching factor should make the work better :) > > Addition/removal is more expensive. We biased the implementation heavily > towards lookup, so we chose to keep it very compact. Most users (and > particularly the VMA tree which was our first client) do more lookups > than modifications; a real application takes many more pagefaults than > it does calls to mmap/munmap/mprotect/etc. > This is what i see. Some use cases are degraded. For example stress-ng forking bench is worse, test_vmalloc.sh also reports a degrade: See below figures: <snip> # Default urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 + 59.52% 7.15% [kernel] [k] __vmalloc_node_range_noprof + 37.98% 0.22% [test_vmalloc] [k] fix_size_alloc_test + 37.32% 8.56% [kernel] [k] vfree.part.0 + 35.31% 0.00% [kernel] [k] ret_from_fork_asm + 35.31% 0.00% [kernel] [k] ret_from_fork + 35.31% 0.00% [kernel] [k] kthread + 35.05% 0.00% [test_vmalloc] [k] test_func + 34.16% 0.06% [test_vmalloc] [k] long_busy_list_alloc_test + 32.10% 0.12% [kernel] [k] __get_vm_area_node + 31.69% 1.82% [kernel] [k] alloc_vmap_area + 27.24% 5.01% [kernel] [k] _raw_spin_lock + 25.45% 0.15% [test_vmalloc] [k] full_fit_alloc_test + 23.57% 0.03% [kernel] [k] remove_vm_area + 22.23% 22.23% [kernel] [k] native_queued_spin_lock_slowpath + 14.34% 0.94% [kernel] [k] alloc_pages_bulk_noprof + 10.80% 7.51% [kernel] [k] free_vmap_area_noflush + 10.59% 10.59% [kernel] [k] clear_page_rep + 9.52% 8.96% [kernel] [k] insert_vmap_area + 7.39% 2.82% [kernel] [k] find_unlink_vmap_area # Maple-tree time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 + 74.33% 1.50% [kernel] [k] __vmalloc_node_range_noprof + 55.73% 0.06% [kernel] [k] __get_vm_area_node + 55.53% 1.07% [kernel] [k] alloc_vmap_area + 53.78% 0.09% [test_vmalloc] [k] long_busy_list_alloc_test + 53.75% 1.76% [kernel] [k] _raw_spin_lock + 52.81% 51.80% [kernel] [k] native_queued_spin_lock_slowpath + 28.93% 0.09% [test_vmalloc] [k] full_fit_alloc_test + 23.29% 2.43% [kernel] [k] vfree.part.0 + 20.29% 0.01% [kernel] [k] mt_insert_vmap_area + 20.27% 0.34% [kernel] [k] mtree_insert_range + 15.30% 0.05% [test_vmalloc] [k] fix_size_alloc_test + 14.06% 0.05% [kernel] [k] remove_vm_area + 13.73% 0.00% [kernel] [k] ret_from_fork_asm + 13.73% 0.00% [kernel] [k] ret_from_fork + 13.73% 0.00% [kernel] [k] kthread + 13.51% 0.00% [test_vmalloc] [k] test_func + 13.15% 0.87% [kernel] [k] alloc_pages_bulk_noprof + 9.92% 9.54% [kernel] [k] clear_page_rep + 9.62% 0.07% [kernel] [k] find_unlink_vmap_area + 9.55% 0.04% [kernel] [k] mtree_erase + 5.92% 1.44% [kernel] [k] free_unref_page + 4.92% 0.24% [kernel] [k] mas_insert.isra.0 + 4.69% 0.93% [kernel] [k] mas_erase + 4.47% 0.02% [kernel] [k] rcu_do_batch + 3.35% 2.10% [kernel] [k] __vmap_pages_range_noflush + 3.00% 2.81% [kernel] [k] mas_wr_store_type <snip> i.e. insert/remove are more expansive, at least my test show this. It looks like, mtree_insert() uses a range_variant which implies a tree update after an insert operation is completed. And probably where an overhead comes from. If i use a b+tree(my own implementation), as expected, it is better than rb-tree because of b+tree properties. I have composed some data, you can find more bench data there: wget ftp://vps418301.ovh.net/incoming/Maple_tree_comparison_with_rb_tree_in_vmalloc.pdf >> That's what maple trees do; they store non-overlapping ranges. So you >> can look up any address in a range and it will return you the pointer >> associated with that range. Just like you'd want for a page fault ;-) Thank you. I see. I though that it also can work as a regular b+ or b tress so we do not spend cycles on updates to track ranges. Like below code: int ret = mtree_insert(t, va->va_start, va, GFP_KERNEL); i do not store a range here, i store key -> value pair but maple-tree considers it as range: [va_start:va_start]. Maybe we can improve this case when not a range is passed? This is just my thoughts :) -- Uladzislau Rezki
diff --git a/mm/usercopy.c b/mm/usercopy.c index 83c164aba6e0..ef1eb23b2273 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -21,6 +21,7 @@ #include <linux/vmalloc.h> #include <linux/atomic.h> #include <linux/jump_label.h> +#include <linux/debugfs.h> #include <asm/sections.h> #include "slab.h" @@ -159,6 +160,8 @@ static inline void check_bogus_address(const unsigned long ptr, unsigned long n, usercopy_abort("null address", NULL, to_user, ptr, n); } +static bool bypass_vmalloc_check __read_mostly; + static inline void check_heap_object(const void *ptr, unsigned long n, bool to_user) { @@ -174,8 +177,13 @@ static inline void check_heap_object(const void *ptr, unsigned long n, } if (is_vmalloc_addr(ptr) && !pagefault_disabled()) { - struct vmap_area *area = find_vmap_area(addr); + struct vmap_area *area; + + /* Bypass it since searching the kernel VM area is slow */ + if (bypass_vmalloc_check) + return; + area = find_vmap_area(addr); if (!area) usercopy_abort("vmalloc", "no area", to_user, 0, n); @@ -271,6 +279,9 @@ static int __init set_hardened_usercopy(void) { if (enable_checks == false) static_branch_enable(&bypass_usercopy_checks); + else + debugfs_create_bool("bypass_usercopy_vmalloc_check", 0600, + NULL, &bypass_vmalloc_check); return 1; }
The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced vmalloc check for usercopy. However, in subsystems like networking, when memory allocated using vmalloc or vmap is subsequently copied using functions like copy_to_iter/copy_from_iter, the check is triggered. This adds overhead in the copy path, such as the cost of searching the red-black tree, which increases the performance burden. We found that after merging this patch, network bandwidth performance in the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec, the hardened_usercopy is enabled by default. To address this, we introduced a debugfs interface that allows selectively enabling or disabling the vmalloc check based on the use case, optimizing performance. By default, vmalloc check for usercopy is enabled. To disable the vmalloc check: echo Y > /sys/kernel/debug/bypass_usercopy_vmalloc_check After executing the above command, the XDP performance returns to 25 Gbits/sec. Signed-off-by: Ze Zuo <zuoze1@huawei.com> --- mm/usercopy.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)