Message ID | 20241206122533.3589947-1-linyunsheng@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | Replace page_frag with page_frag_cache (Part-2) | expand |
On Fri, Dec 6, 2024 at 4:32 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > This is part 2 of "Replace page_frag with page_frag_cache", > which introduces the new API and replaces page_frag with > page_frag_cache for sk_page_frag(). > > The part 1 of "Replace page_frag with page_frag_cache" is in > [1]. > > After [2], there are still two implementations for page frag: > > 1. mm/page_alloc.c: net stack seems to be using it in the > rx part with 'struct page_frag_cache' and the main API > being page_frag_alloc_align(). > 2. net/core/sock.c: net stack seems to be using it in the > tx part with 'struct page_frag' and the main API being > skb_page_frag_refill(). > > This patchset tries to unfiy the page frag implementation > by replacing page_frag with page_frag_cache for sk_page_frag() > first. net_high_order_alloc_disable_key for the implementation > in net/core/sock.c doesn't seems matter that much now as pcp > is also supported for high-order pages: > commit 44042b449872 ("mm/page_alloc: allow high-order pages to > be stored on the per-cpu lists") > > As the related change is mostly related to networking, so > targeting the net-next. And will try to replace the rest > of page_frag in the follow patchset. > > After this patchset: > 1. Unify the page frag implementation by taking the best out of > two the existing implementations: we are able to save some space > for the 'page_frag_cache' API user, and avoid 'get_page()' for > the old 'page_frag' API user. > 2. Future bugfix and performance can be done in one place, hence > improving maintainability of page_frag's implementation. > > Performance validation for part2: > 1. Using micro-benchmark ko added in patch 1 to test aligned and > non-aligned API performance impact for the existing users, there > seems to be about 20% performance degradation for refactoring > page_frag to support the new API, which seems to nullify most of > the performance gain in [3] of part1. So if I am understanding correctly then this is showing a 20% performance degradation with this patchset. I would argue that it is significant enough that it would be a blocking factor for this patch set. I would suggest bisecting the patch set to identify where the performance degradation has been added and see what we can do to resolve it, and if nothing else document it in that patch so we can identify the root cause for the slowdown. > 2. Use the below netcat test case, there seems to be some minor > performance gain for replacing 'page_frag' with 'page_frag_cache' > using the new page_frag API after this patchset. > server: taskset -c 32 nc -l -k 1234 > /dev/null > client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 This test would barely touch the page pool. The fact is most of the overhead for this would likely be things like TCP latency and data copy much more than the page allocation. As such fluctuations here are likely not related to your changes.
On 2024/12/9 5:34, Alexander Duyck wrote: ... >> >> Performance validation for part2: >> 1. Using micro-benchmark ko added in patch 1 to test aligned and >> non-aligned API performance impact for the existing users, there >> seems to be about 20% performance degradation for refactoring >> page_frag to support the new API, which seems to nullify most of >> the performance gain in [3] of part1. > > So if I am understanding correctly then this is showing a 20% > performance degradation with this patchset. I would argue that it is > significant enough that it would be a blocking factor for this patch > set. I would suggest bisecting the patch set to identify where the > performance degradation has been added and see what we can do to > resolve it, and if nothing else document it in that patch so we can > identify the root cause for the slowdown. The only patch in this patchset affecting the performance of existing API seems to be patch 1, only including patch 1 does show ~20% performance degradation as including the whole patchset does: mm: page_frag: some minor refactoring before adding new API And the cause seems to be about the binary increasing as below, as the performance degradation didn't seems to change much when I tried inlining the __page_frag_cache_commit_noref() by moving it to the header file: ./scripts/bloat-o-meter vmlinux_orig vmlinux add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420) Function old new delta __page_frag_cache_prepare - 500 +500 __napi_alloc_frag_align 68 180 +112 __netdev_alloc_skb 488 596 +108 napi_alloc_skb 556 624 +68 __netdev_alloc_frag_align 196 252 +56 svc_tcp_sendmsg 340 376 +36 __page_frag_cache_commit_noref - 32 +32 e843419@09a6_0000bd47_30 - 8 +8 e843419@0369_000044ee_684 8 - -8 __page_frag_alloc_align 492 - -492 Total: Before=34719207, After=34719627, chg +0.00% ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78) Function old new delta page_frag_push_thread 508 580 +72 __UNIQUE_ID_vermagic367 67 73 +6 Total: Before=4582, After=4660, chg +1.70% Patch 1 is about refactoring common codes from __page_frag_alloc_va_align() to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the new API can make use of them as much as possible. Any better idea to reuse common codes as much as possible while avoiding the performance degradation as much as possible? > >> 2. Use the below netcat test case, there seems to be some minor >> performance gain for replacing 'page_frag' with 'page_frag_cache' >> using the new page_frag API after this patchset. >> server: taskset -c 32 nc -l -k 1234 > /dev/null >> client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 > > This test would barely touch the page pool. The fact is most of the I am guessing you meant page_frag here? > overhead for this would likely be things like TCP latency and data > copy much more than the page allocation. As such fluctuations here are > likely not related to your changes. But it does tell us something that the replacing does not seems to cause obvious regression, right? I tried using a smaller MTU to amplify the impact of page allocation, it seemed to have a similar result.
On Mon, Dec 9, 2024 at 3:42 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2024/12/9 5:34, Alexander Duyck wrote: > > ... > > >> > >> Performance validation for part2: > >> 1. Using micro-benchmark ko added in patch 1 to test aligned and > >> non-aligned API performance impact for the existing users, there > >> seems to be about 20% performance degradation for refactoring > >> page_frag to support the new API, which seems to nullify most of > >> the performance gain in [3] of part1. > > > > So if I am understanding correctly then this is showing a 20% > > performance degradation with this patchset. I would argue that it is > > significant enough that it would be a blocking factor for this patch > > set. I would suggest bisecting the patch set to identify where the > > performance degradation has been added and see what we can do to > > resolve it, and if nothing else document it in that patch so we can > > identify the root cause for the slowdown. > > The only patch in this patchset affecting the performance of existing API > seems to be patch 1, only including patch 1 does show ~20% performance > degradation as including the whole patchset does: > mm: page_frag: some minor refactoring before adding new API > > And the cause seems to be about the binary increasing as below, as the > performance degradation didn't seems to change much when I tried inlining > the __page_frag_cache_commit_noref() by moving it to the header file: > > ./scripts/bloat-o-meter vmlinux_orig vmlinux > add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420) > Function old new delta > __page_frag_cache_prepare - 500 +500 > __napi_alloc_frag_align 68 180 +112 > __netdev_alloc_skb 488 596 +108 > napi_alloc_skb 556 624 +68 > __netdev_alloc_frag_align 196 252 +56 > svc_tcp_sendmsg 340 376 +36 > __page_frag_cache_commit_noref - 32 +32 > e843419@09a6_0000bd47_30 - 8 +8 > e843419@0369_000044ee_684 8 - -8 > __page_frag_alloc_align 492 - -492 > Total: Before=34719207, After=34719627, chg +0.00% > > ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko > add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78) > Function old new delta > page_frag_push_thread 508 580 +72 > __UNIQUE_ID_vermagic367 67 73 +6 > Total: Before=4582, After=4660, chg +1.70% Other than code size have you tried using perf to profile the benchmark before and after. I suspect that would be telling about which code changes are the most likely to be causing the issues. Overall I don't think the size has increased all that much. I suspect most of this is the fact that you are inlining more of the functionality. > Patch 1 is about refactoring common codes from __page_frag_alloc_va_align() > to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the > new API can make use of them as much as possible. > > Any better idea to reuse common codes as much as possible while avoiding > the performance degradation as much as possible? > > > > >> 2. Use the below netcat test case, there seems to be some minor > >> performance gain for replacing 'page_frag' with 'page_frag_cache' > >> using the new page_frag API after this patchset. > >> server: taskset -c 32 nc -l -k 1234 > /dev/null > >> client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 > > > > This test would barely touch the page pool. The fact is most of the > > I am guessing you meant page_frag here? > > > overhead for this would likely be things like TCP latency and data > > copy much more than the page allocation. As such fluctuations here are > > likely not related to your changes. > > But it does tell us something that the replacing does not seems to > cause obvious regression, right? Not really. The fragment allocator is such a small portion of this test that we could probably double the cost for it and it would still be negligible. > I tried using a smaller MTU to amplify the impact of page allocation, > it seemed to have a similar result. Not surprising. However the network is likely only a small part of this. I suspect if you ran a profile it would likely show the same.
On 2024/12/10 0:03, Alexander Duyck wrote: ... > > Other than code size have you tried using perf to profile the > benchmark before and after. I suspect that would be telling about > which code changes are the most likely to be causing the issues. > Overall I don't think the size has increased all that much. I suspect > most of this is the fact that you are inlining more of the > functionality. It seems the testing result is very sensitive to code changing and reorganizing, as using the patch at the end to avoid the problem of 'perf stat' not including data from the kernel thread seems to provide more reasonable performance data. It seems the most obvious difference is 'insn per cycle' and I am not sure how to interpret the difference of below data for the performance degradation yet. With patch 1: Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': 5473.815250 task-clock (msec) # 0.984 CPUs utilized 18 context-switches # 0.003 K/sec 1 cpu-migrations # 0.000 K/sec 122 page-faults # 0.022 K/sec 14210894727 cycles # 2.596 GHz (92.78%) 18903171767 instructions # 1.33 insn per cycle (92.82%) 2997494420 branches # 547.606 M/sec (92.84%) 7539978 branch-misses # 0.25% of all branches (92.84%) 6291190031 L1-dcache-loads # 1149.325 M/sec (92.78%) 29874701 L1-dcache-load-misses # 0.47% of all L1-dcache hits (92.82%) 57979668 LLC-loads # 10.592 M/sec (92.79%) 347822 LLC-load-misses # 0.01% of all LL-cache hits (92.90%) 5946042629 L1-icache-loads # 1086.270 M/sec (92.91%) 193877 L1-icache-load-misses (92.91%) 6820220221 dTLB-loads # 1245.972 M/sec (92.91%) 137999 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) 5947607438 iTLB-loads # 1086.556 M/sec (92.91%) 210 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 5.563068950 seconds time elapsed Without patch 1: root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': 5306.644600 task-clock (msec) # 0.984 CPUs utilized 15 context-switches # 0.003 K/sec 1 cpu-migrations # 0.000 K/sec 122 page-faults # 0.023 K/sec 13776872322 cycles # 2.596 GHz (92.84%) 13257649773 instructions # 0.96 insn per cycle (92.82%) 2446901087 branches # 461.101 M/sec (92.91%) 7172751 branch-misses # 0.29% of all branches (92.84%) 5041456343 L1-dcache-loads # 950.027 M/sec (92.84%) 38418414 L1-dcache-load-misses # 0.76% of all L1-dcache hits (92.76%) 65486400 LLC-loads # 12.340 M/sec (92.82%) 191497 LLC-load-misses # 0.01% of all LL-cache hits (92.79%) 4906456833 L1-icache-loads # 924.587 M/sec (92.90%) 175208 L1-icache-load-misses (92.91%) 5539879607 dTLB-loads # 1043.952 M/sec (92.91%) 140166 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) 4906685698 iTLB-loads # 924.631 M/sec (92.91%) 170 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 5.395104330 seconds time elapsed Below is perf data for aligned API without patch 1, as above non-aligned API also use test_alloc_len as 12, theoretically the performance data should not be better than the non-aligned API as the aligned API will do the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems to show otherwise and I am not sure how to interpret that too: perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1 insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1': 2447.553100 task-clock (msec) # 0.965 CPUs utilized 9 context-switches # 0.004 K/sec 1 cpu-migrations # 0.000 K/sec 122 page-faults # 0.050 K/sec 6354149177 cycles # 2.596 GHz (92.81%) 6467793726 instructions # 1.02 insn per cycle (92.76%) 1120749183 branches # 457.906 M/sec (92.81%) 7370402 branch-misses # 0.66% of all branches (92.81%) 2847963759 L1-dcache-loads # 1163.596 M/sec (92.76%) 39439592 L1-dcache-load-misses # 1.38% of all L1-dcache hits (92.77%) 42553468 LLC-loads # 17.386 M/sec (92.71%) 95960 LLC-load-misses # 0.01% of all LL-cache hits (92.94%) 2554887203 L1-icache-loads # 1043.854 M/sec (92.97%) 118902 L1-icache-load-misses (92.97%) 3365755289 dTLB-loads # 1375.151 M/sec (92.97%) 81401 dTLB-load-misses # 0.00% of all dTLB cache hits (92.97%) 2554882937 iTLB-loads # 1043.852 M/sec (92.97%) 159 iTLB-load-misses # 0.00% of all iTLB cache hits (85.58%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 2.535085780 seconds time elapsed > >> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align() >> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the >> new API can make use of them as much as possible. >> >> Any better idea to reuse common codes as much as possible while avoiding >> the performance degradation as much as possible? >> >>> >>>> 2. Use the below netcat test case, there seems to be some minor >>>> performance gain for replacing 'page_frag' with 'page_frag_cache' >>>> using the new page_frag API after this patchset. >>>> server: taskset -c 32 nc -l -k 1234 > /dev/null >>>> client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 >>> >>> This test would barely touch the page pool. The fact is most of the >> >> I am guessing you meant page_frag here? >> >>> overhead for this would likely be things like TCP latency and data >>> copy much more than the page allocation. As such fluctuations here are >>> likely not related to your changes. >> >> But it does tell us something that the replacing does not seems to >> cause obvious regression, right? > > Not really. The fragment allocator is such a small portion of this > test that we could probably double the cost for it and it would still > be negligible. The most beneficial thing for replacing of the old API seems to be about batching of page->_refcount updating and avoid some page_address(), but may have overhead from unifying of page_frag API. > >> I tried using a smaller MTU to amplify the impact of page allocation, >> it seemed to have a similar result. > > Not surprising. However the network is likely only a small part of > this. I suspect if you ran a profile it would likely show the same. > patch for doing the push operation in the insmod process instead of in the kernel thread as 'perf stat' does not seem to include the data of kernel thread: diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c index e806c1866e36..a818431c38b8 100644 --- a/tools/testing/selftests/mm/page_frag/page_frag_test.c +++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c @@ -131,30 +131,39 @@ static int __init page_frag_test_init(void) init_completion(&wait); if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0 || - !cpu_active(test_push_cpu) || !cpu_active(test_pop_cpu)) + !cpu_active(test_pop_cpu)) return -EINVAL; ret = ptr_ring_init(&ptr_ring, nr_objs, GFP_KERNEL); if (ret) return ret; - tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring, - test_push_cpu, "page_frag_push"); - if (IS_ERR(tsk_push)) - return PTR_ERR(tsk_push); - tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_ring, test_pop_cpu, "page_frag_pop"); - if (IS_ERR(tsk_pop)) { - kthread_stop(tsk_push); + if (IS_ERR(tsk_pop)) return PTR_ERR(tsk_pop); + + pr_info("test_push_cpu = %d\n", test_push_cpu); + + if (test_push_cpu < 0) + goto skip_push_thread; + + tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring, + test_push_cpu, "page_frag_push"); + if (IS_ERR(tsk_push)) { + kthread_stop(tsk_pop); + return PTR_ERR(tsk_push); } +skip_push_thread: start = ktime_get(); - wake_up_process(tsk_push); + pr_info("waiting for test to complete\n"); wake_up_process(tsk_pop); - pr_info("waiting for test to complete\n"); + if (test_push_cpu < 0) + page_frag_push_thread(&ptr_ring); + else + wake_up_process(tsk_push); while (!wait_for_completion_timeout(&wait, msecs_to_jiffies(10000))) { /* exit if there is no progress for push or pop size */
On Tue, Dec 10, 2024 at 4:27 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2024/12/10 0:03, Alexander Duyck wrote: > > ... > > > > > Other than code size have you tried using perf to profile the > > benchmark before and after. I suspect that would be telling about > > which code changes are the most likely to be causing the issues. > > Overall I don't think the size has increased all that much. I suspect > > most of this is the fact that you are inlining more of the > > functionality. > > It seems the testing result is very sensitive to code changing and > reorganizing, as using the patch at the end to avoid the problem of > 'perf stat' not including data from the kernel thread seems to provide > more reasonable performance data. > > It seems the most obvious difference is 'insn per cycle' and I am not > sure how to interpret the difference of below data for the performance > degradation yet. > > With patch 1: > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': > > 5473.815250 task-clock (msec) # 0.984 CPUs utilized > 18 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.022 K/sec > 14210894727 cycles # 2.596 GHz (92.78%) > 18903171767 instructions # 1.33 insn per cycle (92.82%) > 2997494420 branches # 547.606 M/sec (92.84%) > 7539978 branch-misses # 0.25% of all branches (92.84%) > 6291190031 L1-dcache-loads # 1149.325 M/sec (92.78%) > 29874701 L1-dcache-load-misses # 0.47% of all L1-dcache hits (92.82%) > 57979668 LLC-loads # 10.592 M/sec (92.79%) > 347822 LLC-load-misses # 0.01% of all LL-cache hits (92.90%) > 5946042629 L1-icache-loads # 1086.270 M/sec (92.91%) > 193877 L1-icache-load-misses (92.91%) > 6820220221 dTLB-loads # 1245.972 M/sec (92.91%) > 137999 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) > 5947607438 iTLB-loads # 1086.556 M/sec (92.91%) > 210 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 5.563068950 seconds time elapsed > > Without patch 1: > root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': > > 5306.644600 task-clock (msec) # 0.984 CPUs utilized > 15 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.023 K/sec > 13776872322 cycles # 2.596 GHz (92.84%) > 13257649773 instructions # 0.96 insn per cycle (92.82%) > 2446901087 branches # 461.101 M/sec (92.91%) > 7172751 branch-misses # 0.29% of all branches (92.84%) > 5041456343 L1-dcache-loads # 950.027 M/sec (92.84%) > 38418414 L1-dcache-load-misses # 0.76% of all L1-dcache hits (92.76%) > 65486400 LLC-loads # 12.340 M/sec (92.82%) > 191497 LLC-load-misses # 0.01% of all LL-cache hits (92.79%) > 4906456833 L1-icache-loads # 924.587 M/sec (92.90%) > 175208 L1-icache-load-misses (92.91%) > 5539879607 dTLB-loads # 1043.952 M/sec (92.91%) > 140166 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) > 4906685698 iTLB-loads # 924.631 M/sec (92.91%) > 170 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 5.395104330 seconds time elapsed > > > Below is perf data for aligned API without patch 1, as above non-aligned > API also use test_alloc_len as 12, theoretically the performance data > should not be better than the non-aligned API as the aligned API will do > the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems > to show otherwise and I am not sure how to interpret that too: > perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1': > > 2447.553100 task-clock (msec) # 0.965 CPUs utilized > 9 context-switches # 0.004 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.050 K/sec > 6354149177 cycles # 2.596 GHz (92.81%) > 6467793726 instructions # 1.02 insn per cycle (92.76%) > 1120749183 branches # 457.906 M/sec (92.81%) > 7370402 branch-misses # 0.66% of all branches (92.81%) > 2847963759 L1-dcache-loads # 1163.596 M/sec (92.76%) > 39439592 L1-dcache-load-misses # 1.38% of all L1-dcache hits (92.77%) > 42553468 LLC-loads # 17.386 M/sec (92.71%) > 95960 LLC-load-misses # 0.01% of all LL-cache hits (92.94%) > 2554887203 L1-icache-loads # 1043.854 M/sec (92.97%) > 118902 L1-icache-load-misses (92.97%) > 3365755289 dTLB-loads # 1375.151 M/sec (92.97%) > 81401 dTLB-load-misses # 0.00% of all dTLB cache hits (92.97%) > 2554882937 iTLB-loads # 1043.852 M/sec (92.97%) > 159 iTLB-load-misses # 0.00% of all iTLB cache hits (85.58%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 2.535085780 seconds time elapsed I'm not sure perf stat will tell us much as it is really too high level to give us much in the way of details. I would be more interested in the output from perf record -g followed by a perf report, or maybe even just a snapshot from perf top while the test is running. That should show us where the CPU is spending most of its time and what areas are hot in the before and after graphs.
On 2024/12/10 23:58, Alexander Duyck wrote: > > I'm not sure perf stat will tell us much as it is really too high > level to give us much in the way of details. I would be more > interested in the output from perf record -g followed by a perf > report, or maybe even just a snapshot from perf top while the test is > running. That should show us where the CPU is spending most of its > time and what areas are hot in the before and after graphs. It seems the bottleneck is in the freeing side that page_frag_free() function took up to about 50% cpu for non-aligned API and 16% cpu for aligned API in the push CPU using 'perf top'. Using the below patch cause the page_frag_free() to disappear in the push CPU of 'perf top', new performance data is below: Without patch 1: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs): 21.084113 task-clock (msec) # 0.008 CPUs utilized ( +- 1.59% ) 7 context-switches # 0.334 K/sec ( +- 1.25% ) 1 cpu-migrations # 0.031 K/sec ( +- 20.20% ) 78 page-faults # 0.004 M/sec ( +- 0.26% ) 54748233 cycles # 2.597 GHz ( +- 1.59% ) 61637051 instructions # 1.13 insn per cycle ( +- 0.13% ) 14727268 branches # 698.501 M/sec ( +- 0.11% ) 20178 branch-misses # 0.14% of all branches ( +- 0.94% ) 2.637345524 seconds time elapsed ( +- 0.19% ) Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs): 19.669259 task-clock (msec) # 0.009 CPUs utilized ( +- 2.91% ) 7 context-switches # 0.356 K/sec ( +- 1.04% ) 0 cpu-migrations # 0.005 K/sec ( +- 68.82% ) 77 page-faults # 0.004 M/sec ( +- 0.27% ) 51077447 cycles # 2.597 GHz ( +- 2.91% ) 58875368 instructions # 1.15 insn per cycle ( +- 4.47% ) 14040015 branches # 713.805 M/sec ( +- 4.68% ) 20150 branch-misses # 0.14% of all branches ( +- 0.64% ) 2.226539190 seconds time elapsed ( +- 0.12% ) With patch 1: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs): 20.782788 task-clock (msec) # 0.008 CPUs utilized ( +- 0.09% ) 7 context-switches # 0.342 K/sec ( +- 0.97% ) 1 cpu-migrations # 0.031 K/sec ( +- 16.83% ) 78 page-faults # 0.004 M/sec ( +- 0.31% ) 53967333 cycles # 2.597 GHz ( +- 0.08% ) 61577257 instructions # 1.14 insn per cycle ( +- 0.02% ) 14712140 branches # 707.900 M/sec ( +- 0.02% ) 20234 branch-misses # 0.14% of all branches ( +- 0.55% ) 2.677974457 seconds time elapsed ( +- 0.15% ) root@(none):/home# perf stat -r 20 insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1 insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs): 20.420537 task-clock (msec) # 0.009 CPUs utilized ( +- 0.05% ) 7 context-switches # 0.345 K/sec ( +- 0.71% ) 0 cpu-migrations # 0.005 K/sec ( +-100.00% ) 77 page-faults # 0.004 M/sec ( +- 0.23% ) 53038942 cycles # 2.597 GHz ( +- 0.05% ) 59965712 instructions # 1.13 insn per cycle ( +- 0.03% ) 14372507 branches # 703.826 M/sec ( +- 0.03% ) 20580 branch-misses # 0.14% of all branches ( +- 0.56% ) 2.287783171 seconds time elapsed ( +- 0.12% ) It seems that bottleneck is still the freeing side that the above result might not be as meaningful as it should be. As we can't use more than one cpu for the free side without some lock using a single ptr_ring, it seems something more complicated might need to be done in order to support more than one CPU for the freeing side? Before patch 1, __page_frag_alloc_align took up to 3.62% percent of CPU using 'perf top'. After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref() took up to 4.67% + 1.01% = 5.68%. Having a similar result, I am not sure if the CPU usages is able tell us the performance degradation here as it seems to be quite large? @@ -100,13 +100,20 @@ static int page_frag_push_thread(void *arg) if (!va) continue; - ret = __ptr_ring_produce(ring, va); - if (ret) { + do { + ret = __ptr_ring_produce(ring, va); + if (!ret) { + va = NULL; + break; + } else { + cond_resched(); + } + } while (!force_exit); + + if (va) page_frag_free(va); - cond_resched(); - } else { + else test_pushed++; - } } pr_info("page_frag push test thread exits on cpu %d\n",
On 2024/12/11 20:52, Yunsheng Lin wrote: > It seems that bottleneck is still the freeing side that the above > result might not be as meaningful as it should be. Through 'perf top' annotating, there seems to be about 70%+ cpu usage for the atmoic operation of put_page_testzero() in page_frag_free(), it was unexpected that the atmoic operation had that much overhead:( > > As we can't use more than one cpu for the free side without some > lock using a single ptr_ring, it seems something more complicated > might need to be done in order to support more than one CPU for the > freeing side? > > Before patch 1, __page_frag_alloc_align took up to 3.62% percent of > CPU using 'perf top'. > After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref() > took up to 4.67% + 1.01% = 5.68%. > Having a similar result, I am not sure if the CPU usages is able tell us > the performance degradation here as it seems to be quite large? > And using 'struct page_frag' to pass the parameter seems to cause some observable overhead as the testing is very low level, peformance seems to be negligible using the below patch to avoid passing 'struct page_frag', 3.62% and 3.27% for the cpu usages for __page_frag_alloc_align() before patch 1 and __page_frag_cache_prepare() after patch 1 respectively. The new refatcoring avoid some overhead for the old API, but might cause some overhead for the new API as it is not able to skip the virt_to_page() for refilling and reusing case, though it seems to be an unlikely case. Or any better idea how to do refatcoring for unifying the page_frag API? diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h index 41a91df82631..b83e7655654e 100644 --- a/include/linux/page_frag_cache.h +++ b/include/linux/page_frag_cache.h @@ -39,8 +39,24 @@ static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc) void page_frag_cache_drain(struct page_frag_cache *nc); void __page_frag_cache_drain(struct page *page, unsigned int count); -void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, - gfp_t gfp_mask, unsigned int align_mask); +void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz, + gfp_t gfp_mask, unsigned int align_mask); + +static inline void *__page_frag_alloc_align(struct page_frag_cache *nc, + unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) +{ + void *va; + + va = __page_frag_cache_prepare(nc, fragsz, gfp_mask, align_mask); + if (likely(va)) { + va += nc->offset; + nc->offset += fragsz; + nc->pagecnt_bias--; + } + + return va; +} static inline void *page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz, gfp_t gfp_mask, diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c index 3f7a203d35c6..729309aee27a 100644 --- a/mm/page_frag_cache.c +++ b/mm/page_frag_cache.c @@ -90,9 +90,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count) } EXPORT_SYMBOL(__page_frag_cache_drain); -void *__page_frag_alloc_align(struct page_frag_cache *nc, - unsigned int fragsz, gfp_t gfp_mask, - unsigned int align_mask) +void *__page_frag_cache_prepare(struct page_frag_cache *nc, + unsigned int fragsz, gfp_t gfp_mask, + unsigned int align_mask) { unsigned long encoded_page = nc->encoded_page; unsigned int size, offset; @@ -151,12 +151,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc, offset = 0; } - nc->pagecnt_bias--; - nc->offset = offset + fragsz; - - return encoded_page_decode_virt(encoded_page) + offset; + nc->offset = offset; + return encoded_page_decode_virt(encoded_page); } -EXPORT_SYMBOL(__page_frag_alloc_align); +EXPORT_SYMBOL(__page_frag_cache_prepare); /* * Frees a page fragment allocated out of either a compound or order 0 page.