From patchwork Mon Oct 28 11:53:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13853358 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 180F0D13596 for ; Mon, 28 Oct 2024 12:00:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E2C46B007B; Mon, 28 Oct 2024 08:00:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 593006B0083; Mon, 28 Oct 2024 08:00:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45ABB6B0085; Mon, 28 Oct 2024 08:00:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 229E76B007B for ; Mon, 28 Oct 2024 08:00:08 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A55D141EC4 for ; Mon, 28 Oct 2024 11:59:55 +0000 (UTC) X-FDA: 82722867444.29.C758BF8 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf22.hostedemail.com (Postfix) with ESMTP id 3165DC0027 for ; Mon, 28 Oct 2024 11:59:35 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730116647; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=3u/Z/+x1OjzdtBLh//pxMk1XEdjVGgjHIFQscVQy4ac=; b=5sFoLY0VIByA3RXPuhcKQFro/BdU6EuKcY3dJv8m/NMPg0roOP+UCdxekFrSxS6J9jePQd U3YLY9A96yxDuJpKLhZsn7O5Qiy2a5sOb2EYUujQPCBnYVY9QRr4Plye8Chke3S6Z7JL7/ QUvjhbHtw1Wi9vzAmJ9cUZAv5AptJzI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730116647; a=rsa-sha256; cv=none; b=ozRjbmGYOb7vb1FqQVmRB4f6GvNLreQL7SUCQ9+GYS/BxUiIbXmL5zyYRWGRrYom31aq0k ypP4oL5GKXy9DHMfIaNc012EwcaUzhBUNE6lM7s1VjkPlxRl5Fws/zUJyEvA3yFYqP1t89 ZFoA0JqKsvA/59Wot8DDPg+7Fwi5NMU= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.48]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4XcX2S2sNdz10P7R; Mon, 28 Oct 2024 19:57:52 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id EBD89180064; Mon, 28 Oct 2024 20:00:00 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 28 Oct 2024 20:00:00 +0800 From: Yunsheng Lin To: , , CC: , , Yunsheng Lin , Alexander Duyck , Shuah Khan , Andrew Morton , Linux-MM Subject: [PATCH net-next v23 0/7] Replace page_frag with page_frag_cache (Part-1) Date: Mon, 28 Oct 2024 19:53:35 +0800 Message-ID: <20241028115343.3405838-1-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 MIME-Version: 1.0 X-Originating-IP: [10.90.30.45] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Stat-Signature: hgrqr48m9r833am4fxjais9qr6ri5spp X-Rspamd-Queue-Id: 3165DC0027 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1730116775-487939 X-HE-Meta: U2FsdGVkX18X1zTqFGauVSwQFi/ON6gmqUqhOS8IwiJHEGuRmvuUItmagmQfSk7kRMyXkIFTA7g/LJuJtvNc3sN5INROESSnrhuQpjbq7mWqSaDbKEsAjbNZbfDailbPz7bzjbW1TlUJ0OqAaxEiS1CZ1N+n4nE2+gR5+82yxruuXHBGAk6NXsEPC5mVUGC7MyywLn10dT+UgWiQAvJU7WKvKSYTOFbmzavQFYQ24aj8VdE+GBi1x9h71HLRL9H+Dxf+qKnD/y9AmKfwHTHqZOyFrau10Vb5Hgm2q3DcdnKW0MjBOnaftbwsmi1lqvd4vJkRgK87zmvRMkuv61kfRzAwdVjRh6cStdczlf5jG35IF/ax7ZcW/FOLfBjhLETQTYHSqC2b5IeENh1Nml05PegpxLZfAlN3JDbsFkH5ZAhcMNBhq7yNnckx0GlZg9KUcToVYxDF2mhb2tLHym/NQKwHOK2Bau++16p7cfh/OTy4c0LLUhI4zthzRRCujh9HVzxLFXQxUDITwn4Xz8k+WTfNEAf9HvFMvAoQ05wWISX54SdmMR49pkqu/nl0/Nf2dCVVOujYtirPdxrgJqVVHiphUDfUFJWfVB+Y/8qu8RJgaZI9d3Z/eCL8KbsKvt8XTXE10/HinM8ooACphiyAA780AoGMzxedLpQPq/+mMVJn5PzZkgbZ+FXWQZmfPObJHe1AplRnYFsSqAF9wrEuyEUhZ8/MofFcL8m7QjDsHyCFs4WzZWWZST3IbmroQX7ol+ZkHx41nQfbGMSATNUg6OHXgCWjdTBTD0gZB5i2EER3eLPRNdw64BRflF+abHk54SdPS4oZvwAvfctEQ0bYn5niZCGKFU1uwuAjVIkagBlE6E5tSDd2KP6iV1S4t5zNPa4xRDoC9kt7Q7wbj+eKPmpg6UP9MWhl1Pf9/TWN+5Zr4vB71a6apIIKsmKFY1OlljzNAXB3g7T7cNTER/g GPSs+x4j Rgw9nNEB5JHy+rhO6dZzr//kQuegqhRgc57h75eWiCkfP9bOcGQwQxcklfs5guAngLA0OHrscMm09W9cmzyP0Vjrg0HwzZM75h2i9nCAF41vgh4TK0ZLY5wtE0nzEhmVvP11fmom9pEKQNhXu8g9TpETARk3NCvhIwpDQ5JEwDdQL+mBIkPrqvSTrRIILhEzUnITwyyWbRozxnhnf5Q/KDOrkzLnJAkFpjskMfQmR9oxp9+2BrO5AGWQcLouycOmyaARcFZcb4OiETsI9LQGMbPj0/1hMOR0ujas85e+HG/iS8OgJS0TqTwoE7E6m5csmSkVMI36ULaOCkWCObdX6fkNFLuqiHJHWtZc/Djf9CVX0EQrDxxLqjCVFOKFrM5I3yZTsycWOT/e0PU4aptU2Iz63lwNX8xTYsu3XsaiNFWBVIa+Upoz0wHGoMI0ltU7Y6KUTDywepbyvBDQOTiziXKgcHA+TULa6Th8mQkw2R7Z5ya2RW+fMKYyzMLpIJHl4QaYZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is part 1 of "Replace page_frag with page_frag_cache", which mainly contain refactoring and optimization for the implementation of page_frag API before the replacing. As the discussion in [1], it would be better to target net-next tree to get more testing as all the callers page_frag API are in networking, and the chance of conflicting with MM tree seems low as implementation of page_frag API seems quite self-contained. After [2], there are still two implementations for page frag: 1. mm/page_alloc.c: net stack seems to be using it in the rx part with 'struct page_frag_cache' and the main API being page_frag_alloc_align(). 2. net/core/sock.c: net stack seems to be using it in the tx part with 'struct page_frag' and the main API being skb_page_frag_refill(). This patchset tries to unfiy the page frag implementation by replacing page_frag with page_frag_cache for sk_page_frag() first. net_high_order_alloc_disable_key for the implementation in net/core/sock.c doesn't seems matter that much now as pcp is also supported for high-order pages: commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") As the related change is mostly related to networking, so targeting the net-next. And will try to replace the rest of page_frag in the follow patchset. After this patchset: 1. Unify the page frag implementation by taking the best out of two the existing implementations: we are able to save some space for the 'page_frag_cache' API user, and avoid 'get_page()' for the old 'page_frag' API user. 2. Future bugfix and performance can be done in one place, hence improving maintainability of page_frag's implementation. Kernel Image changing: Linux Kernel total | text data bss ------------------------------------------------------ after 45250307 | 27274279 17209996 766032 before 45254134 | 27278118 17209984 766032 delta -3827 | -3839 +12 +0 Performance validation: 1. Using micro-benchmark ko added in patch 1 to test aligned and non-aligned API performance impact for the existing users, there is no notiable performance degradation. Instead we seems to have some major performance boot for both aligned and non-aligned API after switching to ptr_ring for testing, respectively about 200% and 10% improvement in arm64 server as below. 2. Use the below netcat test case, we also have some minor performance boot for replacing 'page_frag' with 'page_frag_cache' after this patchset. server: taskset -c 32 nc -l -k 1234 > /dev/null client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 In order to avoid performance noise as much as possible, the testing is done in system without any other load and have enough iterations to prove the data is stable enough, complete log for testing is below: perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1 taskset -c 32 nc -l -k 1234 > /dev/null perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 *After* this patchset: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): 17.758393 task-clock (msec) # 0.004 CPUs utilized ( +- 0.51% ) 5 context-switches # 0.293 K/sec ( +- 0.65% ) 0 cpu-migrations # 0.008 K/sec ( +- 17.21% ) 74 page-faults # 0.004 M/sec ( +- 0.12% ) 46128650 cycles # 2.598 GHz ( +- 0.51% ) 60810511 instructions # 1.32 insn per cycle ( +- 0.04% ) 14764914 branches # 831.433 M/sec ( +- 0.04% ) 19281 branch-misses # 0.13% of all branches ( +- 0.13% ) 4.240273854 seconds time elapsed ( +- 0.13% ) Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): 17.348690 task-clock (msec) # 0.019 CPUs utilized ( +- 0.66% ) 5 context-switches # 0.310 K/sec ( +- 0.84% ) 0 cpu-migrations # 0.009 K/sec ( +- 16.55% ) 74 page-faults # 0.004 M/sec ( +- 0.11% ) 45065287 cycles # 2.598 GHz ( +- 0.66% ) 60755389 instructions # 1.35 insn per cycle ( +- 0.05% ) 14747865 branches # 850.085 M/sec ( +- 0.05% ) 19272 branch-misses # 0.13% of all branches ( +- 0.13% ) 0.935251375 seconds time elapsed ( +- 0.07% ) Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): 16626.042731 task-clock (msec) # 0.607 CPUs utilized ( +- 0.03% ) 3291020 context-switches # 0.198 M/sec ( +- 0.05% ) 1 cpu-migrations # 0.000 K/sec ( +- 0.50% ) 85 page-faults # 0.005 K/sec ( +- 0.16% ) 30581044838 cycles # 1.839 GHz ( +- 0.05% ) 34962744631 instructions # 1.14 insn per cycle ( +- 0.01% ) 6483883671 branches # 389.984 M/sec ( +- 0.02% ) 99624551 branch-misses # 1.54% of all branches ( +- 0.17% ) 27.370305077 seconds time elapsed ( +- 0.01% ) *Before* this patchset: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): 21.587934 task-clock (msec) # 0.005 CPUs utilized ( +- 0.72% ) 6 context-switches # 0.281 K/sec ( +- 0.28% ) 1 cpu-migrations # 0.047 K/sec ( +- 0.50% ) 73 page-faults # 0.003 M/sec ( +- 0.12% ) 56080697 cycles # 2.598 GHz ( +- 0.72% ) 61605150 instructions # 1.10 insn per cycle ( +- 0.05% ) 14950196 branches # 692.526 M/sec ( +- 0.05% ) 19410 branch-misses # 0.13% of all branches ( +- 0.18% ) 4.603530546 seconds time elapsed ( +- 0.11% ) Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): 20.988297 task-clock (msec) # 0.006 CPUs utilized ( +- 0.81% ) 7 context-switches # 0.316 K/sec ( +- 0.54% ) 1 cpu-migrations # 0.048 K/sec ( +- 0.70% ) 73 page-faults # 0.003 M/sec ( +- 0.11% ) 54512166 cycles # 2.597 GHz ( +- 0.81% ) 61440941 instructions # 1.13 insn per cycle ( +- 0.08% ) 14906043 branches # 710.207 M/sec ( +- 0.08% ) 19927 branch-misses # 0.13% of all branches ( +- 0.17% ) 3.438041238 seconds time elapsed ( +- 1.11% ) Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): 17364.040855 task-clock (msec) # 0.624 CPUs utilized ( +- 0.02% ) 3340375 context-switches # 0.192 M/sec ( +- 0.06% ) 1 cpu-migrations # 0.000 K/sec 85 page-faults # 0.005 K/sec ( +- 0.15% ) 32077623335 cycles # 1.847 GHz ( +- 0.03% ) 35121047596 instructions # 1.09 insn per cycle ( +- 0.01% ) 6519872824 branches # 375.481 M/sec ( +- 0.02% ) 101877022 branch-misses # 1.56% of all branches ( +- 0.14% ) 27.842745343 seconds time elapsed ( +- 0.02% ) Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script: nc -u -l -k 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N -u 127.0.0.1 1234 nc -l6 -k 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -N ::1 1234 nc -l6 -k -u 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u -N ::1 1234 CC: Alexander Duyck CC: Shuah Khan CC: Andrew Morton CC: Linux-MM 1. https://lore.kernel.org/all/add10dd4-7f5d-4aa1-aa04-767590f944e0@redhat.com/ 2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/ Change log: V23: 1. CC Andrew and MM ML explicitly. 2. Split into two parts according to the discussion in v22, and this is the part-1. V22: 1. Fix some typo as noted by Bagas. 2. Remove page_frag_cache_page_offset() as it is not really related to this patchset. V21: 1. Do renaming as suggested by Alexander. 2. Filter out the test results of dmesg in script as suggested by Shuah. V20: 1. Rename skb_copy_to_page_nocache() to skb_add_frag_nocache(). 2. Define the PFMEMALLOC_BIT as the ORDER_MASK + 1 as suggested by Alexander. V19: 1. Rebased on latest net-next. 2. Use wait_for_completion_timeout() instead of wait_for_completion() in page_frag_test.c V18: 1. Fix a typo in test_page_frag.sh pointed out by Alexander. 2. Move some inline helper into c file, use ternary operator and move the getting of the size as suggested by Alexander. V17: 1. Add TEST_FILES in Makefile for test_page_frag.sh. V16: 1. Add test_page_frag.sh to handle page_frag_test.ko and add testing for prepare API. 2. Move inline helper unneeded outside of the page_frag_cache.c to page_frag_cache.c. 3. Reset nc->offset when reusing an old page. V15: 1. Fix the compile error pointed out by Simon. 2. Fix Other mistakes when using new API naming and refactoring. V14: 1. Drop '_va' Renaming patch and use new API naming. 2. Use new refactoring to enable more codes to be reusable. 3. And other minor suggestions from Alexander. V13: 1. Move page_frag_test from mm/ to tools/testing/selftest/mm 2. Use ptr_ring to replace ptr_pool for page_frag_test.c 3. Retest based on the new testing ko, which shows a big different result than using ptr_pool. V12: 1. Do not treat page_frag_test ko as DEBUG feature. 2. Make some improvement for the refactoring in patch 8. 3. Some other minor improvement as Alexander's comment. RFC v11: 1. Fold 'page_frag_cache' moving change into patch 2. 2. Optimizate patch 3 according to discussion in v9. V10: 1. Change Subject to "Replace page_frag with page_frag_cache for sk_page_frag()". 2. Move 'struct page_frag_cache' to sched.h as suggested by Alexander. 3. Rename skb_copy_to_page_nocache(). 4. Adjust change between patches to make it more reviewable as Alexander's comment. 5. Use 'aligned_remaining' variable to generate virtual address as Alexander's comment. 6. Some included header and typo fix as Alexander's comment. 7. Add back the get_order() opt patch for xtensa arch V9: 1. Add check for test_alloc_len and change perm of module_param() to 0 as Wang Wei' comment. 2. Rebased on latest net-next. V8: Remove patch 2 & 3 in V7, as free_unref_page() is changed to call pcp_allowed_order() and used in page_frag API recently in: commit 5b8d75913a0e ("mm: combine free_the_page() and free_unref_page()") V7: Fix doc build warning and error. V6: 1. Fix some typo and compiler error for x86 pointed out by Jakub and Simon. 2. Add two refactoring and optimization patches. V5: 1. Add page_frag_alloc_pg() API for tls_device.c case and refactor some implementation, update kernel bin size changing as bin size is increased after that. 2. Add ack from Mat. RFC v4: 1. Update doc according to Randy and Mat's suggestion. 2. Change probe API to "probe" for a specific amount of available space, rather than "nonzero" space according to Mat's suggestion. 3. Retest and update the test result. v3: 1. Use new layout for 'struct page_frag_cache' as the discussion with Alexander and other sugeestions from Alexander. 2. Add probe API to address Mat' comment about mptcp use case. 3. Some doc updating according to Bagas' suggestion. v2: 1. reorder test module to patch 1. 2. split doc and maintainer updating to two patches. 3. refactor the page_frag before moving. 4. fix a type and 'static' warning in test module. 5. add a patch for xtensa arch to enable using get_order() in BUILD_BUG_ON(). 6. Add test case and performance data for the socket code. Yunsheng Lin (7): mm: page_frag: add a test module for page_frag mm: move the page fragment allocator from page_alloc into its own file mm: page_frag: use initial zero offset for page_frag_alloc_align() mm: page_frag: avoid caller accessing 'page_frag_cache' directly xtensa: remove the get_order() implementation mm: page_frag: reuse existing space for 'size' and 'pfmemalloc' mm: page_frag: use __alloc_pages() to replace alloc_pages_node() arch/xtensa/include/asm/page.h | 18 -- drivers/vhost/net.c | 2 +- include/linux/gfp.h | 22 -- include/linux/mm_types.h | 18 -- include/linux/mm_types_task.h | 21 ++ include/linux/page_frag_cache.h | 61 ++++++ include/linux/skbuff.h | 1 + mm/Makefile | 1 + mm/page_alloc.c | 136 ------------ mm/page_frag_cache.c | 171 +++++++++++++++ net/core/skbuff.c | 6 +- net/rxrpc/conn_object.c | 4 +- net/rxrpc/local_object.c | 4 +- net/sunrpc/svcsock.c | 6 +- tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/page_frag/Makefile | 18 ++ .../selftests/mm/page_frag/page_frag_test.c | 198 ++++++++++++++++++ tools/testing/selftests/mm/run_vmtests.sh | 8 + tools/testing/selftests/mm/test_page_frag.sh | 175 ++++++++++++++++ 19 files changed, 665 insertions(+), 208 deletions(-) create mode 100644 include/linux/page_frag_cache.h create mode 100644 mm/page_frag_cache.c create mode 100644 tools/testing/selftests/mm/page_frag/Makefile create mode 100644 tools/testing/selftests/mm/page_frag/page_frag_test.c create mode 100755 tools/testing/selftests/mm/test_page_frag.sh