From patchwork Fri Dec 6 12:25:23 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yunsheng Lin X-Patchwork-Id: 13897118 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EB2AE77173 for ; Fri, 6 Dec 2024 12:32:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B4DE36B025B; Fri, 6 Dec 2024 07:32:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AFE976B025C; Fri, 6 Dec 2024 07:32:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 977E86B025E; Fri, 6 Dec 2024 07:32:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 77BB76B025B for ; Fri, 6 Dec 2024 07:32:23 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EA09DB0086 for ; Fri, 6 Dec 2024 12:32:22 +0000 (UTC) X-FDA: 82864471368.19.FCCD6F5 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf26.hostedemail.com (Postfix) with ESMTP id 3BA1114000C for ; Fri, 6 Dec 2024 12:32:06 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf26.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733488324; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references; bh=v9qFHqIXcLpCDsXsxXXyxFjiPcJKfoKYFHJEmyLv364=; b=kHCU0e87Ta/L2f9d0vb5Qa9es/Ez3j9/GIHteFtW8IMLtau2iDKPjcYxSHrt6aMzjPhrTp +h5ZFhizgJQBEZQV0EczyrCuBu471LoFSZDedgTFXrfIynFV/0baI2LNhet0Tj2SGZLywb cmr2na/Wx0/5+c1RFbhijUSSJ0mbKGE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733488324; a=rsa-sha256; cv=none; b=vqMuEXBxQM4fsfenDym+njrGWZc5njPdNZ2J/3NdpwPZct3bx8IpxptFrBJP0Xr1ogkjks pvesri4XOhblzTKomT06vn3vtF/g9J4leT1bUc6ZV+G72QPwgExCFgPXE5wSrRLYXVpgP8 /6TQCRMTt5nV4IpgJKf+MNMmOXvjzOw= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf26.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4Y4VvS52Krz1kvGM; Fri, 6 Dec 2024 20:29:56 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 09CB81402E1; Fri, 6 Dec 2024 20:32:16 +0800 (CST) Received: from localhost.localdomain (10.90.30.45) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 6 Dec 2024 20:32:15 +0800 From: Yunsheng Lin To: , , CC: , , Yunsheng Lin , Alexander Duyck , Shuah Khan , Andrew Morton , Linux-MM Subject: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Date: Fri, 6 Dec 2024 20:25:23 +0800 Message-ID: <20241206122533.3589947-1-linyunsheng@huawei.com> X-Mailer: git-send-email 2.30.0 MIME-Version: 1.0 X-Originating-IP: [10.90.30.45] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspamd-Queue-Id: 3BA1114000C X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: je5mq35xb3cm77n55bj8cdi7bgmis7jz X-HE-Tag: 1733488326-256036 X-HE-Meta: U2FsdGVkX195qwzlDeMukTCL091tnA4y80rxD9nbEdaQkFRMwm9rO+TTyjPCU0eB0VqJdb6HrluEWpSeJBElWfVo1lJ1vOWFvdU8tKb6JpPGsQyWmV6dTGojq6jcnDZ2SyNny6dPq3iuY6dIbazCwRR8cVT5b6xiqElMPV1Qlw1c1M41B/9uBHZI9bNVU2Vpoi/ozRdUMTfk4lLEVhKYm7cNBbx8ksn9GBWcZ/6BBoQwUdn6hGkQzuNRMbCzddh8XjE4hQLFA3XVMh8chlQbI7ul4QuBQvOQuWOVVzmBhtpueK+EC/Zbll7iojylyzCvOpkqYhkYXRgiuAaNLAybOJ5owJEcCYj8DYaPQBG4IEEeGX3Edac99vTGtDuq+0rkCXfUPJk5dVFN1W22Edvm2OXYiLziLBO7Ke93JAlpLiugvgTt3W/MDhegO5rv6ICgqLb971EqcNq0l/JY/P9XVU5hDWUgVk4C/o48V+vgbOT2JDZTgwRZ8Nv+I4D/olnli4Rjl1PhgWxaxrVFeYsnsx+nV52iH3KoFpKzVPV0NKHm+XZZ9L7k7G41fO1J25T4t7GSZFQ/gdYOxLDrts1GtUZhUNVKp+dHezpGaqjesaZlTbeE+XrTXPDiu+mMOdNYXCLX8k/Oq+g09jTfxM5pJpHG5Ut3WNOp+LDRWlNNanFSs+ramZRQ38/os1c5g7a00LHshOZ8jROXtwT0j1q7kma5wNG3ji1IF/xSJW3pHX0lbQEBmRIZtRZoDNn4D1JARP7iyn0XJgnL7cq6NGZYslcH6AHQVAHCWdigDOIooRqhfvdktWwSIbE+aeHGEJcmHwRaolsg4ZOm1LrMjwbE1w7Y4/SRp1k16eL+5x+BLW/XVnOPEJfnlXl41DMqqyPh1Mmj7kNFQcYbq2iWrjY0IIV6A68xXj3BGVerKKVZGpRUHMAQUhUWZ0KIJeldAnq3AtE4MxZQccPTlrYo1Z0 zRto6r8+ tDBjR7+nfKNinqX1YRp4XEceSTxTOi2n12K7NJcXAbaQSSyq7GxB5Ib0ACup/qqEs8KcXKAMRLlhoaV7qBnM5qXh6SF+QY9Hfzu08D27vZZdLLPAhGjzxdh4GiIFLCpmY9yBVfRQEbiDpFS7D9OUmZ92z/MAPfjS8qRwTSlVjvpGMyx41WzrZ0cQ2Q7jnFweVfPZvLreXYYCswUnVi2KR4Fp+T3C2Xl8shIeyvFXNFiDoewcuQK5dXOeNLWEoZHy0KyoSnUVLgR0DiemEMjQHW1usNYkVSHp0nZEerBXWXAO9Pq/NF4yJAX7gYNZOv1/nRCxCUnHjSVX7QuwXhdy0FBS26BspG1VzkRwroLVS5h4tH9KvyzXYNSzs1qoYsiDpsRPd2WzchuFtW5znJFHCmWeU0DnSaYS/9bDMXDu8Jd1r25hYx+fmgar3jWMpWjP/zwF5BVpDec9HHZ/sq0u8pegFLxJb9qZmvZHY5EBstKPhIZU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is part 2 of "Replace page_frag with page_frag_cache", which introduces the new API and replaces page_frag with page_frag_cache for sk_page_frag(). The part 1 of "Replace page_frag with page_frag_cache" is in [1]. After [2], there are still two implementations for page frag: 1. mm/page_alloc.c: net stack seems to be using it in the rx part with 'struct page_frag_cache' and the main API being page_frag_alloc_align(). 2. net/core/sock.c: net stack seems to be using it in the tx part with 'struct page_frag' and the main API being skb_page_frag_refill(). This patchset tries to unfiy the page frag implementation by replacing page_frag with page_frag_cache for sk_page_frag() first. net_high_order_alloc_disable_key for the implementation in net/core/sock.c doesn't seems matter that much now as pcp is also supported for high-order pages: commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") As the related change is mostly related to networking, so targeting the net-next. And will try to replace the rest of page_frag in the follow patchset. After this patchset: 1. Unify the page frag implementation by taking the best out of two the existing implementations: we are able to save some space for the 'page_frag_cache' API user, and avoid 'get_page()' for the old 'page_frag' API user. 2. Future bugfix and performance can be done in one place, hence improving maintainability of page_frag's implementation. Performance validation for part2: 1. Using micro-benchmark ko added in patch 1 to test aligned and non-aligned API performance impact for the existing users, there seems to be about 20% performance degradation for refactoring page_frag to support the new API, which seems to nullify most of the performance gain in [3] of part1. 2. Use the below netcat test case, there seems to be some minor performance gain for replacing 'page_frag' with 'page_frag_cache' using the new page_frag API after this patchset. server: taskset -c 32 nc -l -k 1234 > /dev/null client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 In order to avoid performance noise as much as possible, the testing is done in system without any other load and have enough iterations to prove the data is stable enough, complete log for testing is below: perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1 taskset -c 32 nc -l -k 1234 > /dev/null perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 *After* this patchset: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): 18.753187 task-clock (msec) # 0.003 CPUs utilized ( +- 0.44% ) 8 context-switches # 0.422 K/sec ( +- 0.30% ) 0 cpu-migrations # 0.003 K/sec ( +- 32.09% ) 84 page-faults # 0.004 M/sec ( +- 0.08% ) 48700826 cycles # 2.597 GHz ( +- 0.44% ) 62086543 instructions # 1.27 insn per cycle ( +- 0.03% ) 14869358 branches # 792.898 M/sec ( +- 0.03% ) 19639 branch-misses # 0.13% of all branches ( +- 0.60% ) 7.035285915 seconds time elapsed ( +- 0.06% ) Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): 18.442151 task-clock (msec) # 0.006 CPUs utilized ( +- 0.01% ) 8 context-switches # 0.422 K/sec ( +- 0.40% ) 0 cpu-migrations # 0.001 K/sec ( +- 57.44% ) 84 page-faults # 0.005 M/sec ( +- 0.08% ) 47890149 cycles # 2.597 GHz ( +- 0.01% ) 60718325 instructions # 1.27 insn per cycle ( +- 0.00% ) 14570862 branches # 790.085 M/sec ( +- 0.00% ) 19613 branch-misses # 0.13% of all branches ( +- 0.12% ) 3.210892358 seconds time elapsed ( +- 0.12% ) Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): 16824.017944 task-clock (msec) # 0.621 CPUs utilized ( +- 0.02% ) 2987954 context-switches # 0.178 M/sec ( +- 0.04% ) 1 cpu-migrations # 0.000 K/sec 93 page-faults # 0.006 K/sec ( +- 0.09% ) 31982647267 cycles # 1.901 GHz ( +- 0.03% ) 38907812424 instructions # 1.22 insn per cycle ( +- 0.02% ) 7112328962 branches # 422.749 M/sec ( +- 0.03% ) 94789062 branch-misses # 1.33% of all branches ( +- 0.21% ) 27.104994660 seconds time elapsed ( +- 0.03% ) *Before* this patchset: Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): 18.700051 task-clock (msec) # 0.003 CPUs utilized ( +- 1.04% ) 8 context-switches # 0.420 K/sec ( +- 0.31% ) 0 cpu-migrations # 0.019 K/sec ( +- 10.16% ) 81 page-faults # 0.004 M/sec ( +- 0.09% ) 48548980 cycles # 2.596 GHz ( +- 1.04% ) 61857980 instructions # 1.27 insn per cycle ( +- 0.09% ) 14814201 branches # 792.201 M/sec ( +- 0.08% ) 42007 branch-misses # 0.28% of all branches ( +- 0.11% ) 5.565806266 seconds time elapsed ( +- 0.08% ) Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): 18.468618 task-clock (msec) # 0.007 CPUs utilized ( +- 1.14% ) 8 context-switches # 0.422 K/sec ( +- 0.43% ) 0 cpu-migrations # 0.026 K/sec ( +- 7.89% ) 81 page-faults # 0.004 M/sec ( +- 0.08% ) 47950150 cycles # 2.596 GHz ( +- 1.14% ) 61745530 instructions # 1.29 insn per cycle ( +- 0.09% ) 14787783 branches # 800.698 M/sec ( +- 0.08% ) 41734 branch-misses # 0.28% of all branches ( +- 0.09% ) 2.584180919 seconds time elapsed ( +- 0.04% ) Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): 17105.617450 task-clock (msec) # 0.599 CPUs utilized ( +- 0.02% ) 2822654 context-switches # 0.165 M/sec ( +- 0.03% ) 1 cpu-migrations # 0.000 K/sec ( +- 0.50% ) 93 page-faults # 0.005 K/sec ( +- 0.09% ) 31819244033 cycles # 1.860 GHz ( +- 0.03% ) 37297412811 instructions # 1.17 insn per cycle ( +- 0.01% ) 6676699757 branches # 390.322 M/sec ( +- 0.01% ) 325102016 branch-misses # 4.87% of all branches ( +- 0.06% ) 28.568053622 seconds time elapsed ( +- 0.02% ) Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script: nc -u -l -k 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u 127.0.0.1 1234 nc -l6 -k 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc ::1 1234 nc -l6 -k -u 1234 > /dev/null perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u ::1 1234 CC: Alexander Duyck CC: Shuah Khan CC: Andrew Morton CC: Linux-MM 1. https://lore.kernel.org/all/20241028115343.3405838-1-linyunsheng@huawei.com/ 2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/ 3. https://lore.kernel.org/all/472a7a09-387f-480d-b66c-761e0b6192ef@huawei.com/ V2: Repost based on the latest net-next. V1: Rebase on latest net-next tree and redo the performance test. RFC: 1. CC Andrew and MM ML explicitly. 2. Split into two parts according to the discussion in v22, and this is the part-2. 3. Split 'introduce new API' patch to more patches to make more reviewable and easier to discuss. Yunsheng Lin (10): mm: page_frag: some minor refactoring before adding new API net: rename skb_copy_to_page_nocache() helper mm: page_frag: update documentation for page_frag mm: page_frag: introduce page_frag_alloc_abort() related API mm: page_frag: introduce refill prepare & commit API mm: page_frag: introduce alloc_refill prepare & commit API mm: page_frag: introduce probe related API mm: page_frag: add testing for the newly added API net: replace page_frag with page_frag_cache mm: page_frag: add an entry in MAINTAINERS for page_frag Documentation/mm/page_frags.rst | 207 ++++++++++- MAINTAINERS | 12 + .../chelsio/inline_crypto/chtls/chtls.h | 3 - .../chelsio/inline_crypto/chtls/chtls_io.c | 101 ++---- .../chelsio/inline_crypto/chtls/chtls_main.c | 3 - drivers/net/tun.c | 47 ++- include/linux/page_frag_cache.h | 330 +++++++++++++++++- include/linux/sched.h | 2 +- include/net/sock.h | 30 +- kernel/exit.c | 3 +- kernel/fork.c | 3 +- mm/page_frag_cache.c | 108 +++++- net/core/skbuff.c | 58 +-- net/core/skmsg.c | 12 +- net/core/sock.c | 32 +- net/ipv4/ip_output.c | 28 +- net/ipv4/tcp.c | 26 +- net/ipv4/tcp_output.c | 25 +- net/ipv6/ip6_output.c | 28 +- net/kcm/kcmsock.c | 21 +- net/mptcp/protocol.c | 47 ++- net/tls/tls_device.c | 100 ++++-- .../selftests/mm/page_frag/page_frag_test.c | 76 +++- tools/testing/selftests/mm/run_vmtests.sh | 4 + tools/testing/selftests/mm/test_page_frag.sh | 27 ++ 25 files changed, 1045 insertions(+), 288 deletions(-)