From patchwork Wed Jul 26 08:09:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peng Zhang X-Patchwork-Id: 13327574 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3AFD1C001DC for ; Wed, 26 Jul 2023 08:10:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFA846B0081; Wed, 26 Jul 2023 04:10:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C85188D0001; Wed, 26 Jul 2023 04:10:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AFE556B0083; Wed, 26 Jul 2023 04:10:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9F1096B0081 for ; Wed, 26 Jul 2023 04:10:55 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7FC92A0100 for ; Wed, 26 Jul 2023 08:10:55 +0000 (UTC) X-FDA: 81053042070.15.BC0CD3E Received: from mail-pg1-f179.google.com (mail-pg1-f179.google.com [209.85.215.179]) by imf28.hostedemail.com (Postfix) with ESMTP id 98FD9C000B for ; Wed, 26 Jul 2023 08:10:53 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="kHc9/HMq"; spf=pass (imf28.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690359053; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AgnuXatC2qbxaqbZ65LcEyHnR8tMExq1rwNTS9IK9fs=; b=O9aVTTK5Gn6VQFDTdRDK6QLVWomBW3METDta1p56dC6jhgNFBbY00/p6qcLp40Adlh8BLW nsVPtsjW/B4Aoz3bC1XyIv0MH1p5YHbkCMLK/QQEXw4yMTCeiYgifCX/0+U0Em2bYUASkM gxNqAXJYz0PH5pNOYv9nks+iy+pRtso= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690359053; a=rsa-sha256; cv=none; b=hT7H6NZhfsrVvtm7lF/uSH3bIXVFQOM5Iua7Q4PMnJmHQ/G/NcmD9OuBeF420U0tFtzAEp AIyaus9pE8ECcgjBMJnPz67KSfcu0+V+n7IlyrU481OOdooqpffvlq9KPaGPVwcgxJY1lf zxQMvnUNwwE+yELh4MPeYvSg1z4xbPk= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="kHc9/HMq"; spf=pass (imf28.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.215.179 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-pg1-f179.google.com with SMTP id 41be03b00d2f7-563de62f861so473371a12.1 for ; Wed, 26 Jul 2023 01:10:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690359052; x=1690963852; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AgnuXatC2qbxaqbZ65LcEyHnR8tMExq1rwNTS9IK9fs=; b=kHc9/HMqaF9R8G/rXM59e0F3CbnZX/wdeNMwFdC4XbxVCwGzgEppqL5qnyB3vNmNVg c8cJwfifm5dQn0XdItECcYmL7213bc+olR8WIHI+oZFMOk9CMsY7zzqyGkHUyoCUbS5O dEHqxuqR01Hr6wkpXFSSeIiFRl0E2XvoCppdLpSz0jYGs9bX0Vjkjgcf/CoG7h0V+GT0 hgH6/mckjlLMtGPSbuo3RnGG1oclyUlcIARBc+DEe60f642YycPhYk3JYCpHJAImiCtD q04bJ/2nn8P7ybfvGd8hRKKdkHJCH61syEN3XyDj2vXfd5qQFV1ojWAT1xwPxp7AAMlx S/qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690359052; x=1690963852; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AgnuXatC2qbxaqbZ65LcEyHnR8tMExq1rwNTS9IK9fs=; b=V2dvZ0AR5cxjL5r7JRtb2l+VWutm0Pyq9whrxawBQnwADVy3eAmz5FNPHfVbkjmmnl RA4rnwe3aezBzjl+R6JG9jw0WHwiCJ5V1bSFArnrLP0YdKMDKMrfJftMdy1MqMAOhMrg mCzmNDzpEfDhsC1GBkt4men025SUYGI6+9pow7L8JFzk3jBmlyEUONmbwF/1uw0B3xR2 HEsdbevpQS6S4xaDFgEwt6+Zbz6dADawhgLfpED3G0KQS/xK9aDpusjE9qZLw27VRIV0 /pbEQFoArdBgPsSiQkS/mJzwSpLLXpnBA9SZvM4yU7csAbmojN42naPHIanzDpFhdZOj eA/w== X-Gm-Message-State: ABy/qLaddSoqfOGOEhUI2kP45YRQvhvAUydZU99cngQ76B5I1DvAhASo hyBbeWkFFShUqkMY56ZIeuaooQ== X-Google-Smtp-Source: APBJJlE/k5FlCPp9wjkDmS2lNntkmIxAcglpNb71TQX77k742PkAt1Hf8XjFuPAOgyeUv3smQ4YSOQ== X-Received: by 2002:a17:90a:6344:b0:263:e423:5939 with SMTP id v4-20020a17090a634400b00263e4235939mr1066813pjs.28.1690359052249; Wed, 26 Jul 2023 01:10:52 -0700 (PDT) Received: from GL4FX4PXWL.bytedance.net ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id gc17-20020a17090b311100b002680b2d2ab6sm756540pjb.19.2023.07.26.01.10.47 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 26 Jul 2023 01:10:52 -0700 (PDT) From: Peng Zhang To: Liam.Howlett@oracle.com, corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org, brauner@kernel.org, surenb@google.com, michael.christie@oracle.com, peterz@infradead.org, mathieu.desnoyers@efficios.com, npiggin@gmail.com, avagin@gmail.com Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Peng Zhang Subject: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Date: Wed, 26 Jul 2023 16:09:16 +0800 Message-Id: <20230726080916.17454-12-zhangpeng.00@bytedance.com> X-Mailer: git-send-email 2.37.0 (Apple Git-136) In-Reply-To: <20230726080916.17454-1-zhangpeng.00@bytedance.com> References: <20230726080916.17454-1-zhangpeng.00@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: yjc9yyyb34nzru4pdjk4di4pserzkj51 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 98FD9C000B X-Rspam-User: X-HE-Tag: 1690359053-312593 X-HE-Meta: U2FsdGVkX1+hBEUd54ArMTX6GKlOaCF4tlmW7B7fA3qZitZ3KAfxnqgvuLhIOdwMfIDAIGglwBzxa7m5Et05yyiKgFQ29j2YybbxesP1v/+KGVR5v5cH/cKCYM4YqMPkUzPlRPIkAY7/5Fm4EhxsvzyfyIzVKmrrPFkLSnquH0ob6ZA3flP0bi59v4TxQWPpgns4aVVGL/GHJGv5u52CjkWeGxpfd6o8FVpbrEyrXzmSJhQmK6/78yKIxIsMF+6nyjECotxW6T2SRG62HyZZ7wIMfFctKGWSmk6nXv+fk0csfkOb40mJh9OADxPnqY6yFleJfwOrM3x16cCM9S6x8VooQsEhJcUXTBmtJDCB9+tRyvIlJvgaYISax6jnBACPa3xAcGwHyKgNZgtgwfWibce1mmy2Ux/rDgFIB+A6iwF7oWD4LetC45+PjhFASQWRkXY9FbLddJv2BCaPqFmZihjaIWwHXqcbjDwQkWv5/lc3NsRAhJ5W9tiM6vqG4cdAA0dFwyv2ACnRqoljb9JtINpJ8VwlzsHsqt+ZKw+vqjLogBT6LTMLSYz8mXRQ2f2lKOu6ctF5UWm3tBM1qZ0PHpQxPyDrjRjhSrFO966bgOARy5Dqe3buFwryAiAwDlwbuE3CrNXZB+o3v9sNwt2d+8PalPZQ27xz13cJ3vJF9KbUJppUg1pYH4mVQpkI0+IO8xHfDqt9wBjFYiWwXDWneFMNTxvCzeJiTwgQlvSsq9s+2pLrRzRtKtWWjT3g2FD31VcJlfkw9xT7hoqo9StZt2WR4uazlHziXGhyT0K9h32gbA+CrqfRZEze8m8jVaN34/KlFLyheSD7GPvqzqNUZijai3vcZv8VK4RCLvbFGlkp2rwGhccYJ7SFgIA1toKesL3iBgNPtqLdQjRb27BbSe+iNWUwx38fxGIg5PUsdRaEY+bzvQCOw4KIyEaUtgUNLRsnwVH0USU9GuMU8oF zuyXqjRG gevbUbtnvDeMQkEUUYURP/uChaVBNBuOEpYRZ2XITLSZTQpQevJx0hrYZwk5PjD9tY6a41J/fqb7cV628u3lHJLkCQp4HBSEn66bLV2+0FfIHngKLxze9HmnufhFfJ5YsGQccHx01X8T0EmYBwCxw99yaO/jtTxpv2iTphmcPaFgzZBIZ/kv8SoQpqY3M0Wemb22MxLtUm1srG1NZGUkdf0QfzYmcS6audCrvYBMb4BujbUsDyzjy4lH0pKDKneRJCJhUzKqj/JYK4EC3gwHYRKgsmnNaKmLjXF4E7mPNLOp7zUzaY4kVmcWC5vWVDhziP2UoBHUPiZnbxYZHjsqCYt0l3h0OhGv2h6YGkaN9bCrnKjJGf4YCoeBHYbWc6euZj6ktNK151NR5DxYaEek1zcl5++mlpBFpvZRYgBI8NIGEoDaj4eE374Y64wZDX6/sdx3BLbWxwbyrwGuhGOum89QnYJQwnUcEog+hxTzetJftwH18XTYe7/oyaxgleugS2xEKId7Kh0J1R9NYVOsYWckXSJzU6vcFeXYqE5bL1FvjC44= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then directly modify the entries of VMAs in the new maple tree, which can get better performance. dup_mmap() is used by fork(), so this patch optimizes fork(). The optimization effect is proportional to the number of VMAs. Due to the introduction of this method, the optimization in (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer has an effect here, but it is also an optimization of the maple tree. There is a unixbench test suite[2] where 'spawn' is used to test fork(). 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a bit to use mmap() to control the number of VMAs. Therefore, the performance under different numbers of VMAs can be measured. Insert code like below into 'spawn': for (int i = 0; i < 200; ++i) { size_t size = 10 * getpagesize(); void *addr; if (i & 1) { addr = mmap(NULL, size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } else { addr = mmap(NULL, size, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); } if (addr == MAP_FAILED) ... } Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test 4 times in 30 seconds each time, and get the following numbers. These numbers are the number of fork() successes in 30s (average of the best 3 out of 4). By the way, based on next-20230725, I reverted [1], and tested it together as a comparison. In order to ensure the reliability of the test results, these tests were run on a physical machine. 23VMAs 223VMAs 4023VMAs revert [1]: 159104.00 73316.33 6787.00 +0.77% +0.42% +0.28% next-20230721: 160321.67 73624.67 6806.33 +2.77% +15.42% +29.86% apply this: 164751.67 84980.33 8838.67 It can be seen that the performance improvement is proportional to the number of VMAs. With 23 VMAs, performance improves by about 3%, with 223 VMAs, performance improves by about 15%, and with 4023 VMAs, performance improves by about 30%. [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@bytedance.com/ [2] https://github.com/kdlucas/byte-unixbench/tree/master Signed-off-by: Peng Zhang --- kernel/fork.c | 35 +++++++++++++++++++++++++++-------- mm/mmap.c | 14 ++++++++++++-- 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f81149739eb9..ef80025b62d6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, int retval; unsigned long charge = 0; LIST_HEAD(uf); - VMA_ITERATOR(old_vmi, oldmm, 0); VMA_ITERATOR(vmi, mm, 0); uprobe_start_dup_mmap(); @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, goto out; khugepaged_fork(mm, oldmm); - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); - if (retval) + /* Use __mt_dup() to efficiently build an identical maple tree. */ + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN); + if (unlikely(retval)) goto out; mt_clear_in_rcu(vmi.mas.tree); - for_each_vma(old_vmi, mpnt) { + for_each_vma(vmi, mpnt) { struct file *file; vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); + + /* + * Since the new tree is exactly the same as the old one, + * we need to remove the unneeded VMAs. + */ + mas_store(&vmi.mas, NULL); + + /* + * Even removing an entry may require memory allocation, + * and if removal fails, we use XA_ZERO_ENTRY to mark + * from which VMA it failed. The case of encountering + * XA_ZERO_ENTRY will be handled in exit_mmap(). + */ + if (unlikely(mas_is_err(&vmi.mas))) { + retval = xa_err(vmi.mas.node); + mas_reset(&vmi.mas); + if (mas_find(&vmi.mas, ULONG_MAX)) + mas_replace_entry(&vmi.mas, + XA_ZERO_ENTRY); + goto loop_out; + } + continue; } charge = 0; @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, hugetlb_dup_vma_private(tmp); /* Link the vma into the MT */ - if (vma_iter_bulk_store(&vmi, tmp)) - goto fail_nomem_vmi_store; + mas_replace_entry(&vmi.mas, tmp); mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, uprobe_end_dup_mmap(); return retval; -fail_nomem_vmi_store: - unlink_anon_vmas(tmp); fail_nomem_anon_vma_fork: mpol_put(vma_policy(tmp)); fail_nomem_policy: diff --git a/mm/mmap.c b/mm/mmap.c index bc91d91261ab..5bfba2fb0e39 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm) arch_exit_mmap(mm); vma = mas_find(&mas, ULONG_MAX); - if (!vma) { + /* + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY, + * xa_is_zero(vma) may be true. + */ + if (!vma || xa_is_zero(vma)) { /* Can happen if dup_mmap() received an OOM */ mmap_read_unlock(mm); return; @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm) remove_vma(vma, true); count++; cond_resched(); - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL); + vma = mas_find(&mas, ULONG_MAX); + /* + * If xa_is_zero(vma) is true, it means that subsequent VMAs + * donot need to be removed. Can happen if dup_mmap() fails to + * remove a VMA marked VM_DONTCOPY. + */ + } while (vma != NULL && !xa_is_zero(vma)); BUG_ON(count != mm->map_count);