From patchwork Thu Jun 22 14:42:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13289261 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBA42EB64DA for ; Thu, 22 Jun 2023 14:43:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F6F28D0011; Thu, 22 Jun 2023 10:43:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27EBA8D000C; Thu, 22 Jun 2023 10:43:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F8E48D0011; Thu, 22 Jun 2023 10:43:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id EEB388D000C for ; Thu, 22 Jun 2023 10:43:07 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B5CF040BE9 for ; Thu, 22 Jun 2023 14:43:07 +0000 (UTC) X-FDA: 80930651214.15.8335378 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id C0493180014 for ; Thu, 22 Jun 2023 14:43:05 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687444985; a=rsa-sha256; cv=none; b=YmRRW20O/RhwOxQqhTp9GHF9o74ZpIp+gKALB9CFOECZRDCWc8yV9TDlGBosgcDQHI7ZxN 5ib2K6Q3NKwgeq8KLVwqDfZ7QQzHvV23REaS4quZdpOuxAvx0KrnLTLLJ6fXS7peRL9ANH 7H1hx1ldN5x0uaK6tw8gB2RwfSzknNU= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687444985; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1U98Qo4K46blocWc+Tr5yK1y0iUP+/AjAftybug9D74=; b=eoSS0WojXfTiz0Id0kBCn6G26Uoj7XzdL3cXtIHVbd5zVIPfqZiu7UwbUEi2bYX8vA1pzY Xahc14NVJ5Kz84f+fCwa2g7OLx8ZM4vfmW3uvc68Vy4Ot9k1Q0S6kdgwibkqjFUEd6ngO1 pf2oTUsZPKydUgqMPpNVTgPnnOuS6Xo= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CB0721516; Thu, 22 Jun 2023 07:43:48 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 46D553F663; Thu, 22 Jun 2023 07:43:02 -0700 (PDT) From: Ryan Roberts To: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 13/14] mm: Batch-copy PTE ranges during fork() Date: Thu, 22 Jun 2023 15:42:08 +0100 Message-Id: <20230622144210.2623299-14-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230622144210.2623299-1-ryan.roberts@arm.com> References: <20230622144210.2623299-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C0493180014 X-Stat-Signature: peaok9amf6es56axtx1mdnc8tjzupycm X-HE-Tag: 1687444985-488209 X-HE-Meta: U2FsdGVkX19F+c+6jkyOlQwGO5ddcopPkKZy8c7T6fEwbpUbmvPi5TIrODgSlFcnWyTpX+EgUIO6+ukdu1HLPWK1SN9byotqOI+FdMPEzbXJ8ssnqV2FSe83/EdMoQZ8jqx1+NplfHBNKdCeRuLx0eK2wKWiKGwho5bu/Amlz0zAGVshrwCT8YAA11mhCuMD428pRh9no6XHTlaI69w0pA9lTsVHKsLEhlCYLJkm9CjN4Wi+UBHjVFHiLAwd7sIbM+01igeUE+DvxuONErPXklx3lwkgz9SbyKpGfJ8UsIAuWPRsjRtjJnkWqMFsL3VmPgparzeLR6lc6r/CSzTI7mbmy45rCjNyo2X6JKxdBH2hiNenThbfnEMMpdi+uSQTz+ujFquxIx+DLnuJogm+nhApOKZ8nX3oFUfJ0V3RzK+fvXBWjYId3USEnHm2subNAeAVdE+85jXvI0fdp5tDVjM9mAIGszVQfjexaJreCIDQgchXGfiUsyg76lus5PCXni2HQ46+p51HSp2cRzueqAVASkLeT3NcKyrihWQTADphABQPNFVaY8bsafFsnZarPp3m+HXygr4uF1hSoB0Hyl68a8C0zpFUfPw7dPpoPaqFDO9FAbsMlum8PTezPJrJx9hDCfor2Wzmpf1grGbaBv1oqDELrWbojQX8otz61yNYuT1OR6EOB8FkyVZC3tidDb7MLsXSFD1YUPvdJmJ+d7IsQOEFHCt8xNU4Vah3FnYJ/RaajMceBZnOv+JR3GaFcYA3xbYH+3KqtgOlmGmC8LwLtBjs8DArk+LzVW8/LjGC9AdvjyjKbKDR7W0+9vnt1WzfuetrgBpWS3RZTy59zQaI6U4wWtZ5mOzPIFZkJ8G/zw6akkISf6h3liJ5vPPTSASw0YK/CsK4J3kzm6prwY4jjPNeUTDV2NjXrKNRMBAKdPdEq0YrqThIjxCbFUf1gVyssDNreS+cWJNZPNJ ylv8SzzF ISHnXQNgtGmmBqOEc4pswu2+9gOjagOHfU3QXU+5plKSGWfNp61BtkepMIahKiE3ahPUzRuBM7uS6yTLHBxbDP6UJrL1wPXRYNhMJk2/MKDmFj7SKOTim9ck8pRkvqsHHWWTtn2zzInCptlJfnF2YUoN+pE4TBrBxi615e6rwSF3vsJ54vkst9aPhfoLyWZhOdsIkptrC5WOnUnNPrAa1XYI+SVJlDV1/xej2eHBJADwPAoE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Convert copy_pte_range() to copy a set of ptes that map a physically contiguous block of memory in a batch. This will likely improve performance by a tiny amount due to batching the folio reference count management and calling set_ptes() rather than making individual calls to set_pte_at(). However, the primary motivation for this change is to reduce the number of tlb maintenance operations that the arm64 backend has to perform during fork, now that it transparently supports the "contiguous bit" in its ptes. By write-protecting the parent using the new ptep_set_wrprotects() (note the 's' at the end) function, the backend can avoid having to unfold contig ranges of PTEs, which is expensive, when all ptes in the range are being write-protected. Similarly, by using set_ptes() rather than set_pte_at() to set up ptes in the child, the backend does not need to fold a contiguous range once they are all populated - they can be initially populated as a contiguous range in the first place. This change addresses the core-mm refactoring only, and introduces ptep_set_wrprotects() with a default implementation that calls ptep_set_wrprotect() for each pte in the range. A separate change will implement ptep_set_wrprotects() in the arm64 backend to realize the performance improvement. Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 13 ++++ mm/memory.c | 149 +++++++++++++++++++++++++++++++--------- 2 files changed, 128 insertions(+), 34 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a661a17173fa..6a7b28d520de 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -547,6 +547,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres } #endif +#ifndef ptep_set_wrprotects +struct mm_struct; +static inline void ptep_set_wrprotects(struct mm_struct *mm, + unsigned long address, pte_t *ptep, + unsigned int nr) +{ + unsigned int i; + + for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++) + ptep_set_wrprotect(mm, address, ptep); +} +#endif + /* * On some architectures hardware does not set page access bit when accessing * memory page, it is responsibility of software setting this bit. It brings diff --git a/mm/memory.c b/mm/memory.c index fb30f7523550..9a041cc31c74 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -911,57 +911,126 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* Uffd-wp needs to be delivered to dest pte as well */ pte = pte_mkuffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); - return 0; + return 1; +} + +static inline unsigned long page_addr(struct page *page, + struct page *anchor, unsigned long anchor_addr) +{ + unsigned long offset; + unsigned long addr; + + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; + addr = anchor_addr + offset; + + if (anchor > page) { + if (addr > anchor_addr) + return 0; + } else { + if (addr < anchor_addr) + return ULONG_MAX; + } + + return addr; +} + +static int calc_anon_folio_map_pgcount(struct folio *folio, + struct page *page, pte_t *pte, + unsigned long addr, unsigned long end) +{ + pte_t ptent; + int floops; + int i; + unsigned long pfn; + + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr), + end); + floops = (end - addr) >> PAGE_SHIFT; + pfn = page_to_pfn(page); + pfn++; + pte++; + + for (i = 1; i < floops; i++) { + ptent = ptep_get(pte); + + if (!pte_present(ptent) || + pte_pfn(ptent) != pfn) { + return i; + } + + pfn++; + pte++; + } + + return floops; } /* - * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page - * is required to copy this pte. + * Copy set of contiguous ptes. Returns number of ptes copied if succeeded + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the + * first pte. */ static inline int -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, - pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, - struct folio **prealloc) +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, + pte_t *dst_pte, pte_t *src_pte, + unsigned long addr, unsigned long end, + int *rss, struct folio **prealloc) { struct mm_struct *src_mm = src_vma->vm_mm; unsigned long vm_flags = src_vma->vm_flags; pte_t pte = ptep_get(src_pte); struct page *page; struct folio *folio; + bool anon; + int nr; + int i; page = vm_normal_page(src_vma, addr, pte); - if (page) + if (page) { folio = page_folio(page); - if (page && folio_test_anon(folio)) { - /* - * If this page may have been pinned by the parent process, - * copy the page immediately for the child so that we'll always - * guarantee the pinned page won't be randomly replaced in the - * future. - */ - folio_get(folio); - if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) { - /* Page may be pinned, we have to copy. */ - folio_put(folio); - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, prealloc, page); + anon = folio_test_anon(folio); + nr = calc_anon_folio_map_pgcount(folio, page, src_pte, addr, end); + + for (i = 0; i < nr; i++, page++) { + if (anon) { + /* + * If this page may have been pinned by the + * parent process, copy the page immediately for + * the child so that we'll always guarantee the + * pinned page won't be randomly replaced in the + * future. + */ + if (unlikely(page_try_dup_anon_rmap( + page, false, src_vma))) { + if (i != 0) + break; + /* Page may be pinned, we have to copy. */ + return copy_present_page(dst_vma, src_vma, + dst_pte, src_pte, + addr, rss, + prealloc, page); + } + rss[MM_ANONPAGES]++; + VM_BUG_ON(PageAnonExclusive(page)); + } else { + page_dup_file_rmap(page, false); + rss[mm_counter_file(page)]++; + } } - rss[MM_ANONPAGES]++; - } else if (page) { - folio_get(folio); - page_dup_file_rmap(page, false); - rss[mm_counter_file(page)]++; - } + + nr = i; + folio_ref_add(folio, nr); + } else + nr = 1; /* * If it's a COW mapping, write protect it both * in the parent and the child */ if (is_cow_mapping(vm_flags) && pte_write(pte)) { - ptep_set_wrprotect(src_mm, addr, src_pte); + ptep_set_wrprotects(src_mm, addr, src_pte, nr); pte = pte_wrprotect(pte); } - VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page)); /* * If it's a shared mapping, mark it clean in @@ -974,8 +1043,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); - set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); - return 0; + set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); + return nr; } static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm, @@ -1065,15 +1134,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, */ WARN_ON_ONCE(ret != -ENOENT); } - /* copy_present_pte() will clear `*prealloc' if consumed */ - ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, &prealloc); + /* copy_present_ptes() will clear `*prealloc' if consumed */ + ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, + addr, end, rss, &prealloc); + /* * If we need a pre-allocated page for this pte, drop the * locks, allocate, and try again. */ if (unlikely(ret == -EAGAIN)) break; + + /* + * Positive return value is the number of ptes copied. + */ + VM_WARN_ON_ONCE(ret < 1); + progress += 8 * ret; + ret--; + dst_pte += ret; + src_pte += ret; + addr += ret << PAGE_SHIFT; + ret = 0; + if (unlikely(prealloc)) { /* * pre-alloc page cannot be reused by next time so as @@ -1084,7 +1166,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, folio_put(prealloc); prealloc = NULL; } - progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode();