From patchwork Fri Apr 14 13:02:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13211462 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29EA4C77B6E for ; Fri, 14 Apr 2023 13:03:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3F87900002; Fri, 14 Apr 2023 09:03:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC8D5280001; Fri, 14 Apr 2023 09:03:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1EA3900006; Fri, 14 Apr 2023 09:03:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 77C59900002 for ; Fri, 14 Apr 2023 09:03:34 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8FF48802AF for ; Fri, 14 Apr 2023 13:03:33 +0000 (UTC) X-FDA: 80680013106.17.7D7A464 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf18.hostedemail.com (Postfix) with ESMTP id B94C21C002A for ; Fri, 14 Apr 2023 13:03:31 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681477412; a=rsa-sha256; cv=none; b=NZ5Z8yeAJI8LMv1r9W4RNq49kJXNBfnDgy7dcU4jI3j04xmsFyqO2EbujGS5n0eby68wfJ RQmZYjuLuJqbnblvt4jo/1qhSh8ddGhoLK+/jpFHRaK1Rb9S3y3xWiY8SSS4wc1IiKtWR6 y4uquuHX0KZdi2V2V4pukCMBpoERe7E= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; spf=pass (imf18.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681477412; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NIUNRxFEOsK9iUMrVeqINZSXpgo1AjdYuToS3nLUXPQ=; b=L1zRx+vaTGSicQc4U2fbzX9+eS1YeCMHhBapyM2KSJlEZAREcbLjmnngPl1i2JiytMCRDC tsuY+mP9UwdAR+nQoclIcfCbs9/aU6stkQ+0KnU9ntnGrX/PYaZPt7lCcjcSOF6n3JmuMe MN5N2qtK0czHZTBa6ibog3bcMHH0ItM= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AC7E21764; Fri, 14 Apr 2023 06:04:15 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 33CFE3F6C4; Fri, 14 Apr 2023 06:03:30 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao , "Yin, Fengwei" Cc: Ryan Roberts , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Subject: [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory Date: Fri, 14 Apr 2023 14:02:56 +0100 Message-Id: <20230414130303.2345383-11-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230414130303.2345383-1-ryan.roberts@arm.com> References: <20230414130303.2345383-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: B94C21C002A X-Rspamd-Server: rspam01 X-Stat-Signature: 95o5fshgyp5jyny7c5ujzm9roaiohoq3 X-HE-Tag: 1681477411-940982 X-HE-Meta: U2FsdGVkX1/4F8EcMq/FdnR+vtguU7j316jvtAkEwLZbF0b9LmDubMvI6Ky3fDh1QBefl/3yMs1OsjQN33Te5dg2dvFBeCnne2IHbbFavHftrnsqOulJ2VpQJiH1iCYEQ3RcULFdYwAw/awECeXR50CYAcZ1hLrOSjJXojnoGKbfFG+l4SPI7vQP0R5kRQ/An1e2cp2SM/m2LdInWOo4vWpD7/UMFzra/fwwQYyYuZsNIadqkzhM9jFWLdM/jn9NX4/OQKmjaVCX/WBHmq0ouRDF05SmnI9t9VfJSjEVEl3gZdkhjcjW/T98Ot6rhpy3hfg9XEkiQELY5ZgLMKDGwSbk791OSkY7+mUJ6PqlyQuKXwl+lGgAIaOKbk/Gs2+niEQq9eAgjmd2XvYd0XGeMwL8n7XbMRNSL7ae8KOfsgx9kVMx6QsIMOMIOUsu/ZHdW55u+FLhS2O3njaEY4Z1WSaqOcWmVnd4vink6OvxQgYhCLBJO+sHyYOmcgP7U7BlEWIoA+a6RQMakeRXRrEHxoA68/bZNXFd2qoFsu4E3idmqFLV//RgehzHU+0FuxONAkxjSDdmbCcBZJysxDQHIzFjbMaS+702skoHngHTR1oxnndHF4ftU9p9jeEyl6VrZaEhAVF78Lp9Uv276pSDgddBW3lHIKWL03KoMNgBDCLMzmteitzv8XcMGtUAugezoLzjs51JiibGjd9La/fhJbh0IJ7IMJPOrxyQSCr9LEwjVQDHr5G/1X/Y5pEk17hw1tqGBx2wtzCwRlRhSU/IeVWQ4G1O9RBZluvQJVfn/WoDzvnDyX/S9I3kmY/7Go4GvFqYC3YwVGfwZqQjn1DJuXZ+iT+v+nsFfo5aOozkUQ/qpejqD7ZHFSzepXDX1Qboyn1YJSYOY/4MS/L5sWSOMGxuNaSs5tqTjgo0ehbxzJU3fkYeofNqpCG/JuxuY+5pRE70cMAXIS8XE/BvwYb Vid72dVr lJTvHG6xCT5ydKbnZK5Iu1Gs6l1CoZkLm1jXZ/v1Fg1KeKYrtwLYo1uRlFm7m7Ep6uJUUcj3zW23oCl/UNw9/eMZ5P+Kh7Q8/oohaGMFFzFrO+qmAS0pczLCF5YUQ9iIhiytpjOd+AM38we74XIVfFcZ581Tys45Lf8AXIuGAmyNlZrk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When taking a write fault on an anonymous page, attempt to reuse as much of the folio as possible if it is exclusive to the process. This avoids a problem where an exclusive, PTE-mapped THP would previously have all of its pages except the last one CoWed, then the last page would be reused, causing the whole original folio to hang around as well as all the CoWed pages. This problem is exaserbated now that we are allocating variable-order folios for anonymous memory. The reason for this behaviour is that a PTE-mapped THP has a reference for each PTE and the old code thought that meant it was not exclusively mapped, and therefore could not be reused. We now take care to find the region that intersects the underlying folio, the VMA and the PMD entry and for the presence of that number of references as indicating exclusivity. Note that we are not guarranteed that this region will cover the whole folio due to munmap and mremap. The aim is to reuse as much as possible in one go in order to: - reduce memory consumption - reduce number of CoWs - reduce time spent in fault handler Signed-off-by: Ryan Roberts --- mm/memory.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 160 insertions(+), 9 deletions(-) -- 2.25.1 diff --git a/mm/memory.c b/mm/memory.c index 83835ff5a818..7e2af54fe2e0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3038,6 +3038,26 @@ struct anon_folio_range { bool exclusive; }; +static inline unsigned long page_addr(struct page *page, + struct page *anchor, unsigned long anchor_addr) +{ + unsigned long offset; + unsigned long addr; + + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT; + addr = anchor_addr + offset; + + if (anchor > page) { + if (addr > anchor_addr) + return 0; + } else { + if (addr < anchor_addr) + return ULONG_MAX; + } + + return addr; +} + /* * Returns index of first pte that is not none, or nr if all are none. */ @@ -3122,6 +3142,122 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) return order; } +static void calc_anon_folio_range_reuse(struct vm_fault *vmf, + struct folio *folio, + struct anon_folio_range *range_out) +{ + /* + * The aim here is to determine the biggest range of pages that can be + * reused for this CoW fault if the identified range is responsible for + * all the references on the folio (i.e. it is exclusive) such that: + * - All pages are contained within folio + * - All pages are within VMA + * - All pages are within the same pmd entry as vmf->address + * - vmf->page is contained within the range + * - All covered ptes must be present, physically contiguous and RO + * + * Note that the folio itself may not be naturally aligned in VA space + * due to mremap. We take the largest range we can in order to increase + * our chances of being the exclusive user of the folio, therefore + * meaning we can reuse. Its possible that the folio crosses a pmd + * boundary, in which case we don't follow it into the next pte because + * this complicates the locking. + * + * Note that the caller may or may not choose to lock the pte. If + * unlocked, the calculation should be considered an estimate that will + * need to be validated under the lock. + */ + + struct vm_area_struct *vma = vmf->vma; + struct page *page; + pte_t *ptep; + pte_t pte; + bool excl = true; + unsigned long start, end; + int bloops, floops; + int i; + unsigned long pfn; + + /* + * Iterate backwards, starting with the page immediately before the + * anchor page. On exit from the loop, start is the inclusive start + * virtual address of the range. + */ + + start = page_addr(&folio->page, vmf->page, vmf->address); + start = max(start, vma->vm_start); + start = max(start, ALIGN_DOWN(vmf->address, PMD_SIZE)); + bloops = (vmf->address - start) >> PAGE_SHIFT; + + page = vmf->page - 1; + ptep = vmf->pte - 1; + pfn = page_to_pfn(vmf->page) - 1; + + for (i = 0; i < bloops; i++) { + pte = *ptep; + + if (!pte_present(pte) || + pte_write(pte) || + pte_protnone(pte) || + pte_pfn(pte) != pfn) { + start = vmf->address - (i << PAGE_SHIFT); + break; + } + + if (excl && !PageAnonExclusive(page)) + excl = false; + + pfn--; + ptep--; + page--; + } + + /* + * Iterate forward, starting with the anchor page. On exit from the + * loop, end is the exclusive end virtual address of the range. + */ + + end = page_addr(&folio->page + folio_nr_pages(folio), + vmf->page, vmf->address); + end = min(end, vma->vm_end); + end = min(end, ALIGN_DOWN(vmf->address, PMD_SIZE) + PMD_SIZE); + floops = (end - vmf->address) >> PAGE_SHIFT; + + page = vmf->page; + ptep = vmf->pte; + pfn = page_to_pfn(vmf->page); + + for (i = 0; i < floops; i++) { + pte = *ptep; + + if (!pte_present(pte) || + pte_write(pte) || + pte_protnone(pte) || + pte_pfn(pte) != pfn) { + end = vmf->address + (i << PAGE_SHIFT); + break; + } + + if (excl && !PageAnonExclusive(page)) + excl = false; + + pfn++; + ptep++; + page++; + } + + /* + * Fixup vmf to point to the start of the range, and return number of + * pages in range. + */ + + range_out->va_start = start; + range_out->pg_start = vmf->page - ((vmf->address - start) >> PAGE_SHIFT); + range_out->pte_start = vmf->pte - ((vmf->address - start) >> PAGE_SHIFT); + range_out->nr = (end - start) >> PAGE_SHIFT; + range_out->exclusive = excl; +} + /* * Handle write page faults for pages that can be reused in the current vma * @@ -3528,13 +3664,23 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) /* * Private mapping: create an exclusive anonymous page copy if reuse * is impossible. We might miss VM_WRITE for FOLL_FORCE handling. + * For anonymous memory, we attempt to copy/reuse in folios rather than + * page-by-page. We always prefer reuse above copy, even if we can only + * reuse a subset of the folio. Note that when reusing pages in a folio, + * due to munmap, mremap and friends, the folio isn't guarranteed to be + * naturally aligned in virtual memory space. */ if (folio && folio_test_anon(folio)) { + struct anon_folio_range range; + int swaprefs; + + calc_anon_folio_range_reuse(vmf, folio, &range); + /* - * If the page is exclusive to this process we must reuse the - * page without further checks. + * If the pages have already been proven to be exclusive to this + * process we must reuse the pages without further checks. */ - if (PageAnonExclusive(vmf->page)) + if (range.exclusive) goto reuse; /* @@ -3544,7 +3690,10 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * * KSM doesn't necessarily raise the folio refcount. */ - if (folio_test_ksm(folio) || folio_ref_count(folio) > 3) + swaprefs = folio_test_swapcache(folio) ? + folio_nr_pages(folio) : 0; + if (folio_test_ksm(folio) || + folio_ref_count(folio) > range.nr + swaprefs + 1) goto copy; if (!folio_test_lru(folio)) /* @@ -3552,29 +3701,31 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * remote LRU pagevecs or references to LRU folios. */ lru_add_drain(); - if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio)) + if (folio_ref_count(folio) > range.nr + swaprefs) goto copy; if (!folio_trylock(folio)) goto copy; if (folio_test_swapcache(folio)) folio_free_swap(folio); - if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) { + if (folio_test_ksm(folio) || + folio_ref_count(folio) != range.nr) { folio_unlock(folio); goto copy; } /* - * Ok, we've got the only folio reference from our mapping + * Ok, we've got the only folio references from our mapping * and the folio is locked, it's dark out, and we're wearing * sunglasses. Hit it. */ - page_move_anon_rmap(vmf->page, vma); + folio_move_anon_rmap_range(folio, range.pg_start, + range.nr, vma); folio_unlock(folio); reuse: if (unlikely(unshare)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } - wp_page_reuse(vmf, NULL); + wp_page_reuse(vmf, &range); return 0; } copy: