From patchwork Mon Dec 16 16:51:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dev Jain X-Patchwork-Id: 13910083 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B6A8E77180 for ; Mon, 16 Dec 2024 16:53:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A3EE58D0005; Mon, 16 Dec 2024 11:53:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C7478D0001; Mon, 16 Dec 2024 11:53:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F7898D0005; Mon, 16 Dec 2024 11:53:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5C6348D0001 for ; Mon, 16 Dec 2024 11:53:19 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 1302D1C83CD for ; Mon, 16 Dec 2024 16:53:19 +0000 (UTC) X-FDA: 82901416542.06.913AE45 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf06.hostedemail.com (Postfix) with ESMTP id A0766180007 for ; Mon, 16 Dec 2024 16:52:55 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734367974; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rM3pVA3HxY3BNxKbRTnafuvJLqn04TkiBEKmftfZJkc=; b=omA7CYfmfmorsSI/T9f00ctussouE02qKNvuMX38ZXvFSJ4gnSPZiAV2AVW7hNQMMHSldD vj2ZhkmWpoLd+DnzMaGhzmFMFicHEEBFGgIRiAnajy/s/o1RzwyeB922q2Fq1CTcLIMFHK y0wYun8AKdyxoZSGLrL9O7056Y/p8/Y= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734367974; a=rsa-sha256; cv=none; b=hLP7hrUFBlhlXAkvyU8S8auZF1LvFE/INoQ5TZHnFd/kuml2cSjuUvoHkzYlrvjDHJ02GM nxAFXMkxNhnXmK9kC1jd1qSPzTF6NnuKClQ2+BUsTsHqdJoMzemhJkwFiMhkwFssC4tk9P sEUCrXL0Ljg508nT4OSaMryywWmLeEw= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CAC2C16F8; Mon, 16 Dec 2024 08:53:43 -0800 (PST) Received: from K4MQJ0H1H2.arm.com (unknown [10.163.78.212]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 1C5503F528; Mon, 16 Dec 2024 08:53:04 -0800 (PST) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dev Jain Subject: [RFC PATCH 09/12] khugepaged: Introduce vma_collapse_anon_folio() Date: Mon, 16 Dec 2024 22:21:02 +0530 Message-Id: <20241216165105.56185-10-dev.jain@arm.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20241216165105.56185-1-dev.jain@arm.com> References: <20241216165105.56185-1-dev.jain@arm.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: A0766180007 X-Rspamd-Server: rspam12 X-Stat-Signature: jj1rq6i8wtpgadfzb3o8uwe93djk13ay X-Rspam-User: X-HE-Tag: 1734367975-480054 X-HE-Meta: U2FsdGVkX19ouPhnk0B+SegFYhRn9xRqnO0wuDmlX218TDoMiPFlAjNDby60nMlx0Rrl9deAQqTAtDq+flQcjP1342AKXHFS/slXnCk8f9AYCm24v5EGbF7+Jui0a4V8wJdaY16ljEBdXFKw096OnKnlFFur+S+NmhyTVARpNTC3gbI5c3sAAothudOBykSaEb43ITRo3+T0+/QSbAFmFTiznQ9rowCa3H57Ysk0PWho4f41GODqz/83t5nKbtdhO5wrS6EVAgyb90JumsvdYpuImS8+3CZaKnm6BBnOwry4yO0nvaZpiqeCB/dpWby1a3lny9l8U/DMtVQpCJqfnrztSZjWTLz/0PwYZCx0zZgiqn0xX/Jp+582r6r+xqevZT1NYIU37cmjFA5TYy3gPE4DHD1ANVB3Rxh9Td8aZHiu/E5dqRtd5FUdZUaNKiTPejJCec6bNQ7xmD6kRzMt7/taOh2x/Fi2x6+ldM9OvbnOora2KI8YUbpCep1vahlOv9nSdp8bq1aPtFSMUZzYCJcZT3BBFj0B+ErqvygLeMy7Lyw7gshTdyGYOOXxJ8lpQnPwvr/5QVUwqJyiDK/Ymy9weL7GbbT/qVeROig+wRh7oZFJC5kXYnbUbuI5Ght7SZQUJJOpnBp2QDCt/T/hDA3TyE9lwlymUTA9gahd8vRbEOaVe/UkcIS1eAooaxvX7yzqxzCoHpADzLtqdAQES4fVf/LtI9Bsd2r1sUTpoooTQsrpDDxETjjXq85glGNrR0nNGW7Xstx/Scwqn2Pj7jZDrLCSQGxkMfdAp1BSqvfmlWmF9GCd1aLmaGxoNWnibbSSJrj3r3hETYff7TWtqF0RuIzi5XJQ1G4GSrEnLfhN8S6ZXeO2nOlreWH3PChbJ37amDQNtvAoYI4hgd7rIu5HeqneG3zPPgu6eS56B3rrNx9IdseTP3qXvjZpbDhFwobJu68e7jCHacclmpR sMmvZRJL z6S05s/PoRfCjWR/J5+CVLR0eOsIqfO7/Y6o8L5VW9mqV8wVO+jguxPEbGRcftWM7fnrIXZwiJx+BU/6d5BeNPeAr3tTPGgmn3t0cEeG7ImhbH5fXLu9RTjDl4YObVKFBAHJPMSEmrQqLicjErIV8bBHNUJ2KHiZ2oYYv/4ZwriHYbMY35C0rNbZwEXPFX6KZUsVsHxdVYJaaZbql5tPfe/IGij1H9EmnMUBPUGNz8qWLGU8J34152iugMLsGbUFdb8wa2pyTYmHnwcE5HVOjC/Z0KI47Y1f+VCvgx55Ccqso5759Rbm7VXn49Hl6cxS02k+bkEtbNQfZJ102YiLV1FPjihMr1hCAkl+xryZzcbZvtJ4Glho0kR+WWF5nTuIqAx77mDR9VfWCVpbftuBRnuEo8HyHLHUz1Fjm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In contrast to PMD-collapse, we do not need to operate on two levels of pagetable simultaneously. Therefore, downgrade the mmap lock from write to read mode. Still take the anon_vma lock in exclusive mode so as to not waste time in the rmap path, which is anyways going to fail since the PTEs are going to be changed. Under the PTL, copy page contents, clear the PTEs, remove folio pins, and (try to) unmap the old folios. Set the PTEs to the new folio using the set_ptes() API. Signed-off-by: Dev Jain --- Note: I have been trying hard to get rid of the locks in here: we still are taking the PTL around the page copying; dropping the PTL and taking it after the copying should lead to a deadlock, for example: khugepaged madvise(MADV_COLD) folio_lock() lock(ptl) lock(ptl) folio_lock() We can create a locked folio list, altogether drop both the locks, take the PTL, do everything which __collapse_huge_page_isolate() does *except* the isolation and again try locking folios, but then it will reduce efficiency of khugepaged and almost looks like a forced solution :) Please note the following discussion if anyone is interested: https://lore.kernel.org/all/66bb7496-a445-4ad7-8e56-4f2863465c54@arm.com/ (Apologies for not CCing the mailing list from the start) mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 87 insertions(+), 21 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 88beebef773e..8040b130e677 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -714,24 +714,28 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, int order) { struct folio *src, *tmp; pte_t *_pte; pte_t pteval; - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; + for (_pte = pte; _pte < pte + (1UL << order); _pte++, address += PAGE_SIZE) { pteval = ptep_get(_pte); if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); if (is_zero_pfn(pte_pfn(pteval))) { - /* - * ptl mostly unnecessary. - */ - spin_lock(ptl); - ptep_clear(vma->vm_mm, address, _pte); - spin_unlock(ptl); + if (order == HPAGE_PMD_ORDER) { + /* + * ptl mostly unnecessary. + */ + spin_lock(ptl); + ptep_clear(vma->vm_mm, address, _pte); + spin_unlock(ptl); + } else { + ptep_clear(vma->vm_mm, address, _pte); + } ksm_might_unmap_zero_page(vma->vm_mm, pteval); } } else { @@ -740,15 +744,20 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, src = page_folio(src_page); if (!folio_test_large(src)) release_pte_folio(src); - /* - * ptl mostly unnecessary, but preempt has to - * be disabled to update the per-cpu stats - * inside folio_remove_rmap_pte(). - */ - spin_lock(ptl); - ptep_clear(vma->vm_mm, address, _pte); - folio_remove_rmap_pte(src, src_page, vma); - spin_unlock(ptl); + if (order == HPAGE_PMD_ORDER) { + /* + * ptl mostly unnecessary, but preempt has to + * be disabled to update the per-cpu stats + * inside folio_remove_rmap_pte(). + */ + spin_lock(ptl); + ptep_clear(vma->vm_mm, address, _pte); + folio_remove_rmap_pte(src, src_page, vma); + spin_unlock(ptl); + } else { + ptep_clear(vma->vm_mm, address, _pte); + folio_remove_rmap_pte(src, src_page, vma); + } free_page_and_swap_cache(src_page); } } @@ -807,7 +816,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl, - struct list_head *compound_pagelist) + struct list_head *compound_pagelist, int order) { unsigned int i; int result = SCAN_SUCCEED; @@ -815,7 +824,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, /* * Copying pages' contents is subject to memory poison at any iteration. */ - for (i = 0; i < HPAGE_PMD_NR; i++) { + for (i = 0; i < (1 << order); i++) { pte_t pteval = ptep_get(pte + i); struct page *page = folio_page(folio, i); unsigned long src_addr = address + i * PAGE_SIZE; @@ -834,7 +843,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, if (likely(result == SCAN_SUCCEED)) __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, - compound_pagelist); + compound_pagelist, order); else __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, compound_pagelist, order); @@ -1196,7 +1205,7 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, vma, address, pte_ptl, - &compound_pagelist); + &compound_pagelist, HPAGE_PMD_ORDER); pte_unmap(pte); if (unlikely(result != SCAN_SUCCEED)) goto out_up_write; @@ -1228,6 +1237,61 @@ static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long addre return result; } +/* Enter with mmap read lock */ +static int vma_collapse_anon_folio(struct mm_struct *mm, unsigned long address, + struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, + struct folio *folio, int order) +{ + int result; + struct mmu_notifier_range range; + spinlock_t *pte_ptl; + LIST_HEAD(compound_pagelist); + pte_t *pte; + pte_t entry; + int nr_pages = folio_nr_pages(folio); + + anon_vma_lock_write(vma->anon_vma); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, + address + (PAGE_SIZE << order)); + mmu_notifier_invalidate_range_start(&range); + + pte = pte_offset_map_lock(mm, pmd, address, &pte_ptl); + if (pte) + result = __collapse_huge_page_isolate(vma, address, pte, cc, + &compound_pagelist, order); + else + result = SCAN_PMD_NULL; + + if (unlikely(result != SCAN_SUCCEED)) + goto out_up_read; + + anon_vma_unlock_write(vma->anon_vma); + + __folio_mark_uptodate(folio); + entry = mk_pte(&folio->page, vma->vm_page_prot); + entry = maybe_mkwrite(entry, vma); + + result = __collapse_huge_page_copy(pte, folio, pmd, *pmd, + vma, address, pte_ptl, + &compound_pagelist, order); + if (unlikely(result != SCAN_SUCCEED)) + goto out_up_read; + + folio_ref_add(folio, nr_pages - 1); + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + deferred_split_folio(folio, false); + set_ptes(mm, address, pte, entry, nr_pages); + update_mmu_cache_range(NULL, vma, address, pte, nr_pages); + pte_unmap_unlock(pte, pte_ptl); + mmu_notifier_invalidate_range_end(&range); + result = SCAN_SUCCEED; + +out_up_read: + mmap_read_unlock(mm); + return result; +} + static int collapse_huge_page(struct mm_struct *mm, unsigned long address, int referenced, int unmapped, int order, struct collapse_control *cc) @@ -1276,6 +1340,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, if (order == HPAGE_PMD_ORDER) result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio); + else + result = vma_collapse_anon_folio(mm, address, vma, cc, pmd, folio, order); if (result == SCAN_SUCCEED) folio = NULL;