From patchwork Mon Mar  6 09:22:55 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yin Fengwei <fengwei.yin@intel.com>
X-Patchwork-Id: 13160757
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D2C30C61DA4
	for <linux-mm@archiver.kernel.org>; Mon,  6 Mar 2023 09:21:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6D5186B0075; Mon,  6 Mar 2023 04:21:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6868E280001; Mon,  6 Mar 2023 04:21:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 528366B007B; Mon,  6 Mar 2023 04:21:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 455E76B0075
	for <linux-mm@kvack.org>; Mon,  6 Mar 2023 04:21:59 -0500 (EST)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id EBD731A0931
	for <linux-mm@kvack.org>; Mon,  6 Mar 2023 09:21:58 +0000 (UTC)
X-FDA: 80537931516.21.42258D4
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	by imf29.hostedemail.com (Postfix) with ESMTP id D065B120015
	for <linux-mm@kvack.org>; Mon,  6 Mar 2023 09:21:56 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KGFydPag;
	spf=pass (imf29.hostedemail.com: domain of fengwei.yin@intel.com designates
 192.55.52.120 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1678094517;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=gYqZfrAlseg/s6Sc6J91lblgYt6k4J4Slcn85cNAxqo=;
	b=oAbdNLCt5v+btpt1krj+ECAVl73FV+8AGX7H2TKy+4NNurTsK2eEqd8q/blp5ac9H8V2B7
	ECb9WeRCIslpdhrYObNowCuFGPuZdtoCjWW6h6X0PrQoYo/6NC8+zVRiKrhrjZX3XjJcRM
	csz5StR7iBJKD0rNxIeyfVRIsNvBG9s=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KGFydPag;
	spf=pass (imf29.hostedemail.com: domain of fengwei.yin@intel.com designates
 192.55.52.120 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678094517; a=rsa-sha256;
	cv=none;
	b=7mn8Ol8YD13XtI1zAiIeJdZAK6JQfq41+IBTo7LjZYA+RL8s7TCNodGIEgx2VFl/fJW2k6
	WWkb5cnCOnevRwozhFTV8xNIHqQLmHmtjhFaOS8ol0B+TPAih6NVerDSdtfxNFPk636ASj
	msjmqN3njTjZdESJ1tYUpH/rTPp5oLU=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1678094517; x=1709630517;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=cbep/069ys7eOkAdGkMtuK/fXBOfL7OwZ64lPB2f1nE=;
  b=KGFydPag1gtJPoDjwABqOJY5Pbn/JZj+MVoJ6UhmKmY5nNrgRO1njc18
   i5SXIpJJpAeAeZwe/8tFjpr3DVoYAIqZdH2wMhE1sYFFotdOUln4f45nX
   x+ni5cCLVgVDWyJBSiXg2k4XHGn8EeKfxIw/zee+3H6s5PNy8gaezmR+C
   Ni45HIFz9ElRnOBlodLpTSspnFXS0J1pnlfAl7jwdYLdB1jtg1I0eH/jQ
   egKJdccwtkezhj18cRZi1W0HIU710h5e3n/RTeyJwJPQ+1dcek01e1aO/
   YefXSWYgKC8Pn7lSmORl0mDpt5q6+wSVa7N7qh14EvETI+B6atXyB5PMQ
   g==;
X-IronPort-AV: E=McAfee;i="6500,9779,10640"; a="334225594"
X-IronPort-AV: E=Sophos;i="5.98,236,1673942400";
   d="scan'208";a="334225594"
Received: from orsmga002.jf.intel.com ([10.7.209.21])
  by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 06 Mar 2023 01:21:55 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10640"; a="676099442"
X-IronPort-AV: E=Sophos;i="5.98,236,1673942400";
   d="scan'208";a="676099442"
Received: from fyin-dev.sh.intel.com ([10.239.159.32])
  by orsmga002.jf.intel.com with ESMTP; 06 Mar 2023 01:21:52 -0800
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org,
	akpm@linux-foundation.org,
	willy@infradead.org,
	mike.kravetz@oracle.com,
	sidhartha.kumar@oracle.com,
	naoya.horiguchi@nec.com,
	jane.chu@oracle.com,
	david@redhat.com
Cc: fengwei.yin@intel.com
Subject: [PATCH v3 1/5] rmap: move hugetlb try_to_unmap to dedicated function
Date: Mon,  6 Mar 2023 17:22:55 +0800
Message-Id: <20230306092259.3507807-2-fengwei.yin@intel.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <20230306092259.3507807-1-fengwei.yin@intel.com>
References: <20230306092259.3507807-1-fengwei.yin@intel.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: D065B120015
X-Stat-Signature: 58fwye5dpuwhc77whg9d5fdq5mycrpd7
X-Rspam-User: 
X-HE-Tag: 1678094516-43851
X-HE-Meta: 
 U2FsdGVkX1+fnWJI1d1N/UP74sBvFcjZUYHdCx915M8mt0TY/79RvE00NyhMMP/ruH5Nss2EFjM8250IHFd7lOXX00mv3psAgZ0f5QJ7psB53ay0ygeT8b6gdALuJIQSfNqNIVojEyIbOGzXgxy7FasprxRZetDSGrWsAjB6vvzJcGWjbk+YuL+Ji14vGep0F/qpRL9zOxq6mLuXZ2CmkjSTxubHGLW6IjSixByim4XhQ8/Jau0iEFgeX53E2P/m5ILxAJZDAcl8RpupC7PBTr2hjcjd9zRw2/GZb6eEcrZlqQLZTSTOlKRlCjdp/9JBsrKmgqxE28t9ecUMuvJR6KfoQOQSU+WTHBjotiZWfD7Dw8SfPIsW+sv6g4sc3IqeOdm0IpjG/e0m8g7zpRNDb8hAXjZ9PKdLmL4DIBfM/QsQD2ADDsaxRwzJJuT7fo8ezx39yclqDJCcNa4Iiv2Nsr3/ZsK+PHPQSOfKW35kYpylsdt7njKxSoW+7KNmkXc/ZhOziPj8dxUae+bLbcPb1i8rkKbLnJqlOsEqkNBbrHKr1xN/APggrVd0GTlxsVeX1b0bbIrqi2o7sS0LCp+jrsIPy38FEVWPuuFkNg016OR9eEHszJyymOnMTF1dhL82qHi1jJ0OaVzsy26SCTBXIfHu0WMK+Il/bFGDedQ2G0NIJDkwrY1y2TZZZ4o+oy2AQWB2ukvzJW7KI32HDpgwqimtc+PKsufRMdxDm409nYBM4rAk0ZRk8E8Q0UJBD8QnCq3dw+y/yggYMKkNbvdEIpNDZy0y5c0DhYMxVEFOJx8LwxyQVG9vFS3bWO/Y1YJ89Df7/zMwUbjAxVERiaaCneggYPpkOwYBxYcDDK6CG99NFSeF5qzOo7SmkR1Crbnih/pNswiggNmj4Hl9vmWoyGZN8zOUllvkcik3n7zNR+FTZ9nw+tGdoI3Pv3m/bHnDCqrIe/gSvlqmERo1OLJ
 PCJOT0uI
 wvF+xGkXpIxcHVJ6quOJalByJfBKDhwT4y+DHGWXT2BDnxYc4dX9X6XQ9ellAYIcsPDjQ+RaLsQL+Bb81LNWCWW86KPVqxuhVn/cCnqmTrSUz6B02jck87lsst1LsKrNDIiAH+sNgcZLgWhWEOkvCwCQ3FcWVOMAMAC9Z48ycaVRX4QQdFaFtLif8ZH8RLmy4zFxIgm7w9VN/T2EB+ly/jby0Pw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

It's to prepare the batched rmap update for large folio.
No need to looped handle hugetlb. Just handle hugetlb and
bail out early.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/rmap.c | 200 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 121 insertions(+), 79 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ba901c416785..508d141dacc5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1441,6 +1441,103 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	munlock_vma_folio(folio, vma, compound);
 }
 
+static bool try_to_unmap_one_hugetlb(struct folio *folio,
+		struct vm_area_struct *vma, struct mmu_notifier_range range,
+		struct page_vma_mapped_walk pvmw, unsigned long address,
+		enum ttu_flags flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pteval;
+	bool ret = true, anon = folio_test_anon(folio);
+
+	/*
+	 * The try_to_unmap() is only passed a hugetlb page
+	 * in the case where the hugetlb page is poisoned.
+	 */
+	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
+	/*
+	 * huge_pmd_unshare may unmap an entire PMD page.
+	 * There is no way of knowing exactly which PMDs may
+	 * be cached for this mm, so we must flush them all.
+	 * start/end were already adjusted above to cover this
+	 * range.
+	 */
+	flush_cache_range(vma, range.start, range.end);
+
+	/*
+	 * To call huge_pmd_unshare, i_mmap_rwsem must be
+	 * held in write mode.  Caller needs to explicitly
+	 * do this outside rmap routines.
+	 *
+	 * We also must hold hugetlb vma_lock in write mode.
+	 * Lock order dictates acquiring vma_lock BEFORE
+	 * i_mmap_rwsem.  We can only try lock here and fail
+	 * if unsuccessful.
+	 */
+	if (!anon) {
+		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+		if (!hugetlb_vma_trylock_write(vma)) {
+			ret = false;
+			goto out;
+		}
+		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+			hugetlb_vma_unlock_write(vma);
+			flush_tlb_range(vma,
+					range.start, range.end);
+			mmu_notifier_invalidate_range(mm,
+					range.start, range.end);
+			/*
+			 * The ref count of the PMD page was
+			 * dropped which is part of the way map
+			 * counting is done for shared PMDs.
+			 * Return 'true' here.  When there is
+			 * no other sharing, huge_pmd_unshare
+			 * returns false and we will unmap the
+			 * actual page and drop map count
+			 * to zero.
+			 */
+			goto out;
+		}
+		hugetlb_vma_unlock_write(vma);
+	}
+	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+
+	/*
+	 * Now the pte is cleared. If this pte was uffd-wp armed,
+	 * we may want to replace a none pte with a marker pte if
+	 * it's file-backed, so we don't lose the tracking info.
+	 */
+	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	/* Poisoned hugetlb folio with TTU_HWPOISON always cleared in flags */
+	pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
+	set_huge_pte_at(mm, address, pvmw.pte, pteval);
+	hugetlb_count_sub(folio_nr_pages(folio), mm);
+
+	/*
+	 * No need to call mmu_notifier_invalidate_range() it has be
+	 * done above for all cases requiring it to happen under page
+	 * table lock before mmu_notifier_invalidate_range_end()
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+	/* No VM_LOCKED set in vma->vm_flags for hugetlb. So not
+	 * necessary to call mlock_drain_local().
+	 */
+	folio_put(folio);
+
+out:
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1504,86 +1601,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			break;
 		}
 
+		address = pvmw.address;
+		if (folio_test_hugetlb(folio)) {
+			ret = try_to_unmap_one_hugetlb(folio, vma, range,
+							pvmw, address, flags);
+
+			/* no need to loop for hugetlb */
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
-		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
 
-		if (folio_test_hugetlb(folio)) {
-			bool anon = folio_test_anon(folio);
-
-			/*
-			 * The try_to_unmap() is only passed a hugetlb page
-			 * in the case where the hugetlb page is poisoned.
-			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		/* Nuke the page table entry. */
+		if (should_defer_flush(mm, flags)) {
 			/*
-			 * huge_pmd_unshare may unmap an entire PMD page.
-			 * There is no way of knowing exactly which PMDs may
-			 * be cached for this mm, so we must flush them all.
-			 * start/end were already adjusted above to cover this
-			 * range.
+			 * We clear the PTE but do not flush so potentially
+			 * a remote CPU could still be writing to the folio.
+			 * If the entry was previously clean then the
+			 * architecture must guarantee that a clear->dirty
+			 * transition on a cached TLB entry is written through
+			 * and traps if the PTE is unmapped.
 			 */
-			flush_cache_range(vma, range.start, range.end);
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
-			 */
-			if (!anon) {
-				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
-				if (!hugetlb_vma_trylock_write(vma)) {
-					page_vma_mapped_walk_done(&pvmw);
-					ret = false;
-					break;
-				}
-				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					flush_tlb_range(vma,
-						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
-					/*
-					 * The ref count of the PMD page was
-					 * dropped which is part of the way map
-					 * counting is done for shared PMDs.
-					 * Return 'true' here.  When there is
-					 * no other sharing, huge_pmd_unshare
-					 * returns false and we will unmap the
-					 * actual page and drop map count
-					 * to zero.
-					 */
-					page_vma_mapped_walk_done(&pvmw);
-					break;
-				}
-				hugetlb_vma_unlock_write(vma);
-			}
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
 		} else {
-			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-			/* Nuke the page table entry. */
-			if (should_defer_flush(mm, flags)) {
-				/*
-				 * We clear the PTE but do not flush so potentially
-				 * a remote CPU could still be writing to the folio.
-				 * If the entry was previously clean then the
-				 * architecture must guarantee that a clear->dirty
-				 * transition on a cached TLB entry is written through
-				 * and traps if the PTE is unmapped.
-				 */
-				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-			} else {
-				pteval = ptep_clear_flush(vma, address, pvmw.pte);
-			}
+			pteval = ptep_clear_flush(vma, address, pvmw.pte);
 		}
 
 		/*
@@ -1602,14 +1650,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
-				set_huge_pte_at(mm, address, pvmw.pte, pteval);
-			} else {
-				dec_mm_counter(mm, mm_counter(&folio->page));
-				set_pte_at(mm, address, pvmw.pte, pteval);
-			}
-
+			dec_mm_counter(mm, mm_counter(&folio->page));
+			set_pte_at(mm, address, pvmw.pte, pteval);
 		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no