From patchwork Sun Feb  6 21:30:38 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736734
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 08E42C433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:30:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3FF786B0072; Sun,  6 Feb 2022 16:30:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3B04F6B0073; Sun,  6 Feb 2022 16:30:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 228306B0074; Sun,  6 Feb 2022 16:30:43 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0025.hostedemail.com
 [216.40.44.25])
	by kanga.kvack.org (Postfix) with ESMTP id 09F566B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:30:43 -0500 (EST)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id A61B88249980
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:30:42 +0000 (UTC)
X-FDA: 79113649524.08.AF7B608
Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com
 [209.85.160.177])
	by imf07.hostedemail.com (Postfix) with ESMTP id 3984240006
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:30:42 +0000 (UTC)
Received: by mail-qt1-f177.google.com with SMTP id x5so10480182qtw.10
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:30:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=tUIqQ2p0G6WShoIuBqH0f6GkLa0ney9dB+6lc0wk7bU=;
        b=WNPCX3JvZ+6TN332rdpbt1cu6Ly4l5OqSKWxkaAd5TUHTqWSGGBTX2VdNMZgCRbfh0
         sL5HvgwMH3RGFFz74/cPIpVCiafXDspgIJyXJatfqVn04UsxReaeZFaB/KAHoysbj/Bg
         A8HmFp2Id+V0itLoV0XpZJ4yBYi0lv+7jZ/gopeuThiPBPhL+nt1IC/8eGFnWjy7RnM+
         6XIv0pTvd9jtbRUVuQ7NwZwF2ZZF7xtYJoaRCbE0s/HSlJMEv7fcvzZ045YxN09THG56
         Q2vqxQwxgRqD6fFJWwmgkCRgOhASOxTARfTjtSJixNMQWq+cOK5LYgJXEJ3O2Ce654Kv
         VQrg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=tUIqQ2p0G6WShoIuBqH0f6GkLa0ney9dB+6lc0wk7bU=;
        b=LPHUShdmTNGmnaPKt0SDW1jPsHH6uP8WvQka6dgb/p2XT1Fb1JtxkaL+nnYPScgr6a
         4B2F/hkNd7KEuQUqT9EgtTqAZY92N/1nQIIRU7fR2yiHS+HRl7k4epwdN6v0u3FeIocr
         7s6vYgBKw48GIfeHEd1asheUAf8m3mZXX7t9rS7e3xDidyRyZk0q5hWAB/z/OlZVfDUm
         A2nrjsmFg6un16CmpGLL1EoE6uXDJLe+SCOMe2x4HqF/owIR0U+5Ja7WzO6z+MZ3ZrqV
         Yha+M+6Fwahn9oLSSMPCf+oft0zlTZbvVGOijF6zbF0OPJ+TXAxAvQy69m1KztRch7nD
         Svwg==
X-Gm-Message-State: AOAM531cYC60ZkP7Ao0XXKheGZGvPk6Yt1SmJFQpMOiWgMbFADekg+4y
	GGF5B52r+m52odt8tTwxCGBRhA==
X-Google-Smtp-Source: 
 ABdhPJwEAMOfiimbrj6oVInUt7dIObUuaAT9yYeaWl8DMyfvCiSy1dMCYg7A4r2IZ0oElw2k8LUMjA==
X-Received: by 2002:a05:622a:547:: with SMTP id
 m7mr6038021qtx.604.1644183041161;
        Sun, 06 Feb 2022 13:30:41 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 l9sm4624636qkp.38.2022.02.06.13.30.39
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:30:40 -0800 (PST)
Date: Sun, 6 Feb 2022 13:30:38 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 01/13] mm/munlock: delete page_mlock() and all its works
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <5ed1f01-3e7e-7e26-cc1-2b7a574e2147@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 3984240006
X-Stat-Signature: 9p8kt86hy3cmocq53rweg1sxsu1ghnpt
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=WNPCX3Jv;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf07.hostedemail.com: domain of hughd@google.com designates
 209.85.160.177 as permitted sender) smtp.mailfrom=hughd@google.com
X-Rspam-User: nil
X-HE-Tag: 1644183042-932164
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant.  Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder.  Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range().  Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/rmap.h |   6 -
 mm/internal.h        |   2 +-
 mm/mlock.c           | 377 +++----------------------------------------
 mm/rmap.c            |  80 ---------
 4 files changed, 25 insertions(+), 440 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e704b1a4c06c..dc48aa8c2c94 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
  */
 int folio_mkclean(struct folio *);
 
-/*
- * called in munlock()/munmap() path to check for other vmas holding
- * the page mlocked.
- */
-void page_mlock(struct page *page);
-
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..e48c486d5ddf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -409,7 +409,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
  * must be called with vma's mmap_lock held for read or write, and page locked.
  */
 extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);
 
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..544c18ce2c58 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -46,12 +46,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * be placed on the LRU "unevictable" list, rather than the [in]active lists.
  * The unevictable list is an LRU sibling list to the [in]active lists.
  * PageUnevictable is set to indicate the unevictable state.
- *
- * When lazy mlocking via vmscan, it is important to ensure that the
- * vma's VM_LOCKED status is not concurrently being modified, otherwise we
- * may have mlocked a page that is being munlocked. So lazy mlock must take
- * the mmap_lock for read, and verify that the vma really is locked
- * (see mm/rmap.c).
  */
 
 /*
@@ -106,299 +100,28 @@ void mlock_vma_page(struct page *page)
 	}
 }
 
-/*
- * Finish munlock after successful page isolation
- *
- * Page must be locked. This is a wrapper for page_mlock()
- * and putback_lru_page() with munlock accounting.
- */
-static void __munlock_isolated_page(struct page *page)
-{
-	/*
-	 * Optimization: if the page was mapped just once, that's our mapping
-	 * and we don't need to check all the other vmas.
-	 */
-	if (page_mapcount(page) > 1)
-		page_mlock(page);
-
-	/* Did try_to_unlock() succeed or punt? */
-	if (!PageMlocked(page))
-		count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
-
-	putback_lru_page(page);
-}
-
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	int nr_pages = thp_nr_pages(page);
-
-	if (PageUnevictable(page))
-		__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	else
-		__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-}
-
 /**
  * munlock_vma_page - munlock a vma page
  * @page: page to be unlocked, either a normal page or THP page head
- *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
- *
- * called from munlock()/munmap() path with page supposedly on the LRU.
- * When we munlock a page, because the vma where we found the page is being
- * munlock()ed or munmap()ed, we want to check whether other vmas hold the
- * page locked so that we can leave it on the unevictable lru list and not
- * bother vmscan with it.  However, to walk the page's rmap list in
- * page_mlock() we must isolate the page from the LRU.  If some other
- * task has removed the page from the LRU, we won't be able to do that.
- * So we clear the PageMlocked as we might not get another chance.  If we
- * can't isolate the page, we leave it for putback_lru_page() and vmscan
- * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
 {
-	int nr_pages;
-
-	/* For page_mlock() and to serialize with page migration */
+	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (!TestClearPageMlocked(page)) {
-		/* Potentially, PTE-mapped THP: do not skip the rest PTEs */
-		return 0;
-	}
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-
-	if (!isolate_lru_page(page))
-		__munlock_isolated_page(page);
-	else
-		__munlock_isolation_failed(page);
-
-	return nr_pages - 1;
-}
-
-/*
- * convert get_user_pages() return value to posix mlock() error
- */
-static int __mlock_posix_error_return(long retval)
-{
-	if (retval == -EFAULT)
-		retval = -ENOMEM;
-	else if (retval == -ENOMEM)
-		retval = -EAGAIN;
-	return retval;
-}
-
-/*
- * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
- *
- * The fast path is available only for evictable pages with single mapping.
- * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need page_mlock() which can fail.
- * when !page_evictable(), we need the full redo logic of putback_lru_page to
- * avoid leaving evictable page in unevictable list.
- *
- * In case of success, @page is added to @pvec and @pgrescued is incremented
- * in case that the page was previously unevictable. @page is also unlocked.
- */
-static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
-		int *pgrescued)
-{
-	VM_BUG_ON_PAGE(PageLRU(page), page);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	if (page_mapcount(page) <= 1 && page_evictable(page)) {
-		pagevec_add(pvec, page);
-		if (TestClearPageUnevictable(page))
-			(*pgrescued)++;
-		unlock_page(page);
-		return true;
-	}
-
-	return false;
-}
-
-/*
- * Putback multiple evictable pages to the LRU
- *
- * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
- * the pages might have meanwhile become unevictable but that is OK.
- */
-static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
-{
-	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
-	/*
-	 *__pagevec_lru_add() calls release_pages() so we don't call
-	 * put_page() explicitly
-	 */
-	__pagevec_lru_add(pvec);
-	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
-}
 
-/*
- * Munlock a batch of pages from the same zone
- *
- * The work is split to two main phases. First phase clears the Mlocked flag
- * and attempts to isolate the pages, all under a single zone lru lock.
- * The second phase finishes the munlock only for pages where isolation
- * succeeded.
- *
- * Note that the pagevec may be modified during the process.
- */
-static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
-{
-	int i;
-	int nr = pagevec_count(pvec);
-	int delta_munlocked = -nr;
-	struct pagevec pvec_putback;
-	struct lruvec *lruvec = NULL;
-	int pgrescued = 0;
-
-	pagevec_init(&pvec_putback);
-
-	/* Phase 1: page isolation */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-		struct folio *folio = page_folio(page);
-
-		if (TestClearPageMlocked(page)) {
-			/*
-			 * We already have pin from follow_page_mask()
-			 * so we can spare the get_page() here.
-			 */
-			if (TestClearPageLRU(page)) {
-				lruvec = folio_lruvec_relock_irq(folio, lruvec);
-				del_page_from_lru_list(page, lruvec);
-				continue;
-			} else
-				__munlock_isolation_failed(page);
-		} else {
-			delta_munlocked++;
-		}
+	VM_BUG_ON_PAGE(PageTail(page), page);
 
-		/*
-		 * We won't be munlocking this page in the next phase
-		 * but we still need to release the follow_page_mask()
-		 * pin. We cannot do it under lru_lock however. If it's
-		 * the last pin, __page_cache_release() would deadlock.
-		 */
-		pagevec_add(&pvec_putback, pvec->pages[i]);
-		pvec->pages[i] = NULL;
-	}
-	if (lruvec) {
-		__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-		unlock_page_lruvec_irq(lruvec);
-	} else if (delta_munlocked) {
-		mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
-	}
+	if (TestClearPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
 
-	/* Now we can release pins of pages that we are not munlocking */
-	pagevec_release(&pvec_putback);
-
-	/* Phase 2: page munlock */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-
-		if (page) {
-			lock_page(page);
-			if (!__putback_lru_fast_prepare(page, &pvec_putback,
-					&pgrescued)) {
-				/*
-				 * Slow path. We don't want to lose the last
-				 * pin before unlock_page()
-				 */
-				get_page(page); /* for putback_lru_page() */
-				__munlock_isolated_page(page);
-				unlock_page(page);
-				put_page(page); /* from follow_page_mask() */
-			}
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (!isolate_lru_page(page)) {
+			putback_lru_page(page);
+			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		} else if (PageUnevictable(page)) {
+			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 		}
 	}
-
-	/*
-	 * Phase 3: page putback for pages that qualified for the fast path
-	 * This will also call put_page() to return pin from follow_page_mask()
-	 */
-	if (pagevec_count(&pvec_putback))
-		__putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
-			struct vm_area_struct *vma, struct zone *zone,
-			unsigned long start, unsigned long end)
-{
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	/*
-	 * Initialize pte walk starting at the already pinned page where we
-	 * are sure that there is a pte, as it was pinned under the same
-	 * mmap_lock write op.
-	 */
-	pte = get_locked_pte(vma->vm_mm, start,	&ptl);
-	/* Make sure we do not cross the page table boundary */
-	end = pgd_addr_end(start, end);
-	end = p4d_addr_end(start, end);
-	end = pud_addr_end(start, end);
-	end = pmd_addr_end(start, end);
-
-	/* The page next to the pinned page is the first we will try to get */
-	start += PAGE_SIZE;
-	while (start < end) {
-		struct page *page = NULL;
-		pte++;
-		if (pte_present(*pte))
-			page = vm_normal_page(vma, start, *pte);
-		/*
-		 * Break if page could not be obtained or the page's node+zone does not
-		 * match
-		 */
-		if (!page || page_zone(page) != zone)
-			break;
-
-		/*
-		 * Do not use pagevec for PTE-mapped THP,
-		 * munlock_vma_pages_range() will handle them.
-		 */
-		if (PageTransCompound(page))
-			break;
-
-		get_page(page);
-		/*
-		 * Increase the address that will be returned *before* the
-		 * eventual break due to pvec becoming full by adding the page
-		 */
-		start += PAGE_SIZE;
-		if (pagevec_add(pvec, page) == 0)
-			break;
-	}
-	pte_unmap_unlock(pte, ptl);
-	return start;
 }
 
 /*
@@ -413,75 +136,11 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  *
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
- *
- * We don't save and restore VM_LOCKED here because pages are
- * still on lru.  In unmap path, pages might be scanned by reclaim
- * and re-mlocked by page_mlock/try_to_unmap before we unmap and
- * free them.  This will result in freeing mlocked pages.
  */
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
-	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-
-	while (start < end) {
-		struct page *page;
-		unsigned int page_mask = 0;
-		unsigned long page_increm;
-		struct pagevec pvec;
-		struct zone *zone;
-
-		pagevec_init(&pvec);
-		/*
-		 * Although FOLL_DUMP is intended for get_dump_page(),
-		 * it just so happens that its special treatment of the
-		 * ZERO_PAGE (returning an error instead of doing get_page)
-		 * suits munlock very well (and if somehow an abnormal page
-		 * has sneaked into the range, we won't oops here: great).
-		 */
-		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
-
-		if (page && !IS_ERR(page)) {
-			if (PageTransTail(page)) {
-				VM_BUG_ON_PAGE(PageMlocked(page), page);
-				put_page(page); /* follow_page_mask() */
-			} else if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to compute
-				 * the page_mask here instead.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zone, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
-		}
-		page_increm = 1 + page_mask;
-		start += page_increm * PAGE_SIZE;
-next:
-		cond_resched();
-	}
+	/* Reimplementation to follow in later commit */
 }
 
 /*
@@ -645,6 +304,18 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
 	return count >> PAGE_SHIFT;
 }
 
+/*
+ * convert get_user_pages() return value to posix mlock() error
+ */
+static int __mlock_posix_error_return(long retval)
+{
+	if (retval == -EFAULT)
+		retval = -ENOMEM;
+	else if (retval == -ENOMEM)
+		retval = -EAGAIN;
+	return retval;
+}
+
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
 {
 	unsigned long locked;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..7ce7f1946cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1996,76 +1996,6 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
 		rmap_walk(page, &rwc);
 }
 
-/*
- * Walks the vma's mapping a page and mlocks the page if any locked vma's are
- * found. Once one is found the page is locked and the scan can be terminated.
- */
-static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address, void *unused)
-{
-	struct page_vma_mapped_walk pvmw = {
-		.page = page,
-		.vma = vma,
-		.address = address,
-	};
-
-	/* An un-locked vma doesn't have any pages to lock, continue the scan */
-	if (!(vma->vm_flags & VM_LOCKED))
-		return true;
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/*
-		 * Need to recheck under the ptl to serialise with
-		 * __munlock_pagevec_fill() after VM_LOCKED is cleared in
-		 * munlock_vma_pages_range().
-		 */
-		if (vma->vm_flags & VM_LOCKED) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked; but
-			 * this function is never called on a DoubleMap THP,
-			 * nor on an Anon THP (which may still be PTE-mapped
-			 * after DoubleMap was cleared).
-			 */
-			mlock_vma_page(page);
-			/*
-			 * No need to scan further once the page is marked
-			 * as mlocked.
-			 */
-			page_vma_mapped_walk_done(&pvmw);
-			return false;
-		}
-	}
-
-	return true;
-}
-
-/**
- * page_mlock - try to mlock a page
- * @page: the page to be mlocked
- *
- * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
- * the page if any are found. The page will be returned with PG_mlocked cleared
- * if it is not mapped by any locked vmas.
- */
-void page_mlock(struct page *page)
-{
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_mlock_one,
-		.done = page_not_mapped,
-		.anon_lock = page_lock_anon_vma_read,
-
-	};
-
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
-
-	/* Anon THP are only marked as mlocked when singly mapped */
-	if (PageTransCompound(page) && PageAnon(page))
-		return;
-
-	rmap_walk(page, &rwc);
-}
-
 #ifdef CONFIG_DEVICE_PRIVATE
 struct make_exclusive_args {
 	struct mm_struct *mm;
@@ -2291,11 +2221,6 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
@@ -2344,11 +2269,6 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
  */
 static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)

From patchwork Sun Feb  6 21:32:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736735
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C7011C433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:32:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5F9CC6B0072; Sun,  6 Feb 2022 16:32:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5A84B6B0073; Sun,  6 Feb 2022 16:32:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 423226B0074; Sun,  6 Feb 2022 16:32:20 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24])
	by kanga.kvack.org (Postfix) with ESMTP id 342A86B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:32:20 -0500 (EST)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id EA594208C8
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:32:19 +0000 (UTC)
X-FDA: 79113653598.02.B5E7427
Received: from mail-qk1-f180.google.com (mail-qk1-f180.google.com
 [209.85.222.180])
	by imf10.hostedemail.com (Postfix) with ESMTP id 7FA57C0002
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:32:19 +0000 (UTC)
Received: by mail-qk1-f180.google.com with SMTP id c189so9492211qkg.11
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:32:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=zQkpuLA9DZJBUcE3jcYKmscDjcpqSqOLAzrvayz8ZnU=;
        b=GR8TM5vXeQKiwsAkIe1lxgAeTCAhUmGm16fCx+tfDblMF1EMkt4QZ4Hy/vppBd/MN0
         3UzX65hlcREiX7l/hWchOeXkrk7bmgMPSJT+V6pYw0zBYf3UjwiwzTE94NidxzhmA/z4
         JXkN9JTi/BRohA6w7XEHQMiAYzehy5Q4fMCx2vIQNzs1SKF+gfc+1DtQm76W0l8ErE9+
         wQvZ5eDJafIatsgNYCocYBFZd1leAMWZIt0e2816Pq3GAn8SpZr9GRn3lsGcMFsOIzyr
         FGKMVW302u+8DJp5iuQRFQfQD9sAV2fGeKibi+QKzHSugVsQmj4soDsg0sIXK86OuhDI
         B7ZA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=zQkpuLA9DZJBUcE3jcYKmscDjcpqSqOLAzrvayz8ZnU=;
        b=vojQRDMuuYfAtnOE27ivTA7jKDC150mx3Qq13DXR2Dj6G1m60ryQQrcCojDUgJGxz/
         B6xFRMu14+Jvih8B0woNAKIcoGxeKl6W4+x22y7x7L8ijeyoq6Zl0TsINyoixcEKQLt9
         XzgeoHcBHalTj/G3TxELuhnrvsepRsK+zgtEaaJqMSW6NHpdvsALay8tdpP3tF5cLqSw
         dX/LrBtKt7BckWFlNhEL/RoU3SLa/0dJA2S63Bmh4BLsA+uztsmKuQwi44P1OSUAxTnw
         8NfWMUvt7AH6ufkpU2j4SaMss02YiW6BamqwMG3AeDE5ZiNWiuFmEagBYbhUbyewq0bF
         iZmw==
X-Gm-Message-State: AOAM533a1BVKjCE6G7xdK0KOmJ+StAYA4WBVHukh+3JX4oe89zhegr54
	s6wr7ZBuetGjYYk+sm6qlCJZhQ==
X-Google-Smtp-Source: 
 ABdhPJwjY0WNGaOWMQXMYnAUM4AdKudIMQlTqPQspgi9cVygJ/L+ge74DGirQVBQMsWFL+2MhlIGzw==
X-Received: by 2002:a05:620a:1914:: with SMTP id
 bj20mr4920106qkb.56.1644183138671;
        Sun, 06 Feb 2022 13:32:18 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 w10sm5068516qtj.73.2022.02.06.13.32.16
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:32:18 -0800 (PST)
Date: Sun, 6 Feb 2022 13:32:15 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <2b5eee76-183f-bd97-2e9d-f5ff8df63db@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 7FA57C0002
X-Stat-Signature: iazprji3bouxr3nh5qunuxkaimwp89q4
X-Rspam-User: nil
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=GR8TM5vX;
	spf=pass (imf10.hostedemail.com: domain of hughd@google.com designates
 209.85.222.180 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644183139-318783
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right?  I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before?  Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area?  I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h |  2 --
 mm/gup.c           | 43 ++++++++-----------------------------------
 mm/huge_memory.c   | 33 ---------------------------------
 3 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..74ee50c2033b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2925,13 +2925,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
-#define FOLL_POPULATE	0x40	/* fault in pages (with FOLL_MLOCK) */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
 #define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
-#define FOLL_MLOCK	0x1000	/* lock present pages */
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 #define FOLL_ANON	0x8000	/* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..2076902344d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -572,32 +572,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 */
 		mark_page_accessed(page);
 	}
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/* Do not mlock pte-mapped THP */
-		if (PageTransCompound(page))
-			goto out;
-
-		/*
-		 * The preliminary mapping check is mainly to avoid the
-		 * pointless overhead of lock_page on the ZERO_PAGE
-		 * which might bounce very badly if there is contention.
-		 *
-		 * If the page is already locked, we don't need to
-		 * handle it now - vmscan will handle it later if and
-		 * when it attempts to reclaim the page.
-		 */
-		if (page->mapping && trylock_page(page)) {
-			lru_add_drain();  /* push cached pages to LRU */
-			/*
-			 * Because we lock page here, and migration is
-			 * blocked by the pte's page reference, and we
-			 * know the page is still mapped, we don't even
-			 * need to check for file-cache page truncation.
-			 */
-			mlock_vma_page(page);
-			unlock_page(page);
-		}
-	}
 out:
 	pte_unmap_unlock(ptep, ptl);
 	return page;
@@ -920,9 +894,6 @@ static int faultin_page(struct vm_area_struct *vma,
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
 
-	/* mlock all present pages, but do not fault in new pages */
-	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
-		return -ENOENT;
 	if (*flags & FOLL_NOFAULT)
 		return -EFAULT;
 	if (*flags & FOLL_WRITE)
@@ -1173,8 +1144,6 @@ static long __get_user_pages(struct mm_struct *mm,
 			case -ENOMEM:
 			case -EHWPOISON:
 				goto out;
-			case -ENOENT:
-				goto next_page;
 			}
 			BUG();
 		} else if (PTR_ERR(page) == -EEXIST) {
@@ -1472,9 +1441,14 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
 	mmap_assert_locked(mm);
 
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
+	/*
+	 * Rightly or wrongly, the VM_LOCKONFAULT case has never used
+	 * faultin_page() to break COW, so it has no work to do here.
+	 */
 	if (vma->vm_flags & VM_LOCKONFAULT)
-		gup_flags &= ~FOLL_POPULATE;
+		return nr_pages;
+
+	gup_flags = FOLL_TOUCH;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
@@ -1541,10 +1515,9 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
 	 *	       in the page table.
 	 * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
 	 *		  a poisoned page.
-	 * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
 	 * !FOLL_FORCE: Require proper access permissions.
 	 */
-	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+	gup_flags = FOLL_TOUCH | FOLL_HWPOISON;
 	if (write)
 		gup_flags |= FOLL_WRITE;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9a34b85ebcf8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,39 +1380,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
 
-	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-		/*
-		 * We don't mlock() pte-mapped THPs. This way we can avoid
-		 * leaking mlocked pages into non-VM_LOCKED VMAs.
-		 *
-		 * For anon THP:
-		 *
-		 * In most cases the pmd is the only mapping of the page as we
-		 * break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
-		 * writable private mappings in populate_vma_page_range().
-		 *
-		 * The only scenario when we have the page shared here is if we
-		 * mlocking read-only mapping shared over fork(). We skip
-		 * mlocking such pages.
-		 *
-		 * For file THP:
-		 *
-		 * We can expect PageDoubleMap() to be stable under page lock:
-		 * for file pages we set it in page_add_file_rmap(), which
-		 * requires page to be locked.
-		 */
-
-		if (PageAnon(page) && compound_mapcount(page) != 1)
-			goto skip_mlock;
-		if (PageDoubleMap(page) || !page->mapping)
-			goto skip_mlock;
-		if (!trylock_page(page))
-			goto skip_mlock;
-		if (page->mapping && !PageDoubleMap(page))
-			mlock_vma_page(page);
-		unlock_page(page);
-	}
-skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
 

From patchwork Sun Feb  6 21:34:51 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736736
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E3266C433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:34:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7A44A6B0072; Sun,  6 Feb 2022 16:34:56 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 753A56B0073; Sun,  6 Feb 2022 16:34:56 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5F5CE6B0074; Sun,  6 Feb 2022 16:34:56 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0219.hostedemail.com
 [216.40.44.219])
	by kanga.kvack.org (Postfix) with ESMTP id 526B46B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:34:56 -0500 (EST)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 07FEA8249980
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:34:56 +0000 (UTC)
X-FDA: 79113660192.20.DC5D506
Received: from mail-qk1-f177.google.com (mail-qk1-f177.google.com
 [209.85.222.177])
	by imf06.hostedemail.com (Postfix) with ESMTP id 9B408180005
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:34:55 +0000 (UTC)
Received: by mail-qk1-f177.google.com with SMTP id j24so9506307qkk.10
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:34:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=aN9jBUAYtXHCGaaeDEIY7w9yB+GjJ5YgDtHsb77P658=;
        b=WpVzCKTSSVi6GoYJEZT6vLbf1sd7YZmUJrQ6su8wcqoyJi5mMuPwdKm+TX3wj+cPOF
         Onecd/s03fPVi3QTwLUozKzP+43eWRkt6eQGfAw+Eav1oQ8wH7lbwdDtAp6zNMltTqX1
         daklbwZttoGb5ecV4vugBaGIyH1JP3g/07jFiOz+xLSBztZiZrBo5yTKBH2WfqzC6dcQ
         Xdd7czT4qmQsV224vt9lD1L4jsep2PPVpmCW+69lyEWmIeKHt4Kpg4nFyr7jQ3QQqc7C
         bbC842ieUWUA5hDtgLSyrj9rpWkBbIbq3cfLoiCemQ3Fee3xqz3hn/yId77xktb0Jh6K
         I14Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=aN9jBUAYtXHCGaaeDEIY7w9yB+GjJ5YgDtHsb77P658=;
        b=UrOKyTFti+GFHPjvNbwmMWPvqRj6UIN+LSTcFA3sKGQ4CRT3PLsuLe/TGXvEQJ2//Y
         fwJppFQzfdTC0rhEefYUDVhHpNFoenmdae27XxwJVf0QYjyP7QlAaHTZ8rs9lIOlQlGx
         qEce4uLz4b4oLCo8exH30N9oybjyonY26En/aeR4XL+UrcuDao6rATOO7qSSBfpKjF2V
         i+Vhmj6Yi2jxJaH7jkk9utyEcwnkBVZYHur5EbSpRspfeOovbwXLj27GmU5TrHGgBUTT
         l3+eb2Udm9iCGnXTkcOhFKyYx7UKyyQS+UX4PisnntiL25i6r1eyCSshiPI6XVfuGgu9
         nVyg==
X-Gm-Message-State: AOAM530K+IJo6V6p1a1iv8uMZLo6f4fWGb4QzEGfgHwmjWDkX9hNZI51
	oioR0pJY768HsAv5G0kjsuJFeQ==
X-Google-Smtp-Source: 
 ABdhPJxujpl+3AC442NQ22lFrdKNnJHbmm/WntCsFSP6F26ynHKhbaiKOWngjy9Dx/FenVb41XuHdg==
X-Received: by 2002:a37:658b:: with SMTP id
 z133mr4932257qkb.119.1644183294804;
        Sun, 06 Feb 2022 13:34:54 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 br30sm4544545qkb.67.2022.02.06.13.34.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:34:54 -0800 (PST)
Date: Sun, 6 Feb 2022 13:34:51 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow
 oomreap
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <8dddb3d4-361-da5-538-3f3ae1b326b@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Stat-Signature: z9ybwjk1u4jfcsswk99tbn58x3zqfdim
X-Rspam-User: nil
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=WpVzCKTS;
	spf=pass (imf06.hostedemail.com: domain of hughd@google.com designates
 209.85.222.177 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 9B408180005
X-HE-Tag: 1644183295-278943
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte.  Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h | 16 ++--------------
 mm/madvise.c  |  5 +++++
 mm/mlock.c    |  4 ++--
 mm/mmap.c     | 32 ++------------------------------
 mm/oom_kill.c |  2 +-
 5 files changed, 12 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e48c486d5ddf..f235aa92e564 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -71,11 +71,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
-static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
-{
-	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
-}
-
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
@@ -398,12 +393,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
 extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
 				   bool write, int *locked);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
 
 /*
  * must be called with vma's mmap_lock held for read or write, and page locked.
@@ -411,9 +402,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern void munlock_vma_page(struct page *page);
 
-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
-			      unsigned long len);
-
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..ae35d72627ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -530,6 +530,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 	tlb_end_vma(tlb, vma);
 }
 
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
+{
+	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
+}
+
 static long madvise_cold(struct vm_area_struct *vma,
 			struct vm_area_struct **prev,
 			unsigned long start_addr, unsigned long end_addr)
diff --git a/mm/mlock.c b/mm/mlock.c
index 544c18ce2c58..d148da934fe9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -137,8 +137,8 @@ void munlock_vma_page(struct page *page)
  * Returns with VM_LOCKED cleared.  Callers must be prepared to
  * deal with this.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+static void munlock_vma_pages_range(struct vm_area_struct *vma,
+				    unsigned long start, unsigned long end)
 {
 	/* Reimplementation to follow in later commit */
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..64b5985b5295 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2674,6 +2674,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	vma->vm_prev = NULL;
 	do {
 		vma_rb_erase(vma, &mm->mm_rb);
+		if (vma->vm_flags & VM_LOCKED)
+			mm->locked_vm -= vma_pages(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
@@ -2778,22 +2780,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-static inline void
-unlock_range(struct vm_area_struct *start, unsigned long limit)
-{
-	struct mm_struct *mm = start->vm_mm;
-	struct vm_area_struct *tmp = start;
-
-	while (tmp && tmp->vm_start < limit) {
-		if (tmp->vm_flags & VM_LOCKED) {
-			mm->locked_vm -= vma_pages(tmp);
-			munlock_vma_pages_all(tmp);
-		}
-
-		tmp = tmp->vm_next;
-	}
-}
-
 /* Munmap is split into 2 main parts -- this part which finds
  * what needs doing, and the areas themselves, which do the
  * work.  This now handles partial unmappings.
@@ -2874,12 +2860,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 			return error;
 	}
 
-	/*
-	 * unlock any mlock()ed ranges before detaching vmas
-	 */
-	if (mm->locked_vm)
-		unlock_range(vma, end);
-
 	/* Detach vmas from rbtree */
 	if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
 		downgrade = false;
@@ -3147,20 +3127,12 @@ void exit_mmap(struct mm_struct *mm)
 		 * Nothing can be holding mm->mmap_lock here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling unlock_range(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
-
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 	}
 
 	mmap_write_lock(mm);
-	if (mm->locked_vm)
-		unlock_range(mm->mmap, ULONG_MAX);
-
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..6b875acabd1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -526,7 +526,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
-		if (!can_madv_lru_vma(vma))
+		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
 		/*

From patchwork Sun Feb  6 21:36:41 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736737
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B18C9C433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:36:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 446DB6B0072; Sun,  6 Feb 2022 16:36:46 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3CF3C6B0073; Sun,  6 Feb 2022 16:36:46 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2224D6B0074; Sun,  6 Feb 2022 16:36:46 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com
 [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 0B2A56B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:36:46 -0500 (EST)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id B219C8EBFB
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:36:45 +0000 (UTC)
X-FDA: 79113664770.19.8DFED5B
Received: from mail-qk1-f169.google.com (mail-qk1-f169.google.com
 [209.85.222.169])
	by imf26.hostedemail.com (Postfix) with ESMTP id 45C04140002
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:36:45 +0000 (UTC)
Received: by mail-qk1-f169.google.com with SMTP id o25so9516872qkj.7
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:36:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=cHkxOAr/Xci/IgVhb32+hCl86iSq+NaR/bZ+/Ehxne4=;
        b=W9dDijTsqsqumhj5Gg6NpLq33dUPxkUuTW71ry/+zt3s/XgJfichuxEjTB8R0t6pbq
         B+kCIyX06DwL9WaHtMLMEqLpRQbVX3I7qxsA3wyEQoAlv4dfZBUDHKvdMjiGzrfNTsMK
         fE77jGS/5rn31g16vnqXhNN91b2iIZ0hNOcdSyZmS0+d+D6jhYKDiAZg6L3As74lTd2r
         L9tWnBDXbE5HuCQZhXgCoqmaJHoUyXvw546PReikXhLpbWH24r/PhH41zMMX5K7IaqW0
         ppv6ERiA9OnhXQkTE9B1+cSVZ/hcX93NhYYTI2j79zH673CYmWAREdH2GOiYbaDS+ZIx
         2LtQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=cHkxOAr/Xci/IgVhb32+hCl86iSq+NaR/bZ+/Ehxne4=;
        b=gC8tppCKz3YsVRCmUZR3t4feHDtUm61O81yHMkQAbELhq4qSbNjVYie7zCeVPOMYoM
         tlWVDtuoSPs+1/pmBbD84Sd66NGs9Q8rzeNBWSPuJNJz+ZzhpGJhCoZW7Oib460czHWw
         w3UMvKT2t9VSlHXkX9j8+eVcc34wOuHI7aNIfbA98Ws8rmQq19eDV63egy9ugoY2/JDa
         rvLas4D1zhqI/PA2Aw2LRjnCk/3Obx6cz5c9wDA4khVMh5GOe2RnxTkO9x83DjZt/NBe
         NLyiy8HsEPZp93XxasXgjJqwV6mxdg0x7VUkjqAMk/J2+SavNUol0EoTvSQzmEAjEwiC
         nnGQ==
X-Gm-Message-State: AOAM530D1XSxnYM0iBEHIOsJDqJIaTMRKF3eWRKFOLE7NBtGm4axJJHY
	RyoFLYZOgveZ9ijQjx1GHU9YVw==
X-Google-Smtp-Source: 
 ABdhPJzYmpn+r+F/mt8w66snhE9b8v4uGr5JRElH67MPgFvlH0fHpmZbl3bpLiCEicP2vxElo73gTw==
X-Received: by 2002:a05:620a:75b:: with SMTP id
 i27mr4842628qki.593.1644183404344;
        Sun, 06 Feb 2022 13:36:44 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 a14sm4747070qtb.92.2022.02.06.13.36.42
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:36:43 -0800 (PST)
Date: Sun, 6 Feb 2022 13:36:41 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 04/13] mm/munlock: rmap call mlock_vma_page()
 munlock_vma_page()
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <9f9ca113-ffb9-498e-4bd6-6bfeaaa10b7@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 45C04140002
X-Stat-Signature: q6bkowf36gc8kkfhsmkekidfbw8irwau
X-Rspam-User: nil
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=W9dDijTs;
	spf=pass (imf26.hostedemail.com: domain of hughd@google.com designates
 209.85.222.169 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644183405-818782
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/rmap.h    | 17 +++++++------
 kernel/events/uprobes.c |  7 ++----
 mm/huge_memory.c        | 17 ++++++-------
 mm/hugetlb.c            |  4 +--
 mm/internal.h           | 36 ++++++++++++++++++++++----
 mm/khugepaged.c         |  4 +--
 mm/ksm.c                | 12 +--------
 mm/memory.c             | 45 +++++++++++----------------------
 mm/migrate.c            |  9 ++-----
 mm/mlock.c              | 21 ++++++----------
 mm/rmap.c               | 56 +++++++++++++++++++----------------------
 mm/userfaultfd.c        | 14 ++++++-----
 12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long address, bool compound);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+		unsigned long address, int flags);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+		unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+		bool compound);
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long);
+		unsigned long address);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-				unsigned long);
+		unsigned long address);
 
 static inline void page_dup_rmap(struct page *page, bool compound)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 			return err;
 	}
 
-	/* For try_to_free_swap() and munlock_vma_page() below */
+	/* For try_to_free_swap() below */
 	lock_page(old_page);
 
 	mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, vma, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
-
-	if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
-		munlock_vma_page(old_page);
 	put_page(old_page);
 
 	err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				set_page_dirty(page);
 			if (!PageReferenced(page) && pmd_young(old_pmd))
 				SetPageReferenced(page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, vma, true);
 			put_page(page);
 		}
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			}
 		}
 		unlock_page_memcg(page);
+
+		/* Above is effectively page_remove_rmap(page, vma, true) */
+		munlock_vma_page(page, vma, true);
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, vma, false);
 			put_page(page + i);
 		}
 	}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				do_unlock_page = true;
 			}
 		}
-		if (PageMlocked(page))
-			clear_page_mlock(page);
 	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, vma, true);
 	put_page(page);
 }
 
@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	if (PageAnon(new))
 		page_add_anon_rmap(new, vma, mmun_start, true);
 	else
-		page_add_file_rmap(new, true);
+		page_add_file_rmap(new, vma, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
-	if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
-		mlock_vma_page(new);
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, vma, true);
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, vma, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
 				   bool write, int *locked);
 extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
 			      unsigned long len);
-
 /*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
  */
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound)
+{
+	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	    (compound || !PageTransCompound(page)))
+		munlock_page(page);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+			struct vm_area_struct *vma, bool compound) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 */
 			spin_lock(ptl);
 			ptep_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, vma, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (pte_none(*pte))
 			continue;
 		page = vm_normal_page(vma, addr, *pte);
-		page_remove_rmap(page, false);
+		page_remove_rmap(page, vma, false);
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, vma, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			err = replace_page(vma, page, kpage, orig_pte);
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
-		munlock_vma_page(page);
-		if (!PageMlocked(kpage)) {
-			unlock_page(page);
-			lock_page(kpage);
-			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
-		}
-	}
-
 out_unlock:
 	unlock_page(page);
 out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
-	if (vma->vm_flags & VM_LOCKED)
-		mlock_vma_page(page);
-
 	/*
 	 * No need to invalidate - it was non-present before. However
 	 * secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-
 			if (is_device_private_entry(entry))
-				page_remove_rmap(page, false);
-
+				page_remove_rmap(page, vma, false);
 			put_page(page);
 			continue;
 		}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
 	return 0;
 }
 
-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	if (!pte_none(*pte))
 		return -EBUSY;
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
-	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+	page_add_file_rmap(page, vma, false);
+	set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
 	return 0;
 }
 
@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 			struct page *page, pgprot_t prot)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	int retval;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (retval)
 		goto out;
 	retval = -ENOMEM;
-	pte = get_locked_pte(mm, addr, &ptl);
+	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
 out:
 	return retval;
 }
 
 #ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
 	int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
 	err = validate_page_before_insert(page);
 	if (err)
 		return err;
-	return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+	return insert_page_into_pte_locked(vma, pte, addr, page, prot);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 		start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
 		for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
-			int err = insert_page_in_batch_locked(mm, pte,
+			int err = insert_page_in_batch_locked(vma, pte,
 				addr, pages[curr_page_idx], prot);
 			if (unlikely(err)) {
 				pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, vma, false);
 		}
 
 		/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 	if (old_page) {
-		/*
-		 * Don't let another task, with possibly unlocked vma,
-		 * keep the mlocked page.
-		 */
-		if (page_copied && (vma->vm_flags & VM_LOCKED)) {
-			lock_page(old_page);	/* LRU manipulation */
-			if (PageMlocked(old_page))
-				munlock_vma_page(old_page);
-			unlock_page(old_page);
-		}
 		if (page_copied)
 			free_swap_cache(old_page);
 		put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
 	add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
-	page_add_file_rmap(page, true);
+	page_add_file_rmap(page, vma, true);
+
 	/*
 	 * deposit and withdraw with pmd lock held
 	 */
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, false);
+		page_add_file_rmap(page, vma, false);
 	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			if (PageAnon(new))
 				page_add_anon_rmap(new, vma, pvmw.address, false);
 			else
-				page_add_file_rmap(new, false);
+				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
-		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-			mlock_vma_page(new);
-
-		if (PageTransHuge(page) && PageMlocked(page))
-			clear_page_mlock(page);
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, vma, false);
 			put_page(page);
 
 			if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index d148da934fe9..aaded15b2f8f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page - page to be mlocked, either a normal page or a THP head.
  */
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
 
 	if (!TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
 }
 
 /**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
  */
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
 {
-	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
-
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page))) {
+	if (unlikely(PageKsm(page)))
 		unlock_page_memcg(page);
-		return;
-	}
 
 	/* address might be in next vma when migration races vma_adjust */
-	if (first)
+	else if (first)
 		__page_set_anon_rmap(page, vma, address,
 				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 /**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,
 
 /**
  * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	int i, nr = 1;
 
@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
-			struct page *head = compound_head(page);
-
 			VM_WARN_ON_ONCE(!PageLocked(page));
-
-			SetPageDoubleMap(head);
-			if (PageMlocked(page))
-				clear_page_mlock(head);
+			SetPageDoubleMap(compound_head(page));
 		}
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
 out:
 	unlock_page_memcg(page);
+
+	mlock_vma_page(page, vma, compound);
 }
 
 static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+	struct vm_area_struct *vma, bool compound)
 {
 	lock_page_memcg(page);
 
@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
 	 */
 out:
 	unlock_page_memcg(page);
+
+	munlock_vma_page(page, vma, compound);
 }
 
 /*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
 		/*
-		 * If the page is mlock()d, we cannot swap it out.
+		 * If the page is in an mlock()d vma, we must not swap it out.
 		 */
 		if (!(flags & TTU_IGNORE_MLOCK) &&
 		    (vma->vm_flags & VM_LOCKED)) {
-			/*
-			 * PTE-mapped THP are never marked as mlocked: so do
-			 * not set it on a DoubleMap THP, nor on an Anon THP
-			 * (which may still be PTE-mapped after DoubleMap was
-			 * cleared).  But stop unmapping even in those cases.
-			 */
-			if (!PageTransCompound(page) || (PageHead(page) &&
-			     !PageDoubleMap(page) && !PageAnon(page)))
-				mlock_vma_page(page);
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, false);
 			page_vma_mapped_walk_done(&pvmw);
 			ret = false;
 			break;
 		}
 
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
-
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, vma, PageHuge(page));
 		put_page(page);
 	}
 
@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
 		 * There is a reference on the page for the swap entry which has
 		 * been removed, so shouldn't take another.
 		 */
-		page_remove_rmap(subpage, false);
+		page_remove_rmap(subpage, vma, false);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 
-	if (page_in_cache)
-		page_add_file_rmap(page, false);
-	else
+	if (page_in_cache) {
+		/* Usually, cache pages are already added to LRU */
+		if (newly_allocated)
+			lru_cache_add(page);
+		page_add_file_rmap(page, dst_vma, false);
+	} else {
 		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+	}
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	 */
 	inc_mm_counter(dst_mm, mm_counter(page));
 
-	if (newly_allocated)
-		lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */

From patchwork Sun Feb  6 21:38:12 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736738
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 72964C433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:38:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DE37E6B0072; Sun,  6 Feb 2022 16:38:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D91D76B0073; Sun,  6 Feb 2022 16:38:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C0D9D6B0074; Sun,  6 Feb 2022 16:38:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com
 [216.40.44.60])
	by kanga.kvack.org (Postfix) with ESMTP id B35BD6B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:38:16 -0500 (EST)
Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 64E5195B1F
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:38:16 +0000 (UTC)
X-FDA: 79113668592.25.A8826C9
Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com
 [209.85.222.181])
	by imf13.hostedemail.com (Postfix) with ESMTP id F1FFD20004
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:38:15 +0000 (UTC)
Received: by mail-qk1-f181.google.com with SMTP id j24so9509963qkk.10
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:38:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=3lOdC07wdrEIzCCwsX+NrW3IUqMrqt5412o2JJKC+nY=;
        b=Py+b21ZD/LOP5rlw150YQjQiW2mKRMmj0OIP2tUXu8JeuNbhuw88vhcTW2SjBpt34V
         2k9hadfsJiwOokBshwWl0QQkC23CnRnCYUoPH7Ur8wxuZqApu11ow4c9XrqneEUY06JU
         pxoy/AGHc1wb6AG1h/InEphBB+9JaJJ/5mG9CsIULX2vdYRpvLJ02EvnJRXYVFJ++giG
         9bWqq13B8luuEqnXxRSWcTiRxwfa7aZxQ7Um2ab8S/5k2wtoK6yE1xJQGjW81Qtp+4sh
         be/ut1eybBrVcmodkRtgSYRdomHvUuuVpZlcCQviQ6F6+grVb937NwZoKDywKPNKbqmi
         fAZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=3lOdC07wdrEIzCCwsX+NrW3IUqMrqt5412o2JJKC+nY=;
        b=loWKiqMlpirpdiWTic35cK1XJg0DOqBTnNKRrKvc8WMp3WlD3462hL0BiHnqinosIf
         5uYRz+ZEIjr42T7UQ3LCAzLyCll9RqtNGWXB6jj9lK2CkOHFLJm39e8A7j5rT2lctVTk
         HrQWSqu5+g1Fj+gSnWrx7YriJP4QU1ZHtOI+ScYH9dxUM4sYtT+7fqc6W8FkI3HGND7z
         T0xw4DaRgm6D/3NuDcPjw/t4T92mAzrZOWZaC0j7fJ+qMpg1U0kW3jRb8RMtlH7hih5m
         LUcLDxEXyqgvOwGK+NZVlpNuPJ9EN/tooJIu9m+G+VnJuBYuehgYEPdDSM14n0pigrrz
         qQBA==
X-Gm-Message-State: AOAM5328NUJEq+PUqVXPlTX8glfwr2b5TwGLgxad9UQTV1EqtzxsGg7J
	0mWQjBM1F/7mQLaKqithjZg9eq9P5/LGlA==
X-Google-Smtp-Source: 
 ABdhPJzGH3XeJsv5+R9u/cCRoM1PQIF4NbimyIWqYkD7Pxcd9bMqcdZ7JxeFtTo/JBjxcSFzOzRnlg==
X-Received: by 2002:ae9:e801:: with SMTP id a1mr4885804qkg.19.1644183495110;
        Sun, 06 Feb 2022 13:38:15 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 de15sm4264793qkb.107.2022.02.06.13.38.13
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:38:14 -0800 (PST)
Date: Sun, 6 Feb 2022 13:38:12 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 05/13] mm/munlock: replace clear_page_mlock() by final
 clearance
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <652a918-8a11-c1e9-a760-854873841bc@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: F1FFD20004
X-Stat-Signature: dtma418kbsm5xhti7p3cdisma1qa654e
X-Rspam-User: nil
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Py+b21ZD;
	spf=pass (imf13.hostedemail.com: domain of hughd@google.com designates
 209.85.222.181 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644183495-746586
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0.  That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0.  If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h | 12 ------------
 mm/mlock.c    | 30 ------------------------------
 mm/rmap.c     |  9 ---------
 mm/swap.c     | 32 ++++++++++++++++++++++++--------
 4 files changed, 24 insertions(+), 59 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3d7dfc8bc471..a43d79335c16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -425,17 +425,6 @@ static inline void munlock_vma_page(struct page *page,
 		munlock_page(page);
 }
 
-/*
- * Clear the page's PageMlocked().  This can be useful in a situation where
- * we want to unconditionally remove a page from the pagecache -- e.g.,
- * on truncation or freeing.
- *
- * It is legal to call this function for any page, mlocked or not.
- * If called for a page that is still mapped by mlocked vmas, all we do
- * is revert to lazy LRU behaviour -- semantics are not broken.
- */
-extern void clear_page_mlock(struct page *page);
-
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
@@ -509,7 +498,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 }
 #else /* !CONFIG_MMU */
 static inline void unmap_mapping_folio(struct folio *folio) { }
-static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
diff --git a/mm/mlock.c b/mm/mlock.c
index aaded15b2f8f..db936288b8a0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -48,36 +48,6 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/*
- *  LRU accounting for clear_page_mlock()
- */
-void clear_page_mlock(struct page *page)
-{
-	int nr_pages;
-
-	if (!TestClearPageMlocked(page))
-		return;
-
-	nr_pages = thp_nr_pages(page);
-	mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-	count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
-	/*
-	 * The previous TestClearPageMlocked() corresponds to the smp_mb()
-	 * in __pagevec_lru_add_fn().
-	 *
-	 * See __pagevec_lru_add_fn for more explanation.
-	 */
-	if (!isolate_lru_page(page)) {
-		putback_lru_page(page);
-	} else {
-		/*
-		 * We lost the race. the page already moved to evictable list.
-		 */
-		if (PageUnevictable(page))
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-	}
-}
-
 /**
  * mlock_page - mlock a page
  * @page - page to be mlocked, either a normal page or a THP head.
diff --git a/mm/rmap.c b/mm/rmap.c
index 6cc8bf129f18..5442a5c97a85 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1315,9 +1315,6 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
-
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
 }
 
 static void page_remove_anon_compound_rmap(struct page *page)
@@ -1357,9 +1354,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		nr = thp_nr_pages(page);
 	}
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (nr)
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
 }
@@ -1398,9 +1392,6 @@ void page_remove_rmap(struct page *page,
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
-	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
-
 	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..ff4810e4a4bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,8 +74,8 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
 };
 
 /*
- * This path almost never happens for VM activity - pages are normally
- * freed via pagevecs.  But it gets used by networking.
+ * This path almost never happens for VM activity - pages are normally freed
+ * via pagevecs.  But it gets used by networking - and for compound pages.
  */
 static void __page_cache_release(struct page *page)
 {
@@ -89,6 +89,14 @@ static void __page_cache_release(struct page *page)
 		__clear_page_lru_flags(page);
 		unlock_page_lruvec_irqrestore(lruvec, flags);
 	}
+	/* See comment on PageMlocked in release_pages() */
+	if (unlikely(PageMlocked(page))) {
+		int nr_pages = thp_nr_pages(page);
+
+		__ClearPageMlocked(page);
+		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+	}
 	__ClearPageWaiters(page);
 }
 
@@ -489,12 +497,8 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
 	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
 	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
 		int nr_pages = thp_nr_pages(page);
-		/*
-		 * We use the irq-unsafe __mod_zone_page_state because this
-		 * counter is not modified from interrupt context, and the pte
-		 * lock is held(spinlock), which implies preemption disabled.
-		 */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
 		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
 	}
 	lru_cache_add(page);
@@ -969,6 +973,18 @@ void release_pages(struct page **pages, int nr)
 			__clear_page_lru_flags(page);
 		}
 
+		/*
+		 * In rare cases, when truncation or holepunching raced with
+		 * munlock after VM_LOCKED was cleared, Mlocked may still be
+		 * found set here.  This does not indicate a problem, unless
+		 * "unevictable_pgs_cleared" appears worryingly large.
+		 */
+		if (unlikely(PageMlocked(page))) {
+			__ClearPageMlocked(page);
+			dec_zone_page_state(page, NR_MLOCK);
+			count_vm_event(UNEVICTABLE_PGCLEARED);
+		}
+
 		__ClearPageWaiters(page);
 
 		list_add(&page->lru, &pages_to_free);

From patchwork Sun Feb  6 21:40:26 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736739
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5A2BFC433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:40:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D90E16B0072; Sun,  6 Feb 2022 16:40:30 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D40D26B0073; Sun,  6 Feb 2022 16:40:30 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BBA0B6B0074; Sun,  6 Feb 2022 16:40:30 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0214.hostedemail.com
 [216.40.44.214])
	by kanga.kvack.org (Postfix) with ESMTP id ACA416B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:40:30 -0500 (EST)
Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 68361181DC53B
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:40:30 +0000 (UTC)
X-FDA: 79113674220.06.03F9AA5
Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com
 [209.85.219.45])
	by imf26.hostedemail.com (Postfix) with ESMTP id 06C81140002
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:40:29 +0000 (UTC)
Received: by mail-qv1-f45.google.com with SMTP id fh9so2187916qvb.1
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:40:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=45ik5JHfgpbUMO7+X66xo52U6DkySR/bQEG5IbUVs7M=;
        b=COjW7R5Ar7dY1W4BBFxBelYHSAdqgPoKvPqnz5sN/eD4rJOBjARiVivgUh2GEX9Njf
         j/JH04kkzbnxqpo2+O6onmJNj7BxfhNa4o6TXoXxbELsdPxq8nbrAfhMLKIMrZHGFaBS
         dYYtaIhhJLfHqXIQ3pzvKadIo6WxysMzPzt/7AAlARmV3cbM0jHT7tAOHLaj2NoP1AoC
         z0bJrAbqL0byqAlMKQXGDtbFqup11i4hHhbLFKv0sqAcTREO+uxwupPCYDgOIyEHqZpJ
         zurHj+oArjDO8bSG6HwG0ljmoF41keHlvDWxYDnEfjN6rjGOedYIxzws/goEP0VfzHa+
         EvUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=45ik5JHfgpbUMO7+X66xo52U6DkySR/bQEG5IbUVs7M=;
        b=CH+/FGipLzn0uDlZq3HoM5h8U1o2NrigHa8aJOG0Sg+BJmJdrQdgMWnA8rCYm51ufu
         3QXbUbVm97x/wZAdHyQUbR+W9XUTpFy2b7gRjtAmqvlm9YT6nXL10+oEt8pMatopj8rt
         J5vCXH6dutdZi38nGpQw5gjbB0BPi4Y8yEb+u86ipSzuCtXumcESxJiVJamCxI7rIyof
         SGcTbsqOh8t0ib5vLzuEhEjHv9VwIp1lcTV/pEROybzqzk40rgv9X0JaKvgSzpxiLJ0X
         hooR+kCk1AZeYgUfK2he4tYEaUtYegJMbEYbvNZtHDeokY6V+2iF9FWA8fmzEA0i1Y8l
         14Zw==
X-Gm-Message-State: AOAM5334TZ9Ng3zdJ5If77n8BRJRo6iEBUZMa6Vn5T2aX19rn3uyqTLu
	xqin7cBWSXacluczEbFVMVvepA==
X-Google-Smtp-Source: 
 ABdhPJycUkC+4qvyXIz/GWx5LP9qBVFYddmkfXjuRRVtCLecaubD7IIvN4HgyQA4xry8SX9dJ5n/AA==
X-Received: by 2002:a05:6214:c66:: with SMTP id
 t6mr8886652qvj.19.1644183629077;
        Sun, 06 Feb 2022 13:40:29 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 m5sm4447367qkp.132.2022.02.06.13.40.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:40:28 -0800 (PST)
Date: Sun, 6 Feb 2022 13:40:26 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 06/13] mm/munlock: maintain page->mlock_count while
 unevictable
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <3d204af4-664f-e4b0-4781-16718a2efb9c@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspam-User: nil
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 06C81140002
X-Stat-Signature: 158gq518quxfz7tk8oot8m837ian5cuq
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=COjW7R5A;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf26.hostedemail.com: domain of hughd@google.com designates
 209.85.219.45 as permitted sender) smtp.mailfrom=hughd@google.com
X-HE-Tag: 1644183629-235681
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done.  As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm_inline.h | 11 +++++--
 include/linux/mm_types.h  | 19 +++++++++--
 mm/huge_memory.c          |  5 ++-
 mm/memcontrol.c           |  3 +-
 mm/mlock.c                | 68 +++++++++++++++++++++++++++++++--------
 mm/mmzone.c               |  7 ++++
 mm/swap.c                 |  1 +
 7 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index b725839dfe71..884d6f6af05b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -99,7 +99,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
-	list_add(&folio->lru, &lruvec->lists[lru]);
+	if (lru != LRU_UNEVICTABLE)
+		list_add(&folio->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list(struct page *page,
@@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 
 	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			folio_nr_pages(folio));
+	/* This is not expected to be used on LRU_UNEVICTABLE */
 	list_add_tail(&folio->lru, &lruvec->lists[lru]);
 }
 
@@ -127,8 +129,11 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
 static __always_inline
 void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
-	list_del(&folio->lru);
-	update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
+	enum lru_list lru = folio_lru_list(folio);
+
+	if (lru != LRU_UNEVICTABLE)
+		list_del(&folio->lru);
+	update_lru_size(lruvec, lru, folio_zonenum(folio),
 			-folio_nr_pages(folio));
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..475bdb282769 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,7 +85,16 @@ struct page {
 			 * lruvec->lru_lock.  Sometimes used as a generic list
 			 * by the page owner.
 			 */
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				/* Or, for the Unevictable "LRU list" slot */
+				struct {
+					/* Always even, to negate PageTail */
+					void *__filler;
+					/* Count page's or folio's mlocks */
+					unsigned int mlock_count;
+				};
+			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
 			pgoff_t index;		/* Our offset within mapping. */
@@ -241,7 +250,13 @@ struct folio {
 		struct {
 	/* public: */
 			unsigned long flags;
-			struct list_head lru;
+			union {
+				struct list_head lru;
+				struct {
+					void *__filler;
+					unsigned int mlock_count;
+				};
+			};
 			struct address_space *mapping;
 			pgoff_t index;
 			void *private;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6477f48a27e..9afca0122723 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2300,8 +2300,11 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
 	} else {
 		/* head is still on lru (and we have it frozen) */
 		VM_WARN_ON(!PageLRU(head));
+		if (PageUnevictable(tail))
+			tail->mlock_count = 0;
+		else
+			list_add_tail(&tail->lru, &head->lru);
 		SetPageLRU(tail);
-		list_add_tail(&tail->lru, &head->lru);
 	}
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..b10590926177 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,8 +1257,7 @@ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
  * @nr_pages: positive when adding or negative when removing
  *
  * This function must be called under lru_lock, just before a page is added
- * to or just after a page is removed from an lru list (that ordering being
- * so as to allow it to check that lru_size 0 is consistent with list_empty).
+ * to or just after a page is removed from an lru list.
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int zid, int nr_pages)
diff --git a/mm/mlock.c b/mm/mlock.c
index db936288b8a0..0d3ae04b1f4e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -54,16 +54,35 @@ EXPORT_SYMBOL(can_do_mlock);
  */
 void mlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (!TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
 		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-		if (!isolate_lru_page(page))
-			putback_lru_page(page);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return;
+
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageUnevictable(page)) {
+		page->mlock_count++;
+		goto out;
 	}
+
+	del_page_from_lru_list(page, lruvec);
+	ClearPageActive(page);
+	SetPageUnevictable(page);
+	page->mlock_count = 1;
+	add_page_to_lru_list(page, lruvec);
+	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+out:
+	SetPageLRU(page);
+	unlock_page_lruvec_irq(lruvec);
 }
 
 /**
@@ -72,19 +91,40 @@ void mlock_page(struct page *page)
  */
 void munlock_page(struct page *page)
 {
+	struct lruvec *lruvec;
+	int nr_pages = thp_nr_pages(page);
+
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	lock_page_memcg(page);
+	lruvec = folio_lruvec_lock_irq(page_folio(page));
+	if (PageLRU(page) && PageUnevictable(page)) {
+		/* Then mlock_count is maintained, but might undercount */
+		if (page->mlock_count)
+			page->mlock_count--;
+		if (page->mlock_count)
+			goto out;
+	}
+	/* else assume that was the last mlock: reclaim will fix it if not */
+
 	if (TestClearPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (!isolate_lru_page(page)) {
-			putback_lru_page(page);
-			count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-		} else if (PageUnevictable(page)) {
-			count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
-		}
+		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+		if (PageLRU(page) || !PageUnevictable(page))
+			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+		else
+			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
+	}
+
+	/* page_evictable() has to be checked *after* clearing Mlocked */
+	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+		del_page_from_lru_list(page, lruvec);
+		ClearPageUnevictable(page);
+		add_page_to_lru_list(page, lruvec);
+		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
+out:
+	unlock_page_lruvec_irq(lruvec);
+	unlock_page_memcg(page);
 }
 
 /*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..40e1d9428300 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,13 @@ void lruvec_init(struct lruvec *lruvec)
 
 	for_each_lru(lru)
 		INIT_LIST_HEAD(&lruvec->lists[lru]);
+	/*
+	 * The "Unevictable LRU" is imaginary: though its size is maintained,
+	 * it is never scanned, and unevictable pages are not threaded on it
+	 * (so that their lru fields can be reused to hold mlock_count).
+	 * Poison its list head, so that any operations on it would crash.
+	 */
+	list_del(&lruvec->lists[LRU_UNEVICTABLE]);
 }
 
 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index ff4810e4a4bc..682a03301a2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1062,6 +1062,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
+		folio->mlock_count = !!folio_test_mlocked(folio);
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}

From patchwork Sun Feb  6 21:42:09 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736740
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CE27EC433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:42:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5790A6B0072; Sun,  6 Feb 2022 16:42:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5283D6B0073; Sun,  6 Feb 2022 16:42:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3C8D26B0074; Sun,  6 Feb 2022 16:42:13 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com
 [216.40.44.151])
	by kanga.kvack.org (Postfix) with ESMTP id 2F2B46B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:42:13 -0500 (EST)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id DFECC8249980
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:42:12 +0000 (UTC)
X-FDA: 79113678504.27.5BD23EC
Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com
 [209.85.167.178])
	by imf30.hostedemail.com (Postfix) with ESMTP id 7252580003
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:42:12 +0000 (UTC)
Received: by mail-oi1-f178.google.com with SMTP id 4so15059950oil.11
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:42:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=mTlHQrodDNd3rZOQBHQa8LzV06BmSwKLzOZxJpptGr4=;
        b=Z02MazlMjxS66AVcF8zW7Ge1wm8O8/a5SUzhY49rHE0xKKyPhz9xuwvFDK7cUXjQVe
         6mhX1iCrPzzE12chAQXI3C8to8Tr4laJBNMD/4GTY1Qb7yGxRIR6pb7Iw7A8jcaPipHy
         Xd2mmhEMLDFVz6i9Y4rrTrhFJ7Ea0LcmUcSbe+TofxuYbzzqt93RnocGwkabi60Py1wO
         H+sP8eOyVXiPHG6wa8rf4M8GmgaWaL/7OQ6jw8513up+F4bMPGqf/XQ5aAYWUjfsb0qb
         ZR2bHRQyR8F+VfDyMXZBDzTMtz4NLDop3+90tA4linsCAcL76jsn3d5bnEZYA448Kmo2
         FnoA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=mTlHQrodDNd3rZOQBHQa8LzV06BmSwKLzOZxJpptGr4=;
        b=DKv9GgjPojfLEl73PC2SQIcBUMCfn1PxORpd9JjQXzvi3zzKV76eGmmAKwFwVlQ3zv
         xBQIRNKA0jHX2iSa5R/51yiIGTx++6OlmHjbm1FJmsTsmh3mfoukTDMwTVJH94SlFS++
         PpJPUEzi00Gsx0EIwIjP3sYylNEhAaSroInGS4AHYIBuQTJpw+mjNflEw5Np1XsJRWRx
         uVNImujJc4M1bBXsquc8jrIk55gdbi60c0QJ9IsqwuFhb1iUZoaiZqrscdcua44sYYle
         VY7pzM2Z6xPOyUmLpBQFqdc6OZq9JWpzsfvWmjN0xyw7IxRJEiUwqmcpbPvP2eLxs6/Y
         r+cA==
X-Gm-Message-State: AOAM530t1TCC6HHFu7ff3DS6BFRfrWoBoS2M//c/9WQajlkCclAwQd+f
	mxriBU+zymgTP1E4/Ul5gU3PAA==
X-Google-Smtp-Source: 
 ABdhPJweTO8bxfcPjveuAAsoqbnlNc0uPNoRQyUpsCch7IdQhCO2thiz5FtfZ+MomjFqSQp75WgJyQ==
X-Received: by 2002:a05:6808:195:: with SMTP id
 w21mr4240572oic.88.1644183731504;
        Sun, 06 Feb 2022 13:42:11 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 u3sm3507208oao.25.2022.02.06.13.42.10
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:42:11 -0800 (PST)
Date: Sun, 6 Feb 2022 13:42:09 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or
 munlocking
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <8bc3ee8c-7f1-d812-7f22-4f9f6d436bc@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 7252580003
X-Rspam-User: nil
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=Z02MazlM;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf30.hostedemail.com: domain of hughd@google.com designates
 209.85.167.178 as permitted sender) smtp.mailfrom=hughd@google.com
X-Stat-Signature: bh6re9g4cx6bu7mr1t8qcbgxw9fnuhop
X-Rspamd-Server: rspam08
X-HE-Tag: 1644183732-309737
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range().  This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/70885d37-62b7-748b-29df-9e94f3291736@gmail.com/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time.  This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h |   3 +-
 mm/mlock.c    | 108 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
 static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound)
 {
-	if (unlikely(vma->vm_flags & VM_LOCKED) &&
+	/* VM_IO check prevents migration from double-counting during mlock */
+	if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
 	    (compound || !PageTransCompound(page)))
 		mlock_page(page);
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index 0d3ae04b1f4e..f8e5dcff21ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
 #include <linux/pagevec.h>
+#include <linux/pagewalk.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
@@ -127,23 +128,90 @@ void munlock_page(struct page *page)
 	unlock_page_memcg(page);
 }
 
+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+			   unsigned long end, struct mm_walk *walk)
+
+{
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *start_pte, *pte;
+	struct page *page;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (!pmd_present(*pmd))
+			goto out;
+		if (is_huge_zero_pmd(*pmd))
+			goto out;
+		page = pmd_page(*pmd);
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+		goto out;
+	}
+
+	start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+		if (!pte_present(*pte))
+			continue;
+		page = vm_normal_page(vma, addr, *pte);
+		if (!page)
+			continue;
+		if (PageTransCompound(page))
+			continue;
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page(page);
+		else
+			munlock_page(page);
+	}
+	pte_unmap(start_pte);
+out:
+	spin_unlock(ptl);
+	cond_resched();
+	return 0;
+}
+
 /*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ *                           or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
  * @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- *  For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
  *
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared.  Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
  */
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
-				    unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+	unsigned long start, unsigned long end, vm_flags_t newflags)
 {
-	/* Reimplementation to follow in later commit */
+	static const struct mm_walk_ops mlock_walk_ops = {
+		.pmd_entry = mlock_pte_range,
+	};
+
+	/*
+	 * There is a slight chance that concurrent page migration,
+	 * or page reclaim finding a page of this now-VM_LOCKED vma,
+	 * will call mlock_vma_page() and raise page's mlock_count:
+	 * double counting, leaving the page unevictable indefinitely.
+	 * Communicate this danger to mlock_vma_page() with VM_IO,
+	 * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+	 * mmap_lock is held in write mode here, so this weird
+	 * combination should not be visible to others.
+	 */
+	if (newflags & VM_LOCKED)
+		newflags |= VM_IO;
+	WRITE_ONCE(vma->vm_flags, newflags);
+
+	lru_add_drain();
+	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+	lru_add_drain();
+
+	if (newflags & VM_IO) {
+		newflags &= ~VM_IO;
+		WRITE_ONCE(vma->vm_flags, newflags);
+	}
 }
 
 /*
@@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	pgoff_t pgoff;
 	int nr_pages;
 	int ret = 0;
-	int lock = !!(newflags & VM_LOCKED);
-	vm_flags_t old_flags = vma->vm_flags;
+	vm_flags_t oldflags = vma->vm_flags;
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
@@ -197,9 +264,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * Keep track of amount of locked VM.
 	 */
 	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
+	if (!(newflags & VM_LOCKED))
 		nr_pages = -nr_pages;
-	else if (old_flags & VM_LOCKED)
+	else if (oldflags & VM_LOCKED)
 		nr_pages = 0;
 	mm->locked_vm += nr_pages;
 
@@ -209,11 +276,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
 
-	if (lock)
+	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+		/* No work to do, and mlocking twice would be wrong */
 		vma->vm_flags = newflags;
-	else
-		munlock_vma_pages_range(vma, start, end);
-
+	} else {
+		mlock_vma_pages_range(vma, start, end, newflags);
+	}
 out:
 	*prev = vma;
 	return ret;

From patchwork Sun Feb  6 21:43:53 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736742
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A444AC433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:43:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1F5706B0072; Sun,  6 Feb 2022 16:43:58 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 17D4E6B0073; Sun,  6 Feb 2022 16:43:58 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F38646B0074; Sun,  6 Feb 2022 16:43:57 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com
 [216.40.44.201])
	by kanga.kvack.org (Postfix) with ESMTP id DE5A46B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:43:57 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 841158249980
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:43:57 +0000 (UTC)
X-FDA: 79113682914.16.EC39648
Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com
 [209.85.167.176])
	by imf11.hostedemail.com (Postfix) with ESMTP id 1D0C840003
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:43:56 +0000 (UTC)
Received: by mail-oi1-f176.google.com with SMTP id u13so15128558oie.5
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:43:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=VKjWDSDEqSXkMBseTZ4kYaUBzUWcD2iDw3Sk4fM5TfU=;
        b=oVJyzbVxwCoRIqqb/a/jVLhXKxolruR0BPQ8MlppBLhDAxgdTVmRqTo1aYZcsaaG4g
         3xQgZjoFD68v2KcCXOqkiU/cWafPHJ+re0aEizZ7t+agvr2mUknoVNKNN4VimGs/YYFC
         tT38/gyOPGVipPYXv1SEL1yq7KCGb/m+UeDMiMXYinOrz+VvVbXl7zD6wWSKxKJznOul
         W2pFIzlEpSJdqEW0MCSuBLXTqafil/jxo19+1kIugDEwkUmODuhmfeAsWf9GCHlm9Axk
         VxoJCzuhezM3XkDxD1pq0WAq53gzsvlpI9d838/fizl5cHG5ut0LUmtUmmfLFIcZX6/a
         lXXw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=VKjWDSDEqSXkMBseTZ4kYaUBzUWcD2iDw3Sk4fM5TfU=;
        b=KCcXvPnBZOztm0UDG6oHoD0rtnHaqGvvAWv4xrhrwQdOT+GBw9QQM+eUw775IcHlrO
         NAzD6+uQ6RXpiW8gp83LabRiskyy35RpjK/KO77Nf1Dn5LXDx7k6MUik1MvDFRakcLXo
         nDUuuTDLtKMqkA4ANnD5yjZaqerRFmQvWdMxINGYmayrlxmPvg3T5xSjKiJSjAYEBLxj
         WnwZajrbQySvHKzbFfITaTRROW8/EqtGg6xok1uRiXrpYkC/1eIHP5FZkUDR3IQClUU6
         S2hwfx/GcemnpA3aidXdd5knqSrtDG6wJb/C4lrRVxKYkEoRfMP0h2tmLQ3FN9JgWV1s
         Ag+A==
X-Gm-Message-State: AOAM5303drLNfR++Yq5XZFRysb7PKPjQOA+6Pz6PdTv/Kh4RZd4zFN1+
	EKinnPoXwvkr0DQMIu1AZ1k5lg==
X-Google-Smtp-Source: 
 ABdhPJyHq22zIK1nECWEXuekkWijhLH6Ab/9ickoD6sMJ6sbVBKBP6Z+c1YzZyQaPuloMWoBb56QTQ==
X-Received: by 2002:a05:6808:1b0d:: with SMTP id
 bx13mr5911648oib.178.1644183835825;
        Sun, 06 Feb 2022 13:43:55 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 x19sm3271502otj.59.2022.02.06.13.43.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:43:55 -0800 (PST)
Date: Sun, 6 Feb 2022 13:43:53 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 08/13] mm/migrate: __unmap_and_move() push good newpage to
 LRU
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <33fb71cf-ea55-123a-bf9d-fdad297cae1@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspam-User: nil
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 1D0C840003
X-Stat-Signature: 4jpjshowbua776ip6uk7mfmkxekpp5bk
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=oVJyzbVx;
	spf=pass (imf11.hostedemail.com: domain of hughd@google.com designates
 209.85.167.176 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644183836-423713
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called.  Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/migrate.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7c4223ce2500..f4bcf1541b62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1032,6 +1032,21 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page, mode);
 
+	/*
+	 * When successful, push newpage to LRU immediately: so that if it
+	 * turns out to be an mlocked page, remove_migration_ptes() will
+	 * automatically build up the correct newpage->mlock_count for it.
+	 *
+	 * We would like to do something similar for the old page, when
+	 * unsuccessful, and other cases when a page has been temporarily
+	 * isolated from the unevictable LRU: but this case is the easiest.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		lru_cache_add(newpage);
+		if (page_was_mapped)
+			lru_add_drain();
+	}
+
 	if (page_was_mapped)
 		remove_migration_ptes(page,
 			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
@@ -1045,20 +1060,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	unlock_page(page);
 out:
 	/*
-	 * If migration is successful, decrease refcount of the newpage
+	 * If migration is successful, decrease refcount of the newpage,
 	 * which will not free the page because new page owner increased
-	 * refcounter. As well, if it is LRU page, add the page to LRU
-	 * list in here. Use the old state of the isolated source page to
-	 * determine if we migrated a LRU page. newpage was already unlocked
-	 * and possibly modified by its owner - don't rely on the page
-	 * state.
+	 * refcounter.
 	 */
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(!is_lru))
-			put_page(newpage);
-		else
-			putback_lru_page(newpage);
-	}
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_page(newpage);
 
 	return rc;
 }

From patchwork Sun Feb  6 21:45:50 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736743
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3FFC4C433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:45:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9BAFE6B0072; Sun,  6 Feb 2022 16:45:55 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 969156B0073; Sun,  6 Feb 2022 16:45:55 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 80A5D6B0074; Sun,  6 Feb 2022 16:45:55 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0171.hostedemail.com
 [216.40.44.171])
	by kanga.kvack.org (Postfix) with ESMTP id 6E3056B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:45:55 -0500 (EST)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 29FBF181DA55E
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:45:55 +0000 (UTC)
X-FDA: 79113687870.15.91ED692
Received: from mail-oo1-f42.google.com (mail-oo1-f42.google.com
 [209.85.161.42])
	by imf26.hostedemail.com (Postfix) with ESMTP id B4594140002
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:45:54 +0000 (UTC)
Received: by mail-oo1-f42.google.com with SMTP id
 w5-20020a4a9785000000b0030956914befso11654284ooi.9
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:45:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=AZqi60Yr8HuyENh7V3NHOA86mYT9XQfDwkJWeap6Hv4=;
        b=EgwdjpiSVtHtN8gYa4GpexZ70qbUL2NhrWgKXCXX2LBjGnHl6TuVJ82lRvT9pl11bg
         GJZIbQQd9vEHvf9Zaur5QaXjfBcZlU0agDbAgqx6EwcJhU9ycAqDSyi/QUJ4XFJ9K08N
         OTfMUECr1/9GvCfG3nvoih7TFdihwBIYkxfomgc/AvmEnFFzxUAaoRPOqI1otAkhH5Vh
         DFi2Jxxxg0xETEmbuQ7CygU4Gh9Xkuh1ncDOi+CmXMyWjIvMfQfyvE3qLgldgW/7DMeW
         zorJdpcn6kqIrFJ70Po9q2a1vrjeggAUpPytS5SWHMYC5ktbrUb60kJ9zlyBYyveZZjp
         uDEg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=AZqi60Yr8HuyENh7V3NHOA86mYT9XQfDwkJWeap6Hv4=;
        b=PYLo/Ki3sIIu8mdSBXwLS/Vt49bXnAjKBHewjQRUVeyjQHDPojeE3glqza0wwc4Y+R
         Kowo2JcbBb6D9oXRLV7sPtyUn8Ndko1OPQEnbAdGjDGvgRSWxuDTEjPzGBPH3UNtmGH9
         iDJ+6Ve5wg+TbLcEHfv04xMs9rcx04o84OYxj0UhUPxas6EFIebposXTFGEZZbGcDBUX
         a8cg48ljMPrrwbtu7QsEqbBq0dFOD77PmVW/m+ATfIjV2FVpMHDPUzHv54YHxyx7pexi
         cfPdNXwZPITu5vnV5DKlyNSBDoq+zRXapL4znPvgT08oDdbCNwuSd4SGOJmHUJ5hekhs
         XTYw==
X-Gm-Message-State: AOAM531dyAyVU/KwonEZVG3+zWVD8CtvlICVOmDOSEtCwX1rMR+CDUmk
	6FmIOChD7bp3mdwlwmV6QbFQSw==
X-Google-Smtp-Source: 
 ABdhPJwuhT+ew2mJdr3D6cKZkrBLmmLSOMdPtFq393BKGZKWDkUv4JaIpVMPIZUPwS+O+DyMnFtz2g==
X-Received: by 2002:a4a:b2ca:: with SMTP id l10mr2861159ooo.13.1644183953803;
        Sun, 06 Feb 2022 13:45:53 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 n24sm3442036oao.40.2022.02.06.13.45.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:45:53 -0800 (PST)
Date: Sun, 6 Feb 2022 13:45:50 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 09/13] mm/munlock: delete smp_mb() from
 __pagevec_lru_add_fn()
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <9121d34d-4889-51f1-56c7-255138f43b8d@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspam-User: nil
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: B4594140002
X-Stat-Signature: bmdwcfh4eafugngmyozokpi7khr9ekoi
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=EgwdjpiS;
	spf=pass (imf26.hostedemail.com: domain of hughd@google.com designates
 209.85.161.42 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644183954-760272
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/swap.c | 37 +++++++++----------------------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 682a03301a2c..3f770b1ea2c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
+	folio_set_lru(folio);
 	/*
-	 * A folio becomes evictable in two ways:
-	 * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
-	 * 2) Before acquiring LRU lock to put the folio on the correct LRU
-	 *    and then
-	 *   a) do PageLRU check with lock [check_move_unevictable_pages]
-	 *   b) do PageLRU check before lock [clear_page_mlock]
-	 *
-	 * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
-	 * following strict ordering:
-	 *
-	 * #0: __pagevec_lru_add_fn		#1: clear_page_mlock
-	 *
-	 * folio_set_lru()			folio_test_clear_mlocked()
-	 * smp_mb() // explicit ordering	// above provides strict
-	 *					// ordering
-	 * folio_test_mlocked()			folio_test_lru()
+	 * Is an smp_mb__after_atomic() still required here, before
+	 * folio_evictable() tests PageMlocked, to rule out the possibility
+	 * of stranding an evictable folio on an unevictable LRU?  I think
+	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * lock is held.
 	 *
-	 *
-	 * if '#1' does not observe setting of PG_lru by '#0' and
-	 * fails isolation, the explicit barrier will make sure that
-	 * folio_evictable check will put the folio on the correct
-	 * LRU. Without smp_mb(), folio_set_lru() can be reordered
-	 * after folio_test_mlocked() check and can make '#1' fail the
-	 * isolation of the folio whose mlocked bit is cleared (#0 is
-	 * also looking at the same folio) and the evictable folio will
-	 * be stranded on an unevictable LRU.
+	 * (That is not true of __page_cache_release(), and not necessarily
+	 * true of release_pages(): but those only clear PageMlocked after
+	 * put_page_testzero() has excluded any other users of the page.)
 	 */
-	folio_set_lru(folio);
-	smp_mb__after_atomic();
-
 	if (folio_evictable(folio)) {
 		if (was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);

From patchwork Sun Feb  6 21:47:53 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736744
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A0306C433F5
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:47:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 28D076B0073; Sun,  6 Feb 2022 16:47:58 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 23C276B0074; Sun,  6 Feb 2022 16:47:58 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0B5BD6B0075; Sun,  6 Feb 2022 16:47:58 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26])
	by kanga.kvack.org (Postfix) with ESMTP id F05816B0073
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:47:57 -0500 (EST)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id B072F220DB
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:47:57 +0000 (UTC)
X-FDA: 79113692994.03.0198E14
Received: from mail-ot1-f52.google.com (mail-ot1-f52.google.com
 [209.85.210.52])
	by imf08.hostedemail.com (Postfix) with ESMTP id 38666160006
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:47:57 +0000 (UTC)
Received: by mail-ot1-f52.google.com with SMTP id
 x52-20020a05683040b400b0059ea92202daso9685842ott.7
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:47:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=o6hd5WPRnqjNUJ/qrVIVCaF7/zauB1F4q7+aFBEjnnY=;
        b=JKq16CM6ngC9LgfolpPHKXBhxy6VLJfsWl/WcFQvWHfNnrqQjFelr0KSNDBk/tWfff
         KCtrsdnDjBkVWzTSFqbIuLDma4lqEPYrRMMPl+UHbZqHIc5+h52d47W1p0cANQFdvV/T
         ShyDKVq1IfrMsb31vlYl5V/ORphHXbGHAnt3nZPD45yYoD8/y7bapEyqKSZtjjEsjSjR
         OZNCpm+OhA4ofz/Uo8GzQtOzK5TIRJnH4pNkVpSAYSmGEOwsCKv8xWiqTfVvZePEPP2u
         7lkYeypCvcW/GgpbJ31Ckiik3KSJMf/I3IggXdurlox3HiRHN+VJb+GOwtL0/VBampPn
         2nyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=o6hd5WPRnqjNUJ/qrVIVCaF7/zauB1F4q7+aFBEjnnY=;
        b=higrkZCMeEivar/pRo+9o+zCWn0fuICLy+gFMFepxwyUpQHE0tWjE1jy9SWmrFVwgB
         1SkbOc0unIx03IybcvnGruPIvswDJBBBndwyOSlMOh8h6MJaqULcvwQOwQS2ggc+MtHu
         3Tss6a0MRmgB0jMdikPjDKR7ZiGKDSSW5smSyDEyCOhpaF1k4Isl3InW3lMdaiJ/qnKa
         pRl0ZawKn5kw2g+uSgLYr077bbKEaxbAdYGKKvNaKgrzFsHFNQhBercdPiGLmXXhjLd4
         5lLLX22mPnyfn4B4k6FrIHNLhFCe+tpYunHWXMUjelnJg04OQKRaq941OAgqVLSvw9DT
         iYIw==
X-Gm-Message-State: AOAM532iDmBk5VYzbJ7KXKs7fP2x0AGEtSkay3NRNbnbR9LihV7JjF1e
	wh/IOCm4lHcgrbKp2UbEdzmUAw==
X-Google-Smtp-Source: 
 ABdhPJxvhVAdRugWqOxvWMdRR5ZqnSsLPZdZQirwJOX5QwpCr0pa2y2GK9Wd8e0C8WX7YrHCQ6Xf1w==
X-Received: by 2002:a9d:7dd0:: with SMTP id k16mr3386604otn.369.1644184076207;
        Sun, 06 Feb 2022 13:47:56 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 d2sm3282540ook.33.2022.02.06.13.47.54
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:47:55 -0800 (PST)
Date: Sun, 6 Feb 2022 13:47:53 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 10/13] mm/munlock: mlock_page() munlock_page() batch by
 pagevec
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <eadba522-cfe1-b0d6-56fc-7bc9f649b89d@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 38666160006
X-Rspam-User: nil
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=JKq16CM6;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf08.hostedemail.com: domain of hughd@google.com designates
 209.85.210.52 as permitted sender) smtp.mailfrom=hughd@google.com
X-Stat-Signature: gtwd8aa5idg61mpnw6a8phkj98odmbe6
X-Rspamd-Server: rspam08
X-HE-Tag: 1644184077-997472
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h |   7 +-
 mm/mlock.c    | 212 ++++++++++++++++++++++++++++++++++++++++++--------
 mm/swap.c     |  27 ++++---
 3 files changed, 199 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3f0dd3ffba2..5817cabb6343 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
  *
  * mlock is usually called at the end of page_add_*_rmap(),
  * munlock at the end of page_remove_rmap(); but new anon
- * pages are managed in lru_cache_add_inactive_or_unevictable().
+ * pages are managed by lru_cache_add_inactive_or_unevictable()
+ * calling mlock_new_page().
  *
  * @compound is used to include pmd mappings of THPs, but filter out
  * pte mappings of THPs, which cannot be consistently counted: a pte
@@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
 	    (compound || !PageTransCompound(page)))
 		munlock_page(page);
 }
+void mlock_new_page(struct page *page);
+bool need_mlock_page_drain(int cpu);
+void mlock_page_drain(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
@@ -503,6 +507,7 @@ static inline void mlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
 static inline void munlock_vma_page(struct page *page,
 			struct vm_area_struct *vma, bool compound) { }
+static inline void mlock_new_page(struct page *page) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index f8e5dcff21ae..d50d48961b22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -28,6 +28,8 @@
 
 #include "internal.h"
 
+static DEFINE_PER_CPU(struct pagevec, mlock_pvec);
+
 bool can_do_mlock(void)
 {
 	if (rlimit(RLIMIT_MEMLOCK) != 0)
@@ -49,57 +51,79 @@ EXPORT_SYMBOL(can_do_mlock);
  * PageUnevictable is set to indicate the unevictable state.
  */
 
-/**
- * mlock_page - mlock a page
- * @page - page to be mlocked, either a normal page or a THP head.
- */
-void mlock_page(struct page *page)
+static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
-	int nr_pages = thp_nr_pages(page);
+	/* There is nothing more we can do while it's off LRU */
+	if (!TestClearPageLRU(page))
+		return lruvec;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	if (unlikely(page_evictable(page))) {
+		/*
+		 * This is a little surprising, but quite possible:
+		 * PageMlocked must have got cleared already by another CPU.
+		 * Could this page be on the Unevictable LRU?  I'm not sure,
+		 * but move it now if so.
+		 */
+		if (PageUnevictable(page)) {
+			del_page_from_lru_list(page, lruvec);
+			ClearPageUnevictable(page);
+			add_page_to_lru_list(page, lruvec);
+			__count_vm_events(UNEVICTABLE_PGRESCUED,
+					  thp_nr_pages(page));
+		}
+		goto out;
 	}
 
-	/* There is nothing more we can do while it's off LRU */
-	if (!TestClearPageLRU(page))
-		return;
-
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
 	if (PageUnevictable(page)) {
-		page->mlock_count++;
+		if (PageMlocked(page))
+			page->mlock_count++;
 		goto out;
 	}
 
 	del_page_from_lru_list(page, lruvec);
 	ClearPageActive(page);
 	SetPageUnevictable(page);
-	page->mlock_count = 1;
+	page->mlock_count = !!PageMlocked(page);
 	add_page_to_lru_list(page, lruvec);
-	__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
 out:
 	SetPageLRU(page);
-	unlock_page_lruvec_irq(lruvec);
+	return lruvec;
 }
 
-/**
- * munlock_page - munlock a page
- * @page: page to be munlocked, either a normal page or a THP head.
- */
-void munlock_page(struct page *page)
+static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+	/* As above, this is a little surprising, but possible */
+	if (unlikely(page_evictable(page)))
+		goto out;
+
+	SetPageUnevictable(page);
+	page->mlock_count = !!PageMlocked(page);
+	__count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
+out:
+	add_page_to_lru_list(page, lruvec);
+	SetPageLRU(page);
+	return lruvec;
+}
+
+static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec)
 {
-	struct lruvec *lruvec;
 	int nr_pages = thp_nr_pages(page);
+	bool isolated = false;
+
+	if (!TestClearPageLRU(page))
+		goto munlock;
 
-	VM_BUG_ON_PAGE(PageTail(page), page);
+	isolated = true;
+	lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
 
-	lock_page_memcg(page);
-	lruvec = folio_lruvec_lock_irq(page_folio(page));
-	if (PageLRU(page) && PageUnevictable(page)) {
+	if (PageUnevictable(page)) {
 		/* Then mlock_count is maintained, but might undercount */
 		if (page->mlock_count)
 			page->mlock_count--;
@@ -108,24 +132,144 @@ void munlock_page(struct page *page)
 	}
 	/* else assume that was the last mlock: reclaim will fix it if not */
 
+munlock:
 	if (TestClearPageMlocked(page)) {
 		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		if (PageLRU(page) || !PageUnevictable(page))
+		if (isolated || !PageUnevictable(page))
 			__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
 		else
 			__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
 	}
 
 	/* page_evictable() has to be checked *after* clearing Mlocked */
-	if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+	if (isolated && PageUnevictable(page) && page_evictable(page)) {
 		del_page_from_lru_list(page, lruvec);
 		ClearPageUnevictable(page);
 		add_page_to_lru_list(page, lruvec);
 		__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
 	}
 out:
-	unlock_page_lruvec_irq(lruvec);
-	unlock_page_memcg(page);
+	if (isolated)
+		SetPageLRU(page);
+	return lruvec;
+}
+
+/*
+ * Flags held in the low bits of a struct page pointer on the mlock_pvec.
+ */
+#define LRU_PAGE 0x1
+#define NEW_PAGE 0x2
+#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
+#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
+
+/*
+ * mlock_pagevec() is derived from pagevec_lru_move_fn():
+ * perhaps that can make use of such page pointer flags in future,
+ * but for now just keep it for mlock.  We could use three separate
+ * pagevecs instead, but one feels better (munlocking a full pagevec
+ * does not need to drain mlocking pagevecs first).
+ */
+static void mlock_pagevec(struct pagevec *pvec)
+{
+	struct lruvec *lruvec = NULL;
+	unsigned long mlock;
+	struct page *page;
+	int i;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		page = pvec->pages[i];
+		mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE);
+		page = (struct page *)((unsigned long)page - mlock);
+		pvec->pages[i] = page;
+
+		if (mlock & LRU_PAGE)
+			lruvec = __mlock_page(page, lruvec);
+		else if (mlock & NEW_PAGE)
+			lruvec = __mlock_new_page(page, lruvec);
+		else
+			lruvec = __munlock_page(page, lruvec);
+	}
+
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
+	release_pages(pvec->pages, pvec->nr);
+	pagevec_reinit(pvec);
+}
+
+void mlock_page_drain(int cpu)
+{
+	struct pagevec *pvec;
+
+	pvec = &per_cpu(mlock_pvec, cpu);
+	if (pagevec_count(pvec))
+		mlock_pagevec(pvec);
+}
+
+bool need_mlock_page_drain(int cpu)
+{
+	return pagevec_count(&per_cpu(mlock_pvec, cpu));
+}
+
+/**
+ * mlock_page - mlock a page already on (or temporarily off) LRU
+ * @page - page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	if (!TestSetPageMlocked(page)) {
+		int nr_pages = thp_nr_pages(page);
+
+		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+		__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+	}
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_lru(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * mlock_new_page - mlock a newly allocated page not yet on LRU
+ * @page - page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_new_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+	int nr_pages = thp_nr_pages(page);
+
+	SetPageMlocked(page);
+	mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+	__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+
+	get_page(page);
+	if (!pagevec_add(pvec, mlock_new(page)) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
+}
+
+/**
+ * munlock_page - munlock a page
+ * @page - page to be munlocked, either a normal page or a THP head.
+ */
+void munlock_page(struct page *page)
+{
+	struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+	/*
+	 * TestClearPageMlocked(page) must be left to __munlock_page(),
+	 * which will check whether the page is multiply mlocked.
+	 */
+
+	get_page(page);
+	if (!pagevec_add(pvec, page) ||
+	    PageHead(page) || lru_cache_disabled())
+		mlock_pagevec(pvec);
+	put_cpu_var(mlock_pvec);
 }
 
 static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
diff --git a/mm/swap.c b/mm/swap.c
index 3f770b1ea2c1..842d5cd92cf6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,18 +490,12 @@ EXPORT_SYMBOL(folio_add_lru);
 void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
 {
-	bool unevictable;
-
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
-	if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
-		count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
-	}
-	lru_cache_add(page);
+	if (unlikely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED))
+		mlock_new_page(page);
+	else
+		lru_cache_add(page);
 }
 
 /*
@@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
+	mlock_page_drain(cpu);
 }
 
 /**
@@ -842,6 +837,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
 		    need_activate_page_drain(cpu) ||
+		    need_mlock_page_drain(cpu) ||
 		    has_bh_in_lru(cpu, NULL)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
@@ -1030,7 +1026,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	 * Is an smp_mb__after_atomic() still required here, before
 	 * folio_evictable() tests PageMlocked, to rule out the possibility
 	 * of stranding an evictable folio on an unevictable LRU?  I think
-	 * not, because munlock_page() only clears PageMlocked while the LRU
+	 * not, because __munlock_page() only clears PageMlocked while the LRU
 	 * lock is held.
 	 *
 	 * (That is not true of __page_cache_release(), and not necessarily
@@ -1043,7 +1039,14 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
 	} else {
 		folio_clear_active(folio);
 		folio_set_unevictable(folio);
-		folio->mlock_count = !!folio_test_mlocked(folio);
+		/*
+		 * folio->mlock_count = !!folio_test_mlocked(folio)?
+		 * But that leaves __mlock_page() in doubt whether another
+		 * actor has already counted the mlock or not.  Err on the
+		 * safe side, underestimate, let page reclaim fix it, rather
+		 * than leaving a page on the unevictable LRU indefinitely.
+		 */
+		folio->mlock_count = 0;
 		if (!was_unevictable)
 			__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
 	}

From patchwork Sun Feb  6 21:49:34 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736745
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68752C433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:49:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E98186B0073; Sun,  6 Feb 2022 16:49:38 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E49486B0074; Sun,  6 Feb 2022 16:49:38 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CE8046B0075; Sun,  6 Feb 2022 16:49:38 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com
 [216.40.44.101])
	by kanga.kvack.org (Postfix) with ESMTP id BF3A76B0073
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:49:38 -0500 (EST)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 752341802ACBD
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:49:38 +0000 (UTC)
X-FDA: 79113697236.10.63BA407
Received: from mail-oi1-f172.google.com (mail-oi1-f172.google.com
 [209.85.167.172])
	by imf16.hostedemail.com (Postfix) with ESMTP id E0EAB180007
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:49:37 +0000 (UTC)
Received: by mail-oi1-f172.google.com with SMTP id q8so15126698oiw.7
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:49:37 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=eeryvFF3YUzQVL6zRfYgPnpVjIrjeIiDC6NPipRUjo8=;
        b=dxv/xyq7a4YKz4HvXaiBhAykSn9i2EZYj69xdALi3OT6gz4vNRjQvKL9ctHf7bWKwz
         m1oXTJqiMhzmqtmOpY18jr5EWeEo24Q98eyAWwX3+Izq0Ax1ehI+Hrd4hiPd+/6j/V44
         pLUizWM1JLm6ymi4UZwkxaB8OfxquyNTA5VkqpFr/d+6/C7WatNV68zP69nfVRYO1GRr
         xx+1+cP+T9iuaiLlnhFbF2gghvgNs/luRc+E0t5x1gbpkquIQly0t1Snc02WRWXQEOuc
         qpRFJZoFkWIHBzzWhlzh71dTYejG7tLQ0s9YbqD1hpsK9MgIKY+Hncn3jydYgx0ogqoU
         7hPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=eeryvFF3YUzQVL6zRfYgPnpVjIrjeIiDC6NPipRUjo8=;
        b=ONGTKudhFq2t7hf+VTdZ3RreD4XC1J3NVGiaoOzmLO38dYBFDc2gZY4OZ9vgoqLpWx
         tlbQHMJEBlq5g4jq9UZ1zWzpbLUHM650zF2cV5KeFxRqO+3P/RwvP7cSWdijlOyWEtUP
         DbiqpiOYIpOJkXAFHpo1/cAR4IhOZxActRcp7A7672nSC7Rq/XxYcv7Orwh6yAFfNatJ
         C81VbqGnwUfRyNjChf2StY1T8TM2WKbGDba9NBYCrbv0q3YnfEQhtxvgY2Wbwns38LLL
         SY3trrxRdmD0J8KorEmAgRqSgQhe2tBIBJ5Lkc/Clq/L1b5CGNGP2ot0izKjMhPvIBjn
         V+NQ==
X-Gm-Message-State: AOAM530fs1OifAfCKfOeiPiL0oqMrmKlxRGLKr+8TZtcEl8MClJ64zVL
	/yPieh18Q5TCOC4LUf8hi4Z7VQ==
X-Google-Smtp-Source: 
 ABdhPJyN1McMGJJTaV8N1fTal8/WpBySojSZ784mbjvrg03WsAI06sgAW2MGqhXDzjPZIsjbUy04Iw==
X-Received: by 2002:aca:adc1:: with SMTP id
 w184mr3914400oie.305.1644184177107;
        Sun, 06 Feb 2022 13:49:37 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id 6sm3517091oig.29.2022.02.06.13.49.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:49:36 -0800 (PST)
Date: Sun, 6 Feb 2022 13:49:34 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec
 drained
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <90c8962-d188-8687-dc70-628293316343@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: E0EAB180007
X-Stat-Signature: 4uk4nbiph89njpy855k1qsjxy8crdpw3
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b="dxv/xyq7";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf16.hostedemail.com: domain of hughd@google.com designates
 209.85.167.172 as permitted sender) smtp.mailfrom=hughd@google.com
X-Rspam-User: nil
X-HE-Tag: 1644184177-359891
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.  If the mlock pagevec raises doubts, we can easily add a
sysctl to tune its length to 1, which reverts to synchronous operation.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/migrate.c | 2 ++
 mm/rmap.c    | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				page_add_file_rmap(new, vma, false);
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 		}
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 
 		/* No need to invalidate - it was non-present before */
 		update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 
@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
 		page_remove_rmap(subpage, vma, PageHuge(page));
+		if (vma->vm_flags & VM_LOCKED)
+			mlock_page_drain(smp_processor_id());
 		put_page(page);
 	}
 

From patchwork Sun Feb  6 21:51:45 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736746
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 58AF5C433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:51:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B39DD6B0072; Sun,  6 Feb 2022 16:51:49 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AC2096B0073; Sun,  6 Feb 2022 16:51:49 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 93D6E6B0074; Sun,  6 Feb 2022 16:51:49 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0006.hostedemail.com
 [216.40.44.6])
	by kanga.kvack.org (Postfix) with ESMTP id 7CEC96B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:51:49 -0500 (EST)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 36263181E4CBF
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:51:49 +0000 (UTC)
X-FDA: 79113702738.14.A91F23A
Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com
 [209.85.222.170])
	by imf10.hostedemail.com (Postfix) with ESMTP id C4A16C0005
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:51:48 +0000 (UTC)
Received: by mail-qk1-f170.google.com with SMTP id 71so9559357qkf.4
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:51:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=2F7iZLYdwYPM4sLsD/XEKrj7yTpwecjzx+f9IFfRioI=;
        b=O2A7e30ZNErMc7sLTDRbVDo+u9vgGRRPsIn3pHksPoXr7DO3I1ltCMlVsvYZ1LrMY/
         bEHygfTQCo2ZhfRS7GvvoLp23fy77sW2s7bXsUHlO698TZbolDPRiVAE/uQlpFNN8FT+
         +vMcuytvVqjY6qnHuwXZjbSQ9xJXKhn4qljFm7CcUjv9AOOUaOlLOz64fJ+4Ta7E39bP
         t+r0JgS7YSQoJ1P2tu1+ztrxegopNrrtPcB7CEglsmPWjrZXjGUlZFdTBJMFHFje+sRF
         3f9ktE3m4hku9+G61oT4wjShEY7leMZRZpbH8WKc0jKZsvK7Yt6pEfme/mAEhf8YYfSJ
         QlsA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=2F7iZLYdwYPM4sLsD/XEKrj7yTpwecjzx+f9IFfRioI=;
        b=OdVfHwqwhkbJvVvtuv/LBn2ouez6WYKg8aR/RH/b3c2vjjpgEAZRe8ODeDmG6DykmD
         4YAuGC7sSnON0FR0+/oVkRv8W1184i/EJfHvVbP68tJUr8DqQUwlxX87SNrevMexf3rd
         cDZdxYEd0jJw/FoLI/5lKqM2JsRcUk+asaKtI6NN7LBDB+GyVod0QGBNLpzKoMdY9T4X
         yUvZIkT8R/19UC36TvzrD/uRpEyYNDFZENO+Tv6BxVkqp3xxhNer1ZEd5F4wVV8ZaX1H
         20yiYoaIdFFACCgkW9rMSC4Jau01y+uRkjNkIdRzynyx2YPWflES9TqaV6XD+HerQXLD
         DrKw==
X-Gm-Message-State: AOAM532rmggYbk8wyZFFfNxDQzJBvG5wiqGM+ZoPZWgbD2bNtOP9uIbi
	euFTAIjER4oa2pKBTktzsYLY8A==
X-Google-Smtp-Source: 
 ABdhPJzBUJKcxDIYJSCu6HIiFnNizBKr8UU37iMWcm6qXkJlQTiR0K//R2/i+23oBe3NLv6QegczpA==
X-Received: by 2002:a37:9a82:: with SMTP id
 c124mr5060970qke.433.1644184307966;
        Sun, 06 Feb 2022 13:51:47 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id
 n6sm4798616qtx.23.2022.02.06.13.51.46
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:51:47 -0800 (PST)
Date: Sun, 6 Feb 2022 13:51:45 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 12/13] mm/thp: collapse_file() do
 try_to_unmap(TTU_BATCH_FLUSH)
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <cc3ad3e4-7045-f00-5a81-de3e3de6b69@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: C4A16C0005
X-Stat-Signature: c9yq66yia7txf11y5jcedf1ttxy5onp6
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=O2A7e30Z;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf10.hostedemail.com: domain of hughd@google.com designates
 209.85.222.170 as permitted sender) smtp.mailfrom=hughd@google.com
X-Rspam-User: nil
X-HE-Tag: 1644184308-780009
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap().  There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue.  I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d5e387c58bde..e0883a33efd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1829,13 +1829,12 @@ static void collapse_file(struct mm_struct *mm,
 		}
 
 		if (page_mapped(page))
-			unmap_mapping_pages(mapping, index, 1, false);
+			try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH);
 
 		xas_lock_irq(&xas);
 		xas_set(&xas, index);
 
 		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
-		VM_BUG_ON_PAGE(page_mapped(page), page);
 
 		/*
 		 * The page is expected to have page_count() == 3:
@@ -1899,6 +1898,13 @@ static void collapse_file(struct mm_struct *mm,
 	xas_unlock_irq(&xas);
 xa_unlocked:
 
+	/*
+	 * If collapse is successful, flush must be done now before copying.
+	 * If collapse is unsuccessful, does flush actually need to be done?
+	 * Do it anyway, to clear the state.
+	 */
+	try_to_unmap_flush();
+
 	if (result == SCAN_SUCCEED) {
 		struct page *page, *tmp;
 

From patchwork Sun Feb  6 21:53:27 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hugh Dickins <hughd@google.com>
X-Patchwork-Id: 12736747
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 94FE9C433EF
	for <linux-mm@archiver.kernel.org>; Sun,  6 Feb 2022 21:53:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 236D36B0072; Sun,  6 Feb 2022 16:53:32 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1E70D6B0073; Sun,  6 Feb 2022 16:53:32 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 060336B0074; Sun,  6 Feb 2022 16:53:32 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0214.hostedemail.com
 [216.40.44.214])
	by kanga.kvack.org (Postfix) with ESMTP id E535B6B0072
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 16:53:31 -0500 (EST)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 990188F162
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:53:31 +0000 (UTC)
X-FDA: 79113707022.11.750CC2D
Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com
 [209.85.222.172])
	by imf13.hostedemail.com (Postfix) with ESMTP id 4A85320004
	for <linux-mm@kvack.org>; Sun,  6 Feb 2022 21:53:31 +0000 (UTC)
Received: by mail-qk1-f172.google.com with SMTP id bs32so9578315qkb.1
        for <linux-mm@kvack.org>; Sun, 06 Feb 2022 13:53:31 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=bdXbpcKd8bEpr3aPHqwTqDw7DbzB6IrHpbHDEXtmJrw=;
        b=SSBt8V2HSh34Girfg/qRDib2a6l28XQfIOv+1FwHWDYLnbIW+mrcUblEZRmSPvGUn0
         fCQ9zCPszNmDVU1f8BaVa59Jzn1oLAq58HZ1NawQKvm4eK1d3jRD+2+1hXsGZ9PwwwSl
         0EO8THwHWzQV1T7LrNAsxs95QIJ09uLPaqpDpr0vY5dLn5irvr0mAACfhpzBEaph90dr
         VliUcdUUV7B3Ubanq+58R0QiHWXXbRO5v8X1DVN2bxYbYFTvL9REJYU70wnwg5WiQhWS
         QCk3xC9r+0dC3wbbs3vC3QjdzONZKXWJhiOZc+7zGTJ3aI5M+hvLEWtBBGXZARGj7dpi
         IzMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=bdXbpcKd8bEpr3aPHqwTqDw7DbzB6IrHpbHDEXtmJrw=;
        b=msRnLG0g+MWcjHPqDyMa2Oqe436gKTD+bra6zWj+789rjgC1sCaAh3WvR6AXreewdw
         hqdAbAUAWNsAKWxlUClhVXqohuvcaMTqmXnUDaPJXunwc5XRf7ZNAHWCQlm2gxxDCHPC
         NvfE5PhB2zBl7ORBuEoPEJpYGS7b4eQmHgqTNhsInn79rHDR1IoDWF3tQg0nnOv7wrsn
         isj0MH4L/3xCfwIoV0In0tDlA6XM5+LBaWip2iiszK4siGZRx7R0TNf++F6dXlbsWRHY
         fL4nhsDk1ltLxarzkPRkLhixy20XfoqlPJ7hqcNiATaUsKoELafoVMSUVCDZXUQodMQ3
         3oKw==
X-Gm-Message-State: AOAM5303KEJLBiJfmGTEkaRRQWWotIwYng77XjuOQ2ORhNT+ln6RdTQX
	r7fUF26oj/eo2RNSdOkI0WK1xQk+agASeQ==
X-Google-Smtp-Source: 
 ABdhPJzzaieljHjokCWJbqDmawWg6ydRcuFm5aJOT+tVvJVkNAlmiHFuPDAXww/MHC8XmIiVtnku2w==
X-Received: by 2002:ae9:c10e:: with SMTP id z14mr5029635qki.538.1644184410482;
        Sun, 06 Feb 2022 13:53:30 -0800 (PST)
Received: from ripple.attlocal.net
 (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id k4sm4771078qta.6.2022.02.06.13.53.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 06 Feb 2022 13:53:30 -0800 (PST)
Date: Sun, 6 Feb 2022 13:53:27 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: Andrew Morton <akpm@linux-foundation.org>
cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>,
    Matthew Wilcox <willy@infradead.org>,
 David Hildenbrand <david@redhat.com>,
    Alistair Popple <apopple@nvidia.com>,
 Johannes Weiner <hannes@cmpxchg.org>,
    Rik van Riel <riel@surriel.com>, Suren Baghdasaryan <surenb@google.com>,
    Yu Zhao <yuzhao@google.com>, Greg Thelen <gthelen@google.com>,
    Shakeel Butt <shakeelb@google.com>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [PATCH 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED
 THP
In-Reply-To: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
Message-ID: <6df82db0-0e6-ef83-6925-4fa3f834133d@google.com>
References: <8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 4A85320004
X-Stat-Signature: 3fzgztnkffyga6p4pzjm6ijcw64ew53j
X-Rspam-User: nil
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=SSBt8V2H;
	spf=pass (imf13.hostedemail.com: domain of hughd@google.com designates
 209.85.222.172 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-HE-Tag: 1644184411-577509
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

4.8 commit 7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked.  Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page?  That's debatable.  The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/rmap.c   | 7 +++++--
 mm/vmscan.c | 6 +++---
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 714bfdc72c7b..c7921c102bc0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,7 +812,10 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if ((vma->vm_flags & VM_LOCKED) &&
+		    (!PageTransCompound(page) || !pvmw.pte)) {
+			/* Restore the mlock which got missed */
+			mlock_vma_page(page, vma, !pvmw.pte);
 			page_vma_mapped_walk_done(&pvmw);
 			pra->vm_flags |= VM_LOCKED;
 			return false; /* To break the loop */
@@ -851,7 +854,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 
 	if (referenced) {
 		pra->referenced++;
-		pra->vm_flags |= vma->vm_flags;
+		pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
 	}
 
 	if (!pra->mapcount)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..a160efba3c73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1386,11 +1386,11 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
-	 * Mlock lost the isolation race with us.  Let try_to_unmap()
-	 * move the page to the unevictable list.
+	 * The supposedly reclaimable page was found to be in a VM_LOCKED vma.
+	 * Let the page, now marked Mlocked, be moved to the unevictable list.
 	 */
 	if (vm_flags & VM_LOCKED)
-		return PAGEREF_RECLAIM;
+		return PAGEREF_ACTIVATE;
 
 	if (referenced_ptes) {
 		/*