From patchwork Tue Feb 16 17:03:47 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Minchan Kim <minchan@kernel.org>
X-Patchwork-Id: 12090333
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2E910C433E0
	for <linux-fsdevel@archiver.kernel.org>;
 Tue, 16 Feb 2021 17:04:43 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E5D2A64DCF
	for <linux-fsdevel@archiver.kernel.org>;
 Tue, 16 Feb 2021 17:04:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230336AbhBPREk (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Tue, 16 Feb 2021 12:04:40 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56640 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229913AbhBPREg (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 16 Feb 2021 12:04:36 -0500
Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com
 [IPv6:2607:f8b0:4864:20::42c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D9583C061574;
        Tue, 16 Feb 2021 09:03:54 -0800 (PST)
Received: by mail-pf1-x42c.google.com with SMTP id z15so6525493pfc.3;
        Tue, 16 Feb 2021 09:03:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=sender:from:to:cc:subject:date:message-id:mime-version
         :content-transfer-encoding;
        bh=JY/xjLys2p6Df/83yNvluuwt58vz7rso04y3aoxqnD8=;
        b=dWuiVyjzsHfoumFzHsotMyKdkUM5pYeGihMonK42WiCckLLntBKEMmSP0BCdjVXB53
         D8xlhFhXopKongDdMw7+1MhRmVmMvOG1RDMFOT5JkmkUizHVpuSjaT8q3DRrdAdPFsy/
         LtbuqoeEs6rbzREzx/FXC/5Jjx3T3N2dHPBKRonSvGaBeNW9aPiWeIDpQ8MElIUt92Ik
         iVOpA6ujExsUjxUtN+EF5KGtk2i1DGQxveeVCkUCJj/h1NKFk2FdOW3tXpDX1uQk4upi
         A78sCie8qvo4o3Yb6fbyhXOT9OdCFypXq5dl6xFKUlBIPK9pu/X4EKfhQXYJZxpK55cb
         GDNw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:sender:from:to:cc:subject:date:message-id
         :mime-version:content-transfer-encoding;
        bh=JY/xjLys2p6Df/83yNvluuwt58vz7rso04y3aoxqnD8=;
        b=VapCLtbSXzZoe/T8YrLMWNcOFKoLz4azHXJFMsgIZquNQ9Wtym9+zmWx5GUsm4dmcz
         QMjvgZqsSIve4BphXEujPZVmdJZZQGmwElvuFUlplkgS3Ha01edM1tJLDcJv4EppuW43
         mK+OA9Kepu0Lzja8/cYJ59e1gMxMZv6GvnP/Stsu0Xh4ki9TKsGgydHxGa84AbvwErqQ
         nxF6AyCJzQpHFK12zkrkSLMUgogV4dkuCFoZdhznEULKo600vERXZ3jSk+0gp+Fv8U50
         jEv5jYApi4EXFDzZZoUKtsQyQYq6YhK0lG2iK1TBfKvRCRuAWqsMNPEFWOE10NB1Bk/W
         AEFA==
X-Gm-Message-State: AOAM530z1Q/biUDJnHgJ8RwAqodz6xs98SEcXKI2BJ+/aYjlLtR97/PY
        6qekQ6ePG+GrCTSuWF7dTck=
X-Google-Smtp-Source: 
 ABdhPJxN4hS40Q1KbiaUYSSXwZmevh2bhfAajbI7s3QAoqOTET5ECQ0rf6zq3rzvvHLKNJpYqSDdTg==
X-Received: by 2002:a65:6706:: with SMTP id u6mr19958722pgf.26.1613495034232;
        Tue, 16 Feb 2021 09:03:54 -0800 (PST)
Received: from bbox-1.mtv.corp.google.com
 ([2620:15c:211:201:fc2a:a664:489d:d48f])
        by smtp.gmail.com with ESMTPSA id
 143sm21876424pfw.3.2021.02.16.09.03.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 16 Feb 2021 09:03:52 -0800 (PST)
Sender: Minchan Kim <minchan.kim@gmail.com>
From: Minchan Kim <minchan@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>,
        cgoldswo@codeaurora.org, linux-fsdevel@vger.kernel.org,
        willy@infradead.org, mhocko@suse.com, david@redhat.com,
        vbabka@suse.cz, viro@zeniv.linux.org.uk, joaodias@google.com,
        Minchan Kim <minchan@kernel.org>
Subject: [RFC 1/2] mm: disable LRU pagevec during the migration temporarily
Date: Tue, 16 Feb 2021 09:03:47 -0800
Message-Id: <20210216170348.1513483-1-minchan@kernel.org>
X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

The other concern is migration keeps retrying until pages in pagevec
are drained. During the time, migration repeatedly allocates target
page, unmap source page from page table of processes and then get to
know the failure, restore the original page to pagetable of processes,
free target page, which is also not good.

To solve the issue, this patch tries to close the race rather than
relying on retrial and luck. The idea is to introduce
migration-in-progress tracking count with introducing IPI barrier
after atomic updating the count to minimize read-side overhead.

The migrate_prep increases migrate_pending_count under the lock
and IPI call to guarantee every CPU see the uptodate value
of migrate_pending_count. Then, drain pagevec via lru_add_drain_all.
From now on, no LRU pages could reach pagevec since LRU handling
functions skips the batching if migration is in progress with checking
migrate_pedning(IOW, pagevec should be empty until migration is done).
Every migrate_prep's caller should call migrate_finish in pair to
decrease the migration tracking count.

With the migrate_pending, vulnerable places to make migration failure
could catch migration-in-progress and make their plan to help the
migration(e.g., bh_lru_install[1]) in future.

[1] https://lore.kernel.org/linux-mm/c083b0ab6e410e33ca880d639f90ef4f6f3b33ff.1613020616.git.cgoldswo@codeaurora.org/

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/migrate.h |  3 +++
 mm/mempolicy.c          |  6 +++++
 mm/migrate.c            | 55 ++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c         |  3 +++
 mm/swap.c               | 24 +++++++++++++-----
 5 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a389633b68f..047d5358fe0d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -46,6 +46,8 @@ extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
 extern void migrate_prep(void);
+extern void migrate_finish(void);
+extern bool migrate_pending(void);
 extern void migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
@@ -67,6 +69,7 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 	{ return -EBUSY; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline void migrate_finish(void) {}
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 static inline void migrate_page_states(struct page *newpage, struct page *page)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6961238c7ef5..46d9986c7bf0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1208,6 +1208,8 @@ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 			break;
 	}
 	mmap_read_unlock(mm);
+	migrate_finish();
+
 	if (err < 0)
 		return err;
 	return busy;
@@ -1371,6 +1373,10 @@ static long do_mbind(unsigned long start, unsigned long len,
 	mmap_write_unlock(mm);
 mpol_out:
 	mpol_put(new);
+
+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+		migrate_finish();
+
 	return err;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8aaeccd..d70e113eee04 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -57,6 +57,22 @@
 
 #include "internal.h"
 
+static DEFINE_SPINLOCK(migrate_pending_lock);
+static unsigned long migrate_pending_count;
+static DEFINE_PER_CPU(struct work_struct, migrate_pending_work);
+
+static void read_migrate_pending(struct work_struct *work)
+{
+	/* TODO : not sure it's needed */
+	unsigned long dummy = __READ_ONCE(migrate_pending_count);
+	(void)dummy;
+}
+
+bool migrate_pending(void)
+{
+	return migrate_pending_count;
+}
+
 /*
  * migrate_prep() needs to be called before we start compiling a list of pages
  * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
@@ -64,11 +80,27 @@
  */
 void migrate_prep(void)
 {
+	unsigned int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count++;
+	spin_unlock(&migrate_pending_lock);
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
+
+		INIT_WORK(work, read_migrate_pending);
+		queue_work_on(cpu, mm_percpu_wq, work);
+	}
+
+	for_each_online_cpu(cpu)
+		flush_work(&per_cpu(migrate_pending_work, cpu));
+	/*
+	 * From now on, every online cpu will see uptodate
+	 * migarte_pending_work.
+	 */
 	/*
 	 * Clear the LRU lists so pages can be isolated.
-	 * Note that pages may be moved off the LRU after we have
-	 * drained them. Those pages will fail to migrate like other
-	 * pages that may be busy.
 	 */
 	lru_add_drain_all();
 }
@@ -79,6 +111,22 @@ void migrate_prep_local(void)
 	lru_add_drain();
 }
 
+void migrate_finish(void)
+{
+	int cpu;
+
+	spin_lock(&migrate_pending_lock);
+	migrate_pending_count--;
+	spin_unlock(&migrate_pending_lock);
+
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(migrate_pending_work, cpu);
+
+		INIT_WORK(work, read_migrate_pending);
+		queue_work_on(cpu, mm_percpu_wq, work);
+	}
+}
+
 int isolate_movable_page(struct page *page, isolate_mode_t mode)
 {
 	struct address_space *mapping;
@@ -1837,6 +1885,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 	if (err >= 0)
 		err = err1;
 out:
+	migrate_finish();
 	return err;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6446778cbc6b..e4cb959f64dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8493,6 +8493,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 		ret = migrate_pages(&cc->migratepages, alloc_migration_target,
 				NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
 	}
+
+	migrate_finish();
+
 	if (ret < 0) {
 		putback_movable_pages(&cc->migratepages);
 		return ret;
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d4ed94..e42c4b4bf2b3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
+#include <linux/migrate.h>
 
 #include "internal.h"
 
@@ -235,6 +236,17 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
 	}
 }
 
+/* return true if pagevec needs flush */
+static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page)
+{
+	bool ret = false;
+
+	if (!pagevec_add(pvec, page) || PageCompound(page) || migrate_pending())
+		ret = true;
+
+	return ret;
+}
+
 /*
  * Writeback is about to end against a page which has been marked for immediate
  * reclaim.  If it still appears to be reclaimable, move it to the tail of the
@@ -252,7 +264,7 @@ void rotate_reclaimable_page(struct page *page)
 		get_page(page);
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		pvec = this_cpu_ptr(&lru_rotate.pvec);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, pagevec_move_tail_fn);
 		local_unlock_irqrestore(&lru_rotate.lock, flags);
 	}
@@ -343,7 +355,7 @@ static void activate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.activate_page);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, __activate_page);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -458,7 +470,7 @@ void lru_cache_add(struct page *page)
 	get_page(page);
 	local_lock(&lru_pvecs.lock);
 	pvec = this_cpu_ptr(&lru_pvecs.lru_add);
-	if (!pagevec_add(pvec, page) || PageCompound(page))
+	if (pagevec_add_and_need_flush(pvec, page))
 		__pagevec_lru_add(pvec);
 	local_unlock(&lru_pvecs.lock);
 }
@@ -654,7 +666,7 @@ void deactivate_file_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
 
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -676,7 +688,7 @@ void deactivate_page(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_deactivate_fn);
 		local_unlock(&lru_pvecs.lock);
 	}
@@ -698,7 +710,7 @@ void mark_page_lazyfree(struct page *page)
 		local_lock(&lru_pvecs.lock);
 		pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
 		get_page(page);
-		if (!pagevec_add(pvec, page) || PageCompound(page))
+		if (pagevec_add_and_need_flush(pvec, page))
 			pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 		local_unlock(&lru_pvecs.lock);
 	}

From patchwork Tue Feb 16 17:03:48 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Minchan Kim <minchan@kernel.org>
X-Patchwork-Id: 12090335
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3E329C433DB
	for <linux-fsdevel@archiver.kernel.org>;
 Tue, 16 Feb 2021 17:04:46 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 15C5164DEC
	for <linux-fsdevel@archiver.kernel.org>;
 Tue, 16 Feb 2021 17:04:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230391AbhBPREm (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Tue, 16 Feb 2021 12:04:42 -0500
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56656 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230001AbhBPREi (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 16 Feb 2021 12:04:38 -0500
Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com
 [IPv6:2607:f8b0:4864:20::536])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1305FC06174A;
        Tue, 16 Feb 2021 09:03:58 -0800 (PST)
Received: by mail-pg1-x536.google.com with SMTP id p21so963822pgl.12;
        Tue, 16 Feb 2021 09:03:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=sender:from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=DfwDzVk1NHB5T9HsjyNqMcwEKAYtkzgp0UAGcMX/HB8=;
        b=htEorIJ+dmdnPpGtj13u9TsG6Q+w5Ev61teJTqxY2tyiVQO37h7AKhrxgdoYqBwrI4
         J5JXCLJOII5fcMCWbDF/FG9CF3Xodb7T3d4+jzYHOoKQ06gDHgIM/wmK/bdCalH/O3Kn
         6vqppxjqTTllh+RS4X+NKy5kIcTUnjiGazOkpTMh1DAx323RC6wYSMURn6wNyN+inqeQ
         LTQCrayW73nESN71nCh45bqFss2G1e1YfUdw1gvmj2YhGZy8ydqP1uwZP2iUEv/R6qvy
         R5tgTG4l4/JkXh4GOuYW2S6eZvOs+WmeTRfQmiqSD8LZpKNZ3ZOW6P0IV0OnnH8zxmAl
         R5Ig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:sender:from:to:cc:subject:date:message-id
         :in-reply-to:references:mime-version:content-transfer-encoding;
        bh=DfwDzVk1NHB5T9HsjyNqMcwEKAYtkzgp0UAGcMX/HB8=;
        b=ADjpRy+NQW2rLfr1XgiJIFKCaauAxi83fbR3Zx8Yp0d2rUOJJ1OmpMbl/ka32acM82
         UwQWEheW+h4fcXfWMzwwYGCNuclDV8in6w2PVKWCEDrg3Ik3+wfk/EH+iG4+QDZxKcaS
         O2JqufBvsWPDakjPurVruxCGZHjtKknQwXHPCoUrIjUwU1hjZdVZVA2DbP75DO6tY97H
         7AtcP255wZJ0qvs2JTtmfW0ckMzE5yfuvHD2D1E87kEoV0MyGFgNcYJR1/TqyAhJsgtv
         Sb9uWc7UrP1gWVLewSs8QXASms+I7mXE7P6yNYu87DgRljp16cLJezuKxLMWE3j+DB+C
         Ef7g==
X-Gm-Message-State: AOAM533qhAnJqNNTRFshDEZhYTcxulA1VprEVFNJFS32lHWyBmkZEwnx
        gK/yIPIWudHug5RMhj897Ew=
X-Google-Smtp-Source: 
 ABdhPJxPi89JXusezEuR2+j63Tw5YX1JwuFnLkLpfGiQLYi3zEz/6/51oQ/ZtclHZTMYeYxxgvTlvQ==
X-Received: by 2002:a65:4942:: with SMTP id q2mr16931626pgs.34.1613495037636;
        Tue, 16 Feb 2021 09:03:57 -0800 (PST)
Received: from bbox-1.mtv.corp.google.com
 ([2620:15c:211:201:fc2a:a664:489d:d48f])
        by smtp.gmail.com with ESMTPSA id
 143sm21876424pfw.3.2021.02.16.09.03.55
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 16 Feb 2021 09:03:56 -0800 (PST)
Sender: Minchan Kim <minchan.kim@gmail.com>
From: Minchan Kim <minchan@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>,
        cgoldswo@codeaurora.org, linux-fsdevel@vger.kernel.org,
        willy@infradead.org, mhocko@suse.com, david@redhat.com,
        vbabka@suse.cz, viro@zeniv.linux.org.uk, joaodias@google.com,
        Minchan Kim <minchan@kernel.org>
Subject: [RFC 2/2] mm: fs: Invalidate BH LRU during page migration
Date: Tue, 16 Feb 2021 09:03:48 -0800
Message-Id: <20210216170348.1513483-2-minchan@kernel.org>
X-Mailer: git-send-email 2.30.0.478.g8a0d178c01-goog
In-Reply-To: <20210216170348.1513483-1-minchan@kernel.org>
References: <20210216170348.1513483-1-minchan@kernel.org>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.

Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 fs/buffer.c                 | 13 +++++++++++--
 include/linux/buffer_head.h |  2 ++
 mm/swap.c                   |  5 ++++-
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604f69b3..de62e75d0ed0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -45,6 +45,7 @@
 #include <linux/mpage.h>
 #include <linux/bit_spinlock.h>
 #include <linux/pagevec.h>
+#include <linux/migrate.h>
 #include <linux/sched/mm.h>
 #include <trace/events/block.h>
 #include <linux/fscrypt.h>
@@ -1301,6 +1302,14 @@ static void bh_lru_install(struct buffer_head *bh)
 	int i;
 
 	check_irqs_on();
+	/*
+	 * buffer_head in bh_lru could increase refcount of the page
+	 * until it will be invalidated. It causes page migraion failure.
+	 * Skip putting upcoming bh into bh_lru until migration is done.
+	 */
+	if (migrate_pending())
+		return;
+
 	bh_lru_lock();
 
 	b = this_cpu_ptr(&bh_lrus);
@@ -1446,7 +1455,7 @@ EXPORT_SYMBOL(__bread_gfp);
  * This doesn't race because it runs in each cpu either in irq
  * or with preempt disabled.
  */
-static void invalidate_bh_lru(void *arg)
+void invalidate_bh_lru(void *arg)
 {
 	struct bh_lru *b = &get_cpu_var(bh_lrus);
 	int i;
@@ -1458,7 +1467,7 @@ static void invalidate_bh_lru(void *arg)
 	put_cpu_var(bh_lrus);
 }
 
-static bool has_bh_in_lru(int cpu, void *dummy)
+bool has_bh_in_lru(int cpu, void *dummy)
 {
 	struct bh_lru *b = per_cpu_ptr(&bh_lrus, cpu);
 	int i;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94378c5..3d98bdabaac9 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -194,6 +194,8 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size,
 struct buffer_head *__bread_gfp(struct block_device *,
 				sector_t block, unsigned size, gfp_t gfp);
 void invalidate_bh_lrus(void);
+void invalidate_bh_lru(void *);
+bool has_bh_in_lru(int cpu, void *dummy);
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
 void free_buffer_head(struct buffer_head * bh);
 void unlock_buffer(struct buffer_head *bh);
diff --git a/mm/swap.c b/mm/swap.c
index e42c4b4bf2b3..14faf558347b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/migrate.h>
+#include <linux/buffer_head.h>
 
 #include "internal.h"
 
@@ -641,6 +642,7 @@ void lru_add_drain_cpu(int cpu)
 		pagevec_lru_move_fn(pvec, lru_lazyfree_fn);
 
 	activate_page_drain(cpu);
+	invalidate_bh_lru(NULL);
 }
 
 /**
@@ -827,7 +829,8 @@ void lru_add_drain_all(void)
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
 		    pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
-		    need_activate_page_drain(cpu)) {
+		    need_activate_page_drain(cpu) ||
+		    has_bh_in_lru(cpu, NULL)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, mm_percpu_wq, work);
 			__cpumask_set_cpu(cpu, &has_work);