From patchwork Thu Aug 29 16:56:04 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783447
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 371A94D8CF
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950630; cv=none;
 b=uKcshgIsRUdhjG7iroFYepRUrpNQH5yAW3MwHsBRHLUUfqpO2OF0ASCPGMXfJ+tr8QUGoio448QXFZH2cpiRKjGhjMEB4RpzX/wFy0kq5Y57CaLTKnDe2BBqtA3H/TJgfO5jED8ZLG6p58LrA7UeXQaJUWfiunzLpp5PMhccUMw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950630; c=relaxed/simple;
	bh=mRUSwjzIKgUm6C6IsJ/V836vgCXG/9gWFoTwA7mi2ik=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=FD6eXu6JjvW1J9DQQdSPfp8RCEqHEZH3Fsv9EnmWh367/V/Ny2Tq8E4zpGn/0UseNd/2OkNHViVlFLVu6p1xTdxa0e9ftqP/qiFAaTTZ83AH4xv1ftvpgoxa3RvIR6kuSY/wPJWmR5oZnK299OfbugtSumwfZOpRsO6AHP9gfcs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=IBWp+EJ+; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="IBWp+EJ+"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950627;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=2JqLQO5zRqaD9D1zDEVy24po9QCWNVhICktX5RsFGi4=;
	b=IBWp+EJ+WoFYFmRhBlSwgt5xD+jOwDx1XRuHSmk1az5AnE42Wqh1k3oO02M0fLy/Ns28F/
	jLcS/DF280NrIxRZkfRFCgoJnEf3YrYBCZZGnzJa2hP38ohHX8g4j6P/WekzKtMEVABUPQ
	HbKYjy8Vt7wzyuByusDFB+FPx2MMt5Y=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-654-Dwmx9zCBO_qNwGe7_gEFEw-1; Thu,
 29 Aug 2024 12:57:03 -0400
X-MC-Unique: Dwmx9zCBO_qNwGe7_gEFEw-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 8FBF21955D52;
	Thu, 29 Aug 2024 16:56:58 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 02D9E1955F21;
	Thu, 29 Aug 2024 16:56:47 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 01/17] mm: factor out large folio handling from
 folio_order() into folio_large_order()
Date: Thu, 29 Aug 2024 18:56:04 +0200
Message-ID: <20240829165627.2256514-2-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's factor it out into a simple helper function. This helper will
also come in handy when working with code where we know that our
folio is large.

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b31d4bdd65ad5..3c6270f87bdc3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1071,6 +1071,11 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 struct mmu_gather;
 struct inode;
 
+static inline unsigned int folio_large_order(const struct folio *folio)
+{
+	return folio->_flags_1 & 0xff;
+}
+
 /*
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
@@ -1084,7 +1089,7 @@ static inline unsigned int compound_order(struct page *page)
 
 	if (!test_bit(PG_head, &folio->flags))
 		return 0;
-	return folio->_flags_1 & 0xff;
+	return folio_large_order(folio);
 }
 
 /**
@@ -1100,7 +1105,7 @@ static inline unsigned int folio_order(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
 		return 0;
-	return folio->_flags_1 & 0xff;
+	return folio_large_order(folio);
 }
 
 #include <linux/huge_mm.h>
@@ -2035,7 +2040,7 @@ static inline long folio_nr_pages(const struct folio *folio)
 #ifdef CONFIG_64BIT
 	return folio->_folio_nr_pages;
 #else
-	return 1L << (folio->_flags_1 & 0xff);
+	return 1L << folio_large_order(folio);
 #endif
 }
 
@@ -2060,7 +2065,7 @@ static inline unsigned long compound_nr(struct page *page)
 #ifdef CONFIG_64BIT
 	return folio->_folio_nr_pages;
 #else
-	return 1L << (folio->_flags_1 & 0xff);
+	return 1L << folio_large_order(folio);
 #endif
 }
 

From patchwork Thu Aug 29 16:56:05 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783448
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 641061B6526
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950638; cv=none;
 b=nkevcauDPfAGCLgikZPf3zr/0bponA9YcrvVZUMTIgGaJU6cfiRj0Hr7p8CK+FOxEcqAVMwzBnQcUWlGxsxbFfqGkcpNLYqGI14o/Hv8y/SGXgFau4onzXY57AZ+n4FpTC7fRJRcYHQifv1Dr7dxRHdDy3nTpVP4SDpcnMLAlp8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950638; c=relaxed/simple;
	bh=8+9llz8sONFJSvcYrwkIc951NwGN6B5PLjuzJF6J33I=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Diie9CYqLeXmIXlxAywxfq0Kq7rGZb68HwtXOxN9yOUNkRWvj2fsx+xbPzD/t4l5iUDYJUcKgxgi/t8vZWUQjwQbSDSNn2+ZS6WmlC8QQ13rkE8AD0YwD7zaQ8/VYVfBUecMlmp43AnRDGzcagEv90sezGPOJPGw84VXnS8zHis=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=WV4kK6C5; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="WV4kK6C5"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950636;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=lJMXjWc4+IBsQZT+SpDHJ06KBG+eQaVO9mpec+pSrAQ=;
	b=WV4kK6C5j7luSsoceO+lflusGpPMjglCp/j3LcpXufKZdpUfOmmIaeEiYRkryi+8ewdz5b
	RyXlrJL6ULcrj7NO9h5U0iIeGBTJLHvYCJgbkA31lXx/+4ZYE+NG5ZYCqCG+fYz1I08HOT
	z7VavvBOqKnY283pG70WaNhTzNy4l0A=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-688-deQlVP0ONhqTZhWhlLdllw-1; Thu,
 29 Aug 2024 12:57:11 -0400
X-MC-Unique: deQlVP0ONhqTZhWhlLdllw-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 36A261954B00;
	Thu, 29 Aug 2024 16:57:08 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id E03F91955F66;
	Thu, 29 Aug 2024 16:56:59 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 02/17] mm: factor out large folio handling from
 folio_nr_pages() into folio_large_nr_pages()
Date: Thu, 29 Aug 2024 18:56:05 +0200
Message-ID: <20240829165627.2256514-3-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's factor it out into a simple helper function. This helper will
also come in handy when working with code where we know that our
folio is large.

Make use of it in internal.h and mm.h, where applicable.

While at it, let's consistently return a "long" value from all these
similar functions. Note that we cannot use "unsigned int" (even though
_folio_nr_pages is of that type), because it would break some callers
that do stuff like "-folio_nr_pages()". Both "int" or "unsigned long"
would work as well.

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h | 27 ++++++++++++++-------------
 mm/internal.h      |  2 +-
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c6270f87bdc3..fa8b6ce54235c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1076,6 +1076,15 @@ static inline unsigned int folio_large_order(const struct folio *folio)
 	return folio->_flags_1 & 0xff;
 }
 
+static inline long folio_large_nr_pages(const struct folio *folio)
+{
+#ifdef CONFIG_64BIT
+	return folio->_folio_nr_pages;
+#else
+	return 1L << folio_large_order(folio);
+#endif
+}
+
 /*
  * compound_order() can be called without holding a reference, which means
  * that niceties like page_folio() don't work.  These callers should be
@@ -2037,11 +2046,7 @@ static inline long folio_nr_pages(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
 		return 1;
-#ifdef CONFIG_64BIT
-	return folio->_folio_nr_pages;
-#else
-	return 1L << folio_large_order(folio);
-#endif
+	return folio_large_nr_pages(folio);
 }
 
 /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
@@ -2056,24 +2061,20 @@ static inline long folio_nr_pages(const struct folio *folio)
  * page.  compound_nr() can be called on a tail page, and is defined to
  * return 1 in that case.
  */
-static inline unsigned long compound_nr(struct page *page)
+static inline long compound_nr(struct page *page)
 {
 	struct folio *folio = (struct folio *)page;
 
 	if (!test_bit(PG_head, &folio->flags))
 		return 1;
-#ifdef CONFIG_64BIT
-	return folio->_folio_nr_pages;
-#else
-	return 1L << folio_large_order(folio);
-#endif
+	return folio_large_nr_pages(folio);
 }
 
 /**
  * thp_nr_pages - The number of regular pages in this huge page.
  * @page: The head page of a huge page.
  */
-static inline int thp_nr_pages(struct page *page)
+static inline long thp_nr_pages(struct page *page)
 {
 	return folio_nr_pages((struct folio *)page);
 }
@@ -2183,7 +2184,7 @@ static inline bool folio_likely_mapped_shared(struct folio *folio)
 		return false;
 
 	/* If any page is mapped more than once we treat it "mapped shared". */
-	if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
+	if (folio_entire_mapcount(folio) || mapcount > folio_large_nr_pages(folio))
 		return true;
 
 	/* Let's guess based on the first subpage. */
diff --git a/mm/internal.h b/mm/internal.h
index 44c8dec1f0d75..97d6b94429ebd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -159,7 +159,7 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
 		bool *any_writable, bool *any_young, bool *any_dirty)
 {
-	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
+	unsigned long folio_end_pfn = folio_pfn(folio) + folio_large_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
 	pte_t expected_pte, *ptep;
 	bool writable, young, dirty;

From patchwork Thu Aug 29 16:56:06 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783449
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8DFD81B81A1
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950649; cv=none;
 b=jc814cdSasAIbBLGH4IH0A/+ONXTBpE/X+rtTtgWS4OAgN0oSNpY5nsJ4aui3n7X/0Jq5CTEOhlnxMwlvdxbliJVpJKzUKnhYosbnd470+WbAUdPiEzSqOW4lIXGnbm3GY2aveqKS0QIXDnpbG/T8NPkajv9oVNTwIxH23VAGPQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950649; c=relaxed/simple;
	bh=Toh5UjI1Z22lpu/+hn5Djf5eJFiv5xTinIwfjvY3gBY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=cPtm5qlyb5R/JZoL7Lt4Pu3oj0GtJUveuVtSdS3lrNqFdYXhj00vQQNS/IMtQJPqiLt06QQdPQJcu1b1s/VHpY1QKXIaTxdJkF3D06kBHg97Wexbs6JYZ6M0notGvxW2hcnk0xW25hdB5CuXO2y1gTIp9exjGHBVQdAxDFBd+X0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NVLSJTyE; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NVLSJTyE"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950646;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=s96zD9Isldkc+L1oUyGNWIMTG4igSbKtEM89OEtQ5iM=;
	b=NVLSJTyEaSE6pO5bYmugxtioSzc50GBd8ueiDwV+l/KU3L52INdVsfH05REygUQkwVrdEB
	L0fTer+oVTSSedj2LpR+qSKxCBUlqjCFt7wG8ocKEzyTHGOOrQEEvx0zkKZMsC/L5mcxSf
	A4MOa/XKPpMmFmEH6UgfhOclxv4ZLd8=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-679-8IQa2j95PSO4YVvjSC6x9A-1; Thu,
 29 Aug 2024 12:57:18 -0400
X-MC-Unique: 8IQa2j95PSO4YVvjSC6x9A-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 689B21955D56;
	Thu, 29 Aug 2024 16:57:16 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id A88031955F21;
	Thu, 29 Aug 2024 16:57:08 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 03/17] mm/rmap: use folio_large_nr_pages() in add/remove
 functions
Date: Thu, 29 Aug 2024 18:56:06 +0200
Message-ID: <20240829165627.2256514-4-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's just use the "large" variant in code where we are sure that we
have a large folio in our hands: this way we are sure that we don't
perform any unnecessary "large" checks.

While at it, convert the VM_BUG_ON_VMA to a VM_WARN_ON_ONCE.

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/rmap.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 78529cf0fd668..6594c122a5895 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1184,7 +1184,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 		if (first) {
 			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
 			if (likely(nr < ENTIRELY_MAPPED + ENTIRELY_MAPPED)) {
-				*nr_pmdmapped = folio_nr_pages(folio);
+				*nr_pmdmapped = folio_large_nr_pages(folio);
 				nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
 				/* Raced ahead of a remove and another add? */
 				if (unlikely(nr < 0))
@@ -1418,14 +1418,11 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
 void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		unsigned long address, rmap_t flags)
 {
-	const int nr = folio_nr_pages(folio);
 	const bool exclusive = flags & RMAP_EXCLUSIVE;
-	int nr_pmdmapped = 0;
+	int nr = 1, nr_pmdmapped = 0;
 
 	VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 	VM_WARN_ON_FOLIO(!exclusive && !folio_test_locked(folio), folio);
-	VM_BUG_ON_VMA(address < vma->vm_start ||
-			address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
 
 	/*
 	 * VM_DROPPABLE mappings don't swap; instead they're just dropped when
@@ -1443,6 +1440,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 	} else if (!folio_test_pmd_mappable(folio)) {
 		int i;
 
+		nr = folio_large_nr_pages(folio);
 		for (i = 0; i < nr; i++) {
 			struct page *page = folio_page(folio, i);
 
@@ -1456,6 +1454,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		atomic_set(&folio->_large_mapcount, nr - 1);
 		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
+		nr = folio_large_nr_pages(folio);
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		/* increment count (starts at -1) */
@@ -1466,6 +1465,9 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 		nr_pmdmapped = nr;
 	}
 
+	VM_WARN_ON_ONCE(address < vma->vm_start ||
+			address + (nr << PAGE_SHIFT) > vma->vm_end);
+
 	__folio_mod_stat(folio, nr, nr_pmdmapped);
 	mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
 }
@@ -1557,7 +1559,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		if (last) {
 			nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
 			if (likely(nr < ENTIRELY_MAPPED)) {
-				nr_pmdmapped = folio_nr_pages(folio);
+				nr_pmdmapped = folio_large_nr_pages(folio);
 				nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
 				/* Raced ahead of another remove and an add? */
 				if (unlikely(nr < 0))

From patchwork Thu Aug 29 16:56:07 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783450
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BC7E1B5ED4
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950654; cv=none;
 b=Tm7J04soT4t+d1HYESPRcP3TgCWaV08nrvE2K70HQvLJ0wjkGHELq2Nh52gOg/GTLsUBI7umKIHWG6R1xpKoPczmdUGqg3cK5AyFa/TREJ3rho1pJ1ZfH3333roK+pj5OJRBNdFF89ouXoVMjC8Ap3Jyc07efWfH6MZjroo6Igc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950654; c=relaxed/simple;
	bh=OX5XZc4KC4Fi5v6f8wjxqDbT7/xpOpJN/+nUfP7WuTY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=LJLZSCdUuz6YHabiv+n++27Q3CVBBYgkLWIJxf0NxkdeouQ4GvSvZ+0IjG6A3vLoItZ50wuq89Am+qBI1wOD7ILUeJN2H4D/UptuD7ItB0YaPG4lTxG7arcYTOuVMc3FiPlwrCssv3zYScxI4VY50qplPGMmcVHPGdHQnmNGWLI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UCnj8Bt/; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="UCnj8Bt/"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950651;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=EpD7Yg2+jtoN65+dYA8OeTl4VXhFFh1pjoJ0uEfzgHA=;
	b=UCnj8Bt/cL41sar+sV8WwnsSF+yecqYSKH1X45z3ny5FdSJV0U+8DpfNL7f+jr5fnxFw0x
	X4sFwJvD6yhF0ZtdpEDQ/WxBU5VTz8NUHtLqs+yXLuimoJFCbeqmnpFhsrZOllGW+X54f2
	uOfuJ+KH/f0rL0GyCl94F0TWgUpkAuM=
Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-62-X3BqNIGuM3SJk-RFpxTW0Q-1; Thu,
 29 Aug 2024 12:57:27 -0400
X-MC-Unique: X3BqNIGuM3SJk-RFpxTW0Q-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E4CFD18EA8EA;
	Thu, 29 Aug 2024 16:57:24 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id DDA2A1955F21;
	Thu, 29 Aug 2024 16:57:16 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 04/17] mm: let _folio_nr_pages overlay memcg_data in first
 tail page
Date: Thu, 29 Aug 2024 18:56:07 +0200
Message-ID: <20240829165627.2256514-5-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's free up some more of the "unconditionally available on 64BIT"
space in order-1 folios by letting _folio_nr_pages overlay memcg_data in
the first tail page (second folio page). Consequently, we have the
optimization now whenever we have CONFIG_MEMCG, independent of 64BIT.

We have to make sure that page->memcg on tail pages does not return
"surprises". page_memcg_check() already properly refuses PageTail().
Let's do that earlier in print_page_owner_memcg() to avoid printing
wrong "Slab cache page" information. No other code should touch that
field on tail pages of compound pages.

Reset the "_nr_pages" to 0 when splitting folios, or when freeing them
back to the buddy (to avoid false page->memcg_data "bad page" reports).

Note that in __split_huge_page(), folio_nr_pages() would stop working
already as soon as we start messing with the subpages.

Most kernel configs should have at least CONFIG_MEMCG enabled, even if
disabled at runtime. 64byte "struct memmap" is what we usually have
on 64BIT.

While at it, rename "_folio_nr_pages" to "_nr_pages".

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h       |  4 ++--
 include/linux/mm_types.h | 30 ++++++++++++++++++++++--------
 mm/huge_memory.c         |  8 ++++++++
 mm/internal.h            |  4 ++--
 mm/page_alloc.c          |  6 +++++-
 mm/page_owner.c          |  2 +-
 6 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa8b6ce54235c..98411e53da916 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,8 +1078,8 @@ static inline unsigned int folio_large_order(const struct folio *folio)
 
 static inline long folio_large_nr_pages(const struct folio *folio)
 {
-#ifdef CONFIG_64BIT
-	return folio->_folio_nr_pages;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+	return folio->_nr_pages;
 #else
 	return 1L << folio_large_order(folio);
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bca..480548552ea54 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,11 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+#if defined(CONFIG_MEMCG) || defined(CONFIG_SLAB_OBJ_EXT)
+/* We have some extra room after the refcount in tail pages. */
+#define NR_PAGES_IN_LARGE_FOLIO
+#endif
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
@@ -305,7 +310,7 @@ typedef struct {
  * @_large_mapcount: Do not use directly, call folio_mapcount().
  * @_nr_pages_mapped: Do not use outside of rmap and debug code.
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
- * @_folio_nr_pages: Do not use directly, call folio_nr_pages().
+ * @_nr_pages: Do not use directly, call folio_nr_pages().
  * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
  * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
@@ -366,13 +371,20 @@ struct folio {
 			unsigned long _flags_1;
 			unsigned long _head_1;
 	/* public: */
-			atomic_t _large_mapcount;
-			atomic_t _entire_mapcount;
-			atomic_t _nr_pages_mapped;
-			atomic_t _pincount;
-#ifdef CONFIG_64BIT
-			unsigned int _folio_nr_pages;
-#endif
+			union {
+				struct {
+					atomic_t _large_mapcount;
+					atomic_t _entire_mapcount;
+					atomic_t _nr_pages_mapped;
+					atomic_t _pincount;
+				};
+				unsigned long _usable_1[4];
+			};
+			atomic_t _mapcount_1;
+			atomic_t _refcount_1;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+			unsigned int _nr_pages;
+#endif /* NR_PAGES_IN_LARGE_FOLIO */
 	/* private: the union with struct page is transitional */
 		};
 		struct page __page_1;
@@ -424,6 +436,8 @@ FOLIO_MATCH(_last_cpupid, _last_cpupid);
 			offsetof(struct page, pg) + sizeof(struct page))
 FOLIO_MATCH(flags, _flags_1);
 FOLIO_MATCH(compound_head, _head_1);
+FOLIO_MATCH(_mapcount, _mapcount_1);
+FOLIO_MATCH(_refcount, _refcount_1);
 #undef FOLIO_MATCH
 #define FOLIO_MATCH(pg, fl)						\
 	static_assert(offsetof(struct folio, fl) ==			\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15418ffdd3774..28d12573fcf8c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3171,6 +3171,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
 
+	/*
+	 * Reset any memcg data overlay in the tail pages. folio_nr_pages()
+	 * is unreliable after this point.
+	 */
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+	folio->_nr_pages = 0;
+#endif
+
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
diff --git a/mm/internal.h b/mm/internal.h
index 97d6b94429ebd..f627fd2200464 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -625,8 +625,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 		return;
 
 	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
-#ifdef CONFIG_64BIT
-	folio->_folio_nr_pages = 1U << order;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+	folio->_nr_pages = 1U << order;
 #endif
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c2ffccf9d2131..e276cbaf97054 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1077,8 +1077,12 @@ __always_inline bool free_pages_prepare(struct page *page,
 	if (unlikely(order)) {
 		int i;
 
-		if (compound)
+		if (compound) {
 			page[1].flags &= ~PAGE_FLAGS_SECOND;
+#ifdef NR_PAGES_IN_LARGE_FOLIO
+			((struct folio *)page)->_nr_pages = 0;
+#endif
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_page_prepare(page, page + i);
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 2d6360eaccbb6..a409e2561a8fd 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -507,7 +507,7 @@ static inline int print_page_owner_memcg(char *kbuf, size_t count, int ret,
 
 	rcu_read_lock();
 	memcg_data = READ_ONCE(page->memcg_data);
-	if (!memcg_data)
+	if (!memcg_data || PageTail(page))
 		goto out_unlock;
 
 	if (memcg_data & MEMCG_DATA_OBJEXTS)

From patchwork Thu Aug 29 16:56:08 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783451
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1D991B86D3
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:43 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950665; cv=none;
 b=ukM71vGk+I//UJl1EwXj4tT1fox7XNhxeTPESM0KpoaEE7Y3SJd6iHeb59nlq/r9mF/sjxDHps1sc1pnuzED+W0H4CKlz/n0Cch+kZoRT1tnf6B28SPJyBEPE7tTHSDtYtGDuo7gn3FGfYi31LXJg1JCogFi3ijZUtw61108Fac=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950665; c=relaxed/simple;
	bh=VUChrcX3pFSYIoI9kHHzXoUY5G447RUvOroM0QJMtW8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ezzgXY1/ZFQNdpXLGmTMUanRUqbwin2yPGx7OT/ua+CTYFTWdrHebxSRKzwcNhTUE4sJjm69fu1RvqbEjwNZs6mGVkpfpEMUVTL8Bm6xej5+/yYuMOLU10qiR2LVOv64hWH4k9J+wjdTnZiuzZ0G4dBoWYNYo/zBFddJ0c62ovs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UqUMz4Gz; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="UqUMz4Gz"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950662;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=P9e2t85ahUim8/CUtgfAvgDNVpQ7xBKAii0+WP0hxAw=;
	b=UqUMz4GzH+k/x7GyBvoGGwMEWlE7aIEkiefIMVkNMvA+tpENA7oFgkZwry98nfPicgtlSk
	0zKX5y8z8Dg2MvXTp9ELPNj+NCd8Dg7WAlBMIG2FK/pm/BIXITqgWnbiNvmYicqbCx2cjC
	FPtDiijQSORQ4/Dv9SgUh2phNLJjR1s=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-634-4iL4jv5LNyWo6OsN4hoihg-1; Thu,
 29 Aug 2024 12:57:36 -0400
X-MC-Unique: 4iL4jv5LNyWo6OsN4hoihg-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 1D1C71956048;
	Thu, 29 Aug 2024 16:57:32 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 414471955F66;
	Thu, 29 Aug 2024 16:57:25 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 05/17] mm/rmap: pass dst_vma to page_try_dup_anon_rmap()
 and page_dup_file_rmap()
Date: Thu, 29 Aug 2024 18:56:08 +0200
Message-ID: <20240829165627.2256514-6-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

We'll need access to the destination MM when modifying the total mapcount
of a non-hugetlb large folios next. So pass in the destination VMA.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 42 +++++++++++++++++++++++++-----------------
 mm/huge_memory.c     |  2 +-
 mm/memory.c          | 10 +++++-----
 3 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 91b5935e8485e..9e275986f0ef6 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -322,7 +322,8 @@ static inline void hugetlb_remove_rmap(struct folio *folio)
 }
 
 static __always_inline void __folio_dup_file_rmap(struct folio *folio,
-		struct page *page, int nr_pages, enum rmap_level level)
+		struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
+		enum rmap_level level)
 {
 	const int orig_nr_pages = nr_pages;
 
@@ -352,45 +353,47 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
  * @folio:	The folio to duplicate the mappings of
  * @page:	The first page to duplicate the mappings of
  * @nr_pages:	The number of pages of which the mapping will be duplicated
+ * @dst_vma:	The destination vm area
  *
  * The page range of the folio is defined by [page, page + nr_pages)
  *
  * The caller needs to hold the page table lock.
  */
 static inline void folio_dup_file_rmap_ptes(struct folio *folio,
-		struct page *page, int nr_pages)
+		struct page *page, int nr_pages, struct vm_area_struct *dst_vma)
 {
-	__folio_dup_file_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE);
+	__folio_dup_file_rmap(folio, page, nr_pages, dst_vma, RMAP_LEVEL_PTE);
 }
 
 static __always_inline void folio_dup_file_rmap_pte(struct folio *folio,
-		struct page *page)
+		struct page *page, struct vm_area_struct *dst_vma)
 {
-	__folio_dup_file_rmap(folio, page, 1, RMAP_LEVEL_PTE);
+	__folio_dup_file_rmap(folio, page, 1, dst_vma, RMAP_LEVEL_PTE);
 }
 
 /**
  * folio_dup_file_rmap_pmd - duplicate a PMD mapping of a page range of a folio
  * @folio:	The folio to duplicate the mapping of
  * @page:	The first page to duplicate the mapping of
+ * @dst_vma:	The destination vm area
  *
  * The page range of the folio is defined by [page, page + HPAGE_PMD_NR)
  *
  * The caller needs to hold the page table lock.
  */
 static inline void folio_dup_file_rmap_pmd(struct folio *folio,
-		struct page *page)
+		struct page *page, struct vm_area_struct *dst_vma)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	__folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, RMAP_LEVEL_PTE);
+	__folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, RMAP_LEVEL_PTE);
 #else
 	WARN_ON_ONCE(true);
 #endif
 }
 
 static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
-		struct page *page, int nr_pages, struct vm_area_struct *src_vma,
-		enum rmap_level level)
+		struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma, enum rmap_level level)
 {
 	const int orig_nr_pages = nr_pages;
 	bool maybe_pinned;
@@ -455,6 +458,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
  * @folio:	The folio to duplicate the mappings of
  * @page:	The first page to duplicate the mappings of
  * @nr_pages:	The number of pages of which the mapping will be duplicated
+ * @dst_vma:	The destination vm area
  * @src_vma:	The vm area from which the mappings are duplicated
  *
  * The page range of the folio is defined by [page, page + nr_pages)
@@ -473,16 +477,18 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
  * Returns 0 if duplicating the mappings succeeded. Returns -EBUSY otherwise.
  */
 static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio,
-		struct page *page, int nr_pages, struct vm_area_struct *src_vma)
+		struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
-	return __folio_try_dup_anon_rmap(folio, page, nr_pages, src_vma,
-					 RMAP_LEVEL_PTE);
+	return __folio_try_dup_anon_rmap(folio, page, nr_pages, dst_vma,
+					 src_vma, RMAP_LEVEL_PTE);
 }
 
 static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
-		struct page *page, struct vm_area_struct *src_vma)
+		struct page *page, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
-	return __folio_try_dup_anon_rmap(folio, page, 1, src_vma,
+	return __folio_try_dup_anon_rmap(folio, page, 1, dst_vma, src_vma,
 					 RMAP_LEVEL_PTE);
 }
 
@@ -491,6 +497,7 @@ static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
  *				 of a folio
  * @folio:	The folio to duplicate the mapping of
  * @page:	The first page to duplicate the mapping of
+ * @dst_vma:	The destination vm area
  * @src_vma:	The vm area from which the mapping is duplicated
  *
  * The page range of the folio is defined by [page, page + HPAGE_PMD_NR)
@@ -509,11 +516,12 @@ static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
  * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise.
  */
 static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
-		struct page *page, struct vm_area_struct *src_vma)
+		struct page *page, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	return __folio_try_dup_anon_rmap(folio, page, HPAGE_PMD_NR, src_vma,
-					 RMAP_LEVEL_PMD);
+	return __folio_try_dup_anon_rmap(folio, page, HPAGE_PMD_NR, dst_vma,
+					 src_vma, RMAP_LEVEL_PMD);
 #else
 	WARN_ON_ONCE(true);
 	return -EBUSY;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 28d12573fcf8c..6de84377e8e77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1642,7 +1642,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_folio = page_folio(src_page);
 
 	folio_get(src_folio);
-	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, src_vma))) {
+	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) {
 		/* Page maybe pinned: split and retry the fault on PTEs. */
 		folio_put(src_folio);
 		pte_free(dst_mm, pgtable);
diff --git a/mm/memory.c b/mm/memory.c
index 06b42db8a2db7..c2143c40a134b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -856,7 +856,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		folio_get(folio);
 		rss[mm_counter(folio)]++;
 		/* Cannot fail as these pages cannot get pinned. */
-		folio_try_dup_anon_rmap_pte(folio, page, src_vma);
+		folio_try_dup_anon_rmap_pte(folio, page, dst_vma, src_vma);
 
 		/*
 		 * We do not preserve soft-dirty information, because so
@@ -1007,14 +1007,14 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
-								  nr, src_vma))) {
+								  nr, dst_vma, src_vma))) {
 				folio_ref_sub(folio, nr);
 				return -EAGAIN;
 			}
 			rss[MM_ANONPAGES] += nr;
 			VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
 		} else {
-			folio_dup_file_rmap_ptes(folio, page, nr);
+			folio_dup_file_rmap_ptes(folio, page, nr, dst_vma);
 			rss[mm_counter_file(folio)] += nr;
 		}
 		if (any_writable)
@@ -1032,7 +1032,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		 * guarantee the pinned page won't be randomly replaced in the
 		 * future.
 		 */
-		if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) {
+		if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, dst_vma, src_vma))) {
 			/* Page may be pinned, we have to copy. */
 			folio_put(folio);
 			err = copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
@@ -1042,7 +1042,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		rss[MM_ANONPAGES]++;
 		VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
 	} else {
-		folio_dup_file_rmap_pte(folio, page);
+		folio_dup_file_rmap_pte(folio, page, dst_vma);
 		rss[mm_counter_file(folio)]++;
 	}
 

From patchwork Thu Aug 29 16:56:09 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783452
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 337FD1B655B
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:57:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950670; cv=none;
 b=l1kU7eSSge2/GBKpi98edEwAxyOWafW1vWjbKIw5kPtGeRzhf/dRpWSyQ/tTDL1eiyh+veAHcrP7QTQdPzJsgKPjqtn4ioZNEdVK7g0s2pgVpEOcOPvEaqsZ0iJWyMEu+KMQxRn3v0CMnr5eD6EZgTzR1T1fwnM4E/RguKGK+U4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950670; c=relaxed/simple;
	bh=YgSN2WU8YPPe3BIrmNuupjxgq4wOpkUDRF97fLPgncA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=h3eKFldWMYXBo2BU8RYejxTnSjOP4lTwxuZQNoKrfMTjvPtKrLDoiurozdBJbC5YKFBwMeRzAH7YEwAcFU/TKv6zHkpSL58NQNEgvJR5DsrLhVqnd8Zub0pAKm4LDH1wd7+TgMbOTCpbhNs8ZUpMmkZMk7ONbQAB1gV0gDUCK1Q=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=SrQJc9IT; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="SrQJc9IT"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950668;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UQpm0aZ+NCHrJn7YVxv+CdOsqon69CXtB9EHB2F3FEM=;
	b=SrQJc9ITWFTHn0cqIqfLV84XllBaw8H+1+lf5Gdp7avYI08ZO4hwUM9hDxuSWYk+m6gPMR
	L+MTs66EkkOpTaknLrNICH1FwuzPGEqRPaQzaNjUG+nLcV05WzeCxS8gFqJP2V+e2/7Qq9
	hQ+PVKYOUNn7IouWsc0CVb1ZBfdLyqE=
Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-59-AymGZyrrMjyz7mijX24iug-1; Thu,
 29 Aug 2024 12:57:44 -0400
X-MC-Unique: AymGZyrrMjyz7mijX24iug-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 402891913791;
	Thu, 29 Aug 2024 16:57:41 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3B8C51955DC0;
	Thu, 29 Aug 2024 16:57:32 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 06/17] mm/rmap: pass vma to __folio_add_rmap()
Date: Thu, 29 Aug 2024 18:56:09 +0200
Message-ID: <20240829165627.2256514-7-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

We'll need access to the destination MM when modifying the total mapcount
of a non-hugetlb large folios next. So pass in the VMA.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 6594c122a5895..ee1bff1638f90 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1153,8 +1153,8 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 }
 
 static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
-		struct page *page, int nr_pages, enum rmap_level level,
-		int *nr_pmdmapped)
+		struct page *page, int nr_pages, struct vm_area_struct *vma,
+		enum rmap_level level, int *nr_pmdmapped)
 {
 	atomic_t *mapped = &folio->_nr_pages_mapped;
 	const int orig_nr_pages = nr_pages;
@@ -1314,7 +1314,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 
 	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
 
-	nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped);
+	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
 
 	if (likely(!folio_test_ksm(folio)))
 		__page_check_anon_rmap(folio, page, vma, address);
@@ -1480,7 +1480,7 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
 
 	VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
 
-	nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped);
+	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
 	__folio_mod_stat(folio, nr, nr_pmdmapped);
 
 	/* See comments in folio_add_anon_rmap_*() */

From patchwork Thu Aug 29 16:56:10 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783462
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0ACB91B8E88
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950684; cv=none;
 b=RNb1qHjghTg2VXqltGXxK6hunWZmR2oLaqh3NR0P2miCpV7NRaQxabKBJ9tbVuDzQKUUb/JOpGzY8nXrAAtaNYtu9R42JM1Emhj0zICyvRCnjLT5PRR602knH48R0lx2e3HPRFPUoH7GXhXr6CYlopMlkgKLjCiL34HNXJCViMI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950684; c=relaxed/simple;
	bh=nXLOnnt+POQFh8Gm7kutfDHDnpyviJWy9et/OinJqZs=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=aucCsJIkTUHgVUVmiJK83cY7W2Jl65Vf0ecshnA0ZrWOVloYap+6OEtCMOBrUn0q6a+gomRQu8RVfOpOW1iIc+hSCxfYyB1WjUo7LNradAj7pWSotwluu23x5RfXW+dAGwy0uYqi6lfYw4kcx6p/2suAKnh7/b9xdv6CLEXTWpo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UY4PDy3X; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="UY4PDy3X"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950682;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=kg6KaZoa+GHuxR5fT8B3Y3uXOyNmhSnzACXsGLCOA4A=;
	b=UY4PDy3XWEvFz28LIL4yKnUBL8BXQPX3caOaiAFrRlgm/a9x1ok3l6cAi+cuAshoQ5kqxv
	8PyNncyFH8w3HeGkV0pLD0JoWmmyncXpvMw4JK4t26+i+SobN4mTSNimhjXGe3Do0jku+V
	SvUactx1dB9hW3tX57s6xYoWs/lRocs=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-125-b73DxecZOFi28Pm89Y2ZEg-1; Thu,
 29 Aug 2024 12:57:55 -0400
X-MC-Unique: b73DxecZOFi28Pm89Y2ZEg-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E5DFB1954223;
	Thu, 29 Aug 2024 16:57:50 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 920AC1955F66;
	Thu, 29 Aug 2024 16:57:41 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 07/17] mm/rmap: abstract large mapcount operations for
 large folios (!hugetlb)
Date: Thu, 29 Aug 2024 18:56:10 +0200
Message-ID: <20240829165627.2256514-8-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's abstract the operations so we can extend these operations easily.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/rmap.h | 39 +++++++++++++++++++++++++++++++++++----
 mm/rmap.c            | 14 ++++++--------
 2 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9e275986f0ef6..e3b82a04b4acb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -173,6 +173,37 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *folio_get_anon_vma(struct folio *folio);
 
+static inline void folio_set_large_mapcount(struct folio *folio, int mapcount,
+		struct vm_area_struct *vma)
+{
+	/* Note: mapcounts start at -1. */
+	atomic_set(&folio->_large_mapcount, mapcount - 1);
+}
+
+static inline void folio_add_large_mapcount(struct folio *folio,
+		int diff, struct vm_area_struct *vma)
+{
+	atomic_add(diff, &folio->_large_mapcount);
+}
+
+static inline void folio_inc_large_mapcount(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	atomic_inc(&folio->_large_mapcount);
+}
+
+static inline void folio_sub_large_mapcount(struct folio *folio,
+		int diff, struct vm_area_struct *vma)
+{
+	atomic_sub(diff, &folio->_large_mapcount);
+}
+
+static inline void folio_dec_large_mapcount(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	atomic_dec(&folio->_large_mapcount);
+}
+
 /* RMAP flags, currently only relevant for some anon rmap operations. */
 typedef int __bitwise rmap_t;
 
@@ -339,11 +370,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
 		do {
 			atomic_inc(&page->_mapcount);
 		} while (page++, --nr_pages > 0);
-		atomic_add(orig_nr_pages, &folio->_large_mapcount);
+		folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
 		break;
 	case RMAP_LEVEL_PMD:
 		atomic_inc(&folio->_entire_mapcount);
-		atomic_inc(&folio->_large_mapcount);
+		folio_inc_large_mapcount(folio, dst_vma);
 		break;
 	}
 }
@@ -437,7 +468,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 				ClearPageAnonExclusive(page);
 			atomic_inc(&page->_mapcount);
 		} while (page++, --nr_pages > 0);
-		atomic_add(orig_nr_pages, &folio->_large_mapcount);
+		folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
 		break;
 	case RMAP_LEVEL_PMD:
 		if (PageAnonExclusive(page)) {
@@ -446,7 +477,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 			ClearPageAnonExclusive(page);
 		}
 		atomic_inc(&folio->_entire_mapcount);
-		atomic_inc(&folio->_large_mapcount);
+		folio_inc_large_mapcount(folio, dst_vma);
 		break;
 	}
 	return 0;
diff --git a/mm/rmap.c b/mm/rmap.c
index ee1bff1638f90..226b188499f91 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1177,7 +1177,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 		    atomic_add_return_relaxed(first, mapped) < ENTIRELY_MAPPED)
 			nr = first;
 
-		atomic_add(orig_nr_pages, &folio->_large_mapcount);
+		folio_add_large_mapcount(folio, orig_nr_pages, vma);
 		break;
 	case RMAP_LEVEL_PMD:
 		first = atomic_inc_and_test(&folio->_entire_mapcount);
@@ -1194,7 +1194,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 				nr = 0;
 			}
 		}
-		atomic_inc(&folio->_large_mapcount);
+		folio_inc_large_mapcount(folio, vma);
 		break;
 	}
 	return nr;
@@ -1450,15 +1450,13 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 				SetPageAnonExclusive(page);
 		}
 
-		/* increment count (starts at -1) */
-		atomic_set(&folio->_large_mapcount, nr - 1);
+		folio_set_large_mapcount(folio, nr, vma);
 		atomic_set(&folio->_nr_pages_mapped, nr);
 	} else {
 		nr = folio_large_nr_pages(folio);
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
-		/* increment count (starts at -1) */
-		atomic_set(&folio->_large_mapcount, 0);
+		folio_set_large_mapcount(folio, 1, vma);
 		atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
 		if (exclusive)
 			SetPageAnonExclusive(&folio->page);
@@ -1542,7 +1540,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 			break;
 		}
 
-		atomic_sub(nr_pages, &folio->_large_mapcount);
+		folio_sub_large_mapcount(folio, nr_pages, vma);
 		do {
 			last += atomic_add_negative(-1, &page->_mapcount);
 		} while (page++, --nr_pages > 0);
@@ -1554,7 +1552,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		partially_mapped = nr && atomic_read(mapped);
 		break;
 	case RMAP_LEVEL_PMD:
-		atomic_dec(&folio->_large_mapcount);
+		folio_dec_large_mapcount(folio, vma);
 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
 		if (last) {
 			nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);

From patchwork Thu Aug 29 16:56:11 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783463
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3B9F71B5EBE
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950692; cv=none;
 b=j360Prq9Tuty8VH3oD0pSuPkvgVtpTVeYPpCTH4o+fL8JVqnrzs11T6yDRIaxtNIrMaCPDBcXLbo/eIixrA9CXLuFrNFYm5YKbTYrpbVU86CZkGrHoBBIsT7yihprEUTwbhRqwc3YsezGfGuKwn8R2NIZlilEWDS1GQQ1aEcpIY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950692; c=relaxed/simple;
	bh=h3qLuDXwVIZz3/rCeEo8yr9v7u0QsQGiXf+MrwgchWo=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=PC2EQXiD+kizFNxLSVfn0yHPPcz9SK/LQlKABenSA1dL23yRw8fIXt7wOrDbnKICLmIwLxreOhHOsZeG2k5WwpmWTJcgrjmiTdfKTzBZ2rtM3sO31vqeVVB9eAZME4DJGgMTknjhBIR8nDcJsBEHXYxsBGQhcVGWaCnsu9zgLd0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=MA94pUop; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="MA94pUop"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950689;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=iJ36TGls6X+NLaUI9I8k1fO0354o+pno0obx3qgSgO4=;
	b=MA94pUopSeod8x7/mua90Ziaw/8PFOBBXXCkpsyMQvBN2Ay8QbFzkzP0rpD2haghVuY+oa
	wL8+PaWLB24Ihya4x3nAant6y6Ed812rWsplUPt9QDgqWd2K9UT7ZWSp0nG9OWfw2wl8Jr
	kdPExGj1olDSSfbbxa+ltpMTOY4hL4M=
Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-592-MTIkZoMGPJK2qRqMLQImJQ-1; Thu,
 29 Aug 2024 12:58:06 -0400
X-MC-Unique: MTIkZoMGPJK2qRqMLQImJQ-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id B18C7190ECCE;
	Thu, 29 Aug 2024 16:58:03 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 45D921955F66;
	Thu, 29 Aug 2024 16:57:51 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 08/17] mm/rmap: initial MM owner tracking for large folios
 (!hugetlb)
Date: Thu, 29 Aug 2024 18:56:11 +0200
Message-ID: <20240829165627.2256514-9-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's track for each large folio (excluding hugetlb for now) whether it is
certainly mapped exclusively (mapped by a single MM), or whether it may be
mapped shared (mapped by multiple MMs).

In an ideal world, we'd have a more precise "mapped exclusively" vs.
"mapped shared" tracking -- avoiding the "maybe" part -- but the
approaches to achieve that are a bit more involve, and we are going to
start with something simple so we can also make progress on per-page
mapcount removal. We can later easily exchange the tracking mechanism.

We'll use this information next to optimize COW reuse for PTE-mapped
anonymous THP, and implement folio_likely_mapped_shared() in kernel
configuration where the per-page mapcounts in large folios are no longer
maintained. We could start doing the MM owner tracking for anonymous
folios initially (COW reuse only applies to anon folios), but we'll keep
it simple and just do it also for pagecache folios: the new tracking
must be manually enabled via a kconfig option for now.

64bit only, because we cannot easily squeeze more stuff into the "struct
folio" of order-1 folios. 32bit might be possible in the future, for
example when limiting order-1 folios to 64bit only.

We'll remember for each large folio for two MMs that currently map this
folio, how often they are mapping folio pages (mapcount). As long as
a folio is unmapped or exclusively mapped, another MM can take a free
spot. We won't allow to take a free spot if the folio is not mapped
exclusively: primarily to avoid some corner cases where some mappings of
a MM are tracked via the slot, and others not (identified while working
on this).

In addition, we'll remember the current state (exclusive/shared) and use a
bit spinlock to sync on updates, and to require only a single atomic
operation for our updates. Using a bit spinlock is not ideal, but there
are not that many easy alternatives. We might be able to squeeze an
arch_spin_lock into the "struct folio" later, for now keep it simple. RT is
out of the picture with THP, and we can always optimize this later.

As we have to squeeze this information into the "struct folio" of even
folios of order-1 (2 pages), and we generally want to reduce the required
metadata, we'll assign each MM a unique ID that consumes less than 32 bit.
We'll limit the IDs to 20bit / 1M for now: we could allow for up to 30bit,
but getting even 1M IDs is unlikely in practice. If required, we could
raise the limit later, and the 1M limit might come in handy in the
future with other tracking approaches.

There won't be any false "mapped shared" detection as long as only two MMs
map pages of a folio at one point in time -- for example with fork() and
short-lived child processes, or with apps that hand over state from one
instance to another, like live-migrating VMs on the same host, effectively
migrating guest RAM via a mmap'ed files.

As soon as three MMs are involved at the same time, we might detect
"mapped shared" although the folio is now "mapped exclusively". Example:
(1) App1 faults in a (shmem/file-backed) folio -> Tracked as MM0
(2) App2 faults in the same folio -> Tracked as MM1
(3) App3 faults in the same folio -> Cannot be tracked separately
(4) App1 and App2 unmap the folio.
(5) We'll still detect "shared" even though only App3 maps the folio.

With multiple processes, this might have the potential to result in
unexpected owner changes, when migrating pages or when faulting them in:
assume a parent process fork()'s two short-lived child processes. We would
expect that the parent always remains tracked under MM0, but it could be
that at some point both child processes are tracked instead. For
file-backed memory, reclaim+refault can trigger something similar.

Keep compilation for the vdso32 hack working by un-defining CONFIG_MM_ID
like we for CONFIG_64BIT.

Make use of __always_inline to keep possible performance degradation
when (un)mapping large folios to a minimum.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/transhuge.rst                |   8 ++
 arch/x86/entry/vdso/vdso32/fake_32bit_build.h |   1 +
 include/linux/mm_types.h                      |  23 ++++
 include/linux/page-flags.h                    |  41 ++++++
 include/linux/rmap.h                          | 126 ++++++++++++++++++
 kernel/fork.c                                 |  36 +++++
 mm/Kconfig                                    |  11 ++
 mm/huge_memory.c                              |   6 +
 mm/internal.h                                 |   6 +
 mm/page_alloc.c                               |  10 ++
 10 files changed, 268 insertions(+)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index a2cd8800d5279..0ee58108a4d14 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -120,11 +120,19 @@ pages:
     and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
     when _entire_mapcount goes from -1 to 0 or 0 to -1.
 
+    With CONFIG_MM_ID, we also maintain the two slots for tracking MM
+    owners (MM ID and corresponding mapcount), and the current status
+    ("mapped shared" vs. "mapped exclusively").
+
   - map/unmap of individual pages with PTE entry increment/decrement
     page->_mapcount, increment/decrement folio->_large_mapcount and also
     increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
     from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
 
+    With CONFIG_MM_ID, we also maintain the two slots for tracking MM
+    owners (MM ID and corresponding mapcount), and the current status
+    ("mapped shared" vs. "mapped exclusively").
+
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
 structures. It can be done easily for refcounts taken by page table
diff --git a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
index db1b15f686e32..93d2bf13a6280 100644
--- a/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
+++ b/arch/x86/entry/vdso/vdso32/fake_32bit_build.h
@@ -13,6 +13,7 @@
 #undef CONFIG_SPARSEMEM_VMEMMAP
 #undef CONFIG_NR_CPUS
 #undef CONFIG_PARAVIRT_XXL
+#undef CONFIG_MM_ID
 
 #define CONFIG_X86_32 1
 #define CONFIG_PGTABLE_LEVELS 2
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 480548552ea54..6d27856686439 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -311,6 +311,9 @@ typedef struct {
  * @_nr_pages_mapped: Do not use outside of rmap and debug code.
  * @_pincount: Do not use directly, call folio_maybe_dma_pinned().
  * @_nr_pages: Do not use directly, call folio_nr_pages().
+ * @_mm0_mapcount: Do not use outside of rmap code.
+ * @_mm1_mapcount: Do not use outside of rmap code.
+ * @_mm_ids: Do not use outside of rmap code.
  * @_hugetlb_subpool: Do not use directly, use accessor in hugetlb.h.
  * @_hugetlb_cgroup: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
@@ -377,6 +380,11 @@ struct folio {
 					atomic_t _entire_mapcount;
 					atomic_t _nr_pages_mapped;
 					atomic_t _pincount;
+#ifdef CONFIG_MM_ID
+					int _mm0_mapcount;
+					int _mm1_mapcount;
+					unsigned long _mm_ids;
+#endif /* CONFIG_MM_ID */
 				};
 				unsigned long _usable_1[4];
 			};
@@ -1044,6 +1052,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_MM_ID
+		unsigned int mm_id;
+#endif
 	} __randomize_layout;
 
 	/*
@@ -1053,6 +1064,18 @@ struct mm_struct {
 	unsigned long cpu_bitmap[];
 };
 
+#ifdef CONFIG_MM_ID
+/*
+ * For init_mm and friends, we don't allocate an ID and use the dummy value
+ * instead. Limit ourselves to 1M MMs for now: even though we might support
+ * up to 4M PIDs, having more than 1M MM instances is highly unlikely.
+ */
+#define MM_ID_DUMMY		0
+#define MM_ID_NR_BITS		20
+#define MM_ID_MIN		(MM_ID_DUMMY + 1)
+#define MM_ID_MAX		((1U << MM_ID_NR_BITS) - 1)
+#endif /* CONFIG_MM_ID */
+
 #define MM_MT_FLAGS	(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
 			 MT_FLAGS_USE_RCU)
 extern struct mm_struct init_mm;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2175ebceb41cb..140de182811f2 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -11,6 +11,7 @@
 #include <linux/mmdebug.h>
 #ifndef __GENERATING_BOUNDS_H
 #include <linux/mm_types.h>
+#include <linux/bit_spinlock.h>
 #include <generated/bounds.h>
 #endif /* !__GENERATING_BOUNDS_H */
 
@@ -1187,6 +1188,46 @@ static inline int folio_has_private(const struct folio *folio)
 	return !!(folio->flags & PAGE_FLAGS_PRIVATE);
 }
 
+#ifdef CONFIG_MM_ID
+/*
+ * We store two flags (including the bit spinlock) in the upper bits of
+ * folio->_mm_ids, whereby that whole value is protected by the bit spinlock.
+ * This allows for only using an atomic op for acquiring the lock.
+ */
+#define FOLIO_MM_IDS_EXCLUSIVE_BITNUM		62
+#define FOLIO_MM_IDS_LOCK_BITNUM		63
+
+static __always_inline void folio_lock_large_mapcount_data(struct folio *folio)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	bit_spin_lock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids);
+}
+
+static __always_inline void folio_unlock_large_mapcount_data(struct folio *folio)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	__bit_spin_unlock(FOLIO_MM_IDS_LOCK_BITNUM, &folio->_mm_ids);
+}
+
+static inline void folio_set_large_mapped_exclusively(struct folio *folio)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	__set_bit(FOLIO_MM_IDS_EXCLUSIVE_BITNUM, &folio->_mm_ids);
+}
+
+static inline void folio_clear_large_mapped_exclusively(struct folio *folio)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	__clear_bit(FOLIO_MM_IDS_EXCLUSIVE_BITNUM, &folio->_mm_ids);
+}
+
+static inline bool folio_test_large_mapped_exclusively(struct folio *folio)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	return test_bit(FOLIO_MM_IDS_EXCLUSIVE_BITNUM, &folio->_mm_ids);
+}
+#endif /* CONFIG_MM_ID */
+
 #undef PF_ANY
 #undef PF_HEAD
 #undef PF_NO_TAIL
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e3b82a04b4acb..ff2a16864deed 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -173,6 +173,131 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *folio_get_anon_vma(struct folio *folio);
 
+#ifdef CONFIG_MM_ID
+
+/*
+ * We don't restrict ID0 to less bit, so we can get a slightly more efficient
+ * implementation when reading/writing ID0. The high bits are used for flags,
+ * see FOLIO_MM_IDS_*_BITNUM.
+ */
+#define FOLIO_MM_IDS_ID0_MASK			0x00000000fffffffful
+#define FOLIO_MM_IDS_ID1_SHIFT			32
+#define FOLIO_MM_IDS_ID1_MASK			0x00ffffff00000000ul
+
+static inline unsigned int folio_mm0_id(struct folio *folio)
+{
+	return folio->_mm_ids & FOLIO_MM_IDS_ID0_MASK;
+}
+
+static inline void folio_set_mm0_id(struct folio *folio, unsigned int id)
+{
+	folio->_mm_ids &= ~FOLIO_MM_IDS_ID0_MASK;
+	folio->_mm_ids |= id;
+}
+
+static inline unsigned int folio_mm1_id(struct folio *folio)
+{
+	return (folio->_mm_ids & FOLIO_MM_IDS_ID1_MASK) >> FOLIO_MM_IDS_ID1_SHIFT;
+}
+
+static inline void folio_set_mm1_id(struct folio *folio, unsigned int id)
+{
+	folio->_mm_ids &= ~FOLIO_MM_IDS_ID1_MASK;
+	folio->_mm_ids |= (unsigned long)id << FOLIO_MM_IDS_ID1_SHIFT;
+}
+
+static __always_inline void folio_set_large_mapcount(struct folio *folio,
+		int mapcount, struct vm_area_struct *vma)
+{
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+
+	/* Note: mapcounts start at -1. */
+	atomic_set(&folio->_large_mapcount, mapcount - 1);
+	folio->_mm0_mapcount = mapcount - 1;
+	folio_set_mm0_id(folio, vma->vm_mm->mm_id);
+	VM_WARN_ON_ONCE(!folio_test_large_mapped_exclusively(folio));
+	VM_WARN_ON_ONCE(folio->_mm1_mapcount >= 0);
+}
+
+static __always_inline void folio_add_large_mapcount(struct folio *folio,
+		int diff, struct vm_area_struct *vma)
+{
+	const unsigned int mm_id = vma->vm_mm->mm_id;
+	int mapcount_val;
+
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	VM_WARN_ON_ONCE(diff <= 0 || mm_id < MM_ID_MIN || mm_id > MM_ID_MAX);
+
+	folio_lock_large_mapcount_data(folio);
+	/*
+	 * We expect that unmapped folios always have the "mapped exclusively"
+	 * flag set for simplicity.
+	 */
+	VM_WARN_ON_ONCE(atomic_read(&folio->_large_mapcount) < 0 &&
+			!folio_test_large_mapped_exclusively(folio));
+
+	mapcount_val = atomic_read(&folio->_large_mapcount) + diff;
+	atomic_set(&folio->_large_mapcount, mapcount_val);
+
+	if (folio_mm0_id(folio) == mm_id) {
+		folio->_mm0_mapcount += diff;
+		if (folio->_mm0_mapcount != mapcount_val)
+			folio_clear_large_mapped_exclusively(folio);
+	} else if (folio_mm1_id(folio) == mm_id) {
+		folio->_mm1_mapcount += diff;
+		if (folio->_mm1_mapcount != mapcount_val)
+			folio_clear_large_mapped_exclusively(folio);
+	} else if (folio_test_large_mapped_exclusively(folio)) {
+		/*
+		 * We only allow taking over a tracking slot if the folio is
+		 * exclusive, meaning that any mappings belong to exactly one
+		 * tracked MM (which cannot be this MM).
+		 */
+		if (folio->_mm0_mapcount < 0) {
+			folio_set_mm0_id(folio, mm_id);
+			folio->_mm0_mapcount = diff - 1;
+		} else {
+			VM_WARN_ON_ONCE(folio->_mm1_mapcount >= 0);
+			folio_set_mm1_id(folio, mm_id);
+			folio->_mm1_mapcount = diff - 1;
+		}
+		folio_clear_large_mapped_exclusively(folio);
+	}
+	folio_unlock_large_mapcount_data(folio);
+}
+#define folio_inc_large_mapcount(folio, vma) \
+	folio_add_large_mapcount(folio, 1, vma)
+
+static __always_inline void folio_sub_large_mapcount(struct folio *folio,
+		int diff, struct vm_area_struct *vma)
+{
+	const unsigned int mm_id = vma->vm_mm->mm_id;
+	int mapcount_val;
+
+	VM_WARN_ON_ONCE(!folio_test_large(folio) || folio_test_hugetlb(folio));
+	VM_WARN_ON_ONCE(diff <= 0 || mm_id < MM_ID_MIN || mm_id > MM_ID_MAX);
+
+	folio_lock_large_mapcount_data(folio);
+	mapcount_val = atomic_read(&folio->_large_mapcount) - diff;
+	atomic_set(&folio->_large_mapcount, mapcount_val);
+
+	if (folio_mm0_id(folio) == mm_id)
+		folio->_mm0_mapcount -= diff;
+	else if (folio_mm1_id(folio) == mm_id)
+		folio->_mm1_mapcount -= diff;
+
+	/*
+	 * We only consider folios exclusive if there are no mappings or if
+	 * one tracked MM owns all mappings.
+	 */
+	if (folio->_mm0_mapcount == mapcount_val ||
+	    folio->_mm1_mapcount == mapcount_val)
+		folio_set_large_mapped_exclusively(folio);
+	folio_unlock_large_mapcount_data(folio);
+}
+#define folio_dec_large_mapcount(folio, vma) \
+	folio_sub_large_mapcount(folio, 1, vma)
+#else /* !CONFIG_MM_ID */
 static inline void folio_set_large_mapcount(struct folio *folio, int mapcount,
 		struct vm_area_struct *vma)
 {
@@ -203,6 +328,7 @@ static inline void folio_dec_large_mapcount(struct folio *folio,
 {
 	atomic_dec(&folio->_large_mapcount);
 }
+#endif /* !CONFIG_MM_ID */
 
 /* RMAP flags, currently only relevant for some anon rmap operations. */
 typedef int __bitwise rmap_t;
diff --git a/kernel/fork.c b/kernel/fork.c
index ebc9132840872..7b9df4c881387 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -813,6 +813,36 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 #define mm_free_pgd(mm)
 #endif /* CONFIG_MMU */
 
+#ifdef CONFIG_MM_ID
+static DEFINE_IDA(mm_ida);
+
+static inline int mm_alloc_id(struct mm_struct *mm)
+{
+	int ret;
+
+	ret = ida_alloc_range(&mm_ida, MM_ID_MIN, MM_ID_MAX, GFP_KERNEL);
+	if (ret < 0)
+		return ret;
+	mm->mm_id = ret;
+	return 0;
+}
+
+static inline void mm_free_id(struct mm_struct *mm)
+{
+	const int id = mm->mm_id;
+
+	mm->mm_id = MM_ID_DUMMY;
+	if (id == MM_ID_DUMMY)
+		return;
+	if (WARN_ON_ONCE(id < MM_ID_MIN || id > MM_ID_MAX))
+		return;
+	ida_free(&mm_ida, id);
+}
+#else
+static inline int mm_alloc_id(struct mm_struct *mm) { return 0; }
+static inline void mm_free_id(struct mm_struct *mm) {}
+#endif
+
 static void check_mm(struct mm_struct *mm)
 {
 	int i;
@@ -916,6 +946,7 @@ void __mmdrop(struct mm_struct *mm)
 
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
+	mm_free_id(mm);
 	destroy_context(mm);
 	mmu_notifier_subscriptions_destroy(mm);
 	check_mm(mm);
@@ -1293,6 +1324,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_pgd(mm))
 		goto fail_nopgd;
 
+	if (mm_alloc_id(mm))
+		goto fail_noid;
+
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
@@ -1312,6 +1346,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 fail_cid:
 	destroy_context(mm);
 fail_nocontext:
+	mm_free_id(mm);
+fail_noid:
 	mm_free_pgd(mm);
 fail_nopgd:
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index b23913d4e47e2..0877be8c50b6c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -846,6 +846,17 @@ choice
 	  enabled at runtime via sysfs.
 endchoice
 
+config MM_ID
+	bool "MM ID tracking"
+	depends on TRANSPARENT_HUGEPAGE && 64BIT
+	help
+	  Use unique per-MM IDs to track whether large allocations, such
+	  as transparent huge pages, that span multiple physical pages
+	  are "mapped shared" or "mapped exclusively" into user page tables.
+	  This information is useful to determine the current owner of such a
+	  large allocation, for example, helpful for the Copy-On-Write reuse
+	  optimization.
+
 config THP_SWAP
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6de84377e8e77..7fa84ba506563 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3193,6 +3193,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageHasHWPoisoned(head);
 
+#ifdef CONFIG_MM_ID
+	if (!new_order)
+		/* Make sure page->private on the second page is 0. */
+		folio->_mm_ids = 0;
+#endif
+
 	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
 		__split_huge_page_tail(folio, i, lruvec, list, new_order);
 		/* Some pages can be beyond EOF: drop them from page cache */
diff --git a/mm/internal.h b/mm/internal.h
index f627fd2200464..da38c747c73d4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -665,6 +665,12 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
+#ifdef CONFIG_MM_ID
+	folio->_mm0_mapcount = -1;
+	folio->_mm1_mapcount = -1;
+	folio->_mm_ids = 0;
+	folio_set_large_mapped_exclusively(folio);
+#endif
 	if (order > 1)
 		INIT_LIST_HEAD(&folio->_deferred_list);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e276cbaf97054..c81f29e29b82d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -959,6 +959,16 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 			bad_page(page, "nonzero pincount");
 			goto out;
 		}
+#ifdef CONFIG_MM_ID
+		if (unlikely(folio->_mm0_mapcount + 1)) {
+			bad_page(page, "nonzero _mm0_mapcount");
+			goto out;
+		}
+		if (unlikely(folio->_mm1_mapcount + 1)) {
+			bad_page(page, "nonzero _mm1_mapcount");
+			goto out;
+		}
+#endif
 		break;
 	case 2:
 		/* the second tail page: deferred_list overlaps ->mapping */

From patchwork Thu Aug 29 16:56:12 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783464
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A5F5E1B5EC0
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950703; cv=none;
 b=oFufLtS3+pCBPTTYDL8cZ4rYIQfHy3juYdHmZmTbBWWWHOHif/xEt+mjbpZhJCmum0TmpzDc7ri4qlauqXAK8GgNILf6SS3Z9CP3VFzx7K/847ki8Rr4ZZwtL8k5XGyjXzH5GBJ0P7aaImcQwp8VCWiSiHMi4Oo2jVgN9FlCgjM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950703; c=relaxed/simple;
	bh=UuijIfSkIYxacK7S1A3ExjdxI/m7orsRsHG6HPl7Su4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Uhm3UnNfsWz/8NrAM8ymXt2Ji3reHIPWN+Slwxrc3jqz8HTjoe0iKYInJidDeiR8ErHPC3ACrXI7/1O4ZavpLw3bcgeG0+OOYwtFw/mURfGy014EAM1afGy2Qgm2IsszuGRCU2WHVQlYl0krpelVko79g0FKuTTNyEcXAwL3ja0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=b1APO7w8; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="b1APO7w8"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950700;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UScoC5uAD98XoEORkSS5jabd7Wa591naPepP4SsdqqM=;
	b=b1APO7w8Lqgw4d/9gY+tm4R8XJGo4sDhdUdu5sh4DLObD2nv3nS8rtNRprpwd8NtDN+5We
	7BcKEojS9uXHplv/IA35+ZLFjxi9B0ZLnai/oILQmZqGMNKYKPwId4EUM+oo07MJOd/NB6
	gPDCgyACIXcZZAJPzfIgYGiSH2Po9y8=
Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-627-7nh5CtHIOOCFSuu-GrSw3g-1; Thu,
 29 Aug 2024 12:58:17 -0400
X-MC-Unique: 7nh5CtHIOOCFSuu-GrSw3g-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id C41D8191379F;
	Thu, 29 Aug 2024 16:58:14 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id AB5951955F66;
	Thu, 29 Aug 2024 16:58:03 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 09/17] bit_spinlock: __always_inline (un)lock functions
Date: Thu, 29 Aug 2024 18:56:12 +0200
Message-ID: <20240829165627.2256514-10-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

The compiler might decide that it is a smart idea to not inline
bit_spin_lock(), primarily when a couple of functions in the same file end
up calling it. Especially when used in RMAP context, this can negatively
affect fork() performance, where each additional function call is
noticeable.

Let's simply flag all lock/unlock functions as __always_inline;
arch_test_and_set_bit_lock() and friends are already tagged like that
(but not test_and_set_bit_lock() for some reason).

If ever a problem, we could split it into a fast and a slow path, and
only force the fast path to be inlined. But there is nothing
particularly "big" here.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/bit_spinlock.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
index bbc4730a6505c..c0989b5b0407f 100644
--- a/include/linux/bit_spinlock.h
+++ b/include/linux/bit_spinlock.h
@@ -13,7 +13,7 @@
  * Don't use this unless you really need to: spin_lock() and spin_unlock()
  * are significantly faster.
  */
-static inline void bit_spin_lock(int bitnum, unsigned long *addr)
+static __always_inline void bit_spin_lock(int bitnum, unsigned long *addr)
 {
 	/*
 	 * Assuming the lock is uncontended, this never enters
@@ -38,7 +38,7 @@ static inline void bit_spin_lock(int bitnum, unsigned long *addr)
 /*
  * Return true if it was acquired
  */
-static inline int bit_spin_trylock(int bitnum, unsigned long *addr)
+static __always_inline int bit_spin_trylock(int bitnum, unsigned long *addr)
 {
 	preempt_disable();
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
@@ -54,7 +54,7 @@ static inline int bit_spin_trylock(int bitnum, unsigned long *addr)
 /*
  *  bit-based spin_unlock()
  */
-static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
+static __always_inline void bit_spin_unlock(int bitnum, unsigned long *addr)
 {
 #ifdef CONFIG_DEBUG_SPINLOCK
 	BUG_ON(!test_bit(bitnum, addr));
@@ -71,7 +71,7 @@ static inline void bit_spin_unlock(int bitnum, unsigned long *addr)
  *  non-atomic version, which can be used eg. if the bit lock itself is
  *  protecting the rest of the flags in the word.
  */
-static inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
+static __always_inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
 {
 #ifdef CONFIG_DEBUG_SPINLOCK
 	BUG_ON(!test_bit(bitnum, addr));

From patchwork Thu Aug 29 16:56:13 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783465
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 889CF1B5ECE
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950716; cv=none;
 b=udOoADN+ADY3qciAyULF2hgnxaMuvg+ChMtfwC2ARfBUtlQtP6g86vH7BLd0QwCI9qjctB23Q1XP/nWM41ZVmWRdpJHRFJrBvrSVPFuLEwFHPesDDKz0WQ1xycxdk+zGqbgjULiWrByik7LHsGl72uYtsmlj2B9Dk5cujQziJ6o=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950716; c=relaxed/simple;
	bh=3ItMPhJsWAcR246rqGNvmSIbS4Cml/6W9Jk2oUh5LQY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=aUOcIc4cX3RYhzWeOREIfthIzhp8SWxJNmSzCNQkqDUAwuvgaThOeqFF7DG+uXSnyzwEB66yKYawB0z3QSrGq2KTQBwojaWUPrXolCLzJ+YWc1MLX84ciKHysgYdifvC2Dao6LAqMuDgganvaaEgdu1KwYMu9Ho3h6BY5QIqHIg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=FoLIfD8r; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="FoLIfD8r"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950713;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=SjSkhPyRz/5fIFSM96q39YLgP24AQs1QkGAT42V2m6c=;
	b=FoLIfD8r8RvGR1/WK0TOLAMzon1MRWv5AlmkhU4cU9vAeLjOA48my1rDGvN6UbdDNDUduq
	8bWgFBVvb7hOouEtWNE/o1YrlyIGPlL1lz0e5oyMm6WTGJV7zOttgi2wYFjRJqQXMnHbuV
	VxknXGg6Am7t5lO6fNuczLmdHR545xI=
Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-548-ZwwKUWzxPCORnd2cpciv7A-1; Thu,
 29 Aug 2024 12:58:27 -0400
X-MC-Unique: ZwwKUWzxPCORnd2cpciv7A-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 3E98818BC2F6;
	Thu, 29 Aug 2024 16:58:23 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 120621955F66;
	Thu, 29 Aug 2024 16:58:14 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with
 CONFIG_MM_ID
Date: Thu, 29 Aug 2024 18:56:13 +0200
Message-ID: <20240829165627.2256514-11-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's add support for CONFIG_MM_ID. The implementation is fairly
straight forward: if exclusively mapped, make sure that all references
are from mappings.

There are plenty of things we can optimize in the future: For example, we
could remember that the folio is fully exclusive so we could speedup
the next fault further. Also, we could try "faulting around", turning
surrounding PTEs that map the same folio writable. But especially the
latter might increase COW latency, so it would need further
investigation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c2143c40a134b..3803d4aa952ed 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3564,19 +3564,90 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
 	return ret;
 }
 
-static bool wp_can_reuse_anon_folio(struct folio *folio,
-				    struct vm_area_struct *vma)
+#ifdef CONFIG_MM_ID
+static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
+		struct vm_area_struct *vma)
 {
+	bool exclusive = false;
+
+	/* Let's just free up a large folio if only a single page is mapped. */
+	if (folio_large_mapcount(folio) <= 1)
+		return false;
+
 	/*
-	 * We could currently only reuse a subpage of a large folio if no
-	 * other subpages of the large folios are still mapped. However,
-	 * let's just consistently not reuse subpages even if we could
-	 * reuse in that scenario, and give back a large folio a bit
-	 * sooner.
+	 * The assumption for anonymous folios is that each page can only get
+	 * mapped once into each MM. The only exception are KSM folios, which
+	 * are always small.
+	 *
+	 * Each taken mapcount must be paired with exactly one taken reference,
+	 * whereby the refcount must be incremented before the mapcount when
+	 * mapping a page, and the refcount must be decremented after the
+	 * mapcount when unmapping a page.
+	 *
+	 * If all folio references are from mappings, and all mappings are in
+	 * the page tables of this MM, then this folio is exclusive to this MM.
 	 */
-	if (folio_test_large(folio))
+	if (!folio_test_large_mapped_exclusively(folio))
+		return false;
+
+	VM_WARN_ON_ONCE(folio_test_ksm(folio));
+	VM_WARN_ON_ONCE(folio_mapcount(folio) > folio_nr_pages(folio));
+	VM_WARN_ON_ONCE(folio_entire_mapcount(folio));
+
+	if (unlikely(folio_test_swapcache(folio))) {
+		/*
+		 * Note: freeing up the swapcache will fail if some PTEs are
+		 * still swap entries.
+		 */
+		if (!folio_trylock(folio))
+			return false;
+		folio_free_swap(folio);
+		folio_unlock(folio);
+	}
+
+	if (folio_large_mapcount(folio) != folio_ref_count(folio))
 		return false;
 
+	/* Stabilize the mapcount vs. refcount and recheck. */
+	folio_lock_large_mapcount_data(folio);
+	VM_WARN_ON_ONCE(folio_large_mapcount(folio) < folio_ref_count(folio));
+
+	if (!folio_test_large_mapped_exclusively(folio))
+		goto unlock;
+	if (folio_large_mapcount(folio) != folio_ref_count(folio))
+		goto unlock;
+
+	VM_WARN_ON_ONCE(folio_mm0_id(folio) != vma->vm_mm->mm_id &&
+			folio_mm1_id(folio) != vma->vm_mm->mm_id);
+
+	/*
+	 * Do we need the folio lock? Likely not. If there would have been
+	 * references from page migration/swapout, we would have detected
+	 * an additional folio reference and never ended up here.
+	 */
+	exclusive = true;
+unlock:
+	folio_unlock_large_mapcount_data(folio);
+	return exclusive;
+}
+#else /* !CONFIG_MM_ID */
+static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	/*
+	 * We could reuse the last mapped page of a large folio, but let's
+	 * just free up this large folio.
+	 */
+	return false;
+}
+#endif /* !CONFIG_MM_ID */
+
+static bool wp_can_reuse_anon_folio(struct folio *folio,
+				    struct vm_area_struct *vma)
+{
+	if (folio_test_large(folio))
+		return __wp_can_reuse_large_anon_folio(folio, vma);
+
 	/*
 	 * We have to verify under folio lock: these early checks are
 	 * just an optimization to avoid locking the folio and freeing

From patchwork Thu Aug 29 16:56:14 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783466
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 22CA91B8EA5
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:39 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950720; cv=none;
 b=tf1YZLyTHikdT+sm7is6p+0eUSIg1ntxYAMNHTly9A8ndvhYyGBdiaaSwEjjmMQfx9zpBRggv4tsVNyRkviBba5sSxTek4EXIqzmGfnoHlknpjLa0fBkUuzWPgslrtsBm0q1DPsCODTlyxuEnwugp8GKHppgilnWnWfuvOjvOTg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950720; c=relaxed/simple;
	bh=RBX78qpxhtQxekqwvy5pTTwYJ9L9zkw4ydpyf0vXszo=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=bKGnK9VZboNKi/GTY1olUBe9BezI2OdS8aCGbatxET6CZ2XpmV6o0eEI7wGPj8WoZVoIHN6rK7hXJQxpedtz3MPJpo5zMv0nstkJz0unP+43iIzSFs6UIF3rDA+Wd1nOx57eMTDIY3dKOaW08HooxOD/tWnYYi3NU8lM9NRt0uE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=aYsXXYHp; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="aYsXXYHp"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950718;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=z6SjsdX5+d4w13f2Yod61hus7319jekbv8gEFkX5n/A=;
	b=aYsXXYHp0kqdI3wPB5Gw9L7jrNQOVtvzp54AWUE4qGKBnYlMpBFFlAT1JPiRqrwkNbVtAG
	/fGl1iJER7HVBMDWH56e1yeludrsN/4cspytzYrj6PpXyfVeVY+KT/ORk8/4PBgkQDqbJ8
	F+4SKDiOE7768j1S5iv9ackhEIDNrFk=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-608-_x1Mv_w9P1eDEEfjyaDX-g-1; Thu,
 29 Aug 2024 12:58:34 -0400
X-MC-Unique: _x1Mv_w9P1eDEEfjyaDX-g-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E0E2419792FA;
	Thu, 29 Aug 2024 16:58:31 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3B8EB1955F21;
	Thu, 29 Aug 2024 16:58:23 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 11/17] mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not
 maintain per-page mapcounts in large folios
Date: Thu, 29 Aug 2024 18:56:14 +0200
Message-ID: <20240829165627.2256514-12-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

We're close to the finishing line: let's introduce a new
CONFIG_NO_PAGE_MAPCOUNT config option where we will incrementally remove
any dependencies on per-page mapcounts in large folios. Once that's
done, we'll stop maintaining the per-page mapcounts with this
config option enabled.

CONFIG_NO_PAGE_MAPCOUNT will be EXPERIMENTAL for now, as we'll have to
learn about some of the real world impact of some of the implications.

As writing "!CONFIG_NO_PAGE_MAPCOUNT" is really nasty, let's introduce
a helper config option "CONFIG_PAGE_MAPCOUNT" that expresses the
negation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/Kconfig | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0877be8c50b6c..73cfacbd1cc6a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -878,8 +878,28 @@ config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config NO_PAGE_MAPCOUNT
+	bool "No per-page mapcount (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && MM_ID
+	help
+	  Do not maintain per-page mapcounts for pages part of larger
+	  allocations, such as transparent huge pages.
+
+	  When this config option is enabled, some interfaces that relied on
+	  this information will rely on less-precise per-folio information
+	  instead: for example, using the average per-page mapcount in such
+	  a large allocation instead of the per-page mapcount.
+
+	  EXPERIMENTAL because the severity of some of the implications first
+	  have to be understood properly.
+
 endif # TRANSPARENT_HUGEPAGE
 
+# simple helper to make the code a bit easier to read
+config PAGE_MAPCOUNT
+	def_bool y
+	depends on !NO_PAGE_MAPCOUNT
+
 #
 # The architecture supports pgtable leaves that is larger than PAGE_SIZE
 #

From patchwork Thu Aug 29 16:56:15 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783467
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F296D1B6546
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:47 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950729; cv=none;
 b=LVQaiyJw2yiEnLdi583lHaP0p0V+KmmyWgvfxWCcx1+CronOnf2ypoNBkFa1UmUGZIfHaC5fnMeBp7bKdLxz/c6eAeAs7FNAKZmyJE+dDmGVcPpAbPwAk18Y571otYon12L6T68yNqbAuuDDscfg9tXtCHIlg/gxmp6ZXDsU5eM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950729; c=relaxed/simple;
	bh=wEDT3WYv5QAv16j+geOxf9Xm27ml+se5kydOVg5W9Ik=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=S+KONgF7s7YUyfdoW/dUnRmOwwPHoKRLDek1A1+cLYHUIRLKrYoFhOoczgxbvpcAcQZ5yTxsUFdF2wXoM/DzYOe7T+7qH75BDlj/A7Epy8J4PubRAmw7GPW8oakXC13/emdxX5mi9PtBCpKop2uBTfFOIAefYdLOwLRAG0H/cRE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TwrkiuT3; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TwrkiuT3"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950727;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Xlz/Hr/MLXsaGBmeoyntSanirDE8WH40+jKnaAIxWt4=;
	b=TwrkiuT39S+U7orQ0EU3/Cv+JGECGl+zrdtpOPql9e4BFOrqO68YOPrxGNEeFrAucHVDc+
	rgTEtqYGYgpQuqHNs9IQHIxV1z8vrNojcEiBgJVl/AZAg6X++B+yfuALBZbnM/Xhr/RH4/
	1tSZ2rA3m8d4PxDVRyk7XnOW7k0CpZY=
Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-302-FFTH9f1IPp2QjlcNQc8UoQ-1; Thu,
 29 Aug 2024 12:58:45 -0400
X-MC-Unique: FFTH9f1IPp2QjlcNQc8UoQ-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id A6ADE190308D;
	Thu, 29 Aug 2024 16:58:43 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3F3A51955F21;
	Thu, 29 Aug 2024 16:58:31 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 12/17] mm: remove per-page mapcount dependency in
 folio_likely_mapped_shared() (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:15 +0200
Message-ID: <20240829165627.2256514-13-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's remove the dependency on the mapcount of the first folio page in
large folios and consequently any "false negatives" from
folio_likely_mapped_shared().

In theory, we could implement this change only with CONFIG_MM_ID,
without gluing it to another config option. But we'll be a bit
careful for the time being, because folio_likely_mapped_shared() can now
return "false positives" more frequently. Glue it to
CONFIG_NO_PAGE_MAPCOUNT, which expresses the "EXPERIMENTAL" character for
now.

Let's reuse our new MM ownership tracking infrastructure for large folios.
Thoroughly document the changed semantics. We might now detect that a
folio as "mapped shared" although it no longer is -- this can only happen
if more than two MMs mapped a folio at the same time, and neither of the
first two is the last one mapping the folio.

"false positives" in this context are certainly better than "false
negatives" when it comes to enforcing policies (e.g., is process 1
allowed to migrate a folio that might also be used by another process?),
but in an ideal world we wouldn't have these "false positives" either.

It's worth noting that there will not be a change for small folios and
hugetlb folios. In general, for PMD-mapped THP we don't expect a change,
only for PTE-mapped THP.

This will affect various users of folio_likely_mapped_shared():

(1) khugepaged counts PTEs that target shared folios towards the
    max_ptes_shared. With false positives we might collapse too little,
    with false negatives too much.

(2) NUMA hinting: PROT_NONE NUMA protection will be skipped for shared
    folios in COW mappings. With false positives we skip too many, with
    false negatives we don't skip some we should be skipping.

    During NUMA hinting faults, we will set TNF_SHARED with shared folios
    in shared mappings. With false positives we set it too often, with
    false negatives not often enough.

    During NUMA hinting faults, we will reject to migrate shared folios in
    mappings with execute permissions (expectation: shared libraries).
    With false positives we reject to migrate some, with false negatives
    we migrate too many.

(3) MADV_COLD / MADV_PAGEOUT / MADV_FREE will not try splitting PTE-mapped
    THPs that are considered shared but not fully covered by the
    requested range, consequently not processing them. With false
    positives we will not split+process some we could have processed, with
    false negatives we split some folios we probably shouldn't have split.

(4) mbind() / migrate_pages() / move_pages() will refuse to migrate shared
    folios unless MPOL_MF_MOVE_ALL is effective (requires CAP_SYS_NICE).
    With false positives we reject to migrate some folios that could be
    migrated, with false negatives we migrate some folios that shouldn't
    have been migrated.

(5) folio_referenced_one() will skip exclusive swapbacked folios in
    dying processes. Shared folios will not be skipped. With false
    positives we might skip this optimization, with false negatives we
    might apply this optimization wrongly.

Likely (3) and (4) are not really used a lot on folios that are heavily
shared among processes -- rather on anonymous memory (mostly from a
single parent process) or almost-exclusively mmap'ed files.

Similarly (1) is not expected to matter much in practice, and if so,
only for long-running child processes after fork(). But even here, it's
unlikely that it matters in practice.

(5) is not expected to matter much at all, it's a new optimization
either way.

(2) is interesting: the expectation here is that for anon folios it
might not make a big difference. For file-backed pages it might,
we'll have to learn about that.

Long story short: this paves the way for a complete
CONFIG_NO_PAGE_MAPCOUNT implementation, but maybe we'll have to
switch to another MM ownership tracking later.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98411e53da916..b37f20b26776d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2142,9 +2142,9 @@ static inline size_t folio_size(const struct folio *folio)
  * are independent.
  *
  * As precise information is not easily available for all folios, this function
- * estimates the number of MMs ("sharers") that are currently mapping a folio
- * using the number of times the first page of the folio is currently mapped
- * into page tables.
+ * must sometimes estimate the number of MMs ("sharers") that are currently
+ * mapping a folio using the number of times the first page of the folio is
+ * currently mapped into page tables.
  *
  * For small anonymous folios and anonymous hugetlb folios, the return
  * value will be exactly correct: non-KSM folios can only be mapped at most once
@@ -2152,13 +2152,21 @@ static inline size_t folio_size(const struct folio *folio)
  * considered shared even if mapped multiple times into the same MM.
  *
  * For other folios, the result can be fuzzy:
- *    #. For partially-mappable large folios (THP), the return value can wrongly
- *       indicate "mapped exclusively" (false negative) when the folio is
- *       only partially mapped into at least one MM.
+ *    #. With CONFIG_PAGE_MAPCOUNT: For partially-mappable large folios (THP),
+ *       the return value can wrongly indicate "mapped exclusively" (false
+ *       negative) when the folio is only partially mapped into at least one MM.
+ *    #. With CONFIG_NO_PAGE_MAPCOUNT: For partially-mappable large folios
+ *       (THP), the return value can wrongly indicate "mapped shared" (false
+ *       positive) in some scenarios. This can only happen if two MMs are
+ *       already mapping a folio and a more MM starts mapping the folio. We
+ *       would still the detect the folio as "mapped shared" after the first
+ *       two MMs no longer map the folio.
  *    #. For pagecache folios (including hugetlb), the return value can wrongly
  *       indicate "mapped shared" (false positive) when two VMAs in the same MM
  *       cover the same file range.
  *
+ * With CONFIG_MM_ID, this function will never return "false negatives".
+ *
  * Further, this function only considers current page table mappings that
  * are tracked using the folio mapcount(s).
  *
@@ -2183,12 +2191,16 @@ static inline bool folio_likely_mapped_shared(struct folio *folio)
 	if (mapcount <= 1)
 		return false;
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 	/* If any page is mapped more than once we treat it "mapped shared". */
 	if (folio_entire_mapcount(folio) || mapcount > folio_large_nr_pages(folio))
 		return true;
 
 	/* Let's guess based on the first subpage. */
 	return atomic_read(&folio->_mapcount) > 0;
+#else /* !CONFIG_PAGE_MAPCOUNT */
+	return !folio_test_large_mapped_exclusively(folio);
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 }
 
 #ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE

From patchwork Thu Aug 29 16:56:16 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783468
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8EDD1B81DB
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:59:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950742; cv=none;
 b=WVzlTr/o6q57N4bAZ/QbjT4bSNApFK6F9aXT7O3w1YhMQx0k1bYxIr4hS55iaTxyTbW/UpA/6dbA3ciE5tOF4OjBnvSVn3ufZdx8p5Jffjo69PiuYYWWW942GmeeflPWk1q30gXH4lNxXKcPlkbtkI/qte4VucBjB/6rZDaZNcI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950742; c=relaxed/simple;
	bh=HNUw6L79YorAY7Er9IcXo5/0qoizTqeiSeFuYwU1AX8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=naa++6cmRE5NijOizhPOtZdJTb737wYgQeS7SPoe0feUTv9vxl1B3mN2jSgCZRr5bfD9o14FSQblp4JYa3i3mzwSBy/ptx6tKF9PBlXfrOyEK6k6ZroJkyuPOrNPg6Yq3kb/surprwB9B+mebiODjcmuIJQ/SgoK8LpdGrf05Jo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=MIDHBggJ; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="MIDHBggJ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950740;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/4w50puZlu0HXrTGynq7fLz3hKXbxnRdIDDiKfwRwdI=;
	b=MIDHBggJjuRbN5jmkFDAYktU12AH0nDt/FqD1zLCI/Hl0v1nRWzFA7pUwrFtmjZeyTkNBZ
	S2L1hkq2E8/cVbAn25UGDffFn1+Wpu6mLnfPoVG14dlEnsruDEpQWhyYiQZTGvzYXMep1N
	xmRJrANuwDlVwQqw+ge6iFjpCobft3I=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-251-I4RGjMuXNxC500-cFdsqyQ-1; Thu,
 29 Aug 2024 12:58:55 -0400
X-MC-Unique: I4RGjMuXNxC500-cFdsqyQ-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4912C18F498B;
	Thu, 29 Aug 2024 16:58:51 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 1C0F81955F21;
	Thu, 29 Aug 2024 16:58:44 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 13/17] fs/proc/page: remove per-page mapcount dependency
 for /proc/kpagecount (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:16 +0200
Message-ID: <20240829165627.2256514-14-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's implement an alternative when per-page mapcounts in large folios
are no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

For large folios, we'll return the per-page average mapcount within the
folio, except when the average is 0 but the folio is mapped: then we
return 1.

For hugetlb folios and for large folios that are fully mapped
into all address spaces, there is no change.

As an alternative, we could simply return 0 for non-hugetlb large folios,
or disable this legacy interface with CONFIG_NO_PAGE_MAPCOUNT.

But the information exposed by this interface can still be valuable, and
frequently we deal with fully-mapped large folios where the average
corresponds to the actual page mapcount. So we'll leave it like this for
now and document the new behavior.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/admin-guide/mm/pagemap.rst |  7 +++++-
 fs/proc/internal.h                       | 31 ++++++++++++++++++++++++
 fs/proc/page.c                           | 18 +++++++++++---
 3 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index caba0f52dd36c..49590306c61a0 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -42,7 +42,12 @@ There are four components to pagemap:
    skip over unmapped regions.
 
  * ``/proc/kpagecount``.  This file contains a 64-bit count of the number of
-   times each page is mapped, indexed by PFN.
+   times each page is mapped, indexed by PFN. Some kernel configurations do
+   not track the precise number of times a page part of a larger allocation
+   (e.g., THP) is mapped. In these configurations, the average number of
+   mappings per page in this larger allocation is returned instead. However,
+   if any page of the large allocation is mapped, the returned value will
+   be at least 1.
 
 The page-types tool in the tools/mm directory can be used to query the
 number of times a page is mapped.
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cc520168f8b69..3c687f97e18c4 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -174,6 +174,37 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
 	return mapcount;
 }
 
+/**
+ * folio_average_page_mapcount() - Average number of mappings per page in this
+ *				   folio
+ * @folio: The folio.
+ *
+ * The average number of present user page table entries that reference each
+ * page in this folio as tracked via the RMAP: either referenced directly
+ * (PTE) or as part of a larger area that covers this page (e.g., PMD).
+ *
+ * Returns: The average number of mappings per page in this folio. 0 for
+ * folios that are not mapped to user space or are not tracked via the RMAP
+ * (e.g., shared zeropage).
+ */
+static inline int folio_average_page_mapcount(struct folio *folio)
+{
+	int mapcount, entire_mapcount;
+	unsigned int adjust;
+
+	if (!folio_test_large(folio))
+		return atomic_read(&folio->_mapcount) + 1;
+
+	mapcount = folio_large_mapcount(folio);
+	entire_mapcount = folio_entire_mapcount(folio);
+	if (mapcount <= entire_mapcount)
+		return entire_mapcount;
+	mapcount -= entire_mapcount;
+
+	adjust = folio_large_nr_pages(folio) / 2;
+	return ((mapcount + adjust) >> folio_large_order(folio)) +
+		entire_mapcount;
+}
 /*
  * array.c
  */
diff --git a/fs/proc/page.c b/fs/proc/page.c
index a55f5acefa974..c7838de949287 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -67,9 +67,21 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
 		 * memmaps that were actually initialized.
 		 */
 		page = pfn_to_online_page(pfn);
-		if (page)
-			mapcount = folio_precise_page_mapcount(page_folio(page),
-							       page);
+		if (page) {
+			struct folio *folio = page_folio(page);
+
+#ifdef CONFIG_PAGE_MAPCOUNT
+			mapcount = folio_precise_page_mapcount(folio, page);
+#else /* !CONFIG_PAGE_MAPCOUNT */
+			/*
+			 * Indicate the per-page average, but at least "1" for
+			 * mapped folios.
+			 */
+			mapcount = folio_average_page_mapcount(folio);
+			if (!mapcount && folio_test_large(folio) && folio_mapped(folio))
+				mapcount = 1;
+#endif /* !CONFIG_PAGE_MAPCOUNT */
+		}
 
 		if (put_user(mapcount, out)) {
 			ret = -EFAULT;

From patchwork Thu Aug 29 16:56:17 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783469
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83FF51B86D1
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:59:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950749; cv=none;
 b=qpf7P2aFBpekw1n/ZuJX9N5aL6ArTlK2coWBhVdhsS92CDvXbgAOJJjwiVneY+LXxVl9qwcPbZMYSf5erSiyVJzNtBP1M2DAJdQn0vR7kTZ/k3eTbh9sOzO1Boj8fC1QmZi1WyePTTzBTzaF1HrR9cnbX9FrDpv60JTyc020hIo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950749; c=relaxed/simple;
	bh=K+woEaTnjUmqkdB8pKxZ/xg0ooE3m8DTsy83tmmcGy4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=PJAdGciBvKLY2tqaxGPLLzOoZhdZi1vhLW6twJC2uOeM1tJKf80jwGPA4dePkS8cgEvcZWjHZ2cL0W/lHtvpKQZARLjTlAYWSIzn5/72ptnBoqITpqGuLfqVHJqRzkBb+GXS25oMK8f3afd0SGv+v9L40polrk+03gh7AUWpGB0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=UGlITaHl; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="UGlITaHl"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950746;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cKf6dQguD8Od23K0wDaepWW4S0v2e9lLNnYaNj5b1cU=;
	b=UGlITaHltMNzqQGpq8hJ+HSNs/wIhFYff+GPM6DsjJSSUoDNSthFXNT19D2aibOOKc5MHL
	Wt9MMlY8+mURMGvFByu3iSBZTDhHLa7MlGG+PvMUxt0alA+X6sPUzCZy6UHdGgizhGIu7k
	OLGJqYJLXe8gLt3yoJDwP/ZKsiZnGs4=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-541-ui0rW_eRM6iDl2qZZuTgKA-1; Thu,
 29 Aug 2024 12:59:02 -0400
X-MC-Unique: ui0rW_eRM6iDl2qZZuTgKA-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 3AA1D18F498B;
	Thu, 29 Aug 2024 16:59:00 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 6ACD61955F66;
	Thu, 29 Aug 2024 16:58:51 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 14/17] fs/proc/task_mmu: remove per-page mapcount
 dependency for PM_MMAP_EXCLUSIVE (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:17 +0200
Message-ID: <20240829165627.2256514-15-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's implement an alternative when per-page mapcounts in large folios are
no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

PM_MMAP_EXCLUSIVE will now be set if folio_likely_mapped_shared() is
true -- when the folio is considered "mapped shared", including when
it once was "mapped shared" but no longer is, as documented.

This might result in and under-indication of "exclusively mapped", which
is considered better than over-indicating it: under-estimating the USS
(Unique Set Size) is better than over-estimating it.

As an alternative, we could simply remove that flag with
CONFIG_NO_PAGE_MAPCOUNT completely, but there might be value to it. So,
let's keep it like that and document the behavior.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/admin-guide/mm/pagemap.rst |  9 +++++++++
 fs/proc/task_mmu.c                       | 16 ++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 49590306c61a0..131c86574c39a 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -37,6 +37,15 @@ There are four components to pagemap:
    precisely which pages are mapped (or in swap) and comparing mapped
    pages between processes.
 
+   Note that in some kernel configurations, all pages part of a larger
+   allocation (e.g., THP) might be considered "mapped shared" if the large
+   allocation is considered "mapped shared": if not all pages are exclusive to
+   the same process. Further, some kernel configurations might consider larger
+   allocations "mapped shared", if they were at one point considered
+   "mapped shared", even if they would now be considered "exclusively mapped".
+   Consequently, in these kernel configurations, bit 56 might be set although
+   the page is actually "exclusively mapped"
+
    Efficient users of this interface will use ``/proc/pid/maps`` to
    determine which areas of memory are actually mapped and llseek to
    skip over unmapped regions.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5f171ad7b436b..f35a63c4b7c7a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -29,6 +29,18 @@
 #include <asm/tlbflush.h>
 #include "internal.h"
 
+#ifdef CONFIG_PAGE_MAPCOUNT
+static bool __folio_page_mapped_exclusively(struct folio *folio, struct page *page)
+{
+	return folio_precise_page_mapcount(folio, page) == 1;
+}
+#else /* !CONFIG_PAGE_MAPCOUNT */
+static bool __folio_page_mapped_exclusively(struct folio *folio, struct page *page)
+{
+	return !folio_likely_mapped_shared(folio);
+}
+#endif /* CONFIG_PAGE_MAPCOUNT */
+
 #define SEQ_PUT_DEC(str, val) \
 		seq_put_decimal_ull_width(m, str, (val) << (PAGE_SHIFT-10), 8)
 void task_mem(struct seq_file *m, struct mm_struct *mm)
@@ -1746,7 +1758,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (!folio_test_anon(folio))
 			flags |= PM_FILE;
 		if ((flags & PM_PRESENT) &&
-		    folio_precise_page_mapcount(folio, page) == 1)
+		    __folio_page_mapped_exclusively(folio, page))
 			flags |= PM_MMAP_EXCLUSIVE;
 	}
 	if (vma->vm_flags & VM_SOFTDIRTY)
@@ -1821,7 +1833,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			pagemap_entry_t pme;
 
 			if (folio && (flags & PM_PRESENT) &&
-			    folio_precise_page_mapcount(folio, page + idx) == 1)
+			    __folio_page_mapped_exclusively(folio, page))
 				cur_flags |= PM_MMAP_EXCLUSIVE;
 
 			pme = make_pme(frame, cur_flags);

From patchwork Thu Aug 29 16:56:18 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783470
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4B8B1B86D7
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:59:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950757; cv=none;
 b=DqzOh60eQ7nBkid7xTpfSSW11Pk+xRKR+sCy6ktB5lT3dDZduobV+cOzB3rxcgT8blB5PZ1oTiB9j6R91NXHe8P9FA8ssXmrwNcIee5vH0pvgs8OksbtC4psBcF3If7q/jB/mgANUC5mB8uaygLn4pThy2KRW567SxMiGk4PtmY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950757; c=relaxed/simple;
	bh=/4f3liGgTHCSBxlNCkznXeziGc69WaWwlAw0W8Tv14I=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=iboz/eg2I5jGmiBxWF76C8/IubZcxHKRfRwDKAkOxe+Q8sfFHvCMdifz4JOmNWbR2+9VjnUFLr37/xHe/Iz9CIdhwOD1AS45y7PCtPKaZUMYV/a3ak/gwueQaY+s4+nQljmbkj+VHQjUB0o9VjtSYVqBzBm8wlWRbbFw03bhveM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=JlaYG4gg; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="JlaYG4gg"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950755;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=MiGCooKskvY1haSoDErYW63CNhCWGlypowkYLoi/78c=;
	b=JlaYG4ggMiutZJ6H3Un4SXHxPC9ict6ee9mC22M+8KWxaa/pT67o/trsfeaWk9dm5XRAip
	q+1f9iqMmcR2SU5/840eeFcbWnQGBCoVo+/ZQVH5mtZUNA9r/iqBJ3Q8/QsZmMMVI8UnTz
	p2l0peQtI3hsgdMb7iAc0N7MLor8Ed8=
Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-169-8aXbFsVPOVG-1DFlGxFIdw-1; Thu,
 29 Aug 2024 12:59:11 -0400
X-MC-Unique: 8aXbFsVPOVG-1DFlGxFIdw-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 6AE9218BC2D7;
	Thu, 29 Aug 2024 16:59:08 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id A5CBF1955F21;
	Thu, 29 Aug 2024 16:59:00 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 15/17] fs/proc/task_mmu: remove per-page mapcount
 dependency for "mapmax" (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:18 +0200
Message-ID: <20240829165627.2256514-16-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's implement an alternative when per-page mapcounts in large folios are
no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

For calculating "mapmax", we now use the average per-page mapcount in
a large folio instead of the per-page mapcount.

For hugetlb folios and folios that are not partially mapped into MMs,
there is no change.

Likely, this change will not matter much in practice, and an alternative
might be to simple remove this stat with CONFIG_NO_PAGE_MAPCOUNT.
However, there might be value to it, so let's keep it like that and
document the behavior.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/filesystems/proc.rst | 5 +++++
 fs/proc/task_mmu.c                 | 8 +++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index e834779d96115..bed03e77c0f91 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -684,6 +684,11 @@ Where:
 node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
 size, in KB, that is backing the mapping up.
 
+Note that some kernel configurations do not track the precise number of times
+a page part of a larger allocation (e.g., THP) is mapped. In these
+configurations, "mapmax" might corresponds to the average number of mappings
+per page in such a larger allocation instead.
+
 1.2 Kernel data
 ---------------
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f35a63c4b7c7a..3d9fe99346478 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2872,7 +2872,13 @@ static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty,
 			unsigned long nr_pages)
 {
 	struct folio *folio = page_folio(page);
-	int count = folio_precise_page_mapcount(folio, page);
+	int count;
+
+#ifdef CONFIG_PAGE_MAPCOUNT
+	count = folio_precise_page_mapcount(folio, page);
+#else
+	count = min_t(int, folio_average_page_mapcount(folio), 1);
+#endif
 
 	md->pages += nr_pages;
 	if (pte_dirty || folio_test_dirty(folio))

From patchwork Thu Aug 29 16:56:19 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783471
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA9D81B6526
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:59:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950766; cv=none;
 b=piZEjvdq+6CT19ELDXjI9huvWtVrMSK/CgQ+zEBKDSXApq9u8OLPdLHvyQ7abj+RTy4b78xkKRnWyzyE5p8PuDmyyN/zjD66EYj3zcoGfgXN2qditanH4t04gEvAF+RNBv8UmWwlzYPenh0VXelxk9bjcwG+1+wt9WrFFPKX78Q=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950766; c=relaxed/simple;
	bh=2kZ7QGLHXwtMIchnMAYhCynmld3eYgcPA3qlbJ9RCfE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=JVbOwQ8njn2y7EMK2ix5WPnWjXaL92OlgweNw74h1q2miSf2c/9zDAC/KSrjpJ4oXwPfYxgBwINIRfo1FN9SwLFzHU7cl6hNECk6+YAhAEfeddjBOZSgovXkqLxSfgOTblsD8ZCl7KYmJzUXbQFfySnRF6b5+4vN9gFAn3N7Dho=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=dA88gEGw; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="dA88gEGw"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950764;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=mZddLqrKBd43/r/px6zwJ3YX6SWmvbs5NWYNR2u4nCE=;
	b=dA88gEGwSCc4JsLAhqrqyQM1M0ddHisK+9SpCybEzkrramdSbwAHMxk040k7en/eOOecaH
	XUGkrp7c85k7MbgntUuwgIrV25CcL23/RR/Zkg77GaN1Y2R51qavE48J8rckvVawpaQHFE
	IATKzhP/dbVtKaiquaf3geOX8UZc3Ok=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-481-IesVcwbKN1S5VOVmE03pWw-1; Thu,
 29 Aug 2024 12:59:18 -0400
X-MC-Unique: IesVcwbKN1S5VOVmE03pWw-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 9B545196E0A8;
	Thu, 29 Aug 2024 16:59:16 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 343891955F21;
	Thu, 29 Aug 2024 16:59:08 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 16/17] fs/proc/task_mmu: remove per-page mapcount
 dependency for smaps/smaps_rollup (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:19 +0200
Message-ID: <20240829165627.2256514-17-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Let's implement an alternative when per-page mapcounts in large folios are
no longer maintained -- soon with CONFIG_NO_PAGE_MAPCOUNT.

When computing the output for smaps / smaps_rollups, in particular when
calculating the USS (Unique Set Size) and the PSS (Proportional Set Size),
we still rely on per-page mapcounts.

To determine private vs. shared, we'll use folio_likely_mapped_shared(),
similar to how we handle PM_MMAP_EXCLUSIVE. Similarly, we might now
under-estimate the USS and count pages towards "shared" that are
actually "private" ("exclusively mapped").

When calculating the PSS, we'll now also use the average per-page
mapcount for large folios: this can result in both, an over-estimation
and an under-estimation of the PSS. The difference is not expected to
matter much in practice, but we'll have to learn as we go.

We can now provide folio_precise_page_mapcount() only with
CONFIG_PAGE_MAPCOUNT, and remove one of the last users of per-page
mapcounts when CONFIG_NO_PAGE_MAPCOUNT is enabled.

Document the new behavior.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/filesystems/proc.rst | 13 +++++++++++++
 fs/proc/internal.h                 |  2 ++
 fs/proc/task_mmu.c                 | 17 +++++++++++++++--
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index bed03e77c0f91..7cbab4135f244 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -504,6 +504,19 @@ Note that even a page which is part of a MAP_SHARED mapping, but has only
 a single pte mapped, i.e.  is currently used by only one process, is accounted
 as private and not as shared.
 
+Note that in some kernel configurations, all pages part of a larger allocation
+(e.g., THP) might be considered "shared" if the large allocation is
+considered "shared": if not all pages are exclusive to the same process.
+Further, some kernel configurations might consider larger allocations "shared",
+if they were at one point considered "shared", even if they would now be
+considered "exclusive".
+
+Some kernel configurations do not track the precise number of times a page part
+of a larger allocation is mapped. In this case, when calculating the PSS, the
+average number of mappings per page in this larger allocation might be used
+as an approximation for the number of mappings of a page. The PSS calculation
+will be imprecise in this case.
+
 "Referenced" indicates the amount of memory currently marked as referenced or
 accessed.
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3c687f97e18c4..8c9ef19526d2b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -143,6 +143,7 @@ unsigned name_to_int(const struct qstr *qstr);
 /* Worst case buffer size needed for holding an integer. */
 #define PROC_NUMBUF 13
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 /**
  * folio_precise_page_mapcount() - Number of mappings of this folio page.
  * @folio: The folio.
@@ -173,6 +174,7 @@ static inline int folio_precise_page_mapcount(struct folio *folio,
 
 	return mapcount;
 }
+#endif /* CONFIG_PAGE_MAPCOUNT */
 
 /**
  * folio_average_page_mapcount() - Average number of mappings per page in this
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3d9fe99346478..30306e231ff04 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -734,6 +734,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	struct folio *folio = page_folio(page);
 	int i, nr = compound ? compound_nr(page) : 1;
 	unsigned long size = nr * PAGE_SIZE;
+	bool exclusive;
+	int mapcount;
 
 	/*
 	 * First accumulate quantities that depend only on |size| and the type
@@ -774,18 +776,29 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 				      dirty, locked, present);
 		return;
 	}
+
+#ifndef CONFIG_PAGE_MAPCOUNT
+	mapcount = folio_average_page_mapcount(folio);
+	exclusive = !folio_likely_mapped_shared(folio);
+#endif
+
 	/*
 	 * We obtain a snapshot of the mapcount. Without holding the folio lock
 	 * this snapshot can be slightly wrong as we cannot always read the
 	 * mapcount atomically.
 	 */
 	for (i = 0; i < nr; i++, page++) {
-		int mapcount = folio_precise_page_mapcount(folio, page);
 		unsigned long pss = PAGE_SIZE << PSS_SHIFT;
+
+#ifdef CONFIG_PAGE_MAPCOUNT
+		mapcount = folio_precise_page_mapcount(folio, page);
+		exclusive = mapcount < 2;
+#endif
+
 		if (mapcount >= 2)
 			pss /= mapcount;
 		smaps_page_accumulate(mss, folio, PAGE_SIZE, pss,
-				dirty, locked, mapcount < 2);
+				dirty, locked, exclusive);
 	}
 }
 

From patchwork Thu Aug 29 16:56:20 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 13783472
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3110E1B78E4
	for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:59:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1724950780; cv=none;
 b=fVIwXtQbiFHwxxv/9Ie/86acUMyYyrSH3l2MeZJayxhX408I6RBJj6xaCROfRULQ+CeqdbhblLRWf7DOqsND2cezd+cAVflF1jHlEK3wj8tX3HkLoI6E2KDDpVzlTUVIrYRE9qqtd4m4p4wowHh5rKSeHV50UApnRFeRh+pFOQ8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1724950780; c=relaxed/simple;
	bh=K/LffPhHgsVkQ8Hmirto/gEaeHVv422Lba8I+gXg09M=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=sbgJXzZmBtb+5aUyGp74r9QZhfDIfIEnroGqdnGn/p5kkOYdteWX6IRUvtPhptPXqeZ4eLkehGlPCf49stHzGBN1xzdkWWyJ+xAOyMAJyNh+wDEW277tJMUCNasWbRQh8EWI550KmjneXczCs9qN4jJlAJV9IuqWPX+/nms4NQE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=U6st3mUV; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="U6st3mUV"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1724950777;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=up1+iyu0HvtZL5E+EXssaXi/TIZa1G2RQW1zOiYjenc=;
	b=U6st3mUVFaGagzTpIbDBUJ1wLcggrYpsvyWW8bSAw9eACFltxARznfBEL7zzm8dWnDamF3
	ZJmsncJutyb5vr/dyCFx/XROTRFcBG7L/IFeLDAkiiw26puNQFe5+geU6xydSuMASpWzC+
	5UIxo3Rmd1LVspjpfZokHl6qo4ioAJI=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-669-1KRbKMh9NyG_mK90LyPUkA-1; Thu,
 29 Aug 2024 12:59:30 -0400
X-MC-Unique: 1KRbKMh9NyG_mK90LyPUkA-1
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id D2DEC195420D;
	Thu, 29 Aug 2024 16:59:27 +0000 (UTC)
Received: from t14s.redhat.com (unknown [10.39.193.245])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id DD5A51955F66;
	Thu, 29 Aug 2024 16:59:17 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
 linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>,
 Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>,
	=?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>,
 Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
 Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>
Subject: [PATCH v1 17/17] mm: stop maintaining the per-page mapcount of large
 folios (CONFIG_NO_PAGE_MAPCOUNT)
Date: Thu, 29 Aug 2024 18:56:20 +0200
Message-ID: <20240829165627.2256514-18-david@redhat.com>
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>
References: <20240829165627.2256514-1-david@redhat.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12

Everything is in place to stop using the per-page mapcounts in large
folios with CONFIG_NO_PAGE_MAPCOUNT: the mapcount of tail pages will always
be logically 0 (-1 value), just like it currently is for hugetlb folios
already, and the page mapcount of the head page is either 0 (-1 value)
or contains a page type (e.g., hugetlb).

Maintaining _nr_pages_mapped without per-page mapcounts is impossible,
so that one also has to go with CONFIG_NO_PAGE_MAPCOUNT.

There are two remaining implications:

(1) Per-node, per-cgroup and per-lruvec stats of "NR_ANON_MAPPED"
    ("mapped anonymous memory") and "NR_FILE_MAPPED"
    ("mapped file memory"):

    As soon as any page of the folio is mapped -- folio_mapped() -- we
    now account the complete folio as mapped. Once the last page is
    unmapped -- !folio_mapped() -- we account the complete folio as
    unmapped.

    This implies that ...

    * "AnonPages" and "Mapped" in /proc/meminfo and
      /sys/devices/system/node/*/meminfo
    * cgroup v2: "anon" and "file_mapped" in "memory.stat" and
      "memory.numa_stat"
    * cgroup v1: "rss" and "mapped_file" in "memory.stat" and
      "memory.numa_stat

    ... can now appear higher than before. But note that these folios do
    consume that memory, simply not all pages are actually currently
    mapped.

    It's worth nothing that other accounting in the kernel (esp. cgroup
    charging on allocation) is not affected by this change.

    [why oh why is "anon" called "rss" in cgroup v1]

 (2) Detecting partial mappings

     Detecting whether anon THP are partially mapped gets a bit more
     unreliable. As long as a single MM maps such a large folio
     ("exclusively mapped"), we can reliably detect it. Especially before
     fork() / after a short-lived child process quit, we will detect
     partial mappings reliably, which is the common case.

     In essence, if the average per-page mapcount in an anon THP is < 1,
     we know for sure that we have a partial mapping.

     However, as soon as multiple MMs are involved, we might miss detecting
     partial mappings: this might be relevant with long-lived child
     processes. If we have a fully-mapped anon folio before fork(), once
     our child processes and our parent all unmap (zap/COW) the same pages
     (but not the complete folio), we might not detect the partial mapping.
     However, once the child processes quit we would detect the partial
     mapping.

     How relevant this case is in practice remains to be seen.
     Swapout/migration will likely mitigate this.

     In the future, RMAP walkers should check for that for "mapped shared"
     anon folios, and flag them for deferred-splitting.

There are a couple of remaining per-page mapcount users we won't
touch for now:

 (1) __dump_folio(): we'll tackle that separately later. For now, it
     will always read effective mapcount of "0" for pages in large folios.

 (2) include/trace/events/page_ref.h: we should rework the whole
     handling to be folio-aware and simply trace folio_mapcount(). Let's
     leave it around for now, might still be helpful to trace the raw
     page mapcount value (e.g., including the page type).

 (3) mm/mm_init.c: to initialize the mapcount/type field to -1. Will be
     required until we decoupled type+mapcount (e.g., moving it into
     "struct folio"), and until we initialize the type+mapcount when
     allocating a folio.

 (4) mm/page_alloc.c: to sanity-check that the mapcount/type field is -1
     when a page gets freed. We could probably remove at least the tail
     page mapcount check in non-debug environments.

Some added ifdefery seems unavoidable for now: at least it's mostly
limited to the rmap add/remove core primitives.

Extend documentation.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 .../admin-guide/cgroup-v1/memory.rst          |  4 ++
 Documentation/admin-guide/cgroup-v2.rst       | 10 ++-
 Documentation/filesystems/proc.rst            | 10 ++-
 Documentation/mm/transhuge.rst                | 31 +++++++---
 include/linux/mm_types.h                      |  4 ++
 include/linux/rmap.h                          | 10 ++-
 mm/internal.h                                 | 21 +++++--
 mm/page_alloc.c                               |  2 +
 mm/rmap.c                                     | 61 +++++++++++++++++++
 9 files changed, 133 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 270501db9f4e8..2e2bbf944eea9 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -615,6 +615,10 @@ memory.stat file includes following statistics:
 
 	'rss + mapped_file" will give you resident set size of cgroup.
 
+	Note that some kernel configurations might account complete larger
+	allocations (e.g., THP) towards 'rss' and 'mapped_file', even if
+	only some, but not all that memory is mapped.
+
 	(Note: file and shmem may be shared among other cgroups. In that case,
 	mapped_file is accounted only when the memory cgroup is owner of page
 	cache.)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index e25e8b2698b95..039bdf49854f3 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1429,7 +1429,10 @@ The following nested keys are defined.
 
 	  anon
 		Amount of memory used in anonymous mappings such as
-		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
+		brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that
+		some kernel configurations might account complete larger
+		allocations (e.g., THP) if only some, but not all the
+		memory of such an allocation is mapped anymore.
 
 	  file
 		Amount of memory used to cache filesystem data,
@@ -1472,7 +1475,10 @@ The following nested keys are defined.
 		Amount of application memory swapped out to zswap.
 
 	  file_mapped
-		Amount of cached filesystem data mapped with mmap()
+		Amount of cached filesystem data mapped with mmap(). Note
+		that some kernel configurations might account complete
+		larger allocations (e.g., THP) if only some, but not
+		not all the memory of such an allocation is mapped.
 
 	  file_dirty
 		Amount of cached filesystem data that was modified but
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 7cbab4135f244..c6d6474738577 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1148,9 +1148,15 @@ Dirty
 Writeback
               Memory which is actively being written back to the disk
 AnonPages
-              Non-file backed pages mapped into userspace page tables
+              Non-file backed pages mapped into userspace page tables. Note that
+              some kernel configurations might consider all pages part of a
+              larger allocation (e.g., THP) as "mapped", as soon as a single
+              page is mapped.
 Mapped
-              files which have been mmapped, such as libraries
+              files which have been mmapped, such as libraries. Note that some
+              kernel configurations might consider all pages part of a larger
+              allocation (e.g., THP) as "mapped", as soon as a single page is
+              mapped.
 Shmem
               Total memory used by shared memory (shmem) and tmpfs
 KReclaimable
diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 0ee58108a4d14..0d34f3ac13d8c 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -116,23 +116,28 @@ pages:
     succeeds on tail pages.
 
   - map/unmap of a PMD entry for the whole THP increment/decrement
-    folio->_entire_mapcount, increment/decrement folio->_large_mapcount
-    and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
-    when _entire_mapcount goes from -1 to 0 or 0 to -1.
+    folio->_entire_mapcount and folio->_large_mapcount.
 
     With CONFIG_MM_ID, we also maintain the two slots for tracking MM
     owners (MM ID and corresponding mapcount), and the current status
     ("mapped shared" vs. "mapped exclusively").
 
+    With CONFIG_PAGE_MAPCOUNT, we also increment/decrement
+    folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount goes
+    from -1 to 0 or 0 to -1.
+
   - map/unmap of individual pages with PTE entry increment/decrement
-    page->_mapcount, increment/decrement folio->_large_mapcount and also
-    increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
-    from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
+    folio->_large_mapcount.
 
     With CONFIG_MM_ID, we also maintain the two slots for tracking MM
     owners (MM ID and corresponding mapcount), and the current status
     ("mapped shared" vs. "mapped exclusively").
 
+    With CONFIG_PAGE_MAPCOUNT, we also increment/decrement
+    page->_mapcount and increment/decrement folio->_nr_pages_mapped when
+    page->_mapcount goes from -1 to 0 or 0 to -1 as this counts the number
+    of pages mapped by PTE.
+
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
 structures. It can be done easily for refcounts taken by page table
@@ -159,8 +164,8 @@ clear where references should go after split: it will stay on the head page.
 Note that split_huge_pmd() doesn't have any limitations on refcounting:
 pmd can be split at any point and never fails.
 
-Partial unmap and deferred_split_folio()
-========================================
+Partial unmap and deferred_split_folio() (anon THP only)
+========================================================
 
 Unmapping part of THP (with munmap() or other way) is not going to free
 memory immediately. Instead, we detect that a subpage of THP is not in use
@@ -175,3 +180,13 @@ a THP crosses a VMA boundary.
 The function deferred_split_folio() is used to queue a folio for splitting.
 The splitting itself will happen when we get memory pressure via shrinker
 interface.
+
+With CONFIG_PAGE_MAPCOUNT, we reliably detect partial mappings based on
+folio->_nr_pages_mapped.
+
+With CONFIG_NO_PAGE_MAPCOUNT, we detect partial mappings based on the
+average per-page mapcount in a THP: if the average is < 1, an anon THP is
+certainly partially mapped. As long as only a single process maps a THP,
+this detection is reliable. With long-running child processes, there can
+be scenarios where partial mappings can currently not be detected, and
+might need asynchronous detection during memory reclaim in the future.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6d27856686439..2adf1839bcb0d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -378,7 +378,11 @@ struct folio {
 				struct {
 					atomic_t _large_mapcount;
 					atomic_t _entire_mapcount;
+#ifdef CONFIG_PAGE_MAPCOUNT
 					atomic_t _nr_pages_mapped;
+#else /* !CONFIG_PAGE_MAPCOUNT */
+					int _unused_1;
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 					atomic_t _pincount;
 #ifdef CONFIG_MM_ID
 					int _mm0_mapcount;
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index ff2a16864deed..345d93636b2b1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -219,7 +219,7 @@ static __always_inline void folio_set_large_mapcount(struct folio *folio,
 	VM_WARN_ON_ONCE(folio->_mm1_mapcount >= 0);
 }
 
-static __always_inline void folio_add_large_mapcount(struct folio *folio,
+static __always_inline int folio_add_large_mapcount(struct folio *folio,
 		int diff, struct vm_area_struct *vma)
 {
 	const unsigned int mm_id = vma->vm_mm->mm_id;
@@ -264,11 +264,12 @@ static __always_inline void folio_add_large_mapcount(struct folio *folio,
 		folio_clear_large_mapped_exclusively(folio);
 	}
 	folio_unlock_large_mapcount_data(folio);
+	return mapcount_val + 1;
 }
 #define folio_inc_large_mapcount(folio, vma) \
 	folio_add_large_mapcount(folio, 1, vma)
 
-static __always_inline void folio_sub_large_mapcount(struct folio *folio,
+static __always_inline int folio_sub_large_mapcount(struct folio *folio,
 		int diff, struct vm_area_struct *vma)
 {
 	const unsigned int mm_id = vma->vm_mm->mm_id;
@@ -294,6 +295,7 @@ static __always_inline void folio_sub_large_mapcount(struct folio *folio,
 	    folio->_mm1_mapcount == mapcount_val)
 		folio_set_large_mapped_exclusively(folio);
 	folio_unlock_large_mapcount_data(folio);
+	return mapcount_val + 1;
 }
 #define folio_dec_large_mapcount(folio, vma) \
 	folio_sub_large_mapcount(folio, 1, vma)
@@ -493,9 +495,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
 			break;
 		}
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 		do {
 			atomic_inc(&page->_mapcount);
 		} while (page++, --nr_pages > 0);
+#endif
 		folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
 		break;
 	case RMAP_LEVEL_PMD:
@@ -592,7 +596,9 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
 		do {
 			if (PageAnonExclusive(page))
 				ClearPageAnonExclusive(page);
+#ifdef CONFIG_PAGE_MAPCOUNT
 			atomic_inc(&page->_mapcount);
+#endif
 		} while (page++, --nr_pages > 0);
 		folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
 		break;
diff --git a/mm/internal.h b/mm/internal.h
index da38c747c73d4..9fb78ce3c2eb3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -60,6 +60,13 @@ struct folio_batch;
 
 void page_writeback_init(void);
 
+/*
+ * Flags passed to __show_mem() and show_free_areas() to suppress output in
+ * various contexts.
+ */
+#define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
+
+#ifdef CONFIG_PAGE_MAPCOUNT
 /*
  * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
  * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
@@ -69,12 +76,6 @@ void page_writeback_init(void);
 #define ENTIRELY_MAPPED		0x800000
 #define FOLIO_PAGES_MAPPED	(ENTIRELY_MAPPED - 1)
 
-/*
- * Flags passed to __show_mem() and show_free_areas() to suppress output in
- * various contexts.
- */
-#define SHOW_MEM_FILTER_NODES		(0x0001u)	/* disallowed nodes */
-
 /*
  * How many individual pages have an elevated _mapcount.  Excludes
  * the folio's entire_mapcount.
@@ -85,6 +86,12 @@ static inline int folio_nr_pages_mapped(const struct folio *folio)
 {
 	return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED;
 }
+#else /* !CONFIG_PAGE_MAPCOUNT */
+static inline int folio_nr_pages_mapped(const struct folio *folio)
+{
+	return -1;
+}
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 
 /*
  * Retrieve the first entry of a folio based on a provided entry within the
@@ -663,7 +670,9 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	folio_set_order(folio, order);
 	atomic_set(&folio->_large_mapcount, -1);
 	atomic_set(&folio->_entire_mapcount, -1);
+#ifdef CONFIG_PAGE_MAPCOUNT
 	atomic_set(&folio->_nr_pages_mapped, 0);
+#endif /* CONFIG_PAGE_MAPCOUNT */
 	atomic_set(&folio->_pincount, 0);
 #ifdef CONFIG_MM_ID
 	folio->_mm0_mapcount = -1;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c81f29e29b82d..bdb57540cdffa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -951,10 +951,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
 			bad_page(page, "nonzero large_mapcount");
 			goto out;
 		}
+#ifdef CONFIG_PAGE_MAPCOUNT
 		if (unlikely(atomic_read(&folio->_nr_pages_mapped))) {
 			bad_page(page, "nonzero nr_pages_mapped");
 			goto out;
 		}
+#endif
 		if (unlikely(atomic_read(&folio->_pincount))) {
 			bad_page(page, "nonzero pincount");
 			goto out;
diff --git a/mm/rmap.c b/mm/rmap.c
index 226b188499f91..888394ff9dd5b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1156,7 +1156,9 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 		struct page *page, int nr_pages, struct vm_area_struct *vma,
 		enum rmap_level level, int *nr_pmdmapped)
 {
+#ifdef CONFIG_PAGE_MAPCOUNT
 	atomic_t *mapped = &folio->_nr_pages_mapped;
+#endif /* CONFIG_PAGE_MAPCOUNT */
 	const int orig_nr_pages = nr_pages;
 	int first = 0, nr = 0;
 
@@ -1169,6 +1171,7 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 			break;
 		}
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 		do {
 			first += atomic_inc_and_test(&page->_mapcount);
 		} while (page++, --nr_pages > 0);
@@ -1178,9 +1181,18 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 			nr = first;
 
 		folio_add_large_mapcount(folio, orig_nr_pages, vma);
+#else /* !CONFIG_PAGE_MAPCOUNT */
+		nr = folio_add_large_mapcount(folio, orig_nr_pages, vma);
+		if (nr == orig_nr_pages)
+			/* Was completely unmapped. */
+			nr = folio_large_nr_pages(folio);
+		else
+			nr = 0;
+#endif /* CONFIG_PAGE_MAPCOUNT */
 		break;
 	case RMAP_LEVEL_PMD:
 		first = atomic_inc_and_test(&folio->_entire_mapcount);
+#ifdef CONFIG_PAGE_MAPCOUNT
 		if (first) {
 			nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
 			if (likely(nr < ENTIRELY_MAPPED + ENTIRELY_MAPPED)) {
@@ -1195,6 +1207,16 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
 			}
 		}
 		folio_inc_large_mapcount(folio, vma);
+#else /* !CONFIG_PAGE_MAPCOUNT */
+		if (first)
+			*nr_pmdmapped = folio_large_nr_pages(folio);
+		nr = folio_inc_large_mapcount(folio, vma);
+		if (nr == 1)
+			/* Was completely unmapped. */
+			nr = folio_large_nr_pages(folio);
+		else
+			nr = 0;
+#endif /* CONFIG_PAGE_MAPCOUNT */
 		break;
 	}
 	return nr;
@@ -1332,6 +1354,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 			break;
 		}
 	}
+#ifdef CONFIG_PAGE_MAPCOUNT
 	for (i = 0; i < nr_pages; i++) {
 		struct page *cur_page = page + i;
 
@@ -1341,6 +1364,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 				   folio_entire_mapcount(folio) > 1)) &&
 				 PageAnonExclusive(cur_page), folio);
 	}
+#else /* !CONFIG_PAGE_MAPCOUNT */
+	VM_WARN_ON_FOLIO(!folio_test_large(folio) && PageAnonExclusive(page) &&
+			 atomic_read(&folio->_mapcount) > 0, folio);
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 
 	/*
 	 * For large folio, only mlock it if it's fully mapped to VMA. It's
@@ -1445,19 +1472,25 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 			struct page *page = folio_page(folio, i);
 
 			/* increment count (starts at -1) */
+#ifdef CONFIG_PAGE_MAPCOUNT
 			atomic_set(&page->_mapcount, 0);
+#endif /* CONFIG_PAGE_MAPCOUNT */
 			if (exclusive)
 				SetPageAnonExclusive(page);
 		}
 
 		folio_set_large_mapcount(folio, nr, vma);
+#ifdef CONFIG_PAGE_MAPCOUNT
 		atomic_set(&folio->_nr_pages_mapped, nr);
+#endif /* CONFIG_PAGE_MAPCOUNT */
 	} else {
 		nr = folio_large_nr_pages(folio);
 		/* increment count (starts at -1) */
 		atomic_set(&folio->_entire_mapcount, 0);
 		folio_set_large_mapcount(folio, 1, vma);
+#ifdef CONFIG_PAGE_MAPCOUNT
 		atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
+#endif /* CONFIG_PAGE_MAPCOUNT */
 		if (exclusive)
 			SetPageAnonExclusive(&folio->page);
 		nr_pmdmapped = nr;
@@ -1527,7 +1560,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		struct page *page, int nr_pages, struct vm_area_struct *vma,
 		enum rmap_level level)
 {
+#ifdef CONFIG_PAGE_MAPCOUNT
 	atomic_t *mapped = &folio->_nr_pages_mapped;
+#endif /* CONFIG_PAGE_MAPCOUNT */
 	int last = 0, nr = 0, nr_pmdmapped = 0;
 	bool partially_mapped = false;
 
@@ -1540,6 +1575,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 			break;
 		}
 
+#ifdef CONFIG_PAGE_MAPCOUNT
 		folio_sub_large_mapcount(folio, nr_pages, vma);
 		do {
 			last += atomic_add_negative(-1, &page->_mapcount);
@@ -1550,8 +1586,20 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 			nr = last;
 
 		partially_mapped = nr && atomic_read(mapped);
+#else /* !CONFIG_PAGE_MAPCOUNT */
+		nr = folio_sub_large_mapcount(folio, nr_pages, vma);
+		if (!nr) {
+			/* Now completely unmapped. */
+			nr = folio_nr_pages(folio);
+		} else {
+			partially_mapped = nr < folio_large_nr_pages(folio) &&
+					   !folio_entire_mapcount(folio);
+			nr = 0;
+		}
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 		break;
 	case RMAP_LEVEL_PMD:
+#ifdef CONFIG_PAGE_MAPCOUNT
 		folio_dec_large_mapcount(folio, vma);
 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
 		if (last) {
@@ -1569,6 +1617,19 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		}
 
 		partially_mapped = nr && nr < nr_pmdmapped;
+#else /* !CONFIG_PAGE_MAPCOUNT */
+		last = atomic_add_negative(-1, &folio->_entire_mapcount);
+		if (last)
+			nr_pmdmapped = folio_large_nr_pages(folio);
+		nr = folio_dec_large_mapcount(folio, vma);
+		if (!nr) {
+			/* Now completely unmapped. */
+			nr = folio_large_nr_pages(folio);
+		} else {
+			partially_mapped = last && nr < folio_large_nr_pages(folio);
+			nr = 0;
+		}
+#endif /* !CONFIG_PAGE_MAPCOUNT */
 		break;
 	}