From patchwork Mon Jul 11 03:46:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com>
X-Patchwork-Id: 12912785
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77D5EC433EF
	for <linux-mm@archiver.kernel.org>; Mon, 11 Jul 2022 03:47:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E8DEB8E0007; Sun, 10 Jul 2022 23:47:14 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E3CC38E0001; Sun, 10 Jul 2022 23:47:14 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CDF678E0007; Sun, 10 Jul 2022 23:47:14 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com
 [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id BEE728E0001
	for <linux-mm@kvack.org>; Sun, 10 Jul 2022 23:47:14 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 8F3F034358
	for <linux-mm@kvack.org>; Mon, 11 Jul 2022 03:47:14 +0000 (UTC)
X-FDA: 79673433588.21.733C2A6
Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com
 [209.85.214.182])
	by imf18.hostedemail.com (Postfix) with ESMTP id 248A51C0069
	for <linux-mm@kvack.org>; Mon, 11 Jul 2022 03:47:13 +0000 (UTC)
Received: by mail-pl1-f182.google.com with SMTP id 5so3392706plk.9
        for <linux-mm@kvack.org>; Sun, 10 Jul 2022 20:47:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=kX77QQPavnlmqTPJr+cPH2hNKZO7UlqRRz965T9dDrs=;
        b=htIXMTHV1qfaPfX3l4XZ4hj+VaUJoyeJxf3BRtOKXsqpYicKtvT6e4VttVAI3DicXN
         9gM6V/tgXsGAqEJx6fyT+PeWPFlSYsi+4M5SHPDGMW/jL34jf6NoLM0kYfBinO8DT1MR
         DPdLnhEyAt7VslHXHkcl8xE/RUV8tJufwFlQVmtY3LpgfPw7A4qLCzw/q/dCdkea/U/H
         el05aKwmbkbJfRsPZasOu7jfAx2d4VJiPcyDUNvV41zJXsh6JXPpax8tmfD7C5+a9ygQ
         wOaqM8gP36qTLyaFt1jXVEk29x4zttAJoOdz3yRBuZA7IPt8psRkaoiAygE7YyPtwQ1Z
         E2zQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=kX77QQPavnlmqTPJr+cPH2hNKZO7UlqRRz965T9dDrs=;
        b=5bngT6qTVXed3QF8niCPTwFr2QAMHiD35HUrcSR8HW0Z9PtOsEGr+RHMDvKrcwkyKC
         naYg5WmRBBolAuUVAYKObMtcpJ18YPaqWiWJYHjr6YW9uX3VlpxJ8nx94B7I3npyGs2s
         VsXobyxeAuzsjESD79RE/p5NfY3s15U4U044Ne+MzSKR+45xa6LbNhDx00KF/dbP+Lqn
         +oRBrKHar/E/hKElyVX3aHMXYa8Qr+WXRT9gYVrpwDPMfZ16Ld/HGp+MxzHyXQaHV82/
         NzUeZSJqjHwFUULA/kygmP++VzjnJx0Z5IkMJGyXpeWQq0Hf6nZ0Kke2UYODHRgFvfOD
         mULw==
X-Gm-Message-State: AJIora8I/6t21ngKPzWFIB4L3e1xdQRv3JNun4p/Is6jWUYsnxYczo6k
	F+RNm8vjwpi5U6P9Oo9KzZE=
X-Google-Smtp-Source: 
 AGRyM1vLe/YZvbOkUYTH5iAqm8VRsVnu2Afj8VTLOnXJSTfR2PgdfzJzSBQsk0Iq2lETGsM8bxMYpQ==
X-Received: by 2002:a17:90b:1c90:b0:1ee:d804:d2c1 with SMTP id
 oo16-20020a17090b1c9000b001eed804d2c1mr15473364pjb.92.1657511233142;
        Sun, 10 Jul 2022 20:47:13 -0700 (PDT)
Received: from localhost.localdomain (47-72-206-164.dsl.dyn.ihug.co.nz.
 [47.72.206.164])
        by smtp.gmail.com with ESMTPSA id
 a13-20020a170902eccd00b001664d88aab3sm3447949plh.240.2022.07.10.20.47.04
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 10 Jul 2022 20:47:12 -0700 (PDT)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org,
	x86@kernel.org,
	catalin.marinas@arm.com,
	will@kernel.org,
	linux-doc@vger.kernel.org
Cc: corbet@lwn.net,
	arnd@arndb.de,
	linux-kernel@vger.kernel.org,
	darren@os.amperecomputing.com,
	yangyicong@hisilicon.com,
	huzhanyuan@oppo.com,
	lipeifeng@oppo.com,
	zhangshiming@oppo.com,
	guojian@oppo.com,
	realmz6@gmail.com,
	linux-mips@vger.kernel.org,
	openrisc@lists.librecores.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org,
	Barry Song <v-songbaohua@oppo.com>,
	Nadav Amit <namit@vmware.com>,
	Mel Gorman <mgorman@suse.de>
Subject: [PATCH v2 4/4] arm64: support batched/deferred tlb shootdown during
 page reclamation
Date: Mon, 11 Jul 2022 15:46:15 +1200
Message-Id: <20220711034615.482895-5-21cnbao@gmail.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220711034615.482895-1-21cnbao@gmail.com>
References: <20220711034615.482895-1-21cnbao@gmail.com>
MIME-Version: 1.0
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=htIXMTHV;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates
 209.85.214.182 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1657511234;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=kX77QQPavnlmqTPJr+cPH2hNKZO7UlqRRz965T9dDrs=;
	b=Os4bp1K1Xb0hCdwBd+oV2xy1hp0K+3xJLK6cFgR4ndR42fxzRtGipjTIDO6qU+KszWhb0F
	eyqhy/KJNDc9xYBoceX8FxuW3GOqFYRTX4DRnQmvTpARjccaVIEEpVlhWK5P9jDGW0+8jn
	/2PhJq+0GX+7FXl/zTusFadC1Za/zNY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657511234; a=rsa-sha256;
	cv=none;
	b=2IcT1v21XK/JveXygn1juW+/U8qZwF2ANQjMcbZHKmoOlFl7+Pkzpdso1dbB7F1mxM/OaH
	pkBHy6BTeGFrmgKyJn/JyFGEGmo8sRYA4J3BjHXspNeThJBacC39ZdkC/0N+hfedwTP9uk
	0ZMkvnF3FYnP89NA+nTW7YRQq0JOWi4=
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=htIXMTHV;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates
 209.85.214.182 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
X-Stat-Signature: 1t81chq76fxzf6784iucb9ms7me1gb7q
X-Rspamd-Queue-Id: 248A51C0069
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1657511233-458606
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  1 +
 arch/arm64/include/asm/tlbbatch.h             | 12 ++++++++++
 arch/arm64/include/asm/tlbflush.h             | 23 +++++++++++++++++--
 4 files changed, 35 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
index 1c009312b9c1..2caf815d7c6c 100644
--- a/Documentation/features/vm/TLB/arch-support.txt
+++ b/Documentation/features/vm/TLB/arch-support.txt
@@ -9,7 +9,7 @@
     |       alpha: | TODO |
     |         arc: | TODO |
     |         arm: | TODO |
-    |       arm64: | TODO |
+    |       arm64: |  ok  |
     |        csky: | TODO |
     |     hexagon: | TODO |
     |        ia64: | TODO |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..e94913a0b040 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
 	select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
new file mode 100644
index 000000000000..fedb0b87b8db
--- /dev/null
+++ b/arch/arm64/include/asm/tlbbatch.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ARCH_ARM64_TLBBATCH_H
+#define _ARCH_ARM64_TLBBATCH_H
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * For arm64, HW can do tlb shootdown, so we don't
+	 * need to record cpumask for sending IPI
+	 */
+};
+
+#endif /* _ARCH_ARM64_TLBBATCH_H */
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 412a3b9a3c25..10364cf8451d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	dsb(ish);
 }
 
-static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+
+static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
 					 unsigned long uaddr)
 {
 	unsigned long addr;
 
 	dsb(ishst);
-	addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
+	addr = __TLBI_VADDR(uaddr, ASID(mm));
 	__tlbi(vale1is, addr);
 	__tlbi_user(vale1is, addr);
 }
 
+static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
+					 unsigned long uaddr)
+{
+	return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
+}
+
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
@@ -272,6 +279,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
+static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm,
+					unsigned long uaddr)
+{
+	__flush_tlb_page_nosync(mm, uaddr);
+}
+
+static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	dsb(ish);
+}
+
 /*
  * This is meant to avoid soft lock-ups on large TLB flushing ranges and not
  * necessarily a performance improvement.