[v6,2/2] arm64: support batched/deferred tlb shootdown during page reclamation

From: Barry Song <v-songbaohua@oppo.com>

From: Barry Song <v-songbaohua@oppo.com>

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include <sys/types.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <string.h>

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
         volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);

         memset(p, 0x88, SIZE);

         for (int k = 0; k < 10000; k++) {
                 /* swap in */
                 for (int i = 0; i < SIZE; i += 4096) {
                         (void)p[i];
                 }

                 /* swap out */
                 madvise(p, SIZE, MADV_PAGEOUT);
         }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only options.
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object      Symbol
 # ........  .......  .................  .............................................................................
 #
    21.07%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irq
     8.23%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     6.67%  a.out    [kernel.kallsyms]  [k] filemap_map_pages
     6.16%  a.out    [kernel.kallsyms]  [k] __zram_bvec_write
     5.36%  a.out    [kernel.kallsyms]  [k] ptep_clear_flush
     3.71%  a.out    [kernel.kallsyms]  [k] _raw_spin_lock
     3.49%  a.out    [kernel.kallsyms]  [k] memset64
     1.63%  a.out    [kernel.kallsyms]  [k] clear_page
     1.42%  a.out    [kernel.kallsyms]  [k] _raw_spin_unlock
     1.26%  a.out    [kernel.kallsyms]  [k] mod_zone_state.llvm.8525150236079521930
     1.23%  a.out    [kernel.kallsyms]  [k] xas_load
     1.15%  a.out    [kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
	Tested with benchmark in the commit on Kunpeng920 arm64 server,
	observed an improvement around 12.5% with command
	`time ./swap_bench`.
		w/o		w/
	real	0m13.460s	0m11.771s
	user	0m0.248s	0m0.279s
	sys	0m12.039s	0m11.458s

	Originally it's noticed a 16.99% overhead of ptep_clear_flush()
	which has been eliminated by this patch:

	[root@localhost yang]# perf record -- ./swap_bench && perf report
	[...]
	16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and only make this enabled on systems
with more than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 .../features/vm/TLB/arch-support.txt          |  2 +-
 arch/arm64/Kconfig                            |  6 +++
 arch/arm64/include/asm/tlbbatch.h             | 12 +++++
 arch/arm64/include/asm/tlbflush.h             | 52 ++++++++++++++++++-
 arch/x86/include/asm/tlbflush.h               |  3 +-
 mm/rmap.c                                     | 10 ++--
 6 files changed, 77 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

Message ID	20221115031425.44640-3-yangyicong@huawei.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C022FC4332F for <linux-riscv@archiver.kernel.org>; Tue, 15 Nov 2022 03:16:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:CC:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=nd9/8ntIRNscAWcXAUgg/Wnws+7mPVAlnyV6cILpHiw=; b=krMncaAqVas2sx kevURFUl5xByJ1LDDqM02C8PLYE4bATPsOiplAcNNZDeSJ6aEXsfo3K7V0fM4SK5kXKDBwx1cUr+l rgI32mfMdRxLFPPL+3swY+646qzxYTsbeY8SV68Ab/vUAI9WVKoXa+IrmB0a89w/JsHMva8Gub9Cq LmLkhTu/rn2adows8bMznP3/+NZ9akgeR9hktxxKka8/Q9HA0g2Tatw6J/0QsdDPX4DS1Yhysdqkl IiVibditOY5q4knMt/bbOsuS38jkwrVhFkjI/8Q0qyJM9+CXdj9A6F2j7rpRII4OhUd2CqL77rFws 173Pg05vGgn53m0gGmDg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oumQp-007949-UL; Tue, 15 Nov 2022 03:16:15 +0000 Received: from szxga08-in.huawei.com ([45.249.212.255]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oumQL-0078t6-3m; Tue, 15 Nov 2022 03:15:48 +0000 Received: from canpemm500009.china.huawei.com (unknown [172.30.72.57]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4NBBBY4h32z15MNp; Tue, 15 Nov 2022 11:15:17 +0800 (CST) Received: from localhost.localdomain (10.67.164.66) by canpemm500009.china.huawei.com (7.192.105.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Tue, 15 Nov 2022 11:15:37 +0800 From: Yicong Yang <yangyicong@huawei.com> To: <akpm@linux-foundation.org>, <linux-mm@kvack.org>, <linux-arm-kernel@lists.infradead.org>, <x86@kernel.org>, <catalin.marinas@arm.com>, <will@kernel.org>, <anshuman.khandual@arm.com>, <linux-doc@vger.kernel.org> CC: <corbet@lwn.net>, <peterz@infradead.org>, <arnd@arndb.de>, <punit.agrawal@bytedance.com>, <linux-kernel@vger.kernel.org>, <darren@os.amperecomputing.com>, <yangyicong@hisilicon.com>, <huzhanyuan@oppo.com>, <lipeifeng@oppo.com>, <zhangshiming@oppo.com>, <guojian@oppo.com>, <realmz6@gmail.com>, <linux-mips@vger.kernel.org>, <openrisc@lists.librecores.org>, <linuxppc-dev@lists.ozlabs.org>, <linux-riscv@lists.infradead.org>, <linux-s390@vger.kernel.org>, Barry Song <21cnbao@gmail.com>, <wangkefeng.wang@huawei.com>, <xhao@linux.alibaba.com>, <prime.zeng@hisilicon.com>, Barry Song <v-songbaohua@oppo.com>, Nadav Amit <namit@vmware.com>, Mel Gorman <mgorman@suse.de> Subject: [PATCH v6 2/2] arm64: support batched/deferred tlb shootdown during page reclamation Date: Tue, 15 Nov 2022 11:14:25 +0800 Message-ID: <20221115031425.44640-3-yangyicong@huawei.com> X-Mailer: git-send-email 2.31.0 In-Reply-To: <20221115031425.44640-1-yangyicong@huawei.com> References: <20221115031425.44640-1-yangyicong@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.67.164.66] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To canpemm500009.china.huawei.com (7.192.105.203) X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20221114_191545_524879_4054FB2C X-CRM114-Status: GOOD ( 28.20 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: <linux-riscv.lists.infradead.org> List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-riscv>, <mailto:linux-riscv-request@lists.infradead.org?subject=unsubscribe> List-Archive: <http://lists.infradead.org/pipermail/linux-riscv/> List-Post: <mailto:linux-riscv@lists.infradead.org> List-Help: <mailto:linux-riscv-request@lists.infradead.org?subject=help> List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-riscv>, <mailto:linux-riscv-request@lists.infradead.org?subject=subscribe> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org
Series	arm64: support batched/deferred tlb shootdown during page reclamation \| expand [v6,0/2] arm64: support batched/deferred tlb shootdown during page reclamation [v6,1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() [v6,2/2] arm64: support batched/deferred tlb shootdown during page reclamation

Context	Check	Description
conchuod/patch_count	success	Link
conchuod/cover_letter	success	Series has a cover letter
conchuod/tree_selection	success	Guessed tree name to be for-next
conchuod/fixes_present	success	Fixes tag not required for -next series
conchuod/verify_signedoff	success	Signed-off-by tag matches author and committer
conchuod/kdoc	success	Errors and warnings before: 0 this patch: 0
conchuod/module_param	success	Was 0 now: 0
conchuod/build_rv32_defconfig	success	Build OK
conchuod/build_warn_rv64	success	Errors and warnings before: 0 this patch: 0
conchuod/dtb_warn_rv64	success	Errors and warnings before: 0 this patch: 0
conchuod/header_inline	success	No static functions without inline keyword in header files
conchuod/checkpatch	warning	CHECK: Alignment should match open parenthesis WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
conchuod/source_inline	success	Was 0 now: 0
conchuod/build_rv64_nommu_k210_defconfig	success	Build OK
conchuod/verify_fixes	success	No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig	success	Build OK

[v6,2/2] arm64: support batched/deferred tlb shootdown during page reclamation

Checks

Commit Message

Comments

Patch