From patchwork Wed Jul 6 11:20:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 12907999 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87326C433EF for ; Wed, 6 Jul 2022 11:21:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 239C76B0071; Wed, 6 Jul 2022 07:21:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1EACF6B0073; Wed, 6 Jul 2022 07:21:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B19B6B0074; Wed, 6 Jul 2022 07:21:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EE2D36B0071 for ; Wed, 6 Jul 2022 07:21:02 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BCC6834E62 for ; Wed, 6 Jul 2022 11:21:02 +0000 (UTC) X-FDA: 79656433164.15.AAF2741 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf01.hostedemail.com (Postfix) with ESMTP id 4D9404000E for ; Wed, 6 Jul 2022 11:21:02 +0000 (UTC) Received: by mail-pj1-f47.google.com with SMTP id o31-20020a17090a0a2200b001ef7bd037bbso10147616pjo.0 for ; Wed, 06 Jul 2022 04:21:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WTrGs67iA2ZgY9g97e2avTL2bX48AmGpRBeQUx77nlM=; b=qb3EmpdcCxy4lvUDAlWBT8Vnhlz2bJsFgBWUWzTsm0fbR+n3SgLGB4sBLoZeuXLrl8 Y49zldl6Bq0RtP3zOi3pD2ASgD8Jy2EdLyZv95k4uD4LqWLDSyPXoH/RsYjnl5WX9NAX l7XDMF9/v+QMI9dkKtsvcgD4I91k14UAsFLscAFXy6jQFQ5erFmehA1KEMSI0yTYSRDg 1Arl39qvDsQuFV52IFSkgwJr6m1Au4/y+hfyTNy+1rCng/cFUR6OO6K4inQ5fsy/lDIL rLH04T40xX4YMzjYLhAhuQDd3QrjBr6UxCEXuC30dd2UtNRA7ZAMb3lSR2GytKAe8JRd XjcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WTrGs67iA2ZgY9g97e2avTL2bX48AmGpRBeQUx77nlM=; b=GeDHKbi8XhjzEhBg06nqAZLXjSGIiTDf914WRf5QlxhxxTiCiRF8P1JcziO+tvHp4K At2BC73BZkXY/7lCJxe82IFNNAf0+rHtwqBjExT4NZ3HeP/HXIwuBdda8EcUAxNK+QbT VS5oW2geODLAaooZ16XhuIzluj32NtCXMglyRFF0EyjVTQM7SzVKrM5ZVCtBUAHBY+ky BX0xfmAVmmB1T5Kzkaf+nVS2kPes25LkbaDOxS83y0jAl5Fa03pl6TcMSHJ7aF89LHDp kylLJiSx1yaUSe1uCHy7qvqRxI2qm7LWu32u6vMHqPi2sUnOF5oM1XN7GqYX7q5mKN7o a/+Q== X-Gm-Message-State: AJIora9RROBxkVhCrn4IPRdy3tbcz+EaPbfP4p+mDmhx9IeHxW2lnNaZ 1aCsPSTqHyX/RCZ7aRbdayQ= X-Google-Smtp-Source: AGRyM1tkhFL11qw0Yz097ZKRSE7H82lsCCEfPHYOvrp3mDQ7wPDjrMHQIWpCQF4+wl352BPj3ukFgw== X-Received: by 2002:a17:90b:3911:b0:1ef:ba7d:bced with SMTP id ob17-20020a17090b391100b001efba7dbcedmr1669107pjb.0.1657106461204; Wed, 06 Jul 2022 04:21:01 -0700 (PDT) Received: from localhost.localdomain (47-72-206-164.dsl.dyn.ihug.co.nz. [47.72.206.164]) by smtp.gmail.com with ESMTPSA id a3-20020a1709027e4300b0016b8b35d725sm18677107pln.95.2022.07.06.04.20.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Jul 2022 04:21:00 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Cc: catalin.marinas@arm.com, huzhanyuan@oppo.com, lipeifeng@oppo.com, zhangshiming@oppo.com, guojian@oppo.com, Barry Song , Yu Zhao , Will Deacon , Alex Van Brunt , Shaohua Li Subject: [PATCH] mm: rmap: Don't flush TLB after checking PTE young for page reference Date: Wed, 6 Jul 2022 23:20:41 +1200 Message-Id: <20220706112041.3831-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657106462; a=rsa-sha256; cv=none; b=EHzsVjGxZkhOfKcN7qtDBHAOYzefJjetbt6VKBG0A0/LSf2k0lgkzHNFMrNIXHgN8pqum4 50vH5N0ilBy0an6YnI9eCJB4KrLknCDQS4NbFv9cmHlB1hUP8ajQ1T+Wmxn24UmZs/g5aV sIZeSvrBAtT0q05AiVyAO5iFXF2ezw0= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=qb3Empdc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657106462; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=WTrGs67iA2ZgY9g97e2avTL2bX48AmGpRBeQUx77nlM=; b=u7XgOrHLhMIt2RDGVNrmurzvZ1yTQP9rIKTsgqB5wF14T4d7LK3jiEj+Tc1NNnhyqFlGzj NVAd/8Wd0QlZZSXtHx29+OhthY2M5FyX819aZDltYBz7nrRuM7xvg7z36YSyssw3WUiJP6 AZPuOwzZtuEwO5WhoIUDycTnA2Udesk= X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4D9404000E X-Rspam-User: Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=qb3Empdc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com X-Stat-Signature: zbwpkzh7puhi1c9eoecy3r1ba19y165d X-HE-Tag: 1657106462-4193 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Barry Song Whether it is done through hardware or software, TLB flushing is usually extremely expensive. Since a page can be mapped by lots of processes at the same time, in folio_referenced_one(), each process with pte_young will send a tlb broadcast, this further increases the overhead of tlb flush exponentially. Some platforms have tried to remove the overhead of tlb flush by implementing their own ptep_clear_flush_young() in which, flush are dropped(x86, s390, powerpc, riscv) or deferred(arm64). This approach has obviously broken the semantics of the API since it is named as "flush". Dropping flush in a function named "flush" isn't cool. On ARM64, flush_tlb_page_nosync() is used as a cheaper way in ptep_clear_flush_young() to replace the more expensive sync tlb broadcast with dsb. But the cost of this nosync alternative has probably been underestimated. Profiling is done by running a program with high memory pressure on rk3568 64bit quad core processor Quad Core Cortex-A55 platform - ROCK 3A with 4GB memory, using zRAM as swap device. In the program, 8 processes are trying to access one shared memory as below, int main() { #define MB (1024 * 1024) pid_t pid = getpid(); volatile unsigned char *p = mmap(NULL, 4096UL * MB, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x11, 4096UL * MB); /* simulate memory mapped by multiple processes,like libs, .txt section, shmem */ fork(); fork(); fork(); while(1) { int i; /* randomly get an offset then access 1024 pages */ unsigned long offset = (rand() % MB); if (offset + 1024 > MB) offset = MB - 1024; for (i = 0; i < 1024; i++) { (void)p[(offset + i) * 4096]; } usleep(1000); } } After removing "inline" before flush_tlb_page_nosync() as below, static noinline void flush_tlb_page_nosync(struct vm_area_struct *vma, perf result for kswapd is quite surprising, 19.63% kswapd0 [kernel.kallsyms] [k] page_vma_mapped_walk 10.69% kswapd0 [kernel.kallsyms] [k] flush_tlb_page_nosync 6.73% kswapd0 [kernel.kallsyms] [k] folio_referenced_one 5.92% kswapd0 [kernel.kallsyms] [k] zram_bvec_rw.constprop.0.isra.0 4.55% kswapd0 [kernel.kallsyms] [k] ptep_clear_flush 3.66% kswapd0 [kernel.kallsyms] [k] _raw_spin_lock 2.87% kswapd0 [kernel.kallsyms] [k] rmap_walk_file 2.72% kswapd0 [kernel.kallsyms] [k] try_to_unmap_one 2.03% kswapd0 [kernel.kallsyms] [k] vma_interval_tree_iter_next 1.86% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% kswapd0 [kernel.kallsyms] [k] isolate_lru_pages 1.78% kswapd0 [kernel.kallsyms] [k] _raw_spin_unlock 1.23% kswapd0 [kernel.kallsyms] [k] vma_interval_tree_subtree_search 1.15% kswapd0 [kernel.kallsyms] [k] PageHuge 1.02% kswapd0 [kernel.kallsyms] [k] check_pte If flush_tlb_page_nosync() is inlined, its overhead will be counted somewhere else. That's why the profiling is removing the inline. The 10.60% overhead demonstrates for ARM64, we still need to move to ptep_clear_young_notify() after we have used the nosync tlbi. In addition to those commits to remove flush in platforms such as riscv, x86, powerpc, Yu Zhao also listed some other evidences to support moving to ptep_clear_young_notify() in vmscan within the discussion of MGLRU. * The fundamental hardware limitation in terms of the TLB scalability[1] * Alexander's benchmark[2] * TLB doesn't cache stale pte young most of the time, flushing TLB just for the sake of the A-bit isn't necessary[3] This patch solves the problem from the source - vmscan, so probably platforms which haven't dropped flush can benefit directly. On the other hand, ARM64 with lightweight tlbi can also eventually remove the overhead of nosync tlb flush. At last but not least, MGLRU has no flush in look_around after clearing pte young, this patch also makes vmscan generally consistent with the approach of MGLRU. [1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf [2] https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/ [3] https://lore.kernel.org/lkml/CAOUHufbOwPSbBwd7TG0QFt4YJvBp93Q9nUJEDvMpUA6PqjYMUQ@mail.gmail.com/ Cc: Yu Zhao Cc: Will Deacon Cc: Alex Van Brunt Cc: Shaohua Li Signed-off-by: Barry Song --- -v1 differences with rfc * refine commit log * investigate on arm64's flush_tlb_page_nosync with memory pressure mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 5bcb334cd6f2..7ce6f0b6c330 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio, } if (pvmw.pte) { - if (ptep_clear_flush_young_notify(vma, address, + if (ptep_clear_young_notify(vma, address, pvmw.pte)) { /* * Don't treat a reference through