[v11,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, as
long as the contents of the folios don't change while staying in pcp or
buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the result is like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   The test set:

      llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
      llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
      wait

      where -t: nr of threads, -s: seed used to make the runtime stable,
      -n: nr of tokens that determines the runtime, -p: prompt to ask,
      -m: LLM model to use.

   Run the test set 5 times successively with caches dropped every run
   via 'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its
   runtime at the end of each.

   1. Runtime from the output of llama.cpp:

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts':

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/linux/fs.h       |   6 +
 include/linux/mm_types.h |   8 +
 include/linux/sched.h    |   9 ++
 mm/compaction.c          |   2 +-
 mm/internal.h            |  42 +++++-
 mm/memory.c              |  39 ++++-
 mm/page_alloc.c          |  17 ++-
 mm/rmap.c                | 315 ++++++++++++++++++++++++++++++++++++++-
 8 files changed, 420 insertions(+), 18 deletions(-)

Message ID	20240531092001.30428-10-byungchul@sk.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CDA8C25B75 for <linux-mm@archiver.kernel.org>; Fri, 31 May 2024 09:20:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C1CC96B00A5; Fri, 31 May 2024 05:20:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B7EC36B00A6; Fri, 31 May 2024 05:20:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 982FE6B00A7; Fri, 31 May 2024 05:20:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 588CC6B00A6 for <linux-mm@kvack.org>; Fri, 31 May 2024 05:20:23 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0B4754127E for <linux-mm@kvack.org>; Fri, 31 May 2024 09:20:23 +0000 (UTC) X-FDA: 82178145126.09.AD44D71 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf11.hostedemail.com (Postfix) with ESMTP id BB79E4000A for <linux-mm@kvack.org>; Fri, 31 May 2024 09:20:20 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf11.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717147221; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=7dQpbpbLlm7DFD9m0RapSFIbgpU8d7AdRKWiAxzHQUs=; b=h4u7gXe5kzXelP9tfCkiHEZ8YgGM5Aw++DfCXY39Ar5cX5dQ4wDw25ZnsHwZSp+bnDRO6M VSvLs+G46wD0S7/xRWbtSnyT2fhmIhh8p9MPH6ra7m4MSNogOeffciWDg8vyMz0SBWKEhq K2uCeOh0nJUv9JBpPSWQnkgYk99Aw6w= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf11.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717147221; a=rsa-sha256; cv=none; b=OBv24fXB4FSxjXu3BZGLHdgwTnbPTchJh3Dc+yqLdGGioF/oBpayKrNzGwix46bXfS+o7E kCUesjMWlWaf984LmNeMsNhfudGKx/84zGZvgx767wvP/GDv1Y7PH7AFpSXIwf8N1FEHMe K97Wb/ZWLdyCm3IpMy8hIbw+oIrvvxs= X-AuditID: a67dfc5b-d85ff70000001748-6c-6659964c9a02 From: Byungchul Park <byungchul@sk.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, rjgolo@gmail.com Subject: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Date: Fri, 31 May 2024 18:19:58 +0900 Message-Id: <20240531092001.30428-10-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20240531092001.30428-1-byungchul@sk.com> References: <20240531092001.30428-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrOLMWRmVeSWpSXmKPExsXC9ZZnoa7PtMg0g7VHFSzmrF/DZvF5wz82 ixcb2hktvq7/xWzx9FMfi8XlXXPYLO6t+c9qcX7XWlaLHUv3MVlcOrCAyeJ47wEmi/n3PrNZ bN40ldni+JSpjBa/fwAVn5w1mcVBwON7ax+Lx85Zd9k9Fmwq9di8Qstj8Z6XTB6bVnWyeWz6 NInd4925c+weJ2b8ZvGYdzLQ4/2+q2weW3/ZeTROvcbm8XmTXABfFJdNSmpOZllqkb5dAlfG 1r73zAWnHzNWfJ69kbWBsWcjYxcjJ4eEgInE5t/T2GDsxrMfweJsAuoSN278ZAaxRQTMJA62 /mEHsZkF7jJJHOgHqufgEBbIkpg5uQ4kzCKgKnF2+XxWEJsXqPzp9BssECPlJVZvOAA2hhMo fuDvHbDxQgKmEov+9wLZXEA1n9kkns1+C3WPpMTBFTdYJjDyLmBkWMUolJlXlpuYmWOil1GZ l1mhl5yfu4kRGPzLav9E72D8dCH4EKMAB6MSD29ARUSaEGtiWXFl7iFGCQ5mJRHeX+lAId6U xMqq1KL8+KLSnNTiQ4zSHCxK4rxG38pThATSE0tSs1NTC1KLYLJMHJxSDYy5rUv/fe/aYXx5 0emDfHyPH05JzWT+ccDN5cRavaWMJx+vMlq+fssMoRy3Wa9En/nMmaKwZXrUjSML+zcutFEL eqMouHuynNtTByUH2eMm1fpJSdqKvDquWzkN1Xeb7mqKMyiX/Cz2cpFcfOK5SovmQ5s3VUdW /nA+Uz+zRi6vKEOGN2WiCacSS3FGoqEWc1FxIgAutu26egIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrHLMWRmVeSWpSXmKPExsXC5WfdrOszLTLNoPWUiMWc9WvYLD5v+Mdm 8WJDO6PF1/W/mC2efupjsTg89ySrxeVdc9gs7q35z2pxftdaVosdS/cxWVw6sIDJ4njvASaL +fc+s1ls3jSV2eL4lKmMFr9/ABWfnDWZxUHQ43trH4vHzll32T0WbCr12LxCy2PxnpdMHptW dbJ5bPo0id3j3blz7B4nZvxm8Zh3MtDj/b6rbB6LX3xg8tj6y86jceo1No/Pm+QC+KO4bFJS czLLUov07RK4Mrb2vWcuOP2YseLz7I2sDYw9Gxm7GDk5JARMJBrPfgSz2QTUJW7c+MkMYosI mEkcbP3DDmIzC9xlkjjQz9bFyMEhLJAlMXNyHUiYRUBV4uzy+awgNi9Q+dPpN1ggRspLrN5w AGwMJ1D8wN87YOOFBEwlFv3vZZzAyLWAkWEVo0hmXlluYmaOqV5xdkZlXmaFXnJ+7iZGYCgv q/0zcQfjl8vuhxgFOBiVeHgDKiLShFgTy4orcw8xSnAwK4nw/koHCvGmJFZWpRblxxeV5qQW H2KU5mBREuf1Ck9NEBJITyxJzU5NLUgtgskycXBKNTCe6zPIMjse4TZLifP4opKgevfOR8LT EpI26f13fj2Xc8OF4/d3V9fEmUs3dnccX985M//8IsaXYo6MPWal5hu0H0xeyDn1Zb2PbvBz 70eSGVkBJ08LuEs9nKY/fePq7Lu3dIsXy/t8POvx0Vl463x5fwbHw8+q0o/VnF64JNC131s9 bO7fhXVKLMUZiYZazEXFiQCi4bqRYQIAAA== X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BB79E4000A X-Stat-Signature: dwq36akjn14phx4yerytxiimyfjxdzkr X-HE-Tag: 1717147220-729173 X-HE-Meta: U2FsdGVkX1/VL3xvhHgzRkLqXyKN2ld1qhmUWKEaNq2L8jVHn73Jvu5w6wdOypl42u1sV/Y9DC+AzXA/hLdRA3Awm3NpIGrV1aDv7PioO3oxSTKH7t3C+LR+FgeSeWPfXJuYu0SR0aZboq2rhdXtDVwJ1ZuGSiqGJ4KkNlfldp21i27wGttlUiXxOsnIVjT4CbJObVF1OCSj5NedaV97iJUL10rbqtNijoXqNK/jjfPynckHUKWXh83bhkGvYgV02nSyeWZISi/6w+8/wpjozJvxdX7B4O1IZXZB2pdYbxzSqKFPJ0PrqlAdlG735tElAWtQW2lSaR1n9xvTIEmmuiuhGOB12K/rJiHqAjW5mEFtvbrLfn+WG5dRX4dA90tj/bbEyGxtkYbrBsLTvflIyUGDPVL0CKGMGtGf5yr2iFZnUlvmD3x3/ld/ZVop7OM8bJfg+5E/6rpC9XUb0brw9TDBDBmRrDiv3euv1JpJu2KKvonycdsrdRF3t+KMTf2gVdsqDXR0HOmP9P/nFQb3q2rrGOE7dXnY1pPI7J3uJ2gVBWAMqd+YsYMCguLgAlPtSztMzt/Rqr4aB/lsd8KJaBH+fqu1txrpt8cUOadQw5nTokjG8b7HOWY5j+pCr8J5BvWCCrTFaBeM1I1Rl4fnWCGc27CzWeR8bYUDdFiD+TSnde77tm7QjxqRw36Y+G4axYqZZivLjy+oeiWYm6dKDWQGs++v0kiCAD4cCxm70R9H+TPN4N9A9Bkb75bFpzgrhVHxkpekyrU4fR7ySSVBIgbuf6r0DICoNAC3FaugFLpb8dg6tmnaWCvBTwfbv4QpJfEMULXArKggnwZzPfqDEG7Pi8xMml02B9ZHDXXAhl10aEmT6GHcvnMjOtnRrcZkH6rkuV6spUELSmZb4cf0Vs4N4LO+kw1xCudILqpLsZFkfyVkCyHIHuvLAoLJKtkRba9977p24TwFBi2kV3r yoA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	LUF(Lazy Unmap Flush) reducing tlb numbers over 90% \| expand [v11,00/12] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% [v11,01/12] x86/tlb: add APIs manipulating tlb batch's arch data [v11,02/12] arm64: tlbflush: add APIs manipulating tlb batch's arch data [v11,03/12] riscv, tlb: add APIs manipulating tlb batch's arch data [v11,04/12] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() [v11,05/12] mm: buddy: make room for a new variable, ugen, in struct page [v11,06/12] mm: add folio_put_ugen() to deliver unmap generation number to pcp or buddy [v11,07/12] mm: add a parameter, unmap generation number, to free_unref_folios() [v11,08/12] mm/rmap: recognize read-only tlb entries during batched tlb flush [v11,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped [v11,10/12] mm: separate move/undo parts from migrate_pages_batch() [v11,11/12] mm, migrate: apply luf mechanism to unmapping during migration [v11,12/12] mm, vmscan: apply luf mechanism to unmapping during folio reclaim

[v11,09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Commit Message

Comments

Patch