[RFC,v12,16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

A new mechanism, LUF(Lazy Unmap Flush), defers tlb flush until folios
that have been unmapped and freed, eventually get allocated again.  It's
safe for folios that had been mapped read-only and were unmapped, as
long as the contents of the folios don't change while staying in pcp or
buddy so we can still read the data through the stale tlb entries.

tlb flush can be defered when folios get unmapped as long as it
guarantees to perform tlb flush needed, before the folios actually
become used, of course, only if all the corresponding ptes don't have
write permission.  Otherwise, the system will get messed up.

To achieve that, for the folios that map only to non-writable tlb
entries, prevent tlb flush during unmapping but perform it just before
the folios actually become used, out of buddy or pcp.

However, we should cancel the pending by LUF and perform the deferred
TLB flush right away when:

   1. a writable pte is newly set through fault handler
   2. a file is updated
   3. kasan needs poisoning on free
   4. the kernel wants to init pages on free

No matter what type of workload is used for performance evaluation, the
result would be positive thanks to the unconditional reduction of tlb
flushes, tlb misses and interrupts.  For the test, I picked up one of
the most popular and heavy workload, llama.cpp that is a
LLM(Large Language Model) inference engine.

The result would depend on memory latency and how often reclaim runs,
which implies tlb miss overhead and how many times unmapping happens.
In my system, the result shows:

   1. tlb shootdown interrupts are reduced about 97%.
   2. The test program runtime is reduced about 4.5%.

The test environment and the test set are like:

   Machine: bare metal, x86_64, Intel(R) Xeon(R) Gold 6430
   CPU: 1 socket 64 core with hyper thread on
   Numa: 2 nodes (64 CPUs DRAM 42GB, no CPUs CXL expander 98GB)
   Config: swap off, numa balancing tiering on, demotion enabled

   llama.cpp/main -m $(70G_model1) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model2) -p "who are you?" -s 1 -t 15 -n 20 &
   llama.cpp/main -m $(70G_model3) -p "who are you?" -s 1 -t 15 -n 20 &
   wait

   where,
   -t: nr of threads, -s: seed used to make the runtime stable,
   -n: nr of tokens that determines the runtime, -p: prompt to ask,
   -m: LLM model to use.

Run the test set 5 times successively with caches dropped every run via
'echo 3 > /proc/sys/vm/drop_caches'.  Each inference prints its runtime
at the end of each.  The results are like:

   1. Runtime from the output of llama.cpp

   BEFORE
   ------
   llama_print_timings:       total time =  883450.54 ms /    24 tokens
   llama_print_timings:       total time =  861665.91 ms /    24 tokens
   llama_print_timings:       total time =  898079.02 ms /    24 tokens
   llama_print_timings:       total time =  879897.69 ms /    24 tokens
   llama_print_timings:       total time =  892360.75 ms /    24 tokens
   llama_print_timings:       total time =  884587.85 ms /    24 tokens
   llama_print_timings:       total time =  861023.19 ms /    24 tokens
   llama_print_timings:       total time =  900022.18 ms /    24 tokens
   llama_print_timings:       total time =  878771.88 ms /    24 tokens
   llama_print_timings:       total time =  889027.98 ms /    24 tokens
   llama_print_timings:       total time =  880783.90 ms /    24 tokens
   llama_print_timings:       total time =  856475.29 ms /    24 tokens
   llama_print_timings:       total time =  896842.21 ms /    24 tokens
   llama_print_timings:       total time =  878883.53 ms /    24 tokens
   llama_print_timings:       total time =  890122.10 ms /    24 tokens

   AFTER
   -----
   llama_print_timings:       total time =  871060.86 ms /    24 tokens
   llama_print_timings:       total time =  825609.53 ms /    24 tokens
   llama_print_timings:       total time =  836854.81 ms /    24 tokens
   llama_print_timings:       total time =  843147.99 ms /    24 tokens
   llama_print_timings:       total time =  831426.65 ms /    24 tokens
   llama_print_timings:       total time =  873939.23 ms /    24 tokens
   llama_print_timings:       total time =  826127.69 ms /    24 tokens
   llama_print_timings:       total time =  835489.26 ms /    24 tokens
   llama_print_timings:       total time =  842589.62 ms /    24 tokens
   llama_print_timings:       total time =  833700.66 ms /    24 tokens
   llama_print_timings:       total time =  875996.19 ms /    24 tokens
   llama_print_timings:       total time =  826401.73 ms /    24 tokens
   llama_print_timings:       total time =  839341.28 ms /    24 tokens
   llama_print_timings:       total time =  841075.10 ms /    24 tokens
   llama_print_timings:       total time =  835136.41 ms /    24 tokens

   2. tlb shootdowns from 'cat /proc/interrupts'

   BEFORE
   ------
   TLB:
    80911532   93691786  100296251  111062810  109769109  109862429
   108968588  119175230  115779676  118377498  119325266  120300143
   124514185  116697222  121068466  118031913  122660681  117494403
   121819907  116960596  120936335  117217061  118630217  122322724
   119595577  111693298  119232201  120030377  115334687  113179982
   118808254  116353592  140987367  137095516  131724276  139742240
   136501150  130428761  127585535  132483981  133430250  133756207
   131786710  126365824  129812539  133850040  131742690  125142213
   128572830  132234350  131945922  128417707  133355434  129972846
   126331823  134050849  133991626  121129038  124637283  132830916
   126875507  122322440  125776487  124340278   TLB shootdowns

   AFTER
   -----
   TLB:
     2121206    2615108    2983494    2911950    3055086    3092672
     3204894    3346082    3286744    3307310    3357296    3315940
     3428034    3112596    3143325    3185551    3186493    3322314
     3330523    3339663    3156064    3272070    3296309    3198962
     3332662    3315870    3234467    3353240    3281234    3300666
     3345452    3173097    4009196    3932215    3898735    3726531
     3717982    3671726    3728788    3724613    3799147    3691764
     3620630    3684655    3666688    3393974    3448651    3487593
     3446357    3618418    3671920    3712949    3575264    3715385
     3641513    3630897    3691047    3630690    3504933    3662647
     3629926    3443044    3832970    3548813   TLB shootdowns

Signed-off-by: Byungchul Park <byungchul@sk.com>
---
 include/asm-generic/tlb.h |   5 ++
 include/linux/fs.h        |  12 +++-
 include/linux/mm_types.h  |   6 ++
 include/linux/sched.h     |   9 +++
 kernel/sched/core.c       |   1 +
 mm/internal.h             |  94 ++++++++++++++++++++++++-
 mm/memory.c               |  15 ++++
 mm/pgtable-generic.c      |   2 +
 mm/rmap.c                 | 141 +++++++++++++++++++++++++++++++++++---
 mm/truncate.c             |  55 +++++++++++++--
 mm/vmscan.c               |  12 +++-
 11 files changed, 333 insertions(+), 19 deletions(-)

Message ID	20250220052027.58847-17-byungchul@sk.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0257C021AD for <linux-mm@archiver.kernel.org>; Thu, 20 Feb 2025 05:21:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C3052802A2; Thu, 20 Feb 2025 00:20:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E9DDF2802A6; Thu, 20 Feb 2025 00:20:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7FD52802A4; Thu, 20 Feb 2025 00:20:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 60D0128029E for <linux-mm@kvack.org>; Thu, 20 Feb 2025 00:20:48 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 15D66B2A00 for <linux-mm@kvack.org>; Thu, 20 Feb 2025 05:20:48 +0000 (UTC) X-FDA: 83139173376.16.EA0C17A Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf01.hostedemail.com (Postfix) with ESMTP id BA5D140005 for <linux-mm@kvack.org>; Thu, 20 Feb 2025 05:20:45 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf01.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740028846; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=i2avaim5yUe1LSh3EOxKV0KxghOAbakaOeaEWjTdAeM=; b=cXktZO7Z1YdG8qk8s3uiCpup29JtRbDS9qNs4e2G2+rm1J/4ss2x26x1T679xzhtPlEgnU q5JAcWoGYlFvjLg92hUu/tNjR4yanl+AYxKoMI/qQxeysGwxSoNCjZ3FK+qWEsUh3+BqmF YOUVIETXDihLbe/OclFrRxDC8Fqc4Fc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740028846; a=rsa-sha256; cv=none; b=29xCaS8jvQJcA9sQWgphgCzl28Y1PEtuilHslWTCS0OLostR4CZzsBj3Oj5/BcQIF+aakE FCwCdRPrVbjN8a0g54J10Anp0VqCCNTOYuhAoAYzzFLwDBZKdWAgT6gma9mYAWF2KkOL9i dqsohw0vf0zKSYXXhIqqzUFcVNuCvA8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf01.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com X-AuditID: a67dfc5b-3c9ff7000001d7ae-fe-67b6bba6fd6f From: Byungchul Park <byungchul@sk.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: kernel_team@skhynix.com, akpm@linux-foundation.org, ying.huang@intel.com, vernhao@tencent.com, mgorman@techsingularity.net, hughd@google.com, willy@infradead.org, david@redhat.com, peterz@infradead.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, rjgolo@gmail.com Subject: [RFC PATCH v12 16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Date: Thu, 20 Feb 2025 14:20:17 +0900 Message-Id: <20250220052027.58847-17-byungchul@sk.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20250220052027.58847-1-byungchul@sk.com> References: <20250220052027.58847-1-byungchul@sk.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFnrBLMWRmVeSWpSXmKPExsXC9ZZnoe7y3dvSDXbtYLSYs34Nm8XnDf/Y LF5saGe0+Lr+F7PF0099LBaXd81hs7i35j+rxflda1ktdizdx2Rx6cACJovjvQeYLObf+8xm sXnTVGaL41OmMlr8/gFUfHLWZBYHAY/vrX0sHjtn3WX3WLCp1GPzCi2PxXteMnlsWtXJ5rHp 0yR2j3fnzrF7nJjxm8Vj3slAj/f7rrJ5bP1l59E49Rqbx+dNcgF8UVw2Kak5mWWpRfp2CVwZ M7d/YCvovcdYsfrRNuYGxpb1jF2MnBwSAiYSv29NYIex/07oZgOx2QTUJW7c+MkMYosImEkc bP0DVsMscJdJ4kA/WI2wQJ7Eq7l/weawCKhKLHmzHqyeF6j+655JzBAz5SVWbzgAZnMCxX/M 6AXrFRIwlXi34BJTFyMXUM1nNonpzbNYIRokJQ6uuMEygZF3ASPDKkahzLyy3MTMHBO9jMq8 zAq95PzcTYzA8F9W+yd6B+OnC8GHGAU4GJV4eGe0bksXYk0sK67MPcQowcGsJMLbVr8lXYg3 JbGyKrUoP76oNCe1+BCjNAeLkjiv0bfyFCGB9MSS1OzU1ILUIpgsEwenVANjbNrfeIH/wVur pc61yf6LX2pfu2oX24UPthZxR2sWFN5M+Okp2BwW435KeNuEh3POylbNt/m7SXv6Y6l3txfK iU2fU36X+UXT75x78hdm6d0uC6heceLEkwY5nqS+Kn8D1syW317aDPtvS+t92u1ycL+8/hO1 663/l/od/9X30N2kY29lVeh7JZbijERDLeai4kQAa5ymkHsCAAA= X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrPLMWRmVeSWpSXmKPExsXC5WfdrLts97Z0gzk3jS3mrF/DZvF5wz82 ixcb2hktvq7/xWzx9FMfi8XhuSdZLS7vmsNmcW/Nf1aL87vWslrsWLqPyeLSgQVMFsd7DzBZ zL/3mc1i86apzBbHp0xltPj9A6j45KzJLA6CHt9b+1g8ds66y+6xYFOpx+YVWh6L97xk8ti0 qpPNY9OnSewe786dY/c4MeM3i8e8k4Ee7/ddZfNY/OIDk8fWX3YejVOvsXl83iQXwB/FZZOS mpNZllqkb5fAlTFz+we2gt57jBWrH21jbmBsWc/YxcjJISFgIvF3QjcbiM0moC5x48ZPZhBb RMBM4mDrH3YQm1ngLpPEgX6wGmGBPIlXc/+C9bIIqEosebMerJ4XqP7rnknMEDPlJVZvOABm cwLFf8zoBesVEjCVeLfgEtMERq4FjAyrGEUy88pyEzNzTPWKszMq8zIr9JLzczcxAoN5We2f iTsYv1x2P8QowMGoxMP74PHWdCHWxLLiytxDjBIczEoivG31W9KFeFMSK6tSi/Lji0pzUosP MUpzsCiJ83qFpyYICaQnlqRmp6YWpBbBZJk4OKUaGOfxVjBz+5rknE1b3JFoOcXTy8RK8snt NTu95hoee3Q9mPXuBLW1Lz8/fZ33pLf1bnLGvI2RX0qk84/c3HdcYuHT2vk6735s1vv7c8P3 k7O4/a3YGuM+ys5/nCAYffHg0zXMm6bNDdtTrRBslO6Rtmajq86SR6XX8hrL9s68mV/z+fGT l19+GLIqsRRnJBpqMRcVJwIANMT5emICAAA= X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Queue-Id: BA5D140005 X-Rspamd-Server: rspam07 X-Stat-Signature: grqdizyw1j1k8icgkjpyxo3fwjc7xxxo X-HE-Tag: 1740028845-71813 X-HE-Meta: U2FsdGVkX18F4k6qK680JycuJe3lx8C4Odj3PfgoiaKmmGZV4dhGGJx57sG2LogZLf+0kRUijstmqR8kbhEPiMDcyoldQtgM4GeuYSMFIjcmk0mCAJru81z8yG0a9JB6ZAF3Uwm+5XSt/k77PlpVnQU7KvLxXg8duKaA03iW/Qe9O5Ax2BXhqDu+TCmT6FXFHkrxjjoCCi/gt+A/etunfMhRkm2y64+4OjFevgZouf+3w+HpEAViOfq5cNWlqr5zzFDCujrPvsNYWiZimO6PmV6t4ImqEaJdCQJmTjkCY7QB77NmnAW+DKxBVH99+cB/jGmWX3ab+l2CP/TtcEuCqYKBFdCnQzCst3l7USEsR/OR+1SSuqbwb9yrL5ZpU1in6DsVVtUlQQAOzxmFF4okjzr83qdvbu7qBTfaFv1ltAbnVO1hFfODa45NAZiztGoldmrHnERrVx+x4gj4kS8hX6wctKzNqhtme2NL9xB6PxNknqUbHzxaVFh7DTwIr8gRFTbPCr5HU+KNca/8XFiBDf2zanp8nJpyTwyH78wsBap8qYZYZqWrTPOlE2b2WuYe4iJlpEmx1bPncYjeACoOO9KUr5o5dMPKa0V8jD8x9y2Agufkx4jq/kc+225rUTceK5ltRfssmkZjoA/uNfyorDgFrjJDdehBsqByQeq4HSTYGphXI6cBAJqbssXEmx08xsG26I5ENARW6Fph8exwIKPJk/iZsrrAJnHF5LgXT1yOBA2bjs/jHze489/3mUH6zfw2NWgu9iKiBKBPa1LnISTVZyXflkco7IdLpC32iaa1KIB2ZSoOLLLWV8sR9+XgPfqLkV1x7VIRHVjG94AUlaVFbDYtQAE/goTDY85ufw5SF5jgd9sRec+RpttFjbWuI3QPNEg7YS4cY4XPMvhsemdpPlH9y20JJ8u6GZL/DC5VXIC++RnQeFgjS2P8gIIRl+135zHX2TFflFP27sO 2lg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	LUF(Lazy Unmap Flush) reducing tlb numbers over 90% \| expand [RFC,v12,00/26] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% [RFC,v12,01/26] x86/tlb: add APIs manipulating tlb batch's arch data [RFC,v12,02/26] arm64/tlbflush: add APIs manipulating tlb batch's arch data [RFC,v12,03/26] riscv/tlb: add APIs manipulating tlb batch's arch data [RFC,v12,04/26] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_fl… [RFC,v12,05/26] mm/buddy: make room for a new variable, luf_key, in struct page [RFC,v12,06/26] mm: move should_skip_kasan_poison() to mm/internal.h [RFC,v12,07/26] mm: introduce luf_ugen to be used as a global timestamp [RFC,v12,08/26] mm: introduce luf_batch to be used as hash table to store luf meta data [RFC,v12,09/26] mm: introduce API to perform tlb shootdown on exit from page allocator [RFC,v12,10/26] mm: introduce APIs to check if the page allocation is tlb shootdownable [RFC,v12,11/26] mm: deliver luf_key to pcp or buddy on free after unmapping [RFC,v12,12/26] mm: delimit critical sections to take off pages from pcp or buddy alloctor [RFC,v12,13/26] mm: introduce pend_list in struct free_area to track luf'd pages [RFC,v12,14/26] mm/rmap: recognize read-only tlb entries during batched tlb flush [RFC,v12,15/26] fs, filemap: refactor to gather the scattered ->write_{begin,end}() calls [RFC,v12,16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped [RFC,v12,17/26] x86/tlb, riscv/tlb, arm64/tlbflush, mm: remove cpus from tlb shootdown that already… [RFC,v12,18/26] mm/page_alloc: retry 3 times to take pcp pages on luf check failure [RFC,v12,19/26] mm: skip luf tlb flush for luf'd mm that already has been done [RFC,v12,20/26] mm, fs: skip tlb flushes for luf'd filemap that already has been done [RFC,v12,21/26] mm: perform luf tlb shootdown per zone in batched manner [RFC,v12,22/26] mm/page_alloc: not allow to tlb shootdown if !preemptable() && non_luf_pages_ok() [RFC,v12,23/26] mm: separate move/undo parts from migrate_pages_batch() [RFC,v12,24/26] mm/migrate: apply luf mechanism to unmapping during migration [RFC,v12,25/26] mm/vmscan: apply luf mechanism to unmapping during folio reclaim [RFC,v12,26/26] mm/luf: implement luf debug feature

[RFC,v12,16/26] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Commit Message

Patch