From patchwork Tue Mar 8 21:34:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 12774389 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 883EEC433F5 for ; Tue, 8 Mar 2022 21:34:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 01B988D0002; Tue, 8 Mar 2022 16:34:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F0C698D0001; Tue, 8 Mar 2022 16:34:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D878B8D0002; Tue, 8 Mar 2022 16:34:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id C3F168D0001 for ; Tue, 8 Mar 2022 16:34:46 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 99669F58 for ; Tue, 8 Mar 2022 21:34:46 +0000 (UTC) X-FDA: 79222523772.08.C097620 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf22.hostedemail.com (Postfix) with ESMTP id 2217EC000D for ; Tue, 8 Mar 2022 21:34:45 +0000 (UTC) Received: by mail-pl1-f202.google.com with SMTP id e7-20020a170902ef4700b00151de30039bso116767plx.17 for ; Tue, 08 Mar 2022 13:34:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=pYtxlHcMq2N2WqmRseH6uf38wh9ycntGjNHKzI4UCxw=; b=RY4RPEhJ/OoaaXh2Ly3bUfQgFyYMhhiBLA2QnvkRbPvlS8tcmiUE0Ot5ENLWUOm9lN Vx+SkoL6DVMoYfcXs3RyGLGGKmL5JAmSx0CLsmnNU2gE86PU7n2tg8Nts9jgfnZlhhYp s1h1HHgcNOOUVYZw+8feLgTxL6IO4uxLibUvhHQRuYOuYBLfAWOJ4ocSoBlJ0LWuxEAC GF2yhdGdDeoZBqcWlu105bgCQYdLvFp9NyBPEuAzyFNuiKZD4mNU4iTqr1j071eSmTMX j4RSH9DbUYgF8lVarERPuusB6GhktVWM6eMZz0rxBAeKAmUDEI50VjPGajAxVmagVtB5 y2SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=pYtxlHcMq2N2WqmRseH6uf38wh9ycntGjNHKzI4UCxw=; b=6lj8hLFVACjnyzGPQtlSET5BCkTbvznC0jqINQBhkTGRM2MnXtY3e+IDs4R4uAUpcl dLN/O+jVz0nibYCyO2jSYTgP3p+dTJOT3sZG+WSC39eAyHtsrwmLN8lDwDMaz6xkMQBX bWkcLR94V33GjkE+uIShz29b/IHDLxaJ+AqDRtH1g7JW0sNRM9Rl2HIyJyMDIazJOZyA UKLMx3vF24s1LUGIWeL+R3rxhdM7HokpNQ+0HiqS8THfrGTaancPL2SlGgj5hEBcxtTk dbmDDrPYj2Mtgtxr7FzqVr07aY8uH7Mh6QdYl/pK1lWeqmLeWMidP/GXPELTM4XlCMQP GvxA== X-Gm-Message-State: AOAM532QV/7FvmjgD/CabMV9rlmAvMeYu+InGOOVDWPMV9b5gJfqW6KX pV3PCnVokzd3sa1R3iEL913l1WReeam5 X-Google-Smtp-Source: ABdhPJxmCo64q6T8ctvPzgdBBbbdRye4m/EZ1Bq8+uVR73jSYJir22YJud7qcOPWwt6HW5QNd3PAw6NV8qdh X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a17:90a:e614:b0:1bf:53e6:46a5 with SMTP id j20-20020a17090ae61400b001bf53e646a5mr7017996pjy.161.1646775284794; Tue, 08 Mar 2022 13:34:44 -0800 (PST) Date: Tue, 8 Mar 2022 13:34:03 -0800 Message-Id: <20220308213417.1407042-1-zokeefe@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.35.1.616.g0bdcbb4464-goog Subject: [RFC PATCH 00/14] mm: userspace hugepage collapse From: "Zach O'Keefe" To: Alex Shi , David Hildenbrand , David Rientjes , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , linux-mm@kvack.org Cc: Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Richard Henderson , Thomas Bogendoerfer , Yang Shi , "Zach O'Keefe" X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 2217EC000D X-Stat-Signature: 6x9rrrhngwuyxwhs9bbzpn1fjy16hf4n Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=RY4RPEhJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of 39MsnYgcKCEM4tpjjkjlttlqj.htrqnsz2-rrp0fhp.twl@flex--zokeefe.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=39MsnYgcKCEM4tpjjkjlttlqj.htrqnsz2-rrp0fhp.twl@flex--zokeefe.bounces.google.com X-HE-Tag: 1646775285-769683 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Introduction -------------------------------- This series provides a mechanism for userspace to induce a collapse of eligible ranges of memory into transparent hugepages in process context, thus permitting users to more tightly control their own hugepage utilization policy at their own expense. This idea was previously introduced by David Rientjes, and thanks to everyone for your patience while I prepared these patches resulting from that discussion[1]. [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/ Interface -------------------------------- The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and leverages the new process_madvise(2) call. (*) process_madvise(2) Performs a synchronous collapse of the native pages mapped by the list of iovecs into transparent hugepages. The default gfp flags used will be the same as those used at-fault for the VMA region(s) covered. When multiple VMA regions are spanned, if faulting-in memory from any VMA would permit synchronous compaction and reclaim, then all hugepage allocations required to satisfy the request may enter compaction and reclaim. Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored by default, as the user is explicitly requesting this action. Define two flags to control collapse semantics, passed through process_madvise(2)’s optional flags parameter: MADV_F_COLLAPSE_LIMITS If supplied, collapse respects pte collapse limits set via sysfs: /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. Required if calling on behalf of another process and not CAP_SYS_ADMIN. MADV_F_COLLAPSE_DEFRAG If supplied, permit synchronous compaction and reclaim, regardless of VMA flags. (*) madvise(2) Equivalent to process_madvise(2) on self, with no flags passed; pte collapse limits are ignored, and the gfp flags will be the same as those used at-fault for the VMA region(s) covered. Note that, users wanting different collapse semantics can always use process_madvise(2) on themselves. Discussion -------------------------------- The mechanism is fully compatible with khugepaged, allowing userspace to separately define synchronous and asynchronous hugepage policies, as priority dictates. It also naturally permits a DAMON scheme, DAMOS_COLLAPSE, to make efficient use of the available hugepages on the system by backing the most frequently accessed memory by hugepages[2]. Though not required to justify this series, hugepage management could be offloaded entirely to a sufficiently informed userspace agent, supplanting the need for khugepaged in the kernel. Along with the interface, this series proposes a batched implementation to collapse a range of memory. The motivation for this is to limit contention on mmap_lock, doing multiple page table modifications while the lock is held exclusively. Only private anonymous memory is supported by this series. File-backed memory support will be added later. Multiple hugepages support (such as 1 GiB gigantic hugepages) were not considered at this time, but could be supported by the flags parameter in the future. kselftests were omitted from this series for brevity, but would be included in an eventual patch submission. [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/ Sequence of Patches -------------------------------- Patches 1-10 perform refactoring of collapse logic within khugepaged.c: introducing the notion of a collapse context and isolating logic that can be reused later in the series for the madvise collapse context. Patches 11-14 introduce logic for the proposed madvise collapse mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and 13, separately, add the core collapse logic, with the former introducing the overall batched approach and locking strategy, and the latter fills-in batch action details. This separation was purely to keep patch size down. Patch 14 adds process_madvise support. Applies against next-20220308. Zach O'Keefe (14): mm/rmap: add mm_find_pmd_raw helper mm/khugepaged: add struct collapse_control mm/khugepaged: add __do_collapse_huge_page() helper mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() mm/khugepaged: add hugepage_vma_revalidate_pmd_count() mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP mm/khugepaged: rename khugepaged-specific/not functions mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse mm/madvise: add __madvise_collapse_*_batch() actions. mm/madvise: add process_madvise(MADV_COLLAPSE) fs/io_uring.c | 3 +- include/linux/huge_mm.h | 27 +- include/linux/mm.h | 3 +- include/uapi/asm-generic/mman-common.h | 10 + mm/huge_memory.c | 2 +- mm/internal.h | 1 + mm/khugepaged.c | 937 ++++++++++++++++++++----- mm/madvise.c | 45 +- mm/memory.c | 6 +- mm/rmap.c | 15 +- 10 files changed, 842 insertions(+), 207 deletions(-)