From patchwork Tue Mar  8 21:34:03 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Zach O'Keefe <zokeefe@google.com>
X-Patchwork-Id: 12774389
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 883EEC433F5
	for <linux-mm@archiver.kernel.org>; Tue,  8 Mar 2022 21:34:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 01B988D0002; Tue,  8 Mar 2022 16:34:47 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F0C698D0001; Tue,  8 Mar 2022 16:34:46 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D878B8D0002; Tue,  8 Mar 2022 16:34:46 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25])
	by kanga.kvack.org (Postfix) with ESMTP id C3F168D0001
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 16:34:46 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 99669F58
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 21:34:46 +0000 (UTC)
X-FDA: 79222523772.08.C097620
Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com
 [209.85.214.202])
	by imf22.hostedemail.com (Postfix) with ESMTP id 2217EC000D
	for <linux-mm@kvack.org>; Tue,  8 Mar 2022 21:34:45 +0000 (UTC)
Received: by mail-pl1-f202.google.com with SMTP id
 e7-20020a170902ef4700b00151de30039bso116767plx.17
        for <linux-mm@kvack.org>; Tue, 08 Mar 2022 13:34:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:message-id:mime-version:subject:from:to:cc
         :content-transfer-encoding;
        bh=pYtxlHcMq2N2WqmRseH6uf38wh9ycntGjNHKzI4UCxw=;
        b=RY4RPEhJ/OoaaXh2Ly3bUfQgFyYMhhiBLA2QnvkRbPvlS8tcmiUE0Ot5ENLWUOm9lN
         Vx+SkoL6DVMoYfcXs3RyGLGGKmL5JAmSx0CLsmnNU2gE86PU7n2tg8Nts9jgfnZlhhYp
         s1h1HHgcNOOUVYZw+8feLgTxL6IO4uxLibUvhHQRuYOuYBLfAWOJ4ocSoBlJ0LWuxEAC
         GF2yhdGdDeoZBqcWlu105bgCQYdLvFp9NyBPEuAzyFNuiKZD4mNU4iTqr1j071eSmTMX
         j4RSH9DbUYgF8lVarERPuusB6GhktVWM6eMZz0rxBAeKAmUDEI50VjPGajAxVmagVtB5
         y2SQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc
         :content-transfer-encoding;
        bh=pYtxlHcMq2N2WqmRseH6uf38wh9ycntGjNHKzI4UCxw=;
        b=6lj8hLFVACjnyzGPQtlSET5BCkTbvznC0jqINQBhkTGRM2MnXtY3e+IDs4R4uAUpcl
         dLN/O+jVz0nibYCyO2jSYTgP3p+dTJOT3sZG+WSC39eAyHtsrwmLN8lDwDMaz6xkMQBX
         bWkcLR94V33GjkE+uIShz29b/IHDLxaJ+AqDRtH1g7JW0sNRM9Rl2HIyJyMDIazJOZyA
         UKLMx3vF24s1LUGIWeL+R3rxhdM7HokpNQ+0HiqS8THfrGTaancPL2SlGgj5hEBcxtTk
         dbmDDrPYj2Mtgtxr7FzqVr07aY8uH7Mh6QdYl/pK1lWeqmLeWMidP/GXPELTM4XlCMQP
         GvxA==
X-Gm-Message-State: AOAM532QV/7FvmjgD/CabMV9rlmAvMeYu+InGOOVDWPMV9b5gJfqW6KX
	pV3PCnVokzd3sa1R3iEL913l1WReeam5
X-Google-Smtp-Source: 
 ABdhPJxmCo64q6T8ctvPzgdBBbbdRye4m/EZ1Bq8+uVR73jSYJir22YJud7qcOPWwt6HW5QNd3PAw6NV8qdh
X-Received: from zokeefe3.c.googlers.com
 ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6])
 (user=zokeefe job=sendgmr) by 2002:a17:90a:e614:b0:1bf:53e6:46a5 with SMTP id
 j20-20020a17090ae61400b001bf53e646a5mr7017996pjy.161.1646775284794; Tue, 08
 Mar 2022 13:34:44 -0800 (PST)
Date: Tue,  8 Mar 2022 13:34:03 -0800
Message-Id: <20220308213417.1407042-1-zokeefe@google.com>
Mime-Version: 1.0
X-Mailer: git-send-email 2.35.1.616.g0bdcbb4464-goog
Subject: [RFC PATCH 00/14] mm: userspace hugepage collapse
From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
 David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>, Michal Hocko <mhocko@suse.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>, SeongJae Park <sj@kernel.org>,
	Song Liu <songliubraving@fb.com>, Vlastimil Babka <vbabka@suse.cz>,
 Zi Yan <ziy@nvidia.com>,
	linux-mm@kvack.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
 Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Axel Rasmussen <axelrasmussen@google.com>,
	Chris Kennelly <ckennelly@google.com>, Chris Zankel <chris@zankel.net>,
 Helge Deller <deller@gmx.de>,
	Hugh Dickins <hughd@google.com>, Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
 Jens Axboe <axboe@kernel.dk>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
 Matthew Wilcox <willy@infradead.org>,
	Matt Turner <mattst88@gmail.com>, Max Filippov <jcmvbkbc@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>,
	Patrick Xia <patrickx@google.com>, Pavel Begunkov <asml.silence@gmail.com>,
	Peter Xu <peterx@redhat.com>, Richard Henderson <rth@twiddle.net>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
 Yang Shi <shy828301@gmail.com>,
	"Zach O'Keefe" <zokeefe@google.com>
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 2217EC000D
X-Stat-Signature: 6x9rrrhngwuyxwhs9bbzpn1fjy16hf4n
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=RY4RPEhJ;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf22.hostedemail.com: domain of
 39MsnYgcKCEM4tpjjkjlttlqj.htrqnsz2-rrp0fhp.twl@flex--zokeefe.bounces.google.com
 designates 209.85.214.202 as permitted sender)
 smtp.mailfrom=39MsnYgcKCEM4tpjjkjlttlqj.htrqnsz2-rrp0fhp.twl@flex--zokeefe.bounces.google.com
X-HE-Tag: 1646775285-769683
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was previously introduced by David Rientjes, and thanks to
everyone for your patience while I prepared these patches resulting from
that discussion[1].

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

(*) process_madvise(2)

        Performs a synchronous collapse of the native pages mapped by
        the list of iovecs into transparent hugepages. The default gfp
        flags used will be the same as those used at-fault for the VMA
        region(s) covered. When multiple VMA regions are spanned, if
        faulting-in memory from any VMA would permit synchronous
        compaction and reclaim, then all hugepage allocations required
        to satisfy the request may enter compaction and reclaim.
        Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
        by default, as the user is explicitly requesting this action.
        Define two flags to control collapse semantics, passed through
        process_madvise(2)’s optional flags parameter:

        MADV_F_COLLAPSE_LIMITS

        If supplied, collapse respects pte collapse limits set via
        sysfs:
        /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
        Required if calling on behalf of another process and not
        CAP_SYS_ADMIN.

        MADV_F_COLLAPSE_DEFRAG

        If supplied, permit synchronous compaction and reclaim,
        regardless of VMA flags.

(*) madvise(2)

        Equivalent to process_madvise(2) on self, with no flags
        passed; pte collapse limits are ignored, and the gfp flags will
        be the same as those used at-fault for the VMA region(s)
        covered. Note that, users wanting different collapse semantics
        can always use process_madvise(2) on themselves.

Discussion
--------------------------------

The mechanism is fully compatible with khugepaged, allowing userspace to
separately define synchronous and asynchronous hugepage policies, as
priority dictates. It also naturally permits a DAMON scheme,
DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
system by backing the most frequently accessed memory by hugepages[2].
Though not required to justify this series, hugepage management could be
offloaded entirely to a sufficiently informed userspace agent,
supplanting the need for khugepaged in the kernel.

Along with the interface, this series proposes a batched implementation
to collapse a range of memory. The motivation for this is to limit
contention on mmap_lock, doing multiple page table modifications while
the lock is held exclusively.

Only private anonymous memory is supported by this series. File-backed
memory support will be added later.

Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
considered at this time, but could be supported by the flags parameter
in the future.

kselftests were omitted from this series for brevity, but would be
included in an eventual patch submission.

[2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/

Sequence of Patches
--------------------------------

Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
introducing the notion of a collapse context and isolating logic that
can be reused later in the series for the madvise collapse context.

Patches 11-14 introduce logic for the proposed madvise collapse
mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
13, separately, add the core collapse logic, with the former introducing
the overall batched approach and locking strategy, and the latter
fills-in batch action details. This separation was purely to keep patch
size down. Patch 14 adds process_madvise support.

Applies against next-20220308.

Zach O'Keefe (14):
  mm/rmap: add mm_find_pmd_raw helper
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: add __do_collapse_huge_page() helper
  mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
  mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
  mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  mm/khugepaged: add vm_flags_ignore to
    hugepage_vma_revalidate_pmd_count()
  mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
  mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  mm/khugepaged: rename khugepaged-specific/not functions
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  mm/madvise: add __madvise_collapse_*_batch() actions.
  mm/madvise: add process_madvise(MADV_COLLAPSE)

 fs/io_uring.c                          |   3 +-
 include/linux/huge_mm.h                |  27 +-
 include/linux/mm.h                     |   3 +-
 include/uapi/asm-generic/mman-common.h |  10 +
 mm/huge_memory.c                       |   2 +-
 mm/internal.h                          |   1 +
 mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
 mm/madvise.c                           |  45 +-
 mm/memory.c                            |   6 +-
 mm/rmap.c                              |  15 +-
 10 files changed, 842 insertions(+), 207 deletions(-)