From patchwork Tue Mar 8 21:34:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zach O'Keefe X-Patchwork-Id: 12774400 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69C7AC433EF for ; Tue, 8 Mar 2022 21:35:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6ACCD8D000E; Tue, 8 Mar 2022 16:35:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 60CFF8D0001; Tue, 8 Mar 2022 16:35:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C39C8D000E; Tue, 8 Mar 2022 16:35:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0070.hostedemail.com [216.40.44.70]) by kanga.kvack.org (Postfix) with ESMTP id 293EB8D0001 for ; Tue, 8 Mar 2022 16:35:09 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DB79A824C420 for ; Tue, 8 Mar 2022 21:35:08 +0000 (UTC) X-FDA: 79222524696.24.0C5ADA8 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf03.hostedemail.com (Postfix) with ESMTP id 59DF82000E for ; Tue, 8 Mar 2022 21:35:08 +0000 (UTC) Received: by mail-pj1-f74.google.com with SMTP id y1-20020a17090a644100b001bc901aba0dso278892pjm.8 for ; Tue, 08 Mar 2022 13:35:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=3XLxnbL9VFnxXVITHjJhBMOjK9Ad9LTLCushbpHINSQ=; b=X2K0PE5pcThGSRm4Vyq5TiG9uavE4+/DTitJYTwFFBGKBiXWrQ8pSftwdQsfQziL7l ePh6RVWRjx10vtyjs1F/uJHsbc9lJ3nXPPuB5q6FKLk7DVuv+2N0f3B/zuk+qVs1K8wF evubqyt60Z27dt0GPCsNZkL4QD5hN4Ww2Fmmn+QrulOkamVU59TPbXVeDZRjCQC2j1VQ Bkggpsvi0nFdf3ExtIXk5SblTLoz1crQv4ayyl2hrJTLOj9b6PNuO9thshxObdtfErh9 lxYomFnPw50tcoKUC5Grx3WQ3Vo8eXQn1nyaTc0kjEyWJOVYy1q4IhKrkXdSJJc2MwTO bJLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=3XLxnbL9VFnxXVITHjJhBMOjK9Ad9LTLCushbpHINSQ=; b=RqhFQ2PLr1YUs0ANrD8b3ChrE4rjCTEsckMSbbhGstWwCgSBb7uv1heT6UHyUbcFlS flbj/76wVwQt8cjT2t0Xjn5eRakEAWn2HopwzrBUkfK/r0h6ZcNxYjDyCWoEh0Um2IN8 pfeXtfpZpzoQYvkL2utsBY7xS6eEmLY5WgWdMe/UE7VRTqGeCZUbCXX69n8LUlwmRIvp C+5WbC/+oqB1suoSAR3UJWKmDgXmzIfqjiG+bqDsjeea+722uASO7Q3ywglRHJCxHXYv FNC2BRmhrC3rqqKJ+vIGWWNA3VlPZU2/Da2eNqLd2IcYEaYxqQUGBmuR2D/fgtHsjmwq PqJQ== X-Gm-Message-State: AOAM532UYhlIFs7z+KWhFD2iThXKVQgu4VaS7dBrDupwzKxUgoiCn9Ks HO2LiFgTZ9D8pYMYdKOHcclHVDlE+2k7 X-Google-Smtp-Source: ABdhPJyCped9ksqzdLEM40g57/qDjseA52PWk9Wrs3SSoGsoUdVBnWP3vjHtgAmV/Bk6EsDeZpEPZzBm5aIk X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a17:902:db0d:b0:14f:b047:8d22 with SMTP id m13-20020a170902db0d00b0014fb0478d22mr19067647plx.90.1646775307419; Tue, 08 Mar 2022 13:35:07 -0800 (PST) Date: Tue, 8 Mar 2022 13:34:14 -0800 In-Reply-To: <20220308213417.1407042-1-zokeefe@google.com> Message-Id: <20220308213417.1407042-12-zokeefe@google.com> Mime-Version: 1.0 References: <20220308213417.1407042-1-zokeefe@google.com> X-Mailer: git-send-email 2.35.1.616.g0bdcbb4464-goog Subject: [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse From: "Zach O'Keefe" To: Alex Shi , David Hildenbrand , David Rientjes , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , linux-mm@kvack.org Cc: Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Richard Henderson , Thomas Bogendoerfer , Yang Shi , "Zach O'Keefe" X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 59DF82000E X-Stat-Signature: hs8m6nt46mhgecjzy5r13cw9so1y7tap Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=X2K0PE5p; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3C8wnYgcKCFoRGC66768GG8D6.4GEDAFMP-EECN24C.GJ8@flex--zokeefe.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3C8wnYgcKCFoRGC66768GG8D6.4GEDAFMP-EECN24C.GJ8@flex--zokeefe.bounces.google.com X-HE-Tag: 1646775308-366568 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The idea of hugepage collapse in process context was previously introduced by David Rientjes to linux-mm[1]. The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a synchronous collapse of memory. The benefits of this approach are: * cpu is charged to the process that wants to spend the cycles for the THP * avoid unpredictable timing of khugepaged collapse * flexible separation of sync userspace and async khugepaged THP collapse policies Immediate users of this new functionality include: * malloc implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THP to regain TLB performance. * immediately back executable text by hugepages. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large system. To keep patches digestible, introduce MADV_COLLAPSE in a few stages. Add plumbing to existing madvise infrastructure, as well as populate uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler stubbed out. Only privately-mapped anon memory is supported for now. [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ Signed-off-by: Zach O'Keefe --- include/linux/huge_mm.h | 12 +++++++ include/uapi/asm-generic/mman-common.h | 2 ++ mm/khugepaged.c | 46 ++++++++++++++++++++++++++ mm/madvise.c | 5 +++ 4 files changed, 65 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index fd905b0b2c71..407b63ab4185 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice); +int madvise_collapse(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, BUG(); return 0; } + +static inline int madvise_collapse(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + BUG(); + return 0; +} + static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6c1aa92a92e4..6ce1f1ceb432 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -77,6 +77,8 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 12ae765c5c32..ca1e523086ed 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void) set_recommended_min_free_kbytes(); mutex_unlock(&khugepaged_mutex); } + +/* + * Returns 0 if successfully able to collapse range into THPs (or range already + * backed by THPs). Due to implementation detail, THPs collapsed here may be + * split again before this function returns. + */ +static int _madvise_collapse(struct mm_struct *mm, + struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, + unsigned long end, gfp_t gfp, + struct collapse_control *cc) +{ + /* Implemented in later patch */ + return -ENOSYS; +} + +int madvise_collapse(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end) +{ + struct collapse_control cc; + gfp_t gfp; + int error; + struct mm_struct *mm = vma->vm_mm; + + /* Requested to hold mmap_lock in read */ + mmap_assert_locked(mm); + + mmgrab(mm); + collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false); + gfp = vma_thp_gfp_mask(vma); + lru_add_drain(); /* lru_add_drain_all() too heavy here */ + error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc); + mmap_assert_locked(mm); + mmdrop(mm); + + /* + * madvise() returns EAGAIN if kernel resources are temporarily + * unavailable. + */ + if (error == -ENOMEM) + error = -EAGAIN; + + return error; +} diff --git a/mm/madvise.c b/mm/madvise.c index 5b6d796e55de..292aa017c150 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_FREE: case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: + case MADV_COLLAPSE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, if (error) goto out; break; + case MADV_COLLAPSE: + return madvise_collapse(vma, prev, start, end); } anon_name = anon_vma_name(vma); @@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior) #ifdef CONFIG_TRANSPARENT_HUGEPAGE case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: + case MADV_COLLAPSE: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * MADV_NOHUGEPAGE - mark the given range as not worth being backed by * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. + * MADV_COLLAPSE - synchronously coalesce pages into new THP. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.