From patchwork Thu Jan 18 12:03:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lance Yang X-Patchwork-Id: 13522765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F588C4707B for ; Thu, 18 Jan 2024 12:04:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C00A36B0095; Thu, 18 Jan 2024 07:04:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BAF556B009D; Thu, 18 Jan 2024 07:04:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A774A6B009E; Thu, 18 Jan 2024 07:04:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 95A566B0095 for ; Thu, 18 Jan 2024 07:04:03 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 49458A22BE for ; Thu, 18 Jan 2024 12:04:03 +0000 (UTC) X-FDA: 81692298366.06.E3C8F5F Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf10.hostedemail.com (Postfix) with ESMTP id 677FCC0026 for ; Thu, 18 Jan 2024 12:04:01 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="NA/8zHwG"; spf=pass (imf10.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705579441; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=d/JT5c4/CVyl5wSIt3jzGnIRjAaGhymLm8MZlEdFH7w=; b=UaXobN9vyAK9t/7xWiK25UIJ8dk4Dp2H4sO49pig2JSDCnEXta+KuwHcwkskw5xdLYKK9T kD23zQmpB6RjII7AjWR4JczwPQSBTHR1CQrE7HnlUDBClfsWRMGOtqHquMNY/aDp1UpP7U s135xokQ9kWxzS09tdps3/y68KVYl5A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705579441; a=rsa-sha256; cv=none; b=kPh1xflO0JSPMFJ2ra+1sRlkcyD4M5CtRlHHvCkBjyE/ZuQNvW2ak0uaNa4YGvC/6SxJwN 4eHMbUPpG5OBDbjEPU9yUlCwGC/G21iY4kOGrQOd44MnsstBmH+VSi/QyKBnlo11M7ubVS Hg533InW8CmF/XMKO+uUOeeHmbqiYjw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="NA/8zHwG"; spf=pass (imf10.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-1d70a986c4aso2167165ad.2 for ; Thu, 18 Jan 2024 04:04:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705579440; x=1706184240; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=d/JT5c4/CVyl5wSIt3jzGnIRjAaGhymLm8MZlEdFH7w=; b=NA/8zHwGkhAcIBEvGMKdnNa/uqd9C23KJ6UMf/w25B2YZMDvUvLzBabr3nl+OAkaHs ygEp756n0NMHa0sdyLUlG07QY1Hc1wn3vBvB6DWFZRgR8JZTyK3nmIuct+WfynQXbEZ/ EZHMDnjxOvC5oh15T8tIsYchHswOdvezHNWhKTSLNCuitIb4UtTceKi2iZs+d6xJbT5B T/pzaKyScvR/ocedB55mfwOqAUhHfllH0EOQ+1I7n4/Bmmta+BmpZeleLtQo1c0+cnYb srQox69bgLliYySKEjeGA/FqhSurI7BNJTlvR7iQ1nIYwqfIcgAILE/sjlBafuk+bp/8 0CDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705579440; x=1706184240; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=d/JT5c4/CVyl5wSIt3jzGnIRjAaGhymLm8MZlEdFH7w=; b=TY2m+C0y4873oNtjd2LFyctnZA0U528uUrv0PprLCIFiJRCfjogohJmrinyMo9Zi9C 6XTrDxQp/Njzk3voQ85SwupMP9TxlmVSpvZtcdjKw5Tj4y4ilQQJ9ngG7MH11GT1BXEv J5KpOxY7Wp1TaB3J1YO7qCUQF6pyD3vl/EXV1bz5kBku8gff8oTTu3a/yQuLSBUq38m4 59Pf97l29uzesUEdBbuHEwffdqlB7Ly2slUlCQ7VlNnseMpmp9yVTSfutomNn0Ja+eg4 mt/DEMvvEeEugsaX4yNPRYzEB2vW/1U/kQ0pYzKn9E4WcYI5osrfp8hyfdom5ZidZvt2 gdtA== X-Gm-Message-State: AOJu0YwI+mpGl/7rKJdcsB9k7bz/+o1Fy1au41+Zp7iGe6YSjHU+Jyoy 4XCc8NKMcWS70oP41Cf25iWkPy+LqF4i2ArgOqrhvWdBCdDMZdQR X-Google-Smtp-Source: AGHT+IE8tcOOxNJPkJAYyXWRrubyef9mnXF48RvrKWw+BdTjsG+CxZRoZCP7Bz6Bl1eZxaj2/HoSVQ== X-Received: by 2002:a17:903:2286:b0:1d6:f1c5:517e with SMTP id b6-20020a170903228600b001d6f1c5517emr906708plh.62.1705579440000; Thu, 18 Jan 2024 04:04:00 -0800 (PST) Received: from LancedeMBP.lan ([112.10.240.24]) by smtp.gmail.com with ESMTPSA id je22-20020a170903265600b001d6ebe895b6sm1258011plb.136.2024.01.18.04.03.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Jan 2024 04:03:59 -0800 (PST) From: Lance Yang To: akpm@linux-foundation.org Cc: zokeefe@google.com, david@redhat.com, songmuchun@bytedance.com, shy828301@gmail.com, peterx@redhat.com, mknyszek@google.com, minchan@kernel.org, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Lance Yang Subject: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() Date: Thu, 18 Jan 2024 20:03:46 +0800 Message-Id: <20240118120347.61817-1-ioworker0@gmail.com> X-Mailer: git-send-email 2.33.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 677FCC0026 X-Rspam-User: X-Stat-Signature: zic8j4ho3f7s4czdiwm9eibct4g5nxob X-Rspamd-Server: rspam03 X-HE-Tag: 1705579441-470050 X-HE-Meta: U2FsdGVkX1/MQZ1rReZ/qZqPH2h3m0dON+EGxCSjKYFy95mymQHMZkY7JsM22MAQMDk02b0MLgU0+36IIQQF9ek5Ey9DQ1ekEv4vVYKFZEe9JEJbExkQ5L6UWXmWdZKzCWYT7apaW9PpHwj9l7wyVU4HaAK9IctMhu1j8me2dHPKU02gD+vtM0lz/GYsGys64hkDQ8z8OEcL1md+Tt8UTcZXaR7BtXZWuSaw+m7x+6EYM6LFHq1DN6GMf7Wyw5r5zpuXJ13WFhFfIY0RAM5HTl9MRt1RDDIz4D+MxW3CFfUA8SWf4dJPgeQbqhFNrhMCfehL9BA2ldU1+Ll1VsPTuDvfWhk8mK0yHz+7dNBe8DdnWH2UQb6CgqMcuVIpBptwASgrBhr0MEmrDHfQdDZA0mjM6It8oxACxH1BYMDyY2VIBVMEkBehI/kQqs7RMO51l2eUwP2byKqiTaGk8fhPEp5MqidsdwAZHpoUyrMO38OvYuSFXu8cke3ZqYGUMXbgYiRJaDqdki9WZAGFmMnd/4i9MKvVRvBmgyMMjQHJqPJcWyhHyOhxNByZynPM5u7OiO/hgp+zxOIn+nORyZEQ/yUAVz8dnLyg6N6Ty84htzm7ZqdUsF5Dw007dSp50sMteyGJUdLon8qYE1NGjaVyFD12WouVIvg0CCJtyq6Psgepz8AzxgxLettyNh0NWd8wM95F+Uff+DfcRGSO42K+F10qiJu4mm6+jGreGvAuD9VrFlyF32M10yBgxAPxnzMivefZP3iPBK3b3LDvHSDieLclnPBdsWMxxcmX8ANX27EuzfBjpYt8M7YuF2+NILrM5E291Xqvt/J2aRCaOrM7cU6sRt8M12S99OWHwdba6sCWu1B/W27xL2U2RLCEItZDX69wZAFNQTH/NgIYWVUamiEjVuT0M4LwEFdTjxly5t5vOudSlPO3B2iuSkWCWsknkoNiTdRf3BBP+VK2aUw o1CXkDsY udV6NIAR/9N7ZcvlxF+UIMlw7QFJcDox4kify0/K1NZFlObVCmk3BcLR0EFw7MwmBZpaVnmO/jRtvzpYLOqKxWbd0/dF0p1GqFH82243yUYcD0EcShTULJan5QAqMAVl6wjYd5q5KJOsMNXTHKamK9bZvPQopGbq1hj/QkxCzKGYQgCDN5dZd6SaceAVExG+BpMVMW/0CpfikphR14ui++C5ThR4o/el4M6Gc6TcUmomAN7Y4KOq3Xvg+kBlgks0BGTxCfVhEr9fWbcThHr4txdHNbPgiFENxxulLNRthubrUGed/k+DAlMTzU71KbGfrF+vJ5zmPlPU0MXhJUIyupe2RQLwLJRXlwSCwWEHvapp2VxzMS8WMgaXo36KDB7Iya1PwYDeKuZQr8cQI9QGkS0UxJcMWIAGWBc5pPoaze0Gujh9Miu6dlUoWEcE4eLH84Ia0ZjLFPfA2iO+r4Ny7i1zfj9mckNvw8x+oPXO5M3T82r4P9LeXh8ChU0++7/JPPCsxPHjg+K98hrtyJXA33WOAf35x4pbOUbO8n7elmyyvzMewREnfpTw21Bhg2mOU1dV/PPf9wLo/n0d1C8ThwgBoxoIp6GD3EiMIqMIGDoXhL0dn6XVf2YE/3A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller has CAP_SYS_ADMIN or is requesting the collapse of its own memory. The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but it avoids direct reclaim and/or compaction, quickly failing on allocation errors. This change enables a more flexible and efficient usage of memory collapse operations, providing additional control to userspace applications for system-wide THP optimization. Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage will not enter direct reclaim and/or compaction, quickly failing if allocation fails. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. Use Cases An immediate user of this new functionality is the Go runtime heap allocator that manages memory in hugepage-sized chunks. In the past, whether it was a newly allocated chunk through mmap() or a reused chunk released by madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] respectively. However, both approaches resulted in performance issues; for both scenarios, there could be entries into direct reclaim and/or compaction, leading to unpredictable stalls[4]. Now, the allocator can confidently use process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77 [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af [4] https://github.com/golang/go/issues/63334 [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ Signed-off-by: Lance Yang Suggested-by: Zach O'Keefe Suggested-by: David Hildenbrand --- V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative to madvise(MADV_COLLAPSE) arch/alpha/include/uapi/asm/mman.h | 1 + arch/mips/include/uapi/asm/mman.h | 1 + arch/parisc/include/uapi/asm/mman.h | 1 + arch/xtensa/include/uapi/asm/mman.h | 1 + include/linux/huge_mm.h | 5 +-- include/uapi/asm-generic/mman-common.h | 1 + mm/khugepaged.c | 15 ++++++-- mm/madvise.c | 36 +++++++++++++++++--- tools/include/uapi/asm-generic/mman-common.h | 1 + 9 files changed, 52 insertions(+), 10 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..22f23ca04f1a 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -77,6 +77,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index c6e1fc77c996..acec0b643e9c 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -104,6 +104,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..812029c98cd7 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -71,6 +71,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..52ef463dd5b6 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -112,6 +112,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..075fdb5d481a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice); int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end); + unsigned long start, unsigned long end, int behavior); void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, unsigned long end, long adjust_next); spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, static inline int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + int behavior) { return -EINVAL; } diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..92c67bc755da 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2b219acb528e..2840051c0ae2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init; struct collapse_control { bool is_khugepaged; + int behavior; + /* Num pages scanned per node */ u32 node_load[MAX_NUMNODES]; @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm, struct collapse_control *cc) { - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() : - GFP_TRANSHUGE); int node = hpage_collapse_find_target_node(cc); struct folio *folio; + gfp_t gfp; + + if (cc->is_khugepaged) + gfp = alloc_hugepage_khugepaged_gfpmask(); + else + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ? + GFP_TRANSHUGE_LIGHT : + GFP_TRANSHUGE); if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) { *hpage = NULL; @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r) } int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, int behavior) { struct collapse_control *cc; struct mm_struct *mm = vma->vm_mm; @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, if (!cc) return -ENOMEM; cc->is_khugepaged = false; + cc->behavior = behavior; mmgrab(mm); lru_add_drain_all(); diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..9c40226505aa 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, if (error) goto out; break; + case MADV_F_COLLAPSE_LIGHT: case MADV_COLLAPSE: - return madvise_collapse(vma, prev, start, end); + return madvise_collapse(vma, prev, start, end, behavior); } anon_name = anon_vma_name(vma); @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: #endif case MADV_DONTDUMP: case MADV_DODUMP: @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior) } } + +static bool process_madvise_behavior_only(int behavior) +{ + switch (behavior) { + case MADV_F_COLLAPSE_LIGHT: + return true; + default: + return false; + } +} + static bool process_madvise_behavior_valid(int behavior) { switch (behavior) { @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior) case MADV_PAGEOUT: case MADV_WILLNEED: case MADV_COLLAPSE: + case MADV_F_COLLAPSE_LIGHT: return true; default: return false; @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * transparent huge pages so the existing pages will not be * coalesced into THP and new pages will not be allocated as THP. * MADV_COLLAPSE - synchronously coalesce pages into new THP. + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or + * compaction. * MADV_DONTDUMP - the application wants to prevent pages in the given range * from being included in its core dump. * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. */ -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, + int behavior, bool is_process_madvise) { unsigned long end; int error; @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (!madvise_behavior_valid(behavior)) return -EINVAL; + if (!is_process_madvise && process_madvise_behavior_only(behavior)) + return -EINVAL; + if (!PAGE_ALIGNED(start)) return -EINVAL; len = PAGE_ALIGN(len_in); @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh return error; } +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) +{ + return _do_madvise(mm, start, len_in, behavior, false); +} + SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return do_madvise(current->mm, start, len_in, behavior); + return _do_madvise(current->mm, start, len_in, behavior, false); } SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, total_len = iov_iter_count(&iter); while (iov_iter_count(&iter)) { - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter), - iter_iov_len(&iter), behavior); + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter), + iter_iov_len(&iter), behavior, true); if (ret < 0) break; iov_iter_advance(&iter, iter_iov_len(&iter)); diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..92c67bc755da 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -78,6 +78,7 @@ #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */ /* compatibility flags */ #define MAP_FILE 0