From patchwork Tue Nov 24 05:39:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Suren Baghdasaryan X-Patchwork-Id: 11927181 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 377DAC2D0E4 for ; Tue, 24 Nov 2020 05:39:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5F41E206FB for ; Tue, 24 Nov 2020 05:39:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=google.com header.i=@google.com header.b="P3CIYtiI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5F41E206FB Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CF57D6B0071; Tue, 24 Nov 2020 00:39:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C827F6B0075; Tue, 24 Nov 2020 00:39:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B6C886B0078; Tue, 24 Nov 2020 00:39:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0164.hostedemail.com [216.40.44.164]) by kanga.kvack.org (Postfix) with ESMTP id 89D326B0071 for ; Tue, 24 Nov 2020 00:39:52 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2D9F23637 for ; Tue, 24 Nov 2020 05:39:52 +0000 (UTC) X-FDA: 77518210224.08.cat39_5a135982736b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 10D2E1819E76C for ; Tue, 24 Nov 2020 05:39:52 +0000 (UTC) X-HE-Tag: cat39_5a135982736b X-Filterd-Recvd-Size: 13227 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Tue, 24 Nov 2020 05:39:51 +0000 (UTC) Received: by mail-pf1-f201.google.com with SMTP id s201so14828069pfs.1 for ; Mon, 23 Nov 2020 21:39:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:in-reply-to:message-id:mime-version:references:subject :from:to:cc; bh=WQ4ZxyeJ9Apc3xH1FaLNSteLy9y1Y4c/AILukn+NVIg=; b=P3CIYtiIkUOJoK4cjQtBDJidp31JrD7y30leztKJY3wb6rr+mtzf9hScrX30EjmEr8 Qc8UHg02EFnGV9w0X0BZ5/1bd4zup+sS4OY6IXZWgOUkwdkzwwLksU1CRkNRsPvsvQ4l NCmPfO9lTaFAfpO6KE1fiDfhyDJOVe+zTzrZMbeBPrxWCyhLpw2k/8Wyw0hxYqGDAPYO RlV5bnixU/kNyzEnPpemoQ1fjVr7l/sqQiCDShSZFofOf35wwmlL/AoUkNG3L0XB43Bq d1fe5lROpS5sxfUP5kZ0G31StDXoW4UmtmT4Glm6q4YmZ94Yns1nPHQ985blv2s0dqLD sMEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=WQ4ZxyeJ9Apc3xH1FaLNSteLy9y1Y4c/AILukn+NVIg=; b=RBozuNOAYlDK8jxbM+/y8srxkhEPAsIQSqs+Mow407Q9PmdTGwvWeEKNcggSFpbxTw IJ2Aeo2oIWKYP4bYqEIF9P4RyGe/y13MIbz/VXLLkHyX5jtkwtaOJVpcZQgmlYGfnLeV wgxbtf04fyyfplfVCKbJNFukYyUOl3l30y7FuIDxLMub5VQfXoOYhm8iGDTbldaWL9x1 lPeQ9raga2N4wZa4yZK2EbNq2Ma9BMG+O9aKxRCANXqXPGhlgOK9D8fsbzoK+f3H/bSI mRltfB+2T1fnSgUOH/4gJUhxBOEAqO0Y7EEcTaXT6yp1pvmqELxXfn9OMN14xsbsFbSv a25A== X-Gm-Message-State: AOAM5317uErU5xstDe6GXrGUNs9edsr+sdC8SR0U0OQi5TUnOiIIIyB2 p5VGpVcdUzrYslpBbauuowdha9GQOPo= X-Google-Smtp-Source: ABdhPJzYgYj0Gaa3zyqbP1lcIVQadsjt1/PvdBrTNvcUdNqNS2AyaWSXSbZ36FloPm6HS7wGQIlt6udxoLE= X-Received: from surenb1.mtv.corp.google.com ([2620:15c:211:200:f693:9fff:fef4:2055]) (user=surenb job=sendgmr) by 2002:a17:90b:293:: with SMTP id az19mr426644pjb.1.1606196390299; Mon, 23 Nov 2020 21:39:50 -0800 (PST) Date: Mon, 23 Nov 2020 21:39:42 -0800 In-Reply-To: <20201124053943.1684874-1-surenb@google.com> Message-Id: <20201124053943.1684874-2-surenb@google.com> Mime-Version: 1.0 References: <20201124053943.1684874-1-surenb@google.com> X-Mailer: git-send-email 2.29.2.454.gaff20da3a2-goog Subject: [PATCH 1/2] mm/madvise: allow process_madvise operations on entire memory range From: Suren Baghdasaryan To: surenb@google.com Cc: akpm@linux-foundation.org, mhocko@kernel.org, mhocko@suse.com, rientjes@google.com, willy@infradead.org, hannes@cmpxchg.org, guro@fb.com, riel@surriel.com, minchan@kernel.org, christian@brauner.io, oleg@redhat.com, timmurray@google.com, linux-api@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: process_madvise requires a vector of address ranges to be provided for its operations. When an advice should be applied to the entire process, the caller process has to obtain the list of VMAs of the target process by reading the /proc/pid/maps or some other way. The cost of this operation grows linearly with increasing number of VMAs in the target process. Even constructing the input vector can be non-trivial when target process has several thousands of VMAs and the syscall is being issued during high memory pressure period when new allocations for such a vector would only worsen the situation. In the case when advice is being applied to the entire memory space of the target process, this creates an extra overhead. Add PMADV_FLAG_RANGE flag for process_madvise enabling the caller to advise a memory range of the target process. For now, to keep it simple, only the entire process memory range is supported, vec and vlen inputs in this mode are ignored and can be NULL and 0. Instead of returning the number of bytes that advice was successfully applied to, the syscall in this mode returns 0 on success. This is due to the fact that the number of bytes would not be useful for the caller that does not know the amount of memory the call is supposed to affect. Besides, the ssize_t return type can be too small to hold the number of bytes affected when the operation is applied to a large memory range. Signed-off-by: Suren Baghdasaryan --- arch/alpha/include/uapi/asm/mman.h | 4 ++ arch/mips/include/uapi/asm/mman.h | 4 ++ arch/parisc/include/uapi/asm/mman.h | 4 ++ arch/xtensa/include/uapi/asm/mman.h | 4 ++ fs/io_uring.c | 2 +- include/linux/mm.h | 3 +- include/uapi/asm-generic/mman-common.h | 4 ++ mm/madvise.c | 47 +++++++++++++++++--- tools/include/uapi/asm-generic/mman-common.h | 4 ++ 9 files changed, 67 insertions(+), 9 deletions(-) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index a18ec7f63888..54588d2f5406 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -79,4 +79,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* __ALPHA_MMAN_H__ */ diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 57dc2ac4f8bd..af94f38a3a9d 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -106,4 +106,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* _ASM_MMAN_H */ diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index ab78cba446ed..ae644c493991 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -77,4 +77,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* __PARISC_MMAN_H__ */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index e5e643752947..934cdd11abff 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -114,4 +114,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* _XTENSA_MMAN_H */ diff --git a/fs/io_uring.c b/fs/io_uring.c index a8c136a1cf4e..508c48b998ee 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4118,7 +4118,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock) if (force_nonblock) return -EAGAIN; - ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice); + ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, 0); if (ret < 0) req_set_fail_links(req); io_req_complete(req, ret); diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..414c7639e394 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2579,7 +2579,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf, bool downgrade); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior); +extern int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in, + int behavior, unsigned int flags); #ifdef CONFIG_MMU extern int __mm_populate(unsigned long addr, unsigned long len, diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..4898034593ec 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -80,4 +80,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ diff --git a/mm/madvise.c b/mm/madvise.c index a8d8d48a57fe..1aa074a46524 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1001,6 +1001,14 @@ process_madvise_behavior_valid(int behavior) } } +static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior) +{ + if (!can_madv_lru_vma(vma)) + return false; + + return true; +} + /* * The madvise(2) system call. * @@ -1067,15 +1075,21 @@ process_madvise_behavior_valid(int behavior) * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. */ -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) +int do_madvise(struct mm_struct *mm, unsigned long start, unsigned long len_in, + int behavior, unsigned int flags) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; int unmapped_error = 0; int error = -EINVAL; + int error_on_gap; int write; - size_t len; + unsigned long len; struct blk_plug plug; + bool range_madvise = !!(flags & PMADV_FLAG_RANGE); + + /* For range operations gap between VMAs is normal */ + error_on_gap = range_madvise ? 0 : -ENOMEM; start = untagged_addr(start); @@ -1123,13 +1137,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh blk_start_plug(&plug); for (;;) { /* Still start < end. */ - error = -ENOMEM; + error = error_on_gap; + if (!vma) goto out; /* Here start < (end|vma->vm_end). */ if (start < vma->vm_start) { - unmapped_error = -ENOMEM; + unmapped_error = error_on_gap; start = vma->vm_start; if (start >= end) goto out; @@ -1140,10 +1155,18 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (end < tmp) tmp = end; + /* For range operations skip VMAs ineligible for the behavior */ + if (range_madvise && !can_range_madv_lru_vma(vma, behavior)) { + prev = vma; + goto skip_vma; + } + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = madvise_vma(vma, &prev, start, tmp, behavior); if (error) goto out; + +skip_vma: start = tmp; if (prev && start < prev->vm_end) start = prev->vm_end; @@ -1167,7 +1190,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return do_madvise(current->mm, start, len_in, behavior); + return do_madvise(current->mm, start, len_in, behavior, 0); } SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, @@ -1183,7 +1206,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, size_t total_len; unsigned int f_flags; - if (flags != 0) { + if (flags & ~PMADV_FLAG_MASK) { ret = -EINVAL; goto out; } @@ -1216,12 +1239,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, goto release_task; } + /* + * For range madvise only the entire address space is supported for now + * and input iovec is ignored. + */ + if (flags & PMADV_FLAG_RANGE) { + ret = do_madvise(mm, 0, ULONG_MAX & PAGE_MASK, behavior, flags); + goto release_mm; + } + total_len = iov_iter_count(&iter); while (iov_iter_count(&iter)) { iovec = iov_iter_iovec(&iter); ret = do_madvise(mm, (unsigned long)iovec.iov_base, - iovec.iov_len, behavior); + iovec.iov_len, behavior, flags); if (ret < 0) break; iov_iter_advance(&iter, iovec.iov_len); @@ -1230,6 +1262,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, if (ret == 0) ret = total_len - iov_iter_count(&iter); +release_mm: mmput(mm); release_task: put_task_struct(task); diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index f94f65d429be..4898034593ec 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -80,4 +80,8 @@ #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ PKEY_DISABLE_WRITE) +/* process_madvise flags */ +#define PMADV_FLAG_RANGE 0x1 /* advice for all VMAs in the range */ +#define PMADV_FLAG_MASK (PMADV_FLAG_RANGE) + #endif /* __ASM_GENERIC_MMAN_COMMON_H */ From patchwork Tue Nov 24 05:39:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Suren Baghdasaryan X-Patchwork-Id: 11927183 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB514C63777 for ; Tue, 24 Nov 2020 05:39:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 28484206FB for ; Tue, 24 Nov 2020 05:39:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=google.com header.i=@google.com header.b="YpxnqpYi" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 28484206FB Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7CAED6B0078; Tue, 24 Nov 2020 00:39:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 727E56B007B; Tue, 24 Nov 2020 00:39:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CF1A6B007D; Tue, 24 Nov 2020 00:39:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0126.hostedemail.com [216.40.44.126]) by kanga.kvack.org (Postfix) with ESMTP id 16A266B0078 for ; Tue, 24 Nov 2020 00:39:54 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id AF1A11EE6 for ; Tue, 24 Nov 2020 05:39:53 +0000 (UTC) X-FDA: 77518210266.23.north47_2513cde2736b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 8F88E37604 for ; Tue, 24 Nov 2020 05:39:53 +0000 (UTC) X-HE-Tag: north47_2513cde2736b X-Filterd-Recvd-Size: 6058 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Tue, 24 Nov 2020 05:39:52 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id i184so25877127ybg.7 for ; Mon, 23 Nov 2020 21:39:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:in-reply-to:message-id:mime-version:references:subject :from:to:cc; bh=num0SQsxDIMRNuuSRBGQ+NlMxj/jOeSqEtVQ3GNIWyo=; b=YpxnqpYi239SbpxtHV+Vxz7w+h0AiAQYV1GookO796eyPdkf+79Unh1sSkTeYIGYlZ fYGtsTS/V0+StBRMMtQWYykHOSsxU8EHFJJMuz5OQum15tUzmoU4mIhQhDqV2nBEyqKs XK/KIZ2+zwBhcA+EzWiaCrqZe2Nb2x65s/Lfy+6uoLbUsHLsF3davJSBrpklz4sSUv4W /SQPXg6/uMywGKgpV3mVypJfITIHuhBrto6iRSwkpTUClCyA2efXPf7U0yXt3j1K5YKW p7TYHvw5CfC6xs5MJRByYXxoD6L1RnaBXJHLv7SbZb1BSqe2QO2ZMHreLbFH0hCJmgun cHmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=num0SQsxDIMRNuuSRBGQ+NlMxj/jOeSqEtVQ3GNIWyo=; b=cvZXrneUAOou5fzaCQZgemUzGE4SozcmoA/isVN8Rr2Qe91Z5Z7Il59mtp7FqX2jPq H6TKPm9bM97d2KPgy+e+prNx/i+nBjh7F9UBrfkaQgaLklkhfNh3QjKYkUxgGuRDHyHE flFHJnrA9Nh+HMQelrGVZ0iYj3X23iD7kuCJhhHPLJMN7jaYbQOnZ2/703Clz6wJPpoT QPEEynB/2Tol/dkA8iyunZvckXFfN50GVEecTkWjwzXq/HcGBqarWP29qcasE1lyQqM0 fqJy6tq5AEaa8KEd1I4TKrbCNIzTWkrXCDpV0JQGxSWjsS0CQaqWBLbBpzIzBpr757ZZ bBFw== X-Gm-Message-State: AOAM530DH/hfE4jWpm3M8RIkQzqwgDiyAuh6tuSmfJneIrXUHm9KCoo/ azbrt7952xInlhihcNQzCgOS2iJkjSA= X-Google-Smtp-Source: ABdhPJxzVVmlhMNlsCBIR68sN2RXwHt8UgDh7S4/lf5Z8gJWx8kRvBvH4P+J8Q+9qwpwIDF0oHGRYNLQQ5A= X-Received: from surenb1.mtv.corp.google.com ([2620:15c:211:200:f693:9fff:fef4:2055]) (user=surenb job=sendgmr) by 2002:a25:cc0c:: with SMTP id l12mr5070405ybf.90.1606196392342; Mon, 23 Nov 2020 21:39:52 -0800 (PST) Date: Mon, 23 Nov 2020 21:39:43 -0800 In-Reply-To: <20201124053943.1684874-1-surenb@google.com> Message-Id: <20201124053943.1684874-3-surenb@google.com> Mime-Version: 1.0 References: <20201124053943.1684874-1-surenb@google.com> X-Mailer: git-send-email 2.29.2.454.gaff20da3a2-goog Subject: [PATCH 2/2] mm/madvise: add process_madvise MADV_DONTNEER support From: Suren Baghdasaryan To: surenb@google.com Cc: akpm@linux-foundation.org, mhocko@kernel.org, mhocko@suse.com, rientjes@google.com, willy@infradead.org, hannes@cmpxchg.org, guro@fb.com, riel@surriel.com, minchan@kernel.org, christian@brauner.io, oleg@redhat.com, timmurray@google.com, linux-api@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@android.com X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. In such situation it is desirable to be able to free up the memory of the process being killed in a more controlled way. Enable MADV_DONTNEED to be used with process_madvise when applied to a dying process to reclaim its memory. This would allow userspace system components like oomd and lmkd to free memory of the target process in a more predictable way. Signed-off-by: Suren Baghdasaryan --- mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 1aa074a46524..11306534369e 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -29,6 +29,7 @@ #include #include #include +#include #include @@ -995,6 +996,18 @@ process_madvise_behavior_valid(int behavior) switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: + case MADV_DONTNEED: + return true; + default: + return false; + } +} + +static bool madvise_destructive(int behavior) +{ + switch (behavior) { + case MADV_DONTNEED: + case MADV_FREE: return true; default: return false; @@ -1006,6 +1019,10 @@ static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior) if (!can_madv_lru_vma(vma)) return false; + /* For destructive madvise skip shared file-backed VMAs */ + if (madvise_destructive(behavior)) + return vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED); + return true; } @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, goto release_task; } + if (madvise_destructive(behavior)) { + /* Allow destructive madvise only on a dying processes */ + if (!signal_group_exit(task->signal)) { + ret = -EINVAL; + goto release_mm; + } + /* Ensure no competition with OOM-killer to avoid contention */ + if (unlikely(mm_is_oom_victim(mm)) || + unlikely(test_bit(MMF_OOM_SKIP, &mm->flags))) { + /* Already being reclaimed */ + ret = 0; + goto release_mm; + } + /* Mark mm as unstable */ + set_bit(MMF_UNSTABLE, &mm->flags); + } + /* * For range madvise only the entire address space is supported for now * and input iovec is ignored.