From patchwork Mon Aug 17 14:08:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718359 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8E92914F6 for ; Mon, 17 Aug 2020 14:11:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 69EFF20729 for ; Mon, 17 Aug 2020 14:11:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="cYeHhJ8Y" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729008AbgHQOLQ (ORCPT ); Mon, 17 Aug 2020 10:11:16 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:36454 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728902AbgHQOJ5 (ORCPT ); Mon, 17 Aug 2020 10:09:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=q6IDd4y0Wmc1pB+WUrbCZSOFrVKno+kl6q8vuBHV27s=; b=cYeHhJ8YhRsPrjHH4i39oSlN7/EIANNNP21jQaKfNiRfhyhltbGdqX49KFUI10hOn8UC0L AWEpiDr+uJjbaLDw5Z1M3YCQ+zWcSG2be+szy7WOn3ItQi/AMSanqGlctXoriLM9fA7vGu dNLAiEG9YaBcAamq1a0OmQaT+zu++y0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-224-_fp1mGjUP7-SFPCxlAKsFQ-1; Mon, 17 Aug 2020 10:09:52 -0400 X-MC-Unique: _fp1mGjUP7-SFPCxlAKsFQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BB74F81F004; Mon, 17 Aug 2020 14:09:49 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id CAF9F21E9E; Mon, 17 Aug 2020 14:09:47 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 1/8] memcg: Enable fine-grained control of over memory.high action Date: Mon, 17 Aug 2020 10:08:24 -0400 Message-Id: <20200817140831.30260-2-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Memory controller can be used to control and limit the amount of physical memory used by a task. When a limit is set in "memory.high" in a non-root memory cgroup, the memory controller will try to reclaim memory if the limit has been exceeded. Normally, that will be enough to keep the physical memory consumption of tasks in the memory cgroup to be around or below the "memory.high" limit. Sometimes, memory reclaim may not be able to recover memory in a rate that can catch up to the physical memory allocation rate especially when rotating disks are used for swapping or writing dirty pages. In this case, the physical memory consumption will keep on increasing. When it reaches "memory.max" or the system is really running out of memory, the OOM killer will be invoked to kill some tasks to free up additional memory. However, one has little control of which tasks are going to be killed by an OOM killer. Users who do not want the OOM killer to be invoked to kill random tasks in an out-of-memory situation will require a better way to manage memory and deal with applications that are out of control in term of physical memory consumption rate. A new set of prctl(2) commands are added to provide a facility to allow users to manage the physical memory consumption of each of their applications and control the mitigation actions that should be taken when those applications consume more physical memory than what they are supposed to use. The new prctl(2) commands are PR_SET_MEMCONTROL and PR_GET_MEMCONTROL to set the memory control parameters and retrieve those parameters respectively. The four possible mitigation actions for a task that exceeds their designated memory limit are: 1) Return ENOMEM for some syscalls that allocate or handle memory 2) Slow down the process for memory reclaim to catch up 3) Send a specific signal to the task 4) Kill the task The parameters that can be specified in the new PR_SET_MEMCONTROL commands are: arg2 - the mitigation action (bits 0-7), signal number (bits 8-15) and flags (bits 16-31). arg3 - the additional memory limit (in bytes) that will be added to memory.high as the real limit that will trigger the mitigation action. The PR_MEMFLAG_SIGCONT flag is used to specify continuous signal delivery instead of a one-shot signal. Signed-off-by: Waiman Long --- include/linux/memcontrol.h | 4 ++ include/linux/sched.h | 7 +++ include/uapi/linux/prctl.h | 37 ++++++++++++ kernel/sys.c | 16 ++++++ mm/memcontrol.c | 114 +++++++++++++++++++++++++++++++++++++ 5 files changed, 178 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d0b036123c6a..40e6ceb8209b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -445,6 +445,10 @@ void mem_cgroup_uncharge_list(struct list_head *page_list); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); +long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item); +long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, + unsigned long limit); + static struct mem_cgroup_per_node * mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) { diff --git a/include/linux/sched.h b/include/linux/sched.h index 93ecd930efd3..c79d606d27ab 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1265,6 +1265,13 @@ struct task_struct { /* Number of pages to reclaim on returning to userland: */ unsigned int memcg_nr_pages_over_high; + /* Memory over-high action, flags, signal and limit */ + unsigned char memcg_over_high_action; + unsigned char memcg_over_high_signal; + unsigned short memcg_over_high_flags; + unsigned int memcg_over_high_climit; + unsigned int memcg_over_limit; + /* Used by memcontrol for targeted memcg charge: */ struct mem_cgroup *active_memcg; #endif diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 07b4f8131e36..87970ae7b32c 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -238,4 +238,41 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +/* Per task fine-grained memory cgroup control */ +#define PR_GET_MEMCONTROL 59 +#define PR_SET_MEMCONTROL 60 + +/* + * PR_SET_MEMCONTROL: + * 2 parameters are passed: + * - Action word + * - Memory cgroup additional memory limit + * + * The action word consists of 3 bit fields: + * - Bits 0-7 : over-memory-limit action code + * - Bits 8-15: signal number + * - Bits 16-32: action flags + */ + +/* Control values for PR_SET_MEMCONTROL over limit action */ +# define PR_MEMACT_NONE 0 +# define PR_MEMACT_ENOMEM 1 /* Deny memory request */ +# define PR_MEMACT_SLOWDOWN 2 /* Slow down the process */ +# define PR_MEMACT_SIGNAL 3 /* Send signal */ +# define PR_MEMACT_KILL 4 /* Kill the process */ +# define PR_MEMACT_MAX PR_MEMACT_KILL + +/* Flags for PR_SET_MEMCONTROL */ +# define PR_MEMFLAG_SIGCONT (1UL << 0) /* Continuous signal delivery */ +# define PR_MEMFLAG_MASK PR_MEMFLAG_SIGCONT + +/* Action word masks */ +# define PR_MEMACT_MASK 0xff +# define PR_MEMACT_SIG_SHIFT 8 +# define PR_MEMACT_FLG_SHIFT 16 + +/* Return specified value for PR_GET_MEMCONTROL */ +# define PR_MEMGET_ACTION 0 +# define PR_MEMGET_CLIMIT 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index ca11af9d815d..644b86235d7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -64,6 +64,10 @@ #include +#ifdef CONFIG_MEMCG +#include +#endif + #include /* Move somewhere else to avoid recompiling? */ #include @@ -2530,6 +2534,18 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; +#ifdef CONFIG_MEMCG + case PR_GET_MEMCONTROL: + if (arg3 || arg4 || arg5) + return -EINVAL; + error = mem_cgroup_over_high_get(me, arg2); + break; + case PR_SET_MEMCONTROL: + if (arg4 || arg5) + return -EINVAL; + error = mem_cgroup_over_high_set(me, arg2, arg3); + break; +#endif default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b807952b4d43..1106dac024ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -59,6 +59,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2628,6 +2629,71 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +/* + * Task specific action when over the high limit. + * Return true if an action has been taken or further check is not needed, + * false otherwise. + */ +static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) +{ + unsigned long mem; + bool ret = false; + struct mm_struct *mm = get_task_mm(current); + u8 signal = READ_ONCE(current->memcg_over_high_signal); + u16 flags = READ_ONCE(current->memcg_over_high_flags); + u32 limit = READ_ONCE(current->memcg_over_high_climit); + + if (!mm) + return true; /* No more check is needed */ + + current->memcg_over_limit = false; + if ((action == PR_MEMACT_SIGNAL) && !signal) + goto out; + + mem = page_counter_read(&memcg->memory); + if (mem <= memcg->memory.high + limit) + goto out; + + ret = true; + switch (action) { + case PR_MEMACT_ENOMEM: + WRITE_ONCE(current->memcg_over_limit, true); + break; + case PR_MEMACT_SLOWDOWN: + /* Slow down by yielding the cpu */ + set_tsk_need_resched(current); + set_preempt_need_resched(); + break; + case PR_MEMACT_KILL: + signal = SIGKILL; + fallthrough; + case PR_MEMACT_SIGNAL: + force_sig(signal); + + /* Deliver signal only once if !PR_MEMFLAG_SIGCONT */ + if (!(flags & PR_MEMFLAG_SIGCONT)) + WRITE_ONCE(current->memcg_over_high_signal, 0); + break; + } + +out: + mmput(mm); + return ret; +} + +/* + * Return true if an action has been taken or further check is not needed, + * false otherwise. + */ +static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg) +{ + u8 action = READ_ONCE(current->memcg_over_high_action); + + if (!action) + return true; /* No more check is needed */ + return __mem_cgroup_over_high_action(memcg, action); +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -2639,6 +2705,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long nr_reclaimed; bool may_swap = true; bool drained = false; + bool taken = false; unsigned long pflags; if (mem_cgroup_is_root(memcg)) @@ -2797,6 +2864,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, swap_high = page_counter_read(&memcg->swap) > READ_ONCE(memcg->swap.high); + if (mem_high && !taken) + taken = mem_cgroup_over_high_action(memcg); + /* Don't bother a random interrupted task */ if (in_interrupt()) { if (mem_high) { @@ -6959,6 +7029,50 @@ void mem_cgroup_sk_free(struct sock *sk) css_put(&sk->sk_memcg->css); } +/* + * Get and set cgroup memory-over-high attributes. + */ +long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item) +{ + switch (item) { + case PR_MEMGET_ACTION: + return task->memcg_over_high_action | + (task->memcg_over_high_signal << PR_MEMACT_SIG_SHIFT) | + (task->memcg_over_high_flags << PR_MEMACT_FLG_SHIFT); + + case PR_MEMGET_CLIMIT: + return (long)task->memcg_over_high_climit * PAGE_SIZE; + } + return -EINVAL; +} + +long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, + unsigned long limit) +{ + unsigned char cmd = action & PR_MEMACT_MASK; + unsigned char sig = (action >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK; + unsigned short flags = action >> PR_MEMACT_FLG_SHIFT; + + if ((cmd > PR_MEMACT_MAX) || (flags & ~PR_MEMFLAG_MASK) || + (sig >= _NSIG)) + return -EINVAL; + + WRITE_ONCE(task->memcg_over_high_action, cmd); + WRITE_ONCE(task->memcg_over_high_signal, sig); + WRITE_ONCE(task->memcg_over_high_flags, flags); + + if (cmd == PR_MEMACT_NONE) { + WRITE_ONCE(task->memcg_over_high_climit, 0); + } else { + /* + * Convert limits to # of pages + */ + limit = DIV_ROUND_UP(limit, PAGE_SIZE); + WRITE_ONCE(task->memcg_over_high_climit, limit); + } + return 0; +} + /** * mem_cgroup_charge_skmem - charge socket memory * @memcg: memcg to charge From patchwork Mon Aug 17 14:08:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718331 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 642A415E6 for ; Mon, 17 Aug 2020 14:10:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4666B20825 for ; Mon, 17 Aug 2020 14:10:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ff5CXecg" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728928AbgHQOKA (ORCPT ); Mon, 17 Aug 2020 10:10:00 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:50597 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728910AbgHQOJ5 (ORCPT ); Mon, 17 Aug 2020 10:09:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673395; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=hIFyaPZvMHjVeqORMTUHU+96lxgfjim2o5HDKXDYjqs=; b=ff5CXecgNivyae+CAR5/pzC8I/ZQ9R3YG5Dsk+5DaVvg64S0SD3B8oBJ/Z3m01M02Qewgn Owg7gwouElHi2sRsVgxjKGF0FH/DZ/v+jV+zHvEr5ITGbhaOdmNPnM7ZBgjZOvWxFLufzw vIm2culDjJ6ZGB1FZenQSUqXXapPUJ4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-246-WvjdVa5iN6mxQ3yFLskmxw-1; Mon, 17 Aug 2020 10:09:53 -0400 X-MC-Unique: WvjdVa5iN6mxQ3yFLskmxw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B503A801AC3; Mon, 17 Aug 2020 14:09:51 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id E6E9A21E90; Mon, 17 Aug 2020 14:09:49 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit Date: Mon, 17 Aug 2020 10:08:25 -0400 Message-Id: <20200817140831.30260-3-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The brk(), mmap(), mlock(), mlockall() and mprotect() syscalls are modified to check the memcg_over_limit flag and return ENOMEM when it is set and memory control action is PR_MEMACT_ENOMEM. In case the action is PR_MEMACT_SLOWDOWN, an artificial delay of 20ms will be added to slow down the memory allocation syscalls. Signed-off-by: Waiman Long --- include/linux/sched.h | 16 ++++++++++++++++ kernel/fork.c | 1 + mm/memcontrol.c | 25 +++++++++++++++++++++++-- mm/mlock.c | 6 ++++++ mm/mmap.c | 12 ++++++++++++ mm/mprotect.c | 3 +++ 6 files changed, 61 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index c79d606d27ab..9ec1bd072334 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1477,6 +1477,22 @@ static inline char task_state_to_char(struct task_struct *tsk) return task_index_to_char(task_state_index(tsk)); } +#ifdef CONFIG_MEMCG +extern bool mem_cgroup_check_over_limit(void); + +static inline bool mem_over_memcg_limit(void) +{ + if (READ_ONCE(current->memcg_over_limit)) + return mem_cgroup_check_over_limit(); + return false; +} +#else +static inline bool mem_over_memcg_limit(void) +{ + return false; +} +#endif + /** * is_global_init - check if a task structure is init. Since init * is free to have sub-threads we need to check tgid. diff --git a/kernel/fork.c b/kernel/fork.c index 4d32190861bd..61f9a9e5f857 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -940,6 +940,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; + tsk->memcg_over_limit = false; #endif return tsk; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1106dac024ac..5cad7bb26d13 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2646,7 +2646,9 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) if (!mm) return true; /* No more check is needed */ - current->memcg_over_limit = false; + if (READ_ONCE(current->memcg_over_limit)) + WRITE_ONCE(current->memcg_over_limit, false); + if ((action == PR_MEMACT_SIGNAL) && !signal) goto out; @@ -2660,7 +2662,11 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) WRITE_ONCE(current->memcg_over_limit, true); break; case PR_MEMACT_SLOWDOWN: - /* Slow down by yielding the cpu */ + /* + * Slow down by yielding the cpu & adding delay to + * memory allocation syscalls. + */ + WRITE_ONCE(current->memcg_over_limit, true); set_tsk_need_resched(current); set_preempt_need_resched(); break; @@ -2694,6 +2700,21 @@ static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg) return __mem_cgroup_over_high_action(memcg, action); } +/* + * Called from memory allocation syscalls. + * Return true if ENOMEM should be returned, false otherwise. + */ +bool mem_cgroup_check_over_limit(void) +{ + u8 action = READ_ONCE(current->memcg_over_high_action); + + if (action == PR_MEMACT_ENOMEM) + return true; + if (action == PR_MEMACT_SLOWDOWN) + msleep(20); /* Artificial delay of 20ms */ + return false; +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { diff --git a/mm/mlock.c b/mm/mlock.c index 93ca2bf30b4f..130d4b3fa0f5 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -678,6 +678,9 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla if (!can_do_mlock()) return -EPERM; + if (mem_over_memcg_limit()) + return -ENOMEM; + len = PAGE_ALIGN(len + (offset_in_page(start))); start &= PAGE_MASK; @@ -807,6 +810,9 @@ SYSCALL_DEFINE1(mlockall, int, flags) if (!can_do_mlock()) return -EPERM; + if (mem_over_memcg_limit()) + return -ENOMEM; + lock_limit = rlimit(RLIMIT_MEMLOCK); lock_limit >>= PAGE_SHIFT; diff --git a/mm/mmap.c b/mm/mmap.c index 40248d84ad5f..873ccf2560a6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -198,6 +198,10 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) bool downgraded = false; LIST_HEAD(uf); + /* Too much memory used? */ + if (mem_over_memcg_limit()) + return -ENOMEM; + if (mmap_write_lock_killable(mm)) return -EINTR; @@ -1407,6 +1411,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (mm->map_count > sysctl_max_map_count) return -ENOMEM; + /* Too much memory used? */ + if (mem_over_memcg_limit()) + return -ENOMEM; + /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ @@ -1557,6 +1565,10 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, struct file *file = NULL; unsigned long retval; + /* Too much memory used? */ + if (mem_over_memcg_limit()) + return -ENOMEM; + if (!(flags & MAP_ANONYMOUS)) { audit_mmap_fd(fd, flags); file = fget(fd); diff --git a/mm/mprotect.c b/mm/mprotect.c index ce8b8a5eacbb..b2c0f50bb0a0 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -519,6 +519,9 @@ static int do_mprotect_pkey(unsigned long start, size_t len, const bool rier = (current->personality & READ_IMPLIES_EXEC) && (prot & PROT_READ); + if (mem_over_memcg_limit()) + return -ENOMEM; + start = untagged_addr(start); prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP); From patchwork Mon Aug 17 14:08:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718333 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 043F814F6 for ; Mon, 17 Aug 2020 14:10:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D222820748 for ; Mon, 17 Aug 2020 14:10:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="LKwFOckC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728899AbgHQOKE (ORCPT ); Mon, 17 Aug 2020 10:10:04 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:42654 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728930AbgHQOKC (ORCPT ); Mon, 17 Aug 2020 10:10:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673399; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=BkHtKKuY3b6GbgQQuDgz6i1v5ZDG8cyA2U/RRbTyPVc=; b=LKwFOckCMrQkGgjJ4ba/cVuje05HGNboCl2WMZUmy2V7q+5eUxX/Iw5tS8fT11KYQhbPFr 0sUu6yYbBnlLAgchNobPaSBpNHB2lMdNIa1oWZJnBuaFo8zTbvr9BTaWityeLbQzQjsMNI A0HUv5PSoFVFu0fNuE1ZAjKHy+uvQXE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-513-aHd89MD1M8ipxY_XcD-u6A-1; Mon, 17 Aug 2020 10:09:55 -0400 X-MC-Unique: aHd89MD1M8ipxY_XcD-u6A-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 9DB6D801AC9; Mon, 17 Aug 2020 14:09:53 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id DF0AC21E90; Mon, 17 Aug 2020 14:09:51 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 3/8] memcg: Allow the use of task RSS memory as over-high action trigger Date: Mon, 17 Aug 2020 10:08:26 -0400 Message-Id: <20200817140831.30260-4-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The total memory consumption of a task as tracked by memory cgroup includes different types of memory like page caches, anonymous memory, share memory and kernel memory. In a memory cgroup with a multiple running tasks, using total memory consumption of all the tasks within the cgroup as action trigger may not be fair to tasks that don't contribute to excessive memory usage. Page cache memory can typically be shared between multiple tasks. It is also not easy to pin kernel memory usage to a specific task. That leaves a task's anonymous (RSS) memory usage as best proxy for a task's contribution to total memory consumption within the memory cgroup. So a new set of PR_MEMFLAG_RSS_* flags are added to enable the checking of a task's real RSS memory footprint as a trigger to over-high action provided that the total memory consumption of the cgroup has exceeded memory.high + the additional memcg memory limit. Signed-off-by: Waiman Long --- include/linux/memcontrol.h | 2 +- include/linux/sched.h | 3 ++- include/uapi/linux/prctl.h | 14 +++++++++++--- kernel/sys.c | 4 ++-- mm/memcontrol.c | 32 ++++++++++++++++++++++++++++++-- 5 files changed, 46 insertions(+), 9 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 40e6ceb8209b..562958cf79d8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -447,7 +447,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item); long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, - unsigned long limit); + unsigned long limit, unsigned long limit2); static struct mem_cgroup_per_node * mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9ec1bd072334..a1e9ac8b9b16 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1265,11 +1265,12 @@ struct task_struct { /* Number of pages to reclaim on returning to userland: */ unsigned int memcg_nr_pages_over_high; - /* Memory over-high action, flags, signal and limit */ + /* Memory over-high action, flags, signal and limits */ unsigned char memcg_over_high_action; unsigned char memcg_over_high_signal; unsigned short memcg_over_high_flags; unsigned int memcg_over_high_climit; + unsigned int memcg_over_high_plimit; unsigned int memcg_over_limit; /* Used by memcontrol for targeted memcg charge: */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 87970ae7b32c..ef8d84c94b4a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -244,9 +244,10 @@ struct prctl_mm_map { /* * PR_SET_MEMCONTROL: - * 2 parameters are passed: + * 3 parameters are passed: * - Action word * - Memory cgroup additional memory limit + * - Flag specific memory limit * * The action word consists of 3 bit fields: * - Bits 0-7 : over-memory-limit action code @@ -263,8 +264,14 @@ struct prctl_mm_map { # define PR_MEMACT_MAX PR_MEMACT_KILL /* Flags for PR_SET_MEMCONTROL */ -# define PR_MEMFLAG_SIGCONT (1UL << 0) /* Continuous signal delivery */ -# define PR_MEMFLAG_MASK PR_MEMFLAG_SIGCONT +# define PR_MEMFLAG_SIGCONT (1UL << 0) /* Continuous signal delivery */ +# define PR_MEMFLAG_RSS_ANON (1UL << 8) /* Check anonymous pages */ +# define PR_MEMFLAG_RSS_FILE (1UL << 9) /* Check file pages */ +# define PR_MEMFLAG_RSS_SHMEM (1UL << 10) /* Check shmem pages */ +# define PR_MEMFLAG_RSS (PR_MEMFLAG_RSS_ANON |\ + PR_MEMFLAG_RSS_FILE |\ + PR_MEMFLAG_RSS_SHMEM) +# define PR_MEMFLAG_MASK (PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS) /* Action word masks */ # define PR_MEMACT_MASK 0xff @@ -274,5 +281,6 @@ struct prctl_mm_map { /* Return specified value for PR_GET_MEMCONTROL */ # define PR_MEMGET_ACTION 0 # define PR_MEMGET_CLIMIT 1 +# define PR_MEMGET_PLIMIT 2 #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 644b86235d7f..272f82227c2d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2541,9 +2541,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = mem_cgroup_over_high_get(me, arg2); break; case PR_SET_MEMCONTROL: - if (arg4 || arg5) + if (arg5) return -EINVAL; - error = mem_cgroup_over_high_set(me, arg2, arg3); + error = mem_cgroup_over_high_set(me, arg2, arg3, arg4); break; #endif default: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5cad7bb26d13..aa76bae7f408 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2629,6 +2629,12 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +static inline unsigned long +get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit) +{ + return (flags & rss_bit) ? get_mm_counter(mm, mm_bit) : 0; +} + /* * Task specific action when over the high limit. * Return true if an action has been taken or further check is not needed, @@ -2656,6 +2662,22 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) if (mem <= memcg->memory.high + limit) goto out; + /* + * Check RSS memory if any of the PR_MEMFLAG_RSS flags is set. + */ + if (flags & PR_MEMFLAG_RSS) { + mem = get_rss_counter(mm, MM_ANONPAGES, flags, + PR_MEMFLAG_RSS_ANON) + + get_rss_counter(mm, MM_FILEPAGES, flags, + PR_MEMFLAG_RSS_FILE) + + get_rss_counter(mm, MM_SHMEMPAGES, flags, + PR_MEMFLAG_RSS_SHMEM); + + limit = READ_ONCE(current->memcg_over_high_plimit); + if (mem <= limit) + goto out; + } + ret = true; switch (action) { case PR_MEMACT_ENOMEM: @@ -7063,12 +7085,15 @@ long mem_cgroup_over_high_get(struct task_struct *task, unsigned long item) case PR_MEMGET_CLIMIT: return (long)task->memcg_over_high_climit * PAGE_SIZE; + + case PR_MEMGET_PLIMIT: + return (long)task->memcg_over_high_plimit * PAGE_SIZE; } return -EINVAL; } long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, - unsigned long limit) + unsigned long limit, unsigned long limit2) { unsigned char cmd = action & PR_MEMACT_MASK; unsigned char sig = (action >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK; @@ -7084,12 +7109,15 @@ long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, if (cmd == PR_MEMACT_NONE) { WRITE_ONCE(task->memcg_over_high_climit, 0); + WRITE_ONCE(task->memcg_over_high_plimit, 0); } else { /* * Convert limits to # of pages */ - limit = DIV_ROUND_UP(limit, PAGE_SIZE); + limit = DIV_ROUND_UP(limit, PAGE_SIZE); + limit2 = DIV_ROUND_UP(limit2, PAGE_SIZE); WRITE_ONCE(task->memcg_over_high_climit, limit); + WRITE_ONCE(task->memcg_over_high_plimit, limit2); } return 0; } From patchwork Mon Aug 17 14:08:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718357 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1341815E6 for ; Mon, 17 Aug 2020 14:11:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EF1F720789 for ; Mon, 17 Aug 2020 14:11:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YLRM2emL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729019AbgHQOLD (ORCPT ); Mon, 17 Aug 2020 10:11:03 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:58490 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728937AbgHQOKD (ORCPT ); Mon, 17 Aug 2020 10:10:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673401; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=7rpL6/EB1+EwHbaZQMk8wUHt9s1YSLWhla0Ecu2Wfro=; b=YLRM2emLhkeuPyZaZpzgWutiU7YAvydmU+SLvuMY7yqEDsBe8DCZpGbWRMpwD53nzV77Ym yj4wGvj603Ce6chiguYvtNT9A4+1cX4mxNwpCZ3w+xXY3JmsMznt5H7Hu30R8SGNpeMml7 sR4zc1Av9tRp8UqDFJ+n1X7dPj1n3mU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-179-7Pue4gy4ORSIsEQu0cavkw-1; Mon, 17 Aug 2020 10:09:57 -0400 X-MC-Unique: 7Pue4gy4ORSIsEQu0cavkw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8C45A425D4; Mon, 17 Aug 2020 14:09:55 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id C792421E90; Mon, 17 Aug 2020 14:09:53 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 4/8] fs/proc: Support a new procfs memctl file Date: Mon, 17 Aug 2020 10:08:27 -0400 Message-Id: <20200817140831.30260-5-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To allow system administrators to view and modify the over-high action settings of a running application, a new /proc//memctl file is now added to show the over-high action parameters as well as allowing their modification. Signed-off-by: Waiman Long --- fs/proc/base.c | 109 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..3c9349ad1e37 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -88,6 +88,8 @@ #include #include #include +#include +#include #include #include #include @@ -3145,6 +3147,107 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_STACKLEAK_METRICS */ +#ifdef CONFIG_MEMCG +/* + * Memory cgroup control parameters + * + */ +static ssize_t proc_memctl_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file_inode(file)); + unsigned long action, limit1, limit2; + char buffer[80]; + ssize_t len; + + if (!task) + return -ESRCH; + + action = task->memcg_over_high_action | + (task->memcg_over_high_signal << PR_MEMACT_SIG_SHIFT) | + (task->memcg_over_high_flags << PR_MEMACT_FLG_SHIFT); + limit1 = (unsigned long)task->memcg_over_high_climit * PAGE_SIZE; + limit2 = (unsigned long)task->memcg_over_high_plimit * PAGE_SIZE; + + put_task_struct(task); + len = snprintf(buffer, sizeof(buffer), "%ld %ld %ld\n", + action, limit1, limit2); + return simple_read_from_buffer(buf, count, ppos, buffer, len); +} + +static ssize_t proc_memctl_write(struct file *file, const char __user *buf, + size_t count, loff_t *offs) +{ + struct task_struct *task = get_proc_task(file_inode(file)); + unsigned long vals[3]; + char buffer[80]; + char *ptr, *next; + int i, err; + unsigned int action, signal, flags; + + if (!task) + return -ESRCH; + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) { + err = -EFAULT; + goto out; + } + buffer[count] = '\0'; + next = buffer; + + /* + * Expect to find 3 numbers + */ + for (i = 0, ptr = buffer; i < 3; i++) { + ptr = skip_spaces(next); + if (!*ptr) { + err = -EINVAL; + goto out; + } + + /* Skip non-space characters for next */ + for (next = ptr; *next && !isspace(*next); next++) + ; + if (isspace(*next)) + *next++ = '\0'; + + err = kstrtoul(ptr, 0, &vals[i]); + if (err) + break; + } + action = vals[0] & PR_MEMACT_MASK; + signal = (vals[0] >> PR_MEMACT_SIG_SHIFT) & PR_MEMACT_MASK; + flags = vals[0] >> PR_MEMACT_FLG_SHIFT; + + /* Round up limits to number of pages */ + vals[1] = DIV_ROUND_UP(vals[1], PAGE_SIZE); + vals[2] = DIV_ROUND_UP(vals[2], PAGE_SIZE); + + /* Check input values */ + if ((action > PR_MEMACT_MAX) || (signal >= _NSIG) || + (flags & ~PR_MEMFLAG_MASK)) { + err = -EINVAL; + goto out; + } + + WRITE_ONCE(task->memcg_over_high_action, action); + WRITE_ONCE(task->memcg_over_high_signal, signal); + WRITE_ONCE(task->memcg_over_high_flags, flags); + WRITE_ONCE(task->memcg_over_high_climit, vals[1]); + WRITE_ONCE(task->memcg_over_high_plimit, vals[2]); +out: + put_task_struct(task); + return err < 0 ? err : count; +} + +const struct file_operations proc_memctl_operations = { + .read = proc_memctl_read, + .write = proc_memctl_write, + .llseek = generic_file_llseek, +}; +#endif /* CONFIG_MEMCG */ + /* * Thread groups */ @@ -3258,6 +3361,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_MEMCG + REG("memctl", 0644, proc_memctl_operations), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) @@ -3587,6 +3693,9 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_MEMCG + REG("memctl", 0644, proc_memctl_operations), +#endif }; static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx) From patchwork Mon Aug 17 14:08:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718351 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8B45414F6 for ; Mon, 17 Aug 2020 14:10:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7397B20789 for ; Mon, 17 Aug 2020 14:10:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="JUaRpiYy" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729015AbgHQOKm (ORCPT ); Mon, 17 Aug 2020 10:10:42 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:27065 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728951AbgHQOKH (ORCPT ); Mon, 17 Aug 2020 10:10:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673405; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=gLC3iaZ7OOAbkDpDBx+l3EQ6N48p5E+2TCrsju7XHF8=; b=JUaRpiYyo88URG+gr38AHwmk+Eyft7WgCQ6safkVBh+kwqxiTA/IFVIh/1CbbRm0YVf3rh /bWU1XqqNypDlgHN/ayFHAGbBYuluvcCCM38hh1ECF/7uMN7ZUfH64sdWpz1fK4OKcdPFr 0qz/reny9xHWapcDlvf24QF32hVDiTg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-466-fyIpzfgoMP2ZcJRcTktqDA-1; Mon, 17 Aug 2020 10:09:59 -0400 X-MC-Unique: fyIpzfgoMP2ZcJRcTktqDA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 846CA801AD9; Mon, 17 Aug 2020 14:09:57 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id B8EA821E8F; Mon, 17 Aug 2020 14:09:55 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 5/8] memcg: Allow direct per-task memory limit checking Date: Mon, 17 Aug 2020 10:08:28 -0400 Message-Id: <20200817140831.30260-6-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Up to now, the PR_SET_MEMCONTROL prctl(2) call enables user-specified action only if the total memory consumption in the memory cgroup exceeds memory.high by the additional memory threshold specified. There are cases where a user may want direct memory consumption control for certain applications even if the total cgroup memory consumption has not exceeded the limit yet. One way of doing that is to create one memory cgroup per application. However, if an application call other helper applications, these helper applications will fall into the same cgroup breaking the one application per cgroup rule. Another alternative is to enable user to enable direct per-task memory limit checking which is what this patch is about. That is for special use cases and is not recommended for general use as memory reclaim may not be triggered even if the per-task memory limit has been exceeded. Signed-off-by: Waiman Long --- include/uapi/linux/prctl.h | 4 ++- mm/memcontrol.c | 52 +++++++++++++++++++++++++++----------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index ef8d84c94b4a..7ba40e10737d 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -265,13 +265,15 @@ struct prctl_mm_map { /* Flags for PR_SET_MEMCONTROL */ # define PR_MEMFLAG_SIGCONT (1UL << 0) /* Continuous signal delivery */ +# define PR_MEMFLAG_DIRECT (1UL << 1) /* Direct memory limit */ # define PR_MEMFLAG_RSS_ANON (1UL << 8) /* Check anonymous pages */ # define PR_MEMFLAG_RSS_FILE (1UL << 9) /* Check file pages */ # define PR_MEMFLAG_RSS_SHMEM (1UL << 10) /* Check shmem pages */ # define PR_MEMFLAG_RSS (PR_MEMFLAG_RSS_ANON |\ PR_MEMFLAG_RSS_FILE |\ PR_MEMFLAG_RSS_SHMEM) -# define PR_MEMFLAG_MASK (PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS) +# define PR_MEMFLAG_MASK (PR_MEMFLAG_SIGCONT | PR_MEMFLAG_RSS |\ + PR_MEMFLAG_DIRECT) /* Action word masks */ # define PR_MEMACT_MASK 0xff diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aa76bae7f408..6488f8a10d66 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2640,27 +2640,27 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit) * Return true if an action has been taken or further check is not needed, * false otherwise. */ -static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) +static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, + u16 flags) { - unsigned long mem; + unsigned long mem = 0; bool ret = false; struct mm_struct *mm = get_task_mm(current); u8 signal = READ_ONCE(current->memcg_over_high_signal); - u16 flags = READ_ONCE(current->memcg_over_high_flags); - u32 limit = READ_ONCE(current->memcg_over_high_climit); + u32 limit; if (!mm) return true; /* No more check is needed */ - if (READ_ONCE(current->memcg_over_limit)) - WRITE_ONCE(current->memcg_over_limit, false); - if ((action == PR_MEMACT_SIGNAL) && !signal) goto out; - mem = page_counter_read(&memcg->memory); - if (mem <= memcg->memory.high + limit) - goto out; + if (memcg) { + mem = page_counter_read(&memcg->memory); + limit = READ_ONCE(current->memcg_over_high_climit); + if (mem <= memcg->memory.high + limit) + goto out; + } /* * Check RSS memory if any of the PR_MEMFLAG_RSS flags is set. @@ -2706,20 +2706,34 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action) out: mmput(mm); - return ret; + /* + * We only need to do direct per-task memory limit checking once. + */ + return memcg ? ret : true; } /* * Return true if an action has been taken or further check is not needed, * false otherwise. */ -static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg) +static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg, + bool mem_high) { u8 action = READ_ONCE(current->memcg_over_high_action); + u16 flags = READ_ONCE(current->memcg_over_high_flags); if (!action) return true; /* No more check is needed */ - return __mem_cgroup_over_high_action(memcg, action); + + if (READ_ONCE(current->memcg_over_limit)) + WRITE_ONCE(current->memcg_over_limit, false); + + if (flags & PR_MEMFLAG_DIRECT) + memcg = NULL; /* Direct per-task memory limit checking */ + else if (!mem_high) + return false; + + return __mem_cgroup_over_high_action(memcg, action, flags); } /* @@ -2907,8 +2921,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, swap_high = page_counter_read(&memcg->swap) > READ_ONCE(memcg->swap.high); - if (mem_high && !taken) - taken = mem_cgroup_over_high_action(memcg); + if (!taken) + taken = mem_cgroup_over_high_action(memcg, mem_high); /* Don't bother a random interrupted task */ if (in_interrupt()) { @@ -7103,6 +7117,14 @@ long mem_cgroup_over_high_set(struct task_struct *task, unsigned long action, (sig >= _NSIG)) return -EINVAL; + /* + * PR_MEMFLAG_DIRECT can only be set if any of the PR_MEMFLAG_RSS flag + * is set and limit2 is non-zero. + */ + if ((flags & PR_MEMFLAG_DIRECT) && + (!(flags & PR_MEMFLAG_RSS) || !limit2)) + return -EINVAL; + WRITE_ONCE(task->memcg_over_high_action, cmd); WRITE_ONCE(task->memcg_over_high_signal, sig); WRITE_ONCE(task->memcg_over_high_flags, flags); From patchwork Mon Aug 17 14:08:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718347 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3154E13B1 for ; Mon, 17 Aug 2020 14:10:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1844020789 for ; Mon, 17 Aug 2020 14:10:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Lh2beVP8" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729020AbgHQOKn (ORCPT ); Mon, 17 Aug 2020 10:10:43 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:60191 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728852AbgHQOKG (ORCPT ); Mon, 17 Aug 2020 10:10:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673405; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=lMSQ3zYe5JdMkN7xyyD+27PUqcIN4NvxLvoYzin94RU=; b=Lh2beVP8U2MYUcABT48pLfJbGguURkli87Xtj8GlBzXc5G98m/ssC1G53DvP98r4k0Pgb2 oI3hw6sLMHCKyo5FR5jA8XO9Pokzubw37oV2rBVF5B/cGhlEvn+r3DaVY1pZJd5V6KtON5 zd9WuDL5aL1MdNspkrx0ywoN+AHgncM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-zyuWZt46PtifejKTk5r3ZA-1; Mon, 17 Aug 2020 10:10:01 -0400 X-MC-Unique: zyuWZt46PtifejKTk5r3ZA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BF50918686C4; Mon, 17 Aug 2020 14:09:59 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id AED9821E8F; Mon, 17 Aug 2020 14:09:57 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 6/8] memcg: Introduce additional memory control slowdown if needed Date: Mon, 17 Aug 2020 10:08:29 -0400 Message-Id: <20200817140831.30260-7-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For fast cpus on slow disks, yielding the cpus repeatedly with PR_MEMACT_SLOWDOWN may not be able to slow down memory allocation enough for memory reclaim to catch up. In case a large memory block is mmap'ed and the pages are faulted in one-by-one, the syscall delays won't be activated during this process. To be safe, an additional variable delay of 20-5000 us will be added to __mem_cgroup_over_high_action() if the excess memory used is more than 1/256 of the memory limit. Signed-off-by: Waiman Long --- mm/memcontrol.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6488f8a10d66..bddf3e659469 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2643,11 +2643,10 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit) static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, u16 flags) { - unsigned long mem = 0; + unsigned long mem = 0, limit = 0, excess = 0; bool ret = false; struct mm_struct *mm = get_task_mm(current); u8 signal = READ_ONCE(current->memcg_over_high_signal); - u32 limit; if (!mm) return true; /* No more check is needed */ @@ -2657,9 +2656,10 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, if (memcg) { mem = page_counter_read(&memcg->memory); - limit = READ_ONCE(current->memcg_over_high_climit); - if (mem <= memcg->memory.high + limit) + limit = READ_ONCE(current->memcg_over_high_climit) + memcg->memory.high; + if (mem <= limit) goto out; + excess = mem - limit; } /* @@ -2676,6 +2676,7 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, limit = READ_ONCE(current->memcg_over_high_plimit); if (mem <= limit) goto out; + excess = mem - limit; } ret = true; @@ -2685,10 +2686,19 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, break; case PR_MEMACT_SLOWDOWN: /* - * Slow down by yielding the cpu & adding delay to - * memory allocation syscalls. + * Slow down by yielding the cpu & adding delay to memory + * allocation syscalls. + * + * An additional 20-5000 us of delay is added in case the + * excess memory is more than 1/256 of the limit. */ WRITE_ONCE(current->memcg_over_limit, true); + limit >>= 8; + if (limit && (excess > limit)) { + int delay = min(5000UL, excess/limit * 20UL); + + udelay(delay); + } set_tsk_need_resched(current); set_preempt_need_resched(); break; From patchwork Mon Aug 17 14:08:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718353 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2976013B1 for ; Mon, 17 Aug 2020 14:11:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0E5E4205CB for ; Mon, 17 Aug 2020 14:11:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="O7uU1Z6O" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729012AbgHQOKm (ORCPT ); Mon, 17 Aug 2020 10:10:42 -0400 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:37100 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728971AbgHQOKQ (ORCPT ); Mon, 17 Aug 2020 10:10:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=YtkHP8yUHAR+QqSRHcyYDisXPjLvOXyhm/pXyropj8M=; b=O7uU1Z6OaQgh61MW7eDV5DQO2mDdcjfMYMEtnjjkXoNgk6b60dwqx78SBDKixZEZfk2f63 A33YGfpVy47URP3sbJ7+BUHw/k7tdN1tXtSbO8w38CRI1LvtGG69b0+t2Pt9nnc3JBcL5C ZxE8V+LwMGBHHI6mWhCDQxqhopu7r5o= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-265-bMwIf74GM1KLBLKidSrXBg-1; Mon, 17 Aug 2020 10:10:11 -0400 X-MC-Unique: bMwIf74GM1KLBLKidSrXBg-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B49C0425D3; Mon, 17 Aug 2020 14:10:09 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id E9B1921E8F; Mon, 17 Aug 2020 14:09:59 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 7/8] memcg: Enable logging of memory control mitigation action Date: Mon, 17 Aug 2020 10:08:30 -0400 Message-Id: <20200817140831.30260-8-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Some of the migitation actions of PR_SET_MEMCONTROL give no visible signal that some actions are being done inside the kernel. To make it more visble, a new PR_MEMFLAG_LOG flag is added to enable the logging of the migitation action done in the kernel ring buffer. The logging is done once when the mitigation action starts through the setting of an internal PR_MEMFLAG_LOGGED flag. This flag will be cleared when it is detected that the memory limit no longer exceeds memory.high. Signed-off-by: Waiman Long --- include/uapi/linux/prctl.h | 1 + mm/memcontrol.c | 34 +++++++++++++++++++++++++++++++++- 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 7ba40e10737d..faa7a51fc52a 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -266,6 +266,7 @@ struct prctl_mm_map { /* Flags for PR_SET_MEMCONTROL */ # define PR_MEMFLAG_SIGCONT (1UL << 0) /* Continuous signal delivery */ # define PR_MEMFLAG_DIRECT (1UL << 1) /* Direct memory limit */ +# define PR_MEMFLAG_LOG (1UL << 2) /* Log action done */ # define PR_MEMFLAG_RSS_ANON (1UL << 8) /* Check anonymous pages */ # define PR_MEMFLAG_RSS_FILE (1UL << 9) /* Check file pages */ # define PR_MEMFLAG_RSS_SHMEM (1UL << 10) /* Check shmem pages */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index bddf3e659469..5bda2dd755fc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2640,6 +2640,7 @@ get_rss_counter(struct mm_struct *mm, int mm_bit, u16 flags, int rss_bit) * Return true if an action has been taken or further check is not needed, * false otherwise. */ +#define PR_MEMFLAG_LOGGED (1UL << 7) /* A log message printed */ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, u16 flags) { @@ -2714,6 +2715,32 @@ static bool __mem_cgroup_over_high_action(struct mem_cgroup *memcg, u8 action, break; } + if ((flags & (PR_MEMFLAG_LOG|PR_MEMFLAG_LOGGED)) == PR_MEMFLAG_LOG) { + char name[80]; + static const char * const acts[] = { + [PR_MEMACT_ENOMEM] = "Action: return ENOMEM on some syscalls", + [PR_MEMACT_SLOWDOWN] = "Action: slow down process", + [PR_MEMACT_SIGNAL] = "Action: send signal", + [PR_MEMACT_KILL] = "Action: kill the process", + }; + + name[0] = '\0'; + if (memcg) + cgroup_name(memcg->css.cgroup, name, sizeof(name)); + else + strcpy(name, "N/A"); + + /* + * Use printk_deferred() to minimize delay in the memory + * allocation path. + */ + printk_deferred(KERN_INFO + "Cgroup: %s, Comm: %s, Pid: %d, Mem: %ld pages, %s\n", + name, current->comm, current->pid, mem, acts[action]); + WRITE_ONCE(current->memcg_over_high_flags, + flags | PR_MEMFLAG_LOGGED); + } + out: mmput(mm); /* @@ -2740,8 +2767,13 @@ static inline bool mem_cgroup_over_high_action(struct mem_cgroup *memcg, if (flags & PR_MEMFLAG_DIRECT) memcg = NULL; /* Direct per-task memory limit checking */ - else if (!mem_high) + else if (!mem_high) { + /* Clear the PR_MEMFLAG_LOGGED flag, if set */ + if (flags & PR_MEMFLAG_LOGGED) + WRITE_ONCE(current->memcg_over_high_flags, + flags & ~PR_MEMFLAG_LOGGED); return false; + } return __mem_cgroup_over_high_action(memcg, action, flags); } From patchwork Mon Aug 17 14:08:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 11718343 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 08E4C13B1 for ; Mon, 17 Aug 2020 14:10:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DA70120748 for ; Mon, 17 Aug 2020 14:10:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="STcCTApv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728994AbgHQOKa (ORCPT ); Mon, 17 Aug 2020 10:10:30 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:45466 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728981AbgHQOKU (ORCPT ); Mon, 17 Aug 2020 10:10:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1597673418; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:in-reply-to:in-reply-to:references:references; bh=7a3mQcEQYLMtjY6U/koWkESRRECCeXLzk+bt2NSvZuo=; b=STcCTApvARX7/qTWzXc7LUk8F/H/7/7npIbCape2gNEChRDH0Kr1XpuqLF71VWm1QtySji AY7dDt/+rDguzW9tYUlWlNURZYQeUoCV3e8hgX5E761IxQspgVhsb7C2miznV0pOWSi0zG 0oAg84X9mtDqS3qcqjuYIKJzY592XZg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-458-qadAukWcOu-r2yAAMxOy1w-1; Mon, 17 Aug 2020 10:10:13 -0400 X-MC-Unique: qadAukWcOu-r2yAAMxOy1w-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id AA55818686C2; Mon, 17 Aug 2020 14:10:11 +0000 (UTC) Received: from llong.com (ovpn-118-35.rdu2.redhat.com [10.10.118.35]) by smtp.corp.redhat.com (Postfix) with ESMTP id DFCDD19C4F; Mon, 17 Aug 2020 14:10:09 +0000 (UTC) From: Waiman Long To: Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Jonathan Corbet , Alexey Dobriyan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long Subject: [RFC PATCH 8/8] memcg: Add over-high action prctl() documentation Date: Mon, 17 Aug 2020 10:08:31 -0400 Message-Id: <20200817140831.30260-9-longman@redhat.com> In-Reply-To: <20200817140831.30260-1-longman@redhat.com> References: <20200817140831.30260-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org A new memcontrol.rst documentation file is added to document the new prctl(2) interface for setting the over-high mitigation action parameters and retrieving them. Signed-off-by: Waiman Long --- Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/memcontrol.rst | 174 +++++++++++++++++++++ 2 files changed, 175 insertions(+) create mode 100644 Documentation/userspace-api/memcontrol.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 69fc5167e648..1c0fc7a7f4ec 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -23,6 +23,7 @@ place where this information is gathered. accelerators/ocxl ioctl/index media/index + memcontrol .. only:: subproject and html diff --git a/Documentation/userspace-api/memcontrol.rst b/Documentation/userspace-api/memcontrol.rst new file mode 100644 index 000000000000..0cfcc72ad5f0 --- /dev/null +++ b/Documentation/userspace-api/memcontrol.rst @@ -0,0 +1,174 @@ +============== +Memory Control +============== + +Memory controller can be used to control and limit the amount of +physical memory used by a task. When a limit is set in "memory.high" in +a v2 non-root memory cgroup, the memory controller will try to reclaim +memory if the limit has been exceeded. Normally, that will be enough +to keep the physical memory consumption of tasks in the memory cgroup +to be around or below the "memory.high" limit. + +Sometimes, memory reclaim may not be able to recover memory in a rate +that can catch up to the physical memory allocation rate. In this case, +the physical memory consumption will keep on increasing. For memory +cgroup v2, when it is reaching "memory.max" or the system is running +out of free memory, the OOM killer will be invoked to kill some tasks +to free up additional memory. However, one has little control of which +tasks are going to be killed by an OOM killer. Killing tasks that hold +some important resources without freeing them first can create other +system problems. + +Users who do not want the OOM killer to be invoked to kill random +tasks in an out-of-memory situation can use the memory control facility +provided by :manpage:`prctl(2)` to better manage the mitigation action +that needs to be performed to an individual task when the specified +memory limit is exceeded with memory cgroup v2 being used. + +The task to be controlled must be running in a non-root memory cgroup +as no limit will be imposed on tasks running in the root memory cgroup. + +There are two prctl commands related to this: + + * PR_SET_MEMCONTROL + + * PR_GET_MEMCONTROL + + +PR_SET_MEMCONTROL +----------------- + +PR_SET_MEMCTROL controls what action should be taken when the memory +limit is exceeded. + +The arg2 of :manpage:`prctl(2)` sets the desired mitigation action. The +action code consists of three different parts: + + * Bits 0-7: action command + + * Bits 8-15: signal number + + * Bits 16-31: flags + +The currently supported action commands are: + +====== ================== ================================================ +Value Define Description +====== ================== ================================================ +0 PR_MEMACT_NONE Use the default memory cgroup behavior +1 PR_MEMACT_ENOMEM Return ENOMEM for selected syscalls that try to + allocate more memory when the preset memory limit + is exceeded +2 PR_MEMACT_SLOWDOWN Slow down the process for memory reclaim to + catch up when memory limit is exceeded +3 PR_MEMACT_SIGNAL Send a signal to the task that has exceeded + preset memory limit +4 PR_MEMACT_KILL Kill the task that has exceeded preset memory + limit +====== ================== ================================================ + +The currently supports flags are: + +====== ==================== ================================================ +Value Define Description +====== ==================== ================================================ +0x01 PR_MEMFLAG_SIGCONT Send a signal on every allocation request instead + of a one-shot signal +0x02 PR_MEMFLAG_DIRECT Check per-task memory limit irrespective of cgroup + setting +0x04 PR_MEMFLAG_LOG Log any actions taken to the kernel ring buffer +0x10 PR_MEMFLAG_RSS_ANON Check process anonymous memory +0x20 PR_MEMFLAG_RSS_FILE Check process page caches +0x40 PR_MEMFLAG_RSS_SHMEM Check process shared memory +0x70 PR_MEMFLAG_RSS Equivalent to (PR_MEMFLAG_RSS_ANON | + PR_MEMFLAG_RSS_FILE | PR_MEMFLAG_RSS_SHMEM) +====== ==================== ================================================ + +If the action command is PR_MEMACT_SIGNAL, bits 16-23 of the action +code contains the signal number to be used when the memory limit is +exceeded. By default, the signal number is reset after delivery so +that the signal will be delivered only once. Another PR_SET_MEMCONTROL +command will have to be issued to set the signal again. If the user +want a non-fatal signal to be delivered every time when the memory +limit is breached without doing another PR_SET_MEMCONTROL call, the +PR_MEMFLAG_SIGCONT flag can be set. + +The arg3 of :manpage:`prctl(2)` sets the additional memory cgroup +limit that will be added to the value specified in the "memory.high" +control file to get the real limit over which action specified in the +action command will be triggered. This is to make sure that mitigation +action will only be taken when the kernel memory reclaim facility fails +to limit the growth of physical memory usage. + +If any of the PR_MEMFLAG_RSS* flag is specified, arg4 contains the +per-process memory limit that will be used to compare against the sum +of the specified RSS memory consumption of the process to determine +if action will be taken provided that overall memory consumption has +exceeded the "memory.high" + arg3 limit when the PR_MEMFLAG_DIRECT flag +isn't set. + +If the PR_MEMFLAG_DIRECT flag is set, however, the cgroup memory limit +is ignored and a memory-over-limit check will be performed on each +memory allocation request, if applicable. This is reserved for special +use case and is not recommended for general use. + + +PR_GET_MEMCONTROL +----------------- + +PR_GET_MEMCONTROL returns the parameters set by a previous +PR_SET_MEMCONTROL command. + +The arg2 of :manpage:`prctl(2)` sets type of parameter that is to be +returned. The possible values are: + +====== =================== ================================================ +Value Define Description +====== =================== ================================================ +0 PR_MEMGET_ACTION Return the action code - command, flags & signal +1 PR_MEMGET_CLIMIT Return the additional cgroup memory limit (in bytes) +2 PR_MEMGET_PLIMIT Return the process memory limit for PR_MEMFLAG_RSS* +====== =================== ================================================ + + +/proc//memctl +------------------ + +PR_GET_MEMCONTROL only returns memory control setting about the +task itself. To find those information about other tasks, the +/proc//memctl file can be read. This file reports three integer +parameters: + + * action code + + * cgroup additional memory limit + + * process memory limit for PR_MEMFLAG_RSS* flags + +These are the same values that will be returned if the task is +calling :manpage:`prctl(2)` with PR_GET_MEMCONTROL command and the +PR_MEMGET_ACTION, PR_MEMGET_CLIMIT and PR_MEMGET_PLIMIT arguments +respectively. + +Privileged users can also write to the memctl file directly to modify +those parameters for a given task. + +This procfs file is present for each of the running threads of a process. +So multiple writes to each of them are needed to update the parameters +for all the threads within a running process. + +Affected Syscalls +----------------- + +The following system calls have additional check for the over-high +memory usage flag that is set by the above memory control facility. + + * :manpage:`brk(2)` + + * :manpage:`mlock(2)` + + * :manpage:`mlock2(2)` + + * :manpage:`mlockall(2)` + + * :manpage:`mmap(2)`