From patchwork Thu Jul 2 15:22:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 11639261 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9D58D618 for ; Thu, 2 Jul 2020 15:22:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 593CA20781 for ; Thu, 2 Jul 2020 15:22:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="pZubqb6X" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 593CA20781 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8EBD46B00A2; Thu, 2 Jul 2020 11:22:33 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 89EB16B00A4; Thu, 2 Jul 2020 11:22:33 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B3CF6B00A5; Thu, 2 Jul 2020 11:22:33 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44]) by kanga.kvack.org (Postfix) with ESMTP id 62C966B00A2 for ; Thu, 2 Jul 2020 11:22:33 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 187BB180AD806 for ; Thu, 2 Jul 2020 15:22:33 +0000 (UTC) X-FDA: 76993502586.23.tramp49_4a14b0326e8a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id E369837609 for ; Thu, 2 Jul 2020 15:22:32 +0000 (UTC) X-Spam-Summary: 1,0,0,420db35eef6c05ac,d41d8cd98f00b204,3t_v9xggkckqwleoiipfksskpi.gsqpmryb-qqozego.svk@flex--shakeelb.bounces.google.com,,RULES_HIT:1:41:69:146:152:355:379:541:800:960:966:967:973:988:989:1260:1263:1277:1313:1314:1345:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2525:2559:2565:2636:2682:2685:2693:2740:2859:2918:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4321:4385:4470:4605:5007:6117:6119:6120:6261:6653:7875:7903:8599:8957:9025:9036:9969:10004:10913:11026:11233:11473:11657:11658:11914:12043:12048:12219:12291:12296:12297:12438:12485:12555:12663:12683:12783:12895:12986:13149:13153:13156:13161:13228:13229:13230:13846:13869:14096:14097:14659:21080:21324:21433:21444:21451:21611:21627:21740:21789:21795:21987:21990:30004:30005:30029:30051:30054:30062:30074,0,RBL:209.85.216.73:@flex--shakeelb.bounces.goo gle.com: X-HE-Tag: tramp49_4a14b0326e8a X-Filterd-Recvd-Size: 12734 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Thu, 2 Jul 2020 15:22:32 +0000 (UTC) Received: by mail-pj1-f73.google.com with SMTP id l3so12078424pjh.0 for ; Thu, 02 Jul 2020 08:22:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=fvCHMx7ekQRuQtz/ZOAdDgYnmiZwvEUkEXK/eR9Jts4=; b=pZubqb6XNvquvfKR87FBbq5Vq+bs6IEZhzmbISTfXpLxbmqp+7PQw1jW29cruGHIvp etrXLrT+fa4/fz7tqOAAMxU8ovpL0XoFwnFidDwYzyz/PzCsuBdbHOev2Sl/46Qr9SVe jkgdI72PaMDVR7hsWbbUSKpY02rCKqklL9ZMgCXoIhte1timnfpKbYYMnNjLT7OTuoI4 vG7axFjj/UF6SUcKWefpq7TvsZW9H20v2rJWXxpUcv83dsQV6S+lHWUZJfwzZfedLxmn 1JoiBWEKaVdB8MHVFp3x2Y+vtqio/oZV9HBvVfHLimPpVAdJSZYBvgezUyP23/lWxi8e PsJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=fvCHMx7ekQRuQtz/ZOAdDgYnmiZwvEUkEXK/eR9Jts4=; b=dkVqKlgZIW7sg4wcnKjvD4wjYRBDmUpJdU3gKFXE3tAiWj5xr9yXBiBJv9Q93LBq2L 6PMt7uYWSa/fJLi14eKbUEQkMJWFODHDNAN8g10JKG/PNGVjb8oFXqGWDiCfAyAbebKV mtpggIZ+702ywo1AA4BeLsSpge4/+wzV3KPtyUud9Dx2RhV3KuTLHxhcTc070hxKfndz SHNnj5X34R7ou46SYiRKOIFCXvVmMR7q+oSGFWfyc3Zx0VjzFQZFh0tA0y5ijMEWM7NM 9y7lxW0YmdduRlnH88iQVW+3QVLL/VAsz1cxVLIqy+V4laf6UxiLpBptdv3GaIKGga2h 4bgg== X-Gm-Message-State: AOAM5316/5VXcRF9YGPeUuM4oNjybNCBynn2BSpdrwNvCxEsR1gKnwvO p2eCxUwDagLJMMPKpHL8u7mEp8PTs0aqYw== X-Google-Smtp-Source: ABdhPJwj+md0ZOHNj16pZZFoE29ZHv1GpYL6OT2vN92hgsSm76Jn3lIdj/tbgslrqtBMH90PK/BDBHq9X3/GDg== X-Received: by 2002:a17:90b:2393:: with SMTP id mr19mr34124195pjb.46.1593703351283; Thu, 02 Jul 2020 08:22:31 -0700 (PDT) Date: Thu, 2 Jul 2020 08:22:22 -0700 Message-Id: <20200702152222.2630760-1-shakeelb@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.27.0.212.ge8ba1cc988-goog Subject: [RFC PROPOSAL] memcg: per-memcg user space reclaim interface From: Shakeel Butt To: Johannes Weiner , Roman Gushchin , Michal Hocko , Yang Shi , David Rientjes , Greg Thelen Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, Shakeel Butt X-Rspamd-Queue-Id: E369837609 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a proposal to expose an interface to the user space to trigger memory reclaim on a memory cgroup. The proposal contains potential use cases, benefits of the user space interface and potential implementation choices. Use cases: ---------- 1) Per-memcg uswapd: Usually applications consists of combination of latency sensitive and latency tolerant tasks. For example, tasks serving user requests vs tasks doing data backup for a database application. At the moment the kernel does not differentiate between such tasks when the application hits the memcg limits. So, potentially a latency sensitive user facing task can get stuck in memory reclaim and be throttled by the kernel. This problem has been discussed before [1, 2]. One way to resolve this issue is to preemptively trigger the memory reclaim from a latency tolerant task (uswapd) when the application is near the limits. (Please note that finding 'near the limits' situation is an orthogonal problem and we are exploring if per-memcg MemAvailable notifications can be useful [3]). 2) Proactive reclaim: This is a similar to the previous use-case, the difference is instead of waiting for the application to be near its limit to trigger memory reclaim, continuously pressuring the memcg to reclaim a small amount of memory. This gives more accurate and uptodate workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing workload of the running applications instead of being reactive. Benefit of user space solution: ------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to centralized the overhead while for uswapd, it makes more sense for the application to pay for the cpu of the memory reclaim. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. Interface options: ------------------ 1) memcg interface e.g. 'echo 10M > memory.reclaim' + simple + can be extended to target specific type of memory (anon, file, kmem). - most probably restricted to cgroup v2. 2) fadvise(PAGEOUT) on cgroup_dir_fd + more general and applicable to other FSes (actually we are using something similar for tmpfs). + can be extended in future to just age the LRUs instead of reclaim or some new use cases. [Or maybe a new fadvise2() syscall which can take FS specific options.] [1] https://lwn.net/Articles/753162/ [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org [3] http://lkml.kernel.org/r/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com The following patch is my attempt to implement the option 2. Please ignore the fine details as I am more interested in getting the feedback on the proposal the interface options. Signed-off-by: Shakeel Butt --- fs/kernfs/dir.c | 20 +++++++++++++++ include/linux/cgroup-defs.h | 2 ++ include/linux/kernfs.h | 2 ++ include/uapi/linux/fadvise.h | 1 + kernel/cgroup/cgroup-internal.h | 2 ++ kernel/cgroup/cgroup-v1.c | 1 + kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++ mm/memcontrol.c | 20 +++++++++++++++ 8 files changed, 91 insertions(+) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 9aec80b9d7c6..96b3b67f3a85 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -1698,9 +1698,29 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx) return 0; } +static int kernfs_dir_fadvise(struct file *file, loff_t offset, loff_t len, + int advise) +{ + struct kernfs_node *kn = kernfs_dentry_node(file->f_path.dentry); + struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops; + int ret; + + if (!scops || !scops->fadvise) + return -EPERM; + + if (!kernfs_get_active(kn)) + return -ENODEV; + + ret = scops->fadvise(kn, offset, len, advise); + + kernfs_put_active(kn); + return ret; +} + const struct file_operations kernfs_dir_fops = { .read = generic_read_dir, .iterate_shared = kernfs_fop_readdir, .release = kernfs_dir_fop_release, .llseek = generic_file_llseek, + .fadvise = kernfs_dir_fadvise, }; diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 52661155f85f..cbe46634875e 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -628,6 +628,8 @@ struct cgroup_subsys { void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu); int (*css_extra_stat_show)(struct seq_file *seq, struct cgroup_subsys_state *css); + int (*css_fadvise)(struct cgroup_subsys_state *css, loff_t offset, + loff_t len, int advise); int (*can_attach)(struct cgroup_taskset *tset); void (*cancel_attach)(struct cgroup_taskset *tset); diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 89f6a4214a70..3e188b6c3402 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -175,6 +175,8 @@ struct kernfs_syscall_ops { const char *new_name); int (*show_path)(struct seq_file *sf, struct kernfs_node *kn, struct kernfs_root *root); + int (*fadvise)(struct kernfs_node *kn, loff_t offset, loff_t len, + int advise); }; struct kernfs_root { diff --git a/include/uapi/linux/fadvise.h b/include/uapi/linux/fadvise.h index 0862b87434c2..302eacc4df44 100644 --- a/include/uapi/linux/fadvise.h +++ b/include/uapi/linux/fadvise.h @@ -19,4 +19,5 @@ #define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */ #endif +#define FADV_PAGEOUT 100 /* Pageout/reclaim pages. */ #endif /* FADVISE_H_INCLUDED */ diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index bfbeabc17a9d..f6077d170112 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -243,6 +243,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode); int cgroup_rmdir(struct kernfs_node *kn); int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, struct kernfs_root *kf_root); +int cgroup_fadvise(struct kernfs_node *kn, loff_t offset, loff_t len, + int advise); int __cgroup_task_count(const struct cgroup *cgrp); int cgroup_task_count(const struct cgroup *cgrp); diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 191c329e482a..d5becb618a50 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1094,6 +1094,7 @@ struct kernfs_syscall_ops cgroup1_kf_syscall_ops = { .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .show_path = cgroup_show_path, + .fadvise = cgroup_fadvise, }; /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1ea181a58465..c5c022bde398 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5564,11 +5564,54 @@ int cgroup_rmdir(struct kernfs_node *kn) return ret; } +static int cgroup_ss_fadvise(struct cgroup *cgrp, struct cgroup_subsys *ss, + loff_t offset, loff_t len, int advise) +{ + struct cgroup_subsys_state *css; + int ret; + + if (!ss->css_fadvise) + return 0; + + css = cgroup_tryget_css(cgrp, ss); + if (!css) + return 0; + + ret = ss->css_fadvise(css, offset, len, advise); + css_put(css); + return ret; +} + +int cgroup_fadvise(struct kernfs_node *kn, loff_t offset, loff_t len, + int advise) +{ + struct cgroup *cgrp; + struct cgroup_subsys *ss; + int ret = 0, ssid; + + if (kernfs_type(kn) != KERNFS_DIR) + return 0; + + cgrp = kn->priv; + if (!cgroup_tryget(cgrp)) + return 0; + + for_each_subsys(ss, ssid) { + ret = cgroup_ss_fadvise(cgrp, ss, offset, len, advise); + if (ret) + break; + } + + cgroup_put(cgrp); + return ret; +} + static struct kernfs_syscall_ops cgroup_kf_syscall_ops = { .show_options = cgroup_show_options, .mkdir = cgroup_mkdir, .rmdir = cgroup_rmdir, .show_path = cgroup_show_path, + .fadvise = cgroup_fadvise, }; static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b1a644224383..a38812aa6cde 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -59,6 +59,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5369,6 +5370,24 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg_wb_domain_size_changed(memcg); } +static int mem_cgroup_css_fadvise(struct cgroup_subsys_state *css, + loff_t offset, loff_t len, int advise) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + unsigned long nr_pages = page_counter_read(&memcg->memory); + unsigned long nr_to_reclaim; + + if (advise != FADV_PAGEOUT || offset <= 0 || len <= 0) + return 0; + + nr_to_reclaim = len >> PAGE_SHIFT; + + if (nr_pages >= nr_to_reclaim) + try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim, GFP_KERNEL, + true); + return 0; +} + #ifdef CONFIG_MMU /* Handlers for move charge at task migration. */ static int mem_cgroup_do_precharge(unsigned long count) @@ -6418,6 +6437,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .css_released = mem_cgroup_css_released, .css_free = mem_cgroup_css_free, .css_reset = mem_cgroup_css_reset, + .css_fadvise = mem_cgroup_css_fadvise, .can_attach = mem_cgroup_can_attach, .cancel_attach = mem_cgroup_cancel_attach, .post_attach = mem_cgroup_move_task,