From patchwork Wed May 1 14:04:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Welty, Brian" X-Patchwork-Id: 10925103 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5B3D1390 for ; Wed, 1 May 2019 14:03:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D195628D1F for ; Wed, 1 May 2019 14:03:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CF65C28E65; Wed, 1 May 2019 14:03:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=unavailable version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DD47328D1F for ; Wed, 1 May 2019 14:03:09 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 990B98934F; Wed, 1 May 2019 14:03:03 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by gabe.freedesktop.org (Postfix) with ESMTPS id A9CE0892A9; Wed, 1 May 2019 14:03:01 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 May 2019 07:03:01 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,417,1549958400"; d="scan'208";a="145141390" Received: from nperf12.hd.intel.com ([10.127.88.161]) by fmsmga008.fm.intel.com with ESMTP; 01 May 2019 07:02:59 -0700 From: Brian Welty To: cgroups@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , dri-devel@lists.freedesktop.org, David Airlie , Daniel Vetter , intel-gfx@lists.freedesktop.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , =?utf-8?q?Christian_K=C3=B6nig?= , Alex Deucher , ChunMing Zhou , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= Subject: [RFC PATCH 1/5] cgroup: Add cgroup_subsys per-device registration framework Date: Wed, 1 May 2019 10:04:34 -0400 Message-Id: <20190501140438.9506-2-brian.welty@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190501140438.9506-1-brian.welty@intel.com> References: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP In containerized or virtualized environments, there is desire to have controls in place for resources that can be consumed by users of a GPU device. For this purpose, we extend control groups with a mechanism for device drivers to register with cgroup subsystems. Device drivers (GPU or other) are then able to reuse the existing cgroup controls, instead of inventing similar ones. A new framework is proposed to allow devices to register with existing cgroup controllers, which creates per-device cgroup_subsys_state within the cgroup. This gives device drivers their own private cgroup controls (such as memory limits or other parameters) to be applied to device resources instead of host system resources. It is exposed in cgroup filesystem as: mount//.devices// such as (for example): mount//memory.devices//memory.max mount//memory.devices//memory.current mount//cpu.devices//cpu.stat The creation of above files is implemented in css_populate_dir() for cgroup subsystems that have enabled per-device support. Above files are created either at time of cgroup creation (for known registered devices) or at the time of device driver registration of the device, during cgroup_register_device. cgroup_device_unregister will remove files from all current cgroups. Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org Cc: dri-devel@lists.freedesktop.org Cc: Matt Roper Signed-off-by: Brian Welty --- include/linux/cgroup-defs.h | 28 ++++ include/linux/cgroup.h | 3 + kernel/cgroup/cgroup.c | 270 ++++++++++++++++++++++++++++++++++-- 3 files changed, 289 insertions(+), 12 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1c70803e9f77..aeaab420e349 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -162,6 +162,17 @@ struct cgroup_subsys_state { struct work_struct destroy_work; struct rcu_work destroy_rwork; + /* + * Per-device state for devices registered with our subsys. + * @device_css_idr stores pointer to per-device cgroup_subsys_state, + * created when devices are associated with this css. + * @device_kn is for creating .devices sub-directory within this cgroup + * or for the per-device sub-directory (subsys.devices/). + */ + struct device *device; + struct idr device_css_idr; + struct kernfs_node *device_kn; + /* * PI: the parent css. Placed here for cache proximity to following * fields of the containing structure. @@ -589,6 +600,9 @@ struct cftype { */ struct cgroup_subsys { struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css); + struct cgroup_subsys_state *(*device_css_alloc)(struct device *device, + struct cgroup_subsys_state *cgroup_css, + struct cgroup_subsys_state *parent_device_css); int (*css_online)(struct cgroup_subsys_state *css); void (*css_offline)(struct cgroup_subsys_state *css); void (*css_released)(struct cgroup_subsys_state *css); @@ -636,6 +650,13 @@ struct cgroup_subsys { */ bool threaded:1; + /* + * If %true, the controller supports device drivers to register + * with this controller for cloning the cgroup functionality + * into per-device cgroup state under .dev//. + */ + bool allow_devices:1; + /* * If %false, this subsystem is properly hierarchical - * configuration, resource accounting and restriction on a parent @@ -664,6 +685,13 @@ struct cgroup_subsys { /* idr for css->id */ struct idr css_idr; + /* + * IDR of registered devices, allows subsys_state to have state + * for each device. Exposed as per-device entries in filesystem, + * under .device//. + */ + struct idr device_idr; + /* * List of cftypes. Each entry is the first entry of an array * terminated by zero length name. diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 81f58b4a5418..3531bf948703 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -116,6 +116,9 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry); int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); +int cgroup_device_register(struct cgroup_subsys *ss, struct device *dev, + unsigned long *dev_id); +void cgroup_device_unregister(struct cgroup_subsys *ss, unsigned long dev_id); void cgroup_fork(struct task_struct *p); extern int cgroup_can_fork(struct task_struct *p); extern void cgroup_cancel_fork(struct task_struct *p); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3f2b4bde0f9c..9b035e728941 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -598,6 +598,8 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of) struct cgroup *cgrp = of->kn->parent->priv; struct cftype *cft = of_cft(of); + /* FIXME this needs updating to lookup device-specific CSS */ + /* * This is open and unprotected implementation of cgroup_css(). * seq_css() is only called from a kernfs file operation which has @@ -1583,14 +1585,15 @@ struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline) return NULL; } -static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft) +static void cgroup_rm_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, + const struct cftype *cft) { char name[CGROUP_FILE_NAME_MAX]; + struct kernfs_node *dest_kn; lockdep_assert_held(&cgroup_mutex); if (cft->file_offset) { - struct cgroup_subsys_state *css = cgroup_css(cgrp, cft->ss); struct cgroup_file *cfile = (void *)css + cft->file_offset; spin_lock_irq(&cgroup_file_kn_lock); @@ -1600,6 +1603,7 @@ static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft) del_timer_sync(&cfile->notify_timer); } + dest_kn = (css->device) ? css->device_kn : cgrp->kn; kernfs_remove_by_name(cgrp->kn, cgroup_file_name(cgrp, cft, name)); } @@ -1630,10 +1634,49 @@ static void css_clear_dir(struct cgroup_subsys_state *css) } } +static int cgroup_device_mkdir(struct cgroup_subsys_state *css) +{ + struct cgroup_subsys_state *device_css; + struct cgroup *cgrp = css->cgroup; + char name[CGROUP_FILE_NAME_MAX]; + struct kernfs_node *kn; + int ret, dev_id; + + /* create subsys.device only if enabled in subsys and non-root cgroup */ + if (!css->ss->allow_devices || !cgroup_parent(cgrp)) + return 0; + + ret = strlcpy(name, css->ss->name, CGROUP_FILE_NAME_MAX); + ret += strlcat(name, ".device", CGROUP_FILE_NAME_MAX); + /* treat as non-error if truncation due to subsys name */ + if (WARN_ON_ONCE(ret >= CGROUP_FILE_NAME_MAX)) + return 0; + + kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, cgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + css->device_kn = kn; + + /* create subdirectory per each registered device */ + idr_for_each_entry(&css->device_css_idr, device_css, dev_id) { + /* FIXME: prefix dev_name with bus_name for uniqueness? */ + kn = kernfs_create_dir(css->device_kn, + dev_name(device_css->device), + cgrp->kn->mode, cgrp); + if (IS_ERR(kn)) + return PTR_ERR(kn); + /* FIXME: kernfs_get needed here? */ + device_css->device_kn = kn; + } + + return 0; +} + /** * css_populate_dir - create subsys files in a cgroup directory * @css: target css * + * Creates per-device directories if enabled in subsys. * On failure, no file is added. */ static int css_populate_dir(struct cgroup_subsys_state *css) @@ -1655,6 +1698,10 @@ static int css_populate_dir(struct cgroup_subsys_state *css) if (ret < 0) return ret; } else { + ret = cgroup_device_mkdir(css); + if (ret < 0) + return ret; + list_for_each_entry(cfts, &css->ss->cfts, node) { ret = cgroup_addrm_files(css, cgrp, cfts, true); if (ret < 0) { @@ -1673,6 +1720,7 @@ static int css_populate_dir(struct cgroup_subsys_state *css) break; cgroup_addrm_files(css, cgrp, cfts, false); } + /* FIXME: per-device files will be removed by kernfs_destroy_root? */ return ret; } @@ -3665,14 +3713,15 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, struct cftype *cft) { char name[CGROUP_FILE_NAME_MAX]; - struct kernfs_node *kn; + struct kernfs_node *kn, *dest_kn; struct lock_class_key *key = NULL; int ret; #ifdef CONFIG_DEBUG_LOCK_ALLOC key = &cft->lockdep_key; #endif - kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name), + dest_kn = (css->device) ? css->device_kn : cgrp->kn; + kn = __kernfs_create_file(dest_kn, cgroup_file_name(cgrp, cft, name), cgroup_file_mode(cft), GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, cft->kf_ops, cft, @@ -3709,15 +3758,13 @@ static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp, * Depending on @is_add, add or remove files defined by @cfts on @cgrp. * For removals, this function never fails. */ -static int cgroup_addrm_files(struct cgroup_subsys_state *css, - struct cgroup *cgrp, struct cftype cfts[], - bool is_add) +static int __cgroup_addrm_files(struct cgroup_subsys_state *css, + struct cgroup *cgrp, struct cftype cfts[], + bool is_add) { struct cftype *cft, *cft_end = NULL; int ret = 0; - lockdep_assert_held(&cgroup_mutex); - restart: for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) { /* does cft->flags tell us to skip this file on @cgrp? */ @@ -3741,12 +3788,43 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css, goto restart; } } else { - cgroup_rm_file(cgrp, cft); + cgroup_rm_file(css, cgrp, cft); } } return ret; } +static int cgroup_addrm_files(struct cgroup_subsys_state *css, + struct cgroup *cgrp, struct cftype cfts[], + bool is_add) +{ + struct cgroup_subsys_state *device_css, *device_css_end = NULL; + int dev_id, ret, err = 0; + + lockdep_assert_held(&cgroup_mutex); +restart: + ret = __cgroup_addrm_files(css, cgrp, cfts, is_add); + if (ret) + return ret; + + /* repeat addrm for each device */ + idr_for_each_entry(&css->device_css_idr, device_css, dev_id) { + if (device_css == device_css_end) + break; + ret = __cgroup_addrm_files(device_css, cgrp, cfts, is_add); + if (ret && !is_add) { + return ret; + } else if (ret) { + is_add = false; + device_css_end = device_css; + err = ret; + goto restart; + } + } + + return err; +} + static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add) { struct cgroup_subsys *ss = cfts[0].ss; @@ -4711,9 +4789,14 @@ static void css_free_rwork_fn(struct work_struct *work) if (ss) { /* css free path */ - struct cgroup_subsys_state *parent = css->parent; - int id = css->id; + struct cgroup_subsys_state *device_css, *parent = css->parent; + int dev_id, id = css->id; + idr_for_each_entry(&css->device_css_idr, device_css, dev_id) { + css_put(device_css->parent); + ss->css_free(device_css); + } + idr_destroy(&css->device_css_idr); ss->css_free(css); cgroup_idr_remove(&ss->css_idr, id); cgroup_put(cgrp); @@ -4833,6 +4916,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->rstat_css_node); css->serial_nr = css_serial_nr_next++; atomic_set(&css->online_cnt, 0); + idr_init(&css->device_css_idr); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -4885,6 +4969,79 @@ static void offline_css(struct cgroup_subsys_state *css) wake_up_all(&css->cgroup->offline_waitq); } +/* + * Associates a device with a css. + * Create a new device-specific css and insert into @css->device_css_idr. + * Acquires a references on @css, which is released when the device is + * dissociated with this css. + */ +static int cgroup_add_device(struct cgroup_subsys_state *css, + struct device *dev, int dev_id) +{ + struct cgroup_subsys *ss = css->ss; + struct cgroup_subsys_state *dev_css, *dev_parent_css; + int err; + + lockdep_assert_held(&cgroup_mutex); + + /* don't add devices at root cgroup level */ + if (!css->parent) + return -EINVAL; + + dev_parent_css = idr_find(&css->parent->device_css_idr, dev_id); + dev_css = ss->device_css_alloc(dev, css, dev_parent_css); + if (IS_ERR_OR_NULL(dev_css)) { + if (!dev_css) + return -ENOMEM; + if (IS_ERR(dev_css)) + return PTR_ERR(dev_css); + } + + /* store per-device css pointer in the cgroup's css */ + err = idr_alloc(&css->device_css_idr, dev_css, dev_id, + dev_id + 1, GFP_KERNEL); + if (err < 0) { + ss->css_free(dev_css); + return err; + } + + dev_css->device = dev; + dev_css->parent = dev_parent_css; + /* + * subsys per-device support is allowed to access cgroup subsys_state + * using cgroup.self. Increment reference on css so it remains valid + * as long as device is associated with it. + */ + dev_css->cgroup = css->cgroup; + dev_css->ss = css->ss; + css_get(css); + + return 0; +} + +/* + * For a new cgroup css, create device-specific css for each device which + * which has registered itself with the subsys. + */ +static int cgroup_add_devices(struct cgroup_subsys_state *css) +{ + struct device *dev; + int dev_id, err = 0; + + /* ignore adding devices for root cgroups */ + if (!css->parent) + return 0; + + /* create per-device css for each associated device */ + idr_for_each_entry(&css->ss->device_idr, dev, dev_id) { + err = cgroup_add_device(css, dev, dev_id); + if (err) + break; + } + + return err; +} + /** * css_create - create a cgroup_subsys_state * @cgrp: the cgroup new css will be associated with @@ -4921,6 +5078,10 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp, goto err_free_css; css->id = err; + err = cgroup_add_devices(css); + if (err) + goto err_free_css; + /* @css is ready to be brought online now, make it visible */ list_add_tail_rcu(&css->sibling, &parent_css->children); cgroup_idr_replace(&ss->css_idr, css, css->id); @@ -5337,6 +5498,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early) mutex_lock(&cgroup_mutex); idr_init(&ss->css_idr); + idr_init(&ss->device_idr); INIT_LIST_HEAD(&ss->cfts); /* Create the root cgroup state for this subsystem */ @@ -5637,6 +5799,90 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, return retval; } +void cgroup_device_unregister(struct cgroup_subsys *ss, unsigned long dev_id) +{ + struct cgroup_subsys_state *css, *device_css; + int css_id; + + if (!ss->allow_devices) + return; + + mutex_lock(&cgroup_mutex); + idr_for_each_entry(&ss->css_idr, css, css_id) { + WARN_ON(css->device); + if (!css->parent || css->device) + continue; + device_css = idr_remove(&css->device_css_idr, dev_id); + if (device_css) { + /* FIXME kernfs_get/put needed to make safe? */ + if (device_css->device_kn) + kernfs_remove(device_css->device_kn); + css_put(device_css->parent); + ss->css_free(device_css); + } + } + idr_remove(&ss->device_idr, dev_id); + mutex_unlock(&cgroup_mutex); +} + +/** + * cgroup_device_register - associate a struct device with @ss + * @ss: the subsystem of interest + * @dev: the device of interest + * @dev_id: index into @ss idr returned + * + * Insert @dev into set of devices to be associated with this subsystem. + * As cgroups are created, subdirectories ./allow_devices) + return -EACCES; + + mutex_lock(&cgroup_mutex); + + id = idr_alloc_cyclic(&ss->device_idr, dev, 0, 0, GFP_KERNEL); + if (id < 0) { + mutex_unlock(&cgroup_mutex); + return id; + } + + idr_for_each_entry(&ss->css_idr, css, css_id) { + WARN_ON(css->device); + if (!css->parent || css->device) + continue; + err = cgroup_add_device(css, dev, id); + if (err) + break; + + if (css_visible(css)) { + /* FIXME - something more lightweight can be done? */ + css_clear_dir(css); + /* FIXME kernfs_get/put needed to make safe? */ + kernfs_remove(css->device_kn); + err = css_populate_dir(css); + if (err) + /* FIXME handle error case */ + err = 0; + else + kernfs_activate(css->cgroup->kn); + } + } + + if (!err) + *dev_id = id; + mutex_unlock(&cgroup_mutex); + + return err; +} + /** * cgroup_fork - initialize cgroup related fields during copy_process() * @child: pointer to task_struct of forking parent process. From patchwork Wed May 1 14:04:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Welty, Brian" X-Patchwork-Id: 10925109 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 05CA91390 for ; Wed, 1 May 2019 14:03:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E8C4828C58 for ; Wed, 1 May 2019 14:03:12 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DCADB28D34; Wed, 1 May 2019 14:03:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=unavailable version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 69AE328CFE for ; Wed, 1 May 2019 14:03:12 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0376B89363; Wed, 1 May 2019 14:03:05 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by gabe.freedesktop.org (Postfix) with ESMTPS id 394568938C; Wed, 1 May 2019 14:03:03 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 May 2019 07:03:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,417,1549958400"; d="scan'208";a="145141398" Received: from nperf12.hd.intel.com ([10.127.88.161]) by fmsmga008.fm.intel.com with ESMTP; 01 May 2019 07:03:01 -0700 From: Brian Welty To: cgroups@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , dri-devel@lists.freedesktop.org, David Airlie , Daniel Vetter , intel-gfx@lists.freedesktop.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , =?utf-8?q?Christian_K=C3=B6nig?= , Alex Deucher , ChunMing Zhou , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= Subject: [RFC PATCH 2/5] cgroup: Change kernfs_node for directories to store cgroup_subsys_state Date: Wed, 1 May 2019 10:04:35 -0400 Message-Id: <20190501140438.9506-3-brian.welty@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190501140438.9506-1-brian.welty@intel.com> References: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP Change the kernfs_node.priv to store the cgroup_subsys_state (CSS) pointer for directories, instead of storing cgroup pointer. This is done in order to support files within the cgroup associated with devices. We require of_css() to return the device-specific CSS pointer for these files. Cc: cgroups@vger.kernel.org Signed-off-by: Brian Welty --- kernel/cgroup/cgroup-v1.c | 10 ++++---- kernel/cgroup/cgroup.c | 48 +++++++++++++++++---------------------- 2 files changed, 27 insertions(+), 31 deletions(-) diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index c126b34fd4ff..4fa56cc2b99c 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -723,6 +723,7 @@ int proc_cgroupstats_show(struct seq_file *m, void *v) int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { struct kernfs_node *kn = kernfs_node_from_dentry(dentry); + struct cgroup_subsys_state *css; struct cgroup *cgrp; struct css_task_iter it; struct task_struct *tsk; @@ -740,12 +741,13 @@ int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) * @kn->priv is RCU safe. Let's do the RCU dancing. */ rcu_read_lock(); - cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv); - if (!cgrp || cgroup_is_dead(cgrp)) { + css = rcu_dereference(*(void __rcu __force **)&kn->priv); + if (!css || cgroup_is_dead(css->cgroup)) { rcu_read_unlock(); mutex_unlock(&cgroup_mutex); return -ENOENT; } + cgrp = css->cgroup; rcu_read_unlock(); css_task_iter_start(&cgrp->self, 0, &it); @@ -851,7 +853,7 @@ void cgroup1_release_agent(struct work_struct *work) static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent, const char *new_name_str) { - struct cgroup *cgrp = kn->priv; + struct cgroup_subsys_state *css = kn->priv; int ret; if (kernfs_type(kn) != KERNFS_DIR) @@ -871,7 +873,7 @@ static int cgroup1_rename(struct kernfs_node *kn, struct kernfs_node *new_parent ret = kernfs_rename(kn, new_parent, new_name_str); if (!ret) - TRACE_CGROUP_PATH(rename, cgrp); + TRACE_CGROUP_PATH(rename, css->cgroup); mutex_unlock(&cgroup_mutex); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9b035e728941..1fe4fee502ea 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -595,12 +595,13 @@ static void cgroup_get_live(struct cgroup *cgrp) struct cgroup_subsys_state *of_css(struct kernfs_open_file *of) { - struct cgroup *cgrp = of->kn->parent->priv; + struct cgroup_subsys_state *css = of->kn->parent->priv; struct cftype *cft = of_cft(of); - /* FIXME this needs updating to lookup device-specific CSS */ - /* + * If the cft specifies a subsys and this is not a device file, + * then lookup the css, otherwise it is already correct. + * * This is open and unprotected implementation of cgroup_css(). * seq_css() is only called from a kernfs file operation which has * an active reference on the file. Because all the subsystem @@ -608,10 +609,9 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of) * the matching css from the cgroup's subsys table is guaranteed to * be and stay valid until the enclosing operation is complete. */ - if (cft->ss) - return rcu_dereference_raw(cgrp->subsys[cft->ss->id]); - else - return &cgrp->self; + if (cft->ss && !css->device) + css = rcu_dereference_raw(css->cgroup->subsys[cft->ss->id]); + return css; } EXPORT_SYMBOL_GPL(of_css); @@ -1524,12 +1524,14 @@ static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask) */ void cgroup_kn_unlock(struct kernfs_node *kn) { + struct cgroup_subsys_state *css; struct cgroup *cgrp; if (kernfs_type(kn) == KERNFS_DIR) - cgrp = kn->priv; + css = kn->priv; else - cgrp = kn->parent->priv; + css = kn->parent->priv; + cgrp = css->cgroup; mutex_unlock(&cgroup_mutex); @@ -1556,12 +1558,14 @@ void cgroup_kn_unlock(struct kernfs_node *kn) */ struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline) { + struct cgroup_subsys_state *css; struct cgroup *cgrp; if (kernfs_type(kn) == KERNFS_DIR) - cgrp = kn->priv; + css = kn->priv; else - cgrp = kn->parent->priv; + css = kn->parent->priv; + cgrp = css->cgroup; /* * We're gonna grab cgroup_mutex which nests outside kernfs @@ -1652,7 +1656,7 @@ static int cgroup_device_mkdir(struct cgroup_subsys_state *css) if (WARN_ON_ONCE(ret >= CGROUP_FILE_NAME_MAX)) return 0; - kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, cgrp); + kn = kernfs_create_dir(cgrp->kn, name, cgrp->kn->mode, css); if (IS_ERR(kn)) return PTR_ERR(kn); css->device_kn = kn; @@ -1662,7 +1666,7 @@ static int cgroup_device_mkdir(struct cgroup_subsys_state *css) /* FIXME: prefix dev_name with bus_name for uniqueness? */ kn = kernfs_create_dir(css->device_kn, dev_name(device_css->device), - cgrp->kn->mode, cgrp); + cgrp->kn->mode, device_css); if (IS_ERR(kn)) return PTR_ERR(kn); /* FIXME: kernfs_get needed here? */ @@ -2025,7 +2029,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) root->kf_root = kernfs_create_root(kf_sops, KERNFS_ROOT_CREATE_DEACTIVATED | KERNFS_ROOT_SUPPORT_EXPORTOP, - root_cgrp); + &root_cgrp->self); if (IS_ERR(root->kf_root)) { ret = PTR_ERR(root->kf_root); goto exit_root_id; @@ -3579,9 +3583,9 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; - struct cgroup *cgrp = of->kn->parent->priv; + struct cgroup_subsys_state *css = of_css(of); struct cftype *cft = of->kn->priv; - struct cgroup_subsys_state *css; + struct cgroup *cgrp = css->cgroup; int ret; /* @@ -3598,16 +3602,6 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf, if (cft->write) return cft->write(of, buf, nbytes, off); - /* - * kernfs guarantees that a file isn't deleted with operations in - * flight, which means that the matching css is and stays alive and - * doesn't need to be pinned. The RCU locking is not necessary - * either. It's just for the convenience of using cgroup_css(). - */ - rcu_read_lock(); - css = cgroup_css(cgrp, cft->ss); - rcu_read_unlock(); - if (cft->write_u64) { unsigned long long v; ret = kstrtoull(buf, 0, &v); @@ -5262,7 +5256,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) } /* create the directory */ - kn = kernfs_create_dir(parent->kn, name, mode, cgrp); + kn = kernfs_create_dir(parent->kn, name, mode, &cgrp->self); if (IS_ERR(kn)) { ret = PTR_ERR(kn); goto out_destroy; From patchwork Wed May 1 14:04:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Welty, Brian" X-Patchwork-Id: 10925113 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DFD3214B6 for ; Wed, 1 May 2019 14:03:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CDB0228BC1 for ; Wed, 1 May 2019 14:03:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CAC1528E42; Wed, 1 May 2019 14:03:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DC78628E3E for ; Wed, 1 May 2019 14:03:13 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D7F398933C; Wed, 1 May 2019 14:03:06 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by gabe.freedesktop.org (Postfix) with ESMTPS id ECF1189361; Wed, 1 May 2019 14:03:04 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 May 2019 07:03:04 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,417,1549958400"; d="scan'208";a="145141406" Received: from nperf12.hd.intel.com ([10.127.88.161]) by fmsmga008.fm.intel.com with ESMTP; 01 May 2019 07:03:03 -0700 From: Brian Welty To: cgroups@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , dri-devel@lists.freedesktop.org, David Airlie , Daniel Vetter , intel-gfx@lists.freedesktop.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , =?utf-8?q?Christian_K=C3=B6nig?= , Alex Deucher , ChunMing Zhou , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= Subject: [RFC PATCH 3/5] memcg: Add per-device support to memory cgroup subsystem Date: Wed, 1 May 2019 10:04:36 -0400 Message-Id: <20190501140438.9506-4-brian.welty@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190501140438.9506-1-brian.welty@intel.com> References: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP Here we update memory cgroup to enable the newly introduced per-device framework. As mentioned in the prior patch, the intent here is to allow drivers to have their own private cgroup controls (such as memory limit) to be applied to device resources instead of host system resources. In summary, to enable device registration for memory cgroup subsystem: * set .allow_devices to true * add new exported device register and device unregister functions to register a device with the cgroup subsystem * implement the .device_css_alloc callback to create device specific cgroups_subsys_state within a cgroup As cgroup is created and for current registered devices, one will see in the cgroup filesystem these additional files: mount//memory.devices// Registration of a new device is performed in device drivers using new mem_cgroup_device_register(). This will create above files in existing cgroups. And for runtime charging to the cgroup, we add the following: * add new routine to lookup the device-specific cgroup_subsys_state which is within the task's cgroup (mem_cgroup_device_from_task) * add new functions for device specific 'direct' charging The last point above involves adding new mem_cgroup_try_charge_direct and mem_cgroup_uncharge_direct functions. The 'direct' name is to say that we are charging the specified cgroup state directly and not using any associated page or mm_struct. We are called within device specific memory management routines, where the device driver will track which cgroup to charge within its own private data structures. With this initial submission, support for memory accounting and charging is functional. Nested cgroups will correctly maintain the parent for device-specific state as well, such that hierarchial charging to device files is supported. Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org Cc: dri-devel@lists.freedesktop.org Cc: Matt Roper Signed-off-by: Brian Welty --- include/linux/memcontrol.h | 10 ++ mm/memcontrol.c | 183 ++++++++++++++++++++++++++++++++++--- 2 files changed, 178 insertions(+), 15 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dbb6118370c1..711669b613dc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -348,6 +348,11 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg, bool compound); void mem_cgroup_uncharge(struct page *page); void mem_cgroup_uncharge_list(struct list_head *page_list); +/* direct charging to mem_cgroup is primarily for device driver usage */ +int mem_cgroup_try_charge_direct(struct mem_cgroup *memcg, + unsigned long nr_pages); +void mem_cgroup_uncharge_direct(struct mem_cgroup *memcg, + unsigned long nr_pages); void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); @@ -395,6 +400,11 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg); struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); +struct mem_cgroup *mem_cgroup_device_from_task(unsigned long id, + struct task_struct *p); +int mem_cgroup_device_register(struct device *dev, unsigned long *dev_id); +void mem_cgroup_device_unregister(unsigned long dev_id); + struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); struct mem_cgroup *get_mem_cgroup_from_page(struct page *page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 81a0d3914ec9..2c8407aed0f5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -823,6 +823,47 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) } EXPORT_SYMBOL(mem_cgroup_from_task); +int mem_cgroup_device_register(struct device *dev, unsigned long *dev_id) +{ + return cgroup_device_register(&memory_cgrp_subsys, dev, dev_id); +} +EXPORT_SYMBOL(mem_cgroup_device_register); + +void mem_cgroup_device_unregister(unsigned long dev_id) +{ + cgroup_device_unregister(&memory_cgrp_subsys, dev_id); +} +EXPORT_SYMBOL(mem_cgroup_device_unregister); + +/** + * mem_cgroup_device_from_task: Lookup device-specific memcg + * @id: device-specific id returned from mem_cgroup_device_register + * @p: task to lookup the memcg + * + * First use mem_cgroup_from_task to lookup and obtain a reference on + * the memcg associated with this task @p. Within this memcg, find the + * device-specific one associated with @id. + * However if mem_cgroup is disabled, NULL is returned. + */ +struct mem_cgroup *mem_cgroup_device_from_task(unsigned long id, + struct task_struct *p) +{ + struct mem_cgroup *memcg; + struct mem_cgroup *dev_memcg = NULL; + + if (mem_cgroup_disabled()) + return NULL; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(p); + if (memcg) + dev_memcg = idr_find(&memcg->css.device_css_idr, id); + rcu_read_unlock(); + + return dev_memcg; +} +EXPORT_SYMBOL(mem_cgroup_device_from_task); + /** * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg. * @mm: mm from which memcg should be extracted. It can be NULL. @@ -2179,13 +2220,31 @@ void mem_cgroup_handle_over_high(void) current->memcg_nr_pages_over_high = 0; } +static bool __try_charge(struct mem_cgroup *memcg, unsigned int nr_pages, + struct mem_cgroup **mem_over_limit) +{ + struct page_counter *counter; + + if (!do_memsw_account() || + page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) { + if (page_counter_try_charge(&memcg->memory, nr_pages, &counter)) + return true; + if (do_memsw_account()) + page_counter_uncharge(&memcg->memsw, nr_pages); + *mem_over_limit = mem_cgroup_from_counter(counter, memory); + } else { + *mem_over_limit = mem_cgroup_from_counter(counter, memsw); + } + + return false; +} + static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages); int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; struct mem_cgroup *mem_over_limit; - struct page_counter *counter; unsigned long nr_reclaimed; bool may_swap = true; bool drained = false; @@ -2198,17 +2257,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (consume_stock(memcg, nr_pages)) return 0; - if (!do_memsw_account() || - page_counter_try_charge(&memcg->memsw, batch, &counter)) { - if (page_counter_try_charge(&memcg->memory, batch, &counter)) - goto done_restock; - if (do_memsw_account()) - page_counter_uncharge(&memcg->memsw, batch); - mem_over_limit = mem_cgroup_from_counter(counter, memory); - } else { - mem_over_limit = mem_cgroup_from_counter(counter, memsw); - may_swap = false; - } + if (__try_charge(memcg, batch, &mem_over_limit)) + goto done_restock; + else + may_swap = !do_memsw_account(); if (batch > nr_pages) { batch = nr_pages; @@ -2892,6 +2944,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) { int nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + if (memcg->css.device) + return 0; + /* we call try-to-free pages for make this cgroup empty */ lru_add_drain_all(); @@ -4496,7 +4551,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void) } static struct cgroup_subsys_state * __ref -mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) +__mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css, bool is_device) { struct mem_cgroup *parent = mem_cgroup_from_css(parent_css); struct mem_cgroup *memcg; @@ -4530,11 +4585,13 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) * much sense so let cgroup subsystem know about this * unfortunate state in our controller. */ - if (parent != root_mem_cgroup) + if (!is_device && parent != root_mem_cgroup) memory_cgrp_subsys.broken_hierarchy = true; } - /* The following stuff does not apply to the root */ + /* The following stuff does not apply to devices or the root */ + if (is_device) + return &memcg->css; if (!parent) { root_mem_cgroup = memcg; return &memcg->css; @@ -4554,6 +4611,34 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) return ERR_PTR(-ENOMEM); } +static struct cgroup_subsys_state * __ref +mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) +{ + return __mem_cgroup_css_alloc(parent_css, false); +} + +/* + * For given @cgroup_css, we create and return new device-specific css. + * + * @device and @cgroup_css are unused here, but they are provided as other + * cgroup subsystems might require them. + */ +static struct cgroup_subsys_state * __ref +mem_cgroup_device_css_alloc(struct device *device, + struct cgroup_subsys_state *cgroup_css, + struct cgroup_subsys_state *parent_device_css) +{ + /* + * For hierarchial page counters to work correctly, we specify + * parent here as the device-specific css from our parent css + * (@parent_device_css). In other words, for nested cgroups, + * the device-specific charging structures are also nested. + * Note, caller will itself set .device and .parent in returned + * structure. + */ + return __mem_cgroup_css_alloc(parent_device_css, true); +} + static int mem_cgroup_css_online(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); @@ -4613,6 +4698,9 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + if (css->device) + goto free_cgrp; + if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); @@ -4624,6 +4712,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) mem_cgroup_remove_from_trees(memcg); memcg_free_shrinker_maps(memcg); memcg_free_kmem(memcg); +free_cgrp: mem_cgroup_free(memcg); } @@ -5720,6 +5809,7 @@ static struct cftype memory_files[] = { struct cgroup_subsys memory_cgrp_subsys = { .css_alloc = mem_cgroup_css_alloc, + .device_css_alloc = mem_cgroup_device_css_alloc, .css_online = mem_cgroup_css_online, .css_offline = mem_cgroup_css_offline, .css_released = mem_cgroup_css_released, @@ -5732,6 +5822,7 @@ struct cgroup_subsys memory_cgrp_subsys = { .dfl_cftypes = memory_files, .legacy_cftypes = mem_cgroup_legacy_files, .early_init = 0, + .allow_devices = true, }; /** @@ -6031,6 +6122,68 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg, cancel_charge(memcg, nr_pages); } +/** + * mem_cgroup_try_charge_direct - try charging nr_pages to memcg + * @memcg: memcgto charge + * @nr_pages: number of pages to charge + * + * Try to charge @nr_pages to specified @memcg. This variant is intended + * where the memcg is known and can be directly charged, with the primary + * use case being in device drivers that have registered with the subsys. + * Device drivers that implement their own device-specific memory manager + * will use these direct charging functions to make charges against their + * device-private state (@memcg) within the cgroup. + * + * There is no separate mem_cgroup_commit_charge() in this use case, as the + * device driver is not using page structs. Reclaim is not needed internally + * here, as the caller can decide to attempt memory reclaim on error. + * + * Returns 0 on success. Otherwise, an error code is returned. + * + * To uncharge (or cancel charge), call mem_cgroup_uncharge_direct(). + */ +int mem_cgroup_try_charge_direct(struct mem_cgroup *memcg, + unsigned long nr_pages) +{ + struct mem_cgroup *mem_over_limit; + int ret = 0; + + if (!memcg || mem_cgroup_disabled() || mem_cgroup_is_root(memcg)) + return 0; + + if (__try_charge(memcg, nr_pages, &mem_over_limit)) { + css_get_many(&memcg->css, nr_pages); + } else { + memcg_memory_event(mem_over_limit, MEMCG_MAX); + ret = -ENOMEM; + } + return ret; +} +EXPORT_SYMBOL(mem_cgroup_try_charge_direct); + +/** + * mem_cgroup_uncharge_direct - uncharge nr_pages to memcg + * @memcg: memcg to charge + * @nr_pages: number of pages to charge + * + * Uncharge @nr_pages to specified @memcg. This variant is intended + * where the memcg is known and can directly uncharge, with the primary + * use case being in device drivers that have registered with the subsys. + * Device drivers use these direct charging functions to make charges + * against their device-private state (@memcg) within the cgroup. + * + * Returns 0 on success. Otherwise, an error code is returned. + */ +void mem_cgroup_uncharge_direct(struct mem_cgroup *memcg, + unsigned long nr_pages) +{ + if (!memcg || mem_cgroup_disabled()) + return; + + cancel_charge(memcg, nr_pages); +} +EXPORT_SYMBOL(mem_cgroup_uncharge_direct); + struct uncharge_gather { struct mem_cgroup *memcg; unsigned long pgpgout; From patchwork Wed May 1 14:04:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Welty, Brian" X-Patchwork-Id: 10925117 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 85B6314B6 for ; Wed, 1 May 2019 14:03:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7309028D34 for ; Wed, 1 May 2019 14:03:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 673CD28D32; Wed, 1 May 2019 14:03:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DB3CC28D32 for ; Wed, 1 May 2019 14:03:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D31018930F; Wed, 1 May 2019 14:03:07 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5883C8930F; Wed, 1 May 2019 14:03:06 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 May 2019 07:03:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,417,1549958400"; d="scan'208";a="145141415" Received: from nperf12.hd.intel.com ([10.127.88.161]) by fmsmga008.fm.intel.com with ESMTP; 01 May 2019 07:03:04 -0700 From: Brian Welty To: cgroups@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , dri-devel@lists.freedesktop.org, David Airlie , Daniel Vetter , intel-gfx@lists.freedesktop.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , =?utf-8?q?Christian_K=C3=B6nig?= , Alex Deucher , ChunMing Zhou , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= Subject: [RFC PATCH 4/5] drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit Date: Wed, 1 May 2019 10:04:37 -0400 Message-Id: <20190501140438.9506-5-brian.welty@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190501140438.9506-1-brian.welty@intel.com> References: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP With new cgroups per-device framework, registration with memory cgroup subsystem can allow us to enforce limit for allocation of device memory against process cgroups. This patch adds new driver feature bit, DRIVER_CGROUPS, such that DRM will register the device with cgroups. Doing so allows device drivers to charge memory allocations to device-specific state within the cgroup. Note, this is only for GEM objects allocated from device memory. Memory charging for GEM objects using system memory is already handled by the mm subsystem charing the normal (non-device) memory cgroup. To charge device memory allocations, we need to (1) identify appropriate cgroup to charge (currently decided at object creation time), and (2) make the charging call at the time that memory pages are being allocated. Above is one policy, and this is open for debate if this is the right choice. For (1), we associate the current task's cgroup with GEM objects as they are created. That cgroup will be charged/uncharged for all paging activity against the GEM object. Note, if the process is not part of a memory cgroup, then this returns NULL and no charging will occur. For shared objects, this may make the charge against a cgroup that is potentially not the same cgroup as the process using the memory. Based on the memory cgroup's discussion of "memory ownership", this seems acceptable [1]. For (2), this is for device drivers to implement within appropriate page allocation logic. [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt, "Memory Ownership" Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org Cc: dri-devel@lists.freedesktop.org Cc: Matt Roper Signed-off-by: Brian Welty --- drivers/gpu/drm/drm_drv.c | 12 ++++++++++++ drivers/gpu/drm/drm_gem.c | 7 +++++++ include/drm/drm_device.h | 3 +++ include/drm/drm_drv.h | 8 ++++++++ include/drm/drm_gem.h | 11 +++++++++++ 5 files changed, 41 insertions(+) diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 862621494a93..890bd3c0e63e 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -28,6 +28,7 @@ #include #include +#include #include #include #include @@ -987,6 +988,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags) if (ret) goto err_minors; + if (dev->dev && drm_core_check_feature(dev, DRIVER_CGROUPS)) { + ret = mem_cgroup_device_register(dev->dev, &dev->memcg_id); + if (ret) + goto err_minors; + } + dev->registered = true; if (dev->driver->load) { @@ -1009,6 +1016,8 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags) goto out_unlock; err_minors: + if (dev->memcg_id) + mem_cgroup_device_unregister(dev->memcg_id); remove_compat_control_link(dev); drm_minor_unregister(dev, DRM_MINOR_PRIMARY); drm_minor_unregister(dev, DRM_MINOR_RENDER); @@ -1052,6 +1061,9 @@ void drm_dev_unregister(struct drm_device *dev) drm_legacy_rmmaps(dev); + if (dev->memcg_id) + mem_cgroup_device_unregister(dev->memcg_id); + remove_compat_control_link(dev); drm_minor_unregister(dev, DRM_MINOR_PRIMARY); drm_minor_unregister(dev, DRM_MINOR_RENDER); diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 50de138c89e0..966fbd701deb 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -281,6 +282,9 @@ drm_gem_handle_delete(struct drm_file *filp, u32 handle) if (IS_ERR_OR_NULL(obj)) return -EINVAL; + /* Release reference on cgroup used with GEM object charging */ + mem_cgroup_put(obj->memcg); + /* Release driver's reference and decrement refcount. */ drm_gem_object_release_handle(handle, obj, filp); @@ -410,6 +414,9 @@ drm_gem_handle_create_tail(struct drm_file *file_priv, goto err_revoke; } + /* Acquire reference on cgroup for charging GEM memory allocations */ + obj->memcg = mem_cgroup_device_from_task(dev->memcg_id, current); + *handlep = handle; return 0; diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index 7f9ef709b2b6..9859f2289066 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -190,6 +190,9 @@ struct drm_device { */ int irq; + /* @memcg_id: cgroup subsys (memcg) index for our device state */ + unsigned long memcg_id; + /** * @vblank_disable_immediate: * diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 5cc7f728ec73..13b0e0b9527f 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -92,6 +92,14 @@ enum drm_driver_feature { */ DRIVER_SYNCOBJ = BIT(5), + /** + * @DRIVER_CGROUPS: + * + * Driver supports and requests DRM to register with cgroups during + * drm_dev_register(). + */ + DRIVER_CGROUPS = BIT(6), + /* IMPORTANT: Below are all the legacy flags, add new ones above. */ /** diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 5047c7ee25f5..ca90ea512e45 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -34,6 +34,7 @@ * OTHER DEALINGS IN THE SOFTWARE. */ +#include #include #include @@ -202,6 +203,16 @@ struct drm_gem_object { */ struct file *filp; + /** + * @memcg: + * + * cgroup used for charging GEM object page allocations against. This + * is set to the current cgroup during GEM object creation. + * Charging policy is up to each DRM driver to decide, but intent is to + * charge during page allocation and use for device memory only. + */ + struct mem_cgroup *memcg; + /** * @vma_node: * From patchwork Wed May 1 14:04:38 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Welty, Brian" X-Patchwork-Id: 10925123 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 69C031390 for ; Wed, 1 May 2019 14:03:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 575A028CFE for ; Wed, 1 May 2019 14:03:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5555B28E3E; Wed, 1 May 2019 14:03:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id E0CCB28E41 for ; Wed, 1 May 2019 14:03:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E6D0B8938C; Wed, 1 May 2019 14:03:08 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by gabe.freedesktop.org (Postfix) with ESMTPS id D5CC58938C; Wed, 1 May 2019 14:03:07 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 May 2019 07:03:07 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,417,1549958400"; d="scan'208";a="145141419" Received: from nperf12.hd.intel.com ([10.127.88.161]) by fmsmga008.fm.intel.com with ESMTP; 01 May 2019 07:03:06 -0700 From: Brian Welty To: cgroups@vger.kernel.org, Tejun Heo , Li Zefan , Johannes Weiner , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , dri-devel@lists.freedesktop.org, David Airlie , Daniel Vetter , intel-gfx@lists.freedesktop.org, Jani Nikula , Joonas Lahtinen , Rodrigo Vivi , =?utf-8?q?Christian_K=C3=B6nig?= , Alex Deucher , ChunMing Zhou , =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= Subject: [RFC PATCH 5/5] drm/i915: Use memory cgroup for enforcing device memory limit Date: Wed, 1 May 2019 10:04:38 -0400 Message-Id: <20190501140438.9506-6-brian.welty@intel.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190501140438.9506-1-brian.welty@intel.com> References: <20190501140438.9506-1-brian.welty@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP i915 driver now includes DRIVER_CGROUPS in feature bits. To charge device memory allocations, we need to (1) identify appropriate cgroup to charge (currently decided at object creation time), and (2) make the charging call at the time that memory pages are being allocated. For (1), see prior DRM patch which associates current task's cgroup with GEM objects as they are created. That cgroup will be charged/uncharged for all paging activity against the GEM object. For (2), we call mem_cgroup_try_charge_direct() in .get_pages callback for the GEM object type. Uncharging is done in .put_pages when the memory is marked such that it can be evicted. The try_charge() call will fail with -ENOMEM if the current memory allocation will exceed the cgroup device memory maximum, and allow for driver to perform memory reclaim. Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org Cc: dri-devel@lists.freedesktop.org Cc: Matt Roper Signed-off-by: Brian Welty --- drivers/gpu/drm/i915/i915_drv.c | 2 +- drivers/gpu/drm/i915/intel_memory_region.c | 24 ++++++++++++++++++---- 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 5a0a59922cb4..4d496c3c3681 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -3469,7 +3469,7 @@ static struct drm_driver driver = { * deal with them for Intel hardware. */ .driver_features = - DRIVER_GEM | DRIVER_PRIME | + DRIVER_GEM | DRIVER_PRIME | DRIVER_CGROUPS | DRIVER_RENDER | DRIVER_MODESET | DRIVER_ATOMIC | DRIVER_SYNCOBJ, .release = i915_driver_release, .open = i915_driver_open, diff --git a/drivers/gpu/drm/i915/intel_memory_region.c b/drivers/gpu/drm/i915/intel_memory_region.c index 813ff83c132b..e4ac5e4d4857 100644 --- a/drivers/gpu/drm/i915/intel_memory_region.c +++ b/drivers/gpu/drm/i915/intel_memory_region.c @@ -53,6 +53,8 @@ i915_memory_region_put_pages_buddy(struct drm_i915_gem_object *obj, mutex_unlock(&obj->memory_region->mm_lock); obj->mm.dirty = false; + mem_cgroup_uncharge_direct(obj->base.memcg, + obj->base.size >> PAGE_SHIFT); } int @@ -65,19 +67,29 @@ i915_memory_region_get_pages_buddy(struct drm_i915_gem_object *obj) struct scatterlist *sg; unsigned int sg_page_sizes; unsigned long n_pages; + int err; GEM_BUG_ON(!IS_ALIGNED(size, mem->mm.min_size)); GEM_BUG_ON(!list_empty(&obj->blocks)); + err = mem_cgroup_try_charge_direct(obj->base.memcg, size >> PAGE_SHIFT); + if (err) { + DRM_DEBUG("MEMCG: try_charge failed for %lld\n", size); + return err; + } + st = kmalloc(sizeof(*st), GFP_KERNEL); - if (!st) - return -ENOMEM; + if (!st) { + err = -ENOMEM; + goto err_uncharge; + } n_pages = div64_u64(size, mem->mm.min_size); if (sg_alloc_table(st, n_pages, GFP_KERNEL)) { kfree(st); - return -ENOMEM; + err = -ENOMEM; + goto err_uncharge; } sg = st->sgl; @@ -161,7 +173,11 @@ i915_memory_region_get_pages_buddy(struct drm_i915_gem_object *obj) err_free_blocks: memory_region_free_pages(obj, st); mutex_unlock(&mem->mm_lock); - return -ENXIO; + err = -ENXIO; +err_uncharge: + mem_cgroup_uncharge_direct(obj->base.memcg, + obj->base.size >> PAGE_SHIFT); + return err; } int i915_memory_region_init_buddy(struct intel_memory_region *mem)