From patchwork Sat Mar 17 00:08:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matt Roper X-Patchwork-Id: 10290525 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id C2310602C2 for ; Sat, 17 Mar 2018 00:09:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B286429102 for ; Sat, 17 Mar 2018 00:09:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A733629106; Sat, 17 Mar 2018 00:09:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 0422529102 for ; Sat, 17 Mar 2018 00:09:31 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CEC356EC16; Sat, 17 Mar 2018 00:09:29 +0000 (UTC) X-Original-To: dri-devel@lists.freedesktop.org Delivered-To: dri-devel@lists.freedesktop.org Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E8CD6EC17; Sat, 17 Mar 2018 00:09:28 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Mar 2018 17:09:28 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,318,1517904000"; d="scan'208";a="39660075" Received: from mdroper-desk.fm.intel.com ([10.1.134.220]) by orsmga001.jf.intel.com with ESMTP; 16 Mar 2018 17:09:27 -0700 From: Matt Roper To: dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, cgroups@vger.kernel.org Subject: [PATCH v4 1/8] cgroup: Allow registration and lookup of cgroup private data (v2) Date: Fri, 16 Mar 2018 17:08:58 -0700 Message-Id: <20180317000905.7091-2-matthew.d.roper@intel.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180317000905.7091-1-matthew.d.roper@intel.com> References: <20180317000905.7091-1-matthew.d.roper@intel.com> X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tejun Heo , Roman Gushchin , Alexei Starovoitov MIME-Version: 1.0 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" X-Virus-Scanned: ClamAV using ClamSMTP There are cases where other parts of the kernel may wish to store data associated with individual cgroups without building a full cgroup controller. Let's add interfaces to allow them to register and lookup this private data for individual cgroups. A kernel system (e.g., a driver) that wishes to register private data for a cgroup should start by obtaining a unique private data key by calling cgroup_priv_getkey(). It may then associate private data with a cgroup by calling cgroup_priv_install(cgrp, key, ref) where 'ref' is a pointer to a kref field inside the private data structure. The private data may later be looked up by calling cgroup_priv_get(cgrp, key) to obtain a new reference to the private data. Private data may be unregistered via cgroup_priv_release(cgrp, key). If a cgroup is removed, the reference count for all private data objects will be decremented. v2: Significant overhaul suggested by Tejun, Alexei, and Roman - Rework interface to make consumers obtain an ida-based key rather than supplying their own arbitrary void* - Internal implementation now uses per-cgroup radixtrees which should allow much faster lookup than the previous hashtable approach - Private data is registered via kref, allowing a single private data structure to potentially be assigned to multiple cgroups. Cc: Tejun Heo Cc: Alexei Starovoitov Cc: Roman Gushchin Cc: cgroups@vger.kernel.org Signed-off-by: Matt Roper --- include/linux/cgroup-defs.h | 8 ++ include/linux/cgroup.h | 7 ++ kernel/cgroup/cgroup.c | 185 +++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 197 insertions(+), 3 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 9f242b876fde..465006274a84 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -427,6 +427,14 @@ struct cgroup { /* used to store eBPF programs */ struct cgroup_bpf bpf; + /* + * cgroup private data registered by other non-controller parts of the + * kernel. Insertions are protected by privdata_lock, lookups by + * rcu_read_lock(). + */ + struct radix_tree_root privdata; + spinlock_t privdata_lock; + /* ids of the ancestors at each level including self */ int ancestor_ids[]; }; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 473e0c0abb86..63d22dfa00bd 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -833,4 +833,11 @@ static inline void put_cgroup_ns(struct cgroup_namespace *ns) free_cgroup_ns(ns); } +/* cgroup private data handling */ +int cgroup_priv_getkey(void (*free)(struct kref *)); +void cgroup_priv_destroykey(int key); +int cgroup_priv_install(struct cgroup *cgrp, int key, struct kref *ref); +struct kref *cgroup_priv_get(struct cgroup *cgrp, int key); +void cgroup_priv_release(struct cgroup *cgrp, int key); + #endif /* _LINUX_CGROUP_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8cda3bc3ae22..a5e2017c9a94 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -81,10 +81,17 @@ EXPORT_SYMBOL_GPL(css_set_lock); #endif /* - * Protects cgroup_idr and css_idr so that IDs can be released without - * grabbing cgroup_mutex. + * ID allocator for cgroup private data keys; the ID's allocated here will + * be used to index all per-cgroup radix trees. The radix tree built into + * the IDR itself will store a key-specific function to be passed to kref_put. */ -static DEFINE_SPINLOCK(cgroup_idr_lock); +static DEFINE_IDR(cgroup_priv_idr); + +/* + * Protects cgroup_idr, css_idr, and cgroup_priv_idr so that IDs can be + * released without grabbing cgroup_mutex. + */ +DEFINE_SPINLOCK(cgroup_idr_lock); /* * Protects cgroup_file->kn for !self csses. It synchronizes notifications @@ -1839,6 +1846,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(&cgrp->cset_links); INIT_LIST_HEAD(&cgrp->pidlists); mutex_init(&cgrp->pidlist_mutex); + INIT_RADIX_TREE(&cgrp->privdata, GFP_NOWAIT); + spin_lock_init(&cgrp->privdata_lock); cgrp->self.cgroup = cgrp; cgrp->self.flags |= CSS_ONLINE; cgrp->dom_cgrp = cgrp; @@ -4578,6 +4587,8 @@ static void css_release_work_fn(struct work_struct *work) container_of(work, struct cgroup_subsys_state, destroy_work); struct cgroup_subsys *ss = css->ss; struct cgroup *cgrp = css->cgroup; + struct radix_tree_iter iter; + void **slot; mutex_lock(&cgroup_mutex); @@ -4617,6 +4628,12 @@ static void css_release_work_fn(struct work_struct *work) NULL); cgroup_bpf_put(cgrp); + + /* Drop reference on any private data */ + rcu_read_lock(); + radix_tree_for_each_slot(slot, &cgrp->privdata, &iter, 0) + cgroup_priv_release(cgrp, iter.index); + rcu_read_unlock(); } mutex_unlock(&cgroup_mutex); @@ -5932,3 +5949,165 @@ static int __init cgroup_sysfs_init(void) } subsys_initcall(cgroup_sysfs_init); #endif /* CONFIG_SYSFS */ + +/** + * cgroup_priv_getkey - obtain a new cgroup_priv lookup key + * @free: Function to release private data associated with this key + * + * Allows non-controller kernel subsystems to register a new key that will + * be used to insert/lookup private data associated with individual cgroups. + * Private data lookup tables are implemented as per-cgroup radix trees. + * + * Returns: + * A positive integer lookup key if successful, or a negative error code + * on failure (e.g., if ID allocation fails). + */ +int +cgroup_priv_getkey(void (*free)(struct kref *)) +{ + int ret; + + WARN_ON(!free); + + idr_preload(GFP_KERNEL); + spin_lock_bh(&cgroup_idr_lock); + ret = idr_alloc(&cgroup_priv_idr, free, 1, 0, GFP_NOWAIT); + spin_unlock_bh(&cgroup_idr_lock); + idr_preload_end(); + + return ret; +} +EXPORT_SYMBOL_GPL(cgroup_priv_getkey); + +/** + * cgroup_priv_destroykey - release a cgroup_priv key + * @key: Private data key to be released + * + * Removes a cgroup private data key and all private data associated with this + * key in any cgroup. This is a heavy operation that will take cgroup_mutex. + */ +void +cgroup_priv_destroykey(int key) +{ + struct cgroup *cgrp; + + WARN_ON(key == 0); + + mutex_lock(&cgroup_mutex); + cgroup_for_each_live_child(cgrp, &cgrp_dfl_root.cgrp) + cgroup_priv_release(cgrp, key); + idr_remove(&cgroup_priv_idr, key); + mutex_unlock(&cgroup_mutex); +} +EXPORT_SYMBOL_GPL(cgroup_priv_destroykey); + +/** + * cgroup_priv_install - install new cgroup private data + * @cgrp: cgroup to store private data for + * @key: key uniquely identifying kernel owner of private data + * @data: pointer to kref field of private data structure + * + * Allows non-controller kernel subsystems to register their own private data + * associated with a cgroup. This will often be used by drivers which wish to + * track their own per-cgroup data without building a full cgroup controller. + * + * The caller is responsible for ensuring that no private data already exists + * for the given key. + * + * Registering cgroup private data with this function will increment the + * reference count for the private data structure. If the cgroup is removed, + * the reference count will be decremented, allowing the private data to + * be freed if there are no other outstanding references. + * + * Returns: + * 0 on success, otherwise a negative error code on failure. + */ +int +cgroup_priv_install(struct cgroup *cgrp, int key, struct kref *ref) +{ + int ret; + + WARN_ON(cgrp->root != &cgrp_dfl_root); + WARN_ON(key == 0); + + kref_get(ref); + + ret = radix_tree_preload(GFP_KERNEL); + if (ret) + return ret; + + spin_lock_bh(&cgrp->privdata_lock); + ret = radix_tree_insert(&cgrp->privdata, key, ref); + spin_unlock_bh(&cgrp->privdata_lock); + radix_tree_preload_end(); + + return ret; +} +EXPORT_SYMBOL_GPL(cgroup_priv_install); + +/** + * cgroup_priv_get - looks up cgroup private data + * @cgrp: cgroup to lookup private data for + * @key: key uniquely identifying owner of private data to lookup + * + * Looks up the cgroup private data associated with a key. The private + * data's reference count is incremented and a pointer to its kref field + * is returned to the caller (which can use container_of()) to obtain + * the private data itself. + * + * Returns: + * A pointer to the private data's kref field, or NULL if no private data has + * been registered. + */ +struct kref * +cgroup_priv_get(struct cgroup *cgrp, int key) +{ + struct kref *ref; + + WARN_ON(cgrp->root != &cgrp_dfl_root); + WARN_ON(key == 0); + + rcu_read_lock(); + + ref = radix_tree_lookup(&cgrp->privdata, key); + if (ref) + kref_get(ref); + + rcu_read_unlock(); + + return ref; +} +EXPORT_SYMBOL_GPL(cgroup_priv_get); + +/** + * cgroup_priv_free - free cgroup private data + * @cgrp: cgroup to release private data for + * @key: key uniquely identifying owner of private data to free + * + * Removes private data associated with the given key from a cgroup's internal + * table and decrements the reference count for the private data removed, + * allowing it to freed if no other references exist. + */ +void +cgroup_priv_release(struct cgroup *cgrp, int key) +{ + struct kref *ref; + void (*free)(struct kref *); + + WARN_ON(cgrp->root != &cgrp_dfl_root); + WARN_ON(key == 0); + + rcu_read_lock(); + + free = idr_find(&cgroup_priv_idr, key); + WARN_ON(!free); + + spin_lock_bh(&cgrp->privdata_lock); + ref = radix_tree_delete(&cgrp->privdata, key); + spin_unlock_bh(&cgrp->privdata_lock); + if (ref) + kref_put(ref, free); + + rcu_read_unlock(); +} +EXPORT_SYMBOL_GPL(cgroup_priv_release);