From patchwork Tue Jun 14 18:09:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12881354 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8463FC433EF for ; Tue, 14 Jun 2022 18:09:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344724AbiFNSJz (ORCPT ); Tue, 14 Jun 2022 14:09:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232143AbiFNSJx (ORCPT ); Tue, 14 Jun 2022 14:09:53 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A9D8D44A0F for ; Tue, 14 Jun 2022 11:09:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1655230191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KrRV//rtrccNkJBTlKhGHOF/oF+ni+3ESEuTSMEs8E4=; b=Xczo7kSf+upM9cSuLA6zJZjKW2iO9IeA3pW4txMSIZhBrd16+Q+glxHOFeCKxQPtMkgXTE YYIrgjcaU1eUbK/ULtXwaXPCxRrze4ypJ+Rrj8ry6vgzX8G7vLu/qQGyqhT2OHF9cuJMeI Az1R8q+jEqwcdJYZkST1dKD1YSi82g4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-550-jt_LnJRONT2IvP9s2VjdPA-1; Tue, 14 Jun 2022 14:09:50 -0400 X-MC-Unique: jt_LnJRONT2IvP9s2VjdPA-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0BC38804196; Tue, 14 Jun 2022 18:09:50 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.60]) by smtp.corp.redhat.com (Postfix) with ESMTP id D17EF492C3B; Tue, 14 Jun 2022 18:09:49 +0000 (UTC) From: Brian Foster To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: ikent@redhat.com, onestero@redhat.com Subject: [PATCH 1/3] radix-tree: propagate all tags in idr tree Date: Tue, 14 Jun 2022 14:09:47 -0400 Message-Id: <20220614180949.102914-2-bfoster@redhat.com> In-Reply-To: <20220614180949.102914-1-bfoster@redhat.com> References: <20220614180949.102914-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The IDR tree has hardcoded tag propagation logic to handle the internal IDR_FREE tag and ignore all others. Fix up the hardcoded logic to support additional tags. This is specifically to support a new internal IDR_TGID radix tree tag used to improve search efficiency of pids with associated PIDTYPE_TGID tasks within a pid namespace. Signed-off-by: Brian Foster --- lib/radix-tree.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index b3afafe46fff..08eef33e7820 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -431,12 +431,14 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp, tag_clear(node, IDR_FREE, 0); root_tag_set(root, IDR_FREE); } - } else { - /* Propagate the aggregated tag info to the new child */ - for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { - if (root_tag_get(root, tag)) - tag_set(node, tag, 0); - } + } + + /* Propagate the aggregated tag info to the new child */ + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { + if (is_idr(root) && tag == IDR_FREE) + continue; + if (root_tag_get(root, tag)) + tag_set(node, tag, 0); } BUG_ON(shift > BITS_PER_LONG); @@ -1368,11 +1370,13 @@ static bool __radix_tree_delete(struct radix_tree_root *root, unsigned offset = get_slot_offset(node, slot); int tag; - if (is_idr(root)) - node_tag_set(root, node, IDR_FREE, offset); - else - for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) - node_tag_clear(root, node, tag, offset); + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { + if (is_idr(root) && tag == IDR_FREE) { + node_tag_set(root, node, tag, offset); + continue; + } + node_tag_clear(root, node, tag, offset); + } replace_slot(slot, NULL, node, -1, values); return node && delete_node(root, node); From patchwork Tue Jun 14 18:09:48 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12881353 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C00CC43334 for ; Tue, 14 Jun 2022 18:09:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343886AbiFNSJy (ORCPT ); Tue, 14 Jun 2022 14:09:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60752 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245502AbiFNSJx (ORCPT ); Tue, 14 Jun 2022 14:09:53 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8B36D3F8A6 for ; Tue, 14 Jun 2022 11:09:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1655230191; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CsdcR2rHWMCCKyL47D03FvUoZIC1ppi2VFdhmhI3LlA=; b=XF6rkcNUVh2VIzwBklDaVMdls9HDnIO9tqIrBo3g0elNLHLPpAucfIKKwAab1T5Jv06mDe T8BOrDaOhSThJ8St3EBDtG21RLQUJ1KG2GaUZjdl7OlzUeZinHWF3l4mS7UPdDc6QMF9si c2yYWP6j9isXdpZAdvqz1ciTU7HyCIU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-638-rSLMswUmPuKdxuF31sOhNw-1; Tue, 14 Jun 2022 14:09:50 -0400 X-MC-Unique: rSLMswUmPuKdxuF31sOhNw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 44F70804197; Tue, 14 Jun 2022 18:09:50 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.60]) by smtp.corp.redhat.com (Postfix) with ESMTP id 183BE492C3B; Tue, 14 Jun 2022 18:09:50 +0000 (UTC) From: Brian Foster To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: ikent@redhat.com, onestero@redhat.com Subject: [PATCH 2/3] pid: use idr tag to hint pids associated with group leader tasks Date: Tue, 14 Jun 2022 14:09:48 -0400 Message-Id: <20220614180949.102914-3-bfoster@redhat.com> In-Reply-To: <20220614180949.102914-1-bfoster@redhat.com> References: <20220614180949.102914-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Searching the pid_namespace for group leader tasks is a fairly inefficient operation. Listing the root directory of a procfs mount performs a linear walk of allocated pids, checking each for an associated PIDTYPE_TGID task to determine whether to populate a directory entry. This can cause a significant increase in readdir() syscall latency when run in runtime environments that might have one or more processes with significant thread counts. To facilitate improved TGID pid searches, define a new IDR radix-tree tag for struct pid entries that are likely to have an associated PIDTYPE_TGID task. To keep the code simple and avoid having to maintain synchronization between tag state and post-fork pid-task association changes, the tag is applied to all pids initially allocated for tasks that are cloned without CLONE_THREAD. The semantics of the tag are thus that false positives are possible (i.e. tagged pids without PIDTYPE_TGID tasks), but false negatives (i.e. untagged pids without PIDTYPE_TGID tasks) are not allowed. For example, a userspace task that does a setsid() followed by a fork() and exit() in the initial task will leave the initial pid tagged (as it remains allocated for the PIDTYPE_SID association) while the group leader task associates with the pid allocated for the child fork. Once set, the tag persists for the lifetime of the pid and is cleared when the pid is freed and associated entry removed from the idr tree. This is an effective optimization because false negatives are relatively uncommon, essentially don't add any overhead that doesn't already exist (i.e. having to check pid_task(..., PIDTYPE_TGID), but still allows filtering out large numbers of thread pids that are guaranteed to not have TGID task association. Define the new IDR_TGID radix tree tag and provide a couple helpers to set and check tag state. Set the tag in the allocation path when the caller specifies that the pid is expected to track a group leader. Since false negatives are not allowed, warn in the event that a PIDTYPE_TGID task is ever attached to an untagged pid. Signed-off-by: Brian Foster --- include/linux/idr.h | 11 +++++++++++ include/linux/pid.h | 2 +- kernel/fork.c | 2 +- kernel/pid.c | 9 ++++++++- 4 files changed, 21 insertions(+), 3 deletions(-) diff --git a/include/linux/idr.h b/include/linux/idr.h index a0dce14090a9..11e0ccedfc92 100644 --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -27,6 +27,7 @@ struct idr { * to users. Use tag 0 to track whether a node has free space below it. */ #define IDR_FREE 0 +#define IDR_TGID 1 /* Set the IDR flag and the IDR_FREE tag */ #define IDR_RT_MARKER (ROOT_IS_IDR | (__force gfp_t) \ @@ -174,6 +175,16 @@ static inline void idr_preload_end(void) local_unlock(&radix_tree_preloads.lock); } +static inline void idr_set_group_lead(struct idr *idr, unsigned long id) +{ + radix_tree_tag_set(&idr->idr_rt, id, IDR_TGID); +} + +static inline bool idr_is_group_lead(struct idr *idr, unsigned long id) +{ + return radix_tree_tag_get(&idr->idr_rt, id, IDR_TGID); +} + /** * idr_for_each_entry() - Iterate over an IDR's elements of a given type. * @idr: IDR handle. diff --git a/include/linux/pid.h b/include/linux/pid.h index 343abf22092e..31f3cf765cee 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -134,7 +134,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, - size_t set_tid_size); + size_t set_tid_size, bool group_leader); extern void free_pid(struct pid *pid); extern void disable_pid_allocation(struct pid_namespace *ns); diff --git a/kernel/fork.c b/kernel/fork.c index 9d44f2d46c69..3c52f45ec93e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2254,7 +2254,7 @@ static __latent_entropy struct task_struct *copy_process( if (pid != &init_struct_pid) { pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid, - args->set_tid_size); + args->set_tid_size, !(clone_flags & CLONE_THREAD)); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_thread; diff --git a/kernel/pid.c b/kernel/pid.c index 2fc0a16ec77b..5a745c35475e 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -157,7 +157,7 @@ void free_pid(struct pid *pid) } struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, - size_t set_tid_size) + size_t set_tid_size, bool group_leader) { struct pid *pid; enum pid_type type; @@ -272,6 +272,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, for ( ; upid >= pid->numbers; --upid) { /* Make the PID visible to find_pid_ns. */ idr_replace(&upid->ns->idr, pid, upid->nr); + if (group_leader) + idr_set_group_lead(&upid->ns->idr, upid->nr); upid->ns->pid_allocated++; } spin_unlock_irq(&pidmap_lock); @@ -331,6 +333,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type) void attach_pid(struct task_struct *task, enum pid_type type) { struct pid *pid = *task_pid_ptr(task, type); + struct pid_namespace *pid_ns = ns_of_pid(pid); + pid_t pid_nr = pid_nr_ns(pid, pid_ns); + + WARN_ON(type == PIDTYPE_TGID && + !idr_is_group_lead(&pid_ns->idr, pid_nr)); hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]); } From patchwork Tue Jun 14 18:09:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 12881355 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCAE4CCA47B for ; Tue, 14 Jun 2022 18:09:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345139AbiFNSJ4 (ORCPT ); Tue, 14 Jun 2022 14:09:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343661AbiFNSJy (ORCPT ); Tue, 14 Jun 2022 14:09:54 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E92DA45792 for ; Tue, 14 Jun 2022 11:09:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1655230192; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MpQuarki66IR7CQ0JeO2Q3emNhjmVObdjJYvYB9eELg=; b=DXHbNWs6N2IiHRc9iFZb4qG/M00yHfmBrDHYS9oXa4fexqXgJAEajDTda91lkUaPShQEBa QDuOObY2s668yEJdVCV1HHJFFrIPBbAtqG5jTPuJny9rmT+WFgw3XKh5jqvhUYz6oymXlV G2ZrbfhyVUviCEWJOeTLZ2SNgQpPkew= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-442-eKpFmPZfNvq5ByL94CXf4Q-1; Tue, 14 Jun 2022 14:09:50 -0400 X-MC-Unique: eKpFmPZfNvq5ByL94CXf4Q-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7E2F63817A6C; Tue, 14 Jun 2022 18:09:50 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.16.60]) by smtp.corp.redhat.com (Postfix) with ESMTP id 51D38492C3B; Tue, 14 Jun 2022 18:09:50 +0000 (UTC) From: Brian Foster To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: ikent@redhat.com, onestero@redhat.com Subject: [PATCH 3/3] proc: use idr tgid tag hint to iterate pids in readdir Date: Tue, 14 Jun 2022 14:09:49 -0400 Message-Id: <20220614180949.102914-4-bfoster@redhat.com> In-Reply-To: <20220614180949.102914-1-bfoster@redhat.com> References: <20220614180949.102914-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The tgid pid/task scan in proc_pid_readdir() is rather inefficient. It linearly walks the pid_namespace and checks each allocated pid for an associated PIDTYPE_TGID task. This has shown to impact getdents() latency in environments that might have processes with very large thread counts. For example, on a mostly idle 2.4GHz Intel Xeon running Fedora on 5.19.0-rc2, 'strace -T xfs_io -c readdir /proc' shows the following: getdents64(... /* 814 entries */, 32768) = 20624 <0.000568> With the addition of a dummy (i.e. idle) process running that creates an additional 100k threads, that latency increases to: getdents64(... /* 815 entries */, 32768) = 20656 <0.011315> While this may not be noticeable in one off /proc scans or simple usage of ps or top, we have users that report problems caused by this latency increase in these sort of scaled environments with custom tooling that makes heavier use of task monitoring. Optimize the tgid task scanning in proc_pid_readdir() by using IDR_TGID tag lookups in the pid namespace tree. Tagged pids are not guaranteed to have an associated PIDTYPE_TGID task, but pids that do are always tagged. This significantly improves readdir() latency when the pid namespace is populated with group leader tasks with unusually large thread counts. For example, the above 100k idle task test against a patched kernel now results in the following: Idle: getdents64(... /* 861 entries */, 32768) = 21048 <0.000670> "" + 100k threads: getdents64(... /* 862 entries */, 32768) = 21096 <0.000959> ... which is a much smaller latency hit after the high thread count task is started. Signed-off-by: Brian Foster --- fs/proc/base.c | 2 +- include/linux/idr.h | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 8dfa36a99c74..fd3c8a5f8c2d 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3436,7 +3436,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite rcu_read_lock(); retry: iter.task = NULL; - pid = find_ge_pid(iter.tgid, ns); + pid = find_tgid_pid(&ns->idr, iter.tgid); if (pid) { iter.tgid = pid_nr_ns(pid, ns); iter.task = pid_task(pid, PIDTYPE_TGID); diff --git a/include/linux/idr.h b/include/linux/idr.h index 11e0ccedfc92..5ef32311b232 100644 --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -185,6 +185,20 @@ static inline bool idr_is_group_lead(struct idr *idr, unsigned long id) return radix_tree_tag_get(&idr->idr_rt, id, IDR_TGID); } +/* + * Find the next id with a potentially associated TGID task using the internal + * tag. Task association is not guaranteed and must be checked explicitly. + */ +static inline struct pid *find_tgid_pid(struct idr *idr, unsigned long id) +{ + struct pid *pid; + + if (radix_tree_gang_lookup_tag(&idr->idr_rt, (void **) &pid, id, 1, + IDR_TGID) != 1) + return NULL; + return pid; +} + /** * idr_for_each_entry() - Iterate over an IDR's elements of a given type. * @idr: IDR handle.