From patchwork Fri Dec 2 17:16:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Brian Foster X-Patchwork-Id: 13062999 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BD3EC47088 for ; Fri, 2 Dec 2022 17:16:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 551396B0074; Fri, 2 Dec 2022 12:16:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B0866B0075; Fri, 2 Dec 2022 12:16:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DC4D6B0078; Fri, 2 Dec 2022 12:16:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 18A376B0074 for ; Fri, 2 Dec 2022 12:16:21 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DC329A12C0 for ; Fri, 2 Dec 2022 17:16:20 +0000 (UTC) X-FDA: 80198019720.03.B4306C6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf23.hostedemail.com (Postfix) with ESMTP id 704DC140014 for ; Fri, 2 Dec 2022 17:16:20 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gGhrHqfT; spf=pass (imf23.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670001380; a=rsa-sha256; cv=none; b=ihfGaR69YCuddV0TP4rRWnxlrStIzKmAy88EbO4ZEysyJGvc+SK+crxGjLxfh70jYo9Bg4 CxhPPdB10aCREgKkUDpmK2jeT3GpMP2Lktdq9yryc6xMZl/ks2gEI3kQyWHzyxtrtfbSjV aCHRLW0xfqBgAjakWuw9Yvymt3dqQwc= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=gGhrHqfT; spf=pass (imf23.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670001380; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Jz5L8aDW4bvVNOnmSsD8fmQ8XtEZs9lbM2YmlOnfn7I=; b=jXmOBExjvQrI23k9q9plecXD/qqfBGftdCT2rmpkqdgis4aUSWJ4zkc7ISwUuEWJnX2TWn Z5s9FhV3a6XL7DtvADv7ZWeVnhGhVIolECR1grBHb2MIYWTCPDHOW9ynC0WRopCR0GXUgb UQQGyBxslmLNsj4lMja0mnVO9eDt4cY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670001379; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Jz5L8aDW4bvVNOnmSsD8fmQ8XtEZs9lbM2YmlOnfn7I=; b=gGhrHqfTYBLE/wym0TeXLrnMnpE/hj+L5L2+6mupDBAMPyVP4On5hUKA81htxtXZSlYmPM gaO7bFVF9oG/e0Va1UqUf7bLyUyUWnxP1Lh4PWAI4JbVQoSKpYyaKzJSDeEI49OJEMqGo0 USL0GNtmKXQIuSvPVHlP3GfrsPRpvNU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-417-stNBeUb6OA2jAZQjw_-VnA-1; Fri, 02 Dec 2022 12:16:16 -0500 X-MC-Unique: stNBeUb6OA2jAZQjw_-VnA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 4BD33101A5AD; Fri, 2 Dec 2022 17:16:16 +0000 (UTC) Received: from bfoster.redhat.com (unknown [10.22.8.52]) by smtp.corp.redhat.com (Postfix) with ESMTP id 090C840C94AA; Fri, 2 Dec 2022 17:16:16 +0000 (UTC) From: Brian Foster To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: ikent@redhat.com, onestero@redhat.com, willy@infradead.org, ebiederm@redhat.com Subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks Date: Fri, 2 Dec 2022 12:16:19 -0500 Message-Id: <20221202171620.509140-5-bfoster@redhat.com> In-Reply-To: <20221202171620.509140-1-bfoster@redhat.com> References: <20221202171620.509140-1-bfoster@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Rspam-User: X-Spamd-Result: default: False [-3.40 / 9.00]; BAYES_HAM(-6.00)[100.00%]; R_MISSING_CHARSET(2.50)[]; MID_CONTAINS_FROM(1.00)[]; DMARC_POLICY_ALLOW(-0.50)[redhat.com,none]; R_DKIM_ALLOW(-0.20)[redhat.com:s=mimecast20190719]; R_SPF_ALLOW(-0.20)[+ip4:170.10.129.0/24]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; RCPT_COUNT_SEVEN(0.00)[7]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[redhat.com:+]; RCVD_COUNT_THREE(0.00)[4]; MIME_TRACE(0.00)[0:+]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_NONE(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 704DC140014 X-Rspamd-Server: rspam01 X-Stat-Signature: j1d6eg4iq3rjs8skqu4ia1yfnp1awo98 X-HE-Tag: 1670001380-892614 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Searching the pid_namespace for group leader tasks is a fairly inefficient operation. Listing the root directory of a procfs mount performs a linear scan of allocated pids, checking each entry for an associated PIDTYPE_TGID task to determine whether to populate a directory entry. This can cause a significant increase in readdir() syscall latency when run in namespaces that might have one or more processes with significant thread counts. To facilitate improved TGID pid searches, mark the ids of pid entries that are likely to have an associated PIDTYPE_TGID task. To keep the code simple and avoid having to maintain synchronization between mark state and post-fork pid-task association changes, the mark is applied to all pids allocated for tasks cloned without CLONE_THREAD. This means that it is possible for a pid to remain marked in the xarray after being disassociated from the group leader task. For example, a process that does a setsid() followed by fork() and exit() (to daemonize) will remain associated with the original pid for the session, but link with the child pid as the group leader. OTOH, the only place other than fork() where a tgid association occurs is in the exec() path, which kills all other tasks in the group and associates the current task with the preexisting leader pid. Therefore, the semantics of the mark are that false positives (marked pids without PIDTYPE_TGID tasks) are possible, but false negatives (unmarked pids without PIDTYPE_TGID tasks) should never occur. This is an effective optimization because false negatives are fairly uncommon and don't add overhead (i.e. we already have to check pid_task() for marked entries), but still filters out thread pids that are guaranteed not to have TGID task association. Mark entries in the pid allocation path when the caller specifies that the pid associates with a new thread group. Since false negatives are not allowed, warn in the event that a PIDTYPE_TGID task is ever attached to an unmarked pid. Finally, create a helper to implement the task search based on the mark semantics defined above (based on search logic currently implemented by next_tgid() in procfs). Signed-off-by: Brian Foster Reviewed-by: Ian Kent --- include/linux/pid.h | 3 ++- kernel/fork.c | 2 +- kernel/pid.c | 44 +++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 46 insertions(+), 3 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 343abf22092e..64caf21be256 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -132,9 +132,10 @@ extern struct pid *find_vpid(int nr); */ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *); extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, - size_t set_tid_size); + size_t set_tid_size, bool group_leader); extern void free_pid(struct pid *pid); extern void disable_pid_allocation(struct pid_namespace *ns); diff --git a/kernel/fork.c b/kernel/fork.c index 08969f5aa38d..1cf2644c642e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2267,7 +2267,7 @@ static __latent_entropy struct task_struct *copy_process( if (pid != &init_struct_pid) { pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid, - args->set_tid_size); + args->set_tid_size, !(clone_flags & CLONE_THREAD)); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_thread; diff --git a/kernel/pid.c b/kernel/pid.c index 53db06f9882d..d65f74c6186c 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -66,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT; int pid_max_min = RESERVED_PIDS + 1; int pid_max_max = PID_MAX_LIMIT; +/* MARK_0 used by XA_FREE_MARK */ +#define TGID_MARK XA_MARK_1 + struct pid_namespace init_pid_ns = { .ns.count = REFCOUNT_INIT(2), .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS), @@ -137,7 +140,7 @@ void free_pid(struct pid *pid) } struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, - size_t set_tid_size) + size_t set_tid_size, bool group_leader) { struct pid *pid; enum pid_type type; @@ -257,6 +260,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, /* Make the PID visible to find_pid_ns. */ __xa_store(&tmp->xa, upid->nr, pid, 0); + if (group_leader) + __xa_set_mark(&tmp->xa, upid->nr, TGID_MARK); tmp->pid_allocated++; xa_unlock_irq(&tmp->xa); } @@ -314,6 +319,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type) void attach_pid(struct task_struct *task, enum pid_type type) { struct pid *pid = *task_pid_ptr(task, type); + struct pid_namespace *pid_ns = ns_of_pid(pid); + pid_t pid_nr = pid_nr_ns(pid, pid_ns); + + WARN_ON(type == PIDTYPE_TGID && + !xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK)); hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]); } @@ -506,6 +516,38 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) } EXPORT_SYMBOL_GPL(find_ge_pid); +/* + * Used by proc to find the first thread group leader task with an id greater + * than or equal to *id. + * + * Use the xarray mark as a hint to find the next best pid. The mark does not + * guarantee a linked group leader task exists, so retry until a suitable entry + * is found. + */ +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns) +{ + struct pid *pid; + struct task_struct *t; + unsigned long nr = *id; + + rcu_read_lock(); + do { + pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK); + if (!pid) { + rcu_read_unlock(); + return NULL; + } + t = pid_task(pid, PIDTYPE_TGID); + nr++; + } while (!t); + + *id = pid_nr_ns(pid, ns); + get_task_struct(t); + rcu_read_unlock(); + + return t; +} + struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags) { struct fd f;