From patchwork Fri Feb 21 17:02:48 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Michal_Koutn=C3=BD?= X-Patchwork-Id: 13986052 Received: from mail-ej1-f65.google.com (mail-ej1-f65.google.com [209.85.218.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AA2482451D9 for ; Fri, 21 Feb 2025 17:02:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.65 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740157379; cv=none; b=tcHAgxkbyZyk1DK/5oIYPoMVw4jqoViHczpeO1Bx4WEG/IJZMLsplb5gXxsI/2bk+Z7bkC1I/I9gXMQ6USLnLiID3icdjIPw7epmeZ5w6S+WWtnrPrUEVgueAZDccE/ge6ZONyGXrWp57d/WLVCWthij2OxtAUsfdadBSCiSzfA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740157379; c=relaxed/simple; bh=BU6AW4/vup2PJvui7d1Yi4h+wWiPT28WpwmUEqadV9c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ZRYwTG63mXmKGVXQDOKWq6CSVqT+YNUw7INNUHCmaGc9HjQRLOZG2AGsoyLSDvKsHkbt+BVtL3uJXdng/1IfFhLyHwjznyIcNjDOd14qa3MsA1Do4uMJJ2+0das4QOw/3cFoetviKqXLmpQzeHKeqsU8/HCjM7v4wS5RbkAA9p8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=GsIGacZU; arc=none smtp.client-ip=209.85.218.65 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="GsIGacZU" Received: by mail-ej1-f65.google.com with SMTP id a640c23a62f3a-ab7483b9bf7so372104866b.3 for ; Fri, 21 Feb 2025 09:02:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1740157375; x=1740762175; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Gn1EUFdCkok2nb6WPUbBFw+Pz8YpsWovy+Wa/6/aDDA=; b=GsIGacZUhWyroON8yKlVcWzP9BQ6wKkOcSNHjIIRrv6Xi0hc9xf4H8/BxFfHaKC3Vd Aaf1NI+9ChJ+2fuOckNfRHKI2hNoFFDUEDcME36HRdBdTkHvLUzqRrWFUovnfzRoxS7b 1izwrONAxg1T+80Lt/OHm8Fu3hUfK8enNJUKZd7SyvrzF70uANT+a//1v2pS3+FIl0qF 9SYz/+YelyDq6fzY2pO2oKZ1BNESW6uxw2+FEJv9NLmz87ZmGz0m0Jk4qM0DX5SGlBgu SF9m/H+733hmKZb3pDm9BXnNL4Y6TjQjKgqw+y1dvc40nA1ErF5TDu6BZB20TGmevqQk AH5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740157375; x=1740762175; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Gn1EUFdCkok2nb6WPUbBFw+Pz8YpsWovy+Wa/6/aDDA=; b=O4puOXgsKq+mDUd45sV35zfZlE0bSw+qVyxi7w0OeUDs7Zo7x55217JrZamjfiomOC F8Xqx2IGw23wh741ERyGwsyos7rqp2M94Nhyqu5wWwzhIdoxqixBp/n7ZwRoURi92dN4 qXO7jpeZsGbQKQk9sZAmRhtTl0x7hejrtvNcCYDvUn6JVL7+hP9oBb1sWIfJXJaC52kc G8KtwDqdN/nTpPSKC0wGew1SiylZ2G8yqBOWhkQkxTVLfkzw0IAxz4xOj9vbUfz2SxYC MHaGTkw9wbspXcFhr2Um2PROuhVoDM6VhmwwweIbJc53uCCd5EY2tQ8Y3gKYCreJr1nt Bpcw== X-Forwarded-Encrypted: i=1; AJvYcCXn9/RP56bTxhUdGybtESUnbNcBW/ZjeX9eq2IS0KXKVBgpo/1MWjIsZ9c5A6VzO4+OrLTITB18etEZvcC7@vger.kernel.org X-Gm-Message-State: AOJu0Ywn3BfFS0Jn+ENpRYsQ3RpqgKg3rm9G2njLH1a/phFbt3hgm4k0 sqAQs3tRyOGUGS2r+se1Ja0PeqtEZe0QEuDkHVGO4fpeGJKGBl6ZexnD4llUdU0= X-Gm-Gg: ASbGnct+o5bv4V60IVQ+79TnRTKM1VC1GfHgO4yImumrRt1YW3IiHORDpga4LYnXtc1 +urGsMllE+w0ctib5/I13fqGYCnNvoMtBfxboW8yOembnyN7piN6X19mogciXW5zcwdSRBdOo4b Je0K7f0fa6E/xaGcchOpScGDYzaK3WgqlEVdWYDzHzmrAPS3ajl9yK/O5aol4K8YowL1o4VIbcA 5eZOb9IrOGEfzUwc3HLgDLAApBSxY+AT5IY6jSZk0tZNTC3y8fMgSBSMCc3K20EeqvDl3SE5cSx gjVNW8dsdLQWtzYmlxhmiTth78qS X-Google-Smtp-Source: AGHT+IE/yZlQr7iCF2u/T2HPOCo0tOM65vbURSlGqazWvG/FhMHwesKZAxDLWtGvDrLMvON4IhYF0w== X-Received: by 2002:a17:906:6a0a:b0:ab7:e1d5:d0c3 with SMTP id a640c23a62f3a-abc09c28bf6mr433317766b.51.1740157374605; Fri, 21 Feb 2025 09:02:54 -0800 (PST) Received: from blackdock.suse.cz ([193.86.92.181]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-abb8eea4d65sm1105668766b.161.2025.02.21.09.02.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Feb 2025 09:02:53 -0800 (PST) From: =?utf-8?q?Michal_Koutn=C3=BD?= To: Christian Brauner , Alexander Mikhalitsyn , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Jonathan Corbet , Kees Cook , Andrew Morton , "Eric W . Biederman" , =?utf-8?q?Michal_Koutn=C3=BD?= , Oleg Nesterov Subject: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Date: Fri, 21 Feb 2025 18:02:48 +0100 Message-ID: <20250221170249.890014-2-mkoutny@suse.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250221170249.890014-1-mkoutny@suse.com> References: <20250221170249.890014-1-mkoutny@suse.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This reverts commit 7863dcc72d0f4b13a641065670426435448b3d80. It is already difficult for users to troubleshoot which of multiple pid limits restricts their workload. I'm afraid making pid_max per-(hierarchical-)NS will contribute to confusion. Also, the implementation copies the limit upon creation from parent, this pattern showed cumbersome with some attributes in legacy cgroup controllers -- it's subject to race condition between parent's limit modification and children creation and once copied it must be changed in the descendant. This is very similar to what pids.max of a cgroup (already) does that can be used as an alternative. Link: https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/ Signed-off-by: Michal Koutný --- include/linux/pid.h | 3 + include/linux/pid_namespace.h | 10 +-- kernel/pid.c | 125 ++---------------------------- kernel/pid_namespace.c | 43 +++------- kernel/sysctl.c | 9 +++ kernel/trace/pid_list.c | 2 +- kernel/trace/trace.h | 2 + kernel/trace/trace_sched_switch.c | 2 +- 8 files changed, 35 insertions(+), 161 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 98837a1ff0f33..fe575fcdb4afa 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -108,6 +108,9 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old); extern void transfer_pid(struct task_struct *old, struct task_struct *new, enum pid_type); +extern int pid_max; +extern int pid_max_min, pid_max_max; + /* * look up a PID in the hash table. Must be called with the tasklist_lock * or rcu_read_lock() held. diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 7c67a58111998..f9f9931e02d6a 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -30,7 +30,6 @@ struct pid_namespace { struct task_struct *child_reaper; struct kmem_cache *pid_cachep; unsigned int level; - int pid_max; struct pid_namespace *parent; #ifdef CONFIG_BSD_PROCESS_ACCT struct fs_pin *bacct; @@ -39,14 +38,9 @@ struct pid_namespace { struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; - struct work_struct work; -#ifdef CONFIG_SYSCTL - struct ctl_table_set set; - struct ctl_table_header *sysctls; -#if defined(CONFIG_MEMFD_CREATE) +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) int memfd_noexec_scope; #endif -#endif } __randomize_layout; extern struct pid_namespace init_pid_ns; @@ -123,8 +117,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); void pidhash_init(void); void pid_idr_init(void); -int register_pidns_sysctls(struct pid_namespace *pidns); -void unregister_pidns_sysctls(struct pid_namespace *pidns); static inline bool task_is_in_init_pid_ns(struct task_struct *tsk) { diff --git a/kernel/pid.c b/kernel/pid.c index 924084713be8b..aa2a7d4da4555 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -61,8 +61,10 @@ struct pid init_struct_pid = { }, } }; -static int pid_max_min = RESERVED_PIDS + 1; -static int pid_max_max = PID_MAX_LIMIT; +int pid_max = PID_MAX_DEFAULT; + +int pid_max_min = RESERVED_PIDS + 1; +int pid_max_max = PID_MAX_LIMIT; /* * PID-map pages start out as NULL, they get allocated upon @@ -81,7 +83,6 @@ struct pid_namespace init_pid_ns = { #ifdef CONFIG_PID_NS .ns.ops = &pidns_operations, #endif - .pid_max = PID_MAX_DEFAULT, #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC, #endif @@ -190,7 +191,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, for (i = ns->level; i >= 0; i--) { int tid = 0; - int pid_max = READ_ONCE(tmp->pid_max); if (set_tid_size) { tid = set_tid[ns->level - i]; @@ -644,118 +644,17 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags) return fd; } -#ifdef CONFIG_SYSCTL -static struct ctl_table_set *pid_table_root_lookup(struct ctl_table_root *root) -{ - return &task_active_pid_ns(current)->set; -} - -static int set_is_seen(struct ctl_table_set *set) -{ - return &task_active_pid_ns(current)->set == set; -} - -static int pid_table_root_permissions(struct ctl_table_header *head, - const struct ctl_table *table) -{ - struct pid_namespace *pidns = - container_of(head->set, struct pid_namespace, set); - int mode = table->mode; - - if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) || - uid_eq(current_euid(), make_kuid(pidns->user_ns, 0))) - mode = (mode & S_IRWXU) >> 6; - else if (in_egroup_p(make_kgid(pidns->user_ns, 0))) - mode = (mode & S_IRWXG) >> 3; - else - mode = mode & S_IROTH; - return (mode << 6) | (mode << 3) | mode; -} - -static void pid_table_root_set_ownership(struct ctl_table_header *head, - kuid_t *uid, kgid_t *gid) -{ - struct pid_namespace *pidns = - container_of(head->set, struct pid_namespace, set); - kuid_t ns_root_uid; - kgid_t ns_root_gid; - - ns_root_uid = make_kuid(pidns->user_ns, 0); - if (uid_valid(ns_root_uid)) - *uid = ns_root_uid; - - ns_root_gid = make_kgid(pidns->user_ns, 0); - if (gid_valid(ns_root_gid)) - *gid = ns_root_gid; -} - -static struct ctl_table_root pid_table_root = { - .lookup = pid_table_root_lookup, - .permissions = pid_table_root_permissions, - .set_ownership = pid_table_root_set_ownership, -}; - -static const struct ctl_table pid_table[] = { - { - .procname = "pid_max", - .data = &init_pid_ns.pid_max, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &pid_max_min, - .extra2 = &pid_max_max, - }, -}; -#endif - -int register_pidns_sysctls(struct pid_namespace *pidns) -{ -#ifdef CONFIG_SYSCTL - struct ctl_table *tbl; - - setup_sysctl_set(&pidns->set, &pid_table_root, set_is_seen); - - tbl = kmemdup(pid_table, sizeof(pid_table), GFP_KERNEL); - if (!tbl) - return -ENOMEM; - tbl->data = &pidns->pid_max; - pidns->pid_max = min(pid_max_max, max_t(int, pidns->pid_max, - PIDS_PER_CPU_DEFAULT * num_possible_cpus())); - - pidns->sysctls = __register_sysctl_table(&pidns->set, "kernel", tbl, - ARRAY_SIZE(pid_table)); - if (!pidns->sysctls) { - kfree(tbl); - retire_sysctl_set(&pidns->set); - return -ENOMEM; - } -#endif - return 0; -} - -void unregister_pidns_sysctls(struct pid_namespace *pidns) -{ -#ifdef CONFIG_SYSCTL - const struct ctl_table *tbl; - - tbl = pidns->sysctls->ctl_table_arg; - unregister_sysctl_table(pidns->sysctls); - retire_sysctl_set(&pidns->set); - kfree(tbl); -#endif -} - void __init pid_idr_init(void) { /* Verify no one has done anything silly: */ BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING); /* bump default and minimum pid_max based on number of cpus */ - init_pid_ns.pid_max = min(pid_max_max, max_t(int, init_pid_ns.pid_max, - PIDS_PER_CPU_DEFAULT * num_possible_cpus())); + pid_max = min(pid_max_max, max_t(int, pid_max, + PIDS_PER_CPU_DEFAULT * num_possible_cpus())); pid_max_min = max_t(int, pid_max_min, PIDS_PER_CPU_MIN * num_possible_cpus()); - pr_info("pid_max: default: %u minimum: %u\n", init_pid_ns.pid_max, pid_max_min); + pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min); idr_init(&init_pid_ns.idr); @@ -766,16 +665,6 @@ void __init pid_idr_init(void) NULL); } -static __init int pid_namespace_sysctl_init(void) -{ -#ifdef CONFIG_SYSCTL - /* "kernel" directory will have already been initialized. */ - BUG_ON(register_pidns_sysctls(&init_pid_ns)); -#endif - return 0; -} -subsys_initcall(pid_namespace_sysctl_init); - static struct file *__pidfd_fget(struct task_struct *task, int fd) { struct file *file; diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 8f6cfec87555a..0f23285be4f92 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -70,8 +70,6 @@ static void dec_pid_namespaces(struct ucounts *ucounts) dec_ucount(ucounts, UCOUNT_PID_NAMESPACES); } -static void destroy_pid_namespace_work(struct work_struct *work); - static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns, struct pid_namespace *parent_pid_ns) { @@ -107,27 +105,17 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns goto out_free_idr; ns->ns.ops = &pidns_operations; - ns->pid_max = parent_pid_ns->pid_max; - err = register_pidns_sysctls(ns); - if (err) - goto out_free_inum; - refcount_set(&ns->ns.count, 1); ns->level = level; ns->parent = get_pid_ns(parent_pid_ns); ns->user_ns = get_user_ns(user_ns); ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; - INIT_WORK(&ns->work, destroy_pid_namespace_work); - #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); #endif - return ns; -out_free_inum: - ns_free_inum(&ns->ns); out_free_idr: idr_destroy(&ns->idr); kmem_cache_free(pid_ns_cachep, ns); @@ -149,28 +137,12 @@ static void delayed_free_pidns(struct rcu_head *p) static void destroy_pid_namespace(struct pid_namespace *ns) { - unregister_pidns_sysctls(ns); - ns_free_inum(&ns->ns); idr_destroy(&ns->idr); call_rcu(&ns->rcu, delayed_free_pidns); } -static void destroy_pid_namespace_work(struct work_struct *work) -{ - struct pid_namespace *ns = - container_of(work, struct pid_namespace, work); - - do { - struct pid_namespace *parent; - - parent = ns->parent; - destroy_pid_namespace(ns); - ns = parent; - } while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count)); -} - struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *old_ns) { @@ -183,8 +155,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, void put_pid_ns(struct pid_namespace *ns) { - if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count)) - schedule_work(&ns->work); + struct pid_namespace *parent; + + while (ns != &init_pid_ns) { + parent = ns->parent; + if (!refcount_dec_and_test(&ns->ns.count)) + break; + destroy_pid_namespace(ns); + ns = parent; + } } EXPORT_SYMBOL_GPL(put_pid_ns); @@ -295,7 +274,6 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write, next = idr_get_cursor(&pid_ns->idr) - 1; tmp.data = &next; - tmp.extra2 = &pid_ns->pid_max; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (!ret && write) idr_set_cursor(&pid_ns->idr, next + 1); @@ -303,6 +281,7 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write, return ret; } +extern int pid_max; static const struct ctl_table pid_ns_ctl_table[] = { { .procname = "ns_last_pid", @@ -310,7 +289,7 @@ static const struct ctl_table pid_ns_ctl_table[] = { .mode = 0666, /* permissions are checked in the handler */ .proc_handler = pid_ns_ctl_handler, .extra1 = SYSCTL_ZERO, - .extra2 = &init_pid_ns.pid_max, + .extra2 = &pid_max, }, }; #endif /* CONFIG_CHECKPOINT_RESTORE */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cb57da499ebb1..bb739608680f2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1803,6 +1803,15 @@ static const struct ctl_table kern_table[] = { .proc_handler = proc_dointvec, }, #endif + { + .procname = "pid_max", + .data = &pid_max, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &pid_max_min, + .extra2 = &pid_max_max, + }, { .procname = "panic_on_oops", .data = &panic_on_oops, diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c index c62b9b3cfb3d8..4966e6bbdf6f3 100644 --- a/kernel/trace/pid_list.c +++ b/kernel/trace/pid_list.c @@ -414,7 +414,7 @@ struct trace_pid_list *trace_pid_list_alloc(void) int i; /* According to linux/thread.h, pids can be no bigger that 30 bits */ - WARN_ON_ONCE(init_pid_ns.pid_max > (1 << 30)); + WARN_ON_ONCE(pid_max > (1 << 30)); pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL); if (!pid_list) diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 9c21ba45b7af6..46c65402ad7e5 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -732,6 +732,8 @@ extern unsigned long tracing_thresh; /* PID filtering */ +extern int pid_max; + bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid); bool trace_ignore_this_task(struct trace_pid_list *filtered_pids, diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c index cb49f7279dc80..573b5d8e8a28e 100644 --- a/kernel/trace/trace_sched_switch.c +++ b/kernel/trace/trace_sched_switch.c @@ -442,7 +442,7 @@ int trace_alloc_tgid_map(void) if (tgid_map) return 0; - tgid_map_max = init_pid_ns.pid_max; + tgid_map_max = pid_max; map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map), GFP_KERNEL); if (!map) From patchwork Fri Feb 21 17:02:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Michal_Koutn=C3=BD?= X-Patchwork-Id: 13986053 Received: from mail-ed1-f66.google.com (mail-ed1-f66.google.com [209.85.208.66]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 96D6B244E96 for ; Fri, 21 Feb 2025 17:02:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.66 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740157379; cv=none; b=mjSvNtW8joYwth6XV31Ii5Onu9kf5c7dxCMsUtp6Yz6YndFJgE8q1WTocAyuJK16fWsZ6H8w4CuRk6y0d68Dnq1YHOYLQz0yG1f3m05xsdtLETb92YdAvYIf0eUDtV5sBCUuV/ZAMcwBP9B3leTQSbOjUo9FprbwvsHFQA++lfM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740157379; c=relaxed/simple; bh=p26yxnUhbdMRoiqj6XHxuH9YLkbrUj4rKzuTeUEX3o0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=CuFhd4FOsUmioO8cUPkPXCQwKICrArYObBHEBcE4SSxOt8vXLlulLo3//Ex5IDJD5T7h7ihnMDEHw2cEYmqtfHftNfrf+7nnuvxrjp5wgrh0HfEQ+nkAY+kyfVF84oRyZgG1kV+bFQZduMeazwPTo72ifH03cPM7THOVLYnw7DQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=f7m7PSE7; arc=none smtp.client-ip=209.85.208.66 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="f7m7PSE7" Received: by mail-ed1-f66.google.com with SMTP id 4fb4d7f45d1cf-5dedae49c63so4332800a12.0 for ; Fri, 21 Feb 2025 09:02:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1740157376; x=1740762176; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wWszliNKbmL8VUqL0WBW3MKacH8rbha5eqW1pX7oPq8=; b=f7m7PSE7CbLc5EfFMwi5bGPy2nHJYOCJBaGL6ffvkTB9gNxh8RPdvhw6SoImkvRQRN ppTnnJB3oEsFvS/iontAf7ZowZOBsuduoBpYirzmGlq9EuoSFS54zH0M9862jk0kZQ1M gV5wLfKv4HKCQFP6bCkB5DKdcl0AHCjHnhrC81t1CSHw3q1lwzQsmlhmsvC8FWL0XBSh 6pfT6EMt6cuFMoWo/0GNS+oQIQ9GWL+fkN1eKOTinNeq/6olFn/Ilr8m31DoFOMhd54k 2z7Re/sfPj4BMsnNebyo3I5IhbPdlLoXVqYKHK9lik+dj/Ryv+huwsAKr7OvwfTobhQr DlkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740157376; x=1740762176; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wWszliNKbmL8VUqL0WBW3MKacH8rbha5eqW1pX7oPq8=; b=GqHLClM5gj8NBiSO54qki+w0ruVSbzCE46irdfEkeqgLLBmWd7gZbSod7u4A8rbLl1 GwWo1dU5EBdVu42T75vv2/ULufIx9uTY1riU7ql8IKIfik/7GjTBSwgMT0L2rKL6R/+d kLtaPABrzj3GS0UIu7Z2rSf+BWnEJQ5zEdNa51SKLk/97f8PGO5WR8ORvTd1NmjKKlDp TTfqDX8BqvtTk+yiX4RxAbGjd3++3hPgXTzlpsU02mjGCPWPOmBo9mfXnyFuAsRJT5+l KIm0M6BsNvHwwNISucYg4W6n+vW7xigdVxZNuuIiYgb98UDMsihX6fdKc9UdLEpVDvyz q06g== X-Forwarded-Encrypted: i=1; AJvYcCUQfWIsdooazg6jKkKonr5w5bOvDMNhppyN8cRmI+FqRGFRBnnGCdvUMlxEblW0LTzLddcx7JwGsuLDdd+w@vger.kernel.org X-Gm-Message-State: AOJu0YwooC12OugNKyRxsOH4YVNJf/HQK3sp1aN3y8JEBAmBw057jPz7 I9XWhg3UhYvviJxlREt4kaqw8yJ1xDd24D6x/0d1PlsGrYPUHyBHjQtdSGmu/oQ= X-Gm-Gg: ASbGncv4jYl+KfPht4hOzPgcPqT5GUrbbH+qUykeCJ79TQOwN60EALWXgsjAcsWtdPt u1Nq/C26I6mUiZ6SHr1r3asIXhbDXPe2Z99dgBG85OjUGbcetbIsB0++ShS+KPVuswRnNUQUPQ7 aVIOv7OswXSZRH6MLkU03pyhRhP+ZD+VFxg2/MDP3ElmxqDVL6YmX6oJVYcpU0aujjqkPl4B+oD y0vyrRBhNuKrvsGVrZU7jAbnA4Sp8BIA7ObYkByUUgd2eatJohTvLw5U23ieYCCmNIfU3TyUfR7 b/GrYAtPiK/YPnbV2W15cneKkNgG X-Google-Smtp-Source: AGHT+IEwCTRlvyl71solhDgWaKxLp5oopUTcMtwHu3YJJA/SVMpLmQ33tXa83ctCiJhFUIO1Qd60bg== X-Received: by 2002:a17:906:314d:b0:ab7:9df1:e562 with SMTP id a640c23a62f3a-abc0de55dbcmr343145666b.48.1740157375725; Fri, 21 Feb 2025 09:02:55 -0800 (PST) Received: from blackdock.suse.cz ([193.86.92.181]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-abb8eea4d65sm1105668766b.161.2025.02.21.09.02.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 21 Feb 2025 09:02:55 -0800 (PST) From: =?utf-8?q?Michal_Koutn=C3=BD?= To: Christian Brauner , Alexander Mikhalitsyn , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: Jonathan Corbet , Kees Cook , Andrew Morton , "Eric W . Biederman" , =?utf-8?q?Michal_Koutn=C3=BD?= , Oleg Nesterov Subject: [PATCH 2/2] pid: Optional first-fit pid allocation Date: Fri, 21 Feb 2025 18:02:49 +0100 Message-ID: <20250221170249.890014-3-mkoutny@suse.com> X-Mailer: git-send-email 2.48.1 In-Reply-To: <20250221170249.890014-1-mkoutny@suse.com> References: <20250221170249.890014-1-mkoutny@suse.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Noone would need to use this allocation strategy (it's slower, pid numbers collide sooner). Its primary purpose are pid namespaces in conjunction with pids.max cgroup limit which keeps (virtual) pid numbers below the given limit. This is for 32-bit userspace programs that may not work well with pid numbers above 65536. Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/ Signed-off-by: Michal Koutný --- Documentation/admin-guide/sysctl/kernel.rst | 2 ++ include/linux/pid_namespace.h | 3 +++ kernel/pid.c | 12 +++++++-- kernel/pid_namespace.c | 28 +++++++++++++++------ 4 files changed, 36 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index a43b78b4b6464..f5e68d1c8849f 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl lives in) pid namespace. When selecting a pid for a next task on fork kernel tries to allocate a number starting from this one. +When set to -1, first-fit pid numbering is used instead of the next-fit. + powersave-nap (PPC only) ======================== diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index f9f9931e02d6a..10bf66ca78590 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -41,6 +41,9 @@ struct pid_namespace { #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) int memfd_noexec_scope; #endif +#ifdef CONFIG_IA32_EMULATION + bool pid_noncyclic; +#endif } __randomize_layout; extern struct pid_namespace init_pid_ns; diff --git a/kernel/pid.c b/kernel/pid.c index aa2a7d4da4555..e9da1662b8821 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, for (i = ns->level; i >= 0; i--) { int tid = 0; + bool pid_noncyclic = 0; +#ifdef CONFIG_IA32_EMULATION + pid_noncyclic = READ_ONCE(tmp->pid_noncyclic); +#endif if (set_tid_size) { tid = set_tid[ns->level - i]; @@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, * Store a null pointer so find_pid_ns does not find * a partially initialized PID (see below). */ - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, - pid_max, GFP_ATOMIC); + if (likely(!pid_noncyclic)) + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, + pid_max, GFP_ATOMIC); + else + nr = idr_alloc(&tmp->idr, NULL, pid_min, + pid_max, GFP_ATOMIC); } spin_unlock_irq(&pidmap_lock); idr_preload_end(); diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 0f23285be4f92..ceda94a064294 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->pid_allocated = PIDNS_ADDING; #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); +#endif +#ifdef CONFIG_IA32_EMULATION + ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic); #endif return ns; @@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) return; } -#ifdef CONFIG_CHECKPOINT_RESTORE +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION) static int pid_ns_ctl_handler(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { @@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write, if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns)) return -EPERM; - next = idr_get_cursor(&pid_ns->idr) - 1; + next = -1; +#ifdef CONFIG_IA32_EMULATION + if (!pid_ns->pid_noncyclic) +#endif + next += idr_get_cursor(&pid_ns->idr); tmp.data = &next; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); - if (!ret && write) - idr_set_cursor(&pid_ns->idr, next + 1); + if (!ret && write) { + if (next > -1) + idr_set_cursor(&pid_ns->idr, next + 1); + else if (!IS_ENABLED(CONFIG_IA32_EMULATION)) + ret = -EINVAL; +#ifdef CONFIG_IA32_EMULATION + WRITE_ONCE(pid_ns->pid_noncyclic, next == -1); +#endif + } return ret; } @@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = { .maxlen = sizeof(int), .mode = 0666, /* permissions are checked in the handler */ .proc_handler = pid_ns_ctl_handler, - .extra1 = SYSCTL_ZERO, + .extra1 = SYSCTL_NEG_ONE, .extra2 = &pid_max, }, }; -#endif /* CONFIG_CHECKPOINT_RESTORE */ +#endif /* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) { @@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT); -#ifdef CONFIG_CHECKPOINT_RESTORE +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION) register_sysctl_init("kernel", pid_ns_ctl_table); #endif