Message ID | 20220214190549.GA2815154@paulmck-ThinkPad-P17-Gen-1 (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,fs/namespace] Make kern_unmount() use synchronize_rcu_expedited() | expand |
On Mon, Feb 14, 2022 at 03:55:36PM -0500, Rik van Riel wrote: > On Mon, 14 Feb 2022 11:44:40 -0800 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > On Mon, Feb 14, 2022 at 07:26:49PM +0000, Chris Mason wrote: > > > Moving from synchronize_rcu() to synchronize_rcu_expedited() does buy > > you at least an order of magnitude. But yes, it should be possible to > > get rid of all but one call per batch, which would be better. Maybe > > a bit more complicated, but probably not that much. > > It doesn't look too bad, except for the include of ../fs/mount.h. > > I'm hoping somebody has a better idea on how to deal with that. > Do we need a kern_unmount() variant that doesn't do the RCU wait, > or should it get a parameter, or something else? > > Is there an ordering requirement between the synchronize_rcu call > and zeroing out n->mq_mnt->mnt_ls? > > What other changes do we need to make everything right? > > The change below also fixes the issue that to-be-freed items that > are queued up while the free_ipc work function runs do not result > in the work item being enqueued again. > > This patch is still totally untested because the 4 year old is > at home today :) I agree that this should decrease grace-period latency quite a bit more than does my patch, so here is hoping that it does work. ;-) Thanx, Paul > diff --git a/ipc/namespace.c b/ipc/namespace.c > index 7bd0766ddc3b..321cbda17cfb 100644 > --- a/ipc/namespace.c > +++ b/ipc/namespace.c > @@ -17,6 +17,7 @@ > #include <linux/proc_ns.h> > #include <linux/sched/task.h> > > +#include "../fs/mount.h" > #include "util.h" > > static struct ucounts *inc_ipc_namespaces(struct user_namespace *ns) > @@ -117,10 +118,7 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids, > > static void free_ipc_ns(struct ipc_namespace *ns) > { > - /* mq_put_mnt() waits for a grace period as kern_unmount() > - * uses synchronize_rcu(). > - */ > - mq_put_mnt(ns); > + mntput(ns->mq_mnt); > sem_exit_ns(ns); > msg_exit_ns(ns); > shm_exit_ns(ns); > @@ -134,11 +132,19 @@ static void free_ipc_ns(struct ipc_namespace *ns) > static LLIST_HEAD(free_ipc_list); > static void free_ipc(struct work_struct *unused) > { > - struct llist_node *node = llist_del_all(&free_ipc_list); > + struct llist_node *node; > struct ipc_namespace *n, *t; > > - llist_for_each_entry_safe(n, t, node, mnt_llist) > - free_ipc_ns(n); > + while ((node = llist_del_all(&free_ipc_list))) { > + llist_for_each_entry(n, node, mnt_llist) > + real_mount(n->mq_mnt)->mnt_ns = NULL; > + > + /* Wait for the last users to have gone away. */ > + synchronize_rcu(); > + > + llist_for_each_entry_safe(n, t, node, mnt_llist) > + free_ipc_ns(n); > + } > } > > /* >
> On Feb 14, 2022, at 2:26 PM, Chris Mason <clm@fb.com> wrote: > > > >> On Feb 14, 2022, at 2:05 PM, Paul E. McKenney <paulmck@kernel.org> wrote: >> >> Experimental. Not for inclusion. Yet, anyway. >> >> Freeing large numbers of namespaces in quick succession can result in >> a bottleneck on the synchronize_rcu() invoked from kern_unmount(). >> This patch applies the synchronize_rcu_expedited() hammer to allow >> further testing and fault isolation. >> >> Hey, at least there was no need to change the comment! ;-) >> > > I don’t think this will be fast enough. I think the problem is that commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7 is putting all of the ipc namespace frees onto a list, and every free includes one call to synchronize_rcu() > > The end result is that we can create new namespaces much much faster than we can free them, and eventually we run out. I found this while debugging clone() returning ENOSPC because create_ipc_ns() was returning ENOSPC. I’m going to try Rik’s patch, but I changed Giuseppe’s benchmark from this commit, just to make it run for a million iterations instead of 1000. #define _GNU_SOURCE #include <sched.h> #include <error.h> #include <errno.h> #include <stdlib.h> #include <stdio.h> int main() { int i; for (i = 0; i < 1000000; i++) { if (unshare(CLONE_NEWIPC) < 0) error(EXIT_FAILURE, errno, "unshare"); } } Then I put on a drgn script to print the size of the free_ipc_list: #!/usr/bin/env drgn # usage: ./check_list <pid of worker thread doing free_ipc calls> from drgn import * from drgn.helpers.linux.pid import find_task import sys,os,time def llist_count(cur): count = 0 while cur: count += 1 cur = cur.next return count pid = int(sys.argv[1]) # sometimes the worker is in different functions, so this # will throw exceptions if we can't find the free_ipc call for x in range(1, 5): try: task = find_task(prog, int(pid)) trace = prog.stack_trace(task) head = prog['free_ipc_list'] for i in range(0, len(trace)): if "free_ipc at" in str(trace[i]): free_ipc_index = i n = trace[free_ipc_index]['n'] print("ipc free list is %d worker %d remaining %d" % (llist_count(head.first), pid, llist_count(n.mnt_llist.next))) break except: time.sleep(0.5) pass I was expecting the run to pretty quickly hit ENOSPC, then try Rik’s patch, then celebrate and move on. What seems to be happening instead is that unshare is spending all of its time creating super blocks: 48.07% boom [kernel.vmlinux] [k] test_keyed_super | ---0x5541f689495641d7 __libc_start_main unshare entry_SYSCALL_64 do_syscall_64 __x64_sys_unshare ksys_unshare unshare_nsproxy_namespaces create_new_namespaces copy_ipcs mq_init_ns mq_create_mount fc_mount vfs_get_tree vfs_get_super sget_fc test_keyed_super But, this does nicely show the backlog on the free_ipc_list. It gets up to around 150K entries, with our worker thread stuck: 196 kworker/0:2+events D [<0>] __wait_rcu_gp+0x105/0x120 [<0>] synchronize_rcu+0x64/0x70 [<0>] kern_unmount+0x27/0x50 [<0>] free_ipc+0x6b/0xe0 [<0>] process_one_work+0x1ee/0x3c0 [<0>] worker_thread+0x23a/0x3b0 [<0>] kthread+0xe6/0x110 [<0>] ret_from_fork+0x1f/0x30 # ./check_list.drgn 196 ipc free list is 58099 worker 196 remaining 98012 Eventually, hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) is slower than synchronize_rcu(), and the worker thread is able to make progress? Production in this case is a few nsjail procs, so it’s not a crazy workload. My guess is that prod tends to have longer grace periods than this test box, so the worker thread loses, but I haven’t been able to figure out why the worker suddenly catches up from time to time on the test box. -chris
diff --git a/fs/namespace.c b/fs/namespace.c index 40b994a29e90d..79c50ad0ade5b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4389,7 +4389,7 @@ void kern_unmount(struct vfsmount *mnt) /* release long term mount so mount point can be released */ if (!IS_ERR_OR_NULL(mnt)) { real_mount(mnt)->mnt_ns = NULL; - synchronize_rcu(); /* yecchhh... */ + synchronize_rcu_expedited(); /* yecchhh... */ mntput(mnt); } }