Message ID | 20190103035426.23526-1-avagin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [vfs/for-next,v6] cgroup: fix top cgroup refcnt leak | expand |
On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote: [I'm thoroughly sick of refcounting in that thing, TBH ;-/] > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index a19f0fec9d82..fe67b5e81f9a 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) > > ret = kernfs_get_tree(fc); > if (ret < 0) > - goto out_cgrp; > + return ret; Why does that case avoid needing cgroup_put()? Note, BTW, that we also have this: /* * Destroy a cgroup filesystem context. */ static void cgroup_fs_context_free(struct fs_context *fc) { struct cgroup_fs_context *ctx = cgroup_fc2context(fc); kfree(ctx->name); kfree(ctx->release_agent); if (ctx->root) cgroup_put(&ctx->root->cgrp); put_cgroup_ns(ctx->ns); kernfs_free_fs_context(fc); kfree(ctx); } which also needs to be taken into account.
On Thu, Jan 03, 2019 at 08:32:29AM +0000, Al Viro wrote: > On Wed, Jan 02, 2019 at 07:54:26PM -0800, Andrei Vagin wrote: > > [I'm thoroughly sick of refcounting in that thing, TBH ;-/] feel your pain... > > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > > index a19f0fec9d82..fe67b5e81f9a 100644 > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) > > > > ret = kernfs_get_tree(fc); > > if (ret < 0) > > - goto out_cgrp; > > + return ret; > > Why does that case avoid needing cgroup_put()? Note, BTW, that we > also have this: The origin patch which added this cgroup_put says that it is needed because mount() and kill_sb() is not a one-to-one match. sb is created in kernfs_get_tree(), so I decided that this cgroup_put() is needed only after kernfs_get_tree(). commit c6b3d5bcd67c75961a1e8b9564d1475c0f194a84 Author: Li Zefan <lizefan@huawei.com> Date: Fri Apr 4 17:14:41 2014 +0800 cgroup: fix top cgroup refcnt leak As mount() and kill_sb() is not a one-to-one match, If we mount the same cgroupfs in serveral mount points, and then umount all of them, kill_sb() will be called only once. > /* > * Destroy a cgroup filesystem context. > */ > static void cgroup_fs_context_free(struct fs_context *fc) > { > struct cgroup_fs_context *ctx = cgroup_fc2context(fc); > > kfree(ctx->name); > kfree(ctx->release_agent); > if (ctx->root) > cgroup_put(&ctx->root->cgrp); > put_cgroup_ns(ctx->ns); > kernfs_free_fs_context(fc); > kfree(ctx); > } > > which also needs to be taken into account. we have unconditional cgroup_get in cgroup1_get_tree() to match this put, but we need to check error paths.
Hi Andrei, It turns out that the cgroup-v1 refcounting is slightly broken upstream. If kernfs_get_inode() fails in kernfs_fill_super(), then there's refcount breakage in the error handling. This can be provoked by making the following change: --- a/fs/kernfs/mount.c +++ b/fs/kernfs/mount.c @@ -240,6 +240,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic) sb->s_shrink.seeks = 0; /* get root inode, initialize and unlock it */ + if (strcmp(current->comm, "foobar") == 0) + return -ENOANO; mutex_lock(&kernfs_mutex); inode = kernfs_get_inode(sb, info->root->kn); mutex_unlock(&kernfs_mutex); and then copying /bin/mount to /tmp/foobar and doing: [root@andromeda ~]# /tmp/foobar -t cgroup -o none,name=xxxxy xxx /tmp/x/a1 foobar: /tmp/x/a1: mount(2) system call failed: No anode. In dmesg I see the attached traces (see below). The problem appears to be because cgroup_do_mount() calls kernfs_mount(), but the refcount on the new root cgroup object hasn't been properly initialised yet. However, because we make it past sget(), the superblock thinks it has taken the caller's ref - and this gets eaten by cgroup_kill_sb(). Further, another ref appears to be released by cgroup_do_mount() in the event that kernfs_mount() fails. David ------------[ cut here ]------------ percpu_ref_kill_and_confirm called more than once on css_release! WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x4b/0x14c Modules linked in: CPU: 3 PID: 3218 Comm: bugger Not tainted 4.20.0-fscache+ #1287 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 RIP: 0010:percpu_ref_kill_and_confirm+0x4b/0x14c Code: c6 74 29 80 3d 42 0a 00 01 00 75 20 48 8b 53 10 48 c7 c6 f0 cd e9 81 48 c7 c7 82 99 16 82 c6 05 27 0a 00 01 01 e8 04 54 ac ff <0f> 0b 48 83 4b 08 02 4c 89 e6 48 89 df e8 fb fc ff ff 65 ff 05 54 RSP: 0018:ffff8880c5103c60 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff8880d35e8020 RCX: ffff8880c5103b4c RDX: 0000000000000046 RSI: ffffffff8245fef8 RDI: ffffffff810ac1ec RBP: ffff8880c5103c78 R08: 0000000000000041 R09: 0000000000021900 R10: ffff8880c5103900 R11: 0000006423114dc0 R12: 0000000000000000 R13: 000000000027e0eb R14: 0000000000000246 R15: 0000000000000000 FS: 00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0 Call Trace: cgroup_kill_sb+0x131/0x141 deactivate_locked_super+0x29/0x5b kernfs_mount_ns+0x1fa/0x223 cgroup_do_mount+0x36/0x1c8 cgroup1_mount+0x5b2/0x610 cgroup_mount+0x33b/0x37f mount_fs+0x6a/0x10b vfs_kern_mount+0x67/0x13c do_mount+0x90e/0xb7e ? kmem_cache_alloc_trace+0x241/0x27d ksys_mount+0x72/0x97 __x64_sys_mount+0x21/0x24 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ec1e14ada Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0 RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480 R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0 R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184 irq event stamp: 3532 hardirqs last enabled at (3531): [<ffffffff811c0584>] kfree+0x152/0x159 hardirqs last disabled at (3532): [<ffffffff819e27e8>] _raw_spin_lock_irqsave+0x12/0x44 softirqs last enabled at (3326): [<ffffffff81c00353>] __do_softirq+0x353/0x38f softirqs last disabled at (3317): [<ffffffff81058c16>] irq_exit+0x63/0xd1 ---[ end trace 9bca09dc135d9213 ]--- WARNING: CPU: 3 PID: 3218 at lib/percpu-refcount.c:359 percpu_ref_reinit+0x10/0x17 Modules linked in: CPU: 3 PID: 3218 Comm: bugger Tainted: G W 4.20.0-fscache+ #1287 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 RIP: 0010:percpu_ref_reinit+0x10/0x17 Code: fa ff ff 48 8d 65 f0 4c 89 f6 48 c7 c7 e0 8d 51 82 5b 41 5e 5d e9 3c 50 45 00 48 8b 47 08 a8 03 74 08 48 8b 07 48 85 c0 74 02 <0f> 0b e9 c4 fe ff ff 55 31 f6 53 48 89 fb 48 83 ec 30 65 48 8b 04 RSP: 0018:ffff8880c5103d38 EFLAGS: 00010282 RAX: fffffffffffffffe RBX: ffff8880d35e8000 RCX: ffffffff810f1a8f RDX: 00000000ffffbf7e RSI: 00000000425b018d RDI: ffff8880d35e8020 RBP: ffff8880c5103da8 R08: 0000000000000001 R09: 0000000000000000 R10: ffff8880c5103d40 R11: 0000000000000001 R12: 0000000000000001 R13: 0000000000000000 R14: ffffffffffffffc9 R15: ffff88803f63f001 FS: 00007f8ec2dbb080(0000) GS:ffff8880c6d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005586974b0198 CR3: 00000000c4bfe002 CR4: 00000000001606e0 Call Trace: cgroup1_mount+0x5d1/0x610 cgroup_mount+0x33b/0x37f mount_fs+0x6a/0x10b vfs_kern_mount+0x67/0x13c do_mount+0x90e/0xb7e ? kmem_cache_alloc_trace+0x241/0x27d ksys_mount+0x72/0x97 __x64_sys_mount+0x21/0x24 do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ec1e14ada Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2fa32b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00005586974ae2a0 RCX: 00007f8ec1e14ada RDX: 00005586974ae480 RSI: 00005586974ae500 RDI: 00005586974ae4e0 RBP: 0000000000000000 R08: 00005586974ae4a0 R09: 00005586974ae480 R10: 00000000c0ed0000 R11: 0000000000000246 R12: 00005586974ae4e0 R13: 00005586974ae480 R14: 0000000000000000 R15: 00007f8ec2ba8184 irq event stamp: 3666 hardirqs last enabled at (3665): [<ffffffff810bb0b1>] __call_rcu+0x1dc/0x1fa hardirqs last disabled at (3666): [<ffffffff81001639>] trace_hardirqs_off_thunk+0x1a/0x1c softirqs last enabled at (3608): [<ffffffff81c00353>] __do_softirq+0x353/0x38f softirqs last disabled at (3535): [<ffffffff81058c16>] irq_exit+0x63/0xd1 ---[ end trace 9bca09dc135d9214 ]---
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index 4b189e821cad..de7d625ec077 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -1285,8 +1285,8 @@ int cgroup1_get_tree(struct fs_context *fc) mutex_lock(&cgroup_mutex); percpu_ref_reinit(&root->cgrp.self.refcnt); mutex_unlock(&cgroup_mutex); - cgroup_get(&root->cgrp); } + cgroup_get(&root->cgrp); /* * If @pinned_sb, we're reusing an existing root and holding an diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a19f0fec9d82..fe67b5e81f9a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2019,7 +2019,7 @@ int cgroup_do_get_tree(struct fs_context *fc) ret = kernfs_get_tree(fc); if (ret < 0) - goto out_cgrp; + return ret; /* * In non-init cgroup namespace, instead of root cgroup's dentry, @@ -2038,19 +2038,30 @@ int cgroup_do_get_tree(struct fs_context *fc) mutex_unlock(&cgroup_mutex); nsdentry = kernfs_node_dentry(cgrp->kn, fc->root->d_sb); - if (IS_ERR(nsdentry)) - return PTR_ERR(nsdentry); + if (IS_ERR(nsdentry)) { + ret = PTR_ERR(nsdentry); + goto out_cgrp; + } dput(fc->root); fc->root = nsdentry; } ret = 0; - if (ctx->kfc.new_sb_created) - goto out_cgrp; - apply_cgroup_root_flags(ctx->flags); - return 0; + if (!ctx->kfc.new_sb_created) + apply_cgroup_root_flags(ctx->flags); out_cgrp: + if (!ctx->kfc.new_sb_created) + cgroup_put(&ctx->root->cgrp); + + if (unlikely(ret)) { + struct super_block *sb = fc->root->d_sb; + + dput(fc->root); + deactivate_locked_super(sb); + fc->root = NULL; + } + return ret; }
It looks like the c6b3d5bcd67c ("cgroup: fix top cgroup refcnt leak") commit was reverted by mistake. $ mkdir /tmp/cgroup $ mkdir /tmp/cgroup2 $ mount -t cgroup -o none,name=test test /tmp/cgroup $ mount -t cgroup -o none,name=test test /tmp/cgroup2 $ umount /tmp/cgroup $ umount /tmp/cgroup2 $ cat /proc/self/cgroup | grep test 12:name=test:/ You can see the test cgroup was not freed. Cc: Li Zefan <lizefan@huawei.com> Fixes: aea3f2676c83 ("kernfs, sysfs, cgroup, intel_rdt: Support fs_context") Signed-off-by: Andrei Vagin <avagin@gmail.com> --- v2: clean up code and add the vfs/for-next tag v3: fix a reference leak when kernfs_node_dentry fails v4: call deactivate_locked_super() in a error case v5: don't dereference fc->root after dput() v6: rebase on today's vfs/for-next kernel/cgroup/cgroup-v1.c | 2 +- kernel/cgroup/cgroup.c | 25 ++++++++++++++++++------- 2 files changed, 19 insertions(+), 8 deletions(-)