Message ID | 20220603181321.443716-1-tadeusz.struk@linaro.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [v2] cgroup: serialize css kill and release paths | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On 6/3/22 11:13, Tadeusz Struk wrote: > Syzbot found a corrupted list bug scenario that can be triggered from > cgroup_subtree_control_write(cgrp). The reproduces writes to > cgroup.subtree_control file, which invokes: > cgroup_apply_control_enable()->css_create()->css_populate_dir(), which > then fails with a fault injected -ENOMEM. > In such scenario the css_killed_work_fn will be en-queued via > cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to > cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: > cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue > css_release_work_fn for the same css instance, causing a list_add > corruption bug, as can be seen in the syzkaller report [1]. > > Fix this by synchronizing the css ref_kill and css_release jobs. > css_release() function will check if the css_killed_work_fn() has been > scheduled for the css and only en-queue the css_release_work_fn() > if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will > set the CSS_REL_LATER flag for that css. This will cause the > css_release_work_fn() work to be executed after css_killed_work_fn() is finished. > > Two scc flags have been introduced to implement this serialization mechanizm: > > * CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and > * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be > scheduled after the css_killed_work_fn is finished. > > There is also a new lock, which will protect the integrity of the css flags. > > [1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd This also fixes a similar, cgroup related list corrupt issue: https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd5
Hello. On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > In such scenario the css_killed_work_fn will be en-queued via > cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to > cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: > cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue > css_release_work_fn for the same css instance, causing a list_add > corruption bug, as can be seen in the syzkaller report [1]. This hypothesis doesn't add up to me (I am sorry). The kill_css(css) would be a css associated with a subsys (css.ss != NULL) whereas css_put(&cgrp->self) is a different css just for the cgroup (css.ss == NULL). Michal
On 6/6/22 05:39, Michal Koutný wrote: > On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<tadeusz.struk@linaro.org> wrote: >> In such scenario the css_killed_work_fn will be en-queued via >> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to >> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: >> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue >> css_release_work_fn for the same css instance, causing a list_add >> corruption bug, as can be seen in the syzkaller report [1]. > This hypothesis doesn't add up to me (I am sorry). > > The kill_css(css) would be a css associated with a subsys (css.ss != > NULL) whereas css_put(&cgrp->self) is a different css just for the > cgroup (css.ss == NULL). Yes, you are right. I couldn't figure it out where the extra css_put() is called from, and the only place that fitted into my theory was from the cgroup_kn_unlock() in cgroup_apply_control_disable(). After some more debugging I can see that, as you said, the cgrp->self is a different css. The offending _put() is actually called by the percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill percpu_ref_func_t, but also it puts the refcnt iself. Because the cgroup_apply_control_disable() will loop for_each_live_descendant, and call css_kill() on all css'es, and css_killed_work_fn() will also loop and call css_put() on all parents, the css_release() will be called on the first parent prematurely, causing the BUG(). What I think should be done to balance put/get is to call css_get() for all the parents in kill_css(): diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c1e1a5c34e77..3ca61325bc4e 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref) */ static void kill_css(struct cgroup_subsys_state *css) { + struct cgroup_subsys_state *_css = css; + lockdep_assert_held(&cgroup_mutex); if (css->flags & CSS_DYING) @@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css) css_clear_dir(css); /* - * Killing would put the base ref, but we need to keep it alive - * until after ->css_offline(). + * Killing would put the base ref, but we need to keep it alive, + * and all its parents, until after ->css_offline(). */ - css_get(css); + do { + css_get(_css); + _css = _css->parent; + } while (_css && atomic_read(&_css->online_cnt)); /* * cgroup core guarantees that, by the time ->css_offline() is This will be then "reverted" in css_killed_work_fn() Please let me know if it makes sense to you. I'm still testing it, but syzbot is very slow today.
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths") url: https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132 base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next patch link: https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro.org in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> [ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363) [ 55.822745][ C1] Modules linked in: fuse ip_tables [ 55.823837][ C1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.18.0-rc5-00048-g3c87862ca131 #1 [ 55.825505][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 55.827516][ C1] RIP: 0010:__local_bh_enable_ip (kernel/softirq.c:363) [ 55.828671][ C1] Code: 47 65 8b 05 b1 d9 a8 47 a9 00 ff ff 00 74 30 65 ff 0d a3 d9 a8 47 e8 9e c6 39 00 fb 5b 5d c3 65 8b 05 8f e2 a8 47 85 c0 75 b0 <0f> 0b eb ac e8 c6 c7 39 00 eb ad 48 89 ef e8 3c 89 16 00 eb b6 65 All code ======== 0: 47 rex.RXB 1: 65 8b 05 b1 d9 a8 47 mov %gs:0x47a8d9b1(%rip),%eax # 0x47a8d9b9 8: a9 00 ff ff 00 test $0xffff00,%eax d: 74 30 je 0x3f f: 65 ff 0d a3 d9 a8 47 decl %gs:0x47a8d9a3(%rip) # 0x47a8d9b9 16: e8 9e c6 39 00 callq 0x39c6b9 1b: fb sti 1c: 5b pop %rbx 1d: 5d pop %rbp 1e: c3 retq 1f: 65 8b 05 8f e2 a8 47 mov %gs:0x47a8e28f(%rip),%eax # 0x47a8e2b5 26: 85 c0 test %eax,%eax 28: 75 b0 jne 0xffffffffffffffda 2a:* 0f 0b ud2 <-- trapping instruction 2c: eb ac jmp 0xffffffffffffffda 2e: e8 c6 c7 39 00 callq 0x39c7f9 33: eb ad jmp 0xffffffffffffffe2 35: 48 89 ef mov %rbp,%rdi 38: e8 3c 89 16 00 callq 0x168979 3d: eb b6 jmp 0xfffffffffffffff5 3f: 65 gs Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: eb ac jmp 0xffffffffffffffb0 4: e8 c6 c7 39 00 callq 0x39c7cf 9: eb ad jmp 0xffffffffffffffb8 b: 48 89 ef mov %rbp,%rdi e: e8 3c 89 16 00 callq 0x16894f 13: eb b6 jmp 0xffffffffffffffcb 15: 65 gs [ 55.832215][ C1] RSP: 0018:ffffc90000178d60 EFLAGS: 00010046 [ 55.833390][ C1] RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff7a3ca41 [ 55.834990][ C1] RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffb88332da [ 55.836576][ C1] RBP: ffffffffb88332da R08: 0000000000000000 R09: ffff8881c0af245b [ 55.838123][ C1] R10: ffffed103815e48b R11: 0000000000000001 R12: 0000000000000008 [ 55.839742][ C1] R13: ffff8881a593a000 R14: dffffc0000000000 R15: ffff8881c0af2418 [ 55.841328][ C1] FS: 00007fd6571cc900(0000) GS:ffff88839d500000(0000) knlGS:0000000000000000 [ 55.843084][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 55.844302][ C1] CR2: 00007fd51c623028 CR3: 00000001bfc72000 CR4: 00000000000406e0 [ 55.845905][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 55.847510][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 55.849058][ C1] Call Trace: [ 55.849769][ C1] <IRQ> [ 55.850460][ C1] put_css_set_locked (include/linux/percpu-refcount.h:335 include/linux/percpu-refcount.h:351 include/linux/cgroup.h:404 kernel/cgroup/cgroup.c:971 kernel/cgroup/cgroup.c:955) [ 55.851549][ C1] cgroup_free (include/linux/spinlock.h:404 kernel/cgroup/cgroup-internal.h:212 kernel/cgroup/cgroup-internal.h:198 kernel/cgroup/cgroup.c:6480) [ 55.852470][ C1] __put_task_struct (kernel/fork.c:842) [ 55.853436][ C1] rcu_do_batch (include/linux/rcupdate.h:273 kernel/rcu/tree.c:2537) [ 55.854361][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.855400][ C1] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2474) [ 55.856581][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4493) [ 55.857836][ C1] ? note_gp_changes (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) kernel/rcu/tree.c:1698 (discriminator 1)) [ 55.858828][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.859889][ C1] rcu_core (kernel/rcu/tree.c:2788) [ 55.860786][ C1] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:559) [ 55.861747][ C1] __irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637) [ 55.862673][ C1] irq_exit_rcu (kernel/softirq.c:651) [ 55.863601][ C1] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 14)) [ 55.864709][ C1] </IRQ> [ 55.865399][ C1] <TASK> [ 55.866087][ C1] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645) [ 55.867251][ C1] RIP: 0010:walk_component (fs/namei.c:2027) [ 55.868365][ C1] Code: c7 43 08 00 00 00 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 99 03 00 00 48 81 c4 88 00 00 00 4c 89 e0 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 b8 00 00 00 00 00 fc ff df 48 8d 7d All code ======== 0: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx) 7: 48 8b 84 24 80 00 00 mov 0x80(%rsp),%rax e: 00 f: 65 48 2b 04 25 28 00 sub %gs:0x28,%rax 16: 00 00 18: 0f 85 99 03 00 00 jne 0x3b7 1e: 48 81 c4 88 00 00 00 add $0x88,%rsp 25: 4c 89 e0 mov %r12,%rax 28: 5b pop %rbx 29: 5d pop %rbp 2a:* 41 5c pop %r12 <-- trapping instruction 2c: 41 5d pop %r13 2e: 41 5e pop %r14 30: 41 5f pop %r15 32: c3 retq 33: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 3a: fc ff df 3d: 48 rex.W 3e: 8d .byte 0x8d 3f: 7d .byte 0x7d Code starting with the faulting instruction =========================================== 0: 41 5c pop %r12 2: 41 5d pop %r13 4: 41 5e pop %r14 6: 41 5f pop %r15 8: c3 retq 9: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 10: fc ff df 13: 48 rex.W 14: 8d .byte 0x8d 15: 7d .byte 0x7d [ 55.871899][ C1] RSP: 0018:ffffc9000001f988 EFLAGS: 00000286 [ 55.873081][ C1] RAX: 0000000000000000 RBX: 000000063dbe831e RCX: ffffc9000001f690 [ 55.874697][ C1] RDX: 0000000000000000 RSI: ffffffffbc798f20 RDI: ffffc9000001fbe0 [ 55.876281][ C1] RBP: 0000000000000006 R08: ffff888100310dd8 R09: ffffffffbd1de4e7 [ 55.877803][ C1] R10: ffffed102003531b R11: 0000000000000001 R12: 0000000000000000 [ 55.879386][ C1] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000 [ 55.881044][ C1] link_path_walk+0x563/0xc80 [ 55.882258][ C1] ? path_init (fs/namei.c:2411) [ 55.883192][ C1] ? __raw_spin_lock_init (kernel/locking/spinlock_debug.c:26) [ 55.884260][ C1] ? open_last_lookups (fs/namei.c:2265) [ 55.885290][ C1] ? debug_mutex_init (kernel/locking/mutex-debug.c:89) [ 55.886305][ C1] ? __alloc_file (fs/file_table.c:153) [ 55.887298][ C1] path_openat (fs/namei.c:3605) [ 55.888247][ C1] ? __lock_acquire (kernel/locking/lockdep.c:5029) [ 55.889229][ C1] ? path_lookupat (fs/namei.c:3591) [ 55.890215][ C1] do_filp_open (fs/namei.c:3636) [ 55.893219][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.894154][ C1] ? may_open_dev (fs/namei.c:3630) [ 55.895168][ C1] ? __lock_release (kernel/locking/lockdep.c:5317) [ 55.896219][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.897250][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.898122][ C1] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:85 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186) [ 55.899114][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10)) [ 55.900031][ C1] ? getname_flags (fs/namei.c:204) [ 55.901118][ C1] do_sys_openat2 (fs/open.c:1213) [ 55.902081][ C1] ? rcu_nocb_cb_kthread (kernel/rcu/tree_nocb.h:382) [ 55.903138][ C1] ? mntput_no_expire (fs/namespace.c:1208) [ 55.904198][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684) [ 55.905231][ C1] ? build_open_flags (fs/open.c:1199) [ 55.906242][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) [ 55.907155][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.908186][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109) [ 55.909104][ C1] __x64_sys_openat (fs/open.c:1240) [ 55.910087][ C1] ? __ia32_compat_sys_open (fs/open.c:1240) [ 55.911242][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) [ 55.912481][ C1] ? syscall_enter_from_user_mode (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 kernel/entry/common.c:109) [ 55.913602][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22)) [ 55.914638][ C1] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 55.915572][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501) [ 55.916880][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.917809][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.918770][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.919737][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) [ 55.920694][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87) To reproduce: # build kernel cd linux cp config-5.18.0-rc5-00048-g3c87862ca131 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On 6/9/22 01:56, kernel test robot wrote: > > Greeting, > > FYI, we noticed the following commit (built with gcc-11): > > commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths") > url:https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132 > base:https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next > patch link:https://lore.kernel.org/netdev/20220603181321.443716-1-tadeusz.struk@linaro.org > > in testcase: boot > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): > > > > If you fix the issue, kindly add following tag > Reported-by: kernel test robot<oliver.sang@intel.com> > > > [ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363) Looks like that will need to be spin_lock_irq(&css->lock) instead of spin_lock_bh(&css->lock) I can respin the patch, but I would like to request some feedback on it first. Tejun, Michal Are you interested in fixing this at syzbot issue all? Do you have any more feedback on this?
Hello Tadeusz. On Thu, Jun 09, 2022 at 07:30:41AM -0700, Tadeusz Struk <tadeusz.struk@linaro.org> wrote: > Are you interested in fixing this at syzbot issue all? The (original) syzbot report is conditioned by allocation failure that's unlikely under normal conditions (AFAIU). Hence I don't treat it extra high urgent. OTOH, it's interesting and it points to some disparity worth fixing -- so I try helping (as time permits, so far I can only run the reproducers via the syzbot). > Do you have any more feedback on this? Ad the patch v2 with spinlock per css -- that looks like an overkill to me, I didn't look deeper into it. Ad the in-thread patch with ancestry css_get(), the ->parent ref: - is inc'd in init_and_link_css(), - is dec'd in css_free_rwork_fn() and thanks to ->online_cnt offlining is ordered (child before parent). Where does your patch dec these ancestry references? (Or why would they be missing in the first place?) Thanks for digging into this! Michal
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1bfcfb1af352..8dc8b4edb242 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -53,6 +53,8 @@ enum { CSS_RELEASED = (1 << 2), /* refcnt reached zero, released */ CSS_VISIBLE = (1 << 3), /* css is visible to userland */ CSS_DYING = (1 << 4), /* css is dying */ + CSS_KILL_ENQED = (1 << 5), /* kill work enqueued for the css */ + CSS_REL_LATER = (1 << 6), /* release needs to be done after kill */ }; /* bits in struct cgroup flags field */ @@ -162,6 +164,8 @@ struct cgroup_subsys_state { */ int id; + /* lock to protect flags */ + spinlock_t lock; unsigned int flags; /* diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..b1bbd438d426 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5210,8 +5210,23 @@ static void css_release(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); - INIT_WORK(&css->destroy_work, css_release_work_fn); - queue_work(cgroup_destroy_wq, &css->destroy_work); + spin_lock_bh(&css->lock); + + /* + * Check if the css_killed_work_fn work has been scheduled for this + * css and enqueue css_release_work_fn only if it wasn't. + * Otherwise set the CSS_REL_LATER flag, which will cause + * release to be enqueued after css_killed_work_fn is finished. + * This is to prevent list corruption by en-queuing two instance + * of the same work struct on the same WQ, namely cgroup_destroy_wq. + */ + if (!(css->flags & CSS_KILL_ENQED)) { + INIT_WORK(&css->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css->destroy_work); + } else { + css->flags |= CSS_REL_LATER; + } + spin_unlock_bh(&css->lock); } static void init_and_link_css(struct cgroup_subsys_state *css, @@ -5230,6 +5245,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css, INIT_LIST_HEAD(&css->rstat_css_node); css->serial_nr = css_serial_nr_next++; atomic_set(&css->online_cnt, 0); + spin_lock_init(&css->lock); if (cgroup_parent(cgrp)) { css->parent = cgroup_css(cgroup_parent(cgrp), ss); @@ -5545,10 +5561,12 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) */ static void css_killed_work_fn(struct work_struct *work) { - struct cgroup_subsys_state *css = + struct cgroup_subsys_state *css_killed, *css = container_of(work, struct cgroup_subsys_state, destroy_work); mutex_lock(&cgroup_mutex); + css_killed = css; + css_killed->flags &= ~CSS_KILL_ENQED; do { offline_css(css); @@ -5557,6 +5575,14 @@ static void css_killed_work_fn(struct work_struct *work) css = css->parent; } while (css && atomic_dec_and_test(&css->online_cnt)); + spin_lock_bh(&css_killed->lock); + if (css_killed->flags & CSS_REL_LATER) { + /* If css_release work was delayed for the css enqueue it now. */ + INIT_WORK(&css_killed->destroy_work, css_release_work_fn); + queue_work(cgroup_destroy_wq, &css_killed->destroy_work); + css_killed->flags &= ~CSS_REL_LATER; + } + spin_unlock_bh(&css_killed->lock); mutex_unlock(&cgroup_mutex); } @@ -5566,10 +5592,13 @@ static void css_killed_ref_fn(struct percpu_ref *ref) struct cgroup_subsys_state *css = container_of(ref, struct cgroup_subsys_state, refcnt); + spin_lock_bh(&css->lock); if (atomic_dec_and_test(&css->online_cnt)) { + css->flags |= CSS_KILL_ENQED; INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work); } + spin_unlock_bh(&css->lock); } /**
Syzbot found a corrupted list bug scenario that can be triggered from cgroup_subtree_control_write(cgrp). The reproduces writes to cgroup.subtree_control file, which invokes: cgroup_apply_control_enable()->css_create()->css_populate_dir(), which then fails with a fault injected -ENOMEM. In such scenario the css_killed_work_fn will be en-queued via cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to cgroup_kn_unlock(). Then cgroup_kn_unlock() will call: cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue css_release_work_fn for the same css instance, causing a list_add corruption bug, as can be seen in the syzkaller report [1]. Fix this by synchronizing the css ref_kill and css_release jobs. css_release() function will check if the css_killed_work_fn() has been scheduled for the css and only en-queue the css_release_work_fn() if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will set the CSS_REL_LATER flag for that css. This will cause the css_release_work_fn() work to be executed after css_killed_work_fn() is finished. Two scc flags have been introduced to implement this serialization mechanizm: * CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be scheduled after the css_killed_work_fn is finished. There is also a new lock, which will protect the integrity of the css flags. [1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd Cc: Tejun Heo <tj@kernel.org> Cc: Michal Koutny<mkoutny@suse.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: <cgroups@vger.kernel.org> Cc: <netdev@vger.kernel.org> Cc: <bpf@vger.kernel.org> Cc: <stable@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Reported-and-tested-by: syzbot+e42ae441c3b10acf9e9d@syzkaller.appspotmail.com Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item") Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> --- v2: Use correct lock in css_killed_work_fn() --- include/linux/cgroup-defs.h | 4 ++++ kernel/cgroup/cgroup.c | 35 ++++++++++++++++++++++++++++++++--- 2 files changed, 36 insertions(+), 3 deletions(-)