Message ID | CAM_iQpUuW=34pxJtkfw_sHTzmm3mUfrg-Fs3++kU+PktBfDJNg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Nov 29, 2017 at 11:37 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote: > (Cc'ing fs people...) > > On Wed, Nov 29, 2017 at 12:33 AM, syzbot wrote: >> BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0 net/socket.c:601 Lovely. Yeah, that is: 601 if (rcu_dereference_protected(sock->wq, 1)->fasync_list) and as you say, that "rcu_dereference_protected()" is confusing, but that should be ok because we have a ref to the inode, and we're really just testing that the pointer is zero. The call trace here is: >> sock_release+0x1c6/0x1e0 net/socket.c:601 >> sock_close+0x16/0x20 net/socket.c:1125 >> __fput+0x333/0x7f0 fs/file_table.c:210 >> ____fput+0x15/0x20 fs/file_table.c:244 >> task_work_run+0x199/0x270 kernel/task_work.c:113 and there is no RCU protection anywhere, but it's really just a sanity check, and the access _should_ be ok. The stale access does seem to be because 'sock' (embedded in the inode) itself that has been free'd: >> Allocated by task 31066: >> kmalloc include/linux/slab.h:499 [inline] >> sock_alloc_inode+0xb4/0x300 net/socket.c:253 >> alloc_inode+0x65/0x180 fs/inode.c:208 >> new_inode_pseudo+0x69/0x190 fs/inode.c:890 >> sock_alloc+0x41/0x270 net/socket.c:565 >> __sock_create+0x148/0x850 net/socket.c:1225 >> sock_create net/socket.c:1301 [inline] >> SYSC_socket net/socket.c:1331 [inline] >> SyS_socket+0xeb/0x200 net/socket.c:1311 > > This looks more like a fs issue than network, my fs knowledge > is not good enough to justify why the hell the inode could be > destroyed before we release the fd. Ugh. The inode freeing really is confusing and fairly involved, but the last free *should* happen as part of the final dput() that is done at the end of __fput(). So in __fput() calls into the if (file->f_op->release) file->f_op->release(inode, file); then the inode should still be around, because the final ref won't be done until later. And RCU simply shouldn't be an issue, because of that reference count on the inode. So it smells like some reference counting went wrong. The socket inode creation is a bit confusing, and then in "sock_release()" we do have that if (!sock->file) { iput(SOCK_INODE(sock)); return; } sock->file = NULL; which *also* tries to free the inode. I'm not sure what the logic (and what the locking) behind that code all is. What *is* the locking for "sock->file" anyway? Al, can you take a look on the vfs side? But I'm inclined to blame the socket code, because if we really had a "inode free'd early" issue at a vfs level, I'd have expected us to see infinite chaos. Linus
On Wed, 2017-11-29 at 11:37 -0800, Cong Wang wrote: > (Cc'ing fs people...) > > On Wed, Nov 29, 2017 at 12:33 AM, syzbot > <bot+9abea25706ae35022385a41f61e579ed66e88a3f@syzkaller.appspotmail.c > om> > wrote: > > Hello, > > > > syzkaller hit the following crash on > > 1d3b78bbc6e983fabb3fbf91b76339bf66e4a12c > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net- > > next.git/master > > compiler: gcc (GCC) 7.1.1 20170620 > > .config is attached > > Raw console output is attached. > > > > Unfortunately, I don't have any reproducer for this bug yet. > > > > > > device syz3 left promiscuous mode > > device syz3 entered promiscuous mode > > ================================================================== > > BUG: KASAN: use-after-free in sock_release+0x1c6/0x1e0 > > net/socket.c:601 > > Read of size 8 at addr ffff8801c8dd1d10 by task syz-executor4/31085 > > > > CPU: 0 PID: 31085 Comm: syz-executor4 Not tainted 4.14.0+ #129 > > Hardware name: Google Google Compute Engine/Google Compute Engine, > > BIOS > > Google 01/01/2011 > > Call Trace: > > __dump_stack lib/dump_stack.c:17 [inline] > > dump_stack+0x194/0x257 lib/dump_stack.c:53 > > print_address_description+0x73/0x250 mm/kasan/report.c:252 > > kasan_report_error mm/kasan/report.c:351 [inline] > > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > > __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 > > sock_release+0x1c6/0x1e0 net/socket.c:601 > > sock_close+0x16/0x20 net/socket.c:1125 > > __fput+0x333/0x7f0 fs/file_table.c:210 > > ____fput+0x15/0x20 fs/file_table.c:244 > > task_work_run+0x199/0x270 kernel/task_work.c:113 > > exit_task_work include/linux/task_work.h:22 [inline] > > do_exit+0x9bb/0x1ae0 kernel/exit.c:865 > > do_group_exit+0x149/0x400 kernel/exit.c:968 > > get_signal+0x73f/0x16c0 kernel/signal.c:2335 > > do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:809 > > exit_to_usermode_loop+0x214/0x310 arch/x86/entry/common.c:158 > > prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline] > > syscall_return_slowpath+0x490/0x550 arch/x86/entry/common.c:264 > > entry_SYSCALL_64_fastpath+0x94/0x96 > > RIP: 0033:0x452879 > > RSP: 002b:00007fb1c2435ce8 EFLAGS: 00000246 ORIG_RAX: > > 00000000000000ca > > RAX: fffffffffffffe00 RBX: 0000000000758100 RCX: 0000000000452879 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000758100 > > RBP: 0000000000758100 R08: 0000000000000304 R09: 00000000007580d8 > > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > > R13: 0000000000a6f7ff R14: 00007fb1c24369c0 R15: 000000000000000e > > > > Allocated by task 31066: > > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > > set_track mm/kasan/kasan.c:459 [inline] > > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 > > kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3613 > > kmalloc include/linux/slab.h:499 [inline] > > sock_alloc_inode+0xb4/0x300 net/socket.c:253 > > alloc_inode+0x65/0x180 fs/inode.c:208 > > new_inode_pseudo+0x69/0x190 fs/inode.c:890 > > sock_alloc+0x41/0x270 net/socket.c:565 > > __sock_create+0x148/0x850 net/socket.c:1225 > > sock_create net/socket.c:1301 [inline] > > SYSC_socket net/socket.c:1331 [inline] > > SyS_socket+0xeb/0x200 net/socket.c:1311 > > entry_SYSCALL_64_fastpath+0x1f/0x96 > > > > Freed by task 3039: > > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > > set_track mm/kasan/kasan.c:459 [inline] > > kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 > > __cache_free mm/slab.c:3491 [inline] > > kfree+0xca/0x250 mm/slab.c:3806 > > __rcu_reclaim kernel/rcu/rcu.h:190 [inline] > > rcu_do_batch kernel/rcu/tree.c:2758 [inline] > > invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline] > > __rcu_process_callbacks kernel/rcu/tree.c:2979 [inline] > > rcu_process_callbacks+0xe79/0x17d0 kernel/rcu/tree.c:2996 > > __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 > > This looks more like a fs issue than network, my fs knowledge > is not good enough to justify why the hell the inode could be > destroyed before we release the fd. > > My _guess_ is that it is because we defer the ____fput() to a > task work. If this is the case, then fs layer is not guilty for this. > > On the other hand, if we have to blame net layer, it does look > suspicious on the RCU usage in sock_release() where we > claim RCU protection but I don't see we hold any RCU lock > there. There is rcu protection for sock->wq, and the 1 in rcu_dereference_protected(sock->wq, 1) is because we do not have a lockdep convenient way to express that we are the last user of sock, and about to free it. > Also, the code that deferences sock->wq is pretty much > useless now, at least I don't see it catches any bug though. > > > diff --git a/net/socket.c b/net/socket.c > index 42d8e9c9ccd5..b2390b5591a9 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -598,9 +598,6 @@ void sock_release(struct socket *sock) > module_put(owner); > } > > - if (rcu_dereference_protected(sock->wq, 1)->fasync_list) > - pr_err("%s: fasync list not empty!\n", __func__); > - > At this point, sock->wq must be valid, and freed later (by us) This really looks like some other bug, and a late effect.
On Wed, Nov 29, 2017 at 12:24:55PM -0800, Linus Torvalds wrote: > Ugh. The inode freeing really is confusing and fairly involved, but > the last free *should* happen as part of the final dput() that is done > at the end of __fput(). Note that struct socket is coallocated with its inode. _Normally_ from sock_alloc() (and that's the case here, apparently), but in several cases it's embedded into another object. TUN and TAP - definitely, might have been other added. Those should never be passed to sock_release() at all. > So in __fput() calls into the > > if (file->f_op->release) > file->f_op->release(inode, file); > > then the inode should still be around, because the final ref won't be > done until later. And RCU simply shouldn't be an issue, because of > that reference count on the inode. > > So it smells like some reference counting went wrong. The socket inode > creation is a bit confusing, and then in "sock_release()" we do have > that > > if (!sock->file) { > iput(SOCK_INODE(sock)); > return; > } > sock->file = NULL; > > which *also* tries to free the inode. I'm not sure what the logic (and > what the locking) behind that code all is. If socket has never gone through sock_alloc_file(), sock_release() on it is called explicitly and frees the sucker. If it has been through sock_alloc_file(), we must not call sock_release() directly and freeing is done by iput() from final fput(). > What *is* the locking for "sock->file" anyway? Pretty much assign-once - zeroing it in the end of sock_release() is pure cosmetics (we'd damn better have no other references to that sucker left anywhere; there's still a reference to embedded inode, but that's it). FWIW, looking through the callers of sock_alloc_file()... we might be better off if it did sock_release() on failure. Then the calling conventions become "sock_alloc_file() means not calling sock_release() directly - either it'll be done by the final fput() on resulting file, or by sock_alloc_file() itself". Look: 1) in lustre: sock_filp = sock_alloc_file(sock, 0, NULL); if (IS_ERR(sock_filp)) { sock_release(sock); rc = PTR_ERR(sock_filp); goto out; } 2) in net/9p: file = sock_alloc_file(csocket, 0, NULL); if (IS_ERR(file)) { pr_err("%s (%d): failed to map fd\n", __func__, task_pid_nr(current)); sock_release(csocket); kfree(p); return PTR_ERR(file); } 3) in sctp: *newfile = sock_alloc_file(newsock, 0, NULL); if (IS_ERR(*newfile)) { put_unused_fd(retval); sock_release(newsock); retval = PTR_ERR(*newfile); *newfile = NULL; return retval; } 4) in accept4(): newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name); if (IS_ERR(newfile)) { err = PTR_ERR(newfile); put_unused_fd(newfd); sock_release(newsock); goto out_put; } 5) called in sock_map_fd(), and the sole caller is retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK)); if (retval < 0) goto out_release; ... out_release: sock_release(sock); return retval; (with no fallthrough or other goto into out_release) 6) the second caller in socketpair(): newfile2 = sock_alloc_file(sock2, flags, NULL); if (IS_ERR(newfile2)) { err = PTR_ERR(newfile2); goto out_fput_1; } ... out_fput_1: fput(newfile1); put_unused_fd(fd2); put_unused_fd(fd1); sock_release(sock2); goto out; (again, no fallthrough or other goto into out_fput_1) 7) the first caller in socketpair(): newfile1 = sock_alloc_file(sock1, flags, NULL); if (IS_ERR(newfile1)) { err = PTR_ERR(newfile1); goto out_put_unused_both; } ... out_put_unused_both: put_unused_fd(fd2); out_put_unused_1: put_unused_fd(fd1); out_release_both: sock_release(sock2); out_release_1: sock_release(sock1); out: return err; No fallthrough or goto either. Sure, we get a failure exit unshared, but AFAICS some reordering can simplify things quite a bit there. 8) kcm_clone(). Fucked in head - we allocate socket, then file, *THEN* sock, then attach sock to socket (already attached to file), then finally deign to initialize sock (already attached to socket, which is attached to file). And, surprise, surprise, failure exits are all wrong. Moreover, calling conventions are broken by design - after we'd put the damn file into descriptor table we return the pointer to sock to the caller. By that time it might have bloody well been destroyed by close(2) from another thread; good thing the caller doesn't use the damn thing. Unfortunately, it is doing this: if (!err) { if (copy_to_user((void __user *)arg, &info, sizeof(info))) { err = -EFAULT; sys_close(info.fd); } } which is also bogus - again, the descriptor might have been already closed by another thread, with another one put in its place, etc. The right way to do that is to do fd_install() _last_. After the last failure exit, i.e. after copy_to_user() in that case. KCM would've looked as a likely cause of that shit, if not for the fact that object had been created by socket(2). It definitely needs fixing, but that's not the cause of PITA here. Incidentally, grepping for sys_close() shows another piece of fun in net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should never, ever be used that way. Sigh...
On Wed, Nov 29, 2017 at 11:37:04AM -0800, Cong Wang wrote: > > Allocated by task 31066: > > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > > set_track mm/kasan/kasan.c:459 [inline] > > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 > > kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3613 > > kmalloc include/linux/slab.h:499 [inline] > > sock_alloc_inode+0xb4/0x300 net/socket.c:253 > > alloc_inode+0x65/0x180 fs/inode.c:208 > > new_inode_pseudo+0x69/0x190 fs/inode.c:890 > > sock_alloc+0x41/0x270 net/socket.c:565 > > __sock_create+0x148/0x850 net/socket.c:1225 > > sock_create net/socket.c:1301 [inline] > > SYSC_socket net/socket.c:1331 [inline] > > SyS_socket+0xeb/0x200 net/socket.c:1311 > > entry_SYSCALL_64_fastpath+0x1f/0x96 > > > > Freed by task 3039: > > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > > set_track mm/kasan/kasan.c:459 [inline] > > kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 > > __cache_free mm/slab.c:3491 [inline] > > kfree+0xca/0x250 mm/slab.c:3806 > > __rcu_reclaim kernel/rcu/rcu.h:190 [inline] > > rcu_do_batch kernel/rcu/tree.c:2758 [inline] > > invoke_rcu_callbacks kernel/rcu/tree.c:3012 [inline] > > __rcu_process_callbacks kernel/rcu/tree.c:2979 [inline] > > rcu_process_callbacks+0xe79/0x17d0 kernel/rcu/tree.c:2996 > > __do_softirq+0x29d/0xbb2 kernel/softirq.c:285 IDGI. We are running into the object pointed to by sock->wq already freed, right? So how the hell had we managed to _fetch_ the pointer in the first place? Freeing had been scheduled by wq = rcu_dereference_protected(ei->socket.wq, 1); kfree_rcu(wq, rcu); kmem_cache_free(sock_inode_cachep, ei); so we should have * sock_destroy_inode() run on another CPU while we are in the middle of sock_release(), sock->wq fetched by sock_release(), sock->wq fed to kfree_rcu() by sock_destroy_inode() *AND* freed before sock_release() got around to dereferencing it. Not impossible to hit, but... why hadn't we run into much wider window? If that sock_destroy_inode() on another CPU had gotten to the call right after that kfree_rcu(), we would've seen use-after-free on attempt to fetch ->wq... And it goes without saying that sock_destroy_inode() should not have happened in parallel to sock_release(), or, for that matter, to anything else done to struct socket instance...
On Thu, Nov 30, 2017 at 02:07:19AM +0000, Al Viro wrote: > Incidentally, grepping for sys_close() shows another piece of fun in > net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S > IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should > never, ever be used that way. Sigh... Would be great do unexport the thing. Except that we also have binfmt_misc (which looks legit) and autofs4, which on crack decided that close() isn't a fun syscall, they'd much rather have an ioctl that does exactly the same..
On Thu, Nov 30, 2017 at 05:18:33AM -0800, Christoph Hellwig wrote: > On Thu, Nov 30, 2017 at 02:07:19AM +0000, Al Viro wrote: > > Incidentally, grepping for sys_close() shows another piece of fun in > > net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S > > IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should > > never, ever be used that way. Sigh... > > Would be great do unexport the thing. Except that we also have > binfmt_misc (which looks legit) and autofs4, which on crack decided > that close() isn't a fun syscall, they'd much rather have an ioctl > that does exactly the same.. Yes, since binfmt_misc one is guaranteed that its descriptor table is not shared - all callchains go through do_execveat_common(), where we'd use unshare_files(). autofs one is... not in good taste, but still safe; there the descriptor is preexisting and it's essentially a weird way of spelling close(2). References from syscall tables are, of course, OK. init/*.c uses are done pretty much from userland - they could have been straight syscalls, if not for the lack of klibc in kernel tree. Everything else, though... IMO we need a whack-a-mole list somewhere; "new callers of sys_close() anywhere outside of init/* and syscall tables" definitely should be on it...
On Thu, Nov 30, 2017 at 4:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Nov 30, 2017 at 05:18:33AM -0800, Christoph Hellwig wrote: >> On Thu, Nov 30, 2017 at 02:07:19AM +0000, Al Viro wrote: >> > Incidentally, grepping for sys_close() shows another piece of fun in >> > net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S >> > IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should >> > never, ever be used that way. Sigh... >> >> Would be great do unexport the thing. Except that we also have >> binfmt_misc (which looks legit) and autofs4, which on crack decided >> that close() isn't a fun syscall, they'd much rather have an ioctl >> that does exactly the same.. > > Yes, since binfmt_misc one is guaranteed that its descriptor table is > not shared - all callchains go through do_execveat_common(), where we'd > use unshare_files(). autofs one is... not in good taste, but still > safe; there the descriptor is preexisting and it's essentially a weird > way of spelling close(2). References from syscall tables are, of course, > OK. init/*.c uses are done pretty much from userland - they could have > been straight syscalls, if not for the lack of klibc in kernel tree. > Everything else, though... > > IMO we need a whack-a-mole list somewhere; "new callers of sys_close() > anywhere outside of init/* and syscall tables" definitely should be > on it... #syz fix: fix kcm_clone()
diff --git a/net/socket.c b/net/socket.c index 42d8e9c9ccd5..b2390b5591a9 100644 --- a/net/socket.c +++ b/net/socket.c @@ -598,9 +598,6 @@ void sock_release(struct socket *sock) module_put(owner); } - if (rcu_dereference_protected(sock->wq, 1)->fasync_list) - pr_err("%s: fasync list not empty!\n", __func__); - this_cpu_sub(sockets_in_use, 1); if (!sock->file) { iput(SOCK_INODE(sock));