Message ID | 20240505175556.1213266-2-torvalds@linux-foundation.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] epoll: be better about file lifetimes | expand |
On 5/5/24 11:55 AM, Linus Torvalds wrote: > epoll can call out to vfs_poll() with a file pointer that may race with > the last 'fput()'. That would make f_count go down to zero, and while > the ep->mtx locking means that the resulting file pointer tear-down will > be blocked until the poll returns, it means that f_count is already > dead, and any use of it won't actually get a reference to the file any > more: it's dead regardless. > > Make sure we have a valid ref on the file pointer before we call down to > vfs_poll() from the epoll routines. > > Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ > Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com > Reviewed-by: Jens Axboe <axboe@kernel.dk> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > > Changes since v1: > > - add Link, Reported-by, and Jens' reviewed-by. And sign off on it > because it looks fine to me and we have some testing now. > > - move epi_fget() closer to the user > > - more comments about the background > > - remove the rcu_read_lock(), with the comment explaining why it's not > needed > > - note about returning zero rather than something like EPOLLERR|POLLHUP > for a file that is going away I did look at that 0 return as well and agreed this is the right choice, but adding the comment is a good idea. Anyway, patch still looks fine to me. I'd word wrap the comment section above epi_fget() wider, but that's just a stylistic choice...
From: Linus Torvalds > Sent: 05 May 2024 18:56 > > epoll can call out to vfs_poll() with a file pointer that may race with > the last 'fput()'. That would make f_count go down to zero, and while > the ep->mtx locking means that the resulting file pointer tear-down will > be blocked until the poll returns, it means that f_count is already > dead, and any use of it won't actually get a reference to the file any > more: it's dead regardless. > > Make sure we have a valid ref on the file pointer before we call down to > vfs_poll() from the epoll routines. How much is the extra pair of atomics going to hurt performance? IIRC a lot of work was done to (try to) make epoll lockless. Perhaps the 'hook' into epoll (usually) from sys_close should be done before any of the references are removed? (Which is different from Q6/A6 in man epoll - but that seems to be documenting a bug!) Then the ->poll() callback can't happen (assuming it is properly locked) after the ->release() one. It seems better to add extra atomics to the close/final-fput path rather than ever ->poll() call epoll makes. I can get extra references to a driver by dup() open("/dev/fd/n") and mmap() - but epoll is definitely fd based. (Which may be why it has the fd number in its data.) Is there another race between EPOLL_CTL_ADD and close() (from a different thread)? David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sun, 5 May 2024 at 13:02, David Laight <David.Laight@aculab.com> wrote: > > How much is the extra pair of atomics going to hurt performance? > IIRC a lot of work was done to (try to) make epoll lockless. If this makes people walk away from epoll, that would be absolutely *lovely*. Maybe they'd start using io_uring instead, which has had its problems, but is a lot more capable in the end. Yes, doing things right is likely more expensive than doing things wrong. Bugs are cheap. That doesn't make buggy code better. Epoll really isn't important enough to screw over the VFS subsystem over. I did point out elsewhere how this could be fixed by epoll() removing the ep items at a different point: https://lore.kernel.org/all/CAHk-=wj6XL9MGCd_nUzRj6SaKeN0TsyTTZDFpGdW34R+zMZaSg@mail.gmail.com/ so if somebody actually wants to fix up epoll properly, that would probably be great. In fact, that model would allow epoll() to just keep a proper refcount as an fd is added to the poll events, and would probably fix a lot of nastiness. Right now those ep items stay around for basically random amounts of time. But maybe there are other ways to fix it. I don't think we have an actual eventpoll maintainer any more, but what I'm *not* willing to happen is eventpoll messing up other parts of the kernel. It was always a ugly performance hack, and was only acceptable as such. "ugly hack" is ok. "buggy ugly hack" is not. Linus
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 882b89edc52a..a3f0f868adc4 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -979,6 +979,37 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep return res; } +/* + * The ffd.file pointer may be in the process of + * being torn down due to being closed, but we + * may not have finished eventpoll_release() yet. + * + * Normally, even with the atomic_long_inc_not_zero, + * the file may have been free'd and then gotten + * re-allocated to something else (since files are + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). + * + * But for epoll, users hold the ep->mtx mutex, and + * as such any file in the process of being free'd + * will block in eventpoll_release_file() and thus + * the underlying file allocation will not be free'd, + * and the file re-use cannot happen. + * + * For the same reason we can avoid a rcu_read_lock() + * around the operation - 'ffd.file' cannot go away + * even if the refcount has reached zero (but we must + * still not call out to ->poll() functions etc). + */ +static struct file *epi_fget(const struct epitem *epi) +{ + struct file *file; + + file = epi->ffd.file; + if (!atomic_long_inc_not_zero(&file->f_count)) + file = NULL; + return file; +} + /* * Differs from ep_eventpoll_poll() in that internal callers already have * the ep->mtx so we need to start from depth=1, such that mutex_lock_nested() @@ -987,14 +1018,23 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi->ffd.file; + struct file *file = epi_fget(epi); __poll_t res; + /* + * We could return EPOLLERR | EPOLLHUP or something, + * but let's treat this more as "file doesn't exist, + * poll didn't happen". + */ + if (!file) + return 0; + pt->_key = epi->event.events; if (!is_file_epoll(file)) res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); + fput(file); return res & epi->event.events; }