From patchwork Mon Jan 21 20:14:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774513 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CC7FA91E for ; Mon, 21 Jan 2019 20:16:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BC3EB2A874 for ; Mon, 21 Jan 2019 20:16:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B08CF2A879; Mon, 21 Jan 2019 20:16:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4638D2A874 for ; Mon, 21 Jan 2019 20:16:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727997AbfAUUPO (ORCPT ); Mon, 21 Jan 2019 15:15:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:55518 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727699AbfAUUPN (ORCPT ); Mon, 21 Jan 2019 15:15:13 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7B9C0AFF4; Mon, 21 Jan 2019 20:15:12 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 01/13] epoll: move private helpers from a header to the source Date: Mon, 21 Jan 2019 21:14:44 +0100 Message-Id: <20190121201456.28338-2-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Those helpers will access private eventpoll structure in future patches, so keep those helpers close to callers. Nothing important here. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 13 +++++++++++++ include/uapi/linux/eventpoll.h | 12 ------------ 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 4a0e98d87fcc..2cc183e86a29 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -466,6 +466,19 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ +#ifdef CONFIG_PM_SLEEP +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) + epev->events &= ~EPOLLWAKEUP; +} +#else +static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +{ + epev->events &= ~EPOLLWAKEUP; +} +#endif /* CONFIG_PM_SLEEP */ + /** * ep_call_nested - Perform a bound (possibly) nested call, by checking * that the recursion limit is not exceeded, and that diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 8a3432d0f0dc..39dfc29f0f52 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -79,16 +79,4 @@ struct epoll_event { __u64 data; } EPOLL_PACKED; -#ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; -} -#else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) -{ - epev->events &= ~EPOLLWAKEUP; -} -#endif #endif /* _UAPI_LINUX_EVENTPOLL_H */ From patchwork Mon Jan 21 20:14:45 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774509 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2734B1390 for ; Mon, 21 Jan 2019 20:16:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 15FCE2A411 for ; Mon, 21 Jan 2019 20:16:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0A10B2A874; Mon, 21 Jan 2019 20:16:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 988C62A411 for ; Mon, 21 Jan 2019 20:16:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728127AbfAUUPO (ORCPT ); Mon, 21 Jan 2019 15:15:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:55534 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727867AbfAUUPO (ORCPT ); Mon, 21 Jan 2019 15:15:14 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B45DAAFF5; Mon, 21 Jan 2019 20:15:12 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 02/13] epoll: introduce user structures for polling from userspace Date: Mon, 21 Jan 2019 21:14:45 +0100 Message-Id: <20190121201456.28338-3-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This one introduces structures of user items array: struct user_epheader - describes inserted epoll items. struct user_epitem - single epoll item, visible to userspace. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 9 +++++++++ include/uapi/linux/eventpoll.h | 19 +++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2cc183e86a29..f598442512f3 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -9,6 +9,8 @@ * * Davide Libenzi * + * Polling from userspace support by Roman Penyaev + * (C) Copyright 2019 SUSE, All Rights Reserved */ #include @@ -109,6 +111,13 @@ #define EP_ITEM_COST (sizeof(struct epitem) + sizeof(struct eppoll_entry)) +/* + * That is around 1.3mb of allocated memory for one epfd. What is more + * important is ->index_length, which should be ^2, so do not increase + * max items number to avoid size doubling of user index. + */ +#define EP_USERPOLL_MAX_ITEMS_NR 65536 + struct epoll_filefd { struct file *file; int fd; diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 39dfc29f0f52..690a625ddeb2 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -79,4 +79,23 @@ struct epoll_event { __u64 data; } EPOLL_PACKED; +struct epoll_uitem { + __poll_t ready_events; + struct epoll_event event; +}; + +#define EPOLL_USERPOLL_HEADER_SIZE 128 +#define EPOLL_USERPOLL_HEADER_MAGIC 0xeb01eb01 + +struct epoll_uheader { + u32 magic; /* epoll user header magic */ + u32 header_length; /* length of the header + items */ + u32 index_length; /* length of the index ring, always pow2 */ + u32 max_items_nr; /* max number of items */ + u32 head; /* updated by userland */ + u32 tail; /* updated by kernel */ + + struct epoll_uitem items[] __aligned(EPOLL_USERPOLL_HEADER_SIZE); +}; + #endif /* _UAPI_LINUX_EVENTPOLL_H */ From patchwork Mon Jan 21 20:14:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774507 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B7A911390 for ; Mon, 21 Jan 2019 20:16:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A707F2A411 for ; Mon, 21 Jan 2019 20:16:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 99A742A878; Mon, 21 Jan 2019 20:16:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E993F2A411 for ; Mon, 21 Jan 2019 20:16:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728254AbfAUUPP (ORCPT ); Mon, 21 Jan 2019 15:15:15 -0500 Received: from mx2.suse.de ([195.135.220.15]:55558 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727882AbfAUUPP (ORCPT ); Mon, 21 Jan 2019 15:15:15 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 404F1AE60; Mon, 21 Jan 2019 20:15:13 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 03/13] epoll: allocate user header and user events ring for polling from userspace Date: Mon, 21 Jan 2019 21:14:46 +0100 Message-Id: <20190121201456.28338-4-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This one allocates user header and user events ring according to max items number, passed as a parameter. User events (index) ring is in a pow2. Pages, which will be shared between kernel and userspace, are accounted through user->locked_vm counter. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 130 +++++++++++++++++++++++++++++++-- include/uapi/linux/eventpoll.h | 3 +- 2 files changed, 125 insertions(+), 8 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index f598442512f3..a73c077a552c 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -231,6 +231,27 @@ struct eventpoll { struct file *file; + /* User header with array of items */ + struct epoll_uheader *user_header; + + /* User index, which acts as a ring of coming events */ + unsigned int *user_index; + + /* Actual length of user header, always aligned on page */ + unsigned int header_length; + + /* Actual length of user index, always pow2 */ + unsigned int index_length; + + /* Maximum possible event items */ + unsigned int max_items_nr; + + /* Items bitmap, is used to get a free bit for new registered epi */ + unsigned long *items_bm; + + /* Length of both items bitmaps, always aligned on page */ + unsigned int items_bm_length; + /* used to optimize loop detection check */ int visited; struct list_head visited_list_link; @@ -381,6 +402,27 @@ static void ep_nested_calls_init(struct nested_calls *ncalls) spin_lock_init(&ncalls->lock); } +static inline unsigned int ep_to_items_length(unsigned int nr) +{ + struct epoll_uheader *user_header; + + return PAGE_ALIGN(struct_size(user_header, items, nr)); +} + +static inline unsigned int ep_to_index_length(unsigned int nr) +{ + struct eventpoll *ep; + unsigned int size; + + size = roundup_pow_of_two(nr << ilog2(sizeof(*ep->user_index))); + return max_t(typeof(size), size, PAGE_SIZE); +} + +static inline unsigned int ep_to_items_bm_length(unsigned int nr) +{ + return PAGE_ALIGN(ALIGN(nr, 8) >> 3); +} + /** * ep_events_available - Checks if ready events might be available. * @@ -836,6 +878,38 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) return 0; } +static int ep_account_mem(struct eventpoll *ep, struct user_struct *user) +{ + unsigned long nr_pages, page_limit, cur_pages, new_pages; + + nr_pages = ep->header_length >> PAGE_SHIFT; + nr_pages += ep->index_length >> PAGE_SHIFT; + + /* Don't allow more pages than we can safely lock */ + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + do { + cur_pages = atomic_long_read(&user->locked_vm); + new_pages = cur_pages + nr_pages; + if (new_pages > page_limit && !capable(CAP_IPC_LOCK)) + return -ENOMEM; + } while (atomic_long_cmpxchg(&user->locked_vm, cur_pages, + new_pages) != cur_pages); + + return 0; +} + +static void ep_unaccount_mem(struct eventpoll *ep, struct user_struct *user) +{ + unsigned long nr_pages; + + nr_pages = ep->header_length >> PAGE_SHIFT; + nr_pages += ep->index_length >> PAGE_SHIFT; + if (nr_pages) + /* When polled by user */ + atomic_long_sub(nr_pages, &user->locked_vm); +} + static void ep_free(struct eventpoll *ep) { struct rb_node *rbp; @@ -883,8 +957,12 @@ static void ep_free(struct eventpoll *ep) mutex_unlock(&epmutex); mutex_destroy(&ep->mtx); - free_uid(ep->user); wakeup_source_unregister(ep->ws); + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); + ep_unaccount_mem(ep, ep->user); + free_uid(ep->user); kfree(ep); } @@ -1037,7 +1115,7 @@ void eventpoll_release_file(struct file *file) mutex_unlock(&epmutex); } -static int ep_alloc(struct eventpoll **pep) +static int ep_alloc(struct eventpoll **pep, int flags, size_t max_items) { int error; struct user_struct *user; @@ -1049,6 +1127,37 @@ static int ep_alloc(struct eventpoll **pep) if (unlikely(!ep)) goto free_uid; + if (flags & EPOLL_USERPOLL) { + BUILD_BUG_ON(sizeof(*ep->user_header) != + EPOLL_USERPOLL_HEADER_SIZE); + + if (!max_items || max_items > EP_USERPOLL_MAX_ITEMS_NR) { + error = -EINVAL; + goto free_ep; + } + ep->max_items_nr = max_items; + ep->header_length = ep_to_items_length(max_items); + ep->index_length = ep_to_index_length(max_items); + ep->items_bm_length = ep_to_items_bm_length(max_items); + + error = ep_account_mem(ep, user); + if (error) + goto free_ep; + + ep->user_header = vmalloc_user(ep->header_length); + ep->user_index = vmalloc_user(ep->index_length); + ep->items_bm = vzalloc(ep->items_bm_length); + if (!ep->user_header || !ep->user_index || !ep->items_bm) + goto unaccount_mem; + + *ep->user_header = (typeof(*ep->user_header)) { + .magic = EPOLL_USERPOLL_HEADER_MAGIC, + .header_length = ep->header_length, + .index_length = ep->index_length, + .max_items_nr = ep->max_items_nr, + }; + } + mutex_init(&ep->mtx); rwlock_init(&ep->lock); init_waitqueue_head(&ep->wq); @@ -1062,6 +1171,13 @@ static int ep_alloc(struct eventpoll **pep) return 0; +unaccount_mem: + ep_unaccount_mem(ep, user); +free_ep: + vfree(ep->user_header); + vfree(ep->user_index); + vfree(ep->items_bm); + kfree(ep); free_uid: free_uid(user); return error; @@ -2066,7 +2182,7 @@ static void clear_tfile_check_list(void) /* * Open an eventpoll file descriptor. */ -static int do_epoll_create(int flags) +static int do_epoll_create(int flags, size_t size) { int error, fd; struct eventpoll *ep = NULL; @@ -2075,12 +2191,12 @@ static int do_epoll_create(int flags) /* Check the EPOLL_* constant for consistency. */ BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); - if (flags & ~EPOLL_CLOEXEC) + if (flags & ~(EPOLL_CLOEXEC | EPOLL_USERPOLL)) return -EINVAL; /* * Create the internal data structure ("struct eventpoll"). */ - error = ep_alloc(&ep); + error = ep_alloc(&ep, flags, size); if (error < 0) return error; /* @@ -2111,7 +2227,7 @@ static int do_epoll_create(int flags) SYSCALL_DEFINE1(epoll_create1, int, flags) { - return do_epoll_create(flags); + return do_epoll_create(flags, 0); } SYSCALL_DEFINE1(epoll_create, int, size) @@ -2119,7 +2235,7 @@ SYSCALL_DEFINE1(epoll_create, int, size) if (size <= 0) return -EINVAL; - return do_epoll_create(0); + return do_epoll_create(0, 0); } /* diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index 690a625ddeb2..fd06efe5d07e 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -20,7 +20,8 @@ #include /* Flags for epoll_create1. */ -#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_CLOEXEC O_CLOEXEC +#define EPOLL_USERPOLL 1 /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 From patchwork Mon Jan 21 20:14:47 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774499 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1957691E for ; Mon, 21 Jan 2019 20:16:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0B4032A874 for ; Mon, 21 Jan 2019 20:16:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F3E202A9C2; Mon, 21 Jan 2019 20:16:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 559FB2A881 for ; Mon, 21 Jan 2019 20:16:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728293AbfAUUPQ (ORCPT ); Mon, 21 Jan 2019 15:15:16 -0500 Received: from mx2.suse.de ([195.135.220.15]:55582 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728139AbfAUUPQ (ORCPT ); Mon, 21 Jan 2019 15:15:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B40AAAFF6; Mon, 21 Jan 2019 20:15:13 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 04/13] epoll: some sanity flags checks for epoll syscalls for polling from userspace Date: Mon, 21 Jan 2019 21:14:47 +0100 Message-Id: <20190121201456.28338-5-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP There are various of limitations if epfd is polled by user: 1. Expect always EPOLLET flag (Edge Triggered behavior) 2. No support for EPOLLWAKEUP events are consumed from userspace, thus no way to call __pm_relax() 3. No support for EPOLLEXCLUSIVE If device does not pass pollflags to wake_up() there is no way to call poll() from the context under spinlock, thus special work is scheduled to offload polling. In this specific case we can't support exclusive wakeups, because we do not know actual result of scheduled work. 4. epoll_wait() for epfd, created with EPOLL_USERPOLL flag, accepts events as NULL and maxevents as 0. No other values are accepted. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 68 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 46 insertions(+), 22 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a73c077a552c..9c9283e4a073 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -423,6 +423,11 @@ static inline unsigned int ep_to_items_bm_length(unsigned int nr) return PAGE_ALIGN(ALIGN(nr, 8) >> 3); } +static inline bool ep_polled_by_user(struct eventpoll *ep) +{ + return !!ep->user_header; +} + /** * ep_events_available - Checks if ready events might be available. * @@ -518,13 +523,17 @@ static inline void ep_set_busy_poll_napi_id(struct epitem *epi) #endif /* CONFIG_NET_RX_BUSY_POLL */ #ifdef CONFIG_PM_SLEEP -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { - if ((epev->events & EPOLLWAKEUP) && !capable(CAP_BLOCK_SUSPEND)) - epev->events &= ~EPOLLWAKEUP; + if (epev->events & EPOLLWAKEUP) { + if (!capable(CAP_BLOCK_SUSPEND) || ep_polled_by_user(ep)) + epev->events &= ~EPOLLWAKEUP; + } } #else -static inline void ep_take_care_of_epollwakeup(struct epoll_event *epev) +static inline void ep_take_care_of_epollwakeup(struct eventpoll *ep, + struct epoll_event *epev) { epev->events &= ~EPOLLWAKEUP; } @@ -2274,10 +2283,6 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (!file_can_poll(tf.file)) goto error_tgt_fput; - /* Check if EPOLLWAKEUP is allowed */ - if (ep_op_has_event(op)) - ep_take_care_of_epollwakeup(&epds); - /* * We have to check that the file structure underneath the file descriptor * the user passed to us _is_ an eventpoll file. And also we do not permit @@ -2287,10 +2292,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (f.file == tf.file || !is_file_epoll(f.file)) goto error_tgt_fput; + /* + * At this point it is safe to assume that the "private_data" contains + * our own data structure. + */ + ep = f.file->private_data; + /* * epoll adds to the wakeup queue at EPOLL_CTL_ADD time only, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. - * Also, we do not currently supported nested exclusive wakeups. + * Also, we do not currently supported nested exclusive wakeups + * and EPOLLEXCLUSIVE is not supported for epoll which is polled + * from userspace. */ if (ep_op_has_event(op) && (epds.events & EPOLLEXCLUSIVE)) { if (op == EPOLL_CTL_MOD) @@ -2298,13 +2311,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) goto error_tgt_fput; + if (ep_polled_by_user(ep)) + goto error_tgt_fput; } - /* - * At this point it is safe to assume that the "private_data" contains - * our own data structure. - */ - ep = f.file->private_data; + if (ep_op_has_event(op)) { + if (ep_polled_by_user(ep) && !(epds.events & EPOLLET)) + /* Polled by user has only edge triggered behaviour */ + goto error_tgt_fput; + + /* Check if EPOLLWAKEUP is allowed */ + ep_take_care_of_epollwakeup(ep, &epds); + } /* * When we insert an epoll file descriptor, inside another epoll file @@ -2406,14 +2424,6 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, struct fd f; struct eventpoll *ep; - /* The maximum number of event must be greater than zero */ - if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) - return -EINVAL; - - /* Verify that the area passed by the user is writeable */ - if (!access_ok(events, maxevents * sizeof(struct epoll_event))) - return -EFAULT; - /* Get the "struct file *" for the eventpoll file */ f = fdget(epfd); if (!f.file) @@ -2432,6 +2442,20 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events, * our own data structure. */ ep = f.file->private_data; + if (!ep_polled_by_user(ep)) { + /* The maximum number of event must be greater than zero */ + if (maxevents <= 0 || maxevents > EP_MAX_EVENTS) + goto error_fput; + + /* Verify that the area passed by the user is writeable */ + error = -EFAULT; + if (!access_ok(events, maxevents * sizeof(struct epoll_event))) + goto error_fput; + } else { + /* Use ring instead */ + if (maxevents != 0 || events != NULL) + goto error_fput; + } /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); From patchwork Mon Jan 21 20:14:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774493 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 96D0D1390 for ; Mon, 21 Jan 2019 20:15:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 87CAD2A9AC for ; Mon, 21 Jan 2019 20:15:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7BB212A87A; Mon, 21 Jan 2019 20:15:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA2202A874 for ; Mon, 21 Jan 2019 20:15:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728311AbfAUUPR (ORCPT ); Mon, 21 Jan 2019 15:15:17 -0500 Received: from mx2.suse.de ([195.135.220.15]:55600 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728210AbfAUUPQ (ORCPT ); Mon, 21 Jan 2019 15:15:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 3C79CAFF7; Mon, 21 Jan 2019 20:15:14 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 05/13] epoll: offload polling to a work in case of epfd polled from userspace Date: Mon, 21 Jan 2019 21:14:48 +0100 Message-Id: <20190121201456.28338-6-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Not every device reports pollflags on wake_up(), expecting that it will be polled later. vfs_poll() can't be called from ep_poll_callback(), because ep_poll_callback() is called under the spinlock. Obviously userspace can't call vfs_poll(), thus epoll has to offload vfs_poll() to a work and then to call ep_poll_callback() with pollflags in a hand. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 108 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 78 insertions(+), 30 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 9c9283e4a073..891cc7db8f8d 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -44,6 +44,7 @@ #include #include #include +#include #include /* @@ -185,6 +186,9 @@ struct epitem { /* The structure that describe the interested events and the source fd */ struct epoll_event event; + + /* Work for offloading event callback */ + struct work_struct work; }; /* @@ -696,6 +700,14 @@ static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) ep_remove_wait_queue(pwq); kmem_cache_free(pwq_cache, pwq); } + if (ep_polled_by_user(ep)) { + /* + * Events polled by user require offloading to a work, + * thus we have to be sure everything which was queued + * has run to a completion. + */ + flush_work(&epi->work); + } } /* call only when ep->mtx is held */ @@ -1339,9 +1351,8 @@ static inline bool chain_epi_lockless(struct epitem *epi) } /* - * This is the callback that is passed to the wait queue wakeup - * mechanism. It is called by the stored file descriptors when they - * have events to report. + * This is the callback that is called directly from wake queue wakeup or + * from a work. * * This callback takes a read lock in order not to content with concurrent * events from another file descriptors, thus all modifications to ->rdllist @@ -1356,14 +1367,11 @@ static inline bool chain_epi_lockless(struct epitem *epi) * queues are used should be detected accordingly. This is detected using * cmpxchg() operation. */ -static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) { - int pwake = 0; - struct epitem *epi = ep_item_from_wait(wait); struct eventpoll *ep = epi->ep; - __poll_t pollflags = key_to_poll(key); + int pwake = 0, ewake = 0; unsigned long flags; - int ewake = 0; read_lock_irqsave(&ep->lock, flags); @@ -1381,8 +1389,9 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v /* * Check the events coming with the callback. At this stage, not * every device reports the events in the "key" parameter of the - * callback. We need to be able to handle both cases here, hence the - * test for "key" != NULL before the event match test. + * callback (for ep_poll_callback() case special worker is used). + * We need to be able to handle both cases here, hence the test + * for "key" != NULL before the event match test. */ if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; @@ -1442,23 +1451,67 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; - if (pollflags & POLLFREE) { - /* - * If we race with ep_remove_wait_queue() it can miss - * ->whead = NULL and do another remove_wait_queue() after - * us, so we can't use __remove_wait_queue(). - */ - list_del_init(&wait->entry); + return ewake; +} + +static void ep_poll_callback_work(struct work_struct *work) +{ + struct epitem *epi = container_of(work, typeof(*epi), work); + __poll_t pollflags; + poll_table pt; + + WARN_ON(!ep_polled_by_user(epi->ep)); + + init_poll_funcptr(&pt, NULL); + pollflags = ep_item_poll(epi, &pt, 1); + + (void)ep_poll_callback(epi, pollflags); +} + +/* + * This is the callback that is passed to the wait queue wakeup + * mechanism. It is called by the stored file descriptors when they + * have events to report. + */ +static int ep_poll_wakeup(wait_queue_entry_t *wait, unsigned int mode, + int sync, void *key) +{ + + struct epitem *epi = ep_item_from_wait(wait); + struct eventpoll *ep = epi->ep; + __poll_t pollflags = key_to_poll(key); + int rc; + + if (!ep_polled_by_user(ep) || pollflags) { + rc = ep_poll_callback(epi, pollflags); + + if (pollflags & POLLFREE) { + /* + * If we race with ep_remove_wait_queue() it can miss + * ->whead = NULL and do another remove_wait_queue() + * after us, so we can't use __remove_wait_queue(). + */ + list_del_init(&wait->entry); + /* + * ->whead != NULL protects us from the race with + * ep_free() or ep_remove(), ep_remove_wait_queue() + * takes whead->lock held by the caller. Once we nullify + * it, nothing protects ep/epi or even wait. + */ + smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + } + } else { + schedule_work(&epi->work); + /* - * ->whead != NULL protects us from the race with ep_free() - * or ep_remove(), ep_remove_wait_queue() takes whead->lock - * held by the caller. Once we nullify it, nothing protects - * ep/epi or even wait. + * Here on this path we are absolutely sure that for file + * descriptors* which are pollable from userspace we do not + * support EPOLLEXCLUSIVE, so it is safe to return 1. */ - smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL); + rc = 1; } - return ewake; + return rc; } /* @@ -1472,7 +1525,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, struct eppoll_entry *pwq; if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) { - init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); + init_waitqueue_func_entry(&pwq->wait, ep_poll_wakeup); pwq->whead = whead; pwq->base = epi; if (epi->event.events & EPOLLEXCLUSIVE) @@ -1666,6 +1719,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, INIT_LIST_HEAD(&epi->rdllink); INIT_LIST_HEAD(&epi->fllink); INIT_LIST_HEAD(&epi->pwqlist); + INIT_WORK(&epi->work, ep_poll_callback_work); epi->ep = ep; ep_set_ffd(&epi->ffd, tfile, fd); epi->event = *event; @@ -2546,12 +2600,6 @@ static int __init eventpoll_init(void) ep_nested_calls_init(&poll_safewake_ncalls); #endif - /* - * We can have many thousands of epitems, so prevent this from - * using an extra cache line on 64-bit (and smaller) CPUs - */ - BUILD_BUG_ON(sizeof(void *) <= 8 && sizeof(struct epitem) > 128); - /* Allocates slab cache used to allocate "struct epitem" items */ epi_cache = kmem_cache_create("eventpoll_epi", sizeof(struct epitem), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); From patchwork Mon Jan 21 20:14:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774505 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8E29A91E for ; Mon, 21 Jan 2019 20:16:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7DBA62A411 for ; Mon, 21 Jan 2019 20:16:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 723C92A878; Mon, 21 Jan 2019 20:16:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D75EE2A411 for ; Mon, 21 Jan 2019 20:16:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728568AbfAUUQL (ORCPT ); Mon, 21 Jan 2019 15:16:11 -0500 Received: from mx2.suse.de ([195.135.220.15]:55614 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727867AbfAUUPQ (ORCPT ); Mon, 21 Jan 2019 15:15:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BA05CAFF8; Mon, 21 Jan 2019 20:15:14 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 06/13] epoll: introduce helpers for adding/removing events to uring Date: Mon, 21 Jan 2019 21:14:49 +0100 Message-Id: <20190121201456.28338-7-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Both add and remove events are lockless and can be called in parallel. ep_add_event_to_uring(): o user item is marked atomically as ready o if on previous stem user item was observed as not ready, then new entry is created for the index uring. ep_remove_user_item(): o user item is marked as EPOLLREMOVED only if it was ready, thus userspace will obseve previously added entry in index uring and correct "removed" state of the item. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 129 +++++++++++++++++++++++++++++++++ include/uapi/linux/eventpoll.h | 3 + 2 files changed, 132 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 891cc7db8f8d..26d837252ba4 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -189,6 +189,9 @@ struct epitem { /* Work for offloading event callback */ struct work_struct work; + + /* Bit in user bitmap for user polling */ + unsigned int bit; }; /* @@ -427,6 +430,11 @@ static inline unsigned int ep_to_items_bm_length(unsigned int nr) return PAGE_ALIGN(ALIGN(nr, 8) >> 3); } +static inline unsigned int ep_max_index_nr(struct eventpoll *ep) +{ + return ep->index_length >> ilog2(sizeof(*ep->user_index)); +} + static inline bool ep_polled_by_user(struct eventpoll *ep) { return !!ep->user_header; @@ -857,6 +865,127 @@ static void epi_rcu_free(struct rcu_head *head) kmem_cache_free(epi_cache, epi); } +#define atomic_set_unless_zero(ptr, flags) \ +({ \ + typeof(ptr) _ptr = (ptr); \ + typeof(flags) _flags = (flags); \ + typeof(*_ptr) _old, _val = READ_ONCE(*_ptr); \ + \ + for (;;) { \ + if (!_val) \ + break; \ + _old = cmpxchg(_ptr, _val, _flags); \ + if (_old == _val) \ + break; \ + _val = _old; \ + } \ + _val; \ +}) + +static inline void ep_remove_user_item(struct epitem *epi) +{ + struct eventpoll *ep = epi->ep; + struct epoll_uitem *uitem; + + lockdep_assert_held(&ep->mtx); + + /* Event should not have any attached queues */ + WARN_ON(!list_empty(&epi->pwqlist)); + + uitem = &ep->user_header->items[epi->bit]; + + /* + * User item can be in two states: signaled (read_events is set + * and userspace has not yet consumed this event) and not signaled + * (no events yet fired or already consumed by userspace). + * We reset ready_events to EPOLLREMOVED only if ready_events is + * in signaled state (we expect that userspace will come soon and + * fetch this event). In case of not signaled leave read_events + * as 0. + * + * Why it is important to mark read_events as EPOLLREMOVED in case + * of already signaled state? ep_insert() op can be immediately + * called after ep_remove(), thus the same bit can be reused and + * then new event comes, which corresponds to the same entry inside + * user items array. For this particular case ep_add_event_to_uring() + * does not allocate a new index entry, but simply masks EPOLLREMOVED, + * and userspace uses old index entry, but meanwhile old user item + * has been removed, new item has been added and event updated. + */ + atomic_set_unless_zero(&uitem->ready_events, EPOLLREMOVED); + clear_bit(epi->bit, ep->items_bm); +} + +#define atomic_or_with_mask(ptr, flags, mask) \ +({ \ + typeof(ptr) _ptr = (ptr); \ + typeof(flags) _flags = (flags); \ + typeof(flags) _mask = (mask); \ + typeof(*_ptr) _old, _new, _val = READ_ONCE(*_ptr); \ + \ + for (;;) { \ + _new = (_val & ~_mask) | _flags; \ + _old = cmpxchg(_ptr, _val, _new); \ + if (_old == _val) \ + break; \ + _val = _old; \ + } \ + _val; \ +}) + +static inline bool ep_add_event_to_uring(struct epitem *epi, __poll_t pollflags) +{ + struct eventpoll *ep = epi->ep; + struct epoll_uitem *uitem; + bool added = false; + + if (WARN_ON(!pollflags)) + return false; + + uitem = &ep->user_header->items[epi->bit]; + /* + * Can be represented as: + * + * was_ready = uitem->ready_events; + * uitem->ready_events &= ~EPOLLREMOVED; + * uitem->ready_events |= pollflags; + * if (!was_ready) { + * // create index entry + * } + * + * See the big comment inside ep_remove_user_item(), why it is + * important to mask EPOLLREMOVED. + */ + if (!atomic_or_with_mask(&uitem->ready_events, + pollflags, EPOLLREMOVED)) { + unsigned int i, *item_idx, index_mask; + + /* + * Item was not ready before, thus we have to insert + * new index to the ring. + */ + + index_mask = ep_max_index_nr(ep) - 1; + i = __atomic_fetch_add(&ep->user_header->tail, 1, + __ATOMIC_ACQUIRE); + item_idx = &ep->user_index[i & index_mask]; + + /* Signal with a bit, which is > 0 */ + *item_idx = epi->bit + 1; + + /* + * Want index update be flushed from CPU write buffer and + * immediately visible on userspace side to avoid long busy + * loops. + */ + smp_wmb(); + + added = true; + } + + return added; +} + /* * Removes a "struct epitem" from the eventpoll RB tree and deallocates * all the associated resources. Must be called with "mtx" held. diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index fd06efe5d07e..2008d445b62e 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -42,6 +42,9 @@ #define EPOLLMSG (__force __poll_t)0x00000400 #define EPOLLRDHUP (__force __poll_t)0x00002000 +/* User item marked as removed for EPOLL_USERPOLL */ +#define EPOLLREMOVED ((__force __poll_t)(1U << 27)) + /* Set exclusive wakeup mode for the target file descriptor */ #define EPOLLEXCLUSIVE ((__force __poll_t)(1U << 28)) From patchwork Mon Jan 21 20:14:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774503 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7E7001390 for ; Mon, 21 Jan 2019 20:16:23 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E5C52A411 for ; Mon, 21 Jan 2019 20:16:23 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 62E742A878; Mon, 21 Jan 2019 20:16:23 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E16762A411 for ; Mon, 21 Jan 2019 20:16:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728574AbfAUUQL (ORCPT ); Mon, 21 Jan 2019 15:16:11 -0500 Received: from mx2.suse.de ([195.135.220.15]:55626 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728272AbfAUUPQ (ORCPT ); Mon, 21 Jan 2019 15:15:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 511A7AFF9; Mon, 21 Jan 2019 20:15:15 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 07/13] epoll: call ep_add_event_to_uring() from ep_poll_callback() Date: Mon, 21 Jan 2019 21:14:50 +0100 Message-Id: <20190121201456.28338-8-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Each ep_poll_callback() is called when fd calls wakeup() on epfd. So account new event in user ring. The tricky part here is EPOLLONESHOT. Since we are lockless we have to be deal with ep_poll_callbacks() called in parallel, thus use cmpxchg to clear public event bits and filter out concurrent call from another cpu. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 26d837252ba4..1d0039b334b8 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1406,6 +1406,29 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, } #endif /* CONFIG_CHECKPOINT_RESTORE */ +/** + * Atomically clear public event bits and return %true if the old value has + * public event bits set. + */ +static inline bool ep_clear_public_event_bits(struct epitem *epi) +{ + __poll_t old, flags; + + /* + * Here we race with ourselves and with ep_modify(), which can + * change the event bits. In order not to override events updated + * by ep_modify() we have to do cmpxchg. + */ + + old = epi->event.events; + do { + flags = old; + } while ((old = cmpxchg(&epi->event.events, flags, + flags & EP_PRIVATE_BITS)) != flags); + + return flags & ~EP_PRIVATE_BITS; +} + /** * Adds a new entry to the tail of the list in a lockless way, i.e. * multiple CPUs are allowed to call this function concurrently. @@ -1525,6 +1548,20 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) if (pollflags && !(pollflags & epi->event.events)) goto out_unlock; + if (ep_polled_by_user(ep)) { + /* + * For polled descriptor from user we have to disable events on + * callback path in case of one-shot. + */ + if ((epi->event.events & EPOLLONESHOT) && + !ep_clear_public_event_bits(epi)) + /* Race is lost, another callback has cleared events */ + goto out_unlock; + + ep_add_event_to_uring(epi, pollflags); + goto wakeup; + } + /* * If we are transferring events to userspace, we can hold no locks * (because we're accessing user memory, and because of linux f_op->poll() @@ -1544,6 +1581,7 @@ static int ep_poll_callback(struct epitem *epi, __poll_t pollflags) ep_pm_stay_awake_rcu(epi); } +wakeup: /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() * wait list. From patchwork Mon Jan 21 20:14:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774495 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CBD611390 for ; Mon, 21 Jan 2019 20:15:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BC6192A9CE for ; Mon, 21 Jan 2019 20:15:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B0BB52A8A5; Mon, 21 Jan 2019 20:15:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3864D2A9B9 for ; Mon, 21 Jan 2019 20:15:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728454AbfAUUPp (ORCPT ); Mon, 21 Jan 2019 15:15:45 -0500 Received: from mx2.suse.de ([195.135.220.15]:55638 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728216AbfAUUPR (ORCPT ); Mon, 21 Jan 2019 15:15:17 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 9F56BAFFA; Mon, 21 Jan 2019 20:15:15 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 08/13] epoll: support polling from userspace for ep_insert() Date: Mon, 21 Jan 2019 21:14:51 +0100 Message-Id: <20190121201456.28338-9-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled by userspace and new item is inserted new bit should be get from a bitmap and then user item is set accordingly. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 78 +++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 65 insertions(+), 13 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 1d0039b334b8..628a2cadfad6 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -865,6 +865,23 @@ static void epi_rcu_free(struct rcu_head *head) kmem_cache_free(epi_cache, epi); } +static inline int ep_get_bit(struct eventpoll *ep) +{ + bool was_set; + int bit; + + lockdep_assert_held(&ep->mtx); + + bit = find_first_zero_bit(ep->items_bm, ep->max_items_nr); + if (bit >= ep->max_items_nr) + return -ENOSPC; + + was_set = test_and_set_bit(bit, ep->items_bm); + WARN_ON(was_set); + + return bit; +} + #define atomic_set_unless_zero(ptr, flags) \ ({ \ typeof(ptr) _ptr = (ptr); \ @@ -1874,6 +1891,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, struct epitem *epi; struct ep_pqueue epq; + lockdep_assert_held(&ep->mtx); lockdep_assert_irqs_enabled(); user_watches = atomic_long_read(&ep->user->epoll_watches); @@ -1900,6 +1918,28 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, RCU_INIT_POINTER(epi->ws, NULL); } + if (ep_polled_by_user(ep)) { + struct epoll_uitem *uitem; + int bit; + + bit = ep_get_bit(ep); + if (unlikely(bit < 0)) { + error = bit; + goto error_get_bit; + } + epi->bit = bit; + + /* + * Now fill-in user item. Do not touch ready_events, since + * it can be EPOLLREMOVED (has been set by previous user + * item), thus user index entry can be not yet consumed + * by userspace. See ep_remove_user_item() and + * ep_add_event_to_uring() for details. + */ + uitem = &ep->user_header->items[epi->bit]; + uitem->event = *event; + } + /* Initialize the poll table using the queue callback */ epq.epi = epi; init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); @@ -1944,16 +1984,23 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* record NAPI ID of new item if present */ ep_set_busy_poll_napi_id(epi); - /* If the file is already "ready" we drop it inside the ready list */ - if (revents && !ep_is_linked(epi)) { - list_add_tail(&epi->rdllink, &ep->rdllist); - ep_pm_stay_awake(epi); + if (revents) { + bool added = false; - /* Notify waiting tasks that events are available */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; + if (ep_polled_by_user(ep)) { + added = ep_add_event_to_uring(epi, revents); + } else if (!ep_is_linked(epi)) { + list_add_tail(&epi->rdllink, &ep->rdllist); + ep_pm_stay_awake(epi); + added = true; + } + if (added) { + /* Notify waiting tasks that events are available */ + if (waitqueue_active(&ep->wq)) + wake_up(&ep->wq); + if (waitqueue_active(&ep->poll_wait)) + pwake++; + } } write_unlock_irq(&ep->lock); @@ -1982,11 +2029,16 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, * list, since that is used/cleaned only inside a section bound by "mtx". * And ep_insert() is called with "mtx" held. */ - write_lock_irq(&ep->lock); - if (ep_is_linked(epi)) - list_del_init(&epi->rdllink); - write_unlock_irq(&ep->lock); + if (ep_polled_by_user(ep)) { + ep_remove_user_item(epi); + } else { + write_lock_irq(&ep->lock); + if (ep_is_linked(epi)) + list_del_init(&epi->rdllink); + write_unlock_irq(&ep->lock); + } +error_get_bit: wakeup_source_unregister(ep_wakeup_source(epi)); error_create_wakeup_source: From patchwork Mon Jan 21 20:14:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774501 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0943713BF for ; Mon, 21 Jan 2019 20:16:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EE5E02A411 for ; Mon, 21 Jan 2019 20:16:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E29502A878; Mon, 21 Jan 2019 20:16:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 88D252A411 for ; Mon, 21 Jan 2019 20:16:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728537AbfAUUQE (ORCPT ); Mon, 21 Jan 2019 15:16:04 -0500 Received: from mx2.suse.de ([195.135.220.15]:55654 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728278AbfAUUPR (ORCPT ); Mon, 21 Jan 2019 15:15:17 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 148FDAFFB; Mon, 21 Jan 2019 20:15:16 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 09/13] epoll: support polling from userspace for ep_remove() Date: Mon, 21 Jan 2019 21:14:52 +0100 Message-Id: <20190121201456.28338-10-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On ep_remove() simply mark a user item with EPOLLREMOVE if the item was ready (i.e. has some bits set). That will prevent further user index entry creation on item ->bit reuse. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 628a2cadfad6..b9f51f4b94e7 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1025,10 +1025,14 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi) rb_erase_cached(&epi->rbn, &ep->rbr); - write_lock_irq(&ep->lock); - if (ep_is_linked(epi)) - list_del_init(&epi->rdllink); - write_unlock_irq(&ep->lock); + if (ep_polled_by_user(ep)) { + ep_remove_user_item(epi); + } else { + write_lock_irq(&ep->lock); + if (ep_is_linked(epi)) + list_del_init(&epi->rdllink); + write_unlock_irq(&ep->lock); + } wakeup_source_unregister(ep_wakeup_source(epi)); /* From patchwork Mon Jan 21 20:14:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774497 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0928491E for ; Mon, 21 Jan 2019 20:16:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EF1632A874 for ; Mon, 21 Jan 2019 20:16:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E3BDB2A9D0; Mon, 21 Jan 2019 20:16:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7409A2A9CE for ; Mon, 21 Jan 2019 20:16:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728508AbfAUUP6 (ORCPT ); Mon, 21 Jan 2019 15:15:58 -0500 Received: from mx2.suse.de ([195.135.220.15]:55600 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728291AbfAUUPR (ORCPT ); Mon, 21 Jan 2019 15:15:17 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7D6D3AFFC; Mon, 21 Jan 2019 20:15:16 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 10/13] epoll: support polling from userspace for ep_modify() Date: Mon, 21 Jan 2019 21:14:53 +0100 Message-Id: <20190121201456.28338-11-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When epfd is polled from userspace and item is being modified: 1. Update user item with new pointer or poll flags. 2. Add event to user ring if needed. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index b9f51f4b94e7..3c3721c315a7 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2058,6 +2058,8 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, static int ep_modify(struct eventpoll *ep, struct epitem *epi, const struct epoll_event *event) { + struct epoll_uitem *uitem; + __poll_t revents; int pwake = 0; poll_table pt; @@ -2072,6 +2074,13 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, */ epi->event.events = event->events; /* need barrier below */ epi->event.data = event->data; /* protected by mtx */ + + /* Update user item, barrier is below */ + if (ep_polled_by_user(ep)) { + uitem = &ep->user_header->items[epi->bit]; + uitem->event = *event; + } + if (epi->event.events & EPOLLWAKEUP) { if (!ep_has_wakeup_source(epi)) ep_create_wakeup_source(epi); @@ -2105,12 +2114,19 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, * If the item is "hot" and it is not registered inside the ready * list, push it inside. */ - if (ep_item_poll(epi, &pt, 1)) { + revents = ep_item_poll(epi, &pt, 1); + if (revents) { + bool added = false; + write_lock_irq(&ep->lock); - if (!ep_is_linked(epi)) { + if (ep_polled_by_user(ep)) + added = ep_add_event_to_uring(epi, revents); + else if (!ep_is_linked(epi)) { list_add_tail(&epi->rdllink, &ep->rdllist); ep_pm_stay_awake(epi); - + added = true; + } + if (added) { /* Notify waiting tasks that events are available */ if (waitqueue_active(&ep->wq)) wake_up(&ep->wq); From patchwork Mon Jan 21 20:14:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774487 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C245A91E for ; Mon, 21 Jan 2019 20:15:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A7FA42A878 for ; Mon, 21 Jan 2019 20:15:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9C3DF2A9C6; Mon, 21 Jan 2019 20:15:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C6432A9C2 for ; Mon, 21 Jan 2019 20:15:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728346AbfAUUPT (ORCPT ); Mon, 21 Jan 2019 15:15:19 -0500 Received: from mx2.suse.de ([195.135.220.15]:55672 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727882AbfAUUPT (ORCPT ); Mon, 21 Jan 2019 15:15:19 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id EA8EFAFFD; Mon, 21 Jan 2019 20:15:16 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 11/13] epoll: support polling from userspace for ep_poll() Date: Mon, 21 Jan 2019 21:14:54 +0100 Message-Id: <20190121201456.28338-12-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Rule of thumb for epfd polled from userspace is simple: epfd has events if ->head != ->tail, no traversing of each item is performed. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 68 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 58 insertions(+), 10 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 3c3721c315a7..e70d1ab64ec1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -440,6 +440,12 @@ static inline bool ep_polled_by_user(struct eventpoll *ep) return !!ep->user_header; } +static inline bool ep_uring_events_available(struct eventpoll *ep) +{ + return ep_polled_by_user(ep) && + ep->user_header->head != ep->user_header->tail; +} + /** * ep_events_available - Checks if ready events might be available. * @@ -451,7 +457,8 @@ static inline bool ep_polled_by_user(struct eventpoll *ep) static inline int ep_events_available(struct eventpoll *ep) { return !list_empty_careful(&ep->rdllist) || - READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; + READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR || + ep_uring_events_available(ep); } #ifdef CONFIG_NET_RX_BUSY_POLL @@ -1160,7 +1167,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct eventpoll *ep; + struct eventpoll *ep, *tep; bool locked; pt->_key = epi->event.events; @@ -1169,6 +1176,26 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, ep = epi->ffd.file->private_data; poll_wait(epi->ffd.file, &ep->poll_wait, pt); + + tep = epi->ffd.file->private_data; + if (ep_polled_by_user(tep)) { + /* + * The behaviour differs comparing to full scan of ready + * list for original epoll. If descriptor is pollable + * from userspace we don't do scan of all ready user items: + * firstly because we can't do reverse search of epi by + * uitem bit, secondly this is simply waste of time for + * edge triggered descriptors (user code should be prepared + * to deal with EAGAIN returned from read() or write() on + * inserted file descriptor) and thirdly once event is put + * into user index ring do not touch it from kernel, what + * we do is mark it as EPOLLREMOVED on ep_remove() and + * that's it. + */ + return ep_uring_events_available(tep) ? + EPOLLIN | EPOLLRDNORM : 0; + } + locked = pt && (pt->_qproc == ep_ptable_queue_proc); return ep_scan_ready_list(epi->ffd.file->private_data, @@ -1211,6 +1238,12 @@ static __poll_t ep_eventpoll_poll(struct file *file, poll_table *wait) /* Insert inside our poll wait queue */ poll_wait(file, &ep->poll_wait, wait); + if (ep_polled_by_user(ep)) { + /* Please read detailed comments inside ep_item_poll() */ + return ep_uring_events_available(ep) ? + EPOLLIN | EPOLLRDNORM : 0; + } + /* * Proceed to find out if wanted events are really available inside * the ready list. @@ -2232,6 +2265,8 @@ static int ep_send_events(struct eventpoll *ep, { struct ep_send_events_data esed; + WARN_ON(ep_polled_by_user(ep)); + esed.maxevents = maxevents; esed.events = events; @@ -2278,6 +2313,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, lockdep_assert_irqs_enabled(); + if (ep_polled_by_user(ep)) { + if (ep_uring_events_available(ep)) + /* Firstly all events from ring have to be consumed */ + return -ESTALE; + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -2366,14 +2407,21 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, __set_current_state(TASK_RUNNING); send_events: - /* - * Try to transfer events to user space. In case we get 0 events and - * there's still timeout left over, we go trying again in search of - * more luck. - */ - if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) - goto fetch_events; + if (!res && eavail) { + if (!ep_polled_by_user(ep)) { + /* + * Try to transfer events to user space. In case we get + * 0 events and there's still timeout left over, we go + * trying again in search of more luck. + */ + res = ep_send_events(ep, events, maxevents); + if (!res && !timed_out) + goto fetch_events; + } else { + /* User has to deal with the ring himself */ + res = -ESTALE; + } + } if (waiter) { spin_lock_irq(&ep->wq.lock); From patchwork Mon Jan 21 20:14:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774491 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 51B4D91E for ; Mon, 21 Jan 2019 20:15:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4315B2A9D0 for ; Mon, 21 Jan 2019 20:15:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 379F22A9B9; Mon, 21 Jan 2019 20:15:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C76372A9C7 for ; Mon, 21 Jan 2019 20:15:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728432AbfAUUPi (ORCPT ); Mon, 21 Jan 2019 15:15:38 -0500 Received: from mx2.suse.de ([195.135.220.15]:55614 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728315AbfAUUPS (ORCPT ); Mon, 21 Jan 2019 15:15:18 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6A31AAFFE; Mon, 21 Jan 2019 20:15:17 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 12/13] epoll: support mapping for epfd when polled from userspace Date: Mon, 21 Jan 2019 21:14:55 +0100 Message-Id: <20190121201456.28338-13-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP User has to mmap user_header and user_index vmalloce'd pointers in order to consume events from userspace. Also we do not let any copies of vma on fork(). Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- fs/eventpoll.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index e70d1ab64ec1..cceeff77bdaf 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1276,11 +1276,47 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f) } #endif +static int ep_eventpoll_mmap(struct file *filep, struct vm_area_struct *vma) +{ + struct eventpoll *ep = vma->vm_file->private_data; + size_t size; + int rc; + + if (!ep_polled_by_user(ep)) + return -ENOTSUPP; + + size = vma->vm_end - vma->vm_start; + if (!vma->vm_pgoff && size > ep->header_length) + return -ENXIO; + if (vma->vm_pgoff && ep->header_length != (vma->vm_pgoff << PAGE_SHIFT)) + /* Index ring starts exactly after the header */ + return -ENXIO; + if (vma->vm_pgoff && size > ep->index_length) + return -ENXIO; + + /* + * vm_pgoff is used *only* for indication, what is mapped: user header + * or user index ring. Sizes are checked above. + */ + if (!vma->vm_pgoff) + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_header, size); + else + rc = remap_vmalloc_range_partial(vma, vma->vm_start, + ep->user_index, size); + if (likely(!rc)) + /* No copies for forks(), please */ + vma->vm_flags |= VM_DONTCOPY; + + return rc; +} + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = ep_show_fdinfo, #endif + .mmap = ep_eventpoll_mmap, .release = ep_eventpoll_release, .poll = ep_eventpoll_poll, .llseek = noop_llseek, From patchwork Mon Jan 21 20:14:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roman Penyaev X-Patchwork-Id: 10774489 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3FA9891E for ; Mon, 21 Jan 2019 20:15:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3086F2A9B9 for ; Mon, 21 Jan 2019 20:15:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 194E52A9AB; Mon, 21 Jan 2019 20:15:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 95F382A9C9 for ; Mon, 21 Jan 2019 20:15:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728404AbfAUUPc (ORCPT ); Mon, 21 Jan 2019 15:15:32 -0500 Received: from mx2.suse.de ([195.135.220.15]:55626 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728330AbfAUUPT (ORCPT ); Mon, 21 Jan 2019 15:15:19 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id D8D96AFFF; Mon, 21 Jan 2019 20:15:17 +0000 (UTC) From: Roman Penyaev Cc: Roman Penyaev , Andrew Morton , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Linus Torvalds , Andrea Parri , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 13/13] epoll: implement epoll_create2() syscall Date: Mon, 21 Jan 2019 21:14:56 +0100 Message-Id: <20190121201456.28338-14-rpenyaev@suse.de> X-Mailer: git-send-email 2.19.1 In-Reply-To: <20190121201456.28338-1-rpenyaev@suse.de> References: <20190121201456.28338-1-rpenyaev@suse.de> MIME-Version: 1.0 To: unlisted-recipients:; (no To-header on input) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP epoll_create2() is needed to accept EPOLL_USERPOLL flags and size, i.e. this patch wires up polling from userspace. Signed-off-by: Roman Penyaev Cc: Andrew Morton Cc: Davidlohr Bueso Cc: Jason Baron Cc: Al Viro Cc: "Paul E. McKenney" Cc: Linus Torvalds Cc: Andrea Parri Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/eventpoll.c | 8 ++++++++ include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 4 +++- kernel/sys_ni.c | 1 + 6 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 6804c1e84b36..3247f49b1325 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -399,3 +399,4 @@ 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq 387 i386 pidfd_send_signal sys_pidfd_send_signal __ia32_sys_pidfd_send_signal +388 i386 epoll_create2 sys_epoll_create2 __ia32_sys_epoll_create2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index aa4b858fa0f1..df7c4f1b4a79 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -344,6 +344,7 @@ 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq 335 common pidfd_send_signal __x64_sys_pidfd_send_signal +336 common epoll_create2 __x64_sys_epoll_create2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/eventpoll.c b/fs/eventpoll.c index cceeff77bdaf..a759ae591202 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2611,6 +2611,14 @@ static int do_epoll_create(int flags, size_t size) return error; } +SYSCALL_DEFINE2(epoll_create2, int, flags, size_t, size) +{ + if (size == 0) + return -EINVAL; + + return do_epoll_create(flags, size); +} + SYSCALL_DEFINE1(epoll_create1, int, flags) { return do_epoll_create(flags, 0); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 5eb2e351675e..249ea00696a8 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -345,6 +345,7 @@ asmlinkage long sys_eventfd2(unsigned int count, int flags); /* fs/eventpoll.c */ asmlinkage long sys_epoll_create1(int flags); +asmlinkage long sys_epoll_create2(int flags, size_t size); asmlinkage long sys_epoll_ctl(int epfd, int op, int fd, struct epoll_event __user *event); asmlinkage long sys_epoll_pwait(int epfd, struct epoll_event __user *events, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index b77538af7aca..a4d686280cb7 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -742,9 +742,11 @@ __SYSCALL(__NR_rseq, sys_rseq) __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) #define __NR_pidfd_send_signal 295 __SYSCALL(__NR_pidfd_send_signal, sys_pidfd_send_signal) +#define __NR_epoll_create2 296 +__SYSCALL(__NR_epoll_create2, sys_epoll_create2) #undef __NR_syscalls -#define __NR_syscalls 296 +#define __NR_syscalls 297 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index f905f4f9f677..5083bb55fcb2 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -60,6 +60,7 @@ COND_SYSCALL(eventfd2); /* fs/eventfd.c */ COND_SYSCALL(epoll_create1); +COND_SYSCALL(epoll_create2); COND_SYSCALL(epoll_ctl); COND_SYSCALL(epoll_pwait); COND_SYSCALL_COMPAT(epoll_pwait);