Message ID | 1454600344-23655-1-git-send-email-jbaron@akamai.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Jason, Just run off the original tests with this patch (eventpoll.c from 4.5-rc2 + patch bellow). Got the same good results, no regression. $ time ./bankcl Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 1359856158.04 USD Account balance is: 101528948.40 USD real 0m41.826s user 0m29.513s sys 0m6.490s Test case: https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/ PS, Original cases 0m24.953s vs 0m41.826s now probably is related with my pc setup. As I just now re-run test with original patch, got the same ~41 sec. So I am fine with this patch! Thanks, Madars Jason Baron @ 2016-02-04 17:39 rakst?ja: > In the current implementation of the EPOLLEXCLUSIVE flag (added for > 4.5-rc1), > if epoll waiters create different POLL* sets and register them as > exclusive > against the same target fd, the current implementation will stop waking > any > further waiters once it finds the first idle waiter. This means that > waiters > could miss wakeups in certain cases. > > For example, when we wake up a pipe for reading we do: > wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); > So if one epoll set or epfd is added to pipe p with POLLIN and a second > set > epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the > wakeup > since the current implementation will stop after it finds any > intersection > of events with a waiter that is blocked in epoll_wait(). > > We could potentially address this by requiring all epoll waiters that > are > added to p be required to pass the same set of POLL* events. IE the > first > EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* > flags to > be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I > think > it might be somewhat confusing interface as we would have to reference > count > the number of users for that set, and so userspace would have to keep > track > of that count, or we would need a more involved interface. It also adds > some > shared state that we'd have store somewhere. I don't think anybody will > want > to bloat __wait_queue_head for this. > > I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE > such > that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way > if > the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit > the > first idle waiter that specifies the EPOLLIN bit, since any remaining > waiters > that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we > can do > the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. > If both > 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least > one > example of this I saw in fs/pipe.c), then we just wake the entire > exclusive > list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any > performance critical path, so I think that's ok (in fs/pipe.c its in > pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by > default > in any exclusive set. Thus, the user can specify EPOLLERR and/or > EPOLLHUP but > is not required to do so. > > Since epoll waiters may be interested in other events as well besides > EPOLLIN, > EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a > 'dup' call > on the target fd and adding that as one normally would with > EPOLL_CTL_ADD. Since > I think that the POLLIN and POLLOUT events are what we are interest in > balancing, > I think that the 'dup' thing could perhaps be added to only one of the > waiter > threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP > should > be sufficient for the majority of use-cases. > > Since EPOLLEXCLUSIVE is intended to be used with a target fd shared > among > multiple epfds, where between 1 and n of the epfds may receive an > event, it > does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would > get an > event. Thus, it is not allowed to be specified in conjunction with > EPOLLEXCLUSIVE. > > EPOLL_CTL_MOD is also not allowed if the fd was previously added as > EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as > interesting, but this could be relaxed at some further point. > > Signed-off-by: Jason Baron <jbaron@akamai.com> > --- > fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------ > 1 file changed, 32 insertions(+), 6 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index ae1dbcf..cde6074 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -94,6 +94,11 @@ > /* Epoll private bits inside the event mask */ > #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | > EPOLLEXCLUSIVE) > > +#define EPOLLINOUT_BITS (POLLIN | POLLOUT) > + > +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | > \ > + EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE) > + > /* Maximum number of nesting allowed inside epoll sets */ > #define EP_MAX_NESTS 4 > > @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait, > unsigned mode, int sync, void *k > * wait list. > */ > if (waitqueue_active(&ep->wq)) { > - ewake = 1; > + if ((epi->event.events & EPOLLEXCLUSIVE) && > + !((unsigned long)key & POLLFREE)) { > + switch ((unsigned long)key & EPOLLINOUT_BITS) { > + case POLLIN: > + if (epi->event.events & POLLIN) > + ewake = 1; > + break; > + case POLLOUT: > + if (epi->event.events & POLLOUT) > + ewake = 1; > + break; > + case 0: > + ewake = 1; > + break; > + } > + } > wake_up_locked(&ep->wq); > } > if (waitqueue_active(&ep->poll_wait)) > @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, > int, fd, > * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. > * Also, we do not currently supported nested exclusive wakeups. > */ > - if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD || > - (op == EPOLL_CTL_ADD && is_file_epoll(tf.file)))) > - goto error_tgt_fput; > + if (epds.events & EPOLLEXCLUSIVE) { > + if (op == EPOLL_CTL_MOD) > + goto error_tgt_fput; > + if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || > + (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) > + goto error_tgt_fput; > + } > > /* > * At this point it is safe to assume that the "private_data" > contains > @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, > int, fd, > break; > case EPOLL_CTL_MOD: > if (epi) { > - epds.events |= POLLERR | POLLHUP; > - error = ep_modify(ep, epi, &epds); > + if (!(epi->event.events & EPOLLEXCLUSIVE)) { > + epds.events |= POLLERR | POLLHUP; > + error = ep_modify(ep, epi, &epds); > + } > } else > error = -ENOENT; > break; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> wrote: > Hi Jason, > > > Just run off the original tests with this patch (eventpoll.c from > 4.5-rc2 + patch bellow). Got the same good results, no regression. > > $ time ./bankcl > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 1359856158.04 USD > Account balance is: 101528948.40 USD > > real 0m41.826s > user 0m29.513s > sys 0m6.490s > > > Test case: > https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/ > > PS, > > Original cases 0m24.953s vs 0m41.826s now probably is related with my pc > setup. As I just now re-run test with original patch, got the same ~41 > sec. > > So I am fine with this patch! > Thanks, I shall add your Tested-by: One thing we're sorely missing is an epoll test suite, in tools/testing/selftests. If anyone has anything which we can use to kick things off, please hand it over ;) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton @ 2016-02-05 00:59 rakst?ja: > On Thu, 04 Feb 2016 23:44:05 +0200 Madars Vitolins <m@silodev.com> > wrote: > >> Hi Jason, >> >> >> Just run off the original tests with this patch (eventpoll.c from >> 4.5-rc2 + patch bellow). Got the same good results, no regression. >> >> $ time ./bankcl >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 1359856158.04 USD >> Account balance is: 101528948.40 USD >> >> real 0m41.826s >> user 0m29.513s >> sys 0m6.490s >> >> >> Test case: >> https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/ >> >> PS, >> >> Original cases 0m24.953s vs 0m41.826s now probably is related with my >> pc >> setup. As I just now re-run test with original patch, got the same ~41 >> sec. >> >> So I am fine with this patch! >> > > Thanks, I shall add your Tested-by: > > One thing we're sorely missing is an epoll test suite, in > tools/testing/selftests. If anyone has anything which we can use to > kick things off, please hand it over ;) Not bad idea, probably we need a "tools/testing/selftests/eventpoll" folder under which we should have test cases for various epoll scenarios with common "run.sh". In spare time I can try to build a case for EXCLUSIVE flag (with queues & multiple processes :) ). Thanks, Madars -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index ae1dbcf..cde6074 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -94,6 +94,11 @@ /* Epoll private bits inside the event mask */ #define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE) +#define EPOLLINOUT_BITS (POLLIN | POLLOUT) + +#define EPOLLEXCLUSIVE_OK_BITS (EPOLLINOUT_BITS | POLLERR | POLLHUP | \ + EPOLLWAKEUP | EPOLLET | EPOLLEXCLUSIVE) + /* Maximum number of nesting allowed inside epoll sets */ #define EP_MAX_NESTS 4 @@ -1068,7 +1073,22 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k * wait list. */ if (waitqueue_active(&ep->wq)) { - ewake = 1; + if ((epi->event.events & EPOLLEXCLUSIVE) && + !((unsigned long)key & POLLFREE)) { + switch ((unsigned long)key & EPOLLINOUT_BITS) { + case POLLIN: + if (epi->event.events & POLLIN) + ewake = 1; + break; + case POLLOUT: + if (epi->event.events & POLLOUT) + ewake = 1; + break; + case 0: + ewake = 1; + break; + } + } wake_up_locked(&ep->wq); } if (waitqueue_active(&ep->poll_wait)) @@ -1875,9 +1895,13 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation. * Also, we do not currently supported nested exclusive wakeups. */ - if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD || - (op == EPOLL_CTL_ADD && is_file_epoll(tf.file)))) - goto error_tgt_fput; + if (epds.events & EPOLLEXCLUSIVE) { + if (op == EPOLL_CTL_MOD) + goto error_tgt_fput; + if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) || + (epds.events & ~EPOLLEXCLUSIVE_OK_BITS))) + goto error_tgt_fput; + } /* * At this point it is safe to assume that the "private_data" contains @@ -1950,8 +1974,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, break; case EPOLL_CTL_MOD: if (epi) { - epds.events |= POLLERR | POLLHUP; - error = ep_modify(ep, epi, &epds); + if (!(epi->event.events & EPOLLEXCLUSIVE)) { + epds.events |= POLLERR | POLLHUP; + error = ep_modify(ep, epi, &epds); + } } else error = -ENOENT; break;
In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1), if epoll waiters create different POLL* sets and register them as exclusive against the same target fd, the current implementation will stop waking any further waiters once it finds the first idle waiter. This means that waiters could miss wakeups in certain cases. For example, when we wake up a pipe for reading we do: wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if one epoll set or epfd is added to pipe p with POLLIN and a second set epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup since the current implementation will stop after it finds any intersection of events with a waiter that is blocked in epoll_wait(). We could potentially address this by requiring all epoll waiters that are added to p be required to pass the same set of POLL* events. IE the first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think it might be somewhat confusing interface as we would have to reference count the number of users for that set, and so userspace would have to keep track of that count, or we would need a more involved interface. It also adds some shared state that we'd have store somewhere. I don't think anybody will want to bloat __wait_queue_head for this. I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the first idle waiter that specifies the EPOLLIN bit, since any remaining waiters that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one example of this I saw in fs/pipe.c), then we just wake the entire exclusive list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any performance critical path, so I think that's ok (in fs/pipe.c its in pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but is not required to do so. Since epoll waiters may be interested in other events as well besides EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT events are what we are interest in balancing, I think that the 'dup' thing could perhaps be added to only one of the waiter threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be sufficient for the majority of use-cases. Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among multiple epfds, where between 1 and n of the epfds may receive an event, it does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an event. Thus, it is not allowed to be specified in conjunction with EPOLLEXCLUSIVE. EPOLL_CTL_MOD is also not allowed if the fd was previously added as EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as interesting, but this could be relaxed at some further point. Signed-off-by: Jason Baron <jbaron@akamai.com> --- fs/eventpoll.c | 38 ++++++++++++++++++++++++++++++++------ 1 file changed, 32 insertions(+), 6 deletions(-)