Message ID | 1507920533-8812-1-git-send-email-jbaron@akamai.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 13 Oct 2017, Jason Baron wrote: >The ep_poll_safewake() function is used to wakeup potentially nested epoll >file descriptors. The function uses ep_call_nested() to prevent entering >the same wake up queue more than once, and to prevent excessively deep >wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >since we are already preventing these conditions during EPOLL_CTL_ADD. This >saves extra function calls, and avoids taking a global lock during the >ep_call_nested() calls. This makes sense. > >I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >case, since ep_call_nested() keeps track of the nesting level, and this is >required by the call to spin_lock_irqsave_nested(). It would be nice to >remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >well, however its not clear how to simply pass the nesting level through >multiple wake_up() levels without more surgery. In any case, I don't think >CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, also >apparently fixes a workload at Google that Salman Qazi reported by >completely removing the poll_safewake_ncalls->lock from wakeup paths. I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as I was tackling the nested epoll scaling issue with loop and readywalk lists in mind. > >Signed-off-by: Jason Baron <jbaron@akamai.com> >Cc: Alexander Viro <viro@zeniv.linux.org.uk> >Cc: Andrew Morton <akpm@linux-foundation.org> >Cc: Salman Qazi <sqazi@google.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> >--- > fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- > 1 file changed, 18 insertions(+), 29 deletions(-) Yay for getting rid of some of the callback hell. Thanks, Davidlohr
On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: > On Fri, 13 Oct 2017, Jason Baron wrote: > >> The ep_poll_safewake() function is used to wakeup potentially nested >> epoll >> file descriptors. The function uses ep_call_nested() to prevent entering >> the same wake up queue more than once, and to prevent excessively deep >> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >> since we are already preventing these conditions during EPOLL_CTL_ADD. >> This >> saves extra function calls, and avoids taking a global lock during the >> ep_call_nested() calls. > > This makes sense. > >> >> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >> case, since ep_call_nested() keeps track of the nesting level, and >> this is >> required by the call to spin_lock_irqsave_nested(). It would be nice to >> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >> well, however its not clear how to simply pass the nesting level through >> multiple wake_up() levels without more surgery. In any case, I don't >> think >> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >> also >> apparently fixes a workload at Google that Salman Qazi reported by >> completely removing the poll_safewake_ncalls->lock from wakeup paths. > > I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as > I was tackling the nested epoll scaling issue with loop and readywalk lists > in mind. >> I'm not sure the details of the workload - perhaps Salman can elaborate further about it. It would seem that the safewake would potentially be the most contended in general in the nested case, because generally you have a few epoll fds attached to lots of sources doing wakeups. So those sources are all going to conflict on the safewake lock. The readywalk is used when performing a 'nested' poll and in general this is likely going to be called on a few epoll fds. That said, we should remove it too. I will post a patch to remove it. The loop lock is used during EPOLL_CTL_ADD to check for loops and deep wakeup paths and so I would expect this to be less common, but I wouldn't doubt there are workloads impacted by it. We can potentially, I think remove this one too - and the global 'epmutex'. I posted some ideas a while ago on it: http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html We can work through these ideas or others... Thanks, -Jason >> Signed-off-by: Jason Baron <jbaron@akamai.com> >> Cc: Alexander Viro <viro@zeniv.linux.org.uk> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Salman Qazi <sqazi@google.com> > > Acked-by: Davidlohr Bueso <dbueso@suse.de> > >> --- >> fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- >> 1 file changed, 18 insertions(+), 29 deletions(-) > > Yay for getting rid of some of the callback hell. > > Thanks, > Davidlohr
On Wed, 18 Oct 2017, Jason Baron wrote: >http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html > >We can work through these ideas or others... So, unsurprisingly, I got some _really_ good results on the epoll_wait() benchmark by removing the readywalk list altogether -- something like 4000% in some cases. It's also survived more testing, although I'm still waiting for customer confirmation on his real workload, which helps further test the changes (along with this patch to move the safewake list to debug). Could you please resend patch 1 in the series above? Thanks, Davidlohr
On 10/28/2017 10:05 AM, Davidlohr Bueso wrote: > On Wed, 18 Oct 2017, Jason Baron wrote: > >> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html >> >> We can work through these ideas or others... > > So, unsurprisingly, I got some _really_ good results on the epoll_wait() > benchmark by removing the readywalk list altogether -- something like > 4000% in some cases. It's also survived more testing, although I'm still > waiting for customer confirmation on his real workload, which helps > further test the changes (along with this patch to move the safewake > list to debug). > > Could you please resend patch 1 in the series above? > Hi, I've resent patch 1 with some cleanups now. Thanks for the testing feedback. Thanks, -Jason
Hi Jason, On 2017/10/18 22:03, Jason Baron wrote: > > > On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: >> On Fri, 13 Oct 2017, Jason Baron wrote: >> >>> The ep_poll_safewake() function is used to wakeup potentially nested >>> epoll >>> file descriptors. The function uses ep_call_nested() to prevent entering >>> the same wake up queue more than once, and to prevent excessively deep >>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >>> since we are already preventing these conditions during EPOLL_CTL_ADD. >>> This >>> saves extra function calls, and avoids taking a global lock during the >>> ep_call_nested() calls. >> >> This makes sense. >> >>> >>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >>> case, since ep_call_nested() keeps track of the nesting level, and >>> this is >>> required by the call to spin_lock_irqsave_nested(). It would be nice to >>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >>> well, however its not clear how to simply pass the nesting level through >>> multiple wake_up() levels without more surgery. In any case, I don't >>> think >>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >>> also >>> apparently fixes a workload at Google that Salman Qazi reported by >>> completely removing the poll_safewake_ncalls->lock from wakeup paths. >> >> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as >> I was tackling the nested epoll scaling issue with loop and readywalk lists >> in mind. >>> > > I'm not sure the details of the workload - perhaps Salman can elaborate > further about it. > > It would seem that the safewake would potentially be the most contended > in general in the nested case, because generally you have a few epoll > fds attached to lots of sources doing wakeups. So those sources are all > going to conflict on the safewake lock. The readywalk is used when > performing a 'nested' poll and in general this is likely going to be > called on a few epoll fds. That said, we should remove it too. I will > post a patch to remove it. > > The loop lock is used during EPOLL_CTL_ADD to check for loops and deep > wakeup paths and so I would expect this to be less common, but I > wouldn't doubt there are workloads impacted by it. We can potentially, I > think remove this one too - and the global 'epmutex'. I posted some > ideas a while ago on it: > > http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html > > We can work through these ideas or others... I have tested these patches on a simple non-nested epoll workload and found that there are performance regression (-0.5% which is better than -1.0% from my patch set) under normal nginx conf and performance improvement (+3%, and the number of my patch set is +4%) under the one-fd-per-request conf. The reason for performance improvement is clear, however, I haven't figured out the reason for the performance regression (maybe cache related ?) It seems that your patch set will work fine both under normal and nested epoll workload, so I don't plan to continue my patch set which only works for normal epoll workload. I have one question about your patch set. It is about the newly added fields in struct file. I had tried to move these fields into struct eventpoll, so only epoll fds will be added into a disjoint set and the target fd will be linked to a disjoint set dynamically, but that method doesn't work out because there are multiple target fds and the order in which these target fds are added to a disjoint set will lead to deadlock. So do you have other ideas to reduce the size of these fields ? Thanks, Tao The following lines are the result of performance result: baseline: result for v4.14 ep_cc: result for v4.14 + ep_cc patch set my: result for v4.14 + my patch set The first 15 columns are the results from wrk and their unit is Requests/sec, the lats two columns are their average value and standard derivation. * normal scenario: nginx + wrk baseline 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \ 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \ 378271.4447 1210.673592 ep_cc 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \ 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \ 375878.312 1983.190539 my 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \ 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \ 374286.1273 2059.92093 * one-fd-per-request scenario: nginx + wrk baseline 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \ 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \ 114891.004 5168.30288 ep_cc 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \ 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \ 118571.01 2437.050182 my 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \ 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \ 119558.7653 4338.18401 > Thanks, > > -Jason > > >>> Signed-off-by: Jason Baron <jbaron@akamai.com> >>> Cc: Alexander Viro <viro@zeniv.linux.org.uk> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: Salman Qazi <sqazi@google.com> >> >> Acked-by: Davidlohr Bueso <dbueso@suse.de> >> >>> --- >>> fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- >>> 1 file changed, 18 insertions(+), 29 deletions(-) >> >> Yay for getting rid of some of the callback hell. >> >> Thanks, >> Davidlohr > > . >
On 01/18/2018 06:00 AM, Hou Tao wrote: > Hi Jason, > > On 2017/10/18 22:03, Jason Baron wrote: >> >> >> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: >>> On Fri, 13 Oct 2017, Jason Baron wrote: >>> >>>> The ep_poll_safewake() function is used to wakeup potentially nested >>>> epoll >>>> file descriptors. The function uses ep_call_nested() to prevent entering >>>> the same wake up queue more than once, and to prevent excessively deep >>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >>>> since we are already preventing these conditions during EPOLL_CTL_ADD. >>>> This >>>> saves extra function calls, and avoids taking a global lock during the >>>> ep_call_nested() calls. >>> >>> This makes sense. >>> >>>> >>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >>>> case, since ep_call_nested() keeps track of the nesting level, and >>>> this is >>>> required by the call to spin_lock_irqsave_nested(). It would be nice to >>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >>>> well, however its not clear how to simply pass the nesting level through >>>> multiple wake_up() levels without more surgery. In any case, I don't >>>> think >>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >>>> also >>>> apparently fixes a workload at Google that Salman Qazi reported by >>>> completely removing the poll_safewake_ncalls->lock from wakeup paths. >>> >>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as >>> I was tackling the nested epoll scaling issue with loop and readywalk lists >>> in mind. >>>> >> >> I'm not sure the details of the workload - perhaps Salman can elaborate >> further about it. >> >> It would seem that the safewake would potentially be the most contended >> in general in the nested case, because generally you have a few epoll >> fds attached to lots of sources doing wakeups. So those sources are all >> going to conflict on the safewake lock. The readywalk is used when >> performing a 'nested' poll and in general this is likely going to be >> called on a few epoll fds. That said, we should remove it too. I will >> post a patch to remove it. >> >> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep >> wakeup paths and so I would expect this to be less common, but I >> wouldn't doubt there are workloads impacted by it. We can potentially, I >> think remove this one too - and the global 'epmutex'. I posted some >> ideas a while ago on it: >> >> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html >> >> We can work through these ideas or others... > > I have tested these patches on a simple non-nested epoll workload and found that > there are performance regression (-0.5% which is better than -1.0% from my patch set) > under normal nginx conf and performance improvement (+3%, and the number of my > patch set is +4%) under the one-fd-per-request conf. The reason for performance > improvement is clear, however, I haven't figured out the reason for the performance > regression (maybe cache related ?) > > It seems that your patch set will work fine both under normal and nested epoll workload, > so I don't plan to continue my patch set which only works for normal epoll workload. > I have one question about your patch set. It is about the newly added fields in > struct file. I had tried to move these fields into struct eventpoll, so only epoll fds > will be added into a disjoint set and the target fd will be linked to a disjoint set > dynamically, but that method doesn't work out because there are multiple target fds and > the order in which these target fds are added to a disjoint set will lead to deadlock. > So do you have other ideas to reduce the size of these fields ? > > Thanks, > Tao Hi, I agree that increasing the size of struct file is no go. One idea is to simply have a single pointer in 'struct file' for epoll usage, that could be dynamically allocated when a file is added to an epoll set. That would reduce the size of things from where we currently are. I'm also not sure that the target fds (or non-epoll fds) need to be added to the disjoint sets. The sets are to serialize loop checks of which the target fds can not be a part of. If there is interest in resurrecting this patchset, I can pick it back up. I hadn't been pushing it forward because I wasn't convinced it mattered on real workloads... Thanks, -Jason > > The following lines are the result of performance result: > > baseline: result for v4.14 > ep_cc: result for v4.14 + ep_cc patch set > my: result for v4.14 + my patch set > > The first 15 columns are the results from wrk and their unit is Requests/sec, > the lats two columns are their average value and standard derivation. > > * normal scenario: nginx + wrk > > baseline > 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \ > 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \ > 378271.4447 1210.673592 > > ep_cc > 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \ > 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \ > 375878.312 1983.190539 > > my > 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \ > 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \ > 374286.1273 2059.92093 > > > * one-fd-per-request scenario: nginx + wrk > baseline > 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \ > 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \ > 114891.004 5168.30288 > > ep_cc > 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \ > 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \ > 118571.01 2437.050182 > > my > 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \ > 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \ > 119558.7653 4338.18401 > >> Thanks, >> >> -Jason >> >> >>>> Signed-off-by: Jason Baron <jbaron@akamai.com> >>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk> >>>> Cc: Andrew Morton <akpm@linux-foundation.org> >>>> Cc: Salman Qazi <sqazi@google.com> >>> >>> Acked-by: Davidlohr Bueso <dbueso@suse.de> >>> >>>> --- >>>> fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- >>>> 1 file changed, 18 insertions(+), 29 deletions(-) >>> >>> Yay for getting rid of some of the callback hell. >>> >>> Thanks, >>> Davidlohr >> >> . >> >
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19..69acfab 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -276,9 +276,6 @@ static DEFINE_MUTEX(epmutex); /* Used to check for epoll file descriptor inclusion loops */ static struct nested_calls poll_loop_ncalls; -/* Used for safe wake up implementation */ -static struct nested_calls poll_safewake_ncalls; - /* Used to call file's f_op->poll() under the nested calls boundaries */ static struct nested_calls poll_readywalk_ncalls; @@ -551,40 +548,21 @@ static int ep_call_nested(struct nested_calls *ncalls, int max_nests, * this special case of epoll. */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -static inline void ep_wake_up_nested(wait_queue_head_t *wqueue, - unsigned long events, int subclass) + +static struct nested_calls poll_safewake_ncalls; + +static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests) { unsigned long flags; + wait_queue_head_t *wqueue = (wait_queue_head_t *)cookie; - spin_lock_irqsave_nested(&wqueue->lock, flags, subclass); - wake_up_locked_poll(wqueue, events); + spin_lock_irqsave_nested(&wqueue->lock, flags, call_nests + 1); + wake_up_locked_poll(wqueue, POLLIN); spin_unlock_irqrestore(&wqueue->lock, flags); -} -#else -static inline void ep_wake_up_nested(wait_queue_head_t *wqueue, - unsigned long events, int subclass) -{ - wake_up_poll(wqueue, events); -} -#endif -static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests) -{ - ep_wake_up_nested((wait_queue_head_t *) cookie, POLLIN, - 1 + call_nests); return 0; } -/* - * Perform a safe wake up of the poll wait list. The problem is that - * with the new callback'd wake up system, it is possible that the - * poll callback is reentered from inside the call to wake_up() done - * on the poll wait queue head. The rule is that we cannot reenter the - * wake up code from the same task more than EP_MAX_NESTS times, - * and we cannot reenter the same wait queue head at all. This will - * enable to have a hierarchy of epoll file descriptor of no more than - * EP_MAX_NESTS deep. - */ static void ep_poll_safewake(wait_queue_head_t *wq) { int this_cpu = get_cpu(); @@ -595,6 +573,15 @@ static void ep_poll_safewake(wait_queue_head_t *wq) put_cpu(); } +#else + +static void ep_poll_safewake(wait_queue_head_t *wq) +{ + wake_up_poll(wq, POLLIN); +} + +#endif + static void ep_remove_wait_queue(struct eppoll_entry *pwq) { wait_queue_head_t *whead; @@ -2315,8 +2302,10 @@ static int __init eventpoll_init(void) */ ep_nested_calls_init(&poll_loop_ncalls); +#ifdef CONFIG_DEBUG_LOCK_ALLOC /* Initialize the structure used to perform safe poll wait head wake ups */ ep_nested_calls_init(&poll_safewake_ncalls); +#endif /* Initialize the structure used to perform file's f_op->poll() calls */ ep_nested_calls_init(&poll_readywalk_ncalls);
The ep_poll_safewake() function is used to wakeup potentially nested epoll file descriptors. The function uses ep_call_nested() to prevent entering the same wake up queue more than once, and to prevent excessively deep wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary since we are already preventing these conditions during EPOLL_CTL_ADD. This saves extra function calls, and avoids taking a global lock during the ep_call_nested() calls. I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC case, since ep_call_nested() keeps track of the nesting level, and this is required by the call to spin_lock_irqsave_nested(). It would be nice to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as well, however its not clear how to simply pass the nesting level through multiple wake_up() levels without more surgery. In any case, I don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, also apparently fixes a workload at Google that Salman Qazi reported by completely removing the poll_safewake_ncalls->lock from wakeup paths. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Salman Qazi <sqazi@google.com> --- fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- 1 file changed, 18 insertions(+), 29 deletions(-)