mbox series

[0/3] use rwlock in order to reduce ep_poll_callback() contention

Message ID 20181212110357.25656-1-rpenyaev@suse.de (mailing list archive)
Headers show
Series use rwlock in order to reduce ep_poll_callback() contention | expand

Message

Roman Penyaev Dec. 12, 2018, 11:03 a.m. UTC
The last patch targets the contention problem in ep_poll_callback(), which
can be very well reproduced by generating events (write to pipe or eventfd)
from many threads, while consumer thread does polling.

The following are some microbenchmark results based on the test [1] which
starts threads which generate N events each.  The test ends when all events
are successfully fetched by the poller thread:

 spinlock
 ========

 threads  events/ms  run-time ms
       8       6402        12495
      16       7045        22709
      32       7395        43268

 rwlock + xchg
 =============

 threads  events/ms  run-time ms
       8      10038         7969
      16      12178        13138
      32      13223        24199


According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This series is based on linux-next/akpm and differs from RFC in that
additional cleanup patches and explicit comments have been added.

[1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Roman Penyaev (3):
  epoll: make sure all elements in ready list are in FIFO order
  epoll: loosen irq safety in ep_poll_callback()
  epoll: use rwlock in order to reduce ep_poll_callback() contention

 fs/eventpoll.c | 127 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 93 insertions(+), 34 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Comments

Davidlohr Bueso Dec. 13, 2018, 6:13 p.m. UTC | #1
On 2018-12-12 03:03, Roman Penyaev wrote:
> The last patch targets the contention problem in ep_poll_callback(), 
> which
> can be very well reproduced by generating events (write to pipe or 
> eventfd)
> from many threads, while consumer thread does polling.
> 
> The following are some microbenchmark results based on the test [1] 
> which
> starts threads which generate N events each.  The test ends when all 
> events
> are successfully fetched by the poller thread:
> 
>  spinlock
>  ========
> 
>  threads  events/ms  run-time ms
>        8       6402        12495
>       16       7045        22709
>       32       7395        43268
> 
>  rwlock + xchg
>  =============
> 
>  threads  events/ms  run-time ms
>        8      10038         7969
>       16      12178        13138
>       32      13223        24199
> 
> 
> According to the results bandwidth of delivered events is significantly
> increased, thus execution time is reduced.
> 
> This series is based on linux-next/akpm and differs from RFC in that
> additional cleanup patches and explicit comments have been added.
> 
> [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Care to "port" this to 'perf bench epoll', in linux-next? I've been 
trying to unify into perf bench the whole epoll performance testcases 
kernel developers can use when making changes and it would be useful.

I ran these patches on the 'wait' workload which is a epoll_wait(2) 
stresser. On a 40-core IvyBridge it shows good performance improvements 
for increasing number of file descriptors each of the 40 threads deals 
with:

64   fds: +20%
512  fds: +30%
1024 fds: +50%

(Yes these are pretty raw measurements ops/sec). Unlike your benchmark, 
though, there is only single writer thread, and therefore is less ideal 
to measure optimizations when IO becomes available. Hence it would be 
nice to also have this.

Thanks,
Davidlohr
Roman Penyaev Dec. 17, 2018, 11:49 a.m. UTC | #2
On 2018-12-13 19:13, Davidlohr Bueso wrote:
> On 2018-12-12 03:03, Roman Penyaev wrote:
>> The last patch targets the contention problem in ep_poll_callback(), 
>> which
>> can be very well reproduced by generating events (write to pipe or 
>> eventfd)
>> from many threads, while consumer thread does polling.
>> 
>> The following are some microbenchmark results based on the test [1] 
>> which
>> starts threads which generate N events each.  The test ends when all 
>> events
>> are successfully fetched by the poller thread:
>> 
>>  spinlock
>>  ========
>> 
>>  threads  events/ms  run-time ms
>>        8       6402        12495
>>       16       7045        22709
>>       32       7395        43268
>> 
>>  rwlock + xchg
>>  =============
>> 
>>  threads  events/ms  run-time ms
>>        8      10038         7969
>>       16      12178        13138
>>       32      13223        24199
>> 
>> 
>> According to the results bandwidth of delivered events is 
>> significantly
>> increased, thus execution time is reduced.
>> 
>> This series is based on linux-next/akpm and differs from RFC in that
>> additional cleanup patches and explicit comments have been added.
>> 
>> [1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c
> 
> Care to "port" this to 'perf bench epoll', in linux-next? I've been
> trying to unify into perf bench the whole epoll performance testcases
> kernel developers can use when making changes and it would be useful.

Yes, good idea.  But frankly I do not want to bloat epoll-wait.c with
my multi-writers-single-reader test case, because soon epoll-wait.c
will become unmaintainable with all possible loads and set of
different options.

Can we have a single, small and separate source for each epoll load?
Easy to fix, easy to maintain, debug/hack.

> I ran these patches on the 'wait' workload which is a epoll_wait(2)
> stresser. On a 40-core IvyBridge it shows good performance
> improvements for increasing number of file descriptors each of the 40
> threads deals with:
> 
> 64   fds: +20%
> 512  fds: +30%
> 1024 fds: +50%
> 
> (Yes these are pretty raw measurements ops/sec). Unlike your
> benchmark, though, there is only single writer thread, and therefore
> is less ideal to measure optimizations when IO becomes available.
> Hence it would be nice to also have this.

That's weird. One writer thread does not content with anybody, only with
consumers, so should not be any big difference.

--
Roman
Davidlohr Bueso Dec. 17, 2018, 6:01 p.m. UTC | #3
On 2018-12-17 03:49, Roman Penyaev wrote:
> On 2018-12-13 19:13, Davidlohr Bueso wrote:
> Yes, good idea.  But frankly I do not want to bloat epoll-wait.c with
> my multi-writers-single-reader test case, because soon epoll-wait.c
> will become unmaintainable with all possible loads and set of
> different options.
> 
> Can we have a single, small and separate source for each epoll load?
> Easy to fix, easy to maintain, debug/hack.

Yes completely agree; I was actually thinking along those lines.

> 
>> I ran these patches on the 'wait' workload which is a epoll_wait(2)
>> stresser. On a 40-core IvyBridge it shows good performance
>> improvements for increasing number of file descriptors each of the 40
>> threads deals with:
>> 
>> 64   fds: +20%
>> 512  fds: +30%
>> 1024 fds: +50%
>> 
>> (Yes these are pretty raw measurements ops/sec). Unlike your
>> benchmark, though, there is only single writer thread, and therefore
>> is less ideal to measure optimizations when IO becomes available.
>> Hence it would be nice to also have this.
> 
> That's weird. One writer thread does not content with anybody, only 
> with
> consumers, so should not be any big difference.

Yeah so the irq optimization patch, which is known to boost numbers on 
this microbench, plays an important factor. I just put them all together 
when testing.

Thanks,
Davidlohr