mbox series

[v15,0/7] io_uring: add napi busy polling support

Message ID 20230608163839.2891748-1-shr@devkernel.io (mailing list archive)
Headers show
Series io_uring: add napi busy polling support | expand

Message

Stefan Roesch June 8, 2023, 4:38 p.m. UTC
This adds the napi busy polling support in io_uring.c. It adds a new
napi_list to the io_ring_ctx structure. This list contains the list of
napi_id's that are currently enabled for busy polling. This list is
used to determine which napi id's enabled busy polling. For faster
access it also adds a hash table.

When a new napi id is added, the hash table is used to locate if
the napi id has already been added. When processing the busy poll
loop the list is used to process the individual elements.

io-uring allows specifying two parameters:
- busy poll timeout and
- prefer busy poll to call of io_napi_busy_loop()
This sets the above parameters for the ring. The settings are passed
with a new structure io_uring_napi.

There is also a corresponding liburing patch series, which enables this
feature. The name of the series is "liburing: add add api for napi busy
poll timeout". It also contains two programs to test the this.

Testing has shown that the round-trip times are reduced to 38us from
55us by enabling napi busy polling with a busy poll timeout of 100us.
More detailled results are part of the commit message of the first
patch.

Changes:
- V15:
  - Combined _napi_busy_loop() and __napi_busy_loop() function
  - Rephrased comment
- V14:
  - Rephrased comment for napi_busy_loop_rcu() funnction
  - Added new function _napi_busy_loop() to remove code
    duplication in napi_busy_loop() and napi_busy_loop_rcu()
- V13:
  - split off __napi_busy_loop() from napi_busy_loop()
  - introduce napi_busy_loop_no_lock()
  - use napi_busy_loop_no_lock in io_napi_blocking_busy_loop
- V12:
  - introduce io_napi_hash_find()
  - use rcu for changes to the hash table
  - use rcu for searching if a napi id is in the napi hash table
  - use rcu hlist functions for adding and removing items from the hash
    table
  - add stale entry detection in __io_napi_do_busy_loop and remove stale
    entries in io_napi_blocking_busy_loop() and io_napi_sqpoll_busy_loop()
  - create io_napi_remove_stale() and __io_napi_remove_stale()
  - __io_napi_do_busy_loop() takes additional loop_end_arg and does stale
    entry detection
  - io_napi_multi_busy_loop is removed. Logic is moved to
    io_napi_blocking_busy_loop()
  - io_napi_free uses rcu function to free
  - io_napi_busy_loop no longer splices
  - io_napi_sqpoll_busy_poll uses rcu
- V11:
  - Fixed long comment lines and whitespace issues
  - Refactor new code io_cqring_wait()
  - Refactor io_napi_adjust_timeout() and remove adjust_timeout
  - Rename io_napi_adjust_timeout to __io_napi_adjust_timeout
  - Add new function io_napi_adjust_timeout
  - Cleanup calls to list_is_singular() in io_napi_multi_busy_loop()
    and io_napi_blocking_busy_loop()
  - Cleanup io_napi_busy_loop_should_end()
  - Rename __io_napi_busy_loop to __io_napi_do_busy_loop() 
- V10:
  - Refreshed to io-uring/for-6.4
  - Repeated performance measurements for 6.4 (same/similar results)
- V9:
  - refreshed to io-uring/for-6.3
  - folded patch 2 and 3 into patch 4
  - fixed commit description for last 2 patches
  - fixed some whitespace issues
  - removed io_napi_busy_loop_on helper
  - removed io_napi_setup_busy helper
  - renamed io_napi_end_busy_loop to io_napi_busy_loop
  - removed NAPI_LIST_HEAD macro
  - split io_napi_blocking_busy_loop into two functions
  - added io_napi function
  - comment for sqpoll check
- V8:
  - added new file napi.c and add napi functions to this file
  - added NAPI_LIST_HEAD function so no ifdef is necessary
  - added io_napi_init and io_napi_free function
  - added io_napi_setup_busy loop helper function
  - added io_napi_adjust_busy_loop helper function
  - added io_napi_end_busy_loop helper function
  - added io_napi_sqpoll_busy_poll helper function
  - some of the definitions in napi.h are macros to avoid ifdef
    definitions in io_uring.c, poll.c and sqpoll.c
  - changed signature of io_napi_add function
  - changed size of hashtable to 16. The number of entries is limited
    by the number of nic queues.
  - Removed ternary in io_napi_blocking_busy_loop
  - Rewrote io_napi_blocking_busy_loop to make it more readable
  - Split off 3 more patches
- V7:
  - allow unregister with NULL value for arg parameter
  - return -EOPNOTSUPP if CONFIG_NET_RX_BUSY_POLL is not enabled
- V6:
  - Add a hash table on top of the list for faster access during the
    add operation. The linked list and the hash table use the same
    data structure
- V5:
  - Refreshed to 6.1-rc6
  - Use copy_from_user instead of memdup/kfree
  - Removed the moving of napi_busy_poll_to
  - Return -EINVAL if any of the reserved or padded fields are not 0.
- V4:
  - Pass structure for napi config, instead of individual parameters
- V3:
  - Refreshed to 6.1-rc5
  - Added a new io-uring api for the prefer napi busy poll api and wire
    it to io_napi_busy_loop().
  - Removed the unregister (implemented as register)
  - Added more performance results to the first commit message.
- V2:
  - Add missing defines if CONFIG_NET_RX_BUSY_POLL is not defined
  - Changes signature of function io_napi_add_list to static inline
    if CONFIG_NET_RX_BUSY_POLL is not defined
  - define some functions as static


Stefan Roesch (7):
  net: split off __napi_busy_poll from napi_busy_poll
  net: add napi_busy_loop_rcu()
  io-uring: move io_wait_queue definition to header file
  io-uring: add napi busy poll support
  io-uring: add sqpoll support for napi busy poll
  io_uring: add register/unregister napi function
  io_uring: add prefer busy poll to register and unregister napi api

 include/linux/io_uring_types.h |  11 ++
 include/net/busy_poll.h        |   4 +
 include/uapi/linux/io_uring.h  |  12 ++
 io_uring/Makefile              |   1 +
 io_uring/io_uring.c            |  41 ++--
 io_uring/io_uring.h            |  26 +++
 io_uring/napi.c                | 331 +++++++++++++++++++++++++++++++++
 io_uring/napi.h                | 104 +++++++++++
 io_uring/poll.c                |   2 +
 io_uring/sqpoll.c              |   4 +
 net/core/dev.c                 |  34 +++-
 11 files changed, 544 insertions(+), 26 deletions(-)
 create mode 100644 io_uring/napi.c
 create mode 100644 io_uring/napi.h


base-commit: f026be0e1e881e3395c3d5418ffc8c2a2203c3f3

Comments

Olivier Langlois Jan. 30, 2024, 9:20 p.m. UTC | #1
Hi,

I was wondering what did happen to this patch submission...

It seems like Stefan did put a lot of effort in addressing every
reported issue for several weeks/months...

and then nothing... as if this patch has never been reviewed by
anyone...

has it been decided to not integrate NAPI busy looping in io_uring
privately finally?

On Thu, 2023-06-08 at 09:38 -0700, Stefan Roesch wrote:
> This adds the napi busy polling support in io_uring.c. It adds a new
> napi_list to the io_ring_ctx structure. This list contains the list
> of
> napi_id's that are currently enabled for busy polling. This list is
> used to determine which napi id's enabled busy polling. For faster
> access it also adds a hash table.
> 
> When a new napi id is added, the hash table is used to locate if
> the napi id has already been added. When processing the busy poll
> loop the list is used to process the individual elements.
> 
> io-uring allows specifying two parameters:
> - busy poll timeout and
> - prefer busy poll to call of io_napi_busy_loop()
> This sets the above parameters for the ring. The settings are passed
> with a new structure io_uring_napi.
> 
> There is also a corresponding liburing patch series, which enables
> this
> feature. The name of the series is "liburing: add add api for napi
> busy
> poll timeout". It also contains two programs to test the this.
> 
> Testing has shown that the round-trip times are reduced to 38us from
> 55us by enabling napi busy polling with a busy poll timeout of 100us.
> More detailled results are part of the commit message of the first
> patch.
> 
> Changes:
> - V15:
>   - Combined _napi_busy_loop() and __napi_busy_loop() function
>   - Rephrased comment
> - V14:
>   - Rephrased comment for napi_busy_loop_rcu() funnction
>   - Added new function _napi_busy_loop() to remove code
>     duplication in napi_busy_loop() and napi_busy_loop_rcu()
> - V13:
>   - split off __napi_busy_loop() from napi_busy_loop()
>   - introduce napi_busy_loop_no_lock()
>   - use napi_busy_loop_no_lock in io_napi_blocking_busy_loop
> - V12:
>   - introduce io_napi_hash_find()
>   - use rcu for changes to the hash table
>   - use rcu for searching if a napi id is in the napi hash table
>   - use rcu hlist functions for adding and removing items from the
> hash
>     table
>   - add stale entry detection in __io_napi_do_busy_loop and remove
> stale
>     entries in io_napi_blocking_busy_loop() and
> io_napi_sqpoll_busy_loop()
>   - create io_napi_remove_stale() and __io_napi_remove_stale()
>   - __io_napi_do_busy_loop() takes additional loop_end_arg and does
> stale
>     entry detection
>   - io_napi_multi_busy_loop is removed. Logic is moved to
>     io_napi_blocking_busy_loop()
>   - io_napi_free uses rcu function to free
>   - io_napi_busy_loop no longer splices
>   - io_napi_sqpoll_busy_poll uses rcu
> - V11:
>   - Fixed long comment lines and whitespace issues
>   - Refactor new code io_cqring_wait()
>   - Refactor io_napi_adjust_timeout() and remove adjust_timeout
>   - Rename io_napi_adjust_timeout to __io_napi_adjust_timeout
>   - Add new function io_napi_adjust_timeout
>   - Cleanup calls to list_is_singular() in io_napi_multi_busy_loop()
>     and io_napi_blocking_busy_loop()
>   - Cleanup io_napi_busy_loop_should_end()
>   - Rename __io_napi_busy_loop to __io_napi_do_busy_loop() 
> - V10:
>   - Refreshed to io-uring/for-6.4
>   - Repeated performance measurements for 6.4 (same/similar results)
> - V9:
>   - refreshed to io-uring/for-6.3
>   - folded patch 2 and 3 into patch 4
>   - fixed commit description for last 2 patches
>   - fixed some whitespace issues
>   - removed io_napi_busy_loop_on helper
>   - removed io_napi_setup_busy helper
>   - renamed io_napi_end_busy_loop to io_napi_busy_loop
>   - removed NAPI_LIST_HEAD macro
>   - split io_napi_blocking_busy_loop into two functions
>   - added io_napi function
>   - comment for sqpoll check
> - V8:
>   - added new file napi.c and add napi functions to this file
>   - added NAPI_LIST_HEAD function so no ifdef is necessary
>   - added io_napi_init and io_napi_free function
>   - added io_napi_setup_busy loop helper function
>   - added io_napi_adjust_busy_loop helper function
>   - added io_napi_end_busy_loop helper function
>   - added io_napi_sqpoll_busy_poll helper function
>   - some of the definitions in napi.h are macros to avoid ifdef
>     definitions in io_uring.c, poll.c and sqpoll.c
>   - changed signature of io_napi_add function
>   - changed size of hashtable to 16. The number of entries is limited
>     by the number of nic queues.
>   - Removed ternary in io_napi_blocking_busy_loop
>   - Rewrote io_napi_blocking_busy_loop to make it more readable
>   - Split off 3 more patches
> - V7:
>   - allow unregister with NULL value for arg parameter
>   - return -EOPNOTSUPP if CONFIG_NET_RX_BUSY_POLL is not enabled
> - V6:
>   - Add a hash table on top of the list for faster access during the
>     add operation. The linked list and the hash table use the same
>     data structure
> - V5:
>   - Refreshed to 6.1-rc6
>   - Use copy_from_user instead of memdup/kfree
>   - Removed the moving of napi_busy_poll_to
>   - Return -EINVAL if any of the reserved or padded fields are not 0.
> - V4:
>   - Pass structure for napi config, instead of individual parameters
> - V3:
>   - Refreshed to 6.1-rc5
>   - Added a new io-uring api for the prefer napi busy poll api and
> wire
>     it to io_napi_busy_loop().
>   - Removed the unregister (implemented as register)
>   - Added more performance results to the first commit message.
> - V2:
>   - Add missing defines if CONFIG_NET_RX_BUSY_POLL is not defined
>   - Changes signature of function io_napi_add_list to static inline
>     if CONFIG_NET_RX_BUSY_POLL is not defined
>   - define some functions as static
> 
> 
> Stefan Roesch (7):
>   net: split off __napi_busy_poll from napi_busy_poll
>   net: add napi_busy_loop_rcu()
>   io-uring: move io_wait_queue definition to header file
>   io-uring: add napi busy poll support
>   io-uring: add sqpoll support for napi busy poll
>   io_uring: add register/unregister napi function
>   io_uring: add prefer busy poll to register and unregister napi api
> 
>  include/linux/io_uring_types.h |  11 ++
>  include/net/busy_poll.h        |   4 +
>  include/uapi/linux/io_uring.h  |  12 ++
>  io_uring/Makefile              |   1 +
>  io_uring/io_uring.c            |  41 ++--
>  io_uring/io_uring.h            |  26 +++
>  io_uring/napi.c                | 331
> +++++++++++++++++++++++++++++++++
>  io_uring/napi.h                | 104 +++++++++++
>  io_uring/poll.c                |   2 +
>  io_uring/sqpoll.c              |   4 +
>  net/core/dev.c                 |  34 +++-
>  11 files changed, 544 insertions(+), 26 deletions(-)
>  create mode 100644 io_uring/napi.c
>  create mode 100644 io_uring/napi.h
> 
> 
> base-commit: f026be0e1e881e3395c3d5418ffc8c2a2203c3f3
Jens Axboe Jan. 30, 2024, 10:59 p.m. UTC | #2
On 1/30/24 2:20 PM, Olivier Langlois wrote:
> Hi,
> 
> I was wondering what did happen to this patch submission...
> 
> It seems like Stefan did put a lot of effort in addressing every
> reported issue for several weeks/months...
> 
> and then nothing... as if this patch has never been reviewed by
> anyone...
> 
> has it been decided to not integrate NAPI busy looping in io_uring
> privately finally?

It's really just waiting for testing, I want to ensure it's working as
we want it to before committing. But the production bits I wanted to
test on have been dragging out, hence I have not made any moves towards
merging this for upstream just yet.

FWIW, I have been maintaining the patchset, you can find the current
series here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-napi
Olivier Langlois Jan. 31, 2024, 5:30 a.m. UTC | #3
On Tue, 2024-01-30 at 15:59 -0700, Jens Axboe wrote:
> On 1/30/24 2:20 PM, Olivier Langlois wrote:
> > Hi,
> > 
> > I was wondering what did happen to this patch submission...
> > 
> > It seems like Stefan did put a lot of effort in addressing every
> > reported issue for several weeks/months...
> > 
> > and then nothing... as if this patch has never been reviewed by
> > anyone...
> > 
> > has it been decided to not integrate NAPI busy looping in io_uring
> > privately finally?
> 
> It's really just waiting for testing, I want to ensure it's working
> as
> we want it to before committing. But the production bits I wanted to
> test on have been dragging out, hence I have not made any moves
> towards
> merging this for upstream just yet.
> 
> FWIW, I have been maintaining the patchset, you can find the current
> series here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-napi
> 
Hi Jens,

ok thx for the update... Since I am a big user of the io_uring napi
busy polling, testing the official patchset is definitely something
that I can do to help...

I should be able to report back the result of my testing in few days!
Olivier Langlois Jan. 31, 2024, 5:22 p.m. UTC | #4
On Tue, 2024-01-30 at 15:59 -0700, Jens Axboe wrote:
> On 1/30/24 2:20 PM, Olivier Langlois wrote:
> > Hi,
> > 
> > I was wondering what did happen to this patch submission...
> > 
> > It seems like Stefan did put a lot of effort in addressing every
> > reported issue for several weeks/months...
> > 
> > and then nothing... as if this patch has never been reviewed by
> > anyone...
> > 
> > has it been decided to not integrate NAPI busy looping in io_uring
> > privately finally?
> 
> It's really just waiting for testing, I want to ensure it's working
> as
> we want it to before committing. But the production bits I wanted to
> test on have been dragging out, hence I have not made any moves
> towards
> merging this for upstream just yet.
> 
> FWIW, I have been maintaining the patchset, you can find the current
> series here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-napi
> 

test setup:
-----------
- kernel 6.7.2 with Jens patchset applied (It did almost work as-is
except for modifs in io_uring/register.c that was in
io_uring/io_uring.c in 6.7.2)
- liburing 2.5 patched with Stefan patch after having carefully make
sure that IORING_REGISTER_NAPI,IORING_UNREGISTER_NAPI values match the
ones found in the kernel. (It was originally 26,27 and it is now 27,28)
- 3 threads each having their own private io_uring ring.

thread 1:
- use SQ_POLL kernel thread
- reads data stream from 15-20 TCP connections
- enable NAPI busy polling by calling io_uring_register_napi()

[2024-01-31 08:59:55] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 3(fd 43), napi_id:31
[2024-01-31 08:59:55] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 8(fd 38), napi_id:30
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 10(fd 36), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 14(fd 32), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 12(fd 34), napi_id:28
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 2(fd 44), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 16(fd 30), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 9(fd 37), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 20(fd 26), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 1(fd 45), napi_id:30
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 6(fd 40), napi_id:28
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 13(fd 33), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 22(fd 22), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 7(fd 39), napi_id:30
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 18(fd 28), napi_id:28
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 19(fd 27), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1028
LWS_CALLBACK_CLIENT_ESTABLISHED client 23(fd 21), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 4(fd 42), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 5(fd 41), napi_id:25
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 21(fd 24), napi_id:31
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 17(fd 29), napi_id:30
[2024-01-31 08:59:56] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 15(fd 31), napi_id:28
[2024-01-31 08:59:57] INFO WSBASE/client_established 1010
LWS_CALLBACK_CLIENT_ESTABLISHED client 11(fd 35), napi_id:30
[2024-01-31 09:00:14] INFO WSBASE/client_established 1031
LWS_CALLBACK_CLIENT_ESTABLISHED client 24(fd 25), napi_id:30

thread 2:
- No SQ_POLL
- reads data stream from 1 TCP socket
- enable NAPI busy polling by calling io_uring_register_napi()

[2024-01-31 09:01:45] INFO WSBASE/client_established 1031
LWS_CALLBACK_CLIENT_ESTABLISHED client 25(fd 23), napi_id:31

thread 3:
- No SQ_POLL
- No NAPI busy polling
- read data stream from 1 TCP socket

Outcome:
--------

I did not measure latency to make sure that NAPI polling was effective
but I did ensure the stability of running the patchset by letting the
program run for 5+ hours non stop without experiencing any glitches

Tested-by: Olivier Langlois <olivier@trillion01.com>
Jens Axboe Jan. 31, 2024, 5:32 p.m. UTC | #5
On 1/31/24 10:22 AM, Olivier Langlois wrote:
> On Tue, 2024-01-30 at 15:59 -0700, Jens Axboe wrote:
>> On 1/30/24 2:20 PM, Olivier Langlois wrote:
>>> Hi,
>>>
>>> I was wondering what did happen to this patch submission...
>>>
>>> It seems like Stefan did put a lot of effort in addressing every
>>> reported issue for several weeks/months...
>>>
>>> and then nothing... as if this patch has never been reviewed by
>>> anyone...
>>>
>>> has it been decided to not integrate NAPI busy looping in io_uring
>>> privately finally?
>>
>> It's really just waiting for testing, I want to ensure it's working
>> as
>> we want it to before committing. But the production bits I wanted to
>> test on have been dragging out, hence I have not made any moves
>> towards
>> merging this for upstream just yet.
>>
>> FWIW, I have been maintaining the patchset, you can find the current
>> series here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-napi
>>
> 
> test setup:
> -----------
> - kernel 6.7.2 with Jens patchset applied (It did almost work as-is
> except for modifs in io_uring/register.c that was in
> io_uring/io_uring.c in 6.7.2)
> - liburing 2.5 patched with Stefan patch after having carefully make
> sure that IORING_REGISTER_NAPI,IORING_UNREGISTER_NAPI values match the
> ones found in the kernel. (It was originally 26,27 and it is now 27,28)
> - 3 threads each having their own private io_uring ring.
> 
> thread 1:
> - use SQ_POLL kernel thread
> - reads data stream from 15-20 TCP connections
> - enable NAPI busy polling by calling io_uring_register_napi()
> 
> [2024-01-31 08:59:55] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 3(fd 43), napi_id:31
> [2024-01-31 08:59:55] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 8(fd 38), napi_id:30
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 10(fd 36), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 14(fd 32), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 12(fd 34), napi_id:28
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 2(fd 44), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 16(fd 30), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 9(fd 37), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 20(fd 26), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 1(fd 45), napi_id:30
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 6(fd 40), napi_id:28
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 13(fd 33), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 22(fd 22), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 7(fd 39), napi_id:30
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 18(fd 28), napi_id:28
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 19(fd 27), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1028
> LWS_CALLBACK_CLIENT_ESTABLISHED client 23(fd 21), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 4(fd 42), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 5(fd 41), napi_id:25
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 21(fd 24), napi_id:31
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 17(fd 29), napi_id:30
> [2024-01-31 08:59:56] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 15(fd 31), napi_id:28
> [2024-01-31 08:59:57] INFO WSBASE/client_established 1010
> LWS_CALLBACK_CLIENT_ESTABLISHED client 11(fd 35), napi_id:30
> [2024-01-31 09:00:14] INFO WSBASE/client_established 1031
> LWS_CALLBACK_CLIENT_ESTABLISHED client 24(fd 25), napi_id:30
> 
> thread 2:
> - No SQ_POLL
> - reads data stream from 1 TCP socket
> - enable NAPI busy polling by calling io_uring_register_napi()
> 
> [2024-01-31 09:01:45] INFO WSBASE/client_established 1031
> LWS_CALLBACK_CLIENT_ESTABLISHED client 25(fd 23), napi_id:31
> 
> thread 3:
> - No SQ_POLL
> - No NAPI busy polling
> - read data stream from 1 TCP socket
> 
> Outcome:
> --------
> 
> I did not measure latency to make sure that NAPI polling was effective
> but I did ensure the stability of running the patchset by letting the
> program run for 5+ hours non stop without experiencing any glitches

Thanks for testing!

Any chance that you could run some tests with and without NAPI that help
validate that it actually works? That part is what I'm most interested
in, not too worried about the stability of it as I have scrutinized it
pretty close already.
Olivier Langlois Jan. 31, 2024, 5:59 p.m. UTC | #6
On Wed, 2024-01-31 at 10:32 -0700, Jens Axboe wrote:
> 
> Thanks for testing!
> 
> Any chance that you could run some tests with and without NAPI that
> help
> validate that it actually works? That part is what I'm most
> interested
> in, not too worried about the stability of it as I have scrutinized
> it
> pretty close already.
> 

There is maybe a test that I can perform. The data that I receive is
timestamped. I have a small test program that checks the age of the
updates on their reception...

I would expect that it should be possible to perceive the busy polling
effect by comparing the average update age with and without the feature
enabled...

A word of warning... The service that my client is connecting to has
relocated recently. I used to have an RTT of about 8mSec with it to
about 400-500 mSec today...

because of the huge RTT, I am unsure that the test is going to be
conclusive at all...

However, I am also in the process of relocating my client closer to the
service. If you can wait a week or so, I should able to do that test
with a RTT < 1 mSec...

Beside that, I could redo the same test that Stefan did with the ping
client/server setup but would that test add any value to the current
collective knowledge?

I'll do the update age test when I restart my client and I'll report
back the result but my expectations aren't very high that it is going
to be conclusive due to the huge RTT.
Olivier Langlois Jan. 31, 2024, 7:56 p.m. UTC | #7
On Wed, 2024-01-31 at 12:59 -0500, Olivier Langlois wrote:
> On Wed, 2024-01-31 at 10:32 -0700, Jens Axboe wrote:
> > 
> > Thanks for testing!
> > 
> > Any chance that you could run some tests with and without NAPI that
> > help
> > validate that it actually works? That part is what I'm most
> > interested
> > in, not too worried about the stability of it as I have scrutinized
> > it
> > pretty close already.
> > 
> 
> There is maybe a test that I can perform. The data that I receive is
> timestamped. I have a small test program that checks the age of the
> updates on their reception...
> 
> I would expect that it should be possible to perceive the busy
> polling
> effect by comparing the average update age with and without the
> feature
> enabled...
> 
> A word of warning... The service that my client is connecting to has
> relocated recently. I used to have an RTT of about 8mSec with it to
> about 400-500 mSec today...
> 
> because of the huge RTT, I am unsure that the test is going to be
> conclusive at all...
> 
> However, I am also in the process of relocating my client closer to
> the
> service. If you can wait a week or so, I should able to do that test
> with a RTT < 1 mSec...
> 
> Beside that, I could redo the same test that Stefan did with the ping
> client/server setup but would that test add any value to the current
> collective knowledge?
> 
> I'll do the update age test when I restart my client and I'll report
> back the result but my expectations aren't very high that it is going
> to be conclusive due to the huge RTT.
> 
> 
As I expected, the busy polling difference in the update age test is so
small compared to the RTT that the result is inconclusive, IMHO...

The number of collected updates to build the stats is 500.

System clocks are assumed to be synchronized and the RTT is the
difference between the local time and the update timestamp.
Actually, it may be more accurate to say that the displayed RTT values
are in fact TT...

latency NO napi busy poll:
[2024-01-31 11:28:34] INFO Main/processCollectedData rtt
min/avg/max/mdev = 74.509/76.752/115.969/3.110 ms

latency napi busy poll:
[2024-01-31 11:33:05] INFO Main/processCollectedData rtt
min/avg/max/mdev = 75.347/76.740/134.588/1.648 ms

I'll redo the test once my RTT is closer to 1mSec. The relative gain
should be more impressive...
Jens Axboe Jan. 31, 2024, 8:52 p.m. UTC | #8
On 1/31/24 12:56 PM, Olivier Langlois wrote:
> On Wed, 2024-01-31 at 12:59 -0500, Olivier Langlois wrote:
>> On Wed, 2024-01-31 at 10:32 -0700, Jens Axboe wrote:
>>>
>>> Thanks for testing!
>>>
>>> Any chance that you could run some tests with and without NAPI that
>>> help
>>> validate that it actually works? That part is what I'm most
>>> interested
>>> in, not too worried about the stability of it as I have scrutinized
>>> it
>>> pretty close already.
>>>
>>
>> There is maybe a test that I can perform. The data that I receive is
>> timestamped. I have a small test program that checks the age of the
>> updates on their reception...
>>
>> I would expect that it should be possible to perceive the busy
>> polling
>> effect by comparing the average update age with and without the
>> feature
>> enabled...
>>
>> A word of warning... The service that my client is connecting to has
>> relocated recently. I used to have an RTT of about 8mSec with it to
>> about 400-500 mSec today...
>>
>> because of the huge RTT, I am unsure that the test is going to be
>> conclusive at all...
>>
>> However, I am also in the process of relocating my client closer to
>> the
>> service. If you can wait a week or so, I should able to do that test
>> with a RTT < 1 mSec...
>>
>> Beside that, I could redo the same test that Stefan did with the ping
>> client/server setup but would that test add any value to the current
>> collective knowledge?
>>
>> I'll do the update age test when I restart my client and I'll report
>> back the result but my expectations aren't very high that it is going
>> to be conclusive due to the huge RTT.
>>
>>
> As I expected, the busy polling difference in the update age test is so
> small compared to the RTT that the result is inconclusive, IMHO...
> 
> The number of collected updates to build the stats is 500.
> 
> System clocks are assumed to be synchronized and the RTT is the
> difference between the local time and the update timestamp.
> Actually, it may be more accurate to say that the displayed RTT values
> are in fact TT...
> 
> latency NO napi busy poll:
> [2024-01-31 11:28:34] INFO Main/processCollectedData rtt
> min/avg/max/mdev = 74.509/76.752/115.969/3.110 ms
> 
> latency napi busy poll:
> [2024-01-31 11:33:05] INFO Main/processCollectedData rtt
> min/avg/max/mdev = 75.347/76.740/134.588/1.648 ms
> 
> I'll redo the test once my RTT is closer to 1mSec. The relative gain
> should be more impressive...

Also happy to try and run it here, if you can share it? If not I have
some other stuff I can try as well, with netbench.
Olivier Langlois Jan. 31, 2024, 9:03 p.m. UTC | #9
On Wed, 2024-01-31 at 13:52 -0700, Jens Axboe wrote:
> Also happy to try and run it here, if you can share it? If not I have
> some other stuff I can try as well, with netbench.
> 

No, it is not an option... It is a small test driver app running on top
of an unpublished closed source 500,000+ lines of code applicative
framework...

but do not worry... I am pretty sure to have access to a much better
testing setup before the end of the week...

I'll report back more significative results with my test very soon...

Greetings,
Olivier Langlois Feb. 2, 2024, 8:20 p.m. UTC | #10
On Wed, 2024-01-31 at 13:52 -0700, Jens Axboe wrote:
> On 1/31/24 12:56 PM, Olivier Langlois wrote:
> > On Wed, 2024-01-31 at 12:59 -0500, Olivier Langlois wrote:
> > > On Wed, 2024-01-31 at 10:32 -0700, Jens Axboe wrote:
> > > > 
> > > > Thanks for testing!
> > > > 
> > > > Any chance that you could run some tests with and without NAPI
> > > > that
> > > > help
> > > > validate that it actually works? That part is what I'm most
> > > > interested
> > > > in, not too worried about the stability of it as I have
> > > > scrutinized
> > > > it
> > > > pretty close already.
> > > > 
> > > 
> > > There is maybe a test that I can perform. The data that I receive
> > > is
> > > timestamped. I have a small test program that checks the age of
> > > the
> > > updates on their reception...
> > > 
> > > I would expect that it should be possible to perceive the busy
> > > polling
> > > effect by comparing the average update age with and without the
> > > feature
> > > enabled...
> > > 
> > > A word of warning... The service that my client is connecting to
> > > has
> > > relocated recently. I used to have an RTT of about 8mSec with it
> > > to
> > > about 400-500 mSec today...
> > > 
> > > because of the huge RTT, I am unsure that the test is going to be
> > > conclusive at all...
> > > 
> > > However, I am also in the process of relocating my client closer
> > > to
> > > the
> > > service. If you can wait a week or so, I should able to do that
> > > test
> > > with a RTT < 1 mSec...
> > > 
> > > Beside that, I could redo the same test that Stefan did with the
> > > ping
> > > client/server setup but would that test add any value to the
> > > current
> > > collective knowledge?
> > > 
> > > I'll do the update age test when I restart my client and I'll
> > > report
> > > back the result but my expectations aren't very high that it is
> > > going
> > > to be conclusive due to the huge RTT.
> > > 
> > > 
> > As I expected, the busy polling difference in the update age test
> > is so
> > small compared to the RTT that the result is inconclusive, IMHO...
> > 
> > The number of collected updates to build the stats is 500.
> > 
> > System clocks are assumed to be synchronized and the RTT is the
> > difference between the local time and the update timestamp.
> > Actually, it may be more accurate to say that the displayed RTT
> > values
> > are in fact TT...
> > 
> > latency NO napi busy poll:
> > [2024-01-31 11:28:34] INFO Main/processCollectedData rtt
> > min/avg/max/mdev = 74.509/76.752/115.969/3.110 ms
> > 
> > latency napi busy poll:
> > [2024-01-31 11:33:05] INFO Main/processCollectedData rtt
> > min/avg/max/mdev = 75.347/76.740/134.588/1.648 ms
> > 
> > I'll redo the test once my RTT is closer to 1mSec. The relative
> > gain
> > should be more impressive...
> 
> Also happy to try and run it here, if you can share it? If not I have
> some other stuff I can try as well, with netbench.
> 
I have redone my test with a fixed liburing lib that actually enable
io_uring NAPI busy polling correctly and I have slightly more
convincing result:

latency NO napi busy poll (kernel v7.2.3):
[2024-02-02 11:42:41] INFO Main/processCollectedData rtt
min/avg/max/mdev = 73.089/75.142/107.169/2.954 ms

latency napi busy poll (kernel v7.2.3):
[2024-02-02 11:48:18] INFO Main/processCollectedData rtt
min/avg/max/mdev = 72.862/73.878/124.536/1.288 ms

FYI, I said that I could redo the test once I relocate my client to
have a RTT < 1ms...

I might not be able to do that. I might settle for an AWS VPS instead
of a bare metal setup and when you are running the kernel on a VPS,
AFAIK, the virtual Ethernet driver does not have NAPI...
Jens Axboe Feb. 2, 2024, 10:58 p.m. UTC | #11
On 2/2/24 1:20 PM, Olivier Langlois wrote:
> On Wed, 2024-01-31 at 13:52 -0700, Jens Axboe wrote:
>> On 1/31/24 12:56 PM, Olivier Langlois wrote:
>>> On Wed, 2024-01-31 at 12:59 -0500, Olivier Langlois wrote:
>>>> On Wed, 2024-01-31 at 10:32 -0700, Jens Axboe wrote:
>>>>>
>>>>> Thanks for testing!
>>>>>
>>>>> Any chance that you could run some tests with and without NAPI
>>>>> that
>>>>> help
>>>>> validate that it actually works? That part is what I'm most
>>>>> interested
>>>>> in, not too worried about the stability of it as I have
>>>>> scrutinized
>>>>> it
>>>>> pretty close already.
>>>>>
>>>>
>>>> There is maybe a test that I can perform. The data that I receive
>>>> is
>>>> timestamped. I have a small test program that checks the age of
>>>> the
>>>> updates on their reception...
>>>>
>>>> I would expect that it should be possible to perceive the busy
>>>> polling
>>>> effect by comparing the average update age with and without the
>>>> feature
>>>> enabled...
>>>>
>>>> A word of warning... The service that my client is connecting to
>>>> has
>>>> relocated recently. I used to have an RTT of about 8mSec with it
>>>> to
>>>> about 400-500 mSec today...
>>>>
>>>> because of the huge RTT, I am unsure that the test is going to be
>>>> conclusive at all...
>>>>
>>>> However, I am also in the process of relocating my client closer
>>>> to
>>>> the
>>>> service. If you can wait a week or so, I should able to do that
>>>> test
>>>> with a RTT < 1 mSec...
>>>>
>>>> Beside that, I could redo the same test that Stefan did with the
>>>> ping
>>>> client/server setup but would that test add any value to the
>>>> current
>>>> collective knowledge?
>>>>
>>>> I'll do the update age test when I restart my client and I'll
>>>> report
>>>> back the result but my expectations aren't very high that it is
>>>> going
>>>> to be conclusive due to the huge RTT.
>>>>
>>>>
>>> As I expected, the busy polling difference in the update age test
>>> is so
>>> small compared to the RTT that the result is inconclusive, IMHO...
>>>
>>> The number of collected updates to build the stats is 500.
>>>
>>> System clocks are assumed to be synchronized and the RTT is the
>>> difference between the local time and the update timestamp.
>>> Actually, it may be more accurate to say that the displayed RTT
>>> values
>>> are in fact TT...
>>>
>>> latency NO napi busy poll:
>>> [2024-01-31 11:28:34] INFO Main/processCollectedData rtt
>>> min/avg/max/mdev = 74.509/76.752/115.969/3.110 ms
>>>
>>> latency napi busy poll:
>>> [2024-01-31 11:33:05] INFO Main/processCollectedData rtt
>>> min/avg/max/mdev = 75.347/76.740/134.588/1.648 ms
>>>
>>> I'll redo the test once my RTT is closer to 1mSec. The relative
>>> gain
>>> should be more impressive...
>>
>> Also happy to try and run it here, if you can share it? If not I have
>> some other stuff I can try as well, with netbench.
>>
> I have redone my test with a fixed liburing lib that actually enable
> io_uring NAPI busy polling correctly and I have slightly more
> convincing result:
> 
> latency NO napi busy poll (kernel v7.2.3):
> [2024-02-02 11:42:41] INFO Main/processCollectedData rtt
> min/avg/max/mdev = 73.089/75.142/107.169/2.954 ms
> 
> latency napi busy poll (kernel v7.2.3):
> [2024-02-02 11:48:18] INFO Main/processCollectedData rtt
> min/avg/max/mdev = 72.862/73.878/124.536/1.288 ms
> 
> FYI, I said that I could redo the test once I relocate my client to
> have a RTT < 1ms...
> 
> I might not be able to do that. I might settle for an AWS VPS instead
> of a bare metal setup and when you are running the kernel on a VPS,
> AFAIK, the virtual Ethernet driver does not have NAPI...

I'm going to try some local 10g testing here, I don't think the above
says a whole lot as we're dealing with tens of msecs here. But if
significant and stable, does look like an improvement. If NAPI works how
it should, then sub msec ping/pong replies should show a significant
improvement. I'll report back once I get to it...