mbox series

[RFC,net-next,x86,v2,0/8] Nontemporal copies in sendmsg path

Message ID 1655024280-23827-1-git-send-email-jdamato@fastly.com (mailing list archive)
Headers show
Series Nontemporal copies in sendmsg path | expand

Message

Joe Damato June 12, 2022, 8:57 a.m. UTC
Greetings:

Welcome to RFC v2.

This is my first series that touches more than 1 subsystem; hope I got the
various subject lines and to/cc-lists correct.

Based on the feedback on RFC v1 [1], I've made a few changes:

- Removed the indirect calls.
- Simplified the code a bit by pushing logic down to a wrapper around
  copyin.
- Added support for the 'MSG_NTCOPY' flag to udp, udp-lite, tcp, and unix.

I think this series is much closer to a v1 that can be submit for
consideration, but wanted to test the waters with an RFC first :)

This new set of code allows applications to request non-temporal copies on
individual calls to sendmsg for several socket types, not just unix.

The result is that:

1. Users don't need to specify no cache copy for the entire interface as
   they had been doing previously with ethtool. There is more fine grained
   control of which sendmsgs are non-temporal. I think it makes sense for
   this to be application specific (vs interface-wide) since applications
   will have a better idea of which copy is appropriate.

2. Previously, the ethool bit for enabling no-cache-copy only seems to have
   affected TCP sockets, IIUC. This series supports UDP, UDP-Lite, TCP, and
   Unix. This means the behavior and accessibility of non-temporal copies
   is normalized bit more than it had been previously.

The performance results on my AMD Zen2 test system are identical to the
previous RFC, so I've included those results below.

As you'll see below, NT copies in the unix write path have a large
measureable impact on certain application architectures and CPUs.

Initial benchmarks are extremely encouraging. I wrote a simple C program to
benchmark this patchset, the program:
  - Creates a unix socket pair
  - Forks a child process
  - The parent process writes to the unix socket using MSG_NTCOPY, or not,
    depending on the command line flags
  - The child process uses splice to move the data from the unix socket to
    a pipe buffer, followed by a second splice call to move the data from
    the pipe buffer to a file descriptor opened on /dev/null.
  - taskset is used when launching the benchmark to ensure the parent and
    child run on appropriate CPUs for various scenarios

The source of the test program is available for examination [2] and results
for three benchmarks I ran are provided below.

Test system: AMD EPYC 7662 64-Core Processor,
	     64 cores / 128 threads,
	     512kb L2 per core shared by sibling CPUs,
	     16mb L3 per NUMA zone,
	     AMD specific settings: NPS=1 and L3 as NUMA enabled 

Test: 1048576 byte object,
      100,000 iterations,
      512kb pipe buffer size,
      512kb unix socket send buffer size

Sample command lines for running the tests provided below. Note that the
command line shows how to run a "normal" copy benchmark. To run the
benchmark in MSG_NTCOPY mode, change command line argument 3 from 0 to 1.

Test pinned to CPUs 1 and 2 which do *not* share an L2 cache, but do share
an L3.

Command line for "normal" copy:
% time taskset -ac 1,2 ./unix-nt-bench 1048576 100000 0 524288 524288

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		10.630				78,928
MSG_NTCOPY		7.429				112,935 

Same test as above, but pinned to CPUs 1 and 65 which share an L2 (512kb)
and L3 cache (16mb).

Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 524288 524288

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		12.532				66,941
MSG_NTCOPY		9.445				88,826	

Same test as above, pinned to CPUs 1 and 65, but with 128kb unix send
buffer and pipe buffer sizes (to avoid spilling L2).

Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 131072 131072

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		12.451				67,377
MSG_NTCOPY		9.451				88,768

Thanks,
Joe

[1]: https://patchwork.kernel.org/project/netdevbpf/cover/1652241268-46732-1-git-send-email-jdamato@fastly.com/
[2]: https://gist.githubusercontent.com/jdamato-fsly/03a2f0cd4e71ebe0fef97f7f2980d9e5/raw/19cfd3aca59109ebf5b03871d952ea1360f3e982/unix-nt-copy-bench.c

Joe Damato (8):
  arch, x86, uaccess: Add nontemporal copy functions
  iov_iter: Introduce iter_copy_type
  iov_iter: add copyin_iovec helper
  net: Add MSG_NTCOPY sendmsg flag
  net: unix: Support MSG_NTCOPY
  net: ip: Support MSG_NTCOPY
  net: udplite: Support MSG_NTCOPY
  net: tcp: Support MSG_NTCOPY

 arch/x86/include/asm/uaccess_64.h |  6 ++++++
 include/linux/socket.h            |  9 +++++++++
 include/linux/uaccess.h           |  6 ++++++
 include/linux/uio.h               | 17 +++++++++++++++++
 include/net/sock.h                |  2 +-
 include/net/udplite.h             |  1 +
 lib/iov_iter.c                    | 25 ++++++++++++++++++++-----
 net/ipv4/ip_output.c              |  1 +
 net/ipv4/tcp.c                    |  2 ++
 net/unix/af_unix.c                |  4 ++++
 10 files changed, 67 insertions(+), 6 deletions(-)