Message ID | 1652241268-46732-1-git-send-email-jdamato@fastly.com (mailing list archive) |
---|---|
Headers | show |
Series | Nontemporal copies in unix socket write path | expand |
On Tue, 10 May 2022 20:54:21 -0700 Joe Damato wrote: > Initial benchmarks are extremely encouraging. I wrote a simple C program to > benchmark this patchset, the program: > - Creates a unix socket pair > - Forks a child process > - The parent process writes to the unix socket using MSG_NTCOPY - or not - > depending on the command line flags > - The child process uses splice to move the data from the unix socket to > a pipe buffer, followed by a second splice call to move the data from > the pipe buffer to a file descriptor opened on /dev/null. > - taskset is used when launching the benchmark to ensure the parent and > child run on appropriate CPUs for various scenarios Is there a practical use case? The patches look like a lot of extra indirect calls.
On Wed, May 11, 2022 at 04:25:20PM -0700, Jakub Kicinski wrote: > On Tue, 10 May 2022 20:54:21 -0700 Joe Damato wrote: > > Initial benchmarks are extremely encouraging. I wrote a simple C program to > > benchmark this patchset, the program: > > - Creates a unix socket pair > > - Forks a child process > > - The parent process writes to the unix socket using MSG_NTCOPY - or not - > > depending on the command line flags > > - The child process uses splice to move the data from the unix socket to > > a pipe buffer, followed by a second splice call to move the data from > > the pipe buffer to a file descriptor opened on /dev/null. > > - taskset is used when launching the benchmark to ensure the parent and > > child run on appropriate CPUs for various scenarios > > Is there a practical use case? Yes; for us there seems to be - especially with AMD Zen2. I'll try to describe such a setup and my synthetic HTTP benchmark results. Imagine a program, call it storageD, which is responsible for storing and retrieving data from a data store. Other programs can request data from storageD via communicating with it on a Unix socket. One such program that could request data via the Unix socket is an HTTP daemon. For some client connections that the HTTP daemon receives, the daemon may determine that responses can be sent in plain text. In this case, the HTTP daemon can use splice to move data from the unix socket connection with storageD directly to the client TCP socket via a pipe. splice saves CPU cycles and avoids incurring any memory access latency since the data itself is not accessed. Because we'll use splice (instead of accessing the data and potentially affecting the CPU cache) it is advantageous for storageD to use NT copies when it writes to the Unix socket to avoid evicting hot data from the CPU cache. After all, once the data is copied into the kernel on the unix socket write path, it won't be touched again; only spliced. In my synthetic HTTP benchmarks for this setup, we've been able to increase network throughput of the the HTTP daemon by roughly 30% while reducing the system time of storageD. We're still collecting data on production workloads. The motivation, IMHO, is very similar to the motivation for NETIF_F_NOCACHE_COPY, as far I understand. In some cases, when an application writes to a network socket the data written to the socket won't be accessed again once it is copied into the kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and helps to preserve the CPU cache and avoid evicting hot data. We get a sizable benefit from this option, too, in situations where we can't use splice and have to call write to transmit data to client connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but when writing to Unix sockets as well. Let me know if that makes it more clear. > The patches look like a lot of extra indirect calls. Yup. As I mentioned in the cover letter this was mostly a PoC that seems to work and increases network throughput in a real world scenario. If this general line of thinking (NT copies on write to a Unix socket) is acceptable, I'm happy to refactor the code however you (and others) would like to get it to an acceptable state. Thanks for taking a look, Joe
On Wed, 11 May 2022 18:01:54 -0700 Joe Damato wrote: > > Is there a practical use case? > > Yes; for us there seems to be - especially with AMD Zen2. I'll try to > describe such a setup and my synthetic HTTP benchmark results. > > Imagine a program, call it storageD, which is responsible for storing and > retrieving data from a data store. Other programs can request data from > storageD via communicating with it on a Unix socket. > > One such program that could request data via the Unix socket is an HTTP > daemon. For some client connections that the HTTP daemon receives, the > daemon may determine that responses can be sent in plain text. > > In this case, the HTTP daemon can use splice to move data from the unix > socket connection with storageD directly to the client TCP socket via a > pipe. splice saves CPU cycles and avoids incurring any memory access > latency since the data itself is not accessed. > > Because we'll use splice (instead of accessing the data and potentially > affecting the CPU cache) it is advantageous for storageD to use NT copies > when it writes to the Unix socket to avoid evicting hot data from the CPU > cache. After all, once the data is copied into the kernel on the unix > socket write path, it won't be touched again; only spliced. > > In my synthetic HTTP benchmarks for this setup, we've been able to increase > network throughput of the the HTTP daemon by roughly 30% while reducing > the system time of storageD. We're still collecting data on production > workloads. > > The motivation, IMHO, is very similar to the motivation for > NETIF_F_NOCACHE_COPY, as far I understand. > > In some cases, when an application writes to a network socket the data > written to the socket won't be accessed again once it is copied into the > kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and > helps to preserve the CPU cache and avoid evicting hot data. > > We get a sizable benefit from this option, too, in situations where we > can't use splice and have to call write to transmit data to client > connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but > when writing to Unix sockets as well. > > Let me know if that makes it more clear. Makes sense, thanks for the explainer. > > The patches look like a lot of extra indirect calls. > > Yup. As I mentioned in the cover letter this was mostly a PoC that seems to > work and increases network throughput in a real world scenario. > > If this general line of thinking (NT copies on write to a Unix socket) is > acceptable, I'm happy to refactor the code however you (and others) would > like to get it to an acceptable state. My only concern is that in post-spectre world the indirect calls are going to be more expensive than an branch would be. But I'm not really a mirco-optimization expert :)
On Thu, May 12, 2022 at 12:46:08PM -0700, Jakub Kicinski wrote: > On Wed, 11 May 2022 18:01:54 -0700 Joe Damato wrote: > > > Is there a practical use case? > > > > Yes; for us there seems to be - especially with AMD Zen2. I'll try to > > describe such a setup and my synthetic HTTP benchmark results. > > > > Imagine a program, call it storageD, which is responsible for storing and > > retrieving data from a data store. Other programs can request data from > > storageD via communicating with it on a Unix socket. > > > > One such program that could request data via the Unix socket is an HTTP > > daemon. For some client connections that the HTTP daemon receives, the > > daemon may determine that responses can be sent in plain text. > > > > In this case, the HTTP daemon can use splice to move data from the unix > > socket connection with storageD directly to the client TCP socket via a > > pipe. splice saves CPU cycles and avoids incurring any memory access > > latency since the data itself is not accessed. > > > > Because we'll use splice (instead of accessing the data and potentially > > affecting the CPU cache) it is advantageous for storageD to use NT copies > > when it writes to the Unix socket to avoid evicting hot data from the CPU > > cache. After all, once the data is copied into the kernel on the unix > > socket write path, it won't be touched again; only spliced. > > > > In my synthetic HTTP benchmarks for this setup, we've been able to increase > > network throughput of the the HTTP daemon by roughly 30% while reducing > > the system time of storageD. We're still collecting data on production > > workloads. > > > > The motivation, IMHO, is very similar to the motivation for > > NETIF_F_NOCACHE_COPY, as far I understand. > > > > In some cases, when an application writes to a network socket the data > > written to the socket won't be accessed again once it is copied into the > > kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and > > helps to preserve the CPU cache and avoid evicting hot data. > > > > We get a sizable benefit from this option, too, in situations where we > > can't use splice and have to call write to transmit data to client > > connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but > > when writing to Unix sockets as well. > > > > Let me know if that makes it more clear. > > Makes sense, thanks for the explainer. > > > > The patches look like a lot of extra indirect calls. > > > > Yup. As I mentioned in the cover letter this was mostly a PoC that seems to > > work and increases network throughput in a real world scenario. > > > > If this general line of thinking (NT copies on write to a Unix socket) is > > acceptable, I'm happy to refactor the code however you (and others) would > > like to get it to an acceptable state. > > My only concern is that in post-spectre world the indirect calls are > going to be more expensive than an branch would be. But I'm not really > a mirco-optimization expert :) Makes sense; neither am I, FWIW :) For whatever reason, on AMD Zen2 it seems that using non-temporal instructions when copying data sizes above the L2 size is a huge performance win (compared to the kernel's normal temporal copy code) even if that size fits in L3. This is why both NETIF_F_NOCACHE_COPY and MSG_NTCOPY from this series seem to have such a large, measurable impact in the contrived benchmark I included in the cover letter and also in synthetic HTTP workloads. I'll plan on including numbers from the benchmark program on a few other CPUs I have access to in the cover letter for any follow-up RFCs or revisions. As a data point, there has been similar-ish work done in glibc [1] to determine when non-temporal copies should be used on Zen2 based on the size of the copy. I'm certainly not a micro-arch expert by any stretch, but the glibc work plus the benchmark results I've measured seem to suggest that NT-copies can be very helpful on Zen2. Two questions for you: 1. Do you have any strong opinions on the sendmsg flag vs a socket option? 2. If I can think of a way to avoid the indirect calls, do you think this series is ready for a v1? I'm not sure if there's anything major that needs to be addressed aside from the indirect calls. I'll include some documentation and cosmetic cleanup in the v1, as well. Thanks, Joe [1]: https://sourceware.org/pipermail/libc-alpha/2020-October/118895.html
On Thu, 12 May 2022 15:53:05 -0700 Joe Damato wrote: > 1. Do you have any strong opinions on the sendmsg flag vs a socket option? It sounded like you want to mix nt and non-nt on a single socket hence the flag was a requirement. socket option is better because we can have many more of those than there are bits for flags, obviously. > 2. If I can think of a way to avoid the indirect calls, do you think this > series is ready for a v1? I'm not sure if there's anything major that > needs to be addressed aside from the indirect calls. Nothing comes to mind, seems pretty straightforward to me.
From the iov_iter point of view: please follow the way how the inatomic nocache helpers are implemented instead of adding costly funtion pointers.