Message ID | cover.1657194434.git.asml.silence@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | io_uring zerocopy send | expand |
On 7/7/22 5:49 AM, Pavel Begunkov wrote: > NOTE: Not be picked directly. After getting necessary acks, I'll be working > out merging with Jakub and Jens. > > The patchset implements io_uring zerocopy send. It works with both registered > and normal buffers, mixing is allowed but not recommended. Apart from usual > request completions, just as with MSG_ZEROCOPY, io_uring separately notifies > the userspace when buffers are freed and can be reused (see API design below), > which is delivered into io_uring's Completion Queue. Those "buffer-free" > notifications are not necessarily per request, but the userspace has control > over it and should explicitly attaching a number of requests to a single > notification. The series also adds some internal optimisations when used with > registered buffers like removing page referencing. > > From the kernel networking perspective there are two main changes. The first > one is passing ubuf_info into the network layer from io_uring (inside of an > in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info > caching on the io_uring side, but also helps to avoid cross-referencing > and synchronisation problems. The second part is an optional optimisation > removing page referencing for requests with registered buffers. > > Benchmarking with an optimised version of the selftest (see [1]), which sends > a bunch of requests, waits for completions and repeats. "+ flush" column posts > one additional "buffer-free" notification per request, and just "zc" doesn't > post buffer notifications at all. > > NIC (requests / second): > IO size | non-zc | zc | zc + flush > 4000 | 495134 | 606420 (+22%) | 558971 (+12%) > 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) > 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) > 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) > > dummy (requests / second): > IO size | non-zc | zc | zc + flush > 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) > 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) > 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) > 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) > > Previously it also brought a massive performance speedup compared to the > msg_zerocopy tool (see [3]), which is probably not super interesting. > can you add a comment that the above results are for UDP. You dropped comments about TCP testing; any progress there? If not, can you relay any issues you are hitting?
On 7/8/22 05:10, David Ahern wrote: > On 7/7/22 5:49 AM, Pavel Begunkov wrote: >> NOTE: Not be picked directly. After getting necessary acks, I'll be working >> out merging with Jakub and Jens. >> >> The patchset implements io_uring zerocopy send. It works with both registered >> and normal buffers, mixing is allowed but not recommended. Apart from usual >> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies >> the userspace when buffers are freed and can be reused (see API design below), >> which is delivered into io_uring's Completion Queue. Those "buffer-free" >> notifications are not necessarily per request, but the userspace has control >> over it and should explicitly attaching a number of requests to a single >> notification. The series also adds some internal optimisations when used with >> registered buffers like removing page referencing. >> >> From the kernel networking perspective there are two main changes. The first >> one is passing ubuf_info into the network layer from io_uring (inside of an >> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info >> caching on the io_uring side, but also helps to avoid cross-referencing >> and synchronisation problems. The second part is an optional optimisation >> removing page referencing for requests with registered buffers. >> >> Benchmarking with an optimised version of the selftest (see [1]), which sends >> a bunch of requests, waits for completions and repeats. "+ flush" column posts >> one additional "buffer-free" notification per request, and just "zc" doesn't >> post buffer notifications at all. >> >> NIC (requests / second): >> IO size | non-zc | zc | zc + flush >> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >> >> dummy (requests / second): >> IO size | non-zc | zc | zc + flush >> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >> >> Previously it also brought a massive performance speedup compared to the >> msg_zerocopy tool (see [3]), which is probably not super interesting. >> > > can you add a comment that the above results are for UDP. Oh, right, forgot to add it > You dropped comments about TCP testing; any progress there? If not, can > you relay any issues you are hitting? Not really a problem, but for me it's bottle necked at NIC bandwidth (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. Was actually benchmarked by my colleague quite a while ago, but can't find numbers. Probably need to at least add localhost numbers or grab a better server.
On 7/8/22 15:26, Pavel Begunkov wrote: > On 7/8/22 05:10, David Ahern wrote: >> On 7/7/22 5:49 AM, Pavel Begunkov wrote: >>> NOTE: Not be picked directly. After getting necessary acks, I'll be working >>> out merging with Jakub and Jens. >>> >>> The patchset implements io_uring zerocopy send. It works with both registered >>> and normal buffers, mixing is allowed but not recommended. Apart from usual >>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies >>> the userspace when buffers are freed and can be reused (see API design below), >>> which is delivered into io_uring's Completion Queue. Those "buffer-free" >>> notifications are not necessarily per request, but the userspace has control >>> over it and should explicitly attaching a number of requests to a single >>> notification. The series also adds some internal optimisations when used with >>> registered buffers like removing page referencing. >>> >>> From the kernel networking perspective there are two main changes. The first >>> one is passing ubuf_info into the network layer from io_uring (inside of an >>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info >>> caching on the io_uring side, but also helps to avoid cross-referencing >>> and synchronisation problems. The second part is an optional optimisation >>> removing page referencing for requests with registered buffers. >>> >>> Benchmarking with an optimised version of the selftest (see [1]), which sends >>> a bunch of requests, waits for completions and repeats. "+ flush" column posts >>> one additional "buffer-free" notification per request, and just "zc" doesn't >>> post buffer notifications at all. >>> >>> NIC (requests / second): >>> IO size | non-zc | zc | zc + flush >>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >>> >>> dummy (requests / second): >>> IO size | non-zc | zc | zc + flush >>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >>> >>> Previously it also brought a massive performance speedup compared to the >>> msg_zerocopy tool (see [3]), which is probably not super interesting. >>> >> >> can you add a comment that the above results are for UDP. > > Oh, right, forgot to add it > > >> You dropped comments about TCP testing; any progress there? If not, can >> you relay any issues you are hitting? > > Not really a problem, but for me it's bottle necked at NIC bandwidth > (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. > Was actually benchmarked by my colleague quite a while ago, but can't > find numbers. Probably need to at least add localhost numbers or grab > a better server. Testing localhost TCP with a hack (see below), it doesn't include refcounting optimisations I was testing UDP with and that will be sent afterwards. Numbers are in MB/s IO size | non-zc | zc 1200 | 4174 | 4148 4096 | 7597 | 11228 Because it's localhost, we also spend cycles here for the recv side. Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the omitted optimisations will somewhat help. I don't consider it to be a blocker. but would be interesting to poke into later. One thing helping non-zc is that it squeezes a number of requests into a single page whenever zerocopy adds a new frag for every request. Can't say anything new for larger payloads, I'm still NIC-bound but looking at CPU utilisation zc doesn't drain as much cycles as non-zc. Also, I don't remember if mentioned before, but another catch is that with TCP it expects users to not be flushing notifications too much, because it forces it to allocate a new skb and lose a good chunk of benefits from using TCP. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 1111adefd906..c4b781b2c3b1 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3218,9 +3218,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */ static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!skb_zcopy(skb))) - return 0; - return skb_copy_ubufs(skb, gfp_mask); + return skb_orphan_frags(skb, gfp_mask); }
On 7/11/22 5:56 AM, Pavel Begunkov wrote: > On 7/8/22 15:26, Pavel Begunkov wrote: >> On 7/8/22 05:10, David Ahern wrote: >>> On 7/7/22 5:49 AM, Pavel Begunkov wrote: >>>> NOTE: Not be picked directly. After getting necessary acks, I'll be >>>> working >>>> out merging with Jakub and Jens. >>>> >>>> The patchset implements io_uring zerocopy send. It works with both >>>> registered >>>> and normal buffers, mixing is allowed but not recommended. Apart >>>> from usual >>>> request completions, just as with MSG_ZEROCOPY, io_uring separately >>>> notifies >>>> the userspace when buffers are freed and can be reused (see API >>>> design below), >>>> which is delivered into io_uring's Completion Queue. Those >>>> "buffer-free" >>>> notifications are not necessarily per request, but the userspace has >>>> control >>>> over it and should explicitly attaching a number of requests to a >>>> single >>>> notification. The series also adds some internal optimisations when >>>> used with >>>> registered buffers like removing page referencing. >>>> >>>> From the kernel networking perspective there are two main changes. >>>> The first >>>> one is passing ubuf_info into the network layer from io_uring >>>> (inside of an >>>> in kernel struct msghdr). This allows extra optimisations, e.g. >>>> ubuf_info >>>> caching on the io_uring side, but also helps to avoid cross-referencing >>>> and synchronisation problems. The second part is an optional >>>> optimisation >>>> removing page referencing for requests with registered buffers. >>>> >>>> Benchmarking with an optimised version of the selftest (see [1]), >>>> which sends >>>> a bunch of requests, waits for completions and repeats. "+ flush" >>>> column posts >>>> one additional "buffer-free" notification per request, and just "zc" >>>> doesn't >>>> post buffer notifications at all. >>>> >>>> NIC (requests / second): >>>> IO size | non-zc | zc | zc + flush >>>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >>>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >>>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >>>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >>>> >>>> dummy (requests / second): >>>> IO size | non-zc | zc | zc + flush >>>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >>>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >>>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >>>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >>>> >>>> Previously it also brought a massive performance speedup compared to >>>> the >>>> msg_zerocopy tool (see [3]), which is probably not super interesting. >>>> >>> >>> can you add a comment that the above results are for UDP. >> >> Oh, right, forgot to add it >> >> >>> You dropped comments about TCP testing; any progress there? If not, can >>> you relay any issues you are hitting? >> >> Not really a problem, but for me it's bottle necked at NIC bandwidth >> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >> Was actually benchmarked by my colleague quite a while ago, but can't >> find numbers. Probably need to at least add localhost numbers or grab >> a better server. > > Testing localhost TCP with a hack (see below), it doesn't include > refcounting optimisations I was testing UDP with and that will be > sent afterwards. Numbers are in MB/s > > IO size | non-zc | zc > 1200 | 4174 | 4148 > 4096 | 7597 | 11228 I am surprised by the low numbers; you should be able to saturate a 100G link with TCP and ZC TX API. > > Because it's localhost, we also spend cycles here for the recv side. > Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the > omitted optimisations will somewhat help. I don't consider it to be a > blocker. but would be interesting to poke into later. One thing helping > non-zc is that it squeezes a number of requests into a single page > whenever zerocopy adds a new frag for every request. > > Can't say anything new for larger payloads, I'm still NIC-bound but > looking at CPU utilisation zc doesn't drain as much cycles as non-zc. > Also, I don't remember if mentioned before, but another catch is that > with TCP it expects users to not be flushing notifications too much, > because it forces it to allocate a new skb and lose a good chunk of > benefits from using TCP. I had issues with TCP sockets and io_uring at the end of 2020: https://www.spinics.net/lists/io-uring/msg05125.html have not tried anything recent (from 2022).
On 7/14/22 00:45, David Ahern wrote: > On 7/11/22 5:56 AM, Pavel Begunkov wrote: >> On 7/8/22 15:26, Pavel Begunkov wrote: >>> On 7/8/22 05:10, David Ahern wrote: >>>> On 7/7/22 5:49 AM, Pavel Begunkov wrote: >>>>> NOTE: Not be picked directly. After getting necessary acks, I'll be >>>>> working >>>>> out merging with Jakub and Jens. >>>>> >>>>> The patchset implements io_uring zerocopy send. It works with both >>>>> registered >>>>> and normal buffers, mixing is allowed but not recommended. Apart >>>>> from usual >>>>> request completions, just as with MSG_ZEROCOPY, io_uring separately >>>>> notifies >>>>> the userspace when buffers are freed and can be reused (see API >>>>> design below), >>>>> which is delivered into io_uring's Completion Queue. Those >>>>> "buffer-free" >>>>> notifications are not necessarily per request, but the userspace has >>>>> control >>>>> over it and should explicitly attaching a number of requests to a >>>>> single >>>>> notification. The series also adds some internal optimisations when >>>>> used with >>>>> registered buffers like removing page referencing. >>>>> >>>>> From the kernel networking perspective there are two main changes. >>>>> The first >>>>> one is passing ubuf_info into the network layer from io_uring >>>>> (inside of an >>>>> in kernel struct msghdr). This allows extra optimisations, e.g. >>>>> ubuf_info >>>>> caching on the io_uring side, but also helps to avoid cross-referencing >>>>> and synchronisation problems. The second part is an optional >>>>> optimisation >>>>> removing page referencing for requests with registered buffers. >>>>> >>>>> Benchmarking with an optimised version of the selftest (see [1]), >>>>> which sends >>>>> a bunch of requests, waits for completions and repeats. "+ flush" >>>>> column posts >>>>> one additional "buffer-free" notification per request, and just "zc" >>>>> doesn't >>>>> post buffer notifications at all. >>>>> >>>>> NIC (requests / second): >>>>> IO size | non-zc | zc | zc + flush >>>>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%) >>>>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) >>>>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) >>>>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) >>>>> >>>>> dummy (requests / second): >>>>> IO size | non-zc | zc | zc + flush >>>>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) >>>>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) >>>>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) >>>>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) >>>>> >>>>> Previously it also brought a massive performance speedup compared to >>>>> the >>>>> msg_zerocopy tool (see [3]), which is probably not super interesting. >>>>> >>>> >>>> can you add a comment that the above results are for UDP. >>> >>> Oh, right, forgot to add it >>> >>> >>>> You dropped comments about TCP testing; any progress there? If not, can >>>> you relay any issues you are hitting? >>> >>> Not really a problem, but for me it's bottle necked at NIC bandwidth >>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >>> Was actually benchmarked by my colleague quite a while ago, but can't >>> find numbers. Probably need to at least add localhost numbers or grab >>> a better server. >> >> Testing localhost TCP with a hack (see below), it doesn't include >> refcounting optimisations I was testing UDP with and that will be >> sent afterwards. Numbers are in MB/s >> >> IO size | non-zc | zc >> 1200 | 4174 | 4148 >> 4096 | 7597 | 11228 > > I am surprised by the low numbers; you should be able to saturate a 100G > link with TCP and ZC TX API. It was a quick test with my laptop, not a super fast CPU, preemptible kernel, etc., and considering that the fact that it processes receives from in the same send syscall roughly doubles the overhead, 87Gb/s looks ok. It's not like MSG_ZEROCOPY would look much different, even more to that all sends here will be executed sequentially in io_uring, so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable, it's just the kernel overhead per byte is too high, should be same with just send(2). >> Because it's localhost, we also spend cycles here for the recv side. >> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the >> omitted optimisations will somewhat help. I don't consider it to be a >> blocker. but would be interesting to poke into later. One thing helping >> non-zc is that it squeezes a number of requests into a single page >> whenever zerocopy adds a new frag for every request. >> >> Can't say anything new for larger payloads, I'm still NIC-bound but >> looking at CPU utilisation zc doesn't drain as much cycles as non-zc. >> Also, I don't remember if mentioned before, but another catch is that >> with TCP it expects users to not be flushing notifications too much, >> because it forces it to allocate a new skb and lose a good chunk of >> benefits from using TCP. > > I had issues with TCP sockets and io_uring at the end of 2020: > https://www.spinics.net/lists/io-uring/msg05125.html > > have not tried anything recent (from 2022). Haven't seen it back then. In general io_uring doesn't stop submitting requests if one request fails, at least because we're trying to execute requests asynchronously. And in general, requests can get executed out of order, so most probably submitting a bunch of requests to a single TCP sock without any ordering on io_uring side is likely a bug. You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing execution ordering. And if you meant links in the message, I agree that it was not the best decision to consider len < sqe->len not an error and not breaking links, but it was later added that MSG_WAITALL would also change the success condition to len==sqe->len. But all that is relevant if you was using linking.
On 7/14/22 12:55 PM, Pavel Begunkov wrote: >>>>> You dropped comments about TCP testing; any progress there? If not, >>>>> can >>>>> you relay any issues you are hitting? >>>> >>>> Not really a problem, but for me it's bottle necked at NIC bandwidth >>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >>>> Was actually benchmarked by my colleague quite a while ago, but can't >>>> find numbers. Probably need to at least add localhost numbers or grab >>>> a better server. >>> >>> Testing localhost TCP with a hack (see below), it doesn't include >>> refcounting optimisations I was testing UDP with and that will be >>> sent afterwards. Numbers are in MB/s >>> >>> IO size | non-zc | zc >>> 1200 | 4174 | 4148 >>> 4096 | 7597 | 11228 >> >> I am surprised by the low numbers; you should be able to saturate a 100G >> link with TCP and ZC TX API. > > It was a quick test with my laptop, not a super fast CPU, preemptible > kernel, etc., and considering that the fact that it processes receives > from in the same send syscall roughly doubles the overhead, 87Gb/s > looks ok. It's not like MSG_ZEROCOPY would look much different, even > more to that all sends here will be executed sequentially in io_uring, > so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable, > it's just the kernel overhead per byte is too high, should be same with > just send(2). ? It's a stream socket so those sends are coalesced into MTU sized packets. > >>> Because it's localhost, we also spend cycles here for the recv side. >>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the >>> omitted optimisations will somewhat help. I don't consider it to be a >>> blocker. but would be interesting to poke into later. One thing helping >>> non-zc is that it squeezes a number of requests into a single page >>> whenever zerocopy adds a new frag for every request. >>> >>> Can't say anything new for larger payloads, I'm still NIC-bound but >>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc. >>> Also, I don't remember if mentioned before, but another catch is that >>> with TCP it expects users to not be flushing notifications too much, >>> because it forces it to allocate a new skb and lose a good chunk of >>> benefits from using TCP. >> >> I had issues with TCP sockets and io_uring at the end of 2020: >> https://www.spinics.net/lists/io-uring/msg05125.html >> >> have not tried anything recent (from 2022). > > Haven't seen it back then. In general io_uring doesn't stop submitting > requests if one request fails, at least because we're trying to execute > requests asynchronously. And in general, requests can get executed > out of order, so most probably submitting a bunch of requests to a single > TCP sock without any ordering on io_uring side is likely a bug. TCP socket buffer fills resulting in a partial send (i.e, for a given sqe submission only part of the write/send succeeded). io_uring was not handling that case. I'll try to find some time to resurrect the iperf3 patch and try top of tree kernel. > > You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing > execution ordering. And if you meant links in the message, I agree > that it was not the best decision to consider len < sqe->len not > an error and not breaking links, but it was later added that > MSG_WAITALL would also change the success condition to > len==sqe->len. But all that is relevant if you was using linking. >
On 7/18/22 03:19, David Ahern wrote: > On 7/14/22 12:55 PM, Pavel Begunkov wrote: >>>>>> You dropped comments about TCP testing; any progress there? If not, >>>>>> can >>>>>> you relay any issues you are hitting? >>>>> >>>>> Not really a problem, but for me it's bottle necked at NIC bandwidth >>>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU. >>>>> Was actually benchmarked by my colleague quite a while ago, but can't >>>>> find numbers. Probably need to at least add localhost numbers or grab >>>>> a better server. >>>> >>>> Testing localhost TCP with a hack (see below), it doesn't include >>>> refcounting optimisations I was testing UDP with and that will be >>>> sent afterwards. Numbers are in MB/s >>>> >>>> IO size | non-zc | zc >>>> 1200 | 4174 | 4148 >>>> 4096 | 7597 | 11228 >>> >>> I am surprised by the low numbers; you should be able to saturate a 100G >>> link with TCP and ZC TX API. >> >> It was a quick test with my laptop, not a super fast CPU, preemptible >> kernel, etc., and considering that the fact that it processes receives >> from in the same send syscall roughly doubles the overhead, 87Gb/s >> looks ok. It's not like MSG_ZEROCOPY would look much different, even >> more to that all sends here will be executed sequentially in io_uring, >> so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable, >> it's just the kernel overhead per byte is too high, should be same with >> just send(2). > > ? > It's a stream socket so those sends are coalesced into MTU sized packets. That leaves syscall and io_uring overhead, locking the socket, etc., which still requires more cycles than just copying 1200 bytes. And the used CPU is not blazingly fast, could be that a better CPU/setup will saturate 100G >>>> Because it's localhost, we also spend cycles here for the recv side. >>>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the >>>> omitted optimisations will somewhat help. I don't consider it to be a >>>> blocker. but would be interesting to poke into later. One thing helping >>>> non-zc is that it squeezes a number of requests into a single page >>>> whenever zerocopy adds a new frag for every request. >>>> >>>> Can't say anything new for larger payloads, I'm still NIC-bound but >>>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc. >>>> Also, I don't remember if mentioned before, but another catch is that >>>> with TCP it expects users to not be flushing notifications too much, >>>> because it forces it to allocate a new skb and lose a good chunk of >>>> benefits from using TCP. >>> >>> I had issues with TCP sockets and io_uring at the end of 2020: >>> https://www.spinics.net/lists/io-uring/msg05125.html >>> >>> have not tried anything recent (from 2022). >> >> Haven't seen it back then. In general io_uring doesn't stop submitting >> requests if one request fails, at least because we're trying to execute >> requests asynchronously. And in general, requests can get executed >> out of order, so most probably submitting a bunch of requests to a single >> TCP sock without any ordering on io_uring side is likely a bug. > > TCP socket buffer fills resulting in a partial send (i.e, for a given > sqe submission only part of the write/send succeeded). io_uring was not > handling that case. Shouldn't have been different from send(2) with MSG_NOWAIT, can be short and the user should handle it. Also I believe Jens pushed just recently in-kernel retries on the io_uring side for TCP in such cases. > I'll try to find some time to resurrect the iperf3 patch and try top of > tree kernel. Awesome >> You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing >> execution ordering. And if you meant links in the message, I agree >> that it was not the best decision to consider len < sqe->len not >> an error and not breaking links, but it was later added that >> MSG_WAITALL would also change the success condition to >> len==sqe->len. But all that is relevant if you was using linking.
On 7/17/22 8:19 PM, David Ahern wrote: >> >> Haven't seen it back then. In general io_uring doesn't stop submitting >> requests if one request fails, at least because we're trying to execute >> requests asynchronously. And in general, requests can get executed >> out of order, so most probably submitting a bunch of requests to a single >> TCP sock without any ordering on io_uring side is likely a bug. > > TCP socket buffer fills resulting in a partial send (i.e, for a given > sqe submission only part of the write/send succeeded). io_uring was not > handling that case. > > I'll try to find some time to resurrect the iperf3 patch and try top of > tree kernel. With your zc_v5 branch (plus the init fix on using msg->sg_from_iter), iperf3 with io_uring support (non-ZC case) no longer shows completions with incomplete sends. So that is good improvement over the last time I tried it. However, adding in the ZC support and that problem resurfaces - a lot of completions are for an incomplete size. liburing comes from your tree, zc_v4 branch. Upstream does not have support for notifications yet, so I can not move to it. Changes to iperf3 are here: https://github.com/dsahern/iperf mods-3.10-io_uring
On 7/24/22 19:28, David Ahern wrote: > On 7/17/22 8:19 PM, David Ahern wrote: >>> >>> Haven't seen it back then. In general io_uring doesn't stop submitting >>> requests if one request fails, at least because we're trying to execute >>> requests asynchronously. And in general, requests can get executed >>> out of order, so most probably submitting a bunch of requests to a single >>> TCP sock without any ordering on io_uring side is likely a bug. >> >> TCP socket buffer fills resulting in a partial send (i.e, for a given >> sqe submission only part of the write/send succeeded). io_uring was not >> handling that case. >> >> I'll try to find some time to resurrect the iperf3 patch and try top of >> tree kernel. > > With your zc_v5 branch (plus the init fix on using msg->sg_from_iter), > iperf3 with io_uring support (non-ZC case) no longer shows completions > with incomplete sends. So that is good improvement over the last time I > tried it. > > However, adding in the ZC support and that problem resurfaces - a lot of > completions are for an incomplete size. Makes sense, it explicitly retries with normal sends but I didn't implement it for zc. Might be a good thing to add. > liburing comes from your tree, zc_v4 branch. Upstream does not have > support for notifications yet, so I can not move to it. Upstreamed it > Changes to iperf3 are here: > https://github.com/dsahern/iperf mods-3.10-io_uring
On 7/27/22 4:51 AM, Pavel Begunkov wrote: >> With your zc_v5 branch (plus the init fix on using msg->sg_from_iter), >> iperf3 with io_uring support (non-ZC case) no longer shows completions >> with incomplete sends. So that is good improvement over the last time I >> tried it. >> >> However, adding in the ZC support and that problem resurfaces - a lot of >> completions are for an incomplete size. > > Makes sense, it explicitly retries with normal sends but I didn't > implement it for zc. Might be a good thing to add. > Yes, before this goes it. It will be confusing to users to get incomplete completions when using the ZC option.
On 7/24/22 19:28, David Ahern wrote: > On 7/17/22 8:19 PM, David Ahern wrote: >>> >>> Haven't seen it back then. In general io_uring doesn't stop submitting >>> requests if one request fails, at least because we're trying to execute >>> requests asynchronously. And in general, requests can get executed >>> out of order, so most probably submitting a bunch of requests to a single >>> TCP sock without any ordering on io_uring side is likely a bug. >> >> TCP socket buffer fills resulting in a partial send (i.e, for a given >> sqe submission only part of the write/send succeeded). io_uring was not >> handling that case. >> >> I'll try to find some time to resurrect the iperf3 patch and try top of >> tree kernel. > > With your zc_v5 branch (plus the init fix on using msg->sg_from_iter), > iperf3 with io_uring support (non-ZC case) no longer shows completions > with incomplete sends. So that is good improvement over the last time I > tried it. > > However, adding in the ZC support and that problem resurfaces - a lot of > completions are for an incomplete size. > > liburing comes from your tree, zc_v4 branch. Upstream does not have > support for notifications yet, so I can not move to it. > > Changes to iperf3 are here: > https://github.com/dsahern/iperf mods-3.10-io_uring Tried it out, the branch below fixes a small problem, adds a couple of extra optimisations and now it actually uses registered buffers. https://github.com/isilence/iperf iou-sendzc Still, the submission loop looked a bit weird, i.e. it submits I/O to io_uring only when it exhausts sqes instead of sending right away with some notion of QD and/or sending in batches. The approach is good for batching (SQ size =16 here), but not so for latency. I also see some CPU cycles being burnt in select(2). io_uring wait would be more natural and perhaps more performant, but I didn't spend enough time with iperf to say for sure.
On 9/26/22 1:08 PM, Pavel Begunkov wrote: > Tried it out, the branch below fixes a small problem, adds a couple > of extra optimisations and now it actually uses registered buffers. > > https://github.com/isilence/iperf iou-sendzc thanks for the patch; will it pull it in. > > Still, the submission loop looked a bit weird, i.e. it submits I/O > to io_uring only when it exhausts sqes instead of sending right > away with some notion of QD and/or sending in batches. The approach > is good for batching (SQ size =16 here), but not so for latency. > > I also see some CPU cycles being burnt in select(2). io_uring wait > would be more natural and perhaps more performant, but I didn't > spend enough time with iperf to say for sure. ok. It will be a while before I have time to come back to it. In the meantime it seems like some io_uring changes happened between your dev branch and what was merged into liburing (compile worked on your branch but fails with upstream). Is the ZC support in liburing now?
On 9/28/22 20:31, David Ahern wrote: > On 9/26/22 1:08 PM, Pavel Begunkov wrote: >> Tried it out, the branch below fixes a small problem, adds a couple >> of extra optimisations and now it actually uses registered buffers. >> >> https://github.com/isilence/iperf iou-sendzc > > thanks for the patch; will it pull it in. > >> Still, the submission loop looked a bit weird, i.e. it submits I/O >> to io_uring only when it exhausts sqes instead of sending right >> away with some notion of QD and/or sending in batches. The approach >> is good for batching (SQ size =16 here), but not so for latency. >> >> I also see some CPU cycles being burnt in select(2). io_uring wait >> would be more natural and perhaps more performant, but I didn't >> spend enough time with iperf to say for sure. > > ok. It will be a while before I have time to come back to it. In the > meantime it seems like some io_uring changes happened between your dev > branch and what was merged into liburing (compile worked on your branch > but fails with upstream). Is the ZC support in liburing now? It is. I forgot to put a note that I also adapted your patches to uapi changes.No more notification slots but a zc send request now can post a second CQE if IORING_CQE_F_MORE is set in the first one. Better described in io_uring_enter(2) man, e.g. https://git.kernel.dk/cgit/liburing/tree/man/io_uring_enter.2#n1063