diff mbox series

[net-next,v2] vhost/net: Defer TX queue re-enable until after sendmsg

Message ID 20250420010518.2842335-1-jon@nutanix.com (mailing list archive)
State New
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] vhost/net: Defer TX queue re-enable until after sendmsg | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 35 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-04-20--09-00 (tests: 916)

Commit Message

Jon Kohler April 20, 2025, 1:05 a.m. UTC
In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
    TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
    RX VM: taskset -c 2 iperf3 -s -p 5200 -D
    6.12.0, each worker backed by tun interface with IFF_NAPI setup.
    Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
    EPT_MISCONFIG/second: 5411
    Datagrams/second: ~382k
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender

6.12.0 patched
    EPT_MISCONFIG/second: 58 (~93x reduction)
    Datagrams/second: ~650k  (~1.7x increase)
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/vhost/net.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

Comments

Michael S. Tsirkin April 20, 2025, 7:32 a.m. UTC | #1
On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> batches up to 64 messages before calling sock->sendmsg.
> 
> Currently, when there are no more messages on the ring to dequeue,
> handle_tx_copy re-enables kicks on the ring *before* firing off the
> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> especially if it needs to wake up a thread (e.g., another vhost worker).
> 
> If the guest submits additional messages immediately after the last ring
> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
> kick the vhost worker. This may happen while the worker is still
> processing the sendmsg, leading to wasteful exit(s).
> 
> This is particularly problematic for single-threaded guest submission
> threads, as they must exit, wait for the exit to be processed
> (potentially involving a TTWU), and then resume.
> 
> In scenarios like a constant stream of UDP messages, this results in a
> sawtooth pattern where the submitter frequently vmexits, and the
> vhost-net worker alternates between sleeping and waking.
> 
> A common solution is to configure vhost-net busy polling via userspace
> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
> period by keeping kicks disabled during the final sendmsg and
> performing one additional ring check afterward provides a significant
> performance improvement without any excess busy poll cycles.
> 
> If messages are found in the ring after the final sendmsg, requeue the
> TX handler. This ensures fairness for the RX handler and allows
> vhost_run_work_list to cond_resched() as needed.
> 
> Test Case
>     TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
>     RX VM: taskset -c 2 iperf3 -s -p 5200 -D
>     6.12.0, each worker backed by tun interface with IFF_NAPI setup.
>     Note: TCP side is largely unchanged as that was copy bound
> 
> 6.12.0 unpatched
>     EPT_MISCONFIG/second: 5411
>     Datagrams/second: ~382k
>     Interval         Transfer     Bitrate         Lost/Total Datagrams
>     0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
> 
> 6.12.0 patched
>     EPT_MISCONFIG/second: 58 (~93x reduction)
>     Datagrams/second: ~650k  (~1.7x increase)
>     Interval         Transfer     Bitrate         Lost/Total Datagrams
>     0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>

sounds like the right approach.

Acked-by: Michael S. Tsirkin <mst@redhat.com>



> ---


in the future, pls put the changelog here as you progress v1->v2->v3.
Thanks!

>  drivers/vhost/net.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index b9b9e9d40951..9b04025eea66 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>  			break;
>  		/* Nothing new?  Wait for eventfd to tell us they refilled. */
>  		if (head == vq->num) {
> +			/* If interrupted while doing busy polling, requeue
> +			 * the handler to be fair handle_rx as well as other
> +			 * tasks waiting on cpu
> +			 */
>  			if (unlikely(busyloop_intr)) {
>  				vhost_poll_queue(&vq->poll);
> -			} else if (unlikely(vhost_enable_notify(&net->dev,
> -								vq))) {
> -				vhost_disable_notify(&net->dev, vq);
> -				continue;
>  			}
> +			/* Kicks are disabled at this point, break loop and
> +			 * process any remaining batched packets. Queue will
> +			 * be re-enabled afterwards.
> +			 */
>  			break;
>  		}
>  
> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>  		++nvq->done_idx;
>  	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>  
> +	/* Kicks are still disabled, dispatch any remaining batched msgs. */
>  	vhost_tx_batch(net, nvq, sock, &msg);
> +
> +	/* All of our work has been completed; however, before leaving the
> +	 * TX handler, do one last check for work, and requeue handler if
> +	 * necessary. If there is no work, queue will be reenabled.
> +	 */
> +	vhost_net_busy_poll_try_queue(net, vq);
>  }
>  
>  static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> -- 
> 2.43.0
Jon Kohler April 20, 2025, 2:38 p.m. UTC | #2
> On Apr 20, 2025, at 3:32 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Sat, Apr 19, 2025 at 06:05:18PM -0700, Jon Kohler wrote:
>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>> batches up to 64 messages before calling sock->sendmsg.
>> 
>> Currently, when there are no more messages on the ring to dequeue,
>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>> especially if it needs to wake up a thread (e.g., another vhost worker).
>> 
>> If the guest submits additional messages immediately after the last ring
>> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
>> kick the vhost worker. This may happen while the worker is still
>> processing the sendmsg, leading to wasteful exit(s).
>> 
>> This is particularly problematic for single-threaded guest submission
>> threads, as they must exit, wait for the exit to be processed
>> (potentially involving a TTWU), and then resume.
>> 
>> In scenarios like a constant stream of UDP messages, this results in a
>> sawtooth pattern where the submitter frequently vmexits, and the
>> vhost-net worker alternates between sleeping and waking.
>> 
>> A common solution is to configure vhost-net busy polling via userspace
>> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
>> period by keeping kicks disabled during the final sendmsg and
>> performing one additional ring check afterward provides a significant
>> performance improvement without any excess busy poll cycles.
>> 
>> If messages are found in the ring after the final sendmsg, requeue the
>> TX handler. This ensures fairness for the RX handler and allows
>> vhost_run_work_list to cond_resched() as needed.
>> 
>> Test Case
>>    TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
>>    RX VM: taskset -c 2 iperf3 -s -p 5200 -D
>>    6.12.0, each worker backed by tun interface with IFF_NAPI setup.
>>    Note: TCP side is largely unchanged as that was copy bound
>> 
>> 6.12.0 unpatched
>>    EPT_MISCONFIG/second: 5411
>>    Datagrams/second: ~382k
>>    Interval         Transfer     Bitrate         Lost/Total Datagrams
>>    0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
>> 
>> 6.12.0 patched
>>    EPT_MISCONFIG/second: 58 (~93x reduction)
>>    Datagrams/second: ~650k  (~1.7x increase)
>>    Interval         Transfer     Bitrate         Lost/Total Datagrams
>>    0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
>> 
>> Acked-by: Jason Wang <jasowang@redhat.com>
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> sounds like the right approach.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> 
>> ---
> 
> 
> in the future, pls put the changelog here as you progress v1->v2->v3.
> Thanks!

Thanks for checking out the patch, sorry about the lack of
changelog. For posterity, v2 was only sending the patch to
net-next as I forgot to add the prefix on v1.

v1 patch with Jason’s ack is here:
https://lore.kernel.org/kvm/20250401043230.790419-1-jon@nutanix.com/T/

> 
>> drivers/vhost/net.c | 19 +++++++++++++++----
>> 1 file changed, 15 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index b9b9e9d40951..9b04025eea66 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> break;
>> /* Nothing new?  Wait for eventfd to tell us they refilled. */
>> if (head == vq->num) {
>> + /* If interrupted while doing busy polling, requeue
>> + * the handler to be fair handle_rx as well as other
>> + * tasks waiting on cpu
>> + */
>> if (unlikely(busyloop_intr)) {
>> vhost_poll_queue(&vq->poll);
>> - } else if (unlikely(vhost_enable_notify(&net->dev,
>> - vq))) {
>> - vhost_disable_notify(&net->dev, vq);
>> - continue;
>> }
>> + /* Kicks are disabled at this point, break loop and
>> + * process any remaining batched packets. Queue will
>> + * be re-enabled afterwards.
>> + */
>> break;
>> }
>> 
>> @@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>> ++nvq->done_idx;
>> } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>> 
>> + /* Kicks are still disabled, dispatch any remaining batched msgs. */
>> vhost_tx_batch(net, nvq, sock, &msg);
>> +
>> + /* All of our work has been completed; however, before leaving the
>> + * TX handler, do one last check for work, and requeue handler if
>> + * necessary. If there is no work, queue will be reenabled.
>> + */
>> + vhost_net_busy_poll_try_queue(net, vq);
>> }
>> 
>> static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>> -- 
>> 2.43.0
>
diff mbox series

Patch

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..9b04025eea66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -769,13 +769,17 @@  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
+			/* If interrupted while doing busy polling, requeue
+			 * the handler to be fair handle_rx as well as other
+			 * tasks waiting on cpu
+			 */
 			if (unlikely(busyloop_intr)) {
 				vhost_poll_queue(&vq->poll);
-			} else if (unlikely(vhost_enable_notify(&net->dev,
-								vq))) {
-				vhost_disable_notify(&net->dev, vq);
-				continue;
 			}
+			/* Kicks are disabled at this point, break loop and
+			 * process any remaining batched packets. Queue will
+			 * be re-enabled afterwards.
+			 */
 			break;
 		}
 
@@ -825,7 +829,14 @@  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 		++nvq->done_idx;
 	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 
+	/* Kicks are still disabled, dispatch any remaining batched msgs. */
 	vhost_tx_batch(net, nvq, sock, &msg);
+
+	/* All of our work has been completed; however, before leaving the
+	 * TX handler, do one last check for work, and requeue handler if
+	 * necessary. If there is no work, queue will be reenabled.
+	 */
+	vhost_net_busy_poll_try_queue(net, vq);
 }
 
 static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)