diff mbox series

[net-next,2/2] vsock/virtio: avoid enqueue packets when work queue is empty

Message ID AS2P194MB21706E349197C1466937052C9AC22@AS2P194MB2170.EURP194.PROD.OUTLOOK.COM (mailing list archive)
State New, archived
Headers show
Series vsock: avoid queuing on workqueue if possible | expand

Commit Message

Luigi Leonardi June 14, 2024, 1:55 p.m. UTC
From: Marco Pinna <marco.pinn95@gmail.com>

This introduces an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is
put directly in the virtqueue reducing latency.

In the following benchmark (pingpong mode) the host sends
a payload to the guest and waits for the same payload back.

Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Payload: 4k
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before (Linux 6.8.11)
------
mean(1st percentile):     722.45 ns
mean(overall):           1686.23 ns
mean(99th percentile):  35379.27 ns

After
------
mean(1st percentile):     602.62 ns
mean(overall):           1248.83 ns
mean(99th percentile):  17557.33 ns

Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
---
 net/vmw_vsock/virtio_transport.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Comments

Stefano Garzarella June 14, 2024, 2:37 p.m. UTC | #1
On Fri, Jun 14, 2024 at 03:55:43PM GMT, Luigi Leonardi wrote:
>From: Marco Pinna <marco.pinn95@gmail.com>
>
>This introduces an optimization in virtio_transport_send_pkt:
>when the work queue (send_pkt_queue) is empty the packet is
>put directly in the virtqueue reducing latency.
>
>In the following benchmark (pingpong mode) the host sends
>a payload to the guest and waits for the same payload back.
>
>Tool: Fio version 3.37-56
>Env: Phys host + L1 Guest
>Payload: 4k
>Runtime-per-test: 50s
>Mode: pingpong (h-g-h)
>Test runs: 50
>Type: SOCK_STREAM
>
>Before (Linux 6.8.11)
>------
>mean(1st percentile):     722.45 ns
>mean(overall):           1686.23 ns
>mean(99th percentile):  35379.27 ns
>
>After
>------
>mean(1st percentile):     602.62 ns
>mean(overall):           1248.83 ns
>mean(99th percentile):  17557.33 ns

Cool, thanks for this improvement!
Can you also report your host CPU detail?

>
>Co-developed-by: Luigi Leonardi <luigi.leonardi@outlook.com>
>Signed-off-by: Luigi Leonardi <luigi.leonardi@outlook.com>
>Signed-off-by: Marco Pinna <marco.pinn95@gmail.com>
>---
> net/vmw_vsock/virtio_transport.c | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index c930235ecaec..e89bf87282b2 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -214,7 +214,9 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> {
> 	struct virtio_vsock_hdr *hdr;
> 	struct virtio_vsock *vsock;
>+	bool use_worker = true;
> 	int len = skb->len;
>+	int ret = -1;

Please define ret in the block we use it. Also, we don't need to initialize it.

>
> 	hdr = virtio_vsock_hdr(skb);
>
>@@ -235,8 +237,34 @@ virtio_transport_send_pkt(struct sk_buff *skb)
> 	if (virtio_vsock_skb_reply(skb))
> 		atomic_inc(&vsock->queued_replies);
>
>-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	/* If the send_pkt_queue is empty there is no need to enqueue the packet.

We should clarify which queue. I mean we are always queueing the packet
somewhere, or in the internal queue for the worker or in the virtqueue,
so this comment is not really clear.

>+	 * Just put it on the ringbuff using virtio_transport_send_skb.

ringbuff? Do you mean virtqueue?

>+	 */
>+

we can avoid this empty line.

>+	if (skb_queue_empty_lockless(&vsock->send_pkt_queue)) {
>+		bool restart_rx = false;
>+		struct virtqueue *vq;

... `int ret;` here.

>+
>+		mutex_lock(&vsock->tx_lock);
>+
>+		vq = vsock->vqs[VSOCK_VQ_TX];
>+
>+		ret = virtio_transport_send_skb(skb, vq, vsock, &restart_rx);

Ah, at the end we don't need `ret` at all.

What about just `if (!virtio_transport_send_skb())`?

>+		if (ret == 0) {
>+			use_worker = false;
>+			virtqueue_kick(vq);
>+		}
>+
>+		mutex_unlock(&vsock->tx_lock);
>+
>+		if (restart_rx)
>+			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
>+	}
>+
>+	if (use_worker) {
>+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
>+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
>+	}
>
> out_rcu:
> 	rcu_read_unlock();
>-- 
>2.45.2
>
Matias Ezequiel Vara Larsen June 17, 2024, 1:24 p.m. UTC | #2
Hello,

thanks for working on this! I have some minor thoughts.

On Fri, Jun 14, 2024 at 03:55:43PM +0200, Luigi Leonardi wrote:
> From: Marco Pinna <marco.pinn95@gmail.com>
> 
> This introduces an optimization in virtio_transport_send_pkt:
> when the work queue (send_pkt_queue) is empty the packet is
> put directly in the virtqueue reducing latency.
> 
> In the following benchmark (pingpong mode) the host sends
> a payload to the guest and waits for the same payload back.
> 
> Tool: Fio version 3.37-56
> Env: Phys host + L1 Guest
> Payload: 4k
> Runtime-per-test: 50s
> Mode: pingpong (h-g-h)
> Test runs: 50
> Type: SOCK_STREAM
> 
> Before (Linux 6.8.11)
> ------
> mean(1st percentile):     722.45 ns
> mean(overall):           1686.23 ns
> mean(99th percentile):  35379.27 ns
> 
> After
> ------
> mean(1st percentile):     602.62 ns
> mean(overall):           1248.83 ns
> mean(99th percentile):  17557.33 ns
> 

I think It would be interesting to know what exactly the test does, and,
if the test is triggering the improvement, i.e., the better results are
due to enqueuing packets directly to the virtqueue instead of letting
the worker does it. If I understand correctly, this patch focuses on the
case in which the worker queue is empty. I think the test can always
send packets at a frequency so the worker queue is always empty, but
maybe, this is a corner case and most of the time the worker queue is
not empty in a non-testing environment.

Matias
Luigi Leonardi June 18, 2024, 5:05 p.m. UTC | #3
Hi Stefano and Matias,

@Stefano Thanks for your review(s)! I'll send a V2 by the end of the week.

@Matias

Thanks for your feedback!

> I think It would be interesting to know what exactly the test does

It's relatively easy: I used fio's pingpong mode. This mode is specifically
for measuring the latency, the way it works is by sending packets,
in my case, from the host to the guest. and waiting for the other side
to send them back. The latency I wrote in the commit is the "completion
latency". The total throughput on my system is around 16 Gb/sec.

> if the test is triggering the improvement

Yes! I did some additional testing and I can confirm you that during this
test, the worker queue is never used!

> If I understand correctly, this patch focuses on the
> case in which the worker queue is empty

Correct!

> I think the test can always send packets at a frequency so the worker queue
> is always empty. but maybe, this is a corner case and most of the time the
> worker queue is not empty in a non-testing environment.

I'm not sure about this, but IMHO this optimization is free, there is no
penalty for using it, in the worst case the system will work as usual.
In any case, I'm more than happy to do some additional testing, do you have
anything in mind?

Luigi
Matias Ezequiel Vara Larsen June 21, 2024, 8:58 a.m. UTC | #4
On Tue, Jun 18, 2024 at 07:05:54PM +0200, Luigi Leonardi wrote:
> Hi Stefano and Matias,
> 
> @Stefano Thanks for your review(s)! I'll send a V2 by the end of the week.
> 
> @Matias
> 
> Thanks for your feedback!
> 
> > I think It would be interesting to know what exactly the test does
> 
> It's relatively easy: I used fio's pingpong mode. This mode is specifically
> for measuring the latency, the way it works is by sending packets,
> in my case, from the host to the guest. and waiting for the other side
> to send them back. The latency I wrote in the commit is the "completion
> latency". The total throughput on my system is around 16 Gb/sec.
> 

Thanks for the explanation!

> > if the test is triggering the improvement
> 
> Yes! I did some additional testing and I can confirm you that during this
> test, the worker queue is never used!
> 

Cool.

> > If I understand correctly, this patch focuses on the
> > case in which the worker queue is empty
> 
> Correct!
> 
> > I think the test can always send packets at a frequency so the worker queue
> > is always empty. but maybe, this is a corner case and most of the time the
> > worker queue is not empty in a non-testing environment.
> 
> I'm not sure about this, but IMHO this optimization is free, there is no
> penalty for using it, in the worst case the system will work as usual.
> In any case, I'm more than happy to do some additional testing, do you have
> anything in mind?
> 
Sure!, this is very a interesting improvement and I am in favor for
that! I was only thinking out loud ;) I asked previous questions
because, in my mind, I was thinking that this improvement would trigger
only for the first bunch of packets, i.e., when the worker queue is
empty so its effect would be seen "only at the beginning of the
transmission" until the worker-queue begins to fill. If I understand
correctly, the worker-queue starts to fill just after the virtqueue is
full, am I right?


Matias
Luigi Leonardi June 21, 2024, 9:47 a.m. UTC | #5
Hi Matias,

> > > I think the test can always send packets at a frequency so the worker queue
> > > is always empty. but maybe, this is a corner case and most of the time the
> > > worker queue is not empty in a non-testing environment.
> >
> > I'm not sure about this, but IMHO this optimization is free, there is no
> > penalty for using it, in the worst case the system will work as usual.
> > In any case, I'm more than happy to do some additional testing, do you have
> > anything in mind?
> >
> Sure!, this is very a interesting improvement and I am in favor for
> that! I was only thinking out loud ;)

No worries :)

> I asked previous questions
> because, in my mind, I was thinking that this improvement would trigger
> only for the first bunch of packets, i.e., when the worker queue is
> empty so its effect would be seen "only at the beginning of the
> transmission" until the worker-queue begins to fill. If I understand
> correctly, the worker-queue starts to fill just after the virtqueue is
> full, am I right?

Correct! Packets are enqueued in the worker-queue only if the virtqueue
is full.

Luigi
diff mbox series

Patch

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c930235ecaec..e89bf87282b2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -214,7 +214,9 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 {
 	struct virtio_vsock_hdr *hdr;
 	struct virtio_vsock *vsock;
+	bool use_worker = true;
 	int len = skb->len;
+	int ret = -1;
 
 	hdr = virtio_vsock_hdr(skb);
 
@@ -235,8 +237,34 @@  virtio_transport_send_pkt(struct sk_buff *skb)
 	if (virtio_vsock_skb_reply(skb))
 		atomic_inc(&vsock->queued_replies);
 
-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	/* If the send_pkt_queue is empty there is no need to enqueue the packet.
+	 * Just put it on the ringbuff using virtio_transport_send_skb.
+	 */
+
+	if (skb_queue_empty_lockless(&vsock->send_pkt_queue)) {
+		bool restart_rx = false;
+		struct virtqueue *vq;
+
+		mutex_lock(&vsock->tx_lock);
+
+		vq = vsock->vqs[VSOCK_VQ_TX];
+
+		ret = virtio_transport_send_skb(skb, vq, vsock, &restart_rx);
+		if (ret == 0) {
+			use_worker = false;
+			virtqueue_kick(vq);
+		}
+
+		mutex_unlock(&vsock->tx_lock);
+
+		if (restart_rx)
+			queue_work(virtio_vsock_workqueue, &vsock->rx_work);
+	}
+
+	if (use_worker) {
+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	}
 
 out_rcu:
 	rcu_read_unlock();