diff mbox series

[v2,4/4] vhost_net: Add self test with tun device

Message ID 20210622161533.1214662-4-dwmw2@infradead.org (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v2,1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode | expand

Commit Message

David Woodhouse June 22, 2021, 4:15 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

This creates a tun device and brings it up, then finds out the link-local
address the kernel automatically assigns to it.

It sends a ping to that address, from a fake LL address of its own, and
then waits for a response.

If the virtio_net_hdr stuff is all working correctly, it gets a response
and manages to understand it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/vhost/Makefile        |  16 +
 tools/testing/selftests/vhost/config          |   2 +
 .../testing/selftests/vhost/test_vhost_net.c  | 522 ++++++++++++++++++
 4 files changed, 541 insertions(+)
 create mode 100644 tools/testing/selftests/vhost/Makefile
 create mode 100644 tools/testing/selftests/vhost/config
 create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c

Comments

Jason Wang June 23, 2021, 4:02 a.m. UTC | #1
在 2021/6/23 上午12:15, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This creates a tun device and brings it up, then finds out the link-local
> address the kernel automatically assigns to it.
>
> It sends a ping to that address, from a fake LL address of its own, and
> then waits for a response.
>
> If the virtio_net_hdr stuff is all working correctly, it gets a response
> and manages to understand it.


I wonder whether it worth to bother the dependency like ipv6 or kernel 
networking stack.

How about simply use packet socket that is bound to tun to receive and 
send packets?


>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   tools/testing/selftests/Makefile              |   1 +
>   tools/testing/selftests/vhost/Makefile        |  16 +
>   tools/testing/selftests/vhost/config          |   2 +
>   .../testing/selftests/vhost/test_vhost_net.c  | 522 ++++++++++++++++++
>   4 files changed, 541 insertions(+)
>   create mode 100644 tools/testing/selftests/vhost/Makefile
>   create mode 100644 tools/testing/selftests/vhost/config
>   create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c


[...]


> +	/*
> +	 * I just want to map the *whole* of userspace address space. But
> +	 * from userspace I don't know what that is. On x86_64 it would be:
> +	 *
> +	 * vmem->regions[0].guest_phys_addr = 4096;
> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> +	 * vmem->regions[0].userspace_addr = 4096;
> +	 *
> +	 * For now, just ensure we put everything inside a single BSS region.
> +	 */
> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> +	vmem->regions[0].memory_size = sizeof(rings);


Instead of doing tricks like this, we can do it in another way:

1) enable device IOTLB
2) wait for the IOTLB miss request (iova, len) and update identity 
mapping accordingly

This should work for all the archs (with some performance hit).

Thanks
David Woodhouse June 23, 2021, 4:12 p.m. UTC | #2
On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote:
> 在 2021/6/23 上午12:15, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This creates a tun device and brings it up, then finds out the link-local
> > address the kernel automatically assigns to it.
> > 
> > It sends a ping to that address, from a fake LL address of its own, and
> > then waits for a response.
> > 
> > If the virtio_net_hdr stuff is all working correctly, it gets a response
> > and manages to understand it.
> 
> 
> I wonder whether it worth to bother the dependency like ipv6 or kernel 
> networking stack.
> 
> How about simply use packet socket that is bound to tun to receive and 
> send packets?
> 

I pondered that but figured that using the kernel's network stack
wasn't too much of an additional dependency. We *could* use an
AF_PACKET socket on the tun device and then drive both ends, but given
that the kernel *automatically* assigns a link-local address when we
bring the device up anyway, it seemed simple enough just to use ICMP.
I also happened to have the ICMP generation/checking code lying around
anyway in the same emacs instance, so it was reduced to a previously
solved problem.

We *should* eventually expand this test case to attach an AF_PACKET
device to the vhost-net, instead of using a tun device as the back end.
(Although I don't really see *why* vhost is limited to AF_PACKET. Why
*can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)


> > +	/*
> > +	 * I just want to map the *whole* of userspace address space. But
> > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > +	 *
> > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > +	 * vmem->regions[0].userspace_addr = 4096;
> > +	 *
> > +	 * For now, just ensure we put everything inside a single BSS region.
> > +	 */
> > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > +	vmem->regions[0].memory_size = sizeof(rings);
> 
> 
> Instead of doing tricks like this, we can do it in another way:
> 
> 1) enable device IOTLB
> 2) wait for the IOTLB miss request (iova, len) and update identity 
> mapping accordingly
> 
> This should work for all the archs (with some performance hit).

Ick. For my actual application (OpenConnect) I'm either going to suck
it up and put in the arch-specific limits like in the comment above, or
I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
about elsewhere. (Probably the former, since if I'm requiring kernel
changes then I have grander plans around extending AF_TLS to do DTLS,
then hooking that directly up to the tun socket via BPF and a sockmap
without the data frames ever going to userspace at all.)

For this test case, a hard-coded single address range in BSS is fine.

I've now added !IFF_NO_PI support to the test case, but as noted it
fails just like the other ones I'd already marked with #if 0, which is
because vhost-net pulls some value for 'sock_hlen' out of its posterior
based on some assumption around the vhost features. And then expects
sock_recvmsg() to return precisely that number of bytes more than the
value it peeks in the skb at the head of the sock's queue.

I think I can fix *all* those test cases by making tun_get_socket()
take an extra 'int *' argument, and use that to return the *actual*
value of sock_hlen. Here's the updated test case in the meantime:


From cf74e3fc80b8fd9df697a42cfc1ff3887de18f78 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Wed, 23 Jun 2021 16:38:56 +0100
Subject: [PATCH] test_vhost_net: add test cases with tun_pi header

These fail too, for the same reason as the previous tests were
guarded with #if 0: vhost-net pulls 'sock_hlen' out of its
posterior and just assumes it's 10 bytes. And then barfs when
a sock_recvmsg() doesn't return precisely ten bytes more than
it peeked in the head skb:

[1296757.531103] Discarded rx packet:  len 78, expected 74

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 .../testing/selftests/vhost/test_vhost_net.c  | 97 +++++++++++++------
 1 file changed, 65 insertions(+), 32 deletions(-)

diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
index fd4a2b0e42f0..734b3015a5bd 100644
--- a/tools/testing/selftests/vhost/test_vhost_net.c
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -48,7 +48,7 @@ static unsigned char hexchar(char *hex)
 	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
 }
 
-int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
+int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr)
 {
 	int tun_fd = open("/dev/net/tun", O_RDWR);
 	if (tun_fd == -1)
@@ -56,7 +56,9 @@ int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
 
 	struct ifreq ifr = { 0 };
 
-	ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+	ifr.ifr_flags = IFF_TUN;
+	if (!pi)
+		ifr.ifr_flags |= IFF_NO_PI;
 	if (vnet_hdr_sz)
 		ifr.ifr_flags |= IFF_VNET_HDR;
 
@@ -249,11 +251,18 @@ static inline uint16_t csum_finish(uint32_t sum)
 	return htons((uint16_t)(~sum));
 }
 
-static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+static int create_icmp_echo(unsigned char *data, int pi, struct in6_addr *dst,
 			    struct in6_addr *src, uint16_t id, uint16_t seq)
 {
 	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
-	const int plen = sizeof(struct ip6_hdr) + icmplen;
+	int plen = sizeof(struct ip6_hdr) + icmplen;
+
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		data += sizeof(*pi);
+		plen += sizeof(*pi);
+		pi->proto = htons(ETH_P_IPV6);
+	}
 
 	struct ip6_hdr *iph = (void *)data;
 	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
@@ -312,8 +321,21 @@ static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
 }
 
 
-static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src)
+static int check_icmp_response(unsigned char *data, uint32_t len, int pi,
+			       struct in6_addr *dst, struct in6_addr *src)
 {
+	if (pi) {
+		struct tun_pi *pi = (void *)data;
+		if (len < sizeof(*pi))
+			return 0;
+
+		if (pi->proto != htons(ETH_P_IPV6))
+			return 0;
+
+		data += sizeof(*pi);
+		len -= sizeof(*pi);
+	}
+
 	struct ip6_hdr *iph = (void *)data;
 	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
 		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
@@ -337,7 +359,7 @@ static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_add
 #endif
 
 
-int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
+int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features)
 {
 	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
 	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
@@ -353,7 +375,7 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	/* Pick up the link-local address that the kernel
 	 * assigns to the tun device. */
 	struct in6_addr tun_addr;
-	tun_fd = open_tun(vnet_hdr_sz, &tun_addr);
+	tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr);
 	if (tun_fd < 0)
 		goto err;
 
@@ -387,18 +409,18 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	local_addr.s6_addr16[0] = htons(0xfe80);
 	local_addr.s6_addr16[7] = htons(1);
 
+
 	/* Set up RX and TX descriptors; the latter with ping packets ready to
 	 * send to the kernel, but don't actually send them yet. */
 	for (int i = 0; i < RING_SIZE; i++) {
 		struct pkt_buf *pkt = &rings[1].pkts[i];
-		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr,
-					    &local_addr, 0x4747, i);
+		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], pi,
+					    &tun_addr, &local_addr, 0x4747, i);
 
 		rings[1].desc[i].addr = vio64((uint64_t)pkt);
 		rings[1].desc[i].len = vio32(plen + vnet_hdr_sz);
 		rings[1].avail_ring[i] = vio16(i);
 
-
 		pkt = &rings[0].pkts[i];
 		rings[0].desc[i].addr = vio64((uint64_t)pkt);
 		rings[0].desc[i].len = vio32(sizeof(*pkt));
@@ -438,9 +460,10 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 				return -1;
 
 			if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz,
-						&local_addr, &tun_addr)) {
+						pi, &local_addr, &tun_addr)) {
 				ret = 0;
-				printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features);
+				printf("Success (hdr %d, xdp %d, pi %d, features %llx)\n",
+				       vnet_hdr_sz, xdp, pi, (unsigned long long)features);
 				goto err;
 			}
 
@@ -466,51 +489,61 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
 	return ret;
 }
 
-
-int main(void)
+/* Perform the given test with all four combinations of XDP/PI */
+int test_four(int vnet_hdr_sz, uint64_t features)
 {
-	int ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	int ret = test_vhost(vnet_hdr_sz, 0, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
-				(1ULL << VIRTIO_F_VERSION_1)));
+	ret = test_vhost(vnet_hdr_sz, 0, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
-
-	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+#if 0 /* These don't work *either* for the same reason as the #if 0 later */
+	ret = test_vhost(vnet_hdr_sz, 1, 0, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	ret = test_vhost(vnet_hdr_sz, 1, 1, features);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
+#endif
+}
 
-	ret = test_vhost(10, 0, 0);
-	if (ret && ret != KSFT_SKIP)
-		return ret;
+int main(void)
+{
+	int ret;
 
-	ret = test_vhost(10, 1, 0);
+	ret = test_four(10, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-#if 0 /* These ones will fail */
-	ret = test_vhost(0, 0, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+			    (1ULL << VIRTIO_F_VERSION_1)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(0, 1, 0);
+	ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 0, 0);
+
+#if 0
+	/*
+	 * These ones will fail, because right now vhost *assumes* that the
+	 * underlying (tun, etc.) socket will be doing a header of precisely
+	 * sizeof(struct virtio_net_hdr), if vhost isn't doing so itself due
+	 * to VHOST_NET_F_VIRTIO_NET_HDR.
+	 *
+	 * That assumption breaks both tun with no IFF_VNET_HDR, and also
+	 * presumably raw sockets. So leave these test cases disabled for
+	 * now until it's fixed.
+	 */
+	ret = test_four(0, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 
-	ret = test_vhost(12, 1, 0);
+	ret = test_four(12, 0);
 	if (ret && ret != KSFT_SKIP)
 		return ret;
 #endif
Jason Wang June 24, 2021, 6:12 a.m. UTC | #3
在 2021/6/24 上午12:12, David Woodhouse 写道:
> On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote:
>> 在 2021/6/23 上午12:15, David Woodhouse 写道:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This creates a tun device and brings it up, then finds out the link-local
>>> address the kernel automatically assigns to it.
>>>
>>> It sends a ping to that address, from a fake LL address of its own, and
>>> then waits for a response.
>>>
>>> If the virtio_net_hdr stuff is all working correctly, it gets a response
>>> and manages to understand it.
>>
>> I wonder whether it worth to bother the dependency like ipv6 or kernel
>> networking stack.
>>
>> How about simply use packet socket that is bound to tun to receive and
>> send packets?
>>
> I pondered that but figured that using the kernel's network stack
> wasn't too much of an additional dependency. We *could* use an
> AF_PACKET socket on the tun device and then drive both ends, but given
> that the kernel *automatically* assigns a link-local address when we
> bring the device up anyway, it seemed simple enough just to use ICMP.
> I also happened to have the ICMP generation/checking code lying around
> anyway in the same emacs instance, so it was reduced to a previously
> solved problem.


Ok.


>
> We *should* eventually expand this test case to attach an AF_PACKET
> device to the vhost-net, instead of using a tun device as the back end.
> (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)


It's just because nobody wrote the code. And we're lacking the real use 
case.

Vhost_net is bascially used for accepting packet from userspace to the 
kernel networking stack.

Using AF_UNIX makes it looks more like a case of inter process 
communication (without vnet header it won't be used efficiently by VM). 
In this case, using io_uring is much more suitable.

Or thinking in another way, instead of depending on the vhost_net, we 
can expose TUN/TAP socket to userspace then io_uring could be used for 
the OpenConnect case as well?


>
>
>>> +	/*
>>> +	 * I just want to map the *whole* of userspace address space. But
>>> +	 * from userspace I don't know what that is. On x86_64 it would be:
>>> +	 *
>>> +	 * vmem->regions[0].guest_phys_addr = 4096;
>>> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
>>> +	 * vmem->regions[0].userspace_addr = 4096;
>>> +	 *
>>> +	 * For now, just ensure we put everything inside a single BSS region.
>>> +	 */
>>> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
>>> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
>>> +	vmem->regions[0].memory_size = sizeof(rings);
>>
>> Instead of doing tricks like this, we can do it in another way:
>>
>> 1) enable device IOTLB
>> 2) wait for the IOTLB miss request (iova, len) and update identity
>> mapping accordingly
>>
>> This should work for all the archs (with some performance hit).
> Ick. For my actual application (OpenConnect) I'm either going to suck
> it up and put in the arch-specific limits like in the comment above, or
> I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
> about elsewhere.


The feature could be useful for the case of vhost-vDPA as well.


>   (Probably the former, since if I'm requiring kernel
> changes then I have grander plans around extending AF_TLS to do DTLS,
> then hooking that directly up to the tun socket via BPF and a sockmap
> without the data frames ever going to userspace at all.)


Ok, I guess we need to make sockmap works for tun socket.


>
> For this test case, a hard-coded single address range in BSS is fine.
>
> I've now added !IFF_NO_PI support to the test case, but as noted it
> fails just like the other ones I'd already marked with #if 0, which is
> because vhost-net pulls some value for 'sock_hlen' out of its posterior
> based on some assumption around the vhost features. And then expects
> sock_recvmsg() to return precisely that number of bytes more than the
> value it peeks in the skb at the head of the sock's queue.
>
> I think I can fix *all* those test cases by making tun_get_socket()
> take an extra 'int *' argument, and use that to return the *actual*
> value of sock_hlen. Here's the updated test case in the meantime:


It would be better if you can post a new version of the whole series to 
ease the reviewing.

Thanks
David Woodhouse June 24, 2021, 10:42 a.m. UTC | #4
On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
> 在 2021/6/24 上午12:12, David Woodhouse 写道:
> > We *should* eventually expand this test case to attach an AF_PACKET
> > device to the vhost-net, instead of using a tun device as the back end.
> > (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
> 
> 
> It's just because nobody wrote the code. And we're lacking the real use 
> case.

Hm, what code? For AF_PACKET I haven't actually spotted that there *is*
any.

As I've been refactoring the interaction between vhost and tun/tap, and
fixing it up for different vhdr lengths, PI, and (just now) frowning in
horror at the concept that tun and vhost can have *different*
endiannesses, I hadn't spotted that there was anything special on the
packet socket. For that case, sock_hlen is just zero and we
send/receive plain packets... or so I thought? Did I miss something?

As far as I was aware, that ought to have worked with any datagram
socket. I was pondering not just AF_UNIX but also UDP (since that's my
main transport for VPN data, at least in the case where I care about
performance).

An interesting use case for a non-packet socket might be to establish a
tunnel. A guest's virtio-net device is just connected to a UDP socket
on the host, and that tunnels all their packets to/from a remote
endpoint which is where that guest is logically connected to the
network. It might be useful for live migration cases, perhaps?

I don't have an overriding desire to *make* it work, mind you; I just
wanted to make sure I understand *why* it doesn't, if indeed it
doesn't. As far as I could tell, it *should* work if we just dropped
the check?

> Vhost_net is bascially used for accepting packet from userspace to the 
> kernel networking stack.
> 
> Using AF_UNIX makes it looks more like a case of inter process 
> communication (without vnet header it won't be used efficiently by VM). 
> In this case, using io_uring is much more suitable.
> 
> Or thinking in another way, instead of depending on the vhost_net, we 
> can expose TUN/TAP socket to userspace then io_uring could be used for 
> the OpenConnect case as well?

That would work, I suppose. Although as noted, I *can* use vhost_net
with tun today from userspace as long as I disable XDP and PI, and use
a virtio_net_hdr that I don't really want. (And pull a value for
TASK_SIZE out of my posterior; qv.)

So I *can* ship a version of OpenConnect that works on existing
production kernels with those workarounds, and I'm fixing up the other
permutations of vhost/tun stuff in the kernel just because I figured we
*should*.

If I'm going to *require* new kernel support for OpenConnect then I
might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and
have the data packets never go to userspace in the first place.


> 
> > 
> > 
> > > > +	/*
> > > > +	 * I just want to map the *whole* of userspace address space. But
> > > > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > > > +	 *
> > > > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > > > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > > > +	 * vmem->regions[0].userspace_addr = 4096;
> > > > +	 *
> > > > +	 * For now, just ensure we put everything inside a single BSS region.
> > > > +	 */
> > > > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].memory_size = sizeof(rings);
> > > 
> > > Instead of doing tricks like this, we can do it in another way:
> > > 
> > > 1) enable device IOTLB
> > > 2) wait for the IOTLB miss request (iova, len) and update identity
> > > mapping accordingly
> > > 
> > > This should work for all the archs (with some performance hit).
> > 
> > Ick. For my actual application (OpenConnect) I'm either going to suck
> > it up and put in the arch-specific limits like in the comment above, or
> > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
> > about elsewhere.
> 
> 
> The feature could be useful for the case of vhost-vDPA as well.
> 
> 
> >   (Probably the former, since if I'm requiring kernel
> > changes then I have grander plans around extending AF_TLS to do DTLS,
> > then hooking that directly up to the tun socket via BPF and a sockmap
> > without the data frames ever going to userspace at all.)
> 
> 
> Ok, I guess we need to make sockmap works for tun socket.

Hm, I need to work out the ideal data flow here. I don't know if
sendmsg() / recvmsg() on the tun socket are even what I want, for full
zero-copy.

In the case where the NIC supports encryption, we want true zero-copy
from the moment the "encrypted" packet arrives over UDP on the public
network, through the DTLS processing and seqno checking, to being
processed as netif_receive_skb() on the tun device.

Likewise skbs from tun_net_xmit() should have the appropriate DTLS and
IP/UDP headers prepended to them and that *same* skb (or at least the
same frags) should be handed to the NIC to encrypt and send.

In the case where we have software crypto in the kernel, we can
tolerate precisely *one* copy because the crypto doesn't have to be
done in-place, so moving from the input to the output crypto buffers
can be that one "copy", and we can use it to move data around (from the
incoming skb to the outgoing skb) if we need to.

Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support
being *given* ownership of the buffers, rather than copying from them.
Or just being given a skb and pushing/pulling their headers.

I'm looking at skb_send_sock() and it *doesn't* seem to support "just
steal the frags from the initial skb and give them to the new one", but
there may be ways to make that work.

> > I think I can fix *all* those test cases by making tun_get_socket()
> > take an extra 'int *' argument, and use that to return the *actual*
> > value of sock_hlen. Here's the updated test case in the meantime:
> 
> 
> It would be better if you can post a new version of the whole series to 
> ease the reviewing.

Yep. I was working on that... until I got even more distracted by
looking at how we can do the true in-kernel zero-copy option ;)
Jason Wang June 25, 2021, 2:55 a.m. UTC | #5
在 2021/6/24 下午6:42, David Woodhouse 写道:
> On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
>> 在 2021/6/24 上午12:12, David Woodhouse 写道:
>>> We *should* eventually expand this test case to attach an AF_PACKET
>>> device to the vhost-net, instead of using a tun device as the back end.
>>> (Although I don't really see *why* vhost is limited to AF_PACKET. Why
>>> *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
>>
>> It's just because nobody wrote the code. And we're lacking the real use
>> case.
> Hm, what code?


The codes to support AF_UNIX.


>   For AF_PACKET I haven't actually spotted that there *is*
> any.


Vhost_net has this support for more than 10 years. It's hard to say 
there's no user for that.


>
> As I've been refactoring the interaction between vhost and tun/tap, and
> fixing it up for different vhdr lengths, PI, and (just now) frowning in
> horror at the concept that tun and vhost can have *different*
> endiannesses, I hadn't spotted that there was anything special on the
> packet socket.


Vnet header support.


>   For that case, sock_hlen is just zero and we
> send/receive plain packets... or so I thought? Did I miss something?


With vnet header, it can have GSO and csum offload.


>
> As far as I was aware, that ought to have worked with any datagram
> socket. I was pondering not just AF_UNIX but also UDP (since that's my
> main transport for VPN data, at least in the case where I care about
> performance).


My understanding is that vhost_net designed for accelerating virtio 
datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs 
(tuntap,macvtap or packet socket) are the main users. If you check git 
history, vhost can only be enabled without KVM until sometime last year. 
So I confess it can serve as a more general use case, and we had already 
has some discussions. But it's hard to say it's worth to do that since 
it became a re-invention of io_uring?

Another interesting thing is that, the copy done by vhost 
(copy_from/to_user()) will be much more slower than io_uring (GUP) 
because userspace copy may suffer from the performance hit caused by SMAP.


>
> An interesting use case for a non-packet socket might be to establish a
> tunnel. A guest's virtio-net device is just connected to a UDP socket
> on the host, and that tunnels all their packets to/from a remote
> endpoint which is where that guest is logically connected to the
> network. It might be useful for live migration cases, perhaps?


So kernel has already had supported on tunnels like L2 over L4 which 
were all done at netdevice level (e.g vxlan). If you want to build a 
customized tunnel which is not supported by the kernel, you need to 
redirect the traffic back to the userspace. vhost-user is probably the 
best choice in that case.


>
> I don't have an overriding desire to *make* it work, mind you; I just
> wanted to make sure I understand *why* it doesn't, if indeed it
> doesn't. As far as I could tell, it *should* work if we just dropped
> the check?


I'm not sure. It requires careful thought.

For the case of L2/VM, we care more about the performance which can be 
done via vnet header.

For the case of L3(TUN) or above, we can do that via io_uring.

So it looks to me it's not worth to bother.


>
>> Vhost_net is bascially used for accepting packet from userspace to the
>> kernel networking stack.
>>
>> Using AF_UNIX makes it looks more like a case of inter process
>> communication (without vnet header it won't be used efficiently by VM).
>> In this case, using io_uring is much more suitable.
>>
>> Or thinking in another way, instead of depending on the vhost_net, we
>> can expose TUN/TAP socket to userspace then io_uring could be used for
>> the OpenConnect case as well?
> That would work, I suppose. Although as noted, I *can* use vhost_net
> with tun today from userspace as long as I disable XDP and PI, and use
> a virtio_net_hdr that I don't really want. (And pull a value for
> TASK_SIZE out of my posterior; qv.)
>
> So I *can* ship a version of OpenConnect that works on existing
> production kernels with those workarounds, and I'm fixing up the other
> permutations of vhost/tun stuff in the kernel just because I figured we
> *should*.
>
> If I'm going to *require* new kernel support for OpenConnect then I
> might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and
> have the data packets never go to userspace in the first place.


Note that BPF have some limitations:

1) requires capabilities like CAP_BPF
2) may need userspace fallback


>
>
>>>
>>>>> +	/*
>>>>> +	 * I just want to map the *whole* of userspace address space. But
>>>>> +	 * from userspace I don't know what that is. On x86_64 it would be:
>>>>> +	 *
>>>>> +	 * vmem->regions[0].guest_phys_addr = 4096;
>>>>> +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
>>>>> +	 * vmem->regions[0].userspace_addr = 4096;
>>>>> +	 *
>>>>> +	 * For now, just ensure we put everything inside a single BSS region.
>>>>> +	 */
>>>>> +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
>>>>> +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
>>>>> +	vmem->regions[0].memory_size = sizeof(rings);
>>>> Instead of doing tricks like this, we can do it in another way:
>>>>
>>>> 1) enable device IOTLB
>>>> 2) wait for the IOTLB miss request (iova, len) and update identity
>>>> mapping accordingly
>>>>
>>>> This should work for all the archs (with some performance hit).
>>> Ick. For my actual application (OpenConnect) I'm either going to suck
>>> it up and put in the arch-specific limits like in the comment above, or
>>> I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
>>> about elsewhere.
>>
>> The feature could be useful for the case of vhost-vDPA as well.
>>
>>
>>>    (Probably the former, since if I'm requiring kernel
>>> changes then I have grander plans around extending AF_TLS to do DTLS,
>>> then hooking that directly up to the tun socket via BPF and a sockmap
>>> without the data frames ever going to userspace at all.)
>>
>> Ok, I guess we need to make sockmap works for tun socket.
> Hm, I need to work out the ideal data flow here. I don't know if
> sendmsg() / recvmsg() on the tun socket are even what I want, for full
> zero-copy.


Zerocopy could be done via vhost_net. But due the HOL issue we disabled 
it by default.


>
> In the case where the NIC supports encryption, we want true zero-copy
> from the moment the "encrypted" packet arrives over UDP on the public
> network, through the DTLS processing and seqno checking, to being
> processed as netif_receive_skb() on the tun device.
>
> Likewise skbs from tun_net_xmit() should have the appropriate DTLS and
> IP/UDP headers prepended to them and that *same* skb (or at least the
> same frags) should be handed to the NIC to encrypt and send.
>
> In the case where we have software crypto in the kernel, we can
> tolerate precisely *one* copy because the crypto doesn't have to be
> done in-place, so moving from the input to the output crypto buffers
> can be that one "copy", and we can use it to move data around (from the
> incoming skb to the outgoing skb) if we need to.


I'm not familiar withe encryption, but it looks like what you want is 
the TLS offload support in TUN/TAP.


>
> Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support
> being *given* ownership of the buffers, rather than copying from them.
> Or just being given a skb and pushing/pulling their headers.


It looks more like you want to add sendpage() support for TUN? The first 
step as discussed would be the codes to expose TUN socket to userspace.


>
> I'm looking at skb_send_sock() and it *doesn't* seem to support "just
> steal the frags from the initial skb and give them to the new one", but
> there may be ways to make that work.


I don't know. Last time I check sockmap only support TCP socket. But I 
saw some work that is proposed by Wang Cong to make it work for UDP 
probably.

Thanks


>
>>> I think I can fix *all* those test cases by making tun_get_socket()
>>> take an extra 'int *' argument, and use that to return the *actual*
>>> value of sock_hlen. Here's the updated test case in the meantime:
>>
>> It would be better if you can post a new version of the whole series to
>> ease the reviewing.
> Yep. I was working on that... until I got even more distracted by
> looking at how we can do the true in-kernel zero-copy option ;)
>
David Woodhouse June 25, 2021, 7:54 a.m. UTC | #6
On Fri, 2021-06-25 at 10:55 +0800, Jason Wang wrote:
> 在 2021/6/24 下午6:42, David Woodhouse 写道:
> > On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
> > > 在 2021/6/24 上午12:12, David Woodhouse 写道:
> > > > We *should* eventually expand this test case to attach an AF_PACKET
> > > > device to the vhost-net, instead of using a tun device as the back end.
> > > > (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> > > > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
> > > 
> > > It's just because nobody wrote the code. And we're lacking the real use
> > > case.
> > 
> > Hm, what code?
> 
> 
> The codes to support AF_UNIX.
> 
> 
> >   For AF_PACKET I haven't actually spotted that there *is* any.
> 
> 
> Vhost_net has this support for more than 10 years. It's hard to say 
> there's no user for that.
> 

I wasn't saying I hadn't spotted the use case. I hadn't spotted the
*code* which is in af_packet to support vhost. But...

> > As I've been refactoring the interaction between vhost and tun/tap, and
> > fixing it up for different vhdr lengths, PI, and (just now) frowning in
> > horror at the concept that tun and vhost can have *different*
> > endiannesses, I hadn't spotted that there was anything special on the
> > packet socket.
> 
> Vnet header support.

... I have no idea how I failed to spot that. OK, so AF_PACKET sockets
can *optionally* support the case where *they* provide the
virtio_net_hdr — instead of vhost doing it, or there being none.

But any other sockets would work for the "vhost does it" or the "no
vhdr" case.

... and I need to fix my 'get sock_hlen from the underlying tun/tap
device' patch to *not* assume that sock_hlen is zero for a raw socket;
it needs to check the PACKET_VNET_HDR sockopt. And *that* was broken
for the VERSION_1|MRG_RXBUF case before I came along, wasn't it?
Because vhost would have assumed sock_hlen to be 12 bytes, while in
AF_PACKET it's always only 10?

> >   For that case, sock_hlen is just zero and we
> > send/receive plain packets... or so I thought? Did I miss something?
> 
> 
> With vnet header, it can have GSO and csum offload.
> 
> 
> > 
> > As far as I was aware, that ought to have worked with any datagram
> > socket. I was pondering not just AF_UNIX but also UDP (since that's my
> > main transport for VPN data, at least in the case where I care about
> > performance).
> 
> 
> My understanding is that vhost_net designed for accelerating virtio 
> datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs 
> (tuntap,macvtap or packet socket) are the main users. If you check git 
> history, vhost can only be enabled without KVM until sometime last year. 
> So I confess it can serve as a more general use case, and we had already 
> has some discussions. But it's hard to say it's worth to do that since 
> it became a re-invention of io_uring?

Yeah, ultimately I'm not sure that's worth exploring. As I said, I was
looking for something that works on *current* kernels. Which means no
io_uring on the underlying tun socket, and no vhost on UDP. If I want
to go and implement *both* ring protocols in userspace and make use of
each of them on the socket that they do support, I can do that. Yay! :)

If I'm going to require new kernels, then I should just work on the
"ideal" data path which doesn't really involve userspace at all. But we
should probably take that discussion to a separate thread.
diff mbox series

Patch

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6c575cf34a71..300c03cfd0c7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -71,6 +71,7 @@  TARGETS += user
 TARGETS += vDSO
 TARGETS += vm
 TARGETS += x86
+TARGETS += vhost
 TARGETS += zram
 #Please keep the TARGETS list alphabetically sorted
 # Run "make quicktest=1 run_tests" or
diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile
new file mode 100644
index 000000000000..f5e565d80733
--- /dev/null
+++ b/tools/testing/selftests/vhost/Makefile
@@ -0,0 +1,16 @@ 
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+include ../lib.mk
+
+.PHONY: all clean
+
+BINARIES := test_vhost_net
+
+test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h
+	$(CC) $(CFLAGS) -g $< -o $@
+
+TEST_PROGS += $(BINARIES)
+EXTRA_CLEAN := $(BINARIES)
+
+all: $(BINARIES)
diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config
new file mode 100644
index 000000000000..6391c1f32c34
--- /dev/null
+++ b/tools/testing/selftests/vhost/config
@@ -0,0 +1,2 @@ 
+CONFIG_VHOST_NET=y
+CONFIG_TUN=y
diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c
new file mode 100644
index 000000000000..14acf2c0e049
--- /dev/null
+++ b/tools/testing/selftests/vhost/test_vhost_net.c
@@ -0,0 +1,522 @@ 
+// SPDX-License-Identifier: LGPL-2.1
+
+#include "../kselftest_harness.h"
+#include "../../../virtio/asm/barrier.h"
+
+#include <sys/eventfd.h>
+
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include <net/if.h>
+#include <sys/socket.h>
+
+#include <netinet/tcp.h>
+#include <netinet/ip.h>
+#include <netinet/ip_icmp.h>
+#include <netinet/ip6.h>
+#include <netinet/icmp6.h>
+
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+#include <linux/vhost.h>
+
+static unsigned char hexnybble(char hex)
+{
+	switch (hex) {
+	case '0'...'9':
+		return hex - '0';
+	case 'a'...'f':
+		return 10 + hex - 'a';
+	case 'A'...'F':
+		return 10 + hex - 'A';
+	default:
+		exit (KSFT_SKIP);
+	}
+}
+
+static unsigned char hexchar(char *hex)
+{
+	return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]);
+}
+
+int open_tun(int vnet_hdr_sz, struct in6_addr *addr)
+{
+	int tun_fd = open("/dev/net/tun", O_RDWR);
+	if (tun_fd == -1)
+		return -1;
+
+	struct ifreq ifr = { 0 };
+
+	ifr.ifr_flags = IFF_TUN | IFF_NO_PI;
+	if (vnet_hdr_sz)
+		ifr.ifr_flags |= IFF_VNET_HDR;
+
+	if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0)
+		goto out_tun;
+
+	if (vnet_hdr_sz &&
+	    ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0)
+		goto out_tun;
+
+	int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP);
+	if (sockfd == -1)
+		goto out_tun;
+
+	if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	ifr.ifr_flags |= IFF_UP;
+	if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0)
+		goto out_sock;
+
+	close(sockfd);
+
+	FILE *inet6 = fopen("/proc/net/if_inet6", "r");
+	if (!inet6)
+		goto out_tun;
+
+	char buf[80];
+	while (fgets(buf, sizeof(buf), inet6)) {
+		size_t len = strlen(buf), namelen = strlen(ifr.ifr_name);
+		if (!strncmp(buf, "fe80", 4) &&
+		    buf[len - namelen - 2] == ' ' &&
+		    !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) {
+			for (int i = 0; i < 16; i++) {
+				addr->s6_addr[i] = hexchar(buf + i*2);
+			}
+			fclose(inet6);
+			return tun_fd;
+		}
+	}
+	/* Not found */
+	fclose(inet6);
+ out_sock:
+	close(sockfd);
+ out_tun:
+	close(tun_fd);
+	return -1;
+}
+
+#define RING_SIZE 32
+#define RING_MASK(x) ((x) & (RING_SIZE-1))
+
+struct pkt_buf {
+	unsigned char data[2048];
+};
+
+struct test_vring {
+	struct vring_desc desc[RING_SIZE];
+	struct vring_avail avail;
+	__virtio16 avail_ring[RING_SIZE];
+	struct vring_used used;
+	struct vring_used_elem used_ring[RING_SIZE];
+	struct pkt_buf pkts[RING_SIZE];
+} rings[2];
+
+static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx)
+{
+	struct test_vring *vring = &rings[idx];
+	int ret;
+
+	memset(vring, 0, sizeof(vring));
+
+	struct vhost_vring_state vs = { };
+	vs.index = idx;
+	vs.num = RING_SIZE;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) {
+		perror("VHOST_SET_VRING_NUM");
+		return -1;
+	}
+
+	vs.num = 0;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) {
+		perror("VHOST_SET_VRING_BASE");
+		return -1;
+	}
+
+	struct vhost_vring_addr va = { };
+	va.index = idx;
+	va.desc_user_addr = (uint64_t)vring->desc;
+	va.avail_user_addr = (uint64_t)&vring->avail;
+	va.used_user_addr  = (uint64_t)&vring->used;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) {
+		perror("VHOST_SET_VRING_ADDR");
+		return -1;
+	}
+
+	struct vhost_vring_file vf = { };
+	vf.index = idx;
+	vf.fd = tun_fd;
+	if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) {
+		perror("VHOST_NET_SET_BACKEND");
+		return -1;
+	}
+
+	vf.fd = call_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) {
+		perror("VHOST_SET_VRING_CALL");
+		return -1;
+	}
+
+	vf.fd = kick_fd;
+	if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) {
+		perror("VHOST_SET_VRING_KICK");
+		return -1;
+	}
+
+	return 0;
+}
+
+int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features)
+{
+	int ret;
+
+	if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) {
+		perror("VHOST_SET_OWNER");
+		return -1;
+	}
+
+	uint64_t features;
+	if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) {
+		perror("VHOST_GET_FEATURES");
+		return -1;
+	}
+
+	if ((features & want_features) != want_features)
+		return KSFT_SKIP;
+
+	if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) {
+		perror("VHOST_SET_FEATURES");
+		return -1;
+	}
+
+	struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0]));
+
+	memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0]));
+	vmem->nregions = 1;
+	/*
+	 * I just want to map the *whole* of userspace address space. But
+	 * from userspace I don't know what that is. On x86_64 it would be:
+	 *
+	 * vmem->regions[0].guest_phys_addr = 4096;
+	 * vmem->regions[0].memory_size = 0x7fffffffe000;
+	 * vmem->regions[0].userspace_addr = 4096;
+	 *
+	 * For now, just ensure we put everything inside a single BSS region.
+	 */
+	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
+	vmem->regions[0].userspace_addr = (uint64_t)&rings;
+	vmem->regions[0].memory_size = sizeof(rings);
+
+	if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {
+		perror("VHOST_SET_MEM_TABLE");
+		return -1;
+	}
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0))
+		return -1;
+
+	if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1))
+		return -1;
+
+	return 0;
+}
+
+
+static char ping_payload[16] = "VHOST TEST PACKT";
+
+static inline uint32_t csum_partial(uint16_t *buf, int nwords)
+{
+	uint32_t sum = 0;
+	for(sum=0; nwords>0; nwords--)
+		sum += ntohs(*buf++);
+	return sum;
+}
+
+static inline uint16_t csum_finish(uint32_t sum)
+{
+	sum = (sum >> 16) + (sum &0xffff);
+	sum += (sum >> 16);
+	return htons((uint16_t)(~sum));
+}
+
+static int create_icmp_echo(unsigned char *data, struct in6_addr *dst,
+			    struct in6_addr *src, uint16_t id, uint16_t seq)
+{
+	const int icmplen = ICMP_MINLEN + sizeof(ping_payload);
+	const int plen = sizeof(struct ip6_hdr) + icmplen;
+
+	struct ip6_hdr *iph = (void *)data;
+	struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph));
+
+	/* IPv6 Header */
+	iph->ip6_flow = htonl((6 << 28) + /* version 6 */
+			      (0 << 20) + /* traffic class */
+			      (0 << 0));  /* flow ID  */
+	iph->ip6_nxt = IPPROTO_ICMPV6;
+	iph->ip6_plen = htons(icmplen);
+	iph->ip6_hlim = 128;
+	iph->ip6_src = *src;
+	iph->ip6_dst = *dst;
+
+	/* ICMPv6 echo request */
+	icmph->icmp6_type = ICMP6_ECHO_REQUEST;
+	icmph->icmp6_code = 0;
+	icmph->icmp6_data16[0] = htons(id);	/* ID */
+	icmph->icmp6_data16[1] = htons(seq);	/* sequence */
+
+	/* Some arbitrary payload */
+	memcpy(&icmph[1], ping_payload, sizeof(ping_payload));
+
+	/*
+	 * IPv6 upper-layer checksums include a pseudo-header
+	 * for IPv6 which contains the source address, the
+	 * destination address, the upper-layer packet length
+	 * and next-header field. See RFC8200 §8.1. The
+	 * checksum is as follows:
+	 *
+	 *   checksum 32 bytes of real IPv6 header:
+	 *     src addr (16 bytes)
+	 *     dst addr (16 bytes)
+	 *   8 bytes more:
+	 *     length of ICMPv6 in bytes (be32)
+	 *     3 bytes of 0
+	 *     next header byte (IPPROTO_ICMPV6)
+	 *   Then the actual ICMPv6 bytes
+	 */
+	uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8);      /* 8 uint16_t */
+	sum += csum_partial((uint16_t *)&iph->ip6_dst, 8);              /* 8 uint16_t */
+
+	/* The easiest way to checksum the following 8-byte
+	 * part of the pseudo-header without horridly violating
+	 * C type aliasing rules is *not* to build it in memory
+	 * at all. We know the length fits in 16 bits so the
+	 * partial checksum of 00 00 LL LL 00 00 00 NH ends up
+	 * being just LLLL + NH.
+	 */
+	sum += IPPROTO_ICMPV6;
+	sum += ICMP_MINLEN + sizeof(ping_payload);
+
+	sum += csum_partial((uint16_t *)icmph, icmplen / 2);
+	icmph->icmp6_cksum = csum_finish(sum);
+	return plen;
+}
+
+
+static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src)
+{
+	struct ip6_hdr *iph = (void *)data;
+	return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */
+		 && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */
+		 && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */
+		 && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */
+		 && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */
+		 && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */
+		 && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */
+		 );
+
+}
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define vio16(x) (x)
+#define vio32(x) (x)
+#define vio64(x) (x)
+#else
+#define vio16(x) __builtin_bswap16(x)
+#define vio32(x) __builtin_bswap32(x)
+#define vio64(x) __builtin_bswap64(x)
+#endif
+
+
+int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features)
+{
+	int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK);
+	int vhost_fd = open("/dev/vhost-net", O_RDWR);
+	int tun_fd = -1;
+	int ret = KSFT_SKIP;
+
+	if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0)
+		goto err;
+
+	memset(rings, 0, sizeof(rings));
+
+	/* Pick up the link-local address that the kernel
+	 * assigns to the tun device. */
+	struct in6_addr tun_addr;
+	tun_fd = open_tun(vnet_hdr_sz, &tun_addr);
+	if (tun_fd < 0)
+		goto err;
+
+	if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) {
+		if (vnet_hdr_sz) {
+			ret = -1;
+			goto err;
+		}
+
+		vnet_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+					   (1ULL << VIRTIO_F_VERSION_1))) ?
+			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
+			sizeof(struct virtio_net_hdr);
+	}
+
+	if (!xdp) {
+		int sndbuf = RING_SIZE * 2048;
+		if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) {
+			perror("TUNSETSNDBUF");
+			ret = -1;
+			goto err;
+		}
+	}
+
+	ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features);
+	if (ret)
+		goto err;
+
+	/* A fake link-local address for the userspace end */
+	struct in6_addr local_addr = { 0 };
+	local_addr.s6_addr16[0] = htons(0xfe80);
+	local_addr.s6_addr16[7] = htons(1);
+
+	/* Set up RX and TX descriptors; the latter with ping packets ready to
+	 * send to the kernel, but don't actually send them yet. */
+	for (int i = 0; i < RING_SIZE; i++) {
+		struct pkt_buf *pkt = &rings[1].pkts[i];
+		int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr,
+					    &local_addr, 0x4747, i);
+
+		rings[1].desc[i].addr = vio64((uint64_t)pkt);
+		rings[1].desc[i].len = vio32(plen + vnet_hdr_sz);
+		rings[1].avail_ring[i] = vio16(i);
+
+
+		pkt = &rings[0].pkts[i];
+		rings[0].desc[i].addr = vio64((uint64_t)pkt);
+		rings[0].desc[i].len = vio32(sizeof(*pkt));
+		rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE);
+		rings[0].avail_ring[i] = vio16(i);
+	}
+	barrier();
+
+	rings[0].avail.idx = RING_SIZE;
+	rings[1].avail.idx = vio16(1);
+
+	barrier();
+	eventfd_write(kick_fd, 1);
+
+	uint16_t rx_seen_used = 0;
+	struct timeval tv = { 1, 0 };
+	while (1) {
+		fd_set rfds = { 0 };
+		FD_SET(call_fd, &rfds);
+
+		if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) {
+			ret = -1;
+			goto err;
+		}
+
+		uint16_t rx_used_idx = vio16(rings[0].used.idx);
+		barrier();
+
+		while(rx_used_idx != rx_seen_used) {
+			uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id);
+			uint32_t len  = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len);
+
+			if (desc >= RING_SIZE || len < vnet_hdr_sz)
+				return -1;
+
+			uint64_t addr = vio64(rings[0].desc[desc].addr);
+			if (!addr)
+				return -1;
+
+			if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz,
+						&local_addr, &tun_addr)) {
+				ret = 0;
+				printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features);
+				goto err;
+			}
+			rx_seen_used++;
+
+			/* Give the same buffer back */
+			rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE);
+			barrier();
+			eventfd_write(kick_fd, 1);
+		}
+
+		uint64_t ev_val;
+		eventfd_read(call_fd, &ev_val);
+	}
+
+ err:
+	if (call_fd != -1)
+		close(call_fd);
+	if (kick_fd != -1)
+		close(kick_fd);
+	if (vhost_fd != -1)
+		close(vhost_fd);
+	if (tun_fd != -1)
+		close(tun_fd);
+
+	return ret;
+}
+
+
+int main(void)
+{
+	int ret;
+
+	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+				(1ULL << VIRTIO_F_VERSION_1)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
+				(1ULL << VIRTIO_F_VERSION_1)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR)));
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(10, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(10, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+#if 0 /* These ones will fail */
+	ret = test_vhost(0, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(0, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(12, 0, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+
+	ret = test_vhost(12, 1, 0);
+	if (ret && ret != KSFT_SKIP)
+		return ret;
+#endif
+
+	return ret;
+}