Message ID | 20210622161533.1214662-4-dwmw2@infradead.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,1/4] net: tun: fix tun_xdp_one() for IFF_TUN mode | expand |
在 2021/6/23 上午12:15, David Woodhouse 写道: > From: David Woodhouse <dwmw@amazon.co.uk> > > This creates a tun device and brings it up, then finds out the link-local > address the kernel automatically assigns to it. > > It sends a ping to that address, from a fake LL address of its own, and > then waits for a response. > > If the virtio_net_hdr stuff is all working correctly, it gets a response > and manages to understand it. I wonder whether it worth to bother the dependency like ipv6 or kernel networking stack. How about simply use packet socket that is bound to tun to receive and send packets? > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/vhost/Makefile | 16 + > tools/testing/selftests/vhost/config | 2 + > .../testing/selftests/vhost/test_vhost_net.c | 522 ++++++++++++++++++ > 4 files changed, 541 insertions(+) > create mode 100644 tools/testing/selftests/vhost/Makefile > create mode 100644 tools/testing/selftests/vhost/config > create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c [...] > + /* > + * I just want to map the *whole* of userspace address space. But > + * from userspace I don't know what that is. On x86_64 it would be: > + * > + * vmem->regions[0].guest_phys_addr = 4096; > + * vmem->regions[0].memory_size = 0x7fffffffe000; > + * vmem->regions[0].userspace_addr = 4096; > + * > + * For now, just ensure we put everything inside a single BSS region. > + */ > + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; > + vmem->regions[0].userspace_addr = (uint64_t)&rings; > + vmem->regions[0].memory_size = sizeof(rings); Instead of doing tricks like this, we can do it in another way: 1) enable device IOTLB 2) wait for the IOTLB miss request (iova, len) and update identity mapping accordingly This should work for all the archs (with some performance hit). Thanks
On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote: > 在 2021/6/23 上午12:15, David Woodhouse 写道: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This creates a tun device and brings it up, then finds out the link-local > > address the kernel automatically assigns to it. > > > > It sends a ping to that address, from a fake LL address of its own, and > > then waits for a response. > > > > If the virtio_net_hdr stuff is all working correctly, it gets a response > > and manages to understand it. > > > I wonder whether it worth to bother the dependency like ipv6 or kernel > networking stack. > > How about simply use packet socket that is bound to tun to receive and > send packets? > I pondered that but figured that using the kernel's network stack wasn't too much of an additional dependency. We *could* use an AF_PACKET socket on the tun device and then drive both ends, but given that the kernel *automatically* assigns a link-local address when we bring the device up anyway, it seemed simple enough just to use ICMP. I also happened to have the ICMP generation/checking code lying around anyway in the same emacs instance, so it was reduced to a previously solved problem. We *should* eventually expand this test case to attach an AF_PACKET device to the vhost-net, instead of using a tun device as the back end. (Although I don't really see *why* vhost is limited to AF_PACKET. Why *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) > > + /* > > + * I just want to map the *whole* of userspace address space. But > > + * from userspace I don't know what that is. On x86_64 it would be: > > + * > > + * vmem->regions[0].guest_phys_addr = 4096; > > + * vmem->regions[0].memory_size = 0x7fffffffe000; > > + * vmem->regions[0].userspace_addr = 4096; > > + * > > + * For now, just ensure we put everything inside a single BSS region. > > + */ > > + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; > > + vmem->regions[0].userspace_addr = (uint64_t)&rings; > > + vmem->regions[0].memory_size = sizeof(rings); > > > Instead of doing tricks like this, we can do it in another way: > > 1) enable device IOTLB > 2) wait for the IOTLB miss request (iova, len) and update identity > mapping accordingly > > This should work for all the archs (with some performance hit). Ick. For my actual application (OpenConnect) I'm either going to suck it up and put in the arch-specific limits like in the comment above, or I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking about elsewhere. (Probably the former, since if I'm requiring kernel changes then I have grander plans around extending AF_TLS to do DTLS, then hooking that directly up to the tun socket via BPF and a sockmap without the data frames ever going to userspace at all.) For this test case, a hard-coded single address range in BSS is fine. I've now added !IFF_NO_PI support to the test case, but as noted it fails just like the other ones I'd already marked with #if 0, which is because vhost-net pulls some value for 'sock_hlen' out of its posterior based on some assumption around the vhost features. And then expects sock_recvmsg() to return precisely that number of bytes more than the value it peeks in the skb at the head of the sock's queue. I think I can fix *all* those test cases by making tun_get_socket() take an extra 'int *' argument, and use that to return the *actual* value of sock_hlen. Here's the updated test case in the meantime: From cf74e3fc80b8fd9df697a42cfc1ff3887de18f78 Mon Sep 17 00:00:00 2001 From: David Woodhouse <dwmw@amazon.co.uk> Date: Wed, 23 Jun 2021 16:38:56 +0100 Subject: [PATCH] test_vhost_net: add test cases with tun_pi header These fail too, for the same reason as the previous tests were guarded with #if 0: vhost-net pulls 'sock_hlen' out of its posterior and just assumes it's 10 bytes. And then barfs when a sock_recvmsg() doesn't return precisely ten bytes more than it peeked in the head skb: [1296757.531103] Discarded rx packet: len 78, expected 74 Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- .../testing/selftests/vhost/test_vhost_net.c | 97 +++++++++++++------ 1 file changed, 65 insertions(+), 32 deletions(-) diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c index fd4a2b0e42f0..734b3015a5bd 100644 --- a/tools/testing/selftests/vhost/test_vhost_net.c +++ b/tools/testing/selftests/vhost/test_vhost_net.c @@ -48,7 +48,7 @@ static unsigned char hexchar(char *hex) return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]); } -int open_tun(int vnet_hdr_sz, struct in6_addr *addr) +int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr) { int tun_fd = open("/dev/net/tun", O_RDWR); if (tun_fd == -1) @@ -56,7 +56,9 @@ int open_tun(int vnet_hdr_sz, struct in6_addr *addr) struct ifreq ifr = { 0 }; - ifr.ifr_flags = IFF_TUN | IFF_NO_PI; + ifr.ifr_flags = IFF_TUN; + if (!pi) + ifr.ifr_flags |= IFF_NO_PI; if (vnet_hdr_sz) ifr.ifr_flags |= IFF_VNET_HDR; @@ -249,11 +251,18 @@ static inline uint16_t csum_finish(uint32_t sum) return htons((uint16_t)(~sum)); } -static int create_icmp_echo(unsigned char *data, struct in6_addr *dst, +static int create_icmp_echo(unsigned char *data, int pi, struct in6_addr *dst, struct in6_addr *src, uint16_t id, uint16_t seq) { const int icmplen = ICMP_MINLEN + sizeof(ping_payload); - const int plen = sizeof(struct ip6_hdr) + icmplen; + int plen = sizeof(struct ip6_hdr) + icmplen; + + if (pi) { + struct tun_pi *pi = (void *)data; + data += sizeof(*pi); + plen += sizeof(*pi); + pi->proto = htons(ETH_P_IPV6); + } struct ip6_hdr *iph = (void *)data; struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph)); @@ -312,8 +321,21 @@ static int create_icmp_echo(unsigned char *data, struct in6_addr *dst, } -static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src) +static int check_icmp_response(unsigned char *data, uint32_t len, int pi, + struct in6_addr *dst, struct in6_addr *src) { + if (pi) { + struct tun_pi *pi = (void *)data; + if (len < sizeof(*pi)) + return 0; + + if (pi->proto != htons(ETH_P_IPV6)) + return 0; + + data += sizeof(*pi); + len -= sizeof(*pi); + } + struct ip6_hdr *iph = (void *)data; return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */ && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */ @@ -337,7 +359,7 @@ static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_add #endif -int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) +int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features) { int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); @@ -353,7 +375,7 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) /* Pick up the link-local address that the kernel * assigns to the tun device. */ struct in6_addr tun_addr; - tun_fd = open_tun(vnet_hdr_sz, &tun_addr); + tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr); if (tun_fd < 0) goto err; @@ -387,18 +409,18 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) local_addr.s6_addr16[0] = htons(0xfe80); local_addr.s6_addr16[7] = htons(1); + /* Set up RX and TX descriptors; the latter with ping packets ready to * send to the kernel, but don't actually send them yet. */ for (int i = 0; i < RING_SIZE; i++) { struct pkt_buf *pkt = &rings[1].pkts[i]; - int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr, - &local_addr, 0x4747, i); + int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], pi, + &tun_addr, &local_addr, 0x4747, i); rings[1].desc[i].addr = vio64((uint64_t)pkt); rings[1].desc[i].len = vio32(plen + vnet_hdr_sz); rings[1].avail_ring[i] = vio16(i); - pkt = &rings[0].pkts[i]; rings[0].desc[i].addr = vio64((uint64_t)pkt); rings[0].desc[i].len = vio32(sizeof(*pkt)); @@ -438,9 +460,10 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) return -1; if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz, - &local_addr, &tun_addr)) { + pi, &local_addr, &tun_addr)) { ret = 0; - printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features); + printf("Success (hdr %d, xdp %d, pi %d, features %llx)\n", + vnet_hdr_sz, xdp, pi, (unsigned long long)features); goto err; } @@ -466,51 +489,61 @@ int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) return ret; } - -int main(void) +/* Perform the given test with all four combinations of XDP/PI */ +int test_four(int vnet_hdr_sz, uint64_t features) { - int ret; - - ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | - (1ULL << VIRTIO_F_VERSION_1))); + int ret = test_vhost(vnet_hdr_sz, 0, 0, features); if (ret && ret != KSFT_SKIP) return ret; - ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | - (1ULL << VIRTIO_F_VERSION_1))); + ret = test_vhost(vnet_hdr_sz, 0, 1, features); if (ret && ret != KSFT_SKIP) return ret; - - ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR))); +#if 0 /* These don't work *either* for the same reason as the #if 0 later */ + ret = test_vhost(vnet_hdr_sz, 1, 0, features); if (ret && ret != KSFT_SKIP) return ret; - ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR))); + ret = test_vhost(vnet_hdr_sz, 1, 1, features); if (ret && ret != KSFT_SKIP) return ret; +#endif +} - ret = test_vhost(10, 0, 0); - if (ret && ret != KSFT_SKIP) - return ret; +int main(void) +{ + int ret; - ret = test_vhost(10, 1, 0); + ret = test_four(10, 0); if (ret && ret != KSFT_SKIP) return ret; -#if 0 /* These ones will fail */ - ret = test_vhost(0, 0, 0); + ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | + (1ULL << VIRTIO_F_VERSION_1))); if (ret && ret != KSFT_SKIP) return ret; - ret = test_vhost(0, 1, 0); + ret = test_four(0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR))); if (ret && ret != KSFT_SKIP) return ret; - ret = test_vhost(12, 0, 0); + +#if 0 + /* + * These ones will fail, because right now vhost *assumes* that the + * underlying (tun, etc.) socket will be doing a header of precisely + * sizeof(struct virtio_net_hdr), if vhost isn't doing so itself due + * to VHOST_NET_F_VIRTIO_NET_HDR. + * + * That assumption breaks both tun with no IFF_VNET_HDR, and also + * presumably raw sockets. So leave these test cases disabled for + * now until it's fixed. + */ + ret = test_four(0, 0); if (ret && ret != KSFT_SKIP) return ret; - ret = test_vhost(12, 1, 0); + ret = test_four(12, 0); if (ret && ret != KSFT_SKIP) return ret; #endif
在 2021/6/24 上午12:12, David Woodhouse 写道: > On Wed, 2021-06-23 at 12:02 +0800, Jason Wang wrote: >> 在 2021/6/23 上午12:15, David Woodhouse 写道: >>> From: David Woodhouse <dwmw@amazon.co.uk> >>> >>> This creates a tun device and brings it up, then finds out the link-local >>> address the kernel automatically assigns to it. >>> >>> It sends a ping to that address, from a fake LL address of its own, and >>> then waits for a response. >>> >>> If the virtio_net_hdr stuff is all working correctly, it gets a response >>> and manages to understand it. >> >> I wonder whether it worth to bother the dependency like ipv6 or kernel >> networking stack. >> >> How about simply use packet socket that is bound to tun to receive and >> send packets? >> > I pondered that but figured that using the kernel's network stack > wasn't too much of an additional dependency. We *could* use an > AF_PACKET socket on the tun device and then drive both ends, but given > that the kernel *automatically* assigns a link-local address when we > bring the device up anyway, it seemed simple enough just to use ICMP. > I also happened to have the ICMP generation/checking code lying around > anyway in the same emacs instance, so it was reduced to a previously > solved problem. Ok. > > We *should* eventually expand this test case to attach an AF_PACKET > device to the vhost-net, instead of using a tun device as the back end. > (Although I don't really see *why* vhost is limited to AF_PACKET. Why > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) It's just because nobody wrote the code. And we're lacking the real use case. Vhost_net is bascially used for accepting packet from userspace to the kernel networking stack. Using AF_UNIX makes it looks more like a case of inter process communication (without vnet header it won't be used efficiently by VM). In this case, using io_uring is much more suitable. Or thinking in another way, instead of depending on the vhost_net, we can expose TUN/TAP socket to userspace then io_uring could be used for the OpenConnect case as well? > > >>> + /* >>> + * I just want to map the *whole* of userspace address space. But >>> + * from userspace I don't know what that is. On x86_64 it would be: >>> + * >>> + * vmem->regions[0].guest_phys_addr = 4096; >>> + * vmem->regions[0].memory_size = 0x7fffffffe000; >>> + * vmem->regions[0].userspace_addr = 4096; >>> + * >>> + * For now, just ensure we put everything inside a single BSS region. >>> + */ >>> + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; >>> + vmem->regions[0].userspace_addr = (uint64_t)&rings; >>> + vmem->regions[0].memory_size = sizeof(rings); >> >> Instead of doing tricks like this, we can do it in another way: >> >> 1) enable device IOTLB >> 2) wait for the IOTLB miss request (iova, len) and update identity >> mapping accordingly >> >> This should work for all the archs (with some performance hit). > Ick. For my actual application (OpenConnect) I'm either going to suck > it up and put in the arch-specific limits like in the comment above, or > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking > about elsewhere. The feature could be useful for the case of vhost-vDPA as well. > (Probably the former, since if I'm requiring kernel > changes then I have grander plans around extending AF_TLS to do DTLS, > then hooking that directly up to the tun socket via BPF and a sockmap > without the data frames ever going to userspace at all.) Ok, I guess we need to make sockmap works for tun socket. > > For this test case, a hard-coded single address range in BSS is fine. > > I've now added !IFF_NO_PI support to the test case, but as noted it > fails just like the other ones I'd already marked with #if 0, which is > because vhost-net pulls some value for 'sock_hlen' out of its posterior > based on some assumption around the vhost features. And then expects > sock_recvmsg() to return precisely that number of bytes more than the > value it peeks in the skb at the head of the sock's queue. > > I think I can fix *all* those test cases by making tun_get_socket() > take an extra 'int *' argument, and use that to return the *actual* > value of sock_hlen. Here's the updated test case in the meantime: It would be better if you can post a new version of the whole series to ease the reviewing. Thanks
On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote: > 在 2021/6/24 上午12:12, David Woodhouse 写道: > > We *should* eventually expand this test case to attach an AF_PACKET > > device to the vhost-net, instead of using a tun device as the back end. > > (Although I don't really see *why* vhost is limited to AF_PACKET. Why > > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) > > > It's just because nobody wrote the code. And we're lacking the real use > case. Hm, what code? For AF_PACKET I haven't actually spotted that there *is* any. As I've been refactoring the interaction between vhost and tun/tap, and fixing it up for different vhdr lengths, PI, and (just now) frowning in horror at the concept that tun and vhost can have *different* endiannesses, I hadn't spotted that there was anything special on the packet socket. For that case, sock_hlen is just zero and we send/receive plain packets... or so I thought? Did I miss something? As far as I was aware, that ought to have worked with any datagram socket. I was pondering not just AF_UNIX but also UDP (since that's my main transport for VPN data, at least in the case where I care about performance). An interesting use case for a non-packet socket might be to establish a tunnel. A guest's virtio-net device is just connected to a UDP socket on the host, and that tunnels all their packets to/from a remote endpoint which is where that guest is logically connected to the network. It might be useful for live migration cases, perhaps? I don't have an overriding desire to *make* it work, mind you; I just wanted to make sure I understand *why* it doesn't, if indeed it doesn't. As far as I could tell, it *should* work if we just dropped the check? > Vhost_net is bascially used for accepting packet from userspace to the > kernel networking stack. > > Using AF_UNIX makes it looks more like a case of inter process > communication (without vnet header it won't be used efficiently by VM). > In this case, using io_uring is much more suitable. > > Or thinking in another way, instead of depending on the vhost_net, we > can expose TUN/TAP socket to userspace then io_uring could be used for > the OpenConnect case as well? That would work, I suppose. Although as noted, I *can* use vhost_net with tun today from userspace as long as I disable XDP and PI, and use a virtio_net_hdr that I don't really want. (And pull a value for TASK_SIZE out of my posterior; qv.) So I *can* ship a version of OpenConnect that works on existing production kernels with those workarounds, and I'm fixing up the other permutations of vhost/tun stuff in the kernel just because I figured we *should*. If I'm going to *require* new kernel support for OpenConnect then I might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and have the data packets never go to userspace in the first place. > > > > > > > > > + /* > > > > + * I just want to map the *whole* of userspace address space. But > > > > + * from userspace I don't know what that is. On x86_64 it would be: > > > > + * > > > > + * vmem->regions[0].guest_phys_addr = 4096; > > > > + * vmem->regions[0].memory_size = 0x7fffffffe000; > > > > + * vmem->regions[0].userspace_addr = 4096; > > > > + * > > > > + * For now, just ensure we put everything inside a single BSS region. > > > > + */ > > > > + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; > > > > + vmem->regions[0].userspace_addr = (uint64_t)&rings; > > > > + vmem->regions[0].memory_size = sizeof(rings); > > > > > > Instead of doing tricks like this, we can do it in another way: > > > > > > 1) enable device IOTLB > > > 2) wait for the IOTLB miss request (iova, len) and update identity > > > mapping accordingly > > > > > > This should work for all the archs (with some performance hit). > > > > Ick. For my actual application (OpenConnect) I'm either going to suck > > it up and put in the arch-specific limits like in the comment above, or > > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking > > about elsewhere. > > > The feature could be useful for the case of vhost-vDPA as well. > > > > (Probably the former, since if I'm requiring kernel > > changes then I have grander plans around extending AF_TLS to do DTLS, > > then hooking that directly up to the tun socket via BPF and a sockmap > > without the data frames ever going to userspace at all.) > > > Ok, I guess we need to make sockmap works for tun socket. Hm, I need to work out the ideal data flow here. I don't know if sendmsg() / recvmsg() on the tun socket are even what I want, for full zero-copy. In the case where the NIC supports encryption, we want true zero-copy from the moment the "encrypted" packet arrives over UDP on the public network, through the DTLS processing and seqno checking, to being processed as netif_receive_skb() on the tun device. Likewise skbs from tun_net_xmit() should have the appropriate DTLS and IP/UDP headers prepended to them and that *same* skb (or at least the same frags) should be handed to the NIC to encrypt and send. In the case where we have software crypto in the kernel, we can tolerate precisely *one* copy because the crypto doesn't have to be done in-place, so moving from the input to the output crypto buffers can be that one "copy", and we can use it to move data around (from the incoming skb to the outgoing skb) if we need to. Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support being *given* ownership of the buffers, rather than copying from them. Or just being given a skb and pushing/pulling their headers. I'm looking at skb_send_sock() and it *doesn't* seem to support "just steal the frags from the initial skb and give them to the new one", but there may be ways to make that work. > > I think I can fix *all* those test cases by making tun_get_socket() > > take an extra 'int *' argument, and use that to return the *actual* > > value of sock_hlen. Here's the updated test case in the meantime: > > > It would be better if you can post a new version of the whole series to > ease the reviewing. Yep. I was working on that... until I got even more distracted by looking at how we can do the true in-kernel zero-copy option ;)
在 2021/6/24 下午6:42, David Woodhouse 写道: > On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote: >> 在 2021/6/24 上午12:12, David Woodhouse 写道: >>> We *should* eventually expand this test case to attach an AF_PACKET >>> device to the vhost-net, instead of using a tun device as the back end. >>> (Although I don't really see *why* vhost is limited to AF_PACKET. Why >>> *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) >> >> It's just because nobody wrote the code. And we're lacking the real use >> case. > Hm, what code? The codes to support AF_UNIX. > For AF_PACKET I haven't actually spotted that there *is* > any. Vhost_net has this support for more than 10 years. It's hard to say there's no user for that. > > As I've been refactoring the interaction between vhost and tun/tap, and > fixing it up for different vhdr lengths, PI, and (just now) frowning in > horror at the concept that tun and vhost can have *different* > endiannesses, I hadn't spotted that there was anything special on the > packet socket. Vnet header support. > For that case, sock_hlen is just zero and we > send/receive plain packets... or so I thought? Did I miss something? With vnet header, it can have GSO and csum offload. > > As far as I was aware, that ought to have worked with any datagram > socket. I was pondering not just AF_UNIX but also UDP (since that's my > main transport for VPN data, at least in the case where I care about > performance). My understanding is that vhost_net designed for accelerating virtio datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs (tuntap,macvtap or packet socket) are the main users. If you check git history, vhost can only be enabled without KVM until sometime last year. So I confess it can serve as a more general use case, and we had already has some discussions. But it's hard to say it's worth to do that since it became a re-invention of io_uring? Another interesting thing is that, the copy done by vhost (copy_from/to_user()) will be much more slower than io_uring (GUP) because userspace copy may suffer from the performance hit caused by SMAP. > > An interesting use case for a non-packet socket might be to establish a > tunnel. A guest's virtio-net device is just connected to a UDP socket > on the host, and that tunnels all their packets to/from a remote > endpoint which is where that guest is logically connected to the > network. It might be useful for live migration cases, perhaps? So kernel has already had supported on tunnels like L2 over L4 which were all done at netdevice level (e.g vxlan). If you want to build a customized tunnel which is not supported by the kernel, you need to redirect the traffic back to the userspace. vhost-user is probably the best choice in that case. > > I don't have an overriding desire to *make* it work, mind you; I just > wanted to make sure I understand *why* it doesn't, if indeed it > doesn't. As far as I could tell, it *should* work if we just dropped > the check? I'm not sure. It requires careful thought. For the case of L2/VM, we care more about the performance which can be done via vnet header. For the case of L3(TUN) or above, we can do that via io_uring. So it looks to me it's not worth to bother. > >> Vhost_net is bascially used for accepting packet from userspace to the >> kernel networking stack. >> >> Using AF_UNIX makes it looks more like a case of inter process >> communication (without vnet header it won't be used efficiently by VM). >> In this case, using io_uring is much more suitable. >> >> Or thinking in another way, instead of depending on the vhost_net, we >> can expose TUN/TAP socket to userspace then io_uring could be used for >> the OpenConnect case as well? > That would work, I suppose. Although as noted, I *can* use vhost_net > with tun today from userspace as long as I disable XDP and PI, and use > a virtio_net_hdr that I don't really want. (And pull a value for > TASK_SIZE out of my posterior; qv.) > > So I *can* ship a version of OpenConnect that works on existing > production kernels with those workarounds, and I'm fixing up the other > permutations of vhost/tun stuff in the kernel just because I figured we > *should*. > > If I'm going to *require* new kernel support for OpenConnect then I > might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and > have the data packets never go to userspace in the first place. Note that BPF have some limitations: 1) requires capabilities like CAP_BPF 2) may need userspace fallback > > >>> >>>>> + /* >>>>> + * I just want to map the *whole* of userspace address space. But >>>>> + * from userspace I don't know what that is. On x86_64 it would be: >>>>> + * >>>>> + * vmem->regions[0].guest_phys_addr = 4096; >>>>> + * vmem->regions[0].memory_size = 0x7fffffffe000; >>>>> + * vmem->regions[0].userspace_addr = 4096; >>>>> + * >>>>> + * For now, just ensure we put everything inside a single BSS region. >>>>> + */ >>>>> + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; >>>>> + vmem->regions[0].userspace_addr = (uint64_t)&rings; >>>>> + vmem->regions[0].memory_size = sizeof(rings); >>>> Instead of doing tricks like this, we can do it in another way: >>>> >>>> 1) enable device IOTLB >>>> 2) wait for the IOTLB miss request (iova, len) and update identity >>>> mapping accordingly >>>> >>>> This should work for all the archs (with some performance hit). >>> Ick. For my actual application (OpenConnect) I'm either going to suck >>> it up and put in the arch-specific limits like in the comment above, or >>> I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking >>> about elsewhere. >> >> The feature could be useful for the case of vhost-vDPA as well. >> >> >>> (Probably the former, since if I'm requiring kernel >>> changes then I have grander plans around extending AF_TLS to do DTLS, >>> then hooking that directly up to the tun socket via BPF and a sockmap >>> without the data frames ever going to userspace at all.) >> >> Ok, I guess we need to make sockmap works for tun socket. > Hm, I need to work out the ideal data flow here. I don't know if > sendmsg() / recvmsg() on the tun socket are even what I want, for full > zero-copy. Zerocopy could be done via vhost_net. But due the HOL issue we disabled it by default. > > In the case where the NIC supports encryption, we want true zero-copy > from the moment the "encrypted" packet arrives over UDP on the public > network, through the DTLS processing and seqno checking, to being > processed as netif_receive_skb() on the tun device. > > Likewise skbs from tun_net_xmit() should have the appropriate DTLS and > IP/UDP headers prepended to them and that *same* skb (or at least the > same frags) should be handed to the NIC to encrypt and send. > > In the case where we have software crypto in the kernel, we can > tolerate precisely *one* copy because the crypto doesn't have to be > done in-place, so moving from the input to the output crypto buffers > can be that one "copy", and we can use it to move data around (from the > incoming skb to the outgoing skb) if we need to. I'm not familiar withe encryption, but it looks like what you want is the TLS offload support in TUN/TAP. > > Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support > being *given* ownership of the buffers, rather than copying from them. > Or just being given a skb and pushing/pulling their headers. It looks more like you want to add sendpage() support for TUN? The first step as discussed would be the codes to expose TUN socket to userspace. > > I'm looking at skb_send_sock() and it *doesn't* seem to support "just > steal the frags from the initial skb and give them to the new one", but > there may be ways to make that work. I don't know. Last time I check sockmap only support TCP socket. But I saw some work that is proposed by Wang Cong to make it work for UDP probably. Thanks > >>> I think I can fix *all* those test cases by making tun_get_socket() >>> take an extra 'int *' argument, and use that to return the *actual* >>> value of sock_hlen. Here's the updated test case in the meantime: >> >> It would be better if you can post a new version of the whole series to >> ease the reviewing. > Yep. I was working on that... until I got even more distracted by > looking at how we can do the true in-kernel zero-copy option ;) >
On Fri, 2021-06-25 at 10:55 +0800, Jason Wang wrote: > 在 2021/6/24 下午6:42, David Woodhouse 写道: > > On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote: > > > 在 2021/6/24 上午12:12, David Woodhouse 写道: > > > > We *should* eventually expand this test case to attach an AF_PACKET > > > > device to the vhost-net, instead of using a tun device as the back end. > > > > (Although I don't really see *why* vhost is limited to AF_PACKET. Why > > > > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) > > > > > > It's just because nobody wrote the code. And we're lacking the real use > > > case. > > > > Hm, what code? > > > The codes to support AF_UNIX. > > > > For AF_PACKET I haven't actually spotted that there *is* any. > > > Vhost_net has this support for more than 10 years. It's hard to say > there's no user for that. > I wasn't saying I hadn't spotted the use case. I hadn't spotted the *code* which is in af_packet to support vhost. But... > > As I've been refactoring the interaction between vhost and tun/tap, and > > fixing it up for different vhdr lengths, PI, and (just now) frowning in > > horror at the concept that tun and vhost can have *different* > > endiannesses, I hadn't spotted that there was anything special on the > > packet socket. > > Vnet header support. ... I have no idea how I failed to spot that. OK, so AF_PACKET sockets can *optionally* support the case where *they* provide the virtio_net_hdr — instead of vhost doing it, or there being none. But any other sockets would work for the "vhost does it" or the "no vhdr" case. ... and I need to fix my 'get sock_hlen from the underlying tun/tap device' patch to *not* assume that sock_hlen is zero for a raw socket; it needs to check the PACKET_VNET_HDR sockopt. And *that* was broken for the VERSION_1|MRG_RXBUF case before I came along, wasn't it? Because vhost would have assumed sock_hlen to be 12 bytes, while in AF_PACKET it's always only 10? > > For that case, sock_hlen is just zero and we > > send/receive plain packets... or so I thought? Did I miss something? > > > With vnet header, it can have GSO and csum offload. > > > > > > As far as I was aware, that ought to have worked with any datagram > > socket. I was pondering not just AF_UNIX but also UDP (since that's my > > main transport for VPN data, at least in the case where I care about > > performance). > > > My understanding is that vhost_net designed for accelerating virtio > datapath which is mainly used for VM (L2 traffic). So all kinds of TAPs > (tuntap,macvtap or packet socket) are the main users. If you check git > history, vhost can only be enabled without KVM until sometime last year. > So I confess it can serve as a more general use case, and we had already > has some discussions. But it's hard to say it's worth to do that since > it became a re-invention of io_uring? Yeah, ultimately I'm not sure that's worth exploring. As I said, I was looking for something that works on *current* kernels. Which means no io_uring on the underlying tun socket, and no vhost on UDP. If I want to go and implement *both* ring protocols in userspace and make use of each of them on the socket that they do support, I can do that. Yay! :) If I'm going to require new kernels, then I should just work on the "ideal" data path which doesn't really involve userspace at all. But we should probably take that discussion to a separate thread.
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 6c575cf34a71..300c03cfd0c7 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -71,6 +71,7 @@ TARGETS += user TARGETS += vDSO TARGETS += vm TARGETS += x86 +TARGETS += vhost TARGETS += zram #Please keep the TARGETS list alphabetically sorted # Run "make quicktest=1 run_tests" or diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile new file mode 100644 index 000000000000..f5e565d80733 --- /dev/null +++ b/tools/testing/selftests/vhost/Makefile @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0 +all: + +include ../lib.mk + +.PHONY: all clean + +BINARIES := test_vhost_net + +test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h + $(CC) $(CFLAGS) -g $< -o $@ + +TEST_PROGS += $(BINARIES) +EXTRA_CLEAN := $(BINARIES) + +all: $(BINARIES) diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config new file mode 100644 index 000000000000..6391c1f32c34 --- /dev/null +++ b/tools/testing/selftests/vhost/config @@ -0,0 +1,2 @@ +CONFIG_VHOST_NET=y +CONFIG_TUN=y diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c new file mode 100644 index 000000000000..14acf2c0e049 --- /dev/null +++ b/tools/testing/selftests/vhost/test_vhost_net.c @@ -0,0 +1,522 @@ +// SPDX-License-Identifier: LGPL-2.1 + +#include "../kselftest_harness.h" +#include "../../../virtio/asm/barrier.h" + +#include <sys/eventfd.h> + +#include <sys/types.h> +#include <sys/stat.h> + +#include <fcntl.h> +#include <unistd.h> +#include <sys/wait.h> +#include <sys/ioctl.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> + +#include <net/if.h> +#include <sys/socket.h> + +#include <netinet/tcp.h> +#include <netinet/ip.h> +#include <netinet/ip_icmp.h> +#include <netinet/ip6.h> +#include <netinet/icmp6.h> + +#include <linux/if_tun.h> +#include <linux/virtio_net.h> +#include <linux/vhost.h> + +static unsigned char hexnybble(char hex) +{ + switch (hex) { + case '0'...'9': + return hex - '0'; + case 'a'...'f': + return 10 + hex - 'a'; + case 'A'...'F': + return 10 + hex - 'A'; + default: + exit (KSFT_SKIP); + } +} + +static unsigned char hexchar(char *hex) +{ + return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]); +} + +int open_tun(int vnet_hdr_sz, struct in6_addr *addr) +{ + int tun_fd = open("/dev/net/tun", O_RDWR); + if (tun_fd == -1) + return -1; + + struct ifreq ifr = { 0 }; + + ifr.ifr_flags = IFF_TUN | IFF_NO_PI; + if (vnet_hdr_sz) + ifr.ifr_flags |= IFF_VNET_HDR; + + if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0) + goto out_tun; + + if (vnet_hdr_sz && + ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0) + goto out_tun; + + int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP); + if (sockfd == -1) + goto out_tun; + + if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0) + goto out_sock; + + ifr.ifr_flags |= IFF_UP; + if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0) + goto out_sock; + + close(sockfd); + + FILE *inet6 = fopen("/proc/net/if_inet6", "r"); + if (!inet6) + goto out_tun; + + char buf[80]; + while (fgets(buf, sizeof(buf), inet6)) { + size_t len = strlen(buf), namelen = strlen(ifr.ifr_name); + if (!strncmp(buf, "fe80", 4) && + buf[len - namelen - 2] == ' ' && + !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) { + for (int i = 0; i < 16; i++) { + addr->s6_addr[i] = hexchar(buf + i*2); + } + fclose(inet6); + return tun_fd; + } + } + /* Not found */ + fclose(inet6); + out_sock: + close(sockfd); + out_tun: + close(tun_fd); + return -1; +} + +#define RING_SIZE 32 +#define RING_MASK(x) ((x) & (RING_SIZE-1)) + +struct pkt_buf { + unsigned char data[2048]; +}; + +struct test_vring { + struct vring_desc desc[RING_SIZE]; + struct vring_avail avail; + __virtio16 avail_ring[RING_SIZE]; + struct vring_used used; + struct vring_used_elem used_ring[RING_SIZE]; + struct pkt_buf pkts[RING_SIZE]; +} rings[2]; + +static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx) +{ + struct test_vring *vring = &rings[idx]; + int ret; + + memset(vring, 0, sizeof(vring)); + + struct vhost_vring_state vs = { }; + vs.index = idx; + vs.num = RING_SIZE; + if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) { + perror("VHOST_SET_VRING_NUM"); + return -1; + } + + vs.num = 0; + if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) { + perror("VHOST_SET_VRING_BASE"); + return -1; + } + + struct vhost_vring_addr va = { }; + va.index = idx; + va.desc_user_addr = (uint64_t)vring->desc; + va.avail_user_addr = (uint64_t)&vring->avail; + va.used_user_addr = (uint64_t)&vring->used; + if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) { + perror("VHOST_SET_VRING_ADDR"); + return -1; + } + + struct vhost_vring_file vf = { }; + vf.index = idx; + vf.fd = tun_fd; + if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) { + perror("VHOST_NET_SET_BACKEND"); + return -1; + } + + vf.fd = call_fd; + if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) { + perror("VHOST_SET_VRING_CALL"); + return -1; + } + + vf.fd = kick_fd; + if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) { + perror("VHOST_SET_VRING_KICK"); + return -1; + } + + return 0; +} + +int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features) +{ + int ret; + + if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) { + perror("VHOST_SET_OWNER"); + return -1; + } + + uint64_t features; + if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) { + perror("VHOST_GET_FEATURES"); + return -1; + } + + if ((features & want_features) != want_features) + return KSFT_SKIP; + + if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) { + perror("VHOST_SET_FEATURES"); + return -1; + } + + struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0])); + + memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0])); + vmem->nregions = 1; + /* + * I just want to map the *whole* of userspace address space. But + * from userspace I don't know what that is. On x86_64 it would be: + * + * vmem->regions[0].guest_phys_addr = 4096; + * vmem->regions[0].memory_size = 0x7fffffffe000; + * vmem->regions[0].userspace_addr = 4096; + * + * For now, just ensure we put everything inside a single BSS region. + */ + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; + vmem->regions[0].userspace_addr = (uint64_t)&rings; + vmem->regions[0].memory_size = sizeof(rings); + + if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { + perror("VHOST_SET_MEM_TABLE"); + return -1; + } + + if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0)) + return -1; + + if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1)) + return -1; + + return 0; +} + + +static char ping_payload[16] = "VHOST TEST PACKT"; + +static inline uint32_t csum_partial(uint16_t *buf, int nwords) +{ + uint32_t sum = 0; + for(sum=0; nwords>0; nwords--) + sum += ntohs(*buf++); + return sum; +} + +static inline uint16_t csum_finish(uint32_t sum) +{ + sum = (sum >> 16) + (sum &0xffff); + sum += (sum >> 16); + return htons((uint16_t)(~sum)); +} + +static int create_icmp_echo(unsigned char *data, struct in6_addr *dst, + struct in6_addr *src, uint16_t id, uint16_t seq) +{ + const int icmplen = ICMP_MINLEN + sizeof(ping_payload); + const int plen = sizeof(struct ip6_hdr) + icmplen; + + struct ip6_hdr *iph = (void *)data; + struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph)); + + /* IPv6 Header */ + iph->ip6_flow = htonl((6 << 28) + /* version 6 */ + (0 << 20) + /* traffic class */ + (0 << 0)); /* flow ID */ + iph->ip6_nxt = IPPROTO_ICMPV6; + iph->ip6_plen = htons(icmplen); + iph->ip6_hlim = 128; + iph->ip6_src = *src; + iph->ip6_dst = *dst; + + /* ICMPv6 echo request */ + icmph->icmp6_type = ICMP6_ECHO_REQUEST; + icmph->icmp6_code = 0; + icmph->icmp6_data16[0] = htons(id); /* ID */ + icmph->icmp6_data16[1] = htons(seq); /* sequence */ + + /* Some arbitrary payload */ + memcpy(&icmph[1], ping_payload, sizeof(ping_payload)); + + /* + * IPv6 upper-layer checksums include a pseudo-header + * for IPv6 which contains the source address, the + * destination address, the upper-layer packet length + * and next-header field. See RFC8200 §8.1. The + * checksum is as follows: + * + * checksum 32 bytes of real IPv6 header: + * src addr (16 bytes) + * dst addr (16 bytes) + * 8 bytes more: + * length of ICMPv6 in bytes (be32) + * 3 bytes of 0 + * next header byte (IPPROTO_ICMPV6) + * Then the actual ICMPv6 bytes + */ + uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8); /* 8 uint16_t */ + sum += csum_partial((uint16_t *)&iph->ip6_dst, 8); /* 8 uint16_t */ + + /* The easiest way to checksum the following 8-byte + * part of the pseudo-header without horridly violating + * C type aliasing rules is *not* to build it in memory + * at all. We know the length fits in 16 bits so the + * partial checksum of 00 00 LL LL 00 00 00 NH ends up + * being just LLLL + NH. + */ + sum += IPPROTO_ICMPV6; + sum += ICMP_MINLEN + sizeof(ping_payload); + + sum += csum_partial((uint16_t *)icmph, icmplen / 2); + icmph->icmp6_cksum = csum_finish(sum); + return plen; +} + + +static int check_icmp_response(unsigned char *data, uint32_t len, struct in6_addr *dst, struct in6_addr *src) +{ + struct ip6_hdr *iph = (void *)data; + return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */ + && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */ + && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */ + && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */ + && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */ + && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */ + && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */ + ); + +} + +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ +#define vio16(x) (x) +#define vio32(x) (x) +#define vio64(x) (x) +#else +#define vio16(x) __builtin_bswap16(x) +#define vio32(x) __builtin_bswap32(x) +#define vio64(x) __builtin_bswap64(x) +#endif + + +int test_vhost(int vnet_hdr_sz, int xdp, uint64_t features) +{ + int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); + int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); + int vhost_fd = open("/dev/vhost-net", O_RDWR); + int tun_fd = -1; + int ret = KSFT_SKIP; + + if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0) + goto err; + + memset(rings, 0, sizeof(rings)); + + /* Pick up the link-local address that the kernel + * assigns to the tun device. */ + struct in6_addr tun_addr; + tun_fd = open_tun(vnet_hdr_sz, &tun_addr); + if (tun_fd < 0) + goto err; + + if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) { + if (vnet_hdr_sz) { + ret = -1; + goto err; + } + + vnet_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | + (1ULL << VIRTIO_F_VERSION_1))) ? + sizeof(struct virtio_net_hdr_mrg_rxbuf) : + sizeof(struct virtio_net_hdr); + } + + if (!xdp) { + int sndbuf = RING_SIZE * 2048; + if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) { + perror("TUNSETSNDBUF"); + ret = -1; + goto err; + } + } + + ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features); + if (ret) + goto err; + + /* A fake link-local address for the userspace end */ + struct in6_addr local_addr = { 0 }; + local_addr.s6_addr16[0] = htons(0xfe80); + local_addr.s6_addr16[7] = htons(1); + + /* Set up RX and TX descriptors; the latter with ping packets ready to + * send to the kernel, but don't actually send them yet. */ + for (int i = 0; i < RING_SIZE; i++) { + struct pkt_buf *pkt = &rings[1].pkts[i]; + int plen = create_icmp_echo(&pkt->data[vnet_hdr_sz], &tun_addr, + &local_addr, 0x4747, i); + + rings[1].desc[i].addr = vio64((uint64_t)pkt); + rings[1].desc[i].len = vio32(plen + vnet_hdr_sz); + rings[1].avail_ring[i] = vio16(i); + + + pkt = &rings[0].pkts[i]; + rings[0].desc[i].addr = vio64((uint64_t)pkt); + rings[0].desc[i].len = vio32(sizeof(*pkt)); + rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE); + rings[0].avail_ring[i] = vio16(i); + } + barrier(); + + rings[0].avail.idx = RING_SIZE; + rings[1].avail.idx = vio16(1); + + barrier(); + eventfd_write(kick_fd, 1); + + uint16_t rx_seen_used = 0; + struct timeval tv = { 1, 0 }; + while (1) { + fd_set rfds = { 0 }; + FD_SET(call_fd, &rfds); + + if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) { + ret = -1; + goto err; + } + + uint16_t rx_used_idx = vio16(rings[0].used.idx); + barrier(); + + while(rx_used_idx != rx_seen_used) { + uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id); + uint32_t len = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len); + + if (desc >= RING_SIZE || len < vnet_hdr_sz) + return -1; + + uint64_t addr = vio64(rings[0].desc[desc].addr); + if (!addr) + return -1; + + if (check_icmp_response((void *)(addr + vnet_hdr_sz), len - vnet_hdr_sz, + &local_addr, &tun_addr)) { + ret = 0; + printf("Success (%d %d %llx)\n", vnet_hdr_sz, xdp, (unsigned long long)features); + goto err; + } + rx_seen_used++; + + /* Give the same buffer back */ + rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE); + barrier(); + eventfd_write(kick_fd, 1); + } + + uint64_t ev_val; + eventfd_read(call_fd, &ev_val); + } + + err: + if (call_fd != -1) + close(call_fd); + if (kick_fd != -1) + close(kick_fd); + if (vhost_fd != -1) + close(vhost_fd); + if (tun_fd != -1) + close(tun_fd); + + return ret; +} + + +int main(void) +{ + int ret; + + ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | + (1ULL << VIRTIO_F_VERSION_1))); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | + (1ULL << VIRTIO_F_VERSION_1))); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(0, 0, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR))); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(0, 1, ((1ULL << VHOST_NET_F_VIRTIO_NET_HDR))); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(10, 0, 0); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(10, 1, 0); + if (ret && ret != KSFT_SKIP) + return ret; + +#if 0 /* These ones will fail */ + ret = test_vhost(0, 0, 0); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(0, 1, 0); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(12, 0, 0); + if (ret && ret != KSFT_SKIP) + return ret; + + ret = test_vhost(12, 1, 0); + if (ret && ret != KSFT_SKIP) + return ret; +#endif + + return ret; +}