From patchwork Mon Oct 31 17:41:45 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wei Xu X-Patchwork-Id: 9406237 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 25B1F60234 for ; Mon, 31 Oct 2016 18:18:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0B43B2933D for ; Mon, 31 Oct 2016 18:18:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F182E29353; Mon, 31 Oct 2016 18:18:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 33F9D2933D for ; Mon, 31 Oct 2016 18:18:39 +0000 (UTC) Received: from localhost ([::1]:37675 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c1HAM-0006bt-9G for patchwork-qemu-devel@patchwork.kernel.org; Mon, 31 Oct 2016 14:18:38 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38059) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c1GbG-0005pM-RT for qemu-devel@nongnu.org; Mon, 31 Oct 2016 13:42:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c1GbD-0004Dl-K7 for qemu-devel@nongnu.org; Mon, 31 Oct 2016 13:42:22 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52908) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1c1GbD-0004DH-8E for qemu-devel@nongnu.org; Mon, 31 Oct 2016 13:42:19 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 83B084E4CF for ; Mon, 31 Oct 2016 17:42:18 +0000 (UTC) Received: from wei-thinkpad.nay.redhat.com (vpn1-4-179.pek2.redhat.com [10.72.4.179]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u9VHgBhg032047; Mon, 31 Oct 2016 13:42:15 -0400 From: wexu@redhat.com To: qemu-devel@nongnu.org Date: Tue, 1 Nov 2016 01:41:45 +0800 Message-Id: <1477935706-4250-2-git-send-email-wexu@redhat.com> In-Reply-To: <1477935706-4250-1-git-send-email-wexu@redhat.com> References: <1477935706-4250-1-git-send-email-wexu@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Mon, 31 Oct 2016 17:42:18 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: jasowang@redhat.com, yvugenfi@redhat.com, dfleytma@redhat.com, Wei Xu , mst@redhat.com Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" X-Virus-Scanned: ClamAV using ClamSMTP From: Wei Xu All the data packets in a tcp connection are cached to a single buffer in every receive interval, and will be sent out via a timer, the 'virtio_net_rsc_timeout' controls the interval, this value may impact the performance and response time of tcp connection, 50000(50us) is an experience value to gain a performance improvement, since the whql test sends packets every 100us, so '300000(300us)' passes the test case, it is the default value as well, tune it via the command line parameter 'rsc_interval' within 'virtio-net-pci' device, for example, to launch a guest with interval set as '500000': 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,rsc_interval=500000' The timer will only be triggered if the packets pool is not empty, and it'll drain off all the cached packets. 'NetRscChain' is used to save the segments of IPv4/6 in a VirtIONet device. A new segment becomes a 'Candidate' as well as it passed sanity check, the main handler of TCP includes TCP window update, duplicated ACK check and the real data coalescing. An 'Candidate' segment means: 1. Segment is within current window and the sequence is the expected one. 2. 'ACK' of the segment is in the valid window. Sanity check includes: 1. Incorrect version in IP header 2. An IP options or IP fragment 3. Not a TCP packet 4. Sanity size check to prevent buffer overflow attack. 5. An ECN packet Even though, there might more cases should be considered such as ip identification other flags, while it breaks the test because windows set it to the same even it's not a fragment. Normally it includes 2 typical ways to handle a TCP control flag, 'bypass' and 'finalize', 'bypass' means should be sent out directly, while 'finalize' means the packets should also be bypassed, but this should be done after search for the same connection packets in the pool and drain all of them out, this is to avoid out of order fragment. All the 'SYN' packets will be bypassed since this always begin a new' connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a finalization, because this normally happens upon a connection is going to be closed, an 'URG' packet also finalize current coalescing unit. Statistics can be used to monitor the basic coalescing status, the 'out of order' and 'out of window' means how many retransmitting packets, thus describe the performance intuitively. Signed-off-by: Wei Xu --- hw/net/virtio-net.c | 602 ++++++++++++++++++++++++++-- include/hw/virtio/virtio-net.h | 5 +- include/hw/virtio/virtio.h | 76 ++++ include/net/eth.h | 2 + include/standard-headers/linux/virtio_net.h | 14 + net/tap.c | 3 +- 6 files changed, 670 insertions(+), 32 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 06bfe4b..d1824d9 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -15,10 +15,12 @@ #include "qemu/iov.h" #include "hw/virtio/virtio.h" #include "net/net.h" +#include "net/eth.h" #include "net/checksum.h" #include "net/tap.h" #include "qemu/error-report.h" #include "qemu/timer.h" +#include "qemu/sockets.h" #include "hw/virtio/virtio-net.h" #include "net/vhost_net.h" #include "hw/virtio/virtio-bus.h" @@ -43,6 +45,24 @@ #define endof(container, field) \ (offsetof(container, field) + sizeof(((container *)0)->field)) +#define VIRTIO_NET_IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ + +#define VIRTIO_NET_TCP_FLAG 0x3F +#define VIRTIO_NET_TCP_HDR_LENGTH 0xF000 + +/* IPv4 max payload, 16 bits in the header */ +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header)) +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535 + +/* header length value in ip header without option */ +#define VIRTIO_NET_IP4_HEADER_LENGTH 5 + +/* Purge coalesced packets timer interval, This value affects the performance + a lot, and should be tuned carefully, '300000'(300us) is the recommended + value to pass the WHQL test, '50000' can gain 2x netperf throughput with + tso/gso/gro 'off'. */ +#define VIRTIO_NET_RSC_INTERVAL 300000 + typedef struct VirtIOFeature { uint32_t flags; size_t end; @@ -589,7 +609,12 @@ static uint64_t virtio_net_guest_offloads_by_features(uint32_t features) (1ULL << VIRTIO_NET_F_GUEST_ECN) | (1ULL << VIRTIO_NET_F_GUEST_UFO); - return guest_offloads_mask & features; + if (features & VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) { + return (guest_offloads_mask & features) | + (1ULL << VIRTIO_NET_F_GUEST_RSC4); + } else { + return guest_offloads_mask & features; + } } static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n) @@ -600,6 +625,7 @@ static inline uint64_t virtio_net_supported_guest_offloads(VirtIONet *n) static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features) { + NetClientState *nc; VirtIONet *n = VIRTIO_NET(vdev); int i; @@ -612,6 +638,22 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features) virtio_has_feature(features, VIRTIO_F_VERSION_1)); + if (virtio_has_feature(features, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) { + n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc); + n->host_hdr_len = n->guest_hdr_len; + n->has_rsc_hdr = 1; + + for (i = 0; i < n->max_queues; i++) { + nc = qemu_get_subqueue(n->nic, i); + + if (peer_has_vnet_hdr(n) && + qemu_has_vnet_hdr_len(nc->peer, n->guest_hdr_len)) { + qemu_set_vnet_hdr_len(nc->peer, n->guest_hdr_len); + n->host_hdr_len = n->guest_hdr_len; + } + } + } + if (n->has_vnet_hdr) { n->curr_guest_offloads = virtio_net_guest_offloads_by_features(features); @@ -1097,7 +1139,8 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size) return 0; } -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t size) +static ssize_t virtio_net_do_receive(NetClientState *nc, + const uint8_t *buf, size_t size) { VirtIONet *n = qemu_get_nic_opaque(nc); VirtIONetQueue *q = virtio_net_get_subqueue(nc); @@ -1161,6 +1204,12 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t } receive_header(n, sg, elem->in_num, buf, size); + + if (n->has_rsc_hdr) { + offset = sizeof(struct virtio_net_hdr_mrg_rxbuf); + iov_from_buf(sg, elem->in_num, offset, \ + buf + offset, 4); + } offset = n->host_hdr_len; total += n->guest_hdr_len; guest_offset = n->guest_hdr_len; @@ -1239,7 +1288,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) ssize_t ret; unsigned int out_num; struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg; - struct virtio_net_hdr_mrg_rxbuf mhdr; + struct virtio_net_hdr_rsc rsc_hdr; elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement)); if (!elem) { @@ -1256,26 +1305,27 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) } if (n->has_vnet_hdr) { - if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) < + if (iov_to_buf(out_sg, out_num, 0, &rsc_hdr, n->guest_hdr_len) < n->guest_hdr_len) { virtio_error(vdev, "virtio-net header incorrect"); virtqueue_detach_element(q->tx_vq, elem, 0); g_free(elem); return -EINVAL; } + if (n->needs_vnet_hdr_swap) { - virtio_net_hdr_swap(vdev, (void *) &mhdr); - sg2[0].iov_base = &mhdr; + virtio_net_hdr_swap(vdev, (void *) &rsc_hdr); + sg2[0].iov_base = &rsc_hdr; sg2[0].iov_len = n->guest_hdr_len; out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1, out_sg, out_num, n->guest_hdr_len, -1); if (out_num == VIRTQUEUE_MAX_SIZE) { goto drop; - } + } out_num += 1; out_sg = sg2; - } + } } /* * If host wants to see the guest header as is, we can @@ -1562,8 +1612,12 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f, virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)); - n->status = qemu_get_be16(f); + if (virtio_vdev_has_feature(vdev, VIRTIO_NET_F_GUEST_RSC4)) { + n->guest_hdr_len = sizeof(struct virtio_net_hdr_rsc); + n->host_hdr_len = n->guest_hdr_len; + } + n->status = qemu_get_be16(f); n->promisc = qemu_get_byte(f); n->allmulti = qemu_get_byte(f); @@ -1660,6 +1714,487 @@ static int virtio_net_load_device(VirtIODevice *vdev, QEMUFile *f, return 0; } +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, + const uint8_t *buf, NetRscUnit* unit) +{ + uint16_t ip_hdrlen; + struct ip_header *ip; + + ip = (struct ip_header *)(buf + chain->n->guest_hdr_len + + sizeof(struct eth_header)); + unit->ip = (void *)ip; + ip_hdrlen = (ip->ip_ver_len & 0xF) << 2; + unit->ip_plen = &ip->ip_len; + unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen); + unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10; + unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen; +} + +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr_rsc *rhdr, + struct ip_header *ip) +{ + uint32_t sum; + struct virtio_net_hdr *vhdr = (struct virtio_net_hdr *)rhdr; + + ip->ip_sum = 0; + sum = net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *)ip, 0); + ip->ip_sum = cpu_to_be16(net_checksum_finish(sum)); + vhdr->flags &= ~VIRTIO_NET_HDR_F_NEEDS_CSUM; + vhdr->flags |= VIRTIO_NET_HDR_F_DATA_VALID; +} + +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg) +{ + int ret; + struct virtio_net_hdr_rsc *h; + + h = (struct virtio_net_hdr_rsc *)seg->buf; + if (seg->is_coalesced) { + h->hdr.flags = VIRTIO_NET_HDR_RSC_TCPV4; + virtio_net_rsc_ipv4_checksum(h, seg->unit.ip); + } + + h = (struct virtio_net_hdr_rsc *)seg->buf; + virtio_net_rsc_ipv4_checksum(h, seg->unit.ip); + h->rsc_pkts = seg->packets; + h->rsc_dup_acks = seg->dup_ack; + ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size); + QTAILQ_REMOVE(&chain->buffers, seg, next); + g_free(seg->buf); + g_free(seg); + + return ret; +} + +static void virtio_net_rsc_purge(void *opq) +{ + NetRscSeg *seg, *rn; + NetRscChain *chain = (NetRscChain *)opq; + + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { + chain->stat.purge_failed++; + continue; + } + } + + chain->stat.timer++; + if (!QTAILQ_EMPTY(&chain->buffers)) { + timer_mod(chain->drain_timer, + qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout); + } +} + +static void virtio_net_rsc_cleanup(VirtIONet *n) +{ + NetRscChain *chain, *rn_chain; + NetRscSeg *seg, *rn_seg; + + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { + QTAILQ_REMOVE(&chain->buffers, seg, next); + g_free(seg->buf); + g_free(seg); + } + + timer_del(chain->drain_timer); + timer_free(chain->drain_timer); + QTAILQ_REMOVE(&n->rsc_chains, chain, next); + g_free(chain); + } +} + +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc, + const uint8_t *buf, size_t size) +{ + uint16_t hdr_len; + NetRscSeg *seg; + + hdr_len = chain->n->guest_hdr_len; + seg = g_malloc(sizeof(NetRscSeg)); + seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)\ + + VIRTIO_NET_MAX_TCP_PAYLOAD); + memcpy(seg->buf, buf, size); + seg->size = size; + seg->packets = 1; + seg->dup_ack = 0; + seg->is_coalesced = 0; + seg->nc = nc; + + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); + chain->stat.cache++; + + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); +} + +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, + NetRscSeg *seg, const uint8_t *buf, + struct tcp_header *n_tcp, + struct tcp_header *o_tcp) +{ + uint32_t nack, oack; + uint16_t nwin, owin; + + nack = htonl(n_tcp->th_ack); + nwin = htons(n_tcp->th_win); + oack = htonl(o_tcp->th_ack); + owin = htons(o_tcp->th_win); + + if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) { + chain->stat.ack_out_of_win++; + return RSC_FINAL; + } else if (nack == oack) { + /* duplicated ack or window probe */ + if (nwin == owin) { + /* duplicated ack, add dup ack count due to whql test up to 1 */ + chain->stat.dup_ack++; + return RSC_FINAL; + } else { + /* Coalesce window update */ + o_tcp->th_win = n_tcp->th_win; + chain->stat.win_update++; + return RSC_COALESCE; + } + } else { + /* pure ack, go to 'C', finalize*/ + chain->stat.pure_ack++; + return RSC_FINAL; + } +} + +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, + NetRscSeg *seg, const uint8_t *buf, + NetRscUnit *n_unit) +{ + void *data; + uint16_t o_ip_len; + uint32_t nseq, oseq; + NetRscUnit *o_unit; + + o_unit = &seg->unit; + o_ip_len = htons(*o_unit->ip_plen); + nseq = htonl(n_unit->tcp->th_seq); + oseq = htonl(o_unit->tcp->th_seq); + + /* out of order or retransmitted. */ + if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) { + chain->stat.data_out_of_win++; + return RSC_FINAL; + } + + data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; + if (nseq == oseq) { + if ((o_unit->payload == 0) && n_unit->payload) { + /* From no payload to payload, normal case, not a dup ack or etc */ + chain->stat.data_after_pure_ack++; + goto coalesce; + } else { + return virtio_net_rsc_handle_ack(chain, seg, buf, + n_unit->tcp, o_unit->tcp); + } + } else if ((nseq - oseq) != o_unit->payload) { + /* Not a consistent packet, out of order */ + chain->stat.data_out_of_order++; + return RSC_FINAL; + } else { +coalesce: + if ((o_ip_len + n_unit->payload) > chain->max_payload) { + chain->stat.over_size++; + return RSC_FINAL; + } + + /* Here comes the right data, the payload length in v4/v6 is different, + so use the field value to update and record the new data len */ + o_unit->payload += n_unit->payload; /* update new data len */ + + /* update field in ip header */ + *o_unit->ip_plen = htons(o_ip_len + n_unit->payload); + + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced + for windows guest, while this may change the behavior for linux + guest. */ + o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags; + + o_unit->tcp->th_ack = n_unit->tcp->th_ack; + o_unit->tcp->th_win = n_unit->tcp->th_win; + + memmove(seg->buf + seg->size, data, n_unit->payload); + seg->size += n_unit->payload; + seg->packets++; + chain->stat.coalesced++; + return RSC_COALESCE; + } +} + +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg, + const uint8_t *buf, size_t size, + NetRscUnit *unit) +{ + struct ip_header *ip1, *ip2; + + ip1 = (struct ip_header *)(unit->ip); + ip2 = (struct ip_header *)(seg->unit.ip); + if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst) + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { + chain->stat.no_match++; + return RSC_NO_MATCH; + } + + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); +} + +/* Pakcets with 'SYN' should bypass, other flag should be sent after drain + * to prevent out of order */ +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, + struct tcp_header *tcp) +{ + uint16_t tcp_hdr; + uint16_t tcp_flag; + + tcp_flag = htons(tcp->th_offset_flags); + tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10; + tcp_flag &= VIRTIO_NET_TCP_FLAG; + tcp_flag = htons(tcp->th_offset_flags) & 0x3F; + if (tcp_flag & TH_SYN) { + chain->stat.tcp_syn++; + return RSC_BYPASS; + } + + if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) { + chain->stat.tcp_ctrl_drain++; + return RSC_FINAL; + } + + if (tcp_hdr > sizeof(struct tcp_header)) { + chain->stat.tcp_all_opt++; + return RSC_FINAL; + } + + return RSC_CANDIDATE; +} + +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClientState *nc, + const uint8_t *buf, size_t size, + NetRscUnit *unit) +{ + int ret; + NetRscSeg *seg, *nseg; + + if (QTAILQ_EMPTY(&chain->buffers)) { + chain->stat.empty_cache++; + virtio_net_rsc_cache_buf(chain, nc, buf, size); + timer_mod(chain->drain_timer, + qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout); + return size; + } + + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { + ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit); + + if (ret == RSC_FINAL) { + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { + /* Send failed */ + chain->stat.final_failed++; + return 0; + } + + /* Send current packet */ + return virtio_net_do_receive(nc, buf, size); + } else if (ret == RSC_NO_MATCH) { + continue; + } else { + /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */ + seg->is_coalesced = 1; + return size; + } + } + + chain->stat.no_match_cache++; + virtio_net_rsc_cache_buf(chain, nc, buf, size); + return size; +} + +/* Drain a connection data, this is to avoid out of order segments */ +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc, + const uint8_t *buf, size_t size, + uint16_t ip_start, uint16_t ip_size, + uint16_t tcp_port) +{ + NetRscSeg *seg, *nseg; + uint32_t ppair1, ppair2; + + ppair1 = *(uint32_t *)(buf + tcp_port); + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { + ppair2 = *(uint32_t *)(seg->buf + tcp_port); + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) + || (ppair1 != ppair2)) { + continue; + } + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { + chain->stat.drain_failed++; + } + + break; + } + + return virtio_net_do_receive(nc, buf, size); +} + +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, + struct ip_header *ip, + const uint8_t *buf, size_t size) +{ + uint16_t ip_len; + + /* Not an ipv4 packet */ + if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) { + chain->stat.ip_option++; + return RSC_BYPASS; + } + + /* Don't handle packets with ip option */ + if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) { + chain->stat.ip_option++; + return RSC_BYPASS; + } + + if (ip->ip_p != IPPROTO_TCP) { + chain->stat.bypass_not_tcp++; + return RSC_BYPASS; + } + + /* Don't handle packets with ip fragment */ + if (!(htons(ip->ip_off) & IP_DF)) { + chain->stat.ip_frag++; + return RSC_BYPASS; + } + + /* Don't handle packets with ecn flag */ + if (IPTOS_ECN(ip->ip_tos)) { + chain->stat.ip_ecn++; + return RSC_BYPASS; + } + + ip_len = htons(ip->ip_len); + if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header)) + || ip_len > (size - chain->n->guest_hdr_len - + sizeof(struct eth_header))) { + chain->stat.ip_hacked++; + return RSC_BYPASS; + } + + return RSC_CANDIDATE; +} + +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc, + const uint8_t *buf, size_t size) +{ + int32_t ret; + uint16_t hdr_len; + NetRscUnit unit; + + hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len; + + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header) + + sizeof(struct tcp_header))) { + chain->stat.bypass_not_tcp++; + return virtio_net_do_receive(nc, buf, size); + } + + virtio_net_rsc_extract_unit4(chain, buf, &unit); + if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size) + != RSC_CANDIDATE) { + return virtio_net_do_receive(nc, buf, size); + } + + ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); + if (ret == RSC_BYPASS) { + return virtio_net_do_receive(nc, buf, size); + } else if (ret == RSC_FINAL) { + return virtio_net_rsc_drain_flow(chain, nc, buf, size, + ((hdr_len + sizeof(struct eth_header)) + 12), + VIRTIO_NET_IP4_ADDR_SIZE, + hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)); + } + + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); +} + +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, + NetClientState *nc, + uint16_t proto) +{ + NetRscChain *chain; + + if (proto != (uint16_t)ETH_P_IP) { + return NULL; + } + + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { + if (chain->proto == proto) { + return chain; + } + } + + chain = g_malloc(sizeof(*chain)); + chain->n = n; + chain->proto = proto; + chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD; + chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; + chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST, + virtio_net_rsc_purge, chain); + memset(&chain->stat, 0, sizeof(chain->stat)); + + QTAILQ_INIT(&chain->buffers); + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); + + return chain; +} + +static ssize_t virtio_net_rsc_receive(NetClientState *nc, + const uint8_t *buf, size_t size) +{ + uint16_t proto; + NetRscChain *chain; + struct eth_header *eth; + VirtIONet *n; + + n = qemu_get_nic_opaque(nc); + if (size < (n->host_hdr_len + sizeof(struct eth_header))) { + return virtio_net_do_receive(nc, buf, size); + } + + eth = (struct eth_header *)(buf + n->guest_hdr_len); + proto = htons(eth->h_proto); + + chain = virtio_net_rsc_lookup_chain(n, nc, proto); + if (!chain) { + return virtio_net_do_receive(nc, buf, size); + } else { + chain->stat.received++; + return virtio_net_rsc_receive4(chain, nc, buf, size); + } +} + +static ssize_t virtio_net_receive(NetClientState *nc, + const uint8_t *buf, size_t size) +{ + VirtIONet *n; + struct virtio_net_hdr_rsc *h; + + n = qemu_get_nic_opaque(nc); + if (n->curr_guest_offloads & (1ULL << VIRTIO_NET_F_GUEST_RSC4)) { + h = (struct virtio_net_hdr_rsc *)buf; + h->hdr.flags = VIRTIO_NET_HDR_RSC_NONE; + h->rsc_pkts = 0; + h->rsc_dup_acks = 0; + return virtio_net_rsc_receive(nc, buf, size); + } else { + return virtio_net_do_receive(nc, buf, size); + } +} + static NetClientInfo net_virtio_info = { .type = NET_CLIENT_DRIVER_NIC, .size = sizeof(NICState), @@ -1805,6 +2340,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp) nc = qemu_get_queue(n->nic); nc->rxfilter_notify_enabled = 1; + QTAILQ_INIT(&n->rsc_chains); n->qdev = dev; } @@ -1835,6 +2371,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp) g_free(n->vqs); qemu_del_nic(n->nic); virtio_cleanup(vdev); + virtio_net_rsc_cleanup(n); } static void virtio_net_instance_init(Object *obj) @@ -1872,45 +2409,46 @@ static const VMStateDescription vmstate_virtio_net = { }; static Property virtio_net_properties[] = { - DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true), - DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features, + DEFINE_PROP_BIT64("csum", VirtIONet, host_features, + VIRTIO_NET_F_CSUM, true), + DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features, VIRTIO_NET_F_GUEST_CSUM, true), - DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), - DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features, + DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true), + DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO4, true), - DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features, + DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features, VIRTIO_NET_F_GUEST_TSO6, true), - DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features, + DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ECN, true), - DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features, + DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features, VIRTIO_NET_F_GUEST_UFO, true), - DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features, + DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features, VIRTIO_NET_F_GUEST_ANNOUNCE, true), - DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features, + DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO4, true), - DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features, + DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features, VIRTIO_NET_F_HOST_TSO6, true), - DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features, + DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features, VIRTIO_NET_F_HOST_ECN, true), - DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features, + DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features, VIRTIO_NET_F_HOST_UFO, true), - DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features, + DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features, VIRTIO_NET_F_MRG_RXBUF, true), - DEFINE_PROP_BIT("status", VirtIONet, host_features, + DEFINE_PROP_BIT64("status", VirtIONet, host_features, VIRTIO_NET_F_STATUS, true), - DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features, VIRTIO_NET_F_CTRL_VQ, true), - DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features, VIRTIO_NET_F_CTRL_RX, true), - DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features, VIRTIO_NET_F_CTRL_VLAN, true), - DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features, VIRTIO_NET_F_CTRL_RX_EXTRA, true), - DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_mac_addr", VirtIONet, host_features, VIRTIO_NET_F_CTRL_MAC_ADDR, true), - DEFINE_PROP_BIT("ctrl_guest_offloads", VirtIONet, host_features, + DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true), - DEFINE_PROP_BIT("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false), + DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false), DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf), DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer, TX_TIMER_INTERVAL), @@ -1918,6 +2456,10 @@ static Property virtio_net_properties[] = { DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx), DEFINE_PROP_UINT16("rx_queue_size", VirtIONet, net_conf.rx_queue_size, VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE), + DEFINE_PROP_BIT64("guest_rsc4", VirtIONet, host_features, + VIRTIO_NET_F_GUEST_RSC4, true), + DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout, + VIRTIO_NET_RSC_INTERVAL), DEFINE_PROP_END_OF_LIST(), }; diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index 0ced975..56a8ce2 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -60,12 +60,15 @@ typedef struct VirtIONet { VirtIONetQueue *vqs; VirtQueue *ctrl_vq; NICState *nic; + QTAILQ_HEAD(, NetRscChain) rsc_chains; + uint32_t rsc_timeout; uint32_t tx_timeout; int32_t tx_burst; uint32_t has_vnet_hdr; + uint32_t has_rsc_hdr; size_t host_hdr_len; size_t guest_hdr_len; - uint32_t host_features; + uint64_t host_features; uint8_t has_ufo; int mergeable_rx_bufs; uint8_t promisc; diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index b913aac..0006ce1 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -30,6 +30,8 @@ (0x1ULL << VIRTIO_F_ANY_LAYOUT)) struct VirtQueue; +struct VirtIONet; +typedef struct VirtIONet VirtIONet; static inline hwaddr vring_align(hwaddr addr, unsigned long align) @@ -129,6 +131,80 @@ typedef struct VirtioDeviceClass { int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id); } VirtioDeviceClass; +/* Coalesced packets type & status */ +typedef enum { + RSC_COALESCE, /* Data been coalesced */ + RSC_FINAL, /* Will terminate current connection */ + RSC_NO_MATCH, /* No matched in the buffer pool */ + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp ctrl, etc */ + RSC_CANDIDATE /* Data want to be coalesced */ +} COALESCE_STATUS; + +typedef struct NetRscStat { + uint32_t received; + uint32_t coalesced; + uint32_t over_size; + uint32_t cache; + uint32_t empty_cache; + uint32_t no_match_cache; + uint32_t win_update; + uint32_t no_match; + uint32_t tcp_syn; + uint32_t tcp_ctrl_drain; + uint32_t dup_ack; + uint32_t dup_ack1; + uint32_t dup_ack2; + uint32_t pure_ack; + uint32_t ack_out_of_win; + uint32_t data_out_of_win; + uint32_t data_out_of_order; + uint32_t data_after_pure_ack; + uint32_t bypass_not_tcp; + uint32_t tcp_option; + uint32_t tcp_all_opt; + uint32_t ip_frag; + uint32_t ip_ecn; + uint32_t ip_hacked; + uint32_t ip_option; + uint32_t purge_failed; + uint32_t drain_failed; + uint32_t final_failed; + int64_t timer; +} NetRscStat; + +/* Rsc unit general info used to checking if can coalescing */ +typedef struct NetRscUnit { + void *ip; /* ip header */ + uint16_t *ip_plen; /* data len pointer in ip header field */ + struct tcp_header *tcp; /* tcp header */ + uint16_t tcp_hdrlen; /* tcp header len */ + uint16_t payload; /* pure payload without virtio/eth/ip/tcp */ +} NetRscUnit; + +/* Coalesced segmant */ +typedef struct NetRscSeg { + QTAILQ_ENTRY(NetRscSeg) next; + void *buf; + size_t size; + uint16_t packets; + uint16_t dup_ack; + bool is_coalesced; /* need recal ipv4 header checksum, mark here */ + NetRscUnit unit; + NetClientState *nc; +} NetRscSeg; + +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ +typedef struct NetRscChain { + QTAILQ_ENTRY(NetRscChain) next; + VirtIONet *n; /* VirtIONet */ + uint16_t proto; + uint8_t gso_type; + uint16_t max_payload; + QEMUTimer *drain_timer; + QTAILQ_HEAD(, NetRscSeg) buffers; + NetRscStat stat; +} NetRscChain; + void virtio_instance_init_common(Object *proxy_obj, void *data, size_t vdev_size, const char *vdev_name); diff --git a/include/net/eth.h b/include/net/eth.h index 2013175..5952ef2 100644 --- a/include/net/eth.h +++ b/include/net/eth.h @@ -177,6 +177,8 @@ struct tcp_hdr { #define TH_PUSH 0x08 #define TH_ACK 0x10 #define TH_URG 0x20 +#define TH_ECE 0x40 +#define TH_CWR 0x80 u_short th_win; /* window */ u_short th_sum; /* checksum */ u_short th_urp; /* urgent pointer */ diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h index 30ff249..e67b36e 100644 --- a/include/standard-headers/linux/virtio_net.h +++ b/include/standard-headers/linux/virtio_net.h @@ -57,6 +57,9 @@ * Steering */ #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ +/* Guest can handle coalesced ipv4-tcp packets */ +#define VIRTIO_NET_F_GUEST_RSC4 41 + #ifndef VIRTIO_NET_NO_LEGACY #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ #endif /* VIRTIO_NET_NO_LEGACY */ @@ -94,6 +97,9 @@ struct virtio_net_hdr_v1 { #define VIRTIO_NET_HDR_GSO_UDP 3 /* GSO frame, IPv4 UDP (UFO) */ #define VIRTIO_NET_HDR_GSO_TCPV6 4 /* GSO frame, IPv6 TCP */ #define VIRTIO_NET_HDR_GSO_ECN 0x80 /* TCP has ECN set */ +#define VIRTIO_NET_HDR_RSC_NONE 5 /* No packets coalesced */ +#define VIRTIO_NET_HDR_RSC_TCPV4 6 /* IPv4 TCP coalesced */ +#define VIRTIO_NET_HDR_RSC_TCPV6 7 /* IPv6 TCP coalesced */ uint8_t gso_type; __virtio16 hdr_len; /* Ethernet + IP + tcp/udp hdrs */ __virtio16 gso_size; /* Bytes to append to hdr_len per frame */ @@ -124,6 +130,14 @@ struct virtio_net_hdr_mrg_rxbuf { struct virtio_net_hdr hdr; __virtio16 num_buffers; /* Number of merged rx buffers */ }; + +/* This is the header to use when either one or both of GUEST_RSC4/6 + * features have been negotiated. */ +struct virtio_net_hdr_rsc { + struct virtio_net_hdr_v1 hdr; + __virtio16 rsc_pkts; /* Number of coalesced packets */ + __virtio16 rsc_dup_acks; /* Duplicated ack packets */ +}; #endif /* ...VIRTIO_NET_NO_LEGACY */ /* diff --git a/net/tap.c b/net/tap.c index b6896a7..4557aa5 100644 --- a/net/tap.c +++ b/net/tap.c @@ -251,7 +251,8 @@ static void tap_set_vnet_hdr_len(NetClientState *nc, int len) TAPState *s = DO_UPCAST(TAPState, nc, nc); assert(nc->info->type == NET_CLIENT_DRIVER_TAP); - assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) || + assert(len == sizeof(struct virtio_net_hdr_rsc) || + len == sizeof(struct virtio_net_hdr_mrg_rxbuf) || len == sizeof(struct virtio_net_hdr)); tap_fd_set_vnet_hdr_len(s->fd, len);