diff mbox series

virtio-net: support RSC v4/v6 tcp traffic for Windows HCK

Message ID 20181109145827.23076-1-yuri.benditovich@daynix.com (mailing list archive)
State New, archived
Headers show
Series virtio-net: support RSC v4/v6 tcp traffic for Windows HCK | expand

Commit Message

Yuri Benditovich Nov. 9, 2018, 2:58 p.m. UTC
This commit adds implementation of RX packets
coalescing, compatible with requirements of Windows
Hardware compatibility kit.

The device enables feature VIRTIO_NET_F_RSC_EXT in
host features if it supports extended RSC functionality
as defined in the specification.
This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
is also present.

In case vhost is enabled the feature bit is cleared in
host_features during device initialization.

If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
the device coalesces TCPv4 and TCPv6 packets (if
respective VIRTIO_NET_F_GUEST_TSO feature is on,
populates extended RSC information in virtio header
and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
The device does not recalculate checksums in the coalesced
packet, so they are not valid.

In this case:
All the data packets in a tcp connection are cached
to a single buffer in every receive interval, and will
be sent out via a timer, the 'virtio_net_rsc_timeout'
controls the interval, this value may impact the
performance and response time of tcp connection,
50000(50us) is an experience value to gain a performance
improvement, since the whql test sends packets every 100us,
so '300000(300us)' passes the test case, it is the default
value as well, tune it via the command line parameter
'rsc_interval' within 'virtio-net-pci' device, for example,
to launch a guest with interval set as '500000':

'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
guest_rsc_ext=on,rsc_interval=500000'

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.

'NetRscChain' is used to save the segments of IPv4/6 in a
VirtIONet device.

A new segment becomes a 'Candidate' as well as it passed sanity check,
the main handler of TCP includes TCP window update, duplicated
ACK check and the real data coalescing.

An 'Candidate' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.

Sanity check includes:
1. Incorrect version in IP header
2. An IP options or IP fragment
3. Not a TCP packet
4. Sanity size check to prevent buffer overflow attack.
5. An ECN packet

Even though, there might more cases should be considered such as
ip identification other flags, while it breaks the test because
windows set it to the same even it's not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
'bypass' and 'finalize', 'bypass' means should be sent out directly,
while 'finalize' means the packets should also be bypassed, but this
should be done after search for the same connection packets in the
pool and drain all of them out, this is to avoid out of order fragment.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
finalization, because this normally happens upon a connection is going
to be closed, an 'URG' packet also finalize current coalescing unit.

Statistics can be used to monitor the basic coalescing status, the
'out of order' and 'out of window' means how many retransmitting packets,
thus describe the performance intuitively.

Difference between ip v4 and v6 processing:
 Fragment length in ipv4 header includes itself, while it's not
 included for ipv6, thus means ipv6 can carry a real 65535 payload.

Signed-off-by: Wei Xu <address@hidden>
Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>
---
 hw/net/virtio-net.c                         | 648 +++++++++++++++++++-
 include/hw/virtio/virtio-net.h              |  81 +++
 include/net/eth.h                           |   2 +
 include/standard-headers/linux/virtio_net.h |   8 +
 4 files changed, 734 insertions(+), 5 deletions(-)

Comments

Michael S. Tsirkin Nov. 9, 2018, 6:11 p.m. UTC | #1
Looks good to me. Some comments below

On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote:
> This commit adds implementation of RX packets
> coalescing, compatible with requirements of Windows
> Hardware compatibility kit.
> 
> The device enables feature VIRTIO_NET_F_RSC_EXT in
> host features if it supports extended RSC functionality
> as defined in the specification.
> This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
> VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
> this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> is also present.
> 
> In case vhost is enabled the feature bit is cleared in
> host_features during device initialization.
> 
> If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
> the device coalesces TCPv4 and TCPv6 packets (if
> respective VIRTIO_NET_F_GUEST_TSO feature is on,
> populates extended RSC information in virtio header
> and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
> The device does not recalculate checksums in the coalesced
> packet, so they are not valid.
> 
> In this case:
> All the data packets in a tcp connection are cached
> to a single buffer in every receive interval, and will
> be sent out via a timer, the 'virtio_net_rsc_timeout'
> controls the interval, this value may impact the
> performance and response time of tcp connection,
> 50000(50us) is an experience value to gain a performance
> improvement, since the whql test sends packets every 100us,
> so '300000(300us)' passes the test case, it is the default
> value as well, tune it via the command line parameter
> 'rsc_interval' within 'virtio-net-pci' device, for example,
> to launch a guest with interval set as '500000':
> 
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
> guest_rsc_ext=on,rsc_interval=500000'
> 
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
> 
> 'NetRscChain' is used to save the segments of IPv4/6 in a
> VirtIONet device.
> 
> A new segment becomes a 'Candidate' as well as it passed sanity check,
> the main handler of TCP includes TCP window update, duplicated
> ACK check and the real data coalescing.
> 
> An 'Candidate' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. An IP options or IP fragment
> 3. Not a TCP packet
> 4. Sanity size check to prevent buffer overflow attack.
> 5. An ECN packet
> 
> Even though, there might more cases should be considered such as
> ip identification other flags, while it breaks the test because
> windows set it to the same even it's not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag,
> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> while 'finalize' means the packets should also be bypassed, but this
> should be done after search for the same connection packets in the
> pool and drain all of them out, this is to avoid out of order fragment.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> finalization, because this normally happens upon a connection is going
> to be closed, an 'URG' packet also finalize current coalescing unit.
> 
> Statistics can be used to monitor the basic coalescing status, the
> 'out of order' and 'out of window' means how many retransmitting packets,
> thus describe the performance intuitively.
> 
> Difference between ip v4 and v6 processing:
>  Fragment length in ipv4 header includes itself, while it's not
>  included for ipv6, thus means ipv6 can carry a real 65535 payload.
> 
> Signed-off-by: Wei Xu <address@hidden>

Pls put in the full address, QEMU doesn't accept anonymous
code donations.

> Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>
> ---
>  hw/net/virtio-net.c                         | 648 +++++++++++++++++++-
>  include/hw/virtio/virtio-net.h              |  81 +++
>  include/net/eth.h                           |   2 +
>  include/standard-headers/linux/virtio_net.h |   8 +
>  4 files changed, 734 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 385b1a03e9..43a7021409 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -41,6 +41,28 @@
>  #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
>  #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
>  
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
> +
> +#define VIRTIO_NET_TCP_FLAG         0x3F
> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
> +
> +/* header length value in ip header without option */
> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
> +
> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)

is this used anywhere?

> +#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
> +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
> +
> +/* Purge coalesced packets timer interval, This value affects the performance
> +   a lot, and should be tuned carefully, '300000'(300us) is the recommended
> +   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
> +   tso/gso/gro 'off'. */
> +#define VIRTIO_NET_RSC_INTERVAL 300000

Maybe _DEFAULT_INTERVAL ?

> +
>  /*
>   * Calculate the number of bytes up to and including the given 'field' of
>   * 'container'.
> @@ -628,6 +650,9 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t features,
>      if (!get_vhost_net(nc->peer)) {
>          return features;
>      }
> +    /* clear RSC_EXT feature in case of VHOST*/

space before *

> +    virtio_clear_feature(&features, VIRTIO_NET_F_RSC_EXT);
> +

why is vhost special?

>      features = vhost_net_get_features(get_vhost_net(nc->peer), features);
>      vdev->backend_features = features;
>  
> @@ -701,6 +726,11 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
>                                 virtio_has_feature(features,
>                                                    VIRTIO_F_VERSION_1));
>  
> +    n->rsc4_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
> +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4);
> +    n->rsc6_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
> +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6);
> +
>      if (n->has_vnet_hdr) {
>          n->curr_guest_offloads =
>              virtio_net_guest_offloads_by_features(features);
> @@ -781,6 +811,12 @@ static int virtio_net_handle_offloads(VirtIONet *n, uint8_t cmd,
>              return VIRTIO_NET_ERR;
>          }
>  
> +        n->rsc4_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
> +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO4);
> +        n->rsc6_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
> +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO6);
> +        virtio_clear_feature(&offloads, VIRTIO_NET_F_RSC_EXT);
> +
>          supported_offloads = virtio_net_supported_guest_offloads(n);
>          if (offloads & ~supported_offloads) {
>              return VIRTIO_NET_ERR;
> @@ -1253,6 +1289,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>              }
>  
>              receive_header(n, sg, elem->in_num, buf, size);
> +
>              offset = n->host_hdr_len;
>              total += n->guest_hdr_len;
>              guest_offset = n->guest_hdr_len;

let's not do unrelated whitespace tweaks pls

> @@ -1292,7 +1329,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      return size;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
> +static ssize_t virtio_net_do_receive(NetClientState *nc, const uint8_t *buf,
>                                    size_t size)
>  {
>      ssize_t r;
> @@ -1303,6 +1340,601 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
>      return r;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)


space before * not after

> +{
> +    uint16_t ip_hdrlen;
> +    struct ip_header *ip;
> +
> +    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
> +                              + sizeof(struct eth_header));
> +    unit->ip = (void *)ip;
> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
> +    unit->ip_plen = &ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    struct ip6_header *ip6;
> +
> +    ip6 = (struct ip6_header *)(buf + chain->n->guest_hdr_len
> +                                 + sizeof(struct eth_header));
> +    unit->ip = ip6;
> +    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
> +                                        + sizeof(struct ip6_header));
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +
> +    /* There is a difference between payload lenght in ipv4 and v6,
> +       ip header is excluded in ipv6 */
> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +    struct virtio_net_hdr *h;
> +
> +    h = (struct virtio_net_hdr *)seg->buf;
> +    h->flags = 0;
> +    h->gso_type = VIRTIO_NET_HDR_GSO_NONE;
> +
> +    if (seg->is_coalesced) {
> +        h->rsc_ext_num_packets = seg->packets;
> +        h->rsc_ext_num_dupacks = seg->dup_ack;
> +        h->flags = VIRTIO_NET_HDR_F_RSC_INFO;
> +        if (chain->proto == ETH_P_IP) {
> +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +        } else {
> +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +        }
> +    }
> +
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscSeg *seg, *rn;
> +    NetRscChain *chain = (NetRscChain *)opq;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    uint16_t hdr_len;
> +    NetRscSeg *seg;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)
> +        + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
> +    memcpy(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->packets = 1;
> +    seg->dup_ack = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    switch (chain->proto) {
> +    case ETH_P_IP:
> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +        break;
> +    case ETH_P_IPV6:
> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> +                                         NetRscSeg *seg, const uint8_t *buf,
> +                                         struct tcp_header *n_tcp,
> +                                         struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +            return RSC_FINAL;
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, go to 'C', finalize*/
> +        chain->stat.pure_ack++;
> +        return RSC_FINAL;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
> +                                            NetRscSeg *seg, const uint8_t *buf,
> +                                            NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((o_unit->payload == 0) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload length in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest (only if it uses RSC feature). */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        seg->packets++;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                                        const uint8_t *buf, size_t size,
> +                                        NetRscUnit *unit)
> +{
> +    struct ip_header *ip1, *ip2;
> +
> +    ip1 = (struct ip_header *)(unit->ip);
> +    ip2 = (struct ip_header *)(seg->unit.ip);
> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    struct ip6_header *ip1, *ip2;
> +
> +    ip1 = (struct ip6_header *)(unit->ip);
> +    ip2 = (struct ip6_header *)(seg->unit.ip);
> +    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
> +        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address))
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +            chain->stat.no_match++;
> +            return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Packets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_hdr;
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags);
> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    if (tcp_hdr > sizeof(struct tcp_header)) {
> +        chain->stat.tcp_all_opt++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
> +                                         NetClientState *nc,
> +                                         const uint8_t *buf, size_t size,
> +                                         NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +        return size;
> +    }
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (chain->proto == ETH_P_IP) {
> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        } else {
> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
> +        }
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                                        const uint8_t *buf, size_t size,
> +                                        uint16_t ip_start, uint16_t ip_size,
> +                                        uint16_t tcp_port)
> +{
> +    NetRscSeg *seg, *nseg;
> +    uint32_t ppair1, ppair2;
> +
> +    ppair1 = *(uint32_t *)(buf + tcp_port);
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || (ppair1 != ppair2)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                                            struct ip_header *ip,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    /* Not an ipv4 packet */
> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ecn flag */
> +    if (IPTOS_ECN(ip->ip_tos)) {
> +        chain->stat.ip_ecn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
> +        || ip_len > (size - chain->n->guest_hdr_len -
> +                     sizeof(struct eth_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscUnit unit;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
> +        + sizeof(struct tcp_header))) {
> +        chain->stat.bypass_not_tcp++;
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
> +        != RSC_CANDIDATE) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 12),
> +                VIRTIO_NET_IP4_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header));
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
> +                                            struct ip6_header *ip6,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
> +        != IP_HEADER_VERSION_6) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Both option and protocol is checked in this */
> +    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    if (ip_len < sizeof(struct tcp_header) ||
> +        ip_len > (size - chain->n->guest_hdr_len - sizeof(struct eth_header)
> +                  - sizeof(struct ip6_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ecn flag */
> +    if (IP6_ECN(ip6->ip6_ctlun.ip6_un3.ip6_un3_ecn)) {
> +        chain->stat.ip_ecn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header)
> +        + sizeof(tcp_header))) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
> +    if (RSC_CANDIDATE != virtio_net_rsc_sanity_check6(chain,
> +                                                 unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 8),
> +                VIRTIO_NET_IP6_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header)
> +                + sizeof(struct ip6_header));
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                                NetClientState *nc,
> +                                                uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->n = n;
> +    chain->proto = proto;
> +    if (proto == (uint16_t)ETH_P_IP) {
> +        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    } else {
> +        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +    }
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf,
> +                                      size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (chain) {
> +        chain->stat.received++;
> +        if (proto == (uint16_t)ETH_P_IP && n->rsc4_enabled) {
> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
> +        } else if (proto == (uint16_t)ETH_P_IPV6 && n->rsc6_enabled) {
> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
> +        }
> +    }
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
> +                                  size_t size)
> +{
> +    VirtIONet *n = qemu_get_nic_opaque(nc);
> +    if ((n->rsc4_enabled || n->rsc6_enabled)) {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +}
> +
>  static int32_t virtio_net_flush_tx(VirtIONetQueue *q);
>  
>  static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
> @@ -1342,7 +1974,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>          ssize_t ret;
>          unsigned int out_num;
>          struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
> -        struct virtio_net_hdr_mrg_rxbuf mhdr;
> +        struct virtio_net_hdr_mrg_rxbuf hdr;
>  
>          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
>          if (!elem) {
> @@ -1359,7 +1991,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>          }
>  
>          if (n->has_vnet_hdr) {
> -            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) <
> +            if (iov_to_buf(out_sg, out_num, 0, &hdr, n->guest_hdr_len) <
>                  n->guest_hdr_len) {
>                  virtio_error(vdev, "virtio-net header incorrect");
>                  virtqueue_detach_element(q->tx_vq, elem, 0);
> @@ -1367,8 +1999,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>                  return -EINVAL;
>              }
>              if (n->needs_vnet_hdr_swap) {
> -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
> -                sg2[0].iov_base = &mhdr;
> +                virtio_net_hdr_swap(vdev, (void *) &hdr);
> +                sg2[0].iov_base = &hdr;
>                  sg2[0].iov_len = n->guest_hdr_len;
>                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>                                     out_sg, out_num,
> @@ -2075,6 +2707,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>  }
>  
> @@ -2104,6 +2737,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      timer_free(n->announce_timer);
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
> +    virtio_net_rsc_cleanup(n);
>      virtio_cleanup(vdev);
>  }
>  
> @@ -2184,6 +2818,10 @@ static Property virtio_net_properties[] = {
>      DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
>                      VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
>      DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
> +    DEFINE_PROP_BIT64("guest_rsc_ext", VirtIONet, host_features,
> +                    VIRTIO_NET_F_RSC_EXT, false),
> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
> +                       VIRTIO_NET_RSC_INTERVAL),

what if interval is set without guest_rsc_ext? should init fail?

>      DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
>      DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
>                         TX_TIMER_INTERVAL),
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 4d7f3c82ca..0da99dcfa0 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -44,6 +44,83 @@ typedef struct virtio_net_conf
>      uint8_t duplex;
>  } virtio_net_conf;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_CANDIDATE                /* Data want to be coalesced */
> +} COALESCE_STATUS;

Should be CoalesceStatus.

> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_all_opt;
> +    uint32_t ip_frag;
> +    uint32_t ip_ecn;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +    void *ip;   /* ip header */
> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
> +    struct tcp_header *tcp; /* tcp header */
> +    uint16_t tcp_hdrlen;    /* tcp header len */
> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint16_t packets;
> +    uint16_t dup_ack;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;


If these structs are generic and not virtio specific,
pls put them and code for handling them in generic files
under net/. OTOH if they are virtio specific, please
prefix names with Virtio.

> +
> +struct VirtIONet;
> +typedef struct VirtIONet VirtIONet;
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +    QTAILQ_ENTRY(NetRscChain) next;
> +    VirtIONet *n;                            /* VirtIONet */
> +    uint16_t proto;
> +    uint8_t  gso_type;
> +    uint16_t max_payload;
> +    QEMUTimer *drain_timer;
> +    QTAILQ_HEAD(, NetRscSeg) buffers;
> +    NetRscStat stat;
> +} NetRscChain;
> +
>  /* Maximum packet size we can receive from tap device: header + 64k */
>  #define VIRTIO_NET_MAX_BUFSIZE (sizeof(struct virtio_net_hdr) + (64 * KiB))
>  
> @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;

what exactly happens with these chains on migration?

>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
>      size_t host_hdr_len;
>      size_t guest_hdr_len;
>      uint64_t host_features;
> +    uint32_t rsc_timeout;
> +    uint8_t rsc4_enabled;
> +    uint8_t rsc6_enabled;
>      uint8_t has_ufo;
>      uint32_t mergeable_rx_bufs;
>      uint8_t promisc;
> diff --git a/include/net/eth.h b/include/net/eth.h
> index e6dc8a7ba0..7f45c678e7 100644
> --- a/include/net/eth.h
> +++ b/include/net/eth.h
> @@ -177,6 +177,8 @@ struct tcp_hdr {
>  #define TH_PUSH 0x08
>  #define TH_ACK  0x10
>  #define TH_URG  0x20
> +#define TH_ECE  0x40
> +#define TH_CWR  0x80
>      u_short th_win;      /* window */
>      u_short th_sum;      /* checksum */
>      u_short th_urp;      /* urgent pointer */
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index 260c3681d7..0d8658c06a 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,10 @@
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>  
> +#define VIRTIO_NET_F_RSC_EXT	38

Should it be VIRTIO_NET_F_GUEST_RSC_EXT ?

> +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE	41	/* reserved */
> +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE	42	/* reserved */
> +
>  #define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
>  					 * with the same MAC.
>  					 */
> @@ -104,6 +108,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_RSC_INFO	4	/* rsc_ext data in csum_ fields */
>  	uint8_t flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
>  	__virtio16 num_buffers;	/* Number of merged rx buffers */
>  };
>  
> +#define rsc_ext_num_packets		csum_start
> +#define rsc_ext_num_dupacks		csum_offset

I would prefer an inline function to set the field, or a union.


> +
>  #ifndef VIRTIO_NET_NO_LEGACY
>  /* This header comes first in the scatter-gather list.
>   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must

This part needs to get into the Linux header. Pls post there.
Until it does you can put it in virtio-net.c


> -- 
> 2.17.1
Yuri Benditovich Nov. 11, 2018, 10:18 a.m. UTC | #2
On Fri, Nov 9, 2018 at 8:11 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> Looks good to me. Some comments below
>
> On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote:
> > This commit adds implementation of RX packets
> > coalescing, compatible with requirements of Windows
> > Hardware compatibility kit.
> >
> > The device enables feature VIRTIO_NET_F_RSC_EXT in
> > host features if it supports extended RSC functionality
> > as defined in the specification.
> > This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
> > VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
> > this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> > is also present.
> >
> > In case vhost is enabled the feature bit is cleared in
> > host_features during device initialization.
> >
> > If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
> > the device coalesces TCPv4 and TCPv6 packets (if
> > respective VIRTIO_NET_F_GUEST_TSO feature is on,
> > populates extended RSC information in virtio header
> > and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
> > The device does not recalculate checksums in the coalesced
> > packet, so they are not valid.
> >
> > In this case:
> > All the data packets in a tcp connection are cached
> > to a single buffer in every receive interval, and will
> > be sent out via a timer, the 'virtio_net_rsc_timeout'
> > controls the interval, this value may impact the
> > performance and response time of tcp connection,
> > 50000(50us) is an experience value to gain a performance
> > improvement, since the whql test sends packets every 100us,
> > so '300000(300us)' passes the test case, it is the default
> > value as well, tune it via the command line parameter
> > 'rsc_interval' within 'virtio-net-pci' device, for example,
> > to launch a guest with interval set as '500000':
> >
> > 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
> > guest_rsc_ext=on,rsc_interval=500000'
> >
> > The timer will only be triggered if the packets pool is not empty,
> > and it'll drain off all the cached packets.
> >
> > 'NetRscChain' is used to save the segments of IPv4/6 in a
> > VirtIONet device.
> >
> > A new segment becomes a 'Candidate' as well as it passed sanity check,
> > the main handler of TCP includes TCP window update, duplicated
> > ACK check and the real data coalescing.
> >
> > An 'Candidate' segment means:
> > 1. Segment is within current window and the sequence is the expected one.
> > 2. 'ACK' of the segment is in the valid window.
> >
> > Sanity check includes:
> > 1. Incorrect version in IP header
> > 2. An IP options or IP fragment
> > 3. Not a TCP packet
> > 4. Sanity size check to prevent buffer overflow attack.
> > 5. An ECN packet
> >
> > Even though, there might more cases should be considered such as
> > ip identification other flags, while it breaks the test because
> > windows set it to the same even it's not a fragment.
> >
> > Normally it includes 2 typical ways to handle a TCP control flag,
> > 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> > while 'finalize' means the packets should also be bypassed, but this
> > should be done after search for the same connection packets in the
> > pool and drain all of them out, this is to avoid out of order fragment.
> >
> > All the 'SYN' packets will be bypassed since this always begin a new'
> > connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> > finalization, because this normally happens upon a connection is going
> > to be closed, an 'URG' packet also finalize current coalescing unit.
> >
> > Statistics can be used to monitor the basic coalescing status, the
> > 'out of order' and 'out of window' means how many retransmitting packets,
> > thus describe the performance intuitively.
> >
> > Difference between ip v4 and v6 processing:
> >  Fragment length in ipv4 header includes itself, while it's not
> >  included for ipv6, thus means ipv6 can carry a real 65535 payload.
> >
> > Signed-off-by: Wei Xu <address@hidden>
>
> Pls put in the full address, QEMU doesn't accept anonymous
> code donations.
>
> > Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>
> > ---
> >  hw/net/virtio-net.c                         | 648 +++++++++++++++++++-
> >  include/hw/virtio/virtio-net.h              |  81 +++
> >  include/net/eth.h                           |   2 +
> >  include/standard-headers/linux/virtio_net.h |   8 +
> >  4 files changed, 734 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> > index 385b1a03e9..43a7021409 100644
> > --- a/hw/net/virtio-net.c
> > +++ b/hw/net/virtio-net.c
> > @@ -41,6 +41,28 @@
> >  #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
> >  #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
> >
> > +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
> > +
> > +#define VIRTIO_NET_TCP_FLAG         0x3F
> > +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> > +
> > +/* IPv4 max payload, 16 bits in the header */
> > +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> > +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
> > +
> > +/* header length value in ip header without option */
> > +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
> > +
> > +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
>
> is this used anywhere?
>
> > +#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
> > +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
> > +
> > +/* Purge coalesced packets timer interval, This value affects the
> performance
> > +   a lot, and should be tuned carefully, '300000'(300us) is the
> recommended
> > +   value to pass the WHQL test, '50000' can gain 2x netperf throughput
> with
> > +   tso/gso/gro 'off'. */
> > +#define VIRTIO_NET_RSC_INTERVAL 300000
>
> Maybe _DEFAULT_INTERVAL ?
>
> > +
> >  /*
> >   * Calculate the number of bytes up to and including the given 'field'
> of
> >   * 'container'.
> > @@ -628,6 +650,9 @@ static uint64_t virtio_net_get_features(VirtIODevice
> *vdev, uint64_t features,
> >      if (!get_vhost_net(nc->peer)) {
> >          return features;
> >      }
> > +    /* clear RSC_EXT feature in case of VHOST*/
>
> space before *
>
> > +    virtio_clear_feature(&features, VIRTIO_NET_F_RSC_EXT);
> > +
>
> why is vhost special?
>
> >      features = vhost_net_get_features(get_vhost_net(nc->peer),
> features);
> >      vdev->backend_features = features;
> >
> > @@ -701,6 +726,11 @@ static void virtio_net_set_features(VirtIODevice
> *vdev, uint64_t features)
> >                                 virtio_has_feature(features,
> >                                                    VIRTIO_F_VERSION_1));
> >
> > +    n->rsc4_enabled = virtio_has_feature(features,
> VIRTIO_NET_F_RSC_EXT) &&
> > +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4);
> > +    n->rsc6_enabled = virtio_has_feature(features,
> VIRTIO_NET_F_RSC_EXT) &&
> > +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6);
> > +
> >      if (n->has_vnet_hdr) {
> >          n->curr_guest_offloads =
> >              virtio_net_guest_offloads_by_features(features);
> > @@ -781,6 +811,12 @@ static int virtio_net_handle_offloads(VirtIONet *n,
> uint8_t cmd,
> >              return VIRTIO_NET_ERR;
> >          }
> >
> > +        n->rsc4_enabled = virtio_has_feature(offloads,
> VIRTIO_NET_F_RSC_EXT) &&
> > +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO4);
> > +        n->rsc6_enabled = virtio_has_feature(offloads,
> VIRTIO_NET_F_RSC_EXT) &&
> > +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO6);
> > +        virtio_clear_feature(&offloads, VIRTIO_NET_F_RSC_EXT);
> > +
> >          supported_offloads = virtio_net_supported_guest_offloads(n);
> >          if (offloads & ~supported_offloads) {
> >              return VIRTIO_NET_ERR;
> > @@ -1253,6 +1289,7 @@ static ssize_t
> virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >              }
> >
> >              receive_header(n, sg, elem->in_num, buf, size);
> > +
> >              offset = n->host_hdr_len;
> >              total += n->guest_hdr_len;
> >              guest_offset = n->guest_hdr_len;
>
> let's not do unrelated whitespace tweaks pls
>
> > @@ -1292,7 +1329,7 @@ static ssize_t
> virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
> >      return size;
> >  }
> >
> > -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
> *buf,
> > +static ssize_t virtio_net_do_receive(NetClientState *nc, const uint8_t
> *buf,
> >                                    size_t size)
> >  {
> >      ssize_t r;
> > @@ -1303,6 +1340,601 @@ static ssize_t virtio_net_receive(NetClientState
> *nc, const uint8_t *buf,
> >      return r;
> >  }
> >
> > +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> > +                                         const uint8_t *buf,
> NetRscUnit* unit)
>
>
> space before * not after
>
> > +{
> > +    uint16_t ip_hdrlen;
> > +    struct ip_header *ip;
> > +
> > +    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
> > +                              + sizeof(struct eth_header));
> > +    unit->ip = (void *)ip;
> > +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
> > +    unit->ip_plen = &ip->ip_len;
> > +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) +
> ip_hdrlen);
> > +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >>
> 10;
> > +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen -
> unit->tcp_hdrlen;
> > +}
> > +
> > +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
> > +                                         const uint8_t *buf,
> NetRscUnit* unit)
> > +{
> > +    struct ip6_header *ip6;
> > +
> > +    ip6 = (struct ip6_header *)(buf + chain->n->guest_hdr_len
> > +                                 + sizeof(struct eth_header));
> > +    unit->ip = ip6;
> > +    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> > +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
> > +                                        + sizeof(struct ip6_header));
> > +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >>
> 10;
> > +
> > +    /* There is a difference between payload lenght in ipv4 and v6,
> > +       ip header is excluded in ipv6 */
> > +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
> > +}
> > +
> > +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg
> *seg)
> > +{
> > +    int ret;
> > +    struct virtio_net_hdr *h;
> > +
> > +    h = (struct virtio_net_hdr *)seg->buf;
> > +    h->flags = 0;
> > +    h->gso_type = VIRTIO_NET_HDR_GSO_NONE;
> > +
> > +    if (seg->is_coalesced) {
> > +        h->rsc_ext_num_packets = seg->packets;
> > +        h->rsc_ext_num_dupacks = seg->dup_ack;
> > +        h->flags = VIRTIO_NET_HDR_F_RSC_INFO;
> > +        if (chain->proto == ETH_P_IP) {
> > +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> > +        } else {
> > +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> > +        }
> > +    }
> > +
> > +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> > +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> > +    g_free(seg->buf);
> > +    g_free(seg);
> > +
> > +    return ret;
> > +}
> > +
> > +static void virtio_net_rsc_purge(void *opq)
> > +{
> > +    NetRscSeg *seg, *rn;
> > +    NetRscChain *chain = (NetRscChain *)opq;
> > +
> > +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> > +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> > +            chain->stat.purge_failed++;
> > +            continue;
> > +        }
> > +    }
> > +
> > +    chain->stat.timer++;
> > +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> > +        timer_mod(chain->drain_timer,
> > +              qemu_clock_get_ns(QEMU_CLOCK_HOST) +
> chain->n->rsc_timeout);
> > +    }
> > +}
> > +
> > +static void virtio_net_rsc_cleanup(VirtIONet *n)
> > +{
> > +    NetRscChain *chain, *rn_chain;
> > +    NetRscSeg *seg, *rn_seg;
> > +
> > +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> > +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> > +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> > +            g_free(seg->buf);
> > +            g_free(seg);
> > +        }
> > +
> > +        timer_del(chain->drain_timer);
> > +        timer_free(chain->drain_timer);
> > +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> > +        g_free(chain);
> > +    }
> > +}
> > +
> > +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState
> *nc,
> > +                                     const uint8_t *buf, size_t size)
> > +{
> > +    uint16_t hdr_len;
> > +    NetRscSeg *seg;
> > +
> > +    hdr_len = chain->n->guest_hdr_len;
> > +    seg = g_malloc(sizeof(NetRscSeg));
> > +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)
> > +        + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
> > +    memcpy(seg->buf, buf, size);
> > +    seg->size = size;
> > +    seg->packets = 1;
> > +    seg->dup_ack = 0;
> > +    seg->is_coalesced = 0;
> > +    seg->nc = nc;
> > +
> > +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> > +    chain->stat.cache++;
> > +
> > +    switch (chain->proto) {
> > +    case ETH_P_IP:
> > +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> > +        break;
> > +    case ETH_P_IPV6:
> > +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
> > +        break;
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> > +}
> > +
> > +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> > +                                         NetRscSeg *seg, const uint8_t
> *buf,
> > +                                         struct tcp_header *n_tcp,
> > +                                         struct tcp_header *o_tcp)
> > +{
> > +    uint32_t nack, oack;
> > +    uint16_t nwin, owin;
> > +
> > +    nack = htonl(n_tcp->th_ack);
> > +    nwin = htons(n_tcp->th_win);
> > +    oack = htonl(o_tcp->th_ack);
> > +    owin = htons(o_tcp->th_win);
> > +
> > +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
> > +        chain->stat.ack_out_of_win++;
> > +        return RSC_FINAL;
> > +    } else if (nack == oack) {
> > +        /* duplicated ack or window probe */
> > +        if (nwin == owin) {
> > +            /* duplicated ack, add dup ack count due to whql test up to
> 1 */
> > +            chain->stat.dup_ack++;
> > +            return RSC_FINAL;
> > +        } else {
> > +            /* Coalesce window update */
> > +            o_tcp->th_win = n_tcp->th_win;
> > +            chain->stat.win_update++;
> > +            return RSC_COALESCE;
> > +        }
> > +    } else {
> > +        /* pure ack, go to 'C', finalize*/
> > +        chain->stat.pure_ack++;
> > +        return RSC_FINAL;
> > +    }
> > +}
> > +
> > +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
> > +                                            NetRscSeg *seg, const
> uint8_t *buf,
> > +                                            NetRscUnit *n_unit)
> > +{
> > +    void *data;
> > +    uint16_t o_ip_len;
> > +    uint32_t nseq, oseq;
> > +    NetRscUnit *o_unit;
> > +
> > +    o_unit = &seg->unit;
> > +    o_ip_len = htons(*o_unit->ip_plen);
> > +    nseq = htonl(n_unit->tcp->th_seq);
> > +    oseq = htonl(o_unit->tcp->th_seq);
> > +
> > +    /* out of order or retransmitted. */
> > +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
> > +        chain->stat.data_out_of_win++;
> > +        return RSC_FINAL;
> > +    }
> > +
> > +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> > +    if (nseq == oseq) {
> > +        if ((o_unit->payload == 0) && n_unit->payload) {
> > +            /* From no payload to payload, normal case, not a dup ack
> or etc */
> > +            chain->stat.data_after_pure_ack++;
> > +            goto coalesce;
> > +        } else {
> > +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> > +                                             n_unit->tcp, o_unit->tcp);
> > +        }
> > +    } else if ((nseq - oseq) != o_unit->payload) {
> > +        /* Not a consistent packet, out of order */
> > +        chain->stat.data_out_of_order++;
> > +        return RSC_FINAL;
> > +    } else {
> > +coalesce:
> > +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> > +            chain->stat.over_size++;
> > +            return RSC_FINAL;
> > +        }
> > +
> > +        /* Here comes the right data, the payload length in v4/v6 is
> different,
> > +           so use the field value to update and record the new data len
> */
> > +        o_unit->payload += n_unit->payload; /* update new data len */
> > +
> > +        /* update field in ip header */
> > +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> > +
> > +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
> coalesced
> > +           for windows guest, while this may change the behavior for
> linux
> > +           guest (only if it uses RSC feature). */
> > +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> > +
> > +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> > +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> > +
> > +        memmove(seg->buf + seg->size, data, n_unit->payload);
> > +        seg->size += n_unit->payload;
> > +        seg->packets++;
> > +        chain->stat.coalesced++;
> > +        return RSC_COALESCE;
> > +    }
> > +}
> > +
> > +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg
> *seg,
> > +                                        const uint8_t *buf, size_t size,
> > +                                        NetRscUnit *unit)
> > +{
> > +    struct ip_header *ip1, *ip2;
> > +
> > +    ip1 = (struct ip_header *)(unit->ip);
> > +    ip2 = (struct ip_header *)(seg->unit.ip);
> > +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
> > +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> > +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> > +        chain->stat.no_match++;
> > +        return RSC_NO_MATCH;
> > +    }
> > +
> > +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> > +}
> > +
> > +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg
> *seg,
> > +                        const uint8_t *buf, size_t size, NetRscUnit
> *unit)
> > +{
> > +    struct ip6_header *ip1, *ip2;
> > +
> > +    ip1 = (struct ip6_header *)(unit->ip);
> > +    ip2 = (struct ip6_header *)(seg->unit.ip);
> > +    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
> > +        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct
> in6_address))
> > +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> > +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> > +            chain->stat.no_match++;
> > +            return RSC_NO_MATCH;
> > +    }
> > +
> > +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> > +}
> > +
> > +/* Packets with 'SYN' should bypass, other flag should be sent after
> drain
> > + * to prevent out of order */
> > +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> > +                                         struct tcp_header *tcp)
> > +{
> > +    uint16_t tcp_hdr;
> > +    uint16_t tcp_flag;
> > +
> > +    tcp_flag = htons(tcp->th_offset_flags);
> > +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
> > +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
> > +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> > +    if (tcp_flag & TH_SYN) {
> > +        chain->stat.tcp_syn++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
> > +        chain->stat.tcp_ctrl_drain++;
> > +        return RSC_FINAL;
> > +    }
> > +
> > +    if (tcp_hdr > sizeof(struct tcp_header)) {
> > +        chain->stat.tcp_all_opt++;
> > +        return RSC_FINAL;
> > +    }
> > +
> > +    return RSC_CANDIDATE;
> > +}
> > +
> > +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
> > +                                         NetClientState *nc,
> > +                                         const uint8_t *buf, size_t
> size,
> > +                                         NetRscUnit *unit)
> > +{
> > +    int ret;
> > +    NetRscSeg *seg, *nseg;
> > +
> > +    if (QTAILQ_EMPTY(&chain->buffers)) {
> > +        chain->stat.empty_cache++;
> > +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> > +        timer_mod(chain->drain_timer,
> > +              qemu_clock_get_ns(QEMU_CLOCK_HOST) +
> chain->n->rsc_timeout);
> > +        return size;
> > +    }
> > +
> > +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> > +        if (chain->proto == ETH_P_IP) {
> > +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> > +        } else {
> > +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
> > +        }
> > +
> > +        if (ret == RSC_FINAL) {
> > +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> > +                /* Send failed */
> > +                chain->stat.final_failed++;
> > +                return 0;
> > +            }
> > +
> > +            /* Send current packet */
> > +            return virtio_net_do_receive(nc, buf, size);
> > +        } else if (ret == RSC_NO_MATCH) {
> > +            continue;
> > +        } else {
> > +            /* Coalesced, mark coalesced flag to tell calc cksum for
> ipv4 */
> > +            seg->is_coalesced = 1;
> > +            return size;
> > +        }
> > +    }
> > +
> > +    chain->stat.no_match_cache++;
> > +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> > +    return size;
> > +}
> > +
> > +/* Drain a connection data, this is to avoid out of order segments */
> > +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain,
> NetClientState *nc,
> > +                                        const uint8_t *buf, size_t size,
> > +                                        uint16_t ip_start, uint16_t
> ip_size,
> > +                                        uint16_t tcp_port)
> > +{
> > +    NetRscSeg *seg, *nseg;
> > +    uint32_t ppair1, ppair2;
> > +
> > +    ppair1 = *(uint32_t *)(buf + tcp_port);
> > +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> > +        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
> > +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> > +            || (ppair1 != ppair2)) {
> > +            continue;
> > +        }
> > +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> > +            chain->stat.drain_failed++;
> > +        }
> > +
> > +        break;
> > +    }
> > +
> > +    return virtio_net_do_receive(nc, buf, size);
> > +}
> > +
> > +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> > +                                            struct ip_header *ip,
> > +                                            const uint8_t *buf, size_t
> size)
> > +{
> > +    uint16_t ip_len;
> > +
> > +    /* Not an ipv4 packet */
> > +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
> > +        chain->stat.ip_option++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    /* Don't handle packets with ip option */
> > +    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
> > +        chain->stat.ip_option++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    if (ip->ip_p != IPPROTO_TCP) {
> > +        chain->stat.bypass_not_tcp++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    /* Don't handle packets with ip fragment */
> > +    if (!(htons(ip->ip_off) & IP_DF)) {
> > +        chain->stat.ip_frag++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    /* Don't handle packets with ecn flag */
> > +    if (IPTOS_ECN(ip->ip_tos)) {
> > +        chain->stat.ip_ecn++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    ip_len = htons(ip->ip_len);
> > +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
> > +        || ip_len > (size - chain->n->guest_hdr_len -
> > +                     sizeof(struct eth_header))) {
> > +        chain->stat.ip_hacked++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    return RSC_CANDIDATE;
> > +}
> > +
> > +static size_t virtio_net_rsc_receive4(NetRscChain *chain,
> NetClientState* nc,
> > +                                      const uint8_t *buf, size_t size)
> > +{
> > +    int32_t ret;
> > +    uint16_t hdr_len;
> > +    NetRscUnit unit;
> > +
> > +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> > +
> > +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct
> ip_header)
> > +        + sizeof(struct tcp_header))) {
> > +        chain->stat.bypass_not_tcp++;
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +
> > +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> > +    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
> > +        != RSC_CANDIDATE) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +
> > +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> > +    if (ret == RSC_BYPASS) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    } else if (ret == RSC_FINAL) {
> > +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> > +                ((hdr_len + sizeof(struct eth_header)) + 12),
> > +                VIRTIO_NET_IP4_ADDR_SIZE,
> > +                hdr_len + sizeof(struct eth_header) + sizeof(struct
> ip_header));
> > +    }
> > +
> > +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> > +}
> > +
> > +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
> > +                                            struct ip6_header *ip6,
> > +                                            const uint8_t *buf, size_t
> size)
> > +{
> > +    uint16_t ip_len;
> > +
> > +    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
> > +        != IP_HEADER_VERSION_6) {
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    /* Both option and protocol is checked in this */
> > +    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
> > +        chain->stat.bypass_not_tcp++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> > +    if (ip_len < sizeof(struct tcp_header) ||
> > +        ip_len > (size - chain->n->guest_hdr_len - sizeof(struct
> eth_header)
> > +                  - sizeof(struct ip6_header))) {
> > +        chain->stat.ip_hacked++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    /* Don't handle packets with ecn flag */
> > +    if (IP6_ECN(ip6->ip6_ctlun.ip6_un3.ip6_un3_ecn)) {
> > +        chain->stat.ip_ecn++;
> > +        return RSC_BYPASS;
> > +    }
> > +
> > +    return RSC_CANDIDATE;
> > +}
> > +
> > +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
> > +                                      const uint8_t *buf, size_t size)
> > +{
> > +    int32_t ret;
> > +    uint16_t hdr_len;
> > +    NetRscChain *chain;
> > +    NetRscUnit unit;
> > +
> > +    chain = (NetRscChain *)opq;
> > +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> > +
> > +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct
> ip6_header)
> > +        + sizeof(tcp_header))) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +
> > +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
> > +    if (RSC_CANDIDATE != virtio_net_rsc_sanity_check6(chain,
> > +                                                 unit.ip, buf, size)) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +
> > +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> > +    if (ret == RSC_BYPASS) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    } else if (ret == RSC_FINAL) {
> > +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> > +                ((hdr_len + sizeof(struct eth_header)) + 8),
> > +                VIRTIO_NET_IP6_ADDR_SIZE,
> > +                hdr_len + sizeof(struct eth_header)
> > +                + sizeof(struct ip6_header));
> > +    }
> > +
> > +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> > +}
> > +
> > +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> > +                                                NetClientState *nc,
> > +                                                uint16_t proto)
> > +{
> > +    NetRscChain *chain;
> > +
> > +    if ((proto != (uint16_t)ETH_P_IP) && (proto !=
> (uint16_t)ETH_P_IPV6)) {
> > +        return NULL;
> > +    }
> > +
> > +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> > +        if (chain->proto == proto) {
> > +            return chain;
> > +        }
> > +    }
> > +
> > +    chain = g_malloc(sizeof(*chain));
> > +    chain->n = n;
> > +    chain->proto = proto;
> > +    if (proto == (uint16_t)ETH_P_IP) {
> > +        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> > +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> > +    } else {
> > +        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
> > +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> > +    }
> > +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
> > +                                      virtio_net_rsc_purge, chain);
> > +    memset(&chain->stat, 0, sizeof(chain->stat));
> > +
> > +    QTAILQ_INIT(&chain->buffers);
> > +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> > +
> > +    return chain;
> > +}
> > +
> > +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> > +                                      const uint8_t *buf,
> > +                                      size_t size)
> > +{
> > +    uint16_t proto;
> > +    NetRscChain *chain;
> > +    struct eth_header *eth;
> > +    VirtIONet *n;
> > +
> > +    n = qemu_get_nic_opaque(nc);
> > +    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +
> > +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> > +    proto = htons(eth->h_proto);
> > +
> > +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> > +    if (chain) {
> > +        chain->stat.received++;
> > +        if (proto == (uint16_t)ETH_P_IP && n->rsc4_enabled) {
> > +            return virtio_net_rsc_receive4(chain, nc, buf, size);
> > +        } else if (proto == (uint16_t)ETH_P_IPV6 && n->rsc6_enabled) {
> > +            return virtio_net_rsc_receive6(chain, nc, buf, size);
> > +        }
> > +    }
> > +    return virtio_net_do_receive(nc, buf, size);
> > +}
> > +
> > +static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t
> *buf,
> > +                                  size_t size)
> > +{
> > +    VirtIONet *n = qemu_get_nic_opaque(nc);
> > +    if ((n->rsc4_enabled || n->rsc6_enabled)) {
> > +        return virtio_net_rsc_receive(nc, buf, size);
> > +    } else {
> > +        return virtio_net_do_receive(nc, buf, size);
> > +    }
> > +}
> > +
> >  static int32_t virtio_net_flush_tx(VirtIONetQueue *q);
> >
> >  static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
> > @@ -1342,7 +1974,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue
> *q)
> >          ssize_t ret;
> >          unsigned int out_num;
> >          struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE +
> 1], *out_sg;
> > -        struct virtio_net_hdr_mrg_rxbuf mhdr;
> > +        struct virtio_net_hdr_mrg_rxbuf hdr;
> >
> >          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
> >          if (!elem) {
> > @@ -1359,7 +1991,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue
> *q)
> >          }
> >
> >          if (n->has_vnet_hdr) {
> > -            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len)
> <
> > +            if (iov_to_buf(out_sg, out_num, 0, &hdr, n->guest_hdr_len) <
> >                  n->guest_hdr_len) {
> >                  virtio_error(vdev, "virtio-net header incorrect");
> >                  virtqueue_detach_element(q->tx_vq, elem, 0);
> > @@ -1367,8 +1999,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue
> *q)
> >                  return -EINVAL;
> >              }
> >              if (n->needs_vnet_hdr_swap) {
> > -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
> > -                sg2[0].iov_base = &mhdr;
> > +                virtio_net_hdr_swap(vdev, (void *) &hdr);
> > +                sg2[0].iov_base = &hdr;
> >                  sg2[0].iov_len = n->guest_hdr_len;
> >                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
> >                                     out_sg, out_num,
> > @@ -2075,6 +2707,7 @@ static void virtio_net_device_realize(DeviceState
> *dev, Error **errp)
> >      nc = qemu_get_queue(n->nic);
> >      nc->rxfilter_notify_enabled = 1;
> >
> > +    QTAILQ_INIT(&n->rsc_chains);
> >      n->qdev = dev;
> >  }
> >
> > @@ -2104,6 +2737,7 @@ static void
> virtio_net_device_unrealize(DeviceState *dev, Error **errp)
> >      timer_free(n->announce_timer);
> >      g_free(n->vqs);
> >      qemu_del_nic(n->nic);
> > +    virtio_net_rsc_cleanup(n);
> >      virtio_cleanup(vdev);
> >  }
> >
> > @@ -2184,6 +2818,10 @@ static Property virtio_net_properties[] = {
> >      DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
> >                      VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
> >      DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ,
> false),
> > +    DEFINE_PROP_BIT64("guest_rsc_ext", VirtIONet, host_features,
> > +                    VIRTIO_NET_F_RSC_EXT, false),
> > +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
> > +                       VIRTIO_NET_RSC_INTERVAL),
>
> what if interval is set without guest_rsc_ext? should init fail?
>

I would not fail the init, in case the RSC_EXT feature is not active the
configuration of the timer does not have any effect.


>
> >      DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
> >      DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
> >                         TX_TIMER_INTERVAL),
> > diff --git a/include/hw/virtio/virtio-net.h
> b/include/hw/virtio/virtio-net.h
> > index 4d7f3c82ca..0da99dcfa0 100644
> > --- a/include/hw/virtio/virtio-net.h
> > +++ b/include/hw/virtio/virtio-net.h
> > @@ -44,6 +44,83 @@ typedef struct virtio_net_conf
> >      uint8_t duplex;
> >  } virtio_net_conf;
> >
> > +/* Coalesced packets type & status */
> > +typedef enum {
> > +    RSC_COALESCE,           /* Data been coalesced */
> > +    RSC_FINAL,              /* Will terminate current connection */
> > +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> > +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl,
> etc */
> > +    RSC_CANDIDATE                /* Data want to be coalesced */
> > +} COALESCE_STATUS;
>
> Should be CoalesceStatus.
>
> > +
> > +typedef struct NetRscStat {
> > +    uint32_t received;
> > +    uint32_t coalesced;
> > +    uint32_t over_size;
> > +    uint32_t cache;
> > +    uint32_t empty_cache;
> > +    uint32_t no_match_cache;
> > +    uint32_t win_update;
> > +    uint32_t no_match;
> > +    uint32_t tcp_syn;
> > +    uint32_t tcp_ctrl_drain;
> > +    uint32_t dup_ack;
> > +    uint32_t dup_ack1;
> > +    uint32_t dup_ack2;
> > +    uint32_t pure_ack;
> > +    uint32_t ack_out_of_win;
> > +    uint32_t data_out_of_win;
> > +    uint32_t data_out_of_order;
> > +    uint32_t data_after_pure_ack;
> > +    uint32_t bypass_not_tcp;
> > +    uint32_t tcp_option;
> > +    uint32_t tcp_all_opt;
> > +    uint32_t ip_frag;
> > +    uint32_t ip_ecn;
> > +    uint32_t ip_hacked;
> > +    uint32_t ip_option;
> > +    uint32_t purge_failed;
> > +    uint32_t drain_failed;
> > +    uint32_t final_failed;
> > +    int64_t  timer;
> > +} NetRscStat;
> > +
> > +/* Rsc unit general info used to checking if can coalescing */
> > +typedef struct NetRscUnit {
> > +    void *ip;   /* ip header */
> > +    uint16_t *ip_plen;      /* data len pointer in ip header field */
> > +    struct tcp_header *tcp; /* tcp header */
> > +    uint16_t tcp_hdrlen;    /* tcp header len */
> > +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> > +} NetRscUnit;
> > +
> > +/* Coalesced segmant */
> > +typedef struct NetRscSeg {
> > +    QTAILQ_ENTRY(NetRscSeg) next;
> > +    void *buf;
> > +    size_t size;
> > +    uint16_t packets;
> > +    uint16_t dup_ack;
> > +    bool is_coalesced;      /* need recal ipv4 header checksum, mark
> here */
> > +    NetRscUnit unit;
> > +    NetClientState *nc;
> > +} NetRscSeg;
>
>
> If these structs are generic and not virtio specific,
> pls put them and code for handling them in generic files
> under net/. OTOH if they are virtio specific, please
> prefix names with Virtio.
>
> > +
> > +struct VirtIONet;
> > +typedef struct VirtIONet VirtIONet;
> > +
> > +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> > +typedef struct NetRscChain {
> > +    QTAILQ_ENTRY(NetRscChain) next;
> > +    VirtIONet *n;                            /* VirtIONet */
> > +    uint16_t proto;
> > +    uint8_t  gso_type;
> > +    uint16_t max_payload;
> > +    QEMUTimer *drain_timer;
> > +    QTAILQ_HEAD(, NetRscSeg) buffers;
> > +    NetRscStat stat;
> > +} NetRscChain;
> > +
> >  /* Maximum packet size we can receive from tap device: header + 64k */
> >  #define VIRTIO_NET_MAX_BUFSIZE (sizeof(struct virtio_net_hdr) + (64 *
> KiB))
> >
> > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
> >      VirtIONetQueue *vqs;
> >      VirtQueue *ctrl_vq;
> >      NICState *nic;
> > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>
> what exactly happens with these chains on migration?
>

This feature (software implementation of RSC in QEMU) is intended to be
used in
the environment of certification tests which never uses migration. These
chains
and accumulated segments (if any) are lost in case of migration. I'll add
the note about it
in commit's message.



>
> >      uint32_t tx_timeout;
> >      int32_t tx_burst;
> >      uint32_t has_vnet_hdr;
> >      size_t host_hdr_len;
> >      size_t guest_hdr_len;
> >      uint64_t host_features;
> > +    uint32_t rsc_timeout;
> > +    uint8_t rsc4_enabled;
> > +    uint8_t rsc6_enabled;
> >      uint8_t has_ufo;
> >      uint32_t mergeable_rx_bufs;
> >      uint8_t promisc;
> > diff --git a/include/net/eth.h b/include/net/eth.h
> > index e6dc8a7ba0..7f45c678e7 100644
> > --- a/include/net/eth.h
> > +++ b/include/net/eth.h
> > @@ -177,6 +177,8 @@ struct tcp_hdr {
> >  #define TH_PUSH 0x08
> >  #define TH_ACK  0x10
> >  #define TH_URG  0x20
> > +#define TH_ECE  0x40
> > +#define TH_CWR  0x80
> >      u_short th_win;      /* window */
> >      u_short th_sum;      /* checksum */
> >      u_short th_urp;      /* urgent pointer */
> > diff --git a/include/standard-headers/linux/virtio_net.h
> b/include/standard-headers/linux/virtio_net.h
> > index 260c3681d7..0d8658c06a 100644
> > --- a/include/standard-headers/linux/virtio_net.h
> > +++ b/include/standard-headers/linux/virtio_net.h
> > @@ -57,6 +57,10 @@
> >                                        * Steering */
> >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23        /* Set MAC address */
> >
> > +#define VIRTIO_NET_F_RSC_EXT 38
>
> Should it be VIRTIO_NET_F_GUEST_RSC_EXT ?
>

IMO, not. In the spec the name of the feature is VIRTIO_NET_F_RSC_EXT and
it is actually host feature
and its effect is how the host sets the fields in the header.


>
> > +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE     41      /* reserved */
> > +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE     42      /* reserved */
> > +
> >  #define VIRTIO_NET_F_STANDBY   62    /* Act as standby for another
> device
> >                                        * with the same MAC.
> >                                        */
> > @@ -104,6 +108,7 @@ struct virtio_net_config {
> >  struct virtio_net_hdr_v1 {
> >  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1       /* Use csum_start,
> csum_offset */
> >  #define VIRTIO_NET_HDR_F_DATA_VALID  2       /* Csum is valid */
> > +#define VIRTIO_NET_HDR_F_RSC_INFO    4       /* rsc_ext data in csum_
> fields */
> >       uint8_t flags;
> >  #define VIRTIO_NET_HDR_GSO_NONE              0       /* Not a GSO frame
> */
> >  #define VIRTIO_NET_HDR_GSO_TCPV4     1       /* GSO frame, IPv4 TCP
> (TSO) */
> > @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
> >       __virtio16 num_buffers; /* Number of merged rx buffers */
> >  };
> >
> > +#define rsc_ext_num_packets          csum_start
> > +#define rsc_ext_num_dupacks          csum_offset
>
> I would prefer an inline function to set the field, or a union.
>
>
> > +
> >  #ifndef VIRTIO_NET_NO_LEGACY
> >  /* This header comes first in the scatter-gather list.
> >   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must
>
> This part needs to get into the Linux header. Pls post there.
> Until it does you can put it in virtio-net.c
>
>
> > --
> > 2.17.1
>
Michael S. Tsirkin Nov. 12, 2018, 2:54 a.m. UTC | #3
On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
>     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>     >      VirtIONetQueue *vqs;
>     >      VirtQueue *ctrl_vq;
>     >      NICState *nic;
>     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> 
>     what exactly happens with these chains on migration?
> 
> 
> This feature (software implementation of RSC in QEMU) is intended to be used in
> the environment of certification tests which never uses migration.

Should this feature disable migration then?

> These chains
> and accumulated segments (if any) are lost in case of migration. I'll add the
> note about it
> in commit's message.
> 

If it's a functional limitation it belongs in a code comment
not in the commit log.

> 
> 
>     >      uint32_t tx_timeout;
>     >      int32_t tx_burst;
>     >      uint32_t has_vnet_hdr;
>     >      size_t host_hdr_len;
>     >      size_t guest_hdr_len;
>     >      uint64_t host_features;
>     > +    uint32_t rsc_timeout;
>     > +    uint8_t rsc4_enabled;
>     > +    uint8_t rsc6_enabled;
>     >      uint8_t has_ufo;
>     >      uint32_t mergeable_rx_bufs;
>     >      uint8_t promisc;
>     > diff --git a/include/net/eth.h b/include/net/eth.h
>     > index e6dc8a7ba0..7f45c678e7 100644
>     > --- a/include/net/eth.h
>     > +++ b/include/net/eth.h
>     > @@ -177,6 +177,8 @@ struct tcp_hdr {
>     >  #define TH_PUSH 0x08
>     >  #define TH_ACK  0x10
>     >  #define TH_URG  0x20
>     > +#define TH_ECE  0x40
>     > +#define TH_CWR  0x80
>     >      u_short th_win;      /* window */
>     >      u_short th_sum;      /* checksum */
>     >      u_short th_urp;      /* urgent pointer */
>     > diff --git a/include/standard-headers/linux/virtio_net.h b/include/
>     standard-headers/linux/virtio_net.h
>     > index 260c3681d7..0d8658c06a 100644
>     > --- a/include/standard-headers/linux/virtio_net.h
>     > +++ b/include/standard-headers/linux/virtio_net.h
>     > @@ -57,6 +57,10 @@
>     >                                        * Steering */
>     >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23        /* Set MAC address */
>     > 
>     > +#define VIRTIO_NET_F_RSC_EXT 38
> 
>     Should it be VIRTIO_NET_F_GUEST_RSC_EXT ?
> 
> 
> IMO, not. In the spec the name of the feature is VIRTIO_NET_F_RSC_EXT and it is
> actually host feature
> and its effect is how the host sets the fields in the header.

Isn't the same true for GUEST_GSO?

> 
> 
>     > +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE     41      /* reserved */
>     > +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE     42      /* reserved */
>     > +
>     >  #define VIRTIO_NET_F_STANDBY   62    /* Act as standby for another
>     device
>     >                                        * with the same MAC.
>     >                                        */
>     > @@ -104,6 +108,7 @@ struct virtio_net_config {
>     >  struct virtio_net_hdr_v1 {
>     >  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1       /* Use csum_start,
>     csum_offset */
>     >  #define VIRTIO_NET_HDR_F_DATA_VALID  2       /* Csum is valid */
>     > +#define VIRTIO_NET_HDR_F_RSC_INFO    4       /* rsc_ext data in csum_
>     fields */
>     >       uint8_t flags;
>     >  #define VIRTIO_NET_HDR_GSO_NONE              0       /* Not a GSO frame
>     */
>     >  #define VIRTIO_NET_HDR_GSO_TCPV4     1       /* GSO frame, IPv4 TCP
>     (TSO) */
>     > @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
>     >       __virtio16 num_buffers; /* Number of merged rx buffers */
>     >  };
>     > 
>     > +#define rsc_ext_num_packets          csum_start
>     > +#define rsc_ext_num_dupacks          csum_offset
> 
>     I would prefer an inline function to set the field, or a union.
> 
> 
>     > +
>     >  #ifndef VIRTIO_NET_NO_LEGACY
>     >  /* This header comes first in the scatter-gather list.
>     >   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must
> 
>     This part needs to get into the Linux header. Pls post there.
>     Until it does you can put it in virtio-net.c
> 
> 
>     > --
>     > 2.17.1
>
Yuri Benditovich Nov. 12, 2018, 8:57 a.m. UTC | #4
On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
> >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
> >     >      VirtIONetQueue *vqs;
> >     >      VirtQueue *ctrl_vq;
> >     >      NICState *nic;
> >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> >
> >     what exactly happens with these chains on migration?
> >
> >
> > This feature (software implementation of RSC in QEMU) is intended to be
> used in
> > the environment of certification tests which never uses migration.
>
> Should this feature disable migration then?
>

IMO, this should not. But if you find it mandatory, please respond and I
will add the migration blocker.
In this case I'd prefer to disable the feature in case of vhost as in this
case the RX normally will not be
processed by QEMU and even if the feature by mistake enabled in command
line this will not have
any effect.


>
> > These chains
> > and accumulated segments (if any) are lost in case of migration. I'll
> add the
> > note about it
> > in commit's message.
> >
>
> If it's a functional limitation it belongs in a code comment
> not in the commit log.
>
> >
> >
> >     >      uint32_t tx_timeout;
> >     >      int32_t tx_burst;
> >     >      uint32_t has_vnet_hdr;
> >     >      size_t host_hdr_len;
> >     >      size_t guest_hdr_len;
> >     >      uint64_t host_features;
> >     > +    uint32_t rsc_timeout;
> >     > +    uint8_t rsc4_enabled;
> >     > +    uint8_t rsc6_enabled;
> >     >      uint8_t has_ufo;
> >     >      uint32_t mergeable_rx_bufs;
> >     >      uint8_t promisc;
> >     > diff --git a/include/net/eth.h b/include/net/eth.h
> >     > index e6dc8a7ba0..7f45c678e7 100644
> >     > --- a/include/net/eth.h
> >     > +++ b/include/net/eth.h
> >     > @@ -177,6 +177,8 @@ struct tcp_hdr {
> >     >  #define TH_PUSH 0x08
> >     >  #define TH_ACK  0x10
> >     >  #define TH_URG  0x20
> >     > +#define TH_ECE  0x40
> >     > +#define TH_CWR  0x80
> >     >      u_short th_win;      /* window */
> >     >      u_short th_sum;      /* checksum */
> >     >      u_short th_urp;      /* urgent pointer */
> >     > diff --git a/include/standard-headers/linux/virtio_net.h b/include/
> >     standard-headers/linux/virtio_net.h
> >     > index 260c3681d7..0d8658c06a 100644
> >     > --- a/include/standard-headers/linux/virtio_net.h
> >     > +++ b/include/standard-headers/linux/virtio_net.h
> >     > @@ -57,6 +57,10 @@
> >     >                                        * Steering */
> >     >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23        /* Set MAC address */
> >     >
> >     > +#define VIRTIO_NET_F_RSC_EXT 38
> >
> >     Should it be VIRTIO_NET_F_GUEST_RSC_EXT ?
> >
> >
> > IMO, not. In the spec the name of the feature is VIRTIO_NET_F_RSC_EXT
> and it is
> > actually host feature
> > and its effect is how the host sets the fields in the header.
>
> Isn't the same true for GUEST_GSO?
>
>
I think GUEST_GSO is really guest feature ( it requires larger buffers on
guest side),
but this is question of terminology. If you find it necessary to rename the
feature,
please respond, I will do that in v3 of the patch. Just would like to
collect all the
requests.



> >
> >
> >     > +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE     41      /* reserved
> */
> >     > +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE     42      /* reserved
> */
> >     > +
> >     >  #define VIRTIO_NET_F_STANDBY   62    /* Act as standby for another
> >     device
> >     >                                        * with the same MAC.
> >     >                                        */
> >     > @@ -104,6 +108,7 @@ struct virtio_net_config {
> >     >  struct virtio_net_hdr_v1 {
> >     >  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1       /* Use csum_start,
> >     csum_offset */
> >     >  #define VIRTIO_NET_HDR_F_DATA_VALID  2       /* Csum is valid */
> >     > +#define VIRTIO_NET_HDR_F_RSC_INFO    4       /* rsc_ext data in
> csum_
> >     fields */
> >     >       uint8_t flags;
> >     >  #define VIRTIO_NET_HDR_GSO_NONE              0       /* Not a GSO
> frame
> >     */
> >     >  #define VIRTIO_NET_HDR_GSO_TCPV4     1       /* GSO frame, IPv4
> TCP
> >     (TSO) */
> >     > @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
> >     >       __virtio16 num_buffers; /* Number of merged rx buffers */
> >     >  };
> >     >
> >     > +#define rsc_ext_num_packets          csum_start
> >     > +#define rsc_ext_num_dupacks          csum_offset
> >
> >     I would prefer an inline function to set the field, or a union.
> >
> >
> >     > +
> >     >  #ifndef VIRTIO_NET_NO_LEGACY
> >     >  /* This header comes first in the scatter-gather list.
> >     >   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated,
> it must
> >
> >     This part needs to get into the Linux header. Pls post there.
> >     Until it does you can put it in virtio-net.c
> >
> >
> >     > --
> >     > 2.17.1
> >
>
Jason Wang Nov. 12, 2018, 9:25 a.m. UTC | #5
On 2018/11/12 下午4:57, Yuri Benditovich wrote:
>
> On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com 
> <mailto:mst@redhat.com>> wrote:
>
>     On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
>     >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>     >     >      VirtIONetQueue *vqs;
>     >     >      VirtQueue *ctrl_vq;
>     >     >      NICState *nic;
>     >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>     >
>     >     what exactly happens with these chains on migration?
>     >
>     >
>     > This feature (software implementation of RSC in QEMU) is
>     intended to be used in
>     > the environment of certification tests which never uses migration.
>
>     Should this feature disable migration then?
>
>
> IMO, this should not. But if you find it mandatory, please respond and 
> I will add the migration blocker.


So if my understanding is correct, it's safe to do nothing even if we 
allow migration for RSC?

Thanks
Yuri Benditovich Nov. 12, 2018, 11:31 a.m. UTC | #6
On Mon, Nov 12, 2018 at 11:26 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2018/11/12 下午4:57, Yuri Benditovich wrote:
> >
> > On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com
> > <mailto:mst@redhat.com>> wrote:
> >
> >     On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
> >     >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
> >     >     >      VirtIONetQueue *vqs;
> >     >     >      VirtQueue *ctrl_vq;
> >     >     >      NICState *nic;
> >     >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> >     >
> >     >     what exactly happens with these chains on migration?
> >     >
> >     >
> >     > This feature (software implementation of RSC in QEMU) is
> >     intended to be used in
> >     > the environment of certification tests which never uses migration.
> >
> >     Should this feature disable migration then?
> >
> >
> > IMO, this should not. But if you find it mandatory, please respond and
> > I will add the migration blocker.
>
>
> So if my understanding is correct, it's safe to do nothing even if we
> allow migration for RSC?
>

This does not create any unrecoverable failure (assertion, BSOD), although
some data (coalesced parts of packets not delivered yet to guest) will be
lost.


>
> Thanks
>
>
>
Michael S. Tsirkin Nov. 12, 2018, 8:53 p.m. UTC | #7
On Mon, Nov 12, 2018 at 01:31:36PM +0200, Yuri Benditovich wrote:
> 
> 
> On Mon, Nov 12, 2018 at 11:26 AM Jason Wang <jasowang@redhat.com> wrote:
> 
> 
>     On 2018/11/12 下午4:57, Yuri Benditovich wrote:
>     >
>     > On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com
>     > <mailto:mst@redhat.com>> wrote:
>     >
>     >     On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
>     >     >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>     >     >     >      VirtIONetQueue *vqs;
>     >     >     >      VirtQueue *ctrl_vq;
>     >     >     >      NICState *nic;
>     >     >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>     >     >
>     >     >     what exactly happens with these chains on migration?
>     >     >
>     >     >
>     >     > This feature (software implementation of RSC in QEMU) is
>     >     intended to be used in
>     >     > the environment of certification tests which never uses migration.
>     >
>     >     Should this feature disable migration then?
>     >
>     >
>     > IMO, this should not. But if you find it mandatory, please respond and
>     > I will add the migration blocker.
> 
> 
>     So if my understanding is correct, it's safe to do nothing even if we
>     allow migration for RSC?
> 
> 
> This does not create any unrecoverable failure (assertion, BSOD), although
> some data (coalesced parts of packets not delivered yet to guest) will be lost.

If guest has no way to detect none of these packets were ever
received by card, then I guess it's fine. Needs a
comment in case we start caring about packet loss around
migration.

> 
> 
>     Thanks
> 
> 
>
Wei Xu Nov. 13, 2018, 3:41 a.m. UTC | #8
Looks good, I can't recall the status of last version well but
I remember Jason gave some comments about sanity check are quiet
essential, have you addressed them?

Reviewed by: Wei Xu <wexu@redhat.com>

On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote:
> This commit adds implementation of RX packets
> coalescing, compatible with requirements of Windows
> Hardware compatibility kit.
> 
> The device enables feature VIRTIO_NET_F_RSC_EXT in
> host features if it supports extended RSC functionality
> as defined in the specification.
> This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
> VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
> this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> is also present.
> 
> In case vhost is enabled the feature bit is cleared in
> host_features during device initialization.
> 
> If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
> the device coalesces TCPv4 and TCPv6 packets (if
> respective VIRTIO_NET_F_GUEST_TSO feature is on,
> populates extended RSC information in virtio header
> and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
> The device does not recalculate checksums in the coalesced
> packet, so they are not valid.
> 
> In this case:
> All the data packets in a tcp connection are cached
> to a single buffer in every receive interval, and will
> be sent out via a timer, the 'virtio_net_rsc_timeout'
> controls the interval, this value may impact the
> performance and response time of tcp connection,
> 50000(50us) is an experience value to gain a performance
> improvement, since the whql test sends packets every 100us,
> so '300000(300us)' passes the test case, it is the default
> value as well, tune it via the command line parameter
> 'rsc_interval' within 'virtio-net-pci' device, for example,
> to launch a guest with interval set as '500000':
> 
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
> guest_rsc_ext=on,rsc_interval=500000'
> 
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
> 
> 'NetRscChain' is used to save the segments of IPv4/6 in a
> VirtIONet device.
> 
> A new segment becomes a 'Candidate' as well as it passed sanity check,
> the main handler of TCP includes TCP window update, duplicated
> ACK check and the real data coalescing.
> 
> An 'Candidate' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. An IP options or IP fragment
> 3. Not a TCP packet
> 4. Sanity size check to prevent buffer overflow attack.
> 5. An ECN packet
> 
> Even though, there might more cases should be considered such as
> ip identification other flags, while it breaks the test because
> windows set it to the same even it's not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag,
> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> while 'finalize' means the packets should also be bypassed, but this
> should be done after search for the same connection packets in the
> pool and drain all of them out, this is to avoid out of order fragment.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> finalization, because this normally happens upon a connection is going
> to be closed, an 'URG' packet also finalize current coalescing unit.
> 
> Statistics can be used to monitor the basic coalescing status, the
> 'out of order' and 'out of window' means how many retransmitting packets,
> thus describe the performance intuitively.
> 
> Difference between ip v4 and v6 processing:
>  Fragment length in ipv4 header includes itself, while it's not
>  included for ipv6, thus means ipv6 can carry a real 65535 payload.
> 
> Signed-off-by: Wei Xu <address@hidden>
> Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>
> ---
>  hw/net/virtio-net.c                         | 648 +++++++++++++++++++-
>  include/hw/virtio/virtio-net.h              |  81 +++
>  include/net/eth.h                           |   2 +
>  include/standard-headers/linux/virtio_net.h |   8 +
>  4 files changed, 734 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 385b1a03e9..43a7021409 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -41,6 +41,28 @@
>  #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
>  #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
>  
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
> +
> +#define VIRTIO_NET_TCP_FLAG         0x3F
> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
> +
> +/* header length value in ip header without option */
> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5
> +
> +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
> +#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
> +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
> +
> +/* Purge coalesced packets timer interval, This value affects the performance
> +   a lot, and should be tuned carefully, '300000'(300us) is the recommended
> +   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
> +   tso/gso/gro 'off'. */
> +#define VIRTIO_NET_RSC_INTERVAL 300000
> +
>  /*
>   * Calculate the number of bytes up to and including the given 'field' of
>   * 'container'.
> @@ -628,6 +650,9 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t features,
>      if (!get_vhost_net(nc->peer)) {
>          return features;
>      }
> +    /* clear RSC_EXT feature in case of VHOST*/
> +    virtio_clear_feature(&features, VIRTIO_NET_F_RSC_EXT);
> +
>      features = vhost_net_get_features(get_vhost_net(nc->peer), features);
>      vdev->backend_features = features;
>  
> @@ -701,6 +726,11 @@ static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
>                                 virtio_has_feature(features,
>                                                    VIRTIO_F_VERSION_1));
>  
> +    n->rsc4_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
> +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4);
> +    n->rsc6_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
> +        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6);
> +
>      if (n->has_vnet_hdr) {
>          n->curr_guest_offloads =
>              virtio_net_guest_offloads_by_features(features);
> @@ -781,6 +811,12 @@ static int virtio_net_handle_offloads(VirtIONet *n, uint8_t cmd,
>              return VIRTIO_NET_ERR;
>          }
>  
> +        n->rsc4_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
> +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO4);
> +        n->rsc6_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
> +            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO6);
> +        virtio_clear_feature(&offloads, VIRTIO_NET_F_RSC_EXT);
> +
>          supported_offloads = virtio_net_supported_guest_offloads(n);
>          if (offloads & ~supported_offloads) {
>              return VIRTIO_NET_ERR;
> @@ -1253,6 +1289,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>              }
>  
>              receive_header(n, sg, elem->in_num, buf, size);
> +
>              offset = n->host_hdr_len;
>              total += n->guest_hdr_len;
>              guest_offset = n->guest_hdr_len;
> @@ -1292,7 +1329,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
>      return size;
>  }
>  
> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
> +static ssize_t virtio_net_do_receive(NetClientState *nc, const uint8_t *buf,
>                                    size_t size)
>  {
>      ssize_t r;
> @@ -1303,6 +1340,601 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
>      return r;
>  }
>  
> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    uint16_t ip_hdrlen;
> +    struct ip_header *ip;
> +
> +    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
> +                              + sizeof(struct eth_header));
> +    unit->ip = (void *)ip;
> +    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
> +    unit->ip_plen = &ip->ip_len;
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
> +}
> +
> +static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
> +                                         const uint8_t *buf, NetRscUnit* unit)
> +{
> +    struct ip6_header *ip6;
> +
> +    ip6 = (struct ip6_header *)(buf + chain->n->guest_hdr_len
> +                                 + sizeof(struct eth_header));
> +    unit->ip = ip6;
> +    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
> +                                        + sizeof(struct ip6_header));
> +    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
> +
> +    /* There is a difference between payload lenght in ipv4 and v6,
> +       ip header is excluded in ipv6 */
> +    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
> +}
> +
> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
> +{
> +    int ret;
> +    struct virtio_net_hdr *h;
> +
> +    h = (struct virtio_net_hdr *)seg->buf;
> +    h->flags = 0;
> +    h->gso_type = VIRTIO_NET_HDR_GSO_NONE;
> +
> +    if (seg->is_coalesced) {
> +        h->rsc_ext_num_packets = seg->packets;
> +        h->rsc_ext_num_dupacks = seg->dup_ack;
> +        h->flags = VIRTIO_NET_HDR_F_RSC_INFO;
> +        if (chain->proto == ETH_P_IP) {
> +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +        } else {
> +            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +        }
> +    }
> +
> +    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
> +    QTAILQ_REMOVE(&chain->buffers, seg, next);
> +    g_free(seg->buf);
> +    g_free(seg);
> +
> +    return ret;
> +}
> +
> +static void virtio_net_rsc_purge(void *opq)
> +{
> +    NetRscSeg *seg, *rn;
> +    NetRscChain *chain = (NetRscChain *)opq;
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.purge_failed++;
> +            continue;
> +        }
> +    }
> +
> +    chain->stat.timer++;
> +    if (!QTAILQ_EMPTY(&chain->buffers)) {
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +    }
> +}
> +
> +static void virtio_net_rsc_cleanup(VirtIONet *n)
> +{
> +    NetRscChain *chain, *rn_chain;
> +    NetRscSeg *seg, *rn_seg;
> +
> +    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
> +        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
> +            QTAILQ_REMOVE(&chain->buffers, seg, next);
> +            g_free(seg->buf);
> +            g_free(seg);
> +        }
> +
> +        timer_del(chain->drain_timer);
> +        timer_free(chain->drain_timer);
> +        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
> +        g_free(chain);
> +    }
> +}
> +
> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
> +                                     const uint8_t *buf, size_t size)
> +{
> +    uint16_t hdr_len;
> +    NetRscSeg *seg;
> +
> +    hdr_len = chain->n->guest_hdr_len;
> +    seg = g_malloc(sizeof(NetRscSeg));
> +    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)
> +        + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
> +    memcpy(seg->buf, buf, size);
> +    seg->size = size;
> +    seg->packets = 1;
> +    seg->dup_ack = 0;
> +    seg->is_coalesced = 0;
> +    seg->nc = nc;
> +
> +    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
> +    chain->stat.cache++;
> +
> +    switch (chain->proto) {
> +    case ETH_P_IP:
> +        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
> +        break;
> +    case ETH_P_IPV6:
> +        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
> +                                         NetRscSeg *seg, const uint8_t *buf,
> +                                         struct tcp_header *n_tcp,
> +                                         struct tcp_header *o_tcp)
> +{
> +    uint32_t nack, oack;
> +    uint16_t nwin, owin;
> +
> +    nack = htonl(n_tcp->th_ack);
> +    nwin = htons(n_tcp->th_win);
> +    oack = htonl(o_tcp->th_ack);
> +    owin = htons(o_tcp->th_win);
> +
> +    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.ack_out_of_win++;
> +        return RSC_FINAL;
> +    } else if (nack == oack) {
> +        /* duplicated ack or window probe */
> +        if (nwin == owin) {
> +            /* duplicated ack, add dup ack count due to whql test up to 1 */
> +            chain->stat.dup_ack++;
> +            return RSC_FINAL;
> +        } else {
> +            /* Coalesce window update */
> +            o_tcp->th_win = n_tcp->th_win;
> +            chain->stat.win_update++;
> +            return RSC_COALESCE;
> +        }
> +    } else {
> +        /* pure ack, go to 'C', finalize*/
> +        chain->stat.pure_ack++;
> +        return RSC_FINAL;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
> +                                            NetRscSeg *seg, const uint8_t *buf,
> +                                            NetRscUnit *n_unit)
> +{
> +    void *data;
> +    uint16_t o_ip_len;
> +    uint32_t nseq, oseq;
> +    NetRscUnit *o_unit;
> +
> +    o_unit = &seg->unit;
> +    o_ip_len = htons(*o_unit->ip_plen);
> +    nseq = htonl(n_unit->tcp->th_seq);
> +    oseq = htonl(o_unit->tcp->th_seq);
> +
> +    /* out of order or retransmitted. */
> +    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
> +        chain->stat.data_out_of_win++;
> +        return RSC_FINAL;
> +    }
> +
> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
> +    if (nseq == oseq) {
> +        if ((o_unit->payload == 0) && n_unit->payload) {
> +            /* From no payload to payload, normal case, not a dup ack or etc */
> +            chain->stat.data_after_pure_ack++;
> +            goto coalesce;
> +        } else {
> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
> +                                             n_unit->tcp, o_unit->tcp);
> +        }
> +    } else if ((nseq - oseq) != o_unit->payload) {
> +        /* Not a consistent packet, out of order */
> +        chain->stat.data_out_of_order++;
> +        return RSC_FINAL;
> +    } else {
> +coalesce:
> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
> +            chain->stat.over_size++;
> +            return RSC_FINAL;
> +        }
> +
> +        /* Here comes the right data, the payload length in v4/v6 is different,
> +           so use the field value to update and record the new data len */
> +        o_unit->payload += n_unit->payload; /* update new data len */
> +
> +        /* update field in ip header */
> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
> +
> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
> +           for windows guest, while this may change the behavior for linux
> +           guest (only if it uses RSC feature). */
> +        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
> +
> +        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
> +        o_unit->tcp->th_win = n_unit->tcp->th_win;
> +
> +        memmove(seg->buf + seg->size, data, n_unit->payload);
> +        seg->size += n_unit->payload;
> +        seg->packets++;
> +        chain->stat.coalesced++;
> +        return RSC_COALESCE;
> +    }
> +}
> +
> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
> +                                        const uint8_t *buf, size_t size,
> +                                        NetRscUnit *unit)
> +{
> +    struct ip_header *ip1, *ip2;
> +
> +    ip1 = (struct ip_header *)(unit->ip);
> +    ip2 = (struct ip_header *)(seg->unit.ip);
> +    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +        chain->stat.no_match++;
> +        return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
> +                        const uint8_t *buf, size_t size, NetRscUnit *unit)
> +{
> +    struct ip6_header *ip1, *ip2;
> +
> +    ip1 = (struct ip6_header *)(unit->ip);
> +    ip2 = (struct ip6_header *)(seg->unit.ip);
> +    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
> +        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address))
> +        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
> +        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
> +            chain->stat.no_match++;
> +            return RSC_NO_MATCH;
> +    }
> +
> +    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
> +}
> +
> +/* Packets with 'SYN' should bypass, other flag should be sent after drain
> + * to prevent out of order */
> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
> +                                         struct tcp_header *tcp)
> +{
> +    uint16_t tcp_hdr;
> +    uint16_t tcp_flag;
> +
> +    tcp_flag = htons(tcp->th_offset_flags);
> +    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
> +    tcp_flag &= VIRTIO_NET_TCP_FLAG;
> +    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
> +    if (tcp_flag & TH_SYN) {
> +        chain->stat.tcp_syn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
> +        chain->stat.tcp_ctrl_drain++;
> +        return RSC_FINAL;
> +    }
> +
> +    if (tcp_hdr > sizeof(struct tcp_header)) {
> +        chain->stat.tcp_all_opt++;
> +        return RSC_FINAL;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
> +                                         NetClientState *nc,
> +                                         const uint8_t *buf, size_t size,
> +                                         NetRscUnit *unit)
> +{
> +    int ret;
> +    NetRscSeg *seg, *nseg;
> +
> +    if (QTAILQ_EMPTY(&chain->buffers)) {
> +        chain->stat.empty_cache++;
> +        virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +        timer_mod(chain->drain_timer,
> +              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
> +        return size;
> +    }
> +
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        if (chain->proto == ETH_P_IP) {
> +            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
> +        } else {
> +            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
> +        }
> +
> +        if (ret == RSC_FINAL) {
> +            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +                /* Send failed */
> +                chain->stat.final_failed++;
> +                return 0;
> +            }
> +
> +            /* Send current packet */
> +            return virtio_net_do_receive(nc, buf, size);
> +        } else if (ret == RSC_NO_MATCH) {
> +            continue;
> +        } else {
> +            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
> +            seg->is_coalesced = 1;
> +            return size;
> +        }
> +    }
> +
> +    chain->stat.no_match_cache++;
> +    virtio_net_rsc_cache_buf(chain, nc, buf, size);
> +    return size;
> +}
> +
> +/* Drain a connection data, this is to avoid out of order segments */
> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
> +                                        const uint8_t *buf, size_t size,
> +                                        uint16_t ip_start, uint16_t ip_size,
> +                                        uint16_t tcp_port)
> +{
> +    NetRscSeg *seg, *nseg;
> +    uint32_t ppair1, ppair2;
> +
> +    ppair1 = *(uint32_t *)(buf + tcp_port);
> +    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
> +        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
> +        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
> +            || (ppair1 != ppair2)) {
> +            continue;
> +        }
> +        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
> +            chain->stat.drain_failed++;
> +        }
> +
> +        break;
> +    }
> +
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
> +                                            struct ip_header *ip,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    /* Not an ipv4 packet */
> +    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip option */
> +    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
> +        chain->stat.ip_option++;
> +        return RSC_BYPASS;
> +    }
> +
> +    if (ip->ip_p != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ip fragment */
> +    if (!(htons(ip->ip_off) & IP_DF)) {
> +        chain->stat.ip_frag++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ecn flag */
> +    if (IPTOS_ECN(ip->ip_tos)) {
> +        chain->stat.ip_ecn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip->ip_len);
> +    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
> +        || ip_len > (size - chain->n->guest_hdr_len -
> +                     sizeof(struct eth_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscUnit unit;
> +
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
> +        + sizeof(struct tcp_header))) {
> +        chain->stat.bypass_not_tcp++;
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    virtio_net_rsc_extract_unit4(chain, buf, &unit);
> +    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
> +        != RSC_CANDIDATE) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 12),
> +                VIRTIO_NET_IP4_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header));
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
> +                                            struct ip6_header *ip6,
> +                                            const uint8_t *buf, size_t size)
> +{
> +    uint16_t ip_len;
> +
> +    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
> +        != IP_HEADER_VERSION_6) {
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Both option and protocol is checked in this */
> +    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
> +        chain->stat.bypass_not_tcp++;
> +        return RSC_BYPASS;
> +    }
> +
> +    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
> +    if (ip_len < sizeof(struct tcp_header) ||
> +        ip_len > (size - chain->n->guest_hdr_len - sizeof(struct eth_header)
> +                  - sizeof(struct ip6_header))) {
> +        chain->stat.ip_hacked++;
> +        return RSC_BYPASS;
> +    }
> +
> +    /* Don't handle packets with ecn flag */
> +    if (IP6_ECN(ip6->ip6_ctlun.ip6_un3.ip6_un3_ecn)) {
> +        chain->stat.ip_ecn++;
> +        return RSC_BYPASS;
> +    }
> +
> +    return RSC_CANDIDATE;
> +}
> +
> +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
> +                                      const uint8_t *buf, size_t size)
> +{
> +    int32_t ret;
> +    uint16_t hdr_len;
> +    NetRscChain *chain;
> +    NetRscUnit unit;
> +
> +    chain = (NetRscChain *)opq;
> +    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
> +
> +    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header)
> +        + sizeof(tcp_header))) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    virtio_net_rsc_extract_unit6(chain, buf, &unit);
> +    if (RSC_CANDIDATE != virtio_net_rsc_sanity_check6(chain,
> +                                                 unit.ip, buf, size)) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
> +    if (ret == RSC_BYPASS) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    } else if (ret == RSC_FINAL) {
> +        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
> +                ((hdr_len + sizeof(struct eth_header)) + 8),
> +                VIRTIO_NET_IP6_ADDR_SIZE,
> +                hdr_len + sizeof(struct eth_header)
> +                + sizeof(struct ip6_header));
> +    }
> +
> +    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
> +}
> +
> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
> +                                                NetClientState *nc,
> +                                                uint16_t proto)
> +{
> +    NetRscChain *chain;
> +
> +    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
> +        if (chain->proto == proto) {
> +            return chain;
> +        }
> +    }
> +
> +    chain = g_malloc(sizeof(*chain));
> +    chain->n = n;
> +    chain->proto = proto;
> +    if (proto == (uint16_t)ETH_P_IP) {
> +        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +    } else {
> +        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
> +        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +    }
> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
> +                                      virtio_net_rsc_purge, chain);
> +    memset(&chain->stat, 0, sizeof(chain->stat));
> +
> +    QTAILQ_INIT(&chain->buffers);
> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
> +
> +    return chain;
> +}
> +
> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
> +                                      const uint8_t *buf,
> +                                      size_t size)
> +{
> +    uint16_t proto;
> +    NetRscChain *chain;
> +    struct eth_header *eth;
> +    VirtIONet *n;
> +
> +    n = qemu_get_nic_opaque(nc);
> +    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +
> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
> +    proto = htons(eth->h_proto);
> +
> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
> +    if (chain) {
> +        chain->stat.received++;
> +        if (proto == (uint16_t)ETH_P_IP && n->rsc4_enabled) {
> +            return virtio_net_rsc_receive4(chain, nc, buf, size);
> +        } else if (proto == (uint16_t)ETH_P_IPV6 && n->rsc6_enabled) {
> +            return virtio_net_rsc_receive6(chain, nc, buf, size);
> +        }
> +    }
> +    return virtio_net_do_receive(nc, buf, size);
> +}
> +
> +static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
> +                                  size_t size)
> +{
> +    VirtIONet *n = qemu_get_nic_opaque(nc);
> +    if ((n->rsc4_enabled || n->rsc6_enabled)) {
> +        return virtio_net_rsc_receive(nc, buf, size);
> +    } else {
> +        return virtio_net_do_receive(nc, buf, size);
> +    }
> +}
> +
>  static int32_t virtio_net_flush_tx(VirtIONetQueue *q);
>  
>  static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
> @@ -1342,7 +1974,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>          ssize_t ret;
>          unsigned int out_num;
>          struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
> -        struct virtio_net_hdr_mrg_rxbuf mhdr;
> +        struct virtio_net_hdr_mrg_rxbuf hdr;
>  
>          elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
>          if (!elem) {
> @@ -1359,7 +1991,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>          }
>  
>          if (n->has_vnet_hdr) {
> -            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) <
> +            if (iov_to_buf(out_sg, out_num, 0, &hdr, n->guest_hdr_len) <
>                  n->guest_hdr_len) {
>                  virtio_error(vdev, "virtio-net header incorrect");
>                  virtqueue_detach_element(q->tx_vq, elem, 0);
> @@ -1367,8 +1999,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
>                  return -EINVAL;
>              }
>              if (n->needs_vnet_hdr_swap) {
> -                virtio_net_hdr_swap(vdev, (void *) &mhdr);
> -                sg2[0].iov_base = &mhdr;
> +                virtio_net_hdr_swap(vdev, (void *) &hdr);
> +                sg2[0].iov_base = &hdr;
>                  sg2[0].iov_len = n->guest_hdr_len;
>                  out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
>                                     out_sg, out_num,
> @@ -2075,6 +2707,7 @@ static void virtio_net_device_realize(DeviceState *dev, Error **errp)
>      nc = qemu_get_queue(n->nic);
>      nc->rxfilter_notify_enabled = 1;
>  
> +    QTAILQ_INIT(&n->rsc_chains);
>      n->qdev = dev;
>  }
>  
> @@ -2104,6 +2737,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
>      timer_free(n->announce_timer);
>      g_free(n->vqs);
>      qemu_del_nic(n->nic);
> +    virtio_net_rsc_cleanup(n);
>      virtio_cleanup(vdev);
>  }
>  
> @@ -2184,6 +2818,10 @@ static Property virtio_net_properties[] = {
>      DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
>                      VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
>      DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
> +    DEFINE_PROP_BIT64("guest_rsc_ext", VirtIONet, host_features,
> +                    VIRTIO_NET_F_RSC_EXT, false),
> +    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
> +                       VIRTIO_NET_RSC_INTERVAL),
>      DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
>      DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
>                         TX_TIMER_INTERVAL),
> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
> index 4d7f3c82ca..0da99dcfa0 100644
> --- a/include/hw/virtio/virtio-net.h
> +++ b/include/hw/virtio/virtio-net.h
> @@ -44,6 +44,83 @@ typedef struct virtio_net_conf
>      uint8_t duplex;
>  } virtio_net_conf;
>  
> +/* Coalesced packets type & status */
> +typedef enum {
> +    RSC_COALESCE,           /* Data been coalesced */
> +    RSC_FINAL,              /* Will terminate current connection */
> +    RSC_NO_MATCH,           /* No matched in the buffer pool */
> +    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
> +    RSC_CANDIDATE                /* Data want to be coalesced */
> +} COALESCE_STATUS;
> +
> +typedef struct NetRscStat {
> +    uint32_t received;
> +    uint32_t coalesced;
> +    uint32_t over_size;
> +    uint32_t cache;
> +    uint32_t empty_cache;
> +    uint32_t no_match_cache;
> +    uint32_t win_update;
> +    uint32_t no_match;
> +    uint32_t tcp_syn;
> +    uint32_t tcp_ctrl_drain;
> +    uint32_t dup_ack;
> +    uint32_t dup_ack1;
> +    uint32_t dup_ack2;
> +    uint32_t pure_ack;
> +    uint32_t ack_out_of_win;
> +    uint32_t data_out_of_win;
> +    uint32_t data_out_of_order;
> +    uint32_t data_after_pure_ack;
> +    uint32_t bypass_not_tcp;
> +    uint32_t tcp_option;
> +    uint32_t tcp_all_opt;
> +    uint32_t ip_frag;
> +    uint32_t ip_ecn;
> +    uint32_t ip_hacked;
> +    uint32_t ip_option;
> +    uint32_t purge_failed;
> +    uint32_t drain_failed;
> +    uint32_t final_failed;
> +    int64_t  timer;
> +} NetRscStat;
> +
> +/* Rsc unit general info used to checking if can coalescing */
> +typedef struct NetRscUnit {
> +    void *ip;   /* ip header */
> +    uint16_t *ip_plen;      /* data len pointer in ip header field */
> +    struct tcp_header *tcp; /* tcp header */
> +    uint16_t tcp_hdrlen;    /* tcp header len */
> +    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
> +} NetRscUnit;
> +
> +/* Coalesced segmant */
> +typedef struct NetRscSeg {
> +    QTAILQ_ENTRY(NetRscSeg) next;
> +    void *buf;
> +    size_t size;
> +    uint16_t packets;
> +    uint16_t dup_ack;
> +    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
> +    NetRscUnit unit;
> +    NetClientState *nc;
> +} NetRscSeg;
> +
> +struct VirtIONet;
> +typedef struct VirtIONet VirtIONet;
> +
> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
> +typedef struct NetRscChain {
> +    QTAILQ_ENTRY(NetRscChain) next;
> +    VirtIONet *n;                            /* VirtIONet */
> +    uint16_t proto;
> +    uint8_t  gso_type;
> +    uint16_t max_payload;
> +    QEMUTimer *drain_timer;
> +    QTAILQ_HEAD(, NetRscSeg) buffers;
> +    NetRscStat stat;
> +} NetRscChain;
> +
>  /* Maximum packet size we can receive from tap device: header + 64k */
>  #define VIRTIO_NET_MAX_BUFSIZE (sizeof(struct virtio_net_hdr) + (64 * KiB))
>  
> @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>      VirtIONetQueue *vqs;
>      VirtQueue *ctrl_vq;
>      NICState *nic;
> +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>      uint32_t tx_timeout;
>      int32_t tx_burst;
>      uint32_t has_vnet_hdr;
>      size_t host_hdr_len;
>      size_t guest_hdr_len;
>      uint64_t host_features;
> +    uint32_t rsc_timeout;
> +    uint8_t rsc4_enabled;
> +    uint8_t rsc6_enabled;
>      uint8_t has_ufo;
>      uint32_t mergeable_rx_bufs;
>      uint8_t promisc;
> diff --git a/include/net/eth.h b/include/net/eth.h
> index e6dc8a7ba0..7f45c678e7 100644
> --- a/include/net/eth.h
> +++ b/include/net/eth.h
> @@ -177,6 +177,8 @@ struct tcp_hdr {
>  #define TH_PUSH 0x08
>  #define TH_ACK  0x10
>  #define TH_URG  0x20
> +#define TH_ECE  0x40
> +#define TH_CWR  0x80
>      u_short th_win;      /* window */
>      u_short th_sum;      /* checksum */
>      u_short th_urp;      /* urgent pointer */
> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
> index 260c3681d7..0d8658c06a 100644
> --- a/include/standard-headers/linux/virtio_net.h
> +++ b/include/standard-headers/linux/virtio_net.h
> @@ -57,6 +57,10 @@
>  					 * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
>  
> +#define VIRTIO_NET_F_RSC_EXT	38
> +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE	41	/* reserved */
> +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE	42	/* reserved */
> +
>  #define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
>  					 * with the same MAC.
>  					 */
> @@ -104,6 +108,7 @@ struct virtio_net_config {
>  struct virtio_net_hdr_v1 {
>  #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
>  #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
> +#define VIRTIO_NET_HDR_F_RSC_INFO	4	/* rsc_ext data in csum_ fields */
>  	uint8_t flags;
>  #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
>  #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
> @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
>  	__virtio16 num_buffers;	/* Number of merged rx buffers */
>  };
>  
> +#define rsc_ext_num_packets		csum_start
> +#define rsc_ext_num_dupacks		csum_offset
> +
>  #ifndef VIRTIO_NET_NO_LEGACY
>  /* This header comes first in the scatter-gather list.
>   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must
> -- 
> 2.17.1
> 
>
Yuri Benditovich Nov. 13, 2018, 8:21 a.m. UTC | #9
On Mon, Nov 12, 2018 at 10:53 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> On Mon, Nov 12, 2018 at 01:31:36PM +0200, Yuri Benditovich wrote:
> >
> >
> > On Mon, Nov 12, 2018 at 11:26 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> >     On 2018/11/12 下午4:57, Yuri Benditovich wrote:
> >     >
> >     > On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com
> >     > <mailto:mst@redhat.com>> wrote:
> >     >
> >     >     On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich
> wrote:
> >     >     >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
> >     >     >     >      VirtIONetQueue *vqs;
> >     >     >     >      VirtQueue *ctrl_vq;
> >     >     >     >      NICState *nic;
> >     >     >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
> >     >     >
> >     >     >     what exactly happens with these chains on migration?
> >     >     >
> >     >     >
> >     >     > This feature (software implementation of RSC in QEMU) is
> >     >     intended to be used in
> >     >     > the environment of certification tests which never uses
> migration.
> >     >
> >     >     Should this feature disable migration then?
> >     >
> >     >
> >     > IMO, this should not. But if you find it mandatory, please respond
> and
> >     > I will add the migration blocker.
> >
> >
> >     So if my understanding is correct, it's safe to do nothing even if we
> >     allow migration for RSC?
> >
> >
> > This does not create any unrecoverable failure (assertion, BSOD),
> although
> > some data (coalesced parts of packets not delivered yet to guest) will
> be lost.
>
> If guest has no way to detect none of these packets were ever
> received by card, then I guess it's fine. Needs a
> comment in case we start caring about packet loss around
> migration.
>
>
Where is the best place to place such a comment in the code?
Are there more notes toward v3 of the patch?



> >
> >
> >     Thanks
> >
> >
> >
>
Michael S. Tsirkin Nov. 13, 2018, 3:32 p.m. UTC | #10
On Tue, Nov 13, 2018 at 10:21:07AM +0200, Yuri Benditovich wrote:
> 
> 
> On Mon, Nov 12, 2018 at 10:53 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Nov 12, 2018 at 01:31:36PM +0200, Yuri Benditovich wrote:
>     >
>     >
>     > On Mon, Nov 12, 2018 at 11:26 AM Jason Wang <jasowang@redhat.com> wrote:
>     >
>     >
>     >     On 2018/11/12 下午4:57, Yuri Benditovich wrote:
>     >     >
>     >     > On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin <mst@redhat.com
>     >     > <mailto:mst@redhat.com>> wrote:
>     >     >
>     >     >     On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich
>     wrote:
>     >     >     >     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>     >     >     >     >      VirtIONetQueue *vqs;
>     >     >     >     >      VirtQueue *ctrl_vq;
>     >     >     >     >      NICState *nic;
>     >     >     >     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>     >     >     >
>     >     >     >     what exactly happens with these chains on migration?
>     >     >     >
>     >     >     >
>     >     >     > This feature (software implementation of RSC in QEMU) is
>     >     >     intended to be used in
>     >     >     > the environment of certification tests which never uses
>     migration.
>     >     >
>     >     >     Should this feature disable migration then?
>     >     >
>     >     >
>     >     > IMO, this should not. But if you find it mandatory, please respond
>     and
>     >     > I will add the migration blocker.
>     >
>     >
>     >     So if my understanding is correct, it's safe to do nothing even if we
>     >     allow migration for RSC?
>     >
>     >
>     > This does not create any unrecoverable failure (assertion, BSOD),
>     although
>     > some data (coalesced parts of packets not delivered yet to guest) will be
>     lost.
> 
>     If guest has no way to detect none of these packets were ever
>     received by card, then I guess it's fine. Needs a
>     comment in case we start caring about packet loss around
>     migration.
> 
> 
> 
> Where is the best place to place such a comment in the code?

Maybe where we add the fields that we don't bother migrating.


> Are there more notes toward v3 of the patch?
> 

Not yet.

> 
>     >
>     >
>     >     Thanks
>     >
>     >
>     >
>
diff mbox series

Patch

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 385b1a03e9..43a7021409 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -41,6 +41,28 @@ 
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+#define VIRTIO_NET_IP4_ADDR_SIZE   8        /* ipv4 saddr + daddr */
+
+#define VIRTIO_NET_TCP_FLAG         0x3F
+#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
+
+/* IPv4 max payload, 16 bits in the header */
+#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
+#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
+
+/* header length value in ip header without option */
+#define VIRTIO_NET_IP4_HEADER_LENGTH 5
+
+#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ)
+#define VIRTIO_NET_IP6_ADDR_SIZE   32      /* ipv6 saddr + daddr */
+#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
+
+/* Purge coalesced packets timer interval, This value affects the performance
+   a lot, and should be tuned carefully, '300000'(300us) is the recommended
+   value to pass the WHQL test, '50000' can gain 2x netperf throughput with
+   tso/gso/gro 'off'. */
+#define VIRTIO_NET_RSC_INTERVAL 300000
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -628,6 +650,9 @@  static uint64_t virtio_net_get_features(VirtIODevice *vdev, uint64_t features,
     if (!get_vhost_net(nc->peer)) {
         return features;
     }
+    /* clear RSC_EXT feature in case of VHOST*/
+    virtio_clear_feature(&features, VIRTIO_NET_F_RSC_EXT);
+
     features = vhost_net_get_features(get_vhost_net(nc->peer), features);
     vdev->backend_features = features;
 
@@ -701,6 +726,11 @@  static void virtio_net_set_features(VirtIODevice *vdev, uint64_t features)
                                virtio_has_feature(features,
                                                   VIRTIO_F_VERSION_1));
 
+    n->rsc4_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
+        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4);
+    n->rsc6_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
+        virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6);
+
     if (n->has_vnet_hdr) {
         n->curr_guest_offloads =
             virtio_net_guest_offloads_by_features(features);
@@ -781,6 +811,12 @@  static int virtio_net_handle_offloads(VirtIONet *n, uint8_t cmd,
             return VIRTIO_NET_ERR;
         }
 
+        n->rsc4_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
+            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO4);
+        n->rsc6_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) &&
+            virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO6);
+        virtio_clear_feature(&offloads, VIRTIO_NET_F_RSC_EXT);
+
         supported_offloads = virtio_net_supported_guest_offloads(n);
         if (offloads & ~supported_offloads) {
             return VIRTIO_NET_ERR;
@@ -1253,6 +1289,7 @@  static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
             }
 
             receive_header(n, sg, elem->in_num, buf, size);
+
             offset = n->host_hdr_len;
             total += n->guest_hdr_len;
             guest_offset = n->guest_hdr_len;
@@ -1292,7 +1329,7 @@  static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
     return size;
 }
 
-static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
+static ssize_t virtio_net_do_receive(NetClientState *nc, const uint8_t *buf,
                                   size_t size)
 {
     ssize_t r;
@@ -1303,6 +1340,601 @@  static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
     return r;
 }
 
+static void virtio_net_rsc_extract_unit4(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    uint16_t ip_hdrlen;
+    struct ip_header *ip;
+
+    ip = (struct ip_header *)(buf + chain->n->guest_hdr_len
+                              + sizeof(struct eth_header));
+    unit->ip = (void *)ip;
+    ip_hdrlen = (ip->ip_ver_len & 0xF) << 2;
+    unit->ip_plen = &ip->ip_len;
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen);
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+    unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen;
+}
+
+static void virtio_net_rsc_extract_unit6(NetRscChain *chain,
+                                         const uint8_t *buf, NetRscUnit* unit)
+{
+    struct ip6_header *ip6;
+
+    ip6 = (struct ip6_header *)(buf + chain->n->guest_hdr_len
+                                 + sizeof(struct eth_header));
+    unit->ip = ip6;
+    unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\
+                                        + sizeof(struct ip6_header));
+    unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10;
+
+    /* There is a difference between payload lenght in ipv4 and v6,
+       ip header is excluded in ipv6 */
+    unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen;
+}
+
+static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg)
+{
+    int ret;
+    struct virtio_net_hdr *h;
+
+    h = (struct virtio_net_hdr *)seg->buf;
+    h->flags = 0;
+    h->gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+    if (seg->is_coalesced) {
+        h->rsc_ext_num_packets = seg->packets;
+        h->rsc_ext_num_dupacks = seg->dup_ack;
+        h->flags = VIRTIO_NET_HDR_F_RSC_INFO;
+        if (chain->proto == ETH_P_IP) {
+            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+        } else {
+            h->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+        }
+    }
+
+    ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size);
+    QTAILQ_REMOVE(&chain->buffers, seg, next);
+    g_free(seg->buf);
+    g_free(seg);
+
+    return ret;
+}
+
+static void virtio_net_rsc_purge(void *opq)
+{
+    NetRscSeg *seg, *rn;
+    NetRscChain *chain = (NetRscChain *)opq;
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) {
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.purge_failed++;
+            continue;
+        }
+    }
+
+    chain->stat.timer++;
+    if (!QTAILQ_EMPTY(&chain->buffers)) {
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
+    }
+}
+
+static void virtio_net_rsc_cleanup(VirtIONet *n)
+{
+    NetRscChain *chain, *rn_chain;
+    NetRscSeg *seg, *rn_seg;
+
+    QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) {
+        QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) {
+            QTAILQ_REMOVE(&chain->buffers, seg, next);
+            g_free(seg->buf);
+            g_free(seg);
+        }
+
+        timer_del(chain->drain_timer);
+        timer_free(chain->drain_timer);
+        QTAILQ_REMOVE(&n->rsc_chains, chain, next);
+        g_free(chain);
+    }
+}
+
+static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc,
+                                     const uint8_t *buf, size_t size)
+{
+    uint16_t hdr_len;
+    NetRscSeg *seg;
+
+    hdr_len = chain->n->guest_hdr_len;
+    seg = g_malloc(sizeof(NetRscSeg));
+    seg->buf = g_malloc(hdr_len + sizeof(struct eth_header)
+        + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD);
+    memcpy(seg->buf, buf, size);
+    seg->size = size;
+    seg->packets = 1;
+    seg->dup_ack = 0;
+    seg->is_coalesced = 0;
+    seg->nc = nc;
+
+    QTAILQ_INSERT_TAIL(&chain->buffers, seg, next);
+    chain->stat.cache++;
+
+    switch (chain->proto) {
+    case ETH_P_IP:
+        virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit);
+        break;
+    case ETH_P_IPV6:
+        virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit);
+        break;
+    default:
+        g_assert_not_reached();
+    }
+}
+
+static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain,
+                                         NetRscSeg *seg, const uint8_t *buf,
+                                         struct tcp_header *n_tcp,
+                                         struct tcp_header *o_tcp)
+{
+    uint32_t nack, oack;
+    uint16_t nwin, owin;
+
+    nack = htonl(n_tcp->th_ack);
+    nwin = htons(n_tcp->th_win);
+    oack = htonl(o_tcp->th_ack);
+    owin = htons(o_tcp->th_win);
+
+    if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.ack_out_of_win++;
+        return RSC_FINAL;
+    } else if (nack == oack) {
+        /* duplicated ack or window probe */
+        if (nwin == owin) {
+            /* duplicated ack, add dup ack count due to whql test up to 1 */
+            chain->stat.dup_ack++;
+            return RSC_FINAL;
+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, go to 'C', finalize*/
+        chain->stat.pure_ack++;
+        return RSC_FINAL;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
+                                            NetRscSeg *seg, const uint8_t *buf,
+                                            NetRscUnit *n_unit)
+{
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((o_unit->payload == 0) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp, o_unit->tcp);
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+coalesce:
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+
+        /* Here comes the right data, the payload length in v4/v6 is different,
+           so use the field value to update and record the new data len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced
+           for windows guest, while this may change the behavior for linux
+           guest (only if it uses RSC feature). */
+        o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags;
+
+        o_unit->tcp->th_ack = n_unit->tcp->th_ack;
+        o_unit->tcp->th_win = n_unit->tcp->th_win;
+
+        memmove(seg->buf + seg->size, data, n_unit->payload);
+        seg->size += n_unit->payload;
+        seg->packets++;
+        chain->stat.coalesced++;
+        return RSC_COALESCE;
+    }
+}
+
+static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg,
+                                        const uint8_t *buf, size_t size,
+                                        NetRscUnit *unit)
+{
+    struct ip_header *ip1, *ip2;
+
+    ip1 = (struct ip_header *)(unit->ip);
+    ip2 = (struct ip_header *)(seg->unit.ip);
+    if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst)
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+        chain->stat.no_match++;
+        return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg,
+                        const uint8_t *buf, size_t size, NetRscUnit *unit)
+{
+    struct ip6_header *ip1, *ip2;
+
+    ip1 = (struct ip6_header *)(unit->ip);
+    ip2 = (struct ip6_header *)(seg->unit.ip);
+    if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address))
+        || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address))
+        || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport)
+        || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) {
+            chain->stat.no_match++;
+            return RSC_NO_MATCH;
+    }
+
+    return virtio_net_rsc_coalesce_data(chain, seg, buf, unit);
+}
+
+/* Packets with 'SYN' should bypass, other flag should be sent after drain
+ * to prevent out of order */
+static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain,
+                                         struct tcp_header *tcp)
+{
+    uint16_t tcp_hdr;
+    uint16_t tcp_flag;
+
+    tcp_flag = htons(tcp->th_offset_flags);
+    tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10;
+    tcp_flag &= VIRTIO_NET_TCP_FLAG;
+    tcp_flag = htons(tcp->th_offset_flags) & 0x3F;
+    if (tcp_flag & TH_SYN) {
+        chain->stat.tcp_syn++;
+        return RSC_BYPASS;
+    }
+
+    if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) {
+        chain->stat.tcp_ctrl_drain++;
+        return RSC_FINAL;
+    }
+
+    if (tcp_hdr > sizeof(struct tcp_header)) {
+        chain->stat.tcp_all_opt++;
+        return RSC_FINAL;
+    }
+
+    return RSC_CANDIDATE;
+}
+
+static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain,
+                                         NetClientState *nc,
+                                         const uint8_t *buf, size_t size,
+                                         NetRscUnit *unit)
+{
+    int ret;
+    NetRscSeg *seg, *nseg;
+
+    if (QTAILQ_EMPTY(&chain->buffers)) {
+        chain->stat.empty_cache++;
+        virtio_net_rsc_cache_buf(chain, nc, buf, size);
+        timer_mod(chain->drain_timer,
+              qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout);
+        return size;
+    }
+
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        if (chain->proto == ETH_P_IP) {
+            ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit);
+        } else {
+            ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit);
+        }
+
+        if (ret == RSC_FINAL) {
+            if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+                /* Send failed */
+                chain->stat.final_failed++;
+                return 0;
+            }
+
+            /* Send current packet */
+            return virtio_net_do_receive(nc, buf, size);
+        } else if (ret == RSC_NO_MATCH) {
+            continue;
+        } else {
+            /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */
+            seg->is_coalesced = 1;
+            return size;
+        }
+    }
+
+    chain->stat.no_match_cache++;
+    virtio_net_rsc_cache_buf(chain, nc, buf, size);
+    return size;
+}
+
+/* Drain a connection data, this is to avoid out of order segments */
+static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState *nc,
+                                        const uint8_t *buf, size_t size,
+                                        uint16_t ip_start, uint16_t ip_size,
+                                        uint16_t tcp_port)
+{
+    NetRscSeg *seg, *nseg;
+    uint32_t ppair1, ppair2;
+
+    ppair1 = *(uint32_t *)(buf + tcp_port);
+    QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) {
+        ppair2 = *(uint32_t *)(seg->buf + tcp_port);
+        if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size)
+            || (ppair1 != ppair2)) {
+            continue;
+        }
+        if (virtio_net_rsc_drain_seg(chain, seg) == 0) {
+            chain->stat.drain_failed++;
+        }
+
+        break;
+    }
+
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain,
+                                            struct ip_header *ip,
+                                            const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    /* Not an ipv4 packet */
+    if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip option */
+    if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) {
+        chain->stat.ip_option++;
+        return RSC_BYPASS;
+    }
+
+    if (ip->ip_p != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ip fragment */
+    if (!(htons(ip->ip_off) & IP_DF)) {
+        chain->stat.ip_frag++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ecn flag */
+    if (IPTOS_ECN(ip->ip_tos)) {
+        chain->stat.ip_ecn++;
+        return RSC_BYPASS;
+    }
+
+    ip_len = htons(ip->ip_len);
+    if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header))
+        || ip_len > (size - chain->n->guest_hdr_len -
+                     sizeof(struct eth_header))) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_CANDIDATE;
+}
+
+static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    uint16_t hdr_len;
+    NetRscUnit unit;
+
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+
+    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header)
+        + sizeof(struct tcp_header))) {
+        chain->stat.bypass_not_tcp++;
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    virtio_net_rsc_extract_unit4(chain, buf, &unit);
+    if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size)
+        != RSC_CANDIDATE) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                ((hdr_len + sizeof(struct eth_header)) + 12),
+                VIRTIO_NET_IP4_ADDR_SIZE,
+                hdr_len + sizeof(struct eth_header) + sizeof(struct ip_header));
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain,
+                                            struct ip6_header *ip6,
+                                            const uint8_t *buf, size_t size)
+{
+    uint16_t ip_len;
+
+    if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4)
+        != IP_HEADER_VERSION_6) {
+        return RSC_BYPASS;
+    }
+
+    /* Both option and protocol is checked in this */
+    if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) {
+        chain->stat.bypass_not_tcp++;
+        return RSC_BYPASS;
+    }
+
+    ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen);
+    if (ip_len < sizeof(struct tcp_header) ||
+        ip_len > (size - chain->n->guest_hdr_len - sizeof(struct eth_header)
+                  - sizeof(struct ip6_header))) {
+        chain->stat.ip_hacked++;
+        return RSC_BYPASS;
+    }
+
+    /* Don't handle packets with ecn flag */
+    if (IP6_ECN(ip6->ip6_ctlun.ip6_un3.ip6_un3_ecn)) {
+        chain->stat.ip_ecn++;
+        return RSC_BYPASS;
+    }
+
+    return RSC_CANDIDATE;
+}
+
+static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc,
+                                      const uint8_t *buf, size_t size)
+{
+    int32_t ret;
+    uint16_t hdr_len;
+    NetRscChain *chain;
+    NetRscUnit unit;
+
+    chain = (NetRscChain *)opq;
+    hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len;
+
+    if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip6_header)
+        + sizeof(tcp_header))) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    virtio_net_rsc_extract_unit6(chain, buf, &unit);
+    if (RSC_CANDIDATE != virtio_net_rsc_sanity_check6(chain,
+                                                 unit.ip, buf, size)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp);
+    if (ret == RSC_BYPASS) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else if (ret == RSC_FINAL) {
+        return virtio_net_rsc_drain_flow(chain, nc, buf, size,
+                ((hdr_len + sizeof(struct eth_header)) + 8),
+                VIRTIO_NET_IP6_ADDR_SIZE,
+                hdr_len + sizeof(struct eth_header)
+                + sizeof(struct ip6_header));
+    }
+
+    return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit);
+}
+
+static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n,
+                                                NetClientState *nc,
+                                                uint16_t proto)
+{
+    NetRscChain *chain;
+
+    if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) {
+        return NULL;
+    }
+
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+
+    chain = g_malloc(sizeof(*chain));
+    chain->n = n;
+    chain->proto = proto;
+    if (proto == (uint16_t)ETH_P_IP) {
+        chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD;
+        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+    } else {
+        chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD;
+        chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+    }
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+
+    return chain;
+}
+
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf,
+                                      size_t size)
+{
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->host_hdr_len + sizeof(struct eth_header))) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (chain) {
+        chain->stat.received++;
+        if (proto == (uint16_t)ETH_P_IP && n->rsc4_enabled) {
+            return virtio_net_rsc_receive4(chain, nc, buf, size);
+        } else if (proto == (uint16_t)ETH_P_IPV6 && n->rsc6_enabled) {
+            return virtio_net_rsc_receive6(chain, nc, buf, size);
+        }
+    }
+    return virtio_net_do_receive(nc, buf, size);
+}
+
+static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf,
+                                  size_t size)
+{
+    VirtIONet *n = qemu_get_nic_opaque(nc);
+    if ((n->rsc4_enabled || n->rsc6_enabled)) {
+        return virtio_net_rsc_receive(nc, buf, size);
+    } else {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+}
+
 static int32_t virtio_net_flush_tx(VirtIONetQueue *q);
 
 static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
@@ -1342,7 +1974,7 @@  static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         ssize_t ret;
         unsigned int out_num;
         struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
-        struct virtio_net_hdr_mrg_rxbuf mhdr;
+        struct virtio_net_hdr_mrg_rxbuf hdr;
 
         elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
         if (!elem) {
@@ -1359,7 +1991,7 @@  static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         }
 
         if (n->has_vnet_hdr) {
-            if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) <
+            if (iov_to_buf(out_sg, out_num, 0, &hdr, n->guest_hdr_len) <
                 n->guest_hdr_len) {
                 virtio_error(vdev, "virtio-net header incorrect");
                 virtqueue_detach_element(q->tx_vq, elem, 0);
@@ -1367,8 +1999,8 @@  static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
                 return -EINVAL;
             }
             if (n->needs_vnet_hdr_swap) {
-                virtio_net_hdr_swap(vdev, (void *) &mhdr);
-                sg2[0].iov_base = &mhdr;
+                virtio_net_hdr_swap(vdev, (void *) &hdr);
+                sg2[0].iov_base = &hdr;
                 sg2[0].iov_len = n->guest_hdr_len;
                 out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1,
                                    out_sg, out_num,
@@ -2075,6 +2707,7 @@  static void virtio_net_device_realize(DeviceState *dev, Error **errp)
     nc = qemu_get_queue(n->nic);
     nc->rxfilter_notify_enabled = 1;
 
+    QTAILQ_INIT(&n->rsc_chains);
     n->qdev = dev;
 }
 
@@ -2104,6 +2737,7 @@  static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
     timer_free(n->announce_timer);
     g_free(n->vqs);
     qemu_del_nic(n->nic);
+    virtio_net_rsc_cleanup(n);
     virtio_cleanup(vdev);
 }
 
@@ -2184,6 +2818,10 @@  static Property virtio_net_properties[] = {
     DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features,
                     VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true),
     DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, false),
+    DEFINE_PROP_BIT64("guest_rsc_ext", VirtIONet, host_features,
+                    VIRTIO_NET_F_RSC_EXT, false),
+    DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout,
+                       VIRTIO_NET_RSC_INTERVAL),
     DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf),
     DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer,
                        TX_TIMER_INTERVAL),
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index 4d7f3c82ca..0da99dcfa0 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -44,6 +44,83 @@  typedef struct virtio_net_conf
     uint8_t duplex;
 } virtio_net_conf;
 
+/* Coalesced packets type & status */
+typedef enum {
+    RSC_COALESCE,           /* Data been coalesced */
+    RSC_FINAL,              /* Will terminate current connection */
+    RSC_NO_MATCH,           /* No matched in the buffer pool */
+    RSC_BYPASS,             /* Packet to be bypass, not tcp, tcp ctrl, etc */
+    RSC_CANDIDATE                /* Data want to be coalesced */
+} COALESCE_STATUS;
+
+typedef struct NetRscStat {
+    uint32_t received;
+    uint32_t coalesced;
+    uint32_t over_size;
+    uint32_t cache;
+    uint32_t empty_cache;
+    uint32_t no_match_cache;
+    uint32_t win_update;
+    uint32_t no_match;
+    uint32_t tcp_syn;
+    uint32_t tcp_ctrl_drain;
+    uint32_t dup_ack;
+    uint32_t dup_ack1;
+    uint32_t dup_ack2;
+    uint32_t pure_ack;
+    uint32_t ack_out_of_win;
+    uint32_t data_out_of_win;
+    uint32_t data_out_of_order;
+    uint32_t data_after_pure_ack;
+    uint32_t bypass_not_tcp;
+    uint32_t tcp_option;
+    uint32_t tcp_all_opt;
+    uint32_t ip_frag;
+    uint32_t ip_ecn;
+    uint32_t ip_hacked;
+    uint32_t ip_option;
+    uint32_t purge_failed;
+    uint32_t drain_failed;
+    uint32_t final_failed;
+    int64_t  timer;
+} NetRscStat;
+
+/* Rsc unit general info used to checking if can coalescing */
+typedef struct NetRscUnit {
+    void *ip;   /* ip header */
+    uint16_t *ip_plen;      /* data len pointer in ip header field */
+    struct tcp_header *tcp; /* tcp header */
+    uint16_t tcp_hdrlen;    /* tcp header len */
+    uint16_t payload;       /* pure payload without virtio/eth/ip/tcp */
+} NetRscUnit;
+
+/* Coalesced segmant */
+typedef struct NetRscSeg {
+    QTAILQ_ENTRY(NetRscSeg) next;
+    void *buf;
+    size_t size;
+    uint16_t packets;
+    uint16_t dup_ack;
+    bool is_coalesced;      /* need recal ipv4 header checksum, mark here */
+    NetRscUnit unit;
+    NetClientState *nc;
+} NetRscSeg;
+
+struct VirtIONet;
+typedef struct VirtIONet VirtIONet;
+
+/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */
+typedef struct NetRscChain {
+    QTAILQ_ENTRY(NetRscChain) next;
+    VirtIONet *n;                            /* VirtIONet */
+    uint16_t proto;
+    uint8_t  gso_type;
+    uint16_t max_payload;
+    QEMUTimer *drain_timer;
+    QTAILQ_HEAD(, NetRscSeg) buffers;
+    NetRscStat stat;
+} NetRscChain;
+
 /* Maximum packet size we can receive from tap device: header + 64k */
 #define VIRTIO_NET_MAX_BUFSIZE (sizeof(struct virtio_net_hdr) + (64 * KiB))
 
@@ -66,12 +143,16 @@  typedef struct VirtIONet {
     VirtIONetQueue *vqs;
     VirtQueue *ctrl_vq;
     NICState *nic;
+    QTAILQ_HEAD(, NetRscChain) rsc_chains;
     uint32_t tx_timeout;
     int32_t tx_burst;
     uint32_t has_vnet_hdr;
     size_t host_hdr_len;
     size_t guest_hdr_len;
     uint64_t host_features;
+    uint32_t rsc_timeout;
+    uint8_t rsc4_enabled;
+    uint8_t rsc6_enabled;
     uint8_t has_ufo;
     uint32_t mergeable_rx_bufs;
     uint8_t promisc;
diff --git a/include/net/eth.h b/include/net/eth.h
index e6dc8a7ba0..7f45c678e7 100644
--- a/include/net/eth.h
+++ b/include/net/eth.h
@@ -177,6 +177,8 @@  struct tcp_hdr {
 #define TH_PUSH 0x08
 #define TH_ACK  0x10
 #define TH_URG  0x20
+#define TH_ECE  0x40
+#define TH_CWR  0x80
     u_short th_win;      /* window */
     u_short th_sum;      /* checksum */
     u_short th_urp;      /* urgent pointer */
diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
index 260c3681d7..0d8658c06a 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -57,6 +57,10 @@ 
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_RSC_EXT	38
+#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE	41	/* reserved */
+#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE	42	/* reserved */
+
 #define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
 					 * with the same MAC.
 					 */
@@ -104,6 +108,7 @@  struct virtio_net_config {
 struct virtio_net_hdr_v1 {
 #define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	/* Use csum_start, csum_offset */
 #define VIRTIO_NET_HDR_F_DATA_VALID	2	/* Csum is valid */
+#define VIRTIO_NET_HDR_F_RSC_INFO	4	/* rsc_ext data in csum_ fields */
 	uint8_t flags;
 #define VIRTIO_NET_HDR_GSO_NONE		0	/* Not a GSO frame */
 #define VIRTIO_NET_HDR_GSO_TCPV4	1	/* GSO frame, IPv4 TCP (TSO) */
@@ -118,6 +123,9 @@  struct virtio_net_hdr_v1 {
 	__virtio16 num_buffers;	/* Number of merged rx buffers */
 };
 
+#define rsc_ext_num_packets		csum_start
+#define rsc_ext_num_dupacks		csum_offset
+
 #ifndef VIRTIO_NET_NO_LEGACY
 /* This header comes first in the scatter-gather list.
  * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must