Message ID | 20220303181607.1094358-15-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | tcp: BIG TCP implementation | expand |
On 3/3/22 11:16 AM, Eric Dumazet wrote: > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index b2ed2f6d4a9208aebfd17fd0c503cd1e37c39ee1..1e51ce1d74486392a26568852c5068fe9047296d 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@ -4910,6 +4910,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) > > netdev->priv_flags |= IFF_UNICAST_FLT; > > + netif_set_tso_ipv6_max_size(netdev, 512 * 1024); How does the ConnectX hardware handle fairness for such large packet sizes? For 1500 MTU this means a single large TSO can cause the H/W to generate 349 MTU sized packets. Even a 4k MTU means 128 packets. This has an effect on the rate of packets hitting the next hop switch for example.
On Thu, Mar 3, 2022 at 8:43 PM David Ahern <dsahern@kernel.org> wrote: > > On 3/3/22 11:16 AM, Eric Dumazet wrote: > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > index b2ed2f6d4a9208aebfd17fd0c503cd1e37c39ee1..1e51ce1d74486392a26568852c5068fe9047296d 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > @@ -4910,6 +4910,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) > > > > netdev->priv_flags |= IFF_UNICAST_FLT; > > > > + netif_set_tso_ipv6_max_size(netdev, 512 * 1024); > > > How does the ConnectX hardware handle fairness for such large packet > sizes? For 1500 MTU this means a single large TSO can cause the H/W to > generate 349 MTU sized packets. Even a 4k MTU means 128 packets. This > has an effect on the rate of packets hitting the next hop switch for > example. I think ConnectX cards interleave packets from all TX queues, at least old CX3 have a parameter to control that. Given that we already can send at line rate, from a single TX queue, I do not see why presenting larger TSO packets would change anything on the wire ? Do you think ConnectX adds an extra gap on the wire at the end of a TSO train ?
On 3/4/22 10:14 AM, Eric Dumazet wrote: > On Thu, Mar 3, 2022 at 8:43 PM David Ahern <dsahern@kernel.org> wrote: >> >> On 3/3/22 11:16 AM, Eric Dumazet wrote: >>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c >>> index b2ed2f6d4a9208aebfd17fd0c503cd1e37c39ee1..1e51ce1d74486392a26568852c5068fe9047296d 100644 >>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c >>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c >>> @@ -4910,6 +4910,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) >>> >>> netdev->priv_flags |= IFF_UNICAST_FLT; >>> >>> + netif_set_tso_ipv6_max_size(netdev, 512 * 1024); >> >> >> How does the ConnectX hardware handle fairness for such large packet >> sizes? For 1500 MTU this means a single large TSO can cause the H/W to >> generate 349 MTU sized packets. Even a 4k MTU means 128 packets. This >> has an effect on the rate of packets hitting the next hop switch for >> example. > > I think ConnectX cards interleave packets from all TX queues, at least > old CX3 have a parameter to control that. > > Given that we already can send at line rate, from a single TX queue, I > do not see why presenting larger TSO packets > would change anything on the wire ? > > Do you think ConnectX adds an extra gap on the wire at the end of a TSO train ? It's not about 1 queue, my question was along several lines. e.g, 1. the inter-packet gap for TSO generated packets. With 512kB packets the burst is 8x from what it is today. 2. the fairness within hardware as 1 queue has potentially many 512kB packets and the impact on other queues (e.g., higher latency?) since it will take longer to split the larger packets into MTU sized packets. It is really about understanding the change this new default size is going to have on users.
On Sat, Mar 5, 2022 at 8:36 AM David Ahern <dsahern@kernel.org> wrote: > > On 3/4/22 10:14 AM, Eric Dumazet wrote: > > On Thu, Mar 3, 2022 at 8:43 PM David Ahern <dsahern@kernel.org> wrote: > >> > >> On 3/3/22 11:16 AM, Eric Dumazet wrote: > >>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > >>> index b2ed2f6d4a9208aebfd17fd0c503cd1e37c39ee1..1e51ce1d74486392a26568852c5068fe9047296d 100644 > >>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > >>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > >>> @@ -4910,6 +4910,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) > >>> > >>> netdev->priv_flags |= IFF_UNICAST_FLT; > >>> > >>> + netif_set_tso_ipv6_max_size(netdev, 512 * 1024); > >> > >> > >> How does the ConnectX hardware handle fairness for such large packet > >> sizes? For 1500 MTU this means a single large TSO can cause the H/W to > >> generate 349 MTU sized packets. Even a 4k MTU means 128 packets. This > >> has an effect on the rate of packets hitting the next hop switch for > >> example. > > > > I think ConnectX cards interleave packets from all TX queues, at least > > old CX3 have a parameter to control that. > > > > Given that we already can send at line rate, from a single TX queue, I > > do not see why presenting larger TSO packets > > would change anything on the wire ? > > > > Do you think ConnectX adds an extra gap on the wire at the end of a TSO train ? > > It's not about 1 queue, my question was along several lines. e.g, > 1. the inter-packet gap for TSO generated packets. With 512kB packets > the burst is 8x from what it is today. We did experiments with 185 KB (or 45 4K segments in our case [1]), and got no increase of drops. We are deploying these limits. [1] we increased MAX_SKB_FRAGS to 45, so that zero copy for both TX and RX is possible. Once your switches are 100Gbit rated, just send them 100Gbit traffic. Note that linux TCP has a lot of burst-control, and pacing features already. > > 2. the fairness within hardware as 1 queue has potentially many 512kB > packets and the impact on other queues (e.g., higher latency?) since it > will take longer to split the larger packets into MTU sized packets. It depends on the NIC. Many NICs (including mlx4) have a per queue quantum, usually configurable in power of two steps (4K, 8K, 16K, 32K ...) It means that one TSO packet is split in smaller chunks, depending on concurrent eligible TX queues. Our NIC of the day at Google, has a MTU quantum per queue. (This is one of the reason I added /sys/class/net/ethX/gro_flush_timeout, because sending TSO packets would not mean the receiver would receive this TSO in a single train of received packets) > > It is really about understanding the change this new default size is > going to have on users. Sure, but to be able to conduct experiments, and allow TCP congestion control to probe for bigger bursts, we need the core to support bigger packets. Then, one can precisely tune the max GSO size that it wants, per ethernet device, if really existing rate limiting features do not help.
On 3/3/2022 8:16 PM, Eric Dumazet wrote: > From: Coco Li <lixiaoyan@google.com> > > mlx5 supports LSOv2. > > IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header > with JUMBO TLV for big packets. > > We need to ignore/skip this HBH header when populating TX descriptor. > > Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet > layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only. > > v2: clear hopbyhop in mlx5e_tx_get_gso_ihs() > > Signed-off-by: Coco Li <lixiaoyan@google.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Saeed Mahameed <saeedm@nvidia.com> > Cc: Leon Romanovsky <leon@kernel.org> > --- > .../net/ethernet/mellanox/mlx5/core/en_main.c | 1 + > .../net/ethernet/mellanox/mlx5/core/en_tx.c | 82 +++++++++++++++---- > 2 files changed, 67 insertions(+), 16 deletions(-) > Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Thanks.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index b2ed2f6d4a9208aebfd17fd0c503cd1e37c39ee1..1e51ce1d74486392a26568852c5068fe9047296d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4910,6 +4910,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev) netdev->priv_flags |= IFF_UNICAST_FLT; + netif_set_tso_ipv6_max_size(netdev, 512 * 1024); mlx5e_set_netdev_dev_addr(netdev); mlx5e_ipsec_build_netdev(priv); mlx5e_tls_build_netdev(priv); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index 2dc48406cd08d21ff94f665cd61ab9227f351215..c6f6ca2d216692e1d3fd99e540198b11145788cd 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -40,6 +40,7 @@ #include "en_accel/en_accel.h" #include "en_accel/ipsec_rxtx.h" #include "en/ptp.h" +#include <net/ipv6.h> static void mlx5e_dma_unmap_wqe_err(struct mlx5e_txqsq *sq, u8 num_dma) { @@ -130,23 +131,32 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, sq->stats->csum_none++; } +/* Returns the number of header bytes that we plan + * to inline later in the transmit descriptor + */ static inline u16 -mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb) +mlx5e_tx_get_gso_ihs(struct mlx5e_txqsq *sq, struct sk_buff *skb, int *hopbyhop) { struct mlx5e_sq_stats *stats = sq->stats; u16 ihs; + *hopbyhop = 0; if (skb->encapsulation) { ihs = skb_inner_transport_offset(skb) + inner_tcp_hdrlen(skb); stats->tso_inner_packets++; stats->tso_inner_bytes += skb->len - ihs; } else { - if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) { ihs = skb_transport_offset(skb) + sizeof(struct udphdr); - else + } else { ihs = skb_transport_offset(skb) + tcp_hdrlen(skb); + if (ipv6_has_hopopt_jumbo(skb)) { + *hopbyhop = sizeof(struct hop_jumbo_hdr); + ihs -= sizeof(struct hop_jumbo_hdr); + } + } stats->tso_packets++; - stats->tso_bytes += skb->len - ihs; + stats->tso_bytes += skb->len - ihs - *hopbyhop; } return ihs; @@ -208,6 +218,7 @@ struct mlx5e_tx_attr { __be16 mss; u16 insz; u8 opcode; + u8 hopbyhop; }; struct mlx5e_tx_wqe_attr { @@ -244,14 +255,16 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5e_sq_stats *stats = sq->stats; if (skb_is_gso(skb)) { - u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb); + int hopbyhop; + u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop); *attr = (struct mlx5e_tx_attr) { .opcode = MLX5_OPCODE_LSO, .mss = cpu_to_be16(skb_shinfo(skb)->gso_size), .ihs = ihs, .num_bytes = skb->len + (skb_shinfo(skb)->gso_segs - 1) * ihs, - .headlen = skb_headlen(skb) - ihs, + .headlen = skb_headlen(skb) - ihs - hopbyhop, + .hopbyhop = hopbyhop, }; stats->packets += skb_shinfo(skb)->gso_segs; @@ -365,7 +378,8 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg; struct mlx5_wqe_data_seg *dseg; struct mlx5e_tx_wqe_info *wi; - + u16 ihs = attr->ihs; + struct ipv6hdr *h6; struct mlx5e_sq_stats *stats = sq->stats; int num_dma; @@ -379,15 +393,36 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb, eseg->mss = attr->mss; - if (attr->ihs) { - if (skb_vlan_tag_present(skb)) { - eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs + VLAN_HLEN); - mlx5e_insert_vlan(eseg->inline_hdr.start, skb, attr->ihs); + if (ihs) { + u8 *start = eseg->inline_hdr.start; + + if (unlikely(attr->hopbyhop)) { + /* remove the HBH header. + * Layout: [Ethernet header][IPv6 header][HBH][TCP header] + */ + if (skb_vlan_tag_present(skb)) { + mlx5e_insert_vlan(start, skb, ETH_HLEN + sizeof(*h6)); + ihs += VLAN_HLEN; + h6 = (struct ipv6hdr *)(start + sizeof(struct vlan_ethhdr)); + } else { + memcpy(start, skb->data, ETH_HLEN + sizeof(*h6)); + h6 = (struct ipv6hdr *)(start + ETH_HLEN); + } + h6->nexthdr = IPPROTO_TCP; + /* Copy the TCP header after the IPv6 one */ + memcpy(h6 + 1, + skb->data + ETH_HLEN + sizeof(*h6) + + sizeof(struct hop_jumbo_hdr), + tcp_hdrlen(skb)); + /* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */ + } else if (skb_vlan_tag_present(skb)) { + mlx5e_insert_vlan(start, skb, ihs); + ihs += VLAN_HLEN; stats->added_vlan_packets++; } else { - eseg->inline_hdr.sz |= cpu_to_be16(attr->ihs); - memcpy(eseg->inline_hdr.start, skb->data, attr->ihs); + memcpy(start, skb->data, ihs); } + eseg->inline_hdr.sz |= cpu_to_be16(ihs); dseg += wqe_attr->ds_cnt_inl; } else if (skb_vlan_tag_present(skb)) { eseg->insert.type = cpu_to_be16(MLX5_ETH_WQE_INSERT_VLAN); @@ -398,7 +433,7 @@ mlx5e_sq_xmit_wqe(struct mlx5e_txqsq *sq, struct sk_buff *skb, } dseg += wqe_attr->ds_cnt_ids; - num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs, + num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr->ihs + attr->hopbyhop, attr->headlen, dseg); if (unlikely(num_dma < 0)) goto err_drop; @@ -918,12 +953,27 @@ void mlx5i_sq_xmit(struct mlx5e_txqsq *sq, struct sk_buff *skb, eseg->mss = attr.mss; if (attr.ihs) { - memcpy(eseg->inline_hdr.start, skb->data, attr.ihs); + if (unlikely(attr.hopbyhop)) { + /* remove the HBH header. + * Layout: [Ethernet header][IPv6 header][HBH][TCP header] + */ + memcpy(eseg->inline_hdr.start, skb->data, ETH_HLEN + sizeof(*h6)); + h6 = (struct ipv6hdr *)((char *)eseg->inline_hdr.start + ETH_HLEN); + h6->nexthdr = IPPROTO_TCP; + /* Copy the TCP header after the IPv6 one */ + memcpy(h6 + 1, + skb->data + ETH_HLEN + sizeof(*h6) + + sizeof(struct hop_jumbo_hdr), + tcp_hdrlen(skb)); + /* Leave ipv6 payload_len set to 0, as LSO v2 specs request. */ + } else { + memcpy(eseg->inline_hdr.start, skb->data, attr.ihs); + } eseg->inline_hdr.sz = cpu_to_be16(attr.ihs); dseg += wqe_attr.ds_cnt_inl; } - num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs, + num_dma = mlx5e_txwqe_build_dsegs(sq, skb, skb->data + attr.ihs + attr.hopbyhop, attr.headlen, dseg); if (unlikely(num_dma < 0)) goto err_drop;