Message ID | 20231027184657.83978-1-yepeilin.cs@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [net] veth: Fix RX stats for bpf_redirect_peer() traffic | expand |
Hi all, On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote: > diff --git a/net/core/filter.c b/net/core/filter.c > index 21d75108c2e9..7aca28b7d0fd 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb) > net_eq(net, dev_net(dev)))) > goto out_drop; > skb->dev = dev; > + dev_sw_netstats_rx_add(dev, skb->len); This assumes that all devices that support BPF_F_PEER (currently only veth) use tstats (instead of lstats, or dstats) - is that okay? If not, should I add another NDO e.g. ->ndo_stats_rx_add()? Thanks, Peilin Ye
Hi Peilin, On 10/27/23 9:02 PM, Peilin Ye wrote: > On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote: >> diff --git a/net/core/filter.c b/net/core/filter.c >> index 21d75108c2e9..7aca28b7d0fd 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb) >> net_eq(net, dev_net(dev)))) >> goto out_drop; >> skb->dev = dev; >> + dev_sw_netstats_rx_add(dev, skb->len); > > This assumes that all devices that support BPF_F_PEER (currently only > veth) use tstats (instead of lstats, or dstats) - is that okay? Dumb question, but why all this change and not simply just call ... dev_lstats_add(dev, skb->len) ... on the host dev ? > If not, should I add another NDO e.g. ->ndo_stats_rx_add()? Definitely no new stats ndo resp indirect call in fast path. Thanks, Daniel
Hi Daniel, Thanks for taking a look! On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote: > > > diff --git a/net/core/filter.c b/net/core/filter.c > > > index 21d75108c2e9..7aca28b7d0fd 100644 > > > --- a/net/core/filter.c > > > +++ b/net/core/filter.c > > > @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb) > > > net_eq(net, dev_net(dev)))) > > > goto out_drop; > > > skb->dev = dev; > > > + dev_sw_netstats_rx_add(dev, skb->len); > > > > This assumes that all devices that support BPF_F_PEER (currently only > > veth) use tstats (instead of lstats, or dstats) - is that okay? > > Dumb question, but why all this change and not simply just call ... > > dev_lstats_add(dev, skb->len) > > ... on the host dev ? Since I didn't want to update host-veth's TX counters. If we bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress, I think it means we've bypassed host-veth TX? > > If not, should I add another NDO e.g. ->ndo_stats_rx_add()? > > Definitely no new stats ndo resp indirect call in fast path. Yeah, I think I'll put a comment saying that all devices that support BPF_F_PEER must use tstats (or must use lstats), then. Thanks, Peilin Ye
On 10/29/23 1:11 AM, Peilin Ye wrote: > On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote: >>>> diff --git a/net/core/filter.c b/net/core/filter.c >>>> index 21d75108c2e9..7aca28b7d0fd 100644 >>>> --- a/net/core/filter.c >>>> +++ b/net/core/filter.c >>>> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb) >>>> net_eq(net, dev_net(dev)))) >>>> goto out_drop; >>>> skb->dev = dev; >>>> + dev_sw_netstats_rx_add(dev, skb->len); >>> >>> This assumes that all devices that support BPF_F_PEER (currently only >>> veth) use tstats (instead of lstats, or dstats) - is that okay? >> >> Dumb question, but why all this change and not simply just call ... >> >> dev_lstats_add(dev, skb->len) >> >> ... on the host dev ? > > Since I didn't want to update host-veth's TX counters. If we > bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress, > I think it means we've bypassed host-veth TX? Yes. So the idea is to transition to tstats replace the location where we used to bump lstats with tstat's tx counter, and only the peer redirect would bump the rx counter.. then upon stats traversal we fold the latter into the rx stats which was populated by the opposite's tx counters. Makes sense. OT: does cadvisor run inside the Pod to collect the device stats? Just curious how it gathers them. >>> If not, should I add another NDO e.g. ->ndo_stats_rx_add()? >> >> Definitely no new stats ndo resp indirect call in fast path. > > Yeah, I think I'll put a comment saying that all devices that support > BPF_F_PEER must use tstats (or must use lstats), then. sgtm. Thanks, Daniel
On Mon, Oct 30, 2023 at 03:19:26PM +0100, Daniel Borkmann wrote: > OT: does cadvisor run inside the Pod to collect the device stats? Just > curious how it gathers them. Seems like it reads from /proc/$PID/net/dev: https://github.com/google/cadvisor/blob/d19df4f3b8e73c7e807d4b9894c252b54852fa8a/container/libcontainer/handler.go#L461 Thanks, Peilin Ye
On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote: > > Since I didn't want to update host-veth's TX counters. If we > > bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress, > > I think it means we've bypassed host-veth TX? > > Yes. So the idea is to transition to tstats replace the location where > we used to bump lstats with tstat's tx counter, and only the peer redirect > would bump the rx counter.. then upon stats traversal we fold the latter into > the rx stats which was populated by the opposite's tx counters. Makes sense. > > OT: does cadvisor run inside the Pod to collect the device stats? Just > curious how it gathers them. Somewhat related - where does netkit count stats? > >> Definitely no new stats ndo resp indirect call in fast path. > > > > Yeah, I think I'll put a comment saying that all devices that support > > BPF_F_PEER must use tstats (or must use lstats), then. > > sgtm. Is comment good enough? Can we try to do something more robust? Move the allocation of stats into the core at registration based on some u8 assigned in the driver? (I haven't looked at the code TBH)
On 10/31/23 8:53 PM, Jakub Kicinski wrote: > On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote: >>> Since I didn't want to update host-veth's TX counters. If we >>> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress, >>> I think it means we've bypassed host-veth TX? >> >> Yes. So the idea is to transition to tstats replace the location where >> we used to bump lstats with tstat's tx counter, and only the peer redirect >> would bump the rx counter.. then upon stats traversal we fold the latter into >> the rx stats which was populated by the opposite's tx counters. Makes sense. >> >> OT: does cadvisor run inside the Pod to collect the device stats? Just >> curious how it gathers them. > > Somewhat related - where does netkit count stats? Yeap, it needs it as well, I have a local branch here where I pushed all of it - coming out soon; I was planning to add some selftests in addition till end of this week: https://github.com/cilium/linux/commits/pr/ndo_peer >>>> Definitely no new stats ndo resp indirect call in fast path. >>> >>> Yeah, I think I'll put a comment saying that all devices that support >>> BPF_F_PEER must use tstats (or must use lstats), then. >> >> sgtm. > > Is comment good enough? Can we try to do something more robust? > Move the allocation of stats into the core at registration based > on some u8 assigned in the driver? (I haven't looked at the code TBH) Hm, not sure. One thing that comes to mind is lazy one-time allocation like in case of netdev_core_stats_inc(), so whenever one of the helpers like dev_sw_netstats_{rx,tx}_add() are called and dev->tstats are still NULL, the core knows about the driver's intent, but tbh that doesn't feel overly clean and in case of netdev_core_stats_inc() it's more in the exception case rather than fast-path. Other option could be to have two small helpers in the core which then set a flag as well: static inline int netdev_tstats_alloc(struct net_device *dev) { dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); if (!dev->tstats) return -ENOMEM; dev->priv_flags |= IFF_USES_TSTATS; return 0; } static inline void netdev_tstats_free(struct net_device *dev) { free_percpu(dev->tstats); } They can then be used from .ndo_init/uninit - not sure if this would be overall nicer.. or just leaving it at the .ndo callback comment for the time being until really more users show up (which I doubt tbh). Thanks, Daniel
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 9980517ed8b0..df7a7c21a46d 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -373,7 +373,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) skb_tx_timestamp(skb); if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) { if (!use_napi) - dev_lstats_add(dev, length); + dev_sw_netstats_tx_add(dev, 1, length); else __veth_xdp_flush(rq); } else { @@ -387,14 +387,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) return ret; } -static u64 veth_stats_tx(struct net_device *dev, u64 *packets, u64 *bytes) -{ - struct veth_priv *priv = netdev_priv(dev); - - dev_lstats_read(dev, packets, bytes); - return atomic64_read(&priv->dropped); -} - static void veth_stats_rx(struct veth_stats *result, struct net_device *dev) { struct veth_priv *priv = netdev_priv(dev); @@ -432,24 +424,24 @@ static void veth_get_stats64(struct net_device *dev, struct veth_priv *priv = netdev_priv(dev); struct net_device *peer; struct veth_stats rx; - u64 packets, bytes; - tot->tx_dropped = veth_stats_tx(dev, &packets, &bytes); - tot->tx_bytes = bytes; - tot->tx_packets = packets; + tot->tx_dropped = atomic64_read(&priv->dropped); + dev_fetch_sw_netstats(tot, dev->tstats); veth_stats_rx(&rx, dev); tot->tx_dropped += rx.xdp_tx_err; tot->rx_dropped = rx.rx_drops + rx.peer_tq_xdp_xmit_err; - tot->rx_bytes = rx.xdp_bytes; - tot->rx_packets = rx.xdp_packets; + tot->rx_bytes += rx.xdp_bytes; + tot->rx_packets += rx.xdp_packets; rcu_read_lock(); peer = rcu_dereference(priv->peer); if (peer) { - veth_stats_tx(peer, &packets, &bytes); - tot->rx_bytes += bytes; - tot->rx_packets += packets; + struct rtnl_link_stats64 tot_peer = {}; + + dev_fetch_sw_netstats(&tot_peer, peer->tstats); + tot->rx_bytes += tot_peer.tx_bytes; + tot->rx_packets += tot_peer.tx_packets; veth_stats_rx(&rx, peer); tot->tx_dropped += rx.peer_tq_xdp_xmit_err; @@ -1508,13 +1500,13 @@ static int veth_dev_init(struct net_device *dev) { int err; - dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats); - if (!dev->lstats) + dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); + if (!dev->tstats) return -ENOMEM; err = veth_alloc_queues(dev); if (err) { - free_percpu(dev->lstats); + free_percpu(dev->tstats); return err; } @@ -1524,7 +1516,7 @@ static int veth_dev_init(struct net_device *dev) static void veth_dev_free(struct net_device *dev) { veth_free_queues(dev); - free_percpu(dev->lstats); + free_percpu(dev->tstats); } #ifdef CONFIG_NET_POLL_CONTROLLER diff --git a/net/core/filter.c b/net/core/filter.c index 21d75108c2e9..7aca28b7d0fd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb) net_eq(net, dev_net(dev)))) goto out_drop; skb->dev = dev; + dev_sw_netstats_rx_add(dev, skb->len); return -EAGAIN; } return flags & BPF_F_NEIGH ?