[RFC,v2,net-next,0/5] net: Qdisc backpressure infrastructure

Message ID	cover.1661158173.git.peilin.ye@bytedance.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Peilin Ye <yepeilin.cs@gmail.com> To: "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jonathan Corbet <corbet@lwn.net>, Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>, David Ahern <dsahern@kernel.org>, Jamal Hadi Salim <jhs@mojatatu.com>, Cong Wang <xiyou.wangcong@gmail.com>, Jiri Pirko <jiri@resnulli.us> Cc: Peilin Ye <peilin.ye@bytedance.com>, netdev@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, Stephen Hemminger <stephen@networkplumber.org>, Dave Taht <dave.taht@gmail.com>, Peilin Ye <yepeilin.cs@gmail.com> Subject: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure Date: Mon, 22 Aug 2022 02:10:17 -0700 Message-Id: <cover.1661158173.git.peilin.ye@bytedance.com> In-Reply-To: <cover.1651800598.git.peilin.ye@bytedance.com> References: <cover.1651800598.git.peilin.ye@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: Qdisc backpressure infrastructure \| expand [RFC,v2,net-next,0/5] net: Qdisc backpressure infrastructure [RFC,v2,net-next,1/5] net: Introduce Qdisc backpressure infrastructure [RFC,v2,net-next,2/5] net/udp: Implement Qdisc backpressure algorithm [RFC,v2,net-next,3/5] net/sched: sch_tbf: Use Qdisc backpressure infrastructure [RFC,v2,net-next,4/5] net/sched: sch_htb: Use Qdisc backpressure infrastructure [RFC,v2,net-next,5/5] net/sched: sch_cbq: Use Qdisc backpressure infrastructure

Peilin Ye Aug. 22, 2022, 9:10 a.m. UTC

From: Peilin Ye <peilin.ye@bytedance.com>

Hi all,

Currently sockets (especially UDP ones) can drop a lot of packets at TC
egress when rate limited by shaper Qdiscs like HTB.  This patchset series
tries to solve this by introducing a Qdisc backpressure mechanism.

RFC v1 [1] used a throttle & unthrottle approach, which introduced several
issues, including a thundering herd problem and a socket reference count
issue [2].  This RFC v2 uses a different approach to avoid those issues:

  1. When a shaper Qdisc drops a packet that belongs to a local socket due
     to TC egress congestion, we make part of the socket's sndbuf
     temporarily unavailable, so it sends slower.
  
  2. Later, when TC egress becomes idle again, we gradually recover the
     socket's sndbuf back to normal.  Patch 2 implements this step using a
     timer for UDP sockets.

The thundering herd problem is avoided, since we no longer wake up all
throttled sockets at the same time in qdisc_watchdog().  The socket
reference count issue is also avoided, since we no longer maintain socket
list on Qdisc.

Performance is better than RFC v1.  There is one concern about fairness
between flows for TBF Qdisc, which could be solved by using a SFQ inner
Qdisc.

Please see the individual patches for details and numbers.  Any comments,
suggestions would be much appreciated.  Thanks!

[1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
[2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/

Peilin Ye (5):
  net: Introduce Qdisc backpressure infrastructure
  net/udp: Implement Qdisc backpressure algorithm
  net/sched: sch_tbf: Use Qdisc backpressure infrastructure
  net/sched: sch_htb: Use Qdisc backpressure infrastructure
  net/sched: sch_cbq: Use Qdisc backpressure infrastructure

 Documentation/networking/ip-sysctl.rst | 11 ++++
 include/linux/udp.h                    |  3 ++
 include/net/netns/ipv4.h               |  1 +
 include/net/sch_generic.h              | 11 ++++
 include/net/sock.h                     | 21 ++++++++
 include/net/udp.h                      |  1 +
 net/core/sock.c                        |  5 +-
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 net/ipv4/udp.c                         | 69 +++++++++++++++++++++++++-
 net/ipv6/udp.c                         |  2 +-
 net/sched/sch_cbq.c                    |  1 +
 net/sched/sch_htb.c                    |  2 +
 net/sched/sch_tbf.c                    |  2 +
 13 files changed, 132 insertions(+), 4 deletions(-)

Jakub Kicinski Aug. 22, 2022, 4:17 p.m. UTC | #1

On Mon, 22 Aug 2022 02:10:17 -0700 Peilin Ye wrote:
> Currently sockets (especially UDP ones) can drop a lot of packets at TC
> egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> tries to solve this by introducing a Qdisc backpressure mechanism.
> 
> RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> issues, including a thundering herd problem and a socket reference count
> issue [2].  This RFC v2 uses a different approach to avoid those issues:
> 
>   1. When a shaper Qdisc drops a packet that belongs to a local socket due
>      to TC egress congestion, we make part of the socket's sndbuf
>      temporarily unavailable, so it sends slower.
>   
>   2. Later, when TC egress becomes idle again, we gradually recover the
>      socket's sndbuf back to normal.  Patch 2 implements this step using a
>      timer for UDP sockets.
> 
> The thundering herd problem is avoided, since we no longer wake up all
> throttled sockets at the same time in qdisc_watchdog().  The socket
> reference count issue is also avoided, since we no longer maintain socket
> list on Qdisc.
> 
> Performance is better than RFC v1.  There is one concern about fairness
> between flows for TBF Qdisc, which could be solved by using a SFQ inner
> Qdisc.
> 
> Please see the individual patches for details and numbers.  Any comments,
> suggestions would be much appreciated.  Thanks!
> 
> [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/

Similarly to Eric's comments on v1 I'm not seeing the clear motivation
here. Modern high speed UDP users will have a CC in user space, back
off and set transmission time on the packets. Could you describe your
_actual_ use case / application in more detail?

Eric Dumazet Aug. 22, 2022, 4:22 p.m. UTC | #2

On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> Hi all,
>
> Currently sockets (especially UDP ones) can drop a lot of packets at TC
> egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> tries to solve this by introducing a Qdisc backpressure mechanism.
>
> RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> issues, including a thundering herd problem and a socket reference count
> issue [2].  This RFC v2 uses a different approach to avoid those issues:
>
>   1. When a shaper Qdisc drops a packet that belongs to a local socket due
>      to TC egress congestion, we make part of the socket's sndbuf
>      temporarily unavailable, so it sends slower.
>
>   2. Later, when TC egress becomes idle again, we gradually recover the
>      socket's sndbuf back to normal.  Patch 2 implements this step using a
>      timer for UDP sockets.
>
> The thundering herd problem is avoided, since we no longer wake up all
> throttled sockets at the same time in qdisc_watchdog().  The socket
> reference count issue is also avoided, since we no longer maintain socket
> list on Qdisc.
>
> Performance is better than RFC v1.  There is one concern about fairness
> between flows for TBF Qdisc, which could be solved by using a SFQ inner
> Qdisc.
>
> Please see the individual patches for details and numbers.  Any comments,
> suggestions would be much appreciated.  Thanks!
>
> [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
>
> Peilin Ye (5):
>   net: Introduce Qdisc backpressure infrastructure
>   net/udp: Implement Qdisc backpressure algorithm
>   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
>   net/sched: sch_htb: Use Qdisc backpressure infrastructure
>   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
>

I think the whole idea is wrong.

Packet schedulers can be remote (offloaded, or on another box)

The idea of going back to socket level from a packet scheduler should
really be a last resort.

Issue of having UDP sockets being able to flood a network is tough, I
am not sure the core networking stack
should pretend it can solve the issue.

Note that FQ based packet schedulers can also help already.

Cong Wang Aug. 29, 2022, 4:47 p.m. UTC | #3

On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > Hi all,
> >
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> >
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> >
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> >
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> >
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> >
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> >
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> >
> > Peilin Ye (5):
> >   net: Introduce Qdisc backpressure infrastructure
> >   net/udp: Implement Qdisc backpressure algorithm
> >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> >
> 
> I think the whole idea is wrong.
> 

Be more specific?

> Packet schedulers can be remote (offloaded, or on another box)

This is not the case we are dealing with (yet).

> 
> The idea of going back to socket level from a packet scheduler should
> really be a last resort.

I think it should be the first resort, as we should backpressure to the
source, rather than anything in the middle.

> 
> Issue of having UDP sockets being able to flood a network is tough, I
> am not sure the core networking stack
> should pretend it can solve the issue.

It seems you misunderstand it here, we are not dealing with UDP on the
network, just on an end host. The backpressure we are dealing with is
from Qdisc to socket on _TX side_ and on one single host.

> 
> Note that FQ based packet schedulers can also help already.

It only helps TCP pacing.

Thanks.

Cong Wang Aug. 29, 2022, 4:53 p.m. UTC | #4

On Mon, Aug 22, 2022 at 09:17:37AM -0700, Jakub Kicinski wrote:
> On Mon, 22 Aug 2022 02:10:17 -0700 Peilin Ye wrote:
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> > 
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > 
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >   
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> > 
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> > 
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> > 
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> > 
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> 
> Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> here. Modern high speed UDP users will have a CC in user space, back
> off and set transmission time on the packets. Could you describe your
> _actual_ use case / application in more detail?

Not everyone implements QUIC or CC, it is really hard to implement CC
from scratch. This backpressure mechnism is much simpler than CC (TCP or
QUIC), as clearly it does not deal with any remote congestions.

And, although this patchset only implements UDP backpressure, it can be
applied to any other protocol easily, it is protocol-independent.

Thanks.

Eric Dumazet Aug. 29, 2022, 4:53 p.m. UTC | #5

On Mon, Aug 29, 2022 at 9:47 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > >
> > > From: Peilin Ye <peilin.ye@bytedance.com>
> > >
> > > Hi all,
> > >
> > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > >
> > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > issues, including a thundering herd problem and a socket reference count
> > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > >
> > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > >      to TC egress congestion, we make part of the socket's sndbuf
> > >      temporarily unavailable, so it sends slower.
> > >
> > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > >      timer for UDP sockets.
> > >
> > > The thundering herd problem is avoided, since we no longer wake up all
> > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > reference count issue is also avoided, since we no longer maintain socket
> > > list on Qdisc.
> > >
> > > Performance is better than RFC v1.  There is one concern about fairness
> > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > Qdisc.
> > >
> > > Please see the individual patches for details and numbers.  Any comments,
> > > suggestions would be much appreciated.  Thanks!
> > >
> > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > >
> > > Peilin Ye (5):
> > >   net: Introduce Qdisc backpressure infrastructure
> > >   net/udp: Implement Qdisc backpressure algorithm
> > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > >
> >
> > I think the whole idea is wrong.
> >
>
> Be more specific?
>
> > Packet schedulers can be remote (offloaded, or on another box)
>
> This is not the case we are dealing with (yet).
>
> >
> > The idea of going back to socket level from a packet scheduler should
> > really be a last resort.
>
> I think it should be the first resort, as we should backpressure to the
> source, rather than anything in the middle.
>
> >
> > Issue of having UDP sockets being able to flood a network is tough, I
> > am not sure the core networking stack
> > should pretend it can solve the issue.
>
> It seems you misunderstand it here, we are not dealing with UDP on the
> network, just on an end host. The backpressure we are dealing with is
> from Qdisc to socket on _TX side_ and on one single host.
>
> >
> > Note that FQ based packet schedulers can also help already.
>
> It only helps TCP pacing.

FQ : Fair Queue.

It definitely helps without the pacing part...

>
> Thanks.

Jakub Kicinski Aug. 30, 2022, 12:21 a.m. UTC | #6

On Mon, 29 Aug 2022 09:53:17 -0700 Cong Wang wrote:
> > Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> > here. Modern high speed UDP users will have a CC in user space, back
> > off and set transmission time on the packets. Could you describe your
> > _actual_ use case / application in more detail?  
> 
> Not everyone implements QUIC or CC, it is really hard to implement CC
> from scratch. This backpressure mechnism is much simpler than CC (TCP or
> QUIC), as clearly it does not deal with any remote congestions.
> 
> And, although this patchset only implements UDP backpressure, it can be
> applied to any other protocol easily, it is protocol-independent.

No disagreement on any of your points. But I don't feel like 
you answered my question about the details of the use case.

Yafang Shao Aug. 30, 2022, 2:28 a.m. UTC | #7

On Tue, Aug 23, 2022 at 1:02 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > Hi all,
> >
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> >
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> >
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> >
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> >
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> >
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> >
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> >
> > Peilin Ye (5):
> >   net: Introduce Qdisc backpressure infrastructure
> >   net/udp: Implement Qdisc backpressure algorithm
> >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> >
>
> I think the whole idea is wrong.
>
> Packet schedulers can be remote (offloaded, or on another box)
>
> The idea of going back to socket level from a packet scheduler should
> really be a last resort.
>
> Issue of having UDP sockets being able to flood a network is tough, I
> am not sure the core networking stack
> should pretend it can solve the issue.
>
> Note that FQ based packet schedulers can also help already.

We encounter a similar issue when using (fq + edt-bpf) to limit UDP
packet, because of the qdisc buffer limit.
If the qdisc buffer limit is too small, the UDP packet will be dropped
in the qdisc layer. But the sender doesn't know that the packets has
been dropped, so it will continue to send packets, and thus more and
more packets will be dropped there.  IOW, the qdisc will be a
bottleneck before the bandwidth limit is reached.
We workaround this issue by enlarging the buffer limit and flow_limit
(the proper values can be calculated from net.ipv4.udp_mem and
net.core.wmem_default).
But obviously this is not a perfect solution, because
net.ipv4.udp_mem or net.core.wmem_default may be changed dynamically.
We also think about a solution to build a connection between udp
memory and qdisc limit, but not sure if it is a good idea neither.

Cong Wang Sept. 19, 2022, 5 p.m. UTC | #8

On Mon, Aug 29, 2022 at 05:21:11PM -0700, Jakub Kicinski wrote:
> On Mon, 29 Aug 2022 09:53:17 -0700 Cong Wang wrote:
> > > Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> > > here. Modern high speed UDP users will have a CC in user space, back
> > > off and set transmission time on the packets. Could you describe your
> > > _actual_ use case / application in more detail?  
> > 
> > Not everyone implements QUIC or CC, it is really hard to implement CC
> > from scratch. This backpressure mechnism is much simpler than CC (TCP or
> > QUIC), as clearly it does not deal with any remote congestions.
> > 
> > And, although this patchset only implements UDP backpressure, it can be
> > applied to any other protocol easily, it is protocol-independent.
> 
> No disagreement on any of your points. But I don't feel like 
> you answered my question about the details of the use case.

Do you need a use case for UDP w/o QUIC? Seriously??? There must be
tons of it...

Take a look at UDP tunnels, for instance, wireguard which is our use
case. ByteDance has wireguard-based VPN solution for bussiness. (I hate
to brand ourselves, but you are asking for it...)

Please do research on your side, as a netdev maintainer, you are
supposed to know this much better than me.

Thanks.

Cong Wang Sept. 19, 2022, 5:04 p.m. UTC | #9

On Tue, Aug 30, 2022 at 10:28:01AM +0800, Yafang Shao wrote:
> On Tue, Aug 23, 2022 at 1:02 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > >
> > > From: Peilin Ye <peilin.ye@bytedance.com>
> > >
> > > Hi all,
> > >
> > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > >
> > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > issues, including a thundering herd problem and a socket reference count
> > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > >
> > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > >      to TC egress congestion, we make part of the socket's sndbuf
> > >      temporarily unavailable, so it sends slower.
> > >
> > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > >      timer for UDP sockets.
> > >
> > > The thundering herd problem is avoided, since we no longer wake up all
> > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > reference count issue is also avoided, since we no longer maintain socket
> > > list on Qdisc.
> > >
> > > Performance is better than RFC v1.  There is one concern about fairness
> > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > Qdisc.
> > >
> > > Please see the individual patches for details and numbers.  Any comments,
> > > suggestions would be much appreciated.  Thanks!
> > >
> > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > >
> > > Peilin Ye (5):
> > >   net: Introduce Qdisc backpressure infrastructure
> > >   net/udp: Implement Qdisc backpressure algorithm
> > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > >
> >
> > I think the whole idea is wrong.
> >
> > Packet schedulers can be remote (offloaded, or on another box)
> >
> > The idea of going back to socket level from a packet scheduler should
> > really be a last resort.
> >
> > Issue of having UDP sockets being able to flood a network is tough, I
> > am not sure the core networking stack
> > should pretend it can solve the issue.
> >
> > Note that FQ based packet schedulers can also help already.
> 
> We encounter a similar issue when using (fq + edt-bpf) to limit UDP
> packet, because of the qdisc buffer limit.
> If the qdisc buffer limit is too small, the UDP packet will be dropped
> in the qdisc layer. But the sender doesn't know that the packets has
> been dropped, so it will continue to send packets, and thus more and
> more packets will be dropped there.  IOW, the qdisc will be a
> bottleneck before the bandwidth limit is reached.
> We workaround this issue by enlarging the buffer limit and flow_limit
> (the proper values can be calculated from net.ipv4.udp_mem and
> net.core.wmem_default).
> But obviously this is not a perfect solution, because
> net.ipv4.udp_mem or net.core.wmem_default may be changed dynamically.
> We also think about a solution to build a connection between udp
> memory and qdisc limit, but not sure if it is a good idea neither.

This is literally what this patchset does. Although this patchset does
not touch any TCP (as TCP has TSQ), I think this is a better approach
than TSQ, because TSQ has no idea about Qdisc limit.

Thanks.

Cong Wang Sept. 19, 2022, 5:06 p.m. UTC | #10

On Mon, Aug 29, 2022 at 09:53:43AM -0700, Eric Dumazet wrote:
> On Mon, Aug 29, 2022 at 9:47 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> > > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > > >
> > > > From: Peilin Ye <peilin.ye@bytedance.com>
> > > >
> > > > Hi all,
> > > >
> > > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > > >
> > > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > > issues, including a thundering herd problem and a socket reference count
> > > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > > >
> > > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > > >      to TC egress congestion, we make part of the socket's sndbuf
> > > >      temporarily unavailable, so it sends slower.
> > > >
> > > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > > >      timer for UDP sockets.
> > > >
> > > > The thundering herd problem is avoided, since we no longer wake up all
> > > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > > reference count issue is also avoided, since we no longer maintain socket
> > > > list on Qdisc.
> > > >
> > > > Performance is better than RFC v1.  There is one concern about fairness
> > > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > > Qdisc.
> > > >
> > > > Please see the individual patches for details and numbers.  Any comments,
> > > > suggestions would be much appreciated.  Thanks!
> > > >
> > > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > > >
> > > > Peilin Ye (5):
> > > >   net: Introduce Qdisc backpressure infrastructure
> > > >   net/udp: Implement Qdisc backpressure algorithm
> > > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > > >
> > >
> > > I think the whole idea is wrong.
> > >
> >
> > Be more specific?
> >
> > > Packet schedulers can be remote (offloaded, or on another box)
> >
> > This is not the case we are dealing with (yet).
> >
> > >
> > > The idea of going back to socket level from a packet scheduler should
> > > really be a last resort.
> >
> > I think it should be the first resort, as we should backpressure to the
> > source, rather than anything in the middle.
> >
> > >
> > > Issue of having UDP sockets being able to flood a network is tough, I
> > > am not sure the core networking stack
> > > should pretend it can solve the issue.
> >
> > It seems you misunderstand it here, we are not dealing with UDP on the
> > network, just on an end host. The backpressure we are dealing with is
> > from Qdisc to socket on _TX side_ and on one single host.
> >
> > >
> > > Note that FQ based packet schedulers can also help already.
> >
> > It only helps TCP pacing.
> 
> FQ : Fair Queue.
> 
> It definitely helps without the pacing part...

True. but the fair queuing part has nothing related to this patchset...
Only the pacing part is related to this topic, and it is merely about
TCP.

Thanks.

[RFC,v2,net-next,0/5] net: Qdisc backpressure infrastructure

Message

Comments