diff mbox series

[v4,net-next,1/1] sched: Add dualpi2 qdisc

Message ID 20241021221248.60378-2-chia-yu.chang@nokia-bell-labs.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series DualPI2 patch | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 44 this patch: 44
netdev/build_tools success Errors and warnings before: 172 (+0) this patch: 157 (+0)
netdev/cc_maintainers warning 5 maintainers not CCed: horms@kernel.org andrew+netdev@lunn.ch jiri@resnulli.us donald.hunter@gmail.com xiyou.wangcong@gmail.com
netdev/build_clang success Errors and warnings before: 86 this patch: 86
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5030 this patch: 5030
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 22 this patch: 22
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-10-29--03-00 (tests: 777)

Commit Message

Chia-Yu Chang (Nokia) Oct. 21, 2024, 10:12 p.m. UTC
From: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>

DualPI2 provides L4S-type low latency & loss to traffic that uses a
scalable congestion controller (e.g. TCP-Prague, DCTCP) without
degrading the performance of 'classic' traffic (e.g. Reno,
Cubic etc.). It is intended to be the reference implementation of the
IETF's DualQ Coupled AQM.

The qdisc provides two queues called low latency and classic. It
classifies packets based on the ECN field in the IP headers. By
default it directs non-ECN and ECT(0) into the classic queue and
ECT(1) and CE into the low latency queue, as per the IETF spec.

Each queue runs its own AQM:
* The classic AQM is called PI2, which is similar to the PIE AQM but
  more responsive and simpler. Classic traffic requires a decent
  target queue (default 15ms for Internet deployment) to fully
  utilize the link and to avoid high drop rates.
* The low latency AQM is, by default, a very shallow ECN marking
  threshold (1ms) similar to that used for DCTCP.

The DualQ isolates the low queuing delay of the Low Latency queue
from the larger delay of the 'Classic' queue. However, from a
bandwidth perspective, flows in either queue will share out the link
capacity as if there was just a single queue. This bandwidth pooling
effect is achieved by coupling together the drop and ECN-marking
probabilities of the two AQMs.

The PI2 AQM has two main parameters in addition to its target delay.
All the defaults are suitable for any Internet setting, but it can
be reconfigured for a Data Centre setting. The integral gain factor
alpha is used to slowly correct any persistent standing queue error
from the target delay, while the proportional gain factor beta is
used to quickly compensate for queue changes (growth or shrinkage).
Either alpha and beta are given as a parameter, or they can be
calculated by tc from alternative typical and maximum RTT parameters.

Internally, the output of a linear Proportional Integral (PI)
controller is used for both queues. This output is squared to
calculate the drop or ECN-marking probability of the classic queue.
This counterbalances the square-root rate equation of Reno/Cubic,
which is the trick that balances flow rates across the queues. For
the ECN-marking probability of the low latency queue, the output of
the base AQM is multiplied by a coupling factor. This determines the
balance between the flow rates in each queue. The default setting
makes the flow rates roughly equal, which should be generally
applicable.

If DUALPI2 AQM has detected overload (due to excessive non-responsive
traffic in either queue), it will switch to signaling congestion
solely using drop, irrespective of the ECN field. Alternatively, it
can be configured to limit the drop probability and let the queue
grow and eventually overflow (like tail-drop).

Additional details can be found in the draft:
  https://datatracker.ietf.org/doc/html/rfc9332

Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
Co-developed-by: Olga Albisser <olga@albisser.org>
Signed-off-by: Olga Albisser <olga@albisser.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Henrik Steen <henrist@henrist.net>
Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 Documentation/netlink/specs/tc.yaml |  124 ++++
 include/linux/netdevice.h           |    1 +
 include/uapi/linux/pkt_sched.h      |   34 +
 net/sched/Kconfig                   |   12 +
 net/sched/Makefile                  |    1 +
 net/sched/sch_dualpi2.c             | 1052 +++++++++++++++++++++++++++
 6 files changed, 1224 insertions(+)
 create mode 100644 net/sched/sch_dualpi2.c

Comments

Dave Taht Oct. 26, 2024, 6:57 p.m. UTC | #1
A couple comments:

Has this been tested mq->an_aqm_queue_per_core or just as a
htb+dualpi, and on what platforms?

I was also under the impression that 2ms was a more robust target from
tests given typical scheduling delays and virtualization.

It appears that gso-splitting is the default? What happens with that off?

The various options are a mite confusing. A tiny bit more inline.

On Mon, Oct 21, 2024 at 3:13 PM <chia-yu.chang@nokia-bell-labs.com> wrote:
>
> From: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
>
> DualPI2 provides L4S-type low latency & loss to traffic that uses a
> scalable congestion controller (e.g. TCP-Prague, DCTCP) without
> degrading the performance of 'classic' traffic (e.g. Reno,
> Cubic etc.). It is intended to be the reference implementation of the
> IETF's DualQ Coupled AQM.
>
> The qdisc provides two queues called low latency and classic. It
> classifies packets based on the ECN field in the IP headers. By
> default it directs non-ECN and ECT(0) into the classic queue and
> ECT(1) and CE into the low latency queue, as per the IETF spec.
>
> Each queue runs its own AQM:
> * The classic AQM is called PI2, which is similar to the PIE AQM but
>   more responsive and simpler. Classic traffic requires a decent
>   target queue (default 15ms for Internet deployment) to fully
>   utilize the link and to avoid high drop rates.
> * The low latency AQM is, by default, a very shallow ECN marking
>   threshold (1ms) similar to that used for DCTCP.
>
> The DualQ isolates the low queuing delay of the Low Latency queue
> from the larger delay of the 'Classic' queue. However, from a
> bandwidth perspective, flows in either queue will share out the link
> capacity as if there was just a single queue. This bandwidth pooling
> effect is achieved by coupling together the drop and ECN-marking
> probabilities of the two AQMs.
>
> The PI2 AQM has two main parameters in addition to its target delay.
> All the defaults are suitable for any Internet setting, but it can
> be reconfigured for a Data Centre setting. The integral gain factor

What would be a good DC setting?

> alpha is used to slowly correct any persistent standing queue error
> from the target delay, while the proportional gain factor beta is
> used to quickly compensate for queue changes (growth or shrinkage).
> Either alpha and beta are given as a parameter, or they can be
> calculated by tc from alternative typical and maximum RTT parameters.
>
> Internally, the output of a linear Proportional Integral (PI)
> controller is used for both queues. This output is squared to
> calculate the drop or ECN-marking probability of the classic queue.
> This counterbalances the square-root rate equation of Reno/Cubic,
> which is the trick that balances flow rates across the queues. For
> the ECN-marking probability of the low latency queue, the output of
> the base AQM is multiplied by a coupling factor. This determines the
> balance between the flow rates in each queue. The default setting
> makes the flow rates roughly equal, which should be generally
> applicable.
>
> If DUALPI2 AQM has detected overload (due to excessive non-responsive
> traffic in either queue), it will switch to signaling congestion
> solely using drop, irrespective of the ECN field. Alternatively, it
> can be configured to limit the drop probability and let the queue
> grow and eventually overflow (like tail-drop).
>
> Additional details can be found in the draft:
>   https://datatracker.ietf.org/doc/html/rfc9332
>
> Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
> Co-developed-by: Olga Albisser <olga@albisser.org>
> Signed-off-by: Olga Albisser <olga@albisser.org>
> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
> Co-developed-by: Henrik Steen <henrist@henrist.net>
> Signed-off-by: Henrik Steen <henrist@henrist.net>
> Signed-off-by: Bob Briscoe <research@bobbriscoe.net>
> Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> ---
>  Documentation/netlink/specs/tc.yaml |  124 ++++
>  include/linux/netdevice.h           |    1 +
>  include/uapi/linux/pkt_sched.h      |   34 +
>  net/sched/Kconfig                   |   12 +
>  net/sched/Makefile                  |    1 +
>  net/sched/sch_dualpi2.c             | 1052 +++++++++++++++++++++++++++
>  6 files changed, 1224 insertions(+)
>  create mode 100644 net/sched/sch_dualpi2.c
>
> diff --git a/Documentation/netlink/specs/tc.yaml b/Documentation/netlink/specs/tc.yaml
> index b02d59a0349c..efe5eb2d8b52 100644
> --- a/Documentation/netlink/specs/tc.yaml
> +++ b/Documentation/netlink/specs/tc.yaml
> @@ -816,6 +816,46 @@ definitions:
>        -
>          name: drop-overmemory
>          type: u32
> +  -
> +    name: tc-dualpi2-xstats
> +    type: struct
> +    members:
> +      -
> +        name: prob
> +        type: u32
> +        doc: Current probability
> +      -
> +        name: delay_c
> +        type: u32
> +        doc: Current C-queue delay in microseconds
> +      -
> +        name: delay_l
> +        type: u32
> +        doc: Current L-queue delay in microseconds
> +      -
> +        name: pkts_in_c
> +        type: u32
> +        doc: Number of packets enqueued in the C-queue
> +      -
> +        name: pkts_in_l
> +        type: u32
> +        doc: Number of packets enqueued in the L-queue
> +      -
> +        name: maxq
> +        type: u32
> +        doc: Maximum number of packets seen in the DualPI2

Seen "by". Also this number will tend towards a peak and stay there,
and thus is not a particularly useful stat.


> +      -
> +        name: ecn_mark
> +        type: u32
> +        doc: All packets marked with ecn

Since this has higher rates of marking than drop perhaps this should be 64 bits.

> +      -
> +        name: step_mark
> +        type: u32
> +        doc: Only packets marked with ecn due to L-queue step AQM

Ditto.

> +      -
> +        name: credit
> +        type: s32
> +        doc: Current credit value for WRR
>    -
>      name: tc-fq-pie-xstats

? fq-pie?

>      type: struct
> @@ -2299,6 +2339,84 @@ attribute-sets:
>        -
>          name: quantum
>          type: u32
> +  -
> +    name: tc-dualpi2-attrs
> +    attributes:
> +      -
> +        name: limit
> +        type: u32
> +        doc: Limit of total number of packets in queue

I have noted previously that memlimits make more sense than packet
limits given the dynamic range of
64b-64kb of a modern gso/tso packet.

> +      -
> +        name: target
> +        type: u32
> +        doc: Classic target delay in microseconds
> +      -
> +        name: tupdate
> +        type: u32
> +        doc: Drop probability update interval time in microseconds
> +      -
> +        name: alpha
> +        type: u32
> +        doc: Integral gain factor in Hz for PI controller
> +      -
> +        name: beta
> +        type: u32
> +        doc: Proportional gain factor in Hz for PI controller
> +      -
> +        name: step_thresh
> +        type: u32
> +        doc: L4S step marking threshold in microseconds or in packet (see step_packets)
> +      -
> +        name: step_packets
> +        type: flags
> +        doc: L4S Step marking threshold unit
> +        entries:
> +        - microseconds
> +        - packets
> +      -
> +        name: coupling_factor
> +        type: u8
> +        doc: Probability coupling factor between Classic and L4S (2 is recommended)
> +      -
> +        name: drop_overload
> +        type: flags
> +        doc: Control the overload strategy (drop to preserve latency or let the queue overflow)
> +        entries:
> +        - drop_on_overload
> +        - overflow
> +      -
> +        name: drop_early
> +        type: flags
> +        doc: Decide where the Classic packets are PI-based dropped or marked
> +        entries:
> +        - drop_enqueue
> +        - drop_dequeue
> +      -
> +        name: classic_protection
> +        type: u8
> +        doc:  Classic WRR weight in percentage (from 0 to 100)
> +      -
> +        name: ecn_mask
> +        type: flags
> +        doc: Configure the L-queue ECN classifier
> +        entries:
> +        - l4s_ect
> +        - any_ect
> +      -
> +        name: gso_split
> +        type: flags
> +        doc: Split aggregated skb or not
> +        entries:
> +        - split_gso
> +        - no_split_gso
> +      -
> +        name: max_rtt
> +        type: u32
> +        doc: The maximum expected RTT of the traffic that is controlled by DualPI2

In what units?

> +      -
> +        name: typical_rtt
> +        type: u32
> +        doc: The typical base RTT of the traffic that is controlled by DualPI2
>    -
>      name: tc-ematch-attrs
>      attributes:
> @@ -3679,6 +3797,9 @@ sub-messages:
>        -
>          value: drr
>          attribute-set: tc-drr-attrs
> +      -
> +        value: dualpi2
> +        attribute-set: tc-dualpi2-attrs
>        -
>          value: etf
>          attribute-set: tc-etf-attrs
> @@ -3846,6 +3967,9 @@ sub-messages:
>        -
>          value: codel
>          fixed-header: tc-codel-xstats
> +      -
> +        value: dualpi2
> +        fixed-header: tc-dualpi2-xstats
>        -
>          value: fq
>          fixed-header: tc-fq-qd-stats
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8feaca12655e..bdd7d6262112 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -30,6 +30,7 @@
>  #include <asm/byteorder.h>
>  #include <asm/local.h>
>
> +#include <linux/netdev_features.h>
>  #include <linux/percpu.h>
>  #include <linux/rculist.h>
>  #include <linux/workqueue.h>
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 25a9a47001cd..f2418eabdcb1 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -1210,4 +1210,38 @@ enum {
>
>  #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
>
> +/* DUALPI2 */
> +enum {
> +       TCA_DUALPI2_UNSPEC,
> +       TCA_DUALPI2_LIMIT,              /* Packets */
> +       TCA_DUALPI2_TARGET,             /* us */
> +       TCA_DUALPI2_TUPDATE,            /* us */
> +       TCA_DUALPI2_ALPHA,              /* Hz scaled up by 256 */
> +       TCA_DUALPI2_BETA,               /* HZ scaled up by 256 */
> +       TCA_DUALPI2_STEP_THRESH,        /* Packets or us */
> +       TCA_DUALPI2_STEP_PACKETS,       /* Whether STEP_THRESH is in packets */
> +       TCA_DUALPI2_COUPLING,           /* Coupling factor between queues */
> +       TCA_DUALPI2_DROP_OVERLOAD,      /* Whether to drop on overload */
> +       TCA_DUALPI2_DROP_EARLY,         /* Whether to drop on enqueue */
> +       TCA_DUALPI2_C_PROTECTION,       /* Percentage */
> +       TCA_DUALPI2_ECN_MASK,           /* L4S queue classification mask */
> +       TCA_DUALPI2_SPLIT_GSO,          /* Split GSO packets at enqueue */
> +       TCA_DUALPI2_PAD,
> +       __TCA_DUALPI2_MAX
> +};
> +
> +#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
> +
> +struct tc_dualpi2_xstats {
> +       __u32 prob;             /* current probability */
> +       __u32 delay_c;          /* current delay in C queue */
> +       __u32 delay_l;          /* current delay in L queue */
> +       __s32 credit;           /* current c_protection credit */
> +       __u32 packets_in_c;     /* number of packets enqueued in C queue */
> +       __u32 packets_in_l;     /* number of packets enqueued in L queue */
> +       __u32 maxq;             /* maximum queue size */
> +       __u32 ecn_mark;         /* packets marked with ecn*/
> +       __u32 step_marks;       /* ECN marks due to the step AQM */
> +};
> +
>  #endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 8180d0c12fce..f00b5ad92ce2 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -403,6 +403,18 @@ config NET_SCH_ETS
>
>           If unsure, say N.
>
> +config NET_SCH_DUALPI2
> +       tristate "Dual Queue PI Square (DUALPI2) scheduler"
> +       help
> +         Say Y here if you want to use the Dual Queue Proportional Integral
> +         Controller Improved with a Square scheduling algorithm.
> +         For more information, please see https://tools.ietf.org/html/rfc9332
> +
> +         To compile this driver as a module, choose M here: the module
> +         will be called sch_dualpi2.
> +
> +         If unsure, say N.
> +
>  menuconfig NET_SCH_DEFAULT
>         bool "Allow override default queue discipline"
>         help
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 82c3f78ca486..1abb06554057 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)  += sch_fq_pie.o
>  obj-$(CONFIG_NET_SCH_CBS)      += sch_cbs.o
>  obj-$(CONFIG_NET_SCH_ETF)      += sch_etf.o
>  obj-$(CONFIG_NET_SCH_TAPRIO)   += sch_taprio.o
> +obj-$(CONFIG_NET_SCH_DUALPI2)  += sch_dualpi2.o
>
>  obj-$(CONFIG_NET_CLS_U32)      += cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
> diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
> new file mode 100644
> index 000000000000..d7b366b9fa42
> --- /dev/null
> +++ b/net/sched/sch_dualpi2.c
> @@ -0,0 +1,1052 @@
> +// SPDX-License-Identifier: GPL-2.0-only

It will ease a BSD implementation to dual license.

> +/* Copyright (C) 2024 Nokia
> + *
> + * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
> + * Author: Olga Albisser <olga@albisser.org>
> + * Author: Henrik Steen <henrist@henrist.net>
> + * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
> + * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> + *
> + * DualPI Improved with a Square (dualpi2):
> + * - Supports congestion controls that comply with the Prague requirements
> + *   in RFC9331 (e.g. TCP-Prague)
> + * - Supports coupled dual-queue with PI2 as defined in RFC9332
> + * - Supports ECN L4S-identifier (IP.ECN==0b*1)
> + *
> + * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
> + *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,

This is really confusing and up until this moment I thought bbrv3 used
an ecn marking
strategy compatible with prague.

> + *   you should use TCP-Prague instead for low latency apps

This is kind of opinionated.


> + *
> + * References:
> + * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332
> + * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
> + *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
> + */
> +
> +#include <linux/errno.h>
> +#include <linux/hrtimer.h>
> +#include <linux/if_vlan.h>
> +#include <linux/kernel.h>
> +#include <linux/limits.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/types.h>
> +
> +#include <net/gso.h>
> +#include <net/inet_ecn.h>
> +#include <net/pkt_cls.h>
> +#include <net/pkt_sched.h>
> +
> +/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
> + * i.e., twice the maximal snd_cwnd.
> + * MAX_PROB must be consistent with the RNG in dualpi2_roll().
> + */
> +#define MAX_PROB U32_MAX
> +
> +/* alpha/beta values exchanged over netlink are in units of 256ns */
> +#define ALPHA_BETA_SHIFT 8
> +
> +/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
> + * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
> + * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1
> + * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
> + * control flows whose maximal RTTs can be in usec up to few secs.
> + */
> +#define ALPHA_BETA_MAX ((1U << 31) - 1)
> +
> +/* Internal alpha/beta are in units of 64ns.
> + * This enables to use all alpha/beta values in the allowed range without loss
> + * of precision due to rounding when scaling them internally, e.g.,
> + * scale_alpha_beta(1) will not round down to 0.
> + */
> +#define ALPHA_BETA_GRANULARITY 6
> +
> +#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
> +
> +/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
> +#define MAX_WC 100
> +
> +struct dualpi2_sched_data {
> +       struct Qdisc *l_queue;  /* The L4S LL queue */
> +       struct Qdisc *sch;      /* The classic queue (owner of this struct) */
> +
> +       /* Registered tc filters */
> +       struct {
> +               struct tcf_proto __rcu *filters;
> +               struct tcf_block *block;
> +       } tcf;
> +
> +       struct { /* PI2 parameters */
> +               u64     target; /* Target delay in nanoseconds */
> +               u32     tupdate;/* Timer frequency in nanoseconds */
> +               u32     prob;   /* Base PI probability */
> +               u32     alpha;  /* Gain factor for the integral rate response */
> +               u32     beta;   /* Gain factor for the proportional response */
> +               struct hrtimer timer; /* prob update timer */
> +       } pi2;
> +
> +       struct { /* Step AQM (L4S queue only) parameters */
> +               u32 thresh;     /* Step threshold */
> +               bool in_packets;/* Whether the step is in packets or time */
> +       } step;
> +
> +       struct { /* Classic queue starvation protection */
> +               s32     credit; /* Credit (sign indicates which queue) */
> +               s32     init;   /* Reset value of the credit */
> +               u8      wc;     /* C queue weight (between 0 and MAX_WC) */
> +               u8      wl;     /* L queue weight (MAX_WC - wc) */
> +       } c_protection;
> +
> +       /* General dualQ parameters */
> +       u8      coupling_factor;/* Coupling factor (k) between both queues */
> +       u8      ecn_mask;       /* Mask to match L4S packets */
> +       bool    drop_early;     /* Drop at enqueue instead of dequeue if true */
> +       bool    drop_overload;  /* Drop (1) on overload, or overflow (0) */
> +       bool    split_gso;      /* Split aggregated skb (1) or leave as is */
> +
> +       /* Statistics */
> +       u64     c_head_ts;      /* Enqueue timestamp of the classic Q's head */
> +       u64     l_head_ts;      /* Enqueue timestamp of the L Q's head */
> +       u64     last_qdelay;    /* Q delay val at the last probability update */
> +       u32     packets_in_c;   /* Number of packets enqueued in C queue */
> +       u32     packets_in_l;   /* Number of packets enqueued in L queue */
> +       u32     maxq;           /* maximum queue size */
> +       u32     ecn_mark;       /* packets marked with ECN */
> +       u32     step_marks;     /* ECN marks due to the step AQM */
> +
> +       struct { /* Deferred drop statistics */
> +               u32 cnt;        /* Packets dropped */
> +               u32 len;        /* Bytes dropped */
> +       } deferred_drops;
> +};
> +
> +struct dualpi2_skb_cb {
> +       u64 ts;                 /* Timestamp at enqueue */
> +       u8 apply_step:1,        /* Can we apply the step threshold */
> +          classified:2,        /* Packet classification results */
> +          ect:2;               /* Packet ECT codepoint */
> +};
> +
> +enum dualpi2_classification_results {
> +       DUALPI2_C_CLASSIC       = 0,    /* C queue */
> +       DUALPI2_C_L4S           = 1,    /* L queue (scale mark/classic drop) */
> +       DUALPI2_C_LLLL          = 2,    /* L queue (no drops/marks) */
> +       __DUALPI2_C_MAX                 /* Keep last*/
> +};
> +
> +static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
> +{
> +       qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
> +       return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
> +}
> +
> +static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference)
> +{
> +       return reference - dualpi2_skb_cb(skb)->ts;
> +}
> +
> +static u64 head_enqueue_time(struct Qdisc *q)
> +{
> +       struct sk_buff *skb = qdisc_peek_head(q);
> +
> +       return skb ? dualpi2_skb_cb(skb)->ts : 0;
> +}
> +
> +static u32 dualpi2_scale_alpha_beta(u32 param)
> +{
> +       u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
> +
> +       do_div(tmp, NSEC_PER_SEC);
> +       return tmp;
> +}
> +
> +static u32 dualpi2_unscale_alpha_beta(u32 param)
> +{
> +       u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
> +
> +       do_div(tmp, MAX_PROB);
> +       return tmp;
> +}
> +
> +static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
> +{
> +       return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate);
> +}
> +
> +static bool skb_is_l4s(struct sk_buff *skb)
> +{
> +       return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S;
> +}
> +
> +static bool skb_in_l_queue(struct sk_buff *skb)
> +{
> +       return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC;
> +}
> +
> +static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb)
> +{
> +       if (INET_ECN_set_ce(skb)) {
> +               q->ecn_mark++;
> +               return true;
> +       }
> +       return false;
> +}
> +
> +static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
> +{
> +       q->c_protection.credit = q->c_protection.init;
> +}
> +
> +/* This computes the initial credit value and WRR weight for the L queue (wl)
> + * from the weight of the C queue (wc).
> + * If wl > wc, the scheduler will start with the L queue when reset.
> + */
> +static void dualpi2_calculate_c_protection(struct Qdisc *sch,
> +                                          struct dualpi2_sched_data *q, u32 wc)
> +{
> +       q->c_protection.wc = wc;
> +       q->c_protection.wl = MAX_WC - wc;
> +       q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
> +               ((int)q->c_protection.wc - (int)q->c_protection.wl);
> +       dualpi2_reset_c_protection(q);
> +}
> +
> +static bool dualpi2_roll(u32 prob)
> +{
> +       return get_random_u32() <= prob;
> +}
> +
> +/* Packets in the C queue are subject to a marking probability pC, which is the
> + * square of the internal PI2 probability (i.e., have an overall lower mark/drop
> + * probability). If the qdisc is overloaded, ignore ECT values and only drop.
> + *
> + * Note that this marking scheme is also applied to L4S packets during overload.
> + * Return true if packet dropping is required in C queue
> + */
> +static bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
> +                                   struct sk_buff *skb, u32 prob,
> +                                   bool overload)
> +{
> +       if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
> +               if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> +                       return true;
> +               dualpi2_mark(q, skb);
> +       }
> +       return false;
> +}
> +
> +/* Packets in the L queue are subject to a marking probability pL given by the
> + * internal PI2 probability scaled by the coupling factor.
> + *
> + * On overload (i.e., @local_l_prob is >= 100%):
> + * - if the qdisc is configured to trade losses to preserve latency (i.e.,
> + *   @q->drop_overload), apply classic drops first before marking.
> + * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
> + *   delay, eventually resulting in taildrop behavior once sch->limit is
> + *   reached.
> + * Return true if packet dropping is required in L queue
> + */
> +static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
> +                                    struct sk_buff *skb,
> +                                    u64 local_l_prob, u32 prob,
> +                                    bool overload)
> +{
> +       if (overload) {
> +               /* Apply classic drop */
> +               if (!q->drop_overload ||
> +                   !(dualpi2_roll(prob) && dualpi2_roll(prob)))
> +                       goto mark;
> +               return true;
> +       }
> +
> +       /* We can safely cut the upper 32b as overload==false */
> +       if (dualpi2_roll(local_l_prob)) {
> +               /* Non-ECT packets could have classified as L4S by filters. */
> +               if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
> +                       return true;
> +mark:
> +               dualpi2_mark(q, skb);
> +       }
> +       return false;
> +}
> +
> +/* Decide whether a given packet must be dropped (or marked if ECT), according
> + * to the PI2 probability.
> + *
> + * Never mark/drop if we have a standing queue of less than 2 MTUs.
> + */
> +static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
> +                     struct sk_buff *skb)
> +{
> +       u64 local_l_prob;
> +       u32 prob;
> +       bool overload;
> +
> +       if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
> +               return false;
> +
> +       prob = READ_ONCE(q->pi2.prob);
> +       local_l_prob = (u64)prob * q->coupling_factor;
> +       overload = local_l_prob > MAX_PROB;
> +
> +       switch (dualpi2_skb_cb(skb)->classified) {
> +       case DUALPI2_C_CLASSIC:
> +               return dualpi2_classic_marking(q, skb, prob, overload);
> +       case DUALPI2_C_L4S:
> +               return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
> +                                               overload);
> +       default: /* DUALPI2_C_LLLL */
> +               return false;
> +       }
> +}
> +
> +static void dualpi2_read_ect(struct sk_buff *skb)
> +{
> +       struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> +       int wlen = skb_network_offset(skb);
> +
> +       switch (skb_protocol(skb, true)) {
> +       case htons(ETH_P_IP):
> +               wlen += sizeof(struct iphdr);
> +               if (!pskb_may_pull(skb, wlen) ||
> +                   skb_try_make_writable(skb, wlen))
> +                       goto not_ecn;
> +
> +               cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
> +               break;
> +       case htons(ETH_P_IPV6):
> +               wlen += sizeof(struct ipv6hdr);
> +               if (!pskb_may_pull(skb, wlen) ||
> +                   skb_try_make_writable(skb, wlen))
> +                       goto not_ecn;
> +
> +               cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
> +               break;
> +       default:
> +               goto not_ecn;
> +       }
> +       return;
> +
> +not_ecn:
> +       /* Non pullable/writable packets can only be dropped hence are
> +        * classified as not ECT.
> +        */
> +       cb->ect = INET_ECN_NOT_ECT;
> +}
> +
> +static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
> +                               struct sk_buff *skb)
> +{
> +       struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
> +       struct tcf_result res;
> +       struct tcf_proto *fl;
> +       int result;
> +
> +       dualpi2_read_ect(skb);
> +       if (cb->ect & q->ecn_mask) {
> +               cb->classified = DUALPI2_C_L4S;
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       if (TC_H_MAJ(skb->priority) == q->sch->handle &&
> +           TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
> +               cb->classified = TC_H_MIN(skb->priority);
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       fl = rcu_dereference_bh(q->tcf.filters);
> +       if (!fl) {
> +               cb->classified = DUALPI2_C_CLASSIC;
> +               return NET_XMIT_SUCCESS;
> +       }
> +
> +       result = tcf_classify(skb, NULL, fl, &res, false);
> +       if (result >= 0) {
> +#ifdef CONFIG_NET_CLS_ACT
> +               switch (result) {
> +               case TC_ACT_STOLEN:
> +               case TC_ACT_QUEUED:
> +               case TC_ACT_TRAP:
> +                       return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> +               case TC_ACT_SHOT:
> +                       return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +               }
> +#endif
> +               cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
> +                       TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
> +       }
> +       return NET_XMIT_SUCCESS;
> +}
> +
> +static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
> +                              struct sk_buff **to_free)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct dualpi2_skb_cb *cb;
> +
> +       if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
> +               qdisc_qstats_overlimit(sch);
> +               if (skb_in_l_queue(skb))
> +                       qdisc_qstats_overlimit(q->l_queue);

shouldn't this be:

               if (skb_in_l_queue(skb))
                       qdisc_qstats_overlimit(q->l_queue);
                else
                       qdisc_qstats_overlimit(sch);


> +               return qdisc_drop(skb, sch, to_free);
> +       }
> +
> +       if (q->drop_early && must_drop(sch, q, skb)) {
> +               qdisc_drop(skb, sch, to_free);
> +               return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +       }
> +
> +       cb = dualpi2_skb_cb(skb);
> +       cb->ts = ktime_get_ns();
> +
> +       if (qdisc_qlen(sch) > q->maxq)
> +               q->maxq = qdisc_qlen(sch);
> +
> +       if (skb_in_l_queue(skb)) {
> +               /* Only apply the step if a queue is building up */
> +               dualpi2_skb_cb(skb)->apply_step =
> +                       skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1;
> +               /* Keep the overall qdisc stats consistent */
> +               ++sch->q.qlen;
> +               qdisc_qstats_backlog_inc(sch, skb);
> +               ++q->packets_in_l;
> +               if (!q->l_head_ts)
> +                       q->l_head_ts = cb->ts;
> +               return qdisc_enqueue_tail(skb, q->l_queue);
> +       }
> +       ++q->packets_in_c;
> +       if (!q->c_head_ts)
> +               q->c_head_ts = cb->ts;
> +       return qdisc_enqueue_tail(skb, sch);
> +}
> +
> +/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue

By default

> + * each of those individually. This yields the following benefits, at the
> + * expense of CPU usage:
> + * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
> + *   same fate (e.g., the random mark/drop probability is applied individually)
> + * - Improved precision of the starvation protection/WRR scheduler at dequeue,
> + *   as the size of the dequeued packets will be smaller.

I had really grave doubts as to whether L4S would work with GSO at all.

> + */
> +static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +                                struct sk_buff **to_free)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       int err;
> +
> +       err = dualpi2_skb_classify(q, skb);
> +       if (err != NET_XMIT_SUCCESS) {
> +               if (err & __NET_XMIT_BYPASS)
> +                       qdisc_qstats_drop(sch);
> +               __qdisc_drop(skb, to_free);
> +               return err;
> +       }
> +
> +       if (q->split_gso && skb_is_gso(skb)) {
> +               netdev_features_t features;
> +               struct sk_buff *nskb, *next;
> +               int cnt, byte_len, orig_len;
> +               int err;
> +
> +               features = netif_skb_features(skb);
> +               nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> +               if (IS_ERR_OR_NULL(nskb))
> +                       return qdisc_drop(skb, sch, to_free);
> +
> +               cnt = 1;
> +               byte_len = 0;
> +               orig_len = qdisc_pkt_len(skb);
> +               while (nskb) {
> +                       next = nskb->next;
> +                       skb_mark_not_on_list(nskb);
> +                       qdisc_skb_cb(nskb)->pkt_len = nskb->len;
> +                       dualpi2_skb_cb(nskb)->classified =
> +                               dualpi2_skb_cb(skb)->classified;
> +                       dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
> +                       err = dualpi2_enqueue_skb(nskb, sch, to_free);
> +                       if (err == NET_XMIT_SUCCESS) {
> +                               /* Compute the backlog adjustement that needs

spelling: "adjustment"

> +                                * to be propagated in the qdisc tree to reflect
> +                                * all new skbs successfully enqueued.
> +                                */
> +                               ++cnt;
> +                               byte_len += nskb->len;
> +                       }
> +                       nskb = next;
> +               }
> +               if (err == NET_XMIT_SUCCESS) {
> +                       /* The caller will add the original skb stats to its
> +                        * backlog, compensate this.
> +                        */
> +                       --cnt;
> +                       byte_len -= orig_len;
> +               }
> +               qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
> +               consume_skb(skb);
> +               return err;
> +       }
> +       return dualpi2_enqueue_skb(skb, sch, to_free);
> +}
> +
> +/* Select the queue from which the next packet can be dequeued, ensuring that
> + * neither queue can starve the other with a WRR scheduler.
> + *
> + * The sign of the WRR credit determines the next queue, while the size of
> + * the dequeued packet determines the magnitude of the WRR credit change. If
> + * either queue is empty, the WRR credit is kept unchanged.
> + *
> + * As the dequeued packet can be dropped later, the caller has to perform the
> + * qdisc_bstats_update() calls.
> + */
> +static struct sk_buff *dequeue_packet(struct Qdisc *sch,
> +                                     struct dualpi2_sched_data *q,
> +                                     int *credit_change,
> +                                     u64 now)
> +{
> +       struct sk_buff *skb = NULL;
> +       int c_len;
> +
> +       *credit_change = 0;
> +       c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
> +       if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) {
> +               skb = __qdisc_dequeue_head(&q->l_queue->q);
> +               WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
> +               if (c_len)
> +                       *credit_change = q->c_protection.wc;
> +               qdisc_qstats_backlog_dec(q->l_queue, skb);
> +               /* Keep the global queue size consistent */
> +               --sch->q.qlen;
> +       } else if (c_len) {
> +               skb = __qdisc_dequeue_head(&sch->q);
> +               WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
> +               if (qdisc_qlen(q->l_queue))
> +                       *credit_change = ~((s32)q->c_protection.wl) + 1;
> +       } else {
> +               dualpi2_reset_c_protection(q);
> +               return NULL;
> +       }
> +       *credit_change *= qdisc_pkt_len(skb);
> +       qdisc_qstats_backlog_dec(sch, skb);
> +       return skb;
> +}
> +
> +static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
> +                      u64 now)
> +{
> +       u64 qdelay = 0;
> +
> +       if (q->step.in_packets)
> +               qdelay = qdisc_qlen(q->l_queue);
> +       else
> +               qdelay = dualpi2_sojourn_time(skb, now);
> +
> +       if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) {
> +               if (!dualpi2_skb_cb(skb)->ect)
> +                       /* Drop this non-ECT packet */
> +                       return 1;
> +               if (dualpi2_mark(q, skb))
> +                       ++q->step_marks;
> +       }
> +       qdisc_bstats_update(q->l_queue, skb);
> +       return 0;
> +}
> +
> +static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb,
> +                          struct Qdisc *sch)
> +{
> +       ++q->deferred_drops.cnt;
> +       q->deferred_drops.len += qdisc_pkt_len(skb);
> +       consume_skb(skb);
> +       qdisc_qstats_drop(sch);
> +}
> +
> +static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct sk_buff *skb;
> +       int credit_change;
> +       u64 now;
> +
> +       now = ktime_get_ns();
> +
> +       while ((skb = dequeue_packet(sch, q, &credit_change, now))) {
> +               if (!q->drop_early && must_drop(sch, q, skb)) {
> +                       drop_and_retry(q, skb, sch);
> +                       continue;
> +               }
> +
> +               if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) {
> +                       qdisc_qstats_drop(q->l_queue);
> +                       drop_and_retry(q, skb, sch);
> +                       continue;
> +               }
> +
> +               q->c_protection.credit += credit_change;
> +               qdisc_bstats_update(sch, skb);
> +               break;
> +       }
> +
> +       /* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
> +        * or HTB crashes.
> +        */
> +       if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
> +               qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
> +                                         q->deferred_drops.len);
> +               q->deferred_drops.cnt = 0;
> +               q->deferred_drops.len = 0;
> +       }
> +       return skb;
> +}
> +
> +static s64 __scale_delta(u64 diff)
> +{
> +       do_div(diff, 1 << ALPHA_BETA_GRANULARITY);
> +       return diff;
> +}
> +
> +static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c,
> +                            u64 *qdelay_l)
> +{
> +       u64 now, qc, ql;
> +
> +       now = ktime_get_ns();
> +       qc = READ_ONCE(q->c_head_ts);
> +       ql = READ_ONCE(q->l_head_ts);
> +
> +       *qdelay_c = qc ? now - qc : 0;
> +       *qdelay_l = ql ? now - ql : 0;
> +}
> +
> +static u32 calculate_probability(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       u32 new_prob;
> +       u64 qdelay_c;
> +       u64 qdelay_l;
> +       u64 qdelay;
> +       s64 delta;
> +
> +       get_queue_delays(q, &qdelay_c, &qdelay_l);
> +       qdelay = max(qdelay_l, qdelay_c);
> +       /* Alpha and beta take at most 32b, i.e, the delay difference would
> +        * overflow for queuing delay differences > ~4.2sec.
> +        */
> +       delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
> +       delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta;
> +       if (delta > 0) {
> +               new_prob = __scale_delta(delta) + q->pi2.prob;
> +               if (new_prob < q->pi2.prob)
> +                       new_prob = MAX_PROB;
> +       } else {
> +               new_prob = q->pi2.prob - __scale_delta(~delta + 1);
> +               if (new_prob > q->pi2.prob)
> +                       new_prob = 0;
> +       }
> +       q->last_qdelay = qdelay;
> +       /* If we do not drop on overload, ensure we cap the L4S probability to
> +        * 100% to keep window fairness when overflowing.
> +        */
> +       if (!q->drop_overload)
> +               return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
> +       return new_prob;
> +}
> +
> +static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer)
> +{
> +       struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
> +
> +       WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch));
> +
> +       hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q));
> +       return HRTIMER_RESTART;
> +}
> +
> +static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
> +       [TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
> +       [TCA_DUALPI2_TARGET] = {.type = NLA_U32},
> +       [TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
> +       [TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
> +       [TCA_DUALPI2_BETA] = {.type = NLA_U32},
> +       [TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
> +       [TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
> +       [TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
> +       [TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
> +       [TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
> +       [TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
> +       [TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
> +       [TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8},
> +};
> +
> +static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
> +                         struct netlink_ext_ack *extack)
> +{
> +       struct nlattr *tb[TCA_DUALPI2_MAX + 1];
> +       struct dualpi2_sched_data *q;
> +       int old_backlog;
> +       int old_qlen;
> +       int err;
> +
> +       if (!opt)
> +               return -EINVAL;
> +       err = nla_parse_nested(tb, TCA_DUALPI2_MAX, opt, dualpi2_policy,
> +                              extack);
> +       if (err < 0)
> +               return err;
> +
> +       q = qdisc_priv(sch);
> +       sch_tree_lock(sch);
> +
> +       if (tb[TCA_DUALPI2_LIMIT]) {
> +               u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
> +
> +               if (!limit) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
> +                                           "limit must be greater than 0.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               sch->limit = limit;
> +       }
> +
> +       if (tb[TCA_DUALPI2_TARGET])
> +               q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
> +                       NSEC_PER_USEC;
> +
> +       if (tb[TCA_DUALPI2_TUPDATE]) {
> +               u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]);
> +
> +               if (!tupdate) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
> +                                           "tupdate cannot be 0us.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.tupdate = tupdate * NSEC_PER_USEC;
> +       }
> +
> +       if (tb[TCA_DUALPI2_ALPHA]) {
> +               u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
> +
> +               if (alpha > ALPHA_BETA_MAX) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
> +                                           "alpha is too large.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
> +       }
> +
> +       if (tb[TCA_DUALPI2_BETA]) {
> +               u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
> +
> +               if (beta > ALPHA_BETA_MAX) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
> +                                           "beta is too large.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->pi2.beta = dualpi2_scale_alpha_beta(beta);
> +       }
> +
> +       if (tb[TCA_DUALPI2_STEP_THRESH])
> +               q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
> +                       NSEC_PER_USEC;
> +
> +       if (tb[TCA_DUALPI2_COUPLING]) {
> +               u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
> +
> +               if (!coupling) {
> +                       NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
> +                                           "Must use a non-zero coupling.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               q->coupling_factor = coupling;
> +       }
> +
> +       if (tb[TCA_DUALPI2_STEP_PACKETS])
> +               q->step.in_packets = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
> +
> +       if (tb[TCA_DUALPI2_DROP_OVERLOAD])
> +               q->drop_overload = !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
> +
> +       if (tb[TCA_DUALPI2_DROP_EARLY])
> +               q->drop_early = !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
> +
> +       if (tb[TCA_DUALPI2_C_PROTECTION]) {
> +               u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
> +
> +               if (wc > MAX_WC) {
> +                       NL_SET_ERR_MSG_ATTR(extack,
> +                                           tb[TCA_DUALPI2_C_PROTECTION],
> +                                           "c_protection must be <= 100.");
> +                       sch_tree_unlock(sch);
> +                       return -EINVAL;
> +               }
> +               dualpi2_calculate_c_protection(sch, q, wc);
> +       }
> +
> +       if (tb[TCA_DUALPI2_ECN_MASK])
> +               q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
> +
> +       if (tb[TCA_DUALPI2_SPLIT_GSO])
> +               q->split_gso = !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO]);
> +
> +       old_qlen = qdisc_qlen(sch);
> +       old_backlog = sch->qstats.backlog;
> +       while (qdisc_qlen(sch) > sch->limit) {
> +               struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
> +
> +               qdisc_qstats_backlog_dec(sch, skb);
> +               rtnl_qdisc_drop(skb, sch);
> +       }
> +       qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch),
> +                                 old_backlog - sch->qstats.backlog);
> +
> +       sch_tree_unlock(sch);
> +       return 0;
> +}
> +
> +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> +{
> +       q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */

... assuming gso/gro is not in use

> +
> +       q->pi2.target = 15 * NSEC_PER_MSEC;
> +       q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> +       q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> +       q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> +
> +       q->step.thresh = 1 * NSEC_PER_MSEC;
> +       q->step.in_packets = false;
> +
> +       dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> +
> +       q->ecn_mask = INET_ECN_ECT_1;
> +       q->coupling_factor = 2;         /* window fairness for equal RTTs */
> +       q->drop_overload = true;        /* Preserve latency by dropping */
> +       q->drop_early = false;          /* PI2 drops on dequeue */
> +       q->split_gso = true;
> +}
> +
> +static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
> +                       struct netlink_ext_ack *extack)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       int err;
> +
> +       q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
> +                                      TC_H_MAKE(sch->handle, 1), extack);
> +       if (!q->l_queue)
> +               return -ENOMEM;
> +
> +       err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack);
> +       if (err)
> +               return err;
> +
> +       q->sch = sch;
> +       dualpi2_reset_default(q);
> +       hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> +       q->pi2.timer.function = dualpi2_timer;
> +
> +       if (opt) {
> +               err = dualpi2_change(sch, opt, extack);
> +
> +               if (err)
> +                       return err;
> +       }
> +
> +       hrtimer_start(&q->pi2.timer, next_pi2_timeout(q),
> +                     HRTIMER_MODE_ABS_PINNED);
> +       return 0;
> +}
> +
> +static u32 convert_ns_to_usec(u64 ns)
> +{
> +       do_div(ns, NSEC_PER_USEC);
> +       return ns;
> +}
> +
> +static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct nlattr *opts;
> +
> +       opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
> +       if (!opts)
> +               goto nla_put_failure;
> +
> +       if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
> +           nla_put_u32(skb, TCA_DUALPI2_TARGET,
> +                       convert_ns_to_usec(q->pi2.target)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
> +                       convert_ns_to_usec(q->pi2.tupdate)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_ALPHA,
> +                       dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_BETA,
> +                       dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
> +           nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, q->step.in_packets ?
> +                       q->step.thresh : convert_ns_to_usec(q->step.thresh)) ||
> +           nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
> +           nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
> +           nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
> +           nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
> +           nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
> +           nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask) ||
> +           nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, q->split_gso))
> +               goto nla_put_failure;
> +
> +       return nla_nest_end(skb, opts);
> +
> +nla_put_failure:
> +       nla_nest_cancel(skb, opts);
> +       return -1;
> +}
> +
> +static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +       struct tc_dualpi2_xstats st = {
> +               .prob           = READ_ONCE(q->pi2.prob),
> +               .packets_in_c   = q->packets_in_c,
> +               .packets_in_l   = q->packets_in_l,
> +               .maxq           = q->maxq,
> +               .ecn_mark       = q->ecn_mark,
> +               .credit         = q->c_protection.credit,
> +               .step_marks     = q->step_marks,
> +       };
> +       u64 qc, ql;
> +
> +       get_queue_delays(q, &qc, &ql);
> +       st.delay_l = convert_ns_to_usec(ql);
> +       st.delay_c = convert_ns_to_usec(qc);
> +       return gnet_stats_copy_app(d, &st, sizeof(st));
> +}
> +
> +static void dualpi2_reset(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       qdisc_reset_queue(sch);
> +       qdisc_reset_queue(q->l_queue);
> +       q->c_head_ts = 0;
> +       q->l_head_ts = 0;
> +       q->pi2.prob = 0;
> +       q->packets_in_c = 0;
> +       q->packets_in_l = 0;
> +       q->maxq = 0;
> +       q->ecn_mark = 0;
> +       q->step_marks = 0;
> +       dualpi2_reset_c_protection(q);
> +}
> +
> +static void dualpi2_destroy(struct Qdisc *sch)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       q->pi2.tupdate = 0;
> +       hrtimer_cancel(&q->pi2.timer);
> +       if (q->l_queue)
> +               qdisc_put(q->l_queue);
> +       tcf_block_put(q->tcf.block);
> +}
> +
> +static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg)
> +{
> +       return NULL;
> +}
> +
> +static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid)
> +{
> +       return 0;
> +}
> +
> +static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent,
> +                                 u32 classid)
> +{
> +       return 0;
> +}
> +
> +static void dualpi2_unbind(struct Qdisc *q, unsigned long cl)
> +{
> +}
> +
> +static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl,
> +                                          struct netlink_ext_ack *extack)
> +{
> +       struct dualpi2_sched_data *q = qdisc_priv(sch);
> +
> +       if (cl)
> +               return NULL;
> +       return q->tcf.block;
> +}
> +
> +static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +       unsigned int i;
> +
> +       if (arg->stop)
> +               return;
> +
> +       /* We statically define only 2 queues */
> +       for (i = 0; i < 2; i++) {
> +               if (arg->count < arg->skip) {
> +                       arg->count++;
> +                       continue;
> +               }
> +               if (arg->fn(sch, i + 1, arg) < 0) {
> +                       arg->stop = 1;
> +                       break;
> +               }
> +               arg->count++;
> +       }
> +}
> +
> +/* Minimal class support to handler tc filters */
> +static const struct Qdisc_class_ops dualpi2_class_ops = {
> +       .leaf           = dualpi2_leaf,
> +       .find           = dualpi2_find,
> +       .tcf_block      = dualpi2_tcf_block,
> +       .bind_tcf       = dualpi2_bind,
> +       .unbind_tcf     = dualpi2_unbind,
> +       .walk           = dualpi2_walk,
> +};
> +
> +static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
> +       .id             = "dualpi2",
> +       .cl_ops         = &dualpi2_class_ops,
> +       .priv_size      = sizeof(struct dualpi2_sched_data),
> +       .enqueue        = dualpi2_qdisc_enqueue,
> +       .dequeue        = dualpi2_qdisc_dequeue,
> +       .peek           = qdisc_peek_dequeued,
> +       .init           = dualpi2_init,
> +       .destroy        = dualpi2_destroy,
> +       .reset          = dualpi2_reset,
> +       .change         = dualpi2_change,
> +       .dump           = dualpi2_dump,
> +       .dump_stats     = dualpi2_dump_stats,
> +       .owner          = THIS_MODULE,
> +};
> +
> +static int __init dualpi2_module_init(void)
> +{
> +       return register_qdisc(&dualpi2_qdisc_ops);
> +}
> +
> +static void __exit dualpi2_module_exit(void)
> +{
> +       unregister_qdisc(&dualpi2_qdisc_ops);
> +}
> +
> +module_init(dualpi2_module_init);
> +module_exit(dualpi2_module_exit);
> +
> +MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
> +MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>");
> +MODULE_AUTHOR("Olga Albisser <olga@albisser.org>");
> +MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>");
> +MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>");
> +MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>");
> +
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("1.0");
> --
> 2.34.1
>
>
Koen De Schepper (Nokia) Oct. 28, 2024, 6:37 p.m. UTC | #2
See below,

Regards,
Koen.

> -----Original Message-----
> From: Dave Taht <dave.taht@gmail.com> 
> Sent: Saturday, October 26, 2024 8:57 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
> Cc: netdev@vger.kernel.org; davem@davemloft.net; stephen@networkplumber.org; jhs@mojatatu.com; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; dsahern@kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@cablelabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com; Olga Albisser <olga@albisser.org>; Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>; Henrik Steen <henrist@henrist.net>; Bob Briscoe <research@bobbriscoe.net>
> Subject: Re: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc


> Has this been tested mq->an_aqm_queue_per_core or just as a
> htb+dualpi, and on what platforms?

It is a qdisc that should work in any combination. We mainly tested with HTB, directly on the real interface and with multiple instances in namespaces. We didn't test all the combinations. Did you see any indication that made you expect problems?

> I was also under the impression that 2ms was a more robust target from
> tests given typical scheduling delays and virtualization.

It is a parameter with a default of 1ms, which is a very achievable target on ethernet links. If in certain deployments it is not achievable, it can be relaxed if needed with a simple parameter. On wireless links, dedicated integration with the driver is needed for best performance, but outside the scope of this AQM.

> It appears that gso-splitting is the default? What happens with that off?

It might work under certain environment conditions or with certain combinations of more relaxed parameters, but it will create problems in other cases. Do you suggest we should force gso-splitting always on without the option? I guess it is useful if the conditions are present in certain deployments to be able to disable it?

> What would be a good DC setting?

The alpha and beta parameters are not necessary to be set directly. The easy way to configure DualPI2 is to set typical RTT and max RTT. The optimal parameters are derived from those. So, for a DC it could be:
     Auto-configuring parameters using [max_rtt: 5ms, typical_rtt: 100us]: target=100us tupdate=100us alpha=0.400000 beta=60.000002
Showing the following full config:
     qdisc dualpi2 1: root refcnt 17 limit 10000p target 100us tupdate 100us alpha 0.394531 beta 59.996094 l4s_ect coupling_factor 2 drop_on_overload step_thresh 1ms drop_dequeue split_gso classic_protection 10%
Or any other typical values can be used.

We will clarify this better in the description and man pages and promote the "simple and safe parameters". Probably we should also list the "experiment at own risk" parameters...

>> +        name: maxq
>> +        type: u32
>> +        doc: Maximum number of packets seen in the DualPI2
>
> Seen "by". Also this number will tend towards a peak and stay there,
> and thus is not a particularly useful stat.

Thanks, will be fixed. The stats can be reset, so can be used to find peek queue occupancy in an interval. Can be removed if people think it is not useful.

>> +        name: ecn_mark
>> +        type: u32
>> +        doc: All packets marked with ecn
>
>Since this has higher rates of marking than drop perhaps this should be 64 bits.

All packet counters are typically 32 bits. Would need to be changed in a lot of qdiscs...

>>      name: tc-fq-pie-xstats
>
>? fq-pie?

Thanks, typo that will be fixed.

>> +        name: limit
>> +        type: u32
>> +        doc: Limit of total number of packets in queue
>
>I have noted previously that memlimits make more sense than packet
>limits given the dynamic range of
>64b-64kb of a modern gso/tso packet.

All qdiscs use packet limits. Would again deviate from the common practice...

>> +        name: max_rtt
>> +        type: u32
>> +        doc: The maximum expected RTT of the traffic that is controlled by DualPI2
>
>In what units?

In the tc command it needs to be specified (although the default unit is currently us), in the data structure it is not present as it is converted to the other parameters. We can mention the default unit.

>> + * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
>> + *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,
>
>This is really confusing and up until this moment I thought bbrv3 used
>an ecn marking strategy compatible with prague.

As far as I know the BBRv3 ECN implementation does not implement all Prague requirements and is not intended to be used on the Internet and not for real-time interactive apps. Tests show that BBR's RTT probes pauses the throughput unnecessarily and still does throughput probes (creating unnecessary latency spikes). We will verify with the BBR maintainers and clarify the text.

>> + *   you should use TCP-Prague instead for low latency apps
>
>This is kind of opinionated.

We will change into " should use a Prague compliant CC for use on the Internet".

>> +       if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
>> +               qdisc_qstats_overlimit(sch);
>> +               if (skb_in_l_queue(skb))
>> +                       qdisc_qstats_overlimit(q->l_queue);
>
>shouldn't this be:
>
>               if (skb_in_l_queue(skb))
>                       qdisc_qstats_overlimit(q->l_queue);
>                else
>                       qdisc_qstats_overlimit(sch);

No, it increments 2 different counters. In the first level we keep the overall stats and increment for all packets (although the queue at this level only contains the C-queue), in the l_queue level we keep the L-stats only. If C-stats only are required, both need to be subtracted.

>> +/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue
>
>By default

Thanks, will be fixed.

> I had really grave doubts as to whether L4S would work with GSO at all.

It will definitely behave differently, and most likely won't be usable for the Internet. If used as a DC AQM, it might still be possible to disable. But as said before, we can remove this option. We haven’t further explored the possibilities, but don't want to prevent others to.

>> +                               /* Compute the backlog adjustement that needs
>
>spelling: "adjustment"

Thanks, will be fixed.

>> +       q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
>
>... assuming gso/gro is not in use

"At least 125ms at 1Gbps" is intended. Typically, the limit causes taildrop if not big enough. True is GSO is in full use the time (and memory used could be up to 40 times bigger.
Paolo Abeni Oct. 29, 2024, 12:56 p.m. UTC | #3
On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
> +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> +{
> +	q->sch->limit = 10000;				/* Max 125ms at 1Gbps */
> +
> +	q->pi2.target = 15 * NSEC_PER_MSEC;
> +	q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> +	q->pi2.alpha = dualpi2_scale_alpha_beta(41);	/* ~0.16 Hz * 256 */
> +	q->pi2.beta = dualpi2_scale_alpha_beta(819);	/* ~3.20 Hz * 256 */
> +
> +	q->step.thresh = 1 * NSEC_PER_MSEC;
> +	q->step.in_packets = false;
> +
> +	dualpi2_calculate_c_protection(q->sch, q, 10);	/* wc=10%, wl=90% */
> +
> +	q->ecn_mask = INET_ECN_ECT_1;
> +	q->coupling_factor = 2;		/* window fairness for equal RTTs */
> +	q->drop_overload = true;	/* Preserve latency by dropping */
> +	q->drop_early = false;		/* PI2 drops on dequeue */
> +	q->split_gso = true;

This is a very unexpected default. Splitting GSO packets earlier WRT the
H/W constaints definitely impact performances in a bad way.

Under which condition this is expected to give better results?
It should be at least documented clearly.

Thanks,

Paolo
Chia-Yu Chang (Nokia) Oct. 29, 2024, 3:27 p.m. UTC | #4
Pls see below

Best regards,
Chia-Yu

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com> 
> Sent: Tuesday, October 29, 2024 1:56 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; netdev@vger.kernel.org; davem@davemloft.net; stephen@networkplumber.org; jhs@mojatatu.com; edumazet@google.com; kuba@kernel.org; dsahern@kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white@CableLabs.com; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel@apple.com
> Cc: Olga Albisser <olga@albisser.org>; Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>; Henrik Steen <henrist@henrist.net>; Bob Briscoe <research@bobbriscoe.net>
> Subject: Re: [PATCH v4 net-next 1/1] sched: Add dualpi2 qdisc


On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
>> +/* Default alpha/beta values give a 10dB stability margin with 
>> +max_rtt=100ms. */ static void dualpi2_reset_default(struct 
>> +dualpi2_sched_data *q) {
>> +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
>> +
>> +     q->pi2.target = 15 * NSEC_PER_MSEC;
>> +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
>> +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
>> +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
>> +
>> +     q->step.thresh = 1 * NSEC_PER_MSEC;
>> +     q->step.in_packets = false;
>> +
>> +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, 
>> + wl=90% */
>> +
>> +     q->ecn_mask = INET_ECN_ECT_1;
>> +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
>> +     q->drop_overload = true;        /* Preserve latency by dropping */
>> +     q->drop_early = false;          /* PI2 drops on dequeue */
>> +     q->split_gso = true;

> This is a very unexpected default. Splitting GSO packets earlier WRT the H/W constaints definitely impact performances in a bad way.

> Under which condition this is expected to give better results?
> It should be at least documented clearly.

> Thanks,

> Paolo

I see a similar operation exists in other qdisc (e.g., sch_tbf.c and sch_cake). They both walk through segs of skb_list.
Instead, I see other qdisc use "skb_list_walk_safe" macro, so I was thinking to follow a similar approach in dualpi2 (or other comments please let me know).
Or do you suggest we should force gso-splitting like in other qdisc?

Chia-Yu
Paolo Abeni Oct. 29, 2024, 4:08 p.m. UTC | #5
On 10/29/24 16:27, Chia-Yu Chang (Nokia) wrote:
> On Tuesday, October 29, 2024 1:56 PM Paolo Abeni <pabeni@redhat.com>  wrote:
>> On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
>>> +/* Default alpha/beta values give a 10dB stability margin with 
>>> +max_rtt=100ms. */ static void dualpi2_reset_default(struct 
>>> +dualpi2_sched_data *q) {
>>> +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
>>> +
>>> +     q->pi2.target = 15 * NSEC_PER_MSEC;
>>> +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
>>> +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
>>> +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
>>> +
>>> +     q->step.thresh = 1 * NSEC_PER_MSEC;
>>> +     q->step.in_packets = false;
>>> +
>>> +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, 
>>> + wl=90% */
>>> +
>>> +     q->ecn_mask = INET_ECN_ECT_1;
>>> +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
>>> +     q->drop_overload = true;        /* Preserve latency by dropping */
>>> +     q->drop_early = false;          /* PI2 drops on dequeue */
>>> +     q->split_gso = true;
> 
>> This is a very unexpected default. Splitting GSO packets earlier WRT the H/W constaints definitely impact performances in a bad way.
> 
>> Under which condition this is expected to give better results?
>> It should be at least documented clearly.
> 
> I see a similar operation exists in other qdisc (e.g., sch_tbf.c and sch_cake). They both walk through segs of skb_list.
> Instead, I see other qdisc use "skb_list_walk_safe" macro, so I was thinking to follow a similar approach in dualpi2 (or other comments please let me know).
> Or do you suggest we should force gso-splitting like in other qdisc?

The main point is not traversing an skb list, but the segmentation this
scheduler performs by default. Note that the sch_tbf case is slightly
different, as it segments skbs as a last resort to avoid dropping
packets exceeding the burst limit.
You should provide some more wording or test showing when and how such
splitting is advantageous - i.e. as done in the cake scheduler in commit
2db6dc2662bab14e59517ab4b86a164cc4d2db42.

The reason for the above is that performing unneeded S/W segmentation
is, generally speaking, a huge loss.

Thanks,

Paolo
Eric Dumazet Oct. 29, 2024, 4:53 p.m. UTC | #6
On Tue, Oct 29, 2024 at 1:56 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
> > +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> > +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> > +{
> > +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
> > +
> > +     q->pi2.target = 15 * NSEC_PER_MSEC;
> > +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> > +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> > +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> > +
> > +     q->step.thresh = 1 * NSEC_PER_MSEC;
> > +     q->step.in_packets = false;
> > +
> > +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> > +
> > +     q->ecn_mask = INET_ECN_ECT_1;
> > +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
> > +     q->drop_overload = true;        /* Preserve latency by dropping */
> > +     q->drop_early = false;          /* PI2 drops on dequeue */
> > +     q->split_gso = true;
>
> This is a very unexpected default. Splitting GSO packets earlier WRT the
> H/W constaints definitely impact performances in a bad way.
>
> Under which condition this is expected to give better results?
> It should be at least documented clearly.

I agree, it is very strange to see this orthogonal feature being
spread in some qdisc.

Also, it seems this qdisc could be a mere sch_prio queue, with two
sch_pie children, or two sch_fq or sch_fq_codel ?

Many of us are using fq_codel or fq, there is no way we can switch to
dualpi2 just to experiment things.
Neal Cardwell Oct. 31, 2024, 1:27 p.m. UTC | #7
On Tue, Oct 29, 2024 at 12:53 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 1:56 PM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
> > > +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> > > +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> > > +{
> > > +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
> > > +
> > > +     q->pi2.target = 15 * NSEC_PER_MSEC;
> > > +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> > > +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> > > +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> > > +
> > > +     q->step.thresh = 1 * NSEC_PER_MSEC;
> > > +     q->step.in_packets = false;
> > > +
> > > +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> > > +
> > > +     q->ecn_mask = INET_ECN_ECT_1;
> > > +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
> > > +     q->drop_overload = true;        /* Preserve latency by dropping */
> > > +     q->drop_early = false;          /* PI2 drops on dequeue */
> > > +     q->split_gso = true;
> >
> > This is a very unexpected default. Splitting GSO packets earlier WRT the
> > H/W constaints definitely impact performances in a bad way.
> >
> > Under which condition this is expected to give better results?
> > It should be at least documented clearly.
>
> I agree, it is very strange to see this orthogonal feature being
> spread in some qdisc.

IMHO it makes sense to offer this split_gso feature in the dualpi2
qdisc because the dualpi2 qdisc is targeted at reducing latency and
targeted mostly at hops in the last mile of the public Internet, where
there can be orders of magnitude disparities in bandwidth between
upstream and downstream links (e.g., packets arriving over 10G
ethernet and leaving destined for a 10M DSL link). In such cases, GRO
may aggregate many packets into a single skb receiving data on a fast
ingress link, and then may want to reduce latency issues on the slow
link by allowing smaller skbs to be enqueued on the slower egress
link.

> Also, it seems this qdisc could be a mere sch_prio queue, with two
> sch_pie children, or two sch_fq or sch_fq_codel ?

Having two independent children would not allow meeting the dualpi2
goal to "preserve fairness between ECN-capable and non-ECN-capable
traffic." (quoting text from https://datatracker.ietf.org/doc/rfc9332/
). The main issue is that there may be differing numbers of flows in
the ECN-capable and non-ECN-capable queues, and yet dualpi2 wants to
maintain approximate per-flow fairness on both sides. To do this, it
uses a single qdisc with coupling of the ECN mark rate in the
ECN-capable queue and drop rate in the non-ECN-capable queue.

This could probably be made more clear in the commit message.

> Many of us are using fq_codel or fq, there is no way we can switch to
> dualpi2 just to experiment things.

Yes, sites that are using fq_codel or fq do not need to switch to dualpi2.

AFAIK the idea with dualpi2 is to offer a new qdisc for folks
developing hardware for the last mile of the Internet where you want
low latency via L4S, and want approximate per-flow fairness between
L4S and non-L4S traffic, even in the presence of VPN-encrypted traffic
(where flow identifiers are not available for fq_codel or fq fair
queuing).

Sites that don't have VPN traffic or don't care about the VPN issue
can use fq or fq_codel with the ce_threshold parameter to allow low
latency via L4S while achieving approximate per-flow fairness.

best regards,
neal
Eric Dumazet Oct. 31, 2024, 2:30 p.m. UTC | #8
On Thu, Oct 31, 2024 at 2:28 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Tue, Oct 29, 2024 at 12:53 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Oct 29, 2024 at 1:56 PM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On 10/22/24 00:12, chia-yu.chang@nokia-bell-labs.com wrote:
> > > > +/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
> > > > +static void dualpi2_reset_default(struct dualpi2_sched_data *q)
> > > > +{
> > > > +     q->sch->limit = 10000;                          /* Max 125ms at 1Gbps */
> > > > +
> > > > +     q->pi2.target = 15 * NSEC_PER_MSEC;
> > > > +     q->pi2.tupdate = 16 * NSEC_PER_MSEC;
> > > > +     q->pi2.alpha = dualpi2_scale_alpha_beta(41);    /* ~0.16 Hz * 256 */
> > > > +     q->pi2.beta = dualpi2_scale_alpha_beta(819);    /* ~3.20 Hz * 256 */
> > > > +
> > > > +     q->step.thresh = 1 * NSEC_PER_MSEC;
> > > > +     q->step.in_packets = false;
> > > > +
> > > > +     dualpi2_calculate_c_protection(q->sch, q, 10);  /* wc=10%, wl=90% */
> > > > +
> > > > +     q->ecn_mask = INET_ECN_ECT_1;
> > > > +     q->coupling_factor = 2;         /* window fairness for equal RTTs */
> > > > +     q->drop_overload = true;        /* Preserve latency by dropping */
> > > > +     q->drop_early = false;          /* PI2 drops on dequeue */
> > > > +     q->split_gso = true;
> > >
> > > This is a very unexpected default. Splitting GSO packets earlier WRT the
> > > H/W constaints definitely impact performances in a bad way.
> > >
> > > Under which condition this is expected to give better results?
> > > It should be at least documented clearly.
> >
> > I agree, it is very strange to see this orthogonal feature being
> > spread in some qdisc.
>
> IMHO it makes sense to offer this split_gso feature in the dualpi2
> qdisc because the dualpi2 qdisc is targeted at reducing latency and
> targeted mostly at hops in the last mile of the public Internet, where
> there can be orders of magnitude disparities in bandwidth between
> upstream and downstream links (e.g., packets arriving over 10G
> ethernet and leaving destined for a 10M DSL link). In such cases, GRO
> may aggregate many packets into a single skb receiving data on a fast
> ingress link, and then may want to reduce latency issues on the slow
> link by allowing smaller skbs to be enqueued on the slower egress
> link.
>
> > Also, it seems this qdisc could be a mere sch_prio queue, with two
> > sch_pie children, or two sch_fq or sch_fq_codel ?
>
> Having two independent children would not allow meeting the dualpi2
> goal to "preserve fairness between ECN-capable and non-ECN-capable
> traffic." (quoting text from https://datatracker.ietf.org/doc/rfc9332/
> ). The main issue is that there may be differing numbers of flows in
> the ECN-capable and non-ECN-capable queues, and yet dualpi2 wants to
> maintain approximate per-flow fairness on both sides. To do this, it
> uses a single qdisc with coupling of the ECN mark rate in the
> ECN-capable queue and drop rate in the non-ECN-capable queue.

Not sure I understand this argument.

The dequeue  seems to use WRR, so this means that instead of prio,
this could use net/sched/sch_drr.c,
then two PIE (with different settings) as children, and a proper
classify at enqueue to choose one queue or the other.

Reviewing ~1000 lines of code, knowing that in one year another
net/sched/sch_fq_dualpi2.c
will follow (as net/sched/sch_fq_pie.c followed net/sched/sch_pie.c )
is not exactly appealing to me.

/* Select the queue from which the next packet can be dequeued, ensuring that
+ * neither queue can starve the other with a WRR scheduler.
+ *
+ * The sign of the WRR credit determines the next queue, while the size of
+ * the dequeued packet determines the magnitude of the WRR credit change. If
+ * either queue is empty, the WRR credit is kept unchanged.
+ *
+ * As the dequeued packet can be dropped later, the caller has to perform the
+ * qdisc_bstats_update() calls.
+ */
+static struct sk_buff *dequeue_packet(struct Qdisc *sch,
+                                     struct dualpi2_sched_data *q,
+                                     int *credit_change,
+                                     u64 now)
+{


>
> This could probably be made more clear in the commit message.
>
> > Many of us are using fq_codel or fq, there is no way we can switch to
> > dualpi2 just to experiment things.
>
> Yes, sites that are using fq_codel or fq do not need to switch to dualpi2.
>
> AFAIK the idea with dualpi2 is to offer a new qdisc for folks
> developing hardware for the last mile of the Internet where you want
> low latency via L4S, and want approximate per-flow fairness between
> L4S and non-L4S traffic, even in the presence of VPN-encrypted traffic
> (where flow identifiers are not available for fq_codel or fq fair
> queuing).
>
> Sites that don't have VPN traffic or don't care about the VPN issue
> can use fq or fq_codel with the ce_threshold parameter to allow low
> latency via L4S while achieving approximate per-flow fairness.
>
> best regards,
> neal
Koen De Schepper (Nokia) Oct. 31, 2024, 4:45 p.m. UTC | #9
From: Eric Dumazet <edumazet@google.com> 
Sent: Thursday, October 31, 2024 3:31 PM
> On Thu, Oct 31, 2024 at 2:28 PM Neal Cardwell <ncardwell@google.com> wrote:
> > On Tue, Oct 29, 2024 at 12:53 PM Eric Dumazet <edumazet@google.com> wrote:
> > > Also, it seems this qdisc could be a mere sch_prio queue, with two 
> > > sch_pie children, or two sch_fq or sch_fq_codel ?
> >
> > Having two independent children would not allow meeting the dualpi2 
> > goal to "preserve fairness between ECN-capable and non-ECN-capable 
> > traffic." (quoting text from https://datatracker.ietf.org/doc/rfc9332/
> > ). The main issue is that there may be differing numbers of flows in 
> > the ECN-capable and non-ECN-capable queues, and yet dualpi2 wants to 
> > maintain approximate per-flow fairness on both sides. To do this, it 
> > uses a single qdisc with coupling of the ECN mark rate in the 
> > ECN-capable queue and drop rate in the non-ECN-capable queue.
>
> Not sure I understand this argument.
>
> The dequeue  seems to use WRR, so this means that instead of prio, this could use net/sched/sch_drr.c, then two PIE (with different settings) as children, and a proper classify at enqueue to choose one queue or the other.
>
> Reviewing ~1000 lines of code, knowing that in one year another net/sched/sch_fq_dualpi2.c will follow (as net/sched/sch_fq_pie.c followed net/sched/sch_pie.c ) is not exactly appealing to me.

This composition doesn't work. We need more than 2 independent AQMs and a scheduler. The coupling between the queues and other extra interworking conditions is very important here, which are unfortunately not possible with a composition of existing qdiscs.

Also, we don't expect any FQ and DualQ merger. Using only 2 queues (one for each class L4S and Classic) is one of the differentiating features of DualQ compared to FQ, with a lower L4S tail latency compared to a blocking and scheduled FQ qdiscs. Adding FQ_ on top or under DualQ would break the goal of DualQ. If an FQ_ supporting L4S is needed, then existing FQ_ implementations can be used (like fq_codel) or extended (identifying L4S and using the correct thresholds by default).

Regards,
Koen.
Dave Taht Oct. 31, 2024, 5:27 p.m. UTC | #10
On Thu, Oct 31, 2024 at 9:46 AM Koen De Schepper (Nokia)
<koen.de_schepper@nokia-bell-labs.com> wrote:
>
>
> From: Eric Dumazet <edumazet@google.com>
> Sent: Thursday, October 31, 2024 3:31 PM
> > On Thu, Oct 31, 2024 at 2:28 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > On Tue, Oct 29, 2024 at 12:53 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > Also, it seems this qdisc could be a mere sch_prio queue, with two
> > > > sch_pie children, or two sch_fq or sch_fq_codel ?
> > >
> > > Having two independent children would not allow meeting the dualpi2
> > > goal to "preserve fairness between ECN-capable and non-ECN-capable
> > > traffic." (quoting text from https://datatracker.ietf.org/doc/rfc9332/
> > > ). The main issue is that there may be differing numbers of flows in
> > > the ECN-capable and non-ECN-capable queues, and yet dualpi2 wants to
> > > maintain approximate per-flow fairness on both sides. To do this, it
> > > uses a single qdisc with coupling of the ECN mark rate in the
> > > ECN-capable queue and drop rate in the non-ECN-capable queue.
> >
> > Not sure I understand this argument.
> >
> > The dequeue  seems to use WRR, so this means that instead of prio, this could use net/sched/sch_drr.c, then two PIE (with different settings) as children, and a proper classify at enqueue to choose one queue or the other.
> >
> > Reviewing ~1000 lines of code, knowing that in one year another net/sched/sch_fq_dualpi2.c will follow (as net/sched/sch_fq_pie.c followed net/sched/sch_pie.c ) is not exactly appealing to me.
>
> This composition doesn't work. We need more than 2 independent AQMs and a scheduler. The coupling between the queues and other extra interworking conditions is very important here, which are unfortunately not possible with a composition of existing qdiscs.

I tried to mention that the dualpi concept is not very dual when
hardware mq is in use - one "dualpi" instance per core.

So essential limitations on usage for dualpi are:

Single instance only
gso-splitting only

So it is not suitable as a general purpose data center qdisc because
it simply cannot scale to larger bandwidths.

I think in part the confusion here is the other stuff that was
originally submitted (accecn, tcp prague), needs to be tested somehow,
and a path forward seems to be to put a ce_threshold into sch_fq
matching the l4s ecn bit, with a suitable default (which in dualpi is
1ms). (self congestion is a thing), then incorporate accecn, then test
prague driving that, then, somewhere on the path or test setup put in
a rate limited dualpi instance?

> Also, we don't expect any FQ and DualQ merger. Using only 2 queues (one for each class L4S and Classic) is one of the differentiating features of DualQ compared to FQ, with a lower L4S tail latency compared to a blocking and scheduled FQ qdiscs.

>Adding FQ_ on top or under DualQ would break the goal of DualQ.

Comparing fq_codel or fq_pie to dualQ would probably be enlightening.
Both of these scale to hardware mq.

In dualpi's defence it seems to be an attempt to mimic a hardware
implementation.

> If an FQ_ supporting L4S is needed, then existing FQ_ implementations can be used (like fq_codel) or extended (identifying L4S and using the correct thresholds by default).

Merely having a preferred value for that threshold would be nice. The
threshold first deployed for fq_codel was far too low for production
environments. If 1ms works, cool!

>
> Regards,
> Koen.
diff mbox series

Patch

diff --git a/Documentation/netlink/specs/tc.yaml b/Documentation/netlink/specs/tc.yaml
index b02d59a0349c..efe5eb2d8b52 100644
--- a/Documentation/netlink/specs/tc.yaml
+++ b/Documentation/netlink/specs/tc.yaml
@@ -816,6 +816,46 @@  definitions:
       -
         name: drop-overmemory
         type: u32
+  -
+    name: tc-dualpi2-xstats
+    type: struct
+    members:
+      -
+        name: prob
+        type: u32
+        doc: Current probability
+      -
+        name: delay_c
+        type: u32
+        doc: Current C-queue delay in microseconds
+      -
+        name: delay_l
+        type: u32
+        doc: Current L-queue delay in microseconds
+      -
+        name: pkts_in_c
+        type: u32
+        doc: Number of packets enqueued in the C-queue
+      -
+        name: pkts_in_l
+        type: u32
+        doc: Number of packets enqueued in the L-queue
+      -
+        name: maxq
+        type: u32
+        doc: Maximum number of packets seen in the DualPI2
+      -
+        name: ecn_mark
+        type: u32
+        doc: All packets marked with ecn
+      -
+        name: step_mark
+        type: u32
+        doc: Only packets marked with ecn due to L-queue step AQM
+      -
+        name: credit
+        type: s32
+        doc: Current credit value for WRR
   -
     name: tc-fq-pie-xstats
     type: struct
@@ -2299,6 +2339,84 @@  attribute-sets:
       -
         name: quantum
         type: u32
+  -
+    name: tc-dualpi2-attrs
+    attributes:
+      -
+        name: limit
+        type: u32
+        doc: Limit of total number of packets in queue
+      -
+        name: target
+        type: u32
+        doc: Classic target delay in microseconds
+      -
+        name: tupdate
+        type: u32
+        doc: Drop probability update interval time in microseconds
+      -
+        name: alpha
+        type: u32
+        doc: Integral gain factor in Hz for PI controller
+      -
+        name: beta
+        type: u32
+        doc: Proportional gain factor in Hz for PI controller
+      -
+        name: step_thresh
+        type: u32
+        doc: L4S step marking threshold in microseconds or in packet (see step_packets)
+      -
+        name: step_packets
+        type: flags
+        doc: L4S Step marking threshold unit
+        entries:
+        - microseconds
+        - packets
+      -
+        name: coupling_factor
+        type: u8
+        doc: Probability coupling factor between Classic and L4S (2 is recommended)
+      -
+        name: drop_overload
+        type: flags
+        doc: Control the overload strategy (drop to preserve latency or let the queue overflow)
+        entries:
+        - drop_on_overload
+        - overflow
+      -
+        name: drop_early
+        type: flags
+        doc: Decide where the Classic packets are PI-based dropped or marked
+        entries:
+        - drop_enqueue
+        - drop_dequeue
+      -
+        name: classic_protection
+        type: u8
+        doc:  Classic WRR weight in percentage (from 0 to 100)
+      -
+        name: ecn_mask
+        type: flags
+        doc: Configure the L-queue ECN classifier
+        entries:
+        - l4s_ect
+        - any_ect
+      -
+        name: gso_split
+        type: flags
+        doc: Split aggregated skb or not
+        entries:
+        - split_gso
+        - no_split_gso
+      -
+        name: max_rtt
+        type: u32
+        doc: The maximum expected RTT of the traffic that is controlled by DualPI2
+      -
+        name: typical_rtt
+        type: u32
+        doc: The typical base RTT of the traffic that is controlled by DualPI2
   -
     name: tc-ematch-attrs
     attributes:
@@ -3679,6 +3797,9 @@  sub-messages:
       -
         value: drr
         attribute-set: tc-drr-attrs
+      -
+        value: dualpi2
+        attribute-set: tc-dualpi2-attrs
       -
         value: etf
         attribute-set: tc-etf-attrs
@@ -3846,6 +3967,9 @@  sub-messages:
       -
         value: codel
         fixed-header: tc-codel-xstats
+      -
+        value: dualpi2
+        fixed-header: tc-dualpi2-xstats
       -
         value: fq
         fixed-header: tc-fq-qd-stats
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8feaca12655e..bdd7d6262112 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -30,6 +30,7 @@ 
 #include <asm/byteorder.h>
 #include <asm/local.h>
 
+#include <linux/netdev_features.h>
 #include <linux/percpu.h>
 #include <linux/rculist.h>
 #include <linux/workqueue.h>
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 25a9a47001cd..f2418eabdcb1 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1210,4 +1210,38 @@  enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+/* DUALPI2 */
+enum {
+	TCA_DUALPI2_UNSPEC,
+	TCA_DUALPI2_LIMIT,		/* Packets */
+	TCA_DUALPI2_TARGET,		/* us */
+	TCA_DUALPI2_TUPDATE,		/* us */
+	TCA_DUALPI2_ALPHA,		/* Hz scaled up by 256 */
+	TCA_DUALPI2_BETA,		/* HZ scaled up by 256 */
+	TCA_DUALPI2_STEP_THRESH,	/* Packets or us */
+	TCA_DUALPI2_STEP_PACKETS,	/* Whether STEP_THRESH is in packets */
+	TCA_DUALPI2_COUPLING,		/* Coupling factor between queues */
+	TCA_DUALPI2_DROP_OVERLOAD,	/* Whether to drop on overload */
+	TCA_DUALPI2_DROP_EARLY,		/* Whether to drop on enqueue */
+	TCA_DUALPI2_C_PROTECTION,	/* Percentage */
+	TCA_DUALPI2_ECN_MASK,		/* L4S queue classification mask */
+	TCA_DUALPI2_SPLIT_GSO,		/* Split GSO packets at enqueue */
+	TCA_DUALPI2_PAD,
+	__TCA_DUALPI2_MAX
+};
+
+#define TCA_DUALPI2_MAX   (__TCA_DUALPI2_MAX - 1)
+
+struct tc_dualpi2_xstats {
+	__u32 prob;		/* current probability */
+	__u32 delay_c;		/* current delay in C queue */
+	__u32 delay_l;		/* current delay in L queue */
+	__s32 credit;		/* current c_protection credit */
+	__u32 packets_in_c;	/* number of packets enqueued in C queue */
+	__u32 packets_in_l;	/* number of packets enqueued in L queue */
+	__u32 maxq;		/* maximum queue size */
+	__u32 ecn_mark;		/* packets marked with ecn*/
+	__u32 step_marks;	/* ECN marks due to the step AQM */
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 8180d0c12fce..f00b5ad92ce2 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,18 @@  config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_DUALPI2
+	tristate "Dual Queue PI Square (DUALPI2) scheduler"
+	help
+	  Say Y here if you want to use the Dual Queue Proportional Integral
+	  Controller Improved with a Square scheduling algorithm.
+	  For more information, please see https://tools.ietf.org/html/rfc9332
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_dualpi2.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 82c3f78ca486..1abb06554057 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -62,6 +62,7 @@  obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_DUALPI2)	+= sch_dualpi2.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_dualpi2.c b/net/sched/sch_dualpi2.c
new file mode 100644
index 000000000000..d7b366b9fa42
--- /dev/null
+++ b/net/sched/sch_dualpi2.c
@@ -0,0 +1,1052 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2024 Nokia
+ *
+ * Author: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
+ * Author: Olga Albisser <olga@albisser.org>
+ * Author: Henrik Steen <henrist@henrist.net>
+ * Author: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
+ * Author: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
+ *
+ * DualPI Improved with a Square (dualpi2):
+ * - Supports congestion controls that comply with the Prague requirements
+ *   in RFC9331 (e.g. TCP-Prague)
+ * - Supports coupled dual-queue with PI2 as defined in RFC9332
+ * - Supports ECN L4S-identifier (IP.ECN==0b*1)
+ *
+ * note: DCTCP is not Prague compliant, so DCTCP & DualPI2 can only be
+ *   used in DC context; BBRv3 (overwrites bbr) stopped Prague support,
+ *   you should use TCP-Prague instead for low latency apps
+ *
+ * References:
+ * - RFC9332: https://datatracker.ietf.org/doc/html/rfc9332
+ * - De Schepper, Koen, et al. "PI 2: A linearized AQM for both classic and
+ *   scalable TCP."  in proc. ACM CoNEXT'16, 2016.
+ */
+
+#include <linux/errno.h>
+#include <linux/hrtimer.h>
+#include <linux/if_vlan.h>
+#include <linux/kernel.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include <net/gso.h>
+#include <net/inet_ecn.h>
+#include <net/pkt_cls.h>
+#include <net/pkt_sched.h>
+
+/* 32b enable to support flows with windows up to ~8.6 * 1e9 packets
+ * i.e., twice the maximal snd_cwnd.
+ * MAX_PROB must be consistent with the RNG in dualpi2_roll().
+ */
+#define MAX_PROB U32_MAX
+
+/* alpha/beta values exchanged over netlink are in units of 256ns */
+#define ALPHA_BETA_SHIFT 8
+
+/* Scaled values of alpha/beta must fit in 32b to avoid overflow in later
+ * computations. Consequently (see and dualpi2_scale_alpha_beta()), their
+ * netlink-provided values can use at most 31b, i.e. be at most (2^23)-1
+ * (~4MHz) as those are given in 1/256th. This enable to tune alpha/beta to
+ * control flows whose maximal RTTs can be in usec up to few secs.
+ */
+#define ALPHA_BETA_MAX ((1U << 31) - 1)
+
+/* Internal alpha/beta are in units of 64ns.
+ * This enables to use all alpha/beta values in the allowed range without loss
+ * of precision due to rounding when scaling them internally, e.g.,
+ * scale_alpha_beta(1) will not round down to 0.
+ */
+#define ALPHA_BETA_GRANULARITY 6
+
+#define ALPHA_BETA_SCALING (ALPHA_BETA_SHIFT - ALPHA_BETA_GRANULARITY)
+
+/* We express the weights (wc, wl) in %, i.e., wc + wl = 100 */
+#define MAX_WC 100
+
+struct dualpi2_sched_data {
+	struct Qdisc *l_queue;	/* The L4S LL queue */
+	struct Qdisc *sch;	/* The classic queue (owner of this struct) */
+
+	/* Registered tc filters */
+	struct {
+		struct tcf_proto __rcu *filters;
+		struct tcf_block *block;
+	} tcf;
+
+	struct { /* PI2 parameters */
+		u64	target;	/* Target delay in nanoseconds */
+		u32	tupdate;/* Timer frequency in nanoseconds */
+		u32	prob;	/* Base PI probability */
+		u32	alpha;	/* Gain factor for the integral rate response */
+		u32	beta;	/* Gain factor for the proportional response */
+		struct hrtimer timer; /* prob update timer */
+	} pi2;
+
+	struct { /* Step AQM (L4S queue only) parameters */
+		u32 thresh;	/* Step threshold */
+		bool in_packets;/* Whether the step is in packets or time */
+	} step;
+
+	struct { /* Classic queue starvation protection */
+		s32	credit; /* Credit (sign indicates which queue) */
+		s32	init;	/* Reset value of the credit */
+		u8	wc;	/* C queue weight (between 0 and MAX_WC) */
+		u8	wl;	/* L queue weight (MAX_WC - wc) */
+	} c_protection;
+
+	/* General dualQ parameters */
+	u8	coupling_factor;/* Coupling factor (k) between both queues */
+	u8	ecn_mask;	/* Mask to match L4S packets */
+	bool	drop_early;	/* Drop at enqueue instead of dequeue if true */
+	bool	drop_overload;	/* Drop (1) on overload, or overflow (0) */
+	bool	split_gso;	/* Split aggregated skb (1) or leave as is */
+
+	/* Statistics */
+	u64	c_head_ts;	/* Enqueue timestamp of the classic Q's head */
+	u64	l_head_ts;	/* Enqueue timestamp of the L Q's head */
+	u64	last_qdelay;	/* Q delay val at the last probability update */
+	u32	packets_in_c;	/* Number of packets enqueued in C queue */
+	u32	packets_in_l;	/* Number of packets enqueued in L queue */
+	u32	maxq;		/* maximum queue size */
+	u32	ecn_mark;	/* packets marked with ECN */
+	u32	step_marks;	/* ECN marks due to the step AQM */
+
+	struct { /* Deferred drop statistics */
+		u32 cnt;	/* Packets dropped */
+		u32 len;	/* Bytes dropped */
+	} deferred_drops;
+};
+
+struct dualpi2_skb_cb {
+	u64 ts;			/* Timestamp at enqueue */
+	u8 apply_step:1,	/* Can we apply the step threshold */
+	   classified:2,	/* Packet classification results */
+	   ect:2;		/* Packet ECT codepoint */
+};
+
+enum dualpi2_classification_results {
+	DUALPI2_C_CLASSIC	= 0,	/* C queue */
+	DUALPI2_C_L4S		= 1,	/* L queue (scale mark/classic drop) */
+	DUALPI2_C_LLLL		= 2,	/* L queue (no drops/marks) */
+	__DUALPI2_C_MAX			/* Keep last*/
+};
+
+static struct dualpi2_skb_cb *dualpi2_skb_cb(struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct dualpi2_skb_cb));
+	return (struct dualpi2_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static u64 dualpi2_sojourn_time(struct sk_buff *skb, u64 reference)
+{
+	return reference - dualpi2_skb_cb(skb)->ts;
+}
+
+static u64 head_enqueue_time(struct Qdisc *q)
+{
+	struct sk_buff *skb = qdisc_peek_head(q);
+
+	return skb ? dualpi2_skb_cb(skb)->ts : 0;
+}
+
+static u32 dualpi2_scale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * MAX_PROB >> ALPHA_BETA_SCALING);
+
+	do_div(tmp, NSEC_PER_SEC);
+	return tmp;
+}
+
+static u32 dualpi2_unscale_alpha_beta(u32 param)
+{
+	u64 tmp = ((u64)param * NSEC_PER_SEC << ALPHA_BETA_SCALING);
+
+	do_div(tmp, MAX_PROB);
+	return tmp;
+}
+
+static ktime_t next_pi2_timeout(struct dualpi2_sched_data *q)
+{
+	return ktime_add_ns(ktime_get_ns(), q->pi2.tupdate);
+}
+
+static bool skb_is_l4s(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->classified == DUALPI2_C_L4S;
+}
+
+static bool skb_in_l_queue(struct sk_buff *skb)
+{
+	return dualpi2_skb_cb(skb)->classified != DUALPI2_C_CLASSIC;
+}
+
+static bool dualpi2_mark(struct dualpi2_sched_data *q, struct sk_buff *skb)
+{
+	if (INET_ECN_set_ce(skb)) {
+		q->ecn_mark++;
+		return true;
+	}
+	return false;
+}
+
+static void dualpi2_reset_c_protection(struct dualpi2_sched_data *q)
+{
+	q->c_protection.credit = q->c_protection.init;
+}
+
+/* This computes the initial credit value and WRR weight for the L queue (wl)
+ * from the weight of the C queue (wc).
+ * If wl > wc, the scheduler will start with the L queue when reset.
+ */
+static void dualpi2_calculate_c_protection(struct Qdisc *sch,
+					   struct dualpi2_sched_data *q, u32 wc)
+{
+	q->c_protection.wc = wc;
+	q->c_protection.wl = MAX_WC - wc;
+	q->c_protection.init = (s32)psched_mtu(qdisc_dev(sch)) *
+		((int)q->c_protection.wc - (int)q->c_protection.wl);
+	dualpi2_reset_c_protection(q);
+}
+
+static bool dualpi2_roll(u32 prob)
+{
+	return get_random_u32() <= prob;
+}
+
+/* Packets in the C queue are subject to a marking probability pC, which is the
+ * square of the internal PI2 probability (i.e., have an overall lower mark/drop
+ * probability). If the qdisc is overloaded, ignore ECT values and only drop.
+ *
+ * Note that this marking scheme is also applied to L4S packets during overload.
+ * Return true if packet dropping is required in C queue
+ */
+static bool dualpi2_classic_marking(struct dualpi2_sched_data *q,
+				    struct sk_buff *skb, u32 prob,
+				    bool overload)
+{
+	if (dualpi2_roll(prob) && dualpi2_roll(prob)) {
+		if (overload || dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
+			return true;
+		dualpi2_mark(q, skb);
+	}
+	return false;
+}
+
+/* Packets in the L queue are subject to a marking probability pL given by the
+ * internal PI2 probability scaled by the coupling factor.
+ *
+ * On overload (i.e., @local_l_prob is >= 100%):
+ * - if the qdisc is configured to trade losses to preserve latency (i.e.,
+ *   @q->drop_overload), apply classic drops first before marking.
+ * - otherwise, preserve the "no loss" property of ECN at the cost of queueing
+ *   delay, eventually resulting in taildrop behavior once sch->limit is
+ *   reached.
+ * Return true if packet dropping is required in L queue
+ */
+static bool dualpi2_scalable_marking(struct dualpi2_sched_data *q,
+				     struct sk_buff *skb,
+				     u64 local_l_prob, u32 prob,
+				     bool overload)
+{
+	if (overload) {
+		/* Apply classic drop */
+		if (!q->drop_overload ||
+		    !(dualpi2_roll(prob) && dualpi2_roll(prob)))
+			goto mark;
+		return true;
+	}
+
+	/* We can safely cut the upper 32b as overload==false */
+	if (dualpi2_roll(local_l_prob)) {
+		/* Non-ECT packets could have classified as L4S by filters. */
+		if (dualpi2_skb_cb(skb)->ect == INET_ECN_NOT_ECT)
+			return true;
+mark:
+		dualpi2_mark(q, skb);
+	}
+	return false;
+}
+
+/* Decide whether a given packet must be dropped (or marked if ECT), according
+ * to the PI2 probability.
+ *
+ * Never mark/drop if we have a standing queue of less than 2 MTUs.
+ */
+static bool must_drop(struct Qdisc *sch, struct dualpi2_sched_data *q,
+		      struct sk_buff *skb)
+{
+	u64 local_l_prob;
+	u32 prob;
+	bool overload;
+
+	if (sch->qstats.backlog < 2 * psched_mtu(qdisc_dev(sch)))
+		return false;
+
+	prob = READ_ONCE(q->pi2.prob);
+	local_l_prob = (u64)prob * q->coupling_factor;
+	overload = local_l_prob > MAX_PROB;
+
+	switch (dualpi2_skb_cb(skb)->classified) {
+	case DUALPI2_C_CLASSIC:
+		return dualpi2_classic_marking(q, skb, prob, overload);
+	case DUALPI2_C_L4S:
+		return dualpi2_scalable_marking(q, skb, local_l_prob, prob,
+						overload);
+	default: /* DUALPI2_C_LLLL */
+		return false;
+	}
+}
+
+static void dualpi2_read_ect(struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	int wlen = skb_network_offset(skb);
+
+	switch (skb_protocol(skb, true)) {
+	case htons(ETH_P_IP):
+		wlen += sizeof(struct iphdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv4_get_dsfield(ip_hdr(skb)) & INET_ECN_MASK;
+		break;
+	case htons(ETH_P_IPV6):
+		wlen += sizeof(struct ipv6hdr);
+		if (!pskb_may_pull(skb, wlen) ||
+		    skb_try_make_writable(skb, wlen))
+			goto not_ecn;
+
+		cb->ect = ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK;
+		break;
+	default:
+		goto not_ecn;
+	}
+	return;
+
+not_ecn:
+	/* Non pullable/writable packets can only be dropped hence are
+	 * classified as not ECT.
+	 */
+	cb->ect = INET_ECN_NOT_ECT;
+}
+
+static int dualpi2_skb_classify(struct dualpi2_sched_data *q,
+				struct sk_buff *skb)
+{
+	struct dualpi2_skb_cb *cb = dualpi2_skb_cb(skb);
+	struct tcf_result res;
+	struct tcf_proto *fl;
+	int result;
+
+	dualpi2_read_ect(skb);
+	if (cb->ect & q->ecn_mask) {
+		cb->classified = DUALPI2_C_L4S;
+		return NET_XMIT_SUCCESS;
+	}
+
+	if (TC_H_MAJ(skb->priority) == q->sch->handle &&
+	    TC_H_MIN(skb->priority) < __DUALPI2_C_MAX) {
+		cb->classified = TC_H_MIN(skb->priority);
+		return NET_XMIT_SUCCESS;
+	}
+
+	fl = rcu_dereference_bh(q->tcf.filters);
+	if (!fl) {
+		cb->classified = DUALPI2_C_CLASSIC;
+		return NET_XMIT_SUCCESS;
+	}
+
+	result = tcf_classify(skb, NULL, fl, &res, false);
+	if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+		case TC_ACT_TRAP:
+			return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+		case TC_ACT_SHOT:
+			return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+		}
+#endif
+		cb->classified = TC_H_MIN(res.classid) < __DUALPI2_C_MAX ?
+			TC_H_MIN(res.classid) : DUALPI2_C_CLASSIC;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static int dualpi2_enqueue_skb(struct sk_buff *skb, struct Qdisc *sch,
+			       struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct dualpi2_skb_cb *cb;
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit)) {
+		qdisc_qstats_overlimit(sch);
+		if (skb_in_l_queue(skb))
+			qdisc_qstats_overlimit(q->l_queue);
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	if (q->drop_early && must_drop(sch, q, skb)) {
+		qdisc_drop(skb, sch, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	}
+
+	cb = dualpi2_skb_cb(skb);
+	cb->ts = ktime_get_ns();
+
+	if (qdisc_qlen(sch) > q->maxq)
+		q->maxq = qdisc_qlen(sch);
+
+	if (skb_in_l_queue(skb)) {
+		/* Only apply the step if a queue is building up */
+		dualpi2_skb_cb(skb)->apply_step =
+			skb_is_l4s(skb) && qdisc_qlen(q->l_queue) > 1;
+		/* Keep the overall qdisc stats consistent */
+		++sch->q.qlen;
+		qdisc_qstats_backlog_inc(sch, skb);
+		++q->packets_in_l;
+		if (!q->l_head_ts)
+			q->l_head_ts = cb->ts;
+		return qdisc_enqueue_tail(skb, q->l_queue);
+	}
+	++q->packets_in_c;
+	if (!q->c_head_ts)
+		q->c_head_ts = cb->ts;
+	return qdisc_enqueue_tail(skb, sch);
+}
+
+/* Optionally, dualpi2 will split GSO skbs into independent skbs and enqueue
+ * each of those individually. This yields the following benefits, at the
+ * expense of CPU usage:
+ * - Finer-grained AQM actions as the sub-packets of a burst no longer share the
+ *   same fate (e.g., the random mark/drop probability is applied individually)
+ * - Improved precision of the starvation protection/WRR scheduler at dequeue,
+ *   as the size of the dequeued packets will be smaller.
+ */
+static int dualpi2_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	err = dualpi2_skb_classify(q, skb);
+	if (err != NET_XMIT_SUCCESS) {
+		if (err & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return err;
+	}
+
+	if (q->split_gso && skb_is_gso(skb)) {
+		netdev_features_t features;
+		struct sk_buff *nskb, *next;
+		int cnt, byte_len, orig_len;
+		int err;
+
+		features = netif_skb_features(skb);
+		nskb = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+		if (IS_ERR_OR_NULL(nskb))
+			return qdisc_drop(skb, sch, to_free);
+
+		cnt = 1;
+		byte_len = 0;
+		orig_len = qdisc_pkt_len(skb);
+		while (nskb) {
+			next = nskb->next;
+			skb_mark_not_on_list(nskb);
+			qdisc_skb_cb(nskb)->pkt_len = nskb->len;
+			dualpi2_skb_cb(nskb)->classified =
+				dualpi2_skb_cb(skb)->classified;
+			dualpi2_skb_cb(nskb)->ect = dualpi2_skb_cb(skb)->ect;
+			err = dualpi2_enqueue_skb(nskb, sch, to_free);
+			if (err == NET_XMIT_SUCCESS) {
+				/* Compute the backlog adjustement that needs
+				 * to be propagated in the qdisc tree to reflect
+				 * all new skbs successfully enqueued.
+				 */
+				++cnt;
+				byte_len += nskb->len;
+			}
+			nskb = next;
+		}
+		if (err == NET_XMIT_SUCCESS) {
+			/* The caller will add the original skb stats to its
+			 * backlog, compensate this.
+			 */
+			--cnt;
+			byte_len -= orig_len;
+		}
+		qdisc_tree_reduce_backlog(sch, -cnt, -byte_len);
+		consume_skb(skb);
+		return err;
+	}
+	return dualpi2_enqueue_skb(skb, sch, to_free);
+}
+
+/* Select the queue from which the next packet can be dequeued, ensuring that
+ * neither queue can starve the other with a WRR scheduler.
+ *
+ * The sign of the WRR credit determines the next queue, while the size of
+ * the dequeued packet determines the magnitude of the WRR credit change. If
+ * either queue is empty, the WRR credit is kept unchanged.
+ *
+ * As the dequeued packet can be dropped later, the caller has to perform the
+ * qdisc_bstats_update() calls.
+ */
+static struct sk_buff *dequeue_packet(struct Qdisc *sch,
+				      struct dualpi2_sched_data *q,
+				      int *credit_change,
+				      u64 now)
+{
+	struct sk_buff *skb = NULL;
+	int c_len;
+
+	*credit_change = 0;
+	c_len = qdisc_qlen(sch) - qdisc_qlen(q->l_queue);
+	if (qdisc_qlen(q->l_queue) && (!c_len || q->c_protection.credit <= 0)) {
+		skb = __qdisc_dequeue_head(&q->l_queue->q);
+		WRITE_ONCE(q->l_head_ts, head_enqueue_time(q->l_queue));
+		if (c_len)
+			*credit_change = q->c_protection.wc;
+		qdisc_qstats_backlog_dec(q->l_queue, skb);
+		/* Keep the global queue size consistent */
+		--sch->q.qlen;
+	} else if (c_len) {
+		skb = __qdisc_dequeue_head(&sch->q);
+		WRITE_ONCE(q->c_head_ts, head_enqueue_time(sch));
+		if (qdisc_qlen(q->l_queue))
+			*credit_change = ~((s32)q->c_protection.wl) + 1;
+	} else {
+		dualpi2_reset_c_protection(q);
+		return NULL;
+	}
+	*credit_change *= qdisc_pkt_len(skb);
+	qdisc_qstats_backlog_dec(sch, skb);
+	return skb;
+}
+
+static int do_step_aqm(struct dualpi2_sched_data *q, struct sk_buff *skb,
+		       u64 now)
+{
+	u64 qdelay = 0;
+
+	if (q->step.in_packets)
+		qdelay = qdisc_qlen(q->l_queue);
+	else
+		qdelay = dualpi2_sojourn_time(skb, now);
+
+	if (dualpi2_skb_cb(skb)->apply_step && qdelay > q->step.thresh) {
+		if (!dualpi2_skb_cb(skb)->ect)
+			/* Drop this non-ECT packet */
+			return 1;
+		if (dualpi2_mark(q, skb))
+			++q->step_marks;
+	}
+	qdisc_bstats_update(q->l_queue, skb);
+	return 0;
+}
+
+static void drop_and_retry(struct dualpi2_sched_data *q, struct sk_buff *skb,
+			   struct Qdisc *sch)
+{
+	++q->deferred_drops.cnt;
+	q->deferred_drops.len += qdisc_pkt_len(skb);
+	consume_skb(skb);
+	qdisc_qstats_drop(sch);
+}
+
+static struct sk_buff *dualpi2_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int credit_change;
+	u64 now;
+
+	now = ktime_get_ns();
+
+	while ((skb = dequeue_packet(sch, q, &credit_change, now))) {
+		if (!q->drop_early && must_drop(sch, q, skb)) {
+			drop_and_retry(q, skb, sch);
+			continue;
+		}
+
+		if (skb_in_l_queue(skb) && do_step_aqm(q, skb, now)) {
+			qdisc_qstats_drop(q->l_queue);
+			drop_and_retry(q, skb, sch);
+			continue;
+		}
+
+		q->c_protection.credit += credit_change;
+		qdisc_bstats_update(sch, skb);
+		break;
+	}
+
+	/* We cannot call qdisc_tree_reduce_backlog() if our qlen is 0,
+	 * or HTB crashes.
+	 */
+	if (q->deferred_drops.cnt && qdisc_qlen(sch)) {
+		qdisc_tree_reduce_backlog(sch, q->deferred_drops.cnt,
+					  q->deferred_drops.len);
+		q->deferred_drops.cnt = 0;
+		q->deferred_drops.len = 0;
+	}
+	return skb;
+}
+
+static s64 __scale_delta(u64 diff)
+{
+	do_div(diff, 1 << ALPHA_BETA_GRANULARITY);
+	return diff;
+}
+
+static void get_queue_delays(struct dualpi2_sched_data *q, u64 *qdelay_c,
+			     u64 *qdelay_l)
+{
+	u64 now, qc, ql;
+
+	now = ktime_get_ns();
+	qc = READ_ONCE(q->c_head_ts);
+	ql = READ_ONCE(q->l_head_ts);
+
+	*qdelay_c = qc ? now - qc : 0;
+	*qdelay_l = ql ? now - ql : 0;
+}
+
+static u32 calculate_probability(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	u32 new_prob;
+	u64 qdelay_c;
+	u64 qdelay_l;
+	u64 qdelay;
+	s64 delta;
+
+	get_queue_delays(q, &qdelay_c, &qdelay_l);
+	qdelay = max(qdelay_l, qdelay_c);
+	/* Alpha and beta take at most 32b, i.e, the delay difference would
+	 * overflow for queuing delay differences > ~4.2sec.
+	 */
+	delta = ((s64)qdelay - q->pi2.target) * q->pi2.alpha;
+	delta += ((s64)qdelay - q->last_qdelay) * q->pi2.beta;
+	if (delta > 0) {
+		new_prob = __scale_delta(delta) + q->pi2.prob;
+		if (new_prob < q->pi2.prob)
+			new_prob = MAX_PROB;
+	} else {
+		new_prob = q->pi2.prob - __scale_delta(~delta + 1);
+		if (new_prob > q->pi2.prob)
+			new_prob = 0;
+	}
+	q->last_qdelay = qdelay;
+	/* If we do not drop on overload, ensure we cap the L4S probability to
+	 * 100% to keep window fairness when overflowing.
+	 */
+	if (!q->drop_overload)
+		return min_t(u32, new_prob, MAX_PROB / q->coupling_factor);
+	return new_prob;
+}
+
+static enum hrtimer_restart dualpi2_timer(struct hrtimer *timer)
+{
+	struct dualpi2_sched_data *q = from_timer(q, timer, pi2.timer);
+
+	WRITE_ONCE(q->pi2.prob, calculate_probability(q->sch));
+
+	hrtimer_set_expires(&q->pi2.timer, next_pi2_timeout(q));
+	return HRTIMER_RESTART;
+}
+
+static const struct nla_policy dualpi2_policy[TCA_DUALPI2_MAX + 1] = {
+	[TCA_DUALPI2_LIMIT] = {.type = NLA_U32},
+	[TCA_DUALPI2_TARGET] = {.type = NLA_U32},
+	[TCA_DUALPI2_TUPDATE] = {.type = NLA_U32},
+	[TCA_DUALPI2_ALPHA] = {.type = NLA_U32},
+	[TCA_DUALPI2_BETA] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_THRESH] = {.type = NLA_U32},
+	[TCA_DUALPI2_STEP_PACKETS] = {.type = NLA_U8},
+	[TCA_DUALPI2_COUPLING] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_OVERLOAD] = {.type = NLA_U8},
+	[TCA_DUALPI2_DROP_EARLY] = {.type = NLA_U8},
+	[TCA_DUALPI2_C_PROTECTION] = {.type = NLA_U8},
+	[TCA_DUALPI2_ECN_MASK] = {.type = NLA_U8},
+	[TCA_DUALPI2_SPLIT_GSO] = {.type = NLA_U8},
+};
+
+static int dualpi2_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_DUALPI2_MAX + 1];
+	struct dualpi2_sched_data *q;
+	int old_backlog;
+	int old_qlen;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+	err = nla_parse_nested(tb, TCA_DUALPI2_MAX, opt, dualpi2_policy,
+			       extack);
+	if (err < 0)
+		return err;
+
+	q = qdisc_priv(sch);
+	sch_tree_lock(sch);
+
+	if (tb[TCA_DUALPI2_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_DUALPI2_LIMIT]);
+
+		if (!limit) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_LIMIT],
+					    "limit must be greater than 0.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_DUALPI2_TARGET])
+		q->pi2.target = (u64)nla_get_u32(tb[TCA_DUALPI2_TARGET]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_TUPDATE]) {
+		u64 tupdate = nla_get_u32(tb[TCA_DUALPI2_TUPDATE]);
+
+		if (!tupdate) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_TUPDATE],
+					    "tupdate cannot be 0us.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.tupdate = tupdate * NSEC_PER_USEC;
+	}
+
+	if (tb[TCA_DUALPI2_ALPHA]) {
+		u32 alpha = nla_get_u32(tb[TCA_DUALPI2_ALPHA]);
+
+		if (alpha > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_ALPHA],
+					    "alpha is too large.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.alpha = dualpi2_scale_alpha_beta(alpha);
+	}
+
+	if (tb[TCA_DUALPI2_BETA]) {
+		u32 beta = nla_get_u32(tb[TCA_DUALPI2_BETA]);
+
+		if (beta > ALPHA_BETA_MAX) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_BETA],
+					    "beta is too large.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->pi2.beta = dualpi2_scale_alpha_beta(beta);
+	}
+
+	if (tb[TCA_DUALPI2_STEP_THRESH])
+		q->step.thresh = nla_get_u32(tb[TCA_DUALPI2_STEP_THRESH]) *
+			NSEC_PER_USEC;
+
+	if (tb[TCA_DUALPI2_COUPLING]) {
+		u8 coupling = nla_get_u8(tb[TCA_DUALPI2_COUPLING]);
+
+		if (!coupling) {
+			NL_SET_ERR_MSG_ATTR(extack, tb[TCA_DUALPI2_COUPLING],
+					    "Must use a non-zero coupling.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		q->coupling_factor = coupling;
+	}
+
+	if (tb[TCA_DUALPI2_STEP_PACKETS])
+		q->step.in_packets = !!nla_get_u8(tb[TCA_DUALPI2_STEP_PACKETS]);
+
+	if (tb[TCA_DUALPI2_DROP_OVERLOAD])
+		q->drop_overload = !!nla_get_u8(tb[TCA_DUALPI2_DROP_OVERLOAD]);
+
+	if (tb[TCA_DUALPI2_DROP_EARLY])
+		q->drop_early = !!nla_get_u8(tb[TCA_DUALPI2_DROP_EARLY]);
+
+	if (tb[TCA_DUALPI2_C_PROTECTION]) {
+		u8 wc = nla_get_u8(tb[TCA_DUALPI2_C_PROTECTION]);
+
+		if (wc > MAX_WC) {
+			NL_SET_ERR_MSG_ATTR(extack,
+					    tb[TCA_DUALPI2_C_PROTECTION],
+					    "c_protection must be <= 100.");
+			sch_tree_unlock(sch);
+			return -EINVAL;
+		}
+		dualpi2_calculate_c_protection(sch, q, wc);
+	}
+
+	if (tb[TCA_DUALPI2_ECN_MASK])
+		q->ecn_mask = nla_get_u8(tb[TCA_DUALPI2_ECN_MASK]);
+
+	if (tb[TCA_DUALPI2_SPLIT_GSO])
+		q->split_gso = !!nla_get_u8(tb[TCA_DUALPI2_SPLIT_GSO]);
+
+	old_qlen = qdisc_qlen(sch);
+	old_backlog = sch->qstats.backlog;
+	while (qdisc_qlen(sch) > sch->limit) {
+		struct sk_buff *skb = __qdisc_dequeue_head(&sch->q);
+
+		qdisc_qstats_backlog_dec(sch, skb);
+		rtnl_qdisc_drop(skb, sch);
+	}
+	qdisc_tree_reduce_backlog(sch, old_qlen - qdisc_qlen(sch),
+				  old_backlog - sch->qstats.backlog);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+/* Default alpha/beta values give a 10dB stability margin with max_rtt=100ms. */
+static void dualpi2_reset_default(struct dualpi2_sched_data *q)
+{
+	q->sch->limit = 10000;				/* Max 125ms at 1Gbps */
+
+	q->pi2.target = 15 * NSEC_PER_MSEC;
+	q->pi2.tupdate = 16 * NSEC_PER_MSEC;
+	q->pi2.alpha = dualpi2_scale_alpha_beta(41);	/* ~0.16 Hz * 256 */
+	q->pi2.beta = dualpi2_scale_alpha_beta(819);	/* ~3.20 Hz * 256 */
+
+	q->step.thresh = 1 * NSEC_PER_MSEC;
+	q->step.in_packets = false;
+
+	dualpi2_calculate_c_protection(q->sch, q, 10);	/* wc=10%, wl=90% */
+
+	q->ecn_mask = INET_ECN_ECT_1;
+	q->coupling_factor = 2;		/* window fairness for equal RTTs */
+	q->drop_overload = true;	/* Preserve latency by dropping */
+	q->drop_early = false;		/* PI2 drops on dequeue */
+	q->split_gso = true;
+}
+
+static int dualpi2_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	q->l_queue = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+				       TC_H_MAKE(sch->handle, 1), extack);
+	if (!q->l_queue)
+		return -ENOMEM;
+
+	err = tcf_block_get(&q->tcf.block, &q->tcf.filters, sch, extack);
+	if (err)
+		return err;
+
+	q->sch = sch;
+	dualpi2_reset_default(q);
+	hrtimer_init(&q->pi2.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	q->pi2.timer.function = dualpi2_timer;
+
+	if (opt) {
+		err = dualpi2_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	hrtimer_start(&q->pi2.timer, next_pi2_timeout(q),
+		      HRTIMER_MODE_ABS_PINNED);
+	return 0;
+}
+
+static u32 convert_ns_to_usec(u64 ns)
+{
+	do_div(ns, NSEC_PER_USEC);
+	return ns;
+}
+
+static int dualpi2_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_DUALPI2_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TARGET,
+			convert_ns_to_usec(q->pi2.target)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_TUPDATE,
+			convert_ns_to_usec(q->pi2.tupdate)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_ALPHA,
+			dualpi2_unscale_alpha_beta(q->pi2.alpha)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_BETA,
+			dualpi2_unscale_alpha_beta(q->pi2.beta)) ||
+	    nla_put_u32(skb, TCA_DUALPI2_STEP_THRESH, q->step.in_packets ?
+			q->step.thresh : convert_ns_to_usec(q->step.thresh)) ||
+	    nla_put_u8(skb, TCA_DUALPI2_COUPLING, q->coupling_factor) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_OVERLOAD, q->drop_overload) ||
+	    nla_put_u8(skb, TCA_DUALPI2_STEP_PACKETS, q->step.in_packets) ||
+	    nla_put_u8(skb, TCA_DUALPI2_DROP_EARLY, q->drop_early) ||
+	    nla_put_u8(skb, TCA_DUALPI2_C_PROTECTION, q->c_protection.wc) ||
+	    nla_put_u8(skb, TCA_DUALPI2_ECN_MASK, q->ecn_mask) ||
+	    nla_put_u8(skb, TCA_DUALPI2_SPLIT_GSO, q->split_gso))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+}
+
+static int dualpi2_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+	struct tc_dualpi2_xstats st = {
+		.prob		= READ_ONCE(q->pi2.prob),
+		.packets_in_c	= q->packets_in_c,
+		.packets_in_l	= q->packets_in_l,
+		.maxq		= q->maxq,
+		.ecn_mark	= q->ecn_mark,
+		.credit		= q->c_protection.credit,
+		.step_marks	= q->step_marks,
+	};
+	u64 qc, ql;
+
+	get_queue_delays(q, &qc, &ql);
+	st.delay_l = convert_ns_to_usec(ql);
+	st.delay_c = convert_ns_to_usec(qc);
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void dualpi2_reset(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset_queue(sch);
+	qdisc_reset_queue(q->l_queue);
+	q->c_head_ts = 0;
+	q->l_head_ts = 0;
+	q->pi2.prob = 0;
+	q->packets_in_c = 0;
+	q->packets_in_l = 0;
+	q->maxq = 0;
+	q->ecn_mark = 0;
+	q->step_marks = 0;
+	dualpi2_reset_c_protection(q);
+}
+
+static void dualpi2_destroy(struct Qdisc *sch)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	q->pi2.tupdate = 0;
+	hrtimer_cancel(&q->pi2.timer);
+	if (q->l_queue)
+		qdisc_put(q->l_queue);
+	tcf_block_put(q->tcf.block);
+}
+
+static struct Qdisc *dualpi2_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long dualpi2_find(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static unsigned long dualpi2_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void dualpi2_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static struct tcf_block *dualpi2_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct dualpi2_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->tcf.block;
+}
+
+static void dualpi2_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	/* We statically define only 2 queues */
+	for (i = 0; i < 2; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+/* Minimal class support to handler tc filters */
+static const struct Qdisc_class_ops dualpi2_class_ops = {
+	.leaf		= dualpi2_leaf,
+	.find		= dualpi2_find,
+	.tcf_block	= dualpi2_tcf_block,
+	.bind_tcf	= dualpi2_bind,
+	.unbind_tcf	= dualpi2_unbind,
+	.walk		= dualpi2_walk,
+};
+
+static struct Qdisc_ops dualpi2_qdisc_ops __read_mostly = {
+	.id		= "dualpi2",
+	.cl_ops		= &dualpi2_class_ops,
+	.priv_size	= sizeof(struct dualpi2_sched_data),
+	.enqueue	= dualpi2_qdisc_enqueue,
+	.dequeue	= dualpi2_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= dualpi2_init,
+	.destroy	= dualpi2_destroy,
+	.reset		= dualpi2_reset,
+	.change		= dualpi2_change,
+	.dump		= dualpi2_dump,
+	.dump_stats	= dualpi2_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init dualpi2_module_init(void)
+{
+	return register_qdisc(&dualpi2_qdisc_ops);
+}
+
+static void __exit dualpi2_module_exit(void)
+{
+	unregister_qdisc(&dualpi2_qdisc_ops);
+}
+
+module_init(dualpi2_module_init);
+module_exit(dualpi2_module_exit);
+
+MODULE_DESCRIPTION("Dual Queue with Proportional Integral controller Improved with a Square (dualpi2) scheduler");
+MODULE_AUTHOR("Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>");
+MODULE_AUTHOR("Olga Albisser <olga@albisser.org>");
+MODULE_AUTHOR("Henrik Steen <henrist@henrist.net>");
+MODULE_AUTHOR("Olivier Tilmans <olivier.tilmans@nokia.com>");
+MODULE_AUTHOR("Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>");
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1.0");