diff mbox series

[RFC] net: introduce HW Rate Limiting Driver API

Message ID 3d1e2d945904a0fb55258559eb7322d7e11066b6.1715199358.git.pabeni@redhat.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: introduce HW Rate Limiting Driver API | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4858 this patch: 4858
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: edumazet@google.com
netdev/build_clang success Errors and warnings before: 1038 this patch: 1038
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5134 this patch: 5134
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 84 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc fail Errors and warnings before: 22 this patch: 23
netdev/source_inline success Was 0 now: 0

Commit Message

Paolo Abeni May 8, 2024, 8:20 p.m. UTC
This is the first incarnation in a formal (pre-RFC) patch of the
HW TX Rate Limiting Driver API proposal shared here[1].

The goal is to outline the proposed APIs before pushing the actual
implementation.

The network devices gain a new ops struct to directly manipulate the
H/W shapers implemented by the NIC.

The shapers can be attached to a pre-defined set of 'domains' - port,
vf, etc. - and the overall shapers configuration pushed to the H/W is
maintained by the kernel.

Each shaper is identified by an unique integer id based on the domain
and additional domain-specific information - e.g. for the VF domain, the
virtual function number/identifier.

[1] https://lore.kernel.org/netdev/20240405102313.GA310894@kernel.org/

Co-developed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/netdevice.h |  15 +++
 include/net/net_shaper.h  | 206 ++++++++++++++++++++++++++++++++++++++
 net/Kconfig               |   3 +
 3 files changed, 224 insertions(+)
 create mode 100644 include/net/net_shaper.h

Comments

Andrew Lunn May 8, 2024, 9:47 p.m. UTC | #1
Hi Paolo

> + * struct net_shaper_info - represents a shaping node on the NIC H/W
> + * @metric: Specify if the bw limits refers to PPS or BPS
> + * @bw_min: Minimum guaranteed rate for this shaper
> + * @bw_max: Maximum peak bw allowed for this shaper
> + * @burst: Maximum burst for the peek rate of this shaper
> + * @priority: Scheduling priority for this shaper
> + * @weight: Scheduling weight for this shaper
> + */
> +struct net_shaper_info {
> +	enum net_shaper_metric metric;
> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> +	u64 bw_max;	/* maximum allowed bandwidth */
> +	u32 burst;	/* maximum burst in bytes for bw_max */
> +	u32 priority;	/* scheduling strict priority */

Above it say priority. Here is strict priority? Is there a difference?

> +	u32 weight;	/* scheduling WRR weight*/
> +};

Are there any special semantics for weight? Looking at the hardware i
have, which has 8 queues for a switch port, i can either set it to
strict priority, meaning queue 7 needs to be empty before it look at
queue 6, and it will only look at queue 5 when 6 is empty etc. Or i
can set weights per queue. How would i expect strict priority?

Shaping itself is not performed on the queues, but the port. So it
seems like i should create 8 net_shaper_info and set the weight in
each, and everything else to 0? And then create a queue group shaper,
put the 8 queue shapers into it, and then configure bw_max for the
group? Everything else is 0, because that is all i can configure.

Does this sound correct?

> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.

Are they also available on plain boring devices which don't have PFs
or VFs?

Would i be correct in assuming my driver should just create these
shapers. There will be some netlink calls to allow user space to
enumerate them, and display the relationships between them? And
netlink calls to set values in a shaper? Will there be a way to say
which fields are actually settable, since i doubt most hardware will
have everything? In my case, only one field appears to be relevant in
each shaper, and maybe we want to give a hint about that to userspace?

	Andrew
Paolo Abeni May 9, 2024, 2:19 p.m. UTC | #2
On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote:
> > + * struct net_shaper_info - represents a shaping node on the NIC H/W
> > + * @metric: Specify if the bw limits refers to PPS or BPS
> > + * @bw_min: Minimum guaranteed rate for this shaper
> > + * @bw_max: Maximum peak bw allowed for this shaper
> > + * @burst: Maximum burst for the peek rate of this shaper
> > + * @priority: Scheduling priority for this shaper
> > + * @weight: Scheduling weight for this shaper
> > + */
> > +struct net_shaper_info {
> > +	enum net_shaper_metric metric;
> > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> > +	u64 bw_max;	/* maximum allowed bandwidth */
> > +	u32 burst;	/* maximum burst in bytes for bw_max */
> > +	u32 priority;	/* scheduling strict priority */
> 
> Above it say priority. Here is strict priority? Is there a difference?

the 'priority' field is the strict priority for scheduling. We will
clarify the first comment.

> 
> > +	u32 weight;	/* scheduling WRR weight*/
> > +};
> 
> Are there any special semantics for weight? Looking at the hardware i
> have, which has 8 queues for a switch port, i can either set it to
> strict priority, meaning queue 7 needs to be empty before it look at
> queue 6, and it will only look at queue 5 when 6 is empty etc. Or i
> can set weights per queue. How would i expect strict priority?

The expected behaviour is that schedulers at the same level with the
same priority will be served according to their weight.

I'm unsure if I read your question correctly. My understanding is that
the your H/W don't allow strict priority and WRR simultaneously. In
such case, the set/create operations should reject calls setting both
non zero weight and priority values.

> Shaping itself is not performed on the queues, but the port. So it
> seems like i should create 8 net_shaper_info and set the weight in
> each, and everything else to 0? And then create a queue group shaper,
> put the 8 queue shapers into it, and then configure bw_max for the
> group? Everything else is 0, because that is all i can configure.
> 
> Does this sound correct?

It sounds correct. You can avoid creating the queue group and set the
rate at the NETDEV scope, or possibly not even set/create such shaper:
the transmission is expected to be limited by the line rate.

> > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> 
> Are they also available on plain boring devices which don't have PFs
> or VFs?

Yes, a driver is free to implement/support an arbitrary subset of the
available feature. Operations attempting to set an unsupported state
should fail with a relevant extended ack.

> Would i be correct in assuming my driver should just create these
> shapers. There will be some netlink calls to allow user space to
> enumerate them, and display the relationships between them?

Yes, the idea/plan/roadmap is to implement yml-based netlink API to
fully expose the feature to the user-space.

>  And
> netlink calls to set values in a shaper? Will there be a way to say
> which fields are actually settable, since i doubt most hardware will
> have everything? 

This was not planned yet, even because some (most?) H/W can have non-
trivial constraints to expose. e.g. WRR is available at the queue scope
only, and only if there is at most one SP shaper somewhere else. 

> In my case, only one field appears to be relevant in
> each shaper, and maybe we want to give a hint about that to userspace?

Any suggestion on how to expose that in reasonable way more then
welcome!

Thanks,

Paolo
Andrew Lunn May 9, 2024, 3 p.m. UTC | #3
On Thu, May 09, 2024 at 04:19:52PM +0200, Paolo Abeni wrote:
> On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote:
> > > + * struct net_shaper_info - represents a shaping node on the NIC H/W
> > > + * @metric: Specify if the bw limits refers to PPS or BPS
> > > + * @bw_min: Minimum guaranteed rate for this shaper
> > > + * @bw_max: Maximum peak bw allowed for this shaper
> > > + * @burst: Maximum burst for the peek rate of this shaper
> > > + * @priority: Scheduling priority for this shaper
> > > + * @weight: Scheduling weight for this shaper
> > > + */
> > > +struct net_shaper_info {
> > > +	enum net_shaper_metric metric;
> > > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> > > +	u64 bw_max;	/* maximum allowed bandwidth */
> > > +	u32 burst;	/* maximum burst in bytes for bw_max */
> > > +	u32 priority;	/* scheduling strict priority */
> > 
> > Above it say priority. Here is strict priority? Is there a difference?
> 
> the 'priority' field is the strict priority for scheduling. We will
> clarify the first comment.
> 
> > 
> > > +	u32 weight;	/* scheduling WRR weight*/
> > > +};
> > 
> > Are there any special semantics for weight? Looking at the hardware i
> > have, which has 8 queues for a switch port, i can either set it to
> > strict priority, meaning queue 7 needs to be empty before it look at
> > queue 6, and it will only look at queue 5 when 6 is empty etc. Or i
> > can set weights per queue. How would i expect strict priority?
> 
> The expected behaviour is that schedulers at the same level with the
> same priority will be served according to their weight.
> 
> I'm unsure if I read your question correctly. My understanding is that
> the your H/W don't allow strict priority and WRR simultaneously.

Correct. There is one bit to select between the two. If WRR is
enabled, it then becomes possible for some generations of the hardware
to configure the weights, for others the weights are fixed in silicon.

> In
> such case, the set/create operations should reject calls setting both
> non zero weight and priority values.

So in order to set strict priority, i need to set the priority field
to 7, 6, 5, 4, 3, 2, 1, 0, for my 8 queues, and weight to 0. For WRR,
i need to set the priority fields to 0, and the weight values, either
to what is hard coded in the silicon, or valid weights when it is
configurable.

Now the question is, how do i get between these two states? It is not
possible to mix WRR and strict priority. Any kAPI which only modifies
one queue at once will go straight into an invalid state, and the
driver will need to return -EOPNOTSUPP. So it seems like there needs
to be an atomic set N queue configuration at once, so i can cleanly go
from strict priority across 8 queues to WRR across 8 queues. Is that
foreseen?

> It sounds correct. You can avoid creating the queue group and set the
> rate at the NETDEV scope, or possibly not even set/create such shaper:
> the transmission is expected to be limited by the line rate.

Well, the hardware exists, i probably should just create the shapers
to match the hardware. However, if i set the hardware equivalent of
bw_max to 0, it then uses line rate. Is this something we want to
document? You can disable a shaper from shaping by setting the
bandwidth to 0? Or do we want a separate enable/disable bit in the
structure?

> 
> > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > 
> > Are they also available on plain boring devices which don't have PFs
> > or VFs?
> 
> Yes, a driver is free to implement/support an arbitrary subset of the
> available feature. Operations attempting to set an unsupported state
> should fail with a relevant extended ack.

Great. Please update the comment.

> > In my case, only one field appears to be relevant in
> > each shaper, and maybe we want to give a hint about that to userspace?
> 
> Any suggestion on how to expose that in reasonable way more then
> welcome!

We might first want to gather knowledge on what these shapers actually
look like in different hardwares. There are going to be some which are
very limited fixed functions as in the hardware i have, probably most
SOHO switches are like this. And then we have TOR and smart NICs which
might be able to create and destroy shapers on the fly, and are a lot
less restrictive.

     Andrew
Andrew Lunn May 9, 2024, 3:09 p.m. UTC | #4
> +/**
> + * enum net_shaper_metric - the metric of the shaper
> + * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
> + * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
> + */
> +enum net_shaper_metric {
> +	NET_SHAPER_METRIC_PPS,
> +	NET_SHAPER_METRIC_BPS
> +};

I think we need a third value. In the hardware i'm talking about, the
8 queues themselves don't do any shaping. All i'm configuring is WRR
vs strict priority order. So ideally i want something to indicate this
shaper does not actually shape.

       Andrew
Paolo Abeni May 9, 2024, 3:43 p.m. UTC | #5
On Thu, 2024-05-09 at 17:00 +0200, Andrew Lunn wrote:
> On Thu, May 09, 2024 at 04:19:52PM +0200, Paolo Abeni wrote:
> > On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote:
> > > > + * struct net_shaper_info - represents a shaping node on the NIC H/W
> > > > + * @metric: Specify if the bw limits refers to PPS or BPS
> > > > + * @bw_min: Minimum guaranteed rate for this shaper
> > > > + * @bw_max: Maximum peak bw allowed for this shaper
> > > > + * @burst: Maximum burst for the peek rate of this shaper
> > > > + * @priority: Scheduling priority for this shaper
> > > > + * @weight: Scheduling weight for this shaper
> > > > + */
> > > > +struct net_shaper_info {
> > > > +	enum net_shaper_metric metric;
> > > > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> > > > +	u64 bw_max;	/* maximum allowed bandwidth */
> > > > +	u32 burst;	/* maximum burst in bytes for bw_max */
> > > > +	u32 priority;	/* scheduling strict priority */
> > > 
> > > Above it say priority. Here is strict priority? Is there a difference?
> > 
> > the 'priority' field is the strict priority for scheduling. We will
> > clarify the first comment.
> > 
> > > 
> > > > +	u32 weight;	/* scheduling WRR weight*/
> > > > +};
> > > 
> > > Are there any special semantics for weight? Looking at the hardware i
> > > have, which has 8 queues for a switch port, i can either set it to
> > > strict priority, meaning queue 7 needs to be empty before it look at
> > > queue 6, and it will only look at queue 5 when 6 is empty etc. Or i
> > > can set weights per queue. How would i expect strict priority?
> > 
> > The expected behaviour is that schedulers at the same level with the
> > same priority will be served according to their weight.
> > 
> > I'm unsure if I read your question correctly. My understanding is that
> > the your H/W don't allow strict priority and WRR simultaneously.
> 
> Correct. There is one bit to select between the two. If WRR is
> enabled, it then becomes possible for some generations of the hardware
> to configure the weights, for others the weights are fixed in silicon.
> 
> > In
> > such case, the set/create operations should reject calls setting both
> > non zero weight and priority values.
> 
> So in order to set strict priority, i need to set the priority field
> to 7, 6, 5, 4, 3, 2, 1, 0, for my 8 queues, and weight to 0. For WRR,
> i need to set the priority fields to 0, and the weight values, either
> to what is hard coded in the silicon, or valid weights when it is
> configurable.
> 
> Now the question is, how do i get between these two states? It is not
> possible to mix WRR and strict priority. Any kAPI which only modifies
> one queue at once will go straight into an invalid state, and the
> driver will need to return -EOPNOTSUPP. So it seems like there needs
> to be an atomic set N queue configuration at once, so i can cleanly go
> from strict priority across 8 queues to WRR across 8 queues. Is that
> foreseen?

You could delete all the WRR shapers and then create/add SP based ones.

> > It sounds correct. You can avoid creating the queue group and set the
> > rate at the NETDEV scope, or possibly not even set/create such shaper:
> > the transmission is expected to be limited by the line rate.
> 
> Well, the hardware exists, i probably should just create the shapers
> to match the hardware. However, if i set the hardware equivalent of
> bw_max to 0, it then uses line rate. Is this something we want to
> document? You can disable a shaper from shaping by setting the
> bandwidth to 0? Or do we want a separate enable/disable bit in the
> structure?

I documenting that '0' means 'unlimited' for bw_max fields should be
good enough. If the NIC can't support any kind of shaping it will
reject with an appropriate extended ack any configuration change with
bw != 0.

I guess we need another comment update here :)


> > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > > 
> > > Are they also available on plain boring devices which don't have PFs
> > > or VFs?
> > 
> > Yes, a driver is free to implement/support an arbitrary subset of the
> > available feature. Operations attempting to set an unsupported state
> > should fail with a relevant extended ack.
> 
> Great. Please update the comment.

Sure, will do.

> > > In my case, only one field appears to be relevant in
> > > each shaper, and maybe we want to give a hint about that to userspace?
> > 
> > Any suggestion on how to expose that in reasonable way more then
> > welcome!
> 
> We might first want to gather knowledge on what these shapers actually
> look like in different hardwares. There are going to be some which are
> very limited fixed functions as in the hardware i have, probably most
> SOHO switches are like this. And then we have TOR and smart NICs which
> might be able to create and destroy shapers on the fly, and are a lot
> less restrictive.

As a reference the initial configuration status has been discussed
here:

https://lore.kernel.org/netdev/20240410075745.4637c537@kernel.org/

(grep for 'Transforming arbitrary pre-existing driver hierarchy')

The conclusion was (to my understanding) to reduce the core complexity
enforcing an uniform views of the NIC defaults.

The 'create' op is just an abstraction to tell the NIC to switch from
the default configuration to the specified one.

The default configuration means 'no restrictions/shaping beyond the
link rate'. If the NIC has buildin shapers it can't disable, it must
initialize them to match the above. In both case the kernel view of the
default will be 'no shapers added'

The 'delete' op will restore the default for the specified scope
location. If the NIC has a persistent shaper there, it will update the
setting to the default, or actually delete the H/W shaper if possible.
In both case the kernel view of such scope location with return to be
'no shapers added'

My understading/guess/hope is that is possible to implement the above
default with any H/W.

Does the above solve your doubt/answer your question?

Thanks!

Paolo
Andrew Lunn May 9, 2024, 4:17 p.m. UTC | #6
> > Now the question is, how do i get between these two states? It is not
> > possible to mix WRR and strict priority. Any kAPI which only modifies
> > one queue at once will go straight into an invalid state, and the
> > driver will need to return -EOPNOTSUPP. So it seems like there needs
> > to be an atomic set N queue configuration at once, so i can cleanly go
> > from strict priority across 8 queues to WRR across 8 queues. Is that
> > foreseen?
> 
> You could delete all the WRR shapers and then create/add SP based ones.

But that does not match the hardware. I cannot delete the hardware. It
will either do strict priority or WRR. If i delete the software
representation of the shaper, the hardware shaper will keep on doing
what it was doing. So i don't see this as a good model. I think the
driver will create shapers to represent the hardware, and you are not
allowed to delete them or add more of them, because that is what the
hardware is. All you can do is configure the shapers that exist.

> The 'create' op is just an abstraction to tell the NIC to switch from
> the default configuration to the specified one.

Well, the hardware default is i think WRR for the queues, and line
rate. That will be what the software representation of the shapers
will be set to when the driver probes and creates the shapers
representors.

	Andrew
Naveen Mamindlapalli May 10, 2024, 7:10 a.m. UTC | #7
Hi Paolo,

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com>
> Sent: Thursday, May 9, 2024 1:51 AM
> To: netdev@vger.kernel.org
> Cc: Jakub Kicinski <kuba@kernel.org>; Jiri Pirko <jiri@resnulli.us>; Madhu
> Chittim <madhu.chittim@intel.com>; Sridhar Samudrala
> <sridhar.samudrala@intel.com>; Simon Horman <horms@kernel.org>; John
> Fastabend <john.fastabend@gmail.com>; Sunil Kovvuri Goutham
> <sgoutham@marvell.com>; Jamal Hadi Salim <jhs@mojatatu.com>
> Subject: [RFC PATCH] net: introduce HW Rate Limiting Driver API
> 
> This is the first incarnation in a formal (pre-RFC) patch of the HW TX Rate
> Limiting Driver API proposal shared here[1].
> 
> The goal is to outline the proposed APIs before pushing the actual implementation.
> 
> The network devices gain a new ops struct to directly manipulate the H/W
> shapers implemented by the NIC.
> 
> The shapers can be attached to a pre-defined set of 'domains' - port, vf, etc. - and
> the overall shapers configuration pushed to the H/W is maintained by the kernel.
> 
> Each shaper is identified by an unique integer id based on the domain and
> additional domain-specific information - e.g. for the VF domain, the virtual function
> number/identifier.
> 
> [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lore.kernel.org_netdev_20240405102313.GA310894-
> 40kernel.org_&d=DwIDaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=TwreqwV6mQ8K9wI
> pqwFO8yjikO_w1jUOe2MzChg4Rmg&m=g3KFBILzzqT9znnKDO0ePHOi5tc0Zzy6r
> G9xL3GdGKS_9IkdC3q-
> okhCTptMrIKN&s=hVQbF8IF41zg6I197jY8r96uQq0oQbgTfe4-4HF6c3M&e=
> 
> Co-developed-by: Simon Horman <horms@kernel.org>
> Signed-off-by: Simon Horman <horms@kernel.org>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  include/linux/netdevice.h |  15 +++
>  include/net/net_shaper.h  | 206
> ++++++++++++++++++++++++++++++++++++++
>  net/Kconfig               |   3 +
>  3 files changed, 224 insertions(+)
>  create mode 100644 include/net/net_shaper.h
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
> cf261fb89d73..39f66af014be 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -79,6 +79,8 @@ struct xdp_buff;
>  struct xdp_frame;
>  struct xdp_metadata_ops;
>  struct xdp_md;
> +struct net_shaper_ops;
> +struct net_shaper_data;
> 
>  typedef u32 xdp_features_t;
> 
> @@ -1596,6 +1598,13 @@ struct net_device_ops {
>  	int			(*ndo_hwtstamp_set)(struct net_device *dev,
>  						    struct kernel_hwtstamp_config
> *kernel_config,
>  						    struct netlink_ext_ack
> *extack);
> +
> +#if IS_ENABLED(CONFIG_NET_SHAPER)
> +	/** @net_shaper_ops: Device shaping offload operations
> +	 * see include/net/net_shapers.h
> +	 */
> +	const struct net_shaper_ops *net_shaper_ops; #endif
>  };
> 
>  /**
> @@ -2403,6 +2412,12 @@ struct net_device {
>  	/** @page_pools: page pools created for this netdevice */
>  	struct hlist_head	page_pools;
>  #endif
> +#if IS_ENABLED(CONFIG_NET_SHAPER)
> +	/** @net_shaper_data: data tracking the current shaper status
> +	 *  see include/net/net_shapers.h
> +	 */
> +	struct net_shaper_data *net_shaper_data; #endif
>  };
>  #define to_net_dev(d) container_of(d, struct net_device, dev)
> 
> diff --git a/include/net/net_shaper.h b/include/net/net_shaper.h new file mode
> 100644 index 000000000000..a4fbadd99870
> --- /dev/null
> +++ b/include/net/net_shaper.h
> @@ -0,0 +1,206 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef _NET_SHAPER_H_
> +#define _NET_SHAPER_H_
> +
> +#include <linux/types.h>
> +#include <linux/netdevice.h>
> +#include <linux/netlink.h>
> +
> +/**
> + * enum net_shaper_metric - the metric of the shaper
> + * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second
> +basis
> + * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
> +*/ enum net_shaper_metric {
> +	NET_SHAPER_METRIC_PPS,
> +	NET_SHAPER_METRIC_BPS
> +};
> +
> +/**
> + * struct net_shaper_info - represents a shaping node on the NIC H/W
> + * @metric: Specify if the bw limits refers to PPS or BPS
> + * @bw_min: Minimum guaranteed rate for this shaper
> + * @bw_max: Maximum peak bw allowed for this shaper
> + * @burst: Maximum burst for the peek rate of this shaper
> + * @priority: Scheduling priority for this shaper
> + * @weight: Scheduling weight for this shaper  */ struct
> +net_shaper_info {
> +	enum net_shaper_metric metric;
> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> +	u64 bw_max;	/* maximum allowed bandwidth */
> +	u32 burst;	/* maximum burst in bytes for bw_max */
> +	u32 priority;	/* scheduling strict priority */
> +	u32 weight;	/* scheduling WRR weight*/
> +};
> +
> +/**
> + * enum net_shaper_scope - the different scopes where a shaper could be
> attached
> + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network
> device.
> + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> + * function.
> + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple
> queues
> +under the
> + * same device.
> + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given
> device queue.
> + *
> + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only
> available on
> + * PF devices, usually inside the host/hypervisor.

What is the difference between NET_SHAPER_SCOPE_VF and NET_SHAPER_SCOPE_NETDEV from a VF traffic shaping perspective?
Is NET_SHAPER_SCOPE_VF used to configure a shaper for a VF which is in a separate domain that is not a NETDEV?

Thanks,
Naveen

> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP
> and
> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> + */
> +enum net_shaper_scope {
> +	NET_SHAPER_SCOPE_PORT,
> +	NET_SHAPER_SCOPE_NETDEV,
> +	NET_SHAPER_SCOPE_VF,
> +	NET_SHAPER_SCOPE_QUEUE_GROUP,
> +	NET_SHAPER_SCOPE_QUEUE,
> +};
> +
> +/**
> + * struct net_shaper_ops - Operations on device H/W shapers
> + * @add: Creates a new shaper in the specified scope.
> + * @set: Modify the existing shaper.
> + * @delete: Delete the specified shaper.
> + * @move: Move an existing shaper under a different parent.
> + *
> + * The initial shaping configuration ad device initialization is empty/
> + * a no-op/does not constraint the b/w in any way.
> + * The network core keeps track of the applied user-configuration in
> + * per device storage.
> + *
> + * Each shaper is uniquely identified within the device with an
> +'handle',
> + * dependent on the shaper scope and other data, see
> +@shaper_make_handle()  */ struct net_shaper_ops {
> +	/** add - Add a shaper inside the shaper hierarchy
> +	 * @dev: netdevice to operate on
> +	 * @handle: the shaper indetifier
> +	 * @shaper: configuration of shaper
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Return:
> +	 * * 0 on success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error values on failure.
> +	 *
> +	 * Examples or reasons this operation may fail include:
> +	 * * H/W resources limits.
> +	 * * Can’t respect the requested bw limits.
> +	 */
> +	int (*add)(struct net_device *dev, u32 handle,
> +		   const struct net_shaper_info *shaper,
> +		   struct netlink_ext_ack *extack);
> +
> +	/** set - Update the specified shaper, if it exists
> +	 * @dev: Netdevice to operate on.
> +	 * @handle: the shaper identifier
> +	 * @shaper: Configuration of shaper.
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Return:
> +	 * * %0 - Success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error values on failure.
> +	 */
> +	int (*set)(struct net_device *dev, u32 handle,
> +		   const struct net_shaper_info *shaper,
> +		   struct netlink_ext_ack *extack);
> +
> +	/** delete - Removes a shaper from the NIC
> +	 * @dev: netdevice to operate on.
> +	 * @handle: the shaper identifier
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Return:
> +	 * * %0 - Success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error value on failure.
> +	 */
> +	int (*delete)(struct net_device *dev, u32 handle,
> +		      struct netlink_ext_ack *extack);
> +
> +	/** Move - change the parent id of the specified shaper
> +	 * @dev: netdevice to operate on.
> +	 * @handle: unique identifier for the shaper
> +	 * @new_parent_id: identifier of the new parent for this shaper
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Move the specified shaper in the hierarchy replacing its
> +	 * current parent shaper with @new_parent_id
> +	 *
> +	 * Return:
> +	 * * %0 - Success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error values on failure.
> +	 */
> +	int (*move)(struct net_device *dev, u32 handle,
> +		    u32 new_parent_handle, struct netlink_ext_ack *extack); };
> +
> +/**
> + * net_shaper_make_handle - creates an unique shaper identifier
> + * @scope: the shaper scope
> + * @vf: virtual function number
> + * @id: queue group or queue id
> + *
> + * Return: an unique identifier for the shaper
> + *
> + * Combines the specified arguments to create an unique identifier for
> + * the shaper.
> + * The virtual function number is only used within
> +@NET_SHAPER_SCOPE_VF,
> + * @NET_SHAPER_SCOPE_QUEUE_GROUP and
> @NET_SHAPER_SCOPE_QUEUE.
> + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP
> and
> + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue
> group
> + * identifier or the queue number.
> + */
> +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int
> +id);
> +
> +/*
> + * Examples:
> + * - set shaping on a given queue
> + *   struct shaper_info info = { }; // fill this
> + *   u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0,
> queue_id);
> + *   dev->shaper_ops->add(dev, handle, &info, NULL);
> + *
> + * - create a queue group with a queue group shaping limits.
> + *   Assuming the following topology already exists:
> + *                       < netdev shaper  >
> + *                        /              \
> + *               <queue 0 shaper> . . .  <queue N shaper>
> + *
> + *   struct shaper_info ginfo = { }; // fill this
> + *   u32 ghandle =
> shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0);
> + *   dev->shaper_ops->add(dev, ghandle, &ginfo);
> + *
> + *   // now topology is:
> + *   //                              < netdev shaper  >
> + *   //                             /         |          \
> + *   //                            /          |       < newly created shaper  >
> + *   //                           /           |
> + *   //	<queue 0 shaper> . . .    <queue N shaper>
> + *
> + *   // move a shapers for queues 3..n out of such queue group
> + *   for (i = 0; i <= 2; ++i) {
> + *       u32 qhandle = net_shaper_make_handle(NET_SHAPER_SCOPE_QUEUE,
> 0, i);
> + *       dev->netshaper_ops->move(dev, qhandle, ghandle, NULL);
> + *   }
> + *
> + *   // now the topology is:
> + *   //                                < netdev shaper  >
> + *   //                                 /            \
> + *   //               < newly created shaper>   <queue 3 shaper> .. <queue n shaper>
> + *   //                /               \
> + *   //	<queue 0 shaper> . . .    <queue 2 shaper>
> + */
> +#endif
> +
> diff --git a/net/Kconfig b/net/Kconfig
> index f0a8692496ff..29c6fec54711 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -66,6 +66,9 @@ config SKB_DECRYPTED
>  config SKB_EXTENSIONS
>  	bool
> 
> +config NET_SHAPER
> +	bool
> +
>  menu "Networking options"
> 
>  source "net/packet/Kconfig"
> --
> 2.43.2
>
Paolo Abeni May 10, 2024, 7:58 a.m. UTC | #8
On Fri, 2024-05-10 at 07:10 +0000, Naveen Mamindlapalli wrote:

> On Thursday, May 9, 2024 1:51 AM Paolo Abeni <pabeni@redhat.com> wrote:
[...]
> > +/**
> > + * enum net_shaper_scope - the different scopes where a shaper could be
> > attached
> > + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network
> > device.
> > + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> > + * function.
> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple
> > queues
> > +under the
> > + * same device.
> > + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given
> > device queue.
> > + *
> > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only
> > available on
> > + * PF devices, usually inside the host/hypervisor.
> 
> What is the difference between NET_SHAPER_SCOPE_VF and NET_SHAPER_SCOPE_NETDEV 
> from a VF traffic shaping perspective?

With NET_SHAPER_SCOPE_VF, the user can control VF-level shapers from
within the hypervisor.

With NET_SHAPER_SCOPE_NETDEV, the user will control whatever is really
under the gear of the given 'dev' struct: the PF-level shaper if the
user itself is within the hypervisor, or the relevant VF if the user is
within the VM.

Cheers,

Paolo
Paolo Abeni May 10, 2024, 11:05 a.m. UTC | #9
On Thu, 2024-05-09 at 18:17 +0200, Andrew Lunn wrote:
> > > Now the question is, how do i get between these two states? It is not
> > > possible to mix WRR and strict priority. Any kAPI which only modifies
> > > one queue at once will go straight into an invalid state, and the
> > > driver will need to return -EOPNOTSUPP. So it seems like there needs
> > > to be an atomic set N queue configuration at once, so i can cleanly go
> > > from strict priority across 8 queues to WRR across 8 queues. Is that
> > > foreseen?
> > 
> > You could delete all the WRR shapers and then create/add SP based ones.
> 
> But that does not match the hardware. I cannot delete the hardware. It
> will either do strict priority or WRR. If i delete the software
> representation of the shaper, the hardware shaper will keep on doing
> what it was doing. So i don't see this as a good model. I think the
> driver will create shapers to represent the hardware, and you are not
> allowed to delete them or add more of them, because that is what the
> hardware is. All you can do is configure the shapers that exist.
> 
> > The 'create' op is just an abstraction to tell the NIC to switch from
> > the default configuration to the specified one.
> 
> Well, the hardware default is i think WRR for the queues, and line
> rate. That will be what the software representation of the shapers
> will be set to when the driver probes and creates the shapers
> representors.

If I read correctly, allowing each NIC to expose it's own different
starting configuration still will not solve the problem for this H/W to
switch from WRR to SP (and vice versa).

AFAICS, what would be needed there is an atomic set of operations:
'set_many' (and e.v. 'delete_many', 'create_many') that will allow
changing all the shapers at once. 

With such operations, that H/W could still fit the expected 'no-op'
default, as WRR on the queue shapers is what we expect. I agree with
Jakub, handling the complexity of arbitrary starting configuration
would pose a lot of trouble to the user/admin.

If all the above stands together, I think we have a few options (in
random order):

- add both set of operations: the ones operating on a single shaper and
the ones operating on multiple shapers
- use only the multiple shapers ops.

And the latter looks IMHO the simple/better. At that point I would
probably drop the 'add' op and would rename 'delete' as
'reset':

int (*set)(struct net_device *dev, int how_many, const u32 *handles,
	   const struct net_shaper_info *shapers,
           struct netlink_ext_ack *extack);
int (*reset)(struct net_device *dev, int how_many, const u32 *handles,
             struct netlink_ext_ack *extack);
int (*move)(struct net_device *dev, int how_many, const u32 *handles,
            const u32 *new_parent_handles,
	    struct netlink_ext_ack *extack);

An NIC with 'static' shapers can implement a dummy move always
returning EOPNOTSUPP and eventually filling a detailed extack.

NIC without any constraints on mixing and matching different kind of
shapers could implement the above as a loop over whatever they will do
for the corresponding 'single shaper op'

NIC with constrains alike the one you pointed out could validate the
final state before atomically applying the specified operation.

After a successful  'reset' operation, the kernel could drop any data
it retains/caches for the relevant shapers - the current idea is to
keep a copy of all successfully configured shaper_info in a xarray,
using the 'handle' as the index.

Side note: the move() operation could look like a complex artifact, but
it's the simplest instrument I could think of to support scenarios
where the user creates/configures/sets a queue group and 'move' some
queue under the newly created group

WDYT?

Thanks,

Paolo
Simon Horman May 15, 2024, 9:51 a.m. UTC | #10
On Fri, May 10, 2024 at 01:05:41PM +0200, Paolo Abeni wrote:
> On Thu, 2024-05-09 at 18:17 +0200, Andrew Lunn wrote:
> > > > Now the question is, how do i get between these two states? It is not
> > > > possible to mix WRR and strict priority. Any kAPI which only modifies
> > > > one queue at once will go straight into an invalid state, and the
> > > > driver will need to return -EOPNOTSUPP. So it seems like there needs
> > > > to be an atomic set N queue configuration at once, so i can cleanly go
> > > > from strict priority across 8 queues to WRR across 8 queues. Is that
> > > > foreseen?
> > > 
> > > You could delete all the WRR shapers and then create/add SP based ones.
> > 
> > But that does not match the hardware. I cannot delete the hardware. It
> > will either do strict priority or WRR. If i delete the software
> > representation of the shaper, the hardware shaper will keep on doing
> > what it was doing. So i don't see this as a good model. I think the
> > driver will create shapers to represent the hardware, and you are not
> > allowed to delete them or add more of them, because that is what the
> > hardware is. All you can do is configure the shapers that exist.
> > 
> > > The 'create' op is just an abstraction to tell the NIC to switch from
> > > the default configuration to the specified one.
> > 
> > Well, the hardware default is i think WRR for the queues, and line
> > rate. That will be what the software representation of the shapers
> > will be set to when the driver probes and creates the shapers
> > representors.
> 
> If I read correctly, allowing each NIC to expose it's own different
> starting configuration still will not solve the problem for this H/W to
> switch from WRR to SP (and vice versa).
> 
> AFAICS, what would be needed there is an atomic set of operations:
> 'set_many' (and e.v. 'delete_many', 'create_many') that will allow
> changing all the shapers at once. 
> 
> With such operations, that H/W could still fit the expected 'no-op'
> default, as WRR on the queue shapers is what we expect. I agree with
> Jakub, handling the complexity of arbitrary starting configuration
> would pose a lot of trouble to the user/admin.
> 
> If all the above stands together, I think we have a few options (in
> random order):
> 
> - add both set of operations: the ones operating on a single shaper and
> the ones operating on multiple shapers
> - use only the multiple shapers ops.
> 
> And the latter looks IMHO the simple/better. At that point I would
> probably drop the 'add' op and would rename 'delete' as
> 'reset':
> 
> int (*set)(struct net_device *dev, int how_many, const u32 *handles,
> 	   const struct net_shaper_info *shapers,
>            struct netlink_ext_ack *extack);
> int (*reset)(struct net_device *dev, int how_many, const u32 *handles,
>              struct netlink_ext_ack *extack);
> int (*move)(struct net_device *dev, int how_many, const u32 *handles,
>             const u32 *new_parent_handles,
> 	    struct netlink_ext_ack *extack);
> 
> An NIC with 'static' shapers can implement a dummy move always
> returning EOPNOTSUPP and eventually filling a detailed extack.
> 
> NIC without any constraints on mixing and matching different kind of
> shapers could implement the above as a loop over whatever they will do
> for the corresponding 'single shaper op'
> 
> NIC with constrains alike the one you pointed out could validate the
> final state before atomically applying the specified operation.
> 
> After a successful  'reset' operation, the kernel could drop any data
> it retains/caches for the relevant shapers - the current idea is to
> keep a copy of all successfully configured shaper_info in a xarray,
> using the 'handle' as the index.
> 
> Side note: the move() operation could look like a complex artifact, but
> it's the simplest instrument I could think of to support scenarios
> where the user creates/configures/sets a queue group and 'move' some
> queue under the newly created group
> 
> WDYT?

Hi Andrew,

Picking up this discussion, I'm wondering if Paolo's proposal
addresses your concerns.
Andrew Lunn May 15, 2024, 2:19 p.m. UTC | #11
> > If I read correctly, allowing each NIC to expose it's own different
> > starting configuration still will not solve the problem for this H/W to
> > switch from WRR to SP (and vice versa).

I also suspect this is not unique to this hardware. I've not looked at
other SOHO switches, but it is reasonably common to have different
queues for different priority classes, and then one shaper for the
overall port rate.

> > AFAICS, what would be needed there is an atomic set of operations:
> > 'set_many' (and e.v. 'delete_many', 'create_many') that will allow
> > changing all the shapers at once. 

Yep.

> > With such operations, that H/W could still fit the expected 'no-op'
> > default, as WRR on the queue shapers is what we expect. I agree with
> > Jakub, handling the complexity of arbitrary starting configuration
> > would pose a lot of trouble to the user/admin.
> > 
> > If all the above stands together, I think we have a few options (in
> > random order):
> > 
> > - add both set of operations: the ones operating on a single shaper and
> > the ones operating on multiple shapers
> > - use only the multiple shapers ops.
> > 
> > And the latter looks IMHO the simple/better.

I would agree, start with only multiple shaper opps. If we find that
many implementation end up just iterating the list and dealing with
them individually, would could pull that iterator into the core, and
expand the ops to either/or, multiple or single.

> > int (*set)(struct net_device *dev, int how_many, const u32 *handles,
> > 	   const struct net_shaper_info *shapers,
> >            struct netlink_ext_ack *extack);
> > int (*reset)(struct net_device *dev, int how_many, const u32 *handles,
> >              struct netlink_ext_ack *extack);
> > int (*move)(struct net_device *dev, int how_many, const u32 *handles,
> >             const u32 *new_parent_handles,
> > 	    struct netlink_ext_ack *extack);
> > 
> > An NIC with 'static' shapers can implement a dummy move always
> > returning EOPNOTSUPP and eventually filling a detailed extack.

The extack is going to be important here, we are going to need
meaningful error messages.

Overall, i think this can be made to work with the hardware i have.

	 Andrew
Andrew Lunn May 15, 2024, 2:41 p.m. UTC | #12
> + * struct net_shaper_info - represents a shaping node on the NIC H/W
> + * @metric: Specify if the bw limits refers to PPS or BPS
> + * @bw_min: Minimum guaranteed rate for this shaper
> + * @bw_max: Maximum peak bw allowed for this shaper
> + * @burst: Maximum burst for the peek rate of this shaper
> + * @priority: Scheduling priority for this shaper
> + * @weight: Scheduling weight for this shaper
> + */
> +struct net_shaper_info {
> +	enum net_shaper_metric metric;
> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> +	u64 bw_max;	/* maximum allowed bandwidth */
> +	u32 burst;	/* maximum burst in bytes for bw_max */
> +	u32 priority;	/* scheduling strict priority */
> +	u32 weight;	/* scheduling WRR weight*/
> +};

...

> +	/** set - Update the specified shaper, if it exists
> +	 * @dev: Netdevice to operate on.
> +	 * @handle: the shaper identifier
> +	 * @shaper: Configuration of shaper.
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Return:
> +	 * * %0 - Success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error values on failure.
> +	 */
> +	int (*set)(struct net_device *dev, u32 handle,
> +		   const struct net_shaper_info *shaper,
> +		   struct netlink_ext_ack *extack);

> + * net_shaper_make_handle - creates an unique shaper identifier
> + * @scope: the shaper scope
> + * @vf: virtual function number
> + * @id: queue group or queue id
> + *
> + * Return: an unique identifier for the shaper
> + *
> + * Combines the specified arguments to create an unique identifier for
> + * the shaper.
> + * The virtual function number is only used within @NET_SHAPER_SCOPE_VF,
> + * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE.
> + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and
> + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group
> + * identifier or the queue number.
> + */
> +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id);

One thing i'm missing here is a function which does the opposite of
net_shaper_make_handle(). Given a handle, it returns the scope, vf and
the id.

When the set() op is called, i somehow need to find the software
instance representing the hardware block. If i know the id, it is just
an array access. Otherwise i need additional bookkeeping, maybe a
linked list of handles and pointers to structures etc.

Or net_shaper_make_handle() could maybe take an addition void * priv,
and provide a function void * net_shape_priv(u32 handle);

    Andrew
Simon Horman May 15, 2024, 2:50 p.m. UTC | #13
On Wed, May 15, 2024 at 04:41:09PM +0200, Andrew Lunn wrote:
> > + * struct net_shaper_info - represents a shaping node on the NIC H/W
> > + * @metric: Specify if the bw limits refers to PPS or BPS
> > + * @bw_min: Minimum guaranteed rate for this shaper
> > + * @bw_max: Maximum peak bw allowed for this shaper
> > + * @burst: Maximum burst for the peek rate of this shaper
> > + * @priority: Scheduling priority for this shaper
> > + * @weight: Scheduling weight for this shaper
> > + */
> > +struct net_shaper_info {
> > +	enum net_shaper_metric metric;
> > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> > +	u64 bw_max;	/* maximum allowed bandwidth */
> > +	u32 burst;	/* maximum burst in bytes for bw_max */
> > +	u32 priority;	/* scheduling strict priority */
> > +	u32 weight;	/* scheduling WRR weight*/
> > +};
> 
> ...
> 
> > +	/** set - Update the specified shaper, if it exists
> > +	 * @dev: Netdevice to operate on.
> > +	 * @handle: the shaper identifier
> > +	 * @shaper: Configuration of shaper.
> > +	 * @extack: Netlink extended ACK for reporting errors.
> > +	 *
> > +	 * Return:
> > +	 * * %0 - Success
> > +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > +	 *                  or core for any reason. @extack should be set to
> > +	 *                  text describing the reason.
> > +	 * * Other negative error values on failure.
> > +	 */
> > +	int (*set)(struct net_device *dev, u32 handle,
> > +		   const struct net_shaper_info *shaper,
> > +		   struct netlink_ext_ack *extack);
> 
> > + * net_shaper_make_handle - creates an unique shaper identifier
> > + * @scope: the shaper scope
> > + * @vf: virtual function number
> > + * @id: queue group or queue id
> > + *
> > + * Return: an unique identifier for the shaper
> > + *
> > + * Combines the specified arguments to create an unique identifier for
> > + * the shaper.
> > + * The virtual function number is only used within @NET_SHAPER_SCOPE_VF,
> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE.
> > + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and
> > + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group
> > + * identifier or the queue number.
> > + */
> > +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id);
> 
> One thing i'm missing here is a function which does the opposite of
> net_shaper_make_handle(). Given a handle, it returns the scope, vf and
> the id.
> 
> When the set() op is called, i somehow need to find the software
> instance representing the hardware block. If i know the id, it is just
> an array access. Otherwise i need additional bookkeeping, maybe a
> linked list of handles and pointers to structures etc.
> 
> Or net_shaper_make_handle() could maybe take an addition void * priv,
> and provide a function void * net_shape_priv(u32 handle);

Hi Andrew,

I think that, initially at least, the implementation of
net_shaper_make_handle() can be fairly simple, involving packing
fields into an integer in a static manner.

As such I think implementing a helper or helpers to an to extract fields
should be trivial.

In other words, yes, I think that can be added.
Simon Horman May 15, 2024, 2:56 p.m. UTC | #14
On Wed, May 15, 2024 at 04:19:57PM +0200, Andrew Lunn wrote:
> > > If I read correctly, allowing each NIC to expose it's own different
> > > starting configuration still will not solve the problem for this H/W to
> > > switch from WRR to SP (and vice versa).
> 
> I also suspect this is not unique to this hardware. I've not looked at
> other SOHO switches, but it is reasonably common to have different
> queues for different priority classes, and then one shaper for the
> overall port rate.

Yes, understood. It's about creating a sufficiently general solution.
And the HW you have in mind has lead us to see some shortcomings
of the proposed API in that area. Because it drew a bit too much on
understanding of a different category of HW.

> > > AFAICS, what would be needed there is an atomic set of operations:
> > > 'set_many' (and e.v. 'delete_many', 'create_many') that will allow
> > > changing all the shapers at once. 
> 
> Yep.
> 
> > > With such operations, that H/W could still fit the expected 'no-op'
> > > default, as WRR on the queue shapers is what we expect. I agree with
> > > Jakub, handling the complexity of arbitrary starting configuration
> > > would pose a lot of trouble to the user/admin.
> > > 
> > > If all the above stands together, I think we have a few options (in
> > > random order):
> > > 
> > > - add both set of operations: the ones operating on a single shaper and
> > > the ones operating on multiple shapers
> > > - use only the multiple shapers ops.
> > > 
> > > And the latter looks IMHO the simple/better.
> 
> I would agree, start with only multiple shaper opps. If we find that
> many implementation end up just iterating the list and dealing with
> them individually, would could pull that iterator into the core, and
> expand the ops to either/or, multiple or single.

FWIIW, this was my thinking too.

> > > int (*set)(struct net_device *dev, int how_many, const u32 *handles,
> > > 	   const struct net_shaper_info *shapers,
> > >            struct netlink_ext_ack *extack);
> > > int (*reset)(struct net_device *dev, int how_many, const u32 *handles,
> > >              struct netlink_ext_ack *extack);
> > > int (*move)(struct net_device *dev, int how_many, const u32 *handles,
> > >             const u32 *new_parent_handles,
> > > 	    struct netlink_ext_ack *extack);
> > > 
> > > An NIC with 'static' shapers can implement a dummy move always
> > > returning EOPNOTSUPP and eventually filling a detailed extack.
> 
> The extack is going to be important here, we are going to need
> meaningful error messages.

Always :)

> Overall, i think this can be made to work with the hardware i have.

Great, I think the next step is for us to propose a revised API
with multiple shaper ops in place of single shaper ops.
Jakub Kicinski May 28, 2024, 5:18 p.m. UTC | #15
On Wed,  8 May 2024 22:20:51 +0200 Paolo Abeni wrote:
> +/**
> + * struct net_shaper_info - represents a shaping node on the NIC H/W
> + * @metric: Specify if the bw limits refers to PPS or BPS
> + * @bw_min: Minimum guaranteed rate for this shaper
> + * @bw_max: Maximum peak bw allowed for this shaper
> + * @burst: Maximum burst for the peek rate of this shaper
> + * @priority: Scheduling priority for this shaper
> + * @weight: Scheduling weight for this shaper
> + */
> +struct net_shaper_info {
> +	enum net_shaper_metric metric;
> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> +	u64 bw_max;	/* maximum allowed bandwidth */
> +	u32 burst;	/* maximum burst in bytes for bw_max */

Burst is burst, either we need two or we assume it's for both bw_min
and bw_max, but it most certainly can't be just for bw_max.

Also presumably not just bytes - if "metric" is pps, burst is pps?

> +	u32 priority;	/* scheduling strict priority */
> +	u32 weight;	/* scheduling WRR weight*/

I wonder if we should somehow more clearly specify what a node can do.
Like Andrew pointed out, if we have a WRR node, presumably the weights
go on the children, since there's only one weigh param. But then the
WRR node itself is either empty (no params) or has rate params... which
is odd.

Maybe shaping nodes and RR nodes should be separate node classes,
somehow?

> +};
> +
> +/**
> + * enum net_shaper_scope - the different scopes where a shaper could be attached
> + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
> + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> + * function.
> + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
> + * same device.
> + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.

I wonder if we need traffic class? Some devices may have two schedulers,
one from the host interfaces (PCIe) into the device buffer. And then
from the device buffer independently into the wire.

> + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> + * PF devices, usually inside the host/hypervisor.
> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> + */
> +enum net_shaper_scope {
> +	NET_SHAPER_SCOPE_PORT,
> +	NET_SHAPER_SCOPE_NETDEV,
> +	NET_SHAPER_SCOPE_VF,

I realized now that we do indeed need this VF node (if we want to
internally express the legacy SRIOV NDOs in this API), as much 
as I hate it. Could you annotate somehow my nack on ever exposing
the ability to hook on the VF to user space?

> +	NET_SHAPER_SCOPE_QUEUE_GROUP,

We may need a definition for a queue group. Did I suggest this?
Isn't queue group just a bunch of queues feeding a trivial RR node?
Why does it need to be a "scope"?

> +	NET_SHAPER_SCOPE_QUEUE,
> +};
> +
> +/**
> + * struct net_shaper_ops - Operations on device H/W shapers
> + * @add: Creates a new shaper in the specified scope.

"in a scope"? Isn't the scope just defining the ingress and egress
points of the scheduling hierarchy?

Also your example moves schedulers from queue scope to queue group
scope.

> + * @set: Modify the existing shaper.
> + * @delete: Delete the specified shaper.
> + * @move: Move an existing shaper under a different parent.
> + *
> + * The initial shaping configuration ad device initialization is empty/

and

> + * a no-op/does not constraint the b/w in any way.
> + * The network core keeps track of the applied user-configuration in
> + * per device storage.

"keeps track .. per device" -- "storage" may make people think NVM.

> + * Each shaper is uniquely identified within the device with an 'handle',
> + * dependent on the shaper scope and other data, see @shaper_make_handle()
> + */
> +struct net_shaper_ops {
> +	/** add - Add a shaper inside the shaper hierarchy
> +	 * @dev: netdevice to operate on
> +	 * @handle: the shaper indetifier
> +	 * @shaper: configuration of shaper
> +	 * @extack: Netlink extended ACK for reporting errors.
> +	 *
> +	 * Return:
> +	 * * 0 on success
> +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> +	 *                  or core for any reason. @extack should be set to
> +	 *                  text describing the reason.
> +	 * * Other negative error values on failure.
> +	 *
> +	 * Examples or reasons this operation may fail include:
> +	 * * H/W resources limits.
> +	 * * Can’t respect the requested bw limits.
> +	 */
> +	int (*add)(struct net_device *dev, u32 handle,
> +		   const struct net_shaper_info *shaper,
> +		   struct netlink_ext_ack *extack);
> +
> +	/** set - Update the specified shaper, if it exists

Why "if it exists" ? Core should make sure it exists, no?

In addition to ops and state, the device will likely need to express
capabilities of some sort. So that the core can do some work for the
drivers and in due course we can expose them to user space for
discoverability.
Jakub Kicinski May 28, 2024, 5:28 p.m. UTC | #16
On Fri, 10 May 2024 13:05:41 +0200 Paolo Abeni wrote:
> And the latter looks IMHO the simple/better. At that point I would
> probably drop the 'add' op and would rename 'delete' as
> 'reset':
> 
> int (*set)(struct net_device *dev, int how_many, const u32 *handles,
> 	   const struct net_shaper_info *shapers,
>            struct netlink_ext_ack *extack);
> int (*reset)(struct net_device *dev, int how_many, const u32 *handles,
>              struct netlink_ext_ack *extack);
> int (*move)(struct net_device *dev, int how_many, const u32 *handles,
>             const u32 *new_parent_handles,
> 	    struct netlink_ext_ack *extack);
> 
> An NIC with 'static' shapers can implement a dummy move always
> returning EOPNOTSUPP and eventually filling a detailed extack.
> 
> NIC without any constraints on mixing and matching different kind of
> shapers could implement the above as a loop over whatever they will do
> for the corresponding 'single shaper op'
> 
> NIC with constrains alike the one you pointed out could validate the
> final state before atomically applying the specified operation.
> 
> After a successful  'reset' operation, the kernel could drop any data
> it retains/caches for the relevant shapers - the current idea is to
> keep a copy of all successfully configured shaper_info in a xarray,
> using the 'handle' as the index.

IMHO this is more confusing that the current API, maybe we just need
better driver-facing documentation? Deleting a node from the hierarchy
doesn't delete the HW, same as deleting a TCAM entry doesn't chisel off
a part of the chip. If you want to switch from WRR to SP you'd need to
delete all the weight nodes and insert SP nodes as children.

If we modeled mux nodes as multi-entry nodes explicitly it may be
easier. Meaning make the scheduling and weights part of the mux node,
not its children.
Paolo Abeni May 31, 2024, 9:22 a.m. UTC | #17
Hi,

Thank you for your time and reply. 

On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote:
> > +	u32 priority;	/* scheduling strict priority */
> > +	u32 weight;	/* scheduling WRR weight*/
> 
> I wonder if we should somehow more clearly specify what a node can do.
> Like Andrew pointed out, if we have a WRR node, presumably the weights
> go on the children, since there's only one weigh param. But then the
> WRR node itself is either empty (no params) or has rate params... which
> is odd.
> 
> Maybe shaping nodes and RR nodes should be separate node classes,
> somehow?

Possibly clarifying the meaning of 'weight' field would help?  
It means: this node is scheduled WRR according to the specified weight
among the sibling shapers with the same priority.

I think it's quite simpler than introducing different node classes with
separate handling. My understanding is that the latter would help only
to workaround some H/W limitation and will make the implementation more
difficult for more capable H/W.

What kind of problems do you foresee with the above definition?

> > +};
> > +
> > +/**
> > + * enum net_shaper_scope - the different scopes where a shaper could be attached
> > + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
> > + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> > + * function.
> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
> > + * same device.
> > + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
> 
> I wonder if we need traffic class? Some devices may have two schedulers,
> one from the host interfaces (PCIe) into the device buffer. And then
> from the device buffer independently into the wire.

I feel like I'm really missing your point here. How would you use
traffic class? And how the 2 schedulers come into play here? Each of
them will be tied to one or more of the scopes above, why exposing H/W
details that will not expand user visible features?

> > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> > + * PF devices, usually inside the host/hypervisor.
> > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > + */
> > +enum net_shaper_scope {
> > +	NET_SHAPER_SCOPE_PORT,
> > +	NET_SHAPER_SCOPE_NETDEV,
> > +	NET_SHAPER_SCOPE_VF,
> 
> I realized now that we do indeed need this VF node (if we want to
> internally express the legacy SRIOV NDOs in this API), as much 
> as I hate it. Could you annotate somehow my nack on ever exposing
> the ability to hook on the VF to user space?

This work sparked from the need to allow configuring a shaper on
specific queues of some VF from the host. I hope this is not what you
are nacking here? Could you please elaborate a bit what concern you
with 'hook on the VF to user space'? Would the ability of attaching a
shaper to the VF from the host hit your nack?

> 
> > +	NET_SHAPER_SCOPE_QUEUE_GROUP,
> 
> We may need a definition for a queue group. Did I suggest this?

I think this was mentioned separately by you, John Fastabend and Intel.

> Isn't queue group just a bunch of queues feeding a trivial RR node?
> Why does it need to be a "scope"?

The goal is allowing arbitrary manipulation at the queue group level.
e.g. you can have different queue groups with different priority, or
weigh or shaping, and below them arbitrary shaping at the queue level.

Note that a similar concept could be introduced for device (or VFs)
groups.

Why would you like to constraint the features avail at the queue
groups?

> 
> > +	NET_SHAPER_SCOPE_QUEUE,
> > +};
> > +
> > +/**
> > + * struct net_shaper_ops - Operations on device H/W shapers
> > + * @add: Creates a new shaper in the specified scope.
> 
> "in a scope"? Isn't the scope just defining the ingress and egress
> points of the scheduling hierarchy?

This is purely lexical matter, right? The scope, and more specifically
the full 'handle' comprising more scoped-related information, specifies
'where' the shaper is located. Do you have a suggested alternative
wording?

> Also your example moves schedulers from queue scope to queue group
> scope.

In the example, this part creates/enables a shaper at the queue level:

	u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0);
	dev->shaper_ops->add(dev, ghandle, &ginfo);

and this:
	
	u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id);
	//...
	dev->netshaper_ops->move(dev, qhandle, ghandle, NULL);

changes the _parent_ of qhandle setting it to the previously creates
queue group shaper. qhandle initial/implicit/default parent was the
device scope shaper.

The scope of the queue shaper remains unchanged. An I misunderstanding
your point?


> 
> > + * @set: Modify the existing shaper.
> > + * @delete: Delete the specified shaper.
> > + * @move: Move an existing shaper under a different parent.
> > + *
> > + * The initial shaping configuration ad device initialization is empty/
> 
> and
> 
> > + * a no-op/does not constraint the b/w in any way.
> > + * The network core keeps track of the applied user-configuration in
> > + * per device storage.
> 
> "keeps track .. per device" -- "storage" may make people think NVM.
> 
> > + * Each shaper is uniquely identified within the device with an 'handle',
> > + * dependent on the shaper scope and other data, see @shaper_make_handle()
> > + */
> > +struct net_shaper_ops {
> > +	/** add - Add a shaper inside the shaper hierarchy
> > +	 * @dev: netdevice to operate on
> > +	 * @handle: the shaper indetifier
> > +	 * @shaper: configuration of shaper
> > +	 * @extack: Netlink extended ACK for reporting errors.
> > +	 *
> > +	 * Return:
> > +	 * * 0 on success
> > +	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
> > +	 *                  or core for any reason. @extack should be set to
> > +	 *                  text describing the reason.
> > +	 * * Other negative error values on failure.
> > +	 *
> > +	 * Examples or reasons this operation may fail include:
> > +	 * * H/W resources limits.
> > +	 * * Can’t respect the requested bw limits.
> > +	 */
> > +	int (*add)(struct net_device *dev, u32 handle,
> > +		   const struct net_shaper_info *shaper,
> > +		   struct netlink_ext_ack *extack);
> > +
> > +	/** set - Update the specified shaper, if it exists
> 
> Why "if it exists" ? Core should make sure it exists, no?
> 
> In addition to ops and state, the device will likely need to express
> capabilities of some sort. So that the core can do some work for the
> drivers and in due course we can expose them to user space for
> discoverability.

What kind of capabilities are you thinking about? supported scopes?
supported metrics? What else? I feel like there is a lot of
mixed/partial kind of support which is hard to express in a formal way
but would fit nicely an extended ack for a failing op - as the SP/WRR
constrains Andrew reported.

Do we need to introduce this introspection support from the start? I
think that having a few H/W implementations around would help (at least
me) understanding which properties could relevant here.

Thanks,

Paolo
Jakub Kicinski May 31, 2024, 4 p.m. UTC | #18
On Fri, 31 May 2024 11:22:46 +0200 Paolo Abeni wrote:
> On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote:
> > > +	u32 priority;	/* scheduling strict priority */
> > > +	u32 weight;	/* scheduling WRR weight*/  
> > 
> > I wonder if we should somehow more clearly specify what a node can do.
> > Like Andrew pointed out, if we have a WRR node, presumably the weights
> > go on the children, since there's only one weigh param. But then the
> > WRR node itself is either empty (no params) or has rate params... which
> > is odd.
> > 
> > Maybe shaping nodes and RR nodes should be separate node classes,
> > somehow?  
> 
> Possibly clarifying the meaning of 'weight' field would help?  
> It means: this node is scheduled WRR according to the specified weight
> among the sibling shapers with the same priority.
> 
> I think it's quite simpler than introducing different node classes with
> separate handling. My understanding is that the latter would help only
> to workaround some H/W limitation and will make the implementation more
> difficult for more capable H/W.
> 
> What kind of problems do you foresee with the above definition?

The problem Andrew mentioned, basically. There may not be a path to
transition one fully offloaded hierarchy to another (e.g. switching
strict prio to WRR). TBH I haven't really grasped your proposal in:
https://lore.kernel.org/all/db51b7ccff835dd5a96293fb84d527be081de062.camel@redhat.com/

> > > + * enum net_shaper_scope - the different scopes where a shaper could be attached
> > > + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> > > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
> > > + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> > > + * function.
> > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
> > > + * same device.
> > > + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.  
> > 
> > I wonder if we need traffic class? Some devices may have two schedulers,
> > one from the host interfaces (PCIe) into the device buffer. And then
> > from the device buffer independently into the wire.  
> 
> I feel like I'm really missing your point here. How would you use
> traffic class? And how the 2 schedulers come into play here? Each of
> them will be tied to one or more of the scopes above, why exposing H/W
> details that will not expand user visible features?

I was just thinking aloud, I'm not sure anyone would ever use a single
host queue to service (ingress) traffic from multiple TCs.
Or if any HW actually supports that. 

We can add it later.

> > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> > > + * PF devices, usually inside the host/hypervisor.
> > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > > + */
> > > +enum net_shaper_scope {
> > > +	NET_SHAPER_SCOPE_PORT,
> > > +	NET_SHAPER_SCOPE_NETDEV,
> > > +	NET_SHAPER_SCOPE_VF,  
> > 
> > I realized now that we do indeed need this VF node (if we want to
> > internally express the legacy SRIOV NDOs in this API), as much 
> > as I hate it. Could you annotate somehow my nack on ever exposing
> > the ability to hook on the VF to user space?  
> 
> This work sparked from the need to allow configuring a shaper on
> specific queues of some VF from the host. I hope this is not what you
> are nacking here? Could you please elaborate a bit what concern you
> with 'hook on the VF to user space'? Would the ability of attaching a
> shaper to the VF from the host hit your nack?

Queue configuration, for the VF, from the hypervisor?
I thought it was from the VF.
In any case, hypervisor has the representors. 
Use the representor's NETDEV scope?

> > > +	NET_SHAPER_SCOPE_QUEUE_GROUP,  
> > 
> > We may need a definition for a queue group. Did I suggest this?  
> 
> I think this was mentioned separately by you, John Fastabend and Intel.

Oh ugh, I can't type. I think I meant to say "Why do we need..."

> > Isn't queue group just a bunch of queues feeding a trivial RR node?
> > Why does it need to be a "scope"?  
> 
> The goal is allowing arbitrary manipulation at the queue group level.
> e.g. you can have different queue groups with different priority, or
> weigh or shaping, and below them arbitrary shaping at the queue level.
> 
> Note that a similar concept could be introduced for device (or VFs)
> groups.
> 
> Why would you like to constraint the features avail at the queue
> groups?

Wait! You don't have a way to create pure RR nodes other than queue
group now? Perhaps that's what I'm missing...

> > > +	NET_SHAPER_SCOPE_QUEUE,
> > > +};
> > > +
> > > +/**
> > > + * struct net_shaper_ops - Operations on device H/W shapers
> > > + * @add: Creates a new shaper in the specified scope.  
> > 
> > "in a scope"? Isn't the scope just defining the ingress and egress
> > points of the scheduling hierarchy?  
> 
> This is purely lexical matter, right? The scope, and more specifically
> the full 'handle' comprising more scoped-related information, specifies
> 'where' the shaper is located. Do you have a suggested alternative
> wording?

... and also what confused me here.

How are you going to do 2 layers of grouping with arbitrary shaping?
We need arbitrary inner nodes. Unless I'm missing a trick.

> > Also your example moves schedulers from queue scope to queue group
> > scope.  
> 
> In the example, this part creates/enables a shaper at the queue level:
> 
> 	u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0);
> 	dev->shaper_ops->add(dev, ghandle, &ginfo);
> 
> and this:
> 	
> 	u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id);
> 	//...
> 	dev->netshaper_ops->move(dev, qhandle, ghandle, NULL);
> 
> changes the _parent_ of qhandle setting it to the previously creates
> queue group shaper. qhandle initial/implicit/default parent was the
> device scope shaper.
> 
> The scope of the queue shaper remains unchanged. An I misunderstanding
> your point?
> 

> > Why "if it exists" ? Core should make sure it exists, no?
> > 
> > In addition to ops and state, the device will likely need to express
> > capabilities of some sort. So that the core can do some work for the
> > drivers and in due course we can expose them to user space for
> > discoverability.  
> 
> What kind of capabilities are you thinking about? supported scopes?
> supported metrics? What else? I feel like there is a lot of
> mixed/partial kind of support which is hard to express in a formal way
> but would fit nicely an extended ack for a failing op - as the SP/WRR
> constrains Andrew reported.
> 
> Do we need to introduce this introspection support from the start? I
> think that having a few H/W implementations around would help (at least
> me) understanding which properties could relevant here.

Try to write good tests which can run on HW for more than one
vendor. The introspection and capabilities will become apparent.
Paolo Abeni June 3, 2024, 11:11 a.m. UTC | #19
Hi,

It looks like most of the open points here are related lack of clarity
or misunderstanding. I propose to discuss them in the upcoming
reviewer's meeting.

Some replies below, just in case I magically achieved superior written
natural language skills meanwhile ;) 

On Fri, 2024-05-31 at 09:00 -0700, Jakub Kicinski wrote:
> On Fri, 31 May 2024 11:22:46 +0200 Paolo Abeni wrote:
> > On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote:
> > > > +	u32 priority;	/* scheduling strict priority */
> > > > +	u32 weight;	/* scheduling WRR weight*/  
> > > 
> > > I wonder if we should somehow more clearly specify what a node can do.
> > > Like Andrew pointed out, if we have a WRR node, presumably the weights
> > > go on the children, since there's only one weigh param. But then the
> > > WRR node itself is either empty (no params) or has rate params... which
> > > is odd.
> > > 
> > > Maybe shaping nodes and RR nodes should be separate node classes,
> > > somehow?  
> > 
> > Possibly clarifying the meaning of 'weight' field would help?  
> > It means: this node is scheduled WRR according to the specified weight
> > among the sibling shapers with the same priority.
> > 
> > I think it's quite simpler than introducing different node classes with
> > separate handling. My understanding is that the latter would help only
> > to workaround some H/W limitation and will make the implementation more
> > difficult for more capable H/W.
> > 
> > What kind of problems do you foresee with the above definition?
> 
> The problem Andrew mentioned, basically. There may not be a path to
> transition one fully offloaded hierarchy to another (e.g. switching
> strict prio to WRR). TBH I haven't really grasped your proposal in:
> https://lore.kernel.org/all/db51b7ccff835dd5a96293fb84d527be081de062.camel@redhat.com/

The problem Andrew reported is that some H/W, in some configuration,
can't change per-queue shapers individually.

The solution proposed in the above link is to let the API change
multiple shapers with a single op "atomically".

> > > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> > > > + * PF devices, usually inside the host/hypervisor.
> > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > > > + */
> > > > +enum net_shaper_scope {
> > > > +	NET_SHAPER_SCOPE_PORT,
> > > > +	NET_SHAPER_SCOPE_NETDEV,
> > > > +	NET_SHAPER_SCOPE_VF,  
> > > 
> > > I realized now that we do indeed need this VF node (if we want to
> > > internally express the legacy SRIOV NDOs in this API), as much 
> > > as I hate it. Could you annotate somehow my nack on ever exposing
> > > the ability to hook on the VF to user space?  
> > 
> > This work sparked from the need to allow configuring a shaper on
> > specific queues of some VF from the host. I hope this is not what you
> > are nacking here? Could you please elaborate a bit what concern you
> > with 'hook on the VF to user space'? Would the ability of attaching a
> > shaper to the VF from the host hit your nack?
> 
> Queue configuration, for the VF, from the hypervisor?
> I thought it was from the VF.
> In any case, hypervisor has the representors. 
> Use the representor's NETDEV scope?

It looks like even the NETDEV scope + representor alternative should
fit the intended use-case.

> > > > +	NET_SHAPER_SCOPE_QUEUE_GROUP,  
> > > 
> > > We may need a definition for a queue group. Did I suggest this?  
> > 
> > I think this was mentioned separately by you, John Fastabend and Intel.
> 
> Oh ugh, I can't type. I think I meant to say "Why do we need..."
> 
> > > Isn't queue group just a bunch of queues feeding a trivial RR node?
> > > Why does it need to be a "scope"?  
> > 
> > The goal is allowing arbitrary manipulation at the queue group level.
> > e.g. you can have different queue groups with different priority, or
> > weigh or shaping, and below them arbitrary shaping at the queue level.
> > 
> > Note that a similar concept could be introduced for device (or VFs)
> > groups.
> > 
> > Why would you like to constraint the features avail at the queue
> > groups?
> 
> Wait! You don't have a way to create pure RR nodes other than queue
> group now? Perhaps that's what I'm missing...

No, it's not needed. To have RR on some queues, just set the same
weight and priority on them.

Queue groups could be used to do something more complex, e.g. shaping
on the specified set of RR queues.

> > > > +	NET_SHAPER_SCOPE_QUEUE,
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct net_shaper_ops - Operations on device H/W shapers
> > > > + * @add: Creates a new shaper in the specified scope.  
> > > 
> > > "in a scope"? Isn't the scope just defining the ingress and egress
> > > points of the scheduling hierarchy?  
> > 
> > This is purely lexical matter, right? The scope, and more specifically
> > the full 'handle' comprising more scoped-related information, specifies
> > 'where' the shaper is located. Do you have a suggested alternative
> > wording?
> 
> ... and also what confused me here.
> 
> How are you going to do 2 layers of grouping with arbitrary shaping?
> We need arbitrary inner nodes. Unless I'm missing a trick.

I guess this part really needs some talk. I also don't understand your
doubt above.

> > 
I hope we can discuss this tomorrow (and the other points, if/as
needed).

Thanks!

Paolo
Jakub Kicinski June 3, 2024, 11:29 p.m. UTC | #20
On Mon, 03 Jun 2024 13:11:39 +0200 Paolo Abeni wrote:
> > ... and also what confused me here.
> > 
> > How are you going to do 2 layers of grouping with arbitrary shaping?
> > We need arbitrary inner nodes. Unless I'm missing a trick.  
> 
> I guess this part really needs some talk. I also don't understand your
> doubt above.

Each layer of shaping corresponds to some level of privilege or control.
Saying that we only need one layer is like saying that we only need one
layer of cgroups.

For example - application may want to WRR/shape between two sets of
queues (e.g. data and control) and then we may wrap that application 
up in a container, and WRR/shape multiple containers together.
Cosmin Ratiu June 5, 2024, 3:04 p.m. UTC | #21
On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote:

> +/**
> + * struct net_shaper_info - represents a shaping node on the NIC H/W
> + * @metric: Specify if the bw limits refers to PPS or BPS
> + * @bw_min: Minimum guaranteed rate for this shaper
> + * @bw_max: Maximum peak bw allowed for this shaper
> + * @burst: Maximum burst for the peek rate of this shaper
> + * @priority: Scheduling priority for this shaper
> + * @weight: Scheduling weight for this shaper
> + */
> +struct net_shaper_info {
> +	enum net_shaper_metric metric;
> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> +	u64 bw_max;	/* maximum allowed bandwidth */
> +	u32 burst;	/* maximum burst in bytes for bw_max */

'burst' really should be u64 if it can deal with bytes. In a 400Gbps
link, u32 really is peanuts.

> +/**
> + * enum net_shaper_scope - the different scopes where a shaper could be attached
> + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
> + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> + * function.
> + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
> + * same device.
> + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
> + *
> + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> + * PF devices, usually inside the host/hypervisor.
> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> + */
> +enum net_shaper_scope {
> +	NET_SHAPER_SCOPE_PORT,
> +	NET_SHAPER_SCOPE_NETDEV,
> +	NET_SHAPER_SCOPE_VF,
> +	NET_SHAPER_SCOPE_QUEUE_GROUP,
> +	NET_SHAPER_SCOPE_QUEUE,
> +};

How would modelling groups of VFs (as implemented in [1]) look like
with this proposal?
I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper
across multiple VFs. How would managing membership of VFs in a group
look like? Will the devlink API continue to be used for that? Or will
something else be introduced?

Looking a bit into the future now...
I am nowadays thinking about extending the mlx5 VF group rate limit
feature to support VFs from multiple PFs from the same NIC (the
hardware can be configured to use a shared shaper across multiple
ports), how could that feature be represented in this API, given that
ops relate to a netdevice? Which netdevice should be used for this
scenario?
In that world, there would be multiple 'root'-level nodes in this
hierarchy, each corresponding to a group of VFs from potentially
multiple PFs.

[1]
https://lore.kernel.org/netdev/1622636251-29892-1-git-send-email-dlinkin@nvidia.com/T/#u

Cosmin.
Paolo Abeni June 5, 2024, 3:52 p.m. UTC | #22
On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote:
> On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote:
> 
> > +/**
> > + * struct net_shaper_info - represents a shaping node on the NIC H/W
> > + * @metric: Specify if the bw limits refers to PPS or BPS
> > + * @bw_min: Minimum guaranteed rate for this shaper
> > + * @bw_max: Maximum peak bw allowed for this shaper
> > + * @burst: Maximum burst for the peek rate of this shaper
> > + * @priority: Scheduling priority for this shaper
> > + * @weight: Scheduling weight for this shaper
> > + */
> > +struct net_shaper_info {
> > +	enum net_shaper_metric metric;
> > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
> > +	u64 bw_max;	/* maximum allowed bandwidth */
> > +	u32 burst;	/* maximum burst in bytes for bw_max */
> 
> 'burst' really should be u64 if it can deal with bytes. In a 400Gbps
> link, u32 really is peanuts.
> 
> > +/**
> > + * enum net_shaper_scope - the different scopes where a shaper could be attached
> > + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
> > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
> > + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
> > + * function.
> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
> > + * same device.
> > + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
> > + *
> > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
> > + * PF devices, usually inside the host/hypervisor.
> > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> > + */
> > +enum net_shaper_scope {
> > +	NET_SHAPER_SCOPE_PORT,
> > +	NET_SHAPER_SCOPE_NETDEV,
> > +	NET_SHAPER_SCOPE_VF,
> > +	NET_SHAPER_SCOPE_QUEUE_GROUP,
> > +	NET_SHAPER_SCOPE_QUEUE,
> > +};
> 
> How would modelling groups of VFs (as implemented in [1]) look like
> with this proposal?
> I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper
> across multiple VFs. 

Following-up yday reviewer mtg - which was spent mainly on this topic -
- the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with
a more generic 'scope', grouping of either queues, VF/netdev or even
other groups (allowing nesting).

> How would managing membership of VFs in a group
> look like? Will the devlink API continue to be used for that? Or will
> something else be introduced?

The idea is to introduce a new generic netlink interface, yaml-based,
to expose these features to user-space.

> Looking a bit into the future now...
> I am nowadays thinking about extending the mlx5 VF group rate limit
> feature to support VFs from multiple PFs from the same NIC (the
> hardware can be configured to use a shared shaper across multiple
> ports), how could that feature be represented in this API, given that
> ops relate to a netdevice? Which netdevice should be used for this
> scenario?

I must admit we[1] haven't thought yet about the scenario you describe
above. I guess we could encode the PF number and the VF number in the
handle major/minor and operate on any PF device belonging to the same
silicon, WDYT?

Thanks,

Paolo

[1] or at least myself;)
Jakub Kicinski June 5, 2024, 7:40 p.m. UTC | #23
On Wed, 05 Jun 2024 17:52:32 +0200 Paolo Abeni wrote:
> > Looking a bit into the future now...
> > I am nowadays thinking about extending the mlx5 VF group rate limit
> > feature to support VFs from multiple PFs from the same NIC (the
> > hardware can be configured to use a shared shaper across multiple
> > ports), how could that feature be represented in this API, given that
> > ops relate to a netdevice? Which netdevice should be used for this
> > scenario?  
> 
> I must admit we[1] haven't thought yet about the scenario you describe
> above. I guess we could encode the PF number and the VF number in the
> handle major/minor and operate on any PF device belonging to the same
> silicon, WDYT?

Just a minor clarification. _Internally_ we can support expressing VF /
PF shaping. uAPI for that continues to be devlink, right?
Jiri Pirko July 30, 2024, 12:10 p.m. UTC | #24
Wed, Jun 05, 2024 at 05:52:32PM CEST, pabeni@redhat.com wrote:
>On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote:
>> On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote:
>> 
>> > +/**
>> > + * struct net_shaper_info - represents a shaping node on the NIC H/W
>> > + * @metric: Specify if the bw limits refers to PPS or BPS
>> > + * @bw_min: Minimum guaranteed rate for this shaper
>> > + * @bw_max: Maximum peak bw allowed for this shaper
>> > + * @burst: Maximum burst for the peek rate of this shaper
>> > + * @priority: Scheduling priority for this shaper
>> > + * @weight: Scheduling weight for this shaper
>> > + */
>> > +struct net_shaper_info {
>> > +	enum net_shaper_metric metric;
>> > +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
>> > +	u64 bw_max;	/* maximum allowed bandwidth */
>> > +	u32 burst;	/* maximum burst in bytes for bw_max */
>> 
>> 'burst' really should be u64 if it can deal with bytes. In a 400Gbps
>> link, u32 really is peanuts.
>> 
>> > +/**
>> > + * enum net_shaper_scope - the different scopes where a shaper could be attached
>> > + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
>> > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
>> > + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
>> > + * function.
>> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
>> > + * same device.
>> > + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
>> > + *
>> > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
>> > + * PF devices, usually inside the host/hypervisor.
>> > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
>> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
>> > + */
>> > +enum net_shaper_scope {
>> > +	NET_SHAPER_SCOPE_PORT,
>> > +	NET_SHAPER_SCOPE_NETDEV,
>> > +	NET_SHAPER_SCOPE_VF,
>> > +	NET_SHAPER_SCOPE_QUEUE_GROUP,
>> > +	NET_SHAPER_SCOPE_QUEUE,
>> > +};
>> 
>> How would modelling groups of VFs (as implemented in [1]) look like
>> with this proposal?
>> I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper
>> across multiple VFs. 
>
>Following-up yday reviewer mtg - which was spent mainly on this topic -
>- the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with
>a more generic 'scope', grouping of either queues, VF/netdev or even
>other groups (allowing nesting).
>
>> How would managing membership of VFs in a group
>> look like? Will the devlink API continue to be used for that? Or will
>> something else be introduced?
>
>The idea is to introduce a new generic netlink interface, yaml-based,
>to expose these features to user-space.
>
>> Looking a bit into the future now...
>> I am nowadays thinking about extending the mlx5 VF group rate limit
>> feature to support VFs from multiple PFs from the same NIC (the
>> hardware can be configured to use a shared shaper across multiple
>> ports), how could that feature be represented in this API, given that
>> ops relate to a netdevice? Which netdevice should be used for this
>> scenario?
>
>I must admit we[1] haven't thought yet about the scenario you describe
>above. I guess we could encode the PF number and the VF number in the
>handle major/minor and operate on any PF device belonging to the same
>silicon, WDYT?

Sometimes, there is no netdevice at all. The infra still should work I
believe.


>
>Thanks,
>
>Paolo
>
>[1] or at least myself;)
>
Jiri Pirko July 30, 2024, 12:18 p.m. UTC | #25
Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote:


>+ * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.

This is interesting. Do you mean you can put a shaper on a specific VF
queue from hypervisor? I was thinking about it recently, I have some
concerns.

In general a nic user expects all queues to behave in the same way,
unless he does some sort of configuration (dcb for example).
VF (the VM side) is not different, it's also a nic.

If you allow the hypervisor to configure shapers on specifig VF queues,
you are breaking VM's user expectation. He did not configure any
different queue treating, yet they are treated differently.

Is that okay? What do you think?
Paolo Abeni July 30, 2024, 1:34 p.m. UTC | #26
Hi,

On 7/30/24 14:18, Jiri Pirko wrote:
> Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote:
> 
>> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
> 
> This is interesting. Do you mean you can put a shaper on a specific VF
> queue from hypervisor? I was thinking about it recently, I have some
> concerns.
> 
> In general a nic user expects all queues to behave in the same way,
> unless he does some sort of configuration (dcb for example).
> VF (the VM side) is not different, it's also a nic.
> 
> If you allow the hypervisor to configure shapers on specifig VF queues,
> you are breaking VM's user expectation. He did not configure any
> different queue treating, yet they are treated differently.
> 
> Is that okay? What do you think?

I'm unsure why you are looking to this old version...

The idea to allow configuring the VF's queues from the hypervisor has 
been removed from the most recent version, for roughly the same reasons 
you mention above.

Cheers,

Paolo
Paolo Abeni July 30, 2024, 1:37 p.m. UTC | #27
On 7/30/24 14:10, Jiri Pirko wrote:
> Wed, Jun 05, 2024 at 05:52:32PM CEST, pabeni@redhat.com wrote:
>> On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote:
>>> On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote:
>>>
>>>> +/**
>>>> + * struct net_shaper_info - represents a shaping node on the NIC H/W
>>>> + * @metric: Specify if the bw limits refers to PPS or BPS
>>>> + * @bw_min: Minimum guaranteed rate for this shaper
>>>> + * @bw_max: Maximum peak bw allowed for this shaper
>>>> + * @burst: Maximum burst for the peek rate of this shaper
>>>> + * @priority: Scheduling priority for this shaper
>>>> + * @weight: Scheduling weight for this shaper
>>>> + */
>>>> +struct net_shaper_info {
>>>> +	enum net_shaper_metric metric;
>>>> +	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
>>>> +	u64 bw_max;	/* maximum allowed bandwidth */
>>>> +	u32 burst;	/* maximum burst in bytes for bw_max */
>>>
>>> 'burst' really should be u64 if it can deal with bytes. In a 400Gbps
>>> link, u32 really is peanuts.
>>>
>>>> +/**
>>>> + * enum net_shaper_scope - the different scopes where a shaper could be attached
>>>> + * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
>>>> + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
>>>> + * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
>>>> + * function.
>>>> + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
>>>> + * same device.
>>>> + * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
>>>> + *
>>>> + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
>>>> + * PF devices, usually inside the host/hypervisor.
>>>> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
>>>> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
>>>> + */
>>>> +enum net_shaper_scope {
>>>> +	NET_SHAPER_SCOPE_PORT,
>>>> +	NET_SHAPER_SCOPE_NETDEV,
>>>> +	NET_SHAPER_SCOPE_VF,
>>>> +	NET_SHAPER_SCOPE_QUEUE_GROUP,
>>>> +	NET_SHAPER_SCOPE_QUEUE,
>>>> +};
>>>
>>> How would modelling groups of VFs (as implemented in [1]) look like
>>> with this proposal?
>>> I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper
>>> across multiple VFs.
>>
>> Following-up yday reviewer mtg - which was spent mainly on this topic -
>> - the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with
>> a more generic 'scope', grouping of either queues, VF/netdev or even
>> other groups (allowing nesting).
>>
>>> How would managing membership of VFs in a group
>>> look like? Will the devlink API continue to be used for that? Or will
>>> something else be introduced?
>>
>> The idea is to introduce a new generic netlink interface, yaml-based,
>> to expose these features to user-space.
>>
>>> Looking a bit into the future now...
>>> I am nowadays thinking about extending the mlx5 VF group rate limit
>>> feature to support VFs from multiple PFs from the same NIC (the
>>> hardware can be configured to use a shared shaper across multiple
>>> ports), how could that feature be represented in this API, given that
>>> ops relate to a netdevice? Which netdevice should be used for this
>>> scenario?
>>
>> I must admit we[1] haven't thought yet about the scenario you describe
>> above. I guess we could encode the PF number and the VF number in the
>> handle major/minor and operate on any PF device belonging to the same
>> silicon, WDYT?
> 
> Sometimes, there is no netdevice at all. The infra still should work I
> believe.

Note that in the most recent incarnation of the shaper APIs has been 
removed any support for shaper 'above' the network device level (e.g. no 
device/VFs groups). The idea is that devlink should be used for such 
scenarios.

Cheers,

Paolo
Jiri Pirko July 31, 2024, 4:03 p.m. UTC | #28
Tue, Jul 30, 2024 at 03:34:24PM CEST, pabeni@redhat.com wrote:
>Hi,
>
>On 7/30/24 14:18, Jiri Pirko wrote:
>> Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote:
>> 
>> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
>> 
>> This is interesting. Do you mean you can put a shaper on a specific VF
>> queue from hypervisor? I was thinking about it recently, I have some
>> concerns.
>> 
>> In general a nic user expects all queues to behave in the same way,
>> unless he does some sort of configuration (dcb for example).
>> VF (the VM side) is not different, it's also a nic.
>> 
>> If you allow the hypervisor to configure shapers on specifig VF queues,
>> you are breaking VM's user expectation. He did not configure any
>> different queue treating, yet they are treated differently.
>> 
>> Is that okay? What do you think?
>
>I'm unsure why you are looking to this old version...
>
>The idea to allow configuring the VF's queues from the hypervisor has been
>removed from the most recent version, for roughly the same reasons you
>mention above.

Okay. Thanks!

>
>Cheers,
>
>Paolo
>
>
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf261fb89d73..39f66af014be 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -79,6 +79,8 @@  struct xdp_buff;
 struct xdp_frame;
 struct xdp_metadata_ops;
 struct xdp_md;
+struct net_shaper_ops;
+struct net_shaper_data;
 
 typedef u32 xdp_features_t;
 
@@ -1596,6 +1598,13 @@  struct net_device_ops {
 	int			(*ndo_hwtstamp_set)(struct net_device *dev,
 						    struct kernel_hwtstamp_config *kernel_config,
 						    struct netlink_ext_ack *extack);
+
+#if IS_ENABLED(CONFIG_NET_SHAPER)
+	/** @net_shaper_ops: Device shaping offload operations
+	 * see include/net/net_shapers.h
+	 */
+	const struct net_shaper_ops *net_shaper_ops;
+#endif
 };
 
 /**
@@ -2403,6 +2412,12 @@  struct net_device {
 	/** @page_pools: page pools created for this netdevice */
 	struct hlist_head	page_pools;
 #endif
+#if IS_ENABLED(CONFIG_NET_SHAPER)
+	/** @net_shaper_data: data tracking the current shaper status
+	 *  see include/net/net_shapers.h
+	 */
+	struct net_shaper_data *net_shaper_data;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/net/net_shaper.h b/include/net/net_shaper.h
new file mode 100644
index 000000000000..a4fbadd99870
--- /dev/null
+++ b/include/net/net_shaper.h
@@ -0,0 +1,206 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _NET_SHAPER_H_
+#define _NET_SHAPER_H_
+
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+
+/**
+ * enum net_shaper_metric - the metric of the shaper
+ * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second basis
+ * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis
+ */
+enum net_shaper_metric {
+	NET_SHAPER_METRIC_PPS,
+	NET_SHAPER_METRIC_BPS
+};
+
+/**
+ * struct net_shaper_info - represents a shaping node on the NIC H/W
+ * @metric: Specify if the bw limits refers to PPS or BPS
+ * @bw_min: Minimum guaranteed rate for this shaper
+ * @bw_max: Maximum peak bw allowed for this shaper
+ * @burst: Maximum burst for the peek rate of this shaper
+ * @priority: Scheduling priority for this shaper
+ * @weight: Scheduling weight for this shaper
+ */
+struct net_shaper_info {
+	enum net_shaper_metric metric;
+	u64 bw_min;	/* minimum guaranteed bandwidth, according to metric */
+	u64 bw_max;	/* maximum allowed bandwidth */
+	u32 burst;	/* maximum burst in bytes for bw_max */
+	u32 priority;	/* scheduling strict priority */
+	u32 weight;	/* scheduling WRR weight*/
+};
+
+/**
+ * enum net_shaper_scope - the different scopes where a shaper could be attached
+ * @NET_SHAPER_SCOPE_PORT:   The root shaper for the whole H/W.
+ * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device.
+ * @NET_SHAPER_SCOPE_VF:     The shaper is attached to the given virtual
+ * function.
+ * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the
+ * same device.
+ * @NET_SHAPER_SCOPE_QUEUE:  The shaper is attached to the given device queue.
+ *
+ * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on
+ * PF devices, usually inside the host/hypervisor.
+ * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and
+ * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
+ */
+enum net_shaper_scope {
+	NET_SHAPER_SCOPE_PORT,
+	NET_SHAPER_SCOPE_NETDEV,
+	NET_SHAPER_SCOPE_VF,
+	NET_SHAPER_SCOPE_QUEUE_GROUP,
+	NET_SHAPER_SCOPE_QUEUE,
+};
+
+/**
+ * struct net_shaper_ops - Operations on device H/W shapers
+ * @add: Creates a new shaper in the specified scope.
+ * @set: Modify the existing shaper.
+ * @delete: Delete the specified shaper.
+ * @move: Move an existing shaper under a different parent.
+ *
+ * The initial shaping configuration ad device initialization is empty/
+ * a no-op/does not constraint the b/w in any way.
+ * The network core keeps track of the applied user-configuration in
+ * per device storage.
+ *
+ * Each shaper is uniquely identified within the device with an 'handle',
+ * dependent on the shaper scope and other data, see @shaper_make_handle()
+ */
+struct net_shaper_ops {
+	/** add - Add a shaper inside the shaper hierarchy
+	 * @dev: netdevice to operate on
+	 * @handle: the shaper indetifier
+	 * @shaper: configuration of shaper
+	 * @extack: Netlink extended ACK for reporting errors.
+	 *
+	 * Return:
+	 * * 0 on success
+	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
+	 *                  or core for any reason. @extack should be set to
+	 *                  text describing the reason.
+	 * * Other negative error values on failure.
+	 *
+	 * Examples or reasons this operation may fail include:
+	 * * H/W resources limits.
+	 * * Can’t respect the requested bw limits.
+	 */
+	int (*add)(struct net_device *dev, u32 handle,
+		   const struct net_shaper_info *shaper,
+		   struct netlink_ext_ack *extack);
+
+	/** set - Update the specified shaper, if it exists
+	 * @dev: Netdevice to operate on.
+	 * @handle: the shaper identifier
+	 * @shaper: Configuration of shaper.
+	 * @extack: Netlink extended ACK for reporting errors.
+	 *
+	 * Return:
+	 * * %0 - Success
+	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
+	 *                  or core for any reason. @extack should be set to
+	 *                  text describing the reason.
+	 * * Other negative error values on failure.
+	 */
+	int (*set)(struct net_device *dev, u32 handle,
+		   const struct net_shaper_info *shaper,
+		   struct netlink_ext_ack *extack);
+
+	/** delete - Removes a shaper from the NIC
+	 * @dev: netdevice to operate on.
+	 * @handle: the shaper identifier
+	 * @extack: Netlink extended ACK for reporting errors.
+	 *
+	 * Return:
+	 * * %0 - Success
+	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
+	 *                  or core for any reason. @extack should be set to
+	 *                  text describing the reason.
+	 * * Other negative error value on failure.
+	 */
+	int (*delete)(struct net_device *dev, u32 handle,
+		      struct netlink_ext_ack *extack);
+
+	/** Move - change the parent id of the specified shaper
+	 * @dev: netdevice to operate on.
+	 * @handle: unique identifier for the shaper
+	 * @new_parent_id: identifier of the new parent for this shaper
+	 * @extack: Netlink extended ACK for reporting errors.
+	 *
+	 * Move the specified shaper in the hierarchy replacing its
+	 * current parent shaper with @new_parent_id
+	 *
+	 * Return:
+	 * * %0 - Success
+	 * * %-EOPNOTSUPP - Operation is not supported by hardware, driver,
+	 *                  or core for any reason. @extack should be set to
+	 *                  text describing the reason.
+	 * * Other negative error values on failure.
+	 */
+	int (*move)(struct net_device *dev, u32 handle,
+		    u32 new_parent_handle, struct netlink_ext_ack *extack);
+};
+
+/**
+ * net_shaper_make_handle - creates an unique shaper identifier
+ * @scope: the shaper scope
+ * @vf: virtual function number
+ * @id: queue group or queue id
+ *
+ * Return: an unique identifier for the shaper
+ *
+ * Combines the specified arguments to create an unique identifier for
+ * the shaper.
+ * The virtual function number is only used within @NET_SHAPER_SCOPE_VF,
+ * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE.
+ * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and
+ * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group
+ * identifier or the queue number.
+ */
+u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id);
+
+/*
+ * Examples:
+ * - set shaping on a given queue
+ *   struct shaper_info info = { }; // fill this
+ *   u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id);
+ *   dev->shaper_ops->add(dev, handle, &info, NULL);
+ *
+ * - create a queue group with a queue group shaping limits.
+ *   Assuming the following topology already exists:
+ *                       < netdev shaper  >
+ *                        /              \
+ *               <queue 0 shaper> . . .  <queue N shaper>
+ *
+ *   struct shaper_info ginfo = { }; // fill this
+ *   u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0);
+ *   dev->shaper_ops->add(dev, ghandle, &ginfo);
+ *
+ *   // now topology is:
+ *   //                              < netdev shaper  >
+ *   //                             /         |          \
+ *   //                            /          |       < newly created shaper  >
+ *   //                           /           |
+ *   //	<queue 0 shaper> . . .    <queue N shaper>
+ *
+ *   // move a shapers for queues 3..n out of such queue group
+ *   for (i = 0; i <= 2; ++i) {
+ *       u32 qhandle = net_shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, i);
+ *       dev->netshaper_ops->move(dev, qhandle, ghandle, NULL);
+ *   }
+ *
+ *   // now the topology is:
+ *   //                                < netdev shaper  >
+ *   //                                 /            \
+ *   //               < newly created shaper>   <queue 3 shaper> .. <queue n shaper>
+ *   //                /               \
+ *   //	<queue 0 shaper> . . .    <queue 2 shaper>
+ */
+#endif
+
diff --git a/net/Kconfig b/net/Kconfig
index f0a8692496ff..29c6fec54711 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -66,6 +66,9 @@  config SKB_DECRYPTED
 config SKB_EXTENSIONS
 	bool
 
+config NET_SHAPER
+	bool
+
 menu "Networking options"
 
 source "net/packet/Kconfig"