Message ID | 3d1e2d945904a0fb55258559eb7322d7e11066b6.1715199358.git.pabeni@redhat.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] net: introduce HW Rate Limiting Driver API | expand |
Hi Paolo > + * struct net_shaper_info - represents a shaping node on the NIC H/W > + * @metric: Specify if the bw limits refers to PPS or BPS > + * @bw_min: Minimum guaranteed rate for this shaper > + * @bw_max: Maximum peak bw allowed for this shaper > + * @burst: Maximum burst for the peek rate of this shaper > + * @priority: Scheduling priority for this shaper > + * @weight: Scheduling weight for this shaper > + */ > +struct net_shaper_info { > + enum net_shaper_metric metric; > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > + u64 bw_max; /* maximum allowed bandwidth */ > + u32 burst; /* maximum burst in bytes for bw_max */ > + u32 priority; /* scheduling strict priority */ Above it say priority. Here is strict priority? Is there a difference? > + u32 weight; /* scheduling WRR weight*/ > +}; Are there any special semantics for weight? Looking at the hardware i have, which has 8 queues for a switch port, i can either set it to strict priority, meaning queue 7 needs to be empty before it look at queue 6, and it will only look at queue 5 when 6 is empty etc. Or i can set weights per queue. How would i expect strict priority? Shaping itself is not performed on the queues, but the port. So it seems like i should create 8 net_shaper_info and set the weight in each, and everything else to 0? And then create a queue group shaper, put the 8 queue shapers into it, and then configure bw_max for the group? Everything else is 0, because that is all i can configure. Does this sound correct? > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. Are they also available on plain boring devices which don't have PFs or VFs? Would i be correct in assuming my driver should just create these shapers. There will be some netlink calls to allow user space to enumerate them, and display the relationships between them? And netlink calls to set values in a shaper? Will there be a way to say which fields are actually settable, since i doubt most hardware will have everything? In my case, only one field appears to be relevant in each shaper, and maybe we want to give a hint about that to userspace? Andrew
On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote: > > + * struct net_shaper_info - represents a shaping node on the NIC H/W > > + * @metric: Specify if the bw limits refers to PPS or BPS > > + * @bw_min: Minimum guaranteed rate for this shaper > > + * @bw_max: Maximum peak bw allowed for this shaper > > + * @burst: Maximum burst for the peek rate of this shaper > > + * @priority: Scheduling priority for this shaper > > + * @weight: Scheduling weight for this shaper > > + */ > > +struct net_shaper_info { > > + enum net_shaper_metric metric; > > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > > + u64 bw_max; /* maximum allowed bandwidth */ > > + u32 burst; /* maximum burst in bytes for bw_max */ > > + u32 priority; /* scheduling strict priority */ > > Above it say priority. Here is strict priority? Is there a difference? the 'priority' field is the strict priority for scheduling. We will clarify the first comment. > > > + u32 weight; /* scheduling WRR weight*/ > > +}; > > Are there any special semantics for weight? Looking at the hardware i > have, which has 8 queues for a switch port, i can either set it to > strict priority, meaning queue 7 needs to be empty before it look at > queue 6, and it will only look at queue 5 when 6 is empty etc. Or i > can set weights per queue. How would i expect strict priority? The expected behaviour is that schedulers at the same level with the same priority will be served according to their weight. I'm unsure if I read your question correctly. My understanding is that the your H/W don't allow strict priority and WRR simultaneously. In such case, the set/create operations should reject calls setting both non zero weight and priority values. > Shaping itself is not performed on the queues, but the port. So it > seems like i should create 8 net_shaper_info and set the weight in > each, and everything else to 0? And then create a queue group shaper, > put the 8 queue shapers into it, and then configure bw_max for the > group? Everything else is 0, because that is all i can configure. > > Does this sound correct? It sounds correct. You can avoid creating the queue group and set the rate at the NETDEV scope, or possibly not even set/create such shaper: the transmission is expected to be limited by the line rate. > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > Are they also available on plain boring devices which don't have PFs > or VFs? Yes, a driver is free to implement/support an arbitrary subset of the available feature. Operations attempting to set an unsupported state should fail with a relevant extended ack. > Would i be correct in assuming my driver should just create these > shapers. There will be some netlink calls to allow user space to > enumerate them, and display the relationships between them? Yes, the idea/plan/roadmap is to implement yml-based netlink API to fully expose the feature to the user-space. > And > netlink calls to set values in a shaper? Will there be a way to say > which fields are actually settable, since i doubt most hardware will > have everything? This was not planned yet, even because some (most?) H/W can have non- trivial constraints to expose. e.g. WRR is available at the queue scope only, and only if there is at most one SP shaper somewhere else. > In my case, only one field appears to be relevant in > each shaper, and maybe we want to give a hint about that to userspace? Any suggestion on how to expose that in reasonable way more then welcome! Thanks, Paolo
On Thu, May 09, 2024 at 04:19:52PM +0200, Paolo Abeni wrote: > On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote: > > > + * struct net_shaper_info - represents a shaping node on the NIC H/W > > > + * @metric: Specify if the bw limits refers to PPS or BPS > > > + * @bw_min: Minimum guaranteed rate for this shaper > > > + * @bw_max: Maximum peak bw allowed for this shaper > > > + * @burst: Maximum burst for the peek rate of this shaper > > > + * @priority: Scheduling priority for this shaper > > > + * @weight: Scheduling weight for this shaper > > > + */ > > > +struct net_shaper_info { > > > + enum net_shaper_metric metric; > > > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > > > + u64 bw_max; /* maximum allowed bandwidth */ > > > + u32 burst; /* maximum burst in bytes for bw_max */ > > > + u32 priority; /* scheduling strict priority */ > > > > Above it say priority. Here is strict priority? Is there a difference? > > the 'priority' field is the strict priority for scheduling. We will > clarify the first comment. > > > > > > + u32 weight; /* scheduling WRR weight*/ > > > +}; > > > > Are there any special semantics for weight? Looking at the hardware i > > have, which has 8 queues for a switch port, i can either set it to > > strict priority, meaning queue 7 needs to be empty before it look at > > queue 6, and it will only look at queue 5 when 6 is empty etc. Or i > > can set weights per queue. How would i expect strict priority? > > The expected behaviour is that schedulers at the same level with the > same priority will be served according to their weight. > > I'm unsure if I read your question correctly. My understanding is that > the your H/W don't allow strict priority and WRR simultaneously. Correct. There is one bit to select between the two. If WRR is enabled, it then becomes possible for some generations of the hardware to configure the weights, for others the weights are fixed in silicon. > In > such case, the set/create operations should reject calls setting both > non zero weight and priority values. So in order to set strict priority, i need to set the priority field to 7, 6, 5, 4, 3, 2, 1, 0, for my 8 queues, and weight to 0. For WRR, i need to set the priority fields to 0, and the weight values, either to what is hard coded in the silicon, or valid weights when it is configurable. Now the question is, how do i get between these two states? It is not possible to mix WRR and strict priority. Any kAPI which only modifies one queue at once will go straight into an invalid state, and the driver will need to return -EOPNOTSUPP. So it seems like there needs to be an atomic set N queue configuration at once, so i can cleanly go from strict priority across 8 queues to WRR across 8 queues. Is that foreseen? > It sounds correct. You can avoid creating the queue group and set the > rate at the NETDEV scope, or possibly not even set/create such shaper: > the transmission is expected to be limited by the line rate. Well, the hardware exists, i probably should just create the shapers to match the hardware. However, if i set the hardware equivalent of bw_max to 0, it then uses line rate. Is this something we want to document? You can disable a shaper from shaping by setting the bandwidth to 0? Or do we want a separate enable/disable bit in the structure? > > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > > > Are they also available on plain boring devices which don't have PFs > > or VFs? > > Yes, a driver is free to implement/support an arbitrary subset of the > available feature. Operations attempting to set an unsupported state > should fail with a relevant extended ack. Great. Please update the comment. > > In my case, only one field appears to be relevant in > > each shaper, and maybe we want to give a hint about that to userspace? > > Any suggestion on how to expose that in reasonable way more then > welcome! We might first want to gather knowledge on what these shapers actually look like in different hardwares. There are going to be some which are very limited fixed functions as in the hardware i have, probably most SOHO switches are like this. And then we have TOR and smart NICs which might be able to create and destroy shapers on the fly, and are a lot less restrictive. Andrew
> +/** > + * enum net_shaper_metric - the metric of the shaper > + * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second basis > + * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis > + */ > +enum net_shaper_metric { > + NET_SHAPER_METRIC_PPS, > + NET_SHAPER_METRIC_BPS > +}; I think we need a third value. In the hardware i'm talking about, the 8 queues themselves don't do any shaping. All i'm configuring is WRR vs strict priority order. So ideally i want something to indicate this shaper does not actually shape. Andrew
On Thu, 2024-05-09 at 17:00 +0200, Andrew Lunn wrote: > On Thu, May 09, 2024 at 04:19:52PM +0200, Paolo Abeni wrote: > > On Wed, 2024-05-08 at 23:47 +0200, Andrew Lunn wrote: > > > > + * struct net_shaper_info - represents a shaping node on the NIC H/W > > > > + * @metric: Specify if the bw limits refers to PPS or BPS > > > > + * @bw_min: Minimum guaranteed rate for this shaper > > > > + * @bw_max: Maximum peak bw allowed for this shaper > > > > + * @burst: Maximum burst for the peek rate of this shaper > > > > + * @priority: Scheduling priority for this shaper > > > > + * @weight: Scheduling weight for this shaper > > > > + */ > > > > +struct net_shaper_info { > > > > + enum net_shaper_metric metric; > > > > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > > > > + u64 bw_max; /* maximum allowed bandwidth */ > > > > + u32 burst; /* maximum burst in bytes for bw_max */ > > > > + u32 priority; /* scheduling strict priority */ > > > > > > Above it say priority. Here is strict priority? Is there a difference? > > > > the 'priority' field is the strict priority for scheduling. We will > > clarify the first comment. > > > > > > > > > + u32 weight; /* scheduling WRR weight*/ > > > > +}; > > > > > > Are there any special semantics for weight? Looking at the hardware i > > > have, which has 8 queues for a switch port, i can either set it to > > > strict priority, meaning queue 7 needs to be empty before it look at > > > queue 6, and it will only look at queue 5 when 6 is empty etc. Or i > > > can set weights per queue. How would i expect strict priority? > > > > The expected behaviour is that schedulers at the same level with the > > same priority will be served according to their weight. > > > > I'm unsure if I read your question correctly. My understanding is that > > the your H/W don't allow strict priority and WRR simultaneously. > > Correct. There is one bit to select between the two. If WRR is > enabled, it then becomes possible for some generations of the hardware > to configure the weights, for others the weights are fixed in silicon. > > > In > > such case, the set/create operations should reject calls setting both > > non zero weight and priority values. > > So in order to set strict priority, i need to set the priority field > to 7, 6, 5, 4, 3, 2, 1, 0, for my 8 queues, and weight to 0. For WRR, > i need to set the priority fields to 0, and the weight values, either > to what is hard coded in the silicon, or valid weights when it is > configurable. > > Now the question is, how do i get between these two states? It is not > possible to mix WRR and strict priority. Any kAPI which only modifies > one queue at once will go straight into an invalid state, and the > driver will need to return -EOPNOTSUPP. So it seems like there needs > to be an atomic set N queue configuration at once, so i can cleanly go > from strict priority across 8 queues to WRR across 8 queues. Is that > foreseen? You could delete all the WRR shapers and then create/add SP based ones. > > It sounds correct. You can avoid creating the queue group and set the > > rate at the NETDEV scope, or possibly not even set/create such shaper: > > the transmission is expected to be limited by the line rate. > > Well, the hardware exists, i probably should just create the shapers > to match the hardware. However, if i set the hardware equivalent of > bw_max to 0, it then uses line rate. Is this something we want to > document? You can disable a shaper from shaping by setting the > bandwidth to 0? Or do we want a separate enable/disable bit in the > structure? I documenting that '0' means 'unlimited' for bw_max fields should be good enough. If the NIC can't support any kind of shaping it will reject with an appropriate extended ack any configuration change with bw != 0. I guess we need another comment update here :) > > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > > > > > Are they also available on plain boring devices which don't have PFs > > > or VFs? > > > > Yes, a driver is free to implement/support an arbitrary subset of the > > available feature. Operations attempting to set an unsupported state > > should fail with a relevant extended ack. > > Great. Please update the comment. Sure, will do. > > > In my case, only one field appears to be relevant in > > > each shaper, and maybe we want to give a hint about that to userspace? > > > > Any suggestion on how to expose that in reasonable way more then > > welcome! > > We might first want to gather knowledge on what these shapers actually > look like in different hardwares. There are going to be some which are > very limited fixed functions as in the hardware i have, probably most > SOHO switches are like this. And then we have TOR and smart NICs which > might be able to create and destroy shapers on the fly, and are a lot > less restrictive. As a reference the initial configuration status has been discussed here: https://lore.kernel.org/netdev/20240410075745.4637c537@kernel.org/ (grep for 'Transforming arbitrary pre-existing driver hierarchy') The conclusion was (to my understanding) to reduce the core complexity enforcing an uniform views of the NIC defaults. The 'create' op is just an abstraction to tell the NIC to switch from the default configuration to the specified one. The default configuration means 'no restrictions/shaping beyond the link rate'. If the NIC has buildin shapers it can't disable, it must initialize them to match the above. In both case the kernel view of the default will be 'no shapers added' The 'delete' op will restore the default for the specified scope location. If the NIC has a persistent shaper there, it will update the setting to the default, or actually delete the H/W shaper if possible. In both case the kernel view of such scope location with return to be 'no shapers added' My understading/guess/hope is that is possible to implement the above default with any H/W. Does the above solve your doubt/answer your question? Thanks! Paolo
> > Now the question is, how do i get between these two states? It is not > > possible to mix WRR and strict priority. Any kAPI which only modifies > > one queue at once will go straight into an invalid state, and the > > driver will need to return -EOPNOTSUPP. So it seems like there needs > > to be an atomic set N queue configuration at once, so i can cleanly go > > from strict priority across 8 queues to WRR across 8 queues. Is that > > foreseen? > > You could delete all the WRR shapers and then create/add SP based ones. But that does not match the hardware. I cannot delete the hardware. It will either do strict priority or WRR. If i delete the software representation of the shaper, the hardware shaper will keep on doing what it was doing. So i don't see this as a good model. I think the driver will create shapers to represent the hardware, and you are not allowed to delete them or add more of them, because that is what the hardware is. All you can do is configure the shapers that exist. > The 'create' op is just an abstraction to tell the NIC to switch from > the default configuration to the specified one. Well, the hardware default is i think WRR for the queues, and line rate. That will be what the software representation of the shapers will be set to when the driver probes and creates the shapers representors. Andrew
Hi Paolo, > -----Original Message----- > From: Paolo Abeni <pabeni@redhat.com> > Sent: Thursday, May 9, 2024 1:51 AM > To: netdev@vger.kernel.org > Cc: Jakub Kicinski <kuba@kernel.org>; Jiri Pirko <jiri@resnulli.us>; Madhu > Chittim <madhu.chittim@intel.com>; Sridhar Samudrala > <sridhar.samudrala@intel.com>; Simon Horman <horms@kernel.org>; John > Fastabend <john.fastabend@gmail.com>; Sunil Kovvuri Goutham > <sgoutham@marvell.com>; Jamal Hadi Salim <jhs@mojatatu.com> > Subject: [RFC PATCH] net: introduce HW Rate Limiting Driver API > > This is the first incarnation in a formal (pre-RFC) patch of the HW TX Rate > Limiting Driver API proposal shared here[1]. > > The goal is to outline the proposed APIs before pushing the actual implementation. > > The network devices gain a new ops struct to directly manipulate the H/W > shapers implemented by the NIC. > > The shapers can be attached to a pre-defined set of 'domains' - port, vf, etc. - and > the overall shapers configuration pushed to the H/W is maintained by the kernel. > > Each shaper is identified by an unique integer id based on the domain and > additional domain-specific information - e.g. for the VF domain, the virtual function > number/identifier. > > [1] https://urldefense.proofpoint.com/v2/url?u=https- > 3A__lore.kernel.org_netdev_20240405102313.GA310894- > 40kernel.org_&d=DwIDaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=TwreqwV6mQ8K9wI > pqwFO8yjikO_w1jUOe2MzChg4Rmg&m=g3KFBILzzqT9znnKDO0ePHOi5tc0Zzy6r > G9xL3GdGKS_9IkdC3q- > okhCTptMrIKN&s=hVQbF8IF41zg6I197jY8r96uQq0oQbgTfe4-4HF6c3M&e= > > Co-developed-by: Simon Horman <horms@kernel.org> > Signed-off-by: Simon Horman <horms@kernel.org> > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > include/linux/netdevice.h | 15 +++ > include/net/net_shaper.h | 206 > ++++++++++++++++++++++++++++++++++++++ > net/Kconfig | 3 + > 3 files changed, 224 insertions(+) > create mode 100644 include/net/net_shaper.h > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index > cf261fb89d73..39f66af014be 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -79,6 +79,8 @@ struct xdp_buff; > struct xdp_frame; > struct xdp_metadata_ops; > struct xdp_md; > +struct net_shaper_ops; > +struct net_shaper_data; > > typedef u32 xdp_features_t; > > @@ -1596,6 +1598,13 @@ struct net_device_ops { > int (*ndo_hwtstamp_set)(struct net_device *dev, > struct kernel_hwtstamp_config > *kernel_config, > struct netlink_ext_ack > *extack); > + > +#if IS_ENABLED(CONFIG_NET_SHAPER) > + /** @net_shaper_ops: Device shaping offload operations > + * see include/net/net_shapers.h > + */ > + const struct net_shaper_ops *net_shaper_ops; #endif > }; > > /** > @@ -2403,6 +2412,12 @@ struct net_device { > /** @page_pools: page pools created for this netdevice */ > struct hlist_head page_pools; > #endif > +#if IS_ENABLED(CONFIG_NET_SHAPER) > + /** @net_shaper_data: data tracking the current shaper status > + * see include/net/net_shapers.h > + */ > + struct net_shaper_data *net_shaper_data; #endif > }; > #define to_net_dev(d) container_of(d, struct net_device, dev) > > diff --git a/include/net/net_shaper.h b/include/net/net_shaper.h new file mode > 100644 index 000000000000..a4fbadd99870 > --- /dev/null > +++ b/include/net/net_shaper.h > @@ -0,0 +1,206 @@ > +/* SPDX-License-Identifier: GPL-2.0-or-later */ > + > +#ifndef _NET_SHAPER_H_ > +#define _NET_SHAPER_H_ > + > +#include <linux/types.h> > +#include <linux/netdevice.h> > +#include <linux/netlink.h> > + > +/** > + * enum net_shaper_metric - the metric of the shaper > + * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second > +basis > + * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis > +*/ enum net_shaper_metric { > + NET_SHAPER_METRIC_PPS, > + NET_SHAPER_METRIC_BPS > +}; > + > +/** > + * struct net_shaper_info - represents a shaping node on the NIC H/W > + * @metric: Specify if the bw limits refers to PPS or BPS > + * @bw_min: Minimum guaranteed rate for this shaper > + * @bw_max: Maximum peak bw allowed for this shaper > + * @burst: Maximum burst for the peek rate of this shaper > + * @priority: Scheduling priority for this shaper > + * @weight: Scheduling weight for this shaper */ struct > +net_shaper_info { > + enum net_shaper_metric metric; > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > + u64 bw_max; /* maximum allowed bandwidth */ > + u32 burst; /* maximum burst in bytes for bw_max */ > + u32 priority; /* scheduling strict priority */ > + u32 weight; /* scheduling WRR weight*/ > +}; > + > +/** > + * enum net_shaper_scope - the different scopes where a shaper could be > attached > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network > device. > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > + * function. > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple > queues > +under the > + * same device. > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given > device queue. > + * > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only > available on > + * PF devices, usually inside the host/hypervisor. What is the difference between NET_SHAPER_SCOPE_VF and NET_SHAPER_SCOPE_NETDEV from a VF traffic shaping perspective? Is NET_SHAPER_SCOPE_VF used to configure a shaper for a VF which is in a separate domain that is not a NETDEV? Thanks, Naveen > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP > and > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > + */ > +enum net_shaper_scope { > + NET_SHAPER_SCOPE_PORT, > + NET_SHAPER_SCOPE_NETDEV, > + NET_SHAPER_SCOPE_VF, > + NET_SHAPER_SCOPE_QUEUE_GROUP, > + NET_SHAPER_SCOPE_QUEUE, > +}; > + > +/** > + * struct net_shaper_ops - Operations on device H/W shapers > + * @add: Creates a new shaper in the specified scope. > + * @set: Modify the existing shaper. > + * @delete: Delete the specified shaper. > + * @move: Move an existing shaper under a different parent. > + * > + * The initial shaping configuration ad device initialization is empty/ > + * a no-op/does not constraint the b/w in any way. > + * The network core keeps track of the applied user-configuration in > + * per device storage. > + * > + * Each shaper is uniquely identified within the device with an > +'handle', > + * dependent on the shaper scope and other data, see > +@shaper_make_handle() */ struct net_shaper_ops { > + /** add - Add a shaper inside the shaper hierarchy > + * @dev: netdevice to operate on > + * @handle: the shaper indetifier > + * @shaper: configuration of shaper > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Return: > + * * 0 on success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error values on failure. > + * > + * Examples or reasons this operation may fail include: > + * * H/W resources limits. > + * * Can’t respect the requested bw limits. > + */ > + int (*add)(struct net_device *dev, u32 handle, > + const struct net_shaper_info *shaper, > + struct netlink_ext_ack *extack); > + > + /** set - Update the specified shaper, if it exists > + * @dev: Netdevice to operate on. > + * @handle: the shaper identifier > + * @shaper: Configuration of shaper. > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Return: > + * * %0 - Success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error values on failure. > + */ > + int (*set)(struct net_device *dev, u32 handle, > + const struct net_shaper_info *shaper, > + struct netlink_ext_ack *extack); > + > + /** delete - Removes a shaper from the NIC > + * @dev: netdevice to operate on. > + * @handle: the shaper identifier > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Return: > + * * %0 - Success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error value on failure. > + */ > + int (*delete)(struct net_device *dev, u32 handle, > + struct netlink_ext_ack *extack); > + > + /** Move - change the parent id of the specified shaper > + * @dev: netdevice to operate on. > + * @handle: unique identifier for the shaper > + * @new_parent_id: identifier of the new parent for this shaper > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Move the specified shaper in the hierarchy replacing its > + * current parent shaper with @new_parent_id > + * > + * Return: > + * * %0 - Success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error values on failure. > + */ > + int (*move)(struct net_device *dev, u32 handle, > + u32 new_parent_handle, struct netlink_ext_ack *extack); }; > + > +/** > + * net_shaper_make_handle - creates an unique shaper identifier > + * @scope: the shaper scope > + * @vf: virtual function number > + * @id: queue group or queue id > + * > + * Return: an unique identifier for the shaper > + * > + * Combines the specified arguments to create an unique identifier for > + * the shaper. > + * The virtual function number is only used within > +@NET_SHAPER_SCOPE_VF, > + * @NET_SHAPER_SCOPE_QUEUE_GROUP and > @NET_SHAPER_SCOPE_QUEUE. > + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP > and > + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue > group > + * identifier or the queue number. > + */ > +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int > +id); > + > +/* > + * Examples: > + * - set shaping on a given queue > + * struct shaper_info info = { }; // fill this > + * u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, > queue_id); > + * dev->shaper_ops->add(dev, handle, &info, NULL); > + * > + * - create a queue group with a queue group shaping limits. > + * Assuming the following topology already exists: > + * < netdev shaper > > + * / \ > + * <queue 0 shaper> . . . <queue N shaper> > + * > + * struct shaper_info ginfo = { }; // fill this > + * u32 ghandle = > shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0); > + * dev->shaper_ops->add(dev, ghandle, &ginfo); > + * > + * // now topology is: > + * // < netdev shaper > > + * // / | \ > + * // / | < newly created shaper > > + * // / | > + * // <queue 0 shaper> . . . <queue N shaper> > + * > + * // move a shapers for queues 3..n out of such queue group > + * for (i = 0; i <= 2; ++i) { > + * u32 qhandle = net_shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, > 0, i); > + * dev->netshaper_ops->move(dev, qhandle, ghandle, NULL); > + * } > + * > + * // now the topology is: > + * // < netdev shaper > > + * // / \ > + * // < newly created shaper> <queue 3 shaper> .. <queue n shaper> > + * // / \ > + * // <queue 0 shaper> . . . <queue 2 shaper> > + */ > +#endif > + > diff --git a/net/Kconfig b/net/Kconfig > index f0a8692496ff..29c6fec54711 100644 > --- a/net/Kconfig > +++ b/net/Kconfig > @@ -66,6 +66,9 @@ config SKB_DECRYPTED > config SKB_EXTENSIONS > bool > > +config NET_SHAPER > + bool > + > menu "Networking options" > > source "net/packet/Kconfig" > -- > 2.43.2 >
On Fri, 2024-05-10 at 07:10 +0000, Naveen Mamindlapalli wrote: > On Thursday, May 9, 2024 1:51 AM Paolo Abeni <pabeni@redhat.com> wrote: [...] > > +/** > > + * enum net_shaper_scope - the different scopes where a shaper could be > > attached > > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network > > device. > > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > > + * function. > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple > > queues > > +under the > > + * same device. > > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given > > device queue. > > + * > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only > > available on > > + * PF devices, usually inside the host/hypervisor. > > What is the difference between NET_SHAPER_SCOPE_VF and NET_SHAPER_SCOPE_NETDEV > from a VF traffic shaping perspective? With NET_SHAPER_SCOPE_VF, the user can control VF-level shapers from within the hypervisor. With NET_SHAPER_SCOPE_NETDEV, the user will control whatever is really under the gear of the given 'dev' struct: the PF-level shaper if the user itself is within the hypervisor, or the relevant VF if the user is within the VM. Cheers, Paolo
On Thu, 2024-05-09 at 18:17 +0200, Andrew Lunn wrote: > > > Now the question is, how do i get between these two states? It is not > > > possible to mix WRR and strict priority. Any kAPI which only modifies > > > one queue at once will go straight into an invalid state, and the > > > driver will need to return -EOPNOTSUPP. So it seems like there needs > > > to be an atomic set N queue configuration at once, so i can cleanly go > > > from strict priority across 8 queues to WRR across 8 queues. Is that > > > foreseen? > > > > You could delete all the WRR shapers and then create/add SP based ones. > > But that does not match the hardware. I cannot delete the hardware. It > will either do strict priority or WRR. If i delete the software > representation of the shaper, the hardware shaper will keep on doing > what it was doing. So i don't see this as a good model. I think the > driver will create shapers to represent the hardware, and you are not > allowed to delete them or add more of them, because that is what the > hardware is. All you can do is configure the shapers that exist. > > > The 'create' op is just an abstraction to tell the NIC to switch from > > the default configuration to the specified one. > > Well, the hardware default is i think WRR for the queues, and line > rate. That will be what the software representation of the shapers > will be set to when the driver probes and creates the shapers > representors. If I read correctly, allowing each NIC to expose it's own different starting configuration still will not solve the problem for this H/W to switch from WRR to SP (and vice versa). AFAICS, what would be needed there is an atomic set of operations: 'set_many' (and e.v. 'delete_many', 'create_many') that will allow changing all the shapers at once. With such operations, that H/W could still fit the expected 'no-op' default, as WRR on the queue shapers is what we expect. I agree with Jakub, handling the complexity of arbitrary starting configuration would pose a lot of trouble to the user/admin. If all the above stands together, I think we have a few options (in random order): - add both set of operations: the ones operating on a single shaper and the ones operating on multiple shapers - use only the multiple shapers ops. And the latter looks IMHO the simple/better. At that point I would probably drop the 'add' op and would rename 'delete' as 'reset': int (*set)(struct net_device *dev, int how_many, const u32 *handles, const struct net_shaper_info *shapers, struct netlink_ext_ack *extack); int (*reset)(struct net_device *dev, int how_many, const u32 *handles, struct netlink_ext_ack *extack); int (*move)(struct net_device *dev, int how_many, const u32 *handles, const u32 *new_parent_handles, struct netlink_ext_ack *extack); An NIC with 'static' shapers can implement a dummy move always returning EOPNOTSUPP and eventually filling a detailed extack. NIC without any constraints on mixing and matching different kind of shapers could implement the above as a loop over whatever they will do for the corresponding 'single shaper op' NIC with constrains alike the one you pointed out could validate the final state before atomically applying the specified operation. After a successful 'reset' operation, the kernel could drop any data it retains/caches for the relevant shapers - the current idea is to keep a copy of all successfully configured shaper_info in a xarray, using the 'handle' as the index. Side note: the move() operation could look like a complex artifact, but it's the simplest instrument I could think of to support scenarios where the user creates/configures/sets a queue group and 'move' some queue under the newly created group WDYT? Thanks, Paolo
On Fri, May 10, 2024 at 01:05:41PM +0200, Paolo Abeni wrote: > On Thu, 2024-05-09 at 18:17 +0200, Andrew Lunn wrote: > > > > Now the question is, how do i get between these two states? It is not > > > > possible to mix WRR and strict priority. Any kAPI which only modifies > > > > one queue at once will go straight into an invalid state, and the > > > > driver will need to return -EOPNOTSUPP. So it seems like there needs > > > > to be an atomic set N queue configuration at once, so i can cleanly go > > > > from strict priority across 8 queues to WRR across 8 queues. Is that > > > > foreseen? > > > > > > You could delete all the WRR shapers and then create/add SP based ones. > > > > But that does not match the hardware. I cannot delete the hardware. It > > will either do strict priority or WRR. If i delete the software > > representation of the shaper, the hardware shaper will keep on doing > > what it was doing. So i don't see this as a good model. I think the > > driver will create shapers to represent the hardware, and you are not > > allowed to delete them or add more of them, because that is what the > > hardware is. All you can do is configure the shapers that exist. > > > > > The 'create' op is just an abstraction to tell the NIC to switch from > > > the default configuration to the specified one. > > > > Well, the hardware default is i think WRR for the queues, and line > > rate. That will be what the software representation of the shapers > > will be set to when the driver probes and creates the shapers > > representors. > > If I read correctly, allowing each NIC to expose it's own different > starting configuration still will not solve the problem for this H/W to > switch from WRR to SP (and vice versa). > > AFAICS, what would be needed there is an atomic set of operations: > 'set_many' (and e.v. 'delete_many', 'create_many') that will allow > changing all the shapers at once. > > With such operations, that H/W could still fit the expected 'no-op' > default, as WRR on the queue shapers is what we expect. I agree with > Jakub, handling the complexity of arbitrary starting configuration > would pose a lot of trouble to the user/admin. > > If all the above stands together, I think we have a few options (in > random order): > > - add both set of operations: the ones operating on a single shaper and > the ones operating on multiple shapers > - use only the multiple shapers ops. > > And the latter looks IMHO the simple/better. At that point I would > probably drop the 'add' op and would rename 'delete' as > 'reset': > > int (*set)(struct net_device *dev, int how_many, const u32 *handles, > const struct net_shaper_info *shapers, > struct netlink_ext_ack *extack); > int (*reset)(struct net_device *dev, int how_many, const u32 *handles, > struct netlink_ext_ack *extack); > int (*move)(struct net_device *dev, int how_many, const u32 *handles, > const u32 *new_parent_handles, > struct netlink_ext_ack *extack); > > An NIC with 'static' shapers can implement a dummy move always > returning EOPNOTSUPP and eventually filling a detailed extack. > > NIC without any constraints on mixing and matching different kind of > shapers could implement the above as a loop over whatever they will do > for the corresponding 'single shaper op' > > NIC with constrains alike the one you pointed out could validate the > final state before atomically applying the specified operation. > > After a successful 'reset' operation, the kernel could drop any data > it retains/caches for the relevant shapers - the current idea is to > keep a copy of all successfully configured shaper_info in a xarray, > using the 'handle' as the index. > > Side note: the move() operation could look like a complex artifact, but > it's the simplest instrument I could think of to support scenarios > where the user creates/configures/sets a queue group and 'move' some > queue under the newly created group > > WDYT? Hi Andrew, Picking up this discussion, I'm wondering if Paolo's proposal addresses your concerns.
> > If I read correctly, allowing each NIC to expose it's own different > > starting configuration still will not solve the problem for this H/W to > > switch from WRR to SP (and vice versa). I also suspect this is not unique to this hardware. I've not looked at other SOHO switches, but it is reasonably common to have different queues for different priority classes, and then one shaper for the overall port rate. > > AFAICS, what would be needed there is an atomic set of operations: > > 'set_many' (and e.v. 'delete_many', 'create_many') that will allow > > changing all the shapers at once. Yep. > > With such operations, that H/W could still fit the expected 'no-op' > > default, as WRR on the queue shapers is what we expect. I agree with > > Jakub, handling the complexity of arbitrary starting configuration > > would pose a lot of trouble to the user/admin. > > > > If all the above stands together, I think we have a few options (in > > random order): > > > > - add both set of operations: the ones operating on a single shaper and > > the ones operating on multiple shapers > > - use only the multiple shapers ops. > > > > And the latter looks IMHO the simple/better. I would agree, start with only multiple shaper opps. If we find that many implementation end up just iterating the list and dealing with them individually, would could pull that iterator into the core, and expand the ops to either/or, multiple or single. > > int (*set)(struct net_device *dev, int how_many, const u32 *handles, > > const struct net_shaper_info *shapers, > > struct netlink_ext_ack *extack); > > int (*reset)(struct net_device *dev, int how_many, const u32 *handles, > > struct netlink_ext_ack *extack); > > int (*move)(struct net_device *dev, int how_many, const u32 *handles, > > const u32 *new_parent_handles, > > struct netlink_ext_ack *extack); > > > > An NIC with 'static' shapers can implement a dummy move always > > returning EOPNOTSUPP and eventually filling a detailed extack. The extack is going to be important here, we are going to need meaningful error messages. Overall, i think this can be made to work with the hardware i have. Andrew
> + * struct net_shaper_info - represents a shaping node on the NIC H/W > + * @metric: Specify if the bw limits refers to PPS or BPS > + * @bw_min: Minimum guaranteed rate for this shaper > + * @bw_max: Maximum peak bw allowed for this shaper > + * @burst: Maximum burst for the peek rate of this shaper > + * @priority: Scheduling priority for this shaper > + * @weight: Scheduling weight for this shaper > + */ > +struct net_shaper_info { > + enum net_shaper_metric metric; > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > + u64 bw_max; /* maximum allowed bandwidth */ > + u32 burst; /* maximum burst in bytes for bw_max */ > + u32 priority; /* scheduling strict priority */ > + u32 weight; /* scheduling WRR weight*/ > +}; ... > + /** set - Update the specified shaper, if it exists > + * @dev: Netdevice to operate on. > + * @handle: the shaper identifier > + * @shaper: Configuration of shaper. > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Return: > + * * %0 - Success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error values on failure. > + */ > + int (*set)(struct net_device *dev, u32 handle, > + const struct net_shaper_info *shaper, > + struct netlink_ext_ack *extack); > + * net_shaper_make_handle - creates an unique shaper identifier > + * @scope: the shaper scope > + * @vf: virtual function number > + * @id: queue group or queue id > + * > + * Return: an unique identifier for the shaper > + * > + * Combines the specified arguments to create an unique identifier for > + * the shaper. > + * The virtual function number is only used within @NET_SHAPER_SCOPE_VF, > + * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE. > + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and > + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group > + * identifier or the queue number. > + */ > +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id); One thing i'm missing here is a function which does the opposite of net_shaper_make_handle(). Given a handle, it returns the scope, vf and the id. When the set() op is called, i somehow need to find the software instance representing the hardware block. If i know the id, it is just an array access. Otherwise i need additional bookkeeping, maybe a linked list of handles and pointers to structures etc. Or net_shaper_make_handle() could maybe take an addition void * priv, and provide a function void * net_shape_priv(u32 handle); Andrew
On Wed, May 15, 2024 at 04:41:09PM +0200, Andrew Lunn wrote: > > + * struct net_shaper_info - represents a shaping node on the NIC H/W > > + * @metric: Specify if the bw limits refers to PPS or BPS > > + * @bw_min: Minimum guaranteed rate for this shaper > > + * @bw_max: Maximum peak bw allowed for this shaper > > + * @burst: Maximum burst for the peek rate of this shaper > > + * @priority: Scheduling priority for this shaper > > + * @weight: Scheduling weight for this shaper > > + */ > > +struct net_shaper_info { > > + enum net_shaper_metric metric; > > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > > + u64 bw_max; /* maximum allowed bandwidth */ > > + u32 burst; /* maximum burst in bytes for bw_max */ > > + u32 priority; /* scheduling strict priority */ > > + u32 weight; /* scheduling WRR weight*/ > > +}; > > ... > > > + /** set - Update the specified shaper, if it exists > > + * @dev: Netdevice to operate on. > > + * @handle: the shaper identifier > > + * @shaper: Configuration of shaper. > > + * @extack: Netlink extended ACK for reporting errors. > > + * > > + * Return: > > + * * %0 - Success > > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > > + * or core for any reason. @extack should be set to > > + * text describing the reason. > > + * * Other negative error values on failure. > > + */ > > + int (*set)(struct net_device *dev, u32 handle, > > + const struct net_shaper_info *shaper, > > + struct netlink_ext_ack *extack); > > > + * net_shaper_make_handle - creates an unique shaper identifier > > + * @scope: the shaper scope > > + * @vf: virtual function number > > + * @id: queue group or queue id > > + * > > + * Return: an unique identifier for the shaper > > + * > > + * Combines the specified arguments to create an unique identifier for > > + * the shaper. > > + * The virtual function number is only used within @NET_SHAPER_SCOPE_VF, > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE. > > + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and > > + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group > > + * identifier or the queue number. > > + */ > > +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id); > > One thing i'm missing here is a function which does the opposite of > net_shaper_make_handle(). Given a handle, it returns the scope, vf and > the id. > > When the set() op is called, i somehow need to find the software > instance representing the hardware block. If i know the id, it is just > an array access. Otherwise i need additional bookkeeping, maybe a > linked list of handles and pointers to structures etc. > > Or net_shaper_make_handle() could maybe take an addition void * priv, > and provide a function void * net_shape_priv(u32 handle); Hi Andrew, I think that, initially at least, the implementation of net_shaper_make_handle() can be fairly simple, involving packing fields into an integer in a static manner. As such I think implementing a helper or helpers to an to extract fields should be trivial. In other words, yes, I think that can be added.
On Wed, May 15, 2024 at 04:19:57PM +0200, Andrew Lunn wrote: > > > If I read correctly, allowing each NIC to expose it's own different > > > starting configuration still will not solve the problem for this H/W to > > > switch from WRR to SP (and vice versa). > > I also suspect this is not unique to this hardware. I've not looked at > other SOHO switches, but it is reasonably common to have different > queues for different priority classes, and then one shaper for the > overall port rate. Yes, understood. It's about creating a sufficiently general solution. And the HW you have in mind has lead us to see some shortcomings of the proposed API in that area. Because it drew a bit too much on understanding of a different category of HW. > > > AFAICS, what would be needed there is an atomic set of operations: > > > 'set_many' (and e.v. 'delete_many', 'create_many') that will allow > > > changing all the shapers at once. > > Yep. > > > > With such operations, that H/W could still fit the expected 'no-op' > > > default, as WRR on the queue shapers is what we expect. I agree with > > > Jakub, handling the complexity of arbitrary starting configuration > > > would pose a lot of trouble to the user/admin. > > > > > > If all the above stands together, I think we have a few options (in > > > random order): > > > > > > - add both set of operations: the ones operating on a single shaper and > > > the ones operating on multiple shapers > > > - use only the multiple shapers ops. > > > > > > And the latter looks IMHO the simple/better. > > I would agree, start with only multiple shaper opps. If we find that > many implementation end up just iterating the list and dealing with > them individually, would could pull that iterator into the core, and > expand the ops to either/or, multiple or single. FWIIW, this was my thinking too. > > > int (*set)(struct net_device *dev, int how_many, const u32 *handles, > > > const struct net_shaper_info *shapers, > > > struct netlink_ext_ack *extack); > > > int (*reset)(struct net_device *dev, int how_many, const u32 *handles, > > > struct netlink_ext_ack *extack); > > > int (*move)(struct net_device *dev, int how_many, const u32 *handles, > > > const u32 *new_parent_handles, > > > struct netlink_ext_ack *extack); > > > > > > An NIC with 'static' shapers can implement a dummy move always > > > returning EOPNOTSUPP and eventually filling a detailed extack. > > The extack is going to be important here, we are going to need > meaningful error messages. Always :) > Overall, i think this can be made to work with the hardware i have. Great, I think the next step is for us to propose a revised API with multiple shaper ops in place of single shaper ops.
On Wed, 8 May 2024 22:20:51 +0200 Paolo Abeni wrote: > +/** > + * struct net_shaper_info - represents a shaping node on the NIC H/W > + * @metric: Specify if the bw limits refers to PPS or BPS > + * @bw_min: Minimum guaranteed rate for this shaper > + * @bw_max: Maximum peak bw allowed for this shaper > + * @burst: Maximum burst for the peek rate of this shaper > + * @priority: Scheduling priority for this shaper > + * @weight: Scheduling weight for this shaper > + */ > +struct net_shaper_info { > + enum net_shaper_metric metric; > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > + u64 bw_max; /* maximum allowed bandwidth */ > + u32 burst; /* maximum burst in bytes for bw_max */ Burst is burst, either we need two or we assume it's for both bw_min and bw_max, but it most certainly can't be just for bw_max. Also presumably not just bytes - if "metric" is pps, burst is pps? > + u32 priority; /* scheduling strict priority */ > + u32 weight; /* scheduling WRR weight*/ I wonder if we should somehow more clearly specify what a node can do. Like Andrew pointed out, if we have a WRR node, presumably the weights go on the children, since there's only one weigh param. But then the WRR node itself is either empty (no params) or has rate params... which is odd. Maybe shaping nodes and RR nodes should be separate node classes, somehow? > +}; > + > +/** > + * enum net_shaper_scope - the different scopes where a shaper could be attached > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > + * function. > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the > + * same device. > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. I wonder if we need traffic class? Some devices may have two schedulers, one from the host interfaces (PCIe) into the device buffer. And then from the device buffer independently into the wire. > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > + * PF devices, usually inside the host/hypervisor. > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > + */ > +enum net_shaper_scope { > + NET_SHAPER_SCOPE_PORT, > + NET_SHAPER_SCOPE_NETDEV, > + NET_SHAPER_SCOPE_VF, I realized now that we do indeed need this VF node (if we want to internally express the legacy SRIOV NDOs in this API), as much as I hate it. Could you annotate somehow my nack on ever exposing the ability to hook on the VF to user space? > + NET_SHAPER_SCOPE_QUEUE_GROUP, We may need a definition for a queue group. Did I suggest this? Isn't queue group just a bunch of queues feeding a trivial RR node? Why does it need to be a "scope"? > + NET_SHAPER_SCOPE_QUEUE, > +}; > + > +/** > + * struct net_shaper_ops - Operations on device H/W shapers > + * @add: Creates a new shaper in the specified scope. "in a scope"? Isn't the scope just defining the ingress and egress points of the scheduling hierarchy? Also your example moves schedulers from queue scope to queue group scope. > + * @set: Modify the existing shaper. > + * @delete: Delete the specified shaper. > + * @move: Move an existing shaper under a different parent. > + * > + * The initial shaping configuration ad device initialization is empty/ and > + * a no-op/does not constraint the b/w in any way. > + * The network core keeps track of the applied user-configuration in > + * per device storage. "keeps track .. per device" -- "storage" may make people think NVM. > + * Each shaper is uniquely identified within the device with an 'handle', > + * dependent on the shaper scope and other data, see @shaper_make_handle() > + */ > +struct net_shaper_ops { > + /** add - Add a shaper inside the shaper hierarchy > + * @dev: netdevice to operate on > + * @handle: the shaper indetifier > + * @shaper: configuration of shaper > + * @extack: Netlink extended ACK for reporting errors. > + * > + * Return: > + * * 0 on success > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > + * or core for any reason. @extack should be set to > + * text describing the reason. > + * * Other negative error values on failure. > + * > + * Examples or reasons this operation may fail include: > + * * H/W resources limits. > + * * Can’t respect the requested bw limits. > + */ > + int (*add)(struct net_device *dev, u32 handle, > + const struct net_shaper_info *shaper, > + struct netlink_ext_ack *extack); > + > + /** set - Update the specified shaper, if it exists Why "if it exists" ? Core should make sure it exists, no? In addition to ops and state, the device will likely need to express capabilities of some sort. So that the core can do some work for the drivers and in due course we can expose them to user space for discoverability.
On Fri, 10 May 2024 13:05:41 +0200 Paolo Abeni wrote: > And the latter looks IMHO the simple/better. At that point I would > probably drop the 'add' op and would rename 'delete' as > 'reset': > > int (*set)(struct net_device *dev, int how_many, const u32 *handles, > const struct net_shaper_info *shapers, > struct netlink_ext_ack *extack); > int (*reset)(struct net_device *dev, int how_many, const u32 *handles, > struct netlink_ext_ack *extack); > int (*move)(struct net_device *dev, int how_many, const u32 *handles, > const u32 *new_parent_handles, > struct netlink_ext_ack *extack); > > An NIC with 'static' shapers can implement a dummy move always > returning EOPNOTSUPP and eventually filling a detailed extack. > > NIC without any constraints on mixing and matching different kind of > shapers could implement the above as a loop over whatever they will do > for the corresponding 'single shaper op' > > NIC with constrains alike the one you pointed out could validate the > final state before atomically applying the specified operation. > > After a successful 'reset' operation, the kernel could drop any data > it retains/caches for the relevant shapers - the current idea is to > keep a copy of all successfully configured shaper_info in a xarray, > using the 'handle' as the index. IMHO this is more confusing that the current API, maybe we just need better driver-facing documentation? Deleting a node from the hierarchy doesn't delete the HW, same as deleting a TCAM entry doesn't chisel off a part of the chip. If you want to switch from WRR to SP you'd need to delete all the weight nodes and insert SP nodes as children. If we modeled mux nodes as multi-entry nodes explicitly it may be easier. Meaning make the scheduling and weights part of the mux node, not its children.
Hi, Thank you for your time and reply. On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote: > > + u32 priority; /* scheduling strict priority */ > > + u32 weight; /* scheduling WRR weight*/ > > I wonder if we should somehow more clearly specify what a node can do. > Like Andrew pointed out, if we have a WRR node, presumably the weights > go on the children, since there's only one weigh param. But then the > WRR node itself is either empty (no params) or has rate params... which > is odd. > > Maybe shaping nodes and RR nodes should be separate node classes, > somehow? Possibly clarifying the meaning of 'weight' field would help? It means: this node is scheduled WRR according to the specified weight among the sibling shapers with the same priority. I think it's quite simpler than introducing different node classes with separate handling. My understanding is that the latter would help only to workaround some H/W limitation and will make the implementation more difficult for more capable H/W. What kind of problems do you foresee with the above definition? > > +}; > > + > > +/** > > + * enum net_shaper_scope - the different scopes where a shaper could be attached > > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. > > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > > + * function. > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the > > + * same device. > > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. > > I wonder if we need traffic class? Some devices may have two schedulers, > one from the host interfaces (PCIe) into the device buffer. And then > from the device buffer independently into the wire. I feel like I'm really missing your point here. How would you use traffic class? And how the 2 schedulers come into play here? Each of them will be tied to one or more of the scopes above, why exposing H/W details that will not expand user visible features? > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > > + * PF devices, usually inside the host/hypervisor. > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > + */ > > +enum net_shaper_scope { > > + NET_SHAPER_SCOPE_PORT, > > + NET_SHAPER_SCOPE_NETDEV, > > + NET_SHAPER_SCOPE_VF, > > I realized now that we do indeed need this VF node (if we want to > internally express the legacy SRIOV NDOs in this API), as much > as I hate it. Could you annotate somehow my nack on ever exposing > the ability to hook on the VF to user space? This work sparked from the need to allow configuring a shaper on specific queues of some VF from the host. I hope this is not what you are nacking here? Could you please elaborate a bit what concern you with 'hook on the VF to user space'? Would the ability of attaching a shaper to the VF from the host hit your nack? > > > + NET_SHAPER_SCOPE_QUEUE_GROUP, > > We may need a definition for a queue group. Did I suggest this? I think this was mentioned separately by you, John Fastabend and Intel. > Isn't queue group just a bunch of queues feeding a trivial RR node? > Why does it need to be a "scope"? The goal is allowing arbitrary manipulation at the queue group level. e.g. you can have different queue groups with different priority, or weigh or shaping, and below them arbitrary shaping at the queue level. Note that a similar concept could be introduced for device (or VFs) groups. Why would you like to constraint the features avail at the queue groups? > > > + NET_SHAPER_SCOPE_QUEUE, > > +}; > > + > > +/** > > + * struct net_shaper_ops - Operations on device H/W shapers > > + * @add: Creates a new shaper in the specified scope. > > "in a scope"? Isn't the scope just defining the ingress and egress > points of the scheduling hierarchy? This is purely lexical matter, right? The scope, and more specifically the full 'handle' comprising more scoped-related information, specifies 'where' the shaper is located. Do you have a suggested alternative wording? > Also your example moves schedulers from queue scope to queue group > scope. In the example, this part creates/enables a shaper at the queue level: u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0); dev->shaper_ops->add(dev, ghandle, &ginfo); and this: u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id); //... dev->netshaper_ops->move(dev, qhandle, ghandle, NULL); changes the _parent_ of qhandle setting it to the previously creates queue group shaper. qhandle initial/implicit/default parent was the device scope shaper. The scope of the queue shaper remains unchanged. An I misunderstanding your point? > > > + * @set: Modify the existing shaper. > > + * @delete: Delete the specified shaper. > > + * @move: Move an existing shaper under a different parent. > > + * > > + * The initial shaping configuration ad device initialization is empty/ > > and > > > + * a no-op/does not constraint the b/w in any way. > > + * The network core keeps track of the applied user-configuration in > > + * per device storage. > > "keeps track .. per device" -- "storage" may make people think NVM. > > > + * Each shaper is uniquely identified within the device with an 'handle', > > + * dependent on the shaper scope and other data, see @shaper_make_handle() > > + */ > > +struct net_shaper_ops { > > + /** add - Add a shaper inside the shaper hierarchy > > + * @dev: netdevice to operate on > > + * @handle: the shaper indetifier > > + * @shaper: configuration of shaper > > + * @extack: Netlink extended ACK for reporting errors. > > + * > > + * Return: > > + * * 0 on success > > + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, > > + * or core for any reason. @extack should be set to > > + * text describing the reason. > > + * * Other negative error values on failure. > > + * > > + * Examples or reasons this operation may fail include: > > + * * H/W resources limits. > > + * * Can’t respect the requested bw limits. > > + */ > > + int (*add)(struct net_device *dev, u32 handle, > > + const struct net_shaper_info *shaper, > > + struct netlink_ext_ack *extack); > > + > > + /** set - Update the specified shaper, if it exists > > Why "if it exists" ? Core should make sure it exists, no? > > In addition to ops and state, the device will likely need to express > capabilities of some sort. So that the core can do some work for the > drivers and in due course we can expose them to user space for > discoverability. What kind of capabilities are you thinking about? supported scopes? supported metrics? What else? I feel like there is a lot of mixed/partial kind of support which is hard to express in a formal way but would fit nicely an extended ack for a failing op - as the SP/WRR constrains Andrew reported. Do we need to introduce this introspection support from the start? I think that having a few H/W implementations around would help (at least me) understanding which properties could relevant here. Thanks, Paolo
On Fri, 31 May 2024 11:22:46 +0200 Paolo Abeni wrote: > On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote: > > > + u32 priority; /* scheduling strict priority */ > > > + u32 weight; /* scheduling WRR weight*/ > > > > I wonder if we should somehow more clearly specify what a node can do. > > Like Andrew pointed out, if we have a WRR node, presumably the weights > > go on the children, since there's only one weigh param. But then the > > WRR node itself is either empty (no params) or has rate params... which > > is odd. > > > > Maybe shaping nodes and RR nodes should be separate node classes, > > somehow? > > Possibly clarifying the meaning of 'weight' field would help? > It means: this node is scheduled WRR according to the specified weight > among the sibling shapers with the same priority. > > I think it's quite simpler than introducing different node classes with > separate handling. My understanding is that the latter would help only > to workaround some H/W limitation and will make the implementation more > difficult for more capable H/W. > > What kind of problems do you foresee with the above definition? The problem Andrew mentioned, basically. There may not be a path to transition one fully offloaded hierarchy to another (e.g. switching strict prio to WRR). TBH I haven't really grasped your proposal in: https://lore.kernel.org/all/db51b7ccff835dd5a96293fb84d527be081de062.camel@redhat.com/ > > > + * enum net_shaper_scope - the different scopes where a shaper could be attached > > > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > > > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. > > > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > > > + * function. > > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the > > > + * same device. > > > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. > > > > I wonder if we need traffic class? Some devices may have two schedulers, > > one from the host interfaces (PCIe) into the device buffer. And then > > from the device buffer independently into the wire. > > I feel like I'm really missing your point here. How would you use > traffic class? And how the 2 schedulers come into play here? Each of > them will be tied to one or more of the scopes above, why exposing H/W > details that will not expand user visible features? I was just thinking aloud, I'm not sure anyone would ever use a single host queue to service (ingress) traffic from multiple TCs. Or if any HW actually supports that. We can add it later. > > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > > > + * PF devices, usually inside the host/hypervisor. > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > > + */ > > > +enum net_shaper_scope { > > > + NET_SHAPER_SCOPE_PORT, > > > + NET_SHAPER_SCOPE_NETDEV, > > > + NET_SHAPER_SCOPE_VF, > > > > I realized now that we do indeed need this VF node (if we want to > > internally express the legacy SRIOV NDOs in this API), as much > > as I hate it. Could you annotate somehow my nack on ever exposing > > the ability to hook on the VF to user space? > > This work sparked from the need to allow configuring a shaper on > specific queues of some VF from the host. I hope this is not what you > are nacking here? Could you please elaborate a bit what concern you > with 'hook on the VF to user space'? Would the ability of attaching a > shaper to the VF from the host hit your nack? Queue configuration, for the VF, from the hypervisor? I thought it was from the VF. In any case, hypervisor has the representors. Use the representor's NETDEV scope? > > > + NET_SHAPER_SCOPE_QUEUE_GROUP, > > > > We may need a definition for a queue group. Did I suggest this? > > I think this was mentioned separately by you, John Fastabend and Intel. Oh ugh, I can't type. I think I meant to say "Why do we need..." > > Isn't queue group just a bunch of queues feeding a trivial RR node? > > Why does it need to be a "scope"? > > The goal is allowing arbitrary manipulation at the queue group level. > e.g. you can have different queue groups with different priority, or > weigh or shaping, and below them arbitrary shaping at the queue level. > > Note that a similar concept could be introduced for device (or VFs) > groups. > > Why would you like to constraint the features avail at the queue > groups? Wait! You don't have a way to create pure RR nodes other than queue group now? Perhaps that's what I'm missing... > > > + NET_SHAPER_SCOPE_QUEUE, > > > +}; > > > + > > > +/** > > > + * struct net_shaper_ops - Operations on device H/W shapers > > > + * @add: Creates a new shaper in the specified scope. > > > > "in a scope"? Isn't the scope just defining the ingress and egress > > points of the scheduling hierarchy? > > This is purely lexical matter, right? The scope, and more specifically > the full 'handle' comprising more scoped-related information, specifies > 'where' the shaper is located. Do you have a suggested alternative > wording? ... and also what confused me here. How are you going to do 2 layers of grouping with arbitrary shaping? We need arbitrary inner nodes. Unless I'm missing a trick. > > Also your example moves schedulers from queue scope to queue group > > scope. > > In the example, this part creates/enables a shaper at the queue level: > > u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0); > dev->shaper_ops->add(dev, ghandle, &ginfo); > > and this: > > u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id); > //... > dev->netshaper_ops->move(dev, qhandle, ghandle, NULL); > > changes the _parent_ of qhandle setting it to the previously creates > queue group shaper. qhandle initial/implicit/default parent was the > device scope shaper. > > The scope of the queue shaper remains unchanged. An I misunderstanding > your point? > > > Why "if it exists" ? Core should make sure it exists, no? > > > > In addition to ops and state, the device will likely need to express > > capabilities of some sort. So that the core can do some work for the > > drivers and in due course we can expose them to user space for > > discoverability. > > What kind of capabilities are you thinking about? supported scopes? > supported metrics? What else? I feel like there is a lot of > mixed/partial kind of support which is hard to express in a formal way > but would fit nicely an extended ack for a failing op - as the SP/WRR > constrains Andrew reported. > > Do we need to introduce this introspection support from the start? I > think that having a few H/W implementations around would help (at least > me) understanding which properties could relevant here. Try to write good tests which can run on HW for more than one vendor. The introspection and capabilities will become apparent.
Hi, It looks like most of the open points here are related lack of clarity or misunderstanding. I propose to discuss them in the upcoming reviewer's meeting. Some replies below, just in case I magically achieved superior written natural language skills meanwhile ;) On Fri, 2024-05-31 at 09:00 -0700, Jakub Kicinski wrote: > On Fri, 31 May 2024 11:22:46 +0200 Paolo Abeni wrote: > > On Tue, 2024-05-28 at 10:18 -0700, Jakub Kicinski wrote: > > > > + u32 priority; /* scheduling strict priority */ > > > > + u32 weight; /* scheduling WRR weight*/ > > > > > > I wonder if we should somehow more clearly specify what a node can do. > > > Like Andrew pointed out, if we have a WRR node, presumably the weights > > > go on the children, since there's only one weigh param. But then the > > > WRR node itself is either empty (no params) or has rate params... which > > > is odd. > > > > > > Maybe shaping nodes and RR nodes should be separate node classes, > > > somehow? > > > > Possibly clarifying the meaning of 'weight' field would help? > > It means: this node is scheduled WRR according to the specified weight > > among the sibling shapers with the same priority. > > > > I think it's quite simpler than introducing different node classes with > > separate handling. My understanding is that the latter would help only > > to workaround some H/W limitation and will make the implementation more > > difficult for more capable H/W. > > > > What kind of problems do you foresee with the above definition? > > The problem Andrew mentioned, basically. There may not be a path to > transition one fully offloaded hierarchy to another (e.g. switching > strict prio to WRR). TBH I haven't really grasped your proposal in: > https://lore.kernel.org/all/db51b7ccff835dd5a96293fb84d527be081de062.camel@redhat.com/ The problem Andrew reported is that some H/W, in some configuration, can't change per-queue shapers individually. The solution proposed in the above link is to let the API change multiple shapers with a single op "atomically". > > > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > > > > + * PF devices, usually inside the host/hypervisor. > > > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > > > + */ > > > > +enum net_shaper_scope { > > > > + NET_SHAPER_SCOPE_PORT, > > > > + NET_SHAPER_SCOPE_NETDEV, > > > > + NET_SHAPER_SCOPE_VF, > > > > > > I realized now that we do indeed need this VF node (if we want to > > > internally express the legacy SRIOV NDOs in this API), as much > > > as I hate it. Could you annotate somehow my nack on ever exposing > > > the ability to hook on the VF to user space? > > > > This work sparked from the need to allow configuring a shaper on > > specific queues of some VF from the host. I hope this is not what you > > are nacking here? Could you please elaborate a bit what concern you > > with 'hook on the VF to user space'? Would the ability of attaching a > > shaper to the VF from the host hit your nack? > > Queue configuration, for the VF, from the hypervisor? > I thought it was from the VF. > In any case, hypervisor has the representors. > Use the representor's NETDEV scope? It looks like even the NETDEV scope + representor alternative should fit the intended use-case. > > > > + NET_SHAPER_SCOPE_QUEUE_GROUP, > > > > > > We may need a definition for a queue group. Did I suggest this? > > > > I think this was mentioned separately by you, John Fastabend and Intel. > > Oh ugh, I can't type. I think I meant to say "Why do we need..." > > > > Isn't queue group just a bunch of queues feeding a trivial RR node? > > > Why does it need to be a "scope"? > > > > The goal is allowing arbitrary manipulation at the queue group level. > > e.g. you can have different queue groups with different priority, or > > weigh or shaping, and below them arbitrary shaping at the queue level. > > > > Note that a similar concept could be introduced for device (or VFs) > > groups. > > > > Why would you like to constraint the features avail at the queue > > groups? > > Wait! You don't have a way to create pure RR nodes other than queue > group now? Perhaps that's what I'm missing... No, it's not needed. To have RR on some queues, just set the same weight and priority on them. Queue groups could be used to do something more complex, e.g. shaping on the specified set of RR queues. > > > > + NET_SHAPER_SCOPE_QUEUE, > > > > +}; > > > > + > > > > +/** > > > > + * struct net_shaper_ops - Operations on device H/W shapers > > > > + * @add: Creates a new shaper in the specified scope. > > > > > > "in a scope"? Isn't the scope just defining the ingress and egress > > > points of the scheduling hierarchy? > > > > This is purely lexical matter, right? The scope, and more specifically > > the full 'handle' comprising more scoped-related information, specifies > > 'where' the shaper is located. Do you have a suggested alternative > > wording? > > ... and also what confused me here. > > How are you going to do 2 layers of grouping with arbitrary shaping? > We need arbitrary inner nodes. Unless I'm missing a trick. I guess this part really needs some talk. I also don't understand your doubt above. > > I hope we can discuss this tomorrow (and the other points, if/as needed). Thanks! Paolo
On Mon, 03 Jun 2024 13:11:39 +0200 Paolo Abeni wrote: > > ... and also what confused me here. > > > > How are you going to do 2 layers of grouping with arbitrary shaping? > > We need arbitrary inner nodes. Unless I'm missing a trick. > > I guess this part really needs some talk. I also don't understand your > doubt above. Each layer of shaping corresponds to some level of privilege or control. Saying that we only need one layer is like saying that we only need one layer of cgroups. For example - application may want to WRR/shape between two sets of queues (e.g. data and control) and then we may wrap that application up in a container, and WRR/shape multiple containers together.
On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote: > +/** > + * struct net_shaper_info - represents a shaping node on the NIC H/W > + * @metric: Specify if the bw limits refers to PPS or BPS > + * @bw_min: Minimum guaranteed rate for this shaper > + * @bw_max: Maximum peak bw allowed for this shaper > + * @burst: Maximum burst for the peek rate of this shaper > + * @priority: Scheduling priority for this shaper > + * @weight: Scheduling weight for this shaper > + */ > +struct net_shaper_info { > + enum net_shaper_metric metric; > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > + u64 bw_max; /* maximum allowed bandwidth */ > + u32 burst; /* maximum burst in bytes for bw_max */ 'burst' really should be u64 if it can deal with bytes. In a 400Gbps link, u32 really is peanuts. > +/** > + * enum net_shaper_scope - the different scopes where a shaper could be attached > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > + * function. > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the > + * same device. > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. > + * > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > + * PF devices, usually inside the host/hypervisor. > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > + */ > +enum net_shaper_scope { > + NET_SHAPER_SCOPE_PORT, > + NET_SHAPER_SCOPE_NETDEV, > + NET_SHAPER_SCOPE_VF, > + NET_SHAPER_SCOPE_QUEUE_GROUP, > + NET_SHAPER_SCOPE_QUEUE, > +}; How would modelling groups of VFs (as implemented in [1]) look like with this proposal? I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper across multiple VFs. How would managing membership of VFs in a group look like? Will the devlink API continue to be used for that? Or will something else be introduced? Looking a bit into the future now... I am nowadays thinking about extending the mlx5 VF group rate limit feature to support VFs from multiple PFs from the same NIC (the hardware can be configured to use a shared shaper across multiple ports), how could that feature be represented in this API, given that ops relate to a netdevice? Which netdevice should be used for this scenario? In that world, there would be multiple 'root'-level nodes in this hierarchy, each corresponding to a group of VFs from potentially multiple PFs. [1] https://lore.kernel.org/netdev/1622636251-29892-1-git-send-email-dlinkin@nvidia.com/T/#u Cosmin.
On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote: > On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote: > > > +/** > > + * struct net_shaper_info - represents a shaping node on the NIC H/W > > + * @metric: Specify if the bw limits refers to PPS or BPS > > + * @bw_min: Minimum guaranteed rate for this shaper > > + * @bw_max: Maximum peak bw allowed for this shaper > > + * @burst: Maximum burst for the peek rate of this shaper > > + * @priority: Scheduling priority for this shaper > > + * @weight: Scheduling weight for this shaper > > + */ > > +struct net_shaper_info { > > + enum net_shaper_metric metric; > > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ > > + u64 bw_max; /* maximum allowed bandwidth */ > > + u32 burst; /* maximum burst in bytes for bw_max */ > > 'burst' really should be u64 if it can deal with bytes. In a 400Gbps > link, u32 really is peanuts. > > > +/** > > + * enum net_shaper_scope - the different scopes where a shaper could be attached > > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. > > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. > > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual > > + * function. > > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the > > + * same device. > > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. > > + * > > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on > > + * PF devices, usually inside the host/hypervisor. > > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and > > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > + */ > > +enum net_shaper_scope { > > + NET_SHAPER_SCOPE_PORT, > > + NET_SHAPER_SCOPE_NETDEV, > > + NET_SHAPER_SCOPE_VF, > > + NET_SHAPER_SCOPE_QUEUE_GROUP, > > + NET_SHAPER_SCOPE_QUEUE, > > +}; > > How would modelling groups of VFs (as implemented in [1]) look like > with this proposal? > I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper > across multiple VFs. Following-up yday reviewer mtg - which was spent mainly on this topic - - the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with a more generic 'scope', grouping of either queues, VF/netdev or even other groups (allowing nesting). > How would managing membership of VFs in a group > look like? Will the devlink API continue to be used for that? Or will > something else be introduced? The idea is to introduce a new generic netlink interface, yaml-based, to expose these features to user-space. > Looking a bit into the future now... > I am nowadays thinking about extending the mlx5 VF group rate limit > feature to support VFs from multiple PFs from the same NIC (the > hardware can be configured to use a shared shaper across multiple > ports), how could that feature be represented in this API, given that > ops relate to a netdevice? Which netdevice should be used for this > scenario? I must admit we[1] haven't thought yet about the scenario you describe above. I guess we could encode the PF number and the VF number in the handle major/minor and operate on any PF device belonging to the same silicon, WDYT? Thanks, Paolo [1] or at least myself;)
On Wed, 05 Jun 2024 17:52:32 +0200 Paolo Abeni wrote: > > Looking a bit into the future now... > > I am nowadays thinking about extending the mlx5 VF group rate limit > > feature to support VFs from multiple PFs from the same NIC (the > > hardware can be configured to use a shared shaper across multiple > > ports), how could that feature be represented in this API, given that > > ops relate to a netdevice? Which netdevice should be used for this > > scenario? > > I must admit we[1] haven't thought yet about the scenario you describe > above. I guess we could encode the PF number and the VF number in the > handle major/minor and operate on any PF device belonging to the same > silicon, WDYT? Just a minor clarification. _Internally_ we can support expressing VF / PF shaping. uAPI for that continues to be devlink, right?
Wed, Jun 05, 2024 at 05:52:32PM CEST, pabeni@redhat.com wrote: >On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote: >> On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote: >> >> > +/** >> > + * struct net_shaper_info - represents a shaping node on the NIC H/W >> > + * @metric: Specify if the bw limits refers to PPS or BPS >> > + * @bw_min: Minimum guaranteed rate for this shaper >> > + * @bw_max: Maximum peak bw allowed for this shaper >> > + * @burst: Maximum burst for the peek rate of this shaper >> > + * @priority: Scheduling priority for this shaper >> > + * @weight: Scheduling weight for this shaper >> > + */ >> > +struct net_shaper_info { >> > + enum net_shaper_metric metric; >> > + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ >> > + u64 bw_max; /* maximum allowed bandwidth */ >> > + u32 burst; /* maximum burst in bytes for bw_max */ >> >> 'burst' really should be u64 if it can deal with bytes. In a 400Gbps >> link, u32 really is peanuts. >> >> > +/** >> > + * enum net_shaper_scope - the different scopes where a shaper could be attached >> > + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. >> > + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. >> > + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual >> > + * function. >> > + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the >> > + * same device. >> > + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. >> > + * >> > + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on >> > + * PF devices, usually inside the host/hypervisor. >> > + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and >> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. >> > + */ >> > +enum net_shaper_scope { >> > + NET_SHAPER_SCOPE_PORT, >> > + NET_SHAPER_SCOPE_NETDEV, >> > + NET_SHAPER_SCOPE_VF, >> > + NET_SHAPER_SCOPE_QUEUE_GROUP, >> > + NET_SHAPER_SCOPE_QUEUE, >> > +}; >> >> How would modelling groups of VFs (as implemented in [1]) look like >> with this proposal? >> I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper >> across multiple VFs. > >Following-up yday reviewer mtg - which was spent mainly on this topic - >- the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with >a more generic 'scope', grouping of either queues, VF/netdev or even >other groups (allowing nesting). > >> How would managing membership of VFs in a group >> look like? Will the devlink API continue to be used for that? Or will >> something else be introduced? > >The idea is to introduce a new generic netlink interface, yaml-based, >to expose these features to user-space. > >> Looking a bit into the future now... >> I am nowadays thinking about extending the mlx5 VF group rate limit >> feature to support VFs from multiple PFs from the same NIC (the >> hardware can be configured to use a shared shaper across multiple >> ports), how could that feature be represented in this API, given that >> ops relate to a netdevice? Which netdevice should be used for this >> scenario? > >I must admit we[1] haven't thought yet about the scenario you describe >above. I guess we could encode the PF number and the VF number in the >handle major/minor and operate on any PF device belonging to the same >silicon, WDYT? Sometimes, there is no netdevice at all. The infra still should work I believe. > >Thanks, > >Paolo > >[1] or at least myself;) >
Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote:
>+ * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices.
This is interesting. Do you mean you can put a shaper on a specific VF
queue from hypervisor? I was thinking about it recently, I have some
concerns.
In general a nic user expects all queues to behave in the same way,
unless he does some sort of configuration (dcb for example).
VF (the VM side) is not different, it's also a nic.
If you allow the hypervisor to configure shapers on specifig VF queues,
you are breaking VM's user expectation. He did not configure any
different queue treating, yet they are treated differently.
Is that okay? What do you think?
Hi, On 7/30/24 14:18, Jiri Pirko wrote: > Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote: > >> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. > > This is interesting. Do you mean you can put a shaper on a specific VF > queue from hypervisor? I was thinking about it recently, I have some > concerns. > > In general a nic user expects all queues to behave in the same way, > unless he does some sort of configuration (dcb for example). > VF (the VM side) is not different, it's also a nic. > > If you allow the hypervisor to configure shapers on specifig VF queues, > you are breaking VM's user expectation. He did not configure any > different queue treating, yet they are treated differently. > > Is that okay? What do you think? I'm unsure why you are looking to this old version... The idea to allow configuring the VF's queues from the hypervisor has been removed from the most recent version, for roughly the same reasons you mention above. Cheers, Paolo
On 7/30/24 14:10, Jiri Pirko wrote: > Wed, Jun 05, 2024 at 05:52:32PM CEST, pabeni@redhat.com wrote: >> On Wed, 2024-06-05 at 15:04 +0000, Cosmin Ratiu wrote: >>> On Wed, 2024-05-08 at 22:20 +0200, Paolo Abeni wrote: >>> >>>> +/** >>>> + * struct net_shaper_info - represents a shaping node on the NIC H/W >>>> + * @metric: Specify if the bw limits refers to PPS or BPS >>>> + * @bw_min: Minimum guaranteed rate for this shaper >>>> + * @bw_max: Maximum peak bw allowed for this shaper >>>> + * @burst: Maximum burst for the peek rate of this shaper >>>> + * @priority: Scheduling priority for this shaper >>>> + * @weight: Scheduling weight for this shaper >>>> + */ >>>> +struct net_shaper_info { >>>> + enum net_shaper_metric metric; >>>> + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ >>>> + u64 bw_max; /* maximum allowed bandwidth */ >>>> + u32 burst; /* maximum burst in bytes for bw_max */ >>> >>> 'burst' really should be u64 if it can deal with bytes. In a 400Gbps >>> link, u32 really is peanuts. >>> >>>> +/** >>>> + * enum net_shaper_scope - the different scopes where a shaper could be attached >>>> + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. >>>> + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. >>>> + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual >>>> + * function. >>>> + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the >>>> + * same device. >>>> + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. >>>> + * >>>> + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on >>>> + * PF devices, usually inside the host/hypervisor. >>>> + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and >>>> + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. >>>> + */ >>>> +enum net_shaper_scope { >>>> + NET_SHAPER_SCOPE_PORT, >>>> + NET_SHAPER_SCOPE_NETDEV, >>>> + NET_SHAPER_SCOPE_VF, >>>> + NET_SHAPER_SCOPE_QUEUE_GROUP, >>>> + NET_SHAPER_SCOPE_QUEUE, >>>> +}; >>> >>> How would modelling groups of VFs (as implemented in [1]) look like >>> with this proposal? >>> I could imagine a NET_SHAPER_SCOPE_VF_GROUP scope, with a shared shaper >>> across multiple VFs. >> >> Following-up yday reviewer mtg - which was spent mainly on this topic - >> - the current direction is to replace NET_SHAPER_SCOPE_QUEUE_GROUP with >> a more generic 'scope', grouping of either queues, VF/netdev or even >> other groups (allowing nesting). >> >>> How would managing membership of VFs in a group >>> look like? Will the devlink API continue to be used for that? Or will >>> something else be introduced? >> >> The idea is to introduce a new generic netlink interface, yaml-based, >> to expose these features to user-space. >> >>> Looking a bit into the future now... >>> I am nowadays thinking about extending the mlx5 VF group rate limit >>> feature to support VFs from multiple PFs from the same NIC (the >>> hardware can be configured to use a shared shaper across multiple >>> ports), how could that feature be represented in this API, given that >>> ops relate to a netdevice? Which netdevice should be used for this >>> scenario? >> >> I must admit we[1] haven't thought yet about the scenario you describe >> above. I guess we could encode the PF number and the VF number in the >> handle major/minor and operate on any PF device belonging to the same >> silicon, WDYT? > > Sometimes, there is no netdevice at all. The infra still should work I > believe. Note that in the most recent incarnation of the shaper APIs has been removed any support for shaper 'above' the network device level (e.g. no device/VFs groups). The idea is that devlink should be used for such scenarios. Cheers, Paolo
Tue, Jul 30, 2024 at 03:34:24PM CEST, pabeni@redhat.com wrote: >Hi, > >On 7/30/24 14:18, Jiri Pirko wrote: >> Wed, May 08, 2024 at 10:20:51PM CEST, pabeni@redhat.com wrote: >> >> > + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. >> >> This is interesting. Do you mean you can put a shaper on a specific VF >> queue from hypervisor? I was thinking about it recently, I have some >> concerns. >> >> In general a nic user expects all queues to behave in the same way, >> unless he does some sort of configuration (dcb for example). >> VF (the VM side) is not different, it's also a nic. >> >> If you allow the hypervisor to configure shapers on specifig VF queues, >> you are breaking VM's user expectation. He did not configure any >> different queue treating, yet they are treated differently. >> >> Is that okay? What do you think? > >I'm unsure why you are looking to this old version... > >The idea to allow configuring the VF's queues from the hypervisor has been >removed from the most recent version, for roughly the same reasons you >mention above. Okay. Thanks! > >Cheers, > >Paolo > >
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cf261fb89d73..39f66af014be 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -79,6 +79,8 @@ struct xdp_buff; struct xdp_frame; struct xdp_metadata_ops; struct xdp_md; +struct net_shaper_ops; +struct net_shaper_data; typedef u32 xdp_features_t; @@ -1596,6 +1598,13 @@ struct net_device_ops { int (*ndo_hwtstamp_set)(struct net_device *dev, struct kernel_hwtstamp_config *kernel_config, struct netlink_ext_ack *extack); + +#if IS_ENABLED(CONFIG_NET_SHAPER) + /** @net_shaper_ops: Device shaping offload operations + * see include/net/net_shapers.h + */ + const struct net_shaper_ops *net_shaper_ops; +#endif }; /** @@ -2403,6 +2412,12 @@ struct net_device { /** @page_pools: page pools created for this netdevice */ struct hlist_head page_pools; #endif +#if IS_ENABLED(CONFIG_NET_SHAPER) + /** @net_shaper_data: data tracking the current shaper status + * see include/net/net_shapers.h + */ + struct net_shaper_data *net_shaper_data; +#endif }; #define to_net_dev(d) container_of(d, struct net_device, dev) diff --git a/include/net/net_shaper.h b/include/net/net_shaper.h new file mode 100644 index 000000000000..a4fbadd99870 --- /dev/null +++ b/include/net/net_shaper.h @@ -0,0 +1,206 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _NET_SHAPER_H_ +#define _NET_SHAPER_H_ + +#include <linux/types.h> +#include <linux/netdevice.h> +#include <linux/netlink.h> + +/** + * enum net_shaper_metric - the metric of the shaper + * @NET_SHAPER_METRIC_PPS: Shaper operates on a packets per second basis + * @NET_SHAPER_METRIC_BPS: Shaper operates on a bits per second basis + */ +enum net_shaper_metric { + NET_SHAPER_METRIC_PPS, + NET_SHAPER_METRIC_BPS +}; + +/** + * struct net_shaper_info - represents a shaping node on the NIC H/W + * @metric: Specify if the bw limits refers to PPS or BPS + * @bw_min: Minimum guaranteed rate for this shaper + * @bw_max: Maximum peak bw allowed for this shaper + * @burst: Maximum burst for the peek rate of this shaper + * @priority: Scheduling priority for this shaper + * @weight: Scheduling weight for this shaper + */ +struct net_shaper_info { + enum net_shaper_metric metric; + u64 bw_min; /* minimum guaranteed bandwidth, according to metric */ + u64 bw_max; /* maximum allowed bandwidth */ + u32 burst; /* maximum burst in bytes for bw_max */ + u32 priority; /* scheduling strict priority */ + u32 weight; /* scheduling WRR weight*/ +}; + +/** + * enum net_shaper_scope - the different scopes where a shaper could be attached + * @NET_SHAPER_SCOPE_PORT: The root shaper for the whole H/W. + * @NET_SHAPER_SCOPE_NETDEV: The main shaper for the given network device. + * @NET_SHAPER_SCOPE_VF: The shaper is attached to the given virtual + * function. + * @NET_SHAPER_SCOPE_QUEUE_GROUP: The shaper groups multiple queues under the + * same device. + * @NET_SHAPER_SCOPE_QUEUE: The shaper is attached to the given device queue. + * + * NET_SHAPER_SCOPE_PORT and NET_SHAPER_SCOPE_VF are only available on + * PF devices, usually inside the host/hypervisor. + * NET_SHAPER_SCOPE_NETDEV, NET_SHAPER_SCOPE_QUEUE_GROUP and + * NET_SHAPER_SCOPE_QUEUE are available on both PFs and VFs devices. + */ +enum net_shaper_scope { + NET_SHAPER_SCOPE_PORT, + NET_SHAPER_SCOPE_NETDEV, + NET_SHAPER_SCOPE_VF, + NET_SHAPER_SCOPE_QUEUE_GROUP, + NET_SHAPER_SCOPE_QUEUE, +}; + +/** + * struct net_shaper_ops - Operations on device H/W shapers + * @add: Creates a new shaper in the specified scope. + * @set: Modify the existing shaper. + * @delete: Delete the specified shaper. + * @move: Move an existing shaper under a different parent. + * + * The initial shaping configuration ad device initialization is empty/ + * a no-op/does not constraint the b/w in any way. + * The network core keeps track of the applied user-configuration in + * per device storage. + * + * Each shaper is uniquely identified within the device with an 'handle', + * dependent on the shaper scope and other data, see @shaper_make_handle() + */ +struct net_shaper_ops { + /** add - Add a shaper inside the shaper hierarchy + * @dev: netdevice to operate on + * @handle: the shaper indetifier + * @shaper: configuration of shaper + * @extack: Netlink extended ACK for reporting errors. + * + * Return: + * * 0 on success + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, + * or core for any reason. @extack should be set to + * text describing the reason. + * * Other negative error values on failure. + * + * Examples or reasons this operation may fail include: + * * H/W resources limits. + * * Can’t respect the requested bw limits. + */ + int (*add)(struct net_device *dev, u32 handle, + const struct net_shaper_info *shaper, + struct netlink_ext_ack *extack); + + /** set - Update the specified shaper, if it exists + * @dev: Netdevice to operate on. + * @handle: the shaper identifier + * @shaper: Configuration of shaper. + * @extack: Netlink extended ACK for reporting errors. + * + * Return: + * * %0 - Success + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, + * or core for any reason. @extack should be set to + * text describing the reason. + * * Other negative error values on failure. + */ + int (*set)(struct net_device *dev, u32 handle, + const struct net_shaper_info *shaper, + struct netlink_ext_ack *extack); + + /** delete - Removes a shaper from the NIC + * @dev: netdevice to operate on. + * @handle: the shaper identifier + * @extack: Netlink extended ACK for reporting errors. + * + * Return: + * * %0 - Success + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, + * or core for any reason. @extack should be set to + * text describing the reason. + * * Other negative error value on failure. + */ + int (*delete)(struct net_device *dev, u32 handle, + struct netlink_ext_ack *extack); + + /** Move - change the parent id of the specified shaper + * @dev: netdevice to operate on. + * @handle: unique identifier for the shaper + * @new_parent_id: identifier of the new parent for this shaper + * @extack: Netlink extended ACK for reporting errors. + * + * Move the specified shaper in the hierarchy replacing its + * current parent shaper with @new_parent_id + * + * Return: + * * %0 - Success + * * %-EOPNOTSUPP - Operation is not supported by hardware, driver, + * or core for any reason. @extack should be set to + * text describing the reason. + * * Other negative error values on failure. + */ + int (*move)(struct net_device *dev, u32 handle, + u32 new_parent_handle, struct netlink_ext_ack *extack); +}; + +/** + * net_shaper_make_handle - creates an unique shaper identifier + * @scope: the shaper scope + * @vf: virtual function number + * @id: queue group or queue id + * + * Return: an unique identifier for the shaper + * + * Combines the specified arguments to create an unique identifier for + * the shaper. + * The virtual function number is only used within @NET_SHAPER_SCOPE_VF, + * @NET_SHAPER_SCOPE_QUEUE_GROUP and @NET_SHAPER_SCOPE_QUEUE. + * The @id number is only used for @NET_SHAPER_SCOPE_QUEUE_GROUP and + * @NET_SHAPER_SCOPE_QUEUE, and must be, respectively, the queue group + * identifier or the queue number. + */ +u32 net_shaper_make_handle(enum net_shaper_scope scope, int vf, int id); + +/* + * Examples: + * - set shaping on a given queue + * struct shaper_info info = { }; // fill this + * u32 handle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, queue_id); + * dev->shaper_ops->add(dev, handle, &info, NULL); + * + * - create a queue group with a queue group shaping limits. + * Assuming the following topology already exists: + * < netdev shaper > + * / \ + * <queue 0 shaper> . . . <queue N shaper> + * + * struct shaper_info ginfo = { }; // fill this + * u32 ghandle = shaper_make_handle(NET_SHAPER_SCOPE_QUEUE_GROUP, 0, 0); + * dev->shaper_ops->add(dev, ghandle, &ginfo); + * + * // now topology is: + * // < netdev shaper > + * // / | \ + * // / | < newly created shaper > + * // / | + * // <queue 0 shaper> . . . <queue N shaper> + * + * // move a shapers for queues 3..n out of such queue group + * for (i = 0; i <= 2; ++i) { + * u32 qhandle = net_shaper_make_handle(NET_SHAPER_SCOPE_QUEUE, 0, i); + * dev->netshaper_ops->move(dev, qhandle, ghandle, NULL); + * } + * + * // now the topology is: + * // < netdev shaper > + * // / \ + * // < newly created shaper> <queue 3 shaper> .. <queue n shaper> + * // / \ + * // <queue 0 shaper> . . . <queue 2 shaper> + */ +#endif + diff --git a/net/Kconfig b/net/Kconfig index f0a8692496ff..29c6fec54711 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -66,6 +66,9 @@ config SKB_DECRYPTED config SKB_EXTENSIONS bool +config NET_SHAPER + bool + menu "Networking options" source "net/packet/Kconfig"