diff mbox series

[net-next,10/11] net/mlx5e: Implement queue mgmt ops and single channel swap

Message ID 20250116215530.158886-11-saeed@kernel.org (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net-next,01/11] net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR | expand

Checks

Context Check Description
netdev/series_format success Pull request is its own cover letter
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 7 maintainers not CCed: hawk@kernel.org ast@kernel.org john.fastabend@gmail.com daniel@iogearbox.net andrew+netdev@lunn.ch richardcochran@gmail.com bpf@vger.kernel.org
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Saeed Mahameed Jan. 16, 2025, 9:55 p.m. UTC
From: Saeed Mahameed <saeedm@nvidia.com>

The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate
and create the new channel resources, similar to
mlx5e_safe_switch_params, but here we do it for a single channel using
existing params, sort of a clone channel.
To swap the old channel with the new one, we deactivate and close the
old channel then replace it with the new one, since the swap procedure
doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start).

Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 96 +++++++++++++++++++
 1 file changed, 96 insertions(+)

Comments

Jakub Kicinski Jan. 16, 2025, 11:21 p.m. UTC | #1
On Thu, 16 Jan 2025 13:55:28 -0800 Saeed Mahameed wrote:
> +static const struct netdev_queue_mgmt_ops mlx5e_queue_mgmt_ops = {
> +	.ndo_queue_mem_size	=	sizeof(struct mlx5_qmgmt_data),
> +	.ndo_queue_mem_alloc	=	mlx5e_queue_mem_alloc,
> +	.ndo_queue_mem_free	=	mlx5e_queue_mem_free,
> +	.ndo_queue_start	=	mlx5e_queue_start,
> +	.ndo_queue_stop		=	mlx5e_queue_stop,
> +};

We need to pay off some technical debt we accrued before we merge more
queue ops implementations. Specifically the locking needs to move from
under rtnl. Sorry, this is not going in for 6.14.
Saeed Mahameed Jan. 16, 2025, 11:46 p.m. UTC | #2
On 16 Jan 15:21, Jakub Kicinski wrote:
>On Thu, 16 Jan 2025 13:55:28 -0800 Saeed Mahameed wrote:
>> +static const struct netdev_queue_mgmt_ops mlx5e_queue_mgmt_ops = {
>> +	.ndo_queue_mem_size	=	sizeof(struct mlx5_qmgmt_data),
>> +	.ndo_queue_mem_alloc	=	mlx5e_queue_mem_alloc,
>> +	.ndo_queue_mem_free	=	mlx5e_queue_mem_free,
>> +	.ndo_queue_start	=	mlx5e_queue_start,
>> +	.ndo_queue_stop		=	mlx5e_queue_stop,
>> +};
>
>We need to pay off some technical debt we accrued before we merge more
>queue ops implementations. Specifically the locking needs to move from
>under rtnl. Sorry, this is not going in for 6.14.

What technical debt accrued ? I haven't seen any changes in queue API since
bnxt and gve got merged, what changed since then ?

mlx5 doesn't require rtnl if this is because of the assert, I can remove
it. I don't understand what this series is being deferred for, please
elaborate, what do I need to do to get it accepted ?

Thanks,
Saeed.

>-- 
>pw-bot: defer
Jakub Kicinski Jan. 16, 2025, 11:54 p.m. UTC | #3
On Thu, 16 Jan 2025 15:46:43 -0800 Saeed Mahameed wrote:
> >We need to pay off some technical debt we accrued before we merge more
> >queue ops implementations. Specifically the locking needs to move from
> >under rtnl. Sorry, this is not going in for 6.14.  
> 
> What technical debt accrued ? I haven't seen any changes in queue API since
> bnxt and gve got merged, what changed since then ?
> 
> mlx5 doesn't require rtnl if this is because of the assert, I can remove
> it. I don't understand what this series is being deferred for, please
> elaborate, what do I need to do to get it accepted ?

Remove the dependency on rtnl_lock _in the core kernel_.
Stanislav Fomichev Jan. 24, 2025, 12:39 a.m. UTC | #4
On 01/16, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:46:43 -0800 Saeed Mahameed wrote:
> > >We need to pay off some technical debt we accrued before we merge more
> > >queue ops implementations. Specifically the locking needs to move from
> > >under rtnl. Sorry, this is not going in for 6.14.  
> > 
> > What technical debt accrued ? I haven't seen any changes in queue API since
> > bnxt and gve got merged, what changed since then ?
> > 
> > mlx5 doesn't require rtnl if this is because of the assert, I can remove
> > it. I don't understand what this series is being deferred for, please
> > elaborate, what do I need to do to get it accepted ?
> 
> Remove the dependency on rtnl_lock _in the core kernel_.

IIUC, we want queue API to move away from rtnl and use only (new) netdev
lock. Otherwise, removing this dependency in the future might be
complicated. I'll talk to Jakub so can we can maybe get something out early
in the next merge window so you can retest the mlx5 changes on top.
Will that work? (unless, Saeed, you want to look into that core locking part
yourself)
Jakub Kicinski Jan. 24, 2025, 12:55 a.m. UTC | #5
On Thu, 23 Jan 2025 16:39:05 -0800 Stanislav Fomichev wrote:
> > > What technical debt accrued ? I haven't seen any changes in queue API since
> > > bnxt and gve got merged, what changed since then ?
> > > 
> > > mlx5 doesn't require rtnl if this is because of the assert, I can remove
> > > it. I don't understand what this series is being deferred for, please
> > > elaborate, what do I need to do to get it accepted ?  
> > 
> > Remove the dependency on rtnl_lock _in the core kernel_.  
> 
> IIUC, we want queue API to move away from rtnl and use only (new) netdev
> lock. Otherwise, removing this dependency in the future might be
> complicated.

Correct. We only have one driver now which reportedly works (gve).
Let's pull queues under optional netdev_lock protection.
Then we can use queue mgmt op support as a carrot for drivers
to convert / test the netdev_lock protection... "compliance".

I added netdev_lock protection for NAPI before the merge window.
Queues are configured in much more ad-hoc fashion, so I think 
the best way to make queue changes netdev_lock safe would be to
wrap all driver ops which are currently under rtnl_lock with
netdev_lock.
Saeed Mahameed Jan. 24, 2025, 3:11 a.m. UTC | #6
On 23 Jan 16:55, Jakub Kicinski wrote:
>On Thu, 23 Jan 2025 16:39:05 -0800 Stanislav Fomichev wrote:
>> > > What technical debt accrued ? I haven't seen any changes in queue API since
>> > > bnxt and gve got merged, what changed since then ?
>> > >
>> > > mlx5 doesn't require rtnl if this is because of the assert, I can remove
>> > > it. I don't understand what this series is being deferred for, please
>> > > elaborate, what do I need to do to get it accepted ?
>> >
>> > Remove the dependency on rtnl_lock _in the core kernel_.
>>
>> IIUC, we want queue API to move away from rtnl and use only (new) netdev
>> lock. Otherwise, removing this dependency in the future might be
>> complicated.
>
>Correct. We only have one driver now which reportedly works (gve).
>Let's pull queues under optional netdev_lock protection.
>Then we can use queue mgmt op support as a carrot for drivers
>to convert / test the netdev_lock protection... "compliance".
>
>I added netdev_lock protection for NAPI before the merge window.
>Queues are configured in much more ad-hoc fashion, so I think
>the best way to make queue changes netdev_lock safe would be to
>wrap all driver ops which are currently under rtnl_lock with
>netdev_lock.

Are you expecting drivers to hold netdev_lock internally? 
I was thinking something more scalable, queue_mgmt API to take
netdev_lock,  and any other place in the stack that can access 
"netdev queue config" e.g ethtool/netlink/netdev_ops should grab 
netdev_lock as well, this is better for the future when we want to 
reduce rtnl usage in the stack to protect single netdev ops where
netdev_lock will be sufficient, otherwise you will have to wait for ALL
drivers to properly use netdev_lock internally to even start thinking of
getting rid of rtnl from some parts of the core stack.
Jakub Kicinski Jan. 24, 2025, 3:26 p.m. UTC | #7
On Thu, 23 Jan 2025 19:11:23 -0800 Saeed Mahameed wrote:
> On 23 Jan 16:55, Jakub Kicinski wrote:
> >> IIUC, we want queue API to move away from rtnl and use only (new) netdev
> >> lock. Otherwise, removing this dependency in the future might be
> >> complicated.  
> >
> >Correct. We only have one driver now which reportedly works (gve).
> >Let's pull queues under optional netdev_lock protection.
> >Then we can use queue mgmt op support as a carrot for drivers
> >to convert / test the netdev_lock protection... "compliance".
> >
> >I added netdev_lock protection for NAPI before the merge window.
> >Queues are configured in much more ad-hoc fashion, so I think
> >the best way to make queue changes netdev_lock safe would be to
> >wrap all driver ops which are currently under rtnl_lock with
> >netdev_lock.  
> 
> Are you expecting drivers to hold netdev_lock internally? 
> I was thinking something more scalable, queue_mgmt API to take
> netdev_lock,  and any other place in the stack that can access 
> "netdev queue config" e.g ethtool/netlink/netdev_ops should grab 
> netdev_lock as well, this is better for the future when we want to 
> reduce rtnl usage in the stack to protect single netdev ops where
> netdev_lock will be sufficient, otherwise you will have to wait for ALL
> drivers to properly use netdev_lock internally to even start thinking of
> getting rid of rtnl from some parts of the core stack.

Agreed, expecting drivers to get the locking right internally is easier
short term but messy long term. I'm thinking opt-in for drivers to have
netdev_lock taken by the core. Probably around all ops which today hold
rtnl_lock, to keep the expectations simple.

net_shaper and queue_mgmt ops can require that drivers that support
them opt-in and these ops can hold just the netdev_lock, no rtnl_lock.
Saeed Mahameed Jan. 24, 2025, 7:34 p.m. UTC | #8
On 24 Jan 07:26, Jakub Kicinski wrote:
>On Thu, 23 Jan 2025 19:11:23 -0800 Saeed Mahameed wrote:
>> On 23 Jan 16:55, Jakub Kicinski wrote:
>> >> IIUC, we want queue API to move away from rtnl and use only (new) netdev
>> >> lock. Otherwise, removing this dependency in the future might be
>> >> complicated.
>> >
>> >Correct. We only have one driver now which reportedly works (gve).
>> >Let's pull queues under optional netdev_lock protection.
>> >Then we can use queue mgmt op support as a carrot for drivers
>> >to convert / test the netdev_lock protection... "compliance".
>> >
>> >I added netdev_lock protection for NAPI before the merge window.
>> >Queues are configured in much more ad-hoc fashion, so I think
>> >the best way to make queue changes netdev_lock safe would be to
>> >wrap all driver ops which are currently under rtnl_lock with
>> >netdev_lock.
>>
>> Are you expecting drivers to hold netdev_lock internally?
>> I was thinking something more scalable, queue_mgmt API to take
>> netdev_lock,  and any other place in the stack that can access
>> "netdev queue config" e.g ethtool/netlink/netdev_ops should grab
>> netdev_lock as well, this is better for the future when we want to
>> reduce rtnl usage in the stack to protect single netdev ops where
>> netdev_lock will be sufficient, otherwise you will have to wait for ALL
>> drivers to properly use netdev_lock internally to even start thinking of
>> getting rid of rtnl from some parts of the core stack.
>
>Agreed, expecting drivers to get the locking right internally is easier
>short term but messy long term. I'm thinking opt-in for drivers to have
>netdev_lock taken by the core. Probably around all ops which today hold
>rtnl_lock, to keep the expectations simple.
>

Why opt-in? I don't see any overhead of taking netdev_lock by default in
rtnl_lock flows.

>net_shaper and queue_mgmt ops can require that drivers that support
>them opt-in and these ops can hold just the netdev_lock, no rtnl_lock.
Jakub Kicinski Jan. 27, 2025, 7:27 p.m. UTC | #9
On Fri, 24 Jan 2025 11:34:54 -0800 Saeed Mahameed wrote:
> On 24 Jan 07:26, Jakub Kicinski wrote:
> >> Are you expecting drivers to hold netdev_lock internally?
> >> I was thinking something more scalable, queue_mgmt API to take
> >> netdev_lock,  and any other place in the stack that can access
> >> "netdev queue config" e.g ethtool/netlink/netdev_ops should grab
> >> netdev_lock as well, this is better for the future when we want to
> >> reduce rtnl usage in the stack to protect single netdev ops where
> >> netdev_lock will be sufficient, otherwise you will have to wait for ALL
> >> drivers to properly use netdev_lock internally to even start thinking of
> >> getting rid of rtnl from some parts of the core stack.  
> >
> >Agreed, expecting drivers to get the locking right internally is easier
> >short term but messy long term. I'm thinking opt-in for drivers to have
> >netdev_lock taken by the core. Probably around all ops which today hold
> >rtnl_lock, to keep the expectations simple.
> 
> Why opt-in? I don't see any overhead of taking netdev_lock by default in
> rtnl_lock flows.

We could, depends on how close we take the dev lock to the ndo vs to
rtnl_lock. Some drivers may call back into the stack so if we're not
careful enough we'll get flooded by static analysis reports saying 
that we had deadlocked some old Sun driver :(

Then there are SW upper drivers like bonding which we'll need at 
the very least lockdep nesting allocations for.

Would be great to solve all these issues, but IMHO not a hard
requirement, we can at least start with opt in. Unless always
taking the lock gives us some worthwhile invariant I haven't considered?
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 340ed7d3feac..1e03f2afe625 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5489,6 +5489,101 @@  static const struct netdev_stat_ops mlx5e_stat_ops = {
 	.get_base_stats      = mlx5e_get_base_stats,
 };
 
+struct mlx5_qmgmt_data {
+	struct mlx5e_channel *c;
+	struct mlx5e_channel_param cparam;
+};
+
+static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq, int queue_index)
+{
+	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5e_channels *chs = &priv->channels;
+	struct mlx5e_params params = chs->params;
+	struct mlx5_core_dev *mdev;
+	int err;
+
+	ASSERT_RTNL();
+	mutex_lock(&priv->state_lock);
+	if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+		err = -ENODEV;
+		goto unlock;
+	}
+
+	if (queue_index >= chs->num) {
+		err = -ERANGE;
+		goto unlock;
+	}
+
+	if (MLX5E_GET_PFLAG(&chs->params, MLX5E_PFLAG_TX_PORT_TS) ||
+	    chs->params.ptp_rx   ||
+	    chs->params.xdp_prog ||
+	    priv->htb) {
+		netdev_err(priv->netdev,
+			   "Cloning channels with Port/rx PTP, XDP or HTB is not supported\n");
+		err = -EOPNOTSUPP;
+		goto unlock;
+	}
+
+	mdev = mlx5_sd_ch_ix_get_dev(priv->mdev, queue_index);
+	err = mlx5e_build_channel_param(mdev, &params, &new->cparam);
+	if (err) {
+		return err;
+		goto unlock;
+	}
+
+	err = mlx5e_open_channel(priv, queue_index, &params, NULL, &new->c);
+unlock:
+	mutex_unlock(&priv->state_lock);
+	return err;
+}
+
+static void mlx5e_queue_mem_free(struct net_device *dev, void *mem)
+{
+	struct mlx5_qmgmt_data *data = (struct mlx5_qmgmt_data *)mem;
+
+	/* not supposed to happen since mlx5e_queue_start never fails
+	 * but this is how this should be implemented just in case
+	 */
+	if (data->c)
+		mlx5e_close_channel(data->c);
+}
+
+static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
+{
+	/* mlx5e_queue_start does not fail, we stop the old queue there */
+	return 0;
+}
+
+static int mlx5e_queue_start(struct net_device *dev, void *newq, int queue_index)
+{
+	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
+	struct mlx5e_priv *priv = netdev_priv(dev);
+	struct mlx5e_channel *old;
+
+	mutex_lock(&priv->state_lock);
+
+	/* stop and close the old */
+	old = priv->channels.c[queue_index];
+	mlx5e_deactivate_priv_channels(priv);
+	/* close old before activating new, to avoid napi conflict */
+	mlx5e_close_channel(old);
+
+	/* start the new */
+	priv->channels.c[queue_index] = new->c;
+	mlx5e_activate_priv_channels(priv);
+	mutex_unlock(&priv->state_lock);
+	return 0;
+}
+
+static const struct netdev_queue_mgmt_ops mlx5e_queue_mgmt_ops = {
+	.ndo_queue_mem_size	=	sizeof(struct mlx5_qmgmt_data),
+	.ndo_queue_mem_alloc	=	mlx5e_queue_mem_alloc,
+	.ndo_queue_mem_free	=	mlx5e_queue_mem_free,
+	.ndo_queue_start	=	mlx5e_queue_start,
+	.ndo_queue_stop		=	mlx5e_queue_stop,
+};
+
 static void mlx5e_build_nic_netdev(struct net_device *netdev)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -5499,6 +5594,7 @@  static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	SET_NETDEV_DEV(netdev, mdev->device);
 
 	netdev->netdev_ops = &mlx5e_netdev_ops;
+	netdev->queue_mgmt_ops = &mlx5e_queue_mgmt_ops;
 	netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
 	netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;