diff mbox series

net/mlx5: Use cpumask_local_spread() instead of custom code

Message ID 20240812082244.22810-1-e.velu@criteo.com (mailing list archive)
State Not Applicable
Headers show
Series net/mlx5: Use cpumask_local_spread() instead of custom code | expand

Commit Message

Erwan Velu Aug. 12, 2024, 8:22 a.m. UTC
Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints")
removed the usage of cpumask_local_spread().

The issue explained in this commit was fixed by
commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality").

Since this commit, mlx5_cpumask_default_spread() is having the same
behavior as cpumask_local_spread().

This commit is about :
- removing the specific logic and use cpumask_local_spread() instead
- passing mlx5_core_dev as argument to more flexibility

mlx5_cpumask_default_spread() is kept as it could be useful for some
future specific quirks.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 27 +++-----------------
 1 file changed, 4 insertions(+), 23 deletions(-)

Comments

Tariq Toukan Aug. 14, 2024, 7:48 a.m. UTC | #1
On 12/08/2024 11:22, Erwan Velu wrote:
> Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints")
> removed the usage of cpumask_local_spread().
> 
> The issue explained in this commit was fixed by
> commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality").
> 
> Since this commit, mlx5_cpumask_default_spread() is having the same
> behavior as cpumask_local_spread().
> 

Adding Yuri.

One patch led to the other, finally they were all submitted within the 
same patchset.

cpumask_local_spread() indeed improved, and AFAIU is functionally 
equivalent to existing logic.
According to [1] the current code is faster.
However, this alone is not a relevant enough argument, as we're talking 
about slowpath here.

Yuri, is that accurate? Is this the only difference?

If so, I am fine with this change, preferring simplicity.

[1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122

> This commit is about :
> - removing the specific logic and use cpumask_local_spread() instead
> - passing mlx5_core_dev as argument to more flexibility
> 
> mlx5_cpumask_default_spread() is kept as it could be useful for some
> future specific quirks.
> 
> Signed-off-by: Erwan Velu <e.velu@criteo.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 27 +++-----------------
>   1 file changed, 4 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> index cb7e7e4104af..f15ecaef1331 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> @@ -835,28 +835,9 @@ static void comp_irq_release_pci(struct mlx5_core_dev *dev, u16 vecidx)
>   	mlx5_irq_release_vector(irq);
>   }
>   
> -static int mlx5_cpumask_default_spread(int numa_node, int index)
> +static int mlx5_cpumask_default_spread(struct mlx5_core_dev *dev, int index)
>   {
> -	const struct cpumask *prev = cpu_none_mask;
> -	const struct cpumask *mask;
> -	int found_cpu = 0;
> -	int i = 0;
> -	int cpu;
> -
> -	rcu_read_lock();
> -	for_each_numa_hop_mask(mask, numa_node) {
> -		for_each_cpu_andnot(cpu, mask, prev) {
> -			if (i++ == index) {
> -				found_cpu = cpu;
> -				goto spread_done;
> -			}
> -		}
> -		prev = mask;
> -	}
> -
> -spread_done:
> -	rcu_read_unlock();
> -	return found_cpu;
> +	return cpumask_local_spread(index, dev->priv.numa_node);
>   }
>   
>   static struct cpu_rmap *mlx5_eq_table_get_pci_rmap(struct mlx5_core_dev *dev)
> @@ -880,7 +861,7 @@ static int comp_irq_request_pci(struct mlx5_core_dev *dev, u16 vecidx)
>   	int cpu;
>   
>   	rmap = mlx5_eq_table_get_pci_rmap(dev);
> -	cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vecidx);
> +	cpu = mlx5_cpumask_default_spread(dev, vecidx);
>   	irq = mlx5_irq_request_vector(dev, cpu, vecidx, &rmap);
>   	if (IS_ERR(irq))
>   		return PTR_ERR(irq);
> @@ -1145,7 +1126,7 @@ int mlx5_comp_vector_get_cpu(struct mlx5_core_dev *dev, int vector)
>   	if (mask)
>   		cpu = cpumask_first(mask);
>   	else
> -		cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vector);
> +		cpu = mlx5_cpumask_default_spread(dev, vector);
>   
>   	return cpu;
>   }
Yury Norov Aug. 14, 2024, 2:45 p.m. UTC | #2
On Wed, Aug 14, 2024 at 10:48:40AM +0300, Tariq Toukan wrote:
> 
> 
> On 12/08/2024 11:22, Erwan Velu wrote:
> > Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints")
> > removed the usage of cpumask_local_spread().
> > 
> > The issue explained in this commit was fixed by
> > commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality").
> > 
> > Since this commit, mlx5_cpumask_default_spread() is having the same
> > behavior as cpumask_local_spread().
> > 
> 
> Adding Yuri.
> 
> One patch led to the other, finally they were all submitted within the same
> patchset.
> 
> cpumask_local_spread() indeed improved, and AFAIU is functionally equivalent
> to existing logic.
> According to [1] the current code is faster.
> However, this alone is not a relevant enough argument, as we're talking
> about slowpath here.
> 
> Yuri, is that accurate? Is this the only difference?
> 
> If so, I am fine with this change, preferring simplicity.
> 
> [1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122

If you end up calling mlx5_cpumask_default_spread() for each CPU, it
would be O(N^2). If you call cpumask_local_spread() for each CPU, your
complexity would be O(N*logN), because under the hood it uses binary
search.

The comment you've mentioned says that you can traverse your CPUs in
O(N) if you can manage to put all the logic inside the
for_each_numa_hop_mask() iterator. It doesn't seem to be your case.

I agree with you. mlx5_cpumask_default_spread() should be switched to
using library code.

Acked-by: Yury Norov <yury.norov@gmail.com>

You may be interested in siblings-aware CPU distribution I've made
for mana ethernet driver in 91bfe210e196. This is also an example
where using for_each_numa_hop_mask() over simple cpumask_local_spread()
is justified.

Thanks,
Yury
Tariq Toukan Aug. 15, 2024, 10:39 a.m. UTC | #3
On 14/08/2024 17:45, Yury Norov wrote:
> On Wed, Aug 14, 2024 at 10:48:40AM +0300, Tariq Toukan wrote:
>>
>>
>> On 12/08/2024 11:22, Erwan Velu wrote:
>>> Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints")
>>> removed the usage of cpumask_local_spread().
>>>
>>> The issue explained in this commit was fixed by
>>> commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality").
>>>
>>> Since this commit, mlx5_cpumask_default_spread() is having the same
>>> behavior as cpumask_local_spread().
>>>
>>
>> Adding Yuri.
>>
>> One patch led to the other, finally they were all submitted within the same
>> patchset.
>>
>> cpumask_local_spread() indeed improved, and AFAIU is functionally equivalent
>> to existing logic.
>> According to [1] the current code is faster.
>> However, this alone is not a relevant enough argument, as we're talking
>> about slowpath here.
>>
>> Yuri, is that accurate? Is this the only difference?
>>
>> If so, I am fine with this change, preferring simplicity.
>>
>> [1] https://elixir.bootlin.com/linux/v6.11-rc3/source/lib/cpumask.c#L122
> 
> If you end up calling mlx5_cpumask_default_spread() for each CPU, it
> would be O(N^2). If you call cpumask_local_spread() for each CPU, your
> complexity would be O(N*logN), because under the hood it uses binary
> search.
> 
> The comment you've mentioned says that you can traverse your CPUs in
> O(N) if you can manage to put all the logic inside the
> for_each_numa_hop_mask() iterator. It doesn't seem to be your case.
> 
> I agree with you. mlx5_cpumask_default_spread() should be switched to
> using library code.
> 
> Acked-by: Yury Norov <yury.norov@gmail.com>
> 
> You may be interested in siblings-aware CPU distribution I've made
> for mana ethernet driver in 91bfe210e196. This is also an example
> where using for_each_numa_hop_mask() over simple cpumask_local_spread()
> is justified.
> 
> Thanks,
> Yury

Thanks Yuri.

For the patch:
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>

Targeting net-next.
The patch subject should've t stated this clearly.

Jakub,
Please note that this patch is also mistakenly marked 'Not Applicable' 
already...

Regards,
Tariq
patchwork-bot+netdevbpf@kernel.org Aug. 16, 2024, 2:10 a.m. UTC | #4
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 12 Aug 2024 10:22:42 +0200 you wrote:
> Commit 2acda57736de ("net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints")
> removed the usage of cpumask_local_spread().
> 
> The issue explained in this commit was fixed by
> commit 406d394abfcd ("cpumask: improve on cpumask_local_spread() locality").
> 
> Since this commit, mlx5_cpumask_default_spread() is having the same
> behavior as cpumask_local_spread().
> 
> [...]

Here is the summary with links:
  - net/mlx5: Use cpumask_local_spread() instead of custom code
    https://git.kernel.org/netdev/net-next/c/e5efc2311cc4

You are awesome, thank you!
Erwan Velu Aug. 19, 2024, 10:15 a.m. UTC | #5
[...]
> You may be interested in siblings-aware CPU distribution I've made
> for mana ethernet driver in 91bfe210e196. This is also an example
> where using for_each_numa_hop_mask() over simple cpumask_local_spread()
> is justified.

That's clearly a topic I'd like to discuss because the allocation
strategy may vary depending on the hardware and/or usage.
I've been investigating a case where the default mlx5 allocation isn't
what I need.

1/ I noticed that using the smp_affinity in an RFS context didn't
change the IRQ allocation and I was wondering if that is an expected
behavior.
This prevents any later tuning that an application could require.
It would be super helpful to be able to influence the placement from
the host to avoid hardcoded allocators that may not match a particular
hardware configuration.

2/ I was also wondering if we shouldn't have a kernel module option to
choose the allocation algorithm (I have a POC in that direction).
The benefit could be allowing the platform owner to select the
allocation algorithm that sys-admin needs.
On single-package AMD EPYC servers, the numa topology is pretty handy
for mapping the L3 affinity but it doesn't provide any particular hint
about the actual "distance" to the network device.
You can have up to 12 NUMA nodes on a single package but the actual
distance to the nic is almost identical as each core needs to use the
IOdie to reach the PCI devices.
We can see in the NUMA allocation logic assumptions like "1 NUMA per
package" logic that the actual distance between nodes should be
considered in the allocation logic.

In my case, the NIC is reported to Numa node 6 (of 8) (inherited from
the PXM configuration).
With the current "proximity" logic all cores are consumed within this
numa domain before reaching the next ones and so on.
This leads to a very unbalanced configuration where a few numa domains
are fully allocated when others are free.
When SMT is enabled, consuming all cores from a NUMA domain also means
using hyperthreads which could be less optimal than using real cores
from adjacent nodes.

In a hypervisor-like use case, when multiple containers from various
users run on the same system, having RFS enabled helps to have each
user have its own toil of generating traffic.
In such a configuration, it'd be better to let the allocator consume
cores from each numa node of the same package one by one to get a
balanced configuration, that would also have the advantage of avoiding
consuming hyperthreads until at least 1 IRQ per physical core is
reached.

That allocation logic could be interesting to be shared between
various drivers to allow sys admins to get a balanced IRQ mapping on
modern, multi-nodes per socket, architecture.

WDYT of having selectable logic and add this type of
"package-balanced" allocator?
Erwan,
Jakub Kicinski Aug. 19, 2024, 3:34 p.m. UTC | #6
On Mon, 19 Aug 2024 12:15:10 +0200 Erwan Velu wrote:
> 2/ I was also wondering if we shouldn't have a kernel module option to
> choose the allocation algorithm (I have a POC in that direction).
> The benefit could be allowing the platform owner to select the
> allocation algorithm that sys-admin needs.
> On single-package AMD EPYC servers, the numa topology is pretty handy
> for mapping the L3 affinity but it doesn't provide any particular hint
> about the actual "distance" to the network device.
> You can have up to 12 NUMA nodes on a single package but the actual
> distance to the nic is almost identical as each core needs to use the
> IOdie to reach the PCI devices.
> We can see in the NUMA allocation logic assumptions like "1 NUMA per
> package" logic that the actual distance between nodes should be
> considered in the allocation logic.

I think user space has more information on what the appropriate
placement is than the kernel. We can have a reasonable default,
and maybe try not to stupidly reset the settings when config
changes (I don't think mlx5 does that but other drivers do);
but having a way to select algorithm would only work if there
was a well understood and finite set of algorithms.

IMHO we should try to sell this task to systemd-networkd or some other 
user space daemon. We now have netlink access to NAPI information,
including IRQ<>NAPI<>queue mapping. It's possible to implement a
completely driver-agnostic IRQ mapping support from user space
(without the need to grep irq names like we used to)
Erwan Velu Aug. 19, 2024, 3:41 p.m. UTC | #7
Le lun. 19 août 2024 à 17:34, Jakub Kicinski <kuba@kernel.org> a écrit :
>
> On Mon, 19 Aug 2024 12:15:10 +0200 Erwan Velu wrote:
> > 2/ I was also wondering if we shouldn't have a kernel module option to
> > choose the allocation algorithm (I have a POC in that direction).
> > The benefit could be allowing the platform owner to select the
> > allocation algorithm that sys-admin needs.
> > On single-package AMD EPYC servers, the numa topology is pretty handy
> > for mapping the L3 affinity but it doesn't provide any particular hint
> > about the actual "distance" to the network device.
> > You can have up to 12 NUMA nodes on a single package but the actual
> > distance to the nic is almost identical as each core needs to use the
> > IOdie to reach the PCI devices.
> > We can see in the NUMA allocation logic assumptions like "1 NUMA per
> > package" logic that the actual distance between nodes should be
> > considered in the allocation logic.
>
> I think user space has more information on what the appropriate
> placement is than the kernel. We can have a reasonable default,
> and maybe try not to stupidly reset the settings when config
> changes (I don't think mlx5 does that but other drivers do);
> but having a way to select algorithm would only work if there
> was a well understood and finite set of algorithms.

I totally agree with this view, I'm wondering if people who used to
work on the mlx driver can provide hints about this task.
I have no idea if that requires any particular task at the fw level.
Is this a complex task to perform?
That feature would be super helpful to get precise tuning.

> IMHO we should try to sell this task to systemd-networkd or some other
> user space daemon. We now have netlink access to NAPI information,
> including IRQ<>NAPI<>queue mapping. It's possible to implement a
> completely driver-agnostic IRQ mapping support from user space
> (without the need to grep irq names like we used to)
Clearly that would be a nice path to achieve this feature.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index cb7e7e4104af..f15ecaef1331 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -835,28 +835,9 @@  static void comp_irq_release_pci(struct mlx5_core_dev *dev, u16 vecidx)
 	mlx5_irq_release_vector(irq);
 }
 
-static int mlx5_cpumask_default_spread(int numa_node, int index)
+static int mlx5_cpumask_default_spread(struct mlx5_core_dev *dev, int index)
 {
-	const struct cpumask *prev = cpu_none_mask;
-	const struct cpumask *mask;
-	int found_cpu = 0;
-	int i = 0;
-	int cpu;
-
-	rcu_read_lock();
-	for_each_numa_hop_mask(mask, numa_node) {
-		for_each_cpu_andnot(cpu, mask, prev) {
-			if (i++ == index) {
-				found_cpu = cpu;
-				goto spread_done;
-			}
-		}
-		prev = mask;
-	}
-
-spread_done:
-	rcu_read_unlock();
-	return found_cpu;
+	return cpumask_local_spread(index, dev->priv.numa_node);
 }
 
 static struct cpu_rmap *mlx5_eq_table_get_pci_rmap(struct mlx5_core_dev *dev)
@@ -880,7 +861,7 @@  static int comp_irq_request_pci(struct mlx5_core_dev *dev, u16 vecidx)
 	int cpu;
 
 	rmap = mlx5_eq_table_get_pci_rmap(dev);
-	cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vecidx);
+	cpu = mlx5_cpumask_default_spread(dev, vecidx);
 	irq = mlx5_irq_request_vector(dev, cpu, vecidx, &rmap);
 	if (IS_ERR(irq))
 		return PTR_ERR(irq);
@@ -1145,7 +1126,7 @@  int mlx5_comp_vector_get_cpu(struct mlx5_core_dev *dev, int vector)
 	if (mask)
 		cpu = cpumask_first(mask);
 	else
-		cpu = mlx5_cpumask_default_spread(dev->priv.numa_node, vector);
+		cpu = mlx5_cpumask_default_spread(dev, vector);
 
 	return cpu;
 }