diff mbox series

[net-next,1/2] sfc: default config to 1 channel/core in local NUMA node only

Message ID 20220128151922.1016841-2-ihuguet@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series sfc: optimize RXQs count and affinities | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Íñigo Huguet Jan. 28, 2022, 3:19 p.m. UTC
Handling channels from CPUs in different NUMA node can penalize
performance, so better configure only one channel per core in the same
NUMA node than the NIC, and not per each core in the system.

Fallback to all other online cores if there are not online CPUs in local
NUMA node.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
---
 drivers/net/ethernet/sfc/efx_channels.c | 50 ++++++++++++++++---------
 1 file changed, 33 insertions(+), 17 deletions(-)

Comments

Jakub Kicinski Jan. 28, 2022, 10:27 p.m. UTC | #1
On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> Handling channels from CPUs in different NUMA node can penalize
> performance, so better configure only one channel per core in the same
> NUMA node than the NIC, and not per each core in the system.
> 
> Fallback to all other online cores if there are not online CPUs in local
> NUMA node.

I think we should make netif_get_num_default_rss_queues() do a similar
thing. Instead of min(8, num_online_cpus()) we should default to
num_cores / 2 (that's physical cores, not threads). From what I've seen
this appears to strike a good balance between wasting resources on
pointless queues per hyperthread, and scaling up for CPUs which have
many wimpy cores.
Íñigo Huguet Feb. 7, 2022, 3:03 p.m. UTC | #2
On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> > Handling channels from CPUs in different NUMA node can penalize
> > performance, so better configure only one channel per core in the same
> > NUMA node than the NIC, and not per each core in the system.
> >
> > Fallback to all other online cores if there are not online CPUs in local
> > NUMA node.
>
> I think we should make netif_get_num_default_rss_queues() do a similar
> thing. Instead of min(8, num_online_cpus()) we should default to
> num_cores / 2 (that's physical cores, not threads). From what I've seen
> this appears to strike a good balance between wasting resources on
> pointless queues per hyperthread, and scaling up for CPUs which have
> many wimpy cores.
>

I have a few busy weeks coming, but I can do this after that.

With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
nodes, or just the plain number 2?
Jakub Kicinski Feb. 7, 2022, 4:53 p.m. UTC | #3
On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:  
> > > Handling channels from CPUs in different NUMA node can penalize
> > > performance, so better configure only one channel per core in the same
> > > NUMA node than the NIC, and not per each core in the system.
> > >
> > > Fallback to all other online cores if there are not online CPUs in local
> > > NUMA node.  
> >
> > I think we should make netif_get_num_default_rss_queues() do a similar
> > thing. Instead of min(8, num_online_cpus()) we should default to
> > num_cores / 2 (that's physical cores, not threads). From what I've seen
> > this appears to strike a good balance between wasting resources on
> > pointless queues per hyperthread, and scaling up for CPUs which have
> > many wimpy cores.
> >  
> 
> I have a few busy weeks coming, but I can do this after that.
> 
> With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> nodes, or just the plain number 2?

Plain number 2, it's just a heuristic which seems to work okay.
One queue per core (IOW without the /2) is still way too many queues
for normal DC workloads.
Íñigo Huguet Feb. 10, 2022, 9:35 a.m. UTC | #4
On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> > On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote:
> > > > Handling channels from CPUs in different NUMA node can penalize
> > > > performance, so better configure only one channel per core in the same
> > > > NUMA node than the NIC, and not per each core in the system.
> > > >
> > > > Fallback to all other online cores if there are not online CPUs in local
> > > > NUMA node.
> > >
> > > I think we should make netif_get_num_default_rss_queues() do a similar
> > > thing. Instead of min(8, num_online_cpus()) we should default to
> > > num_cores / 2 (that's physical cores, not threads). From what I've seen
> > > this appears to strike a good balance between wasting resources on
> > > pointless queues per hyperthread, and scaling up for CPUs which have
> > > many wimpy cores.
> > >
> >
> > I have a few busy weeks coming, but I can do this after that.
> >
> > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > nodes, or just the plain number 2?
>
> Plain number 2, it's just a heuristic which seems to work okay.
> One queue per core (IOW without the /2) is still way too many queues
> for normal DC workloads.
>

Maybe it's because of being quite special workloads, but I have
encountered problems related to queues in different NUMA nodes in 2
cases: XDP performance being almost half with more RX queues because
of being in different node (the example in my patches) and a customer
losing UDP packets which was solved reducing the number of RX queues
so all them are in the same node.
Jakub Kicinski Feb. 10, 2022, 4:22 p.m. UTC | #5
On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote:
> On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:  
> > > I have a few busy weeks coming, but I can do this after that.
> > >
> > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > > nodes, or just the plain number 2?  
> >
> > Plain number 2, it's just a heuristic which seems to work okay.
> > One queue per core (IOW without the /2) is still way too many queues
> > for normal DC workloads.
> 
> Maybe it's because of being quite special workloads, but I have
> encountered problems related to queues in different NUMA nodes in 2
> cases: XDP performance being almost half with more RX queues because
> of being in different node (the example in my patches) and a customer
> losing UDP packets which was solved reducing the number of RX queues
> so all them are in the same node.

Right, no argument, NUMA tuning will still be necessary. 
I'm primarily concerned about providing a sensible default
for workloads which are not network heavy and therefore
nobody spends time tuning their queue configuration.
Any network-heavy workload will likely always benefit from tuning.

The status quo is that our current default returned by
netif_get_num_default_rss_queues() is 8 which is inadequate 
for modern servers, and people end up implementing their own
logic in the drivers.

I'm fine with sfc doing its own thing (at least for now) and 
therefore your patches as they are, but for new drivers I want
to be able to recommend netif_get_num_default_rss_queues() with
a clear conscience.

Does that make sense?
Íñigo Huguet Feb. 11, 2022, 11:05 a.m. UTC | #6
On Thu, Feb 10, 2022 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote:
> > On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote:
> > > > I have a few busy weeks coming, but I can do this after that.
> > > >
> > > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA
> > > > nodes, or just the plain number 2?
> > >
> > > Plain number 2, it's just a heuristic which seems to work okay.
> > > One queue per core (IOW without the /2) is still way too many queues
> > > for normal DC workloads.
> >
> > Maybe it's because of being quite special workloads, but I have
> > encountered problems related to queues in different NUMA nodes in 2
> > cases: XDP performance being almost half with more RX queues because
> > of being in different node (the example in my patches) and a customer
> > losing UDP packets which was solved reducing the number of RX queues
> > so all them are in the same node.
>
> Right, no argument, NUMA tuning will still be necessary.
> I'm primarily concerned about providing a sensible default
> for workloads which are not network heavy and therefore
> nobody spends time tuning their queue configuration.
> Any network-heavy workload will likely always benefit from tuning.
>
> The status quo is that our current default returned by
> netif_get_num_default_rss_queues() is 8 which is inadequate
> for modern servers, and people end up implementing their own
> logic in the drivers.
>
> I'm fine with sfc doing its own thing (at least for now) and
> therefore your patches as they are, but for new drivers I want
> to be able to recommend netif_get_num_default_rss_queues() with
> a clear conscience.
>
> Does that make sense?
>

Totally. My comment was intended to be more like a question to see why
we should or shouldn't consider NUMA nodes in
netif_get_num_default_rss_queues. But now I understand your point
better.

However, would it make sense something like this for
netif_get_num_default_rss_queues, or it would be a bit overkill?
if the system has more than one NUMA node, allocate one queue per
physical core in local NUMA node.
else, allocate physical cores / 2

Another thing: this patch series appears in patchwork with state
"Changes Requested", but no changes have been requested, actually. Can
the state be changed so it has more visibility to get reviews?

Thanks!
Jakub Kicinski Feb. 11, 2022, 7:01 p.m. UTC | #7
On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> Totally. My comment was intended to be more like a question to see why
> we should or shouldn't consider NUMA nodes in
> netif_get_num_default_rss_queues. But now I understand your point
> better.
> 
> However, would it make sense something like this for
> netif_get_num_default_rss_queues, or it would be a bit overkill?
> if the system has more than one NUMA node, allocate one queue per
> physical core in local NUMA node.
> else, allocate physical cores / 2

I don't have a strong opinion on the NUMA question, to be honest.
It gets complicated pretty quickly. If there is one NIC we may or 
may not want to divide - for pure packet forwarding sure, best if
its done on the node with the NIC, but that assumes the other node 
is idle or doing something else? How does it not need networking?

If each node has a separate NIC we should definitely divide. But
it's impossible to know the NIC count at the netdev level..

So my thinking was let's leave NUMA configurations to manual tuning.
If we don't do anything special for NUMA it's less likely someone will
tell us we did the wrong thing there :) But feel free to implement what
you suggested above.

One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs 
in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
potentially be problematic if we wanted to divide by number of nodes.
Maybe not as much if just dividing by 2.

> Another thing: this patch series appears in patchwork with state
> "Changes Requested", but no changes have been requested, actually. Can
> the state be changed so it has more visibility to get reviews?

I think resend would be best.
Íñigo Huguet Feb. 14, 2022, 7:22 a.m. UTC | #8
On Fri, Feb 11, 2022 at 8:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> > Totally. My comment was intended to be more like a question to see why
> > we should or shouldn't consider NUMA nodes in
> > netif_get_num_default_rss_queues. But now I understand your point
> > better.
> >
> > However, would it make sense something like this for
> > netif_get_num_default_rss_queues, or it would be a bit overkill?
> > if the system has more than one NUMA node, allocate one queue per
> > physical core in local NUMA node.
> > else, allocate physical cores / 2
>
> I don't have a strong opinion on the NUMA question, to be honest.
> It gets complicated pretty quickly. If there is one NIC we may or
> may not want to divide - for pure packet forwarding sure, best if
> its done on the node with the NIC, but that assumes the other node
> is idle or doing something else? How does it not need networking?
>
> If each node has a separate NIC we should definitely divide. But
> it's impossible to know the NIC count at the netdev level..
>
> So my thinking was let's leave NUMA configurations to manual tuning.
> If we don't do anything special for NUMA it's less likely someone will
> tell us we did the wrong thing there :) But feel free to implement what
> you suggested above.

Agreed, the more you try to be smart, the more less common case you
might fail to do it well.

If nobody else speaks in favor of my suggestion I will go the simpler way.

>
> One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs
> in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
> potentially be problematic if we wanted to divide by number of nodes.
> Maybe not as much if just dividing by 2.
>
> > Another thing: this patch series appears in patchwork with state
> > "Changes Requested", but no changes have been requested, actually. Can
> > the state be changed so it has more visibility to get reviews?
>
> I think resend would be best.
>
Martin Habets Feb. 19, 2022, 1:53 p.m. UTC | #9
On Fri, Feb 11, 2022 at 11:01:00AM -0800, Jakub Kicinski wrote:
> On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote:
> > Totally. My comment was intended to be more like a question to see why
> > we should or shouldn't consider NUMA nodes in
> > netif_get_num_default_rss_queues. But now I understand your point
> > better.
> > 
> > However, would it make sense something like this for
> > netif_get_num_default_rss_queues, or it would be a bit overkill?
> > if the system has more than one NUMA node, allocate one queue per
> > physical core in local NUMA node.
> > else, allocate physical cores / 2
> 
> I don't have a strong opinion on the NUMA question, to be honest.
> It gets complicated pretty quickly. If there is one NIC we may or 
> may not want to divide - for pure packet forwarding sure, best if
> its done on the node with the NIC, but that assumes the other node 
> is idle or doing something else? How does it not need networking?
> 
> If each node has a separate NIC we should definitely divide. But
> it's impossible to know the NIC count at the netdev level..
> 
> So my thinking was let's leave NUMA configurations to manual tuning.
> If we don't do anything special for NUMA it's less likely someone will
> tell us we did the wrong thing there :) But feel free to implement what
> you suggested above.
> 
> One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs 
> in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd
> potentially be problematic if we wanted to divide by number of nodes.
> Maybe not as much if just dividing by 2.

Since one week Xilinx is part of AMD. In time I'm sure we'll be able
to investigate AMD specifics.

Martin

> > Another thing: this patch series appears in patchwork with state
> > "Changes Requested", but no changes have been requested, actually. Can
> > the state be changed so it has more visibility to get reviews?
> 
> I think resend would be best.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c
index ead550ae2709..ec6c2f231e73 100644
--- a/drivers/net/ethernet/sfc/efx_channels.c
+++ b/drivers/net/ethernet/sfc/efx_channels.c
@@ -78,31 +78,48 @@  static const struct efx_channel_type efx_default_channel_type = {
  * INTERRUPTS
  *************/
 
-static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+static unsigned int count_online_cores(struct efx_nic *efx, bool local_node)
 {
-	cpumask_var_t thread_mask;
+	cpumask_var_t filter_mask;
 	unsigned int count;
 	int cpu;
+
+	if (unlikely(!zalloc_cpumask_var(&filter_mask, GFP_KERNEL))) {
+		netif_warn(efx, probe, efx->net_dev,
+			   "RSS disabled due to allocation failure\n");
+		return 1;
+	}
+
+	cpumask_copy(filter_mask, cpu_online_mask);
+	if (local_node) {
+		int numa_node = pcibus_to_node(efx->pci_dev->bus);
+
+		cpumask_and(filter_mask, filter_mask, cpumask_of_node(numa_node));
+	}
+
+	count = 0;
+	for_each_cpu(cpu, filter_mask) {
+		++count;
+		cpumask_andnot(filter_mask, filter_mask, topology_sibling_cpumask(cpu));
+	}
+
+	free_cpumask_var(filter_mask);
+
+	return count;
+}
+
+static unsigned int efx_wanted_parallelism(struct efx_nic *efx)
+{
+	unsigned int count;
 
 	if (rss_cpus) {
 		count = rss_cpus;
 	} else {
-		if (unlikely(!zalloc_cpumask_var(&thread_mask, GFP_KERNEL))) {
-			netif_warn(efx, probe, efx->net_dev,
-				   "RSS disabled due to allocation failure\n");
-			return 1;
-		}
-
-		count = 0;
-		for_each_online_cpu(cpu) {
-			if (!cpumask_test_cpu(cpu, thread_mask)) {
-				++count;
-				cpumask_or(thread_mask, thread_mask,
-					   topology_sibling_cpumask(cpu));
-			}
-		}
+		count = count_online_cores(efx, true);
 
-		free_cpumask_var(thread_mask);
+		/* If no online CPUs in local node, fallback to any online CPUs */
+		if (count == 0)
+			count = count_online_cores(efx, false);
 	}
 
 	if (count > EFX_MAX_RX_QUEUES) {