Message ID | 20220128151922.1016841-2-ihuguet@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | sfc: optimize RXQs count and affinities | expand |
On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote: > Handling channels from CPUs in different NUMA node can penalize > performance, so better configure only one channel per core in the same > NUMA node than the NIC, and not per each core in the system. > > Fallback to all other online cores if there are not online CPUs in local > NUMA node. I think we should make netif_get_num_default_rss_queues() do a similar thing. Instead of min(8, num_online_cpus()) we should default to num_cores / 2 (that's physical cores, not threads). From what I've seen this appears to strike a good balance between wasting resources on pointless queues per hyperthread, and scaling up for CPUs which have many wimpy cores.
On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote: > > Handling channels from CPUs in different NUMA node can penalize > > performance, so better configure only one channel per core in the same > > NUMA node than the NIC, and not per each core in the system. > > > > Fallback to all other online cores if there are not online CPUs in local > > NUMA node. > > I think we should make netif_get_num_default_rss_queues() do a similar > thing. Instead of min(8, num_online_cpus()) we should default to > num_cores / 2 (that's physical cores, not threads). From what I've seen > this appears to strike a good balance between wasting resources on > pointless queues per hyperthread, and scaling up for CPUs which have > many wimpy cores. > I have a few busy weeks coming, but I can do this after that. With num_cores / 2 you divide by 2 because you're assuming 2 NUMA nodes, or just the plain number 2?
On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote: > On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote: > > > Handling channels from CPUs in different NUMA node can penalize > > > performance, so better configure only one channel per core in the same > > > NUMA node than the NIC, and not per each core in the system. > > > > > > Fallback to all other online cores if there are not online CPUs in local > > > NUMA node. > > > > I think we should make netif_get_num_default_rss_queues() do a similar > > thing. Instead of min(8, num_online_cpus()) we should default to > > num_cores / 2 (that's physical cores, not threads). From what I've seen > > this appears to strike a good balance between wasting resources on > > pointless queues per hyperthread, and scaling up for CPUs which have > > many wimpy cores. > > > > I have a few busy weeks coming, but I can do this after that. > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA > nodes, or just the plain number 2? Plain number 2, it's just a heuristic which seems to work okay. One queue per core (IOW without the /2) is still way too many queues for normal DC workloads.
On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote: > > On Fri, Jan 28, 2022 at 11:27 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Fri, 28 Jan 2022 16:19:21 +0100 Íñigo Huguet wrote: > > > > Handling channels from CPUs in different NUMA node can penalize > > > > performance, so better configure only one channel per core in the same > > > > NUMA node than the NIC, and not per each core in the system. > > > > > > > > Fallback to all other online cores if there are not online CPUs in local > > > > NUMA node. > > > > > > I think we should make netif_get_num_default_rss_queues() do a similar > > > thing. Instead of min(8, num_online_cpus()) we should default to > > > num_cores / 2 (that's physical cores, not threads). From what I've seen > > > this appears to strike a good balance between wasting resources on > > > pointless queues per hyperthread, and scaling up for CPUs which have > > > many wimpy cores. > > > > > > > I have a few busy weeks coming, but I can do this after that. > > > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA > > nodes, or just the plain number 2? > > Plain number 2, it's just a heuristic which seems to work okay. > One queue per core (IOW without the /2) is still way too many queues > for normal DC workloads. > Maybe it's because of being quite special workloads, but I have encountered problems related to queues in different NUMA nodes in 2 cases: XDP performance being almost half with more RX queues because of being in different node (the example in my patches) and a customer losing UDP packets which was solved reducing the number of RX queues so all them are in the same node.
On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote: > On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote: > > > I have a few busy weeks coming, but I can do this after that. > > > > > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA > > > nodes, or just the plain number 2? > > > > Plain number 2, it's just a heuristic which seems to work okay. > > One queue per core (IOW without the /2) is still way too many queues > > for normal DC workloads. > > Maybe it's because of being quite special workloads, but I have > encountered problems related to queues in different NUMA nodes in 2 > cases: XDP performance being almost half with more RX queues because > of being in different node (the example in my patches) and a customer > losing UDP packets which was solved reducing the number of RX queues > so all them are in the same node. Right, no argument, NUMA tuning will still be necessary. I'm primarily concerned about providing a sensible default for workloads which are not network heavy and therefore nobody spends time tuning their queue configuration. Any network-heavy workload will likely always benefit from tuning. The status quo is that our current default returned by netif_get_num_default_rss_queues() is 8 which is inadequate for modern servers, and people end up implementing their own logic in the drivers. I'm fine with sfc doing its own thing (at least for now) and therefore your patches as they are, but for new drivers I want to be able to recommend netif_get_num_default_rss_queues() with a clear conscience. Does that make sense?
On Thu, Feb 10, 2022 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 10 Feb 2022 10:35:53 +0100 Íñigo Huguet wrote: > > On Mon, Feb 7, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Mon, 7 Feb 2022 16:03:01 +0100 Íñigo Huguet wrote: > > > > I have a few busy weeks coming, but I can do this after that. > > > > > > > > With num_cores / 2 you divide by 2 because you're assuming 2 NUMA > > > > nodes, or just the plain number 2? > > > > > > Plain number 2, it's just a heuristic which seems to work okay. > > > One queue per core (IOW without the /2) is still way too many queues > > > for normal DC workloads. > > > > Maybe it's because of being quite special workloads, but I have > > encountered problems related to queues in different NUMA nodes in 2 > > cases: XDP performance being almost half with more RX queues because > > of being in different node (the example in my patches) and a customer > > losing UDP packets which was solved reducing the number of RX queues > > so all them are in the same node. > > Right, no argument, NUMA tuning will still be necessary. > I'm primarily concerned about providing a sensible default > for workloads which are not network heavy and therefore > nobody spends time tuning their queue configuration. > Any network-heavy workload will likely always benefit from tuning. > > The status quo is that our current default returned by > netif_get_num_default_rss_queues() is 8 which is inadequate > for modern servers, and people end up implementing their own > logic in the drivers. > > I'm fine with sfc doing its own thing (at least for now) and > therefore your patches as they are, but for new drivers I want > to be able to recommend netif_get_num_default_rss_queues() with > a clear conscience. > > Does that make sense? > Totally. My comment was intended to be more like a question to see why we should or shouldn't consider NUMA nodes in netif_get_num_default_rss_queues. But now I understand your point better. However, would it make sense something like this for netif_get_num_default_rss_queues, or it would be a bit overkill? if the system has more than one NUMA node, allocate one queue per physical core in local NUMA node. else, allocate physical cores / 2 Another thing: this patch series appears in patchwork with state "Changes Requested", but no changes have been requested, actually. Can the state be changed so it has more visibility to get reviews? Thanks!
On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote: > Totally. My comment was intended to be more like a question to see why > we should or shouldn't consider NUMA nodes in > netif_get_num_default_rss_queues. But now I understand your point > better. > > However, would it make sense something like this for > netif_get_num_default_rss_queues, or it would be a bit overkill? > if the system has more than one NUMA node, allocate one queue per > physical core in local NUMA node. > else, allocate physical cores / 2 I don't have a strong opinion on the NUMA question, to be honest. It gets complicated pretty quickly. If there is one NIC we may or may not want to divide - for pure packet forwarding sure, best if its done on the node with the NIC, but that assumes the other node is idle or doing something else? How does it not need networking? If each node has a separate NIC we should definitely divide. But it's impossible to know the NIC count at the netdev level.. So my thinking was let's leave NUMA configurations to manual tuning. If we don't do anything special for NUMA it's less likely someone will tell us we did the wrong thing there :) But feel free to implement what you suggested above. One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd potentially be problematic if we wanted to divide by number of nodes. Maybe not as much if just dividing by 2. > Another thing: this patch series appears in patchwork with state > "Changes Requested", but no changes have been requested, actually. Can > the state be changed so it has more visibility to get reviews? I think resend would be best.
On Fri, Feb 11, 2022 at 8:01 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote: > > Totally. My comment was intended to be more like a question to see why > > we should or shouldn't consider NUMA nodes in > > netif_get_num_default_rss_queues. But now I understand your point > > better. > > > > However, would it make sense something like this for > > netif_get_num_default_rss_queues, or it would be a bit overkill? > > if the system has more than one NUMA node, allocate one queue per > > physical core in local NUMA node. > > else, allocate physical cores / 2 > > I don't have a strong opinion on the NUMA question, to be honest. > It gets complicated pretty quickly. If there is one NIC we may or > may not want to divide - for pure packet forwarding sure, best if > its done on the node with the NIC, but that assumes the other node > is idle or doing something else? How does it not need networking? > > If each node has a separate NIC we should definitely divide. But > it's impossible to know the NIC count at the netdev level.. > > So my thinking was let's leave NUMA configurations to manual tuning. > If we don't do anything special for NUMA it's less likely someone will > tell us we did the wrong thing there :) But feel free to implement what > you suggested above. Agreed, the more you try to be smart, the more less common case you might fail to do it well. If nobody else speaks in favor of my suggestion I will go the simpler way. > > One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs > in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd > potentially be problematic if we wanted to divide by number of nodes. > Maybe not as much if just dividing by 2. > > > Another thing: this patch series appears in patchwork with state > > "Changes Requested", but no changes have been requested, actually. Can > > the state be changed so it has more visibility to get reviews? > > I think resend would be best. >
On Fri, Feb 11, 2022 at 11:01:00AM -0800, Jakub Kicinski wrote: > On Fri, 11 Feb 2022 12:05:19 +0100 Íñigo Huguet wrote: > > Totally. My comment was intended to be more like a question to see why > > we should or shouldn't consider NUMA nodes in > > netif_get_num_default_rss_queues. But now I understand your point > > better. > > > > However, would it make sense something like this for > > netif_get_num_default_rss_queues, or it would be a bit overkill? > > if the system has more than one NUMA node, allocate one queue per > > physical core in local NUMA node. > > else, allocate physical cores / 2 > > I don't have a strong opinion on the NUMA question, to be honest. > It gets complicated pretty quickly. If there is one NIC we may or > may not want to divide - for pure packet forwarding sure, best if > its done on the node with the NIC, but that assumes the other node > is idle or doing something else? How does it not need networking? > > If each node has a separate NIC we should definitely divide. But > it's impossible to know the NIC count at the netdev level.. > > So my thinking was let's leave NUMA configurations to manual tuning. > If we don't do anything special for NUMA it's less likely someone will > tell us we did the wrong thing there :) But feel free to implement what > you suggested above. > > One thing I'm not sure of is if anyone uses the early AMD chiplet CPUs > in a NUMA-per-chiplet mode? IIRC they had a mode like that. And that'd > potentially be problematic if we wanted to divide by number of nodes. > Maybe not as much if just dividing by 2. Since one week Xilinx is part of AMD. In time I'm sure we'll be able to investigate AMD specifics. Martin > > Another thing: this patch series appears in patchwork with state > > "Changes Requested", but no changes have been requested, actually. Can > > the state be changed so it has more visibility to get reviews? > > I think resend would be best.
diff --git a/drivers/net/ethernet/sfc/efx_channels.c b/drivers/net/ethernet/sfc/efx_channels.c index ead550ae2709..ec6c2f231e73 100644 --- a/drivers/net/ethernet/sfc/efx_channels.c +++ b/drivers/net/ethernet/sfc/efx_channels.c @@ -78,31 +78,48 @@ static const struct efx_channel_type efx_default_channel_type = { * INTERRUPTS *************/ -static unsigned int efx_wanted_parallelism(struct efx_nic *efx) +static unsigned int count_online_cores(struct efx_nic *efx, bool local_node) { - cpumask_var_t thread_mask; + cpumask_var_t filter_mask; unsigned int count; int cpu; + + if (unlikely(!zalloc_cpumask_var(&filter_mask, GFP_KERNEL))) { + netif_warn(efx, probe, efx->net_dev, + "RSS disabled due to allocation failure\n"); + return 1; + } + + cpumask_copy(filter_mask, cpu_online_mask); + if (local_node) { + int numa_node = pcibus_to_node(efx->pci_dev->bus); + + cpumask_and(filter_mask, filter_mask, cpumask_of_node(numa_node)); + } + + count = 0; + for_each_cpu(cpu, filter_mask) { + ++count; + cpumask_andnot(filter_mask, filter_mask, topology_sibling_cpumask(cpu)); + } + + free_cpumask_var(filter_mask); + + return count; +} + +static unsigned int efx_wanted_parallelism(struct efx_nic *efx) +{ + unsigned int count; if (rss_cpus) { count = rss_cpus; } else { - if (unlikely(!zalloc_cpumask_var(&thread_mask, GFP_KERNEL))) { - netif_warn(efx, probe, efx->net_dev, - "RSS disabled due to allocation failure\n"); - return 1; - } - - count = 0; - for_each_online_cpu(cpu) { - if (!cpumask_test_cpu(cpu, thread_mask)) { - ++count; - cpumask_or(thread_mask, thread_mask, - topology_sibling_cpumask(cpu)); - } - } + count = count_online_cores(efx, true); - free_cpumask_var(thread_mask); + /* If no online CPUs in local node, fallback to any online CPUs */ + if (count == 0) + count = count_online_cores(efx, false); } if (count > EFX_MAX_RX_QUEUES) {
Handling channels from CPUs in different NUMA node can penalize performance, so better configure only one channel per core in the same NUMA node than the NIC, and not per each core in the system. Fallback to all other online cores if there are not online CPUs in local NUMA node. Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> --- drivers/net/ethernet/sfc/efx_channels.c | 50 ++++++++++++++++--------- 1 file changed, 33 insertions(+), 17 deletions(-)