Message ID | 20231217213214.1905481-4-yury.norov@gmail.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | net: mana: add irq_spread() | expand |
On 12/17/2023 1:32 PM, Yury Norov wrote: > +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node) > +{ > + const struct cpumask *next, *prev = cpu_none_mask; > + cpumask_var_t cpus __free(free_cpumask_var); > + int cpu, weight; > + > + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) > + return -ENOMEM; > + > + rcu_read_lock(); > + for_each_numa_hop_mask(next, node) { > + weight = cpumask_weight_andnot(next, prev); > + while (weight-- > 0) { > + cpumask_andnot(cpus, next, prev); > + for_each_cpu(cpu, cpus) { > + if (len-- == 0) > + goto done; > + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu)); > + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); > + } > + } > + prev = next; > + } > +done: > + rcu_read_unlock(); > + return 0; > +} > + You're adding a function here but its not called and even marked as __maybe_unused? > static int mana_gd_setup_irqs(struct pci_dev *pdev) > { > unsigned int max_queues_per_port = num_online_cpus();
On Mon, Dec 18, 2023 at 01:17:53PM -0800, Jacob Keller wrote: > > > On 12/17/2023 1:32 PM, Yury Norov wrote: > > +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node) > > +{ > > + const struct cpumask *next, *prev = cpu_none_mask; > > + cpumask_var_t cpus __free(free_cpumask_var); > > + int cpu, weight; > > + > > + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) > > + return -ENOMEM; > > + > > + rcu_read_lock(); > > + for_each_numa_hop_mask(next, node) { > > + weight = cpumask_weight_andnot(next, prev); > > + while (weight-- > 0) { > > + cpumask_andnot(cpus, next, prev); > > + for_each_cpu(cpu, cpus) { > > + if (len-- == 0) > > + goto done; > > + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu)); > > + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); > > + } > > + } > > + prev = next; > > + } > > +done: > > + rcu_read_unlock(); > > + return 0; > > +} > > + > > You're adding a function here but its not called and even marked as > __maybe_unused? I expect that Souradeep would build his driver improvement on top of this function. cpumask API is somewhat tricky to use it properly here, so this is an attempt help him, instead of moving back and forth on review. Sorry, I had to be more explicit. Thanks, Yury
>-----Original Message----- >From: Yury Norov <yury.norov@gmail.com> >Sent: Monday, December 18, 2023 3:02 AM >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan ><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; >wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net; >edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li ><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org; >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux- >kernel@vger.kernel.org; linux-rdma@vger.kernel.org >Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm ><paulros@microsoft.com> >Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per >CPUs > >[Some people who received this message don't often get email from >yury.norov@gmail.com. Learn why this is important at >https://aka.ms/LearnAboutSenderIdentification ] > >Souradeep investigated that the driver performs faster if IRQs are spread on CPUs >with the following heuristics: > >1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second >priority; 3. Sibling dislocality is the last priority. > >Let's consider this topology: > >Node 0 1 >Core 0 1 2 3 >CPU 0 1 2 3 4 5 6 7 > >The most performant IRQ distribution based on the above topology and heuristics >may look like this: > >IRQ Nodes Cores CPUs >0 1 0 0-1 >1 1 1 2-3 >2 1 0 0-1 >3 1 1 2-3 >4 2 2 4-5 >5 2 3 6-7 >6 2 2 4-5 >7 2 3 6-7 > >The irq_setup() routine introduced in this patch leverages the >for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as >described above. > >According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on >cpumask_local_spread() performance test results look like this: > >./ntttcp -r -m 16 >NTTTCP for Linux 1.4.0 >--------------------------------------------------------- >08:05:20 INFO: 17 threads created >08:05:28 INFO: Network activity progressing... >08:06:28 INFO: Test run completed. >08:06:28 INFO: Test cycle finished. >08:06:28 INFO: ##### Totals: ##### >08:06:28 INFO: test duration :60.00 seconds >08:06:28 INFO: total bytes :630292053310 >08:06:28 INFO: throughput :84.04Gbps >08:06:28 INFO: retrans segs :4 >08:06:28 INFO: cpu cores :192 >08:06:28 INFO: cpu speed :3799.725MHz >08:06:28 INFO: user :0.05% >08:06:28 INFO: system :1.60% >08:06:28 INFO: idle :96.41% >08:06:28 INFO: iowait :0.00% >08:06:28 INFO: softirq :1.94% >08:06:28 INFO: cycles/byte :2.50 >08:06:28 INFO: cpu busy (all) :534.41% > >For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: > >./ntttcp -r -m 16 >NTTTCP for Linux 1.4.0 >--------------------------------------------------------- >08:08:51 INFO: 17 threads created >08:08:56 INFO: Network activity progressing... >08:09:56 INFO: Test run completed. >08:09:56 INFO: Test cycle finished. >08:09:56 INFO: ##### Totals: ##### >08:09:56 INFO: test duration :60.00 seconds >08:09:56 INFO: total bytes :741966608384 >08:09:56 INFO: throughput :98.93Gbps >08:09:56 INFO: retrans segs :6 >08:09:56 INFO: cpu cores :192 >08:09:56 INFO: cpu speed :3799.791MHz >08:09:56 INFO: user :0.06% >08:09:56 INFO: system :1.81% >08:09:56 INFO: idle :96.18% >08:09:56 INFO: iowait :0.00% >08:09:56 INFO: softirq :1.95% >08:09:56 INFO: cycles/byte :2.25 >08:09:56 INFO: cpu busy (all) :569.22% > >[1] >https://lore.kernel/ >.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v >ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros >oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7 >cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d >8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% >7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26 >IhqkE%3D&reserved=0 > >Signed-off-by: Yury Norov <yury.norov@gmail.com> >Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Please also add Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> >--- > .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ > 1 file changed, 28 insertions(+) > >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c >b/drivers/net/ethernet/microsoft/mana/gdma_main.c >index 6367de0c2c2e..11e64e42e3b2 100644 >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c >@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource >*r) > r->size = 0; > } > >+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int >+len, int node) { >+ const struct cpumask *next, *prev = cpu_none_mask; >+ cpumask_var_t cpus __free(free_cpumask_var); >+ int cpu, weight; >+ >+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) >+ return -ENOMEM; >+ >+ rcu_read_lock(); >+ for_each_numa_hop_mask(next, node) { >+ weight = cpumask_weight_andnot(next, prev); >+ while (weight-- > 0) { >+ cpumask_andnot(cpus, next, prev); >+ for_each_cpu(cpu, cpus) { >+ if (len-- == 0) >+ goto done; >+ irq_set_affinity_and_hint(*irqs++, >topology_sibling_cpumask(cpu)); >+ cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); >+ } >+ } >+ prev = next; >+ } >+done: >+ rcu_read_unlock(); >+ return 0; >+} >+ > static int mana_gd_setup_irqs(struct pci_dev *pdev) { > unsigned int max_queues_per_port = num_online_cpus(); >-- >2.40.1
>-----Original Message----- >From: Yury Norov <yury.norov@gmail.com> >Sent: Monday, December 18, 2023 3:02 AM >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan ><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; >wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net; >edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li ><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org; >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux- >kernel@vger.kernel.org; linux-rdma@vger.kernel.org >Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm ><paulros@microsoft.com> >Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per >CPUs > >[Some people who received this message don't often get email from >yury.norov@gmail.com. Learn why this is important at >https://aka.ms/LearnAboutSenderIdentification ] > >Souradeep investigated that the driver performs faster if IRQs are spread on CPUs >with the following heuristics: > >1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second >priority; 3. Sibling dislocality is the last priority. > >Let's consider this topology: > >Node 0 1 >Core 0 1 2 3 >CPU 0 1 2 3 4 5 6 7 > >The most performant IRQ distribution based on the above topology and heuristics >may look like this: > >IRQ Nodes Cores CPUs >0 1 0 0-1 >1 1 1 2-3 >2 1 0 0-1 >3 1 1 2-3 >4 2 2 4-5 >5 2 3 6-7 >6 2 2 4-5 >7 2 3 6-7 > >The irq_setup() routine introduced in this patch leverages the >for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as >described above. > >According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on >cpumask_local_spread() performance test results look like this: > >./ntttcp -r -m 16 >NTTTCP for Linux 1.4.0 >--------------------------------------------------------- >08:05:20 INFO: 17 threads created >08:05:28 INFO: Network activity progressing... >08:06:28 INFO: Test run completed. >08:06:28 INFO: Test cycle finished. >08:06:28 INFO: ##### Totals: ##### >08:06:28 INFO: test duration :60.00 seconds >08:06:28 INFO: total bytes :630292053310 >08:06:28 INFO: throughput :84.04Gbps >08:06:28 INFO: retrans segs :4 >08:06:28 INFO: cpu cores :192 >08:06:28 INFO: cpu speed :3799.725MHz >08:06:28 INFO: user :0.05% >08:06:28 INFO: system :1.60% >08:06:28 INFO: idle :96.41% >08:06:28 INFO: iowait :0.00% >08:06:28 INFO: softirq :1.94% >08:06:28 INFO: cycles/byte :2.50 >08:06:28 INFO: cpu busy (all) :534.41% > >For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: > >./ntttcp -r -m 16 >NTTTCP for Linux 1.4.0 >--------------------------------------------------------- >08:08:51 INFO: 17 threads created >08:08:56 INFO: Network activity progressing... >08:09:56 INFO: Test run completed. >08:09:56 INFO: Test cycle finished. >08:09:56 INFO: ##### Totals: ##### >08:09:56 INFO: test duration :60.00 seconds >08:09:56 INFO: total bytes :741966608384 >08:09:56 INFO: throughput :98.93Gbps >08:09:56 INFO: retrans segs :6 >08:09:56 INFO: cpu cores :192 >08:09:56 INFO: cpu speed :3799.791MHz >08:09:56 INFO: user :0.06% >08:09:56 INFO: system :1.81% >08:09:56 INFO: idle :96.18% >08:09:56 INFO: iowait :0.00% >08:09:56 INFO: softirq :1.95% >08:09:56 INFO: cycles/byte :2.25 >08:09:56 INFO: cpu busy (all) :569.22% > >[1] >https://lore.kernel/ >.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v >ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros >oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7 >cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d >8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% >7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26 >IhqkE%3D&reserved=0 > >Signed-off-by: Yury Norov <yury.norov@gmail.com> >Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> >--- > .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ > 1 file changed, 28 insertions(+) > >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c >b/drivers/net/ethernet/microsoft/mana/gdma_main.c >index 6367de0c2c2e..11e64e42e3b2 100644 >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c >@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource >*r) > r->size = 0; > } > >+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int >+len, int node) { >+ const struct cpumask *next, *prev = cpu_none_mask; >+ cpumask_var_t cpus __free(free_cpumask_var); >+ int cpu, weight; >+ >+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) >+ return -ENOMEM; >+ >+ rcu_read_lock(); >+ for_each_numa_hop_mask(next, node) { >+ weight = cpumask_weight_andnot(next, prev); >+ while (weight-- > 0) { Make it while (weight > 0) { >+ cpumask_andnot(cpus, next, prev); >+ for_each_cpu(cpu, cpus) { >+ if (len-- == 0) >+ goto done; >+ irq_set_affinity_and_hint(*irqs++, >topology_sibling_cpumask(cpu)); >+ cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); Here do --weight, else this code will traverse the same node N^2 times, where each node has N cpus . >+ } >+ } >+ prev = next; >+ } >+done: >+ rcu_read_unlock(); >+ return 0; >+} >+ > static int mana_gd_setup_irqs(struct pci_dev *pdev) { > unsigned int max_queues_per_port = num_online_cpus(); >-- >2.40.1
On Tue, Dec 19, 2023 at 10:18:49AM +0000, Souradeep Chakrabarti wrote: > > > >-----Original Message----- > >From: Yury Norov <yury.norov@gmail.com> > >Sent: Monday, December 18, 2023 3:02 AM > >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan > ><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>; > >wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; davem@davemloft.net; > >edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; Long Li > ><longli@microsoft.com>; yury.norov@gmail.com; leon@kernel.org; > >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets@redhat.com; > >tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; linux- > >kernel@vger.kernel.org; linux-rdma@vger.kernel.org > >Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>; Paul Rosswurm > ><paulros@microsoft.com> > >Subject: [EXTERNAL] [PATCH 3/3] net: mana: add a function to spread IRQs per > >CPUs > > > >[Some people who received this message don't often get email from > >yury.norov@gmail.com. Learn why this is important at > >https://aka.ms/LearnAboutSenderIdentification ] > > > >Souradeep investigated that the driver performs faster if IRQs are spread on CPUs > >with the following heuristics: > > > >1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second > >priority; 3. Sibling dislocality is the last priority. > > > >Let's consider this topology: > > > >Node 0 1 > >Core 0 1 2 3 > >CPU 0 1 2 3 4 5 6 7 > > > >The most performant IRQ distribution based on the above topology and heuristics > >may look like this: > > > >IRQ Nodes Cores CPUs > >0 1 0 0-1 > >1 1 1 2-3 > >2 1 0 0-1 > >3 1 1 2-3 > >4 2 2 4-5 > >5 2 3 6-7 > >6 2 2 4-5 > >7 2 3 6-7 > > > >The irq_setup() routine introduced in this patch leverages the > >for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as > >described above. > > > >According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on > >cpumask_local_spread() performance test results look like this: > > > >./ntttcp -r -m 16 > >NTTTCP for Linux 1.4.0 > >--------------------------------------------------------- > >08:05:20 INFO: 17 threads created > >08:05:28 INFO: Network activity progressing... > >08:06:28 INFO: Test run completed. > >08:06:28 INFO: Test cycle finished. > >08:06:28 INFO: ##### Totals: ##### > >08:06:28 INFO: test duration :60.00 seconds > >08:06:28 INFO: total bytes :630292053310 > >08:06:28 INFO: throughput :84.04Gbps > >08:06:28 INFO: retrans segs :4 > >08:06:28 INFO: cpu cores :192 > >08:06:28 INFO: cpu speed :3799.725MHz > >08:06:28 INFO: user :0.05% > >08:06:28 INFO: system :1.60% > >08:06:28 INFO: idle :96.41% > >08:06:28 INFO: iowait :0.00% > >08:06:28 INFO: softirq :1.94% > >08:06:28 INFO: cycles/byte :2.50 > >08:06:28 INFO: cpu busy (all) :534.41% > > > >For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: > > > >./ntttcp -r -m 16 > >NTTTCP for Linux 1.4.0 > >--------------------------------------------------------- > >08:08:51 INFO: 17 threads created > >08:08:56 INFO: Network activity progressing... > >08:09:56 INFO: Test run completed. > >08:09:56 INFO: Test cycle finished. > >08:09:56 INFO: ##### Totals: ##### > >08:09:56 INFO: test duration :60.00 seconds > >08:09:56 INFO: total bytes :741966608384 > >08:09:56 INFO: throughput :98.93Gbps > >08:09:56 INFO: retrans segs :6 > >08:09:56 INFO: cpu cores :192 > >08:09:56 INFO: cpu speed :3799.791MHz > >08:09:56 INFO: user :0.06% > >08:09:56 INFO: system :1.81% > >08:09:56 INFO: idle :96.18% > >08:09:56 INFO: iowait :0.00% > >08:09:56 INFO: softirq :1.95% > >08:09:56 INFO: cycles/byte :2.25 > >08:09:56 INFO: cpu busy (all) :569.22% > > > >[1] > >https://lore.kernel/ > >.org%2Fall%2F20231211063726.GA4977%40linuxonhyperv3.guj3yctzbm1etfxqx2v > >ob5hsef.xx.internal.cloudapp.net%2F&data=05%7C02%7Cschakrabarti%40micros > >oft.com%7Ca385a5a5d661458219c208dbff47a7ab%7C72f988bf86f141af91ab2d7 > >cd011db47%7C1%7C0%7C638384455520036393%7CUnknown%7CTWFpbGZsb3d > >8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D% > >7C3000%7C%7C%7C&sdata=kzoalzSu6frB0GIaUM5VWsz04%2FsB%2FBdXwXKb26 > >IhqkE%3D&reserved=0 > > > >Signed-off-by: Yury Norov <yury.norov@gmail.com> > >Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> > >--- > > .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ > > 1 file changed, 28 insertions(+) > > > >diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c > >b/drivers/net/ethernet/microsoft/mana/gdma_main.c > >index 6367de0c2c2e..11e64e42e3b2 100644 > >--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c > >+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c > >@@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource > >*r) > > r->size = 0; > > } > > > >+static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int > >+len, int node) { > >+ const struct cpumask *next, *prev = cpu_none_mask; > >+ cpumask_var_t cpus __free(free_cpumask_var); > >+ int cpu, weight; > >+ > >+ if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) > >+ return -ENOMEM; > >+ > >+ rcu_read_lock(); > >+ for_each_numa_hop_mask(next, node) { > >+ weight = cpumask_weight_andnot(next, prev); > >+ while (weight-- > 0) { > Make it while (weight > 0) { > >+ cpumask_andnot(cpus, next, prev); > >+ for_each_cpu(cpu, cpus) { > >+ if (len-- == 0) > >+ goto done; > >+ irq_set_affinity_and_hint(*irqs++, > >topology_sibling_cpumask(cpu)); > >+ cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); > Here do --weight, else this code will traverse the same node N^2 times, where each > node has N cpus . Sure. When building your series on top of this, can you please fix it inplace? Thanks, Yury > >+ } > >+ } > >+ prev = next; > >+ } > >+done: > >+ rcu_read_unlock(); > >+ return 0; > >+} > >+ > > static int mana_gd_setup_irqs(struct pci_dev *pdev) { > > unsigned int max_queues_per_port = num_online_cpus(); > >-- > >2.40.1
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c index 6367de0c2c2e..11e64e42e3b2 100644 --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c @@ -1243,6 +1243,34 @@ void mana_gd_free_res_map(struct gdma_resource *r) r->size = 0; } +static __maybe_unused int irq_setup(unsigned int *irqs, unsigned int len, int node) +{ + const struct cpumask *next, *prev = cpu_none_mask; + cpumask_var_t cpus __free(free_cpumask_var); + int cpu, weight; + + if (!alloc_cpumask_var(&cpus, GFP_KERNEL)) + return -ENOMEM; + + rcu_read_lock(); + for_each_numa_hop_mask(next, node) { + weight = cpumask_weight_andnot(next, prev); + while (weight-- > 0) { + cpumask_andnot(cpus, next, prev); + for_each_cpu(cpu, cpus) { + if (len-- == 0) + goto done; + irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu)); + cpumask_andnot(cpus, cpus, topology_sibling_cpumask(cpu)); + } + } + prev = next; + } +done: + rcu_read_unlock(); + return 0; +} + static int mana_gd_setup_irqs(struct pci_dev *pdev) { unsigned int max_queues_per_port = num_online_cpus();
Souradeep investigated that the driver performs faster if IRQs are spread on CPUs with the following heuristics: 1. No more than one IRQ per CPU, if possible; 2. NUMA locality is the second priority; 3. Sibling dislocality is the last priority. Let's consider this topology: Node 0 1 Core 0 1 2 3 CPU 0 1 2 3 4 5 6 7 The most performant IRQ distribution based on the above topology and heuristics may look like this: IRQ Nodes Cores CPUs 0 1 0 0-1 1 1 1 2-3 2 1 0 0-1 3 1 1 2-3 4 2 2 4-5 5 2 3 6-7 6 2 2 4-5 7 2 3 6-7 The irq_setup() routine introduced in this patch leverages the for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups as described above. According to [1], for NUMA-aware but sibling-ignorant IRQ distribution based on cpumask_local_spread() performance test results look like this: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:05:20 INFO: 17 threads created 08:05:28 INFO: Network activity progressing... 08:06:28 INFO: Test run completed. 08:06:28 INFO: Test cycle finished. 08:06:28 INFO: ##### Totals: ##### 08:06:28 INFO: test duration :60.00 seconds 08:06:28 INFO: total bytes :630292053310 08:06:28 INFO: throughput :84.04Gbps 08:06:28 INFO: retrans segs :4 08:06:28 INFO: cpu cores :192 08:06:28 INFO: cpu speed :3799.725MHz 08:06:28 INFO: user :0.05% 08:06:28 INFO: system :1.60% 08:06:28 INFO: idle :96.41% 08:06:28 INFO: iowait :0.00% 08:06:28 INFO: softirq :1.94% 08:06:28 INFO: cycles/byte :2.50 08:06:28 INFO: cpu busy (all) :534.41% For NUMA- and sibling-aware IRQ distribution, the same test works 15% faster: ./ntttcp -r -m 16 NTTTCP for Linux 1.4.0 --------------------------------------------------------- 08:08:51 INFO: 17 threads created 08:08:56 INFO: Network activity progressing... 08:09:56 INFO: Test run completed. 08:09:56 INFO: Test cycle finished. 08:09:56 INFO: ##### Totals: ##### 08:09:56 INFO: test duration :60.00 seconds 08:09:56 INFO: total bytes :741966608384 08:09:56 INFO: throughput :98.93Gbps 08:09:56 INFO: retrans segs :6 08:09:56 INFO: cpu cores :192 08:09:56 INFO: cpu speed :3799.791MHz 08:09:56 INFO: user :0.06% 08:09:56 INFO: system :1.81% 08:09:56 INFO: idle :96.18% 08:09:56 INFO: iowait :0.00% 08:09:56 INFO: softirq :1.95% 08:09:56 INFO: cycles/byte :2.25 08:09:56 INFO: cpu busy (all) :569.22% [1] https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/ Signed-off-by: Yury Norov <yury.norov@gmail.com> Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> --- .../net/ethernet/microsoft/mana/gdma_main.c | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+)