diff mbox

[02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag

Message ID 1465934346-20648-3-git-send-email-hch@lst.de (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Christoph Hellwig June 14, 2016, 7:58 p.m. UTC
From: Thomas Gleixner <tglx@linutronix.de>

Interupts marked with this flag are excluded from user space interrupt
affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
affinity mechanism is not blocked.

This flag will be used for multi-queue device interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq.h    |  7 +++++++
 kernel/irq/internals.h |  2 ++
 kernel/irq/manage.c    | 21 ++++++++++++++++++---
 kernel/irq/proc.c      |  2 +-
 4 files changed, 28 insertions(+), 4 deletions(-)

Comments

Bart Van Assche June 15, 2016, 8:44 a.m. UTC | #1
On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Interupts marked with this flag are excluded from user space interrupt
> affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal
> affinity mechanism is not blocked.
>
> This flag will be used for multi-queue device interrupts.

It's great to see that the goal of this patch series is to configure 
interrupt affinity automatically for adapters that support multiple 
MSI-X vectors. However, is excluding these interrupts from irqbalanced 
really the way to go? Suppose e.g. that a system is equipped with two 
RDMA adapters, that these adapters are used by a blk-mq enabled block 
initiator driver and that each adapter supports eight MSI-X vectors. 
Should the interrupts of the two RDMA adapters be assigned to different 
CPU cores? If so, which software layer should realize this? The kernel 
or user space?

Sorry that I missed the first version of this patch series.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 15, 2016, 10:23 a.m. UTC | #2
Hi Bart,

On Wed, Jun 15, 2016 at 10:44:37AM +0200, Bart Van Assche wrote:
> However, is excluding these interrupts from irqbalanced really the 
> way to go?

What positive effect will irqbalanced have on explcititly spread
interrupts?

> Suppose e.g. that a system is equipped with two RDMA adapters, 
> that these adapters are used by a blk-mq enabled block initiator driver and 
> that each adapter supports eight MSI-X vectors. Should the interrupts of 
> the two RDMA adapters be assigned to different CPU cores? If so, which 
> software layer should realize this? The kernel or user space?

RDMA should eventually use the interrupt spreading implemented in this
series, as should networking (RDMA actually is on my near term todo list).

RDMA block protocols will then pick up the queue information from the
HCA driver.  I've not actually implemented this yet, but my current idea
is:

 - the HCA drivers are switch to use pci_alloc_irq_vectors to spread
   their interrupt vectors around the system
 - the HCA drivers will expose the irq_affinity affinity array
   in struct ib_device (we'll need to consider what do about the
   odd completion vectors instead of irq terminology in the RDMA stack,
   but that's not a show stopper)
 - multiqueue aware block drivers will then feed the irq_affinity
   cpumask from the hca driver to blk-mq.  We'll also need to ensure
   the number of protocol queues aligns nicely to the number of hardware
   queues.  My current thinking is that they should be the same or
   a fraction of the hardware completion queues, but this might need
   some careful benchmarking.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 15, 2016, 10:42 a.m. UTC | #3
On 06/15/2016 12:23 PM, Christoph Hellwig wrote:
> Hi Bart,
>
> On Wed, Jun 15, 2016 at 10:44:37AM +0200, Bart Van Assche wrote:
>> However, is excluding these interrupts from irqbalanced really the
>> way to go?
>
> What positive effect will irqbalanced have on explcititly spread
> interrupts?
>
>> Suppose e.g. that a system is equipped with two RDMA adapters,
>> that these adapters are used by a blk-mq enabled block initiator driver and
>> that each adapter supports eight MSI-X vectors. Should the interrupts of
>> the two RDMA adapters be assigned to different CPU cores? If so, which
>> software layer should realize this? The kernel or user space?
>
> RDMA should eventually use the interrupt spreading implemented in this
> series, as should networking (RDMA actually is on my near term todo list).
>
> RDMA block protocols will then pick up the queue information from the
> HCA driver.  I've not actually implemented this yet, but my current idea
> is:
>
>  - the HCA drivers are switch to use pci_alloc_irq_vectors to spread
>    their interrupt vectors around the system
>  - the HCA drivers will expose the irq_affinity affinity array
>    in struct ib_device (we'll need to consider what do about the
>    odd completion vectors instead of irq terminology in the RDMA stack,
>    but that's not a show stopper)
>  - multiqueue aware block drivers will then feed the irq_affinity
>    cpumask from the hca driver to blk-mq.  We'll also need to ensure
>    the number of protocol queues aligns nicely to the number of hardware
>    queues.  My current thinking is that they should be the same or
>    a fraction of the hardware completion queues, but this might need
>    some careful benchmarking.

Hello Christoph,

Today irqbalanced is responsible for deciding how to assign interrupts 
from different adapters to CPU cores. Does the above mean that for 
adapters that support multiple MSI-X interrupts the kernel will have 
full responsibility for assigning interrupt vectors to CPU cores?

If two identical adapters are present in a system, will these generate 
the same irq_affinity mask? Do you agree that interrupt vectors from 
different adapters should be assigned to different CPU cores if enough 
CPU cores are available? If so, which software layer will assign 
interrupt vectors from different adapters to different CPU cores?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch June 15, 2016, 3:14 p.m. UTC | #4
On Wed, Jun 15, 2016 at 12:42:53PM +0200, Bart Van Assche wrote:
> Today irqbalanced is responsible for deciding how to assign interrupts from
> different adapters to CPU cores. Does the above mean that for adapters that
> support multiple MSI-X interrupts the kernel will have full responsibility
> for assigning interrupt vectors to CPU cores?

Hi Bart,

Right, the kernel would be responsible for assigning interrupt vectors to
cores. The kernel is already responsible for setting the affinity hint,
but we want direct control because we can do a better than irqbalance,
which has been a problem point for users.

Many adapters gain significant performance when irqbalance is using
"exact" hint policy. But that's not irqbalance's default setting, and
we don't necessarily want to enforce "exact" on the entire system when
only a subset of devices benefit from such a setup.
 
> If two identical adapters are present in a system, will these generate the
> same irq_affinity mask? Do you agree that interrupt vectors from different
> adapters should be assigned to different CPU cores if enough CPU cores are
> available? If so, which software layer will assign interrupt vectors from
> different adapters to different CPU cores?

I think the idea is have the irq_affinity mask match the CPU mapping on
the submission side context associated with that particular vector. If
two identical adapters generate the same submission CPU mapping, I don't
think we can do better than matching irq_affinity masks.

Thanks,
Keith
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 15, 2016, 3:28 p.m. UTC | #5
On 06/15/2016 05:14 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 12:42:53PM +0200, Bart Van Assche wrote:
>> If two identical adapters are present in a system, will these generate the
>> same irq_affinity mask? Do you agree that interrupt vectors from different
>> adapters should be assigned to different CPU cores if enough CPU cores are
>> available? If so, which software layer will assign interrupt vectors from
>> different adapters to different CPU cores?
>
> I think the idea is have the irq_affinity mask match the CPU mapping on
> the submission side context associated with that particular vector. If
> two identical adapters generate the same submission CPU mapping, I don't
> think we can do better than matching irq_affinity masks.

Has this been verified by measurements? Sorry but I'm not convinced that 
using the same mapping for multiple identical adapters instead of 
spreading interrupts will result in better performance.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch June 15, 2016, 4:03 p.m. UTC | #6
On Wed, Jun 15, 2016 at 05:28:54PM +0200, Bart Van Assche wrote:
> On 06/15/2016 05:14 PM, Keith Busch wrote:
> >I think the idea is have the irq_affinity mask match the CPU mapping on
> >the submission side context associated with that particular vector. If
> >two identical adapters generate the same submission CPU mapping, I don't
> >think we can do better than matching irq_affinity masks.
> 
> Has this been verified by measurements? Sorry but I'm not convinced that
> using the same mapping for multiple identical adapters instead of spreading
> interrupts will result in better performance.

The interrupts automatically spread based on which CPU submitted the
work. If you want to spread interrupts across more CPUs, then you can
spread submissions to the CPUs you want to service the interrupts.

Completing work on the same CPU that submitted it is quickest with
its cache hot access. I have equipment available to demo this. What
affinty_mask policy would you like to see compared with the proposal?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 15, 2016, 7:36 p.m. UTC | #7
On 06/15/2016 06:03 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 05:28:54PM +0200, Bart Van Assche wrote:
>> On 06/15/2016 05:14 PM, Keith Busch wrote:
>>> I think the idea is have the irq_affinity mask match the CPU mapping on
>>> the submission side context associated with that particular vector. If
>>> two identical adapters generate the same submission CPU mapping, I don't
>>> think we can do better than matching irq_affinity masks.
>>
>> Has this been verified by measurements? Sorry but I'm not convinced that
>> using the same mapping for multiple identical adapters instead of spreading
>> interrupts will result in better performance.
>
> The interrupts automatically spread based on which CPU submitted the
> work. If you want to spread interrupts across more CPUs, then you can
> spread submissions to the CPUs you want to service the interrupts.
>
> Completing work on the same CPU that submitted it is quickest with
> its cache hot access. I have equipment available to demo this. What
> affinty_mask policy would you like to see compared with the proposal?

Hello Keith,

Sorry that I had not yet this made this clear but my concern is about a 
system equipped with two or more adapters and with more CPU cores than 
the number of MSI-X interrupts per adapter. Consider e.g. a system with 
two adapters (A and B), 8 interrupts per adapter (A0..A7 and B0..B7), 32 
CPU cores and two NUMA nodes. Assuming that hyperthreading is disabled, 
will the patches from this patch series generate the following interrupt 
assignment?

0: A0 B0
1: A1 B1
2: A2 B2
3: A3 B3
4: A4 B4
5: A5 B5
6: A6 B6
7: A7 B7
8: (none)
...
31: (none)

The mapping I would like to see is as follows (assuming CPU cores 0..15 
correspond to NUMA node 0 and CPU cores 16..31 correspond to NUMA node 1):

0: A0
1: B0
2: (none)
3: (none)
4: A1
5: B1
6: (none)
7: (none)
8: A2
9: B2
10: (none)
11: (none)
12: A3
13: B3
14: (none)
15: (none)
...
31: (none)

Do you agree that - ignoring other interrupt assignments - that the 
latter interrupt assignment scheme would result in higher throughput and 
lower interrupt processing latency?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch June 15, 2016, 8:06 p.m. UTC | #8
On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
> Sorry that I had not yet this made this clear but my concern is about a
> system equipped with two or more adapters and with more CPU cores than the
> number of MSI-X interrupts per adapter. Consider e.g. a system with two
> adapters (A and B), 8 interrupts per adapter (A0..A7 and B0..B7), 32 CPU
> cores and two NUMA nodes. Assuming that hyperthreading is disabled, will the
> patches from this patch series generate the following interrupt assignment?
> 
> 0: A0 B0
> 1: A1 B1
> 2: A2 B2
> 3: A3 B3
> 4: A4 B4
> 5: A5 B5
> 6: A6 B6
> 7: A7 B7
> 8: (none)
> ...
> 31: (none)

I'll need to look at the follow on patches do to confirm, but that's
not what this should do. All CPU's should have a vector assigned because
every CPU needs to be assigned a submission context using a vector. In
your example, every vector's affinity mask should be assigned to 4 CPUs:
vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.

If it's done such that all CPUs are assigned and no sharing occurs across
NUMA nodes, does that change your concern?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch June 15, 2016, 8:12 p.m. UTC | #9
On Wed, Jun 15, 2016 at 04:06:55PM -0400, Keith Busch wrote:
> > 
> > 0: A0 B0
> > 1: A1 B1
> > 2: A2 B2
> > 3: A3 B3
> > 4: A4 B4
> > 5: A5 B5
> > 6: A6 B6
> > 7: A7 B7
> > 8: (none)
> > ...
> > 31: (none)
> 
> I'll need to look at the follow on patches do to confirm, but that's
> not what this should do. All CPU's should have a vector assigned because
> every CPU needs to be assigned a submission context using a vector. In
> your example, every vector's affinity mask should be assigned to 4 CPUs:
> vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.

  ^^^^^^

Sorry, I meant "CPU '8'", not "vector '8'".
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 15, 2016, 8:50 p.m. UTC | #10
On 06/15/2016 10:12 PM, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 04:06:55PM -0400, Keith Busch wrote:
>>>
>>> 0: A0 B0
>>> 1: A1 B1
>>> 2: A2 B2
>>> 3: A3 B3
>>> 4: A4 B4
>>> 5: A5 B5
>>> 6: A6 B6
>>> 7: A7 B7
>>> 8: (none)
>>> ...
>>> 31: (none)
>>
>> I'll need to look at the follow on patches do to confirm, but that's
>> not what this should do. All CPU's should have a vector assigned because
>> every CPU needs to be assigned a submission context using a vector. In
>> your example, every vector's affinity mask should be assigned to 4 CPUs:
>> vector '8' starts over with A0 B0, '9' gets A1 B1, and so on.
>
>   ^^^^^^
>
> Sorry, I meant "CPU '8'", not "vector '8'".

Hello Keith,

Does it matter on x86 systems whether or not these interrupt vectors are 
also associated with a CPU with a higher CPU number? Although multiple 
bits can be set in /proc/irq/<n>/smp_affinity only the first bit counts 
on x86 platforms. In default_cpu_mask_to_apicid_and() it is easy to see 
that only the first bit that has been set in that mask counts on x86 
systems.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 16, 2016, 9:08 a.m. UTC | #11
On 06/14/2016 09:58 PM, Christoph Hellwig wrote:
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index 4d758a7..49d66d1 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -197,6 +197,7 @@ struct irq_data {
>   * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
>   * IRQD_WAKEUP_ARMED		- Wakeup mode armed
>   * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
> + * IRQD_AFFINITY_MANAGED	- Affinity is managed automatically
>   */

Does "managed automatically" mean managed by software? If so, I think it 
would help to mention which software manages IRQ affinity if the 
IRQD_AFFINITY_MANAGED flag has been set.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Keith Busch June 16, 2016, 3:19 p.m. UTC | #12
On Wed, Jun 15, 2016 at 10:50:53PM +0200, Bart Van Assche wrote:
> Does it matter on x86 systems whether or not these interrupt vectors are
> also associated with a CPU with a higher CPU number? Although multiple bits
> can be set in /proc/irq/<n>/smp_affinity only the first bit counts on x86
> platforms. In default_cpu_mask_to_apicid_and() it is easy to see that only
> the first bit that has been set in that mask counts on x86 systems.

Wow, thanks for the information. I didn't know the apic wasn't using
the full cpu mask, so this changes how I need to look at this, and will
experiment with such a configuration.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 16, 2016, 3:20 p.m. UTC | #13
On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
> Do you agree that - ignoring other interrupt assignments - that the latter 
> interrupt assignment scheme would result in higher throughput and lower 
> interrupt processing latency?

Probably.  Once we've got it in the core IRQ code we can tweak the
algorithm to be optimal.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 16, 2016, 3:39 p.m. UTC | #14
On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>> Do you agree that - ignoring other interrupt assignments - that the latter
>> interrupt assignment scheme would result in higher throughput and lower
>> interrupt processing latency?
>
> Probably.  Once we've got it in the core IRQ code we can tweak the
> algorithm to be optimal.

Sorry but I'm afraid that we are embedding policy in the kernel, 
something we should not do. I know that there are workloads for which 
dedicating some CPU cores to interrupt processing and other CPU cores to 
running kernel threads improves throughput, probably because this 
results in less cache eviction on the CPU cores that run kernel threads 
and some degree of interrupt coalescing on the CPU cores that process 
interrupts. My concern is that I doubt that there is an interrupt 
assignment scheme that works optimally for all workloads. Hence my 
request to preserve the ability to modify interrupt affinity from user 
space.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 20, 2016, 12:22 p.m. UTC | #15
On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote:
> On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
>> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>>> Do you agree that - ignoring other interrupt assignments - that the latter
>>> interrupt assignment scheme would result in higher throughput and lower
>>> interrupt processing latency?
>>
>> Probably.  Once we've got it in the core IRQ code we can tweak the
>> algorithm to be optimal.
>
> Sorry but I'm afraid that we are embedding policy in the kernel, something 
> we should not do. I know that there are workloads for which dedicating some 
> CPU cores to interrupt processing and other CPU cores to running kernel 
> threads improves throughput, probably because this results in less cache 
> eviction on the CPU cores that run kernel threads and some degree of 
> interrupt coalescing on the CPU cores that process interrupts.

And you can still easily set this use case up by chosing less queues
(aka interrupts) than CPUs and assining your workload to the other
cores.

> My concern 
> is that I doubt that there is an interrupt assignment scheme that works 
> optimally for all workloads. Hence my request to preserve the ability to 
> modify interrupt affinity from user space.

I'd say let's do such an interface incrementall based on the use
case - especially after we get networking over to use common code
to distribute the interrupts.  If you were doing something like this
with the current blk-mq code it wouldn't work very well due to the
fact that you'd have a mismatch between the assigned interrupt and
the blk-mq queue mapping anyway.

It might be a good idea to start brainstorming how we'd want to handle
this change - we'd basically need a per-device notification that the
interrupt mapping changes so that we can rebuild the queue mapping,
which is somewhat similar to the lib/cpu_rmap.c code used by a few
networking drivers.  This would also help with dealing with cpu
hotplug events that change the cpu mapping.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche June 20, 2016, 1:21 p.m. UTC | #16
On 06/20/2016 02:22 PM, Christoph Hellwig wrote:
> On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote:
>> On 06/16/2016 05:20 PM, Christoph Hellwig wrote:
>>> On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:
>> My concern
>> is that I doubt that there is an interrupt assignment scheme that works
>> optimally for all workloads. Hence my request to preserve the ability to
>> modify interrupt affinity from user space.
>
> I'd say let's do such an interface incrementally based on the use
> case - especially after we get networking over to use common code
> to distribute the interrupts.  If you were doing something like this
> with the current blk-mq code it wouldn't work very well due to the
> fact that you'd have a mismatch between the assigned interrupt and
> the blk-mq queue mapping anyway.
>
> It might be a good idea to start brainstorming how we'd want to handle
> this change - we'd basically need a per-device notification that the
> interrupt mapping changes so that we can rebuild the queue mapping,
> which is somewhat similar to the lib/cpu_rmap.c code used by a few
> networking drivers.  This would also help with dealing with cpu
> hotplug events that change the cpu mapping.

A notification mechanism that reports interrupt mapping changes will 
definitely help. What would also help is an API that allows drivers to 
query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. 
hctx->cpumask. Another function can then map that IRQ into an index in 
the range 0..n-1 where n is the number of MSI-X interrupts for that 
adapter. Every blk-mq/scsi-mq driver will need this functionality to 
decide which IRQ to associate with a block layer hctx.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 21, 2016, 2:31 p.m. UTC | #17
On Mon, Jun 20, 2016 at 03:21:47PM +0200, Bart Van Assche wrote:
> A notification mechanism that reports interrupt mapping changes will 
> definitely help. What would also help is an API that allows drivers to 
> query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. 
> hctx->cpumask.

This is still the wrong way around - we need to build the blk-mq queue
mappings based on the interrupts, not the other way around.

> Another function can then map that IRQ into an index in the 
> range 0..n-1 where n is the number of MSI-X interrupts for that adapter. 
> Every blk-mq/scsi-mq driver will need this functionality to decide which 
> IRQ to associate with a block layer hctx.

This is something that should be done in commmon code and is done in
common code in this series - the driver passes a cpumask to blk-mq,
and blk-mq creates a queue for every cpu that is set in the cpumask.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Gordeev June 22, 2016, 11:56 a.m. UTC | #18
On Thu, Jun 16, 2016 at 11:19:51AM -0400, Keith Busch wrote:
> On Wed, Jun 15, 2016 at 10:50:53PM +0200, Bart Van Assche wrote:
> > Does it matter on x86 systems whether or not these interrupt vectors are
> > also associated with a CPU with a higher CPU number? Although multiple bits
> > can be set in /proc/irq/<n>/smp_affinity only the first bit counts on x86
> > platforms. In default_cpu_mask_to_apicid_and() it is easy to see that only
> > the first bit that has been set in that mask counts on x86 systems.
> 
> Wow, thanks for the information. I didn't know the apic wasn't using
> the full cpu mask, so this changes how I need to look at this, and will
> experiment with such a configuration.

I have vague memories of this, but you probably need to check PPC as well.
Its interrupt distribution is not straightforward as well, AFAIR.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 4d758a7..49d66d1 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -197,6 +197,7 @@  struct irq_data {
  * IRQD_IRQ_INPROGRESS		- In progress state of the interrupt
  * IRQD_WAKEUP_ARMED		- Wakeup mode armed
  * IRQD_FORWARDED_TO_VCPU	- The interrupt is forwarded to a VCPU
+ * IRQD_AFFINITY_MANAGED	- Affinity is managed automatically
  */
 enum {
 	IRQD_TRIGGER_MASK		= 0xf,
@@ -212,6 +213,7 @@  enum {
 	IRQD_IRQ_INPROGRESS		= (1 << 18),
 	IRQD_WAKEUP_ARMED		= (1 << 19),
 	IRQD_FORWARDED_TO_VCPU		= (1 << 20),
+	IRQD_AFFINITY_MANAGED		= (1 << 21),
 };
 
 #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
@@ -305,6 +307,11 @@  static inline void irqd_clr_forwarded_to_vcpu(struct irq_data *d)
 	__irqd_to_state(d) &= ~IRQD_FORWARDED_TO_VCPU;
 }
 
+static inline bool irqd_affinity_is_managed(struct irq_data *d)
+{
+	return __irqd_to_state(d) & IRQD_AFFINITY_MANAGED;
+}
+
 #undef __irqd_to_state
 
 static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 09be2c9..b15aa3b 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -105,6 +105,8 @@  static inline void unregister_handler_proc(unsigned int irq,
 					   struct irqaction *action) { }
 #endif
 
+extern bool irq_can_set_affinity_usr(unsigned int irq);
+
 extern int irq_select_affinity_usr(unsigned int irq, struct cpumask *mask);
 
 extern void irq_set_thread_affinity(struct irq_desc *desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index ef0bc02..30658e9 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -115,12 +115,12 @@  EXPORT_SYMBOL(synchronize_irq);
 #ifdef CONFIG_SMP
 cpumask_var_t irq_default_affinity;
 
-static int __irq_can_set_affinity(struct irq_desc *desc)
+static bool __irq_can_set_affinity(struct irq_desc *desc)
 {
 	if (!desc || !irqd_can_balance(&desc->irq_data) ||
 	    !desc->irq_data.chip || !desc->irq_data.chip->irq_set_affinity)
-		return 0;
-	return 1;
+		return false;
+	return true;
 }
 
 /**
@@ -134,6 +134,21 @@  int irq_can_set_affinity(unsigned int irq)
 }
 
 /**
+ * irq_can_set_affinity_usr - Check if affinity of a irq can be set from user space
+ * @irq:	Interrupt to check
+ *
+ * Like irq_can_set_affinity() above, but additionally checks for the
+ * AFFINITY_MANAGED flag.
+ */
+bool irq_can_set_affinity_usr(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+
+	return __irq_can_set_affinity(desc) &&
+		!irqd_affinity_is_managed(&desc->irq_data);
+}
+
+/**
  *	irq_set_thread_affinity - Notify irq threads to adjust affinity
  *	@desc:		irq descriptor which has affitnity changed
  *
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4e1b947..40bdcdc 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -96,7 +96,7 @@  static ssize_t write_irq_affinity(int type, struct file *file,
 	cpumask_var_t new_value;
 	int err;
 
-	if (!irq_can_set_affinity(irq) || no_irq_affinity)
+	if (!irq_can_set_affinity_usr(irq) || no_irq_affinity)
 		return -EIO;
 
 	if (!alloc_cpumask_var(&new_value, GFP_KERNEL))