diff mbox series

[07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

Message ID 20201005152856.974112-7-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping | expand

Commit Message

David Woodhouse Oct. 5, 2020, 3:28 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

This is the maximum possible set of CPUs which can be used. Use it
to calculate the default affinity requested from __irq_alloc_descs()
by first attempting to find the intersection with irq_default_affinity,
or falling back to using just the max_affinity if the intersection
would be empty.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/irqdomain.h |  3 ++-
 kernel/irq/ipi.c          |  2 +-
 kernel/irq/irqdomain.c    | 45 +++++++++++++++++++++++++++++++++------
 3 files changed, 42 insertions(+), 8 deletions(-)

Comments

Thomas Gleixner Oct. 6, 2020, 9:26 p.m. UTC | #1
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This is the maximum possible set of CPUs which can be used. Use it
> to calculate the default affinity requested from __irq_alloc_descs()
> by first attempting to find the intersection with irq_default_affinity,
> or falling back to using just the max_affinity if the intersection
> would be empty.

And why do we need that as yet another argument?

This is an optional property of the irq domain, really and no caller has
any business with that. 

>  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
> -			   int node, const struct irq_affinity_desc *affinity)
> +			   int node, const struct irq_affinity_desc *affinity,
> +			   const struct cpumask *max_affinity)
>  {
> +	cpumask_var_t default_affinity;
>  	unsigned int hint;
> +	int i;
> +
> +	/* Check requested per-IRQ affinities are in the possible range */
> +	if (affinity && max_affinity) {
> +		for (i = 0; i < cnt; i++)
> +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
> +				return -EINVAL;

https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos

What is preventing the affinity spreading code from spreading the masks
out to unusable CPUs? The changelog is silent about that part.

> +	/*
> +	 * Generate default affinity. Either the possible subset of
> +	 * irq_default_affinity if such a subset is non-empty, or fall
> +	 * back to the provided max_affinity if there is no intersection.
..
> +	 * And just a copy of irq_default_affinity in the
> +	 * !CONFIG_CPUMASK_OFFSTACK case.

We know that already...

> +	 */
> +	memset(&default_affinity, 0, sizeof(default_affinity));

Right, memset() before allocating is useful.

> +	if ((max_affinity &&
> +	     !cpumask_subset(irq_default_affinity, max_affinity))) {
> +		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
> +			return -ENOMEM;
> +		cpumask_and(default_affinity, max_affinity,
> +			    irq_default_affinity);
> +		if (cpumask_empty(default_affinity))
> +			cpumask_copy(default_affinity, max_affinity);
> +	} else if (cpumask_available(default_affinity))
> +		cpumask_copy(default_affinity, irq_default_affinity);

That's garbage and unreadable.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 7:19 a.m. UTC | #2
On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This is the maximum possible set of CPUs which can be used. Use it
> > to calculate the default affinity requested from __irq_alloc_descs()
> > by first attempting to find the intersection with irq_default_affinity,
> > or falling back to using just the max_affinity if the intersection
> > would be empty.
> 
> And why do we need that as yet another argument?
> 
> This is an optional property of the irq domain, really and no caller has
> any business with that. 

Because irq_domain_alloc_descs() doesn't actually *take* the domain as
an argument. It's more of an internal function, which is only non-
static because it's used from kernel/irq/ipi.c too for some reason. If
we convert the IPI code to just call __irq_alloc_descs() directly,
perhaps that we can actually make irq_domain_alloc_decs() static.

> >  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
> > -			   int node, const struct irq_affinity_desc *affinity)
> > +			   int node, const struct irq_affinity_desc *affinity,
> > +			   const struct cpumask *max_affinity)
> >  {
> > +	cpumask_var_t default_affinity;
> >  	unsigned int hint;
> > +	int i;
> > +
> > +	/* Check requested per-IRQ affinities are in the possible range */
> > +	if (affinity && max_affinity) {
> > +		for (i = 0; i < cnt; i++)
> > +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
> > +				return -EINVAL;
> 
> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos
> 
> What is preventing the affinity spreading code from spreading the masks
> out to unusable CPUs? The changelog is silent about that part.

I'm coming to the conclusion that we should allow unusable CPUs to be
specified at this point, just as we do offline CPUs. That's largely
driven by the realisation that our x86_non_ir_cpumask is only going to
contain online CPUs anyway, and hotplugged CPUs only get added to it as
they are brought online.

> > +	/*
> > +	 * Generate default affinity. Either the possible subset of
> > +	 * irq_default_affinity if such a subset is non-empty, or fall
> > +	 * back to the provided max_affinity if there is no intersection.
> 
> ..
> > +	 * And just a copy of irq_default_affinity in the
> > +	 * !CONFIG_CPUMASK_OFFSTACK case.
> 
> We know that already...
> 
> > +	 */
> > +	memset(&default_affinity, 0, sizeof(default_affinity));
> 
> Right, memset() before allocating is useful.

The memset is because there's no cpumask_var_t initialiser that I can
see. So cpumask_available() on the uninitialised 'default_affinity'
variable might be true even in the OFFSTACK case.

> > +	if ((max_affinity &&
> > +	     !cpumask_subset(irq_default_affinity, max_affinity))) {
> > +		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
> > +			return -ENOMEM;
> > +		cpumask_and(default_affinity, max_affinity,
> > +			    irq_default_affinity);
> > +		if (cpumask_empty(default_affinity))
> > +			cpumask_copy(default_affinity, max_affinity);
> > +	} else if (cpumask_available(default_affinity))
> > +		cpumask_copy(default_affinity, irq_default_affinity);
> 
> That's garbage and unreadable.

That's why there was a comment explaining it... at which point you
claimed to already know :)

Clearly the comment didn't do the job it was supposed to do. I'll
rework.
Thomas Gleixner Oct. 7, 2020, 1:37 p.m. UTC | #3
On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote:
> On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>> > From: David Woodhouse <dwmw@amazon.co.uk>
>> > 
>> > This is the maximum possible set of CPUs which can be used. Use it
>> > to calculate the default affinity requested from __irq_alloc_descs()
>> > by first attempting to find the intersection with irq_default_affinity,
>> > or falling back to using just the max_affinity if the intersection
>> > would be empty.
>> 
>> And why do we need that as yet another argument?
>> 
>> This is an optional property of the irq domain, really and no caller has
>> any business with that. 
>
> Because irq_domain_alloc_descs() doesn't actually *take* the domain as
> an argument. It's more of an internal function, which is only non-
> static because it's used from kernel/irq/ipi.c too for some reason. If
> we convert the IPI code to just call __irq_alloc_descs() directly,
> perhaps that we can actually make irq_domain_alloc_decs() static.

What is preventing you to change the function signature? But handing
down irqdomain here is not cutting it. The right thing to do is to
replace 'struct irq_affinity_desc *affinity' with something more
flexible.

>> >  int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
>> > -			   int node, const struct irq_affinity_desc *affinity)
>> > +			   int node, const struct irq_affinity_desc *affinity,
>> > +			   const struct cpumask *max_affinity)
>> >  {
>> > +	cpumask_var_t default_affinity;
>> >  	unsigned int hint;
>> > +	int i;
>> > +
>> > +	/* Check requested per-IRQ affinities are in the possible range */
>> > +	if (affinity && max_affinity) {
>> > +		for (i = 0; i < cnt; i++)
>> > +			if (!cpumask_subset(&affinity[i].mask, max_affinity))
>> > +				return -EINVAL;
>> 
>> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos
>> 
>> What is preventing the affinity spreading code from spreading the masks
>> out to unusable CPUs? The changelog is silent about that part.
>
> I'm coming to the conclusion that we should allow unusable CPUs to be
> specified at this point, just as we do offline CPUs. That's largely
> driven by the realisation that our x86_non_ir_cpumask is only going to
> contain online CPUs anyway, and hotplugged CPUs only get added to it as
> they are brought online.

Can you please stop looking at this from a x86 only perspective. It's
largely irrelevant what particular needs x86 or virt or whatever has.

Fact is, that if there are CPUs which cannot be targeted by device
interrupts then the multiqueue affinity mechanism has to be fixed to
handle this. Right now it's just broken.

Passing yet more cpumasks and random pointers around through device
drivers and whatever is just not going to happen. Neither are we going
to have

        arch_can_be_used_for_device_interrupts_mask

or whatever you come up with and claim it to be 'generic'.

The whole affinity control mechanism needs to be refactored from ground
up and the information about CPUs which can be targeted has to be
retrievable through the irqdomain hierarchy.

Anything else is just tinkering and I have zero interest in mopping up
after you.

It's pretty obvious that the irq domains are the right place to store
that information:

const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d)
{
        while (d) {
        	if (d->get_possible_affinity)
                	return d->get_possible_affinity(d);
                d = d->parent;
        }
        return cpu_possible_mask;
}

So if you look at X86 then you have either:

   [VECTOR] ----------------- [IO/APIC]
                          |-- [MSI]
                          |-- [WHATEVER]

or

   [VECTOR] ---[REMAP]------- [IO/APIC]
             |            |-- [MSI]
             |----------------[WHATEVER]

So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
then irqdomain_get_possible_affinity() will return the correct result
independent whether remapping is enabled or not.

This allows to use that for other things like per node restrictions or
whatever people come up with, without sprinkling more insanities through
the tree.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 2:10 p.m. UTC | #4
On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
> On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote:
> > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
> > > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > This is the maximum possible set of CPUs which can be used. Use it
> > > > to calculate the default affinity requested from __irq_alloc_descs()
> > > > by first attempting to find the intersection with irq_default_affinity,
> > > > or falling back to using just the max_affinity if the intersection
> > > > would be empty.
> > > 
> > > And why do we need that as yet another argument?
> > > 
> > > This is an optional property of the irq domain, really and no caller has
> > > any business with that. 
> > 
> > Because irq_domain_alloc_descs() doesn't actually *take* the domain as
> > an argument. It's more of an internal function, which is only non-
> > static because it's used from kernel/irq/ipi.c too for some reason. If
> > we convert the IPI code to just call __irq_alloc_descs() directly,
> > perhaps that we can actually make irq_domain_alloc_decs() static.
> 
> What is preventing you to change the function signature? But handing
> down irqdomain here is not cutting it. The right thing to do is to
> replace 'struct irq_affinity_desc *affinity' with something more
> flexible.

Yeah, although I do think I want to ditch this part completely, and
treat the "possible" mask for the domain very much more like we do
cpu_online_mask. In that we can allow affinities to be *requested*
which are outside it, and they can just never be *effective* while
those CPUs aren't present and reachable.

That way a lot of the nastiness about preparing an "acceptable" mask in
advance can go away.

It's *also* driven, as I noted, by the realisation that on x86, the
x86_non_ir_cpumask will only ever contain those CPUs which have been
brought online so far and meet the criteria... but won't that be true
on *any* system where CPU numbers are virtual and not 1:1 mapped with
whatever determines reachability by the IRQ domain? It isn't *such* an
x86ism, surely?

> Fact is, that if there are CPUs which cannot be targeted by device
> interrupts then the multiqueue affinity mechanism has to be fixed to
> handle this. Right now it's just broken.

I think part of the problem there is that I don't really understand how
this part is *supposed* to work. I was focusing on getting the simple
case working first, in the expectation that we'd come back to that part
ansd you'd keep me honest. Is there some decent documentation on this
that I'm missing?

> It's pretty obvious that the irq domains are the right place to store
> that information:
> 
> const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d)
> {
>         while (d) {
>         	if (d->get_possible_affinity)
>                 	return d->get_possible_affinity(d);
>                 d = d->parent;
>         }
>         return cpu_possible_mask;
> }

Sure.

> So if you look at X86 then you have either:
> 
>    [VECTOR] ----------------- [IO/APIC]
>                           |-- [MSI]
>                           |-- [WHATEVER]
> 
> or
> 
>    [VECTOR] ---[REMAP]------- [IO/APIC]
>              |            |-- [MSI]
>              |----------------[WHATEVER]

Hierarchically, theoretically the IOAPIC and HPET really ought to be
children of the MSI domain. It's the Compatibility MSI which has the
restriction on destination ID, because of how many bits it interprets
from the MSI address. HPET and IOAPIC are just generating MSIs that hit
that upstream limit.

> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
> then irqdomain_get_possible_affinity() will return the correct result
> independent whether remapping is enabled or not.

Sure. Although VECTOR doesn't have the restriction. As noted, it's the
Compatibility MSI that does. So the diagram looks something like this:

 [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ]
              |                             |---[ IR-PCI-MSI ]
              |                             |---[ IR-IOAPIC ]
              |
              |--[ COMPAT-MSI ] ---- [ HPET ]
                                 |-- [ PCI-MSI ]
                                 |-- [ IOAPIC ]


In this diagram, it's COMPAT-MSI that has the affinity restriction,
inherited by its children.

Now, I understand that you're not keen on IOAPIC actually being a child
of the MSI domain, and that's fine. In Linux right now, those generic
'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the
compatibility HPET, PCI-MSI and IOAPIC domains would have to add that
same 8-bit affinity limit for themselves, as their parent is the VECTOR
domain.

I suppose it *might* not hurt to pretend that VECTOR does indeed have
the limit, and to *remove* it in the remapping domain. And then the
affinity limit could be removed in one place by the REMAP domain
because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI-
MSI and IR-IOAPIC domains *are* children of that.

But honestly, I'd rather iterate and get the generic irqdomain support
for affinity limits into decent shape before I worry *too* much about
precisely which was round it gets used by x86.
Thomas Gleixner Oct. 7, 2020, 3:57 p.m. UTC | #5
On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote:
> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
>> What is preventing you to change the function signature? But handing
>> down irqdomain here is not cutting it. The right thing to do is to
>> replace 'struct irq_affinity_desc *affinity' with something more
>> flexible.
>
> Yeah, although I do think I want to ditch this part completely, and
> treat the "possible" mask for the domain very much more like we do
> cpu_online_mask. In that we can allow affinities to be *requested*
> which are outside it, and they can just never be *effective* while
> those CPUs aren't present and reachable.

Huch?

> That way a lot of the nastiness about preparing an "acceptable" mask in
> advance can go away.

There is not lot's of nastiness.

> It's *also* driven, as I noted, by the realisation that on x86, the
> x86_non_ir_cpumask will only ever contain those CPUs which have been
> brought online so far and meet the criteria... but won't that be true
> on *any* system where CPU numbers are virtual and not 1:1 mapped with
> whatever determines reachability by the IRQ domain? It isn't *such* an
> x86ism, surely?

Yes, but that's exactly the reason why I don't want to have whatever
mask name you chose to be directly exposed and want it to be part of the
irq domains because then you can express arbitrary per domain limits.

>> Fact is, that if there are CPUs which cannot be targeted by device
>> interrupts then the multiqueue affinity mechanism has to be fixed to
>> handle this. Right now it's just broken.
>
> I think part of the problem there is that I don't really understand how
> this part is *supposed* to work. I was focusing on getting the simple
> case working first, in the expectation that we'd come back to that part
> ansd you'd keep me honest. Is there some decent documentation on this
> that I'm missing?

TLDR & HTF;

Multiqueue devices want to have at max 1 queue per CPU or if the device
has less queues than CPUs they want the queues to have a fixed
association to a set of CPUs.

At setup time this is established considering possible CPUs to handle
'physical' hotplug correctly.

If a queue has no online CPUs it cannot be started. If it's active and
the last CPU goes down then it's quiesced and stopped and the core code
shuts down the interrupt and does not move it to a still online CPU.

So with your hackery, we end up in a situation where we have a large
possible mask, but not all CPUs in that mask can be reached, which means
in a 1 queue per CPU scenario all unreachable CPUs would have
disfunctional queues.

So that spreading algorithm needs to know about this limitation.

>> So if you look at X86 then you have either:
>> 
>>    [VECTOR] ----------------- [IO/APIC]
>>                           |-- [MSI]
>>                           |-- [WHATEVER]
>> 
>> or
>> 
>>    [VECTOR] ---[REMAP]------- [IO/APIC]
>>              |            |-- [MSI]
>>              |----------------[WHATEVER]
>
> Hierarchically, theoretically the IOAPIC and HPET really ought to be
> children of the MSI domain. It's the Compatibility MSI which has the
> restriction on destination ID, because of how many bits it interprets
> from the MSI address. HPET and IOAPIC are just generating MSIs that hit
> that upstream limit.

We kinda have that, but not nicely abstracted. But we surely can and
should fix that.

>> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
>> then irqdomain_get_possible_affinity() will return the correct result
>> independent whether remapping is enabled or not.
>
w> Sure. Although VECTOR doesn't have the restriction. As noted, it's the
> Compatibility MSI that does. So the diagram looks something like this:
>
>  [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ]
>               |                             |---[ IR-PCI-MSI ]
>               |                             |---[ IR-IOAPIC ]
>               |
>               |--[ COMPAT-MSI ] ---- [ HPET ]
>                                  |-- [ PCI-MSI ]
>                                  |-- [ IOAPIC ]
>
>
> In this diagram, it's COMPAT-MSI that has the affinity restriction,
> inherited by its children.
>
> Now, I understand that you're not keen on IOAPIC actually being a child
> of the MSI domain, and that's fine. In Linux right now, those generic
> 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the
> compatibility HPET, PCI-MSI and IOAPIC domains would have to add that
> same 8-bit affinity limit for themselves, as their parent is the VECTOR
> domain.

No. We fix it proper and not by hacking around it.

> I suppose it *might* not hurt to pretend that VECTOR does indeed have
> the limit, and to *remove* it in the remapping domain. And then the
> affinity limit could be removed in one place by the REMAP domain
> because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI-
> MSI and IR-IOAPIC domains *are* children of that.

It's not rocket science to fix that as the irq remapping code already
differentiates between the device types.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 4:11 p.m. UTC | #6
On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote:
>> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
>>> What is preventing you to change the function signature? But handing
>>> down irqdomain here is not cutting it. The right thing to do is to
>>> replace 'struct irq_affinity_desc *affinity' with something more
>>> flexible.
>>
>> Yeah, although I do think I want to ditch this part completely, and
>> treat the "possible" mask for the domain very much more like we do
>> cpu_online_mask. In that we can allow affinities to be *requested*
>> which are outside it, and they can just never be *effective* while
>> those CPUs aren't present and reachable.
>
>Huch?
>
>> That way a lot of the nastiness about preparing an "acceptable" mask
>in
>> advance can go away.
>
>There is not lot's of nastiness.

OK, but I think we do have to cope with the fact that the limit is dynamic, and a CPU might be added which widens the mask. I think that's fundamental and not x86-specific.

>> It's *also* driven, as I noted, by the realisation that on x86, the
>> x86_non_ir_cpumask will only ever contain those CPUs which have been
>> brought online so far and meet the criteria... but won't that be true
>> on *any* system where CPU numbers are virtual and not 1:1 mapped with
>> whatever determines reachability by the IRQ domain? It isn't *such*
>an
>> x86ism, surely?
>
>Yes, but that's exactly the reason why I don't want to have whatever
>mask name you chose to be directly exposed and want it to be part of
>the
>irq domains because then you can express arbitrary per domain limits.

The x86_non_ir_mask isn't intended to be directly exposed to any generic IRQ code.

It's set up by the x86 APIC code so that those x86 IRQ domains which are affected by it, can set it with irqdomain_set_max_affinity() for the generic code to see. Without each having to create the cpumask from scratch for themselves.

> ... reading for once the kids are elsewhere...

Thanks.

>It's not rocket science to fix that as the irq remapping code already
>differentiates between the device types.

I don't actually like that very much. The IRQ remapping code could just compose the MSI messages that it desires without really having to care so much about the child device. The bit where IRQ remapping actually composes IOAPIC RTEs makes me particularly sad.
Thomas Gleixner Oct. 7, 2020, 8:53 p.m. UTC | #7
On Wed, Oct 07 2020 at 17:11, David Woodhouse wrote:
> On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>There is not lot's of nastiness.
>
> OK, but I think we do have to cope with the fact that the limit is
> dynamic, and a CPU might be added which widens the mask. I think
> that's fundamental and not x86-specific.

Yes, but it needs some thoughts vs. serialization against CPU hotplug.

The stability of that cpumask is architecture specific if the calling
context cannot serialize against CPU hotplug. The hot-unplug case is
less interesting because the mask does not shrink, it only get's wider.

That's why I want a callback instead of a pointer assignment and that
callback cannot return a pointer to something which can be modified
concurrently, it needs to update a caller provided mask so that the
result is at least consistent in itself.

It just occured to me that there is something which needs some more
thought vs. CPU hotunplug as well:

  Right now we just check whether there are enough vectors available on
  the remaining online CPUs so that the interrupts which have an
  effective affinity directed to the outgoing CPU can be migrated away
  (modulo the managed ones).

  That check obviously needs to take the target CPU restrictions into
  account to prevent that devices end up with no target at the end.

I bet there are some other interesting bits and pieces which need
some care...

Thanks,

        tglx
David Woodhouse Oct. 8, 2020, 7:21 a.m. UTC | #8
On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
> TLDR & HTF;
> 
> Multiqueue devices want to have at max 1 queue per CPU or if the device
> has less queues than CPUs they want the queues to have a fixed
> association to a set of CPUs.
> 
> At setup time this is established considering possible CPUs to handle
> 'physical' hotplug correctly.
> 
> If a queue has no online CPUs it cannot be started. If it's active and
> the last CPU goes down then it's quiesced and stopped and the core code
> shuts down the interrupt and does not move it to a still online CPU.
> 
> So with your hackery, we end up in a situation where we have a large
> possible mask, but not all CPUs in that mask can be reached, which means
> in a 1 queue per CPU scenario all unreachable CPUs would have
> disfunctional queues.
> 
> So that spreading algorithm needs to know about this limitation.

OK, thanks. So the queue exists, with an MSI assigned to point to an
offline CPU(s), but it cannot actually be used until/unless at least
one CPU in its mask comes online.

So when I said I wanted to try treating "reachable" the same way as
"online", that would mean the queue can't start until/unless at least
one *reachable* CPU in its mask comes online.


The underlying problem here is that until a CPU comes online, we don't
actually *know* if it's reachable or not.

So if we want carefully create the affinity masks at setup time so that
they don't include any unreachable CPUs... that basically means we
don't include any non-present CPUs at all (unless they've been added
once and then removed).

That's because — at least on x86 — we don't assign CPU numbers to CPUs
which haven't yet been added. Theoretically we *could* but if there are
more than NR_CPUS listed in (e.g.) the MADT then we could run out of
CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't*
have space for, we'd not be able to handle it.

I suppose there might be options like pre-assigning CPU numbers only to
non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU
numbers, so some not-yet-assigned CPU#s are for low APICIDs and some
are for high APICIDs but it doesn't actually have to be a 1:1
predetermined mapping.

But those really do seem like hacks which might only apply on x86,
while the generic approach of treating "reachable" like "online" seems
like it would work in other cases too.

Fundamentally, there are three sets of CPUs. There are those known to
be reachable, those known not to be, and those which are not yet known.

So another approach we could use is to work with a cpumask of those
*known* not to be reachable, and to filter those *out* of the prebuilt
affinities. That gives us basically the right behaviour without
hotplug, but does include absent CPUs in a mask that *if* they are ever
added, wouldn't be able to receive the IRQ. Which does mean we'd have
to refrain from bringing up the corresponding queue.
Thomas Gleixner Oct. 8, 2020, 9:34 a.m. UTC | #9
On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote:
> On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
>> Multiqueue devices want to have at max 1 queue per CPU or if the device
>> has less queues than CPUs they want the queues to have a fixed
>> association to a set of CPUs.
>> 
>> At setup time this is established considering possible CPUs to handle
>> 'physical' hotplug correctly.
>> 
>> If a queue has no online CPUs it cannot be started. If it's active and
>> the last CPU goes down then it's quiesced and stopped and the core code
>> shuts down the interrupt and does not move it to a still online CPU.
>> 
>> So with your hackery, we end up in a situation where we have a large
>> possible mask, but not all CPUs in that mask can be reached, which means
>> in a 1 queue per CPU scenario all unreachable CPUs would have
>> disfunctional queues.
>> 
>> So that spreading algorithm needs to know about this limitation.
>
> OK, thanks. So the queue exists, with an MSI assigned to point to an
> offline CPU(s), but it cannot actually be used until/unless at least
> one CPU in its mask comes online.

The MSI entry in that case is actually directed to an online CPU's
MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised
by the device after shutdown.

> So when I said I wanted to try treating "reachable" the same way as
> "online", that would mean the queue can't start until/unless at least
> one *reachable* CPU in its mask comes online.
>
> The underlying problem here is that until a CPU comes online, we don't
> actually *know* if it's reachable or not.

It's known before online, i.e. when the CPU is registered which is
either at boot time for present CPUs or at 'physical' hotplug.

> So if we want carefully create the affinity masks at setup time so that
> they don't include any unreachable CPUs... that basically means we
> don't include any non-present CPUs at all (unless they've been added
> once and then removed).

That breaks _all_ multi-queue assumptions in one go. :)

> But those really do seem like hacks which might only apply on x86,
> while the generic approach of treating "reachable" like "online" seems
> like it would work in other cases too.
>
> Fundamentally, there are three sets of CPUs. There are those known to
> be reachable, those known not to be, and those which are not yet
> known.

Unfortunately there are lots of assumptions all over the place that
possible CPUs are reachable. Multi-queue using managed interrupts is
just the tip of the iceberg.

> So another approach we could use is to work with a cpumask of those
> *known* not to be reachable, and to filter those *out* of the prebuilt
> affinities. That gives us basically the right behaviour without
> hotplug, but does include absent CPUs in a mask that *if* they are ever
> added, wouldn't be able to receive the IRQ. Which does mean we'd have
> to refrain from bringing up the corresponding queue. 

The multi-queue drivers rely on the interrupt setup to create their
queues and the fundamental assumption is that this setup works. The
managed interrupt mechanism guarantees that the queue has a vector
available on all CPUs which are in the queues assigned affinity mask. As
of today it also guarantees that these CPUs are reachable once they come
online.

So in order to make that work you'd need to teach the multi-queue stuff
about this new world order:

 1) On hotplug the queue needs to be able to figure out whether the
    interrupt is functional. If not it has to redirect any requests to
    some actually functional queue.

 2) On unplug it needs to be able to figure out whether the interrupt
    will shutdown because the outgoing CPU is the last reachable in the
    group and if there are still online but unreachable CPUs then use
    the redirect mechanism.

I'm sure that the multi-queue people will be enthusiastic to add all of
this and deal with all the nasty corner cases coming out of it.

The overall conclusion for this is:

 1) X2APIC support on bare metal w/o irq remapping is not going to
    happen unless you:

      - added support in multi-queue devices which utilize managed
        interrupts
        
      - audited the whole tree for other assumptions related to the
        reachability of possible CPUs.

    I'm not expecting you to be done with that before I retire so for
    me it's just not going to happen :)

 2) X2APIC support on VIRT is possible if the extended ID magic is
    supported by the hypervisor because that does not make any CPU
    unreachable for MSI and therefore the multi-queue muck and
    everything else just works.

    This requires to have either the domain affinity limitation for HPET
    in place or just to force disable HPET or at least HPET-MSI which is
    a reasonable tradeoff.

    HPET is not required for guests which have kvmclock and
    APIC/deadline timer and known (hypervisor provided) frequencies.

Anything else is just wishful thinking, really.

Thanks,

        tglx
David Woodhouse Oct. 8, 2020, 11:10 a.m. UTC | #10
On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
> The overall conclusion for this is:
> 
>  1) X2APIC support on bare metal w/o irq remapping is not going to
>     happen unless you:
> 
>       - added support in multi-queue devices which utilize managed
>         interrupts
>         
>       - audited the whole tree for other assumptions related to the
>         reachability of possible CPUs.
> 
>     I'm not expecting you to be done with that before I retire so for
>     me it's just not going to happen :)

Makes sense. It probably does mean we should a BUG_ON for the case
where IRQ remapping *is* enabled but any device is found which isn't
behind it. But that's OK.

>  2) X2APIC support on VIRT is possible if the extended ID magic is
>     supported by the hypervisor because that does not make any CPU
>     unreachable for MSI and therefore the multi-queue muck and
>     everything else just works.
> 
>     This requires to have either the domain affinity limitation for HPET
>     in place or just to force disable HPET or at least HPET-MSI which is
>     a reasonable tradeoff.
> 
>     HPET is not required for guests which have kvmclock and
>     APIC/deadline timer and known (hypervisor provided) frequencies.

HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
*actual* MSI domain. The address/data in the MSI message are completely
opaque to it, and if the parent domain happens to put meaningful
information into bits 11-5 of the MSI address, the HPET won't even
notice.

The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
address; it's not doing bit-swizzling like the IOAPIC does, which might
potentially *not* have been able to set certain bits in the MSI.
Thomas Gleixner Oct. 8, 2020, 12:40 p.m. UTC | #11
On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote:
> On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote:
>> The overall conclusion for this is:
>> 
>>  1) X2APIC support on bare metal w/o irq remapping is not going to
>>     happen unless you:
>> 
>>       - added support in multi-queue devices which utilize managed
>>         interrupts
>>         
>>       - audited the whole tree for other assumptions related to the
>>         reachability of possible CPUs.
>> 
>>     I'm not expecting you to be done with that before I retire so for
>>     me it's just not going to happen :)
>
> Makes sense. It probably does mean we should a BUG_ON for the case
> where IRQ remapping *is* enabled but any device is found which isn't
> behind it. But that's OK.

We can kinda gracefully handle that. See the completely untested and
incomplete patch below.

>>  2) X2APIC support on VIRT is possible if the extended ID magic is
>>     supported by the hypervisor because that does not make any CPU
>>     unreachable for MSI and therefore the multi-queue muck and
>>     everything else just works.
>> 
>>     This requires to have either the domain affinity limitation for HPET
>>     in place or just to force disable HPET or at least HPET-MSI which is
>>     a reasonable tradeoff.
>> 
>>     HPET is not required for guests which have kvmclock and
>>     APIC/deadline timer and known (hypervisor provided) frequencies.
>
> HPET-MSI should work fine. Like the IOAPIC, it's just a child of the
> *actual* MSI domain. The address/data in the MSI message are completely
> opaque to it, and if the parent domain happens to put meaningful
> information into bits 11-5 of the MSI address, the HPET won't even
> notice.
>
> The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI
> address; it's not doing bit-swizzling like the IOAPIC does, which might
> potentially *not* have been able to set certain bits in the MSI.

Indeed. I thought it was crippled in some way, but you're right it has
all the bits.

Thanks,

        tglx
---
Subject: x86/iommu: Make interrupt remapping more robust
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu, 08 Oct 2020 14:09:44 +0200

Needs to be split into pieces and cover PCI proper. Right now PCI gets a
NULL pointer assigned which makes it explode at the wrong place
later. Also hyperv iommu wants some love.

NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/apic/io_apic.c      |    4 +++-
 arch/x86/kernel/apic/msi.c          |   24 ++++++++++++++----------
 drivers/iommu/amd/iommu.c           |    6 +++---
 drivers/iommu/intel/irq_remapping.c |    4 ++--
 4 files changed, 22 insertions(+), 16 deletions(-)

--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi
 	info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
 	info.devid = mpc_ioapic_id(ioapic);
 	parent = irq_remapping_get_irq_domain(&info);
-	if (!parent)
+	if (IS_ERR(parent))
+		return PTR_ERR(parent);
+	else if (!parent)
 		parent = x86_vector_domain;
 	else
 		name = "IO-APIC-IR";
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d
 struct irq_domain *hpet_create_irq_domain(int hpet_id)
 {
 	struct msi_domain_info *domain_info;
+	struct fwnode_handle *fn = NULL;
 	struct irq_domain *parent, *d;
 	struct irq_alloc_info info;
-	struct fwnode_handle *fn;
 
 	if (x86_vector_domain == NULL)
 		return NULL;
@@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai
 	init_irq_alloc_info(&info, NULL);
 	info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
 	info.devid = hpet_id;
+
 	parent = irq_remapping_get_irq_domain(&info);
-	if (parent == NULL)
+	if (IS_ERR(parent))
+		goto fail;
+	else if (!parent)
 		parent = x86_vector_domain;
 	else
 		hpet_msi_controller.name = "IR-HPET-MSI";
 
 	fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name,
 					      hpet_id);
-	if (!fn) {
-		kfree(domain_info);
-		return NULL;
-	}
+	if (!fn)
+		goto fail;
 
 	d = msi_create_irq_domain(fn, domain_info, parent);
-	if (!d) {
-		irq_domain_free_fwnode(fn);
-		kfree(domain_info);
-	}
+	if (!d)
+		goto fail;
 	return d;
+
+fail:
+	irq_domain_free_fwnode(fn);
+	kfree(domain_info);
+	return NULL;
 }
 
 int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain
 	struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
 
 	if (!iommu)
-		return NULL;
+		return ERR_PTR(-ENODEV);
 
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT:
@@ -3565,7 +3565,7 @@ static struct irq_domain *get_irq_domain
 		return iommu->ir_domain;
 	default:
 		WARN_ON_ONCE(1);
-		return NULL;
+		return ERR_PTR(-ENODEV);
 	}
 }
 
@@ -3578,7 +3578,7 @@ static struct irq_domain *get_irq_domain
 
 	devid = get_devid(info);
 	if (devid < 0)
-		return NULL;
+		return ERR_PTR(-ENODEV);
 	return get_irq_domain_for_devid(info, devid);
 }
 
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -212,7 +212,7 @@ static struct irq_domain *map_hpet_to_ir
 		if (ir_hpet[i].id == hpet_id && ir_hpet[i].iommu)
 			return ir_hpet[i].iommu->ir_domain;
 	}
-	return NULL;
+	return ERR_PTR(-ENODEV);
 }
 
 static struct intel_iommu *map_ioapic_to_iommu(int apic)
@@ -230,7 +230,7 @@ static struct irq_domain *map_ioapic_to_
 {
 	struct intel_iommu *iommu = map_ioapic_to_iommu(apic);
 
-	return iommu ? iommu->ir_domain : NULL;
+	return iommu ? iommu->ir_domain : ERR_PTR(-ENODEV);
 }
 
 static struct irq_domain *map_dev_to_ir(struct pci_dev *dev)
David Woodhouse Oct. 9, 2020, 7:54 a.m. UTC | #12
On Thu, 2020-10-08 at 14:40 +0200, Thomas Gleixner wrote:
> Subject: x86/iommu: Make interrupt remapping more robust
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Thu, 08 Oct 2020 14:09:44 +0200
> 
> Needs to be split into pieces and cover PCI proper. Right now PCI gets a
> NULL pointer assigned which makes it explode at the wrong place
> later. Also hyperv iommu wants some love.
> 
> NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/kernel/apic/io_apic.c      |    4 +++-
>  arch/x86/kernel/apic/msi.c          |   24 ++++++++++++++----------
>  drivers/iommu/amd/iommu.c           |    6 +++---
>  drivers/iommu/intel/irq_remapping.c |    4 ++--
>  4 files changed, 22 insertions(+), 16 deletions(-)
> 
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi
>         info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
>         info.devid = mpc_ioapic_id(ioapic);
>         parent = irq_remapping_get_irq_domain(&info);
> -       if (!parent)
> +       if (IS_ERR(parent))
> +               return PTR_ERR(parent);
> +       else if (!parent)
>                 parent = x86_vector_domain;
>         else
>                 name = "IO-APIC-IR";
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d
>  struct irq_domain *hpet_create_irq_domain(int hpet_id)
>  {
>         struct msi_domain_info *domain_info;
> +       struct fwnode_handle *fn = NULL;
>         struct irq_domain *parent, *d;
>         struct irq_alloc_info info;
> -       struct fwnode_handle *fn;
>  
>         if (x86_vector_domain == NULL)
>                 return NULL;
> @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai
>         init_irq_alloc_info(&info, NULL);
>         info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
>         info.devid = hpet_id;
> +
>         parent = irq_remapping_get_irq_domain(&info);
> -       if (parent == NULL)
> +       if (IS_ERR(parent))
> +               goto fail;
> +       else if (!parent)
>                 parent = x86_vector_domain;
>         else
>                 hpet_msi_controller.name = "IR-HPET-MSI";

Hrm... this is the '-ENODEV changes' that you wanted me to pick up,
right?

I confess I don't like it very much.

I'd much prefer to be able to use a generic foo_get_parent_irq_domain()
function which just returns an answer or an error.

Definitely none of this "if it returns NULL, caller fills it in for
themselves with some magic global because we're special".

And I don't much like abusing the irq_alloc_info for this either. I
didn't like the irq_alloc_info very much in its *original* use cases,
for actually allocating IRQs. Abusing it for this is worse.

I'm more than happy to start digging into this, but once I do I fear
I'm not going to stop until HPET and IOAPIC don't know *anything* about
whether they're behind IR or not.

And yes, I know that IOAPIC has a different EOI method for IR but
AFAICT that's *already* hosed for Hyper-V because it doesn't *really*
do remapping and so the RTEs *can* change there to move interrupts
around. There's more differences between ioapic_ack_level() and
ioapic_ir_ack_level() than the erratum workaround and whether we use
data->entry.vector... aren't there?

For the specific case you were targeting with this patch, you already
successfully persuaded me it should never happen, and there's a WARN_ON
which would trigger if it ever does (with too many CPUs in the system).

Let's come back and tackle this can of worms in a separate cleanup
series.
diff mbox series

Patch

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 44445d9de881..6b5576da77f0 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -278,7 +278,8 @@  extern void irq_set_default_host(struct irq_domain *host);
 extern struct irq_domain *irq_get_default_host(void);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
 				  irq_hw_number_t hwirq, int node,
-				  const struct irq_affinity_desc *affinity);
+				  const struct irq_affinity_desc *affinity,
+				  const struct cpumask *max_affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 43e3d1be622c..13f56210eca9 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -75,7 +75,7 @@  int irq_reserve_ipi(struct irq_domain *domain,
 		}
 	}
 
-	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
+	virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL, dest);
 	if (virq <= 0) {
 		pr_warn("Can't reserve IPI, failed to alloc descs\n");
 		return -ENOMEM;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index c93e00ca11d8..ffd41f21afca 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -660,7 +660,8 @@  unsigned int irq_create_mapping(struct irq_domain *domain,
 	}
 
 	/* Allocate a virtual interrupt number */
-	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL);
+	virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+				      NULL, NULL);
 	if (virq <= 0) {
 		pr_debug("-> virq allocation failed\n");
 		return 0;
@@ -1011,25 +1012,57 @@  int irq_domain_translate_twocell(struct irq_domain *d,
 EXPORT_SYMBOL_GPL(irq_domain_translate_twocell);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-			   int node, const struct irq_affinity_desc *affinity)
+			   int node, const struct irq_affinity_desc *affinity,
+			   const struct cpumask *max_affinity)
 {
+	cpumask_var_t default_affinity;
 	unsigned int hint;
+	int i;
+
+	/* Check requested per-IRQ affinities are in the possible range */
+	if (affinity && max_affinity) {
+		for (i = 0; i < cnt; i++)
+			if (!cpumask_subset(&affinity[i].mask, max_affinity))
+				return -EINVAL;
+	}
+
+	/*
+	 * Generate default affinity. Either the possible subset of
+	 * irq_default_affinity if such a subset is non-empty, or fall
+	 * back to the provided max_affinity if there is no intersection.
+	 * And just a copy of irq_default_affinity in the
+	 * !CONFIG_CPUMASK_OFFSTACK case.
+	 */
+	memset(&default_affinity, 0, sizeof(default_affinity));
+	if ((max_affinity &&
+	     !cpumask_subset(irq_default_affinity, max_affinity))) {
+		if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL))
+			return -ENOMEM;
+		cpumask_and(default_affinity, max_affinity,
+			    irq_default_affinity);
+		if (cpumask_empty(default_affinity))
+			cpumask_copy(default_affinity, max_affinity);
+	} else if (cpumask_available(default_affinity))
+		cpumask_copy(default_affinity, irq_default_affinity);
 
 	if (virq >= 0) {
 		virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
-					 affinity, NULL);
+					 affinity, default_affinity);
 	} else {
 		hint = hwirq % nr_irqs;
 		if (hint == 0)
 			hint++;
 		virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
-					 affinity, NULL);
+					 affinity, default_affinity);
 		if (virq <= 0 && hint > 1) {
 			virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
-						 affinity, NULL);
+						 affinity, default_affinity);
 		}
 	}
 
+	if (cpumask_available(default_affinity))
+		free_cpumask_var(default_affinity);
+
 	return virq;
 }
 
@@ -1342,7 +1375,7 @@  int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
 		virq = irq_base;
 	} else {
 		virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
-					      affinity);
+					      affinity, NULL);
 		if (virq < 0) {
 			pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 				 irq_base, nr_irqs);