Message ID | 20201005152856.974112-7-dwmw2@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping | expand |
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > This is the maximum possible set of CPUs which can be used. Use it > to calculate the default affinity requested from __irq_alloc_descs() > by first attempting to find the intersection with irq_default_affinity, > or falling back to using just the max_affinity if the intersection > would be empty. And why do we need that as yet another argument? This is an optional property of the irq domain, really and no caller has any business with that. > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, > - int node, const struct irq_affinity_desc *affinity) > + int node, const struct irq_affinity_desc *affinity, > + const struct cpumask *max_affinity) > { > + cpumask_var_t default_affinity; > unsigned int hint; > + int i; > + > + /* Check requested per-IRQ affinities are in the possible range */ > + if (affinity && max_affinity) { > + for (i = 0; i < cnt; i++) > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) > + return -EINVAL; https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos What is preventing the affinity spreading code from spreading the masks out to unusable CPUs? The changelog is silent about that part. > + /* > + * Generate default affinity. Either the possible subset of > + * irq_default_affinity if such a subset is non-empty, or fall > + * back to the provided max_affinity if there is no intersection. .. > + * And just a copy of irq_default_affinity in the > + * !CONFIG_CPUMASK_OFFSTACK case. We know that already... > + */ > + memset(&default_affinity, 0, sizeof(default_affinity)); Right, memset() before allocating is useful. > + if ((max_affinity && > + !cpumask_subset(irq_default_affinity, max_affinity))) { > + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) > + return -ENOMEM; > + cpumask_and(default_affinity, max_affinity, > + irq_default_affinity); > + if (cpumask_empty(default_affinity)) > + cpumask_copy(default_affinity, max_affinity); > + } else if (cpumask_available(default_affinity)) > + cpumask_copy(default_affinity, irq_default_affinity); That's garbage and unreadable. Thanks, tglx
On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > This is the maximum possible set of CPUs which can be used. Use it > > to calculate the default affinity requested from __irq_alloc_descs() > > by first attempting to find the intersection with irq_default_affinity, > > or falling back to using just the max_affinity if the intersection > > would be empty. > > And why do we need that as yet another argument? > > This is an optional property of the irq domain, really and no caller has > any business with that. Because irq_domain_alloc_descs() doesn't actually *take* the domain as an argument. It's more of an internal function, which is only non- static because it's used from kernel/irq/ipi.c too for some reason. If we convert the IPI code to just call __irq_alloc_descs() directly, perhaps that we can actually make irq_domain_alloc_decs() static. > > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, > > - int node, const struct irq_affinity_desc *affinity) > > + int node, const struct irq_affinity_desc *affinity, > > + const struct cpumask *max_affinity) > > { > > + cpumask_var_t default_affinity; > > unsigned int hint; > > + int i; > > + > > + /* Check requested per-IRQ affinities are in the possible range */ > > + if (affinity && max_affinity) { > > + for (i = 0; i < cnt; i++) > > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) > > + return -EINVAL; > > https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos > > What is preventing the affinity spreading code from spreading the masks > out to unusable CPUs? The changelog is silent about that part. I'm coming to the conclusion that we should allow unusable CPUs to be specified at this point, just as we do offline CPUs. That's largely driven by the realisation that our x86_non_ir_cpumask is only going to contain online CPUs anyway, and hotplugged CPUs only get added to it as they are brought online. > > + /* > > + * Generate default affinity. Either the possible subset of > > + * irq_default_affinity if such a subset is non-empty, or fall > > + * back to the provided max_affinity if there is no intersection. > > .. > > + * And just a copy of irq_default_affinity in the > > + * !CONFIG_CPUMASK_OFFSTACK case. > > We know that already... > > > + */ > > + memset(&default_affinity, 0, sizeof(default_affinity)); > > Right, memset() before allocating is useful. The memset is because there's no cpumask_var_t initialiser that I can see. So cpumask_available() on the uninitialised 'default_affinity' variable might be true even in the OFFSTACK case. > > + if ((max_affinity && > > + !cpumask_subset(irq_default_affinity, max_affinity))) { > > + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) > > + return -ENOMEM; > > + cpumask_and(default_affinity, max_affinity, > > + irq_default_affinity); > > + if (cpumask_empty(default_affinity)) > > + cpumask_copy(default_affinity, max_affinity); > > + } else if (cpumask_available(default_affinity)) > > + cpumask_copy(default_affinity, irq_default_affinity); > > That's garbage and unreadable. That's why there was a comment explaining it... at which point you claimed to already know :) Clearly the comment didn't do the job it was supposed to do. I'll rework.
On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote: > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: >> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: >> > From: David Woodhouse <dwmw@amazon.co.uk> >> > >> > This is the maximum possible set of CPUs which can be used. Use it >> > to calculate the default affinity requested from __irq_alloc_descs() >> > by first attempting to find the intersection with irq_default_affinity, >> > or falling back to using just the max_affinity if the intersection >> > would be empty. >> >> And why do we need that as yet another argument? >> >> This is an optional property of the irq domain, really and no caller has >> any business with that. > > Because irq_domain_alloc_descs() doesn't actually *take* the domain as > an argument. It's more of an internal function, which is only non- > static because it's used from kernel/irq/ipi.c too for some reason. If > we convert the IPI code to just call __irq_alloc_descs() directly, > perhaps that we can actually make irq_domain_alloc_decs() static. What is preventing you to change the function signature? But handing down irqdomain here is not cutting it. The right thing to do is to replace 'struct irq_affinity_desc *affinity' with something more flexible. >> > int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, >> > - int node, const struct irq_affinity_desc *affinity) >> > + int node, const struct irq_affinity_desc *affinity, >> > + const struct cpumask *max_affinity) >> > { >> > + cpumask_var_t default_affinity; >> > unsigned int hint; >> > + int i; >> > + >> > + /* Check requested per-IRQ affinities are in the possible range */ >> > + if (affinity && max_affinity) { >> > + for (i = 0; i < cnt; i++) >> > + if (!cpumask_subset(&affinity[i].mask, max_affinity)) >> > + return -EINVAL; >> >> https://lore.kernel.org/r/alpine.DEB.2.20.1701171956290.3645@nanos >> >> What is preventing the affinity spreading code from spreading the masks >> out to unusable CPUs? The changelog is silent about that part. > > I'm coming to the conclusion that we should allow unusable CPUs to be > specified at this point, just as we do offline CPUs. That's largely > driven by the realisation that our x86_non_ir_cpumask is only going to > contain online CPUs anyway, and hotplugged CPUs only get added to it as > they are brought online. Can you please stop looking at this from a x86 only perspective. It's largely irrelevant what particular needs x86 or virt or whatever has. Fact is, that if there are CPUs which cannot be targeted by device interrupts then the multiqueue affinity mechanism has to be fixed to handle this. Right now it's just broken. Passing yet more cpumasks and random pointers around through device drivers and whatever is just not going to happen. Neither are we going to have arch_can_be_used_for_device_interrupts_mask or whatever you come up with and claim it to be 'generic'. The whole affinity control mechanism needs to be refactored from ground up and the information about CPUs which can be targeted has to be retrievable through the irqdomain hierarchy. Anything else is just tinkering and I have zero interest in mopping up after you. It's pretty obvious that the irq domains are the right place to store that information: const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d) { while (d) { if (d->get_possible_affinity) return d->get_possible_affinity(d); d = d->parent; } return cpu_possible_mask; } So if you look at X86 then you have either: [VECTOR] ----------------- [IO/APIC] |-- [MSI] |-- [WHATEVER] or [VECTOR] ---[REMAP]------- [IO/APIC] | |-- [MSI] |----------------[WHATEVER] So if REMAP allows cpu_possible_mask and VECTOR some restricted subset then irqdomain_get_possible_affinity() will return the correct result independent whether remapping is enabled or not. This allows to use that for other things like per node restrictions or whatever people come up with, without sprinkling more insanities through the tree. Thanks, tglx
On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: > On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote: > > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote: > > > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote: > > > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > > > > > This is the maximum possible set of CPUs which can be used. Use it > > > > to calculate the default affinity requested from __irq_alloc_descs() > > > > by first attempting to find the intersection with irq_default_affinity, > > > > or falling back to using just the max_affinity if the intersection > > > > would be empty. > > > > > > And why do we need that as yet another argument? > > > > > > This is an optional property of the irq domain, really and no caller has > > > any business with that. > > > > Because irq_domain_alloc_descs() doesn't actually *take* the domain as > > an argument. It's more of an internal function, which is only non- > > static because it's used from kernel/irq/ipi.c too for some reason. If > > we convert the IPI code to just call __irq_alloc_descs() directly, > > perhaps that we can actually make irq_domain_alloc_decs() static. > > What is preventing you to change the function signature? But handing > down irqdomain here is not cutting it. The right thing to do is to > replace 'struct irq_affinity_desc *affinity' with something more > flexible. Yeah, although I do think I want to ditch this part completely, and treat the "possible" mask for the domain very much more like we do cpu_online_mask. In that we can allow affinities to be *requested* which are outside it, and they can just never be *effective* while those CPUs aren't present and reachable. That way a lot of the nastiness about preparing an "acceptable" mask in advance can go away. It's *also* driven, as I noted, by the realisation that on x86, the x86_non_ir_cpumask will only ever contain those CPUs which have been brought online so far and meet the criteria... but won't that be true on *any* system where CPU numbers are virtual and not 1:1 mapped with whatever determines reachability by the IRQ domain? It isn't *such* an x86ism, surely? > Fact is, that if there are CPUs which cannot be targeted by device > interrupts then the multiqueue affinity mechanism has to be fixed to > handle this. Right now it's just broken. I think part of the problem there is that I don't really understand how this part is *supposed* to work. I was focusing on getting the simple case working first, in the expectation that we'd come back to that part ansd you'd keep me honest. Is there some decent documentation on this that I'm missing? > It's pretty obvious that the irq domains are the right place to store > that information: > > const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d) > { > while (d) { > if (d->get_possible_affinity) > return d->get_possible_affinity(d); > d = d->parent; > } > return cpu_possible_mask; > } Sure. > So if you look at X86 then you have either: > > [VECTOR] ----------------- [IO/APIC] > |-- [MSI] > |-- [WHATEVER] > > or > > [VECTOR] ---[REMAP]------- [IO/APIC] > | |-- [MSI] > |----------------[WHATEVER] Hierarchically, theoretically the IOAPIC and HPET really ought to be children of the MSI domain. It's the Compatibility MSI which has the restriction on destination ID, because of how many bits it interprets from the MSI address. HPET and IOAPIC are just generating MSIs that hit that upstream limit. > So if REMAP allows cpu_possible_mask and VECTOR some restricted subset > then irqdomain_get_possible_affinity() will return the correct result > independent whether remapping is enabled or not. Sure. Although VECTOR doesn't have the restriction. As noted, it's the Compatibility MSI that does. So the diagram looks something like this: [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ] | |---[ IR-PCI-MSI ] | |---[ IR-IOAPIC ] | |--[ COMPAT-MSI ] ---- [ HPET ] |-- [ PCI-MSI ] |-- [ IOAPIC ] In this diagram, it's COMPAT-MSI that has the affinity restriction, inherited by its children. Now, I understand that you're not keen on IOAPIC actually being a child of the MSI domain, and that's fine. In Linux right now, those generic 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the compatibility HPET, PCI-MSI and IOAPIC domains would have to add that same 8-bit affinity limit for themselves, as their parent is the VECTOR domain. I suppose it *might* not hurt to pretend that VECTOR does indeed have the limit, and to *remove* it in the remapping domain. And then the affinity limit could be removed in one place by the REMAP domain because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI- MSI and IR-IOAPIC domains *are* children of that. But honestly, I'd rather iterate and get the generic irqdomain support for affinity limits into decent shape before I worry *too* much about precisely which was round it gets used by x86.
On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote: > On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: >> What is preventing you to change the function signature? But handing >> down irqdomain here is not cutting it. The right thing to do is to >> replace 'struct irq_affinity_desc *affinity' with something more >> flexible. > > Yeah, although I do think I want to ditch this part completely, and > treat the "possible" mask for the domain very much more like we do > cpu_online_mask. In that we can allow affinities to be *requested* > which are outside it, and they can just never be *effective* while > those CPUs aren't present and reachable. Huch? > That way a lot of the nastiness about preparing an "acceptable" mask in > advance can go away. There is not lot's of nastiness. > It's *also* driven, as I noted, by the realisation that on x86, the > x86_non_ir_cpumask will only ever contain those CPUs which have been > brought online so far and meet the criteria... but won't that be true > on *any* system where CPU numbers are virtual and not 1:1 mapped with > whatever determines reachability by the IRQ domain? It isn't *such* an > x86ism, surely? Yes, but that's exactly the reason why I don't want to have whatever mask name you chose to be directly exposed and want it to be part of the irq domains because then you can express arbitrary per domain limits. >> Fact is, that if there are CPUs which cannot be targeted by device >> interrupts then the multiqueue affinity mechanism has to be fixed to >> handle this. Right now it's just broken. > > I think part of the problem there is that I don't really understand how > this part is *supposed* to work. I was focusing on getting the simple > case working first, in the expectation that we'd come back to that part > ansd you'd keep me honest. Is there some decent documentation on this > that I'm missing? TLDR & HTF; Multiqueue devices want to have at max 1 queue per CPU or if the device has less queues than CPUs they want the queues to have a fixed association to a set of CPUs. At setup time this is established considering possible CPUs to handle 'physical' hotplug correctly. If a queue has no online CPUs it cannot be started. If it's active and the last CPU goes down then it's quiesced and stopped and the core code shuts down the interrupt and does not move it to a still online CPU. So with your hackery, we end up in a situation where we have a large possible mask, but not all CPUs in that mask can be reached, which means in a 1 queue per CPU scenario all unreachable CPUs would have disfunctional queues. So that spreading algorithm needs to know about this limitation. >> So if you look at X86 then you have either: >> >> [VECTOR] ----------------- [IO/APIC] >> |-- [MSI] >> |-- [WHATEVER] >> >> or >> >> [VECTOR] ---[REMAP]------- [IO/APIC] >> | |-- [MSI] >> |----------------[WHATEVER] > > Hierarchically, theoretically the IOAPIC and HPET really ought to be > children of the MSI domain. It's the Compatibility MSI which has the > restriction on destination ID, because of how many bits it interprets > from the MSI address. HPET and IOAPIC are just generating MSIs that hit > that upstream limit. We kinda have that, but not nicely abstracted. But we surely can and should fix that. >> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset >> then irqdomain_get_possible_affinity() will return the correct result >> independent whether remapping is enabled or not. > w> Sure. Although VECTOR doesn't have the restriction. As noted, it's the > Compatibility MSI that does. So the diagram looks something like this: > > [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ] > | |---[ IR-PCI-MSI ] > | |---[ IR-IOAPIC ] > | > |--[ COMPAT-MSI ] ---- [ HPET ] > |-- [ PCI-MSI ] > |-- [ IOAPIC ] > > > In this diagram, it's COMPAT-MSI that has the affinity restriction, > inherited by its children. > > Now, I understand that you're not keen on IOAPIC actually being a child > of the MSI domain, and that's fine. In Linux right now, those generic > 'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the > compatibility HPET, PCI-MSI and IOAPIC domains would have to add that > same 8-bit affinity limit for themselves, as their parent is the VECTOR > domain. No. We fix it proper and not by hacking around it. > I suppose it *might* not hurt to pretend that VECTOR does indeed have > the limit, and to *remove* it in the remapping domain. And then the > affinity limit could be removed in one place by the REMAP domain > because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI- > MSI and IR-IOAPIC domains *are* children of that. It's not rocket science to fix that as the irq remapping code already differentiates between the device types. Thanks, tglx
On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >On Wed, Oct 07 2020 at 15:10, David Woodhouse wrote: >> On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote: >>> What is preventing you to change the function signature? But handing >>> down irqdomain here is not cutting it. The right thing to do is to >>> replace 'struct irq_affinity_desc *affinity' with something more >>> flexible. >> >> Yeah, although I do think I want to ditch this part completely, and >> treat the "possible" mask for the domain very much more like we do >> cpu_online_mask. In that we can allow affinities to be *requested* >> which are outside it, and they can just never be *effective* while >> those CPUs aren't present and reachable. > >Huch? > >> That way a lot of the nastiness about preparing an "acceptable" mask >in >> advance can go away. > >There is not lot's of nastiness. OK, but I think we do have to cope with the fact that the limit is dynamic, and a CPU might be added which widens the mask. I think that's fundamental and not x86-specific. >> It's *also* driven, as I noted, by the realisation that on x86, the >> x86_non_ir_cpumask will only ever contain those CPUs which have been >> brought online so far and meet the criteria... but won't that be true >> on *any* system where CPU numbers are virtual and not 1:1 mapped with >> whatever determines reachability by the IRQ domain? It isn't *such* >an >> x86ism, surely? > >Yes, but that's exactly the reason why I don't want to have whatever >mask name you chose to be directly exposed and want it to be part of >the >irq domains because then you can express arbitrary per domain limits. The x86_non_ir_mask isn't intended to be directly exposed to any generic IRQ code. It's set up by the x86 APIC code so that those x86 IRQ domains which are affected by it, can set it with irqdomain_set_max_affinity() for the generic code to see. Without each having to create the cpumask from scratch for themselves. > ... reading for once the kids are elsewhere... Thanks. >It's not rocket science to fix that as the irq remapping code already >differentiates between the device types. I don't actually like that very much. The IRQ remapping code could just compose the MSI messages that it desires without really having to care so much about the child device. The bit where IRQ remapping actually composes IOAPIC RTEs makes me particularly sad.
On Wed, Oct 07 2020 at 17:11, David Woodhouse wrote: > On 7 October 2020 16:57:36 BST, Thomas Gleixner <tglx@linutronix.de> wrote: >>There is not lot's of nastiness. > > OK, but I think we do have to cope with the fact that the limit is > dynamic, and a CPU might be added which widens the mask. I think > that's fundamental and not x86-specific. Yes, but it needs some thoughts vs. serialization against CPU hotplug. The stability of that cpumask is architecture specific if the calling context cannot serialize against CPU hotplug. The hot-unplug case is less interesting because the mask does not shrink, it only get's wider. That's why I want a callback instead of a pointer assignment and that callback cannot return a pointer to something which can be modified concurrently, it needs to update a caller provided mask so that the result is at least consistent in itself. It just occured to me that there is something which needs some more thought vs. CPU hotunplug as well: Right now we just check whether there are enough vectors available on the remaining online CPUs so that the interrupts which have an effective affinity directed to the outgoing CPU can be migrated away (modulo the managed ones). That check obviously needs to take the target CPU restrictions into account to prevent that devices end up with no target at the end. I bet there are some other interesting bits and pieces which need some care... Thanks, tglx
On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: > TLDR & HTF; > > Multiqueue devices want to have at max 1 queue per CPU or if the device > has less queues than CPUs they want the queues to have a fixed > association to a set of CPUs. > > At setup time this is established considering possible CPUs to handle > 'physical' hotplug correctly. > > If a queue has no online CPUs it cannot be started. If it's active and > the last CPU goes down then it's quiesced and stopped and the core code > shuts down the interrupt and does not move it to a still online CPU. > > So with your hackery, we end up in a situation where we have a large > possible mask, but not all CPUs in that mask can be reached, which means > in a 1 queue per CPU scenario all unreachable CPUs would have > disfunctional queues. > > So that spreading algorithm needs to know about this limitation. OK, thanks. So the queue exists, with an MSI assigned to point to an offline CPU(s), but it cannot actually be used until/unless at least one CPU in its mask comes online. So when I said I wanted to try treating "reachable" the same way as "online", that would mean the queue can't start until/unless at least one *reachable* CPU in its mask comes online. The underlying problem here is that until a CPU comes online, we don't actually *know* if it's reachable or not. So if we want carefully create the affinity masks at setup time so that they don't include any unreachable CPUs... that basically means we don't include any non-present CPUs at all (unless they've been added once and then removed). That's because — at least on x86 — we don't assign CPU numbers to CPUs which haven't yet been added. Theoretically we *could* but if there are more than NR_CPUS listed in (e.g.) the MADT then we could run out of CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't* have space for, we'd not be able to handle it. I suppose there might be options like pre-assigning CPU numbers only to non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU numbers, so some not-yet-assigned CPU#s are for low APICIDs and some are for high APICIDs but it doesn't actually have to be a 1:1 predetermined mapping. But those really do seem like hacks which might only apply on x86, while the generic approach of treating "reachable" like "online" seems like it would work in other cases too. Fundamentally, there are three sets of CPUs. There are those known to be reachable, those known not to be, and those which are not yet known. So another approach we could use is to work with a cpumask of those *known* not to be reachable, and to filter those *out* of the prebuilt affinities. That gives us basically the right behaviour without hotplug, but does include absent CPUs in a mask that *if* they are ever added, wouldn't be able to receive the IRQ. Which does mean we'd have to refrain from bringing up the corresponding queue.
On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote: > On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote: >> Multiqueue devices want to have at max 1 queue per CPU or if the device >> has less queues than CPUs they want the queues to have a fixed >> association to a set of CPUs. >> >> At setup time this is established considering possible CPUs to handle >> 'physical' hotplug correctly. >> >> If a queue has no online CPUs it cannot be started. If it's active and >> the last CPU goes down then it's quiesced and stopped and the core code >> shuts down the interrupt and does not move it to a still online CPU. >> >> So with your hackery, we end up in a situation where we have a large >> possible mask, but not all CPUs in that mask can be reached, which means >> in a 1 queue per CPU scenario all unreachable CPUs would have >> disfunctional queues. >> >> So that spreading algorithm needs to know about this limitation. > > OK, thanks. So the queue exists, with an MSI assigned to point to an > offline CPU(s), but it cannot actually be used until/unless at least > one CPU in its mask comes online. The MSI entry in that case is actually directed to an online CPU's MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised by the device after shutdown. > So when I said I wanted to try treating "reachable" the same way as > "online", that would mean the queue can't start until/unless at least > one *reachable* CPU in its mask comes online. > > The underlying problem here is that until a CPU comes online, we don't > actually *know* if it's reachable or not. It's known before online, i.e. when the CPU is registered which is either at boot time for present CPUs or at 'physical' hotplug. > So if we want carefully create the affinity masks at setup time so that > they don't include any unreachable CPUs... that basically means we > don't include any non-present CPUs at all (unless they've been added > once and then removed). That breaks _all_ multi-queue assumptions in one go. :) > But those really do seem like hacks which might only apply on x86, > while the generic approach of treating "reachable" like "online" seems > like it would work in other cases too. > > Fundamentally, there are three sets of CPUs. There are those known to > be reachable, those known not to be, and those which are not yet > known. Unfortunately there are lots of assumptions all over the place that possible CPUs are reachable. Multi-queue using managed interrupts is just the tip of the iceberg. > So another approach we could use is to work with a cpumask of those > *known* not to be reachable, and to filter those *out* of the prebuilt > affinities. That gives us basically the right behaviour without > hotplug, but does include absent CPUs in a mask that *if* they are ever > added, wouldn't be able to receive the IRQ. Which does mean we'd have > to refrain from bringing up the corresponding queue. The multi-queue drivers rely on the interrupt setup to create their queues and the fundamental assumption is that this setup works. The managed interrupt mechanism guarantees that the queue has a vector available on all CPUs which are in the queues assigned affinity mask. As of today it also guarantees that these CPUs are reachable once they come online. So in order to make that work you'd need to teach the multi-queue stuff about this new world order: 1) On hotplug the queue needs to be able to figure out whether the interrupt is functional. If not it has to redirect any requests to some actually functional queue. 2) On unplug it needs to be able to figure out whether the interrupt will shutdown because the outgoing CPU is the last reachable in the group and if there are still online but unreachable CPUs then use the redirect mechanism. I'm sure that the multi-queue people will be enthusiastic to add all of this and deal with all the nasty corner cases coming out of it. The overall conclusion for this is: 1) X2APIC support on bare metal w/o irq remapping is not going to happen unless you: - added support in multi-queue devices which utilize managed interrupts - audited the whole tree for other assumptions related to the reachability of possible CPUs. I'm not expecting you to be done with that before I retire so for me it's just not going to happen :) 2) X2APIC support on VIRT is possible if the extended ID magic is supported by the hypervisor because that does not make any CPU unreachable for MSI and therefore the multi-queue muck and everything else just works. This requires to have either the domain affinity limitation for HPET in place or just to force disable HPET or at least HPET-MSI which is a reasonable tradeoff. HPET is not required for guests which have kvmclock and APIC/deadline timer and known (hypervisor provided) frequencies. Anything else is just wishful thinking, really. Thanks, tglx
On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: > The overall conclusion for this is: > > 1) X2APIC support on bare metal w/o irq remapping is not going to > happen unless you: > > - added support in multi-queue devices which utilize managed > interrupts > > - audited the whole tree for other assumptions related to the > reachability of possible CPUs. > > I'm not expecting you to be done with that before I retire so for > me it's just not going to happen :) Makes sense. It probably does mean we should a BUG_ON for the case where IRQ remapping *is* enabled but any device is found which isn't behind it. But that's OK. > 2) X2APIC support on VIRT is possible if the extended ID magic is > supported by the hypervisor because that does not make any CPU > unreachable for MSI and therefore the multi-queue muck and > everything else just works. > > This requires to have either the domain affinity limitation for HPET > in place or just to force disable HPET or at least HPET-MSI which is > a reasonable tradeoff. > > HPET is not required for guests which have kvmclock and > APIC/deadline timer and known (hypervisor provided) frequencies. HPET-MSI should work fine. Like the IOAPIC, it's just a child of the *actual* MSI domain. The address/data in the MSI message are completely opaque to it, and if the parent domain happens to put meaningful information into bits 11-5 of the MSI address, the HPET won't even notice. The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI address; it's not doing bit-swizzling like the IOAPIC does, which might potentially *not* have been able to set certain bits in the MSI.
On Thu, Oct 08 2020 at 12:10, David Woodhouse wrote: > On Thu, 2020-10-08 at 11:34 +0200, Thomas Gleixner wrote: >> The overall conclusion for this is: >> >> 1) X2APIC support on bare metal w/o irq remapping is not going to >> happen unless you: >> >> - added support in multi-queue devices which utilize managed >> interrupts >> >> - audited the whole tree for other assumptions related to the >> reachability of possible CPUs. >> >> I'm not expecting you to be done with that before I retire so for >> me it's just not going to happen :) > > Makes sense. It probably does mean we should a BUG_ON for the case > where IRQ remapping *is* enabled but any device is found which isn't > behind it. But that's OK. We can kinda gracefully handle that. See the completely untested and incomplete patch below. >> 2) X2APIC support on VIRT is possible if the extended ID magic is >> supported by the hypervisor because that does not make any CPU >> unreachable for MSI and therefore the multi-queue muck and >> everything else just works. >> >> This requires to have either the domain affinity limitation for HPET >> in place or just to force disable HPET or at least HPET-MSI which is >> a reasonable tradeoff. >> >> HPET is not required for guests which have kvmclock and >> APIC/deadline timer and known (hypervisor provided) frequencies. > > HPET-MSI should work fine. Like the IOAPIC, it's just a child of the > *actual* MSI domain. The address/data in the MSI message are completely > opaque to it, and if the parent domain happens to put meaningful > information into bits 11-5 of the MSI address, the HPET won't even > notice. > > The HPET's Tn_FSB_INT_ADDR register does have a full 32 bits of the MSI > address; it's not doing bit-swizzling like the IOAPIC does, which might > potentially *not* have been able to set certain bits in the MSI. Indeed. I thought it was crippled in some way, but you're right it has all the bits. Thanks, tglx --- Subject: x86/iommu: Make interrupt remapping more robust From: Thomas Gleixner <tglx@linutronix.de> Date: Thu, 08 Oct 2020 14:09:44 +0200 Needs to be split into pieces and cover PCI proper. Right now PCI gets a NULL pointer assigned which makes it explode at the wrong place later. Also hyperv iommu wants some love. NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/apic/io_apic.c | 4 +++- arch/x86/kernel/apic/msi.c | 24 ++++++++++++++---------- drivers/iommu/amd/iommu.c | 6 +++--- drivers/iommu/intel/irq_remapping.c | 4 ++-- 4 files changed, 22 insertions(+), 16 deletions(-) --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT; info.devid = mpc_ioapic_id(ioapic); parent = irq_remapping_get_irq_domain(&info); - if (!parent) + if (IS_ERR(parent)) + return PTR_ERR(parent); + else if (!parent) parent = x86_vector_domain; else name = "IO-APIC-IR"; --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d struct irq_domain *hpet_create_irq_domain(int hpet_id) { struct msi_domain_info *domain_info; + struct fwnode_handle *fn = NULL; struct irq_domain *parent, *d; struct irq_alloc_info info; - struct fwnode_handle *fn; if (x86_vector_domain == NULL) return NULL; @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai init_irq_alloc_info(&info, NULL); info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT; info.devid = hpet_id; + parent = irq_remapping_get_irq_domain(&info); - if (parent == NULL) + if (IS_ERR(parent)) + goto fail; + else if (!parent) parent = x86_vector_domain; else hpet_msi_controller.name = "IR-HPET-MSI"; fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name, hpet_id); - if (!fn) { - kfree(domain_info); - return NULL; - } + if (!fn) + goto fail; d = msi_create_irq_domain(fn, domain_info, parent); - if (!d) { - irq_domain_free_fwnode(fn); - kfree(domain_info); - } + if (!d) + goto fail; return d; + +fail: + irq_domain_free_fwnode(fn); + kfree(domain_info); + return NULL; } int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc, --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3557,7 +3557,7 @@ static struct irq_domain *get_irq_domain struct amd_iommu *iommu = amd_iommu_rlookup_table[devid]; if (!iommu) - return NULL; + return ERR_PTR(-ENODEV); switch (info->type) { case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT: @@ -3565,7 +3565,7 @@ static struct irq_domain *get_irq_domain return iommu->ir_domain; default: WARN_ON_ONCE(1); - return NULL; + return ERR_PTR(-ENODEV); } } @@ -3578,7 +3578,7 @@ static struct irq_domain *get_irq_domain devid = get_devid(info); if (devid < 0) - return NULL; + return ERR_PTR(-ENODEV); return get_irq_domain_for_devid(info, devid); } --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -212,7 +212,7 @@ static struct irq_domain *map_hpet_to_ir if (ir_hpet[i].id == hpet_id && ir_hpet[i].iommu) return ir_hpet[i].iommu->ir_domain; } - return NULL; + return ERR_PTR(-ENODEV); } static struct intel_iommu *map_ioapic_to_iommu(int apic) @@ -230,7 +230,7 @@ static struct irq_domain *map_ioapic_to_ { struct intel_iommu *iommu = map_ioapic_to_iommu(apic); - return iommu ? iommu->ir_domain : NULL; + return iommu ? iommu->ir_domain : ERR_PTR(-ENODEV); } static struct irq_domain *map_dev_to_ir(struct pci_dev *dev)
On Thu, 2020-10-08 at 14:40 +0200, Thomas Gleixner wrote: > Subject: x86/iommu: Make interrupt remapping more robust > From: Thomas Gleixner <tglx@linutronix.de> > Date: Thu, 08 Oct 2020 14:09:44 +0200 > > Needs to be split into pieces and cover PCI proper. Right now PCI gets a > NULL pointer assigned which makes it explode at the wrong place > later. Also hyperv iommu wants some love. > > NOT-Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > arch/x86/kernel/apic/io_apic.c | 4 +++- > arch/x86/kernel/apic/msi.c | 24 ++++++++++++++---------- > drivers/iommu/amd/iommu.c | 6 +++--- > drivers/iommu/intel/irq_remapping.c | 4 ++-- > 4 files changed, 22 insertions(+), 16 deletions(-) > > --- a/arch/x86/kernel/apic/io_apic.c > +++ b/arch/x86/kernel/apic/io_apic.c > @@ -2300,7 +2300,9 @@ static int mp_irqdomain_create(int ioapi > info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT; > info.devid = mpc_ioapic_id(ioapic); > parent = irq_remapping_get_irq_domain(&info); > - if (!parent) > + if (IS_ERR(parent)) > + return PTR_ERR(parent); > + else if (!parent) > parent = x86_vector_domain; > else > name = "IO-APIC-IR"; > --- a/arch/x86/kernel/apic/msi.c > +++ b/arch/x86/kernel/apic/msi.c > @@ -415,9 +415,9 @@ static struct msi_domain_info hpet_msi_d > struct irq_domain *hpet_create_irq_domain(int hpet_id) > { > struct msi_domain_info *domain_info; > + struct fwnode_handle *fn = NULL; > struct irq_domain *parent, *d; > struct irq_alloc_info info; > - struct fwnode_handle *fn; > > if (x86_vector_domain == NULL) > return NULL; > @@ -432,25 +432,29 @@ struct irq_domain *hpet_create_irq_domai > init_irq_alloc_info(&info, NULL); > info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT; > info.devid = hpet_id; > + > parent = irq_remapping_get_irq_domain(&info); > - if (parent == NULL) > + if (IS_ERR(parent)) > + goto fail; > + else if (!parent) > parent = x86_vector_domain; > else > hpet_msi_controller.name = "IR-HPET-MSI"; Hrm... this is the '-ENODEV changes' that you wanted me to pick up, right? I confess I don't like it very much. I'd much prefer to be able to use a generic foo_get_parent_irq_domain() function which just returns an answer or an error. Definitely none of this "if it returns NULL, caller fills it in for themselves with some magic global because we're special". And I don't much like abusing the irq_alloc_info for this either. I didn't like the irq_alloc_info very much in its *original* use cases, for actually allocating IRQs. Abusing it for this is worse. I'm more than happy to start digging into this, but once I do I fear I'm not going to stop until HPET and IOAPIC don't know *anything* about whether they're behind IR or not. And yes, I know that IOAPIC has a different EOI method for IR but AFAICT that's *already* hosed for Hyper-V because it doesn't *really* do remapping and so the RTEs *can* change there to move interrupts around. There's more differences between ioapic_ack_level() and ioapic_ir_ack_level() than the erratum workaround and whether we use data->entry.vector... aren't there? For the specific case you were targeting with this patch, you already successfully persuaded me it should never happen, and there's a WARN_ON which would trigger if it ever does (with too many CPUs in the system). Let's come back and tackle this can of worms in a separate cleanup series.
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 44445d9de881..6b5576da77f0 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -278,7 +278,8 @@ extern void irq_set_default_host(struct irq_domain *host); extern struct irq_domain *irq_get_default_host(void); extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs, irq_hw_number_t hwirq, int node, - const struct irq_affinity_desc *affinity); + const struct irq_affinity_desc *affinity, + const struct cpumask *max_affinity); static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node) { diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c index 43e3d1be622c..13f56210eca9 100644 --- a/kernel/irq/ipi.c +++ b/kernel/irq/ipi.c @@ -75,7 +75,7 @@ int irq_reserve_ipi(struct irq_domain *domain, } } - virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL); + virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL, dest); if (virq <= 0) { pr_warn("Can't reserve IPI, failed to alloc descs\n"); return -ENOMEM; diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index c93e00ca11d8..ffd41f21afca 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -660,7 +660,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain, } /* Allocate a virtual interrupt number */ - virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), NULL); + virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), + NULL, NULL); if (virq <= 0) { pr_debug("-> virq allocation failed\n"); return 0; @@ -1011,25 +1012,57 @@ int irq_domain_translate_twocell(struct irq_domain *d, EXPORT_SYMBOL_GPL(irq_domain_translate_twocell); int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq, - int node, const struct irq_affinity_desc *affinity) + int node, const struct irq_affinity_desc *affinity, + const struct cpumask *max_affinity) { + cpumask_var_t default_affinity; unsigned int hint; + int i; + + /* Check requested per-IRQ affinities are in the possible range */ + if (affinity && max_affinity) { + for (i = 0; i < cnt; i++) + if (!cpumask_subset(&affinity[i].mask, max_affinity)) + return -EINVAL; + } + + /* + * Generate default affinity. Either the possible subset of + * irq_default_affinity if such a subset is non-empty, or fall + * back to the provided max_affinity if there is no intersection. + * And just a copy of irq_default_affinity in the + * !CONFIG_CPUMASK_OFFSTACK case. + */ + memset(&default_affinity, 0, sizeof(default_affinity)); + if ((max_affinity && + !cpumask_subset(irq_default_affinity, max_affinity))) { + if (!alloc_cpumask_var(&default_affinity, GFP_KERNEL)) + return -ENOMEM; + cpumask_and(default_affinity, max_affinity, + irq_default_affinity); + if (cpumask_empty(default_affinity)) + cpumask_copy(default_affinity, max_affinity); + } else if (cpumask_available(default_affinity)) + cpumask_copy(default_affinity, irq_default_affinity); if (virq >= 0) { virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); } else { hint = hwirq % nr_irqs; if (hint == 0) hint++; virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); if (virq <= 0 && hint > 1) { virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE, - affinity, NULL); + affinity, default_affinity); } } + if (cpumask_available(default_affinity)) + free_cpumask_var(default_affinity); + return virq; } @@ -1342,7 +1375,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, virq = irq_base; } else { virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node, - affinity); + affinity, NULL); if (virq < 0) { pr_debug("cannot allocate IRQ(base %d, count %d)\n", irq_base, nr_irqs);