Message ID | 806c51fa-992b-33ac-61a9-00a606f82edb@linux.intel.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug | expand |
On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > > Hi > > I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes > the "Plug non-maskable MSI affinity race" patch. > > I can see this on a couple platforms, I'm running a script that first generates > a lot of usb traffic, and then in a busyloop sets irq affinity and turns off > and on cpus: > > for i in 1 3 5 7; do > echo "1" > /sys/devices/system/cpu/cpu$i/online > done > echo "A" > "/proc/irq/*/smp_affinity" > echo "A" > "/proc/irq/*/smp_affinity" > echo "F" > "/proc/irq/*/smp_affinity" > for i in 1 3 5 7; do > echo "0" > /sys/devices/system/cpu/cpu$i/online > done > > I added some very simple debugging but I don't really know what to look for. > xhci interrupts (122) just stop after a setting msi affinity, it survived many > similar msi_set_affinity() calls before this. > > I'm not that familiar with the inner workings of this, but I'll be happy to > help out with adding debugging and testing patches. How quickly are you able to reproduce this when you run your script? Does reverting Thomas' patch make it repro faster? Can you send the output of lspci -vvv for the xhci device? -Evan
On 19.3.2020 22.24, Evan Green wrote: > On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman > <mathias.nyman@linux.intel.com> wrote: >> >> Hi >> >> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes >> the "Plug non-maskable MSI affinity race" patch. >> >> I can see this on a couple platforms, I'm running a script that first generates >> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off >> and on cpus: >> >> for i in 1 3 5 7; do >> echo "1" > /sys/devices/system/cpu/cpu$i/online >> done >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "F" > "/proc/irq/*/smp_affinity" >> for i in 1 3 5 7; do >> echo "0" > /sys/devices/system/cpu/cpu$i/online >> done >> >> I added some very simple debugging but I don't really know what to look for. >> xhci interrupts (122) just stop after a setting msi affinity, it survived many >> similar msi_set_affinity() calls before this. >> >> I'm not that familiar with the inner workings of this, but I'll be happy to >> help out with adding debugging and testing patches. > > How quickly are you able to reproduce this when you run your script? Less than a minute > Does reverting Thomas' patch make it repro faster? Can you send the > output of lspci -vvv for the xhci device? It's roughly the same after reverting "6f1a4891a592 x86/apic/msi: Plug non-maskable MSI affinity race" # lspci -vvv -d :a3af 00:14.0 USB controller: Intel Corporation Device a3af (prog-if 30 [XHCI]) Subsystem: Intel Corporation Device 7270 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin ? routed to IRQ 122 Region 0: Memory at b1020000 (64-bit, non-prefetchable) [size=64K] Capabilities: [70] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+ Address: 00000000fee08000 Data: 4021 Kernel driver in use: xhci_hcd -Mathias
Mathias, Mathias Nyman <mathias.nyman@linux.intel.com> writes: > I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes > the "Plug non-maskable MSI affinity race" patch. > > I can see this on a couple platforms, I'm running a script that first generates > a lot of usb traffic, and then in a busyloop sets irq affinity and turns off > and on cpus: > > for i in 1 3 5 7; do > echo "1" > /sys/devices/system/cpu/cpu$i/online > done > echo "A" > "/proc/irq/*/smp_affinity" > echo "A" > "/proc/irq/*/smp_affinity" > echo "F" > "/proc/irq/*/smp_affinity" > for i in 1 3 5 7; do > echo "0" > /sys/devices/system/cpu/cpu$i/online > done > trace snippet: > <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq > <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq > <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6 Looks like a regular affinity setting in interrupt context, but I can't make sense of the time stamps > <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq > <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq > <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq > migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0 So this is a CPU offline operation and after that irq 122 is silent, right? > kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 That kworker context looks fishy. Can you please enable stacktraces in the tracer so I can see the call chains leading to this? OTOH that's irq 121 not 122. Anyway moar information is always useful. And please add the patch below. Thanks, tglx 8<--------------- --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -243,6 +243,7 @@ u64 arch_irq_stat(void) RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); desc = __this_cpu_read(vector_irq[vector]); + trace_printk("vector: %u desc %lx\n", vector, (unsigned long) desc); if (likely(!IS_ERR_OR_NULL(desc))) { if (IS_ENABLED(CONFIG_X86_32)) handle_irq(desc, regs);
On 20.3.2020 11.52, Thomas Gleixner wrote: > Mathias, > > Mathias Nyman <mathias.nyman@linux.intel.com> writes: >> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes >> the "Plug non-maskable MSI affinity race" patch. >> >> I can see this on a couple platforms, I'm running a script that first generates >> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off >> and on cpus: >> >> for i in 1 3 5 7; do >> echo "1" > /sys/devices/system/cpu/cpu$i/online >> done >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "A" > "/proc/irq/*/smp_affinity" >> echo "F" > "/proc/irq/*/smp_affinity" >> for i in 1 3 5 7; do >> echo "0" > /sys/devices/system/cpu/cpu$i/online >> done >> trace snippet: >> <idle>-0 [001] d.h. 129.676900: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677507: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677556: xhci_irq: xhci irq >> <idle>-0 [001] d.h. 129.677647: xhci_irq: xhci irq >> <...>-14 [001] d..1 129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6 > > Looks like a regular affinity setting in interrupt context, but I can't > make sense of the time stamps I think so, everything worked normally after this one still. > >> <idle>-0 [003] d.h. 129.682639: xhci_irq: xhci irq >> <idle>-0 [003] d.h. 129.702380: xhci_irq: xhci irq >> <idle>-0 [003] d.h. 129.702493: xhci_irq: xhci irq >> migration/3-24 [003] d..1 129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0 > > So this is a CPU offline operation and after that irq 122 is silent, right? Yes, after this irq 122 was silent. > >> kworker/0:0-5 [000] d.h. 131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> <idle>-0 [000] d.h. 137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 >> kworker/0:0-5 [000] d.h. 145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0 > > That kworker context looks fishy. Can you please enable stacktraces in > the tracer so I can see the call chains leading to this? OTOH that's irq > 121 not 122. Anyway moar information is always useful. > > And please add the patch below. > Full function trace with patch is huge, can be found compressed at https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing xhci_traces.tgz contains: trace_full: full function trace. trace: timestamp ~48.29 to ~48.93 of trace above, section with last xhci irq trace_prink_only: only trace_printk() of "trace" above This time xhci interrupts stopped after migration/3-24 [003] d..1 48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4 Thanks -Mathias
Mathias, Mathias Nyman <mathias.nyman@linux.intel.com> writes: > Full function trace with patch is huge, can be found compressed at > > This time xhci interrupts stopped after > migration/3-24 [003] d..1 48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4 thanks for providing the data. I think I decoded the issue. Can you please test the patch below? Thanks, tglx ---- diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c index 159bd0cb8548..49aec92fa35b 100644 --- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -63,12 +63,13 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) { struct irq_cfg old_cfg, *cfg = irqd_cfg(irqd); struct irq_data *parent = irqd->parent_data; - unsigned int cpu; + unsigned int cpu, old_vector; int ret; /* Save the current configuration */ cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd)); old_cfg = *cfg; + old_vector = old_cfg.vector; /* Allocate a new target vector */ ret = parent->chip->irq_set_affinity(parent, mask, force); @@ -90,10 +91,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) */ if (!irqd_msi_nomask_quirk(irqd) || cfg->vector == old_cfg.vector || - old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR || + old_vector == MANAGED_IRQ_SHUTDOWN_VECTOR || cfg->dest_apicid == old_cfg.dest_apicid) { irq_msi_update_msg(irqd, cfg); - return ret; + goto check_old; } /* @@ -102,7 +103,7 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) */ if (WARN_ON_ONCE(cpu != smp_processor_id())) { irq_msi_update_msg(irqd, cfg); - return ret; + goto check_old; } /* @@ -163,6 +164,11 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) if (lapic_vector_set_in_irr(cfg->vector)) irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); +check_old: + if (!in_irq() && old_vector != MANAGED_IRQ_SHUTDOWN_VECTOR && + irqd_msi_nomask_quirk(irqd) && lapic_vector_set_in_irr(old_vector)) + irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); + return ret; }
On 23.3.2020 16.10, Thomas Gleixner wrote: > > thanks for providing the data. I think I decoded the issue. Can you > please test the patch below? > Unfortunately it didn't help. Traces added to xhci_trace2.tgz in the same place https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing Thanks -Mathias
Mathias Nyman <mathias.nyman@linux.intel.com> writes: > On 23.3.2020 16.10, Thomas Gleixner wrote: >> >> thanks for providing the data. I think I decoded the issue. Can you >> please test the patch below? > > Unfortunately it didn't help. I did not expect that to help, simply because the same issue is caught by the loop in fixup_irqs(). What I wanted to make sure is that there is not something in between which causes the latter to fail. So I stared at the trace data earlier today and looked at the xhci irq events. They are following a more or less periodic schedule and the forced migration on CPU hotplug hits definitely in the time frame where the next interrupt should be raised by the device. 1) First off all I do not have to understand why new systems released in 2020 still use non-maskable MSI which is the root cause of all of this trouble especially in Intel systems which are known to have this disastrouos interrupt migration troubles. Please tell your hardware people to stop this. 2) I have no idea why the two step mechanism fails exactly on this system. I tried the same test case on a skylake client and I can clearly see from the traces that the interrupt raised in the device falls exactly into the two step update and causes the IRR to be set which resolves the situation by IPI'ing the new target CPU. I have not found a single instance of IPI recovery in your traces. Instead of that your system stops working in exactly this situation. The two step mechanism tries to work around the fact that PCI does not support a 64bit atomic config space update. So we carefully avoid changing more than one 32bit value at a time, i.e. we change first the vector and then the destination ID (part of address_lo). This ensures that the message is consistent all the time. But obviously on your system this does not work as expected. Why? I really can't tell. Please talk to your hardware folks. And of course all of this is so well documented that all of us can clearly figure out what's going on... Thanks, tglx
On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Mathias Nyman <mathias.nyman@linux.intel.com> writes: > > On 23.3.2020 16.10, Thomas Gleixner wrote: > >> > >> thanks for providing the data. I think I decoded the issue. Can you > >> please test the patch below? > > > > Unfortunately it didn't help. > > I did not expect that to help, simply because the same issue is caught > by the loop in fixup_irqs(). What I wanted to make sure is that there is > not something in between which causes the latter to fail. > > So I stared at the trace data earlier today and looked at the xhci irq > events. They are following a more or less periodic schedule and the > forced migration on CPU hotplug hits definitely in the time frame where > the next interrupt should be raised by the device. > > 1) First off all I do not have to understand why new systems released > in 2020 still use non-maskable MSI which is the root cause of all of > this trouble especially in Intel systems which are known to have > this disastrouos interrupt migration troubles. > > Please tell your hardware people to stop this. > > 2) I have no idea why the two step mechanism fails exactly on this > system. I tried the same test case on a skylake client and I can > clearly see from the traces that the interrupt raised in the device > falls exactly into the two step update and causes the IRR to be set > which resolves the situation by IPI'ing the new target CPU. > > I have not found a single instance of IPI recovery in your > traces. Instead of that your system stops working in exactly this > situation. > > The two step mechanism tries to work around the fact that PCI does > not support a 64bit atomic config space update. So we carefully avoid > changing more than one 32bit value at a time, i.e. we change first > the vector and then the destination ID (part of address_lo). This > ensures that the message is consistent all the time. > > But obviously on your system this does not work as expected. Why? I > really can't tell. > > Please talk to your hardware folks. > > And of course all of this is so well documented that all of us can > clearly figure out what's going on... I won't pretend to know what's going on, so I'll preface this by labeling it all as "flailing", but: I wonder if there's some way the interrupt can get delayed between XHCI snapping the torn value and it finding its way into the IRR. For instance, if xhci read this value at the start of their interrupt moderation timer period, that would be awful (I hope they don't do this). One test patch would be to carve out 8 vectors reserved for xhci on all cpus. Whenever you change the affinity, the assigned vector is always reserved_base + cpu_number. That lets you exercise the affinity switching code, but in a controlled manner where torn interrupts could be easily seen (ie hey I got an interrupt on cpu 4's vector but I'm cpu 2). I might struggle to write such a change, but in theory it's doable. Actually the slightly easier experiment might be to reserve a single vector for xhci on all cores. We'd strongly expect the problem to go away, since now there's no more torn writes since the vector is always the same. If it somehow didn't go away, you'd know there's more to the story. I was alternately trying to build a theory in my head about the write somehow being posted and getting out of order, but I don't think that can happen. Another experiment would be to try my old patch in [1]. I'm not advocating for this patch as a solution, Thomas and Bjorn have convinced me that it will break the rest of the world. But your PCI device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd expect to at least see it mask your problem. Again, if it didn't, that might be an interesting datapoint. [1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/ -Evan
Evan Green <evgreen@chromium.org> writes: > On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> And of course all of this is so well documented that all of us can >> clearly figure out what's going on... > > I won't pretend to know what's going on, so I'll preface this by > labeling it all as "flailing", but: > > I wonder if there's some way the interrupt can get delayed between > XHCI snapping the torn value and it finding its way into the IRR. For > instance, if xhci read this value at the start of their interrupt > moderation timer period, that would be awful (I hope they don't do > this). One test patch would be to carve out 8 vectors reserved for > xhci on all cpus. Whenever you change the affinity, the assigned > vector is always reserved_base + cpu_number. That lets you exercise > the affinity switching code, but in a controlled manner where torn > interrupts could be easily seen (ie hey I got an interrupt on cpu 4's > vector but I'm cpu 2). I might struggle to write such a change, but in > theory it's doable. Well, the point is that we don't see a spurious interrupt on any CPU. We added a traceprintk into do_IRQ() and that would immediately tell us where the thing goes off into lala land. Which it didn't. > I was alternately trying to build a theory in my head about the write > somehow being posted and getting out of order, but I don't think that > can happen. If that happens then the lost XHCI interrupt is the least of your worries. Thanks, tglx
On 24.3.2020 18.17, Evan Green wrote: > > Another experiment would be to try my old patch in [1]. I'm not > advocating for this patch as a solution, Thomas and Bjorn have > convinced me that it will break the rest of the world. But your PCI > device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd > expect to at least see it mask your problem. Again, if it didn't, that > might be an interesting datapoint. > > [1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/ > Yes I'm testing on a Comet Lake, oddly your patch doesn't mask the problem. Did a few rounds, seems that xhci interrupts don't always stop when both vector and apicid are updated. Sometimes just apicid update triggers it. (tail of) round 1: <idle>-0 [001] d.h. 249.040051: xhci_irq: xhci irq <idle>-0 [001] d.h. 249.040122: do_IRQ: vector: 33 desc ffff9e84075b1800 <idle>-0 [001] d.h. 249.040125: xhci_irq: xhci irq migration/1-14 [001] d..1 249.045231: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6 migration/3-24 [003] d..1 249.069563: msi_set_affinity: twostep update msi, irq 123, vector 33 -> 35, apicid: 6 -> 0 <idle>-0 [000] d.h. 250.294934: do_IRQ: vector: 33 desc ffff9e84066df200 <idle>-0 [000] d.h. 250.294940: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 0 -> 0 <idle>-0 [000] d.h. 252.276099: do_IRQ: vector: 33 desc ffff9e84066df200 round 2: <idle>-0 [001] d.h. 75.580577: do_IRQ: vector: 33 desc ffffa0449d084a00 <idle>-0 [001] d.h. 75.580578: xhci_irq: xhci irq <idle>-0 [001] d.h. 75.580639: do_IRQ: vector: 33 desc ffffa0449d084a00 <idle>-0 [001] d.h. 75.580640: xhci_irq: xhci irq <...>-14 [001] d..1 75.585814: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6 migration/3-24 [003] d..1 75.606381: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0 <idle>-0 [000] d.h. 76.275792: do_IRQ: vector: 35 desc ffffa04486009a00 round 3: <idle>-0 [001] d.h. 117.383266: xhci_irq: xhci irq <idle>-0 [001] d.h. 117.383407: do_IRQ: vector: 33 desc ffff92515d0d5800 <idle>-0 [001] d.h. 117.383410: xhci_irq: xhci irq migration/1-14 [001] d..1 117.388527: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6 migration/3-24 [003] d..1 117.413877: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0 <idle>-0 [000] d.h. 119.290474: do_IRQ: vector: 35 desc ffff92514757b600 round 4: <idle>-0 [003] d.h. 147.950569: do_IRQ: vector: 33 desc ffff976fdd04ca00 <idle>-0 [003] d.h. 147.950569: xhci_irq: xhci irq <idle>-0 [003] d.h. 147.950689: do_IRQ: vector: 33 desc ffff976fdd04ca00 <idle>-0 [003] d.h. 147.950690: xhci_irq: xhci irq migration/3-24 [003] d..1 147.951368: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0 <idle>-0 [000] d.h. 148.491193: do_IRQ: vector: 35 desc ffff976fdcc9da00 <idle>-0 [000] d.h. 148.491198: msi_set_affinity: direct update msi 122, vector 35 -> 35, apicid: 0 -> 0 -Mathias
Hi Thomas Just started looking into it to get some idea about what could be going on. I had some questions, that would be helpful to clarify. On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote: > Evan Green <evgreen@chromium.org> writes: > > On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> And of course all of this is so well documented that all of us can > >> clearly figure out what's going on... > > > > I won't pretend to know what's going on, so I'll preface this by > > labeling it all as "flailing", but: > > > > I wonder if there's some way the interrupt can get delayed between > > XHCI snapping the torn value and it finding its way into the IRR. For > > instance, if xhci read this value at the start of their interrupt > > moderation timer period, that would be awful (I hope they don't do > > this). One test patch would be to carve out 8 vectors reserved for > > xhci on all cpus. Whenever you change the affinity, the assigned > > vector is always reserved_base + cpu_number. That lets you exercise > > the affinity switching code, but in a controlled manner where torn > > interrupts could be easily seen (ie hey I got an interrupt on cpu 4's > > vector but I'm cpu 2). I might struggle to write such a change, but in > > theory it's doable. > > Well, the point is that we don't see a spurious interrupt on any > CPU. We added a traceprintk into do_IRQ() and that would immediately > tell us where the thing goes off into lala land. Which it didn't. Now that we don't have the torn write issue. We did an experiment with legacy MSI, and no interrupt remap support. One of the thought process was, since we don't have a way to ensure that previous MSI writes are globally observed, a read from the device should flush any outstanidng writes correct? (according to PCI, not sure if we can depend on this.. or chipsets would take their own sweet time to push to CPU) I'm not certain if such a read happens today? So to make it simple tried to force a retrigger. In the following case of direct update, even though the vector isn't changing a MSI write to the previous destination could have been sent to the previous CPU right? arch/x86/kernel/apic/msi.c: msi_set_affinity() /* * Direct update is possible when: * - The MSI is maskable (remapped MSI does not use this code path)). * The quirk bit is not set in this case. * - The new vector is the same as the old vector * - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up) * - The new destination CPU is the same as the old destination CPU */ if (!irqd_msi_nomask_quirk(irqd) || cfg->vector == old_cfg.vector || old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR || cfg->dest_apicid == old_cfg.dest_apicid) { irq_msi_update_msg(irqd, cfg); -->>> force a retrigger It appears that without a gaurantee of flusing MSI writes from the device the check for lapic_vector_set_in_irr(vector) is still racy. With adding the forced retrigger in both places, the test didn't reveal any lost interrupt cases. Now the second question with Interrupt Remapping Support: intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte() The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts were in flight, they made it to the previous CPU, and any new interrupts must be delivered to the new CPU. Question is do we need a check similar to the legacy MSI handling if (lapic_vector_set_in_irr()) handle interrupt? Is there a reason we don't check if the interrupt delivered to previous CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends an IPI to perform the cleanup? It appears that maybe send_cleanup_vector() sends IPI to the old cpu and that somehow ensures the device interrupt handler actually getting called? I lost my track somewhere down there :) Cheers, Ashok
Ashok, "Raj, Ashok" <ashok.raj@linux.intel.com> writes: > On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote: >> Evan Green <evgreen@chromium.org> writes: >> Well, the point is that we don't see a spurious interrupt on any >> CPU. We added a traceprintk into do_IRQ() and that would immediately >> tell us where the thing goes off into lala land. Which it didn't. > > Now that we don't have the torn write issue. We did an experiment > with legacy MSI, and no interrupt remap support. One of the thought > process was, since we don't have a way to ensure that previous MSI writes > are globally observed, a read from the device should flush any > outstanidng writes correct? (according to PCI, not sure if we can > depend on this.. or chipsets would take their own sweet time to push > to CPU) Aargh. Indeed. > I'm not certain if such a read happens today? So to make it simple tried > to force a retrigger. In the following case of direct update, > even though the vector isn't changing a MSI write to the previous > destination could have been sent to the previous CPU right? Just checked the pci msi write code and there is no read at the end which would flush the darn posted write. Duh, I never noticed. > With adding the forced retrigger in both places, the test didn't reveal any > lost interrupt cases. Not a surprise, but not what we really want. > Now the second question with Interrupt Remapping Support: > > intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte() > > The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts > were in flight, they made it to the previous CPU, and any new interrupts > must be delivered to the new CPU. > > Question is do we need a check similar to the legacy MSI handling > > if (lapic_vector_set_in_irr()) > handle interrupt? > > Is there a reason we don't check if the interrupt delivered to previous > CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends > an IPI to perform the cleanup? > > It appears that maybe send_cleanup_vector() sends IPI to the old cpu > and that somehow ensures the device interrupt handler actually getting > called? I lost my track somewhere down there :) The way it works is: 1) New vector on different CPU is allocated 2) New vector is installed 3) When the first interrupt on the new CPU arrives then the cleanup IPI is sent to the previous CPU which tears down the old vector So if the interrupt is sent to the original CPU just before #2 then this will be correctly handled on that CPU after the set affinity code reenables interrupts. Can you try the patch below? Thanks, tglx 8<-------------- drivers/pci/msi.c | 2 ++ 1 file changed, 2 insertions(+) --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR); writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR); writel(msg->data, base + PCI_MSIX_ENTRY_DATA); + readl(base + PCI_MSIX_ENTRY_DATA); } else { int pos = dev->msi_cap; u16 msgctl; @@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc pci_write_config_word(dev, pos + PCI_MSI_DATA_32, msg->data); } + pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); } skip:
On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote: > Ashok, > > > > Now the second question with Interrupt Remapping Support: > > > > intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte() > > > > The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts > > were in flight, they made it to the previous CPU, and any new interrupts > > must be delivered to the new CPU. > > > > Question is do we need a check similar to the legacy MSI handling > > > > if (lapic_vector_set_in_irr()) > > handle interrupt? > > > > Is there a reason we don't check if the interrupt delivered to previous > > CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends > > an IPI to perform the cleanup? > > > > It appears that maybe send_cleanup_vector() sends IPI to the old cpu > > and that somehow ensures the device interrupt handler actually getting > > called? I lost my track somewhere down there :) > > The way it works is: > > 1) New vector on different CPU is allocated > > 2) New vector is installed > > 3) When the first interrupt on the new CPU arrives then the cleanup > IPI is sent to the previous CPU which tears down the old vector > > So if the interrupt is sent to the original CPU just before #2 then this > will be correctly handled on that CPU after the set affinity code > reenables interrupts. I'll have this test tried out.. but in msi_set_affinity() the check below for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct? don't we want to check for old_cfg.vector instead? msi_set_affinit () { .... unlock_vector_lock(); /* * Check whether the transition raced with a device interrupt and * is pending in the local APICs IRR. It is safe to do this outside * of vector lock as the irq_desc::lock of this interrupt is still * held and interrupts are disabled: The check is not accessing the * underlying vector store. It's just checking the local APIC's * IRR. */ if (lapic_vector_set_in_irr(cfg->vector)) irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); > > Can you try the patch below? > > Thanks, > > tglx > > 8<-------------- > drivers/pci/msi.c | 2 ++ > 1 file changed, 2 insertions(+) > > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc > writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR); > writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR); > writel(msg->data, base + PCI_MSIX_ENTRY_DATA); > + readl(base + PCI_MSIX_ENTRY_DATA); > } else { > int pos = dev->msi_cap; > u16 msgctl; > @@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc > pci_write_config_word(dev, pos + PCI_MSI_DATA_32, > msg->data); > } > + pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl); > } > > skip:
Ashok, "Raj, Ashok" <ashok.raj@intel.com> writes: > On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote: >> The way it works is: >> >> 1) New vector on different CPU is allocated >> >> 2) New vector is installed >> >> 3) When the first interrupt on the new CPU arrives then the cleanup >> IPI is sent to the previous CPU which tears down the old vector >> >> So if the interrupt is sent to the original CPU just before #2 then this >> will be correctly handled on that CPU after the set affinity code >> reenables interrupts. > > I'll have this test tried out.. but in msi_set_affinity() the check below > for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct? > don't we want to check for old_cfg.vector instead? > > msi_set_affinit () > { > .... > unlock_vector_lock(); > > /* > * Check whether the transition raced with a device interrupt and > * is pending in the local APICs IRR. It is safe to do this outside > * of vector lock as the irq_desc::lock of this interrupt is still > * held and interrupts are disabled: The check is not accessing the > * underlying vector store. It's just checking the local APIC's > * IRR. > */ > if (lapic_vector_set_in_irr(cfg->vector)) > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); No. This catches the transitional interrupt to the new vector on the original CPU, i.e. the one which is running that code. Again the steps are: 1) Allocate new vector on new CPU 2) Set new vector on original CPU 3) Set new vector on new CPU So we have 3 points where an interrupt can fire: A) Before #2 B) After #2 and before #3 C) After #3 #A is hitting the old vector which is still valid on the old CPU and will be handled once interrupts are enabled with the correct irq descriptor - Normal operation (same as with maskable MSI) #B This must be checked in the IRR because the there is no valid vector on the old CPU. #C is handled on the new vector on the new CPU Thanks, tglx
Hi Thomas We did a bit more tracing and it looks like the IRR check is actually not happening on the right cpu. See below. On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote: > > > > msi_set_affinit () > > { > > .... > > unlock_vector_lock(); > > > > /* > > * Check whether the transition raced with a device interrupt and > > * is pending in the local APICs IRR. It is safe to do this outside > > * of vector lock as the irq_desc::lock of this interrupt is still > > * held and interrupts are disabled: The check is not accessing the > > * underlying vector store. It's just checking the local APIC's > > * IRR. > > */ > > if (lapic_vector_set_in_irr(cfg->vector)) > > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); > > No. This catches the transitional interrupt to the new vector on the > original CPU, i.e. the one which is running that code. Mathias added some trace to his xhci driver when the isr is called. Below is the tail of my trace with last two times xhci_irq isr is called: <idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq Just trying to follow your steps below with traces. The traces follow the same comments in the source. > > Again the steps are: > > 1) Allocate new vector on new CPU /* Allocate a new target vector */ ret = parent->chip->irq_set_affinity(parent, mask, force); migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0 > > 2) Set new vector on original CPU /* Redirect it to the new vector on the local CPU temporarily */ old_cfg.vector = cfg->vector; irq_msi_update_msg(irqd, &old_cfg); migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6 > > 3) Set new vector on new CPU /* Now transition it to the target CPU */ irq_msi_update_msg(irqd, cfg); migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33 if (lapic_vector_set_in_irr(cfg->vector)) irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0 > > So we have 3 points where an interrupt can fire: > > A) Before #2 > > B) After #2 and before #3 > > C) After #3 > > #A is hitting the old vector which is still valid on the old CPU and > will be handled once interrupts are enabled with the correct irq > descriptor - Normal operation (same as with maskable MSI) > > #B This must be checked in the IRR because the there is no valid vector > on the old CPU. The check for IRR seems like on a random cpu3 vs checking for the new vector 33 on old cpu 6? This is the place when we force the retrigger without the IRR check things seem to fix itself. > > #C is handled on the new vector on the new CPU > Did we miss something? Cheers, Ashok
Ashok, "Raj, Ashok" <ashok.raj@intel.com> writes: > We did a bit more tracing and it looks like the IRR check is actually > not happening on the right cpu. See below. What? > On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote: >> > >> > msi_set_affinit () >> > { >> > .... >> > unlock_vector_lock(); >> > >> > /* >> > * Check whether the transition raced with a device interrupt and >> > * is pending in the local APICs IRR. It is safe to do this outside >> > * of vector lock as the irq_desc::lock of this interrupt is still >> > * held and interrupts are disabled: The check is not accessing the >> > * underlying vector store. It's just checking the local APIC's >> > * IRR. >> > */ >> > if (lapic_vector_set_in_irr(cfg->vector)) >> > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); >> >> No. This catches the transitional interrupt to the new vector on the >> original CPU, i.e. the one which is running that code. > > Mathias added some trace to his xhci driver when the isr is called. > > Below is the tail of my trace with last two times xhci_irq isr is called: > > <idle>-0 [003] d.h. 200.277971: xhci_irq: xhci irq > <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq > > Just trying to follow your steps below with traces. The traces follow > the same comments in the source. > >> >> Again the steps are: >> >> 1) Allocate new vector on new CPU > > /* Allocate a new target vector */ > ret = parent->chip->irq_set_affinity(parent, mask, force); > > migration/3-24 [003] d..1 200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0 > >> >> 2) Set new vector on original CPU > > /* Redirect it to the new vector on the local CPU temporarily */ > old_cfg.vector = cfg->vector; > irq_msi_update_msg(irqd, &old_cfg); > > migration/3-24 [003] d..1 200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6 On old CPU 6? This runs on CPU 3 which is wrong to begin with. >> 3) Set new vector on new CPU > > /* Now transition it to the target CPU */ > irq_msi_update_msg(irqd, cfg); > > migration/3-24 [003] d..1 200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33 > > > > if (lapic_vector_set_in_irr(cfg->vector)) > irq_data_get_irq_chip(irqd)->irq_retrigger(irqd); > > > migration/3-24 [003] d..1 200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0 > >> >> So we have 3 points where an interrupt can fire: >> >> A) Before #2 >> >> B) After #2 and before #3 >> >> C) After #3 >> >> #A is hitting the old vector which is still valid on the old CPU and >> will be handled once interrupts are enabled with the correct irq >> descriptor - Normal operation (same as with maskable MSI) >> >> #B This must be checked in the IRR because the there is no valid vector >> on the old CPU. > > The check for IRR seems like on a random cpu3 vs checking for the new vector 33 > on old cpu 6? The whole sequence runs on CPU 3. If old CPU was 6 then this should never run on CPU 3. > This is the place when we force the retrigger without the IRR check things seem to fix itself. It's not fixing it. It's papering over the root cause. > Did we miss something? Yes, you missed to analyze why this runs on CPU3 when old CPU is 6. But the last interrupt actually was on CPU3. > <idle>-0 [003] d.h. 200.278052: xhci_irq: xhci irq Can you please provide the full trace and the patch you used to generate it? Thanks, tglx
Ashok, "Raj, Ashok" <ashok.raj@intel.com> writes: > > I think i got mixed up with logical apic id and logical cpu :-( Stuff happens. > <idle>-0 [000] d.h. 44.376659: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0 > <idle>-0 [000] d.h. 44.376684: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2 > <idle>-0 [000] d.h. 44.376685: xhci_irq: xhci irq > <idle>-0 [001] d.h. 44.376750: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 2 > <idle>-0 [001] d.h. 44.376774: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 2 > <idle>-0 [001] d.h. 44.376776: xhci_irq: xhci irq > <idle>-0 [001] d.h. 44.395824: xhci_irq: xhci irq > <...>-14 [001] d..1 44.400666: msi_set_affinity: quirk[1] new vector allocated, new apic = 6 vector = 33 this apic = 2 > <...>-14 [001] d..1 44.400691: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6 > <idle>-0 [003] d.h. 44.421021: xhci_irq: xhci irq > <idle>-0 [003] d.h. 44.421135: xhci_irq: xhci irq > migration/3-24 [003] d..1 44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6 > migration/3-24 [003] d..1 44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0 So this last one is a direct update. Straight forward moving it from one to the other CPU on the same vector number. And that's the case where we either expect the interrupt to come in on CPU3 or on CPU0. There is actually an example in the trace: <idle>-0 [000] d.h. 40.616467: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0 <idle>-0 [000] d.h. 40.616488: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2 <idle>-0 [000] d.h. 40.616488: xhci_irq: xhci irq <idle>-0 [001] d.h. 40.616504: xhci_irq: xhci irq > migration/3-24 [003] d..1 44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6 > migration/3-24 [003] d..1 44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0 But as this last one is the migration thread, aka stomp machine, I assume this is a hotplug operation. Which means the CPU cannot handle interrupts anymore. In that case we check the old vector on the unplugged CPU in fixup_irqs() and do the retrigger from there. Can you please add tracing to that one as well? Thanks, tglx
--- a/arch/x86/kernel/apic/msi.c +++ b/arch/x86/kernel/apic/msi.c @@ -92,6 +92,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) cfg->vector == old_cfg.vector || old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR || cfg->dest_apicid == old_cfg.dest_apicid) { + trace_printk("direct update msi %u, vector %u -> %u, apicid: %u -> %u\n", + irqd->irq, + old_cfg.vector, cfg->vector, + old_cfg.dest_apicid, cfg->dest_apicid); irq_msi_update_msg(irqd, cfg); return ret; } @@ -134,7 +138,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force) */ if (IS_ERR_OR_NULL(this_cpu_read(vector_irq[cfg->vector]))) this_cpu_write(vector_irq[cfg->vector], VECTOR_RETRIGGERED); - + trace_printk("twostep update msi, irq %u, vector %u -> %u, apicid: %u -> %u\n", + irqd->irq, + old_cfg.vector, cfg->vector, + old_cfg.dest_apicid, cfg->dest_apicid); /* Redirect it to the new vector on the local CPU temporarily */ old_cfg.vector = cfg->vector; irq_msi_update_msg(irqd, &old_cfg);