[V2] x86/apic/msi: Plug non-maskable MSI affinity race

Thomas Gleixner <tglx@linutronix.de> writes:

Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space. 

   - Write address low 32bits
   - Write address high 32bits (If supported by device)
   - Write data

When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes and MSI
message is sent built from half updated state.

On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.

Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:

 If MSI is disabled the PCI device is routing an interrupt to the legacy
 INTx mechanism. The INTx delivery can be disabled, but the disablement is
 not working on all devices.

 Some devices lose interrupts when both MSI and INTx delivery are disabled.

Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.

Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.

That makes it possible to solve it with a two step update:

  1) Target the MSI msg to the new vector on the current target CPU

  2) Target the MSI msg to the new vector on the new target CPU

In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.

After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.

This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.

This can cause spurious interrupts on both the local and the new target
CPU.

 1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

 2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

 3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

#1 is uninteresting and has no unintended side effects. #2 and #3 might
expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.

Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>                                                                                        Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rajat Jain <rajatja@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: x86@kernel.org
Cc: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
---

V2: Add the missing bracket ....

@stable: The Fixes tag is missing intentionally, as this problem has
been there forever. The rework of the vector allocation logic just made
it more likely to be observable.

---
 arch/x86/include/asm/apic.h |    8 ++
 arch/x86/kernel/apic/msi.c  |  128 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/irq.h         |   18 ++++++
 include/linux/irqdomain.h   |    7 ++
 kernel/irq/debugfs.c        |    1 
 kernel/irq/msi.c            |    5 +
 6 files changed, 163 insertions(+), 4 deletions(-)

Message ID	87imkr4s7n.fsf@nanos.tec.linutronix.de (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <SRS0=9tGp=3U=vger.kernel.org=linux-pci-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8902014D5 for <patchwork-linux-pci@patchwork.kernel.org>; Fri, 31 Jan 2020 14:27:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5D38B2067C for <patchwork-linux-pci@patchwork.kernel.org>; Fri, 31 Jan 2020 14:27:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728881AbgAaO1H (ORCPT <rfc822;patchwork-linux-pci@patchwork.kernel.org>); Fri, 31 Jan 2020 09:27:07 -0500 Received: from Galois.linutronix.de ([193.142.43.55]:55683 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728730AbgAaO1H (ORCPT <rfc822;linux-pci@vger.kernel.org>); Fri, 31 Jan 2020 09:27:07 -0500 Received: from [195.177.82.11] (helo=nanos.tec.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from <tglx@linutronix.de>) id 1ixXG7-000731-9Q; Fri, 31 Jan 2020 15:26:59 +0100 Received: by nanos.tec.linutronix.de (Postfix, from userid 1000) id 150DF105BE6; Fri, 31 Jan 2020 15:26:52 +0100 (CET) From: Thomas Gleixner <tglx@linutronix.de> To: Evan Green <evgreen@chromium.org> Cc: Rajat Jain <rajatja@google.com>, Bjorn Helgaas <bhelgaas@google.com>, linux-pci <linux-pci@vger.kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, x86@kernel.org, Marc Zyngier <maz@kernel.org> Subject: [PATCH V2] x86/apic/msi: Plug non-maskable MSI affinity race In-Reply-To: <87lfpn50id.fsf@nanos.tec.linutronix.de> References: <20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid> <CACK8Z6Ft95qj4e_fsA32r_bcz2SsHOW1xxqZJt3_DBAJw=NMGA@mail.gmail.com> <CAE=gft6fKQWExW-=xjZGzXs30XohfpA5SKggvL2WtYXAHmzMew@mail.gmail.com> <87y2tytv5i.fsf@nanos.tec.linutronix.de> <87eevqkpgn.fsf@nanos.tec.linutronix.de> <CAE=gft6YiM5S1A7iJYJTd5zmaAa8=nhLE3B94JtWa+XW-qVSqQ@mail.gmail.com> <CAE=gft5xta4XCJtctWe=R3w=kVr598JCbk9VSRue04nzKAk3CQ@mail.gmail.com> <CAE=gft7MqQ3Mej5oCT=gw6ZLMSTHoSyMGOFz=-hae-eRZvXLxA@mail.gmail.com> <87d0b82a9o.fsf@nanos.tec.linutronix.de> <CAE=gft7C5HTmcTLsXqXbCtcYDeKG6bCJ0gmgwVNc0PDHLJ5y_A@mail.gmail.com> <878slwmpu9.fsf@nanos.tec.linutronix.de> <87imkv63yf.fsf@nanos.tec.linutronix.de> <CAE=gft7Gu0ah4qcbsEB1X+kUMagCzPR+cdCfn2caofcGV+tBjA@mail.gmail.com> <87pnf342pr.fsf@nanos.tec.linutronix.de> <CAE=gft69hQcbmT46b1T8eLdPFyb9Pp-sDYd5JfPsZ2JWL4PXqQ@mail.gmail.com> <877e1a2d11.fsf@nanos.tec.linutronix.de> <CAE=gft7mLAU3G+f8gi_etRSpUijoCh7_6ni9Ob2JqjW7Q1n3yQ@mail.gmail.com> <874kwd3lbn.fsf@nanos.tec.linutronix.de> <CAE=gft52iBTJyyvDTXeHdEYnpSSROvrQsweuXjd6OuaLO47ACw@mail.gmail.com> <87lfpn50id.fsf@nanos.tec.linutronix.de> Date: Fri, 31 Jan 2020 15:26:52 +0100 Message-ID: <87imkr4s7n.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: <linux-pci.vger.kernel.org> X-Mailing-List: linux-pci@vger.kernel.org
Series	[V2] x86/apic/msi: Plug non-maskable MSI affinity race \| expand [V2] x86/apic/msi: Plug non-maskable MSI affinity race

[V2] x86/apic/msi: Plug non-maskable MSI affinity race

Commit Message

Comments

Patch