[v6,00/18] APEI in_nmi() rework

Message ID	20180921221705.6478-1-james.morse@arm.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of james.morse@arm.com designates 217.140.101.70 as permitted sender) client-ip=217.140.101.70; From: James Morse <james.morse@arm.com> To: linux-acpi@vger.kernel.org Cc: kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, Borislav Petkov <bp@alien8.de>, Marc Zyngier <marc.zyngier@arm.com>, Christoffer Dall <christoffer.dall@arm.com>, Will Deacon <will.deacon@arm.com>, Catalin Marinas <catalin.marinas@arm.com>, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>, Rafael Wysocki <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>, Tony Luck <tony.luck@intel.com>, Tyler Baicar <tbaicar@codeaurora.org>, Dongjiu Geng <gengdongjiu@huawei.com>, Xie XiuQi <xiexiuqi@huawei.com>, Punit Agrawal <punit.agrawal@arm.com>, jonathan.zhang@cavium.com, James Morse <james.morse@arm.com> Subject: [PATCH v6 00/18] APEI in_nmi() rework Date: Fri, 21 Sep 2018 23:16:47 +0100 Message-Id: <20180921221705.6478-1-james.morse@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	APEI in_nmi() rework \| expand [v6,00/18] APEI in_nmi() rework [v6,01/18] ACPI / APEI: Move the estatus queue code up, and under its own ifdef [v6,02/18] ACPI / APEI: Generalise the estatus queue's add/remove and notify code [v6,03/18] ACPI / APEI: don't wait to serialise with oops messages when panic()ing [v6,04/18] ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue [v6,05/18] ACPI / APEI: Make estatus queue a Kconfig symbol [v6,06/18] KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing [v6,07/18] arm64: KVM/mm: Move SEA handling behind a single 'claim' interface [v6,08/18] ACPI / APEI: Move locking to the notification helper [v6,09/18] ACPI / APEI: Let the notification helper specify the fixmap slot [v6,10/18] ACPI / APEI: preparatory split of ghes->estatus [v6,11/18] ACPI / APEI: Remove silent flag from ghes_read_estatus() [v6,12/18] ACPI / APEI: Don't store CPER records physical address in struct ghes [v6,13/18] ACPI / APEI: Don't update struct ghes' flags in read/clear estatus [v6,14/18] ACPI / APEI: Split ghes_read_estatus() to read CPER length [v6,15/18] ACPI / APEI: Only use queued estatus entry during _in_nmi_notify_one() [v6,16/18] ACPI / APEI: Split fixmap pages for arm64 NMI-like notifications [v6,17/18] mm/memory-failure: increase queued recovery work's priority [v6,18/18] arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work

Message ID

20180921221705.6478-1-james.morse@arm.com (mailing list archive)

Headers

Received-SPF: pass (google.com: domain of james.morse@arm.com designates
 217.140.101.70 as permitted sender) client-ip=217.140.101.70;
From: James Morse <james.morse@arm.com>
To: linux-acpi@vger.kernel.org
Cc: kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org,
	linux-mm@kvack.org,
	Borislav Petkov <bp@alien8.de>,
	Marc Zyngier <marc.zyngier@arm.com>,
	Christoffer Dall <christoffer.dall@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Rafael Wysocki <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	Xie XiuQi <xiexiuqi@huawei.com>,
	Punit Agrawal <punit.agrawal@arm.com>,
	jonathan.zhang@cavium.com,
	James Morse <james.morse@arm.com>
Subject: [PATCH v6 00/18] APEI in_nmi() rework
Date: Fri, 21 Sep 2018 23:16:47 +0100
Message-Id: <20180921221705.6478-1-james.morse@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

APEI in_nmi() rework | expand

Message

James Morse Sept. 21, 2018, 10:16 p.m. UTC

Hello,

The GHES driver has collected quite a few bugs:

ghes_proc() at ghes_probe() time can be interrupted by an NMI that
will clobber the ghes->estatus fields, flags, and the buffer_paddr.

ghes_copy_tofrom_phys() uses in_nmi() to decide which path to take. arm64's
SEA taking both paths, depending on what it interrupted.

There is no guarantee that queued memory_failure() errors will be processed
before this CPU returns to user-space.

x86 can't TLBI from interrupt-masked code which this driver does all the
time.


This series aims to fix the first three, with an eye to fixing the
last one with a follow-up series.

Previous postings included the SDEI notification calls, which I haven't
finished re-testing. This series is big enough as it is.


Any NMIlike notification should always be in_nmi(), and should use the
ghes estatus cache to hold the CPER records until they can be processed.

The path through GHES should be nmi-safe, without the need to look at
in_nmi(). Abstract the estatus cache, and re-plumb arm64 to always
nmi_enter() before making the ghes_notify_sea() call.

To remove the use of in_nmi(), the locks are pushed out to the notification
helpers, and the fixmap slot to use is passed in. (A future series could
change as many nnotification helpers as possible to not mask-irqs, and
pass in some GHES_FIXMAP_NONE that indicates ioremap() should be used)

Change the now common _in_nmi_notify_one() to use local estatus/paddr/flags,
instead of clobbering those in the struct ghes.

Finally we try and ensure the memory_failure() work will run before this
CPU returns to user-space where the error may be triggered again.


Changes since v5:
 * Fixed phys_addr_t/u64 that failed to build on 32bit x86.
 * Removed buffer/flags from struct ghes, these are now on the stack.

To make future irq/tlbi fixes easier:
 * Moved the locking further out to make it easier to avoid masking interrupts
   for notifications where it isn't needed.
 * Restored map/unmap helpers so they can use ioremap() when interrupts aren't
   masked.


Feedback welcome,

Thanks

James Morse (18):
  ACPI / APEI: Move the estatus queue code up, and under its own ifdef
  ACPI / APEI: Generalise the estatus queue's add/remove and notify code
  ACPI / APEI: don't wait to serialise with oops messages when
    panic()ing
  ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue
  ACPI / APEI: Make estatus queue a Kconfig symbol
  KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing
  arm64: KVM/mm: Move SEA handling behind a single 'claim' interface
  ACPI / APEI: Move locking to the notification helper
  ACPI / APEI: Let the notification helper specify the fixmap slot
  ACPI / APEI: preparatory split of ghes->estatus
  ACPI / APEI: Remove silent flag from ghes_read_estatus()
  ACPI / APEI: Don't store CPER records physical address in struct ghes
  ACPI / APEI: Don't update struct ghes' flags in read/clear estatus
  ACPI / APEI: Split ghes_read_estatus() to read CPER length
  ACPI / APEI: Only use queued estatus entry during _in_nmi_notify_one()
  ACPI / APEI: Split fixmap pages for arm64 NMI-like notifications
  mm/memory-failure: increase queued recovery work's priority
  arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work

 arch/arm/include/asm/kvm_ras.h       |  14 +
 arch/arm/include/asm/system_misc.h   |   5 -
 arch/arm64/include/asm/acpi.h        |   4 +
 arch/arm64/include/asm/daifflags.h   |   1 +
 arch/arm64/include/asm/fixmap.h      |   4 +-
 arch/arm64/include/asm/kvm_ras.h     |  25 ++
 arch/arm64/include/asm/system_misc.h |   2 -
 arch/arm64/kernel/acpi.c             |  48 +++
 arch/arm64/mm/fault.c                |  25 +-
 drivers/acpi/apei/Kconfig            |   6 +
 drivers/acpi/apei/ghes.c             | 564 +++++++++++++++------------
 include/acpi/ghes.h                  |   2 -
 mm/memory-failure.c                  |  11 +-
 virt/kvm/arm/mmu.c                   |   4 +-
 14 files changed, 426 insertions(+), 289 deletions(-)
 create mode 100644 arch/arm/include/asm/kvm_ras.h
 create mode 100644 arch/arm64/include/asm/kvm_ras.h

Comments

Borislav Petkov Sept. 25, 2018, 12:45 p.m. UTC | #1

On Fri, Sep 21, 2018 at 11:16:47PM +0100, James Morse wrote:
> Hello,
> 
> The GHES driver has collected quite a few bugs:
> 
> ghes_proc() at ghes_probe() time can be interrupted by an NMI that
> will clobber the ghes->estatus fields, flags, and the buffer_paddr.
> 
> ghes_copy_tofrom_phys() uses in_nmi() to decide which path to take. arm64's
> SEA taking both paths, depending on what it interrupted.
> 
> There is no guarantee that queued memory_failure() errors will be processed
> before this CPU returns to user-space.
> 
> x86 can't TLBI from interrupt-masked code which this driver does all the
> time.
> 
> 
> This series aims to fix the first three, with an eye to fixing the
> last one with a follow-up series.
> 
> Previous postings included the SDEI notification calls, which I haven't
> finished re-testing. This series is big enough as it is.

Yeah, and everywhere I look, this thing looks overengineered. Like,
for example, what's the purpose of this ghes_esource_prealloc_size()
computing a size each time the pool changes size?

AFAICT, this size can be computed exactly *once* at driver init and be
done with it. Right?

Or am I missing something subtle?

James Morse Oct. 3, 2018, 5:50 p.m. UTC | #2

Hi Boris,

On 25/09/18 13:45, Borislav Petkov wrote:
> On Fri, Sep 21, 2018 at 11:16:47PM +0100, James Morse wrote:
>> Hello,
>>
>> The GHES driver has collected quite a few bugs:
>>
>> ghes_proc() at ghes_probe() time can be interrupted by an NMI that
>> will clobber the ghes->estatus fields, flags, and the buffer_paddr.
>>
>> ghes_copy_tofrom_phys() uses in_nmi() to decide which path to take. arm64's
>> SEA taking both paths, depending on what it interrupted.
>>
>> There is no guarantee that queued memory_failure() errors will be processed
>> before this CPU returns to user-space.
>>
>> x86 can't TLBI from interrupt-masked code which this driver does all the
>> time.
>>
>>
>> This series aims to fix the first three, with an eye to fixing the
>> last one with a follow-up series.
>>
>> Previous postings included the SDEI notification calls, which I haven't
>> finished re-testing. This series is big enough as it is.

> Yeah, and everywhere I look, this thing looks overengineered. Like,
> for example, what's the purpose of this ghes_esource_prealloc_size()
> computing a size each time the pool changes size?

The size to grow the pool by, because each error-source described by a GHES
entry has its own worst-case size.

Today ghes_nmi_add() does this each time its called. You could have multiple
GHES entries in the HEST that describe NMI as the notification. The worst-case
size for the records is described in the GHES entry, and could be different for
each one. (error_block_length and records_to_preallocate, or table 18-379 of
acpi v6.2)

These different error-sources could be delivered on different CPUs at the same
time, so need their own pre-allocated reserved memory. ghes_notify_nmi()'s
atomic_add_unless() suggests this can happen on x86, but I don't know the
arch-specifics. It definitely can happen on arm64.

> AFAICT, this size can be computed exactly *once* at driver init and be
> done with it. Right?

We could do two passes of the HEST to pre-compute the total size of this
estatus-queue memory, allocate it, then do the notification registration stuff.
But this doesn't really work with the way this driver acts as platform-driver
for a ghes device...

The non-ghes HEST entries have a "number of records to pre-allocate" too, we
could make this memory pool something hest.c looks after, but I can't see if the
other error sources use those values.

Hmmm,
The size is capped to 64K, we could ignore the firmware description of the
memory requirements, and allocate SZ_64K each time. Doing it per-GHES is still
the only way to avoid allocating nmi-safe memory for irqs.

Thanks,

James

Borislav Petkov Oct. 4, 2018, 3:15 p.m. UTC | #3

On Wed, Oct 03, 2018 at 06:50:38PM +0100, James Morse wrote:

...

> The non-ghes HEST entries have a "number of records to pre-allocate" too, we
> could make this memory pool something hest.c looks after, but I can't see if the
> other error sources use those values.

Thanks for the detailed analysis!

> Hmmm, The size is capped to 64K, we could ignore the firmware description of the
> memory requirements, and allocate SZ_64K each time. Doing it per-GHES is still
> the only way to avoid allocating nmi-safe memory for irqs.

Right, so I'm thinking a lot simpler: allocate a pool which should
be large enough to handle all situations and drop all that logic
which recomputes and reallocates pool size. Just a static thing which
JustWorks(tm).

For a couple of reasons:

 - you state it above: all those synchronization issues are gone with a
 prellocated pool

 - 64K per-GHES pool is nothing if you consider the machines this thing
 runs on - fat servers with lotsa memory. And RAS there *is* important.
 And TBH 64K is nothing even on a small client sporting gigabytes of
 memory.

 - code is a lot simpler and cleaner - you don't need all that pool
 expanding and shrinking. I mean, I'm all for smarter solutions if they
 have any clear advantages warranting the complication but this is a
 lot of machinery just so that we can save a couple of KBs. Which, as a
 whole, sounds just too much to me.

But this is just me.